Scrape Hidden Form Fields with Screaming Frog and XPath

Extract Values From Hidden Form Fields

Usually migrating data from one platform to another is manageable if the integrity of the source data is solid. But what happens when you receive product data from a website that was built poorly and lacks some fundamental data dependencies? More specifically, what to do when the data that associates product IDs to categories isn’t available in the data dump or through a site scrape?

In my use case, I needed a way to properly translate product data from the old infrastructure into a data set that could be import into Magento 2. Ordinarily I’d most likely be able to map all the data fields from old to new CMS through a spreadsheet – or at worst, a few spreadsheets. However, in this instance, the original CMS exports didn’t provide product-to-category associations – so the standard option wasn’t possible.

My initial path forward was to utilize Screaming Frog and its Tree Table View options. I think this is a great option for smaller sites – like my website for example. But if you’re working with an eCommerce website with 8 primary categories, hundreds of subcategories, and thousands of products, that Tree Table View export won’t be an efficient approach.

Upon further investigation of the source code, I discovered a critical nugget of data that would help in this scenario. On each product page, there was a sidebar contact form where customers could receive additional details about a specific product they were interested in. These forms had 2 hidden fields to determine which product people were enquiring about – product ID and category path. Great! I just had to now find a way to extract this information into a digestible spreadsheet.

This would eventually lead me to the Custom Extraction option in Screaming Frog. This feature allowed me to include HTML elements in my scrape. I needed to scrape hidden form fields, which contained the product ID/SKU and the category URI. The 3 options for custom extraction in Screaming Frog is CSS Path, Regex, and XPath. I chose the XPath mode to identify the specific nodes I needed to extract.

Here are some useful formulas I generated and utilized.

Extract Form Elements With Specific ID

//form[@id='FORM_ID']/input[@name='PART_ID']/@value

Target Nodes With more Specific Path

//ul[@id='UL_ID']/li/form/input[@name='PART_ID']/@value
//ul[@class='UL_CLASS']/li/form/input[@name='PART_ID']/@value

Starts-With Selector

//form[starts-with(@id, 'FORM_ID_START_CHARACTERS')]/input[@name='PART_ID']/@value

Wildcard Selector

//[starts-with(@id, 'ID_START_CHARACTERS')] //div[@id='DIV_ID']//form[starts-with(@id,'FORM_ID_START_CHARACTERS')]/input[@name='PART_ID']/@value

Learn more about custom extraction feature from the official Screaming Frog documentation.

Other Insights

December 27, 2023

The Most Common WordPress Security Concerns

November 25, 2023

5 Mobile Optimization Mistakes That Hurt Your Website

November 25, 2023

Import Excel Data Into MySQL Using phpMyAdmin

November 25, 2023

Fixing Dell Dimension 3000 Screen Resolution After Windows 7 Upgrade

Scrape Hidden Form Fields with Screaming Frog and XPath

Extract Values From Hidden Form Fields

Extract Form Elements With Specific ID

Target Nodes With more Specific Path

Starts-With Selector

Wildcard Selector

Leave a Reply

Other Insights

Let's Work Together

WordPress Services

SEO Services

Other Services

Connect