Extract Values From Hidden Form Fields
Usually migrating data from one platform to another is manageable if the integrity of the source data is solid. But what happens when you receive product data from a website that was built poorly and lacks some fundamental data dependencies? More specifically, what to do when the data that associates product IDs to categories isn’t available in the data dump or through a site scrape?
In my use case, I needed a way to properly translate product data from the old infrastructure into a data set that could be import into Magento 2. Ordinarily I’d most likely be able to map all the data fields from old to new CMS through a spreadsheet – or at worst, a few spreadsheets. However, in this instance, the original CMS exports didn’t provide product-to-category associations – so the standard option wasn’t possible.
My initial path forward was to utilize Screaming Frog and its Tree Table View options. I think this is a great option for smaller sites – like my website for example. But if you’re working with an eCommerce website with 8 primary categories, hundreds of subcategories, and thousands of products, that Tree Table View export won’t be an efficient approach.
Upon further investigation of the source code, I discovered a critical nugget of data that would help in this scenario. On each product page, there was a sidebar contact form where customers could receive additional details about a specific product they were interested in. These forms had 2 hidden fields to determine which product people were enquiring about – product ID and category path. Great! I just had to now find a way to extract this information into a digestible spreadsheet.
This would eventually lead me to the Custom Extraction option in Screaming Frog. This feature allowed me to include HTML elements in my scrape. I needed to scrape hidden form fields, which contained the product ID/SKU and the category URI. The 3 options for custom extraction in Screaming Frog is CSS Path, Regex, and XPath. I chose the XPath mode to identify the specific nodes I needed to extract.
Here are some useful formulas I generated and utilized.
Extract Form Elements With Specific ID
//form[@id='FORM_ID']/input[@name='PART_ID']/@value
Target Nodes With more Specific Path
//ul[@id='UL_ID']/li/form/input[@name='PART_ID']/@value
//ul[@class='UL_CLASS']/li/form/input[@name='PART_ID']/@value
Starts-With Selector
//form[starts-with(@id, 'FORM_ID_START_CHARACTERS')]/input[@name='PART_ID']/@value
Wildcard Selector
//[starts-with(@id, 'ID_START_CHARACTERS')] //div[@id='DIV_ID']//form[starts-with(@id,'FORM_ID_START_CHARACTERS')]/input[@name='PART_ID']/@value
Learn more about custom extraction feature from the official Screaming Frog documentation.