Parse script text using chompjs library
So, let’s assume you’ve extracted this text from a script tag
In this case, the parse_js_object function looks through the script to find the first js object, extracts it, and then turns it into a python dictionary. This is just the tip of the iceberg with chompjs, check out the examples on its Github to see other, more difficult formats you can parse easily with it.
Extract data using JMESPath
So, now that you have your dictionary, what’s the best way to get your data out of it? With nested dictionaries, it can be annoying to pick out the fields you need, but you can make it much easier by using another package: JMESPath.
For example if you want to get the list of products from that dictionary, you can do that with a single function call:
It doesn’t stop there. Let’s go one step further – say you want the names of the products? You can do:
The change here indicates that I want to go through the products list and pull out the name fields, this will leave me with a list of product names. Now, while that is already very useful, we can go a bit deeper.
Say you want the dict for only one of these products – the one called “Bacon”. Well, we can actually enter a query within the square brackets to filter our results:
As before, we can also pull a specific field out:
Now, let’s do something a bit more interesting. Say I want to find products that are over a certain price. Well, I can do other sorts of conditionals in those brackets too:
You may have noticed that only one of the items has a brand, so if I was to do the following it would give me just the brand name for that one product. Take care in this case as if there are incomplete results you won’t know which of the dictionaries this data actually comes from:
Finally, you may have noticed that the instock field in our sample has a boolean value, so if we wanted to only get the names of all in-stock items we can do it as so:
These two packages are probably two of the more important packages I use when I am extracting web data. Many sites will typically use standard JSON or js scripts in their source, which chompjs can extract for you. With these cases and also most API responses you likely end up with nested dictionaries, which JMESPath makes sifting through a breeze.
If you would like to see a video of these packages in action, check out my video on Youtube.
Check out other useful open source packages for parsing HTML and extracting data: