What's the best Python option to scrape Javascript generated content?

bernard

BuSo Pro
Joined
Dec 31, 2016
Messages
2,587
Likes
2,298
Degree
6
As in title, what is the current best method to scrape content that is generated by javascript?
 
I have used pyppeteer in the past. There is also another project from scrapinghub on github.

I had to use a browser based scraper because of anti-bot mechanism.
 
I tried pyppeteer with the request-html method, but it didn't work, only got the raw html pre-load. Other people say it isn't supported anymore. Not sure.
 
Selenium (there's a web driver for it for Python too) works, but you need to spawn a headless browser for it vs. scraping with requests.
 
If anyone can do this for me, for a smaller sum, including delivering source code, then hit me up. Scrape 5 category pages and grab usual product data.
 
I've figured it out from my Mac so far using Selenium and BS4. Works fine. Not sure how it works on Pythonanywhere.
 
Why use Python for this?


You would be making the job far simpler using node puppeteer for this.

If you're scraping chances are you have at least a moderate understanding of Javascript its not to hard to learn node or at least get it to the stage where you can run a puppeteer client.
 
I've figured it out from my Mac so far using Selenium and BS4. Works fine. Not sure how it works on Pythonanywhere.
Yeah, it can run on any server. I run some selenium stuff on Linode. Just set up a normal server, install what you need, and it can run there just like on your own computer.
 
Why use Python for this?


You would be making the job far simpler using node puppeteer for this.

If you're scraping chances are you have at least a moderate understanding of Javascript its not to hard to learn node or at least get it to the stage where you can run a puppeteer client.
Yeah, I've recently started using puppeteer. It's quick and easy to learn. You just need some basic javascript knowledge and you are good to go. Did it all in Visual studio code.

My first scraping project took like a day, maybe 2 days if you count the optimizing.

Still a lot to learn, but I'm astonished. This is almost too easy.
 
I suggest looking at the underlying source code for json data stores, or looking at the http requests made for the api endpoints. Usually you can skip the entire browser automation stage, which is brittle and has high maintenance cost
 
I suggest looking at the underlying source code for json data stores, or looking at the http requests made for the api endpoints. Usually you can skip the entire browser automation stage, which is brittle and has high maintenance cost

How?

:smile:
 
To check for json data stores on a server-rendered webpage:
  1. right click on a web page in your browser. Select "view page source"
  2. find the script tags with json-serialized objects that contain dynamic content. The giveaway is usually the type attribute being set to "application/json", or having a "hardcoded" js object/variable in the script. The new reddit homepage does both for data loading
You can then scrape the tag using normal scraping methods and load the data as a json decoder in whatever language you prefer.

To check for api requests made on page load:
  1. right click ona web page in your browser. Select "inspect page"
  2. click on the network tab. Disable cache. Reload the page.
  3. observe all requests being made by the page.
To check api requests made on js interaction:
  1. right click ona web page in your browser. Select "inspect page"
  2. click on the console tab. disable xhr filtering (meaning you want xhr logs to show in the console. It is usually disabled by default)
  3. Perform interaction. Observe all xhr logs in the console.
Caveat is that this is not for sites have high bot protection, it is best for public content sites. The browser automation option would be best for high security situations, as there are 101 things a webmaster can do to detect bots, like honeypots or interaction tracking.
 
I tried to do it, but couldn't figure it out. It was some sort of aggregated backend on another domain, that run a bunch of webshops.

Above my paygrade.
 
If you can fetch the API data like mentioned above, it becomes super easy. Sometimes you can use Selenium just to get the login cookie, store it, and inject it when you do the request.

I've done this to scrape ie. directories, people-finder type sites, government sites etc. If you get the JSON, you can even scrape it directly to Google sheets.

If you want it for Google sheets, I can share a bunch of code.
 
If you want it for Google sheets, I can share a bunch of code.
There's actually the IMPORTXML function that let's you scrape web pages using xpath as arrays. You can then use array functions to clean the data. A super useful tool, really nice for prototyping. Plus, the requests are made by Google, so you don't have to worry about IPs or proxies.
 
There's actually the IMPORTXML function that let's you scrape web pages using xpath as arrays. You can then use array functions to clean the data. A super useful tool, really nice for prototyping. Plus, the requests are made by Google, so you don't have to worry about IPs or proxies.
Yes, I like to use Google apps a scripts though. Allows you to set headers and also manipulate the response.

Especially after they allow modern JS syntax like classes, async/await etc, its very clean to work with.

Will share some code examples today
 
Back