How to Scrape Websites That Don’t Want to Be Scraped
How you can handle common anti-scraping mechanisms to cover your data needs
Many times during your data analysis projects, you need to gather data from the web.
You can easily scrape simple websites when you:
- Understand HTTP requests
- Understand HTML structure (the DOM)
- Can write either xPath or CSS selector expressions to match elements
- Know at least one programming language well enough to clean up the data before saving
But the web has evolved and there are many cases where the above skills are not enough. I’m going to dive into tougher cases and how you can handle them to get the data you need. The cases are ordered from simple to complex.
The client doesn’t send the same HTML to your script
You are using your browser to test a page before creating your web scraping script. When you implement a request in your script, you realize that the data returned to you is different than the data you see from view:source in your browser. (Note: View source & inspect elements won’t always show the same thing, and there’s a reason for that)
There’s a chance that what you’re missing is a User-Agent Request Header. What the User-Agent header does, is specify what browser is being used. When it’s not there, the website might assume that an automated request is being made. But it’s very simple to fix.
Just add a modern browser’s user agent in your requests.
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
If this doesn’t fix your problem, then the following case might be true.
You see the data on the page, but it’s not in view source
As I pointed out earlier, “Inspect Element” and “View Source” in your browser are not the same thing.
“View Source” shows you the HTML exactly as the website returns it on the initial requests.
So, when the data does not exist inside the HTML there are 2 main scenarios:
- The data is fetched through XHR requests
- The data is stored somewhere throughout the document, probably in JSON format
Let’s go with the 1st example. We will use Chrome Tools to identify which request is responsible for bringing the data we’re looking for.
I am looking for the request that brings new tweets to the page. Here are the steps I followed to find the right request:
- Opened Network tab in Chrome Dev tools and press XHR to only see this kind of requests (and not CSS, images, etc.)
- Trigger a new load. Sometimes this is a “Load More” button. In this case, it’s caused by scrolling down.
- Look through the requests, until I find the right data within them using CTRL+F (So if a tweet on the page contains the word “football” I’ll narrow down the requests by searching for that word)
Now that I found the right request, I can do the following to convert this frontend API request into something I can use in my code.
Now, you can either copy the cURL command to postman for further debugging, or turn it to code with this online tool.
The downside is that headless browsers are much slower than single HTTP requests for scraping. And it’s also quite hard to scale headless browser scraping and running parallel requests.
But it is an option when nothing else works, or when input is required (actual form input).
How do we handle cases where the data is within a
<script> tag somewhere in the document?
This can be tricky, but a well-written Regex can do the job.
Let’s say you’re scraping Instagram profiles to identify businesses.
A lot of the data you need is inside a
<script> tag and in JSON format.
I put all of the HTML inside Regex101 and experimented until I wrote this regex which captures the data:
The logic of the Regex is fairly simple. I’m trying to match a script element, with a specific signature (window.sharedData) and I’m matching everything until the closing tag. As you see the everything part
(.+?) is inside a group, which allows me to extract only the specific group. This group contains the JSON that I can then parse and process in my programming language.
You’re receiving 429 HTTP responses (or other rate-limiting responses)
When you’re receiving a 429 response, it’s quite likely that the website understood that you’re scraping it and is trying to stop you.
In some cases, scraping at a slower pace does the job. So first of all, make sure you’re scraping respectfully.
If you are scraping respectfully, and are still getting banned, then here are the tools you can use to avoid that:
- Proxy rotation
- User-Agent rotation
- Act like a real user
One of the main things that websites use to identify scraping is many requests coming from the same IP. You can eliminate this issue by using a proxy rotation service, that changes your IP for every request you make.
I personally use Luminati and ScrapingHub Crawlera. I find these two to be more professional and trustworthy. They are also the most expensive, but I find that cheaper alternatives don’t work very well.
There is an important issue to note here. There are 2 kinds of proxy IPs you can get:
- Datacenter IPs
- Residential IPs
Many websites that are common scraping targets, like for example LinkedIn, are very hard to reach with Datacenter IPs. Their anti-scraping mechanisms identify and quickly ban these IPs, which means that most datacenter IPs that you can buy access to, won’t work.
Residential IPs are much more effective at unblocking access to sites, but also quite expensive. And sadly, even residential IPs get banned pretty easily from heavily scraped websites.
The fewer people/companies that scrape a certain site, the more likely it is that Datacenter IPs will work well for unblocking.
User Agent Rotation
When you’re scraping a website that doesn’t want to be scraped, your goal is to make your script unidentifiable. You can further achieve this goal by randomizing the user agents used for the request.
Act like a real user
The more you can make your scraping application behave like a real user, the less likely it is to get your IPs banned.
As I explained above, the best way to act like a real user is to use a real browser. But when this option is too slow you can do the following things to look more like a real user.
- Pass a Referrer header on each request (or use a framework that does so automatically)
- Use the same headers as the site uses when you use it from the browser (inspect network requests to view the headers)
- Create a session or cookie by spinning up a headless browser, getting cookies and then reusing them throughout your HTTP requests.
- Don’t crawl in a repetitive and predictable way. If a website has 100 pages and you are crawling pages consecutively (1, 2, 3, ..) then it’s quite likely that the anti-scraping mechanisms will identify you. Try to randomize the order as much as you can. Add random delays between requests, and if possible try not to chain all these requests in one large sequence. If you use parallel requests, then you could set a strategy where each URL is scraped for a while, then it’s switched and then later it picks up the previous one.
The more unpredictable and more user-like you can make your scraper, the more likely it is to work.
Google, LinkedIn and a few other examples will make it hard no matter what you do. In general, the websites that are scraped extremely often, also have extremely good anti-scraping mechanisms.
But you can always try, or use alternative strategies. For example, there are LinkedIn copycat websites for specific industries. For search, there is DuckDuckGo which is not easy, but definitely easier than google to scrape.
It’s important to remember to be respectful when scraping, and I believe that’s the best strategy. A scraper can run 24/7 even while you’re sleeping, so it can achieve a lot during your “unproductive” time.