How to Scrape Websites That Don’t Want to Be Scraped

How you can handle common anti-scraping mechanisms to cover your data needs

Aris Pattakos
7 min readMay 19, 2020
Photo by Vidar Nordli-Mathisen on Unsplash

Many times during your data analysis projects, you need to gather data from the web.

You can easily scrape simple websites when you:

  • Understand HTTP requests
  • Understand HTML structure (the DOM)
  • Can write either xPath or CSS selector expressions to match elements
  • Know at least one programming language well enough to clean up the data before saving

But the web has evolved and there are many cases where the above skills are not enough. I’m going to dive into tougher cases and how you can handle them to get the data you need. The cases are ordered from simple to complex.

The client doesn’t send the same HTML to your script

You are using your browser to test a page before creating your web scraping script. When you implement a request in your script, you realize that the data returned to you is different than the data you see from view:source in your browser. (Note: View source & inspect elements won’t always show the same thing, and there’s a reason for that)

--

--

Aris Pattakos
Aris Pattakos

Written by Aris Pattakos

Lead Software Engineer @Flash Pack - I post programming advice on https://www.bestpractices.tech/

Responses (1)