How to Scrape Websites That Don’t Want to Be Scraped
How you can handle common anti-scraping mechanisms to cover your data needs
Many times during your data analysis projects, you need to gather data from the web.
You can easily scrape simple websites when you:
- Understand HTTP requests
- Understand HTML structure (the DOM)
- Can write either xPath or CSS selector expressions to match elements
- Know at least one programming language well enough to clean up the data before saving
But the web has evolved and there are many cases where the above skills are not enough. I’m going to dive into tougher cases and how you can handle them to get the data you need. The cases are ordered from simple to complex.
The client doesn’t send the same HTML to your script
You are using your browser to test a page before creating your web scraping script. When you implement a request in your script, you realize that the data returned to you is different than the data you see from view:source in your browser. (Note: View source & inspect elements won’t always show the same thing, and there’s a reason for that)