The goal of this article is to provide you with tools & ideas that you can use to become more productive when creating web scrapers. This tutorial is language-agnostic, meaning that you can apply this in your programming language of choice.
Throughout the article I’m assuming that you’re familiar with CSS selectors, XPath, curl requests etc. If you’d like a beginner’s guide in web scraping then let me know in the comments. If you’d like to follow along with the examples, here’s the Wikipedia link I used for the screenshots.
By inspecting any element on a webpage, you can right-click on the HTML definition of the element to copy a specific CSS or XPath selector. In my experience, both may need a little bit of tweaking to work just right especially when you want to select multiple elements. But this is definitely a time-saver when working with complex HTML and want to quickly select elements with confidence.
// So by copying the XPath for the element we get a query that will work for the specific element
// But our aim is to get all cite_note-* elements so we'll need to experiment with that.//*[@id="cite_note-4"]
2. The $x command
// In our case it helps us find a way to match all cite_notes and make sure that our XPath syntax is correct.
$x you can quickly test your XPath queries on multiple pages and make sure that it’s working correctly on multiple examples. When you’re finished you just copy your XPath query and use it within your code.
3. Copying requests (Reproducing complex requests in your code)
A lot of times when creating web scrapers you may need to replicate complex requests that are made within your browser. These requests may include many parameters & specific headers without which they won’t work. So in order to access the information that you need you’ll have to replicate the requests within your code. This will allow you to access and reverse engineer public APIs used by a website’s frontend and easily access the information you’re looking for, without using heavier tools like selenium.
By copying the request as a curl command, you’re getting all the headers and parameters used within the request. If you execute the copied cURL command in your Terminal then it should work unless you need authentication in which case it won’t.
4. Curl to code service (Github)
This open source tool translates curl commands into code in your language of choice (Python, Node.js, PHP and a few more) which helps you save even more time when creating scrapers. You’ll have all the header configurations and parameters you need to replicate and test these requests in your code.
All these neat features in Google Chrome’s developer tools make web scraping really fast and knowing about them will make you more productive (and I hope happier 😊) as a developer.