Web scraping is the enormously helpful ability to run a program that will automatically extract data from web pages for you and provide them to you in a structured file for use in analysis or elsewhere.
The off-the-shelf web scraper so helpfully named Webscraper is easy to install and relatively easy to use, with many helpful tutorial videos on their website.
As a simple testing ground for web scraping, I did a simple search at the McGill Library for ‘image childhood’ with the intention of scraping the details of the 10 items on the first page of results returned.
I faced two challenges in pulling this off.
One had nothing to do with the web scraper itself. I have the bibliographic application Zotero installed and it automatically attempts to log me in to the library’s systems whenever I make a request for library materials. This is very helpful in most contexts, but not when the web scraper first ran, since it didn’t have the authorization credentials. So, I had to make sure I was logged in before I ran Webscraper.
The second was a glitch in my initial set up of Webscraper. I neglected to put the requests for author, title, etc. inside the element wrapper and instead had them at root. Since Webscraper therefore didn’t ‘know’ that I wanted multiple elements scraped, it just returned the first item. I then re-watched a tutorial, realized my error, and fixed it.
Inspecting these results, I note four unexpected results:
- The results are not returned in the same order as they are listed on the web page. They do, however, come with a ‘web-scraper-order’ ID field, which is in the order items are listed on the web page. So it’s easy enough to sort them. But it’s odd that they get permuted in the first place.
- I allowed for multiple authors of a work when I set up the scraper (4 of the 10 records had multiple authors). The authors were scraped in ‘long’ format, i.e., the whole record shows up, with its own ID, for each author. The disordered return order then didn’t keep the authors of a single work together. They only appeared together after being re-sorted.
- The publication year sometimes has a copyright mark (©), sometimes doesn’t. I didn’t expect that. The one work that is a revised edition has its publication year in square brackets .
- For some reason the call number, library collection, and availability status of the first two items didn’t get scraped and shows as ‘null’.
But, if I got some unexpected behaviour, there was no real problem with getting the results. And that scraper can now easily be re-used. Indeed, the interesting next step would be to get it to page through the results are scrape all 20,877 of them. That where it gets really powerful – and really useful.
Scraping is perhaps the most basic and necessary capacity to study the web as a cultural territory, since it allows one to capture web content for research purposes. But it is enormously useful for all kinds of research purposes, since so much of contemporary research involves interacting with web-based tools. Scrapers offer the opportunity to automate (some of) that, freeing your time for less tedious, more substantive work. Now, Webscraper itself may or may not be the best tool for that, but the point stands.
In terms of my own research, it’s less a matter of can I envision using scraping than can I envision not using scraping. I don’t think I can. I already make programmatic URL requests to Project Gutenberg for necessary texts and will be drawing on many similar e-text repositories. I’m using the Twitter, Reddit, and similar APIs to pull data from them, which is closely related. If I am going to succeed in downloading all the Web of Science citation information for my topic search on ‘child*’, to be better able to explore and visualize the literature, the only way that it going to be possible is with automated assistance. At a maximum of 500 records at a time, pulling 2 million records means 4000 downloads. Doing that manually? Carpal tunnel here I come. And, that kind of scale is beyond what their API permits. Clearly, a bot is called for. I just have to figure out how to make it work!