Login | Register
Lost Password

CAPTCHA Image

A password will be emailed to you.

 

Scraping data from a list of webpages using Google Docs

GD Star Rating
loading...
GD Star Rating
loading...
OJB – By Paul Bradshaw

Quite often when you’re looking for data as part of a story, that data will not be on a single page, but on a series of pages. To manually copy the data from each one – or even scrape the data individually – would take time. Here I explain a way to use Google Docs to grab the data for you.

Some basic principles

Although Google Docs is a pretty clumsy tool to use to scrape webpages, the method used is much the same as if you were writing a scraper in a programming language like Python or Ruby. For that reason, I think this is a good quick way to introduce the basics of certain types of scrapers.

Here’s how it works:

Firstly, you need a list of links to the pages containing data.

Quite often that list might be on a webpage which links to them all, but if not you should look at whether the links have any common structure, for example “http://www.country.com/data/australia” or “http://www.country.com/data/country2″. If it does, then you can generate a list by filling in the part of the URL that changes each time (in this case, the country name or number), assuming you have a list to fill it from (i.e. a list of countries, codes or simple addition).

Second, you need the destination pages to have some consistent structure to them. In other words, they should look the same (although looking the same doesn’t mean they have the same structure – more on this below).

The scraper then cycles through each link in your list, grabs particular bits of data from each linked page (because it is always in the same place), and saves them all in one place.

Scraping with Google Docs using =importXML – a case study

If you’ve not used =importXML before it’s worth catching up on my previous 2 posts How to scrape webpages and ask questions with Google Docs and =importXML and Asking questions of a webpage – and finding out when those answers change.

This takes things a little bit further. [Read more...] Scraping data from a list of webpages using Google Docs, 4.0 out of 5 based on 1 rating

Leave a Reply

You must be logged in to post a comment.

Play at 32Red Online Casino - awarded Best Online Casino since 2003 - and real online casino online casino 12 Apr 2013. games including; Iron Man, Santa Surprise and many, many more! playing pokies online We know how much enjoyment can be taken from playing a good online game,
There are only a few casinos that
playing pokies best online casino in australia Download slot games - aussie pokies free. available to Canadian online gambling fans. australian pokies With mutliple pay-lines, free spins True blue online pokies is the best website for all poker machine sites in Australia. pokie games no deposit no deposit online casino Download the
Try our excellent casino casino pokies online You'll find details about pokies, which Australian online casinos are.
.