Scraping data from a list of webpages using Google Docs

OJB – By Paul Bradshaw

Quite often when you’re looking for data as part of a story, that data will not be on a single page, but on a series of pages. To manually copy the data from each one – or even scrape the data individually – would take time. Here I explain a way to use Google Docs to grab the data for you.

Some basic principles

Although Google Docs is a pretty clumsy tool to use to scrape webpages, the method used is much the same as if you were writing a scraper in a programming language like Python or Ruby. For that reason, I think this is a good quick way to introduce the basics of certain types of scrapers.

Here’s how it works:

Firstly, you need a list of links to the pages containing data.

Quite often that list might be on a webpage which links to them all, but if not you should look at whether the links have any common structure, for example “http://www.country.com/data/australia” or “http://www.country.com/data/country2″. If it does, then you can generate a list by filling in the part of the URL that changes each time (in this case, the country name or number), assuming you have a list to fill it from (i.e. a list of countries, codes or simple addition).

Second, you need the destination pages to have some consistent structure to them. In other words, they should look the same (although looking the same doesn’t mean they have the same structure – more on this below).

The scraper then cycles through each link in your list, grabs particular bits of data from each linked page (because it is always in the same place), and saves them all in one place.

Scraping with Google Docs using =importXML – a case study

If you’ve not used =importXML before it’s worth catching up on my previous 2 posts How to scrape webpages and ask questions with Google Docs and =importXML and Asking questions of a webpage – and finding out when those answers change.

This takes things a little bit further. [Read more…]

4 Simple Tools for Creating an Infographic Resume

Editor’s note: As data journalists, designers or other data enthusiasts, what a better way to show off your skills than with an infographic resume? Here is a very useful article by Mashable’s  introducing four very interesting tools to make your profile stand out! Show us your infographic resume in our Data Art Corner. The best examples will be featured in the DJB’s front page next month!

MASHABLE – By 

As a freelancer or job seeker, it is important to have a resume that stands out among the rest — one of the more visually pleasing options on the market today is the infographic resume.

An infographic resume enables a job seeker to better visualize his or her career history, education and skills.

Unfortunately, not everyone is a graphic designer, and whipping up a professional-looking infographic resume can be a difficult task for the technically unskilled job seeker. For those of us not talented in design, it can also be costly to hire an experienced designer to toil over a career-centric infographic.

Luckily, a number of companies are picking up on this growing trend and building apps to enable the average job seeker to create a beautiful resume.

To spruce up your resume, check out these four tools for creating an infographic CV. If you’ve seen other tools on the market, let us know about them in the comments below.


1. Vizualize.me


 

 
 

 

Vizualize.me is a new app that turns a user’s LinkedIn profile information into a beautiful, web-based infographic.

After creating an account and connecting via LinkedIn, a user can edit his or her profile summary, work experience, education, links, skills, interests, languages, stats, recommendations and awards. And voila, astunning infographic is created.

The company’s vision is to “be the future of resumes.” Lofty goal, but completely viable, given that its iteration of the resume is much more compelling than the simple, black-and-white paper version that currently rules the world.


2. Re.vu


 

 
 

 

Re.vu, a newer name on the market, is another app that enables a user to pull in and edit his or her LinkedIn data to produce a stylish web-based infographic.

The infographic layout focuses on the user’s name, title, biography, social links and career timeline — it also enables a user to add more graphics, including stats, skill evolution, proficiencies, quotes and interests over time.

Besides the career timeline that is fully generated via the LinkedIn connection, the other graphics can be a bit tedious to create, as all of the details must be entered manually.

In the end, though, a very attractive infographic resume emerges. This is, by far, the most visually pleasing option of all of the apps we reviewed.


3. Kinzaa


 

 
 

 

Based on a user’s imported LinkedIn data, Kinzaa creates a data-driven infographic resume that focuses on a user’s skills and job responsibilities throughout his or her work history.

The tool is still in beta, so it can be a bit wonky at times — but if you’re looking for a tool that helps outline exactly how you’ve divided your time in previous positions, this may be your tool of choice.

Unlike other tools, it also features a section outlining the user’s personality and work environment preferences. Details such as preferences on company size, job security, challenge level, culture, decision-making speed and more are outlined in the personality section, while the work environment section focuses on the user’s work-day length, team size, noise level, dress code and travel preferences.


4. Brazen Careerist Facebook App


 

 
 

 

Brazen Careerist, the career management resource for young professionals, launched a new Facebook application in September that generates an infographic resume from a user’s FacebookTwitter and LinkedIn information.

After a user authorizes the app to access his or her Facebook and LinkedIn data, the app creates an infographic resume with a unique URL — for example, my infographic resume is located atbrazen.me/u/ericaswallow.

The infographic features a user’s honors, years of experience, recommendations, network reach, degree information, specialty keywords, career timeline, social links and LinkedIn profile image.

The app also creates a “Career Portfolio” section which features badges awarded based on a user’s Facebook, Twitter and LinkedIn achievements. Upon signing up for the app, I earned eight badges, including “social media ninja,” “team player” and “CEO in training.” While badges are a nice addition, they aren’t compelling enough to keep me coming back to the app.

 

 

 

Scraperwiki now makes it easier to ask questions of data

OJB – By Paul Bradshaw

I was very excited recently to read on the Scraperwiki mailing list that the website was working on making it possible to create an RSS feed from a SQL query.

Yes, that’s the sort of thing that gets me excited these days.

But before you reach for a blunt object to knock some sense into me, allow me to explain…

Scraperwiki has, until now, done very well at trying to make it easier to get hold of hard-to-reach data. It has done this in two ways: firstly by creating an environment which lowers the technical barrier to creating scrapers (these get hold of the data); and secondly by lowering the social barrier to creating scrapers (by hosting a space where journalists can ask developers for help in writing scrapers).

This move, however, does something different.

It allows you to ask questions – of any dataset on the site. Not only that, but it allows you to receive updates as those answers change. And those updates come in an RSS feed, which opens up all sorts of possibilities around automatically publishing those answers.

The blog post explaining the development already has a couple of examples of this in practice:

Anna, for example, has scraped data on alcohol licence applications. The new feature not only allows her to get a constant update of new applications in her RSS reader – but you could also customise that feed to tell you about licence applications on a particular street, or from a particular applicant, and so on. [Read more…]

Strata Summit 2011: Generating Stories From Data [VIDEO]

As the world of data expands, new challenges arise. The complexity of some datasets can be overwhelming for journalists across the globe who “dig” for a story without the technical skills. Narrative Science’s Kristian Hammond addressed this challenge during last week’s Strata Summit in New York in a presentation about a software platform that helps write stories out of numbers…

[youtube P9hJJCOeIB4]

 

 

The work of data journalism: Find, clean, analyze, create … repeat

O’REILLY RADAR – By 

Data journalism has rounded an important corner: The discussion is no longer if it should be done, but rather how journalists can find and extract stories from datasets.

Of course, a dedicated focus on the “how” doesn’t guarantee execution. Stories don’t magically float out of spreadsheets, and data rarely arrives in a pristine form. Data journalism — like all journalism — requires a lot of grunt work.

With that in mind, I got in touch with Simon Rogers, editor of The Guardian’s Datablog and a speaker at next week’s Strata Summit, to discuss the nuts and bolts of data journalism. The Guardian has been at the forefront of data-driven storytelling, so its process warrants attention — and perhaps even full-fledged duplication.

Our interview follows.

What’s involved in creating a data-centric story?

 

Simon RogersSimon Rogers: It’s really 90% perspiration. There’s a whole process to making the data work and getting to a position where you can get stories out of it. It goes like this:

  • We locate the data or receive it from a variety of sources — from breaking news stories, government data, journalists’ research and so on.
  • We then start looking at what we can do with the data. Do we need to mash it up with another dataset? How can we show changes over time?
  • Spreadsheets often have to be seriously tidied up — all those extraneous columns and weirdly merged cells really don’t help. And that’s assuming it’s not a PDF, the worst format for data known to humankind.
  • Now we’re getting there. Next up we can actually start to perform the calculations that will tell us if there’s a story or not.
  • At the end of that process is the output. Will it be a story or a graphic or a visualisation? What tools will we use?

We’ve actually produced a graphic (of how we make graphics) that shows the process we go through:

 

Guardian data journalism process
Partial screenshot of “Data journalism broken down.” Click to see the full graphic.

What is the most common mistake data journalists make?

Simon Rogers: There’s a tendency to spend months fiddling around [Read more…]

 

 

Data-Driven Journalism In A Box: what do you think needs to be in it?

The following post is from Liliana Bounegru (European Journalism Centre), Jonathan Gray (Open Knowledge Foundation), and Michelle Thorne (Mozilla), who are planning a Data-Driven Journalism in a Box session at the Mozilla Festival 2011, which we recently blogged about here. This is cross posted at DataDrivenJournalism.net and on the Mozilla Festival Blog.

We’re currently organising a session on Data-Driven Journalism in a Box at the Mozilla Festival 2011, and we want your input!

In particular:

  • What skills and tools are needed for data-driven journalism?
  • What is missing from existing tools and documentation?

If you’re interested in the idea, please come and say hello on our data-driven-journalism mailing list!

Following is a brief outline of our plans so far…

What is it?

The last decade has seen an explosion of publicly available data sources – from government databases, to data from NGOs and companies, to large collections of newsworthy documents. There is an increasing pressure for journalists to be equipped with tools and skills to be able to bring value from these data sources to the newsroom and to their readers.

But where can you start? How do you know what tools are available, and what those tools are capable of? How can you harness external expertise to help to make sense of complex or esoteric data sources? How can you take data-driven journalism into your own hands and explore this promising, yet often daunting, new field?

A group of journalists, developers, and data geeks want to compile a Data-Driven Journalism In A Box, a user-friendly kit that includes the most essential tools and tips for data. What is needed to find, clean, sort, create, and visualize data — and ultimately produce a story out of data?

There are many tools and resources already out there, but we want to bring them together into one easy-to-use, neatly packaged kit, specifically catered to the needs of journalists and news organisations. We also want to draw attention to missing pieces and encourage sprints to fill in the gaps as well as tighten documentation.

What’s needed in the Box?

  • Introduction
    • What is data?
    • What is data-driven journalism?
    • Different approaches: Journalist coders vs. Teams of hacks & hackers vs. Geeks for hire
    • Investigative journalism vs. online eye candy
  • Understanding/interpreting data:
    • Analysis: resources on statistics, university course material, etc. (OER)
    • Visualization tools & guidelines – Tufte 101, bubbles or graphs?
    • Acquiring data
  • Guide to data sources
  • Methods for collecting your own data
  • FOI / open data
  • Scraping
    • Working with data
  • Guide to tools for non-technical people
  • Cleaning
    • Publishing data
  • Rights clearance
  • How to publish data openly.
  • Feedback loop on correcting, annotating, adding to data
  • How to integrate data story with existing content management systems

What bits are already out there?

What bits are missing?

  • Tools that are shaped to newsroom use
  • Guide to browser plugins
  • Guide to web-based tools

Opportunities with Data-Driven Journalism:

  • Reduce costs and time by building on existing data sources, tools, and expertise.
  • Harness external expertise more effectively
  • Towards more trust and accountability of journalistic outputs by publishing supporting data with stories. Towards a “scientific journalism” approach that appreciates transparent, empirically- backed sources.
  • News outlets can find their own story leads rather than relying on press releases
  • Increased autonomy when journalists can produce their own datasets
  • Local media can better shape and inform media campaigns. Information can be tailored to local audiences (hyperlocal journalism)
  • Increase traffic by making sense of complex stories with visuals.
  • Interactive data visualizations allow users to see the big picture & zoom in to find information relevant to them
  • Improved literacy. Better understanding of statistics, datasets, how data is obtained & presented.
  • Towards employable skills.

Visualize This: How to Tell Stories with Data

BRAIN PICKINGS – By Maria Popova

How to turn numbers into stories, or what pattern-recognition has to do with the evolution of journalism.

 

Data visualization is a frequent fixation around here and, just recently, we looked at 7 essential books that explore the discipline’s capacity for creative storytelling. Today, a highly anticipated new book joins their ranks —Visualize This: The FlowingData Guide to Design, Visualization, and Statistics, penned by Nathan Yau of the fantastic FlowingDatablog. (Which also makes this a fine addition to our running list of blog-turned-book success stories.) Yu offers a practical guide to creating data graphics that mean something, that captivate and illuminate and tell stories of what matters — a pinnacle of the discipline’s sensemaking potential in a world of ever-increasing information overload.

And in a culture of equally increasing infographics overload, where we are constantly bombarded with mediocre graphics that lack context and provide little actionable insight, Yau makes a special point of separating the signal from the noise and equipping you with the tools to not only create better data graphics but also be a more educated consumer and critic of the discipline.

[youtube Q9RWwKntuXg]

From asking the right questions to exploring data through the visual metaphors that make the most sense to seeing data in new ways and gleaning from it the stories that beg to be told, the book offers a brilliant blueprint to practical eloquence in this emerging visual language. [Read more…]

 

 

 

 

 

Will PANDA save data journalism?

Panda image used under a Creative Commons license from Jenn and Tony Bot

Over the past few years, the Knight Foundation News Challenge has helped develop amazing projects such as DocumentCloud and Localwiki.

Data and the use of it for journalism was a big trend among this year’s winners. No need to say we were quite excited to see this burst of idea dedicated to data journalism.

The project that caught our attention, and not just because of its cute name, is PANDA, a newsroom data application that would help journalists find context and relationships between datasets in a flick of an eye.

“While national news organizations often have the staff and know-how to handle federal data, smaller news organizations are at a disadvantage. City and state data are messier, and newsroom staff often lack the tools to use it,” John Bracken from the Knight Foundation explains. The PANDA project will “help news organisations better use public information.”

Brian Boyer, the news applications editor at the Chicago Tribune, in partnership with Investigative Reporters & Editors (IRE) and The Spokane Spokesman-Review, will build a set of open-source, web-based tools that will make it easier for journalists to use and analyze data. “The goal is to have a system that each news organization can put to their own use,” Boyer said. “I want this to be something an editor can set up for you, not your IT department.”

In the following PPT slides, Brian Boyer explains the concept of PANDA and how it could revolutionize data journalism:

 

You must have understood by now, there is unfortunately no link to the furry animal, in fact, PANDA stands for PANDA A News Data Application.

One of the backbones of the project will be Google Refine, a tool launched last year that cleans up messy datasets and detect patterns. “One of the added benefits of Google Refine, Boyer said, is that it can help draw relationships across data.” It would also allow newsrooms that can’t afford developers, to integrate PANDA into their workplace easily.

The PANDA project received a $150,000 grant. The money will mainly be used to hire a developer to build the application and to give the project a nice fancy look and easy-to-use features.

The first step in this project will be to survey journalists on how they would like PANDA to work in their newsroom. The team will then have to implement those needs and scale the project across newsrooms of different sizes.

Dealing with big datasets requires big storage space and Boyer said that the best option would be for PANDA to work with a cloud storage system, although they haven’t worked out any specifics yet.

Other data-related projects received Knight funding: ScraperWiki (you can find our interview with their media partner manager here), OpenBlock Rural, Overview and SwiftRiver.

Here is a video from the Knight Foundation website giving an overview of all the projects:

(For Brian Boyer’s talk about the PANDA project, go to 9:42)

[vimeo 25222167]