7 ways to get data out of PDFs

HELP ME INVESTIGATE – By  Paul Bradshaw

A frequent obstacle in data journalism is when the information you want to analyse is locked away in a PDF. Here are 6 ways to tackle that problem – with space for a 7th:

1) For simple PDFs: Google Docs’ conversion facility

 

Google Docs recently added a feature that allows you to convert a PDF to a ‘Google document’ when you upload it. It’s pretty powerful, and about the simplest way you can extract information.

 

It does not work, however, if the PDF was generated by scanning – in other words if it is an image, rather than a document that has been converted to PDF.

 

2) For scanned documents and pulling out key players: Document Cloud

 

Document Cloud is a tool for journalists to convert PDFs to text. It will also add ‘semantic’ information along the way, such as what organisations, people and ‘entities’ such as dates and locations are mentioned within it, and there are some useful features that allow you to present documents for others to comment on.

 

The good news is that it works very well with scanned documents, using Optical Character Recognition (OCR). The bad news is that you need to ask permission to use it, so if you don’t work as a professional journalist you may not be able to use it. Still, there’s no harm in asking. [Read more…]

 

10 things every journalist should know about data

NEWS:REWIRED: by SARAH MARSHALL

Picture from News:Rewired website

Every journalist needs to know about data. It is not just the preserve of the investigative journalist but can – and should – be used by reporters writing for local papers, magazines, the consumer and trade press and for online publications.

Think about crime statistics, government spending, bin collections, hospital infections and missing kittens and tell me data journalism is not relevant to your title.

If you think you need to be a hacker as well as a hack then you are wrong. Although data journalism combines journalism, research, statistics and programming, you may dabble but you don’t need to know much maths or code to get started. It can be as simple as copying and pasting data from an Excel spreadsheet. [Read more…]