Semantic Web and what it means for data journalism

I’ve found myself increasingly interested by the semantic web in recent months, particularly in how it could be applied to data journalism. While the concept is still somewhat in its infancy, the potential it holds to quickly find data — and abstract it into a format usable by visualizations — is something that all data journalists should take note of.

Imagine the Internet as one big decentralized database, with important information explicitly tagged — instead of just a big collection of linked text files, organized on the larger document level, such as it currently is. In the foreseeable future, journalists wanting to answer a question will simply have to supply this database with a SQL-like query, instead of digging through a boatload of content or writing scrapers. Projects like Freebase and Wikipedia’s burgeoning “Datapedia” provide some clues as to the power of this notion — already, the semantic components of Wikipedia make it incredibly easy to answer a wide variety of questions in this manner.

Take, for example, the following bit of SPARQL, a commonly used semantic web query language:

SELECT ?country ?competitors WHERE { ?s foaf:page ?country . ?s rdf:type . ?s "2012"^^ . ?s dbpprop:competitors ?competitors } order by desc(?competitors)

If used on DBPedia (a dataset cloning Wikipedia that attempts to make its data usable as semantic web constructs), this fairly straight-forward 6-line query will return a JSON object listing all countries participating in the London 2012 Olympics and the number of athletes they’re sending. Go ahead — try pasting the above snippet into a DBpedia SPARQL query editor, such as the one at live.dbpedia.org/sparql. To accomplish a similar feat would take hours of scraping or data gathering. Because it can provide results in JSON, CSV, XML or whatever strikes your fancy, the output can then be supplied to some piece of visualization, whether that’s simply a table or something more complex like a bar chart.

One nice thing about SPARQL is that a lot of the terminology becomes self-evident once you get the hang of it. For instance, if you want to find the properties of the OlympicResult ontology, you merely have to visit the URL in the rdf:type declaration. That will also link to other related ontologies and you can thus find the definitions you need to construct a successful query. For instance, try going to dbpedia.org/page/Canada_at_the_2012_Summer_Olympics, which is the page I used to derive most of the ontologies and properties for the above query. From that page, you learn that entities in the “olympic result” ontology are assigned a “dbpprop:games” property (I.e., the year of the games) and a “dbpprop:competitors” property (I.e., the number of competitors, a.k.a., your pay dirt).

Here’s another, more complex SPARQL query, taken from DBpedia’s documentation:

SELECT DISTINCT ?player { ?s foaf:page ?player. ?s rdf:type . ?s dbpedia2:position ?position . ?s ?club . ?club ?cap . ?s ?place . ?place ?population ?pop. OPTIONAL {?s ?tricot.} Filter (?population in (, , )) Filter (xsd:int(?pop) >10000000 ) . Filter (xsd:int(?cap) Filter (?position = "Goalkeeper"@en || ?position = || ?position = ) } Limit 1000

This selects all pages describing a “player”, of type “SoccerPlayer”, with position “goalkeeper”, playing for a club with a stadium capacity of less than 40,000 and born in a country with a population of greater than 10 million. Producing such a list without semantic web would be mind-numbingly difficult and would require a very complex scraping routine.

Some limitations

That said, there are some limitations to this. The first is that the amount of well-structured semantic web data out there is limited — at least in comparison with non-semantic web data — though that is growing all the time. Wikipedia/DBpedia seems to be the most useful resource for this by far at the moment, though it’s worth noting that semantic web data from Wikipedia suffers from the same problems that all data from Wikipedia suffers from — namely, the fact that it’s edited by anonymous users. In other words, if something’s incorrect on Wikipedia, it’ll also be wrong in the semantic web resource. Another aspect of this is that Wikipedia data changes really quickly, which means that the official DBpedia endpoint becomes outdated really quickly. As a result, it’s often better to use live.dbpedia.org, which enables a continuous synchronization between Wikipedia and DBpedia.

The other thing you have to watch out for is data tampering. If your visualization is hooked up to a data source with little editorial oversight and the ability of users to edit data, the possibility always exist that one of those users will realize that data set is hooked up to your live visualization on a newspaper website somewhere, and will thus try to tamper with the data in order to make it full of profanity or whatnot. As such, while semantic web data from DBpedia might be a good way of getting the initial result, saving that result as a static object within your script afterwards might be the safest course of action.

Some limitations

One Reply to “Semantic Web and what it means for data journalism”