Counting crime: How journalists make sense of police data

This article was originally published on the Data Journalism Awards Medium Publication managed by the Global Editors Network. You can find the original version right here.



Takeaways from a discussion with experts behind two of the most compelling data projects tackling crime in the US


As a journalist, how do you go about accessing, verifying and visualising datasets on crime and police?


Accessing crime and police data is crucial given the amount of shootings and police violence brought into the headlines through cases like Freddy Gray and Philando Castile.

As a journalist, how do you go about accessing, verifying, and visualising datasets on this topic? What kind of ethical questions does that raise? How do you protect the victims? We gathered experts to find out.


Is crime in America rising or falling? The answer is not nearly as simple as politicians sometimes make it out to be.


Tom Meagher is deputy managing editor for The Marshall Project which has been publishing some of the most compelling crime data journalism of the past few years. Their project Crime in Context won a Data Journalism Awards 2017 prize for its analysis of 40 years worth of national and local crime data. The Next to Die has been tracking every execution in the US for the last two years in close to real time.

Ciara McCarthy is a journalist who worked on Guardian US’s The Counted project, often referred to as an industry benchmark. It counts the number of people killed by police and other law enforcement agencies in the US throughout 2015 and 2016 to monitor their demographics and to tell the stories of how they died.

Both of them joined us during a Slack discussion dedicated to crime and police data at the beginning of November. This article gathers the best tips and advice they dared to share.


The Counted is the most thorough public accounting for deadly use of force in the US


What makes working with crime or police data different from working with any other type of data?


Tom Meagher: Oh, where to begin? In the US, there are a few things that make criminal justice data a little more complicated than in most other beats. First, there’s a presumption of innocence for people accused of crimes until their case works its way through the court systems. So we want to be mindful of how the people our data represents are considered. Not everyone arrested is guilty, but with data it can be easy to overlook that key fact sometimes.

And more practically, in the US the data is so, so fragmented. There are 18,000+ police agencies and thousands of courts that all seem to keep their data in their own way (if they keep it at all). It makes it really challenging to carry out national analyses of how parts of the criminal justice system are operating. There are very few one-stop-shops for data.

Ciara McCarthy: I think, for us at The Counted at least, the main issue we set out to fix was that the data we wanted to analyse and investigate simply didn’t exist. There was no comprehensive or reliable information about how many people died in police custody in the US (although there is lots of available data, of varying reliability, about other pieces of the criminal justice system).

I think that a lot of criminal justice data […] might not be complete or accurate if it’s even been collected. And to echo Tom, that’s the other main issue: With no central body keeping track of the data we were looking at, it was hard to monitor thousands of different law enforcement agencies, all of which follow slightly different policies and standards for releasing information and communicating with reporters.

Although the FBI ‘collects’ this data, it’s wildly inaccurate, and underestimates the true number of people who die in police custody at least by half. It’s optional for police departments to submit their information to the FBI, meaning that most don’t end up doing it.


Previously unpublished data revealed only 224 of 18,000 US law enforcement agencies reported fatal shootings in 2014 sheds new light on flawed system


So would your advice be to ‘build your own data’?


Ciara McCarthy: I think it depends! Once our team started reporting on this issue in particular, it was clear that, at least for deaths in custody, the information the federal government had would have resulted in deeply flawed analyses. But in other areas of the US criminal justice system, the data collected by the government is usable — I think it’s a matter of asking a lot of questions of an available data set before you get started and seeing whether you can make reliable analyses. And if you can’t, then yes! Build your own data.

Tom Meagher: It seems like at The Marshall Project, for nearly every significant investigative story we do, the data doesn’t exist. We have to build it ourselves. As an example, here’s a story I wrote about just a few of the really key criminal justice questions we can’t answer in the US because the data doesn’t exist.


After the deaths of Freddie Gray and Laquan McDonald and others — in an age when police in many cities are under greater scrutiny than they’ve been in decades — how is it that we know so little about how officers employ force to subdue suspects?


As data is tough to get hold of, do you have tips on how or WHERE to find crime and police data?


Tom Meagher: When we’re approaching a story, we have to craft a new strategy every time. For Crime in Context, we had a trove of 40+ years of the federal Uniform Crime Reporting data, but then we had to go back and contact individual police agencies to fill in dozens and dozens of holes we identified.

Then we had to call 70+ police agencies to get them to release the previous year’s data (this was in August) because the FBI didn’t have it yet. We could flag missing records in the data or reports that were suspicious (how could they have -30 assaults in a month?) and had to report each of those out. My friend Steven Rich at the Washington Post likes to say ‘the phone is the most important tool for data journalism’.

Ciara McCarthy: For us at The Counted we basically went from agency to agency to request and ask for the data. Sometimes we had to request the information under public records law, and sometimes the information (or the basics, at least) were easily distributed. The Counted was a little different from some data analysis projects in that it was live: We added new cases of people killed by police to the database each day.


How do you verify data related to crime and the police, especially when victims come forward to denounce wrongdoing? Any tips or best practice on crowdsourcing for such projects, and establishing trust with sources?


Tom Meagher: We tend to rely on official court records — lawsuit filings, courtroom testimony, decisions — and on other journalists to help us vet information. Our executions project, The Next to Die, is a sort of journalistic crowdsourcing, where we work with reporters and editors in eight other news organisations to help us amass the information that goes into our database.


The Next to Die aims to bring attention, and thus accountability, to upcoming executions.


Ciara McCarthy: A few things I’d point out from our project: First, for us, when we couldn’t give a definitive answer, we noted it (see an example right here). I think part of the genius behind our very brilliant interactive journalists who built the database was they created one that could adapt to our reporting needs as we added to the database.

So if police said someone was armed with a knife, but witnesses said the person had dropped the knife before the shooting, we usually label that ‘disputed’ in our database, and then pursue additional information to try and get a clear answer. In cases of people killed by police, the first piece of information almost always comes from authorities, and that information may or may not be true. So if there are witnesses (often there aren’t) we’ll talk to them to see if they saw something different.

Secondly, we considered The Counted to be a crowdsourced database, meaning that our readers could reach out and contact us with tips at any time. We had a ‘tip line’ of sorts on our website and we also got information from readers via Facebook, Twitter, and email. Most of the time, the people reaching out to us weren’t sources with sensitive or story-cracking information, but readers with questions about the project or people alerting us to new cases. Sometimes, though, family members of the deceased would reach out to dispute law enforcement’s characterisation of the incident, and when that happened we’d follow up on whatever information they gave us.


The Guardian US had a “tip line” on their website and also got information from readers via Facebook, Twitter and via email


Have you ever been worried of the backlash or bad impact your projects could have?


Tom Meagher: We try to operate in a ‘no surprises’ manner. We go to great lengths to let our subjects know what’s coming out and to give them an opportunity to respond ahead of time. A big story my colleagues undertook on these programmes where you can pay money to stay in safer or nicer jails relied heavily on freedom of information requests and data compiled from more than 25 different police jurisdictions (screenshot below). If you look at the methodology, they describe how they did the analysis and how they took it to each of those police agencies a few weeks before publication to give them a chance to dispute or comment on the analysis.


In what is commonly called “pay-to-stay” or “private jail,” a constellation of small city jails — at least 26 of them in Los Angeles and Orange counties — open their doors to defendants who can afford the option


As far as protecting sources from legal or physical harm, we’re very mindful of that. We go to great lengths to get our sources to go on the record, but if we think they’re potentially in jeopardy, we will allow them to be anonymous, provided we can vet their story independently. We don’t want to put anyone at risk of losing their jobs or of physical harm.

Ciara McCarthy: No one on our team personally encountered any threats or danger as a result of The Counted project as far as I know; I’d say the worst I personally encountered was a few mean tweets and a few terse phone calls with law enforcement officials who weren’t happy about the project. We also didn’t have a ton of anonymous sources whose identity we needed to protect (which I don’t think is something we expected starting out).

Most of the time, if witnesses or family members contradicted the police account, these (very brave) people did so pretty publicly. See, for example this article (screenshot below) telling the story of an American who filmed police violence. If there were cases where our reporters were working with anonymous sources, they were very cautious and made sure those who were providing information knew what publishing their accounts entailed.


When Feidin Santana filmed Walter Scott’s death, it marked a turning point in the US civil rights movement — and in Santana’s life. He and others who have taken the law into their own hands tell their stories


Do you encounter difficulties in streamlining key definitions (for example ‘armed’ vs ‘unarmed’, or ‘Police custody’), especially when gathering data from multiple sources? How do you resolve these differences?


Tom Meagher: Oh yes, all the time. We find that different agencies or different states will often use the same words but have completely different meanings. In one state, for example, they may have a crime called ‘battery’ that in a different state would be labelled ‘assault’. We first try to make sure that we understand exactly what each term means to each source. We start with getting their data dictionary (or record layout or user’s manual) to see how they define it in print. Then we’ll follow up with interviews with agency personnel to confirm our understanding of the terms. Ultimately, we’ll often create our own categorization scheme that is hopefully more accessible to readers to describe each class of records we see in the data.

In the Pay to Stay story, we had 25+ agencies all using different terms to refer to a fairly arcane set of state statutes that you really needed a law degree to understand. With lots of reporting work, we were able to generally class them as types of crimes with colloquial names (Drugs, Driving Violations) that were still accurate to the legal definitions, muddled as they were. It ultimately made it easier for our readers to grasp the importance of the different types of crimes being reported on.

“Often in data reporting, it’s tempting to be lulled into thinking that the ‘official data’ that is provided to you is rational and sensible and ready to be analyzed or visualized. In reality, we find most of the time that it’s a complete mess that requires a lot of reporting before we can even think about analyzing it to inform our reporting.” Tom Meagher (The Marshall Project)

Pay-to-stay is a curated collection of links by The Marshall Project, part of their Records project


Ciara McCarthy: We ran into this issue A LOT while working on The Counted project, particularly when it came to defining whether the deceased was armed or unarmed, as you noted. As you can imagine, the law enforcement definition of someone who is armed might differ from what others would consider armed, or the police account might change over time. We ran into this a lot when police shot and killed someone who was driving a car; often, they would say, they opened fire because the person in question was using the car as a weapon. (We did a bigger piece on this here).

That’s obviously super tricky, because it’s difficult to corroborate without video or a witness. A good example of this issue is the case of Zachary Hammond, a teenager who was shot and killed in South Carolina in 2015. Police initially said he drove the car toward the officers, which is why one opened fire. Surveillance footage released later showed that Hammond was driving past the officer, and not directly at him.

So I don’t have an easy answer! Sometimes the only available info we had was from police, but we’d do our best to find other sources when the police account seemed questionable. Basically, it meant a lot of extra reporting and a lot of discussions among our team members.


What tips do you have on visualising crime and police data? How and why do you decide whether or not to show people’s name, photo, or personal information?


Ciara McCarthy: With The Counted, we had built this big database, and wanted people to be able to use it and explore it and learn from it. That’s a main reason why the database included photos, whenever possible: We really wanted to put a face on each person who had died, so we weren’t only focusing on the overall number of people who died.

As for personal information, we would include what was relevant; so, for example, if a person’s medical or mental health history might have impacted their interaction with authorities, we’d be sure to note that.


For regular updates from The Next to Die, follow @thenexttodie on Twitter


Tom Meagher’s tips:

  • You want to give your data context.
  • Avoid one-year comparisons.
  • Set it against historical data as much as possible.
  • As you visualize it, try to remember that every record in that database represents a person — someone who was injured or victimized or killed, or someone who has committed crimes.
  • Try to use your visualization to emphasize their humanity as much as you can. Dots or jagged lines sometimes obscure the people they represent


Is there one thing you wish someone had told you before you took on The Counted and the Next To Die projects?


Tom Meagher: Building your own databases for open-ended projects can be very fulfilling as a journalist. You’re filling a gap in the public’s understanding of an issue. It’s very worthy. But also keep in mind that you’re committing your news organization to an endless project.

Does the story merit your time and your colleagues’ time for the indefinite future? I’d argue that The Counted and the Next to Die do. But you don’t want to make the decision without understanding the costs and all the other reporting you won’t be able to do for the next few years because you’ll have to be updating your database.

Also, these can be very emotionally taxing subjects to report on. You’re spending your entire professional life (and much of your personal life) immersed in stories of violence, and trauma, and misery. Be sure to take care of yourself and give yourself emotional outlets.


What do you think could be done to improve things? Do we just need more comprehensive data from authorities compiled in a standardised way?


Tom Meagher: The division of powers between local and state and federal governments in the US makes it complicated. There’s realistically not going to ever be a single source for reliable data. What would be a vast improvement would be if more politicians and policymakers embraced the ideas of transparency and accountability, that better, smarter data will help them and the public understand our justice systems, and to make better decisions.

As journalists, we’d certainly benefit from that change in mindset, which is still too rare here.

Ciara McCarthy: It would be lovely to get more comprehensive data, but perhaps that’s just wishful thinking. I think getting data from a variety of sources and different types of data will help — comparing a database of media reports vs. official data, for example. That’s what my team is doing with our project, anyway.

More comprehensive data from authorities would be amazing, of course, but when that’s not an option I think building your project is a great public service for newsrooms to undertake. One of my favourite things about The Counted was that, on the surface, it’s mission and premise was pretty simple: The US government should know how many people are killed by police each year. We don’t, so let’s change that.

There’s obviously a ton of different reporting that can (and should!) be done on issues related to police violence, but one thing I really liked about our project was that, at the heart of it, we were saying that we can’t have this public policy discussion without reliable data. I think having this specific, and sometimes narrow, aim for big journalism projects can be really clarifying, and help you achieve impact.


How does it compare in other parts of the world?



Aun Qi Koh of Malaysiakini (Malaysia): I feel like it’s the opposite problem in Malaysia as the official data comes from just one source, the Interior Ministry/Royal Malaysian Police, but it’s not very detailed, and unfortunately we don’t have many other sources of data because there aren’t many checks and balances on the police.



Shree D N of Citizen Matters (India): India has the problem of under-reporting crime data. The National Crime Records Bureau is the official data source, but underreporting usually happens. This article has some insights on the issue. The methodology used to record offences leads to under-reporting of rape, abduction and stalking.



Eva Constantaras is a data journalist and trainer who recently wrote the Data Journalism Manual for the UN Development Program.


During our November Slack discussion she shared with us great examples from Kenya, Afghanistan and Turkey:

“I think The Counted inspired so many other media outlets because they realized they could build their own databases using similar data collection techniques but getting away from official sources. The Kenya Nation Newsplex team used mostly media reports to compile its Deadly Force Database.

Pajhwok Afghan News maintains a database of terrorist attacks that is much more detailed than anything the government or international bodies maintain. It’s not too much work because they cover all terrorist attacks anyway so they just have to enter them into the database. And then they can generate monthly stories on trends in terrorism in Kabul and across Afghanistan without too much effort.

This paper on collaboration between civic tech and data journalists I think is also relevant. In Turkey, Dag Media works with a domestic violence NGO to track violence against women. The NGO builds the database and the journalists do the stories.”


To see the full discussion, check out previous ones and take part in future ones, join the Data Journalism Awards community on Slack!

Over the past six years, the Global Editors Network has organised the Data Journalism Awards competition to celebrate and credit outstanding work in the field of data-driven journalism worldwide. To see the full list of winners, read about the categories, join the competition yourself, go to our website.


Holding the powerful accountable, using data

This article was originally published on the Data Journalism Awards Medium Publication managed by the Global Editors Network. You can find the original version right here.


From left to right: screenshots of Fact Check: Trump And Clinton Debate For The First Time (NPR, USA), Database of Assets of Serbian Politicians (KRIK, Serbia), and Ctrl+X (ABRAJI, Brazil)


It is referred to as one of the main goals of modern journalism, and yet, in many parts of the world, holding the powerful accountable causes a great amount of threats and challenges.

How do you go about investigating corruption and finding the data that your government or powerful individuals want to keep hidden? What issues do most data journalists face when working on such investigations and how do they tackle them?

As season 7 of the Data Journalism Awards competition starts this fall, we’ve set up a group discussion on Slack last week and gathered Amita Kelly of NPR (USA), Jelena Vasić of KRIK (Serbia) and Tiago Mali of ABRAJI (Brazil) to discuss the challenges of holding the powerful accountable using data. The three of them gave us great insights on the state of data journalism across Eastern Europe and the Americas.


From left to right: Amita Kelly of NPR (USA), Tiago Mali of ABRAJI (Brazil) and Jelena Vasić of KRIK (Serbia)


In Brazil, the political and judiciary systems seem to go hand-in-hand against freedom of speech


“There is a perception, amongst the politicians and the judiciary system, that they don’t have to be accountable,” said Tiago Mali, project coordinator at The Brazilian Association of Investigative Journalism (ABRAJI) in Brazil.

“The checks and balances are too weak and the judges are often close to the politicians. So many times the first instance judges favour censorship against the media to preserve the politicians. They help each other against freedom of speech.”

In September 2017, the mayor of Betim, a city in Minas Gerais, sued a website that published an investigation against him, Mali explained. The journalist who worked on the story also received threatening calls.

The team at ABRAJI realised that part of the problem was that the judiciary system was not held accountable. They started to expose judges, lawsuits and decisions that aimed at censoring the media.

“It’s our way to increase society’s pressure on them and to shed a light on their misbehaviour,” Mali said.

“We haven’t been directly threatened here in ABRAJI, but we report on cases of many journalists that are being constantly threatened.”


The project Ctrl+X is a database that gathers lawsuits in which people, politicians or companies try to remove content from the internet and hide information from Brazilian audience.


A Brazilian project denounces politicians trying to remove information from the public eye


ABRAJI won a Data Journalism Awards prize in June 2017 for their project Ctrl+X which scraped thousands of lawsuits and catalogued close to 2500 filed by Brazilian politicians who were trying to hide information from the public eye.

“We started because we realised there were too many cases of politicians pulling their weight to silence journalists in courts. We knew of former presidents, governors, and mayors using the judiciary system to prevent the publication of news about them they were not too comfortable with— a practice that we assumed had died with the dictatorship in the 80’s,” Tiago Mali said.

“We didn’t know then how many cases they were amounting to, so we did what every good journalist should do in such a situation: we started the count ourselves.”

In the beginning, in 2014, ABRAJI asked media lawyers and media organisations to provide them with details on the lawsuits filed against them. This work had some impact on the 2014 elections, but not everyone was willing or had time to cooperate.

So the team wanted to go further. In 2015 and 2016, ABRAJI developed scraping tools to parse the many court websites in Brazil for this sort of lawsuits. “As we improved our system, we started to count the cases not in dozens, but in thousands,” Tiago Mali said. “We cannot say that we were not surprised by this.”

“Since its publication, CTRL+X has not only provided insightful data on freedom of expression, but also made their data available for other media to report on the transparency issue. It was crucial that this data be of use for the 2016 election,” said Yolanda Ma, editor of Data Journalism China and jury member of the Data Journalism Awards competition.


Journalists who investigate politicians’ wrongdoings in Serbia face multiple threats


Screenshot of the story by KRIK investigating Serbia’s Defense Minister, Aleksandar Vulin


In September 2017, Serbia’s Defense Minister, Aleksandar Vulin has been at the heart of an investigation by KRIK, the Crime and Corruption Reporting Network in Serbia. He told the country’s anti-corruption agency that his wife’s aunt from Canada lent the couple more than €200,000 to buy their Belgrade apartment, but did not manage to submit convincing evidence to support his claim.

“Vulin’s political party then started publishing official statements against KRIK’s editor, and this for several days,” said Jelena Vasić, journalist at KRIK. They allegedly said that “KRIK’s editor Stevan Dojcinovic was a ‘drug addict who needs to be tested for drugs’, and accused him of being paid by foreigners to attack the minister.”

The political party also rudely attacked every public figure which stood for KRIK’s defence.

After this incident, EU institutions informed Belgrade that they will be tracking the behaviour of Serbia’s officials towards media organisations during the accession process.

But this is not an isolated incident for KRIK. Last July, the home of Dragana Peco, award-winning KRIK’s investigative reporter, was broken into, and her belongings turned over, Jelena Vasić explained alleging to foul play. “KRIK journalists have also received death threats on social media,” she said.


KRIK created the most comprehensive online database of assets of Serbian politicians


A Serbian database of politicians assets


KRIK won a Data Journalism Awards 2017 prize last June for creating the most comprehensive database of assets of Serbian politicians, which currently consists of property cards of all ministers of Serbian government and all Serbian presidential candidates running in the 2017 Elections.

The database was launched to help Serbian citizens to better understand who the people running their country are and promote greater transparency.

Each profile contains information about the apartments, houses, cars and companies of current ministers or presidential candidates, and details about how they came to possess them.

“What KRIK did with their database project went beyond simply opening data up for examination; they opened minds,”said Paul Radu, executive director of the Organized Crime and Corruption Reporting Project (OCCRP), also member of the Data Journalism Awards 2017 jury.

“Their work allowed people in Serbia, where open access to data is limited, to see what wealth their politicians had accumulated. The publication of the database sparked investigations by the Serbian Anti-Corruption Agency. At the same time, KRIK journalists were monitored and recorded, and the organisation subjected to smear campaigns. But they persevered in the name of public accountability and transparency.”

The Online Database of Assets of Serbian Politicians attracted a lot of attention. No other organisation in Serbian had ever gone to such depth to investigate this subject as KRIK did.

This database has contributed to higher government transparency and now, details on politicians that would otherwise be hidden are in the public domain.


Journalists in the USA also get their share of challenges


It is no secret that trying to enforce transparency from prominent figures is an uphill battle in the US, barely six month ago, the current President elusive tax returns were a hot topic. “We find that it varies a lot with who is in power and what agency we are looking at,” said Amita Kelly, digital editor for NPR.

“Some are much more transparent and have very detailed policy papers, for example, that can be picked apart. Our challenge in the 2016 election was that with the increasing use of digital and social media by campaigns and candidates, it was often difficult to parse what is truly a policy versus an opinion.”

Has Trump’s election changed the way journalists hold the powerful accountable in the USA?

Amita Kelly argued there have always been difficulties with getting to the center of what the government or corporations are doing:

“I think what changed during the Trump campaign was that his policy proposals or political stances evolved very much over the course of the campaign and his presidency,” Kelly said.


A fact-checking project on political debates in the USA


NPR’s politics team, with help from reporters and editors who cover national security, immigration, business, foreign policy and more, live annotated the debate between Trump and Clinton back in September 2016.


Kelly’s team won a Data Journalism Awards prize last June for their project Fact Check: Trump And Clinton Debate For The First Time, which was the culmination of their day-to-day fact-checking efforts, but on a largerscale due to its live aspect and the number of reporters involved.

“We relied a lot on our journalists’ body of expertise to fact check statements from the campaign and the President — either to confirm what they said or more often, counter things they said with correct information”, Kelly argued. “So it was less a matter of difficulty in finding the information, but more about what we do with the information that’s getting out there.”

Kennet Cukier, senior editor for digital at The Economist, and member of the Data Journalism Awards 2017 jury, said of the project:“In a world of fake news, one of the most important tasks of journalism is to respond to spin or outright lies with truth quickly and simply — and with sources.”

“NPR did a thoughtful, novel and effective job at checking both US presidential candidates’ statements. The outlet verified, criticised or enriched on candidates points in a way that marshalled data and facts. It shows how the ethos of journalism for truth can be embedded into code to create a new way to present news events with responsible criticism just alongside it.”


How do you face and tackle threats during such investigations?


All three organisations have systems in place to cope with attacks, intimidation or threats towards journalists.

KRIK has developed a system of defence in situations when they are publicly attacked or when there is a smear campaign against them. “Threats have never stopped us,” Jelena Vasić said.

“We immediately write to all our donors, partners, national and international journalists’ associations, and public figures to tell them what is happening and ask them to give us official statements. Then we publish all of those statements, one by one on our website, so our readers can see that we have the support of professionals and of the community.”

KRIK also frequently ask their readers on social media for financial support, using this kind of incidents to expand their crowdfunding community and show that people of Serbia are on their side. This is not without reminding us of ProPublica’s “We’re not shutting up” campaign last year.

“We have made a special page on our website where we record (in reverse chronology) every attack on KRIK,” Vasić added.


For additional security, they also have special procedures: journalists working on a story can only talk to their editor about it, KRIK staff also use Signal for telephone communications and encrypted emails.

Tiago Mali of ABRAJI pointed out that journalists facing threat shouldn’t do so on their own.

“It’s important that we unite to defend ourselves against them,” he said. “In Abraji, we monitor these threats and try to investigate aggressions against journalists. The spirit is: if you mess with one, you mess with all.”

The Brazilian organisation also has a project in place called Tim Lopes (named after a journalist that was killed in 2002) where journalists from all over Brazil investigate the deaths of other journalists.

NPR have a system in place to handle threats depending on the level. “We of course get a lot of social media threats that we have to choose whether to engage or not,” Amita Kelly said. “And some of our reporters felt threatened at campaign rallies, etc. But we are very lucky that it is not a persistent issue.”


How do you get hold of the data that your government or powerful individuals want to keep hidden?


For ABRAJI it all started with regularly scraping the judiciary system for lawsuits. “The problem is that there is no flag or anything structured in a lawsuit that tells you it is about censorship or content removal,” Tiago Mali said.

“So we have tried and improved different queries that get us closer to the lawsuits we are looking for. As we collect thousands of these lawsuits, we read every single one of them and sort and classify the ones related to the project. It’s a time-consuming process we automatised step by step.”

The team at ABRAJI now wants to work with machine learning for sorting and classifying the lawsuits. “We want to build an algorithm that makes everything automatically and we would use our time only to review these work” Mali said. “This would be a tremendous upgrade in efficiency but we still lack the funds to build this structure.”

For their database of assets of Serbian politicians, KRIK has used company, criminal, court, and financial records, but also land registry records, sales contracts, loan and mortgages contracts from Serbia and other countries such as Montenegro, Bosnia and Herzegovina, Croatia, Italy, Czech Republic (and even offshore zone — Delaware, UAE, and Cyprus).

“We have used FOI requests very often in this project,” Jelena Vasić said. “Major difficulties came from state institutions which stopped replying to our FOI requests, but at the same time they were revealing all details from those requests to politicians and pro-government media, which then used it in smear campaigns against KRIK.”

“In situations like this one, we talk to the Commissioner for Information of Public Importance and also write on our website and social media about the institutions that are not replying to our FOI requests. Despite all the efforts of the authorities to disable us from obtaining important information, we have managed to get to the majority of documents we needed.”


There is good impact, and there is bad impact


When investigating wrongdoing, trying to bring forward what is kept hidden or denouncing corruption, news teams aim for positive impact.

“Since the very beginning, we wanted to provide data so there could be more journalistic stories on how the politicians and judges are harming freedom of expression in Brazil,” Tiago Mali said.

“We managed to achieve this goal.”

Because Ctrl+X provided insightful data, freedom of expression, a subject normally ignored by Brazilian media, managed to made the news. At the end of the 2016 electoral campaign, more than 200 articles about politicians trying to hide information had been published in Brazilian media using the project’s data. All major Brazilian newspapers, relevant radios and a TV show ran stories on freedom of expression with their information.

Yet sometimes, an investigative project end up changing the law, and not necessarily for the better, as it was the case in Serbia:

“Because of our investigation, the Serbian Land Registry has changed the way of replying to FOI requests” Jelena Vasić said. “They have decided that every response from their office should get approval from the headquarters in Belgrade, which was not the case before.”

As for NPR, they’ve noticed a real hunger for fact checks and stories that seek the truth on government leaders. “Our debate fact check was the story with the highest traffic ever on with something like 20+ million views and people stayed on the story something like 20 minutes, which mean they actually read it,” Amita Kelly said.


What could be done to make the job of holding the powerful accountable easier for journalists?


Approve and enforce Freedom of Information Laws, that’s what Tiago Mali argues. “Here in Brazil, a big shift happened after the approval of our FOIA. When you don’t need to rely on the willingness of the powerful to give you information (because a law says so), everything becomes much easier.”

“I think it would be very useful if international institutions could react every time a reporter is exposed to public attacks, because here in Serbia our government is afraid of international pressure” Jelena Vasić added.

For Amita Kelly, it is definitely about pushing for more transparency all around, including laws such as the Freedom of Information Act they have in the U.S. where journalists can request government information. She also thinks news organisations should invest “in allowing reporters to get to know a beat”. Covering an area for a long time helps to develop invaluable sources and expertise.


Bonus: tools and resources used in investigative projects


During our Slack discussion, Tiago Mali of ABRAJI revealed they used Parsehub for the CTRL+X project. It is a tool that easily extracts data from any website.

“We have worked with a lot of high-end tools here, programming, etc. But, still, I think there is no faster way to organise the information you work hard to collect than a spreadsheet. Sometimes the spreadsheet has to be a bigger database, a SQL or something you need R to deal with. But still, being able to make queries and organise your thoughts is really important to the investigation.”

Jelena Vasić loves to use companies search website (similar to Open Corporates) and also Facebook Graph.

“We used different online sources, and were searching through different databases: Orbis and Lexis databases containing millions of entries of companies worldwide that also contain information on shareholders, directors and subsidiaries of companies.

Vasić also pointed at different local business registries online in Serbia, Bosnia and Herzegovina, Montenegro, Czech Republic and local land registries in Serbia, Montenegro, Croatia.

Google Docs is simple but has been amazing for collaboration,” Amita Kelly added. “At one point we had up to 50 people across the network in one document commenting on a live transcript.


To see the full discussion, check out previous ones and take part in future ones, join the DJA community on Slack!

Over the past six years, the Global Editors Network has organised the Data Journalism Awards competition to celebrate and credit outstanding work in the field of data-driven journalism worldwide. To see the full list of winners, read about the categories, join the competition yourself, go to our website.

marianne-bouchartMarianne Bouchart is the founder and director of HEI-DA, a nonprofit organisation promoting news innovation, the future of data journalism and open data. She runs data journalism programmes in various regions around the world as well as HEI-DA’s Sensor Journalism Toolkit project and manages the Data Journalism Awards competition.

Before launching HEI-DA, Marianne spent 10 years in London where she worked as a web producer, data journalism and graphics editor for Bloomberg News, amongst others. She created the Data Journalism Blog in 2011 and gives lectures at journalism schools, in the UK and in France.