DATA JOURNALISM
A decade of digital evolution to help reporting revolutions at ICIJ
Groundbreaking journalism like the Panama Papers and FinCEN Files investigations wouldn’t be possible without creating our own cutting edge technology.
Five years ago, on May 9, 2016, the International Consortium of Investigative Journalists published the details of more than 200,000 offshore entities from the Panama Papers to the Offshore Leaks Database.
The addition of so much previously secret tax haven data was the culmination of more than 12 months of rigorous analysis and processing of one of the world’s largest data leaks.
But it wasn’t ICIJ’s first time working with data sets on a scale unseen in traditional journalism.
Five years prior, even before ICIJ officially had a data team, director Gerard Ryle obtained a set of 2.5 million files — at the time the largest of its kind in history — that would eventually become the Offshore Leaks investigation, published in 2013.
As ICIJ research editor Emilia Díaz-Struck remembers, there was a notable quirk in the process for collaborating internationally back in those days.
“With Offshore Leaks, there was a local search tool where you had to go to the ICIJ office to find the documents connected to your country!” Díaz-Struck said.
ICIJ’s technological capabilities have evolved significantly since the time when a reporter had to hop on the phone — or on a plane — to Washington in order to access records. From powerful database tools to innovative platforms for sharing findings, ICIJ has established itself not only as a go-to organization for coordinating major international investigations, but as a creator of the tools needed to enable future cross-border collaborations.
Keep it secret, keep it safe
In a world where investigative journalists are often targets of harassment by the powerful people they report on, security and privacy are the first priorities of any platform ICIJ develops — especially when the nature of ICIJ’s investigations often entails several months of research and reporting before any stories are published.
From ICIJ’s first global collaboration — an investigation into the tobacco industry in 2000 — security and secrecy were integral to the success of the project. Writing for the Guardian 20 years ago, ICIJ member David Leigh described how innovative use of technology and good security practices allowed the collaborators to keep their investigation under wraps until the day it was published, ensuring maximum impact.
“The ICIJ communicates via the internet and a system of secure emails. Its existence demonstrates that it is possible to use the net not merely as a source of information, but as a means of bringing journalists together to work in a new way,” Leigh wrote.
One of my big beliefs is that creating a user-friendly interface means creating more security for your users — Pierre Romera
In the 20 years since, keeping investigations secret — and keeping collaborators trained on new security technology — has been a constant challenge.
“One of my big beliefs is that creating a user-friendly interface [means] creating more security for your users,” ICIJ’s chief technology officer, Pierre Romera, said.
Journalists working on the Panama Papers in 2016 were forced to maintain numerous different user accounts with the many ICIJ research platforms, which meant (hopefully) remembering multiple different passwords and login protocols. This also meant multiple points of potential account compromise.
Since then, Romera’s team has built a custom user management system that acts as an all-purpose gateway to ICIJ’s internal platforms. Emails to users are protected by PGP encryption, and two-factor authentication is mandated for each login. Offering users a single account with enhanced security is both more user-friendly, and more secure, Romera said.
While some things have evolved with each ICIJ investigation, other methodologies remain unchanged, according to Romera. ICIJ trains all its collaborators in encryption technologies, and also continues to preference using open-source software whenever possible, to ensure no malicious code is processed on ICIJ servers.
Romera emphasized that no solution is ever 100% safe, but that the precautions ICIJ takes are designed to make things as secure as possible so that journalists who work on ICIJ projects can feel confident that they — and their sources — are not being put at unnecessary risk.
Staying in touch with hundreds of journalists — all at once
Once journalists have secure access to information, they need a secure place to share findings and collaborate.
During the Offshore Leaks investigation, ICIJ’s communication included a rudimentary forum but was still dominated largely by email and phone calls, which, when multiplied out by a team of more than 100 journalists across 58 countries, became a significant logistical challenge. Thankfully, communication technology has advanced enormously since 2013.
In 2014, inspired by the experience of the Offshore Leaks collaboration and riffing off an idea by ICIJ member Giannina Segnini, ICIJ started developing the first version of the Global I-Hub. Supported by a Knight Foundation Prototype Grant and based on an open-source framework called Oxwall, the I-Hub aimed to become a nexus of communication that would allow the entire team to share and organize their findings in a central forum.
Oxwall required significant alterations to make it fit for purpose — the software was designed more for use as a dating platform or for niche online communities to connect, rather than for a team of journalists working in secret on a confidential investigation. ICIJ developers had to remove baked-in questions about dating preferences, tweak other features, and experiment with security upgrades before the I-Hub could be rolled out during the 2015 Swiss Leaks investigation.
“You wouldn’t recognize it nowadays, but it looked like a social network,” Díaz-Struck said. “You’d have groups and the possibility to chat with people in the project, upload files and photos.”
The I-Hub became a central component of ICIJ’s process for the Panama Papers, allowing for a natural bridge between reporters working all over the globe. Users researching politicians or other powerful figures in their countries could come together and exchange not just their results, but tips on how to improve their searches and find patterns in the files.
“Magic happened, because the project was so complex but everyone was eager to understand the files we had in front of us, that everyone started sharing [their findings],” Díaz-Struck said. “It happened naturally, and I don’t think the Panama Papers would have been possible without the I-Hub.”
But with the size of the Panama Papers and subsequent Paradise Papers investigations and the volume of information being exchanged between journalists, ICIJ stretched Oxwall beyond its limit. In 2019, the I-Hub migrated to a new platform, Discourse, to allow for further customization, and ICIJ’s tech team has been building bridges between the I-Hub and Datashare to make global collaboration even more seamless.
How do you dig through 11.5 million documents — or more?
While the I-Hub is useful as a way to enable global collaboration among ICIJ staff and worldwide media partners, Romera points out that that’s only one piece of the puzzle.
“I think the I-Hub is a good metaphor for what ICIJ is,” Romera said. “It’s a virtual newsroom with everyone working together and sharing leads. But from a technical point of view, the most important things are on Datashare, because that’s where the stories are — in the documents.”
Datashare is the crown jewel in ICIJ’s repertoire of tech tools. Built in-house in a collaboration between ICIJ’s developers, researchers, data specialists and investigative reporters, Datashare is a platform that can take millions of leaked files — PDFs, emails, documents, spreadsheets, and more — and make them searchable, filterable and easy to analyze.
Like many pieces of the ICIJ tech puzzle, Datashare’s genesis was in those first big data investigations. For Offshore Leaks, there was Interdata, which allowed journalists some access to some of the leaked files — for the rest, they had to make requests of ICIJ’s D.C. team or visit the office in person.
This changed in 2014, when ICIJ officially established a dedicated data unit — comprised of just three people: data journalist and team leader Mar Cabra, developer Matthew Caruana Galizia and programmer and data analyst Rigoberto Carvajal.
The data unit developed a cross-platform tool called Extract, which brought together a number of open source platforms that could perform optical character recognition (OCR) to extract text from PDFs and other file types, process and index the files into a structured set for easier searching. One of the key innovations was in Extract’s scalability. For a leak the size of the 2.6 terabyte Panama Papers, the load was distributed across some 35 servers to speed the process.
Once the documents were indexed, ICIJ turned to Project Blacklight, a user-interface system often used by libraries and museums to facilitate searches across large sets of information — like a catalogue, or a cache of 11.5 million leaked files.
But again, ICIJ found itself outgrowing this tech stack.
Datashare was already in early development at the time of the Panama Papers, but didn’t see its first official use until 2019 as part of the Bribery Division investigation. The platform still uses some of the original technology, including an upgraded version of Extract, but is otherwise a total overhaul that has significantly expanded ICIJ’s capacity to rapidly process data and has also added a rich set of features to improve search functionality.
Popular bespoke features that ICIJ developed or upgraded for Blacklight are now baked into Datashare, including easy filtering of search results, and a batch search tool, which lets users upload a list of terms — for example, the names of all of the members of Congress — to find all the documents that match against those search terms.
Datashare also adds entity extraction, which automatically recognizes names, addresses and other structured information from inside documents, and flags them for analysis, filtering, and further investigation.
The tool is not intended only for use by ICIJ’s journalists. ICIJ has developed an open-source version of Datashare that can be implemented locally by any organization that wants to be able to search and analyze its own collection of records.
“The reason why we produce open-source [tools] is because our mission is to produce things that are useful for the public,” Romera said. “Making a tool like Datashare open and available to the public is a way to provide service to our community and to ensure that everything that we produce complies with our philosophy at ICIJ, which is transparency and openness.”
Looking to the future
User experience designer Soline Ledésert joined ICIJ in 2019 to help shape the continued development of Datashare and other ICIJ tools. With a wide range of technical expertise levels among journalists, her role involves striking a fine balance between making core functions as simple as possible while also giving more tech-oriented users a chance to access advanced features.
But, in an attempt to maximize privacy for users, ICIJ avoids using site analytics on tools like Datashare to assess, for example, how many times certain buttons are clicked — even though access to those metrics could make UX updates easier. Instead, in order to understand how users interact with the platforms, Ledésert goes directly to the source.
“I prefer journalists who feel safe and private when they do their searches [rather than] using those analytics,” Ledésert said. Instead, she does user surveys at the end of each project and conducts individual interviews with journalists — especially “extreme” users who are either very happy or very frustrated with the interfaces.
The kinds of projects we do would not be possible without these technologies. It’s technology with a purpose, it’s data with a purpose, and it’s journalism with a purpose. — Emilia Díaz-Struck
The development of new tools is expected to continue through 2021 and further into the future.
Romera cited efforts that began with the Paradise Papers, Mauritius Leaks and Luanda Leaks investigations to make better use of artificial intelligence. For instance, machine learning can help identify clusters of documents — making it easier to find, for example, all of the files associated with incorporating a company. Ledésert pointed to a need for better translation and the possibility of extracting text from audio and video files.
Díaz-Struck said that ICIJ’s core philosophy is about improving the ability of local journalists to contribute their expertise to global investigations.
“The kinds of projects we do would not be possible without these technologies,” she said. “It’s technology with a purpose, it’s data with a purpose, [and] it’s journalism with a purpose.”