From historians and academics to public agencies and government watchdogs, the innovative data tool ICIJ built to power its biggest journalistic investigations is finding a whole new audience of researchers eager to put it to use in other fields.
Datashare, the journalism collaborative’s foundational technology, has been downloaded more than 20,000 times over the last four years by historians, academics, public agencies, government watchdogs and journalists all over the world.
The game-changing software can process millions of documents in many different languages and formats, detecting and extracting key information more efficiently than previously possible.
“We wouldn’t have been able to do Pandora Papers without it,” said ICIJ’s chief technology officer Pierre Romera Zhang, referring to the landmark 2021 investigation that was the largest journalism collaboration in history.
Romera Zhang understood the power of the technology for the investigative journalists that his team worked with to build the tool, but he didn’t anticipate then that thousands of other researchers would find it useful, too.
ICIJ will never know the full impact of Datashare because, for privacy reasons, it doesn’t collect information about who has downloaded it or what they’re analyzing because that data stays securely on users’ private servers or local computers.
“It’s a big mystery,” said Soline Ledésert, ICIJ’s user experience designer and one of the lead managers on the Datashare project. “It’s useful for any organization that needs to explore documents, especially if they are documents in different formats.”
John Bassett, a PhD student in history at the University of Wisconsin-Madison, said the software holds promise for researchers like him.
“Currently, most historians’ lack of technical expertise means that even if a lucky researcher found a huge new stack of documents, she’d likely have to wait for a big institution to acquire, digitize and republish them before she could ever do something as simple as a keyword search,” said Bassett, who studies the Philippines.
Amateur historians without institutional ties would be completely out of luck, he said.
“The promise of Datashare for someone like me is that it can take in a really wide range of types of documents and make it possible to search and browse,” Bassett said. “That’s a game changer.”
The software uses natural language processing to extract named entities from documents in virtually any format. Artificial intelligence capabilities allow the software to recognize names, locations and email addresses based on context.
For example, Datashare is able to discern that the sentence “Paris Hilton was in London” refers to the socialite and not a hotel in France.
Another key feature of Datashare is its batch search capabilities. For example, if a user wants to search millions of documents for references to members of the U.S. Congress, Datashare allows them to search for all the lawmakers at once. Before Datashare, that kind of research would have required 535 separate searches.
Datashare was essential to analyzing the millions of leaked documents in the Pandora Papers, Ledésert said.
Without it, journalists would have had to read every document. They would have missed mentions of key figures that they wouldn’t have thought to search for and never expected to find. That is, people such as international superstar Shakira, German model Claudia Schiffer, acclaimed Spanish singer Julio Iglesias, British pop-rock icon Elton John, British-Italian actress Monica Bellucci, former Beatle Ringo Starr, Argentine soccer player Angel Di Maria and numerous other high-profile figures.
“We wouldn’t have known that” without Datashare, Ledésert said.
ICIJ shares the technology to fulfill its mission of empowering journalists all over the world to expose wrongdoing in their own regions.
“Providing open-source software is a way to facilitate access to information, and this is what ICIJ wants to do,” Romera Zhang said. “We want to provide more information and more easy access to information so people can become part of the democratic debate.”
Providing open-source software is a way to facilitate access to information, and this is what ICIJ wants to do.
— Pierre Romera Zhang, ICIJ’s chief technology officer
ICIJ also has benefitted from open-source technology created by others such as Apache Tika and Tesseract OCR.
“As a developer, you work with a lot of open-source resources,” Romera Zhang said. Publicly sharing the code underpinning Datashare “is a way to give back to the community that really gave us a lot.”
ICIJ’s technology team is developing new features that leverage the power of artificial intelligence to extract documents from massive data sets. Users will be able to query, for example, all invoices in a jumble of millions of different kinds of documents, Romera Zhang said. Or they could ask for all email messages related to creating an offshore entity, and Datashare will deliver.
“The software is never finished. We’re always developing it to meet new needs directly inspired by reporters working on ICIJ investigations” Ledésert said.
Developers next want to create features that will help users organize information, build their own knowledge bases and recognize connections between data points.
The challenge, Romera Zhang said, is melding many complex features together while still keeping the software easy to use.
“Datashare is in very active development,” he said. “It’s great, but in the future it’s going to be amazing.”