This monster always rears its ugly head just when I think I’m getting what I want: a public dataset that could shed light on some injustice or terror in the world. At first it looks alluring. It’s usually a government database presented in a veritable theme park of graphs and labeled numbers. But hiding just underneath the download button it lurks, silent. The option to download the raw data is not available.
Want to know how many immigrants US Customs and Border Protection agents are encountering at the southwest land border? There’s a visualization for that. Want to check their math? Too bad. But the reality of this game is more insidious. It’s about giving just enough access to exclude.
CBP—that noted paragon of good governance recently caught investigating journalists—would likely claim that it is being transparent by releasing the data as it does. Better yet, the agency could claim that, by visualizing and aggregating the information, it is making the underlying data more accessible to those who don’t know how or don’t want to invest the time to analyze the data themselves. Yet, neither should preclude them from making the underlying numbers available to those who want to dive deeper.
Reporters who know how to analyze large datasets continually use government data to astonishing effect. ProPublica showed where cancer-causing air is at its worst.(Places where people of color predominantly live are disproportionately impacted.) The Markup revealed how people of color are far more likely to have their mortgage application denied, and the Los Angeles Times exposed how LA sheriff’s deputies use minor stops to search bicyclists. (Seven of every 10 stops involve a Latino cyclist.) Not a single of these investigations would have been possible if government officials got to decide how reporters analyzed the data and what data points were included.
Gatekeepers sometimes argue these shiny barriers protect the privacy of people who are in the database and protect the data from analysts who do not understand it and might draw incorrect conclusions.
There probably are some databases that are so personal that their public release should be withheld. But in the vast majority of instances, these claims do more to protect injustice by shrouding the government’s actions in haze. Would an immigrant seeking a new life in the United States care if more information about their interactions with Border Patrol were made public? Probably not—especially since names and other personally identifiable material could easily be removed before publishing the dataset. Could a data reporter make a mistake? Sure. But it is just as possible that the government is erring, accidentally or otherwise, in its own representation of the data.
Ultimately, in a time of uncertainty and skepticism, data analysis can provide a basis for understanding and, maybe even more importantly, verification. When Mother Jones found accounting data showing how Purdue Pharma spent $115 million funding other organizations—including groups that helped spread its opioid messaging—we converted it into a user-friendly database. But we also released the original PDF and the final spreadsheet, just like many other news organizations do. Anyone can check both documents to verify the accuracy. All I want is to do the same with the CBP dataset and others like it. Instead, the full story hides behind visualizations and aggregation. It’s the illusion of transparency, used as a cudgel against the real thing.