UK Government Data-Sharing Initiative Doesn’t Pass the Data Science Test
In a “world-leading” move, the UK government has created a website intended to enable data scientists to “‘explain or change’ disparities in how people from different backgrounds are treated.”
The website, which purports to be the first of its kind, makes public a variety of summary statistics which relate to all sorts of topics from criminal justice to transportation. So, a team of top data scientists at DataRobot looked at numerous links on the website – lots of highly summarised, annual summary statistics — basically pivot tables by ethnicity. They wanted to analyze the data, but couldn’t.
And here’s why… If you want to look at statistics about “stop and search” events, then you could go here to the very promising link that reads, “Download the data.” Unfortunately, what you get is very high-level summary data that provides annual summary statistics on these types of events by year. You can learn things like how many stop-and-search occurrences there were for Asians in Bedfordshire in 2006. Interesting? Maybe. Helpful for understanding the real drivers of police behavior? Not even remotely.
Unfortunately, this example is not an anomaly. Which is actually quite a shame. There does seem to be a lot of great data available, but it isn’t useful to the data science and research community if it comes pre-summarized. If the government actually wants to understand the drivers then they need to release the actual, un-summarized event data, like a spreadsheet with one row per stop-and-search event, and they need to include all the details of the event. Where did it occur? What time of day was it? How many people were involved? What was the weather like? How did the person behave? What did the officer say the reason for the stop was? In other words, all of the features of the event that might impact the results – those are the features that we have to have in order to “explain or change” any ethnic disparities.
Releasing data that is pre-summarised by ethnicity pre-supposes that ethnicity is a primary driver and hides all the other features – making it impossible to do a real, predictive analysis.
I was excited when I read that all this detailed government data was being packaged, housed, and released in one central location. Unfortunately, by pre-summarizing the data and eliminating all the detail, the initiative is doomed to fail before it starts.