UK Government Data-Sharing Initiative Doesn’t Pass the Data Science Test

October 11, 2017

· 2 min read

In a “world-leading” move, the UK government has created a website intended to enable data scientists to “‘explain or change’ disparities in how people from different backgrounds are treated.”

The website, which purports to be the first of its kind, makes public a variety of summary statistics which relate to all sorts of topics from criminal justice to transportation. So, a team of top data scientists at DataRobot looked at numerous links on the website – lots of highly summarised, annual summary statistics — basically pivot tables by ethnicity. They wanted to analyze the data, but couldn’t.

And here’s why… If you want to look at statistics about “stop and search” events, then you could go here to the very promising link that reads, “Download the data.” Unfortunately, what you get is very high-level summary data that provides annual summary statistics on these types of events by year. You can learn things like how many stop-and-search occurrences there were for Asians in Bedfordshire in 2006. Interesting? Maybe. Helpful for understanding the real drivers of police behavior? Not even remotely.

Unfortunately, this example is not an anomaly. Which is actually quite a shame. There does seem to be a lot of great data available, but it isn’t useful to the data science and research community if it comes pre-summarized. If the government actually wants to understand the drivers then they need to release the actual, un-summarized event data, like a spreadsheet with one row per stop-and-search event, and they need to include all the details of the event. Where did it occur? What time of day was it? How many people were involved? What was the weather like? How did the person behave? What did the officer say the reason for the stop was? In other words, all of the features of the event that might impact the results – those are the features that we have to have in order to “explain or change” any ethnic disparities.

Releasing data that is pre-summarised by ethnicity pre-supposes that ethnicity is a primary driver and hides all the other features – making it impossible to do a real, predictive analysis.

I was excited when I read that all this detailed government data was being packaged, housed, and released in one central location. Unfortunately, by pre-summarizing the data and eliminating all the detail, the initiative is doomed to fail before it starts.

About the author

Greg Michaelson

Meet Greg Michaelson

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

See other posts in AI & ML Expertise

Subscribe to our Blog

First Name

Last Name

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

UK Government Data-Sharing Initiative Doesn’t Pass the Data Science Test

How to Choose the Right LLM for Your Use Case

Belong @ DataRobot: Celebrating 2024 Women’s History Month with DataRobot AI Legends

Choosing the Right Vector Embedding Model for Your Generative AI Use Case

Related Posts

Thanks! Check your inbox to confirm your subscription.