Discovering and Curating Data on Data.World

Jessica YungData ScienceLeave a Comment

To solve problems – particularly if you want to use statistical approaches or AI – you need data. Data is evidence or descriptive information. We usually deal with quantitative data or quantitative representations of e.g. text or images because they are easier to handle.

The good news is there’s tons of data out there. The bad news is it’s often hidden in obscure corners of the Internet (or in less-read research papers) or is unusable in its current form. is one step toward making curated datasets open and accessible. (Older and more well-known sites like Kaggle also have this feature, although it isn’t their main focus.) Recent datasets on include ‘885k tweets from the third presidential debate in 2016’, information on 112,000 incidents of corporate misconduct, or figures on INC.COM’s ranked list for the 5000 fastest growing private companies in Europe.

screenshot.png user home page with a list of recommended datasets.

Looking at One Dataset

I had a look at the dataset ‘15 Years of Opioid Overdose Deaths‘ this morning, which was featured in’s email newsletter. It included the deaths (absolute number and crude rate, i.e. number per 100,000 population) by opioid overdose per US state per year from 1999-2014. From 2000 to 2014, the rate of deaths from drug overdoses increased by 137%, including a 200% increase in the rate of overdose deaths involving opioids. (Source: CDC.) Unfortunately I could not download the dataset because of a bug, but I look forward to doing that later on.


The original opioid dataset on

In the Discussion section, someone had suggested recommended next steps, e.g. looking at opioid prescription rates and profits from opioid drugs over time to see if there was a relationship between those features and the increase in opioid-related deaths.

I decided to follow up on those suggestions. A quick Google led me to the US government’s article America’s addiction to Opioids and Heroin, which included some data on opioid prescription in the US. I could only read off annual data from 1991-2013 from the chart, but it was sufficient to potentially add some insight.


Enter a caption

I created a new dataset Opioid Prescriptions Dispensed by US Retail Pharmacies from 1991-2013 to temporarily house my data while I waited to be approved as a contributor to the original dataset.


My new dataset.

Issue and Feature Suggestion

One issue is if I wanted to share a dataset I’d share it on both and Kaggle because datasets are not yet visible to people who are not logged in. This is common with all services where there are multiple vendors and only becomes more of a hassle when you update datasets or have discussions based on them.

A feature suggestion based on what I’d just done is having a dedicated section for links to related datasets. You can include that in the discussion section and the dataset summary, but it’s less obvious. This isn’t as necessary now because there aren’t as many interlinked datasets, but it would be useful in future and prevent people from accidentally constructing new datasets from scratch. is still in its infancy, but it has great potential to help people get access to data they need more easily, discover interesting datasets and through that solve more problems.

Leave a Reply