Document a pragmatic data dictionary in under 10 minutes!
Don’t you wish that all of our data sets came with a nice synopsis to tell us their story? If you are a data scientist, then you can appreciate how much easier it is to work with well documented data. Practically speaking, it’s rare to find clean, well documented, and complete data when starting a new project. That generally leads to one of two outcomes:
- You use the data as is, wade through understanding it to push out a semi-complete project and then hope that you remember what it all meant the next time someone asks you to reproduce the product, or…
- You work through the tedious process of developing a data dictionary for your project. But who really has time for that?
Fortunately, there are a number of metadata management software capabilities available if your organization has taken the time to acquire and implement them and you have been given the administrative rights to actually use them to their full capability. In my case, neither tend to apply, but I still need to develop the documentation and make a useable product. What if, we just automated the majority of the process and then used our subject matter experts to fill in the meaning of the data where needed with a simple phone call? A simple python function perhaps?
Before we get to that, lets take a quick look at a recent project I was working on and why a data dictionary might have helped.
This week I ran across a data set on HealthData.gov involving rates of child abuse and maltreatment from 2011 to 2015. As a pediatrician this data set sparked an interest and I decided to take a quick look through it. Of course I started with the metadata and description to see what I could quickly learn about the data.
Both were less than helpful in explaining the data set as a whole:
Initial exploration using pandas .info() and .describe() helped a little:
At least the data was clean and I didn’t need to worry about missing values. Before moving on, given that this data set includes a “State” column, I decided to make sure that I we had full US representation.
Note that this data set includes “District of Columbia”, “National”, and “Puerto Rico” in the “State” column. This provides an interesting way to normalize the data later on, but I will save that for another article. It’s always to good quickly review your data set and ensure that you fully understand limitations and nuances of the data you working with.
The “State”, “Year”, “Format”, and “Value” columns all made sense, but the “Table” and “Measure” columns had me confused.
Neither column seemed t o make much sense for the type of data set I was anticipating seeing. To get a better understanding, I took a look at the unique “Table” and “Measure” values for the entire set:
Ahhh…now it’s making sense. Each “Table” is a unique report of aggregated values from a State and each “Measure” is the aggregated output from that report.
Perfect! So we have a data set with the aggregated measures of various reports by calendar year for all the states in the US to include “Puerto Rico”, “Washington, D.C.”, and a “National Summary” report.
That would have been a lot easier if we just had a simple data dictionary. Why don’t we just make that an automatable process?
Here I’m going to create a class so that I can import it into other projects quickly in the future. I linked the entire code at the end of the article.
Now lets make the dictionary:
For my use, I’m only looking for the most important information (data type, total number of rows, total number of null values, memory use by column, and a definition of the data). You can add what you want in your own data dictionaries by changing this code.
If you are interested, I used pandas dtypes and columns attributes to get my data types, length of columns for length, the isna() function for summing the null values, and pandas memory_usage() to obtain the memory use by column.
Now we need to add meaning to the features in our data set:
This code loops through our current data dictionary asking the user to define each feature variable and then transposes the data frame to make it useable, finally returning the data dictionary for our data frame. So does it work?
Here is the workflow:
Now all you have to do is export your new data dictionary to a PDF, JSON, XML or CSV file and save with your project.
In this article I demonstrated how to quickly explore a data set from HealthData.gov that was lacking useful metadata. Using the insight gained from that exploration, I created a python class in order to automate the development of a data dictionary for future data science projects. Finally, I demonstrated the generation of an Excel developed PDF document that could serve as the baseline metadata documentation for future projects.
You can grab the code related to this article at the following:
Thanks for Reading!