Python for data validation

I spent a good chunk of time today creating a data validation program using Python. Python is a language I know (to some extent) but I barely use. Instead, I write scripts in PowerShell, create more complex programs in C#, and analyze data with SQL queries, Galvanize Analytics, or Power Query in Excel. My goal today was to find a way to validate a large number of .csv files, which have a great many columns and will be of questionable quality. My normal tools and languages would be either too cumbersome or too limited to do this, so I reached for Python.

Python is great at importing .csv files, parsing them, modifying them, and outputting a modified copy. Within a few minutes of research, I discovered a Python package called petl, which contains a ready-made data validation pipeline. I just needed to create validation functions, which are simple 1-3 line functions, and use them to define constraints, which are simply dictionaries. All those constraints get put into an array and passed to an already-written validation method.

Coding this program has been fun and remarkably efficient. It has been fun coding in Python again. I particularly love how packages can make the hard parts easy and leave me with more time to spend on my data work rather than on creating scaffolding for the program. I will have to look into more uses for Python going forward.