Generating dummy data with Python

Today I learned a good way to generate dummy data for use in the data analysis training I am going to perform next month.

While there are services like dumbdata.com that can produce dummy datasets without requiring any programming at all, those tend not to work for me. My data needs are domain-specific. I don’t just need random names and addresses and things like that; I also need specific columns, including some inter-related financial data columns, for a dummy dataset to make sense to my audience.

I have been using Python a lot lately, so naturally I wrote a Python script to generate a table full of randomized, but real-looking, data. First I used petl’s dummytable command to create a base table full of randomized identifier numbers, dates, data categories, and dollar amounts. To generate real-looking data for that table, I used functionality from the Faker package and from the standard random library, including Faker.date_between and random.choice. Then I used petl’s addfield to add some fields with calculated and inter-related values. Next, I used the petl cut function to re-order a subset of the table columns and prepare them for export. Lastly, I used the petl toxlsx function to export the data to Excel.

It was surprisingly easy. Not having to write any of the functions to randomly generate the data or pick random selections from value lists made the process far quicker than it otherwise would have been. I wish I had known about these tools the last time I created a data analytics training demo.