I have been experimenting with writing ETL (extract, transform, load) scripts in Python for work. The scripts were mostly fun and easy to write, but the outcomes have been disappointing: unacceptably slow load times.
I usually work with an instance of SQL Server that runs locally on my computer. I use it for data collection, analysis, and reporting, and have never needed a shared server. In the local environment, I have always loaded .csv files to staging tables via BCP or the BULK INSERT command in T-SQL scripts; then I use T-SQL queries or stored procedures to load the data into the destination tables. This approach is very fast, but it only does “L” part of ETL (load). It also doesn’t work for remote databases, unless I have access to a filesystem that database server can read.
A Python script using a library such as PETL or Pandas, on the other hand, can do the whole thing: read a file in from just about any file format, transform the rows and columns in various ways, then load it into the database.
Lately I have been working with various databases that are stored somewhere in the cloud. They are the slowest databases I have ever worked with, and I don’t have access to the filesystem they can BULK INSERT from. Writing an ETL script lets me work around that problem easily. By far the hardest part of writing my first ETL scripts with Python was getting working SQL connection string. (I think that part would have been a cinch if our database was something open source like PostGRES.)
Unfortunately, my simple ETL script could load a table at the rate of 5 records per second, which is comically slow. At that rate, our data will take days or weeks to load. I’m not sure what the bottleneck is. It is probably slow network transfer time on top of a hugely sub-optimal data insertion method on the database side. I think that, in the end, it may not be worth figuring out. I should probably try a better tool for the job.