Thursday, December 20, 2007

paste

Learned something new recently while having the task of manipulating large files from a data provider. You see, I needed to gather several years of financial information for several thousand companies (at the daily level). There were about 30 attributes so this process needed to be run about 30 times with 30 resulting files. After it was done it all needed to be loaded into a database. I could have loaded the files individually, but in the end all the data needed to be joined. I actually tried loading it all into the database and let the database doe the join, however these were such large data sets that the memory required to do such a join was larger than I had. I had a "wouldn't it be great" moment wondering if there was a way to join the files together in a streaming fashion. The order of the lineswas such that line 1 of each file could be joined together in a consistent way (the all belonged to the same security and the same date)

I was able to run the follwing command after placing all cvs files in a directory:

paste -d ',' *csv > all_data.csv

After that I had one monter file ready to load to the database - pre-joined and all!

No comments:

Post a Comment