Last Friday, an hour before the doors of my customer close for the weekend, a co-worker came to me. He just finished to export 9 CSV files from an Oracle database which he wanted to import into Greenplum such as our customer could start testing on Monday morning.

The problem as exposed was quite simple. He needed a quick solution (less than an hour, coding included) to transform all the date in the source CSV file into a format suitable for Greenplum. While Oracle exported dates in the form of ‘DD/MM/YYYY’, Greenplum was picky enough to expect dates in the form of ‘YYYY-MM-DD’.

Here are the files we needed to import:

Among those files, two are larger than 45Gb.

Since generating new files would probably have taken too long, the first question asked was whether or not the Greenplum import command would accept stdin data through unix pipe, which question luckily answered to yes. So my first suggestion was to cat a file, pipe it to Awk which would reformat the date line by line and finally pipe it to the import tool. The command would look like:

The solution seems workable. Problem, we are no Awk expert and I could not guaranty him an hour would be enough to write and test the command.

Considering how familiar I am with Node those days, I told him that writing a small JavaScript script which interacts with stdin and stdout would be much more secure. This is when we though of using the CSV parser, the first library I wrote for Node a little more than a year ago.

We first had to configure the proxy and install Node on this server which took about 10mn. For this we used NVM. Then, we wrote the following JavaScript:

Note how the usage of the regular expression could have been optimized. But our goal wasn’t to speed the import process, only to make it run as soon as possible.

We already had at our disposal a small bash script which could loop trough the CSV files and launch Greenplum import command. We tweak it a bit such as the final version looked like:

Well, on Monday morning, the CSV parser handled the volume and all the CSV files were ingested into Greenplum.