How Node CSV parser may save your weekend

Last Friday, an hour before the doors of my customer close for the weekend, a co-worker came to me. He just finished to export 9 CSV files from

[Oracle][oracle] which he wanted to import into [Greenplum][green] such as our customer could start testing on Monday morning.

The problem as exposed was quite simple. He needed a quick solution (less than an hour, coding included) to transform all the date in the source CSV file into a format suitable for [Greenplum][green]. While [Oracle][oracle] exported dates in the form of ‘DD/MM/YYYY’, [Greenplum][green] was picky enough to expect dates in the form of ‘YYYY-MM-DD’.

Here are the files we needed to import:

Among those files, two are larger than 45Gb.

Since generating new files would probably have taken too long, the first question asked was whether or not the [Greenplum][green] import command would accept stdin data through unix pipe, which question luckily answered to yes. So my first suggestion was to cat a file, pipe it to [Awk][awk] which would reformat the date line by line and finally pipe it to the import tool. The command would look like:

The solution seems workable. Problem, we are no [Awk][awk] expert and I could not guaranty him an hour would be enough to write and test the command.

Considering how familiar I am with [Node][node] those days, I told him that writing a small JavaScript script which interact with stdin and stdout would be much more secure. This is when we though of using the [CSV parser][csv], the first library I wrote for [Node][node] a little more than a year ago.

We first had to configure the proxy and install [Node][node] on this server which took about 10mn. For this we used [NVM][nvm]. Then, we wrote the following JavaScript:

Note how the usage of the regular expression could have been optimized. But our goal wasn’t to speed the import process, only to make it run as soon as possible.

We alread had at our disposal a small bash script which could loop trough the CSV files and launch the [Greenplum][green] import command. We tweak it a bit such as the final version looked like:

Well, on Monday morning, the [CSV parser][csv] handled the volume and all the CSV files were ingested by [Greenplum][green].

[node]: http://www.nodejs.org Node JavaScript Engine [oracle]: http://www.oracle.com Oracle database [green]: http://www.greenplum.com Greenplum Big Data analytics [awk]: http://www.gnu.org/s/gawk/manual/gawk.html The GNU Awk User’s Guide [csv]: https://github.com/wdavidw/node-csv-parser Node CSV parser [nvm]: https://github.com/creationix/nvm Node Version Manager
By |2018-06-05T22:37:31+00:00December 13th, 2011|Categories: Node.js|0 Comments

About the Author:

Passionate with programming, data and entrepreneurship, I participate in shaping Adaltas to be a team of talented engineers to share our skills and experiences.

Leave A Comment