How Node CSV parser may save your weekend

How Node CSV parser may save your weekend

By David WORMS

Dec 13, 2011

Categories: Hack | Tags: Bash, CSV, Hack, Node.js

Last Friday, an hour before the doors of my customer close for the weekend, a co-worker came to me. He just finished to export 9 CSV files from an Oracle database which he wanted to import into Greenplum such as our customer could start testing on Monday morning.

The problem as exposed was quite simple. He needed a quick solution (less than an hour, coding included) to transform all the date in the source CSV file into a format suitable for Greenplum. While Oracle exported dates in the form of ‘DD/MM/YYYY’, Greenplum was picky enough to expect dates in the form of ‘YYYY-MM-DD’.

Here are the files we needed to import:

-rw-r--r-- 1 ingres ingres  41G 2011-12-09 18:23 a.csv
-rw-r--r-- 1 ingres ingres 418M 2011-12-09 17:22 b.csv
-rw-r--r-- 1 ingres ingres  45G 2011-12-09 18:26 c.csv
-rw-r--r-- 1 ingres ingres 1,8G 2011-12-09 17:37 d.csv
-rw-r--r-- 1 ingres ingres 896M 2011-12-09 17:33 e.csv
-rw-r--r-- 1 ingres ingres 382M 2011-12-09 17:13 f.csv
-rw-r--r-- 1 ingres ingres 7,6G 2011-12-09 18:09 g.csv
-rw-r--r-- 1 ingres ingres 621M 2011-12-09 17:33 h.csv
-rw-r--r-- 1 ingres ingres 3,1G 2011-12-09 17:46 i.csv

Among those files, two are larger than 45Gb.

Since generating new files would probably have taken too long, the first question asked was whether or not the Greenplum import command would accept stdin data through unix pipe, which question luckily answered to yes. So my first suggestion was to cat a file, pipe it to Awk which would reformat the date line by line and finally pipe it to the import tool. The command would look like:

cat a.csv | awk '...' | psql -p $port $database -c "COPY $table FROM STDIN WITH CSV;"

The solution seems workable. Problem, we are no Awk expert and I could not guaranty him an hour would be enough to write and test the command.

Considering how familiar I am with Node those days, I told him that writing a small JavaScript script which interacts with stdin and stdout would be much more secure. This is when we though of using the CSV parser, the first library I wrote for Node a little more than a year ago.

We first had to configure the proxy and install Node on this server which took about 10mn. For this we used NVM. Then, we wrote the following JavaScript:

#!/usr/bin/env node
var csv = require('./csv');

csv()
.fromStream(process.stdin)
.toStream(process.stdout)
.transform(function(data, index){
    for(var i=0; i< data.length; i++){
        if(/(\d{2})\/(\d{2})\/(\d{2})/.test(data[i])){
            data[i] = data[i].replace(/(\d{2})\/(\d{2})\/(\d{2})/, "20$3-$2-$1");
        }
    }
    return data;
});

process.stdin.resume()

Note how the usage of the regular expression could have been optimized. But our goal wasn’t to speed the import process, only to make it run as soon as possible.

We already had at our disposal a small bash script which could loop trough the CSV files and launch Greenplum import command. We tweak it a bit such as the final version looked like:

#!/bin/bash
for csv in $(ls ~/export/*.csv)
do
        table=`echo $csv|cut -d'/' -f 5`;
        table=`echo $table|cut -d'.' -f 1`;
        table=`echo $table| sed 's/\(.*\)../\1/'`;
        cat $csv | transform.js | psql -p 5433 my_database -c "COPY $table FROM STDIN WITH CSV;"
done

Well, on Monday morning, the CSV parser handled the volume and all the CSV files were ingested into Greenplum.

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.