Node.js CSV version 4 - re-writing and performance

Node.js CSV version 4 - re-writing and performance

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

Today, we release a new major version of the Node.js CSV parser project. Version 4 is a complete re-writing of the project focusing on performance. It also comes with new functionalities as well as some cleanup in the option properties and the exported information. The official website is updated and the changelog contains the list of changes for this major release.

A massive undertaking

The Node.js CSV project was started on September 25th of 2010. This is quite old in our evolving tech world. Since then, it has survived multiple Node.js evolutions such as the redesign of the Stream APIs. Over the years, the project was maintained with bug fixes, documentation, and support. With the help of the community, incremental features were provided to fit everyone use cases. The quality of the test suite made us confident to get back to the project and dive into the code. However, there was one task which I never had the courage to initiate: to rewrite the parser from the ground up to take benefit of the Buffer API and its promises of performance. A few days of holidays gave me the opportunity to engage this work.

The re-writing started with a blank new project. While there is probably still room for improvements and further optimizations, I run a few benchmarks to measure the performance impact of multiple implementations. This is how I came up with the Resizable Buffer class which reuse the same internal buffer adjusting to the input data set instead of instantiating a new buffer for each field. When ready, the next step was to write the parser. The process was broken down into multiple iterations, 13 exactly:

  1. Basic Buffer loop
  2. Add __needMoreData
  3. Add __autoDiscoverRowDelimiter
  4. Start working on quote, escape , delimiter, and record_delimiter
  5. Options quote, escape, delimiter and record_delimiter working
  6. Option comment
  7. Options relax_column_count and skip_empty_lines as well as info count, empty_line_count and skipped_line_count
  8. Options skip_lines_with_empty_values, skip_lines_with_error, from, to
  9. Option columns
  10. Option trim
  11. Option relax
  12. Options objname, raw, cast, and cast_date
  13. Rewrite info counters

The implementation no longer uses CoffeeScript and is written directly in JavaScript 6. Don’t get me wrong, I am still a big fan of CoffeeScript and we are still using it in the tests for its expressiveness. However, I needed a fine control on the code and using JavaScript as the main language will hopefully encourage more contributions.

Breaking changes

Overall, there are no major breaking changes. The modules are the same and the API for using it remained unchanged. There are however a few minor breaking changes to take into consideration such as the rowDelimiter option being renamed to record_delimiter, some previously deprecated options being removed and the available counters being regrouped into the new info property:

  • Option rowDelimiter is now record_delimiter
  • count is now info.records
  • Drop the record event
  • Normalize error message as {error type}: {error description}
  • State values are now isolated into the info object
  • count is now info.records
  • lines is now info.lines
  • empty_line_count is now info.empty_lines
  • skipped_line_count is now info.invalid_field_length
  • context.count in the cast function is now context.records
  • Drop support for deprecated options auto_parse and auto_parse_date
  • In raw option, the row property is renamed record
  • Option max_limit_on_data_read is now max_record_size
  • Default value of max_record_size is now 0 (unlimited)
  • Drop emission of the record event, use the readable event and this.read() instead

The most impacting breaking change is probably the renaming of the rowDelimiter option into record_delimiter because of its popular usage. Also, the max_record_size is now unlimited by default and must be explicitly defined if used.

New features

This new version comes also with new features. The new information object is a nice addon. It regroups a few counter properties which were available directly from the parser instance. Those properties have been renamed to be more expressive. The information object is directly available from the parser instance as info. To the callback users, they are exported as the third argument of their callback function. They can also be available for each record by activating the info option with the value true.

There are 3 new options which are info, from_line, and to_line:

  • info: Generate two properties info and record where info is a snapshot of the info object at the time the record was created and record is the parsed array or object; note, it can be used conjointly with the raw option.
  • from_line: Start handling records from the requested line number.
  • to_line: Stop handling records after the requested line number.

The info option is quite useful for debugging or giving to the end users some feedback about their mistake.

The from_line and to_line options respectively filter the first and last lines of a data set. Speaking of lines, previous versions of the parser were surely confused when it comes to count lines mixing row and record delimiters. It was working for most users for the simple reason that they are usually the same. It shall be fixed with this new release.

Here is the new feature list extracted from the changelog:

  • new options info, from_line and to_line
  • trim: respect ltrim and rtrim when defined
  • delimiter: may be a Buffer
  • delimiter: handle multiple bytes/characters
  • callback: export info object as third argument
  • cast: catch error in user functions
  • TypeScript: mark info as readonly with required properties
  • comment_lines: count the number of commented lines with no records

What’s coming next

The source code is backed by an extended test suite. No test has been removed and new tests have appeared to reinforce the guarantees of the parser. It is, however, possible that some behaviors are not covered by the tests and, in the next few weeks, we count on your feedback to fix any coming issues.

While not being a big fan of ES6 Promise in the context of the parser, the request for support has been made multiple time and will come soon. It will also be implemented in the other CSV packages.

Another potential improvement is to extend the error objects with additional information such as a unique code associated to each type of error. While being improved, there is room to better normalize the messages.

I am also planning to support the Flow static type checker. I have never used it before. It seems appropriate to the package and it will give me the occasion to try it on.

Finally, I am considering writing a command line tool which will expose all the available options and provide multiple output formats (JSON, JSON line, YAML, …).

Share this article

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain