Today, we release a new major version of the Node.js CSV parser project. Version 4 is a complete re-writing of the project focusing on performance. It also comes with new functionalities as well as some cleanup in the option properties and the exported information. The official website is updated and the changelog contains the list of changes for this major release.

A massive undertaking

The Node.js CSV project was started on September 25th of 2010. This is quite old in our evolving tech world. Since then, it has survived multiple Node.js evolutions such as the redesign of the Stream APIs. Over the years, the project was maintained with bug fixes, documentation, and support. With the help of the community, incremental features were provided to fit everyone use cases. The quality of the test suite made us confident to get back to the project and dive into the code. However, there was one task which I never had the courage to initiate: to rewrite the parser from the ground up to take benefit of the Buffer API and its promises of performance. A few days of holidays gave me the opportunity to engage this work.

The re-writing started with a blank new project. While there is probably still room for improvements and further optimizations, I run a few benchmarks to measure the performance impact of multiple implementations. This is how I came up with the Resizable Buffer class which reuse the same internal buffer adjusting to the input data set instead of instantiating a new buffer for each field. When ready, the next step was to write the parser. The process was broken down into multiple iterations, 13 exactly:

  1. Basic Buffer loop
  2. Add __needMoreData
  3. Add __autoDiscoverRowDelimiter
  4. Start working on quote , escape , delimiter, and record_delimiter
  5. Options quote , escape , delimiter  and  record_delimiter  working
  6. Option comment
  7. Options relax_column_count  and skip_empty_lines as well as info count , empty_line_count  and skipped_line_count
  8. Options skip_lines_with_empty_values , skip_lines_with_error , from , to
  9. Option columns
  10. Option trim
  11. Option relax
  12. Options objname , raw , cast, and cast_date
  13. Rewrite info counters

The implementation no longer uses CoffeeScript and is written directly in JavaScript 6. Don’t get me wrong, I am still a big fan of CoffeeScript and we are still using it in the tests for its expressiveness. However, I needed a fine control on the code and using JavaScript as the main language will hopefully encourage more contributions.

Breaking changes

Overall, there are no major breaking changes. The modules are the same and the API for using it remained unchanged. There are however  a few minor breaking changes to take into consideration such as the  rowDelimiter option being renamed to  record_delimiter , some previously deprecated options being removed and the available counters being regrouped into the new info property:

  • Option rowDelimiter  is now record_delimiter
  • count  is now info.records
  • Drop the record  event
  • Normalise error message as {error type}: {error description}
  • State values are now isolated into the info  object
  • count  is now info.records
  • lines is now info.lines
  • empty_line_count  is now info.empty_lines
  • skipped_line_count  is now info.invalid_field_length
  • context.count  in the cast  function is now context.records
  • Drop support for deprecated options auto_parse  and auto_parse_date
  • In raw option, the row  property is renamed record
  • Option max_limit_on_data_read  is now max_record_size
  • Default value of max_record_size  is now 0 (unlimited)
  • Drop emission of the record  event, use the readable  event and  instead

The most impacting breaking change is probably the renaming of the  rowDelimiter  option into record_delimiter  because of its popular usage. Also, the max_record_size  is now unlimited by default and must be explicitly defined if used.

New features

This new version comes also with new features. The new information object is a nice addon. It regroups a few counter properties which were available directly from the parser instance. Those properties have been renamed to be more expressive. The information object is directly available from the parser instance as info . To the callback users, they are exported as the third argument of their callback function. They can also be available for each record by activating the info  option with the value true.

There are 3 new options which are info , from_line , and to_line :

  • info : Generate two properties info  and record  where info  is a snapshot of the info object at the time the record was created and record  is the parsed array or object; note, it can be used conjointly with the raw  option.
  • from_line : Start handling records from the requested line number.
  • to_line : Stop handling records after the requested line number.

The info  option is quite useful for debugging or giving to the end users some feedback about their mistake.

The from_line  and to_line  options respectively filter the first and last lines of a data set. Speaking of lines, previous versions of the parser were surely confused when it comes to count lines mixing row and record delimiters. It was working for most users for the simple reason that they are usually the same. It shall be fixed with this new release.

Here is the new feature list extracted from the changelog:

  • new options info , from_line  and to_line
  • trim : respect ltrim  and rtrim  when defined
  • delimiter : may be a Buffer
  • delimiter : handle multiple bytes/characters
  • callback : export info object as third argument
  • cast : catch error in user functions
  • TypeScript: mark info as readonly with required properties
  • comment_lines : count the number of commented lines with no records

What’s coming next

The source code is backed by an extended test suite. No test has been removed and new tests have appeared to reinforce the guarantees of the parser. It is, however, possible that some behaviors are not covered by the tests and, in the next few weeks, we count on your feedback to fix any coming issues.

While not being a big fan of ES6 Promise in the context of the parser, the request for support has been made multiple time and will come soon. It will also be implemented in the other CSV packages.

Another potential improvement is to extend the error objects with additional information such as a unique code associated to each type of error. While being improved, there is room to better normalize the messages.

I am also planning to support the Flow static type checker. I have never used it before. It seems appropriate to the package and it will give me the occasion to try it on.

Finally, I am considering writing a command line tool which will expose all the available options and provide multiple output formats (JSON, JSON line, YAML, …).