Overview

HTTPS SSH

CSV Game

Introduction

The CSV Game is a collection of examples of csv parsing programs which have two tests: report the number of fields in a csv file and take the sum of the values in a single column. It began when I saw this Rob Miller talk from GopherCon 2014 about Hekka where he claims that Go is so slow at parsing CSV messages that they pass the data over protocol buffers to a luajit process which parses the message and sends the data back over protocol buffers - and it's quicker than just reading it in Go (14:45 in the video). I could hardly believe this so I wrote some sample code myself to check it. Sure enough, I found Go to be pretty slow at parsing CSV files.

I discussed this with some friends and they contributed other versions in various languagues. So I've collected them here.

Procedure

  1. Generate the test file using the script in the test directory.

  2. Either run time csv < /tmp/hello.csv or time csv /tmp/hello.csv or whatever.

  3. For csv-count, run time csv-count 5 /tmp/count.csv where 5 is the column to sum.

  4. Alternatively, pull down the 901MB docker image and run them using wercker build.

Disclaimer

I don't claim that all of the implementations are representative of idiomatic code.

Contributing

As I don't claim that all the implementations are representative of idiomatic code, PRs are most certainly welcome! However, keep in mind that I would like to keep the code plausible so I will be very skeptical of contributions where a parser is configured to drop all features in the intent of gaming results.

The Tests

There are two tests.

  1. fieldcount: Count the number of fields in the file. This exercises the CSV processing library by forcing it to parse all the fields. There is a separate run called empty which runs against an empty file and it is used as an attempt to tease out the performance of the actual CSV parsing from the startup for the runtime (importing modules, loading libraries, instantiating structures, etc).

  2. csv-count: Take the sum of one of the columns in the file. This exercises the CSV parsing library, string to integer parsing, and basic maths. I saw textql which slurps data into sqlite and runs queries on the resulting database. I thought it's a cool idea, but could it possibly be performant? This test would probably be better named as csv-summer

Timings

Here are some timings from whatever virtual machine/container system runs on Wercker for the fieldcount.

Language Library Time Time sans startup
C++ csvmonkey 0.051 0.050
Rust csvcore-reader 0.065 0.064
Rust csvreader 0.076 0.075
Rust quick-reader 0.086 0.085
Nim parsecsv 0.113 0.112
C libcsv 0.120 0.119
Rust libcsv-reader 0.121 0.119
C++ spirit 0.134 0.133
Java UnivocityCsv 0.446 0.346
Python2 pandas 0.604 0.398
Python2 csv 0.412 0.402
C++ tokenizer 0.417 0.416
Java JavaCsv 0.526 0.447
Rust peg-reader 0.477 0.476
Python3 csv 0.556 0.534
Java OpenCsv 0.648 0.571
Python-paratext paratext 0.651 0.584
Scala MightyCsv 0.808 0.619
Golang csv 0.684 0.683
Rust nom-reader 0.954 0.953
Java CommonsCsv 1.048 0.965
Luajit libcsv 0.995 0.993
Lua lpeg 1.021 1.019
Haskell cassava 1.058 1.056
Rust lalr-reader 1.086 1.085
Clojure csv 2.344 1.606
Java BeanIOCsv 1.709 1.632
Php csv 1.789 1.779
R dataframe 2.049 1.966
Julia dataframe 3.607 2.045
Perl Text::CSV_XS 2.137 2.114
Java CSVeedCsv 6.856 6.612
Gawk regexp 7.867 7.865
Ruby csv 9.334 9.290

Here are some timings for the csv-count test (which are old and haven't been added to the Continuous Integration).

Language Time
C (libcsv) 0m0.177s
Go (Go 1.5) 0m1.383s
Java (OpenCSV) 0m0.767s
Java (UnivocityCSV) 0m0.627s
Lua LPEG 0m1.437s
Luajit FFI 0m1.486s
Ocaml 0m0.522s
Perl (Text::CSV_XS) 0m2.519s
Python 2.7 0m1.077s
Ruby 0m11.924s
Rust (csv) 0m0.172s
Rust (quick) 0m0.138s
SQLite3 0m1.834s

Notes

The following variants are using general parsing libraries for processing the CSV:

  • C++ Boost.Spirit.
  • Lua lpeg (a PEG library).
  • Rust PEG.
  • Rust lalrpop (a LALR parser).
  • Rust NOM (a parser combinator library).

Luajit FFI is using the C libcsv library through a foreign function interface.

R reads the CSV file into a DataFrame and multiplies the product of the dimensions rather than counting each individual record. This may be a bit cheaty. Pandas does this too.

Julia works in a similar fashion to R and reads the CSV file into an Array{Any,2} and multiplies the product of the dimensions rather than counting each individual record. Like the R version, this might be a bit cheaty.

SQLite makes a table and imports the csv file and then runs a query.

There is also a perl6 version but unfortunately it takes a very long time to run (minutes). Rakudo/moar/perl6 is under active development and performance improvements so I expect this will get much faster in the future. When a breakthrough occurs, let me know and I'd love to add it to the game.

Gawk is using FPAT to delimit the fields. This is a regular expression used to escape the quotes csv. I wish AWK had a --csv flag which was more performant.

Rudimentary Analysis

Profiling the Go code, I can see that a lot of the time goes to Garbage collection. A lot of the time also goes to handling UFT8.

Slurping data into SQLite for interactive analysis isn't so bad, actually.