CSV Game


The CSV Game is a collection of examples of csv parsing programs which have two tests: report the number of fields in a csv file and take the sum of the values in a single column. It began when I saw this Rob Miller talk from GopherCon 2014 about Hekka where he claims that Go is so slow at parsing CSV messages that they pass the data over protocol buffers to a luajit process which parses the message and sends the data back over protocol buffers - and it's quicker than just reading it in Go (14:45 in the video). I could hardly believe this so I wrote some sample code myself to check it. Sure enough, I found Go to be pretty slow at parsing CSV files.

I discussed this with some friends and they contributed other versions in various languagues. So I've collected them here.


  1. Generate the test file using the script in the test directory.

  2. Either run time csv < /tmp/hello.csv or time csv /tmp/hello.csv or whatever.

  3. For csv-count, run time csv-count 5 /tmp/count.csv where 5 is the column to sum.

  4. Alternatively, pull down the 901MB docker image and run them using wercker build.


I don't claim that all of the implementations are representative of idiomatic code.


As I don't claim that all the implementations are representative of idiomatic code, PRs are most certainly welcome! However, keep in mind that I would like to keep the code plausible so I will be very skeptical of contributions where a parser is configured to drop all features in the intent of gaming results.

The Tests

There are two tests.

  1. fieldcount: Count the number of fields in the file. This exercises the CSV processing library by forcing it to parse all the fields. There is a separate run called empty which runs against an empty file and it is used as an attempt to tease out the performance of the actual CSV parsing from the startup for the runtime (importing modules, loading libraries, instantiating structures, etc).

  2. csv-count: Take the sum of one of the columns in the file. This exercises the CSV parsing library, string to integer parsing, and basic maths. I saw textql which slurps data into sqlite and runs queries on the resulting database. I thought it's a cool idea, but could it possibly be performant? This test would probably be better named as csv-summer


Here are some timings from whatever virtual machine/container system runs on Wercker for the fieldcount.

Language Library Time Time sans startup
C++ csvmonkey 0.052 0.051
Rust csvcore-reader 0.074 0.073
Rust csvreader 0.084 0.083
Rust quick-reader 0.095 0.094
Nim parsecsv 0.111 0.110
C++ spirit 0.123 0.121
Rust libcsv-reader 0.124 0.122
C libcsv 0.123 0.122
Java UnivocityCsv 0.422 0.315
C++ tokenizer 0.373 0.371
Python2 csv 0.385 0.375
Python2 pandas 0.627 0.396
Java JavaCsv 0.494 0.411
Java OpenCsv 0.594 0.512
Python3 csv 0.558 0.530
Python-paratext paratext 0.634 0.548
Scala MightyCsv 0.807 0.611
Golang csv 0.617 0.616
Rust peg-reader 0.640 0.639
Ruby fastest-csv 0.928 0.889
Java CommonsCsv 1.031 0.944
Lua lpeg 1.000 0.998
Luajit libcsv 1.025 1.023
Rust nom-reader 1.049 1.048
Crystal csv 1.509 1.506
Java BeanIOCsv 1.674 1.593
Clojure csv 2.362 1.595
Php csv 1.794 1.784
R dataframe 2.072 1.976
Perl Text::CSV_XS 2.143 2.115
Julia dataframe 3.290 2.509
Haskell cassava 3.571 3.557
Java CSVeedCsv 6.650 6.406
Gawk regexp 7.977 7.975
Ruby csv 9.354 9.310

Here are some timings for the csv-count test (which are old and haven't been added to the Continuous Integration).

Language Time
C (libcsv) 0m0.177s
Go (Go 1.5) 0m1.383s
Java (OpenCSV) 0m0.767s
Java (UnivocityCSV) 0m0.627s
Lua LPEG 0m1.437s
Luajit FFI 0m1.486s
Ocaml 0m0.522s
Perl (Text::CSV_XS) 0m2.519s
Python 2.7 0m1.077s
Ruby 0m11.924s
Rust (csv) 0m0.172s
Rust (quick) 0m0.138s
SQLite3 0m1.834s


The following variants are using general parsing libraries for processing the CSV:

  • C++ Boost.Spirit.
  • Lua lpeg (a PEG library).
  • Rust PEG.
  • Rust lalrpop (a LALR parser).
  • Rust NOM (a parser combinator library).

Luajit FFI is using the C libcsv library through a foreign function interface.

R reads the CSV file into a DataFrame and multiplies the product of the dimensions rather than counting each individual record. This may be a bit cheaty. Pandas does this too.

Julia works in a similar fashion to R and reads the CSV file into an Array{Any,2} and multiplies the product of the dimensions rather than counting each individual record. Like the R version, this might be a bit cheaty.

SQLite makes a table and imports the csv file and then runs a query.

There is also a perl6 version but unfortunately it takes a very long time to run (minutes). Rakudo/moar/perl6 is under active development and performance improvements so I expect this will get much faster in the future. When a breakthrough occurs, let me know and I'd love to add it to the game.

Gawk is using FPAT to delimit the fields. This is a regular expression used to escape the quotes csv. I wish AWK had a --csv flag which was more performant.

Rudimentary Analysis

Profiling the Go code, I can see that a lot of the time goes to Garbage collection. A lot of the time also goes to handling UFT8.

Slurping data into SQLite for interactive analysis isn't so bad, actually.