add python pandas

Issue #4 closed
David created an issue

I don't see python pandas.read_csv in here. Would be interesting to see.

Comments (14)

  1. David reporter

    One thing to note is that use bash "time" will include the startup and module load time. I think if you make the test.csv file bigger you won't have to worry about this as much.

  2. Kenneth Hoste

    I looked into pandas.read_csv, it's slower than using the csv module, probably because of the overhead of creating the dataframe:

    $ time python python2/csvreader.py < test/hello.csv
    5000000
    python python2/csvreader.py < test/hello.csv  0.73s user 0.07s system 93% cpu 0.854 total
    $ time python python2/csvreader_pandas.py < test/hello.csv
    5000000
    python python2/csvreader_pandas.py < test/hello.csv  1.21s user 0.47s system 83% cpu 2.023 total
    
    import pandas
    import sys
    print sum(map(len, pandas.read_csv(sys.stdin, header=None).values))
    
  3. Kenneth Hoste

    The pandas version can be sped up signficantly using shape, but that's probably cheating since it's not using the actual data:

    df = pandas.read_csv(sys.stdin, header=None)
    print df.shape[0] * df.shape[1]
    
    $ time python python2/csvreader_pandas_shape.py < test/hello.csv
    5000000
    python python2/csvreader_pandas_shape.py < test/hello.csv  0.72s user 0.22s system 97% cpu 0.962 total
    
  4. Ewan Higgs repo owner

    Thanks @kehoste . Could you make a PR with your shape calculating version? It's loading the values into a dataframe so it's satisfactorily parsing the CSV like R and Julia so I think it's an ok solution here.

  5. David reporter

    I think you really need to exclude startup time and import time in the comparisons if you are going to included interpreted languages. The easy way to do this is run the pd.read_csv in IPython with %time magic.

  6. David reporter

    Here are some example timings:

    In [34]: %time np.prod(pd.read_csv('/tmp/hello.csv', header=None).shape)
    CPU times: user 518 ms, sys: 75.6 ms, total: 593 ms
    Wall time: 592 ms
    Out[34]: 5000000
    
    In [35]: %timeit np.prod(pd.read_csv('/tmp/hello.csv', header=None).shape)
    1 loop, best of 3: 594 ms per loop
    
    csv-game/python3$ time ./csvreader.py < /tmp/hello.csv
    5000000
    
    real    0m1.688s
    user    0m1.563s
    sys 0m0.057s
    
    csv-game/c++-tokenizer$ time ./csv < /tmp/hello.csv
    5000000
    
    real    0m1.104s
    user    0m1.047s
    sys 0m0.031s
    
  7. Ewan Higgs repo owner

    I think pythons tools for microbenchmarking are great but getting something that works across the different languages and runtimes in a consistent manner will be nigh impossible. A better approach may be to time a hello world example that also loads the csv module and takes the different of the results as part of post processing analysis.

    Im happy for the buld.sh scripts to be extended with a column for job type and that job type to include 'hello' for example.

  8. David reporter

    Yes, if you want to keep the API simple, you could just have an optional 'background' timing script that loads things but does not read the csv and subtract the results of timings as you say.

  9. Ewan Higgs repo owner

    I wont be in front of a comouter for the rest if the weekend i think. Could you out in an issue?

    Thanks

  10. Log in to comment