add python pandas

Issue #4 closed

David created an issue 2016-06-17

I don't see python pandas.read_csv in here. Would be interesting to see.

Comments (14)

Ewan Higgs repo owner
- marked as enhancement
Good shout. I thought I had this in there. Maybe I forgot to commit it. :(
- 2016-06-17T07:49:11+00:00
David reporter
One thing to note is that use bash "time" will include the startup and module load time. I think if you make the test.csv file bigger you won't have to worry about this as much.
- 2016-06-18T10:42:23+00:00

Kenneth Hoste

I looked into pandas.read_csv, it's slower than using the csv module, probably because of the overhead of creating the dataframe:

$ time python python2/csvreader.py < test/hello.csv
5000000
python python2/csvreader.py < test/hello.csv  0.73s user 0.07s system 93% cpu 0.854 total
$ time python python2/csvreader_pandas.py < test/hello.csv
5000000
python python2/csvreader_pandas.py < test/hello.csv  1.21s user 0.47s system 83% cpu 2.023 total

import pandas
import sys
print sum(map(len, pandas.read_csv(sys.stdin, header=None).values))

2016-06-22T07:23:43+00:00

Kenneth Hoste

The pandas version can be sped up signficantly using shape, but that's probably cheating since it's not using the actual data:

df = pandas.read_csv(sys.stdin, header=None)
print df.shape[0] * df.shape[1]

$ time python python2/csvreader_pandas_shape.py < test/hello.csv
5000000
python python2/csvreader_pandas_shape.py < test/hello.csv  0.72s user 0.22s system 97% cpu 0.962 total

2016-06-22T07:36:29+00:00

Ewan Higgs repo owner
Thanks @kehoste . Could you make a PR with your shape calculating version? It's loading the values into a dataframe so it's satisfactorily parsing the CSV like R and Julia so I think it's an ok solution here.
- 2016-06-24T08:58:57+00:00
Ewan Higgs repo owner
Fixed. https://app.wercker.com/#ehiggs/csv-game/build/5776edb3b45c977302017380?step=5776ee0189fefa000121e71f
- 2016-07-01T22:38:31+00:00
Ewan Higgs repo owner
- changed status to resolved
- 2016-07-01T22:38:42+00:00
Ewan Higgs repo owner
- changed status to closed
- 2016-07-01T22:38:59+00:00
David reporter
I think you really need to exclude startup time and import time in the comparisons if you are going to included interpreted languages. The easy way to do this is run the pd.read_csv in IPython with %time magic.
- 2016-07-02T08:55:30+00:00

David reporter

Here are some example timings:

In [34]: %time np.prod(pd.read_csv('/tmp/hello.csv', header=None).shape)
CPU times: user 518 ms, sys: 75.6 ms, total: 593 ms
Wall time: 592 ms
Out[34]: 5000000

In [35]: %timeit np.prod(pd.read_csv('/tmp/hello.csv', header=None).shape)
1 loop, best of 3: 594 ms per loop

csv-game/python3$ time ./csvreader.py < /tmp/hello.csv
5000000

real    0m1.688s
user    0m1.563s
sys 0m0.057s

csv-game/c++-tokenizer$ time ./csv < /tmp/hello.csv
5000000

real    0m1.104s
user    0m1.047s
sys 0m0.031s

2016-07-02T09:11:05+00:00

Ewan Higgs repo owner
I think pythons tools for microbenchmarking are great but getting something that works across the different languages and runtimes in a consistent manner will be nigh impossible. A better approach may be to time a hello world example that also loads the csv module and takes the different of the results as part of post processing analysis.

Im happy for the buld.sh scripts to be extended with a column for job type and that job type to include 'hello' for example.
- 2016-07-02T09:30:05+00:00
David reporter
Yes, if you want to keep the API simple, you could just have an optional 'background' timing script that loads things but does not read the csv and subtract the results of timings as you say.
- 2016-07-02T09:55:13+00:00
Ewan Higgs repo owner
I wont be in front of a comouter for the rest if the weekend i think. Could you out in an issue?

Thanks
- 2016-07-02T10:10:05+00:00
David reporter
done.
- 2016-07-02T12:45:04+00:00
Log in to comment

Assignee: –

Type: enhancement

Priority: major

Status: closed

Votes: 0

Watchers: 1