- marked as enhancement
add python pandas
I don't see python pandas.read_csv in here. Would be interesting to see.
Comments (14)
-
repo owner -
reporter One thing to note is that use bash "time" will include the startup and module load time. I think if you make the test.csv file bigger you won't have to worry about this as much.
-
I looked into pandas.read_csv, it's slower than using the csv module, probably because of the overhead of creating the dataframe:
$ time python python2/csvreader.py < test/hello.csv 5000000 python python2/csvreader.py < test/hello.csv 0.73s user 0.07s system 93% cpu 0.854 total $ time python python2/csvreader_pandas.py < test/hello.csv 5000000 python python2/csvreader_pandas.py < test/hello.csv 1.21s user 0.47s system 83% cpu 2.023 total
import pandas import sys print sum(map(len, pandas.read_csv(sys.stdin, header=None).values))
-
The pandas version can be sped up signficantly using
shape
, but that's probably cheating since it's not using the actual data:df = pandas.read_csv(sys.stdin, header=None) print df.shape[0] * df.shape[1]
$ time python python2/csvreader_pandas_shape.py < test/hello.csv 5000000 python python2/csvreader_pandas_shape.py < test/hello.csv 0.72s user 0.22s system 97% cpu 0.962 total
-
repo owner Thanks @kehoste . Could you make a PR with your shape calculating version? It's loading the values into a dataframe so it's satisfactorily parsing the CSV like R and Julia so I think it's an ok solution here.
-
repo owner -
repo owner - changed status to resolved
-
repo owner - changed status to closed
-
reporter I think you really need to exclude startup time and import time in the comparisons if you are going to included interpreted languages. The easy way to do this is run the pd.read_csv in IPython with %time magic.
-
reporter Here are some example timings:
In [34]: %time np.prod(pd.read_csv('/tmp/hello.csv', header=None).shape) CPU times: user 518 ms, sys: 75.6 ms, total: 593 ms Wall time: 592 ms Out[34]: 5000000 In [35]: %timeit np.prod(pd.read_csv('/tmp/hello.csv', header=None).shape) 1 loop, best of 3: 594 ms per loop
csv-game/python3$ time ./csvreader.py < /tmp/hello.csv 5000000 real 0m1.688s user 0m1.563s sys 0m0.057s
csv-game/c++-tokenizer$ time ./csv < /tmp/hello.csv 5000000 real 0m1.104s user 0m1.047s sys 0m0.031s
-
repo owner I think pythons tools for microbenchmarking are great but getting something that works across the different languages and runtimes in a consistent manner will be nigh impossible. A better approach may be to time a hello world example that also loads the csv module and takes the different of the results as part of post processing analysis.
Im happy for the buld.sh scripts to be extended with a column for job type and that job type to include 'hello' for example.
-
reporter Yes, if you want to keep the API simple, you could just have an optional 'background' timing script that loads things but does not read the csv and subtract the results of timings as you say.
-
repo owner I wont be in front of a comouter for the rest if the weekend i think. Could you out in an issue?
Thanks
-
reporter done.
- Log in to comment
Good shout. I thought I had this in there. Maybe I forgot to commit it. :(