This library is a thin and fast python wrapper around libhdfs. It allows read and write files in hdfs from python. Other filesystem manipulation calls provided by libhdfs also supported. Cython is used to wrap libhdfs calls, which in turn is a C wrapper around java of hadoop.
Simple usage example:
import cyhdfs conn = cyhdfs.HDFSConnection('namenode_hostname', 54310, 'hduser') test_file = conn.open_file("/tmp/cyhdfs/test.txt", os.O_WRONLY) bytes_written = test_file.write("Hello world.") test_file.close() conn.delete("/tmp/cyhdfs") conn.close()
See other usage examples in example directory in source.
First Call to libhdfs API takes long time because Java initializes its jvm.
Further call takes significally less time. So if you going to timeit - please
take it into account.
I have been able to write into HDFS using this lib with speed of 90Mb per second on a 1Gb channel.
Install on cloudera 0.20.2-cdh3u3:
Symptoms: compile fails with message:: src/cyhdfs.c: In function '__pyx_pf_6cyhdfs_14HDFSConnection_20delete': src/cyhdfs.c:2815: error: too many arguments to function 'hdfsDelete' error: command 'cc' failed with exit status 1
Pregenerated .c file comes for 0.20.2-cdh3u5 or higher.
To compile with 0.20.2-cdh3u3::
- first install cython 17.2 of higher
- edit setup.cfg set hadoop_delete_recursive=0 hadoop_hflush=1
- delete src/cyhdfs.c file and ./build directory
- run python ./setup.py build as usually