Wiki

#HDFS At the core of Hadoop is HDFS, a distributed, fault-tolerant filesystem. Put simply, this file system is not on one machine but spread across multiple machines and if one of those were to fail, data would not be lost.

Hadoop lets us access files on the special filesystem (which coexists with the local files) using similar command line syntax.

To list files, we have something similar to `ls

hadoop fs -ls

To cat files, we have to also use the hadoop dfs command

hadoop fs -cat /path/to/file

If we want to copy something from the local filesystem to HDFS, we can use the copyFromLocal flag

hdfs fs -copyFromLocal <local_file> <folder_on_hdfs>

Now we can access this file on hdfs using cat

hdfs fs -cat <some_file> | less

File Formats

Text Files

Text files or zipped files can serve as input to map reduce jobs. Gzipped files can be processed but each file can not be split up, but other compressions formats are splittable by Hadoop (LZO).

Hadoop can easily handle unstructured text files, but it maybe be useful to have an enforced schema at times.

Avro

Avro is a data serialization system. Data can be stored in an Avro format, a binary format that has a schema attached.

Parquet

##Sqoop

Sqoop is a tool for bulk data transfers between datastores like relational databases and HDFS

Sqoop

Basic command structure:

sqoop import --connect jdbc:mysql://krakatoa.mssm.edu/annot_gene --username <> --password <> --fields-terminated-by '\t' --table synonym --target-dir ailun_tables/synonym