Wiki
Clone wikidemeter-course / HDFS
#HDFS At the core of Hadoop is HDFS, a distributed, fault-tolerant filesystem. Put simply, this file system is not on one machine but spread across multiple machines and if one of those were to fail, data would not be lost.
Hadoop lets us access files on the special filesystem (which coexists with the local files) using similar command line syntax.
To list files, we have something similar to `ls
hadoop fs -ls
To cat files, we have to also use the hadoop dfs
command
hadoop fs -cat /path/to/file
If we want to copy something from the local filesystem to HDFS, we can use the copyFromLocal flag
hdfs fs -copyFromLocal <local_file> <folder_on_hdfs>
Now we can access this file on hdfs using cat
hdfs fs -cat <some_file> | less
File Formats
Text Files
Text files or zipped files can serve as input to map reduce jobs. Gzipped files can be processed but each file can not be split up, but other compressions formats are splittable by Hadoop (LZO).
Hadoop can easily handle unstructured text files, but it maybe be useful to have an enforced schema at times.
Avro
Avro is a data serialization system. Data can be stored in an Avro format, a binary format that has a schema attached.
Parquet
##Sqoop
Sqoop is a tool for bulk data transfers between datastores like relational databases and HDFS
Basic command structure:
sqoop import --connect jdbc:mysql://krakatoa.mssm.edu/annot_gene --username <> --password <> --fields-terminated-by '\t' --table synonym --target-dir ailun_tables/synonym
Updated