Wiki
Clone wikidemeter-course / Impala
#Impala
impala-shell
##Creating a Database and Table
create database mydb; use mydb; create table mytable ( id int, description string );
##Loading Data
insert into mytable (id, description) values (1, 'This is text'), (2, 'This is text'); OR load data inpath '/path/to/files' into mytables;
##Specifying Data Storage
In the command above we were able to create a table, but we also have to state something about how the underlying data is stored. This is done with the ROW FORMAT
command.
create table mytable ( id int, description string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
We specified that the column delimiter was a tab and the underlying files were text. There are alternatives as well.
---EMR Fact Table CREATE TABLE `FACT` ( `FACT_KEY` bigint , `PERSON_KEY` int , `ACCOUNTING_GROUP_KEY` int , `ENCOUNTER_KEY` int , `CAREGIVER_GROUP_KEY` int , `FACILITY_KEY` int , `PROCEDURE_GROUP_KEY` int , `DIAGNOSIS_GROUP_KEY` int , `MATERIAL_GROUP_KEY` int , `ORGANIZATION_GROUP_KEY` int , `PAYOR_GROUP_KEY` int , `OPERATION_KEY` bigint , `KEYWORD_GROUP_KEY` int , `CALENDAR_KEY` int , `TIME_OF_DAY_KEY` int , `META_DATA_KEY` int , `UOM_KEY` int , `DATA_FEED_KEY` int , `DATA_QUALITY_KEY` int , `DATA_STATE_KEY` int , `AUDIT_KEY` bigint , `VALUE` string , `INSERT_DATE_TIME` string , `UPDATE_DATE_TIME` string )ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
---BAM File create table if not exists bam_test ( referenceName string, referenceId int, fstart bigint, fend bigint, mapq int, readName string, sequence string, mateReference string, mateAlignmentStart bigint, cigar string, qual string, recordGroupId string, readPaired boolean, properPair boolean, readMapped boolean, mateMapped boolean, readNegativeStrand boolean, mateNegativeStrand boolean, firstOfPair boolean, secondOfPair boolean, primaryAlignment boolean, failedVendorQualityChecks boolean, duplicateRead boolean, mismatchingPositions string, attributes string ) STORED AS PARQUETFILE;
##External Vs Internal Tables
Traditionally, you can load data into a table with the follwing command:
load data inpath '/path/to/files' into table;
This command will move the files from /path/to/files to an a folder managed by Hive. Important note is that Hive will need write permissions on the folder to move the files. Alternatively, you can create an external table.
create external table mytable (...);
This will leave the files where they are and manage them here. Unlike internal tables, dropping the table will not delete data.
##Querying with Impala
###EMR Queries
SELECT d_person.*, d_encounter.* FROM fact JOIN d_person on(d_person.person_key=fact.person_key) JOIN d_encounter ON (d_encounter.encounter_key = fact.encounter_key) LIMIT 10;
###VCF Queries
create database if not exists vcf_files; use vcf_files; drop table if exists arun_vcf; create table if not exists arun_vcf (CHROM string, POS string, ID string, REF string, ALT string, QUAL string, FILTER string, INFO string, SFORMAT string, GENOTYPE string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; load data inpath '/user/ahujaa01/arun_vcf/my_genome.vcf' into table arun_vcf;
use vcf_files; select distinct name, geneid, genesymbol from clinvar.variant_summary join arun_vcf on ID = concat('rs', clinvar.variant_summary.rsdbsnp) where clinvar.variant_summary.clinicalsignificance = 'pathogenic';
###BAM File Queries
A primary benefit of using Hadoop is the wealth of tools available already. Example: Hadoop-BAM and Adam
use 1000genomes; select count(*) from hg00096_bam; select count(*) from hg00096_bam where duplicateread == True;
Updated