Wiki

Clone wiki

demeter-course / Impala

#Impala

impala-shell

##Creating a Database and Table

create database mydb;
use mydb;

create table mytable
    (
        id int,
        description string
    );

##Loading Data

insert into mytable (id, description) values (1, 'This is text'), (2, 'This is text');

OR

load data inpath '/path/to/files' into mytables;

##Specifying Data Storage

In the command above we were able to create a table, but we also have to state something about how the underlying data is stored. This is done with the ROW FORMAT command.

create table mytable
    (
        id int,
        description string
    )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

We specified that the column delimiter was a tab and the underlying files were text. There are alternatives as well.

---EMR Fact Table

CREATE TABLE `FACT` (
  `FACT_KEY` bigint ,
  `PERSON_KEY` int ,
  `ACCOUNTING_GROUP_KEY` int ,
  `ENCOUNTER_KEY` int ,
  `CAREGIVER_GROUP_KEY` int ,
  `FACILITY_KEY` int ,
  `PROCEDURE_GROUP_KEY` int ,
  `DIAGNOSIS_GROUP_KEY` int ,
  `MATERIAL_GROUP_KEY` int ,
  `ORGANIZATION_GROUP_KEY` int ,
  `PAYOR_GROUP_KEY` int ,
  `OPERATION_KEY` bigint ,
  `KEYWORD_GROUP_KEY` int ,
  `CALENDAR_KEY` int ,
  `TIME_OF_DAY_KEY` int ,
  `META_DATA_KEY` int ,
  `UOM_KEY` int ,
  `DATA_FEED_KEY` int ,
  `DATA_QUALITY_KEY` int ,
  `DATA_STATE_KEY` int ,
  `AUDIT_KEY` bigint ,
  `VALUE` string ,
  `INSERT_DATE_TIME` string ,
  `UPDATE_DATE_TIME` string
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
---BAM File
create table if not exists bam_test
(
        referenceName string,
        referenceId int,
        fstart bigint,
        fend bigint,
        mapq int,
        readName string,
        sequence string,
        mateReference string,
        mateAlignmentStart bigint,
        cigar string,
        qual string,
        recordGroupId string,
        readPaired boolean,
        properPair boolean,
        readMapped boolean,
        mateMapped boolean,
        readNegativeStrand boolean,
        mateNegativeStrand boolean,
        firstOfPair boolean,
        secondOfPair boolean,
        primaryAlignment boolean,
        failedVendorQualityChecks boolean,
        duplicateRead boolean,
        mismatchingPositions string,
        attributes string
)
STORED AS PARQUETFILE;

##External Vs Internal Tables

Traditionally, you can load data into a table with the follwing command:

load data inpath '/path/to/files' into table;

This command will move the files from /path/to/files to an a folder managed by Hive. Important note is that Hive will need write permissions on the folder to move the files. Alternatively, you can create an external table.

create external table mytable (...);

This will leave the files where they are and manage them here. Unlike internal tables, dropping the table will not delete data.

##Querying with Impala

###EMR Queries

SELECT d_person.*, d_encounter.*
FROM fact JOIN
d_person on(d_person.person_key=fact.person_key)
JOIN d_encounter ON (d_encounter.encounter_key  = fact.encounter_key)
LIMIT 10;

###VCF Queries

create database if not exists vcf_files;

use vcf_files;
drop table if exists arun_vcf;
create table if not exists arun_vcf
(CHROM string,
POS string,
ID string,
REF string,
ALT string,
QUAL string,
FILTER string,
INFO string,
SFORMAT  string,
GENOTYPE string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

load data inpath '/user/ahujaa01/arun_vcf/my_genome.vcf' into table arun_vcf;
use vcf_files;

select distinct name, geneid, genesymbol 
from clinvar.variant_summary 
join arun_vcf on ID = concat('rs', clinvar.variant_summary.rsdbsnp) 
where clinvar.variant_summary.clinicalsignificance = 'pathogenic';

###BAM File Queries

A primary benefit of using Hadoop is the wealth of tools available already. Example: Hadoop-BAM and Adam

use 1000genomes;
select count(*) from hg00096_bam;
select count(*) from hg00096_bam where duplicateread == True;

Updated