errors creating custom reference genome

Issue #7 new
Stephane Plaisance created an issue

I need to add four reporter genes to the hg38 reference.
I did the following:
* download the hg38 table from UCSC
* download the hg38 2bit file
* create the corresponding fasta
* add my four sequences to the end of the hg38 fasta (the sequences are saved as pseudo-contigs and are in fact just the ORFs of the reporters)

>mTagBFP2_vect

ATGGTGTCTAAGGGCGAAGAGCTGATTAAGGAGAACATGCACATGAAGCTGTACATGGAGGGCACCGTGGACAACCATCACTTCAAGTGCACATCCGAGGGCGAAGGCAAGCCCTACGAGGGCACCCAGACCATGAGAATCAAGGTGGTCGAGGGCGGCCCTCTCCCCTTCGCCTTCGACATCCTGGCTACTAGCTTCCTCTACGGCAGCAAGACCTTCATCAACCACACCCAGGGCATCCCCGACTTCTTCAAGCAGTCCTTCCCTGAGGGCTTCACATGGGAGAGAGTCACCACATACGAAGACGGGGGCGTGCTGACCGCTACCCAGGACACCAGCCTCCAGGACGGCTGCCTCATCTACAACGTCAAGATCAGAGGGGTGAACTTCACATCCAACGGCCCTGTGATGCAGAAGAAAACACTCGGCTGGGAGGCCTTCACCGAGACGCTGTACCCCGCTGACGGCGGCCTGGAAGGCAGAAACGACATGGCCCTGAAGCTCGTGGGCGGGAGCCATCTGATCGCAAACGCCAAGACCACATATAGATCCAAGAAACCCGCTAAGAACCTCAAGATGCCTGGCGTCTACTATGTGGACTACAGACTGGAAAGAATCAAGGAGGCCAACAACGAAACCTACGTCGAGCAGCACGAGGTGGCAGTGGCCAGATACTGCGACCTCCCTAGCAAACTGGGGCACAAGCTTAATctcgagtag

>mClover3_vect

ATGGTGTCTAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTCCGCGGCGAGGGCGAGGGCGATGCCACCAACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCTTCGGCTACGGCGTGGCCTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTCTTTCAAGGACGACGGTACCTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTTCAACAGCCACTACGTCTATATCACGGCCGACAAGCAGAAGAACTGCATCAAGGCTAACTTCAAGATCCGCCACAACGTTGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCCATCAGTCCAAGCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATTACACATGGCATGGACGAGCTGTACAAGTAA

>mScarlet_vect

ATGGTGAGCAAGGGCGAGGCAGTGATCAAGGAGTTCATGCGGTTCAAGGTGCACATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGTGGCCCCCTGCCCTTCTCCTGGGACATCCTGTCCCCTCAGTTCATGTACGGCTCCAGGGCCTTCACCAAGCACCCCGCCGACATCCCCGACTACTATAAGCAGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGCGCCGTGACCGTGACCCAGGACACCTCCCTGGAGGACGGCACCCTGATCTACAAGGTGAAGCTCCGCGGCACCAACTTCCCTCCTGACGGCCCCGTAATGCAGAAGAAGACAATGGGCTGGGAAGCGTCCACCGAGCGGTTGTACCCCGAGGACGGCGTGCTGAAGGGCGACATTAAGATGGCCCTGCGCCTGAAGGACGGCGGCCGCTACCTGGCGGACTTCAAGACCACCTACAAGGCCAAGAAGCCCGTGCAGATGCCCGGCGCCTACAACGTCGACCGCAAGTTGGACATCACCTCCCACAACGAGGACTACACCGTGGTGGAACAGTACGAACGCTCCGAGGGCCGCCACTCCACCGGCGGCATGGACGAGCTGTACAAGTAA

>eGFP_vect

ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCTGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGCGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA

* create a new .2bit from this named hg38_orfs.2bit
* create a small tab-separated table for the four new sequences using bin numbers greater than the largest present bin

100000  mClover3_ORF    mClover3_vect   +       0       720     0       717     1       0,      720,    0       mClover3        none    none    0,
100001  mTagBFP2_ORF    mTagBFP2_vect   +       0       720     0       717     1       0,      720,    0       mTagBFP2        none    none    0,
100002  mScarlet_ORF    mScarlet_vect   +       0       699     0       696     1       0,      699,    0       mScarlet        none    none    0,
100003  eGFP_ORF        eGFP_vect       +       0       720     0       717     1       0,      720,    0       eGFP    none    none    0,

* add this small table to the bottom of the UCSC table to create a new table hg38_orfs.gene_table

...
586     XM_024452505.1  chr22_KI270734v1_random -       138085  153834  138479  150383  13      138085,138742,142193,143613,144748,145003,146639,147623,148115,148413,150349,150986,152880,    138667,138831,142292,143789,144895,145096,146721,147703,148232,148478,150499,151598,153834,     0       LOC102724788    cmpl    cmpl    1,2,2,0,0,0,2,0,0,1,0,-1,-1,
586     XM_017030169.1  chr22_KI270734v1_random -       138085  153836  138479  152899  12      138085,138742,142193,143613,144748,145003,146639,147623,148115,150349,150986,152880,  138667,138831,142292,143789,144895,145096,146721,147703,148232,150499,151021,153836,     0       LOC102724788    cmpl    cmpl    1,2,2,0,0,0,2,0,0,0,1,0,
586     XM_017030167.1  chr22_KI270734v1_random -       138085  153840  138479  152899  13      138085,138742,142193,143613,144748,145003,146639,147623,148115,150349,150986,151337,152880,    138667,138831,142292,143789,144895,145096,146721,147703,148232,150499,151021,151598,153840,     0       LOC102724788    cmpl    cmpl    1,2,2,0,0,0,2,0,0,0,1,1,0,
586     NM_001368249.2  chr22_KI270734v1_random -       138085  161592  138479  161586  14      138085,138742,142193,143613,144748,145003,146639,147623,148115,148413,150349,150986,156288,161313,     138667,138831,142292,143789,144895,145096,146721,147703,148232,148478,150499,151021,156497,161592,      0       LOC102724788    cmpl    cmpl    1,2,2,0,0,0,2,0,0,1,1,2,0,0,
586     XM_006724936.3  chr22_KI270734v1_random -       138085  161594  138479  150995  13      138085,138742,142193,143613,144748,145003,146639,147623,148115,150349,150986,156288,161313,    138667,138831,142292,143789,144895,145096,146721,147703,148232,150499,151021,156497,161594,     0       LOC102724788    cmpl    cmpl    1,2,2,0,0,0,2,0,0,0,0,-1,-1,
586     XR_951398.2     chr22_KI270734v1_random -       138085  161588  161588  161588  14      138085,138742,142193,143613,144748,145003,146639,147623,148115,148413,150349,150986,156288,161313,     138667,138831,142292,143789,144929,145096,146721,147703,148232,148478,150499,151021,156497,161588,      0       LOC102724788    none    none    -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
100000  mClover3_ORF    mClover3_vect   +       0       720     0       717     1       0,      720,    0       mClover3        none    none    0,
100001  mTagBFP2_ORF    mTagBFP2_vect   +       0       720     0       717     1       0,      720,    0       mTagBFP2        none    none    0,
100002  mScarlet_ORF    mScarlet_vect   +       0       699     0       696     1       0,      699,    0       mScarlet        none    none    0,
100003  eGFP_ORF        eGFP_vect       +       0       720     0       717     1       0,      720,    0       eGFP    none    none    0,

* create a new bowtie index from the edited fasta named hg38_orfs

When I run a chopchop query for a hg38 gene name, it works but when I use the names (either the name from col2 or col13) from my new genes it fails with:

The gene name mClover3 (same for or mClover3_ORF) does not exist in file /data/biodata/chopchop_db/hg38_orfs.gene_table. Please try again.

If I use the contig name is also fails but this is expected as it wants a gene name

If I use --targets eGFP_vect:0-720 it also fails

If I run a direct command: twoBitToFa /path/hg38_orfs.2bit -seq=mClover3_vect stdout it works perfectly!

I think there is something wrong with my gene_table but what?

Can you please comment on where my method is wrong.

This would really deserve a tutorial as I followed the repo instructions but this does not work.

Thanks in advance

Comments (5)

  1. Stephane Plaisance reporter

    After more searching, I discovered that I cannot use chopchop close to the edges of a chromosome/contig.

    For instance:

    chopchop.py --targets chrM:80-200 --genome hg38
    Running twoBitToFa failed
    

    although the sequence exists

    twoBitToFa /path/hg38.2bit:chrM:0-200 stdout
    >chrM:0-200
    GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
    TTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTG
    GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATC
    CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTA
    
    twoBitToFa /path/hg38.2bit:chrM:80-200 stdout
    >chrM:80-200
    GATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTG
    TCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
    ACAGGCGAACATACTTACTA
    

    By contrast:

    chopchop.py --targets chrM:85-200 --genome hg38
    
    Rank    Target sequence Genomic location        Strand  GC content (%)  Self-complementarity    MM0     MM1     MM2     MM3     Efficiency
    1       GCACCTACGTTCAATATTACAGG chrM:163        +       40      0       0       0       0       0       0.00
    2       TCGCCTGTAATATTGAACGTAGG chrM:166        -       40      0       0       0       0       0       0.00
    3       GCGATAGCATTGCGAGACGCTGG chrM:79 +       60      0       0       0       0       0       0.00
    4       TACTGCGACATAGGGTGCTCCGG chrM:104        -       55      0       0       0       0       0       0.00
    5       AAAGACAGATACTGCGACATAGG chrM:113        -       40      0       0       0       0       0       0.00
    6       GATAAATAATAGGATGAGGCAGG chrM:140        -       35      0       0       0       0       0       0.00
    7       ACGTAGGTGCGATAAATAATAGG chrM:150        -       35      0       0       0       0       0       0.00
    8       GTGCGATAAATAATAGGATGAGG chrM:144        -       35      0       0       0       0       1       1.00
    9       GCATTGCGAGACGCTGGAGCCGG chrM:85 +       65      0       0       0       0       1       0.00
    10      AAGACAGATACTGCGACATAGGG chrM:112        -       40      0       0       0       0       3       0.00
    

    The same is observed close to the end of a chromosome.

    Based on this I tried my reporter query with a start coordinate >85 and it now works

    chopchop.py --targets mClover3_vect:85-200 --genome hg38_orfs
    Rank    Target sequence Genomic location        Strand  GC content (%)  Self-complementarity    MM0     MM1     MM2     MM3     Efficiency
    1       GACCACCTTCGGCTACGGCGTGG mClover3_vect:186       +       70      0       0       0       0       0       1.00
    2       CTCGTGACCACCTTCGGCTACGG mClover3_vect:181       +       60      0       0       0       0       0       0.00
    3       AAGTTCAGCGTCCGCGGCGAGGG mClover3_vect:79        +       65      0       0       0       0       0       0.00
    4       CGTAGCCGAAGGTGGTCACGAGG mClover3_vect:180       -       65      0       0       0       0       1       1.00
    5       GTAGCCGAAGGTGGTCACGAGGG mClover3_vect:179       -       60      0       0       0       0       1       0.00
    6       GCAGGCCACGCCGTAGCCGAAGG mClover3_vect:191       -       75      0       0       0       0       1       0.00
    7       GGCCACGCCGTAGCCGAAGGTGG mClover3_vect:188       -       75      0       0       0       0       2       1.00
    8       GGCCACAAGTTCAGCGTCCGCGG mClover3_vect:73        +       65      0       0       0       0       3       1.00
    9       CAAGTTCAGCGTCCGCGGCGAGG mClover3_vect:78        +       70      0       0       0       0       3       1.00
    10      GCCGAAGGTGGTCACGAGGGTGG mClover3_vect:176       -       70      0       0       0       0       4       1.00
    11      CGCCGCGGACGCTGAACTTGTGG mClover3_vect:75        -       70      0       0       0       0       4       1.00
    12      CATCGCCCTCGCCCTCGCCGCGG mClover3_vect:90        -       80      0       0       0       0       4       1.00
    13      CCCACCCTCGTGACCACCTTCGG mClover3_vect:175       +       65      0       0       0       0       7       0.00
    14      AGCGTCCGCGGCGAGGGCGAGGG mClover3_vect:85        +       80      0       0       0       0       7       0.00
    15      GGCGAGGGCGATGCCACCAACGG mClover3_vect:100       +       70      0       0       1       2       3       0.00
    16      CGGTGGTGCAGATGAACTTCAGG mClover3_vect:132       -       55      0       1       0       0       7       0.00
    17      CAGGGTCAGCTTGCCGTTGGTGG mClover3_vect:113       -       65      0       0       1       0       11      1.00
    18      CTTCAGGGTCAGCTTGCCGTTGG mClover3_vect:116       -       60      0       1       1       3       6       0.00
    19      CAGCGTCCGCGGCGAGGGCGAGG mClover3_vect:84        +       85      0       0       0       1       14      1.00
    20      GGGCACGGGCAGCTTGCCGGTGG mClover3_vect:149       -       80      1       1       0       3       10      1.00
    21      GGTGGTGCAGATGAACTTCAGGG mClover3_vect:131       -       50      0       1       0       1       14      0.00
    22      CCGGCAAGCTGCCCGTGCCCTGG mClover3_vect:152       +       80      1       1       0       1       15      0.00
    23      CTGAAGTTCATCTGCACCACCGG mClover3_vect:133       +       50      0       1       0       1       19      0.00
    24      GTGGTCACGAGGGTGGGCCAGGG mClover3_vect:169       -       70      0       1       1       1       18      0.00
    25      ACGAGGGTGGGCCAGGGCACGGG mClover3_vect:163       -       75      0       1       0       1       21      0.00
    26      GGTGGTCACGAGGGTGGGCCAGG mClover3_vect:170       -       75      0       1       1       1       22      0.00
    27      CACGAGGGTGGGCCAGGGCACGG mClover3_vect:164       -       75      0       1       1       4       39      0.00
    

    Can you please explain this behaviour?

    Should I add 100 N’s at both ends of my sequences

    Thanks

  2. Stephane Plaisance reporter

    Adding 1kb of N’s at both ends finally worked for all four queries, I do not know why more than 23 bases (une guide length) are required at both ends but it seems to be a fact.

  3. Kornel Labun

    I suspect all of your pains come from this option, adjust it to 0.

    parser.add_argument("-p", "--padSize", default=-1, type=int, help="Extra bases searched outside the exon. Defaults to the size of the guide RNA for CRISPR and TALEN + maximum spacer for TALENS.")

  4. Stephane Plaisance reporter

    Thanks Kornel, I indeed seems to work for the chrM example:

    chopchop.py --targets chrM:80-200 --genome hg38 --padSize=0
    

    Rank    Target sequence Genomic location        Strand  GC content (%)  Self-complementarity    MM0     MM1     MM2     MM3     Efficiency

    1       TACTGCGACATAGGGTGCTCCGG chrM:104        -       55      0       0       0       0       0       0.00

    2       AAAGACAGATACTGCGACATAGG chrM:113        -       40      0       0       0       0       0       0.00

    3       GATAAATAATAGGATGAGGCAGG chrM:140        -       35      0       0       0       0       0       0.00

    4       ACGTAGGTGCGATAAATAATAGG chrM:150        -       35      0       0       0       0       0       0.00

    5       GCACCTACGTTCAATATTACAGG chrM:163        +       40      0       0       0       0       0       0.00

    6       TCGCCTGTAATATTGAACGTAGG chrM:166        -       40      0       0       0       0       0       0.00

    7       GTGCGATAAATAATAGGATGAGG chrM:144        -       35      0       0       0       0       1       1.00

    8       GCATTGCGAGACGCTGGAGCCGG chrM:85 +       65      0       0       0       0       1       0.00

    9       AAGACAGATACTGCGACATAGGG chrM:112        -       40      0       0       0       0       3       0.00

    I will create the bowtie index without N’s and try this out.
    Best

  5. Stephane Plaisance reporter

    Hi Kornel,

    I was a bit too fast, running chrM from the default genome also fails when starting query at 0..59 with padSize=0. It starts to work from 60 onwards.

    There is more to it than just the padSize I am afraid and it works with my 1kb N flanks when targeting the same sequences.

    You can try the commands below on your own machine as they use the regular hg38 index and table from your side

    # working when starting at 60 or more (here 80)
    $ chopchop.py --targets chrM:80-200 --genome hg38 --padSize=0
    Rank    Target sequence Genomic location        Strand  GC content (%)  Self-complementarity    MM0     MM1     MM2     MM3     Efficiency
    1       TACTGCGACATAGGGTGCTCCGG chrM:104        -       55      0       0       0       0       0       0.00
    2       AAAGACAGATACTGCGACATAGG chrM:113        -       40      0       0       0       0       0       0.00
    3       GATAAATAATAGGATGAGGCAGG chrM:140        -       35      0       0       0       0       0       0.00
    4       ACGTAGGTGCGATAAATAATAGG chrM:150        -       35      0       0       0       0       0       0.00
    5       GCACCTACGTTCAATATTACAGG chrM:163        +       40      0       0       0       0       0       0.00
    6       TCGCCTGTAATATTGAACGTAGG chrM:166        -       40      0       0       0       0       0       0.00
    7       GTGCGATAAATAATAGGATGAGG chrM:144        -       35      0       0       0       0       1       1.00
    8       GCATTGCGAGACGCTGGAGCCGG chrM:85 +       65      0       0       0       0       1       0.00
    9       AAGACAGATACTGCGACATAGGG chrM:112        -       40      0       0       0       0       3       0.00
    
    # not working despite --padSize=0
    $ chopchop.py --targets chrM:0-200 --genome hg38 --padSize=0
    Running twoBitToFa failed
    

  6. Log in to comment