errors creating custom reference genome
I need to add four reporter genes to the hg38 reference.
I did the following:
* download the hg38 table from UCSC
* download the hg38 2bit file
* create the corresponding fasta
* add my four sequences to the end of the hg38 fasta (the sequences are saved as pseudo-contigs and are in fact just the ORFs of the reporters)
>mTagBFP2_vect
ATGGTGTCTAAGGGCGAAGAGCTGATTAAGGAGAACATGCACATGAAGCTGTACATGGAGGGCACCGTGGACAACCATCACTTCAAGTGCACATCCGAGGGCGAAGGCAAGCCCTACGAGGGCACCCAGACCATGAGAATCAAGGTGGTCGAGGGCGGCCCTCTCCCCTTCGCCTTCGACATCCTGGCTACTAGCTTCCTCTACGGCAGCAAGACCTTCATCAACCACACCCAGGGCATCCCCGACTTCTTCAAGCAGTCCTTCCCTGAGGGCTTCACATGGGAGAGAGTCACCACATACGAAGACGGGGGCGTGCTGACCGCTACCCAGGACACCAGCCTCCAGGACGGCTGCCTCATCTACAACGTCAAGATCAGAGGGGTGAACTTCACATCCAACGGCCCTGTGATGCAGAAGAAAACACTCGGCTGGGAGGCCTTCACCGAGACGCTGTACCCCGCTGACGGCGGCCTGGAAGGCAGAAACGACATGGCCCTGAAGCTCGTGGGCGGGAGCCATCTGATCGCAAACGCCAAGACCACATATAGATCCAAGAAACCCGCTAAGAACCTCAAGATGCCTGGCGTCTACTATGTGGACTACAGACTGGAAAGAATCAAGGAGGCCAACAACGAAACCTACGTCGAGCAGCACGAGGTGGCAGTGGCCAGATACTGCGACCTCCCTAGCAAACTGGGGCACAAGCTTAATctcgagtag
>mClover3_vect
ATGGTGTCTAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTCCGCGGCGAGGGCGAGGGCGATGCCACCAACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCTTCGGCTACGGCGTGGCCTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTCTTTCAAGGACGACGGTACCTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTTCAACAGCCACTACGTCTATATCACGGCCGACAAGCAGAAGAACTGCATCAAGGCTAACTTCAAGATCCGCCACAACGTTGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCCATCAGTCCAAGCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATTACACATGGCATGGACGAGCTGTACAAGTAA
>mScarlet_vect
ATGGTGAGCAAGGGCGAGGCAGTGATCAAGGAGTTCATGCGGTTCAAGGTGCACATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGTGGCCCCCTGCCCTTCTCCTGGGACATCCTGTCCCCTCAGTTCATGTACGGCTCCAGGGCCTTCACCAAGCACCCCGCCGACATCCCCGACTACTATAAGCAGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGCGCCGTGACCGTGACCCAGGACACCTCCCTGGAGGACGGCACCCTGATCTACAAGGTGAAGCTCCGCGGCACCAACTTCCCTCCTGACGGCCCCGTAATGCAGAAGAAGACAATGGGCTGGGAAGCGTCCACCGAGCGGTTGTACCCCGAGGACGGCGTGCTGAAGGGCGACATTAAGATGGCCCTGCGCCTGAAGGACGGCGGCCGCTACCTGGCGGACTTCAAGACCACCTACAAGGCCAAGAAGCCCGTGCAGATGCCCGGCGCCTACAACGTCGACCGCAAGTTGGACATCACCTCCCACAACGAGGACTACACCGTGGTGGAACAGTACGAACGCTCCGAGGGCCGCCACTCCACCGGCGGCATGGACGAGCTGTACAAGTAA
>eGFP_vect
ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCTGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGCGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA
* create a new .2bit from this named hg38_orfs.2bit
* create a small tab-separated table for the four new sequences using bin numbers greater than the largest present bin
100000 mClover3_ORF mClover3_vect + 0 720 0 717 1 0, 720, 0 mClover3 none none 0,
100001 mTagBFP2_ORF mTagBFP2_vect + 0 720 0 717 1 0, 720, 0 mTagBFP2 none none 0,
100002 mScarlet_ORF mScarlet_vect + 0 699 0 696 1 0, 699, 0 mScarlet none none 0,
100003 eGFP_ORF eGFP_vect + 0 720 0 717 1 0, 720, 0 eGFP none none 0,
* add this small table to the bottom of the UCSC table to create a new table hg38_orfs.gene_table
...
586 XM_024452505.1 chr22_KI270734v1_random - 138085 153834 138479 150383 13 138085,138742,142193,143613,144748,145003,146639,147623,148115,148413,150349,150986,152880, 138667,138831,142292,143789,144895,145096,146721,147703,148232,148478,150499,151598,153834, 0 LOC102724788 cmpl cmpl 1,2,2,0,0,0,2,0,0,1,0,-1,-1,
586 XM_017030169.1 chr22_KI270734v1_random - 138085 153836 138479 152899 12 138085,138742,142193,143613,144748,145003,146639,147623,148115,150349,150986,152880, 138667,138831,142292,143789,144895,145096,146721,147703,148232,150499,151021,153836, 0 LOC102724788 cmpl cmpl 1,2,2,0,0,0,2,0,0,0,1,0,
586 XM_017030167.1 chr22_KI270734v1_random - 138085 153840 138479 152899 13 138085,138742,142193,143613,144748,145003,146639,147623,148115,150349,150986,151337,152880, 138667,138831,142292,143789,144895,145096,146721,147703,148232,150499,151021,151598,153840, 0 LOC102724788 cmpl cmpl 1,2,2,0,0,0,2,0,0,0,1,1,0,
586 NM_001368249.2 chr22_KI270734v1_random - 138085 161592 138479 161586 14 138085,138742,142193,143613,144748,145003,146639,147623,148115,148413,150349,150986,156288,161313, 138667,138831,142292,143789,144895,145096,146721,147703,148232,148478,150499,151021,156497,161592, 0 LOC102724788 cmpl cmpl 1,2,2,0,0,0,2,0,0,1,1,2,0,0,
586 XM_006724936.3 chr22_KI270734v1_random - 138085 161594 138479 150995 13 138085,138742,142193,143613,144748,145003,146639,147623,148115,150349,150986,156288,161313, 138667,138831,142292,143789,144895,145096,146721,147703,148232,150499,151021,156497,161594, 0 LOC102724788 cmpl cmpl 1,2,2,0,0,0,2,0,0,0,0,-1,-1,
586 XR_951398.2 chr22_KI270734v1_random - 138085 161588 161588 161588 14 138085,138742,142193,143613,144748,145003,146639,147623,148115,148413,150349,150986,156288,161313, 138667,138831,142292,143789,144929,145096,146721,147703,148232,148478,150499,151021,156497,161588, 0 LOC102724788 none none -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
100000 mClover3_ORF mClover3_vect + 0 720 0 717 1 0, 720, 0 mClover3 none none 0,
100001 mTagBFP2_ORF mTagBFP2_vect + 0 720 0 717 1 0, 720, 0 mTagBFP2 none none 0,
100002 mScarlet_ORF mScarlet_vect + 0 699 0 696 1 0, 699, 0 mScarlet none none 0,
100003 eGFP_ORF eGFP_vect + 0 720 0 717 1 0, 720, 0 eGFP none none 0,
* create a new bowtie index from the edited fasta named hg38_orfs
When I run a chopchop query for a hg38 gene name, it works but when I use the names (either the name from col2 or col13) from my new genes it fails with:
The gene name mClover3 (same for or mClover3_ORF) does not exist in file /data/biodata/chopchop_db/hg38_orfs.gene_table. Please try again.
If I use the contig name is also fails but this is expected as it wants a gene name
If I use --targets eGFP_vect
:0-720 it also fails
If I run a direct command: twoBitToFa /path/hg38_orfs.2bit -seq=mClover3_vect stdout it works perfectly!
I think there is something wrong with my gene_table but what?
Can you please comment on where my method is wrong.
This would really deserve a tutorial as I followed the repo instructions but this does not work.
Thanks in advance
Comments (5)
-
reporter -
reporter Adding 1kb of N’s at both ends finally worked for all four queries, I do not know why more than 23 bases (une guide length) are required at both ends but it seems to be a fact.
-
I suspect all of your pains come from this option, adjust it to 0.
parser.add_argument("-p", "--padSize", default=-1, type=int, help="Extra bases searched outside the exon. Defaults to the size of the guide RNA for CRISPR and TALEN + maximum spacer for TALENS.")
-
reporter Thanks Kornel, I indeed seems to work for the chrM example:
chopchop.py --targets chrM:80-200 --genome hg38 --padSize=0
Rank Target sequence Genomic location Strand GC content (%) Self-complementarity MM0 MM1 MM2 MM3 Efficiency
1 TACTGCGACATAGGGTGCTCCGG chrM:104 - 55 0 0 0 0 0 0.00
2 AAAGACAGATACTGCGACATAGG chrM:113 - 40 0 0 0 0 0 0.00
3 GATAAATAATAGGATGAGGCAGG chrM:140 - 35 0 0 0 0 0 0.00
4 ACGTAGGTGCGATAAATAATAGG chrM:150 - 35 0 0 0 0 0 0.00
5 GCACCTACGTTCAATATTACAGG chrM:163 + 40 0 0 0 0 0 0.00
6 TCGCCTGTAATATTGAACGTAGG chrM:166 - 40 0 0 0 0 0 0.00
7 GTGCGATAAATAATAGGATGAGG chrM:144 - 35 0 0 0 0 1 1.00
8 GCATTGCGAGACGCTGGAGCCGG chrM:85 + 65 0 0 0 0 1 0.00
9 AAGACAGATACTGCGACATAGGG chrM:112 - 40 0 0 0 0 3 0.00
I will create the bowtie index without N’s and try this out.
Best -
reporter Hi Kornel,
I was a bit too fast, running chrM from the default genome also fails when starting query at 0..59 with padSize=0. It starts to work from 60 onwards.
There is more to it than just the padSize I am afraid and it works with my 1kb N flanks when targeting the same sequences.
You can try the commands below on your own machine as they use the regular hg38 index and table from your side
# working when starting at 60 or more (here 80) $ chopchop.py --targets chrM:80-200 --genome hg38 --padSize=0 Rank Target sequence Genomic location Strand GC content (%) Self-complementarity MM0 MM1 MM2 MM3 Efficiency 1 TACTGCGACATAGGGTGCTCCGG chrM:104 - 55 0 0 0 0 0 0.00 2 AAAGACAGATACTGCGACATAGG chrM:113 - 40 0 0 0 0 0 0.00 3 GATAAATAATAGGATGAGGCAGG chrM:140 - 35 0 0 0 0 0 0.00 4 ACGTAGGTGCGATAAATAATAGG chrM:150 - 35 0 0 0 0 0 0.00 5 GCACCTACGTTCAATATTACAGG chrM:163 + 40 0 0 0 0 0 0.00 6 TCGCCTGTAATATTGAACGTAGG chrM:166 - 40 0 0 0 0 0 0.00 7 GTGCGATAAATAATAGGATGAGG chrM:144 - 35 0 0 0 0 1 1.00 8 GCATTGCGAGACGCTGGAGCCGG chrM:85 + 65 0 0 0 0 1 0.00 9 AAGACAGATACTGCGACATAGGG chrM:112 - 40 0 0 0 0 3 0.00 # not working despite --padSize=0 $ chopchop.py --targets chrM:0-200 --genome hg38 --padSize=0 Running twoBitToFa failed
- Log in to comment
After more searching, I discovered that I cannot use chopchop close to the edges of a chromosome/contig.
For instance:
although the sequence exists
By contrast:
The same is observed close to the end of a chromosome.
Based on this I tried my reporter query with a start coordinate >85 and it now works
Can you please explain this behaviour?
Should I add 100 N’s at both ends of my sequences
Thanks