Tethya wilhelma genome project
Data for working paper: Francis, WR., M. Eitel, S. Vargas, M. Adamski, SHD. Haddock, S. Krebs, H. Blum, D. Erpenbeck, G. Wörheide (2017) The Genome Of The Contractile Demosponge Tethya wilhelma And The Evolution Of Metazoan Neural Signalling Pathways. DOI: 10.1101/120998
Mills, DB., WR. Francis, S. Vargas, M. Larsen, CPH. Elemans, DE. Canfield, G. Wörheide. 2018. The Last Common Ancestor of Animals Lacked the HIF Pathway and Respired in Low-Oxygen Environments. eLife 7: e31176. DOI: 10.7554/eLife.31176
This data is provided prior to journal publication under CC BY-NC-SA license; please cite the above references.
Tethya wilhelma is a ball-shaped demosponge that undergoes cyclic contractions and is an emerging laboratory model for many topics, including multicellularity, early-animal evolution, biomineralization, and microbial interactions. The original description by Sara et al. from 2001 can be found here. Technical details and raw data for the Tethya wilhelma sequencing project can be found at the NCBI BioProject PRJNA288690.
For general questions about the genome project or T. wilhelma as a model organism, including requests for (live) specimens, or data usage, please contact: firstname.lastname@example.org
For technical questions about the assembly and/or annotation please contact: email@example.com
This project was seed-funded by the LMUexcellent program (Project MODELSPONGE) to G.W. and D.E. through the German Excellence Initiative, and also benefitted from funding by VILLUM FONDEN (Grant 9278 "Early evolution of multicellular sponges") to G.W.
Full assembly v1 is 125Mb made up of 5936 scaffolds with N50 of 73kb. There are two alphaproteobacteria associated with this sponge, but most/all bacterial scaffolds have been removed.
De novo assembled strand-specific RNAseq reads with Trinity (v2014) with options normalize and trimmomatic, 127012 transcripts including splice variants.
GFF of Trinity transcripts was produced by GMAP, using option
Genome guided transcripts, raw reads mapped to genomic scaffolds with Tophat2 and genes predicted with StringTie, 46398 transcripts including splice variants.
~/tophat-2.0.13.Linux_x86_64/tophat2 -p 4 -o tethya_rnaseq_ss --library-type fr-firststrand tethya-0_1 ../rnaseq_reads/Tethya_RNA-Seq_Fastq1_TAGCTT_lane2.fastq.f ../rnaseq_reads/Tethya_RNA-Seq_Fastq2_TAGCTT_lane2.fastq.f ~/stringtie-1.0.2.Linux_x86_64/stringtie tethya_rnaseq_ss/accepted_hits.bam -o tethya_rnaseq_ss_stringtie.gtf -l twi_ss
Transcripts from the StringTie GTF were generated using the script
cufflinks_gtf_genome_to_cdna_fasta.pl provided with TransDecoder.
Ab initio gene predictions, 37633 transcripts and proteins with up to 2 splice variants. AUGUSTUS was run as:
~/augustus-3.0.3/bin/augustus --species=Tethya_wilhelma --strand=both --genemodel=atleastone --codingseq=on --protein=on --cds=on --sample=100 --keep_viterbi=true --alternatives-from-sampling=true --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=2 --gff3=on --exonnames=on twilhelma_scaffolds_v1.fasta > tethya-v1_augustus_max2.gff
Because AUGUSTUS GFF format is non-standard, this has been reformatted to better conform to the Sequence Ontology GFF3 specifications. Format was changed to remove comments (including the embedded protein sequences) and intron types, add exon types, change transcript types to mRNA. Proteins and CDS were taken from the GFF using
extract_features.py, exon format was generated with
AUGUSTUS training parameters for T. wilhelma can be found here.
Genes that were better represented by Trinity (due to false breaks or fusions) were replaced. All AUGUSTUS transcripts that covered a region with no mapped RNAseq were kept. Some manual changes were made as well.
PFAM domain annotation
PFAM domain matches are mapped onto the StringTie/TransDecoder proteins, made with hmmscan and
hmmscan --cpu 4 --domtblout twilhelma_stringtie.pfam.tab ~/PfamScan/data/Pfam-A.hmm twilhelma_stringtie_transdecoder_proteins.fasta > twilhelma_stringtie.pfam.log pfam2gff.py -g twilhelma_stringtie_split_transdecoder.gff -i twilhelma_stringtie.pfam.tab -T > twilhelma_stringtie_split_transdecoder_pfam_domains.gff
BLASTX to Aque2 proteins
StringTie transcripts aligned to Aque-v2.1 proteins
blastx -query twilhelma_stringtie_transcripts.fasta -db Aqu2.1_Isoforms_proteins.fasta -max_target_seqs 5 -evalue 1e-6 -outfmt 6 -num_threads 8 > twilhelma_stringtie_blastx_v_aqu2.tab blast2genomegff.py -b twilhelma_stringtie_blastx_v_aqu2.tab -g twilhelma_stringtie_split.gtf -d Aqu2.1_Isoforms_proteins.fasta > twilhelma_stringtie_blastx-v-amphimedon.gff
Raw sequence data can be found at the NCBI SRA: