HTTPS SSH

Tethya wilhelma genome project

Data for working paper: Francis, WR., M. Eitel, S. Vargas, M. Adamski, SHD. Haddock, S. Krebs, H. Blum, D. Erpenbeck, G. Wörheide (2017) The Genome Of The Contractile Demosponge Tethya wilhelma And The Evolution Of Metazoan Neural Signalling Pathways. DOI: 10.1101/120998

Mills, DB., WR. Francis, S. Vargas, M. Larsen, CPH. Elemans, DE. Canfield, G. Wörheide. 2018. The Last Common Ancestor of Animals Lacked the HIF Pathway and Respired in Low-Oxygen Environments. eLife 7: e31176. DOI: 10.7554/eLife.31176

This data is provided prior to journal publication under CC BY-NC-SA license; please cite the above references.

Tethya wilhelma is a ball-shaped demosponge that undergoes cyclic contractions and is an emerging laboratory model for many topics, including multicellularity, early-animal evolution, biomineralization, and microbial interactions. The original description by Sara et al. from 2001 can be found here. Technical details and raw data for the Tethya wilhelma sequencing project can be found at the NCBI BioProject PRJNA288690.

For general questions about the genome project or T. wilhelma as a model organism, including requests for (live) specimens, or data usage, please contact: woerheide@lmu.de

For technical questions about the assembly and/or annotation please contact: wrf@lrz.uni-muenchen.de

This project was seed-funded by the LMUexcellent program (Project MODELSPONGE) to G.W. and D.E. through the German Excellence Initiative, and also benefitted from funding by VILLUM FONDEN (Grant 9278 "Early evolution of multicellular sponges") to G.W.

Genome assembly

Full assembly v1 is 125Mb made up of 5936 scaffolds with N50 of 73kb. There are two alphaproteobacteria associated with this sponge, but most/all bacterial scaffolds have been removed.

Gene sets

Trinity transcriptome

De novo assembled strand-specific RNAseq reads with Trinity (v2014) with options normalize and trimmomatic, 127012 transcripts including splice variants.

GFF of Trinity transcripts was produced by GMAP, using option -f 2.

StringTie transcriptome

Genome guided transcripts, raw reads mapped to genomic scaffolds with Tophat2 and genes predicted with StringTie, 46398 transcripts including splice variants.

~/tophat-2.0.13.Linux_x86_64/tophat2 -p 4 -o tethya_rnaseq_ss --library-type fr-firststrand tethya-0_1 ../rnaseq_reads/Tethya_RNA-Seq_Fastq1_TAGCTT_lane2.fastq.f ../rnaseq_reads/Tethya_RNA-Seq_Fastq2_TAGCTT_lane2.fastq.f
~/stringtie-1.0.2.Linux_x86_64/stringtie tethya_rnaseq_ss/accepted_hits.bam -o tethya_rnaseq_ss_stringtie.gtf -l twi_ss

Transcripts from the StringTie GTF were generated using the script cufflinks_gtf_genome_to_cdna_fasta.pl provided with TransDecoder.

AUGUSTUS models

Ab initio gene predictions, 37633 transcripts and proteins with up to 2 splice variants. AUGUSTUS was run as:

~/augustus-3.0.3/bin/augustus --species=Tethya_wilhelma --strand=both --genemodel=atleastone --codingseq=on --protein=on --cds=on --sample=100 --keep_viterbi=true --alternatives-from-sampling=true --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=2 --gff3=on --exonnames=on twilhelma_scaffolds_v1.fasta > tethya-v1_augustus_max2.gff

Because AUGUSTUS GFF format is non-standard, this has been reformatted to better conform to the Sequence Ontology GFF3 specifications. Format was changed to remove comments (including the embedded protein sequences) and intron types, add exon types, change transcript types to mRNA. Proteins and CDS were taken from the GFF using extract_features.py, exon format was generated with reformatgff.py.

AUGUSTUS training parameters for T. wilhelma can be found here.

Filtered set

Genes that were better represented by Trinity (due to false breaks or fusions) were replaced. All AUGUSTUS transcripts that covered a region with no mapped RNAseq were kept. Some manual changes were made as well.

Annotation tracks

PFAM domain annotation

PFAM domain matches are mapped onto the StringTie/TransDecoder proteins, made with hmmscan and pfam2gff.py

hmmscan --cpu 4 --domtblout twilhelma_stringtie.pfam.tab ~/PfamScan/data/Pfam-A.hmm twilhelma_stringtie_transdecoder_proteins.fasta > twilhelma_stringtie.pfam.log
pfam2gff.py -g twilhelma_stringtie_split_transdecoder.gff -i twilhelma_stringtie.pfam.tab -T > twilhelma_stringtie_split_transdecoder_pfam_domains.gff

BLASTX to Aque2 proteins

StringTie transcripts aligned to Aque-v2.1 proteins

blastx -query twilhelma_stringtie_transcripts.fasta -db Aqu2.1_Isoforms_proteins.fasta -max_target_seqs 5 -evalue 1e-6 -outfmt 6 -num_threads 8 > twilhelma_stringtie_blastx_v_aqu2.tab
blast2genomegff.py -b twilhelma_stringtie_blastx_v_aqu2.tab -g twilhelma_stringtie_split.gtf -d Aqu2.1_Isoforms_proteins.fasta > twilhelma_stringtie_blastx-v-amphimedon.gff

Raw data

Raw sequence data can be found at the NCBI SRA: