Building phylogeny regression test data

The method below does not work well. Instead, see the utility script that builds a test data set using randomly selected clusters from the all-by-all homologize in the full regression test for the Agalma paper.

We start by selecting the first 60 transcripts from each assembly:

for A in HWI-ST625-54-C026EACXX-6-ATCACG HWI-ST625-54-C026EACXX-6-CAGATC HWI-ST625-54-C026EACXX-6-CTTGTA HWI-ST625-54-C026EACXX-6-GCCAAT HWI-ST625-54-C026EACXX-6-TTAGGC;
do
  fastacrop 0-60 /gpfs/data/cdunn/analyses/rsem/$A/3?/assembly_trinity_*.annotated.fa >$A.fa
done

Then we blast these against the JGI Nematostella data set and choose all of the genes that have a hit:

makeblastdb -dbtype nucl -in Nemve_No_ribosomes_FilteredModels_NO_TABS.fas -out JGI_NEMVEC -title JGI_NEMVEC
for A in HWI-ST625-54-C026EACXX-6-ATCACG HWI-ST625-54-C026EACXX-6-CAGATC HWI-ST625-54-C026EACXX-6-CTTGTA HWI-ST625-54-C026EACXX-6-GCCAAT HWI-ST625-54-C026EACXX-6-TTAGGC;
do
  tblastx -query $A.fa -outfmt "6 stitle" -db JGI_NEMVEC -evalue 1e-20 >>ids.txt
done
exclude -k -x ids.txt -i Nemve_No_ribosomes_FilteredModels_NO_TABS.fas -o JGI_NEMVEC.fa

Wiki

agalma / Building phylogeny regression test data

Building phylogeny regression test data