Overview

HTTPS SSH

AliSt - Multi-Sequence Alignment Statistics

ALIgnment STatistic is a C++ program to compute basic statistic on Multi-Sequence Alignments.

AliST 0.1.0

Usage: AliSt [arguments] or [params=file.txt]
Documentation can be found at https://bitbucket.org/lorenzogatti89/AliSt/

Features

The following statistics are currently implemented:

  • Average Pairwise Sequence Identity (4 methods) (PID)
  • Average GAP proportion
  • Average not-GAP (CHAR) proportion
  • Completeness scores (sequence, site, pairwise) see method here
  • Incompleteness scores (sequence, site, pairwise) see method here
  • InDel length distribution
  • Percent Sequence Identity see method here

Dependencies

  • bpp-core 2.4.0
  • bpp-seq 2.4.0
  • glog
  • boost c++ libraries

Compiling

cmake -DCMAKE_BUILD_TYPE=Release -G "CodeBlocks - Unix Makefiles" 
cmake  --target AliST -- -j 2

Download precompiled binary

Download Precompiled Binary (Linux)


Options

alphabet={DNA|RNA|Protein)|Codon(letter={DNA|RNA},type={Standard|EchinodermMitochondrial|InvertebrateMitochondrial|VertebrateMitochondrial})}
                                                        The alphabet to use when reading sequences. DNA and RNA alphabet can in addition take
                                                        an argument: **bangAsgap={bool}**
                                                        Tell is exclamation mark should be considered as a gap character. The default
                                                        is to consider it as an unknown character such as 'N' or '?'.
genetic_code={translation table}                        Where ’translation table’ specifies the code to use, either as a text description,
                                                        or as the NCBI number. The following table give the currently implemented codes
                                                        with their corresponding names:
                                                        Standard                    1
                                                        VertebrateMitochondrial     2
                                                        YeastMitochondrial          3
                                                        MoldMitochondrial           4
                                                        InvertebrateMitochondrial   5
                                                        EchinodermMitochondrial     9
                                                        AscidianMitochondrial       13
                                                        The states of the alphabets are in alphabetical order.


input.sequence.file={path}                              The sequence file to use. (These sequences can also be not aligned).
input.sequence.format={format}                          The sequence file format.
input.sequence.sites_to_use={all|nogap|complete}        Tells which sites to use
input.sequence.remove_stop_codons={boolean}             Removes the sites where there is a stop codon (default: ’yes’)
input.sequence.max_gap_allowed=100%                     It specifies the maximum amount of gap allowed per site.
input.sequence.max_unresolved_allowed=100%              It specifies the maximum amount of unresolved states per site.
input.site.selection={list of integers}                 Will only consider sites in the given list of positions, in extended format :
                                                        positions separated with ",", and "i:j" for all positions between i and j,
                                                        included.
input.site.selection = {Sample(n={integer} [, replace={true}])}
                                                        Will consider {n} random sites, with optional replacement.


showtables={true|false}                                 Show tables (Indel Distribution and Pairwise Identity Matrix) on the terminal

The following formats are currently supported:

Fasta(extended={bool}, strictNames={bool})              The fasta format. The argument extended, default to 'no' allows to enable the HUPO-PSI
                                                        extension of the format. The argument strict_names, default to 'no', specifies that
                                                        only the first word in the fasta header is used as a sequence names, the rest of the
                                                        header being considered as comments.
Mase(siteSelection={chars})                             The Mase format (as read by Seaview and Phylo_win for instance), with an optional site
                                                        selection name.
Phylip(order={interleaved|sequential}, type={classic|extended}, split={spaces|tab})
                                                        The Phylip format, with several variations. The argument order distinguishes between
                                                        sequential and interleaved format, while the option type distinguished between the
                                                        plain old Phylip format and the more recent extention allowing for sequence names
                                                        longer than 10 characters, as understood by PAML and PhyML. Finally, the split
                                                        argument specifies the type of character that separates the sequence name from the
                                                        sequence content. The conventional option is to use one (classic) or more (extended)
                                                        spaces, but tabs can also be used instead.
Clustal(extraSpaces={int})                              The Clustal format.
                                                        In its basic set up, sequence names do not have space characters, and one space splits
                                                        the sequence content from its name. The parser can however be configured to allow
                                                        for spaces in the sequence names, providing a minimum number of space characters is
                                                        used to split the content from the name. Setting extraSpaces to 5 for instance, the
                                                        sequences are expected to be at least 6 spaces away for their names.
Dcse()                                                  The DCSE alignment format. The secondary structure annotation will be ignored.
Nexus()                                                 The Nexus alignment format. (Only very basic support is provided)
GenBank()                                               The GenBank not aligned sequences format.
                                                        Very basic support: only retrieves the sequence content for now, all features are
                                                        ignored.

Example

alphabet=DNA input.sequence.file=../tests/datasets/aligned_simulated_big_nt.fa input.sequence.sites_to_use=all method.pid=1 showtables=true

The following execution output was generated with the above arguments:

------------------------------------------------------------------------------
AliST 0.1.0
Multi-Sequence Alignment Statistics
Authors: Lorenzo Gatti
Build on commit: refs/heads/master 398b5c15c8145824583aae7b7818da9899799175
On date: 19 Jul 2018, 13:16:25
------------------------------------------------------------------------------
Execution started on:..................: dhcp-wlan-uzh-10-12-146-200.uzh.ch
WARNING!!! Parameter alignment not specified. Default used instead: 0
Aligned sequences......................: yes
Sequence file .........................: ../tests/datasets/aligned_simulated_big_nt.fa
Sequence format .......................: FASTA file
Sites to use...........................: all
PID method.............................: 1

Number of sequences....................: 8
Number of sites........................: 327
Number of pairs........................: 28

Average pairwise Identity (PID)........: 0.45096112
Average GAP proportion.................: 0.42775229
Average CHAR proportion................: 0.57224771

Completeness (C) score align. (Ca).....: 0.001911315
Max C-score for sequences (Cr_max).....: 0.62079511
Min C-score for sequences (Cr_min).....: 0.5382263
Max C-score for sites (Cc_max).........: 1
Min C-score for sites (Cc_min).........: 0.125
Max C-score pairwise (Cij_max).........: 1
Min C-score pairwise (Cij_min).........: 0.55351682
Max I-score pairwise (Iij_max).........: 0.44648318
Min I-score pairwise (Iij_min).........: 0

C-Scores (Cr) for individual sequences.: ../tests/datasets/aligned_simulated_big_nt.cscores_seqs_cr.csv
C-Scores (Cc) for sites................: ../tests/datasets/aligned_simulated_big_nt.cscores_sites_cc.csv
C-Scores (Cij) pairwise................: ../tests/datasets/aligned_simulated_big_nt.cscores_pairwise_cij.csv
I-Scores (Iij) pairwise................: ../tests/datasets/aligned_simulated_big_nt.iscores_pairwise_Iij.csv
C/I-Scores (C/Iij) pairwise............: ../tests/datasets/aligned_simulated_big_nt.table_pairwise_Cij_Iij.csv
Pairwise Identity Matrix (PID).........: ../tests/datasets/aligned_simulated_big_nt.table_pairwise_identities.csv
InDel distribution ....................: ../tests/datasets/aligned_simulated_big_nt.indel_distribution.csv
Summary statistics ....................: ../tests/datasets/aligned_simulated_big_nt.stats.csv


---------- Pairwise Identity Matrix ---------
             F         G         H         A         B         C         D         E
F          1.0   0.69113   0.69113   0.36391   0.37003   0.37309   0.39755    0.5107
G      0.69113       1.0   0.68807   0.31498    0.3211   0.33945   0.36086   0.47095
H      0.69113   0.68807       1.0   0.31193   0.31498   0.32416   0.33639   0.41896
A      0.36391   0.31498   0.31193       1.0   0.81651   0.64526   0.44954   0.39144
B      0.37003    0.3211   0.31498   0.81651       1.0   0.62691   0.43731   0.39755
C      0.37309   0.33945   0.32416   0.64526   0.62691       1.0   0.44954   0.41284
D      0.39755   0.36086   0.33639   0.44954   0.43731   0.44954       1.0   0.40061
E       0.5107   0.47095   0.41896   0.39144   0.39755   0.41284   0.40061       1.0

---------- Indel distribution ---------
class          [1]       [2]       [3]       [4]       [5]       [6]       [7]       [8]       [9]      [10]      [11]      [12]      [13]      [14]      [17]
counts           8         8         8         8         8         8         6         3         4         2         1         1         3         1         1
prop       0.11429   0.11429   0.11429   0.11429   0.11429   0.11429  0.085714  0.042857  0.057143  0.028571  0.014286  0.014286  0.042857  0.014286  0.014286

Total execution time: 0.000000d, 0.000000h, 0.000000m, 0.000000s.