UGAHash - An algorithm for a distributed accession system.
As the number of biological databases increases, maintaining consistency of the contained information (e.g. genomic positions, transcript types, and annotations) becomes increasingly complex. This problem is exacerbated by next generation sequencing (NGS) techniques (e.g. RNA-seq) that do not require prior knowledge for analysis. For example, many guided assemblers can assign accession numbers to transcripts from a reference annotation, while annotating sequences that are absent in the reference as “novel transcripts”. As the information exchange among various databases is poor, such a novel sequence from one reference could be annotated in another. Furthermore, relationships to nearby or overlapping annotated transcripts are even more complicated when different genome assemblies as used (e.g. hg18, 19, and 38 for human). These combined factors make the comparison and identification of newly discovered transcripts a difficult task. To highlight these problems, we surveyed the currently available genomic assemblies and annotations across various databases. To remedy these problems, we introduce a new algorithm called “UGAHash” as an alternative to “Do-It-Yourself” accession systems to serve as the basis for a decentralized accession system. Finally, we created a web tool to encourage utilization of UGAHash to allow researchers to generate accessions for newly discovered transcripts and explore annotations and hashes deposited in various past and present public databases. Finally, we speculate about future applications of accession systems based on cryptographic hash algorithm in bioinformatics.
In this repository you will find a python implementation of the UGAHash algorithm and the code used to generate the figures in the UGAHash paper in the folders UGAHash_algorithm and analysis_scripts respectively.
Weirick, T., John, D., and Uchida, S.: Resolving the problem of multiple accessions of the same transcript deposited across various public databases. Brief Bioinform. [Accepted].