ConvertDb genbank should test the number of sequences in input and output match

Issue #162 new
ssnn created an issue

Add a check to ConvertDb genbank to verify the number of sequences in the output matches the number of sequences in the input. From a warning in docs/examples/genbank.rst:

It is recommended that the number of sequences in the output sqn file be verified against the number of sequences in the input tab or tsv file. From the command line, this can be achieved via grep -c iupacna *.sqn. This step is not necessary if running tbl2asn outside ConvertDb. This step is not necessary if running tbl2asn outside ConvertDb.

Comments (3)

  1. Jason Vander Heiden

    Wait… Unless this is referring specifically to the tbl2asn filters. Yes? (Ie, not what passes/fails ConvertDb-genbank, but what passes/fails tbl2asn from the ConvertDb-genbank output.)

    If so, we should be able to check that. Not sure if there is a stout/sterr message from tbl2asn we can capture, but we can search through the .gbf file for a record out (same sort of thing as @Julian Zhou is doing with grepping the .sqn files).

  2. Julian Zhou

    My impression is that the console log, with the --asn flag on, only tells how many sequences passed the first step (ConvertDb generating fsa and tbl files), but not how many sequences came out of tbl2asn. I say this because in my most recent submission, ConvertDb's console log indicated that all sequences passed, but eventually I got a few to a dozen sequences missing (in a seemingly random fashion) for most of the final sqn files. I only noticed this because the GenBank curator noticed that my files between different re-submissions had different number of sequences. This problem went away when I stoppd using the --asn flag and ran tbl2asn outside ConvertDb. We are not sure if this is a platform-specific issue or what. @Hailong Meng is helping to test this on Windows/Linux [issue #163] (I had this problem on Mac).

  3. Log in to comment