A few questions and bug reports

Former user Account Deleted

edited description

2013-05-28T09:24:40+00:00

Thomas Kluyver

Hi Lior,

Thanks, it's really useful to get some feedback from people trying to use this without me sitting beside them.

Space vs. underscore: It does expect spaces, and there's not currently any way to change that. We ran into a similar use case here, with names from a phylogeny that came with underscores in, but I thought it was simpler for the user to replace the underscores in a spreadsheet or a script than to put another option into Taxonome. I can revisit that if it's going to be a common desire.
var/subsp: when it parses the name, it will handle these as a distinct part, but the fuzzy matching currently treats the whole name as one string, including the subspecific rank. This doesn't give great results for longer names, and I'm thinking of ways to improve it (e.g. by doing fuzzy matching on each part separately).
Memory: I've only tested it with synonymies of ~100k rows. It will use quite a bit of space to produce all the indices, although I'd expect it should be able to get above 1M. If you're not already running it on Python 3.3, that should be up to 4x more efficient with memory, so it's definitely worth an upgrade. I think I built the latest Win & Mac binary packages with Python 3.3. Is the synonymy something you can send me to test with?

I'm also thinking of a way to store a synonymy dataset in an SQLite database, to get round the limits of what can fit in memory. Lookups would be slightly slower, but SQLite has pretty impressive performance. And it would save time if you're frequently starting Taxonome and loading the same dataset. That would also offer a way to parallelise matching. Drat, now I want to drop what I'm doing and start coding straight away. ;-)

Matching in the API: I need to update the docs on this, but there's some coverage here: http://taxonome.bitbucket.org/api/taxa.html#matching-and-combining-datasets

Basically, use match_taxa if you want to get a Taxonome dataset (in memory) at the end, and run_match_taxa if you don't. You pass a list of trackers in for the CSV output - CSVTracker is the log output, CSVListMatches is what the GUI calls 'Name mappings', and CSVTaxaTracker is what the GUI calls 'Taxa data with new names'. There's also a couple of different counters.

Best wishes, Thomas

2013-05-28T12:31:17+00:00

Former user Account Deleted

Thanks, this was very useful, and I managed to run the process through the API.
Storing the data in sqlite (or other sql software) will be a very welcome addition, as I already use sqlite to store my synonymy, and have to extract CSVs each time I want to match names.
Thank you again for your detailed and helpful reply.
Lior

2013-05-30T04:45:03+00:00

Thomas Kluyver

Cool. I started work on an sqlite store last night. Though of course it'll be a different database format to your own synonymy, so you'll still need some kind of script to translate it.

2013-05-30T09:38:31+00:00

Thomas Kluyver

There's now a prototype SqliteTaxonDB checked in. Using my legumes dataset (25k taxa, 50k names), I get about 2ms to select a name from it with an exact match. Fuzzy matching, which isn't working yet, will have to be somewhat slower. The plain in-memory TaxonSet now gets an exact match in about 8us, down from 30us before.

If you decide to try it out, be aware that the data format might change before release.

2013-06-06T13:22:24+00:00

Former user Account Deleted

Sounds great, but we need the fuzzy matching, so whenever this option is available, we'd be happy to try it.

2013-06-09T11:06:21+00:00

Thomas Kluyver

I've just added support for fuzzy matching from an SqliteTaxonDB (this is all in the collectionapi2 branch). I haven't tested its performance yet.

2013-06-16T20:36:31+00:00

Comments (7)