A few questions and bug reports

Issue #11 new
Former user created an issue

Hello
I have some questions regarding the name matching proccess:
- I noticed that matching results are a bit different when words are separated by underscores rather than spaces, e.g Iris acutiloba var bimaculata VS Iris_acutiloba_var_ bimaculata. IWhy does this happen, and is there a way to set the separating character (either in the GUI or the API) ?
- About matching subspecies and variants, does the program refer specifically to words such as 'var.' or 'subsp.', or does it simply match it like any other part of the name ?
I'm asking because I sometimes get some weird matches. For example, the following two matches have very similar scores, while one of them is clearly a correct match while the other is problematic:
1. Iris tenax subsp. gormanii -> Iris tenax var. gormanii (score 0.73)
2. Iris potaninii var. ionantha -> Iris potaninii var. arenaria (score 0.714)
Do you think there's a way to improve the matching in this case ?
- I'm having trouble when trying to load very large synonymy datasets (over 1M rows). I get out of memory errors both in GUI (Windows machine) and API (Linux machine).
- When using the Python API, how can I do "Match taxa by name", like in the GUI. I saw in the API demonstration that there is a function called 'select', but I'd like a non-interactive process that automatically resolves all names in an input csv file. I also tried the 'resolve_fuzzy_name' function, but it doesn't yield the same results as the GUI.
Thanks a lot,
Lior

Comments (7)

  1. Thomas Kluyver

    Hi Lior,

    Thanks, it's really useful to get some feedback from people trying to use this without me sitting beside them.

    • Space vs. underscore: It does expect spaces, and there's not currently any way to change that. We ran into a similar use case here, with names from a phylogeny that came with underscores in, but I thought it was simpler for the user to replace the underscores in a spreadsheet or a script than to put another option into Taxonome. I can revisit that if it's going to be a common desire.

    • var/subsp: when it parses the name, it will handle these as a distinct part, but the fuzzy matching currently treats the whole name as one string, including the subspecific rank. This doesn't give great results for longer names, and I'm thinking of ways to improve it (e.g. by doing fuzzy matching on each part separately).

    • Memory: I've only tested it with synonymies of ~100k rows. It will use quite a bit of space to produce all the indices, although I'd expect it should be able to get above 1M. If you're not already running it on Python 3.3, that should be up to 4x more efficient with memory, so it's definitely worth an upgrade. I think I built the latest Win & Mac binary packages with Python 3.3. Is the synonymy something you can send me to test with?

    I'm also thinking of a way to store a synonymy dataset in an SQLite database, to get round the limits of what can fit in memory. Lookups would be slightly slower, but SQLite has pretty impressive performance. And it would save time if you're frequently starting Taxonome and loading the same dataset. That would also offer a way to parallelise matching. Drat, now I want to drop what I'm doing and start coding straight away. ;-)

    Basically, use match_taxa if you want to get a Taxonome dataset (in memory) at the end, and run_match_taxa if you don't. You pass a list of trackers in for the CSV output - CSVTracker is the log output, CSVListMatches is what the GUI calls 'Name mappings', and CSVTaxaTracker is what the GUI calls 'Taxa data with new names'. There's also a couple of different counters.

    Best wishes, Thomas

  2. Former user Account Deleted

    Thanks, this was very useful, and I managed to run the process through the API.
    Storing the data in sqlite (or other sql software) will be a very welcome addition, as I already use sqlite to store my synonymy, and have to extract CSVs each time I want to match names.
    Thank you again for your detailed and helpful reply.
    Lior

  3. Thomas Kluyver

    Cool. I started work on an sqlite store last night. Though of course it'll be a different database format to your own synonymy, so you'll still need some kind of script to translate it.

  4. Thomas Kluyver

    There's now a prototype SqliteTaxonDB checked in. Using my legumes dataset (25k taxa, 50k names), I get about 2ms to select a name from it with an exact match. Fuzzy matching, which isn't working yet, will have to be somewhat slower. The plain in-memory TaxonSet now gets an exact match in about 8us, down from 30us before.

    If you decide to try it out, be aware that the data format might change before release.

  5. Former user Account Deleted

    Sounds great, but we need the fuzzy matching, so whenever this option is available, we'd be happy to try it.

  6. Thomas Kluyver

    I've just added support for fuzzy matching from an SqliteTaxonDB (this is all in the collectionapi2 branch). I haven't tested its performance yet.

  7. Log in to comment