Some matching bugs

Issue #15 new
Former user created an issue

Hi, While performing name matching, I encountered some strange cases which I think might indicate bugs:
1. The taxon Erato sodiroi (Hieron.) H. Rob. was not resolved, although my synonymy data contains Erato sodiroi (Hieron.) H.Rob. (only difference is the space after H.)
2. Same happened with Nasturtium officinale R. Br. , which should have been resolved to Nasturtium officinale R.Br. (only difference is the space after R.)
3. The taxon Microthlaspi granatense (Boiss. & Reut.) F.K. Meyer was not matched, even though Microthlaspi granatense F.K. Meyer appears in the synonymy dataset. I noticed that Taxonome has some mechanism regarding the author name in parenthesis, but for some reason in this case it didn't resolve.
Do you have any idea why these taxa were not matched ?
Thank you!

Comments (8)

  1. Thomas Kluyver

    Hi Lior,

    How are you loading the taxa? Is this through the GUI, or in a script? Double check that the authority is being recognised as separate from the actual name. If you're working with the GUI, check the options for loading names from the CSV file.

    In my tests, it can handle the differences, so my best guess is that it's getting authority names and binomials confused.

  2. Former user Account Deleted

    Hi, and thanks. We're doing the matching through a script. The authority is in the same field as the actual name. We could separate them if necessary. Do you think the matching will work better this way ?
    We set the auth_field parameter to TRUE. Is this the correct value in this case ?
    We also tried testing two problematic taxa using the GUI. Results are a bit odd: one was matched witch score 1 while the other was not matched at all. Separating the authority and name fields did not change the output. Attached are the CSVs I used as synonymy dataset, names-to-match dataset and the name mapping result. Can you give it a look ?

    Another issue that we came across is the matching of hybrid species. These species are denoted with either an X character or a special × character. In either case, we noticed that the hybrid marker character is omitted in the matching results. For example, Citrus × aurantiifolia simply becomes Citrus aurantiifolia (the original name is changed). Was this supposed to happen ?
    Thanks again for all the help,
    Lior

  3. Thomas Kluyver

    Script: setting authfield to True should work (not auth_field, but it should fail loudly if you get that name wrong, so I assume that's not the problem). Taxonome is designed to be able to disentangle names and authorities from a single field, but if you can easily put them in two separate fields, it's worth trying that to see whether it's the problem. Can you send me your script, or enough of it to reproduce the problem?

    The datasets you've attached: In the synonymy file, there's an odd character (a non-breaking space) between 'Nasturtium' and 'officinale'. It can handle that correctly, but you need to tell it to use Windows-1252 encoding. There's a dropdown at the bottom of the CSV import dialog. From a script, use the encoding parameter to the open function.

    Hybrid names: It stores the hybrid status internally, but for efficiency it doesn't refer to it when reconstructing the name as a string. In my area, the hybrid markers are used inconsistently, so ignoring them made sense. I'll have a think about how to include it. If you need that functionality in the meantime, you can use a function like this:

    def plain_with_hybrid(name):
        parts = [('×'+name.g) if name.hybrid=='genus' else name.g]
    
        if name.sp:
            parts.append('×'+name.sp if name.hybrid=='species' else name.sp)
    
            for sn in name.subnames:
                parts.extend(sn)
    
        return " ".join(parts)
    
  4. Former user Account Deleted

    Thanks. We'll try that and see how it works.
    Another question :
    The name Cnepis rubra L. was not resolved, although Crepis rubra L. appears in the synonymy dataset (notice the 'n' instead of 'r' mistake in the original name). Can you explain why the fuzzy-matching did not work here ? Is it because the mismatch is in the genus name ?
    Thanks,
    Lior

  5. Thomas Kluyver

    The fuzzy matching on a TaxonSet assumes that the first three letters are correct, so it checks against everything beginning with 'Cne' in your case. It would be quite slow to compare against every name in the database, so we need some way to pick a subset to do fuzzy comparison against.

    The SqliteTaxonDB prototype uses a different system - it will check against all names with genus 'Cnepis' and all names with the specific epithet 'rubra' for the closest match. That will work better in this case, though is has a different limitation: it can never find a name if both the genus and the species epithet are mis-spelled.

  6. Former user Account Deleted

    Hi, long time. We're still working, and have another question.
    We noticed that sometimes the names we are matching are changed during the process. That is, a different name appears in the original name column of the name mapping file. The changed name usually can't be matched, because it's quite different from what appears in the synonymy dataset. Some examples:
    Hieracium caespitosum Dumort. [ H. pratense Tausch] became Hieracium caespitosum D. [. H. P. Tausch] , which could not be matched. Hieracium hoppeanum Schultes grex macranthum (Ten.) Zahn became Hieracium hoppeanum S. G. M. (. ). Zahn .
    I understand that it is sometimes necessary to format the name before matching it, but in these examples it seems that formating goes a bit wrong. Any idea what happens there?
    Thanks!

  7. Thomas Kluyver

    Hi again Lior. The issue is that it parses the name, and doesn't store the original text, so when it writes the original name back out, it's reconstructed from a structured representation. That mostly works, but it's not expecting another name in square brackets, so it treats "[ H. pratense Tausch]" as all part of the author's name. In that case, you might want to do some preprocessing to strip that part off. I assume it's noting a synonym?

    grex is more complicated - I've not come across the term before, but it looks like it doesn't always follow normal nomenclature rules. In your example, though, it looks like it's associated with the word following it, in the same way that we'd use 'subsp.' or 'var.'. If that's the case, you can add 'grex' to the set named extranamebits on this line: https://bitbucket.org/taxonome/taxonome/src/f02676ca06825b3cbed40f7078e796311e8baf34/taxonome/taxa/base.py?at=default#cl-16

    I'll have a think about the name matching output - I've run into this as well, and it would be useful to have the real original name kept. But parsing the names correctly is still important to try to match them.

  8. Log in to comment