Tokenizer splits numbers

Issue #39 new
Nicolás Palopoli created an issue

On the CLV_Separin_Fungi motif Browse page (http://slim.icr.ac.uk/articles/browse/?motif_class=CLV_Separin_Fungi), the Word Cloud shows “139” and “889” as relevant terms.

The abstract for “"Studies on substrate recognition by the budding yeast separase." has the following text: “This motif is found in 1,139 of 5,889 predicted yeast proteins.”

Therefore, the tokenizer seems to be splitting the numbers using commas, as if they were different words.

(Also, I’m not sure if we should use numbers as relevant terms)

Comments (0)

  1. Log in to comment