test_vector_unicode fails in Python 3.5.3 (Debian)

Create issue
Issue #459 new
Simon McVittie created an issue

test_vector_unicode fails under the current Python version in Debian. I'm fixing this during general QA work, so I don't know anything in particular about Whoosh, only that one of its tests fails:

=================================== FAILURES =================================== _____ test_vector_unicode ______ Traceback (most recent call last):\n File "/<<PKGBUILDDIR>>/.pybuild/pythonX.Y_3.5/build/tests/test_vectors.py", line 80, in test_vector_unicode\n assert vec[0][0] == u"\u13ac\u13ad\u13ae"\nAssertionError: assert '\uab7c\uab7d\uab7e' == '\u13ac\u13ad\u13ae'\n - \uab7c\uab7d\uab7e\n + \u13ac\u13ad\u13ae\n============================ pytest-warning summary ============================

The particular text used in that test uses Cherokee letters: for example, the first one used is U+13AC CHEROKEE LETTER GV.

Prior to Unicode 8.0, Cherokee was modelled as not having upper or lower case, but this was later decided to have been incorrect. Unicode 8.0 repurposed the existing Cherokee block U+13A0..U+13FF as upper-case Cherokee to reflect their appearance in existing fonts, and introduced new lower-case versions in the range U+AB70..U+ABBF. For example, U+AB7C CHEROKEE SMALL LETTER GV is the lower-case form of U+13AC.

When this test was written, it was presumably run against a pre-Unicode 8.0 version of Python, where the default LowercaseFilter leaves U+13AC intact: u"\u13ac".lower() == u"\u13ac". However, Python 3.5 has Unicode 8.0 tables which result in u"\u13ac".lower() == u"\ab7c".

My proposed patch (attached) makes the test Python-version-independent by asserting that the word found in the frequency analysis is the result of lower(), whatever this Python version thinks that is.

Comments (2)

  1. Simon McVittie reporter

    I don't know anything in particular about Cherokee either, I'm getting all this from Google :-)

  2. Simon McVittie reporter

    In fact this has already been fixed in 2.7.1 with commit "Fix the analyzer in test_vector_unicode() to not lowercase, since this makes the test fail on some Python versions"; so please ignore this report, unless you think my solution improves test coverage.

  3. Log in to comment