Clone wiki

utf8rewind / Changes for 1.2.1

Description

utf8rewind is a cross-platform and open source C library designed to extend the default string handling functions and add support for UTF-8 encoded text.

Download

utf8rewind-1.2.1.zip (3.17 MB)

Clone in Mercurial

hg clone https://bitbucket.org/knight666/utf8rewind utf8rewind

Summary

In this release, we have fixed critical issues in the handling of invalid input that could result in unhandled exceptions or crashes. Users of utf8rewind 1.2.0 are strongly advised to upgrade to 1.2.1 as soon as possible. This release is binary compatible with 1.2.0 and can be used as a drop-in replacement.

Bug fixes

One of the bugs that has been fixed in this release is a crash that could occur if the output buffer for one of the conversion functions (utf8towide, utf8toutf16 and utf8toutf32) had an incorrect length and was not a multiple of two or four. The functions will also no longer read past the length of the input buffer if the buffer has an incorrect length specified.

Seeking backwards now deals with overlong or erroneously encoded codepoints correctly.

In-depth

The road to this release was longer than expected and this can be be attributed almost entirely to unit testing. Because bugs in the functions were not caught by unit tests, all tests related to these functions had to be examined, reconsidered and, for a large part, rewritten. This, as you can imagine, took a considerable amount of time. All in all, just under 750 unit tests were added for this release and countless others were refactored.

While the issues with the functions were found and fixed quickly, a lot of time was put into making sure the bugs could not occur anywhere else. While working on these tests, a number of issues were found when seeking backwards in UTF-8 encoded strings. These issues have been addressed as well in this release.

Running the new tests on the previous version shows a significant amount of tests (8.4%) that either crashed or returned the wrong results.

Tests results for running 1.2.1 tests on 1.2.0 implementation

Test count

As you can see in this chart, the number of tests has been increasing on every release:

Total number of tests per release

In fact, this release marks the second largest increase in total number of tests, only surpassed by 1.2.0's monstrous 1119 additional tests. Maintaining these tests is starting to become a real issue, especially when they have to be refactored.

Changelog

  • utf16toutf8: Fix crash when input is missing bytes and input length in bytes is not a multiple of four.
  • utf16toutf8: Fix issue where UTF8_ERR_INVALID_DATA would not always be output when an invalid surrogate pair is encountered
  • utf16toutf8: Fix issue where UTF8_ERR_INVALID_DATA instead of UTF8_ERR_NOT_ENOUGH_SPACE would be returned when an invalid sequence cannot be output to the target buffer.
  • utf16toutf8: Fix issue where sequences with missing bytes would not result in a replacement character.
  • utf16toutf8: Fix issue where sequences with missing bytes could fill up the output buffer with the replacement character.
  • utf16toutf8: Fix issue where UTF8_ERR_INVALID_DATA would not be output on all possible invalid surrogate pairs.
  • utf32toutf8: Fix crash when input is missing bytes and input length in bytes is not a multiple of four.
  • utf32toutf8: Fix issue where UTF8_ERR_INVALID_DATA would not always be output when an invalid surrogate pair is encountered
  • utf32toutf8: Fix issue where UTF8_ERR_INVALID_DATA instead of UTF8_ERR_NOT_ENOUGH_SPACE would be returned when an invalid sequence cannot be output to the target buffer.
  • utf32toutf8: Fix issue where sequences with missing bytes would not result in a replacement character.
  • utf32toutf8: Fix issue where sequences with missing bytes could fill up the output buffer with the replacement character.
  • utf32toutf8: Fix issue where UTF8_ERR_INVALID_DATA would not be output on all possible invalid surrogate pairs.
  • utf8seek: Fix crash when input start pointer is NUL.
  • utf8seek: Fix issue where a NUL input current pointer would not be considered invalid input.
  • utf8seek: Fix issue where seeking backwards on overlong sequences would not skip over the correct amount of continuation bytes.

Updated