Str::removeDiacritics converts wrongly

Issue #13 resolved
Rudie Dirkx created an issue

If I hardcode Öé in my UTF-8 (according to PhpStorm and Sublime Text) PHP file, it's not converted to Oe, but some gibberish. Str::seemsUtf8 doesn't think it's UTF-8. If I force it to use the UTF-8 conversion, it works perfectly.

The not-used UTF-8 way works: https://3v4l.org/2EqNK

The used non-UTF-8 way doesn't work: https://3v4l.org/IoUYP

I remember iconv() being decent with simple diacritics, but maybe not: https://3v4l.org/8QDlo

Comments (6)

  1. Mark Penner repo owner

    Interesting. Maybe I'll remove that seemsUtf8 check and replace it with an option if it's not reliable. Thanks for reporting!

  2. Mark Penner repo owner

    Yeah. dev-default. No one other than me seems to be using this lib so I haven't really bothered tagging anything. I suppose I should tag a new release after I fix this :-)

    Str::seemsUtf8 is returning true for "Öé" for me. Would it be possible for you to run this: print_r(unpack("C*", "Öé"))? You should get back [195,150,195,169] if it's UTF-8 I believe. [214,233] is ISO-8859-1.

    My [new] unit tests are passing for both:

    $this->assertSame("Oe",Str::removeDiacritics(implode('',array_map('chr',[195,150,195,169]))));
    $this->assertSame("Oe",Str::removeDiacritics(implode('',array_map('chr',[214,233]))));
    

    Oh... you know what it could be? Maybe there's a bug in \Ptilz\Str::length. Do you have ext-mbstring installed?

    Actually, I shouldn't have been using Str::length at all there -- I'm pretty sure I wanted the byte length (strlen) not the actual string length. Using the current character encoding to try and detect a character encoding doesn't make much sense :-)

    I just pushed v0.6.0 with a potential fix. You're welcome to give it a whirl. If I can't get removeDiacritics to play nicely I'll split it.

  3. Rudie Dirkx reporter

    I'm getting [195,150,195,169] everywhere. On Windows 7 PHP 7.1. On Ubuntu 16 PHP 7.1. And on 3v4l: https://3v4l.org/BuiZm

    mbstring is enabled everywhere.

    If I don't upgrade to 0.6.0 or dev-default, it works... The string seemsUtf8, and the result is good. I don't know what the hell happened...

    After upgrading to 0.6.0, it works as well.

    I don't get it... And I really don't want to. Maybe next year when I don't hate encoding anymore. Thanks! Fixed! I guess...

  4. Log in to comment