Store UTF-8 Strings in a more efficient way internally

Issue #1102 resolved
Lukas Meindl
created an issue

This headers-only library could be integrated into CEGUI and might help with this: http://utfcpp.sourceforge.net/

Otherwise ICU library support is an option.

Lastly, more information on the general string rework that is necessary to reduce our memory footprint can be found here: http://wiki.worldforge.org/wiki/Summer_of_Code#CEGUI_Ideas

As suggested by Martin this could also go into v0-8 if we add it as a compile-option and keep the old string.

Comments (150)

  1. Henrik S. Gaßmann

    So you said in pull request #185 that your are currently working on this. How far have you progressed? Is collaboration regarding this still wanted? Also I would like to point to UTF8++ which is my mostly compatible, optimized C++11 fork of the utf8cpp project which you mentioned in the description.

  2. Lukas Meindl reporter

    it is 99% done, it should most likely be submitted this weekend. I prefer to stick with everything C++11 provides us with and try to work with that to import an additional external library for obvious reasons. I made a String wrapper around std::u32string and I will handle all additional cases using std conversion functions to utf8.

  3. Lukas Meindl reporter

    CEGUI::String by default will handle both utf8 and utf32 but internally store utf32 because we target PC and memory is not our constraint. If you set it in Cmake to use std::string then it will only do what std::string does of course :D

    So what I did was making a String class that accepts char, char32_t, std::string and std::u32string in all methods constructors and operators but stores utf32string internally. Does that make sense?

  4. Henrik S. Gaßmann

    I don't like that approach or better utf32 in general, it's just unelegant. I would have preferred an approach which would simply use std::string as backend with the convention that it only contains utf8, because even if you don't want to write utf8 parsing yourself or include one of the header only libraries mentioned you could use the standard C++11 utf8 support as explained here (which might not be overly comfortable, but does the job without wasting 3 bytes per character in most cases)

  5. Lukas Meindl reporter

    Yea but UTF8 sucks for editing strings in runtime or checking their partial count of characters, think of wordwrapping for example... We won't use UTF8 in runtime for storage, we decided against it, I wanted to go for UTF8 as well first but there is just too many reasons not to go through that trouble. If we didn't have to ever edit or count them , then I would say sure.

  6. Lukas Meindl reporter

    Besides, String until now already always contained utf-32 characters so my rework is not "worse" than before, in fact it should be better and more conform to the standard.

    Unfortunately you never explained why storing them like this is unelegant. If it is just a feeling then this is not an argument for it ): It is like saying compressed image data is more elegant than non-compressed image data. Is it? It is not as easy to read or retrieve if you are working on CPU-side.

  7. Lukas Meindl reporter

    Regarding "but does the job without wasting 3 bytes per character in most cases", if you want to use ASCII there is still the option to use std::strings in CEGUI. Again: Aiming for PC so memory size is not really a problem here. We waste much much much more memory by having redundancy with our Property key strings, which we will change in the future.

    PS: If you want to add a CEGUI::String type based on a UTF-8 you are free to add this to our library at any time. Be aware that we won't want to link external libs or add someone's source code for this and be aware of the issues regarding word wrapping etc., this might not work out of the box since we probably use std functions all over that depend on single-byte codepoints in std::string

  8. Henrik S. Gaßmann

    Oh sorry, thought it was obvious: UTF32 wastes at least one byte per code point (Unicode is constrained to 21 bits) or for most western languages you waste 3 bytes. Furthermore find the design of UTF8 quite charming... compatibility to ASCII, the well thought way of storing code points that exceed a 7 bit representation, etc. And on the other side utf32 doesn't solve problems like combining characters and normalization, so you may need 2 code points for one visible "character" anyways. So utf32 is also "dynamic" and suffers from the same complexities as utf8 does, but additionally wastes memory. I could implement utf8 support, but I would like to know how chances are that such an addition gets accepted.

  9. Lukas Meindl reporter

    I m obviously aware of the additional memory it takes up as I have already made references to that multiple times. I think you are also confusing code points and code units.

  10. Lukas Meindl reporter

    I could implement utf8 support, but I would like to know how chances are that such an addition gets accepted.

    High if it works ;) Please consider that it will be another optional mode - this means that all UTF-8 String specific code has to be guarded by preprocessor directives in and outside the String class

  11. Lukas Meindl reporter

    Interesting, I will look into this more.

    Saying it suffers from the same complexities as UTF-8 is still wrong however, since there you have the additional complexity of code unit vs code point and all the troubles associated with that.

  12. Yaron Cohen-Tal

    Just want to point out, GCC 4.9 (which is used e.g. by Debian stable) as well as MSVC 2013, don't have the full Unicode support that C++11 gives. For example, MSVC 2013 doesn't support C++11 Unicode literals, and GCC 4.9 doesn't have the <codecvt> header. So if we want to support these compilers (and I think we should) - this should be taken into account.

  13. Lukas Meindl reporter

    I knew about the literals (which btw doesn't affect us except for the samples) but GCC not supporting codecvt is news to me. This branch requires C++11 and codecvt is a non-optional part of codecvt, source: http://en.cppreference.com/w/cpp/locale/codecvt

    So if they dont implement it then there is not much I can do for the users of it ): Not using codecvt is not an option because there is no better way or rather: no other way at all, to convert.

  14. Henrik S. Gaßmann

    Indeed there are some common characters which were assigned their own code point, but this is only a subset of the possible combinations. Furthermore both representations (single code point vs multiple code points) are valid and thus might be provided by the user and have to be handled.

    Saying it suffers from the same complexities as UTF-8 is still wrong however, since there you have the additional complexity of code unit vs code point and all the troubles associated with that.

    What do you mean with additional complexity?

  15. Henrik S. Gaßmann

    Not using codecvt is not an option because there is no better way or rather: no other way at all, to convert.

    Write the algorithms yourself - there is a clear standard, so it's not that hard (though unit tests are advised 😋 )

  16. Yaron Cohen-Tal

    Not using codecvt is not an option because there is no better way or rather: no other way at all, to convert.

    What do u mean? What do we do in branch "v0-8" then? Don't we use our own algorithms? And of course, we could also use an external lib as said in the opening of this issue.

  17. Henrik S. Gaßmann

    PS: If you want to add a CEGUI::String type based on a UTF-8 you are free to add this to our library at any time. Be aware that we won't want to link external libs or add someone's source code for this and be aware of the issues regarding word wrapping etc., this might not work out of the box since we probably use std functions all over that depend on single-byte codepoints in std::string

    Switching from a fixed width encoding to a dynamic width encoding is quite some work indeed, but I think it's well worth the effort. But I wouldn't want to add yet another string type, I would like to use std::string as backend if that's OK. And I would like to integrate some of the UTF8++ code, it's licensed under the Boost Software License, so putting it into a seperate header and retaining the license notice at the top of that header file completely satisfies the conditions of the license. Would that be acceptable? Are there any more requirements except correctness and making it a compile time option (and probably not causing too many #if's)?

    EDIT: My current workload won't allow me to provide an API Design proposal within the next 6 days.

  18. Yaron Cohen-Tal

    I gotta admit, that document has some good arguments, especially in "3. Counting coded characters or code points is important.". Still, UTF-32 is simpler and faster, and I doubt if strings are the memory bottleneck in many real-world programs, even on hand-held devices. So it's a really close contest between the 2, imo..

  19. Lukas Meindl reporter

    Just to clarify this: we won't start the discussion here if UTF-32,UTF-8 or even UTF-16 should be used exclusively because we already discussed this internally not too long ago among the developers and we already came to a conclusion back then after a while of discussing. The conclusion was that there will be an UTF-32 based String for sure and that will be our default for now (as before). Of course, if you show us this can be done better, bug-free and more efficiently in UTF-8 we will gladly make this our default and maybe also throw UTF-32 away, who knows. The whole thing is not a political decision but it is based on what we believe is most likely to work well and efficiently and what has worked in the past and there were also past experiences and considerations with this, it is not an entirely new discussion either.

    Regarding the link: I would like to point out that it must be considered that there are different use-cases for UTF encodings and that our use-case is a very specific one. Simply referring to links, which btw probably everyone who looked this up has already seen, is not helpful in this context ):. If you want to point something specific out, do it by just writing it down here so we can discuss it.

    Not the RAM storage itself is the problem. The bottlenecks are the CPU cache size and bandwidth of the CPU<->RAM bridge.

    Did you try it out yourself? Are there benchmarks? Did you try it using an UTF-8 String class inside CEGUI? This sounds like FUD to me. Without a lot of data to back it up this is not really helping us right now, is it?

    So if you really wanna know about UTF-8 performance benefits you will unfortunately have to create a String class and benchmark it using proper scenarios and then it should be easy to convince anyone of your point, if it turns out to be true after all. The benchmark of course, in order to be relevant relevant, have to include String manipulation as we perform it in CEGUI regularly, word wrapping, and multiple Strings that change very frame and are part of glyph rendering (which requires code points not code units, obviously) etc etc.

    Last but not least: NO! I am not going to reimplement C++11 features in default branch just because an old GCC compiler does not support this feature. This is lunatic! I know it COULD be done, but it makes absolutely no sense. Our default branch requires C++11 and the codecvt feature is a C++11 feature that is not optional, so we WILL rely on its existance unless (almost) NONE of the current popular compilers support it - which is not the case. That said, default branch is also not meant for production, it is a development branch. Default branch (1.0) Release will not happen in forseeable future considering our backlog. Therefore, if the C++11 feature will most likely be present in all compilers in the near future, it should definitely be used in my opinion.

    @Henrik S. Gaßmann I dont know what you mean exactly by using std::string as your backend. Acting like every std::string is an UTF-8 string is fine as long as this is only done if this String is chosen in CMake. Users should be able to choose between regular ASCII std::string, your UTF-8 std::string and finally the UTF-32 string. This will also allow us to compare them easily. You can use the preprocessor directives whenever you want to add UTF-8 string specific code somewhere. The other String choices should of course remain untouched by this.

    So my suggestion is:

    #define CEGUI_STRING_CLASS_STD 1
    #define CEGUI_STRING_CLASS_UTF8 2
    #define CEGUI_STRING_CLASS_UTF32 3    // (formerly defined as CEGUI_STRING_CLASS_UNICODE)
    

    And then you can do with the CEGUI_STRING_CLASS_UTF8 case whatever you want. You may typedef String to std::string in that case if you want. But like I said, I would not want to remove the regular STD version.

  20. Yaron Cohen-Tal

    Look, GCC 4.9 is not that old - it was released on April 22, 2014. It is used by Debian stable, and will be for the next about year and 5 months. It is also used by the latest stable Android NDK, and I think support for hand-held devices becomes more and more important. C++11 is not fully supported by GCC 4.9 or MSVC 2013 and not even MSVC 2015. I don't think we should blindly say we support only fully-C++11-compliant compilers. I think we should define exactly which compiler versions we support, and I suggest MSVC 2013, GCC 4.9 and prolly Clang 3.5.

    Perhaps I could "grab" the <codecvt> code that u need from the GCC 5.1 sources, and make it usable with GCC 4.9.

    Of course, All the above depends on when CEGUI 1.x is released. If it's gonna be more than, say, a year and a half, I'd say we go for GCC 5.2, MSVC 2015 and prolly Clang 3.7.

  21. Lukas Meindl reporter

    Seriously, is there really a good argument for going way out of our way to do this if it is already supporting in all updated compilers and soon also in ubuntu and debian in the next stable releases? I am not convinced yet. Whatever we do, we would have to revert it to codecvt usage then anyways. This doesn't sound worth it to me.

  22. Christopher Beck

    Just want to comment -- C++11 is a bit different from old versions of C++ in the following sense. In the past, e.g. C++98 and C++03 standards gcc, clang and almost every other compiler were very close together in their levels of compliance and largely conformed to the standards, with msvc being the obvious red headed step child.

    In C++11 an incredible number of totally new features were added and compilers vary WIDELY in their conformance and which ones they implement. Most of them implement the important ones but lots of the obscure ones are pretty hit or miss.

    This leads to comments like this on cmake mailing lists and such:

    https://cmake.org/pipermail/cmake/2013-February/053636.html

    that usually testing for "C++11" support in a compiler is not particularly useful, you need to check for individual features.

    I think also that, telling users they can't use CEGUI unless they build gcc 5 from source may seriously limit who will use it. If Debian Stable is at gcc 4.9, Ubuntu Trusty is at gcc 4.8, travis-ci runs on Ubuntu Precise still, which means you would be backporting gcc 5 to a system that ships with gcc 4.6... it just will make it significantly harder for most people running linux nowadays to build the lib.

    Whether you guys want to make a list of specific compiler versions you support, or, a specific subset C++11 features that you require and use, I mean that's up to you, but I would suggest that requiring full, 100% C++11 compliance is not a very practically-oriented goalpost. I think you would end up throwing out all but the bleeding edge versions of most if not all major compilers.

    It sucks if everyone on the planet has to scramble to find little UTF-8 encoder algorithms or roll their own, but that's more or less what everyone has been doing for the past 5-10 years. If #include<codecvt> is really portable and correct and works everywhere then yeah that's clearly the way to go, but if GCC doesn't even have it until this year or something then IMO it seems safer to stay away from it until its more mature. I mean is the v0-8 UTF8 stuff known to be broken or something? Is there a particular rush to get rid of it?

    Here's another post from a year and a half ago on the topic: https://github.com/chrismanning/jbson/issues/1

    I really like the design and quality of implementation of your library, and would like to use it in some of my projects. But the fact that Clang is currently the only C++14 compiler with the codecvt header included in the standard library makes it difficult to use jbson in any of my projects. Would it be possible to support a restricted subset of the functionality in the case that codecvt is not available?

  23. Lukas Meindl reporter

    I will put my changes into a branch and leave it there until Debian and Ubuntu officially supports it.

    Regarding C++11 features: Like I said, if they are supported by all major compilers in forseeable future then imo it is alright to use. We can just branch them always until they are really supported. Of course that is not always optimal

  24. Yaron Cohen-Tal

    Lukas:

    So you can do it.

    Well, we need to be accurate about out terms. You've asked:

    You can't upgrade GCC on Debian stable to a newer version?

    "Upgrading" a Debian stable package means installing a newer version of that package from the Debian stable repositories. In practice, I think this will always be the same version (even minor), plus only critical bug fixes. In that sense, GCC can't be "upgraded" to >4.9.x on Debian stable.

    Now, building GCC from source is possible, but is a subtle and error-prone process, which may lead to a not-so-stable result. GCC has many dependencies: glibc, zlib, binutils, cloog-isl, gmp, isl, mpc, and mpfr. Not every combination of versions of those is valid with a specific version of GCC. GCC has many configuration options. Is building GCC possible? Yes. Is it something I'd use in production build? prolly not!

    How bad should it be to be able to use either <codecvt> or our own implementations (which we have today)? That is, with "#if"-s of course.

  25. Paul Turner

    Without wanting to get tied up in a big debate about anything, I will just say a couple of things, which kind of reiterate what Lukas has already touched on. If anyone responds to what I put here, don't be surprised when I don't reply.

    First of all, this is the development branch; our 'we can do anything and it doesn't matter' code. This is where we can move the code towards our goals and not have to worry about anything, such as it not working on current compilers or what have you. Anybody using this version of the code has to accept the fact that we'll do things that seem 'terrible'from the outside, no ifs, no buts. That's just the way it works.

    As far as adding additional dependencies, including taking something and integrating it into the codebase where it will need to be maintained, this is something to avoid wherever possible, especially where the needs can be met by language features (regardless of current implementation levels). To those people who feel differently, perhaps your time is better spent getting those features implemented for compiler 'x'? That would benefit far more people in the long run.

    Finally, please remember that the odds of the project releasing this code as stable during the period when many of the compiler related points mentioned here remain relevant is infinitesimally small, seriously, the chances are almost zero. Even when people had a lot of time to work on CEGUI progress was pretty slow overall, and currently the guys have very limited cegui time.

    Basically, you have to look at the bigger picture.

  26. Lukas Meindl reporter

    Thanks for your input, Paul.

    @Yaron Cohen-Tal What will the #if be checking for? A new CMake variable is not an option. We could check against compilers but that's also not too fun regarding the amount we support. Could make it MSVC12+ only... But I would still hate carrying legacy code around just for temporary sake.

    Edit (clarification): In some months or so, in case it will be added, this legacy support code will HAVE to be removed. This is definitely not going into 1.0 Release. I understand that some of you guys won't be able to work on CEGUI default as easily anymore if i add the codecvt dependency and I do appreciate your contributions so I do consider adding this legacy support for temporary sake.

  27. Yaron Cohen-Tal

    Paul:

    First of all, this is the development branch; our 'we can do anything and it doesn't matter' code. This is where we can move the code towards our goals and not have to worry about anything, such as it not working on current compilers or what have you. Anybody using this version of the code has to accept the fact that we'll do things that seem 'terrible'from the outside, no ifs, no buts. That's just the way it works.

    Ok, but CEGUI 1.x will be released someday (right?..), and when it does, we do want extensive compiler support. So again, the question is when it's expected to b released.

    Paul:

    To those people who feel differently, perhaps your time is better spent getting those features implemented for compiler 'x'? That would benefit far more people in the long run.

    <codecvt> already is implemented in GCC 5.1, but not merged into GCC 4.9, prolly because GCC 4.9 now accepts only bug fixes.

    Lukas: It's easy to check the "<codecvt>" availability automatically from CMake, and incorporate it into "Config.h".

  28. Lukas Meindl reporter

    Ok, but CEGUI 1.x will be released someday (right?..), and when it does, we do want extensive compiler support. So again, the question is when it's expected to b released.

    Like you said, newer GCC already do have codecvt, so this support will exist, right?

    We clearly don't want extensive support for old compilers. That is the entire point of requiring C++11 in the first place and this is the only way we can modernise the library and get rid of extra dependencies and redundant code. Now I know there are features of C++11 that barely exist anywhere yet and those should not be relied on. But from what I see, codecvt is in all modern compilers by now and just because Debian and Ubuntu have not been updated yet, there is no reason to say that GCC was not supported anymore. Honestly, this issue ticket was the first time I heard a modern major compiler did not have it.

    There is a clear reason to use codecvt over an external library or homebrewn solution: It will be universally available, reliable, doesn't require dependencies, won't randomly break on newer versions and will work where it is available.

    Also in this context I want to add that we definitely definitely definitely won't depend on boost.locale or anything like that, before anybody suggests that and there is simply no good solution in the STL to work around codecvt. codecvt IS the solution.

    So my suggestion is either I make my changes a branch off from default and merge it until Debian has a new latest stable (which will most likely have a GCC supporting codecvt) or somebody will patch preprocessor directives on top of my commits that will all have to be marked deprecated so we can removed them before 1.0 Release.

    That said, it seems like Ubuntu does have a GCC 5.1 available? http://askubuntu.com/questions/618474/how-to-install-the-latest-gcurrently-5-1-in-ubuntucurrently-14-04

  29. Lukas Meindl reporter

    Lukas: It's easy to check the "<codecvt>" availability automatically from CMake, and incorporate it into "Config.h".

    Are you willing to make a patch? Be aware that most likely it will be reverted sooner or later since we don't want to have legacy code and directives all over the code.

  30. Yaron Cohen-Tal

    Lukas:

    That said, it seems like Ubuntu does have a GCC 5.1 available?

    I didn't say anything about Ubuntu, only Debian and Android...

    Lukas:

    So my suggestion is either I make my changes a branch off from default and merge it until Debian has a new latest stable (which will most likely have a GCC supporting codecvt) or somebody will patch preprocessor directives on top of my commits that will all have to be marked deprecated so we can removed them before 1.0 Release.

    The next Debian stable (now called "Debian testing") will have at least GCC 5.2, coz it already does..

    I suggest that before u push your project, let me have a look at it and I can b smarter as to how I think it's best to proceed.

  31. Lukas Meindl reporter

    @Henrik S. Gaßmann Yes, I am updating all usage of sscanf to stringstreams in the process, so everything takes a bit longer, I am mostly done with this though, only few sscanf's and printf's left ;)

    Also: http://stackoverflow.com/questions/33708892/why-is-there-no-definition-for-stdregex-traitschar32-t-and-thus-no-stdbas

    I am considering UTF-8 a bit more again, mostly because I don't remember what the actual troubles with it were (I know there are a couple..).

    Also about the above link: we already use PCRE and PCRE actually supports UTF.8 16 and 32 for regex. Currently my implementation transforms to UTF-8 for PCRE, since ours is set up for that.

  32. Henrik S. Gaßmann

    we already use PCRE and PCRE actually supports UTF.8 16 and 32 for regex. Currently my implementation transforms to UTF-8 for PCRE, since ours is set up for that.

    Do you consider replacing PCRE with c++11 regexes?

  33. Lukas Meindl reporter

    @Henrik S. Gaßmann Yea but the question is if it is possible ;) see the related links in the SO link above that I posted ;) not even UTF-8 is properly supported. Honestly I do not understand the solution provided by rici

    but you can use an external preprocessor with a Unicode database to create a byte-oriented regex from a regex with explicitly marked unicode codepoints.

  34. Henrik S. Gaßmann

    @Lukas Meindl rici proposes to use a preprocessor which converts any unicode regex string to their byte representation equivalent, so the comparison and matching doesn't need to be unicode aware as it happens on a byte by byte basis. This is definitely not an option - this isn't a project like Qt which can simply force their users to use another preprocessor on their sources.

  35. Henrik S. Gaßmann

    You might note that this doesn't solve the normalisation problem at all... which rises a more general question: is there any convention regarding unicode normalisation for CEGUI APIs?

  36. Lukas Meindl reporter

    Yea, I thought this wouldnt be a solution for us. Do you have any idea how we could make it work? I don't.

    Regarding existance of a CEGUI normalisation convention: Not that I know of, this entire topic was simply neglected until know and decided years before I came. No one also ever seemed to have complained. Also afaik the normalisation depends on the locale. It seems to have worked for most people? Maybe some people just never complained to us despite having problems?

  37. Lukas Meindl reporter

    I got a stupid linker issue left before I can even start testing and no idea what the issue is, some problem with dll export of a static codecvt converter.. am heavily delayed due to this.

  38. Henrik S. Gaßmann

    Also afaik the normalisation depends on the locale.

    I'm pretty sure that the stl isn't even aware that a concept like normalisation exists.

    EDIT: looks like rici agrees with me:

    The C++ standard library does not implement any Unicode normalization algorithm

  39. Lukas Meindl reporter

    Yea but anyways, normalisation isn't really our main issue right now. The issue is if regex can handle UTF-8, or UTF-32 at ALL. The wstring regex does not seem an option to me. It seems messy and not really consistent across platforms, not really something anybody wants to have in a library ;)

  40. Henrik S. Gaßmann

    Different normalisation forms become problematic during string comparisons, so as long as no codepoints affected by the normalisation are used or everyone agrees to only use a certain normalisation form everything will work well... otherwise you might duplicate certain map entries or can't find them even though they exist and stuff like that.

  41. Lukas Meindl reporter

    @Henrik S. Gaßmann We will discuss ASCII-only regex support internally. The issue I see with this is that we supported UTF-8 regex so far via pcre. A step back is worse than no step forward. We will make a decision internally.

    @Henrik S. Gaßmann Are you still interested in making UTF-8 support? Imo it might not be too difficult actually and some of my UTF-32 rework could be reused. First we need to get that to work though ;)

  42. Henrik S. Gaßmann

    @Lukas Meindl I took a further look at the regex stuff: It looks like you will have to revise your regex strings a bit if you want to use c++11 regexes, because c++11 doesn't directly support the perl regex syntax (afaik posix extended regex syntax slightly differs from the perl flavor). Furthermore I want to add that if the regex engine is used to validate/process user input you will have to provide unicode support for obvious i18n/l10n reasons. From this I conclude that replacing pcre with c++11 regexes isn't a viable option. Why don't you abstract the regex functionality and provide multiple implementations? I also want to point out that PCRE2 was released this year...

    Regarding the utf8 implementation: I still want to write an utf8 implementation, but I'm busy and can only afford to work on this during the weekend.

  43. Lukas Meindl reporter

    @Henrik S. Gaßmann Yes, we would have to revise them, I was of course aware of that ;) but it is not really much work for us. For users who depends on perl regexes on the other hand it might be a bit annoying.

    Furthermore I want to add that if the regex engine is used to validate/process user input you will have to provide unicode support for obvious i18n/l10n reasons.

    What do you mean? I thought you earlier said that 7 bit was enough for everyone. If you were being sarcastic then I really did not get that. For example when it comes to number related regexes I believe that it is entirely enough to have 7 bit, I am not aware of any places that uses something other than the Arabic numerals, and even Roman numerals would work with 7 bits :D Afaik no one in regular life uses hebrew numerals. If anyone uses archaic numerals then this is not worht supporting.

    What we could do is that we provide an interface for plugging in regex-checking, or abstract it like you said. For our Spinner we could switch to using std::regex. This whole process would allow us to drop PCRE as a dependency. However, currently I have no clue how such an abstraction or interface should look like.

    Cool, take your time.

  44. Paul Turner

    Regular expression support is already done via an abstracted interface in order to allow alternatives to the default PCRE. If I recall correctly it's not currently as friendly as our support for customisation elsewhere, though the work to make it so should be very minimal.

  45. Henrik S. Gaßmann

    Also linking breaks here right now

    @Lukas Meindl If you mean that std::codecvt<...>::id can't be resolved you will want to look at this and that. I glimpsed over the source code in xlocale and was able to confirm that our issue is directly related to those two. I guess you'll have to wait at least until the next VS Update.

  46. Henrik S. Gaßmann

    Anyways I just discovered this:

    namespace CEGUI
    {
    class String
    {
        [...]
        //! The UTF-8 / UTF-32 standard conversion facet
        static std::wstring_convert<std::codecvt<char32_t, char, std::mbstate_t>, char32_t> s_utf8Converter;
        [...]
    };
    }
    

    Which is pure evil and definitely a source for headaches in every multi threaded environment utilizing your string class in combination with utf8 conversion.

  47. Henrik S. Gaßmann

    which is why I know about its support

    Oh, I knew that it doesn't work with VS2013 - it was one of the reasons I switched to VS2015RC ASAP (Please note that I tend to work with bleeding edge stuff and consider most stuff which is older than a year as legacy 😉). I just didn't remember that you want to provide VS2013 support... But as I said with the conditionally defined std mutex it will even work on that legacy thing at the cost of some performance...

  48. Lukas Meindl reporter

    If it is reasonable we would like to support as much as makes sense. Currently we don't use threads inside CEGUI so it makes little sense to drop everything. I would prefer to only support VS2015 and GCC5.1+ but what will our users think of that?

  49. Lukas Meindl reporter

    Another option for MSVC 2013 is to use TLS (Thread Local Storage).

    Not a fan of that idea at all. Besides, this has drawbacks too, already looked into it. The point of working on default is not having to go out of your way for exactly such features provided in all newer compilers, not having to make dirty workarounds to support older software, and not having to use any legacy stuff. All of this we will need to do once 1.0 is released anyways, and we already do this in our 0.8 support, which limits modernisation of the library a lot.

    I prefer modernisation over legacy support in 1.0

  50. Henrik S. Gaßmann

    Another option for MSVC 2013 is to use TLS (Thread Local Storage).

    Not exactly sure whether you mean the OS feature or __declspec(thread). On the VS2013 platform the latter does only work with PODs and thus isn't an option. However using the OS feature is quite some work, especially if you don't want to leak the object on thread destruction.

  51. Lukas Meindl reporter

    Well, you might not use them, but what about me the user who is using threads, your String class and things like boost locale or gettext?!

    Then you would access CEGUI from one thread only and everything is fine. Is there a serious benefit or use-case for accessing CEGUI from multiple different threads alternatingly?

    (Info: I am not against making CEGUI compatible with multi-threading, I actually pushed the printf/sscanf replacement by stringstreams forward partialyl due to this and in SharedStringstream.h you will find comments indicating that I want to add thread_local wherever needed ;) But figuring out the use-cases in a clear manner will help us to get this thing right (or give up the idea) )

  52. Henrik S. Gaßmann

    Then you would access CEGUI from one thread only and everything is fine.

    You missed my point. Usually people don't want to use different string classes within the same project, so they will probably stick with yours if possible and whenever this isn't possible (external library requiring utf8 or whatever), they will convert from your utf32 string to something adequate (most likely utf8) and the other way around; also they will probably use your toUtf8 method, when they want to save a string to file. All of this stuff isn't guaranteed to happen within the same thread context, thus data races will occur.

  53. Lukas Meindl reporter

    I didn't miss your point. If they use CEGUI classes of any type, shape or name outside the one thread that is supposed to exclusively deal with CEGUI then I might consider this a potential design problem of the software in question. Except of course, if we explicitly allow multi-threaded usage of CEGUI classes (which would be the optimum anyways but then goodbye VS2013)

    Btw. what I meant by making the variable file-static is by making it global but only inside one file (making it static inside a cpp file). I assume that won't help with multi-threading, am I right? We still need it thread_local for that purpose. But it might solve the other issue. I m installing RC1 btw.

  54. Henrik S. Gaßmann

    I didn't miss your point. If they use CEGUI classes of any type, shape or name outside the one thread that is supposed to exclusively deal with CEGUI then I might consider this a potential design problem of the software in question.

    I generally agree with you, but you advertise you string class as unicode compliant whereas the std string isn't, so guess what happens in smaller projects which don't want to use ICU/boost.locale only to do add some basic unicode support. I mean you can easily solve this with a big red warning in the documentation; but seriously making a core class like String not thread safe is imho negligent.

    I m installing RC1

    VS2015 RC1? If so, you should note that VS2015 RTM superseded the RC1.

    what I meant by making the variable file-static

    I did understand the technical aspect, but I wasn't sure which variable you meant 😉, I guess the wstring converter. If my assumptions are correct this won't solve your link error and of course you are right, this doesn't help with the thread safety issue.

  55. Henrik S. Gaßmann

    Sorry, looks like I misunderstood you, the VS2015 Update 1 RC is of course newer than the VS2015 RTM version, I initially thought you meant the VS2015 RC ^^

    EDIT: I will probably wait until VS2015 Update 1 RTM gets automatically shipped with the updater, because after skimming through the changelog I don't feel like really needing this update.

  56. Lukas Meindl reporter

    I did understand the technical aspect, but I wasn't sure which variable you meant wink, I guess the wstring converter. If my assumptions are correct this won't solve your link error and of course you are right, this doesn't help with the thread safety issue.

    The converter of course, the converter is the only thing in String that is (now) making it unsafe for threading and having it thread_local is the simple and logically right thing to do with it, I agree with you on that, no doubt. And yes, this would probably solve absolutely nothing 🐌

    Well, clearly I should have said UPDATE 1 RC and not just RC1, which was incorrect.

    Anyways, I will see if Update 1 RC fixes the linker issue ,if not then we have a problem in any case.

  57. Henrik S. Gaßmann

    Same shit with std::codecvt_utf8<char32_t>

    Well, that's just a logical consequence of

    template< class Elem, unsigned long Maxcode = 0x10ffff, std::codecvt_mode Mode = (std::codecvt_mode)0 >
    class codecvt_utf8 : public std::codecvt<Elem, char, std::mbstate_t>;
    

    And this: http://stackoverflow.com/questions/30765256/linker-error-using-vs-2015-rc-cant-find-symbol-related-to-stdcodecvt

    doesn't seem to be even covered by the c++ standard (see cppreference.com); as per standard codecvt should only specialized for the various char types, but definitely not for any of the integer types.

  58. Lukas Meindl reporter

    doesn't seem to be even covered by the c++ standard (see cppreference.com); as per standard codecvt should only specialized for the various char types, but definitely not for any of the integer types.

    Yes I know, no idea where that guy got the idea from that this workaround would work... On the other hand, if I look more I find more sources saying it works. Will try again this evening.

  59. Henrik S. Gaßmann

    Yes I know, no idea where that guy got the idea from that this workaround would work... On the other hand, if I look more I find more sources saying it works. Will try again this evening.

    In an isolated test case it does work at least with VS2015; the more interesting question is however, whether this works with GCC and Clang.

  60. Lukas Meindl reporter

    I would assume it doesn't , I would use preprocessor defines to have the workaround only on VS2015 for this purpose. I am not a fan of this whole issue...

  61. Lukas Meindl reporter

    I tried with VS2012:

    • char32_t works fine in codecvt

    • file-static and class-static codecvt triggers an assert, they do not get initialised properly

    • creating a new codecvt on every conversion works but not an option

    • oddly enough, function-static codecvt works fine unlike class-static ones

    Any ideas?

  62. Henrik S. Gaßmann

    with VS2012 char32_t works fine in codecvt

    Yeah, because until VS2015 there hasn't been a distinct char32_t type, but something like:

    typedef uint32_t char32_t;
    

    Which makes char16_t and char32_t completely unusable with streams.


    file-static and class-static codecvt triggers an assert, they do not get initialised properly

    oddly enough, function-static codecvt works fine unlike class-static ones

    I guess that the initialization depends on other library functionality which isn't guaranteed to be available before you enter main() (function static objects are initialized on first call, i.e. within the scope of main()). If I had VS2012 installed I would have taken a look at the implementation, but I don't want to pollute my PC with yet another VS installation.


    Any ideas?

    Plenty actually, but you won't like them 😉.

  63. Lukas Meindl reporter

    function static objects are initialized on first call, i.e. within the scope of main()

    I know. I should have been more specific: Why would wstring_convert depend on anything that wouldn't be initialised on the first call to this constructor. Anyways, I am actually okay with a (thread_local) function-specific static variable as well, after all we have the same performance gain through that, it just doesn't feel as "right" but whatever works works, right ?

  64. Henrik S. Gaßmann

    it just doesn't feel as "right"

    For me it doesn't make a difference, the concept of a static codecvt itself makes me feel bad. I mean preserving state between two independent conversions is just sick (yes, I know this is required for streams where you have to partially consume the input, but in all other cases this doesn't really seem to be a well suited behaviour)

  65. Lukas Meindl reporter

    @Henrik S. Gaßmann Yes but if we do many 1000s of string conversions we dont want to construct wstring_convert every time, do we? That, to me, just seems sick as well :D

    I mean preserving state between two independent conversions is just sick

    Why is it sick? is there any side-effect or anything we need to worry about? Is there any difference between destroying the old converter, making a new one, converting and then looking at the new one's state vs just using the old one and then using at its state after conversion?

    From the docu:

    The conversion state may be explicitly set in the constructor and is updated by all conversion operations.

    I can't find any good argument why this would affect us.

    @Yaron Cohen-Tal Yes it can take a significant time, just like stringstreams are so much faster if you don't reconstruct it, which is why i made the sharedstringstream class that. Personally I have not tested this for wstring_convert (shame on me) but I based this on discussions I found on the internet. I can profile it quickly.... brb

    @Henrik S. Gaßmann must be getting sick of the sharedstringstream class as well ;D Unfortunately I could not find a better way to solve this then using a static variable there as well paired with helper functions. Feel free to find us a better solution.

  66. Henrik S. Gaßmann

    Does constructing "String::s_utf8Converter" really take a significant time?

    Implementation dependent. Anyways I think that if you issue that many calls to a unicode conversion utility during program execution that the construction of the converter state becomes significant, the software design is most likely to be flawed at some point.

  67. Lukas Meindl reporter

    Implementation dependent. Anyways I think that if you issue that many calls to a unicode conversion utility during program execution that the construction of the converter state becomes significant, the software design is most likely to be flawed at some point.

    You make a lot of assumptions there. Do you know how wstring_convert is built?

  68. Henrik S. Gaßmann

    Why is it sick? is there any side-effect or anything we need to worry about? Is there any difference between destroying the old converter, making a new one, converting and then looking at the new one's state vs just using the old one and then using at its state after conversion? From the docu:

    The conversion state may be explicitly set in the constructor and is updated by all conversion operations.

    From cppreference do_out:

    The effect on state is deliberately unspecified.

  69. Lukas Meindl reporter

    Lol VS2013 wstring_convert gives me range errors whenever i input an UTF-8 string as char array. This is golden. I could store it as std::string btw, so my input isn't crazy for sure.

  70. Lukas Meindl reporter

    Ok I couldn't solve the issue and just used an ASCII string instead, results should be comparable in any case: For 1000 conversions i measured 0,001s when using the static and I measured 0,037s when reconstructing each time.

  71. Lukas Meindl reporter

    @Henrik S. Gaßmann I forgot to reply to this:

    Yeah, because until VS2015 there hasn't been a distinct char32_t type, but something like: typedef uint32_t char32_t; Which makes char16_t and char32_t completely unusable with streams.

    char16_t and char32_t based streams are not supported in C++11, you probably know this

    But: in VS2013 char32_t is simply unsigned int. Fun times. And yes, it is an atrocity.

  72. Henrik S. Gaßmann

    You make a lot of assumptions there. Do you know how wstring_convert is built?

    I didn't when I wrote my last comment, but I educated myself and have to tell you that there is (at least in the VS2015 library) some crazy stuff going on. E.g. look at the codecvt<char32_t, char, ...> constructor:

    _BEGIN_LOCINFO(_Lobj)
        _Init(_Lobj);
    _END_LOCINFO()
    

    This means that if we construct a codecvt we will always change the locale to "C" (and construct multiple _Yarns btw) with _BEGIN_LOCINFO; do nothing with it during _Init(); and just destruct the whole thing with _END_LOCINFO() which also switches back to the locale installed before... I think this unnecessary BS causes the performance degration you experienced, the actual codecvt and also the wstring_convert constructions are trivial.


    For 1000 conversions i measured 0,001s when using the static and I measured 0,037s when reconstructing each time.

    Did you enable optimisations?

  73. Lukas Meindl reporter

    I figured out why VS2013 fails at my UTF-8 strings such as "fdsfsdfä中文" it internally converts it to some other format. I inspected the char array it produces and it is not UTF-8. According to this:

    http://www.nubaria.com/en/blog/?p=289

    We need the char literals like u8 for this to work. Fun times for everyone!

    Btw.: My file encoding was UTF-8 all the time of course and the compiler setting was set to unicode. Damn visual studio!

    EDIT: Trying the test in VS2015 now with the "workaround" (herpderp)

    EDIT2: u8 before the string does not allow me to just write UTF-8 characters into the file and have them interpreted correctly even in VS 2015 (wtf srsly?)

  74. Henrik S. Gaßmann

    @Lukas Meindl oh, you were using string literals... well, while writing the unit tests for utf8++ I had similiar issues, I settled with loading the larger strings from an utf8 encoded file at runtime and converted the smaller ones with this tool to their code point representation --- definitly not human readable, but works well.

  75. Lukas Meindl reporter

    @Henrik S. Gaßmann Thanks so much for the link, I was about to look for exactly a tool like this. I think I will convert all the unicode samples to use the non-readable representations. I remember some people had issues with the encoded text in the files already in v0-8. Seems like a better solution to use this, as also mentioned in the article i mentioned in the last comment ;)

  76. Lukas Meindl reporter

    @Henrik S. Gaßmann http://utf8everywhere.org/ this says UTF-8 without bom makes it work (not just "UTF-8" encoding on its own). Tried it and it works.

    I just remembered this is not news to me, I worked with this already when I developed newer CEGUI samples. I actually do not know why some people had issues with this ;)

    EDIT: http://stackoverflow.com/questions/5406172/utf-8-without-bom <-- should not be an issue unless people save it, maybe that is what happened in those cases...

    EDIT2: this guy had issues and it wasn't visual studio, the operating system converted the encoding of the files (WTF!) : http://cegui.org.uk/forum/viewtopic.php?t=6769

  77. Henrik S. Gaßmann

    Yea I saw that while searching for these things but unfortunately we can't assume everyone has this :D


    USE THE ForceUTF8 VISUAL STUDIO EXTENSION IF YOU DARE TO TOUCH OUR HOLY SOURCE FILES


    Put this on top of you docs and you should be fine xD

  78. Lukas Meindl reporter

    @Henrik S. Gaßmann It is better to make it in a way it will run by default everywhere ;D Even if it is unreadable. If they wanna make it readable they will find ways on their own. Btw. your converter doesn't offer \x representation, looking for one that has it..

    EDIT: Can't find one but \x representation gets realllllllllyyy long. I am reconsidering this.

  79. Lukas Meindl reporter

    @Henrik S. Gaßmann I agree and disagree. XML seems overkill for this task, instead i will just shorten each text to a few characters or lines. I agree baking them in is bad so i will use the escaped code units, such as: \xE2\x88\x83y. What do you think of this approach?

  80. Henrik S. Gaßmann

    XML seems overkill for this task

    @Lukas Meindl It is, but you have decent XML support as part of CEGUI, so it would be the goto solution for the runtime string loading approach. But I guess runtime loading is way beyond sample's scope... But I wouldn't like to shorten the existing strings... Instead I would store them in a separate file as escaped code units and use the preprocessor to include them... This way the example files don't get polluted and the example itself won't be downgraded...

    EDIT: Someone should really write a tool for unicode code point representation conversion specifically suited for C++ programmers...

  81. Lukas Meindl reporter

    And then people look at it and are like "wtf, I can't just use plain text UTF-8 in CEGUI?". I guess I could add an explanation on top of that file though to explain why the char strings looks so weird..

  82. Henrik S. Gaßmann

    wtf, I can't just use plain text UTF-8 in CEGUI?

    If they think that way, they will sooner or later be bitten by c++ bad unicode support anyways and have to educate themselves. Dealing with unicode, i18n and l10n in general without properly educating yourself is a bad idea anyways,,,

  83. Lukas Meindl reporter

    I changed the source code strings to escaped hexadecimal utf-8 code units. I have problems with the current implementation of String using utf-32 though and VS2013 is a pain in the ass since i cant see the code points in the debugger.

  84. Lukas Meindl reporter

    @Henrik S. Gaßmann I don't know because in VS2013 I can't look at the UTF-32 string and I just gave up debugging at some point. I am now rewriting our own old conversion function, see what I will get, if it is messed up again I will look into it in VS2015.

    Btw, writing this conversion function is pretty easy - I don't really understand why the whole codecvt crap is necessary (ok, except for the locales, but I just stick with the default implementation...)

  85. Henrik S. Gaßmann

    writing this conversion function is pretty easy

    Indeed, I have written my own, too 😉. Don't forget to unit test your implementation, there are some nasty corner cases...

    ok, except for the locales

    AFAIK CEGUI doesn't make use of locales (which aren't really useful anyway (without an extension like boost.locale)).

  86. Lukas Meindl reporter

    Well there is std::locale, and yes we don't really use them so far so why would we suddenly care now.

    I am not planning on doing unit-tests since I take the original CEGUI implmentation, clean it up without changing what happens and then compare it with the implementation of UTF8-CPP and other projects. I know this is not optimal but hell I just don't have the time right now. Do you have any specific unit tests you would urge me tod o?

  87. Lukas Meindl reporter

    I pushed a new commit on the branch. It works now! Converter is based on our old converter. We could probably add checks for invalid code units and code points just to be more conform with the unicode standard. I will look into that.

  88. Lukas Meindl reporter

    There is still some issues with console and log output that I have to look at. It is partially probably due to the stringstream vs printf replacement that I did.

    @Henrik S. Gaßmann If you want to add UTF-8 support this would be a good point to begin working on it. I will be happy to assist and code on it with you if you want.

  89. Lukas Meindl reporter

    Some more issues with XML parsers popped up and are resolved now. I might look into a UTF-8 implementation soon. Seems to make sense to do it now, as I already know the ropes.

  90. Log in to comment