Source

text / Data / Text / Encoding.hs

Author Commit Message Labels Comments Date
Bryan O'Sullivan
Correct the documentation for streaming decoding
Bryan O'Sullivan
streamDecodeUtf8With: accumulate undecoded chunks correctly We had previously gotten the accounting and reporting wrong if an incomplete input was fed in over the course of several continuations, such that we'd report only the incomplete input seen by the most recent continuation. This fixes gh-70.
Bryan O'Sullivan
Tidy up imports
Bryan O'Sullivan
Drop a redundant import
Bryan O'Sullivan
Drop the old pure-Haskell implementation of encodeUtf8
Bryan O'Sullivan
Drop the Builder-based encodeUtf8 implementation While it is very cool indeed, it is slower than the new C code under all circumstances, sometimes by a factor of two or more.
Bryan O'Sullivan
encodeUtf8_1: so long, it's been nice knowing you! Since encodeUtf8_2 wins under all circumstances, there's no reason to keep the intermediate version around.
Bryan O'Sullivan
encodeUtf8_2: cap the number of wasted bytes at 2x This has the odd side effect of improving tiny-string performance from 20% slower then encodeUtf8_1 to about 5% faster. Never stop being weird, GHC optimizer!
Bryan O'Sullivan
encodeUtf8_2: a C-based encoding function Not surprisingly, this is a lot faster than encodeUtf8_1 and the Builder-based rewrite under almost all circumstances. It's slower on tiny inputs (20%), but roughly twice as fast as encodeUtf8_1 on longer inputs.
Simon Meier
Improve small string performance for UTF-8 encoding to bytestrings On a 5 byte string the conversion of strict text to a strict bytestring is still a factor 2x slower than the custom 'encodeUtf8_1' routine. However, this is much better than the factor 4.5x that we started with. I attribute the slowdown to the more expensive startup cost for the bytestring-builder-based solution. Note that this startup cost is shared in case a small string is encoded as part of a…
Bryan O'Sullivan
encodeUtf8_1: get my arithmetic right :-(
Bryan O'Sullivan
Export both encodeUtf8 variants
Bryan O'Sullivan
Drop now-redundant imports
Bryan O'Sullivan
encodeUtf8_1: drop an unnecessary type signature The value that was having too general a type inferred is now a pointer, so inference doesn't accidentally overgeneralize.
Bryan O'Sullivan
encodeUtf8_1: drop a loop induction variable This helps performance quite a bit! Now encoding Japanese text is 2x faster than encodeUtf8, as opposed to 30% faster before. Not bad!
Bryan O'Sullivan
Drop unused import
Bryan O'Sullivan
encodeUtf8_1: make available with both bytestring versions
Bryan O'Sullivan
encodeUtf8_1: a little cosmetic work
Bryan O'Sullivan
encodeUtf8_1: refactor the last loop body This requires a bit more torturing to maintain performance. For some unknown reason, doing the same refactoring on go4 decreases performance on russian-small.txt by half!
Bryan O'Sullivan
encodeUtf8_1: refactor another loop body
Bryan O'Sullivan
encodeUtf8_1: refactor loop body
Bryan O'Sullivan
encodeUtf8_1: massively rework internals The goal here is to avoid a buffer size check on every iteration, instead only doing one the first time we encounter some input that's larger than the buffer we preallocated. This helps performance rather a lot: we don't regress on the smallest inputs, but we are up to 35% faster than the previous version of encodeUtf8 on larger inputs.
Bryan O'Sullivan
encodeUtf8_1: hoist ensure up a level
Bryan O'Sullivan
encodeUtf8_1: refactor go to accept a pointer parameter
Bryan O'Sullivan
encodeUtf8_1: hoist poke8 up a level
Bryan O'Sullivan
Duplicate encodeUtf8 as encodeUtf8_1 temporarily
Bryan O'Sullivan
Merge pull request #63 from meiersi/polish-text-bytestring-builder-integration Polish UTF-8 bytestring builder support
Simon Meier
Add back 'ensure 1' to avoid overflowing an output buffer The counter-example for the existing code is a string of length '2*n' that starts with 'n' characters with codepoints in the range (0x7F, 0x7FF) and ends with 'n' ASCII characters. All 'n' ASCII characters will be written after the end of the output buffer.
Simon Meier
Polish UTF-8 bytestring builder support - adjust function names to 'encodeUtf8Builder' and 'encodeUtf8BuilderEscaped' - expose the same conversion to builders for both lazy and strict text - ensure 'Escaped' versions are inlined to allow specialization for specific escaping primitives - fix some Haddock references - add Haddock comment about bytestring >= 0.10.4.0 dependency - remove stream-to-builder encoding functions. There is no d…
Bryan O'Sullivan
Drop some special-casing for ASCII during UTF-8 encoding I somehow forgot that we allocate the initial ByteString to contain the same number of bytes as the Text contains code units. This means that we never need to ensure that the ByteString is big enough, nor (with this observation) does a special-cased ASCII-only loop help performance.
  1. Prev
  2. 1
  3. 2
  4. 3
  5. Next