Fast (de)compression of MLV files

Issue #29 new
Andrew Baldwin repo owner created an issue

mlv_dump supports LZMA compression of MLV frames, as well as 16bit padding and delta encoding of frames. This is capable of reducing frames to typically 60-70% of their original size. Unfortunately the compression time is quite slow (~1 sec/frame) and the resulting files could no longer be played back in real time.

There should be an alternative compression format which achieves similar gains with real-time compression, while retaining the capability for real time playback. In fact, since size-on-disk would be reduced, playback might even be easier.

A good candidate would be something like: - Padding to 16bit - Bayer aware differencing of same-colour pixels in a row - Encode full green channel, plus red-green or red-blue hues/deltas - Entropy encoding with the new fast/efficient Finite State Encoder (~200MB/s/i7core encode speed)

This compression format should be added to mlv_dump for archiving and could be implemented in MlRawViewer for playback.

Comments (25)

  1. g3gg0

    this sounds indeed interesting and i would implement it too.

    do you have experience with this FSE? is there some standard implementation for it?

    i guess its the best to both use the same implementation in C and use that as a library/link to it.

    BR, g3gg0

  2. Andrew Baldwin reporter

    FSE is a very new huffman alternative (few weeks old!) available as BSD C file. I already made a small python binding for playing with it, and it is very fast.

    See: https://bitbucket.org/baldand/pyfse and the links from there.

    I tried some experiments as described above, using per pixel differences instead of whole frame deltas, and could quite easily get to 70% compression ratio with a very simple row-based approach that would still allow random access of frames.

    I think compression speed would be 100-200MB/s on 1 core of an i7, so perfect for compressing while importing, or even for compressing a large existing archive (I have ~4TB already!). And decompression should be even faster, which should help with real-time playback enormously as the bottleneck is still disk read speed, and that would be reduced.

  3. g3gg0

    wow it is really small.

    i will check it this weekend. any plans about the exact coding that would fit best? or simply like that?

    input bayer
        R G R G R G R G R G R G 
        G B G B G B G B G B G B 
    
    pixel groups
        [R G1] R G R G R G R G R G 
        [G2 B] G B G B G B G B G B 
    
    pixel-delta: 
        rG = G1
        dR = 0x8000 + R - rG
        dB = 0x8000 + B - rG
        dG = 0x8000 + G2 - rG
    
    new pixel groups
        [rG dR dG dB] rG dR dG dB rG dR dG dB 
    

    why not use deltas between frames afterwards? in static scenes, you also reduce rG to the magnitude of pixel noise.

  4. Andrew Baldwin reporter

    If I understand it right, that's not quite what I had in mind.

    I think it might be more compressible to code the deltas between the same colour pixels in different groups. This could be done along a row, so you need to decode a whole row starting from the x=0 (left)

    In addition, the hues (difference to green) should be coded as deltas to each other rather than deltas between blue or red since hue often stays quite uniform.

    As an example in a GB row:

    Original:

    G0, B0, G1, B1, G2, B2, ...
    

    Replace with deltas, apart from first 2 pixels pixels. Range is now 14bit + 1 sign bit.

    G0, G0-B0=dB0, G1-G0, G1-B1=dB1, G2-G1, G2-B2=dB2, ...
    

    Second round replacement. Hue values are now 15bit + 1 sign bit.

    G0, dB0, G1-G0, dB0-dB1, G2-G1, dB1-dB2
    

    To increase the reuse of symbols for small values, invert negative values and set their high bit e.g.

    0 => 0x0000
    1 => 0x0001
    255 => 0x00FF
    256 => 0x0100
    -1 => 0x8001
    -2 => 0x8002
    -255 => 0x80FF
    -256 => 0x8100
    

    Now, code the high bytes as one FSE symbol stream (it should have mostly values 0x0000 and 0x8000, plus small number of other values -> very compressible). Code the low bytes as a second FSE symbol stream (not so compressible, but small positive and negative values share same symbols, e.g. +1 and -1 both use 0x01, which helps).

    To reverse, first decode the two symbol stream and reassemble the 16bit values. Convert the sign bit back to normal signed integer representation. Run through the row one time in sequence to convert delta values back to G and dB Run through the dB values a second time to convert back to B Same for RG rows.

    The first values of the row could also be coded vertically as a delta starting from the top left corner value.

    I suppose delta between frames only helps for sure (beyond the above method of delta between adjacent pixels in areas of similar colour) when you are on a steady tripod. Otherwise it can be just another source of entropy, increasing the potential range to be coded by 1 bit.

    Did you do a study of how much it helps for different types of input data when you added that to mlv_dump?

  5. g3gg0

    hey,

    about the horizontal deltas: yeah, while driving home i also thought that :) then frame deltas wont be necessary as we already reduced redundant data. also had vertical delta encoding in mind, but this can not be parallelized as good. guess we can sacrifice that few bits :)

    why not encode like: G0, B0-G0, G1-G0, B1-B0, G2-G1, ... on very fine structures, this would produce higher numbers. but for areas with similar colors, it should be more efficient.

    the delta encoding looks better, especially when interlacing hi and lo bytes. good point :) so we have noise in the highest bit and in the lower bits. rearranging the bits so that negative flag is in the lower byte could help. then sensor noise is encoded in the low byte only (-65..64), larger numbers will get into the high byte.

  6. Andrew Baldwin reporter

    I had a quick go at implementing something like this. So far only the compression, but I'll add also the decompression soon. It's called "bayz".

    https://bitbucket.org/baldand/mlrawviewer/commits/32cbc0ec7ce546efb3445bc42f02d780be19a321

    In a quick test on some quite dynamic raw data, it was giving me slightly better compression than mlv_dump's LZMA on 16bit, but with much quicker compression times. Build the test program and give it a .RAW file (not MLV) as first param - it will compress the first few frames and give stats.

  7. g3gg0

    i added two helpers. one encodes the number a bit more efficient. (at least on the test video) does it work for yours too?

    u16 encode(int val)
    {
        u16 encoded = 0;
    
        /* positive values dont get encoded, lowest bit is zero */
        if(val >= 0)
        {
            encoded = val<<1;
        }
        else
        {
            /* negative values get 0x80 set, offset to squeeze one bit more into (we dont need two representations of zero) */
            val = -val + 1;
            encoded = 0x0080;
            encoded |= (val & 0xFF80) << 1;
            encoded |= (val & 0x007F);
        }
        return encoded;
    }
    
    void split_bytes(u16 value, u8 *high, u8 *low)
    {
        *high = value >> 8;
        *low = value & 0xFF;
    }
    
  8. Andrew Baldwin reporter

    I integrated the patch and tried to write a decode function, but couldn't get it to work so it's currently defaulting to the original encoder.

    Surely +0x60 and -0x41 encode to the same value? 0x60 -> 0xC0 -0x41 -> 0xC0 Positive encoding should leave 0x80 == 0? Or did I miss something?

    Also, I added the full decoder now and it seems to be working.

  9. g3gg0

    you are right, forgot the positive case :)

    ive added three variants.

    1) putting the sign in the highest bit

    2) putting the sign in the uppermost bit of lower byte

    3) putting the sign in the lowest bit

    variant 3 has slightly more compression with my test video

    u16 encode(int val)
    {
        u16 encoded = 0;
    
        if(0)
        {
            return (val>=0)?val:(0x8000|(-val));
        }
    
        if(0)
        {
            /* positive values dont get encoded, lowest bit is zero */
            if(val < 0)
            {
                /* negative values get 0x80 set, offset to squeeze one bit more into (we dont need two representations of zero) */
                val = -val - 1;
                encoded |= 0x0080;
            }
    
            encoded |= (val & 0xFF80) << 1;
            encoded |= (val & 0x007F);
        }
    
        if(1)
        {
            /* positive values dont get encoded, lowest bit is zero */
            if(val < 0)
            {
                /* negative values get lowest bit set, offset to squeeze one bit more into (we dont need two representations of zero) */
                val = -val - 1;
                encoded |= 0x0001;
            }
    
            encoded |= val << 1;
        }
    
        return encoded;
    }
    

    that should be the right one. its just about 2% data reduction, but that sums up ;)

  10. Andrew Baldwin reporter

    Great! Integrated and working.

    2 & 3 were quite close for me on different tests. Sometimes 2 would be a few bytes ahead, sometimes 3...

    I set 3 as the default.

  11. Andrew Baldwin reporter

    Now I also removed the hue as a predictor. It seems it wasn't helping as I get consistently a few % better compression in all cases with using just the previous same colour on the line as the predictor.

  12. g3gg0

    very cool :)

    One thing - do you mind making the code omitting poiner advancing and making it directly dereferencing using array[pos] instead of array++? Then we can do a for loop with index "x" from 0...[width-1].

    Now when we also encode() on the reference values (the first ones written), we can simply do: (untested code, just written straight away)

    int last_g = 0;
    int last_b = 0; /* or set to r[0] to store the first B relative to G */
    for(int x = 0; x < width; x += 2)
    {
        int g = r[x];
        int b = r[x+1];
        split_bytes(encode(g - last_g), &high[x], &low[x]);
        split_bytes(encode(b - last_b), &high[x+1], &low[x+1]);
        last_g = g;
        last_b = b;
    }
    
  13. Andrew Baldwin reporter

    Done (though there is still pointer incrementing between the inner loops).

    Also, I now made the first 2 columns code vertically with green estimated from the nearest sample 1 row up, and the R/B estimated from 2 rows up. Seems to save a few more bytes.

    I notice that with high ISO (very noisy) raw files, the overall size can still increase slightly compared to the packed 14bit version (1-5%). In those case it would be best just to store an unencoded frame.

  14. Andrew Baldwin reporter

    I added some bechmark code that times encoding the same 16bit unpacked frame 30 times. On a 2.8GHz laptop i7 (one thread) I'm getting 160MB/s. On a 1GHz AMD C-50, I'm getting 40MB/s. That doesn't include the 14bit->16bit unpacking step.

    Next steps:

    • It needs a checksum to validate the content is good
    • Encoding could be broken into multiple sub-rectangles, e.g. 4 x 4 = 16. That would allow fully multi-threaded encode and decode, and in the case of a single bit corruption in a file (detected by the checksum per sub-rect, only that sub-rectangle would be lost.
    • Lossy compression with optional quantization step. My thinking is that for bright areas, the low bits of each delta step can be thrown away with no visual impact in post-processing, but lots of bit saving. It would need subsequent deltas to be corrected to keep each pixel close to correct. This would need at least the black level to be supplied to the encoder so the importance of each bit can be estimated. Lossy bitstreams would be decoded identically to lossless, they just have fewer bits in (fewer quantized delta values).
  15. g3gg0

    another idea which is about noisy footage. is it of any help to blur the previous image (or line/previous pixels to stay random accessible), efficiently compress that as reference and store the deltas (noise)?

  16. g3gg0

    tried that by averaging the last 2 pixels of each color and it only improved compression with your pseudo noise test frame (1.3% less =103.897%) for the others it caused a negative effect.

    checksum -> i would rather add a verify step. but we also can spend those 4 bytes.

    rectangle -> any advantages over "block"ifying e.g. 32 lines? grouping lines would be much easier to handle. parallelizing is very easy - just store the first pixel absolute and every line can be decoded separately.

    lossy -> log encoding :)

  17. g3gg0

    okay some other idea - what about debayering, H.264 / JPEG encoding every single frame with quite low quality, storing this along with the frame and using this as reference for every single frame.

    then we only store the difference betwen the encoded image and the real raw bayer which might be compressible much better.

    this also provides an already playable sidecar video :)

  18. Andrew Baldwin reporter

    I had a first go at a lossy compression option using an optional quantizer.

    By giving e.g. 2 2 as additional args, compression ratio can get from 60% (lossless) down to 30-40% (lossy), while still looking almost the same. Artifacts where visible are mostly on vertical edges. There are some problems in very bright areas with the highest compression levels.

    Lossy and lossless bitstreams are decoded with exactly the same decoder at the same speed.

  19. Andrew Baldwin reporter

    Some progress. The FSE lib now has a new U16 API which should be faster and better for this use case, once the bugs are out of that. I hope to soon rework the bayz compression to use that mode.

    After that, and once it all seems to be working ok, I will make a pull request against mlv_dump adding it as a new alternative compression format (and then also add compressed MLV support to MlRawViewer).

  20. Andrew Baldwin reporter

    I'm rethinking the approach here (now that I worked on DNG exports).

    It might make more sense to try to reproduce the Lossless JPEG compression begin used by DNG. Pros/cons:

    Pros:

    • It's a 20 year old standard documented in the original JPEG spec
    • It's used by DNG as the lossless compression option
    • It supports up to 16bit data, lossless
    • It supports multiple components to help deal with CFA pattern (not sure how well?)
    • Compression level will probably be quite similar to lossless bayz -> It also uses DPCM approach with multiple predictors.
    • It might also take an extra quantizer stage for slightly improved compression with controlled per-pixel error
    • Compressed blocks in MLV could be copied as-is into DNGs
    • ML could write compressed DNGs from all current tools, or in camera
    • It's also used in Canon's CR2 format

    Cons:

    • I can't find an existing suitable library to start from.

    dcraw includes a very compact/terse decoder which is used for DNG and CR2 reading. IJG libjpeg does not support it. There was a mythical patch to add it to libjpeg62, but I can't find it. There was an open source implementation from cornell, no longer online. ImageMagick has it in their jpeg code, which it tied into the rest of ImageMagick. DNG sdk includes one based on some of these old implementations, but with Adobe's license.

    Maybe the best thing here would be to make a new small library just for lossless jpeg encode/decode. It should be ARM friendly, suitable for multithreaded operation, and produce data for DNGs that can be decoded with dcraw and Adobe tools.

  21. Andrew Baldwin reporter

    Those are all for JPEG-LS (based on an HP patented algorithm) which is a different/later standard to the original (and for a long time not widely used) lossless mode of original JPEG.

    But it's the original JPEG lossless that DNG is using.

    The original JPEG lossless mode is much simpler even than normal JPEG. It's just a predictor based on 3 near pixels, and huffmann coding.

  22. Log in to comment