Experiment with Morton ordering of voxel data

Create issue
Issue #61 open
David Williams created an issue

Each PagedVolume::Chunk currently stores it's data in 'linear' order, but using Morton order instead would bring a couple of advantages:

  • Locality of access: On average, the neighbors of a voxel should be closer together in memory.
  • Ease of downsampling: A crude approach to downsampling is to simply average together groups of eight voxels into a single voxel. When Morton encoding is used, these groups of eight voxels always lie adjacent in memory, meaning that the downsampling can be done with a single pass over the data in the order that it is stored in memory.
  • Some algorithms need random access: Operations like raycasting (which may be used heavily for ambient occlusion calculation) can sample the neighbors in any/every direction and so may benefit from Morton ordering.

There are also disadvantages:

  • Index calculation is harder: For a given (x, y, z) position we need to compute the index into the chunks array. This is easy with the linear approach (a couple of adds and multiplys) but requires complex bit shifting in the Morton case. In principle the improved locality is supposed to make up for the computational overhead. Rather than doing the bit shifting every time, a look up table is generally considered to be the fastest approach.
  • Some algorithms need linear access: The surface extractors work one slice at a time and in general I think this makes a lot of sense. This mean they may be better suited to a linear layout. That said, profiling has indicated that normal calculation is actually one of the botlenecks and in this case Morton ordering may help. The MC extractor needs to be optimised anyway as it can touch the same voxel multiple times, and also calls the convertToDensity() function multiple times. It should probably create a slice of densities, fill it once, and use that for both vertex and normal generation.

Comments (18)

  1. Matt Williams

    I did start looking at this a while back. I started writing a issue which I never submitted but I paste the text below:


    I know this is something you've been thinking about for a while now and so I thought I'd start doing some preliminary tests to see how much of a change it would be and how much of a performance improvement it could be.

    I've done some preliminary tests which I will push to a branch. This work isn't really intended for usage, more to just save the tests so far.

    • Leave chunk order alone (they're unordered anyway) and just implement inside chunks.
    • Changes are confined to PagedVolume::Chunk and PagedVolume::Sampler
    • I've had to add some extra API to enable better encapsulation
    • Initial tests show worse performance but this is likely due to lots of explicit z-y-x loops which are now very cache-inefficient.

    I propose to do this work from the outside-in. Something like:

    1. Add operator++() to Sampler to move in memory order and add functions to get the current sampler's position
    2. Remove all pointer arithmetic from the Sampler and add functions to Chunk to do the transforms (like in the feature/morton-order-tests branch)

    These two steps just add indirection and so will make stuff slower (though in my tests only a few percent). Once the data layout is encapsulated like this, it it possible to change the data layout of Chunks easily. So far this would be a mostly API safe change and all it would do is make things a little slower (indirection and cache misses).

    1. Switch all our z-y-x loops to use the iterator protocol with operator++. Hopefully this will resolve the slowdown and we will be faster than the original.

    The snippet of the important part of this branch followed by the full patch of the work I did is now at https://bitbucket.org/snippets/volumesoffun/jn7y

    I implemented this in PolyVox by putting an indirection layer inside PagedVolumeChunk. I then changed every access of the data with a call to Chunk::convertCoordinates (there's lots of direct access inside the samplers)

    The main reason I post this is that I found that using a lookup table for the Morton conversion helped speed things up. Alternatively, it's possible to generate the lookup table at compile time with one template magic (or even better some constexpr magic in the future) as shown at http://ideone.com/wHEFTu

    Like I say, I found a big slowdown coming from the many explicit z-y-x loops in the code which became cache-inefficient. I guess it just needs some thought as to the best way to handle it.

  2. David Williams reporter

    Thanks Matt, that's really useful stuff.

    Let's start by talking about a test case. Over the last couple of weeks I've done quite some profiling to see where the time goes when reading a voxel. There are three stages, first we identify the chunk which contains the requested voxel, then we compute the (linear) index of the voxel in that chunk, and then we read it. All three of these stages contribute to the run time and they are all worth optimizing.

    For the purpose of Morton codes, we are currently only discussing how the data is ordered in each chunk. This affects stage two and three above due to different index calculation and different memory locality. Therefore I have made a test case which stores a single chunk and does tens of millions of local/random reads from it. This is not representative of the real system behaviour but does let us test linear/morton performance in isolation.

    I'll also say there are many variables, such as chunk size, degree of locality of accesses, 64 vs 32 bit systems, etc. But I do have some preliminary results.

    Firstly I implemented the bit-twidling approach as described on the ForceFlow blog post:

    http://www.forceflow.be/2013/10/07/morton-encodingdecoding-through-bit-interleaving-implementations/

    This gave a performance which was roughly four times slower than using the linear memory layout - not a good result. I then tested the look up table approach which he suggests, and for which you gave the tables in the code snippet you linked. Performance here was much better but still roughly 40% slower than linear memory access.

    However, we can do better than this because we can limit the range of input values. The lookup tables are only for converting a single byte, and the ForceFlow blog applies these repeatedly to convert 32-bit inputs into a 64-bit output. But for our purposes it is perfectly reasonable to limit the side length of the chunks to 256 (the default value is only 32), and in this case we can use each lookup table just once, and without all the shifting. It is as simple as:

    uint32_t index = morton256_x[uXPos] | morton256_y[uYPos] | morton256_z[uZPos];
    

    At least, I think this is valid. With the above approach access to the Morton encoded data is about 10% faster than with the linear layout. Success!

    It may be that we can do better still. 256 is actually quite big for a chunk size (I think) and if it were e.g. 32 bits then we'd only need 32 entries in each lookup table, reducing the cache pressure. This is where your suggested compile time generation of the tables could be valuable but I haven't looked into it yet.

    So, random access is better, but next up is the question of samplers. As you identify, we have lost the ability to use simple pointer arithmetic on the current voxel to find the neighbours which are above, below, or next to it. This was useful functionality, used for filtering, normal calculation, etc. However, even without this I think the samplers can be useful because they also (can/could) keep track of the current chunk, and as mentioned earlier this is also a significant part of accessing a given voxel. Perhaps we can also keep track of the 'split up' versions of the x, y,and z positions to make them faster to combine (no need to consult all the look up tables). I haven't done much with this yet.

    Additionally, there are tricks to combine Morton keys, such as http://stackoverflow.com/a/18529061 (see comments) and http://stackoverflow.com/a/9377178 Again, I'm not sure if this is useful.

    Sorry for such a long post! I also should have told you what I was working on before diving in and taking over this. But in summary Morton keys appear to be beneficial for random access but I'm not entirely sure of the effect on the samplers yet.

  3. Matt Williams

    Great work with the benchmarking so far. I'm glad that we can get an improvement at least on this simple test case.

    For the peeking samplers, I've just been playing with the binary numbers and there's a quite simple way to do the pointer arithmetic. For a given (x,y,z) which has a Morton position p; if we want to peek at (x+1,y,z) then the amount by which p must increase, Δp, is dependent only on x and is independent of y and z. This same logic holds for peeking in y and z.

    Looking at just x and for a block size of 256^3, Δp can be found in a lookup table 255 elements long (calculated at http://ideone.com/wHEFTu):

    std::array<uint32_t, 255> dp = {1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,3511,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,28087,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,3511,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,224695,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,3511,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,28087,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,3511,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,1797559,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,3511,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,28087,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,3511,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,224695,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,3511,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,28087,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,3511,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1};
    

    and so we can find the new Morton position with:

    uint32_t positionForpx0y0z(uint8_t x, uint32_t p)
    {
        return p + dp[x];
    }
    

    The values of Δp for y and z are 2 and 4 times as large respectively so we can just do:

    uint32_t positionFor0xpy0z(uint8_t x, uint32_t p)
    {
        return p + 2*dp[x];
    }
    
    uint32_t positionFor0x0ypz(uint8_t x, uint32_t p)
    {
        return p + 4*dp[x];
    }
    

    and we can do similar logic for the negative peeking.

    We could compute this table at compile time since it follows quite a simple binary pattern but that can wait until later.

    I expect that using this lookup table rather than just recomputing the Morton position of (x+1,y,z) is faster but I guess that is something we will have to test. Looking at the table above, you can see how the Morton order democratises the three directions. Previously, adjacent x would always be adjacent in memory but y would be chunkSideLength away and z would be chunkSideLength^2 away. With Morton encoding, 50% of adjacent x lookups are adjacent in memory, 50% of adjacent y lookups are 2 positions away in memory and for z 50% are only 4 away. The worst case is worse with Morton but by that point you're probably outside the cache anyway.

    The big question is whether doing the table lookup and multiplication is much slower than the previous method of just multiplying by chunkSideLength once or twice and whether that consumes any improvement due to cache locality.

    Finally, it's possible that it makes a difference which order you sample the local-26 when running an extractor. I guess you would want to grab those nearest first to delay the first cache miss/invalidation as long as possible. Perhaps it's just a non-issue though.

  4. David Williams reporter

    Wow, this is a really great observation, and could make the samplers nice and fast. Did you find it somewhere or did you come up with it yourself? A source would be handy if there are any diagrams, etc to make it clearer.

    Is the lookup table different for different chunk sizes? My gut instinct is that it would be, but running your code the 32 entry table is just a shorter version of the 256 entry table.

    Next, are you sure the same process works when moving in a negative direction? I half-feel that maybe a separate table is needed here.

    The multiplication by two and four can be replaced by shifts, but the compiler will probably do that if it helps.

    It seems to depend on knowing the current (x,y,z) position of the voxel in the chunk. Actually the sampler currently tracks the position in the volume, but this can be changed if required.

    I quickly tried to implement this in the peekVoxel1px0py0pz() method on the features/morton-encoding branch:

    template <typename VoxelType>
    VoxelType PagedVolume<VoxelType>::Sampler::peekVoxel1px0py0pz(void) const
    {
        if(CAN_GO_POS_X(this->mXPosInVolume) )
        {
            const int32_t uXChunk = this->mXPosInVolume >> this->mVolume->m_uChunkSideLengthPower;
            const uint16_t uXPosInChunk = static_cast<uint16_t>(this->mXPosInVolume - (uXChunk << this->mVolume->m_uChunkSideLengthPower));
            return *(mCurrentVoxel + dp[uXPosInChunk]);
            //return *(mCurrentVoxel + 1);
        }
        return this->mVolume->getVoxel(this->mXPosInVolume+1,this->mYPosInVolume,this->mZPosInVolume);
    }
    

    Unfortunatly it crashes as I think adding 'dp[uXPosInChunk]' pushes the pointer out of bounds. Did I misunderstand how to use it?

    Ok, I've just had a quick look so far but will try to let this sink in. Great stuff!

  5. Matt Williams

    This was my own observation. The network was down at Uni this morning so I was passing the time :) I think that if we write a blog post about this stuff (or at least some docs) then it would indeed be useful to have a diagram of it. It's easy enough to demonstrate in 2D and then the conversion to 3D was just done in code. I will try to put together a visual explanation.

    The table does change a little for different chunk sizes but due to the topology of the Morton order, if you halve the chunk length then you can just use the first half of the table (we just grab the first octant of the volume which in Morton-space is a single contiguous block). Thus, for a chunk size of 32, I think you can just grab the first 31 entries as:

    std::array<uint32_t, 255> dp = {1,7,1,55,1,7,1,439,1,7,1,55,1,7,1,3511,1,7,1,55,1,7,1,439,1,7,1,55,1,7,1};
    

    For negative directions, there is a slight change needed as dp[x] is the memory distance from (x,y,z) to (x+1,y,z). To reverse this you need to use the memory delta from the lower of the two x-positions:

    uint32_t positionFornx0y0z(uint8_t x, uint32_t p)
    {
        return p - dp[x-1];
    }
    

    As seen in the link to my patch earlier, in my experiments with Morton I added:

    std::shared_ptr<Chunk> mCurrentChunk;
    Vector3DUint16 mPosInChunk;
    

    to the PagedVolumeSampler since I found they were indeed very useful to cache.

    For testing this, I suggest comparing my method to the standard "recompute Morton position based on xyz". Perhaps print them both in the function body. Assuming that CAN_GO_POS_X is still a valid check, I can't immediately see why it would go beyond the bounds. Perhaps there's just an off-by-one error somewhere.

  6. David Williams reporter

    I got it to work - the issue was simply that I hadn't adjusted the initial index to be in Morton order in the Sampler::setPosition() function.

    I've added the use of your delta lookup table to the peekVoxel1nx0py0pz() and peekVoxel1px0py0pz() methods (peeking in positive and negative x), and I will add the other 25(!) soon. Then we will be able to see how the performance is comparing to the raw 'fall back to Volume::getVoxel()' approach and also to the linear version.

    Once that is done we will hopefully know that Morton is faster in all cases, but there are further optimizations which can be made, and which could probably have also been made to the linear version but I didn't get around to them. For example, now that we are tracking the position inside the chunk it is easier to check if we are on a chunk edge (that is, we can simplify the CAN_GO_POS_X(), etc macros). In fact, even if we are on a chunk edge perhaps the index in the chunk is still valid (or can be made valid with modulus?), so we would only need to compute the new chunk pointer rather than doing a whole trip through Volume::getVoxel()(which computes from scratch both the chunk pointer and the index).

    There also might be potential for a templatized peek<x,y,y> function to peek arbitrary distance, or to remove the 27 separate functions altogether. We'll have to think about this though and make sure it doesn't affect performance.

    Let's get the code comparable in structure to the linear version first, check if the performance is better, and then go to town with clever tricks like this :-)

  7. David Williams reporter

    I implemented all the other peeking functions last night. Overall performance is very close to the linear version but perhaps 1-2% slower (though as mentioned it's faster for random access). I guess this is not surprising - for example when peeking in the positive x direction the Morton version has to do the array lookup and add the result whereas the linear version just adds 1.

    None-the-less, the Morton version should scale better with larger chunk sizes and has a number of other benefits (the compression, ease of downsampling, etc) so I'm still in favour of running with it. I will make some more optimizations and may still beat the linear performance, but the extra optimizations may also have been applicable to the linear case anyway.

    Also, I realized we can probably do the index offset trick without the lookup table. Essentially we can take the 'split-up' x component (for example), 'un-split' it, add one, and then split it again. I believe the lookup table just stores the result of this operation for each x? This trick shows how we can add one without doing the whole unsplit/split process. It still needs some logic though, so I suspect the lookup table will be faster.

  8. David Williams reporter

    Re my previous comment, I think we should stick with the look up table for now. Doing it without the lookup table is a little more complicated, and we've seen in other places that a small lookup table is faster than even a few bit shifts. This is a bit surprising (given that general advice is that computation is faster than memory lookup) but it is what we have seen so far.

  9. Matt Williams

    I agree it's hard to know what would win out of a short lookup table or a simple calculation. I think we might win in this case due to the size of the table being small enough to fit in a very local cache or even a register. We should perhaps test it at some point though.

    Have you compared between the methods for running a full surface extraction or would that not be fair/possible due to the different memory layout?

    I've come up with a way of generating the lookup tables at compile time but it depends on constexpr so we can't use it on MSVC. For now we'll just use the static version and keep this in mind for future.

    However, it's possible that the delta lookup table can be compressed down to 5 entries for a side length of 32 (there's only 5 unique entries in it) which might relieve some cache pressure. It will need some tests to see if it's worth it though (it would require a tad more computation at run-time). For example, if the lookup table is condensed to become

    std::array<uint32_t, 5> dp = {1,7,55,439,3511};
    

    Then we can do

    return p + dp[__builtin_ffs(~x)]; //on GCC
    
    return p + dp[_BitScanForward(~x)];//on MSVC
    

    But whether this is useful depends on how NOTing the index and callling ffs() comapres to a more direct lookup of a larger table.

  10. David Williams reporter

    The surface extraction tests run at the same speed (as do most of other tests). The Morton version might be slightly faster overall, but not enough to be sure. The real benefit here will come from the compression and easier subsampling of the Morton version, as well as the fact that we can now optimize a bit further. For extractor improvements I think we really have to optimize the extractor itself, but it is good to understand the performance characteristics of the underlying volume.

    Generating the lookup tables at compile time is indeed interesting (as long as it really is as fast), but as you say it's not supported by VS2103 unless you use the CTP version. Something to keep in mind for the future though.

    Those intrinsics look cool! I doubt if they are faster due to the extra logic but at some point we can try it and see. However, first I will optimize other aspects of the sampler as far as I can to see where that takes us, and it should also make any performance improvement due to intrinsics more noticeable.

  11. David Williams reporter

    The new (Morton-order) sampler needs to keep track of the position inside the chunk. It also still tracks the position in the volume, though possibly that can be removed now. We only apply our clever offset/lookup-table logic when we are not at the edge of the chunk, and tracking the position inside the chunk allows us to test this more easily.

    This commit uses this for testing whether we can peek in the positive x/y/z direction, and the sampler tests now run about 30% faster! This means that the actual voxel lookup is possibly not even the bottleneck, but instead it may be the logic which determines if we are at a chunk edge. Perhaps this is why Morton vs. linear didn't make much difference.

    The crazy thing is, I actually tried applying this optimization in the negative x/y/z direction as well and it actually gets slower again! I really can't understand why. I guess there are some strange compiler/processor optimizations going on.

    Anyway, this means that the Morton vs. linear comparison for the sampler is perhaps not completely meaningful. But I think we have to let that go, because trying to apply the position-in-chunk-tracking changes to the develop branch is more work, and I'm losing focus on what we are trying to prove :-) I think we should accept that Morton order is better in principle and run with that, and then try to make further optimizations (e.g. to these if statements) as we find them.

  12. Matt Williams

    Be aware with running the test suite for benchmarking that you can't rely on the total time it takes to run the test binary (i.e. what is done when you run CTest via make test or the equivalent in VS). The QBENCHMARK macro will run the test a number of times in order to get an accurate timing for it. This means sometimes you can make an algorithm faster which causes QBENCHMARK to need to run the algorithm e.g. 64 instead of 32 times, making the test run slower overall. Running, e.g. the TestSurfaceExtractor binary directly will print out the information about how many times the test ran and the average timings per run.

    If that's not the cause then I agree it's possibly caused by some strange optimisation.

    ... I just tested it myself by adding

    #define CAN_GO_NEG_X(val)  (this->m_uXPosInChunk > 0)
    #define CAN_GO_NEG_Y(val)  (this->m_uYPosInChunk > 0)
    #define CAN_GO_NEG_Z(val)  (this->m_uZPosInChunk > 0)
    

    and it gave slight speed improvements (a few percent I guess).

    It is indeed hard to get a completely fair test of the new method since some algorithms have to change along the way. We've found that Morton encoding can be just as good as linear ordering and it opens the doors to some further improvements.

    I agree though that given what we've found that Morton is in general at least as good, we should commit to it and make improvements as they come. With the current feature/morton-encoding branch I'm seeing speed improvements of ~20-40% for the surface extractors compared to develop.

  13. David Williams reporter

    Thanks for the tip but I'm already running the tests individually and looking at the average runtime (though actually only a single iteration is being performed anyway). The results are indeed strange (especially if you are seeing different results- are you testing on GCC/Clang and 32/64 bit?). But it could be due to strange inlining rules which change as we enlarge/shrink the function, or other stuff like that. Maybe it varies per compiler, or with other settings.

    There's also the question of what exactly we are optimizing for. For example, I'm currently simplifying the if(...) statements in the peek...() functions by bit-packing several flags into a single variable, so only one test is needed. But this also complicates the logic of moving the iterator. For a task like filtering (which the test emulates) this is a good tradeoff because we move the iterator once and then sample many neighbours. But with something like raycasting we would move the iterator, sample only the current voxel once, and not look at the neighbours at all. Here the speed of moving the iterator is more important. Perhaps it should have two modes of operation - I'm not sure yet.

    Ideally of course the profiling should be driven by our actual usecases, but in practice it's easier to profile against isolated tests. I'll get it working as well as I can on the current tests but we can also come back to raw performance again in the future (after thread safety perhaps).

  14. David Williams reporter

    Ok, my clever flag trick from the previous post didn't really help - it seems sometimes simpler is just better. So my point regarding optimizing for moving v.s for sampling the neighbours isn't so relevant any more. But I have further sped up the sampler with some more caching.

    I think I'm ready to wrap this up now. The Morton ordering seems to be working correctly and performance is better than the develop branch. The tests which are very heavy on sampler access show a 40-50% increase , but practical algorithms such as the Marching Cubes extractor show only a few % increase. Still, we get the benefits of Morton ordering for compression and subsampling, and again the benefits should improve with bigger chunks and larger voxels.

    There are a couple of things to keep in mind for the future but which I don't think are worth investigating now:

    • The sampler could be modified to track the current chunk position and current index in the chunk. At the moment, when the 'if' in the peek...() functions fails we resort to volume->getVoxel(), which internally computes from scratch the index and the chunk. But we could avoid needing to recompute the index if our lookup table would cause the index to wrap-around when peeking outside a chunk. This would be fiddly though as the lookup table for a small chunk would not longer be a subset of the lookup table for a large chunk, meaning we'd probably have to use compile time generation of the lookup tables to cover all chunk sizes. Also, I don't think the index calculation is actually that expensive - finding the correct chunk in the chunk map appears to be more costly.

    • Every time we peek...() we first check if we are allowed to peek in the given direction. However, in some cases we might already know that we can do this. For example, if the sampler is being moved forward by the operator++() (or some other controller class) the maybe it could keep track of this. The peek...() functions could have a template parameter as to whether they even need to perform the boundary check. But this can all get rather complex, and it's not clear if we actually have a use case for it.

  15. David Williams reporter
    • changed status to open

    I'm opening this up again because our recent blog post has generated some useful discussion, particularly with regard to whether it's better to compute the Morton offset on the fly rather than to use a lookup table. It's probably worth testing this:

    http://www.volumesoffun.com/implementing-morton-ordering-for-chunked-voxel-data/#comments

    Also, we've improved the chunk lookup time in PolyVox by replacing the std::map with a custom hash table, so the index calculation/lookup time may now be more significant.

    I don't think I'll look at this soon, but it probably is worth coming back to in the future.

  16. David Williams reporter

    After more tests, I'm also concerned that the expected compression benefits are not really present. I tested a couple of datasets from Cubiquity and in both cases the compression ratio is actually worse for the Morton case (with the city dataset from Build & Shoot it's nearly two times worse). I'm guessing that real world datasets are mostly aligned with the main axis (straight walls, flat ground, etc) and so don't work as well with Morton encoding.

    This isn't a fatal problem in itself, as we can still store the data in linear order on disk and swap it to be in Morton order in memory. But we're losing one of the expected benefits here, so the Morton access really has to be faster for it to be worthwhile. If the only true benefit of Morton ordering was the faster downsampling then we could instead just swap it to Morton order for that task (and the cost of swapping and then downsampling would probably be greater than just downsampling the linear version). It also depends which operation is more common out of loading from disk vs. downsampling, and I don't have a clear answer to that.

    Overall, as long as accessing the Morton ordered data really is faster then it's probably still worth using it. But I'm no longer quite so convinced, with all the comments on Reddit and the blog about how Morton ordering is not worth it. Let's run with it for now and get other stuff in place, but be prepared to test linear ordering again in the future.

  17. Log in to comment