possible speedup for transfer shapekeys and other intensive computations

Issue #1775 resolved
Alessandro Padovani created an issue

blender 3.6.4, diffeomorphic 1.7.2.1848

I noticed that when we transfer shapekeys my task manager shows 30% cpu used, that probably means one core is used by python since my cpu has 4 cores. I just started learning some python and blender api so I know very little, but it seems it is possible for python to use all the cpu cores.

That would be brilliant since now transfer shapekeys is a very slow process. Let us know.

https://superfastpython.com/python-use-all-cpu-cores/

Comments (42)

  1. Alessandro Padovani reporter

    For example, if I understand correctly from the above article, when we transfer shapekeys to multiple clothing items, we could use one task per clothing item, this way python should use all the available cores.

    # create the process pool
    with ProcessPoolExecutor(8) as exe:
        # perform calculations
        results = exe.map(task, range(1,50000))
    

  2. Rakete

    I’d be interested in finding out what is actually slow about the morph loading process, and what could potentially be done to speed it up.

    That said, does the ProcessPoolExecutor thing actually work @Padone? Python is not particularly good at concurrent processing, and I think that ProcessPoolExecutor would try to spawn multiple python processes. Googling it that seems to be impossible inside the blender python interpreter. Better would be something that uses threads, but it doesn’t look likely to work either. You’d also have to think about thread safety when interacting with the C parts of blender then, it is already quite easy to crash blender in my experience, and I assume concurrency would amplify that problem quite a bit. Maybe newer/upcoming blender versions have better support for concurrency?

    I am thinking that you’ll most likely be stuck with only the linear processing that is allowed through the blender python api. The only way I can see that you could use concurrency is by writing something like an external tool and distributing that with diffeo. So for example if you could pre-process the morphs/shapekeys/whatever such that the majority of work is already done and then blender only needs to put it somewhere, then you could put that pre-processing into some golang program (just pipe it in) that does it concurrently, and then you only read the output of that program and use that output in blender.

  3. Alessandro Padovani reporter

    If you’re interested you can look at “transfer.py”, any optimization there would work especially in inner loops.

  4. Thomas Larsson repo owner

    The scanned morph database was an attempt to preprocess the morphs so loading could be speeded up later. Unfortunately the scan time is too large if your have many daz assets, at least for my taste, so in practise I don’t use it. To load multiple morphs in parallel seems difficult to me. The files must be read from the disk and parses with python’s gzip and json modules, and I don’t think that disk access can easily be made parallel. Another bottleneck is the creation of vertex groups and shapekeys, where values are assigned in a slow python loop. I don’t see how to avoid that.

  5. Rakete

    Especially the reading files and parsing them with gzip and json should benefit a lot from concurrency. IO bound tasks usually spend a lot of time waiting, so you can leverage that by doing the processing while waiting for IO. I have a pretty large daz asset library and have written some go tools that do things like “parse a scene and get all morphs, then find the morph files for those, then parse those morphs and find all referenced morphs in those, etc.”, so basicly something that lets me determine all morphs used in a scene, and I can do that in around 6 seconds in a >1TB daz asset library (I’ve cut a lot of corners to achieve this though). Problem with this would be that I assume you can’t do it in python (at least not as fast), but I am thinking it might be possible to write a small go executable that processes all the morph files and then outputs only what diffeo needs, and that could then be parsed by python.

    The thing I am unsure about is the loading vertex groups into blender though, I assume the only way to do this is to gather all the data, then insert that data into blender in a linear fashion. What would be good if you could do “read file, insert file vertex groups into blender”, and then do that in parallel for all the morphs. But I assume blender just won’t let you do that.

  6. Alessandro Padovani reporter

    Thomas, by reading transfer.py I see that you compute the bounding box for the target by scanning the whole vertex set. This is unnecessary since we already have the bounding box in the object properties. We only need to compute the bounding box for the shapekey. So this could be optimized, unless I miss something.

    def outsideBox(self, src, trg, hskey):
        eps = self.eps
        hverts = [v.index for v in src.data.vertices if (hskey.data[v.index].co - v.co).length > eps]
        for j in range(3):
            xclo = [v.co[j] for v in trg.data.vertices]
            # xkey = [hskey.data[vn].co[j] for vn in hverts]
            xkey = [src.data.vertices[vn].co[j] for vn in hverts]
            if xclo and xkey:
                minclo = min(xclo)
                maxclo = max(xclo)
                minkey = min(xkey)
                maxkey = max(xkey)
                if minclo > maxkey or maxclo < minkey:
                    return True
        return False
    

    I was thinking something as below. Let me know.

        # get min max for target
        min_xclo = [trg.location]
        max_xclo = [trg.location + trg.dimensions]
    

  7. Alessandro Padovani reporter

    p.s. Also, it is not needed to recompute the shapekey bounding box for every target. We can compute it once then store it with the shapekey and reuse it for all the targets.

  8. Thomas Larsson repo owner

    The bounding box is now only computed once for each shapekey. It doesn’t give a dramatic improvement, but speeds up transfer to top and shorts by some 10%. If I include the hair cap the improvement becomes neglible, because the bounding box doesn’t discard the shapekeys so a full transfer must be attempted.

  9. Alessandro Padovani reporter

    Commit 48ce4c3.

    Thank you Thomas for your effort. With the new commit I go from 135 to 125 seconds with the test in #1772. As noted above we don’t need to compute the bounding box for the target since it’s already there in the object properties. So I added this optimization and it goes from 125 to 115 seconds.

    I am new to python and the blender api so please verify, but it seems to work fine here. Let me know.

    def outsideBox(self, src, trg, box):
        tverts = [trg.matrix_world @ Vector(v) for v in trg.bound_box]
        for j,side in enumerate(box):
            xclo = [v[j] for v in tverts]
            if xclo and (min(xclo) > side[1] or max(xclo) < side[0]):
                return True
        return False
    

  10. Alessandro Padovani reporter

    p.s. Another optimization is if the target is outside the bounding box of the source, that may happen if a figure comes with extra items. In this case we can use the objects bounding boxes and nothing needs to be computed. Let me know.

  11. Thomas Larsson repo owner

    No, that doesn’t seem to work. I translated the rig in the test file, and then nothing was transferred. However, there is another optimization: the bounding box only has to be computed once for each object. That almost cut down the time to half when transferring to top and shorts.

  12. Alessandro Padovani reporter

    That is because the bounding box in my code doesn’t follow the axes rotation. That is, obj.bound_box is in local space. But this is not an issue in the common case since figures are imported with transformations applied. To get the bounding box aligned with some axes you have to multiply with the axes matrix.

    We should really avoid to compute the bounding box for the target since it’s already there. Let me know.

    p.s. Commit 806de30 is amazing, the test time here goes from 125 to 59 seconds.

    p.p.s. Optimizing the box function doesn’t seem to get any better it’s always 59 seconds in the test scene. But I guess this is because transferred morphs take most of the time anyway. In the case where there’s only a few transferred morphs the optimization should matter, I mean a lot. The idea is to skip unused morphs as fast as possible. Let me know.

    def computeObjectBox(self, ob):
        box = []
        verts = [ob.matrix_world @ Vector(v) for v in ob.bound_box]
        for j in range(3):
            coords = [v[j] for v in verts]
            box.append((min(coords), max(coords)))
        return box
    

    p.p.p.s. Then this is only for the bounding boxes aka skipping morphs. I didn’t check the code for actual transfers so may be we can optimize there too, or I hope so.

  13. Alessandro Padovani reporter

    bug. div2.

    I see div2 morphs are transferred as well. This should not happen since div2 are only meaningful for HD, so in this case we use vendor morphs or nothing. That is, div2 must not be transferred and we can always skip them. If the target doesn’t provide the HD morph then it doesn’t follow. Let me know.

  14. Thomas Larsson repo owner

    No, the div2 shapekeys are non-zero even for the base mesh, and have to be transferred to eyebrows and lashes. eJCMAfraid_HD_div2 visualized between 0 and 5 mm:

  15. Alessandro Padovani reporter

    Ok, this means we transfer base deformations for HD morphs, even if the target is not HD. May make make sense as an attempt to use non-HD items with a HD figure, though personally I’m not too much convinced. Thank you for looking into this.

    p.s. I definitely don’t like that div2 expressions are transferred to clothing items. But I understand there’s no way to avoid that if we keep div2.

  16. Thomas Larsson repo owner

    In the last commit Easy import and Import standard morphs have an option to ignore HD morphs. The div2 morphs are ignored and no shapekeys are generated for face units, expressions and visemes. The option doesn’t affect other morph types like facs and jcms, since the shapekey is everything there.

  17. Alessandro Padovani reporter

    Commit f870947.

    We are already able to ignore div2 if we want to, by not selecting them in manual import, apart the bug reported below. I wasn’t talking of not importing div2, but not transferring them, under the idea that div2 should always be vendor morphs since they’re for HD. Then your observation for eyelashes is correct, since eyelashes don’t have HD morphs, as well as FACS which rely mainly on HD morphs thus div2 for the base mesh.

    bug. div2. It seems ignoring div2 works for easy import but not for manual import. In manual import some div2 are always imported, even if we deselect them.

    steps:

    1. import G8
    2. import units or expressions deselecting div2

  18. Thomas Larsson repo owner

    Having an option to ignore the div2 morphs is the only way to use easy import and get a light-weight character with expressions. Even if you disable shapekeys globally the div2 morphs are loaded, and the corresponding object and armature properties and drivers are generated. And you might want to easy import jcms but not the expression shapekeys (pJCMs but not eJCMs). This option is analogous to the Ignore fingers options when you import jcms.

    If you don’t want to transfer some morphs to clothes, you can disable the Transfer to clothes option and do the transfer manually.

    That some div2 shapekeys are created even if they are disabled is not a bug, or at least not a bug in the plugin. In some cases the connection between the base and div2 morphs is made in the base file, and then the div2 file is loaded as a missing morph. I think Xin at some stage had the opposite problem; since the connection wasn’t made in the div2 file he didn’t find the base morph.

  19. Alessandro Padovani reporter

    Ok, but it is peculiar that we can disable div2 for easy import but not for manual import. If we disable div2 in easy import then div2 are not loaded as missing morphs. If we deselect div2 in manual import as shown above then they are.

  20. Alessandro Padovani reporter

    Personally I am satisfied with the optimizations in the transfer code, it works great now. G9 is still slow as hell but that’s not related to transfer alone. If @Rakete has nothing to add we can close as resolved.

  21. Rakete

    I was actually planning on trying to optimize the morph loading code a bit. I tried replacing json with ujson and orjson already, but it had no effect. But I also wanted to try out if asyncio can somehow be used.

    I tried to measure where the code spends time, and it does seem a good chunk is just loading the json. I imagine that could be optimized, building drivers for example I assume couldn’t be.

    Perfs {'read': 1.594147100004193, 'load': 27.3830600999936, 'parseMorph': 30.858307900012733, 'make_single': 6.748371699985, 'build': 24.224261400000614}
    

    “load” in there is this code:

    load_t1 = perf_counter()
    struct,msg,jsonerr = loadFromString(string)
    if jsonerr:
        string = smashString(string, jsonerr)
        if string:
            struct,msg,jsonerr = loadFromString(string)
    if msg and not silent:
        reportError(msg, trigger=(1,5))
    load_t2 = perf_counter()
    GS.perfs["load"] = GS.perfs.get("load", 0) + load_t2 - load_t1
    

    So just that loadFromString guy uses up 27 seconds on its own.

    The measuring is kind of flawed though, I assume that “parseMorph” time for example includes the 27 seconds of the “load” code. So in effect it should only be 3 seconds. Not sure though.

  22. Rakete

    Just playing with it right now, and I just noticed that loadJson is repeatedly called for certain files, and it reads them everytime. So I just added a simple cache dict to GS and put the struct in there associated with the filename, then just reuse it when something tries to load it again, that cut off ~20-30 seconds from the morph loading process for me in a quick test.

    Perfs {'read': 0.15702350003448373, 'load': 0.3401700000013079, 'load2': 0.00016119998144858982, 'parseMorph': 1.4387110000097891, 'make_single': 7.645069100017281, 'build': 24.940205800001422}
    Repeats 567
    Repeats E:/My DAZ Connect Library/data/cloud/1_42071/data/daz 3d/genesis 8/female 8_1/genesis8_1female.dsf 140
    Repeats E:/My DAZ Connect Library/data/cloud/1_42071/data/daz 3d/genesis 8/female 8_1/morphs/daz 3d/facs/facs_bs_browdownright_div2.dsf 2
    [... many more files are loaded twice]
    

    That is after the change, “load” and “load2” are the same as just “load” before. You can see how genesis8_1female.dsf is loaded 140 times! And now it is only loaded once. That eliminates almost all the time spent in loadJson as far as I can tell.

    I just put this:

    def loadJson(filepath, mustOpen=False, silent=False):
        if filepath in GS.repeats:
            GS.repeats[filepath] += 1
            return GS.cache[filepath]
    

    at the top of loadJson, and this:

        GS.repeats[filepath] = 1
        GS.cache[filepath] = struct
    
        return struct
    

    at the bottom, and this:

    class GlobalSettings:
    
        def __init__(self):
            self.perfs = {}
            self.repeats = {}
            self.cache = {}
    

    into GlobalSettings.

    This should also help when loading morphs first, and then loading more morphs again after, since you don’t have to clear the cache and it will just stay in the GS instance. Maybe there is a better place for it (like a dedicated global).

  23. Rakete

    Here is another one: the code spends ~24s in “build”, that is this code:

    self.buildDrivers()
    self.buildSumDrivers()
    self.buildRestDrivers()
    if self.isJcm:
        self.optimizeJcmDrivers()
    self.correctScaleParents()
    

    and most of the time is spent in self.buildSumDrivers(). So looking at that function it is three nested loops, and of course I thought most of the time is spent inside the most nested loop. But when I tried to measure it, I noticed almost no time is spent in the innermost loop, even in the second innermost loop almost no time is spent, but 99% of the time is spent only in the outermost loop. Which does almost nothing, it consists only of the second innermost loop, and a print:

    i_loop_t1 = perf_counter()
    for bdata in self.sumdrivers.values():
        j_loop_t1 = perf_counter()
        for channel,cdata in bdata.items():
            [...]
        j_loop_t2 = perf_counter()
        GS.perfs["buildSumDriversJ"] = GS.perfs.get("buildSumDriversJ", 0) + j_loop_t2 - j_loop_t1
        printName(" +", bname) # <- one lonely print
    i_loop_t2 = perf_counter()
    GS.perfs["buildSumDriversI"] = GS.perfs.get("buildSumDriversI", 0) + i_loop_t2 - i_loop_t1
    

    So, what is print? That is actually IO, and it blocks (depends on the terminal I guess), so what happens when I remove the print? Another ~20 seconds cut off the morph loading for me.

    Without print:

     'buildSumDriversI': 0.6377954999989015
    

    with print:

     'buildSumDriversI': 22.191953100002138
    

    Actually, I also changed the self.sumdrivers.items() to self.sumdrivers.values(), because I don’t need the bname anymore. Maybe that causes it? Though I do think it is probably the print, you can find stuff online suggesting print can be quite harmful to performance, and who knows how blender implements it, maybe there are a bunch of low hanging optimization fruits in the code by just eliminating prints from loops? Or maybe it only affects me when I have the python console open in blender while looking at the output?

    EDIT: Oh wait, it is actually not print, but printName, which does more then just print. So maybe you already noticed that particular problem and I am just also re-discovering it.

    EDIT2: Yeah, absolutely I can reduce morph loading times just by not doing anything in printName. Loading my pose with a bunch of morphs (FACS, Expression) did go from ~70 seconds at the start, to now ~10 seconds. Eliminating printName(and newLine) makes a difference of ~20 seconds.

  24. Alessandro Padovani reporter

    diffeomorphic 1.7.2.1859, blender 3.6.4

    That is brilliant thank you Rakete for this. In my test with a G8F figure with basic wear and toulouse hair the loading time went from 51 seconds to 25 seconds that’s double the speed. @Thomas Larsson let us know what you think.

    p.s. in load_json.py we also have to add “from .settings import GS“.

    p.p.s. Victoria 9 with basic clothes and pixie hair takes 230 seconds here even with the new optimizations, vs 25 seconds for G8F. Luckily I don’t use G9.

  25. Rakete

    How does that 230 seconds compare to without optimizations? Maybe with these optimizations using ujson or orjson would get you a noticeable speedup for genesis 9, if that loading time is caused mainly by loading very large json files.

    I tried usjon and orjson again with my optimizations, and for my test pose it only resulted in a relatively minor improvement again (but at least there seemed to be one this time around).

  26. Alessandro Padovani reporter

    Well Victoria 9 without the new optimizations just takes forever, it’s 513 seconds here, luckily I don’t use G9. Thank you again for your nice work, now we wait for Thomas to get in.

  27. Xin

    Haven’t looked at this closely, but using Numba or Cython for preprocessing could perhaps work quite well for some tasks. Cython is used by quite a few Blender addons to isolate intensive tasks in native code. But if no libraries are needed and the code is mostly loops, then Numba is the better choice.

    This won’t be as fast as C++ but it can get quite close and is easier to use and maintain. You can use almost python-style syntax (Numba is almost the same as python), and you don’t need to worry as much about the details of how the data is transferred from python to native code.

  28. Thomas Larsson repo owner

    The json files that are loaded repeatedly are the definitions of the main genesis figures. Those json files are now cached, which results in a nice speedup. Other files are not cached, since most of those are only loaded once and caching them would eat up a lot of memory. This leads to a nice speedup. I checked easy import with facs, facs expressions, jcms and flexions.

    Without caching:

    Facs loaded in 27.0 seconds
    Facsexpr loaded in 6.4 seconds
    Jcms loaded in 11.1 seconds
    Flexions loaded in 2.4 seconds
    File D:\home\bugs\genesis\G8\g8f-basic.duf loaded in 67.077 seconds
    

    With caching:

    Facs loaded in 4.9 seconds
    Facsexpr loaded in 0.3 seconds
    Jcms loaded in 2.5 seconds
    Flexions loaded in 0.6 seconds
    File D:\home\bugs\genesis\G8\g8f-basic.duf loaded in 29.560 seconds
    

  29. Thomas Larsson repo owner

    The printName and newLine functions don’t make a difference on my system. Have you tried to turn off Show In Terminal setting? This still prints something in the terminal when loading morphs manually, but not during easy import.

  30. Alessandro Padovani reporter

    commit cb0796a, blender 3.6.4, windows 10 22H2

    Works great here thank you Thomas for the fix and Rakete for finding this out.

    I can confirm that print makes no difference, I tried disabling “show in terminal“ and commenting out the prints in load_morph.py and closing the terminal there’s no difference here. If it makes a difference for Rakete then we may turn off “show in terminal“ by default and place a warning in the tooltip that it may slow down the import process.

    possible bug. transfer to mesh without faces. With G9 I noticed that the importer spends quite some time transferring to “pixie cut” that’s the dforce hair. Apart that it’s a hair mesh detected as clothing so this is not good, it is also a mesh without faces. So I wonder if we can avoid transferring to mesh without faces, since it makes little sense. Let me know.

    p.s. This means that for animals the dforce fur would not follow the skin morphs for example, but once converted to blender particles it will follow the emitter. I’m not sure about curves in 4.0 if it’s the same. May be we can have an option in the global settings to transfer to meshes without faces, off by default, so the user can choose if needed.

  31. Alessandro Padovani reporter

    @Xin Thank you for pointing out numba and cython. As for numba it seems to require to be installed since it’s a jit, that may not be what the user wants to install python extensions for diffeomorphic. As for cython it seems it can be used to convert some python code into a wrapped c++ dll, that may be wonderful for modules where performance is critical.

    Thomas let us know what you think.

    p.s. @Xin would you be able, for example, to use cython to wrap load_json.py and load_morph.py into c++ dlls ? I guess that would be the definitive solution here.

  32. Thomas Larsson repo owner

    Transfer shapekeys now ignores meshes without faces.

    There is one further optimization. When the plugin reads an asset, in this case a morph or a formula, it also reads its parent around line 306 of asset.py. The parent of a morph is the figure, which is why the G8.1F file was loaded for each morph.

    Getting the parent is important when we import a scene, but in that case the parent has already been loaded and cached, so the operation costs almost nothing. When we load morphs that cache is cleared before each morph, so the parent must be reimported. Even if the file is cached, the data is parsed for each morph.

    However, the parent is never used when we load morphs, only when we load scenes. So in the last commit the parent is not loaded when morphs are imported.

  33. Thomas Larsson repo owner

    I don’t want to use C wrappers or non-standard python modules unless there is a huge performance gain. Both because there will be a nuisance for the user to set up the dependencies, and because I’m not competent to handle it. With the last improvements I think that the performance is quite acceptable.

  34. Alessandro Padovani reporter

    Commit db669ed works fine, thank you Thomas for your effort and letting us know what you think.

    @Xin let us know for the wrapper, I’m looking into it myself now that I’m starting to learn some python and blender api. I did manage to write my own scripts to rig figures but I’m currently quite a noob.

  35. Rakete

    My old code had a problem anyways that it cached just everything. So also poses. Meaning if I saved a pose to a file, then changed it and saved it again to the same file, I couldn’t load it. The new code does not have that problem.

    Also agree that it is not worth adding more dependencies when they don’t improve the performance substantially, the json replacements were very underwhelming.

  36. Xin

    Alessandro Padovani, I would have to look closely to know for sure, but in general I would say yes, I don’t see why not.

    Another thing that maybe Thomas could try is to vectorize all operations with numpy, since I’m not sure that’s the case right now (I think it uses python lists sometimes, and operations on those are slow).

  37. Alessandro Padovani reporter

    Thank you Xin for your reply.

    Numpy is used to transfer morphs in transfer.py. I had a look at cython, but it seems you have to refactor the code to get a good speed gain, also there’s issues compiling dlls since that part is not complete. So not easily usable, or at least it’s not the python to c compiler I expected. In general I agree with Thomas we have a decent speed now, that doesn’t mean it can’t be improved of course.

    Personally I don’t work with G9 or HD figures so this is minor to me.

  38. Thomas Larsson repo owner

    Numpy is used in some places, in particular for transfer of shapekeys. However, to use it we must transfer from Blender’s internal data structures and back. The gain in using numpy has to be weighted against the cost for such overhead. In particular I haven’t found a way to write to a shapekey or vertex group without doing a slow python loop.

  39. Alessandro Padovani reporter

    That I noted too. Also considering that shape keys in blender always take the whole mesh, differently from vertex groups. I tried to skip all zero weights in the transfer process, that is, zero weights are not necessary to transfer. But that didn’t speedup anything so I guess it’s a minor optimization.

    p.s. Now I realize that the shapekey list is already without zeroes that’s why I didn’t get anything. Sorry it is difficult for me to read the code.

  40. Xin

    An old version of the HD morphs addon used Blender structures to load shape keys directly without the API, and it was very fast. But it was a big problem that it depended on Blender’s C/C++ code since that code changes a lot and is not documented, so it becomes very annoying to maintain.

    So I agree that any move of code to Cython should be easy to maintain otherwise it’s not worth it.

  41. Log in to comment