possible speedup for transfer shapekeys and other intensive computations

Alessandro Padovani reporter

For example, if I understand correctly from the above article, when we transfer shapekeys to multiple clothing items, we could use one task per clothing item, this way python should use all the available cores.

# create the process pool
with ProcessPoolExecutor(8) as exe:
    # perform calculations
    results = exe.map(task, range(1,50000))

‌

2023-10-05T08:03:49+00:00

Rakete

I’d be interested in finding out what is actually slow about the morph loading process, and what could potentially be done to speed it up.

That said, does the ProcessPoolExecutor thing actually work @Padone? Python is not particularly good at concurrent processing, and I think that ProcessPoolExecutor would try to spawn multiple python processes. Googling it that seems to be impossible inside the blender python interpreter. Better would be something that uses threads, but it doesn’t look likely to work either. You’d also have to think about thread safety when interacting with the C parts of blender then, it is already quite easy to crash blender in my experience, and I assume concurrency would amplify that problem quite a bit. Maybe newer/upcoming blender versions have better support for concurrency?

I am thinking that you’ll most likely be stuck with only the linear processing that is allowed through the blender python api. The only way I can see that you could use concurrency is by writing something like an external tool and distributing that with diffeo. So for example if you could pre-process the morphs/shapekeys/whatever such that the majority of work is already done and then blender only needs to put it somewhere, then you could put that pre-processing into some golang program (just pipe it in) that does it concurrently, and then you only read the output of that program and use that output in blender.

2023-10-05T09:22:56+00:00

Alessandro Padovani reporter

If you’re interested you can look at “transfer.py”, any optimization there would work especially in inner loops.

2023-10-05T09:49:57+00:00

Thomas Larsson repo owner

The scanned morph database was an attempt to preprocess the morphs so loading could be speeded up later. Unfortunately the scan time is too large if your have many daz assets, at least for my taste, so in practise I don’t use it. To load multiple morphs in parallel seems difficult to me. The files must be read from the disk and parses with python’s gzip and json modules, and I don’t think that disk access can easily be made parallel. Another bottleneck is the creation of vertex groups and shapekeys, where values are assigned in a slow python loop. I don’t see how to avoid that.

2023-10-05T12:37:10+00:00

Rakete

Especially the reading files and parsing them with gzip and json should benefit a lot from concurrency. IO bound tasks usually spend a lot of time waiting, so you can leverage that by doing the processing while waiting for IO. I have a pretty large daz asset library and have written some go tools that do things like “parse a scene and get all morphs, then find the morph files for those, then parse those morphs and find all referenced morphs in those, etc.”, so basicly something that lets me determine all morphs used in a scene, and I can do that in around 6 seconds in a >1TB daz asset library (I’ve cut a lot of corners to achieve this though). Problem with this would be that I assume you can’t do it in python (at least not as fast), but I am thinking it might be possible to write a small go executable that processes all the morph files and then outputs only what diffeo needs, and that could then be parsed by python.

The thing I am unsure about is the loading vertex groups into blender though, I assume the only way to do this is to gather all the data, then insert that data into blender in a linear fashion. What would be good if you could do “read file, insert file vertex groups into blender”, and then do that in parallel for all the morphs. But I assume blender just won’t let you do that.

2023-10-05T13:17:16+00:00

Alessandro Padovani reporter

Thomas, by reading transfer.py I see that you compute the bounding box for the target by scanning the whole vertex set. This is unnecessary since we already have the bounding box in the object properties. We only need to compute the bounding box for the shapekey. So this could be optimized, unless I miss something.

def outsideBox(self, src, trg, hskey):
    eps = self.eps
    hverts = [v.index for v in src.data.vertices if (hskey.data[v.index].co - v.co).length > eps]
    for j in range(3):
        xclo = [v.co[j] for v in trg.data.vertices]
        # xkey = [hskey.data[vn].co[j] for vn in hverts]
        xkey = [src.data.vertices[vn].co[j] for vn in hverts]
        if xclo and xkey:
            minclo = min(xclo)
            maxclo = max(xclo)
            minkey = min(xkey)
            maxkey = max(xkey)
            if minclo > maxkey or maxclo < minkey:
                return True
    return False

I was thinking something as below. Let me know.

    # get min max for target
    min_xclo = [trg.location]
    max_xclo = [trg.location + trg.dimensions]

‌

2023-10-05T13:29:44+00:00

Alessandro Padovani reporter

p.s. Also, it is not needed to recompute the shapekey bounding box for every target. We can compute it once then store it with the shapekey and reuse it for all the targets.

2023-10-05T15:38:01+00:00

Rakete

Seems like it would be possible to use asyncio in blender.

2023-10-05T21:16:14+00:00

Thomas Larsson repo owner

The bounding box is now only computed once for each shapekey. It doesn’t give a dramatic improvement, but speeds up transfer to top and shorts by some 10%. If I include the hair cap the improvement becomes neglible, because the bounding box doesn’t discard the shapekeys so a full transfer must be attempted.

2023-10-06T04:30:13+00:00

Alessandro Padovani reporter

Commit 48ce4c3.

Thank you Thomas for your effort. With the new commit I go from 135 to 125 seconds with the test in ~~#1772~~. As noted above we don’t need to compute the bounding box for the target since it’s already there in the object properties. So I added this optimization and it goes from 125 to 115 seconds.

I am new to python and the blender api so please verify, but it seems to work fine here. Let me know.

def outsideBox(self, src, trg, box):
    tverts = [trg.matrix_world @ Vector(v) for v in trg.bound_box]
    for j,side in enumerate(box):
        xclo = [v[j] for v in tverts]
        if xclo and (min(xclo) > side[1] or max(xclo) < side[0]):
            return True
    return False

‌

2023-10-06T05:33:11+00:00

Alessandro Padovani reporter

p.s. Another optimization is if the target is outside the bounding box of the source, that may happen if a figure comes with extra items. In this case we can use the objects bounding boxes and nothing needs to be computed. Let me know.

2023-10-06T05:48:31+00:00

Thomas Larsson repo owner

No, that doesn’t seem to work. I translated the rig in the test file, and then nothing was transferred. However, there is another optimization: the bounding box only has to be computed once for each object. That almost cut down the time to half when transferring to top and shorts.

2023-10-06T07:03:04+00:00

Alessandro Padovani reporter

That is because the bounding box in my code doesn’t follow the axes rotation. That is, obj.bound_box is in local space. But this is not an issue in the common case since figures are imported with transformations applied. To get the bounding box aligned with some axes you have to multiply with the axes matrix.

We should really avoid to compute the bounding box for the target since it’s already there. Let me know.

p.s. Commit 806de30 is amazing, the test time here goes from 125 to 59 seconds.

p.p.s. Optimizing the box function doesn’t seem to get any better it’s always 59 seconds in the test scene. But I guess this is because transferred morphs take most of the time anyway. In the case where there’s only a few transferred morphs the optimization should matter, I mean a lot. The idea is to skip unused morphs as fast as possible. Let me know.

def computeObjectBox(self, ob):
    box = []
    verts = [ob.matrix_world @ Vector(v) for v in ob.bound_box]
    for j in range(3):
        coords = [v[j] for v in verts]
        box.append((min(coords), max(coords)))
    return box

p.p.p.s. Then this is only for the bounding boxes aka skipping morphs. I didn’t check the code for actual transfers so may be we can optimize there too, or I hope so.

2023-10-06T07:06:27+00:00

Alessandro Padovani reporter

bug. div2.

I see div2 morphs are transferred as well. This should not happen since div2 are only meaningful for HD, so in this case we use vendor morphs or nothing. That is, div2 must not be transferred and we can always skip them. If the target doesn’t provide the HD morph then it doesn’t follow. Let me know.

‌

2023-10-06T08:07:33+00:00

Thomas Larsson repo owner

No, the div2 shapekeys are non-zero even for the base mesh, and have to be transferred to eyebrows and lashes. eJCMAfraid_HD_div2 visualized between 0 and 5 mm:

‌

2023-10-06T08:31:27+00:00

Alessandro Padovani reporter

Ok, this means we transfer base deformations for HD morphs, even if the target is not HD. May make make sense as an attempt to use non-HD items with a HD figure, though personally I’m not too much convinced. Thank you for looking into this.

p.s. I definitely don’t like that div2 expressions are transferred to clothing items. But I understand there’s no way to avoid that if we keep div2.

2023-10-06T09:17:00+00:00

Thomas Larsson repo owner

In the last commit Easy import and Import standard morphs have an option to ignore HD morphs. The div2 morphs are ignored and no shapekeys are generated for face units, expressions and visemes. The option doesn’t affect other morph types like facs and jcms, since the shapekey is everything there.

2023-10-07T02:36:57+00:00

Alessandro Padovani reporter

Commit f870947.

We are already able to ignore div2 if we want to, by not selecting them in manual import, apart the bug reported below. I wasn’t talking of not importing div2, but not transferring them, under the idea that div2 should always be vendor morphs since they’re for HD. Then your observation for eyelashes is correct, since eyelashes don’t have HD morphs, as well as FACS which rely mainly on HD morphs thus div2 for the base mesh.

bug. div2. It seems ignoring div2 works for easy import but not for manual import. In manual import some div2 are always imported, even if we deselect them.

steps:

import G8
import units or expressions deselecting div2

‌

2023-10-07T05:31:38+00:00

Thomas Larsson repo owner

Having an option to ignore the div2 morphs is the only way to use easy import and get a light-weight character with expressions. Even if you disable shapekeys globally the div2 morphs are loaded, and the corresponding object and armature properties and drivers are generated. And you might want to easy import jcms but not the expression shapekeys (pJCMs but not eJCMs). This option is analogous to the Ignore fingers options when you import jcms.

If you don’t want to transfer some morphs to clothes, you can disable the Transfer to clothes option and do the transfer manually.

That some div2 shapekeys are created even if they are disabled is not a bug, or at least not a bug in the plugin. In some cases the connection between the base and div2 morphs is made in the base file, and then the div2 file is loaded as a missing morph. I think Xin at some stage had the opposite problem; since the connection wasn’t made in the div2 file he didn’t find the base morph.

2023-10-07T15:19:05+00:00

Alessandro Padovani reporter

Ok, but it is peculiar that we can disable div2 for easy import but not for manual import. If we disable div2 in easy import then div2 are not loaded as missing morphs. If we deselect div2 in manual import as shown above then they are.

2023-10-07T16:35:55+00:00

Alessandro Padovani reporter

Personally I am satisfied with the optimizations in the transfer code, it works great now. G9 is still slow as hell but that’s not related to transfer alone. If @Rakete has nothing to add we can close as resolved.

2023-10-11T07:05:06+00:00

Rakete

I was actually planning on trying to optimize the morph loading code a bit. I tried replacing json with ujson and orjson already, but it had no effect. But I also wanted to try out if asyncio can somehow be used.

I tried to measure where the code spends time, and it does seem a good chunk is just loading the json. I imagine that could be optimized, building drivers for example I assume couldn’t be.

Perfs {'read': 1.594147100004193, 'load': 27.3830600999936, 'parseMorph': 30.858307900012733, 'make_single': 6.748371699985, 'build': 24.224261400000614}

“load” in there is this code:

load_t1 = perf_counter()
struct,msg,jsonerr = loadFromString(string)
if jsonerr:
    string = smashString(string, jsonerr)
    if string:
        struct,msg,jsonerr = loadFromString(string)
if msg and not silent:
    reportError(msg, trigger=(1,5))
load_t2 = perf_counter()
GS.perfs["load"] = GS.perfs.get("load", 0) + load_t2 - load_t1

So just that loadFromString guy uses up 27 seconds on its own.

The measuring is kind of flawed though, I assume that “parseMorph” time for example includes the 27 seconds of the “load” code. So in effect it should only be 3 seconds. Not sure though.

2023-10-11T11:45:49+00:00

Rakete

Just playing with it right now, and I just noticed that loadJson is repeatedly called for certain files, and it reads them everytime. So I just added a simple cache dict to GS and put the struct in there associated with the filename, then just reuse it when something tries to load it again, that cut off ~20-30 seconds from the morph loading process for me in a quick test.

Perfs {'read': 0.15702350003448373, 'load': 0.3401700000013079, 'load2': 0.00016119998144858982, 'parseMorph': 1.4387110000097891, 'make_single': 7.645069100017281, 'build': 24.940205800001422}
Repeats 567
Repeats E:/My DAZ Connect Library/data/cloud/1_42071/data/daz 3d/genesis 8/female 8_1/genesis8_1female.dsf 140
Repeats E:/My DAZ Connect Library/data/cloud/1_42071/data/daz 3d/genesis 8/female 8_1/morphs/daz 3d/facs/facs_bs_browdownright_div2.dsf 2
[... many more files are loaded twice]

That is after the change, “load” and “load2” are the same as just “load” before. You can see how genesis8_1female.dsf is loaded 140 times! And now it is only loaded once. That eliminates almost all the time spent in loadJson as far as I can tell.

I just put this:

def loadJson(filepath, mustOpen=False, silent=False):
    if filepath in GS.repeats:
        GS.repeats[filepath] += 1
        return GS.cache[filepath]

at the top of loadJson, and this:

    GS.repeats[filepath] = 1
    GS.cache[filepath] = struct

    return struct

at the bottom, and this:

class GlobalSettings:

    def __init__(self):
        self.perfs = {}
        self.repeats = {}
        self.cache = {}

into GlobalSettings.

This should also help when loading morphs first, and then loading more morphs again after, since you don’t have to clear the cache and it will just stay in the GS instance. Maybe there is a better place for it (like a dedicated global).

2023-10-11T12:10:58+00:00

Rakete

Here is another one: the code spends ~24s in “build”, that is this code:

self.buildDrivers()
self.buildSumDrivers()
self.buildRestDrivers()
if self.isJcm:
    self.optimizeJcmDrivers()
self.correctScaleParents()

and most of the time is spent in self.buildSumDrivers(). So looking at that function it is three nested loops, and of course I thought most of the time is spent inside the most nested loop. But when I tried to measure it, I noticed almost no time is spent in the innermost loop, even in the second innermost loop almost no time is spent, but 99% of the time is spent only in the outermost loop. Which does almost nothing, it consists only of the second innermost loop, and a print:

i_loop_t1 = perf_counter()
for bdata in self.sumdrivers.values():
    j_loop_t1 = perf_counter()
    for channel,cdata in bdata.items():
        [...]
    j_loop_t2 = perf_counter()
    GS.perfs["buildSumDriversJ"] = GS.perfs.get("buildSumDriversJ", 0) + j_loop_t2 - j_loop_t1
    printName(" +", bname) # <- one lonely print
i_loop_t2 = perf_counter()
GS.perfs["buildSumDriversI"] = GS.perfs.get("buildSumDriversI", 0) + i_loop_t2 - i_loop_t1

So, what is print? That is actually IO, and it blocks (depends on the terminal I guess), so what happens when I remove the print? Another ~20 seconds cut off the morph loading for me.

Without print:

 'buildSumDriversI': 0.6377954999989015

with print:

 'buildSumDriversI': 22.191953100002138

Actually, I also changed the self.sumdrivers.items() to self.sumdrivers.values(), because I don’t need the bname anymore. Maybe that causes it? Though I do think it is probably the print, you can find stuff online suggesting print can be quite harmful to performance, and who knows how blender implements it, maybe there are a bunch of low hanging optimization fruits in the code by just eliminating prints from loops? Or maybe it only affects me when I have the python console open in blender while looking at the output?

EDIT: Oh wait, it is actually not print, but printName, which does more then just print. So maybe you already noticed that particular problem and I am just also re-discovering it.

EDIT2: Yeah, absolutely I can reduce morph loading times just by not doing anything in printName. Loading my pose with a bunch of morphs (FACS, Expression) did go from ~70 seconds at the start, to now ~10 seconds. Eliminating printName(and newLine) makes a difference of ~20 seconds.

2023-10-11T13:12:53+00:00

Alessandro Padovani reporter

diffeomorphic 1.7.2.1859, blender 3.6.4

That is brilliant thank you Rakete for this. In my test with a G8F figure with basic wear and toulouse hair the loading time went from 51 seconds to 25 seconds that’s double the speed. @Thomas Larsson let us know what you think.

p.s. in load_json.py we also have to add “from .settings import GS“.

p.p.s. Victoria 9 with basic clothes and pixie hair takes 230 seconds here even with the new optimizations, vs 25 seconds for G8F. Luckily I don’t use G9.

2023-10-11T13:36:41+00:00

Rakete

How does that 230 seconds compare to without optimizations? Maybe with these optimizations using ujson or orjson would get you a noticeable speedup for genesis 9, if that loading time is caused mainly by loading very large json files.

I tried usjon and orjson again with my optimizations, and for my test pose it only resulted in a relatively minor improvement again (but at least there seemed to be one this time around).

2023-10-11T18:24:09+00:00

Alessandro Padovani reporter

Well Victoria 9 without the new optimizations just takes forever, it’s 513 seconds here, luckily I don’t use G9. Thank you again for your nice work, now we wait for Thomas to get in.

2023-10-11T18:44:10+00:00

Xin

Haven’t looked at this closely, but using Numba or Cython for preprocessing could perhaps work quite well for some tasks. Cython is used by quite a few Blender addons to isolate intensive tasks in native code. But if no libraries are needed and the code is mostly loops, then Numba is the better choice.

This won’t be as fast as C++ but it can get quite close and is easier to use and maintain. You can use almost python-style syntax (Numba is almost the same as python), and you don’t need to worry as much about the details of how the data is transferred from python to native code.

2023-10-11T19:55:15+00:00

Thomas Larsson repo owner

The json files that are loaded repeatedly are the definitions of the main genesis figures. Those json files are now cached, which results in a nice speedup. Other files are not cached, since most of those are only loaded once and caching them would eat up a lot of memory. This leads to a nice speedup. I checked easy import with facs, facs expressions, jcms and flexions.

Without caching:

Facs loaded in 27.0 seconds
Facsexpr loaded in 6.4 seconds
Jcms loaded in 11.1 seconds
Flexions loaded in 2.4 seconds
File D:\home\bugs\genesis\G8\g8f-basic.duf loaded in 67.077 seconds

With caching:

Facs loaded in 4.9 seconds
Facsexpr loaded in 0.3 seconds
Jcms loaded in 2.5 seconds
Flexions loaded in 0.6 seconds
File D:\home\bugs\genesis\G8\g8f-basic.duf loaded in 29.560 seconds

‌

2023-10-12T01:37:46+00:00

Thomas Larsson repo owner

The printName and newLine functions don’t make a difference on my system. Have you tried to turn off Show In Terminal setting? This still prints something in the terminal when loading morphs manually, but not during easy import.

2023-10-12T01:49:08+00:00

Alessandro Padovani reporter

commit cb0796a, blender 3.6.4, windows 10 22H2

Works great here thank you Thomas for the fix and Rakete for finding this out.

I can confirm that print makes no difference, I tried disabling “show in terminal“ and commenting out the prints in load_morph.py and closing the terminal there’s no difference here. If it makes a difference for Rakete then we may turn off “show in terminal“ by default and place a warning in the tooltip that it may slow down the import process.

possible bug. transfer to mesh without faces. With G9 I noticed that the importer spends quite some time transferring to “pixie cut” that’s the dforce hair. Apart that it’s a hair mesh detected as clothing so this is not good, it is also a mesh without faces. So I wonder if we can avoid transferring to mesh without faces, since it makes little sense. Let me know.

p.s. This means that for animals the dforce fur would not follow the skin morphs for example, but once converted to blender particles it will follow the emitter. I’m not sure about curves in 4.0 if it’s the same. May be we can have an option in the global settings to transfer to meshes without faces, off by default, so the user can choose if needed.

‌

2023-10-12T06:38:40+00:00

Alessandro Padovani reporter

@Xin Thank you for pointing out numba and cython. As for numba it seems to require to be installed since it’s a jit, that may not be what the user wants to install python extensions for diffeomorphic. As for cython it seems it can be used to convert some python code into a wrapped c++ dll, that may be wonderful for modules where performance is critical.

Thomas let us know what you think.

p.s. @Xin would you be able, for example, to use cython to wrap load_json.py and load_morph.py into c++ dlls ? I guess that would be the definitive solution here.

2023-10-12T07:28:19+00:00

Thomas Larsson repo owner

Transfer shapekeys now ignores meshes without faces.

There is one further optimization. When the plugin reads an asset, in this case a morph or a formula, it also reads its parent around line 306 of asset.py. The parent of a morph is the figure, which is why the G8.1F file was loaded for each morph.

Getting the parent is important when we import a scene, but in that case the parent has already been loaded and cached, so the operation costs almost nothing. When we load morphs that cache is cleared before each morph, so the parent must be reimported. Even if the file is cached, the data is parsed for each morph.

However, the parent is never used when we load morphs, only when we load scenes. So in the last commit the parent is not loaded when morphs are imported.

‌

2023-10-12T07:55:23+00:00

Thomas Larsson repo owner

I don’t want to use C wrappers or non-standard python modules unless there is a huge performance gain. Both because there will be a nuisance for the user to set up the dependencies, and because I’m not competent to handle it. With the last improvements I think that the performance is quite acceptable.

2023-10-12T08:02:41+00:00

Alessandro Padovani reporter

Commit db669ed works fine, thank you Thomas for your effort and letting us know what you think.

@Xin let us know for the wrapper, I’m looking into it myself now that I’m starting to learn some python and blender api. I did manage to write my own scripts to rig figures but I’m currently quite a noob.

2023-10-12T08:11:40+00:00

Rakete

My old code had a problem anyways that it cached just everything. So also poses. Meaning if I saved a pose to a file, then changed it and saved it again to the same file, I couldn’t load it. The new code does not have that problem.

Also agree that it is not worth adding more dependencies when they don’t improve the performance substantially, the json replacements were very underwhelming.

2023-10-12T10:21:56+00:00

Alessandro Padovani reporter

changed status to resolved

2023-10-14T08:36:28+00:00

Xin

Alessandro Padovani, I would have to look closely to know for sure, but in general I would say yes, I don’t see why not.

Another thing that maybe Thomas could try is to vectorize all operations with numpy, since I’m not sure that’s the case right now (I think it uses python lists sometimes, and operations on those are slow).

2023-10-16T18:35:59+00:00

Alessandro Padovani reporter

Thank you Xin for your reply.

Numpy is used to transfer morphs in transfer.py. I had a look at cython, but it seems you have to refactor the code to get a good speed gain, also there’s issues compiling dlls since that part is not complete. So not easily usable, or at least it’s not the python to c compiler I expected. In general I agree with Thomas we have a decent speed now, that doesn’t mean it can’t be improved of course.

Personally I don’t work with G9 or HD figures so this is minor to me.

2023-10-17T05:48:56+00:00

Thomas Larsson repo owner

Numpy is used in some places, in particular for transfer of shapekeys. However, to use it we must transfer from Blender’s internal data structures and back. The gain in using numpy has to be weighted against the cost for such overhead. In particular I haven’t found a way to write to a shapekey or vertex group without doing a slow python loop.

2023-10-17T10:59:48+00:00

Alessandro Padovani reporter

That I noted too. Also considering that shape keys in blender always take the whole mesh, differently from vertex groups. I tried to skip all zero weights in the transfer process, that is, zero weights are not necessary to transfer. But that didn’t speedup anything so I guess it’s a minor optimization.

p.s. Now I realize that the shapekey list is already without zeroes that’s why I didn’t get anything. Sorry it is difficult for me to read the code.

2023-10-17T11:22:38+00:00

Xin

An old version of the HD morphs addon used Blender structures to load shape keys directly without the API, and it was very fast. But it was a big problem that it depended on Blender’s C/C++ code since that code changes a lot and is not documented, so it becomes very annoying to maintain.

So I agree that any move of code to Cython should be easy to maintain otherwise it’s not worth it.

‌

2023-10-17T12:47:48+00:00

Comments (42)