Importing JIT compiled module on large cluster runs is a bottle neck

Issue #410 invalid
Johan Hake created an issue

JIT compilations in parallel is done only on rank 0 mpi process. However, when the module is compiled all the other processes tries to import it. When np is large ~1000 this becomes a bottle neck.

Comments (13)

  1. Johan Hake reporter

    Possible solution: Let more than one process JIT compile and store the module. For example can 50 processes share the same instant cache dir. Then add logic to let the JITing only occur on process with:

    rank % 50 == 0

    Such logic need to be followed by a global barrier.

    We should prob add mpi4py as dependency for instant and let instant handle this logic.

  2. Prof Garth Wells

    Is this more generally a Python import problems? We (@chris_richardson and I) see that importing Python modules is the most expensive part in a single elliptic equation solve.

  3. Johan Hake reporter

    Yes, I think so. Andy Terrell warned us for this in a thread a couple of years ago, where he explained it. Not sure where that thread is now though.

    What you describe is the same problem we see on our local cluster. I am pretty sure, need to nail the numbers though, that this is because all processes tries to import the same module from the rank 0 process. It might be that the reading of the python module is the actual bottle neck. I do not think the mpi communication is such a bottle neck, but it might be too.

  4. Prof Garth Wells

    @chris_richardson has tried some approaches that help (but I think do not completely resolve the issue). I'm sure he'll comment.

    Some background is that some modules, e.g. SymPy, import a large number of modules (FIAT imports SymPy). Apparently Python can load pure Python modules from memory/virtual file system, but not modules that require a compiled library.

  5. Chris Richardson

    There are probably a lot of factors involved. Typical HPC filesystems (e.g. LustreFS) are optimised for accessing large datasets from multiple processes. They have problems with accessing many small files from many processes. NFS is probably even worse.

    Is this just a problem with JIT modules, or do you also see slow start up time? i.e. how long does it take for a job which is just from dolfin import * ?

    In the case of large pure python libraries (e.g. sympy) you can zip the files up and use the python zipimporter, which gives a clear performance improvement. Unfortunately, you cannot include .so modules in the zip, so they always have to be loaded separately. I haven't really looked closely at the JIT modules, as they tend to be smaller - but I guess they can add up, if you have a lot of them. My only advice would be to limit the number of files - if we can compress all the pure python into one file, that would help...

  6. Chris Richardson

    I'm not clear that your solution will work, either... I think it will impact the filesystem in a similar way. I don't think it is a problem for multiple processes to read the same file at the same time, it is just the sheer number of filesystem accesses. I guess it may be system dependent.

  7. Johan Hake reporter

    I have to ask @johannes_ring for the actual numbers for the pure dolfin import.

    If it is not only the jit (or jit is negligible), we need to change the title and description of this issue.

    Aren't sympt/numpy installed on each compute node? Is the file system access then still a problem? I guess also on many clusters we are not allowed to install fenics on the compute nodes and we probably get a hit when all processors imports dolfin from the file system on the user nodes.

  8. Chris Richardson

    On a large HPC system, there are no "local" files on compute nodes. Even /tmp may be memory resident. Typically the files are distributed on "OSS" or "OST" object storage servers/targets and the location is stored in a metadata server. When 1000 cores all ask for sympy (~300 files) it triggers 300000 requests to the metadata server, followed by the same to the respective OSS/OST servers... If we can compress the library using the zipimporter, that reduces to 1000 requests. In theory, we could use MPI to distribute the pure python, and write our own importer hook, so just reducing to one read.

  9. Johan Hake reporter

    Ok, then I got the cause of this all wrong, but good to have an issue reported.

    Even though dolfin on its own does not drag in that many python modules some dependencies does. So I guess it is still a problem for dolfin.

    How would such a custom importer look like? Any links to someone that uses such?

  10. Johan Hake reporter

    We do not have any proof that importing JIT modules on large clusters takes time. I will open another issue addressing the issue of just importing dolfin on large clusters.

  11. Log in to comment