Slow simulation NS2D over a biperiodic space?

Issue #38 new
Pierre Augier
created an issue

Hi,

I am Pierre Augier, one of the developer of fluidsim (https://bitbucket.org/fluiddyn/fluidsim).

In order to compare our performance with other pseudo-spectral CFD codes, we tried Dedalus.

I suspect that we are not doing it the right way (see our script https://bitbucket.org/fluiddyn/fluidsim/src/master/bench/dedalus/ns2d_rot.py) because Dedalus is much (approximately 30 times on my computer) slower than the other codes.

Since we are going to include these comparisons in an article, we would like to get the best of Dedalus. Is there something that I can do to get better performance with Dedalus for this very simple case (NS2D over a biperiodic space, 10 time steps)?

I tried with 512**2 and 1024**2 and I got similar results.

Comments (9)

  1. Keaton Burns

    Hi Pierre,

    Thanks for your interest in comparing Dedalus to fluidsim -- it looks like a great project.

    It looks like the version of the script in your repository is currently throwing a singular matrix error because the gauge of the streamfunction isn't specified, so I've modified the equations a bit to set it to zero (script attached). With those changes, and the default Dedalus settings, I'm seeing a baseline time running the script serially with n=256 of T0 = 4.39 seconds on my laptop. There are three major improvements I'd recommend making:

    1) Dedalus lazily constructs the required transform and transposes plans the first time they are required, which is typically during the first timestep. This means the first timestep should usually be considered as a startup cost, and not indicative of the simulation speed. If I simply copy your main loop to run 10 startup iterations, and then time the following 10, I get a time of T1 = 4.25 seconds. Note this startup cost should become less important at higher resolutions, but maybe more important in parallel (due to transpose planning).

    2) The most important thing for improving performance is to set the "STORE_LU" option to True in the Dedalus configuration file. This will store and re-use the LU factorization of the LHS matrices when the timestep is unchanged from the previous iteration. It is currently off by default (which we should probably change), because the LU factorization library wrapped in Scipy can have an enormous memory footprint, and we were leaning towards stability over speed for the default settings. Changing this flag, I get a time of T2 = 1.38 seconds.

    3) Finally, I noticed you're using the RK443 timestepper. This is a 4-stage 3rd order Runge-Kutta method, which will be evaluating the RHS expressions and solving the LHS matrices 4 times per iteration. If you're using the same method for other codes, that's ok, but otherwise it's probably most fair to pick timesteppers with the same number of solves per iteration. A good substitute might be SBDF3, which is a 3rd order multistep method that only uses one solve per iteration. Switching to SBDF3, I get a time of T3 = 0.38 seconds.

    I'd also point out that Dedalus doesn't implement any fully explicit timesteppers -- they are all IMEX schemes, which may make comparisons to fully explicit codes a little tricky, since you're trading off speed-per-iteration for stability with larger timesteps. From our previous comparisons, we very roughly expect Dedalus to be 2-4x slower than other implicitly-timestepped Fourier pseudospectral codes -- I think it's fair to say that our focus so far has been optimizing for bounded domains with Chebyshev methods.

    Best, -Keaton

  2. Pierre Augier reporter

    Hi Keaton,

    Thank you for your nice answer.

    For simplicity and to be fair with all codes, I will simply compare the elapsed time for 10 RK4 time steps. All codes implement a Runge-Kutta 4 scheme and it is often a good and simple choice for real life simulations.

    For the considered case (NS2D, Fourier-Fourier, RK4), Dedalus is indeed quite slow (~ 15 time slower than fluidsim). Of course I'm going to point out that Dedalus is very versatile and that it has been more optimized for bounded domains with Chebyshev methods.

  3. Keaton Burns

    Hi Pierre,

    I'm not sure it's the right comparison -- if I understand correctly, the other codes are implementing the classic 4-stage explicit RK4 method, correct? Our RK443 is NOT this scheme. It is a 4-stage, third-order mixed implicit-explicit scheme described in Ascher 1997. The first is fully explicit but the second is performing four implicit matrix solves per iteration. They are very different schemes with very different stability properties, with the IMEX scheme allowing for much larger timestep sizes in practice.

    Since the codes do not implement comparable methods, perhaps a better test of performance is to compute the time necessary to compute a particular solution within a given accuracy, allowing for different timesteps between different integrators? We'd be happy to help set this up if you're interested.

    Best, -Keaton

  4. Pierre Augier reporter

    Ok I understand your point. Dedalus does not also implement the classic RK4 method ? Or the classical RK2 method ?

    I can't download the article (Elsevier) so I can't really study this RK443 scheme. Are the equations summarized in the documentation of Dedalus or in another open document that I could get? How do you choose the value of the time step for this scheme? Is it based on a CFL coefficient?

    Note that the linear terms are treated fully implicitly in some of the other codes (exact integration).

    Time stepping is a complicated subject (and there is also the issue of phase shifting which changes everything!), so it is not simple to compare the performance of different schemes. This is why I would prefer to compare the raw performance of the codes with a standard and simple time stepping method.

  5. Keaton Burns

    Hi Pierre, sorry for the delay, I was wrapping up my thesis and then took some time off! Currently, we just implement IMEX schemes, so no fully explicit methods or exponential-explicit methods, since these aren't practical for the matrices that come from Chebyshev discretizations. The tableaus of the implemented schemes are listed in the timesteppers.py module, and the general form for both the IMEX RK schemes and IMEX multistep schemes are listed in the class docstrings there.

    For a fluid simulation, the timestep is usually based on a CFL coefficient when the viscous terms are integrated implicitly. In practice we find that the maximum stable safety factor can vary by a substantial amount for different integrators depending on the equation set, which is why we took the approach of just implementing a range of options and letting the user test and pick the best option for their specific equations.

    We've thought a bit about implementing some exponential timesteppers which should speed things up for constant-coefficeint, fully-Fourier problems, but haven't gotten around to this yet since we're all primarily using Chebyshev discretizations in our research. This would also be a welcome pull-request if anyone reading would like to take a crack at it!

  6. Keaton Burns

    Hi Pierre, another big thing to check -- are you trying to compare to other spectral codes using 512 x 512 dealiased modes or a 512 x 512 grid? In Dedalus, the "resolution" of the bases corresponds to dealiased modes, and dealiasing is done by padding the modes by 3/2 before transforming, so these Dedalus simulations correspond to a grid size of 768 x 768. If you're comparing to other codes which start with a 512 x 512 grid and apply a 2/3 truncation to dealias, then the right comparison would be to set the Dedalus basis resolution to 341, and the dealias keyword to 512/341 to end up on a 512 x 512 grid.

  7. Log in to comment