Slow simulation NS2D over a biperiodic space?

Issue #38 new
Pierre Augier
created an issue

Hi,

I am Pierre Augier, one of the developer of fluidsim (https://bitbucket.org/fluiddyn/fluidsim).

In order to compare our performance with other pseudo-spectral CFD codes, we tried Dedalus.

I suspect that we are not doing it the right way (see our script https://bitbucket.org/fluiddyn/fluidsim/src/master/bench/dedalus/ns2d_rot.py) because Dedalus is much (approximately 30 times on my computer) slower than the other codes.

Since we are going to include these comparisons in an article, we would like to get the best of Dedalus. Is there something that I can do to get better performance with Dedalus for this very simple case (NS2D over a biperiodic space, 10 time steps)?

I tried with 512**2 and 1024**2 and I got similar results.

Comments (7)

  1. Keaton Burns

    Hi Pierre,

    Thanks for your interest in comparing Dedalus to fluidsim -- it looks like a great project.

    It looks like the version of the script in your repository is currently throwing a singular matrix error because the gauge of the streamfunction isn't specified, so I've modified the equations a bit to set it to zero (script attached). With those changes, and the default Dedalus settings, I'm seeing a baseline time running the script serially with n=256 of T0 = 4.39 seconds on my laptop. There are three major improvements I'd recommend making:

    1) Dedalus lazily constructs the required transform and transposes plans the first time they are required, which is typically during the first timestep. This means the first timestep should usually be considered as a startup cost, and not indicative of the simulation speed. If I simply copy your main loop to run 10 startup iterations, and then time the following 10, I get a time of T1 = 4.25 seconds. Note this startup cost should become less important at higher resolutions, but maybe more important in parallel (due to transpose planning).

    2) The most important thing for improving performance is to set the "STORE_LU" option to True in the Dedalus configuration file. This will store and re-use the LU factorization of the LHS matrices when the timestep is unchanged from the previous iteration. It is currently off by default (which we should probably change), because the LU factorization library wrapped in Scipy can have an enormous memory footprint, and we were leaning towards stability over speed for the default settings. Changing this flag, I get a time of T2 = 1.38 seconds.

    3) Finally, I noticed you're using the RK443 timestepper. This is a 4-stage 3rd order Runge-Kutta method, which will be evaluating the RHS expressions and solving the LHS matrices 4 times per iteration. If you're using the same method for other codes, that's ok, but otherwise it's probably most fair to pick timesteppers with the same number of solves per iteration. A good substitute might be SBDF3, which is a 3rd order multistep method that only uses one solve per iteration. Switching to SBDF3, I get a time of T3 = 0.38 seconds.

    I'd also point out that Dedalus doesn't implement any fully explicit timesteppers -- they are all IMEX schemes, which may make comparisons to fully explicit codes a little tricky, since you're trading off speed-per-iteration for stability with larger timesteps. From our previous comparisons, we very roughly expect Dedalus to be 2-4x slower than other implicitly-timestepped Fourier pseudospectral codes -- I think it's fair to say that our focus so far has been optimizing for bounded domains with Chebyshev methods.

    Best, -Keaton

  2. Pierre Augier reporter

    Hi Keaton,

    Thank you for your nice answer.

    For simplicity and to be fair with all codes, I will simply compare the elapsed time for 10 RK4 time steps. All codes implement a Runge-Kutta 4 scheme and it is often a good and simple choice for real life simulations.

    For the considered case (NS2D, Fourier-Fourier, RK4), Dedalus is indeed quite slow (~ 15 time slower than fluidsim). Of course I'm going to point out that Dedalus is very versatile and that it has been more optimized for bounded domains with Chebyshev methods.

  3. Keaton Burns

    Hi Pierre,

    I'm not sure it's the right comparison -- if I understand correctly, the other codes are implementing the classic 4-stage explicit RK4 method, correct? Our RK443 is NOT this scheme. It is a 4-stage, third-order mixed implicit-explicit scheme described in Ascher 1997. The first is fully explicit but the second is performing four implicit matrix solves per iteration. They are very different schemes with very different stability properties, with the IMEX scheme allowing for much larger timestep sizes in practice.

    Since the codes do not implement comparable methods, perhaps a better test of performance is to compute the time necessary to compute a particular solution within a given accuracy, allowing for different timesteps between different integrators? We'd be happy to help set this up if you're interested.

    Best, -Keaton

  4. Pierre Augier reporter

    Ok I understand your point. Dedalus does not also implement the classic RK4 method ? Or the classical RK2 method ?

    I can't download the article (Elsevier) so I can't really study this RK443 scheme. Are the equations summarized in the documentation of Dedalus or in another open document that I could get? How do you choose the value of the time step for this scheme? Is it based on a CFL coefficient?

    Note that the linear terms are treated fully implicitly in some of the other codes (exact integration).

    Time stepping is a complicated subject (and there is also the issue of phase shifting which changes everything!), so it is not simple to compare the performance of different schemes. This is why I would prefer to compare the raw performance of the codes with a standard and simple time stepping method.

  5. Log in to comment