 edited description
Slow simulation NS2D over a biperiodic space?
Hi,
I am Pierre Augier, one of the developer of fluidsim (https://bitbucket.org/fluiddyn/fluidsim).
In order to compare our performance with other pseudospectral CFD codes, we tried Dedalus.
I suspect that we are not doing it the right way (see our script https://bitbucket.org/fluiddyn/fluidsim/src/master/bench/dedalus/ns2d_rot.py) because Dedalus is much (approximately 30 times on my computer) slower than the other codes.
Since we are going to include these comparisons in an article, we would like to get the best of Dedalus. Is there something that I can do to get better performance with Dedalus for this very simple case (NS2D over a biperiodic space, 10 time steps)?
I tried with 512**2
and 1024**2
and I got similar results.
Comments (13)


Hi Pierre,
Thanks for your interest in comparing Dedalus to fluidsim  it looks like a great project.
It looks like the version of the script in your repository is currently throwing a singular matrix error because the gauge of the streamfunction isn't specified, so I've modified the equations a bit to set it to zero (script attached). With those changes, and the default Dedalus settings, I'm seeing a baseline time running the script serially with n=256 of T0 = 4.39 seconds on my laptop. There are three major improvements I'd recommend making:
1) Dedalus lazily constructs the required transform and transposes plans the first time they are required, which is typically during the first timestep. This means the first timestep should usually be considered as a startup cost, and not indicative of the simulation speed. If I simply copy your main loop to run 10 startup iterations, and then time the following 10, I get a time of T1 = 4.25 seconds. Note this startup cost should become less important at higher resolutions, but maybe more important in parallel (due to transpose planning).
2) The most important thing for improving performance is to set the "STORE_LU" option to True in the Dedalus configuration file. This will store and reuse the LU factorization of the LHS matrices when the timestep is unchanged from the previous iteration. It is currently off by default (which we should probably change), because the LU factorization library wrapped in Scipy can have an enormous memory footprint, and we were leaning towards stability over speed for the default settings. Changing this flag, I get a time of T2 = 1.38 seconds.
3) Finally, I noticed you're using the RK443 timestepper. This is a 4stage 3rd order RungeKutta method, which will be evaluating the RHS expressions and solving the LHS matrices 4 times per iteration. If you're using the same method for other codes, that's ok, but otherwise it's probably most fair to pick timesteppers with the same number of solves per iteration. A good substitute might be SBDF3, which is a 3rd order multistep method that only uses one solve per iteration. Switching to SBDF3, I get a time of T3 = 0.38 seconds.
I'd also point out that Dedalus doesn't implement any fully explicit timesteppers  they are all IMEX schemes, which may make comparisons to fully explicit codes a little tricky, since you're trading off speedperiteration for stability with larger timesteps. From our previous comparisons, we very roughly expect Dedalus to be 24x slower than other implicitlytimestepped Fourier pseudospectral codes  I think it's fair to say that our focus so far has been optimizing for bounded domains with Chebyshev methods.
Best,
Keaton 
 attached ns2d_rot.py
ns2d_rot.py, updated to set streamfunction gauge, use SBDF3, and separate startup loops from timing loops.

 attached dedalus.cfg
Configuration file with STORE_LU set to True.

Hi Keaton,
Thank you for your nice answer.
For simplicity and to be fair with all codes, I will simply compare the elapsed time for 10 RK4 time steps. All codes implement a RungeKutta 4 scheme and it is often a good and simple choice for real life simulations.
For the considered case (NS2D, FourierFourier, RK4), Dedalus is indeed quite slow (~ 15 time slower than fluidsim). Of course I'm going to point out that Dedalus is very versatile and that it has been more optimized for bounded domains with Chebyshev methods.

Hi Pierre,
I'm not sure it's the right comparison  if I understand correctly, the other codes are implementing the classic 4stage explicit RK4 method, correct? Our RK443 is NOT this scheme. It is a 4stage, thirdorder mixed implicitexplicit scheme described in Ascher 1997. The first is fully explicit but the second is performing four implicit matrix solves per iteration. They are very different schemes with very different stability properties, with the IMEX scheme allowing for much larger timestep sizes in practice.
Since the codes do not implement comparable methods, perhaps a better test of performance is to compute the time necessary to compute a particular solution within a given accuracy, allowing for different timesteps between different integrators? We'd be happy to help set this up if you're interested.
Best,
Keaton 
Ok I understand your point. Dedalus does not also implement the classic RK4 method ? Or the classical RK2 method ?
I can't download the article (Elsevier) so I can't really study this RK443 scheme. Are the equations summarized in the documentation of Dedalus or in another open document that I could get? How do you choose the value of the time step for this scheme? Is it based on a CFL coefficient?
Note that the linear terms are treated fully implicitly in some of the other codes (exact integration).
Time stepping is a complicated subject (and there is also the issue of phase shifting which changes everything!), so it is not simple to compare the performance of different schemes. This is why I would prefer to compare the raw performance of the codes with a standard and simple time stepping method.

Hi Pierre, sorry for the delay, I was wrapping up my thesis and then took some time off! Currently, we just implement IMEX schemes, so no fully explicit methods or exponentialexplicit methods, since these aren't practical for the matrices that come from Chebyshev discretizations. The tableaus of the implemented schemes are listed in the timesteppers.py module, and the general form for both the IMEX RK schemes and IMEX multistep schemes are listed in the class docstrings there.
For a fluid simulation, the timestep is usually based on a CFL coefficient when the viscous terms are integrated implicitly. In practice we find that the maximum stable safety factor can vary by a substantial amount for different integrators depending on the equation set, which is why we took the approach of just implementing a range of options and letting the user test and pick the best option for their specific equations.
We've thought a bit about implementing some exponential timesteppers which should speed things up for constantcoefficeint, fullyFourier problems, but haven't gotten around to this yet since we're all primarily using Chebyshev discretizations in our research. This would also be a welcome pullrequest if anyone reading would like to take a crack at it!

Hi Pierre, another big thing to check  are you trying to compare to other spectral codes using 512 x 512 dealiased modes or a 512 x 512 grid? In Dedalus, the "resolution" of the bases corresponds to dealiased modes, and dealiasing is done by padding the modes by 3/2 before transforming, so these Dedalus simulations correspond to a grid size of 768 x 768. If you're comparing to other codes which start with a 512 x 512 grid and apply a 2/3 truncation to dealias, then the right comparison would be to set the Dedalus basis resolution to 341, and the dealias keyword to 512/341 to end up on a 512 x 512 grid.

Ok I took a closer look at the script, and noticed that there's also big improvements we can make to the problem formulation. In Dedalus, only Chebyshev problems need to be reduced to first order, but higherorder derivatives are fine with Fourier bases. This means all of the diagnostic equations here can actually be replaced with substitution rules relating rot, u, and v to psi. Making these changes also speeds up the code quite a bit, in addition to compensating for the different dealiasing strategies. Currently timings on my laptop look like:
FluidSim:
512^2 grid: 0.56 sec
1024^2 grid: 2.76 secOld Dedalus script:
512^2 modes: 5.73 sec
1024^2 modes: 26.93 secUpdated Dedalus script:
512^2 grid: 1.19 sec
1024^2 grid: 6.78 secI'll post this over on the FluidSim issue as well.

I confirm the nice improvement for Dedalus! Here are my measurements:
augier3pi@meige8pcpa79:~/Dev/fluidsim/bench/dedalus$ time python ns2d_rot_faster.py 20181024 10:17:24,379 pencil 0/1 INFO :: Building pencil matrix 1/171 (~1%) Elapsed: 0s, Remaining: 38s, Rate: 4.5e+00/s 20181024 10:17:28,216 pencil 0/1 INFO :: Building pencil matrix 18/171 (~11%) Elapsed: 4s, Remaining: 35s, Rate: 4.4e+00/s 20181024 10:17:32,193 pencil 0/1 INFO :: Building pencil matrix 36/171 (~21%) Elapsed: 8s, Remaining: 30s, Rate: 4.5e+00/s 20181024 10:17:34,197 pencil 0/1 INFO :: Building pencil matrix 45/171 (~26%) Elapsed: 10s, Remaining: 28s, Rate: 4.5e+00/s 20181024 10:17:36,202 pencil 0/1 INFO :: Building pencil matrix 54/171 (~32%) Elapsed: 12s, Remaining: 26s, Rate: 4.5e+00/s 20181024 10:17:40,211 pencil 0/1 INFO :: Building pencil matrix 72/171 (~42%) Elapsed: 16s, Remaining: 22s, Rate: 4.5e+00/s 20181024 10:17:44,248 pencil 0/1 INFO :: Building pencil matrix 90/171 (~53%) Elapsed: 20s, Remaining: 18s, Rate: 4.5e+00/s 20181024 10:17:48,339 pencil 0/1 INFO :: Building pencil matrix 108/171 (~63%) Elapsed: 24s, Remaining: 14s, Rate: 4.5e+00/s 20181024 10:17:52,393 pencil 0/1 INFO :: Building pencil matrix 126/171 (~74%) Elapsed: 28s, Remaining: 10s, Rate: 4.5e+00/s 20181024 10:17:54,160 pencil 0/1 INFO :: Building pencil matrix 134/171 (~78%) Elapsed: 30s, Remaining: 8s, Rate: 4.5e+00/s 20181024 10:17:56,405 pencil 0/1 INFO :: Building pencil matrix 144/171 (~84%) Elapsed: 32s, Remaining: 6s, Rate: 4.5e+00/s 20181024 10:18:00,536 pencil 0/1 INFO :: Building pencil matrix 162/171 (~95%) Elapsed: 36s, Remaining: 2s, Rate: 4.5e+00/s 20181024 10:18:02,561 pencil 0/1 INFO :: Building pencil matrix 171/171 (~100%) Elapsed: 38s, Remaining: 0s, Rate: 4.5e+00/s Starting startup loop... Run time for startup loop: 2.106695 Starting main time loop... Run time for main loop: 1.607797 real 0m43.280s user 0m42.668s sys 0m0.820s augier3pi@meige8pcpa79:~/Dev/fluidsim/bench/dedalus$ time fluidsimbench 512 d 2 s ns2d it 10 nh = (512, 512); Lh = (8, 8) running a benchmark simulation... done. 10 time steps computed in 0.51 s results benchmarks saved in /tmp/fluidsim_bench/result_bench_ns2d_512x512_np=1_default_20181024_10332516769.json Cleaning up simulation. real 0m2.603s user 0m2.132s sys 0m0.600s

Great, thanks for taking another look!

 changed status to resolved
 Log in to comment