So. Context first. I'm in the process of migrating a pretty big realtime 3D suite from Awesomium to CEF for our in-engine and general web view needs, and so far it's looking pretty good. The only real problem is the positively abysmal frame rate for offscreen renders - a topic that was mentioned here time and time again but never solved satisfyingly.
So I did some research, written in story form because it's fun and to clarify how I got here.
Case in point: This nice thing here. Renders at a perfectly steady 60fps in Chrome and other browsers, and CPU/GPU are seriously bored while showing it (i7 5820, AMD FirePro W9100, Win10 64bit). Yet, when using CEF and offscreen rendering, I only get very stuttery and unstable 20 to 30fps out of it.
To eliminate all other error sources I've written a minimal test case that just opens a windowless browser with the page above and in the OnPaint callback dumps the dirty rectangle sizes and elapsed time between calls - see attachment. And lo and behold - after some random numbers while the page is loading it "stabilizes" at 30 to 70ms between calls (dirty rect omitted for brevity, but it's 522x446)
239, 3917, 17, 275, 35, 121, 79, 17, 94, 55, 22, 194, 41, 65, 36, 257, 117, 19, 206, 233, 34, 159, 30, 182, 354, 33, 44, 54, 63, 45, 58, 55, 62, 49, 54, 46, 51, 49, 49, 77, 40, 51, 64, 69, 50, 55, 86, 71, 37, 54, 32, 48, 50, 50, 55, 45, 50, 50, 66, 64, 56, 47,
So yep, that's too slow and unstable as hell. But here's the kicker (that the test case doesn't show but if you look at the render output you can see it): The page has an FPS meter on it, and it shows a perfect 60, or in fact, whatever you set in browser_settings.windowless_frame_rate.
And if I squint hard enough, the millisecond deltas in that dump kind of cluster around multiples of 17ms. So could it be that actually, Chromium itself renders at a nice 60fps but CEF randomly only gives me every second or third or fourth frame?
So, into the source (I'm using the Win64 Spotify build of CEF 3.2840.1515.g1b7ab74 plus PDBs that allow me to debug), set a few random breakpoints and try to make sense of all of it. And after some hours of digging I found this in CefCopyFrameGenerator::GenerateCopyFrame():
// Don't attempt to generate a frame while one is currently in-progress. if (frame_in_progress_) return; frame_in_progress_ = true;
Ehrm. One breakpoint and 20 seconds later I had the proof: More than two thirds of frames that come out of OnSwapCompositorFrame get thrown away because another frame is "in progress".
Now to be very clear and to avoid the usual "but GPU readback is slow" replies - I've been writing 3D engines on PCs and consoles for 15 years now, and trust me, no sensible amount of GPU readback or IPC or whatever can possibly exceed the 16 milliseconds one has got per frame - especially not for a measly 522x446 rectangle.
So what could be the holdup? What in the world could be the reason that a frame takes more than 16ms to arrive back at the CPU? (the frame_in_progress_ logic itself seems to do fine, otherwise it would just stop rendering at one point)
And then it hit me: Latency.
Now what GPU drivers do is, they try to keep the GPU busy. The easiest way to do this is to just queue up a ton of commands before the GPU even starts rendering, and the result of this is that the GPU is easily one, two, three frames behind the CPU, and the image arrives on the screen somewhat later. Fine for noninteractive stuff and not too action oriented games, not fine for stuff that needs short reaction times, but a fact of life for realtime 3D devs (and there are ways around it, more on that below).
So. If I may make an educated guess what happens: OnSwapCompositorFrame gets called when the compositor has finished its work on the CPU. At this point all commands are in the buffer but the frame isn't actually fully rendered yet - the GPU is still doing its thing. Now InternalGenerateCopyFrame() calls cc::CopyOutputRequest::CreateRequest() which adds a readback command to the command buffer and registers a callback, and everybody goes on with their lives.
Sixteen milliseconds later. The GPU is still not finished with compositing that frame because there really was so much other stuff to do and the driver was really chill about it anyway, but: In the CPU the compositor just finished queuing up the commands for the next frame already and calls... OnSwapCompositorFrame. Which calls GenerateCopyFrame(). Which passes by the code snippet above and is like "wait, there's still a frame in flight, let's exit". And BOOM, CEF just threw away a perfectly good frame of animation.
Some time later: that first frame finally arrives at the CPU and gets handled by OnPaint(), and the whole thing starts from the beginning. The end result is Chromium rendering at full frame rate but only a stuttery version of that arriving at the client. Exactly as observed. lights pipe
Now, luckily there's a few ways to address that issue. I'll just outline them here because you're probably way faster at fixing than I would be.
The easy way (actually not a bad way): At the end of InternalGenerateCopyFrame() add a GPU flush. No idea how that looks in the Chromium gl or gpu subsystems but there should be a means to force the GL driver to flush all pending commands to the GPU and make it render them right now. Of course this makes the GPU stall now and then and degrades overall perf by a few percent but it fixes the latency for OSR applications. And one could argue that responsiveness is way more important than a few percent of rendering perf in a web browser. :)
The hard way: Embrace the fact that there can be more than one frame that has a readback pending. This probably means wiring the dirty rectangles list through the whole callback chain and replaceing the frame_in_progress_ stuff with a queue as to at least prevent several OnPaint() callbacks running at once, and possibly it means double or triple buffering the CPU side image and merging dirty rectangles of several frames, but this would be the most elegant solution - with the drawback that it doesn't actually fix the latency for the user so please add the GPU flush anyway as a setting.
If I'm right this should fix most of CEF's performance problems with OSR. So, to quote my favourite AI: Thank you for helping us help you help us all :)
Bonus question time! Half serious because it'd mean a lot of work for everyone but would be awesome: Why transfer the image to the CPU at all if the next thing I do is reupload it to the GPU anyway? How hard would it be to add let's say an API where you either get a shared surface handle as a callback or specify your own shared surface for CEF to render into? Restricting pixel formats etc. or forcing clients to double buffer to avoid stalls would be fine. This would be a pro level "know exactly what you're doing" API. (Shouldn't be a problem under Windows with Chromium using ANGLE and thus Direct3D under the hood anyway, no idea about Linux or Mac tho)
Also, currently CEF clocks offscreen rendering with its CefBeginFrameTimer class - any plans on exposing that functionality to the user? I'd really like to let Chromium render in lockstep with our app to get guaranteed silky smooth 60fps (even with a frame of delay or two). :)