1. Jason McKesson
  2. gltut


gltut / Documents / Optimization.xml

<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng" type="xml"?>
<?oxygen SCHSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng"?>
<appendix xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
    <?dbhtml filename="Optimization.html" ?>
    <para>This appendix is not intended to be a detailed view of possible graphics optimizations.
        Instead, it is a high-level view of important information for optimizing rendering
        applications. There are also no source code samples for this.</para>
        <title>Finding the Bottleneck</title>
        <para>The absolute best tool to have in your repertoire for optimizing your rendering is
            finding out why your rendering is slow.</para>
        <para>GPUs are designed as a pipeline. Each stage in the pipeline is functionally
            independent from the other. A vertex shader can be computing some number of vertices,
            while the clipping and rasterization are working on other triangles, while the fragment
            shader is working on fragments generated by other triangles.</para>
        <para>However, a vertex generated by the vertex shader cannot pass to the rasterizer if the
            rasterizer is busy. Similarly, the rasterizer cannot generate more fragments if all of
            the fragment shaders are in use. Therefore, the overall performance of the GPU can only
            be the performance of the slowest step in the pipeline.</para>
        <para>This means that, in order to actually make the GPU faster, you must find the
            particular stage of the pipeline that is the slowest. This step is referred to as the
                <glossterm>bottleneck</glossterm>. Until you know what the bottleneck is, then the
            most you can do is take a guess as to why things are slower than you think they are. And
            doing major code changes based purely on a guess is probably not something you can do.
            At least, not until you have a lot of experience with the GPU(s) in question.</para>
        <para>It should also be noted that bottlenecks are not consistent throughout the rendering
            of a single frame. Some parts of it can be CPU bound, others can be fragment shader
            bound, etc. Thus, attempt to find particular sections of rendering that likely have the
            same problem before trying to find the bottleneck.</para>
            <title>Measuring Performance</title>
            <para>The most common performance statistic you see when most people talk about
                performance is frames per second (<acronym>FPS</acronym>). While this is useful when
                talking to the lay person, a graphics programmer does not use FPS as their standard
                performance metric. It is the overall goal, but when measuring the actual
                performance of a piece of rendering code, the more useful metric is simply time.
                This is usually measured in milliseconds (ms).</para>
            <para>If you are attempting to maintain 60fps, that translates to having 16.67
                milliseconds to spend performing all rendering tasks.</para>
            <para>One thing that confounds performance metrics is the fact that the GPU is both
                pipelined and asynchronous. When running regular code, if you call a function,
                you're usually assured that the action the function took has completed when it
                returns. When you issue a rendering call (any <function>glDraw*</function>
                function), not only is it likely that rendering has not completed by the time it has
                returned, it is very possible that rendering has not even
                    <emphasis>started</emphasis>. Not even doing a buffer swap will ensure that the
                GPU has finished, as GPUs can wait to actual perform the buffer swap until
            <para>If you specifically want to time the GPU, then you must force the GPU to finish
                its work. To do that in OpenGL, you call a function cleverly titled
                    <function>glFinish</function>. It will return sometime after the GPU finishes.
                Note that it does not guarantee that it returns immediately after, only at some
                point after the GPU has finished all of its commands. So it is a good idea to give
                the GPU a healthy workload before calling finish, to minimize the difference between
                the time you measure and the time the GPU actually has.</para>
            <para>You will also want to turn vertical synchronization, or vsync, off. There is a
                certain point during which a graphics chip is able to swap the front and back
                framebuffers with a guarantee of not causing half of the displayed image to be from
                one buffer and half from another. The latter eventuality is called
                    <glossterm>tearing</glossterm>, and having vsync enabled avoids that. However,
                you don't care about tearing; you want to know about performance. So you need to
                turn off any form of vsync.</para>
            <para>Vsync is controlled by the window-system specific extensions
                    <literal>GLX_EXT_swap_control</literal> and
                    <literal>WGL_EXT_swap_control</literal>. They both do the same thing and have
                similar APIs. The <function>wgl/glxSwapInterval</function> functions take an integer
                that tells how many vsyncs to wait between swaps. If you pass 0, then it will swap
            <title>Possible Bottlenecks</title>
            <para>There are several potential bottlenecks that a section of rendering code can have.
                We will list those and the ways of determining if it is the bottleneck. You should
                test these in the order presented below.</para>
                <title>Fragment Processing</title>
                <para>This is probably the easiest to find. The quantity of fragment processing you
                    have depends entirely on the number of fragments the various triangles are
                    rasterized to. Therefore, simply increase the resolution. If you increase the
                    resolution by 2x the number of pixels (double either the width or height), and
                    the time to render doubles, then you are fragment processing bound.</para>
                <para>Note that rendering time will go up when you increase the resolution. What you
                    are interested in is whether it goes up linearly with the number of fragments
                    rendered. If the render time only goes up by 1.2x with a 2x increase in number
                    of fragments, then the code was not fragment processing bound.</para>
                <title>Vertex Processing</title>
                <para>If you are not fragment processing bound, then there's a good chance you are
                    vertex processing bound. After ruling out fragment processing, simply turn off
                    all fragment processing. If this does not increase your performance
                    significantly (there will generally be some change), then you were vertex
                    processing bound.</para>
                <para>To turn off fragment processing, simply
                        <function>glEnable</function>(<literal>GL_CULL_FACE</literal>) and set
                        <function>glCullFace</function> to <literal>GL_FRONT_AND_BACK</literal>.
                    That will cause the clipping system to cull all triangles before rasterization.
                    Obviously, nothing will be rendered, but your performance timings will be for
                    vertex processing alone.</para>
                <para>A CPU bottleneck means that the GPU is being starved; it is consuming data
                    faster than the CPU is providing it. You don't really test for CPU bottlenecks
                    per-se; they are discovered by process of elimination. If nothing else is
                    bottlenecking the GPU, then the CPU clearly is not giving it enough stuff to
            <title>Unfixable Bottlenecks</title>
            <para>It is entirely possible that you cannot fix a bottleneck. Maybe there's simply no
                way to avoid a vertex-processing heavy section of your renderer. Perhaps you need
                all of that fragment processing in a certain area of rendering.</para>
            <para>If there is some bottleneck that cannot be optimized away, then turn it to your
                advantage. If you have a CPU bottleneck, then render more detailed models. If you
                have a vertex-shader bottleneck, improve your lighting by adding some
                fragment-shader complexity. And so forth. Just make sure that you don't increase
                complexity to the point where you move the bottleneck.</para>
        <title>Core Optimizations</title>
            <title>State Changes</title>
            <para>This rule is designed to decrease CPU bottlenecks. The rule itself is simple:
                minimize the number of state changes. Actually doing it is a complex exercise in
                graphics engine design.</para>
            <para>What is a state change? Any OpenGL function that changes the state of the current
                context is a state change. This includes any function that changes the state of
                objects bound to the current context.</para>
            <para>What you should do is gather all of the things you need to render and sort them
                based on state changes. Objects with similar state will be rendered one after the
                other. But not all state changes are equal to one another; some state changes are
                more expensive than others.</para>
            <para>Vertex array state, for example, is generally considered quite expensive. Try to
                group many objects that have the same vertex attribute data formats in the same
                buffer objects. Use glDrawElementsBaseVertex to help when using indexed
            <para>The currently bound texture state is also somewhat expensive. Program state is
                analogous to this.</para>
            <para>Global state, such as face culling, blending, etc, are generally considered less
                expensive. You should still only change it when necessary, but buffer object and
                texture state are much more important in state sorting.</para>
            <para>There are also certain tricky states that can hurt you. For example, it is best to
                avoid changing the direction of the depth test once you have cleared the depth
                buffer and started rendering to it. This is for reasons having to do with specific
                hardware optimizations of depth buffering.</para>
            <para>It is less well-understood how important uniform state is, or how uniform buffer
                objects compare with traditional uniform values.</para>
            <title>Object Culling</title>
            <para>The fastest object is one not drawn. And there's no point in drawing something
                that isn't seen.</para>
            <para>The simplest form of object culling is frustum culling: choosing not to render
                objects that are entirely outside of the view frustum. Determining that an object is
                off screen is a CPU task. You generally have to represent each object as a sphere or
                camera-space box; then you test the sphere or box to see if it is partially within
                the view space.</para>
            <para>There are also a number of techniques for dealing with knowing whether the view to
                certain objects are obstructed by other objects. Portals, BSPs, and a variety of
                other techniques involve preprocessing terrain to determine visibility sets.
                Therefore, it can be known that, when the camera is in a certain region of the
                world, objects in certain other regions cannot be visible, even if they are within
                the view frustum.</para>
            <para>A level beyond that involves using something called occlusion queries. This is a
                way to render an object with the GPU and then ask how many fragments of that object
                were rasterized. It is generally preferred to render simple test objects, such that
                if any part of the test object is visible, then the real object will be visible.
                Color masks (with <function>glColorMask</function>) are used to prevent writing the
                fragment shader outputs of the test object to the framebuffer.</para>
            <para>Occlusion queries in OpenGL are objects that have state. They are created with the
                    <function>glGenQueries</function> function. To start rendering a test object for
                occlusion queries, the object generated from <function>glGenQueries</function> is
                passed to the <function>glBeginQuery</function> function, along with the mode of
                    <literal>GL_SAMPLES_PASSED</literal>. All rendering commands between
                    <function>glBeginQuery</function> and the corresponding
                    <function>glEndQuery</function> are part of the test object. If all of the
                fragments of the object were discarded (via depth buffer or something else), then
                the query failed. If even one fragment was rendered, then it passed.</para>
            <para>This can be used with conditional rendering. Conditional rendering allows a series
                of rendering commands, bracketed by
                functions, to cause rendering of an object to happen or not happen based on the
                status of an occlusion query object. If the occlusion query passed, then the
                rendering commands will be executed. If it did not, then they will not be.</para>
            <para>Of course, conditional rendering can cause pipeline stalls; OpenGL still requires
                that operations execute in-order, even conditional ones. So all later operations
                will be held up if a conditional render is waiting for its occlusion query to
                finish. To avoid this, you can specify <literal>GL_QUERY_NO_WAIT</literal> when
                beginning the conditional render. This will cause OpenGL to render if the query has
                not completed before this conditional render is ready to be rendered.</para>
            <title>Model LOD</title>
            <para>When a model is far away, it does not need to look as detailed. Therefore, one can
                substitute more detailed models for less detailed ones. This is commonly referred to
                as Level of Detail (<acronym>LOD</acronym>).</para>
            <para>Of course in modern rendering, detail means more than just the number of polygons
                in a mesh. It can often mean what shader to use, what textures to use with it, etc.
                So while meshes will often have LODs, so will shaders. Textures have their own
                built-in LODing mechanism in mip-mapping. But it is often the case that low-LOD
                shaders (those used from far away) do not need as many textures as the closer LOD
                shaders. You might be able to get away with per-vertex lighting for distant models,
                while you need per-fragment lighting for those close up.</para>
            <para>The general problem is how to deal with the transitions between LOD levels. If you
                change them too close to the camera, then the user will notice the pop. If you do
                them too far away, you lose much of the performance impact. Finding a good
                middle-ground is key.</para>
            <para>For any texture that represents a surface property of an object, strongly consider
                giving it mipmaps. This includes bump maps, diffuse textures, specular textures,
                etc. This is primarily for performance reasons.</para>
            <para>When you fetch a texel from a texture, the texture unit hardware will usually
                fetch the neighboring textures at the mip LOD(s) in question. These texels will be
                stored in local memory called the texture cache. This means that, when the next
                fragment on the surface comes along, that texel will already be in the cache. But
                this only works for texels that are near each other.</para>
            <para>When an object is far from the camera or angled sharply relative to the view, then
                the two texture coordinates for two neighboring fragments can be quite different
                from one another. When fetching from a low mipmap (remember: 0 is the biggest
                mipmap), then the two fragments will get texels that are far apart. Neither one will
                fetch texels near each other.</para>
            <para>But if they are fetching from a high mipmap, then the large texture coordinate
                difference between them translates into a small texel-space difference. With proper
                mipmaping, neighboring texels can feed on the cache and do fewer memory accesses.
                This speeds up texturing performance.</para>
            <para>This also means that biasing the mipmap LOD lower (to larger mipmaps) can cause
                serious performance problems in addition to aliasing.</para>
        <title>Vertex Format</title>
        <para>Vertex attributes stored in buffer objects can be of a surprisingly large number of
            formats. These tutorials generally used 32-bit floating-point data, but that is far from
            the best case.</para>
        <para>The <glossterm>vertex format</glossterm> specifically refers to the set of values
            given to the <function>glVertexAttribPointer</function> calls that describe how each
            attribute is aligned in the buffer object.</para>
            <title>Attribute Formats</title>
            <para>Each attribute should take up as little room as possible. This is for performance
                reasons, but it also saves memory. For buffer objects, these are usually one in the
                same. The less data you have stored in memory, the faster it gets to the vertex
            <para>Attributes can be stored in normalized integer formats, just like textures. This
                is most useful for colors and texture coordinates. For example, to have an attribute
                that is stored in 4 unsigned normalized bytes, you can use this:</para>
            <programlisting language="cpp">glVertexAttribPointer(index, 4, GLubyte, GLtrue, 0, offset);</programlisting>
            <para>If you want to store a normal as a normalized signed short, you can use
            <programlisting language="cpp">glVertexAttribPointer(index, 3, GLushort, GLtrue, 0, offset);</programlisting>
            <para>There are also a few specialized formats. <literal>GL_HALF_FLOAT</literal> can be
                used for 16-bit floating-point types. This is useful for when you need values
                outside of [-1, 1], but don't need the full </para>
            <para>Non-normalized integers can be used as well. These map in GLSL directly to
                floating-point values, so a non-normalized value of 16 maps to a GLSL value of
            <para>The best thing about all of these formats is that they cost
                    <emphasis>nothing</emphasis> in performance to use. They are all silently
                converted into floating-point values for consumption by the vertex shader, with no
                performance lost.</para>
            <title>Interleaved Attributes</title>
            <para>Attributes do not all have to come from the same buffer object; multiple
                attributes can come from multiple buffers. However, where possible, this should be
                avoided. Furthermore, attributes in the same buffer should be interleaved with one
                another whenever possible.</para>
            <para>Consider an array of structs in C++:</para>
            <programlisting>struct Vertex
  float position[3];
  GLubyte color[4];
  GLushort texCoord[2];

Vertex vertArray[20];</programlisting>
            <para>The byte offset of <varname>color</varname> in the <type>Vertex</type> struct is
                12. That is, from the beginning of the <type>Vertex</type> struct, the
                    <varname>color</varname> variable starts 12 bytes in. The
                    <varname>texCoord</varname> variable starts 16 bytes in.</para>
            <para>If we did a memcpy between <varname>vertArray</varname> and a buffer object, and
                we wanted to set the attributes to pull from this data, we could do so using the
                stride and offsets to position things properly.</para>
            <programlisting>glVertexAttribPointer(0, 3, GLfloat, GLfalse, 20, 0);
glVertexAttribPointer(1, 3, GLubyte, GLtrue, 20, 12);
glVertexAttribPointer(3, 3, GLushort, GLtrue, 20, 16);</programlisting>
            <para>The fifth argument is the stride. The stride is the number of bytes from the
                beginning of one instance of this attribute to the beginning of another. The stride
                here is set to <literal>sizeof</literal>(<type>Vertex</type>). C++ defines that the
                size of a struct represents the byte offset between separate instances of that
                struct in an array. So that is our stride.</para>
            <para>The offsets represent where in the buffer object the first element is. These match
                the offsets in the struct. If we had loaded this data to a location past the front
                of our buffer object, we would need to offset these values by the beginning of where
                we uploaded our data to.</para>
            <para>There are certain gotchas when deciding how data gets packed like this. First, it
                is a good idea to keep every attribute on a 4-byte alignment. This may mean
                introducing explicit padding (empty space) into your structures. Some hardware will
                have massive slowdowns if things aren't aligned to four bytes.</para>
            <para>Next, it is a good idea to keep the size of any interleaved vertex data restricted
                to multiples of 32 bytes in size. Violating this is not as bad as violating the
                4-byte alignment rule, but one can sometimes get sub-optimal performance if the
                total size of interleaved vertex data is, for example, 48 bytes. Or 20 bytes, as in
                our example.</para>
            <title>Packing Suggestions</title>
            <para>If the smallest vertex data size is what you need, consider these packing
            <para>Colors generally do not need to be more than 3-4 bytes in size. One byte per
            <para>Texture coordinates, particularly those clamped to the [0, 1] range, almost never
                need more than 16-bit precision. So use unsigned shorts.</para>
            <para>Normals should be stored in the signed 2_10_10_10 format whenever possible.
                Normals generally do not need that much precisions, especially since you're going to
                normalize them anyway. This format was specifically devised for normals, so use
            <para>Positions are the trickiest to work with, because the needs vary so much. If you
                are willing to modify your vertex shaders and put some work into it, you can often
                use 16-bit signed normalized shorts.</para>
            <para>The key to this is a special scale/translation matrix. When you are preparing your
                data, in an offline tool, you take the floating-point positions of a model and
                determine the model's maximum extents in all three axes. This forms a bounding box
                around the model. The center of the box is the center of your new model, and you
                apply a translation to move the points to this center. Then you apply a non-uniform
                scale to transform the points from their extent range to the [-1, 1] range of signed
                normalized values. You save the offset and the scales you used as part of your mesh
                data (not to be stored in the buffer object).</para>
            <para>When it comes time to render the model, you simply reverse the transformation. You
                build a scale/translation matrix that undoes what was done to get them into the
                signed-normalized range. Note that this matrix should not be applied to the normals,
                because the normals were not compressed this way. A fully matrix multiply is even
                overkill for this transformation; a scale+translation can be done with a simple
                vector multiply and add.</para>
        <title>Vertex Caching</title>
        <title>Shaders and Performance</title>
        <para>GPUs gain quite a bit of their performance because, by and large, once you tell them
            what to do, they go do their stuff without further intervention. As the programmer, you
            do not care that a frame has not yet completed. All you are interested in is that the
            user sees the frame when it is ready.</para>
        <para>There are certain things that the user can do which will cause this perfect
            asynchronous activity to come to a screeching halt. These are called synchronization
        <para>OpenGL is defined to allow asynchronous behavior; commands that you give do not have
            to be completed when the function ends (for the most part). However, OpenGL defines this
            by saying that if there is asynchronous behavior, the user <emphasis>cannot</emphasis>
            be made aware of it. That is, if you call glDrawArrays, the effect of this command
            should be based solely on the current state of OpenGL. This means that, if the
            glDrawArrays command is executed later, the OpenGL implementation must do whatever it
            takes to prevent later changes from impacting the results.</para>
        <para>Therefore, if you make a glDrawArrays call that pulls from some buffer object, and
            then immediately call glBufferSubData on that buffer object, the OpenGL implementation
            may have to pause the CPU in the glBufferSubData call until the glDrawArrays has at
            least finished vertex processing. However, the implementation may also simply copy the
            data you are trying to transfer into some memory it allocates, to be uploaded to the
            buffer once the glDrawArrays completes. There is no way to be sure which will
        <para>Synchronization events usually include changing data objects. That is, changing the
            contents of buffer objects or textures. Usually, changing simple state of objects, like
            what attributes a VAO provides or texture parameters, does not cause synchronization
            issues. Changing global OpenGL state also does not cause synchronization
        <para>There are ways to allow you to modify data objects that still let the GPU be
            asynchronous. But any discussion of these is well beyond the bounds of this book. Just
            be aware that data objects that are in active use should probably not have their data