Commits

Jason McKesson  committed 3d49a79

Added beginnings of Optimization appendix.

  • Participants
  • Parent commits e14e188

Comments (0)

Files changed (4)

File Documents/Optimization.xml

+<?xml version="1.0" encoding="UTF-8"?>
+<?oxygen RNGSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng" type="xml"?>
+<?oxygen SCHSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng"?>
+<appendix xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
+    <?dbhtml filename="Optimization.html" ?>
+    <title>Optimizations</title>
+    <para>This appendix is not intended to be a detailed view of possible graphics optimizations.
+        Instead, it is a high-level view of important information for optimizing rendering
+        applications. There are also no source code samples for this.</para>
+    <section>
+        <title>Finding the Bottleneck</title>
+        <para>The absolute best tool to have in your repertoire for optimizing your rendering is
+            finding out why your rendering is slow.</para>
+        <para>GPUs are designed as a pipeline. Each stage in the pipeline is functionally
+            independent from the other. A vertex shader can be computing some number of vertices,
+            while the clipping and rasterization are working on other triangles, while the fragment
+            shader is working on fragments generated by other triangles.</para>
+        <para>However, a vertex generated by the vertex shader cannot pass to the rasterizer if the
+            rasterizer is busy. Similarly, the rasterizer cannot generate more fragments if all of
+            the fragment shaders are in use. Therefore, the overall performance of the GPU can only
+            be the performance of the slowest step in the pipeline.</para>
+        <para>This means that, in order to actually make the GPU faster, you must find the
+            particular stage of the pipeline that is the slowest. This step is referred to as the
+                <glossterm>bottleneck</glossterm>. Until you know what the bottleneck is, then the
+            most you can do is take a guess as to why things are slower than you think they are. And
+            doing major code changes based purely on a guess is probably not something you can do.
+            At least, not until you have a lot of experience with the GPU(s) in question.</para>
+        <para>It should also be noted that bottlenecks are not consistent throughout the rendering
+            of a single frame. Some parts of it can be CPU bound, others can be fragment shader
+            bound, etc. Thus, attempt to find particular sections of rendering that likely have the
+            same problem before trying to find the bottleneck.</para>
+        <section>
+            <title>Measuring Performance</title>
+            <para>The most common performance statistic you see when most people talk about
+                performance is frames per second (<acronym>FPS</acronym>). While this is useful when
+                talking to the lay person, a graphics programmer does not use FPS as their standard
+                performance metric. It is the overall goal, but when measuring the actual
+                performance of a piece of rendering code, the more useful metric is simply time.
+                This is usually measured in milliseconds (ms).</para>
+            <para>If you are attempting to maintain 60fps, that translates to having 16.67
+                milliseconds to spend performing all rendering tasks.</para>
+            <para>One thing that confounds performance metrics is the fact that the GPU is both
+                pipelined and asynchronous. When running regular code, if you call a function,
+                you're usually assured that the action the function took has completed when it
+                returns. When you issue a rendering call (any <function>glDraw*</function>
+                function), not only is it likely that rendering has not completed by the time it has
+                returned, it is very possible that rendering has not even
+                    <emphasis>started</emphasis>. Not even doing a buffer swap will ensure that the
+                GPU has finished, as GPUs can wait to actual perform the buffer swap until
+                later.</para>
+            <para>If you specifically want to time the GPU, then you must force the GPU to finish
+                its work. To do that in OpenGL, you call a function cleverly titled
+                    <function>glFinish</function>. It will return sometime after the GPU finishes.
+                Note that it does not guarantee that it returns immediately after, only at some
+                point after the GPU has finished all of its commands. So it is a good idea to give
+                the GPU a healthy workload before calling finish, to minimize the difference between
+                the time you measure and the time the GPU actually has.</para>
+            <para>You will also want to turn vertical synchronization, or vsync, off. There is a
+                certain point during which a graphics chip is able to swap the front and back
+                framebuffers with a guarantee of not causing half of the displayed image to be from
+                one buffer and half from another. The latter eventuality is called
+                    <glossterm>tearing</glossterm>, and having vsync enabled avoids that. However,
+                you don't care about tearing; you want to know about performance. So you need to
+                turn off any form of vsync.</para>
+            <para>Vsync is controlled by the window-system specific extensions
+                    <literal>GLX_EXT_swap_control</literal> and
+                    <literal>WGL_EXT_swap_control</literal>. They both do the same thing and have
+                similar APIs. The <function>wgl/glxSwapInterval</function> functions take an integer
+                that tells how many vsyncs to wait between swaps. If you pass 0, then it will swap
+                immediately.</para>
+        </section>
+        <section>
+            <title>Possible Bottlenecks</title>
+            <para>There are several potential bottlenecks that a section of rendering code can have.
+                We will list those and the ways of determining if it is the bottleneck. You should
+                test these in the order presented below.</para>
+            <section>
+                <title>Fragment Processing</title>
+                <para>This is probably the easiest to find. The quantity of fragment processing you
+                    have depends entirely on the number of fragments the various triangles are
+                    rasterized to. Therefore, simply increase the resolution. If you increase the
+                    resolution by 2x the number of pixels (double either the width or height), and
+                    the time to render doubles, then you are fragment processing bound.</para>
+                <para>Note that rendering time will go up when you increase the resolution. What you
+                    are interested in is whether it goes up linearly with the number of fragments
+                    rendered. If the render time only goes up by 1.2x with a 2x increase in number
+                    of fragments, then the code was not fragment processing bound.</para>
+            </section>
+            <section>
+                <title>Vertex Processing</title>
+                <para>If you are not fragment processing bound, then there's a good chance you are
+                    vertex processing bound. After ruling out fragment processing, simply turn off
+                    all fragment processing. If this does not increase your performance
+                    significantly (there will generally be some change), then you were vertex
+                    processing bound.</para>
+                <para>To turn off fragment processing, simply
+                        <function>glEnable</function>(<literal>GL_CULL_FACE</literal>) and set
+                        <function>glCullFace</function> to <literal>GL_FRONT_AND_BACK</literal>.
+                    That will cause the clipping system to cull all triangles before rasterization.
+                    Obviously, nothing will be rendered, but your performance timings will be for
+                    vertex processing alone.</para>
+            </section>
+            <section>
+                <title>CPU</title>
+                <para>A CPU bottleneck means that the GPU is being starved; it is consuming data
+                    faster than the CPU is providing it. You don't really test for CPU bottlenecks
+                    per-se; they are discovered by process of elimination. If nothing else is
+                    bottlenecking the GPU, then the CPU clearly is not giving it enough stuff to
+                    do.</para>
+            </section>
+        </section>
+        <section>
+            <title>Unfixable Bottlenecks</title>
+            <para>It is entirely possible that you cannot fix a bottleneck. Maybe there's simply no
+                way to avoid a vertex-processing heavy section of your renderer. Perhaps you need
+                all of that fragment processing in a certain area of rendering.</para>
+            <para>If there is some bottleneck that cannot be optimized away, then turn it to your
+                advantage. If you have a CPU bottleneck, then render more detailed models. If you
+                have a vertex-shader bottleneck, improve your lighting by adding some
+                fragment-shader complexity. And so forth. Just make sure that you don't increase
+                complexity to the point where you move the bottleneck.</para>
+        </section>
+    </section>
+    <section>
+        <title>Core Optimizations</title>
+        <para/>
+        <section>
+            <title>State Changes</title>
+            <para>This rule is designed to decrease CPU bottlenecks. The rule itself is simple:
+                minimize the number of state changes. Actually doing it is a complex exercise in
+                graphics engine design.</para>
+            <para>What is a state change? Any OpenGL function that changes the state of the current
+                context is a state change. This includes any function that changes the state of
+                objects bound to the current context.</para>
+            <para>What you should do is gather all of the things you need to render and sort them
+                based on state changes. Objects with similar state will be rendered one after the
+                other. But not all state changes are equal to one another; some state changes are
+                more expensive than others.</para>
+            <para>Vertex array state, for example, is generally considered quite expensive. Try to
+                group many objects that have the same vertex attribute data formats in the same
+                buffer objects. Use glDrawElementsBaseVertex to help when using indexed
+                rendering.</para>
+            <para>The currently bound texture state is also somewhat expensive. Program state is
+                analogous to this.</para>
+            <para>Global state, such as face culling, blending, etc, are generally considered less
+                expensive. You should still only change it when necessary, but buffer object and
+                texture state are much more important in state sorting.</para>
+            <para>There are also certain tricky states that can hurt you. For example, it is best to
+                avoid changing the direction of the depth test once you have cleared the depth
+                buffer and started rendering to it. This is for reasons having to do with specific
+                hardware optimizations of depth buffering.</para>
+            <para>It is less well-understood how important uniform state is, or how uniform buffer
+                objects compare with traditional uniform values.</para>
+        </section>
+        <section>
+            <title>Object Culling</title>
+            <para>The fastest object is one not drawn. And there's no point in drawing something
+                that isn't seen.</para>
+            <para>The simplest form of object culling is frustum culling: choosing not to render
+                objects that are entirely outside of the view frustum. Determining that an object is
+                off screen is a CPU task. You generally have to represent each object as a sphere or
+                camera-space box; then you test the sphere or box to see if it is partially within
+                the view space.</para>
+            <para>There are also a number of techniques for dealing with knowing whether the view to
+                certain objects are obstructed by other objects. Portals, BSPs, and a variety of
+                other techniques involve preprocessing terrain to determine visibility sets.
+                Therefore, it can be known that, when the camera is in a certain region of the
+                world, objects in certain other regions cannot be visible, even if they are within
+                the view frustum.</para>
+            <para>A level beyond that involves using something called occlusion queries. This is a
+                way to render an object with the GPU and then ask how many fragments of that object
+                were rasterized. It is generally preferred to render simple test objects, such that
+                if any part of the test object is visible, then the real object will be visible.
+                Color masks (with <function>glColorMask</function>) are used to prevent writing the
+                fragment shader outputs of the test object to the framebuffer.</para>
+            <para>Occlusion queries in OpenGL are objects that have state. They are created with the
+                    <function>glGenQueries</function> function. To start rendering a test object for
+                occlusion queries, the object generated from <function>glGenQueries</function> is
+                passed to the <function>glBeginQuery</function> function, along with the mode of
+                    <literal>GL_SAMPLES_PASSED</literal>. All rendering commands between
+                    <function>glBeginQuery</function> and the corresponding
+                    <function>glEndQuery</function> are part of the test object. If all of the
+                fragments of the object were discarded (via depth buffer or something else), then
+                the query failed. If even one fragment was rendered, then it passed.</para>
+            <para>This can be used with conditional rendering. Conditional rendering allows a series
+                of rendering commands, bracketed by
+                    <function>glBeginConditionalRender</function>/<function>glEndConditionalRender</function>
+                functions, to cause rendering of an object to happen or not happen based on the
+                status of an occlusion query object. If the occlusion query passed, then the
+                rendering commands will be executed. If it did not, then they will not be.</para>
+            <para>Of course, conditional rendering can cause pipeline stalls; OpenGL still requires
+                that operations execute in-order, even conditional ones. So all later operations
+                will be held up if a conditional render is waiting for its occlusion query to
+                finish. To avoid this, you can specify <literal>GL_QUERY_NO_WAIT</literal> when
+                beginning the conditional render. This will cause OpenGL to render if the query has
+                not completed before this conditional render is ready to be rendered.</para>
+        </section>
+        <section>
+            <title>Model LOD</title>
+            <para>When a model is far away, it does not need to look as detailed. Therefore, one can
+                substitute more detailed models for less detailed ones. This is commonly referred to
+                as Level of Detail (<acronym>LOD</acronym>).</para>
+            <para>Of course in modern rendering, detail means more than just the number of polygons
+                in a mesh. It can often mean what shader to use, what textures to use with it, etc.
+                So while meshes will often have LODs, so will shaders. Textures have their own
+                built-in LODing mechanism in mip-mapping. But it is often the case that low-LOD
+                shaders (those used from far away) do not need as many textures as the closer LOD
+                shaders. You might be able to get away with per-vertex lighting for distant models,
+                while you need per-fragment lighting for those close up.</para>
+            <para>The general problem is how to deal with the transitions between LOD levels. If you
+                change them too close to the camera, then the user will notice the pop. If you do
+                them too far away, you lose much of the performance impact. Finding a good
+                middle-ground is key.</para>
+        </section>
+        <section>
+            <title>Mipmapping</title>
+            <para>For any texture that represents a surface property of an object, strongly consider
+                giving it mipmaps. This includes bump maps, diffuse textures, specular textures,
+                etc. This is primarily for performance reasons.</para>
+            <para>When you fetch a texel from a texture, the texture unit hardware will usually
+                fetch the neighboring textures at the mip LOD(s) in question. These texels will be
+                stored in local memory called the texture cache. This means that, when the next
+                fragment on the surface comes along, that texel will already be in the cache. But
+                this only works for texels that are near each other.</para>
+            <para>When an object is far from the camera or angled sharply relative to the view, then
+                the two texture coordinates for two neighboring fragments can be quite different
+                from one another. When fetching from a low mipmap (remember: 0 is the biggest
+                mipmap), then the two fragments will get texels that are far apart. Neither one will
+                fetch texels near each other.</para>
+            <para>But if they are fetching from a high mipmap, then the large texture coordinate
+                difference between them translates into a small texel-space difference. With proper
+                mipmaping, neighboring texels can feed on the cache and do fewer memory accesses.
+                This speeds up texturing performance.</para>
+            <para>This also means that biasing the mipmap LOD lower (to larger mipmaps) can cause
+                serious performance problems in addition to aliasing.</para>
+        </section>
+    </section>
+    <section>
+        <title>Vertex Format</title>
+        <para>Vertex attributes stored in buffer objects can be of a surprisingly large number of
+            formats. These tutorials generally used 32-bit floating-point data, but that is far from
+            the best case.</para>
+        <para>The <glossterm>vertex format</glossterm> specifically refers to the set of values
+            given to the <function>glVertexAttribPointer</function> calls that describe how each
+            attribute is aligned in the buffer object.</para>
+        <section>
+            <title>Attribute Formats</title>
+            <para>Each attribute should take up as little room as possible. This is for performance
+                reasons, but it also saves memory. For buffer objects, these are usually one in the
+                same. The less data you have stored in memory, the faster it gets to the vertex
+                shader.</para>
+            <para>Attributes can be stored in normalized integer formats, just like textures. This
+                is most useful for colors and texture coordinates. For example, to have an attribute
+                that is stored in 4 unsigned normalized bytes, you can use this:</para>
+            <programlisting language="cpp">glVertexAttribPointer(index, 4, GLubyte, GLtrue, 0, offset);</programlisting>
+            <para>If you want to store a normal as a normalized signed short, you can use
+                this:</para>
+            <programlisting language="cpp">glVertexAttribPointer(index, 3, GLushort, GLtrue, 0, offset);</programlisting>
+            <para>There are also a few specialized formats. <literal>GL_HALF_FLOAT</literal> can be
+                used for 16-bit floating-point types. This is useful for when you need values
+                outside of [-1, 1], but don't need the full </para>
+            <para>Non-normalized integers can be used as well. These map in GLSL directly to
+                floating-point values, so a non-normalized value of 16 maps to a GLSL value of
+                16.0.</para>
+            <para>The best thing about all of these formats is that they cost
+                    <emphasis>nothing</emphasis> in performance to use. They are all silently
+                converted into floating-point values for consumption by the vertex shader, with no
+                performance lost.</para>
+        </section>
+        <section>
+            <title>Interleaved Attributes</title>
+            <para>Attributes do not all have to come from the same buffer object; multiple
+                attributes can come from multiple buffers. However, where possible, this should be
+                avoided. Furthermore, attributes in the same buffer should be interleaved with one
+                another whenever possible.</para>
+            <para>Consider an array of structs in C++:</para>
+            <programlisting>struct Vertex
+{
+  float position[3];
+  GLubyte color[4];
+  GLushort texCoord[2];
+}
+
+Vertex vertArray[20];</programlisting>
+            <para>The byte offset of <varname>color</varname> in the <type>Vertex</type> struct is
+                12. That is, from the beginning of the <type>Vertex</type> struct, the
+                    <varname>color</varname> variable starts 12 bytes in. The
+                    <varname>texCoord</varname> variable starts 16 bytes in.</para>
+            <para>If we did a memcpy between <varname>vertArray</varname> and a buffer object, and
+                we wanted to set the attributes to pull from this data, we could do so using the
+                stride and offsets to position things properly.</para>
+            <programlisting>glVertexAttribPointer(0, 3, GLfloat, GLfalse, 20, 0);
+glVertexAttribPointer(1, 3, GLubyte, GLtrue, 20, 12);
+glVertexAttribPointer(3, 3, GLushort, GLtrue, 20, 16);</programlisting>
+            <para>The fifth argument is the stride. The stride is the number of bytes from the
+                beginning of one instance of this attribute to the beginning of another. The stride
+                here is set to <literal>sizeof</literal>(<type>Vertex</type>). C++ defines that the
+                size of a struct represents the byte offset between separate instances of that
+                struct in an array. So that is our stride.</para>
+            <para>The offsets represent where in the buffer object the first element is. These match
+                the offsets in the struct. If we had loaded this data to a location past the front
+                of our buffer object, we would need to offset these values by the beginning of where
+                we uploaded our data to.</para>
+            <para>There are certain gotchas when deciding how data gets packed like this. First, it
+                is a good idea to keep every attribute on a 4-byte alignment. This may mean
+                introducing explicit padding (empty space) into your structures. Some hardware will
+                have massive slowdowns if things aren't aligned to four bytes.</para>
+            <para>Next, it is a good idea to keep the size of any interleaved vertex data restricted
+                to multiples of 32 bytes in size. Violating this is not as bad as violating the
+                4-byte alignment rule, but one can sometimes get sub-optimal performance if the
+                total size of interleaved vertex data is, for example, 48 bytes. Or 20 bytes, as in
+                our example.</para>
+        </section>
+        <section>
+            <title>Packing Suggestions</title>
+            <para>If the smallest vertex data size is what you need, consider these packing
+                techniques.</para>
+            <para>Colors generally do not need to be more than 3-4 bytes in size. One byte per
+                component.</para>
+            <para>Texture coordinates, particularly those clamped to the [0, 1] range, almost never
+                need more than 16-bit precision. So use unsigned shorts.</para>
+            <para>Normals should be stored in the signed 2_10_10_10 format whenever possible.
+                Normals generally do not need that much precisions, especially since you're going to
+                normalize them anyway. This format was specifically devised for normals, so use
+                it.</para>
+            <para>Positions are the trickiest to work with, because the needs vary so much. If you
+                are willing to modify your vertex shaders and put some work into it, you can often
+                use 16-bit signed normalized shorts.</para>
+            <para>The key to this is a special scale/translation matrix. When you are preparing your
+                data, in an offline tool, you take the floating-point positions of a model and
+                determine the model's maximum extents in all three axes. This forms a bounding box
+                around the model. The center of the box is the center of your new model, and you
+                apply a translation to move the points to this center. Then you apply a non-uniform
+                scale to transform the points from their extent range to the [-1, 1] range of signed
+                normalized values. You save the offset and the scales you used as part of your mesh
+                data (not to be stored in the buffer object).</para>
+            <para>When it comes time to render the model, you simply reverse the transformation. You
+                build a scale/translation matrix that undoes what was done to get them into the
+                signed-normalized range. Note that this matrix should not be applied to the normals,
+                because the normals were not compressed this way. A fully matrix multiply is even
+                overkill for this transformation; a scale+translation can be done with a simple
+                vector multiply and add.</para>
+        </section>
+    </section>
+    <section>
+        <title>Vertex Caching</title>
+        <para/>
+    </section>
+    <section>
+        <title>Shaders and Performance</title>
+        <para/>
+    </section>
+    <section>
+        <title>Synchronization</title>
+        <para>GPUs gain quite a bit of their performance because, by and large, once you tell them
+            what to do, they go do their stuff without further intervention. As the programmer, you
+            do not care that a frame has not yet completed. All you are interested in is that the
+            user sees the frame when it is ready.</para>
+        <para>There are certain things that the user can do which will cause this perfect
+            asynchronous activity to come to a screeching halt. These are called synchronization
+            events.</para>
+        <para>OpenGL is defined to allow asynchronous behavior; commands that you give do not have
+            to be completed when the function ends (for the most part). However, OpenGL defines this
+            by saying that if there is asynchronous behavior, the user <emphasis>cannot</emphasis>
+            be made aware of it. That is, if you call glDrawArrays, the effect of this command
+            should be based solely on the current state of OpenGL. This means that, if the
+            glDrawArrays command is executed later, the OpenGL implementation must do whatever it
+            takes to prevent later changes from impacting the results.</para>
+        <para>Therefore, if you make a glDrawArrays call that pulls from some buffer object, and
+            then immediately call glBufferSubData on that buffer object, the OpenGL implementation
+            may have to pause the CPU in the glBufferSubData call until the glDrawArrays has at
+            least finished vertex processing. However, the implementation may also simply copy the
+            data you are trying to transfer into some memory it allocates, to be uploaded to the
+            buffer once the glDrawArrays completes. There is no way to be sure which will
+            happen.</para>
+        <para>Synchronization events usually include changing data objects. That is, changing the
+            contents of buffer objects or textures. Usually, changing simple state of objects, like
+            what attributes a VAO provides or texture parameters, does not cause synchronization
+            issues. Changing global OpenGL state also does not cause synchronization
+            problems.</para>
+        <para>There are ways to allow you to modify data objects that still let the GPU be
+            asynchronous. But any discussion of these is well beyond the bounds of this book. Just
+            be aware that data objects that are in active use should probably not have their data
+            modified.</para>
+    </section>
+</appendix>

File Documents/Tutorial Documents.xpr

         </options>
     </meta>
     <projectTree name="Tutorial%20Documents.xpr">
+        <folder name="Appendices">
+            <file name="History%20of%20Graphics%20Hardware.xml"/>
+            <file name="Optimization.xml"/>
+        </folder>
         <folder name="Basics">
             <file name="Basics/Tutorial%2000.xml"/>
             <file name="Basics/Tutorial%2001.xml"/>
         </folder>
         <file name="Building%20the%20Tutorials.xml"/>
         <file name="cssDoc.txt"/>
-        <file name="History%20of%20Graphics%20Hardware.xml"/>
         <file name="meshFormat.rnc"/>
         <file name="Outline.xml"/>
         <file name="preface.xml"/>

File Documents/Tutorials.xml

                 from Phong lighting to reflections to HDR and blooming.</para>
         </partintro>
     </part>
-    <xi:include href="History of Graphics Hardware.xml"></xi:include>
+    <xi:include href="Optimization.xml"/>
+    <xi:include href="History of Graphics Hardware.xml"/>
 </book>

File Documents/preface.xml

             advanced concept. It teaches programmable rendering for beginning graphics programmers,
             from the ground up.</para>
         <para>This book also covers some important material that is often neglected or otherwise
-            relegated to <quote>advanced</quote> concepts. These concepts are not really advanced,
+            relegated to <quote>advanced</quote> concepts. These concepts are not truly advanced,
             but they are often ignored by most introductory material because they do not work with
             the fixed function pipeline.</para>
         <para>This book is first and foremost about learning how to be a graphics programmer.
             lessons that this book teaches. If you already know graphics and are in need of a book
             that teaches modern OpenGL programming, this is not it. It may be useful to you in that
             capacity, but that is not this book's main thrust.</para>
+        <para>This book is intended to teach you how to be a graphics programmer. It is not aimed any
+            any particular graphics field; it is designed to cover most of the basics of 3D
+            rendering. So if you want to be a game developer, a CAD program designer, do some
+            computer visualization, or any number of things, this book can still be an asset for
+            you.</para>
+        <para>This does not mean that it covers everything there is about 3D graphics. Hardly. It
+            tries to provide a sound foundation for your further exploration in whatever field of 3D
+            graphics you are interested in.</para>
+        <para>One topic this book does not cover in depth is optimization. The reason for this is
+            simply that serious optimization is an advanced topic. Optimizations can often be
+            platform-specific, different for different kinds of hardware. They can also be
+            API-specific, as some APIs have different optimization needs. Optimizations may be
+            mentioned here and there, but it is simply too complex of a subject for a beginning
+            graphics programmer. There is a chapter in the appendix covering optimization
+            opportunities, but it only provides a fairly high-level look.</para>
     </section>
     <section>
         <title>What You Need</title>