Basic Optimization Optimization is far too large of a subject to cover adequately in a mere appendix. Optimizations tend to be specific to particular algorithms, and they usually involve tradeoffs with memory. That is, one can make something run faster by taking up memory. And even then, optimizations should only be made when one has proper profiling to determine where performance is lacking. This appendix will instead cover the most basic optimizations. These are not guaranteed to improve performance in any particular program, but they almost never hurt. They are also things you can implement relatively easily. Think of these as the default standard practice you should start with before performing real optimizations. For the sake of clarity, most of the code in this book did not use these practices, so many of them will be new. Do as I say, not as I do.
Textures There are various techniques you can use to improve the performance of texture accesses.
Image Formats The smaller the data, the faster it can be fetched into a shader. As with vertex formats, try to use the smallest format that you can get away with. As with vertex formats, what you can get away with tends to be defined by what you are trying to store in the texture. Normals Textures containing normals can use GL_RGB10_A2_SNORM, which is the texture equivalent to the 10-bit signed normalized format we used for attribute normals. However, this can be made more precise if the normals are for a tangent-space normal map. Since the tangent-space normals always have a positive Z coordinate, and since the normals are normalized, the actual Z value can be computed from the other two. So you only need to store 2 values; GL_RG16_SNORM is sufficient for these needs. To compute the third value, do this: vec2 norm2d = texture(tangentBumpTex, texCoord).xy; vec3 tanSpaceNormal = vec3(norm2d, sqrt(1.0 - dot(norm2d, norm2d))); Obviously this costs some performance, so it's a question of how much precision you actually need. On the plus side, using this method means that you will not have to normalize the tangent-space normal fetched from the texture. The GL_RG16_SNORM format can be made even smaller with texture compression. The GL_COMPRESSED_SIGNED_RG_RGTC1 compressed texture format is a 2-channel signed integer format. It only takes up 8-bits per pixel. Floating-point Intensity There are two unorthodox formats for floating-point textures, both of which have important uses. The GL_R11F_G11F_B10F format is potentially a good format to use for HDR render targets. As the name suggests, it takes up only 32-bits. The downside is the relative loss of precision compared to GL_RGB16F (as well as the complete loss of a destination alpha). They can store approximately the same magnitude of values, but the smaller format loses some precision. This may or may not impact the overall visual quality of the scene. It should be fairly simple to test to see which is better. The GL_RGB9_E5 format is used for input floating-point textures. If you have a texture that represents light intensity in HDR situations, this format can be quite handy. The way it works is that each of the RGB colors get 9 bits for their values, but they all share the same exponent. This has to do with how floating-point numbers work, but what it boils down to is that the values have to be relatively close to one another in magnitude. They do not have to be that close; there's still some leeway. Values that are too small relative to larger ones become zero. This is oftentimes an acceptable tradeoff, depending on the particular magnitude in question. This format is useful for textures that are generated offline by tools. You cannot render to a texture in this format. Colors Storing colors that are clamped to [0, 1] can be done with good precision with GL_RGBA8 or GL_SRGB8_ALPHA8 as needed. However, compressed texture formats are available. The S3TC formats are good choices if the compression artifacts are not too noticable. There are sRGB versions of the S3TC formats as well. The difference in the various S3TC formats are how much alpha you need. The choices are as follows: GL_COMPRESSED_RGB_S3TC_DXT1_EXT No alpha. GL_COMPRESSED_RGBA_S3TC_DXT1_EXT Binary alpha. Either zero or one for each texel. The RGB color for any texel with a zero alpha will also be zero. GL_COMPRESSED_RGBA_S3TC_DXT3_EXT 4-bits of alpha per pixel. GL_COMPRESSED_RGBA_S3TC_DXT5_EXT Alpha is compressed in an S3TC block, much like RG texture compression. If an image needs to have a varying alpha, the primary difference will be between DXT3 and DXT5. DXT5 has the potential for better results, but if the alpha does not compress well with the S3TC algorithm, the results will be rather worse than DXT3.
Use Mipmaps Often Mipmapping improves performance when textures are mapped to regions that are larger in texel space than in window space. That is, when texture minification happens. Mipmapping improves performance because it keeps the locality of texture accesses near each other. Texture hardware is optimized for accessing regions of textures, so improving locality of texture data will help performance. How much this matters depends on how the texture is mapped to the surface. Static mapping with explicit texture coordinates, or with linear computation based on surface properties, can use mipmapping to improve locality of texture access. For more unusual mappings or for pure-lookup tables, mipmapping may not help locality at all. Ultimately, mipmaps are more likely to help performance when the texture in question represents some characteristic of a surface, and is therefore mapped directly to that surface. So diffuse textures, normal maps, specular maps, and other surface characteristics are all very likely to gain some performance from using mipmaps. Projective lights are less likely to gain from this, as it depends on the geometry that they are projected onto.
Object Optimizations These optimizations all have to do with the concept of objects. An object, for the purpose of this discussion, is a combination of a mesh, program, uniform data, and set of textures used to render some specific thing in the world.
Object Culling A virtual world consists of many objects. The more objects we draw, the longer rendering takes. One major optimization is also a very simple one: render only what must be rendered. There is no point in drawing an object in the world that is not actually visible. Thus, the task here is to, for each object, detect if it would be visible; if it is not, then it is not rendered. This process is called visiblity culling or object culling. As a first pass, we can say that objects that are not within the view frustum are not visible. This is called frustum culling, for obvious reasons. Determining that an object is off screen is generally a CPU task. Each object must be represented by a simple volume, such as a sphere or camera-space box. These objects are used because they are relatively easy to test against the view frustum; if they are within the frustum, then the corresponding object is considered visible. Of course, this only boils the scene down to the objects in front of the camera. Objects that are entirely occluded by other objects will still be rendered. There are a number of techniques for detecting whether objects obstruct the view of other objects. Portals, BSPs, and a variety of other techniques involve preprocessing certain static terrain to determine visibility sets. Therefore it can be known that, when the camera is in a certain region of the world, objects in certain other regions cannot be visible even if they are within the view frustum. A more fine-grained solution involves using a hardware feature called occlusion queries. This is a way to render an object and then ask how many fragments of that object were actually rasterized. If even one fragment passed the depth test (assuming all possible occluding surfaces have been rendered), then the object is visible and must be rendered. It is generally preferred to render simple test objects, such that if any part of the test object is visible, then the real object will be visible. Drawing a test object is much faster than drawing a complex hierarchial model with specialized skinning vertex shaders. Write masks (set with glColorMask and glDepthMask) are used to prevent writing the fragment shader outputs of the test object to the framebuffer. Thus, the test object is only tested against the depth buffer, not actually rendered. Occlusion queries in OpenGL are objects that have state. They are created with the glGenQueries function. To start rendering a test object for occlusion queries, the object generated from glGenQueries is passed to the glBeginQuery function, along with the mode of GL_SAMPLES_PASSED. All rendering commands between glBeginQuery and the corresponding glEndQuery are part of the test object. If all of the fragments of the object were discarded (via depth buffer or something else), then the query failed. If even one fragment was rendered, then it passed. This can be used with a concept called conditional rendering. This is exactly what it says: rendering an object conditionally. It allows a series of rendering commands, bracketed by glBeginConditionalRender/glEndConditionalRender functions, to cause the execution of those rendering commands to happen or not happen based on the status of an occlusion query object. If the occlusion query passed, then the rendering commands will be executed. If it did not, then they will not be. Of course, conditional rendering can cause pipeline stalls; OpenGL still requires that operations execute in-order, even conditional ones. So all later operations will be held up if a conditional render is waiting for its occlusion query to finish. To avoid this, you can specify GL_QUERY_NO_WAIT when beginning the conditional render. This will cause OpenGL to render if the query has not completed before this conditional render is ready to be rendered. To gain the maximum benefit from this, it is best to render the conditional objects well after the test objects they are conditioned on.
Model LOD When a model is far away, it does not need to look as detailed, since most of the details will be lost due to lack of resolution. Therefore, one can substitute more detailed models for less detailed ones. This is commonly referred to as Level of Detail (LOD). Of course in modern rendering, detail means more than just the number of polygons in a mesh. It can often mean what shader to use, what textures to use with it, etc. So while meshes will often have LODs, so will shaders. Textures have their own built-in LODing mechanism in mip-mapping. But it is often the case that low-LOD shaders (those used from far away) do not need as many textures as the closer LOD shaders. You might be able to get away with per-vertex lighting for distant models, while you need per-fragment lighting for those close up. The problem with this visually is how to deal with the transitions between LOD levels. If you change them too close to the camera, then the user will notice a pop. If you do them too far away, you lose much of the performance gain from rendering a low-detail mesh far away. Finding a good middle-ground is key.
State Changes OpenGL has three kinds of functions: those that actually do rendering, those that retrieve information from OpenGL, and those that modify some information stored in OpenGL. The vast majority of OpenGL functions are the latter. OpenGL's information is generally called state, and needlessly changing state can be expensive. Therefore, this optimization rule is to, as best as possible, minimize the number of state changes. For simple scenes, this can be trivial. But in a complicated, data-driven environment, this can be exceedingly complex. The general idea is to gather up a list of all objects that need to be rendered (after culling non-visible objects and performing any LOD work), then sort them based on their shared state. Objects that use the same program share program state, for example. By doing this, if you render the objects in state order, you will minimize the number of changes to OpenGL state. The three most important pieces of state to sort by are the ones that change most frequently: programs (and their associated uniforms), textures, and VAO state. Global state, such as face culling, blending, etc, are less expensive because they don't change as often. Generally, all meshes use the same culling parameters, viewport settings, depth comparison state, and so forth. Minimizing vertex array state changes generally requires more than just sorting; it requires changing how mesh data is stored. This book usually gives every mesh its own VAO, which represents its own separate state. This is certainly very convenient, but it can work against performance if the CPU is a bottleneck. To avoid this, try to group meshes that have the same vertex data formats in the same buffer objects and VAOs. This makes it possible to render several objects, with several different glDraw* commands, all using the same VAO state. glDrawElementsBaseVertex is very useful for this purpose when rendering with indexed data. The fewer VAO binds, the better. There is less information on how harmful uniform state changes are to performance, or the performance difference between changing in-program uniforms and buffer-based uniforms. Be advised that state sorting cannot help when dealing with blending, because blending correctness requires sorting based on depth. Thus, it is necessary to avoid that. There are also certain tricky states that can hurt, depending on hardware. For example, it is best to avoid changing the direction of the depth test once you have cleared the depth buffer and started rendering to it. This is for reasons having to do with specific hardware optimizations of depth buffering.
Finding the Bottleneck The absolute best tool to have in your repertoire for optimizing your rendering is finding out why your rendering is slow. GPUs are designed as a pipeline. Each stage in the pipeline is functionally independent from the other. A vertex shader can be computing some number of vertices, while the clipping and rasterization are working on other triangles, while the fragment shader is working on fragments generated by other triangles. However, a vertex generated by the vertex shader cannot pass to the rasterizer if the rasterizer is busy. Similarly, the rasterizer cannot generate more fragments if all of the fragment shaders are in use. Therefore, the overall performance of the GPU can only be the performance of the slowest step in the pipeline. This means that, in order to actually make the GPU faster, you must find the particular stage of the pipeline that is the slowest. This step is referred to as the bottleneck. Until you know what the bottleneck is, then the most you can do is take a guess as to why things are slower than you think they are. And doing major code changes based purely on a guess is probably not something you can do. At least, not until you have a lot of experience with the GPU(s) in question. It should also be noted that bottlenecks are not consistent throughout the rendering of a single frame. Some parts of it can be CPU bound, others can be fragment shader bound, etc. Thus, attempt to find particular sections of rendering that likely have the same problem before trying to find the bottleneck.
Measuring Performance The most common performance statistic you see when most people talk about performance is frames per second (FPS). While this is useful when talking to the lay person, a graphics programmer does not use FPS as their standard performance metric. It is the overall goal, but when measuring the actual performance of a piece of rendering code, the more useful metric is simply time. This is usually measured in milliseconds (ms). If you are attempting to maintain 60fps, that translates to having 16.67 milliseconds to spend performing all rendering tasks. One thing that confounds performance metrics is the fact that the GPU is both pipelined and asynchronous. When running regular code, if you call a function, you're usually assured that the actions the function took have all completed when it returns. When you issue a rendering call (any glDraw* function), not only is it likely that rendering has not completed by the time it has returned, it is very likely that rendering has not even started. Not even doing a buffer swap will ensure that the GPU has finished, as GPUs can wait to actual perform the buffer swap until later. If you specifically want to time the GPU, then you must force the GPU to finish its work. To do that in OpenGL, you call a function cleverly titled glFinish. It will return sometime after the GPU finishes. Note that it does not guarantee that it returns immediately after, only at some point after the GPU has finished all of its commands. So it is a good idea to give the GPU a healthy workload before calling finish, to minimize the difference between the time you measure and the time the GPU actually has. You will also want to turn vertical synchronization, or vsync, off. There is a certain point during which a graphics chip is able to swap the front and back framebuffers with a guarantee of not causing half of the displayed image to be from one buffer and half from another. The latter eventuality is called tearing, and having vsync enabled avoids that. However, you do not care about tearing; you want to know about performance. So you need to turn off any form of vsync. Vsync is controlled by the window-system specific extensions GLX_EXT_swap_control and WGL_EXT_swap_control. They both do the same thing and have similar APIs. The wgl/glxSwapInterval functions take an integer that tells how many vsyncs to wait between swaps. If you pass 0, then it will swap immediately.
Possible Bottlenecks There are several potential bottlenecks that a section of rendering code can have. We will list those and the ways of determining if it is the bottleneck. You should test these in the order presented below.
Fragment Processing This is probably the easiest to find. The quantity of fragment processing you have depends entirely on the number of fragments the various triangles are rasterized to. Therefore, simply increase the resolution. If you increase the resolution by 2x the number of pixels (double either the width or height), and the time to render doubles, then you are fragment processing bound. Note that rendering time will go up when you increase the resolution. What you are interested in is whether it goes up linearly with the number of fragments rendered. If the rendering time only goes up by 1.2x with a 2x increase in number of fragments, then the code was not entirely fragment processing bound.
Vertex Processing If you are not fragment processing bound, then there's a good chance you are vertex processing bound. After ruling out fragment processing, simply turn off all fragment processing. If this does not increase your performance significantly (there will generally be some change), then you were vertex processing bound. To turn off fragment processing, simply glEnable(GL_RASTERIZER_DISCARDâ€‹). This will cause all fragments to be discarded. Obviously, nothing will be rendered, but all of the steps before rasterization will still be executed. Therefore, your performance timings will be for vertex processing alone.
CPU A CPU bottleneck means that the GPU is being starved; it is consuming data faster than the CPU is providing it. You do not really test for CPU bottlenecks per-se; they are discovered by process of elimination. If nothing else is bottlenecking the GPU, then the CPU clearly is not giving it enough stuff to do.
Unfixable Bottlenecks It is entirely possible that you cannot fix a bottleneck. Maybe there's simply no way to avoid a vertex-processing heavy section of your renderer. Perhaps you need all of that fragment processing in a certain area of rendering. If there is some bottleneck that cannot be optimized away, then turn it to your advantage by increasing the complexity of the other stages in the pipeline. If you have an unfixable CPU bottleneck, then render more detailed models. If you have a vertex-shader bottleneck, improve your lighting by adding some fragment-shader complexity. And so forth. Just make sure that you do not increase complexity to the point where you move the bottleneck and make things slower.
Vertex Format Vertex attributes stored in buffer objects can be of a surprisingly large number of formats. These tutorials generally used 32-bit floating-point data, but that is far from the best case. The vertex format specifically refers to the set of values given to the glVertexAttribPointer calls that describe how each attribute is aligned in the buffer object.
Attribute Formats Each attribute should take up as little room as possible. This is for performance reasons, but it also saves memory. For buffer objects, these are usually one in the same. The less data you have stored in memory, the faster it gets to the vertex shader. Attributes can be stored in normalized integer formats, just like textures. This is most useful for colors and texture coordinates. For example, to have an attribute that is stored in 4 unsigned normalized bytes, you can use this: glVertexAttribPointer(index, 4, GLubyte, GLtrue, 0, offset); If you want to store a normal as a normalized signed short, you can use this: glVertexAttribPointer(index, 3, GLushort, GLtrue, 0, offset); There are also a few specialized formats. GL_HALF_FLOAT can be used for 16-bit floating-point types. This is useful for when you need values outside of [-1, 1], but do not need the full Non-normalized integers can be used as well. These map in GLSL directly to floating-point values, so a non-normalized value of 16 maps to a GLSL value of 16.0. The best thing about all of these formats is that they cost nothing in performance to use. They are all silently converted into floating-point values for consumption by the vertex shader, with no performance lost.
Interleaved Attributes Attributes do not all have to come from the same buffer object; multiple attributes can come from multiple buffers. However, where possible, this should be avoided. Furthermore, attributes in the same buffer should be interleaved with one another whenever possible. Consider an array of structs in C++: struct Vertex { float position[3]; GLubyte color[4]; GLushort texCoord[2]; } Vertex vertArray[20]; The byte offset of color in the Vertex struct is 12. That is, from the beginning of the Vertex struct, the color variable starts 12 bytes in. The texCoord variable starts 16 bytes in. If we did a memcpy between vertArray and a buffer object, and we wanted to set the attributes to pull from this data, we could do so using the stride and offsets to position things properly. glVertexAttribPointer(0, 3, GL_FLOAT, GLfalse, 20, 0); glVertexAttribPointer(1, 3, GL_UNSIGNED_BYTE, GL_TRUE, 20, 12); glVertexAttribPointer(3, 3, GL_UNSIGNED_SHORT, GL_TRUE, 20, 16); The fifth argument is the stride. The stride is the number of bytes from the beginning of one instance of this attribute to the beginning of another. The stride here is set to sizeof(Vertex). C++ defines that the size of a struct represents the byte offset between separate instances of that struct in an array. So that is our stride. The offsets represent where in the buffer object the first element is. These match the offsets in the struct. If we had loaded this data to a location past the front of our buffer object, we would need to offset these values by the beginning of where we uploaded our data to. There are certain gotchas when deciding how data gets packed like this. First, it is a good idea to keep every attribute on a 4-byte alignment. This may mean introducing explicit padding (empty space) into your structures. Some hardware will have massive slowdowns if things are not aligned to four bytes. Next, it is a good idea to keep the size of any interleaved vertex data restricted to multiples of 32 bytes in size. Violating this is not as bad as violating the 4-byte alignment rule, but one can sometimes get sub-optimal performance if the total size of interleaved vertex data is, for example, 48 bytes. Or 20 bytes, as in our example.
Packing Suggestions If the smallest vertex data size is what you need, consider these packing techniques. Colors generally do not need to be more than 3-4 bytes in size. One byte per component. Texture coordinates, particularly those clamped to the [0, 1] range, almost never need more than 16-bit precision. So use unsigned shorts. Normals should be stored in the signed 2_10_10_10 format whenever possible. Normals generally do not need that much precisions, especially since you're going to normalize them anyway. This format was specifically devised for normals, so use it. Positions are the trickiest to work with, because the needs vary so much. If you are willing to modify your vertex shaders and put some work into it, you can often use 16-bit signed normalized shorts. The key to this is a special scale/translation matrix. When you are preparing your data, in an offline tool, you take the floating-point positions of a model and determine the model's maximum extents in all three axes. This forms a bounding box around the model. The center of the box is the center of your new model, and you apply a translation to move the points to this center. Then you apply a non-uniform scale to transform the points from their extent range to the [-1, 1] range of signed normalized values. You save the offset and the scales you used as part of your mesh data (not to be stored in the buffer object). When it comes time to render the model, you simply reverse the transformation. You build a scale/translation matrix that undoes what was done to get them into the signed-normalized range. Note that this matrix should not be applied to the normals, because the normals were not compressed this way. A fully matrix multiply is even overkill for this transformation; a scale+translation can be done with a simple vector multiply and add.
Vertex Caching