Source

gltut / Documents / Optimization.xml

Full commit
Jason McKesson 3d49a79 




Jason McKesson fdee255 








Jason McKesson 2edef3e 
Jason McKesson fdee255 

Jason McKesson 2edef3e 
Jason McKesson fdee255 

Jason McKesson 2edef3e 












Jason McKesson fdee255 



























Jason McKesson 2edef3e 

Jason McKesson fdee255 





Jason McKesson 2edef3e 

Jason McKesson fdee255 













Jason McKesson 2edef3e 



Jason McKesson fdee255 












Jason McKesson 2edef3e 





Jason McKesson fdee255 






Jason McKesson 2edef3e 

Jason McKesson fdee255 




Jason McKesson 2edef3e 
Jason McKesson fdee255 














































Jason McKesson 2edef3e 


Jason McKesson fdee255 


Jason McKesson 2edef3e 

















































































































Jason McKesson fdee255 

Jason McKesson 2edef3e 


























































































































Jason McKesson fdee255 
Jason McKesson 3d49a79 

































Jason McKesson 2edef3e 
Jason McKesson 3d49a79 

Jason McKesson 2edef3e 


Jason McKesson 3d49a79 











Jason McKesson 5c0f458 
Jason McKesson 3d49a79 





















Jason McKesson 2edef3e 


Jason McKesson 3d49a79 








Jason McKesson 2edef3e 



Jason McKesson 3d49a79 



Jason McKesson 5c0f458 
Jason McKesson 3d49a79 










Jason McKesson 2edef3e 




Jason McKesson 3d49a79 


Jason McKesson 508a0b4 
Jason McKesson 3d49a79 





















Jason McKesson 5c0f458 
Jason McKesson 3d49a79 





























Jason McKesson fdee255 


Jason McKesson 3d49a79 











Jason McKesson 5c0f458 
Jason McKesson 3d49a79 





































Jason McKesson 508a0b4 
Jason McKesson 3d49a79 



Jason McKesson 508a0b4 
Jason McKesson 3d49a79 



Jason McKesson 508a0b4 
Jason McKesson 3d49a79 































  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng" type="xml"?>
<?oxygen SCHSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng"?>
<appendix xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
    <?dbhtml filename="Basic Optimization.html" ?>
    <title>Basic Optimization</title>
    <para>Optimization is far too large of a subject to cover adequately in a mere appendix.
        Optimizations tend to be specific to particular algorithms, and they usually involve
        tradeoffs with memory. That is, one can make something run faster by taking up memory. And
        even then, optimizations should only be made when one has proper profiling to determine
        where performance is lacking.</para>
    <para>This appendix will instead cover the most basic optimizations. These are not guaranteed to
        improve performance in any particular program, but they almost never hurt. They are also
        things you can implement relatively easily. Think of these as the default standard practice
        you should start with before performing real optimizations. For the sake of clarity, most of
        the code in this book did not use these practices, so many of them will be new.</para>
    <para>Do as I say, not as I do.</para>
    <section>
        <title>Vertex Format</title>
        <para>Interleave vertex attribute arrays for objects where possible. Obviously, if you need
            to overwrite certain attributes frequently while other attributes remains static, then
            you will need to separate that data. But unless you have some specific need to do so,
            interleave your vertex data.</para>
        <para>Equally importantly, try to use the smallest vertex data possible. Small data means
            that GPU caches are more efficient; they store more vertex attributes per cache line.
            This means fewer direct memory accesses, which means increasing the performance that
            vertex shaders receive their attributes. In this book, the vertex data was almost always
            32-bit floats. You should only use 32-bit floats when you absolutely need that much
            precision.</para>
        <para>The biggest key to this is the use of normalized integer values for attributes. As a
            reminder for how this works, here is the definition of
                <function>glVertexAttribPointer</function>:</para>
        <funcsynopsis>
            <funcprototype>
                <funcdef>void <function>glVertexAttribPointer</function></funcdef>
                <paramdef>GLuint <parameter>index</parameter></paramdef>
                <paramdef>GLint <parameter>size</parameter></paramdef>
                <paramdef>GLenum <parameter>type</parameter></paramdef>
                <paramdef>GLboolean <parameter>normalized</parameter></paramdef>
                <paramdef>GLsizei <parameter>stride</parameter></paramdef>
                <paramdef>GLvoid *<parameter>pointer</parameter></paramdef>
            </funcprototype>
        </funcsynopsis>
        <para>If <varname>type</varname> is an integer attribute, like
                <varname>GL_UNSIGNED_BYTE</varname>, then setting <varname>normalized</varname> to
                <literal>GL_TRUE</literal> will mean that OpenGL interprets the integer value as
            normalized. It will automatically convert the integer 255 to 1.0, and so forth. If the
            normalization flag is false instead, then it will convert the integers directly to
            floats: 255 becomes 255.0, etc. Signed values can be normalized as well; GL_BYTE with
            normalization will map 127 to 1.0, -128 to -1.0, etc.</para>
        <formalpara>
            <title>Colors</title>
            <para>Color values are commonly stored as 4 unsigned normalized bytes. This is far
                smaller than using 4 32-bit floats, but the loss of precision is almost always
                negligible. To send 4 unsigned normalized bytes, use:</para>
        </formalpara>
        <programlisting language="cpp">glVertexAttribPointer(#, 4, GL_UNSIGNED_BYTE, GL_TRUE, ...);</programlisting>
        <para>The best part is that all of this is free; it costs no actual performance. Note
            however that 32-bit integers cannot be normalized.</para>
        <para>Sometimes, color values need higher precision than 8-bits, but less than 16-bits. If a
            color is in the linear RGB colorspace, it is often desirable to give them greater than
            8-bit precision. If the alpha of the color is negligible or non-existent, then a special
                <varname>type</varname> can be used. This type is
                <literal>GL_UNSIGNED_INT_2_10_10_10_REV</literal>. It takes 32-bit unsigned
            normalized integers and pulls the four components of the attributes out of each integer.
            This type can only be used with normalization:</para>
        <programlisting language="cpp">glVertexAttribPointer(#, 4, GL_UNSIGNED_BYTE, GL_TRUE, ...);</programlisting>
        <para>The most significant 2 bits of each integer is the Alpha. The next 10 bits are the
            Blue, then Green, and finally Red. Make note of the fact that it is reversed. It is
            equivalent to this bitfield struct in C:</para>
        <programlisting language="cpp">struct RGB10_A2
{
  unsigned int alpha    : 2;
  unsigned int blue     : 10;
  unsigned int green    : 10;
  unsigned int red      : 10;
};</programlisting>
        <formalpara>
            <title>Normals</title>
            <para>Another attribute where precision isn't of paramount importance is normals. If the
                normals are normalized, and they always should be, the coordinates are always going
                to be on the [-1, 1] range. So signed normalized integers are appropriate here.
                8-bits of precision are sometimes enough, but 10-bit precision is going to be an
                improvement. 16-bit precision, <literal>GL_SHORT</literal>, may be overkill, so
                stick with <literal>GL_INT_2_10_10_10_REV</literal> (the signed version of the
                above). Because this format provides 4 values, you will need to use 4 as the size of
                the attribute, but you can still use <type>vec3</type> in the shader as the normal's
                input variable.</para>
        </formalpara>
        <formalpara>
            <title>Texture Coordinates</title>
            <para>Two-dimensional texture coordinates do not typically need 32-bits of precision. 8
                and 10-bit precision are usually not good enough, but 16-bit unsigned normalized
                integers are often sufficient. If texture coordinates range outside of [0, 1], then
                normalization will not be sufficient. In these cases, there is an alternative to
                32-bit floats: 16-bit floats.</para>
        </formalpara>
        <para>The hardest part of dealing with 16-bit floats is that C/C++ does not deal with very
            well. There is no native 16-bit float type, unlike virtually every other type. Even the
            10-bit format can be built using bit selectors in structs, as above. Generating a 16-bit
            float from a 32-bit float requires care, as well as an understanding of how
            floating-point values work.</para>
        <para>This is where the GLM math library comes in handy. It has the <type>glm::thalf</type>,
            which is a type that represents a 16-bit floating-point value. It has overloaded
            operators, so that it can be used like a regular <type>float</type>. GLM also provides
                <type>glm::hvec</type> and <type>glm::hmat</type> types for vectors and matrices,
            respectively.</para>
        <formalpara>
            <title>Positions</title>
            <para>In general, positions are the least likely attribute to be easily optimized
                without consequence. 16-bit floats can be used, but these are restricted to a range
                of approximately [-6550.4, 6550.4]. They also lack some precision, which may be
                necessary depending on the size and detail of the object in model space.</para>
        </formalpara>
        <para>If 16-bit floats are insufficient, a certain form of compression can be used. The
            process is as follows:</para>
        <orderedlist>
            <listitem>
                <para>When loading the mesh data, find the bounding volume of the mesh in model
                    space. To do this, find the maximum and minimum values in the X, Y and Z
                    directions independently. This represents a rectangle in model space that
                    contains all of the vertices. This rectangle is defined by two 3D vectors: the
                    maximum vector (containing the max X, Y and Z values), and the minimum vector.
                    These are named <varname>max</varname> and <varname>min</varname>.</para>
            </listitem>
            <listitem>
                <para>Compute the center point of this region:</para>
                <programlisting language="cpp">glm::vec3 center = (max + min) / 2.0f;</programlisting>
            </listitem>
            <listitem>
                <para>Compute half of the size (width, height, depth) of the region:</para>
                <programlisting language="cpp">glm::vec3 halfSize = (max - min) / 2.0f;</programlisting>
            </listitem>
            <listitem>
                <para>For each position in the mesh, compute a normalized version by subtracting the
                    center from it, then dividing it by half the size. As follows:</para>
                <programlisting language="cpp">glm::vec3 newPosition = (position - center) / halfSize;</programlisting>
            </listitem>
            <listitem>
                <para>For each new position, convert it to a signed, normalized integer by
                    multiplying it by 32767:</para>
                <programlisting>unsigned short normX = (unsigned short)(newPosition.x * 32767.0f);
unsigned short normY = (unsigned short)(newPosition.y * 32767.0f);
unsigned short normZ = (unsigned short)(newPosition.z * 32767.0f);</programlisting>
                <para>These three coordinates are then stored as the new position data in the buffer
                    object.</para>
            </listitem>
            <listitem>
                <para>Keep the <varname>center</varname> and <varname>halfSize</varname> variables
                    stored with your mesh data. When computing the model-space to camera-space
                    matrix for that mesh, add one final matrix to the top. This matrix will perform
                    the inverse operation from the one that we used to compute the normalized
                    values:</para>
                <programlisting language="cpp">matrixStack.Translate(center);
matrixStack.Scale(halfSize);</programlisting>
                <para>This final matrix should <emphasis>not</emphasis> be applied to the normal's
                    matrix. Compute the normal matrix <emphasis>before</emphasis> applying the final
                    step above. So if you were not using a separate matrix for normals (you did not
                    have non-uniform scales in your model-to-camera matrix), you will need to use
                    one now. So this may make your data bigger or make your shader run slightly
                    slower.</para>
            </listitem>
        </orderedlist>
        <formalpara>
            <title>Alignment</title>
            <para>One additional rule you should always follow is this: make sure that all
                attributes begin on a 4-byte boundary. This is true for attributes that are smaller
                than 4-bytes, such as a 3-vector of 8-bit values. While OpenGL will allow you to use
                arbitrary alignments, hardware may have problems making it work. So if you make your
                3D position data 16-bit floats or 16-bit signed normalized integers, you will still
                waste 2 bytes from every position. You may want to try making your position values
                4-dimensional values and putting something useful in the W component.</para>
        </formalpara>
    </section>
    <section>
        <title>Textures</title>
        <para>There are various techniques you can use to improve the performance of texture
            accesses.</para>
        <section>
            <title>Image Formats</title>
            <para>The smaller the data, the faster it can be fetched into a shader. As with vertex
                formats, try to use the smallest format that you can get away with. As with vertex
                formats, what you can get away with tends to be defined by what you are trying to
                store in the texture.</para>
            <formalpara>
                <title>Normals</title>
                <para>Textures containing normals can use <literal>GL_RGB10_A2_SNORM</literal>,
                    which is the texture equivalent to the 10-bit signed normalized format we used
                    for attribute normals. However, this can be made more precise if the normals are
                    for a tangent-space normal map. Since the tangent-space normals always have a
                    positive Z coordinate, and since the normals are normalized, the actual Z value
                    can be computed from the other two. So you only need to store 2 values;
                        <literal>GL_RG16_SNORM</literal> is sufficient for these needs. To compute
                    the third value, do this:</para>
            </formalpara>
            <programlisting language="glsl">vec2 norm2d = texture(tangentBumpTex, texCoord).xy;
vec3 tanSpaceNormal = vec3(norm2d, sqrt(1.0 - dot(norm2d, norm2d)));</programlisting>
            <para>Obviously this costs some performance, so it's a question of how much precision
                you actually need. On the plus side, using this method means that you will not have
                to normalize the tangent-space normal fetched from the texture.</para>
            <para>The <literal>GL_RG16_SNORM</literal> format can be made even smaller with texture
                compression. The <literal>GL_COMPRESSED_SIGNED_RG_RGTC1</literal> compressed texture
                format is a 2-channel signed integer format. It only takes up 8-bits per
                pixel.</para>
            <formalpara>
                <title>Floating-point Intensity</title>
                <para>There are two unorthodox formats for floating-point textures, both of which
                    have important uses. The <literal>GL_R11F_G11F_B10F</literal> format is
                    potentially a good format to use for HDR render targets. As the name suggests,
                    it takes up only 32-bits. The downside is the relative loss of precision
                    compared to <literal>GL_RGB16F</literal> (as well as the complete loss of a
                    destination alpha). They can store approximately the same magnitude of values,
                    but the smaller format loses some precision. This may or may not impact the
                    overall visual quality of the scene. It should be fairly simple to test to see
                    which is better.</para>
            </formalpara>
            <para>The <literal>GL_RGB9_E5</literal> format is used for input floating-point
                textures. If you have a texture that represents light intensity in HDR situations,
                this format can be quite handy. The way it works is that each of the RGB colors get
                9 bits for their values, but they all share the same exponent. This has to do with
                how floating-point numbers work, but what it boils down to is that the values have
                to be relatively close to one another in magnitude. They do not have to be that
                close; there's still some leeway. Values that are too small relative to larger ones
                become zero. This is oftentimes an acceptable tradeoff, depending on the particular
                magnitude in question.</para>
            <para>This format is useful for textures that are generated offline by tools. You cannot
                render to a texture in this format.</para>
            <formalpara>
                <title>Colors</title>
                <para>Storing colors that are clamped to [0, 1] can be done with good precision with
                        <literal>GL_RGBA8</literal> or <literal>GL_SRGB8_ALPHA8</literal> as needed.
                    However, compressed texture formats are available. The S3TC formats are good
                    choices if the compression artifacts are not too noticable. There are sRGB
                    versions of the S3TC formats as well.</para>
            </formalpara>
            <para>The difference in the various S3TC formats are how much alpha you need. The
                choices are as follows:</para>
            <glosslist>
                <glossentry>
                    <glossterm>GL_COMPRESSED_RGB_S3TC_DXT1_EXT</glossterm>
                    <glossdef>
                        <para>No alpha.</para>
                    </glossdef>
                </glossentry>
                <glossentry>
                    <glossterm>GL_COMPRESSED_RGBA_S3TC_DXT1_EXT</glossterm>
                    <glossdef>
                        <para>Binary alpha. Either zero or one for each texel. The RGB color for any
                            texel with a zero alpha will also be zero.</para>
                    </glossdef>
                </glossentry>
                <glossentry>
                    <glossterm>GL_COMPRESSED_RGBA_S3TC_DXT3_EXT</glossterm>
                    <glossdef>
                        <para>4-bits of alpha per pixel.</para>
                    </glossdef>
                </glossentry>
                <glossentry>
                    <glossterm>GL_COMPRESSED_RGBA_S3TC_DXT5_EXT</glossterm>
                    <glossdef>
                        <para>Alpha is compressed in an S3TC block, much like RG texture
                            compression.</para>
                    </glossdef>
                </glossentry>
            </glosslist>
            <para>If an image needs to have a varying alpha, the primary difference will be between
                DXT3 and DXT5. DXT5 has the potential for better results, but if the alpha does not
                compress well with the S3TC algorithm, the results will be rather worse than
                DXT3.</para>
        </section>
        <section>
            <title>Use Mipmaps Often</title>
            <para>Mipmapping improves performance when textures are mapped to regions that are
                larger in texel space than in window space. That is, when texture minification
                happens. Mipmapping improves performance because it keeps the locality of texture
                accesses near each other. Texture hardware is optimized for accessing regions of
                textures, so improving locality of texture data will help performance.</para>
            <para>How much this matters depends on how the texture is mapped to the surface. Static
                mapping with explicit texture coordinates, or with linear computation based on
                surface properties, can use mipmapping to improve locality of texture access. For
                more unusual mappings or for pure-lookup tables, mipmapping may not help locality at
                all.</para>
            <para>Ultimately, mipmaps are more likely to help performance when the texture in
                question represents some characteristic of a surface, and is therefore mapped
                directly to that surface. So diffuse textures, normal maps, specular maps, and other
                surface characteristics are all very likely to gain some performance from using
                mipmaps. Projective lights are less likely to gain from this, as it depends on the
                geometry that they are projected onto.</para>
        </section>
    </section>
    <section>
        <?dbhtml filename="Optimize Core.html"?>
        <title>Object Optimizations</title>
        <para>These optimizations all have to do with the concept of objects. An object, for the
            purpose of this discussion, is a combination of a mesh, program, uniform data, and set
            of textures used to render some specific thing in the world.</para>
        <section>
            <title>Object Culling</title>
            <para>A virtual world consists of many objects. The more objects we draw, the longer
                rendering takes.</para>
            <para>One major optimization is also a very simple one: render only what must be
                rendered. There is no point in drawing an object in the world that is not actually
                visible. Thus, the task here is to, for each object, detect if it would be visible;
                if it is not, then it is not rendered. This process is called visiblity culling or
                object culling.</para>
            <para>As a first pass, we can say that objects that are not within the view frustum are
                not visible. This is called frustum culling, for obvious reasons. Determining that
                an object is off screen is generally a CPU task. Each object must be represented by
                a simple volume, such as a sphere or camera-space box. These objects are used
                because they are relatively easy to test against the view frustum; if they are
                within the frustum, then the corresponding object is considered visible.</para>
            <para>Of course, this only boils the scene down to the objects in front of the camera.
                Objects that are entirely occluded by other objects will still be rendered. There
                are a number of techniques for detecting whether objects obstruct the view of other
                objects. Portals, BSPs, and a variety of other techniques involve preprocessing
                certain static terrain to determine visibility sets. Therefore it can be known that,
                when the camera is in a certain region of the world, objects in certain other
                regions cannot be visible even if they are within the view frustum.</para>
            <para>A more fine-grained solution involves using a hardware feature called occlusion
                queries. This is a way to render an object and then ask how many fragments of that
                object were actually rasterized. If even one fragment passed the depth test
                (assuming all possible occluding surfaces have been rendered), then the object is
                visible and must be rendered.</para>
            <para>It is generally preferred to render simple test objects, such that if any part of
                the test object is visible, then the real object will be visible. Drawing a test
                object is much faster than drawing a complex hierarchial model with specialized
                skinning vertex shaders. Write masks (set with <function>glColorMask</function> and
                    <function>glDepthMask</function>) are used to prevent writing the fragment
                shader outputs of the test object to the framebuffer. Thus, the test object is only
                tested against the depth buffer, not actually rendered.</para>
            <para>Occlusion queries in OpenGL are objects that have state. They are created with the
                    <function>glGenQueries</function> function. To start rendering a test object for
                occlusion queries, the object generated from <function>glGenQueries</function> is
                passed to the <function>glBeginQuery</function> function, along with the mode of
                    <literal>GL_SAMPLES_PASSED</literal>. All rendering commands between
                    <function>glBeginQuery</function> and the corresponding
                    <function>glEndQuery</function> are part of the test object. If all of the
                fragments of the object were discarded (via depth buffer or something else), then
                the query failed. If even one fragment was rendered, then it passed.</para>
            <para>This can be used with a concept called conditional rendering. This is exactly what
                it says: rendering an object conditionally. It allows a series of rendering
                commands, bracketed by
                    <function>glBeginConditionalRender</function>/<function>glEndConditionalRender</function>
                functions, to cause the execution of those rendering commands to happen or not
                happen based on the status of an occlusion query object. If the occlusion query
                passed, then the rendering commands will be executed. If it did not, then they will
                not be.</para>
            <para>Of course, conditional rendering can cause pipeline stalls; OpenGL still requires
                that operations execute in-order, even conditional ones. So all later operations
                will be held up if a conditional render is waiting for its occlusion query to
                finish. To avoid this, you can specify <literal>GL_QUERY_NO_WAIT</literal> when
                beginning the conditional render. This will cause OpenGL to render if the query has
                not completed before this conditional render is ready to be rendered. To gain the
                maximum benefit from this, it is best to render the conditional objects well after
                the test objects they are conditioned on.</para>
        </section>
        <section>
            <title>Model LOD</title>
            <para>When a model is far away, it does not need to look as detailed, since most of the
                details will be lost due to lack of resolution. Therefore, one can substitute more
                detailed models for less detailed ones. This is commonly referred to as Level of
                Detail (<acronym>LOD</acronym>).</para>
            <para>Of course in modern rendering, detail means more than just the number of polygons
                in a mesh. It can often mean what shader to use, what textures to use with it, etc.
                So while meshes will often have LODs, so will shaders. Textures have their own
                built-in LODing mechanism in mip-mapping. But it is often the case that low-LOD
                shaders (those used from far away) do not need as many textures as the closer LOD
                shaders. You might be able to get away with per-vertex lighting for distant models,
                while you need per-fragment lighting for those close up.</para>
            <para>The problem with this visually is how to deal with the transitions between LOD
                levels. If you change them too close to the camera, then the user will notice a pop.
                If you do them too far away, you lose much of the performance gain from rendering a
                low-detail mesh far away. Finding a good middle-ground is key.</para>
        </section>
        <section>
            <title>State Changes</title>
            <para>OpenGL has three kinds of functions: those that actually do rendering, those that
                retrieve information from OpenGL, and those that modify some information stored in
                OpenGL. The vast majority of OpenGL functions are the latter. OpenGL's information
                is generally called <quote>state,</quote> and needlessly changing state can be
                expensive.</para>
            <para>Therefore, this optimization rule is to, as best as possible, minimize the number
                of state changes. For simple scenes, this can be trivial. But in a complicated,
                data-driven environment, this can be exceedingly complex.</para>
            <para>The general idea is to gather up a list of all objects that need to be rendered
                (after culling non-visible objects and performing any LOD work), then sort them
                based on their shared state. Objects that use the same program share program state,
                for example. By doing this, if you render the objects in state order, you will
                minimize the number of changes to OpenGL state.</para>
            <para>The three most important pieces of state to sort by are the ones that change most
                frequently: programs (and their associated uniforms), textures, and VAO state.
                Global state, such as face culling, blending, etc, are less expensive because they
                don't change as often. Generally, all meshes use the same culling parameters,
                viewport settings, depth comparison state, and so forth.</para>
            <para>Minimizing vertex array state changes generally requires more than just sorting;
                it requires changing how mesh data is stored. This book usually gives every mesh its
                own VAO, which represents its own separate state. This is certainly very convenient,
                but it can work against performance if the CPU is a bottleneck.</para>
            <para>To avoid this, try to group meshes that have the same vertex data formats in the
                same buffer objects and VAOs. This makes it possible to render several objects, with
                several different <function>glDraw*</function> commands, all using the same VAO
                state. <function>glDrawElementsBaseVertex</function> is very useful for this purpose
                when rendering with indexed data. The fewer VAO binds, the better.</para>
            <para>There is less information on how harmful uniform state changes are to performance,
                or the performance difference between changing in-program uniforms and buffer-based
                uniforms.</para>
            <para>Be advised that state sorting cannot help when dealing with blending, because
                blending correctness requires sorting based on depth. Thus, it is necessary to avoid
                that.</para>
            <para>There are also certain tricky states that can hurt, depending on hardware. For
                example, it is best to avoid changing the direction of the depth test once you have
                cleared the depth buffer and started rendering to it. This is for reasons having to
                do with specific hardware optimizations of depth buffering.</para>
        </section>
    </section>
    <section>
        <title>Finding the Bottleneck</title>
        <para>The absolute best tool to have in your repertoire for optimizing your rendering is
            finding out why your rendering is slow.</para>
        <para>GPUs are designed as a pipeline. Each stage in the pipeline is functionally
            independent from the other. A vertex shader can be computing some number of vertices,
            while the clipping and rasterization are working on other triangles, while the fragment
            shader is working on fragments generated by other triangles.</para>
        <para>However, a vertex generated by the vertex shader cannot pass to the rasterizer if the
            rasterizer is busy. Similarly, the rasterizer cannot generate more fragments if all of
            the fragment shaders are in use. Therefore, the overall performance of the GPU can only
            be the performance of the slowest step in the pipeline.</para>
        <para>This means that, in order to actually make the GPU faster, you must find the
            particular stage of the pipeline that is the slowest. This step is referred to as the
                <glossterm>bottleneck</glossterm>. Until you know what the bottleneck is, then the
            most you can do is take a guess as to why things are slower than you think they are. And
            doing major code changes based purely on a guess is probably not something you can do.
            At least, not until you have a lot of experience with the GPU(s) in question.</para>
        <para>It should also be noted that bottlenecks are not consistent throughout the rendering
            of a single frame. Some parts of it can be CPU bound, others can be fragment shader
            bound, etc. Thus, attempt to find particular sections of rendering that likely have the
            same problem before trying to find the bottleneck.</para>
        <section>
            <title>Measuring Performance</title>
            <para>The most common performance statistic you see when most people talk about
                performance is frames per second (<acronym>FPS</acronym>). While this is useful when
                talking to the lay person, a graphics programmer does not use FPS as their standard
                performance metric. It is the overall goal, but when measuring the actual
                performance of a piece of rendering code, the more useful metric is simply time.
                This is usually measured in milliseconds (ms).</para>
            <para>If you are attempting to maintain 60fps, that translates to having 16.67
                milliseconds to spend performing all rendering tasks.</para>
            <para>One thing that confounds performance metrics is the fact that the GPU is both
                pipelined and asynchronous. When running regular code, if you call a function,
                you're usually assured that the actions the function took have all completed when it
                returns. When you issue a rendering call (any <function>glDraw*</function>
                function), not only is it likely that rendering has not completed by the time it has
                returned, it is very likely that rendering has not even
                <emphasis>started</emphasis>. Not even doing a buffer swap will ensure that the GPU
                has finished, as GPUs can wait to actual perform the buffer swap until later.</para>
            <para>If you specifically want to time the GPU, then you must force the GPU to finish
                its work. To do that in OpenGL, you call a function cleverly titled
                    <function>glFinish</function>. It will return sometime after the GPU finishes.
                Note that it does not guarantee that it returns immediately after, only at some
                point after the GPU has finished all of its commands. So it is a good idea to give
                the GPU a healthy workload before calling finish, to minimize the difference between
                the time you measure and the time the GPU actually has.</para>
            <para>You will also want to turn vertical synchronization, or vsync, off. There is a
                certain point during which a graphics chip is able to swap the front and back
                framebuffers with a guarantee of not causing half of the displayed image to be from
                one buffer and half from another. The latter eventuality is called
                    <glossterm>tearing</glossterm>, and having vsync enabled avoids that. However,
                you do not care about tearing; you want to know about performance. So you need to
                turn off any form of vsync.</para>
            <para>Vsync is controlled by the window-system specific extensions
                    <literal>GLX_EXT_swap_control</literal> and
                    <literal>WGL_EXT_swap_control</literal>. They both do the same thing and have
                similar APIs. The <function>wgl/glxSwapInterval</function> functions take an integer
                that tells how many vsyncs to wait between swaps. If you pass 0, then it will swap
                immediately.</para>
        </section>
        <section>
            <title>Possible Bottlenecks</title>
            <para>There are several potential bottlenecks that a section of rendering code can have.
                We will list those and the ways of determining if it is the bottleneck. You should
                test these in the order presented below.</para>
            <section>
                <title>Fragment Processing</title>
                <para>This is probably the easiest to find. The quantity of fragment processing you
                    have depends entirely on the number of fragments the various triangles are
                    rasterized to. Therefore, simply increase the resolution. If you increase the
                    resolution by 2x the number of pixels (double either the width or height), and
                    the time to render doubles, then you are fragment processing bound.</para>
                <para>Note that rendering time will go up when you increase the resolution. What you
                    are interested in is whether it goes up linearly with the number of fragments
                    rendered. If the rendering time only goes up by 1.2x with a 2x increase in
                    number of fragments, then the code was not entirely fragment processing
                    bound.</para>
            </section>
            <section>
                <title>Vertex Processing</title>
                <para>If you are not fragment processing bound, then there's a good chance you are
                    vertex processing bound. After ruling out fragment processing, simply turn off
                    all fragment processing. If this does not increase your performance
                    significantly (there will generally be some change), then you were vertex
                    processing bound.</para>
                <para>To turn off fragment processing, simply
                        <function>glEnable</function>(<literal>GL_RASTERIZER_DISCARD​</literal>).
                    This will cause all fragments to be discarded. Obviously, nothing will be
                    rendered, but all of the steps before rasterization will still be executed.
                    Therefore, your performance timings will be for vertex processing alone.</para>
            </section>
            <section>
                <title>CPU</title>
                <para>A CPU bottleneck means that the GPU is being starved; it is consuming data
                    faster than the CPU is providing it. You do not really test for CPU bottlenecks
                    per-se; they are discovered by process of elimination. If nothing else is
                    bottlenecking the GPU, then the CPU clearly is not giving it enough stuff to
                    do.</para>
            </section>
        </section>
        <section>
            <title>Unfixable Bottlenecks</title>
            <para>It is entirely possible that you cannot fix a bottleneck. Maybe there's simply no
                way to avoid a vertex-processing heavy section of your renderer. Perhaps you need
                all of that fragment processing in a certain area of rendering.</para>
            <para>If there is some bottleneck that cannot be optimized away, then turn it to your
                advantage by increasing the complexity of the other stages in the pipeline. If you
                have an unfixable CPU bottleneck, then render more detailed models. If you have a
                vertex-shader bottleneck, improve your lighting by adding some fragment-shader
                complexity. And so forth. Just make sure that you do not increase complexity to the
                point where you move the bottleneck and make things slower.</para>
        </section>
    </section>
    <section>
        <?dbhtml filename="Optimize Vertex Format.html" ?>
        <title>Vertex Format</title>
        <para>Vertex attributes stored in buffer objects can be of a surprisingly large number of
            formats. These tutorials generally used 32-bit floating-point data, but that is far from
            the best case.</para>
        <para>The <glossterm>vertex format</glossterm> specifically refers to the set of values
            given to the <function>glVertexAttribPointer</function> calls that describe how each
            attribute is aligned in the buffer object.</para>
        <section>
            <title>Attribute Formats</title>
            <para>Each attribute should take up as little room as possible. This is for performance
                reasons, but it also saves memory. For buffer objects, these are usually one in the
                same. The less data you have stored in memory, the faster it gets to the vertex
                shader.</para>
            <para>Attributes can be stored in normalized integer formats, just like textures. This
                is most useful for colors and texture coordinates. For example, to have an attribute
                that is stored in 4 unsigned normalized bytes, you can use this:</para>
            <programlisting language="cpp">glVertexAttribPointer(index, 4, GLubyte, GLtrue, 0, offset);</programlisting>
            <para>If you want to store a normal as a normalized signed short, you can use
                this:</para>
            <programlisting language="cpp">glVertexAttribPointer(index, 3, GLushort, GLtrue, 0, offset);</programlisting>
            <para>There are also a few specialized formats. <literal>GL_HALF_FLOAT</literal> can be
                used for 16-bit floating-point types. This is useful for when you need values
                outside of [-1, 1], but do not need the full </para>
            <para>Non-normalized integers can be used as well. These map in GLSL directly to
                floating-point values, so a non-normalized value of 16 maps to a GLSL value of
                16.0.</para>
            <para>The best thing about all of these formats is that they cost
                    <emphasis>nothing</emphasis> in performance to use. They are all silently
                converted into floating-point values for consumption by the vertex shader, with no
                performance lost.</para>
        </section>
        <section>
            <title>Interleaved Attributes</title>
            <para>Attributes do not all have to come from the same buffer object; multiple
                attributes can come from multiple buffers. However, where possible, this should be
                avoided. Furthermore, attributes in the same buffer should be interleaved with one
                another whenever possible.</para>
            <para>Consider an array of structs in C++:</para>
            <programlisting>struct Vertex
{
  float position[3];
  GLubyte color[4];
  GLushort texCoord[2];
}

Vertex vertArray[20];</programlisting>
            <para>The byte offset of <varname>color</varname> in the <type>Vertex</type> struct is
                12. That is, from the beginning of the <type>Vertex</type> struct, the
                    <varname>color</varname> variable starts 12 bytes in. The
                    <varname>texCoord</varname> variable starts 16 bytes in.</para>
            <para>If we did a memcpy between <varname>vertArray</varname> and a buffer object, and
                we wanted to set the attributes to pull from this data, we could do so using the
                stride and offsets to position things properly.</para>
            <programlisting>glVertexAttribPointer(0, 3, GL_FLOAT, GLfalse, 20, 0);
glVertexAttribPointer(1, 3, GL_UNSIGNED_BYTE, GL_TRUE, 20, 12);
glVertexAttribPointer(3, 3, GL_UNSIGNED_SHORT, GL_TRUE, 20, 16);</programlisting>
            <para>The fifth argument is the stride. The stride is the number of bytes from the
                beginning of one instance of this attribute to the beginning of another. The stride
                here is set to <literal>sizeof</literal>(<type>Vertex</type>). C++ defines that the
                size of a struct represents the byte offset between separate instances of that
                struct in an array. So that is our stride.</para>
            <para>The offsets represent where in the buffer object the first element is. These match
                the offsets in the struct. If we had loaded this data to a location past the front
                of our buffer object, we would need to offset these values by the beginning of where
                we uploaded our data to.</para>
            <para>There are certain gotchas when deciding how data gets packed like this. First, it
                is a good idea to keep every attribute on a 4-byte alignment. This may mean
                introducing explicit padding (empty space) into your structures. Some hardware will
                have massive slowdowns if things are not aligned to four bytes.</para>
            <para>Next, it is a good idea to keep the size of any interleaved vertex data restricted
                to multiples of 32 bytes in size. Violating this is not as bad as violating the
                4-byte alignment rule, but one can sometimes get sub-optimal performance if the
                total size of interleaved vertex data is, for example, 48 bytes. Or 20 bytes, as in
                our example.</para>
        </section>
        <section>
            <title>Packing Suggestions</title>
            <para>If the smallest vertex data size is what you need, consider these packing
                techniques.</para>
            <para>Colors generally do not need to be more than 3-4 bytes in size. One byte per
                component.</para>
            <para>Texture coordinates, particularly those clamped to the [0, 1] range, almost never
                need more than 16-bit precision. So use unsigned shorts.</para>
            <para>Normals should be stored in the signed 2_10_10_10 format whenever possible.
                Normals generally do not need that much precisions, especially since you're going to
                normalize them anyway. This format was specifically devised for normals, so use
                it.</para>
            <para>Positions are the trickiest to work with, because the needs vary so much. If you
                are willing to modify your vertex shaders and put some work into it, you can often
                use 16-bit signed normalized shorts.</para>
            <para>The key to this is a special scale/translation matrix. When you are preparing your
                data, in an offline tool, you take the floating-point positions of a model and
                determine the model's maximum extents in all three axes. This forms a bounding box
                around the model. The center of the box is the center of your new model, and you
                apply a translation to move the points to this center. Then you apply a non-uniform
                scale to transform the points from their extent range to the [-1, 1] range of signed
                normalized values. You save the offset and the scales you used as part of your mesh
                data (not to be stored in the buffer object).</para>
            <para>When it comes time to render the model, you simply reverse the transformation. You
                build a scale/translation matrix that undoes what was done to get them into the
                signed-normalized range. Note that this matrix should not be applied to the normals,
                because the normals were not compressed this way. A fully matrix multiply is even
                overkill for this transformation; a scale+translation can be done with a simple
                vector multiply and add.</para>
        </section>
    </section>
    <section>
        <?dbhtml filename="Optimize Vertex Cache.html" ?>
        <title>Vertex Caching</title>
        <para/>
    </section>
    <section>
        <?dbhtml filename="Optimize Shaders.html" ?>
        <title>Shaders and Performance</title>
        <para/>
    </section>
    <section>
        <?dbhtml filename="Optimize Sync.html" ?>
        <title>Synchronization</title>
        <para>GPUs gain quite a bit of their performance because, by and large, once you tell them
            what to do, they go do their stuff without further intervention. As the programmer, you
            do not care that a frame has not yet completed. All you are interested in is that the
            user sees the frame when it is ready.</para>
        <para>There are certain things that the user can do which will cause this perfect
            asynchronous activity to come to a screeching halt. These are called synchronization
            events.</para>
        <para>OpenGL is defined to allow asynchronous behavior; commands that you give do not have
            to be completed when the function ends (for the most part). However, OpenGL defines this
            by saying that if there is asynchronous behavior, the user <emphasis>cannot</emphasis>
            be made aware of it. That is, if you call glDrawArrays, the effect of this command
            should be based solely on the current state of OpenGL. This means that, if the
            glDrawArrays command is executed later, the OpenGL implementation must do whatever it
            takes to prevent later changes from impacting the results.</para>
        <para>Therefore, if you make a glDrawArrays call that pulls from some buffer object, and
            then immediately call glBufferSubData on that buffer object, the OpenGL implementation
            may have to pause the CPU in the glBufferSubData call until the glDrawArrays has at
            least finished vertex processing. However, the implementation may also simply copy the
            data you are trying to transfer into some memory it allocates, to be uploaded to the
            buffer once the glDrawArrays completes. There is no way to be sure which will
            happen.</para>
        <para>Synchronization events usually include changing data objects. That is, changing the
            contents of buffer objects or textures. Usually, changing simple state of objects, like
            what attributes a VAO provides or texture parameters, does not cause synchronization
            issues. Changing global OpenGL state also does not cause synchronization
            problems.</para>
        <para>There are ways to allow you to modify data objects that still let the GPU be
            asynchronous. But any discussion of these is well beyond the bounds of this book. Just
            be aware that data objects that are in active use should probably not have their data
            modified.</para>
    </section>
</appendix>