Source

gltut / Documents / Optimization.xml

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng" type="xml"?>
<?oxygen SCHSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng"?>
<appendix xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
    <?dbhtml filename="Basic Optimization.html" ?>
    <title>Basic Optimization</title>
    <para>Optimization is far too large of a subject to cover adequately in a mere appendix.
        Optimizations tend to be specific to particular algorithms, and they usually involve
        tradeoffs with memory. That is, one can make something run faster by taking up memory. And
        even then, optimizations should only be made when one has proper profiling to determine
        where performance is lacking.</para>
    <para>This appendix will instead cover the most basic optimizations. These are not guaranteed to
        improve performance in any particular program, but they almost never hurt. They are also
        things you can implement relatively easily. These of these as the default standard practice
        you should start with before performing real optimizations. For the sake of clarity, most of
        the code in this book did not use these practices, so many of them will be new.</para>
    <section>
        <title>Vertex Format</title>
        <para>Interleave vertex arrays for objects where possible. Obviously, if you need to
            overwrite some vertex data frequently while other data remains static, then you will
            need to separate that data. But unless you have some specific need to do so, interleave
            your vertex data.</para>
        <para>Equally importantly, use the smallest vertex data possible. In the tutorials, the
            vertex data was almost always 32-bit floats. You should only use 32-bit floats when you
            absolutely need that much precision.</para>
        <para>The biggest key to this is the use of normalized integer values for attributes. Here
            is the definition of <function>glVertexAttribPointer</function>:</para>
        <funcsynopsis>
            <funcprototype>
                <funcdef>void <function>glVertexAttribPointer</function></funcdef>
                <paramdef>GLuint <parameter>index</parameter></paramdef>
                <paramdef>GLint <parameter>size</parameter></paramdef>
                <paramdef>GLenum <parameter>type</parameter></paramdef>
                <paramdef>GLboolean <parameter>normalized</parameter></paramdef>
                <paramdef>GLsizei <parameter>stride</parameter></paramdef>
                <paramdef>GLvoid *<parameter>pointer</parameter></paramdef>
            </funcprototype>
        </funcsynopsis>
        <para>If <varname>type</varname> is an integer attribute, like
                <varname>GL_UNSIGNED_BYTE</varname>, then setting <varname>normalized</varname> to
                <literal>GL_TRUE</literal> will mean that OpenGL interprets the integer value as
            normalized. It will automatically convert the integer 255 to 1.0, and so forth. If the
            normalization flag is false instead, then it will convert the integers directly to
            floats: 255 becomes 255.0, etc. Signed values can be normalized as well; GL_BYTE with
            normalization will map 127 to 1.0, -128 to -1.0, etc.</para>
        <formalpara>
            <title>Colors</title>
            <para>Color values are commonly stored as 4 unsigned normalized bytes. This is far
                smaller than using 4 32-bit floats, but the loss of precision is almost always
                negligible. To send 4 unsigned normalized bytes, use:</para>
        </formalpara>
        <programlisting language="cpp">glVertexAttribPointer(#, 4, GL_UNSIGNED_BYTE, GL_TRUE, ...);</programlisting>
        <para>The best part is that all of this is free; it costs no actual performance. Note
            however that 32-bit integers cannot be normalized.</para>
        <para>Sometimes, color values need higher precision than 8-bits, but less than 16-bits. If a
            color is a linear RGB color, it is often desirable to give them greater than 8-bit
            precision. If the alpha of the color is negligible or non-existent, then a special
                <varname>type</varname> can be used. This type is
                <literal>GL_UNSIGNED_INT_2_10_10_10_REV</literal>. It takes 32-bit unsigned
            normalized integers and pulls the four components of the attributes out of each integer.
            This type can only be used with normalization:</para>
        <programlisting language="cpp">glVertexAttribPointer(#, 4, GL_UNSIGNED_BYTE, GL_TRUE, ...);</programlisting>
        <para>The most significant 2 bits of each integer is the Alpha. The next 10 bits are the
            Blue, then Green, and finally red. It is equivalent to this struct in C:</para>
        <programlisting language="cpp">struct RGB10_A2
{
  unsigned int alpha    : 2;
  unsigned int blue     : 10;
  unsigned int green    : 10;
  unsigned int red      : 10;
};</programlisting>
        <formalpara>
            <title>Normals</title>
            <para>Another attribute where precision isn't of paramount importance is normals. If the
                normals are normalized, and they always should be, the coordinates are always going
                to be on the [-1, 1] range. So signed normalized integers are appropriate here.
                8-bits of precision are sometimes enough, but 10-bit precision is going to be an
                improvement. 16-bit precision, <literal>GL_SHORT</literal>, may be overkill, so
                stick with <literal>GL_INT_2_10_10_10_REV</literal>. Because this format provides 4
                values, you will still need to use 4 as the size of the attribute, but you can still
                use <type>vec3</type> in the shader as the normal's input variable.</para>
        </formalpara>
        <formalpara>
            <title>Texture Coordinates</title>
            <para>Two-dimensional texture coordinates do not typically need 32-bits of precision. 8
                and 10-bit precision are usually not good enough, but 16-bit unsigned normalized
                integers are often sufficient. If texture coordinates range outside of [0, 1], then
                normalization will not be sufficient. In these cases, there is an alternative to
                32-bit floats: 16-bit floats.</para>
        </formalpara>
        <para>The hardest part of dealing with 16-bit floats is that C/C++ does not deal with very
            well. There is no native 16-bit float type, unlike virtually every other type. Even the
            10-bit format can be built using bit selectors in structs, as above. Generating a 16-bit
            float from a 32-bit float requires care, as well as an understanding of how
            floating-point values work. The details of that are beyond the scope of this work,
            however.</para>
        <formalpara>
            <title>Positions</title>
            <para>In general, positions are the least likely attribute to be easily optimized
                without consequence. 16-bit floats can be used, but these are restricted to a range
                of approximately [-6550.4, 6550.4]. They also lack some precision, which may be
                necessary depending on the size and detail of the object in model space.</para>
        </formalpara>
        <para>If 16-bit floats are insufficient, there are things that can be done. The process is
            as follows:</para>
        <orderedlist>
            <listitem>
                <para>When loading the mesh data, find the bounding volume of the mesh in model
                    space. To do this, find the maximum and minimum values in the X, Y and Z
                    directions independently. This represents a rectangle in model space that
                    contains all of the vertices. This rectangle is defined by two vectors: the
                    maximum vector (containing the max X, Y and Z values), and the minimum vector.
                    These are named <varname>max</varname> and <varname>min</varname>.</para>
            </listitem>
            <listitem>
                <para>Compute the center point of this region:</para>
                <programlisting language="cpp">glm::vec3 center = (max + min) / 2.0f;</programlisting>
            </listitem>
            <listitem>
                <para>Compute half of the size (width, height, depth) of the region:</para>
                <programlisting language="cpp">glm::vec3 halfSize = (max - min) / 2.0f;</programlisting>
            </listitem>
            <listitem>
                <para>For each position in the mesh, compute a normalized version by subtracting the
                    center from it, then dividing it by half the size. As follows:</para>
                <programlisting language="cpp">glm::vec3 newPosition = (position - center) / halfSize;</programlisting>
            </listitem>
            <listitem>
                <para>For each new position, convert it to a signed, normalized integer by
                    multiplying it by 32767:</para>
                <programlisting>unsigned short normX = (unsigned short)(newPosition.x * 32767.0f);
unsigned short normY = (unsigned short)(newPosition.y * 32767.0f);
unsigned short normZ = (unsigned short)(newPosition.z * 32767.0f);</programlisting>
                <para>These three coordinates are then stored as the new position data in the buffer
                    object.</para>
            </listitem>
            <listitem>
                <para>Keep the <varname>center</varname> and <varname>halfSize</varname> variables
                    stored with your mesh data. When computing the model-space to camera-space
                    matrix for that mesh, add one final matrix to the top. This matrix will perform
                    the inverse operation from the one that we used to compute the normalized
                    values:</para>
                <programlisting language="cpp">matrixStack.Translate(center);
matrixStack.Scale(halfSize);</programlisting>
                <para>This final matrix should <emphasis>not</emphasis> be applied to the normal's
                    matrix. Compute the normal matrix <emphasis>before</emphasis> applying the final
                    step above. So if you were not using a separate matrix for normals (you did not
                    have non-uniform scales in your model-to-camera matrix), you will need to use
                    one now. So this may make your data bigger or make your shader run slightly
                    slower.</para>
            </listitem>
        </orderedlist>
        <formalpara>
            <title>Alignment</title>
            <para>One additional rule you should always follow is this: make sure that all
                attributes begin on a 4-byte boundary. This is true for attributes that are smaller
                than 4-bytes, such as a 3-vector of 8-bit values. While OpenGL will allow you to use
                arbitrary alignments, hardware may have problems making it work. So if you make your
                position data 16-bit floats or signed normalized integers, you will still waste 2
                bytes from every position. You may want to try making your position values
                4-dimensional values and using the last value for something useful.</para>
        </formalpara>
    </section>
    <section>
        <title>Image Formats</title>
        <para>As with vertex formats, try to use the smallest format that you can get away with.
            Also, as with vertex formats, what you can get away with tends to be defined by what you
            are trying to store in the texture.</para>
        <formalpara>
            <title>Normals</title>
            <para>Textures containing normals can use <literal>GL_RGB10_A2_SNORM</literal>, which is
                the texture equivalent to the 10-bit signed normalized format we used for attribute
                normals. However, this can be made more precise if the normals are for a
                tangent-space bump map. Since the tangent-space normals always have a positive Z
                coordinate, and since the normals are normalized, the actual Z value can be computed
                from the other two. So you only need to store 2 values;
                    <literal>GL_RG16_SNORM</literal> is sufficient for these needs. To compute the
                third value, do this:</para>
        </formalpara>
        <programlisting language="glsl">vec2 norm2d = texture(tangentBumpTex, texCoord).xy;
vec3 tanSpaceNormal = sqrt(1.0 - dot(norm2d, norm2d));</programlisting>
        <para>Obviously this costs some performance, so the added precision may not be worthwhile.
            On the plus side, you will not have to do any normalization of the tangent-space
            normal.</para>
        <para>The <literal>GL_RG16_SNORM</literal> format can be made even smaller with texture
            compression. The <literal>GL_COMPRESSED_SIGNED_RG_RGTC1</literal> compressed texture
            format is a 2-channel signed integer format. It only takes up 8-bits per pixel.</para>
        <formalpara>
            <title>Floating-point Intensity</title>
            <para>There are two unorthodox formats for floating-point textures, both of which have
                important uses. The <literal>GL_R11F_G11F_B10F</literal> format is potentially a
                good format to use for HDR render targets. As the name suggests, it takes up only
                32-bits. The downside is the relative loss of precision compared to
                    <literal>GL_RGB16F</literal>. They can store approximately the same magnitude of
                values, but the smaller format loses some precision. This may or may not impact the
                overall visual quality of the scene. It should be fairly simple to test to see which
                is better.</para>
        </formalpara>
        <para>The <literal>GL_RGB9_E5</literal> format is used for input floating-point textures. If
            you have a texture that represents light intensity in HDR situations, this format can be
            quite handy. The way it works is that each of the RGB colors get 9 bits for their
            values, but they all share the same exponent. This has to do with how floating-point
            numbers work, but what it boils down to is that the values have to be relatively close
            to one another in magnitude. They do not have to be that close; there's still some
            leeway. Values that are too small relative to larger ones become zero. This is
            oftentimes an acceptable tradeoff, depending on the particular magnitude in
            question.</para>
        <para>This format is useful for textures that are generated offline by tools. You cannot
            render to a texture in this format.</para>
        <formalpara>
            <title>Colors</title>
            <para>Storing colors that are clamped to [0, 1] can be done with good precision with
                    <literal>GL_RGBA8</literal> or <literal>GL_SRGB8_ALPHA8</literal> as needed.
                However, compressed texture formats are available. The S3TC formats are good choices
                if the compression works reasonably well for the texture. There are sRGB versions of
                the S3TC formats as well.</para>
        </formalpara>
        <para>The difference in the various S3TC formats are how much alpha you need. The choices
            are as follows:</para>
        <glosslist>
            <glossentry>
                <glossterm>GL_COMPRESSED_RGB_S3TC_DXT1_EXT</glossterm>
                <glossdef>
                    <para>No alpha.</para>
                </glossdef>
            </glossentry>
            <glossentry>
                <glossterm>GL_COMPRESSED_RGBA_S3TC_DXT1_EXT</glossterm>
                <glossdef>
                    <para>Binary alpha. Either zero or one for each texel. The RGB color for any
                        alpha of zero will also be zero.</para>
                </glossdef>
            </glossentry>
            <glossentry>
                <glossterm>GL_COMPRESSED_RGBA_S3TC_DXT3_EXT</glossterm>
                <glossdef>
                    <para>4-bits of alpha per pixel.</para>
                </glossdef>
            </glossentry>
            <glossentry>
                <glossterm>GL_COMPRESSED_RGBA_S3TC_DXT5_EXT</glossterm>
                <glossdef>
                    <para>Alpha is compressed in an S3TC block, much like RG texture
                        compression.</para>
                </glossdef>
            </glossentry>
        </glosslist>
        <para>If a variable alpha matters for a texture, the primary difference will be between DXT3
            and DXT5. DXT5 has the potential for better results, but if the alpha does not compress
            well with the S3TC algorithm, the results will be rather worse.</para>
    </section>
    <section>
        <title>Textures</title>
        <para>Mipmapping improves performance when textures are mapped to regions that are larger in
            texel space than in window space. That is, when texture minification happens. Mipmapping
            improves performance because it keeps the locality of texture accesses near each other.
            Texture hardware is optimized for accessing regions of textures, so improving locality
            of texture data will help performance.</para>
        <para>How much this matters depends on how the texture is mapped to the surface. Static
            mapping with explicit texture coordinates, or with linear computation based on surface
            properties, can use mipmapping to improve locality of texture access. For more unusual
            mappings or for pure-lookup tables, mipmapping may not help locality at all.</para>
        <para/>
    </section>
    <section>
        <title>Finding the Bottleneck</title>
        <para>The absolute best tool to have in your repertoire for optimizing your rendering is
            finding out why your rendering is slow.</para>
        <para>GPUs are designed as a pipeline. Each stage in the pipeline is functionally
            independent from the other. A vertex shader can be computing some number of vertices,
            while the clipping and rasterization are working on other triangles, while the fragment
            shader is working on fragments generated by other triangles.</para>
        <para>However, a vertex generated by the vertex shader cannot pass to the rasterizer if the
            rasterizer is busy. Similarly, the rasterizer cannot generate more fragments if all of
            the fragment shaders are in use. Therefore, the overall performance of the GPU can only
            be the performance of the slowest step in the pipeline.</para>
        <para>This means that, in order to actually make the GPU faster, you must find the
            particular stage of the pipeline that is the slowest. This step is referred to as the
                <glossterm>bottleneck</glossterm>. Until you know what the bottleneck is, then the
            most you can do is take a guess as to why things are slower than you think they are. And
            doing major code changes based purely on a guess is probably not something you can do.
            At least, not until you have a lot of experience with the GPU(s) in question.</para>
        <para>It should also be noted that bottlenecks are not consistent throughout the rendering
            of a single frame. Some parts of it can be CPU bound, others can be fragment shader
            bound, etc. Thus, attempt to find particular sections of rendering that likely have the
            same problem before trying to find the bottleneck.</para>
        <section>
            <title>Measuring Performance</title>
            <para>The most common performance statistic you see when most people talk about
                performance is frames per second (<acronym>FPS</acronym>). While this is useful when
                talking to the lay person, a graphics programmer does not use FPS as their standard
                performance metric. It is the overall goal, but when measuring the actual
                performance of a piece of rendering code, the more useful metric is simply time.
                This is usually measured in milliseconds (ms).</para>
            <para>If you are attempting to maintain 60fps, that translates to having 16.67
                milliseconds to spend performing all rendering tasks.</para>
            <para>One thing that confounds performance metrics is the fact that the GPU is both
                pipelined and asynchronous. When running regular code, if you call a function,
                you're usually assured that the action the function took has completed when it
                returns. When you issue a rendering call (any <function>glDraw*</function>
                function), not only is it likely that rendering has not completed by the time it has
                returned, it is very possible that rendering has not even
                    <emphasis>started</emphasis>. Not even doing a buffer swap will ensure that the
                GPU has finished, as GPUs can wait to actual perform the buffer swap until
                later.</para>
            <para>If you specifically want to time the GPU, then you must force the GPU to finish
                its work. To do that in OpenGL, you call a function cleverly titled
                    <function>glFinish</function>. It will return sometime after the GPU finishes.
                Note that it does not guarantee that it returns immediately after, only at some
                point after the GPU has finished all of its commands. So it is a good idea to give
                the GPU a healthy workload before calling finish, to minimize the difference between
                the time you measure and the time the GPU actually has.</para>
            <para>You will also want to turn vertical synchronization, or vsync, off. There is a
                certain point during which a graphics chip is able to swap the front and back
                framebuffers with a guarantee of not causing half of the displayed image to be from
                one buffer and half from another. The latter eventuality is called
                    <glossterm>tearing</glossterm>, and having vsync enabled avoids that. However,
                you do not care about tearing; you want to know about performance. So you need to
                turn off any form of vsync.</para>
            <para>Vsync is controlled by the window-system specific extensions
                    <literal>GLX_EXT_swap_control</literal> and
                    <literal>WGL_EXT_swap_control</literal>. They both do the same thing and have
                similar APIs. The <function>wgl/glxSwapInterval</function> functions take an integer
                that tells how many vsyncs to wait between swaps. If you pass 0, then it will swap
                immediately.</para>
        </section>
        <section>
            <title>Possible Bottlenecks</title>
            <para>There are several potential bottlenecks that a section of rendering code can have.
                We will list those and the ways of determining if it is the bottleneck. You should
                test these in the order presented below.</para>
            <section>
                <title>Fragment Processing</title>
                <para>This is probably the easiest to find. The quantity of fragment processing you
                    have depends entirely on the number of fragments the various triangles are
                    rasterized to. Therefore, simply increase the resolution. If you increase the
                    resolution by 2x the number of pixels (double either the width or height), and
                    the time to render doubles, then you are fragment processing bound.</para>
                <para>Note that rendering time will go up when you increase the resolution. What you
                    are interested in is whether it goes up linearly with the number of fragments
                    rendered. If the render time only goes up by 1.2x with a 2x increase in number
                    of fragments, then the code was not fragment processing bound.</para>
            </section>
            <section>
                <title>Vertex Processing</title>
                <para>If you are not fragment processing bound, then there's a good chance you are
                    vertex processing bound. After ruling out fragment processing, simply turn off
                    all fragment processing. If this does not increase your performance
                    significantly (there will generally be some change), then you were vertex
                    processing bound.</para>
                <para>To turn off fragment processing, simply
                        <function>glEnable</function>(<literal>GL_CULL_FACE</literal>) and set
                        <function>glCullFace</function> to <literal>GL_FRONT_AND_BACK</literal>.
                    That will cause the clipping system to cull all triangles before rasterization.
                    Obviously, nothing will be rendered, but your performance timings will be for
                    vertex processing alone.</para>
            </section>
            <section>
                <title>CPU</title>
                <para>A CPU bottleneck means that the GPU is being starved; it is consuming data
                    faster than the CPU is providing it. You do not really test for CPU bottlenecks
                    per-se; they are discovered by process of elimination. If nothing else is
                    bottlenecking the GPU, then the CPU clearly is not giving it enough stuff to
                    do.</para>
            </section>
        </section>
        <section>
            <title>Unfixable Bottlenecks</title>
            <para>It is entirely possible that you cannot fix a bottleneck. Maybe there's simply no
                way to avoid a vertex-processing heavy section of your renderer. Perhaps you need
                all of that fragment processing in a certain area of rendering.</para>
            <para>If there is some bottleneck that cannot be optimized away, then turn it to your
                advantage. If you have a CPU bottleneck, then render more detailed models. If you
                have a vertex-shader bottleneck, improve your lighting by adding some
                fragment-shader complexity. And so forth. Just make sure that you do not increase
                complexity to the point where you move the bottleneck.</para>
        </section>
    </section>
    <section>
        <?dbhtml filename="Optimize Core.html" ?>
        <title>Core Optimizations</title>
        <para/>
        <section>
            <title>State Changes</title>
            <para>This rule is designed to decrease CPU bottlenecks. The rule itself is simple:
                minimize the number of state changes. Actually doing it is a complex exercise in
                graphics engine design.</para>
            <para>What is a state change? Any OpenGL function that changes the state of the current
                context is a state change. This includes any function that changes the state of
                objects bound to the current context.</para>
            <para>What you should do is gather all of the things you need to render and sort them
                based on state changes. Objects with similar state will be rendered one after the
                other. But not all state changes are equal to one another; some state changes are
                more expensive than others.</para>
            <para>Vertex array state, for example, is generally considered quite expensive. Try to
                group many objects that have the same vertex attribute data formats in the same
                buffer objects. Use glDrawElementsBaseVertex to help when using indexed
                rendering.</para>
            <para>The currently bound texture state is also somewhat expensive. Program state is
                analogous to this.</para>
            <para>Global state, such as face culling, blending, etc, are generally considered less
                expensive. You should still only change it when necessary, but buffer object and
                texture state are much more important in state sorting.</para>
            <para>There are also certain tricky states that can hurt you. For example, it is best to
                avoid changing the direction of the depth test once you have cleared the depth
                buffer and started rendering to it. This is for reasons having to do with specific
                hardware optimizations of depth buffering.</para>
            <para>It is less well-understood how important uniform state is, or how uniform buffer
                objects compare with traditional uniform values.</para>
        </section>
        <section>
            <title>Object Culling</title>
            <para>The fastest object is one not drawn. And there's no point in drawing something
                that is not seen.</para>
            <para>The simplest form of object culling is frustum culling: choosing not to render
                objects that are entirely outside of the view frustum. Determining that an object is
                off screen is a CPU task. You generally have to represent each object as a sphere or
                camera-space box; then you test the sphere or box to see if it is partially within
                the view space.</para>
            <para>There are also a number of techniques for dealing with knowing whether the view to
                certain objects are obstructed by other objects. Portals, BSPs, and a variety of
                other techniques involve preprocessing terrain to determine visibility sets.
                Therefore, it can be known that, when the camera is in a certain region of the
                world, objects in certain other regions cannot be visible, even if they are within
                the view frustum.</para>
            <para>A level beyond that involves using something called occlusion queries. This is a
                way to render an object with the GPU and then ask how many fragments of that object
                were rasterized. It is generally preferred to render simple test objects, such that
                if any part of the test object is visible, then the real object will be visible.
                Color masks (with <function>glColorMask</function>) are used to prevent writing the
                fragment shader outputs of the test object to the framebuffer.</para>
            <para>Occlusion queries in OpenGL are objects that have state. They are created with the
                    <function>glGenQueries</function> function. To start rendering a test object for
                occlusion queries, the object generated from <function>glGenQueries</function> is
                passed to the <function>glBeginQuery</function> function, along with the mode of
                    <literal>GL_SAMPLES_PASSED</literal>. All rendering commands between
                    <function>glBeginQuery</function> and the corresponding
                    <function>glEndQuery</function> are part of the test object. If all of the
                fragments of the object were discarded (via depth buffer or something else), then
                the query failed. If even one fragment was rendered, then it passed.</para>
            <para>This can be used with conditional rendering. Conditional rendering allows a series
                of rendering commands, bracketed by
                    <function>glBeginConditionalRender</function>/<function>glEndConditionalRender</function>
                functions, to cause rendering of an object to happen or not happen based on the
                status of an occlusion query object. If the occlusion query passed, then the
                rendering commands will be executed. If it did not, then they will not be.</para>
            <para>Of course, conditional rendering can cause pipeline stalls; OpenGL still requires
                that operations execute in-order, even conditional ones. So all later operations
                will be held up if a conditional render is waiting for its occlusion query to
                finish. To avoid this, you can specify <literal>GL_QUERY_NO_WAIT</literal> when
                beginning the conditional render. This will cause OpenGL to render if the query has
                not completed before this conditional render is ready to be rendered.</para>
        </section>
        <section>
            <title>Model LOD</title>
            <para>When a model is far away, it does not need to look as detailed. Therefore, one can
                substitute more detailed models for less detailed ones. This is commonly referred to
                as Level of Detail (<acronym>LOD</acronym>).</para>
            <para>Of course in modern rendering, detail means more than just the number of polygons
                in a mesh. It can often mean what shader to use, what textures to use with it, etc.
                So while meshes will often have LODs, so will shaders. Textures have their own
                built-in LODing mechanism in mip-mapping. But it is often the case that low-LOD
                shaders (those used from far away) do not need as many textures as the closer LOD
                shaders. You might be able to get away with per-vertex lighting for distant models,
                while you need per-fragment lighting for those close up.</para>
            <para>The general problem is how to deal with the transitions between LOD levels. If you
                change them too close to the camera, then the user will notice the pop. If you do
                them too far away, you lose much of the performance impact. Finding a good
                middle-ground is key.</para>
        </section>
        <section>
            <title>Mipmapping</title>
            <para>For any texture that represents a surface property of an object, strongly consider
                giving it mipmaps. This includes bump maps, diffuse textures, specular textures,
                etc. This is primarily for performance reasons.</para>
            <para>When you fetch a texel from a texture, the texture unit hardware will usually
                fetch the neighboring textures at the mip LOD(s) in question. These texels will be
                stored in local memory called the texture cache. This means that, when the next
                fragment on the surface comes along, that texel will already be in the cache. But
                this only works for texels that are near each other.</para>
            <para>When an object is far from the camera or angled sharply relative to the view, then
                the two texture coordinates for two neighboring fragments can be quite different
                from one another. When fetching from a low mipmap (remember: 0 is the biggest
                mipmap), then the two fragments will get texels that are far apart. Neither one will
                fetch texels near each other.</para>
            <para>But if they are fetching from a high mipmap, then the large texture coordinate
                difference between them translates into a small texel-space difference. With proper
                mipmaping, neighboring texels can feed on the cache and do fewer memory accesses.
                This speeds up texturing performance.</para>
            <para>This also means that biasing the mipmap LOD lower (to larger mipmaps) can cause
                serious performance problems in addition to aliasing.</para>
        </section>
    </section>
    <section>
        <?dbhtml filename="Optimize Vertex Format.html" ?>
        <title>Vertex Format</title>
        <para>Vertex attributes stored in buffer objects can be of a surprisingly large number of
            formats. These tutorials generally used 32-bit floating-point data, but that is far from
            the best case.</para>
        <para>The <glossterm>vertex format</glossterm> specifically refers to the set of values
            given to the <function>glVertexAttribPointer</function> calls that describe how each
            attribute is aligned in the buffer object.</para>
        <section>
            <title>Attribute Formats</title>
            <para>Each attribute should take up as little room as possible. This is for performance
                reasons, but it also saves memory. For buffer objects, these are usually one in the
                same. The less data you have stored in memory, the faster it gets to the vertex
                shader.</para>
            <para>Attributes can be stored in normalized integer formats, just like textures. This
                is most useful for colors and texture coordinates. For example, to have an attribute
                that is stored in 4 unsigned normalized bytes, you can use this:</para>
            <programlisting language="cpp">glVertexAttribPointer(index, 4, GLubyte, GLtrue, 0, offset);</programlisting>
            <para>If you want to store a normal as a normalized signed short, you can use
                this:</para>
            <programlisting language="cpp">glVertexAttribPointer(index, 3, GLushort, GLtrue, 0, offset);</programlisting>
            <para>There are also a few specialized formats. <literal>GL_HALF_FLOAT</literal> can be
                used for 16-bit floating-point types. This is useful for when you need values
                outside of [-1, 1], but do not need the full </para>
            <para>Non-normalized integers can be used as well. These map in GLSL directly to
                floating-point values, so a non-normalized value of 16 maps to a GLSL value of
                16.0.</para>
            <para>The best thing about all of these formats is that they cost
                    <emphasis>nothing</emphasis> in performance to use. They are all silently
                converted into floating-point values for consumption by the vertex shader, with no
                performance lost.</para>
        </section>
        <section>
            <title>Interleaved Attributes</title>
            <para>Attributes do not all have to come from the same buffer object; multiple
                attributes can come from multiple buffers. However, where possible, this should be
                avoided. Furthermore, attributes in the same buffer should be interleaved with one
                another whenever possible.</para>
            <para>Consider an array of structs in C++:</para>
            <programlisting>struct Vertex
{
  float position[3];
  GLubyte color[4];
  GLushort texCoord[2];
}

Vertex vertArray[20];</programlisting>
            <para>The byte offset of <varname>color</varname> in the <type>Vertex</type> struct is
                12. That is, from the beginning of the <type>Vertex</type> struct, the
                    <varname>color</varname> variable starts 12 bytes in. The
                    <varname>texCoord</varname> variable starts 16 bytes in.</para>
            <para>If we did a memcpy between <varname>vertArray</varname> and a buffer object, and
                we wanted to set the attributes to pull from this data, we could do so using the
                stride and offsets to position things properly.</para>
            <programlisting>glVertexAttribPointer(0, 3, GL_FLOAT, GLfalse, 20, 0);
glVertexAttribPointer(1, 3, GL_UNSIGNED_BYTE, GL_TRUE, 20, 12);
glVertexAttribPointer(3, 3, GL_UNSIGNED_SHORT, GL_TRUE, 20, 16);</programlisting>
            <para>The fifth argument is the stride. The stride is the number of bytes from the
                beginning of one instance of this attribute to the beginning of another. The stride
                here is set to <literal>sizeof</literal>(<type>Vertex</type>). C++ defines that the
                size of a struct represents the byte offset between separate instances of that
                struct in an array. So that is our stride.</para>
            <para>The offsets represent where in the buffer object the first element is. These match
                the offsets in the struct. If we had loaded this data to a location past the front
                of our buffer object, we would need to offset these values by the beginning of where
                we uploaded our data to.</para>
            <para>There are certain gotchas when deciding how data gets packed like this. First, it
                is a good idea to keep every attribute on a 4-byte alignment. This may mean
                introducing explicit padding (empty space) into your structures. Some hardware will
                have massive slowdowns if things are not aligned to four bytes.</para>
            <para>Next, it is a good idea to keep the size of any interleaved vertex data restricted
                to multiples of 32 bytes in size. Violating this is not as bad as violating the
                4-byte alignment rule, but one can sometimes get sub-optimal performance if the
                total size of interleaved vertex data is, for example, 48 bytes. Or 20 bytes, as in
                our example.</para>
        </section>
        <section>
            <title>Packing Suggestions</title>
            <para>If the smallest vertex data size is what you need, consider these packing
                techniques.</para>
            <para>Colors generally do not need to be more than 3-4 bytes in size. One byte per
                component.</para>
            <para>Texture coordinates, particularly those clamped to the [0, 1] range, almost never
                need more than 16-bit precision. So use unsigned shorts.</para>
            <para>Normals should be stored in the signed 2_10_10_10 format whenever possible.
                Normals generally do not need that much precisions, especially since you're going to
                normalize them anyway. This format was specifically devised for normals, so use
                it.</para>
            <para>Positions are the trickiest to work with, because the needs vary so much. If you
                are willing to modify your vertex shaders and put some work into it, you can often
                use 16-bit signed normalized shorts.</para>
            <para>The key to this is a special scale/translation matrix. When you are preparing your
                data, in an offline tool, you take the floating-point positions of a model and
                determine the model's maximum extents in all three axes. This forms a bounding box
                around the model. The center of the box is the center of your new model, and you
                apply a translation to move the points to this center. Then you apply a non-uniform
                scale to transform the points from their extent range to the [-1, 1] range of signed
                normalized values. You save the offset and the scales you used as part of your mesh
                data (not to be stored in the buffer object).</para>
            <para>When it comes time to render the model, you simply reverse the transformation. You
                build a scale/translation matrix that undoes what was done to get them into the
                signed-normalized range. Note that this matrix should not be applied to the normals,
                because the normals were not compressed this way. A fully matrix multiply is even
                overkill for this transformation; a scale+translation can be done with a simple
                vector multiply and add.</para>
        </section>
    </section>
    <section>
        <?dbhtml filename="Optimize Vertex Cache.html" ?>
        <title>Vertex Caching</title>
        <para/>
    </section>
    <section>
        <?dbhtml filename="Optimize Shaders.html" ?>
        <title>Shaders and Performance</title>
        <para/>
    </section>
    <section>
        <?dbhtml filename="Optimize Sync.html" ?>
        <title>Synchronization</title>
        <para>GPUs gain quite a bit of their performance because, by and large, once you tell them
            what to do, they go do their stuff without further intervention. As the programmer, you
            do not care that a frame has not yet completed. All you are interested in is that the
            user sees the frame when it is ready.</para>
        <para>There are certain things that the user can do which will cause this perfect
            asynchronous activity to come to a screeching halt. These are called synchronization
            events.</para>
        <para>OpenGL is defined to allow asynchronous behavior; commands that you give do not have
            to be completed when the function ends (for the most part). However, OpenGL defines this
            by saying that if there is asynchronous behavior, the user <emphasis>cannot</emphasis>
            be made aware of it. That is, if you call glDrawArrays, the effect of this command
            should be based solely on the current state of OpenGL. This means that, if the
            glDrawArrays command is executed later, the OpenGL implementation must do whatever it
            takes to prevent later changes from impacting the results.</para>
        <para>Therefore, if you make a glDrawArrays call that pulls from some buffer object, and
            then immediately call glBufferSubData on that buffer object, the OpenGL implementation
            may have to pause the CPU in the glBufferSubData call until the glDrawArrays has at
            least finished vertex processing. However, the implementation may also simply copy the
            data you are trying to transfer into some memory it allocates, to be uploaded to the
            buffer once the glDrawArrays completes. There is no way to be sure which will
            happen.</para>
        <para>Synchronization events usually include changing data objects. That is, changing the
            contents of buffer objects or textures. Usually, changing simple state of objects, like
            what attributes a VAO provides or texture parameters, does not cause synchronization
            issues. Changing global OpenGL state also does not cause synchronization
            problems.</para>
        <para>There are ways to allow you to modify data objects that still let the GPU be
            asynchronous. But any discussion of these is well beyond the bounds of this book. Just
            be aware that data objects that are in active use should probably not have their data
            modified.</para>
    </section>
</appendix>
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.