Source

gltut / Documents / History of Graphics Hardware.xml

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng" type="xml"?>
<?oxygen SCHSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng"?>
<appendix xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
    <?dbhtml filename="History of Graphics Hardware.html" ?>
    <info>
        <title>History of PC Graphics Hardware</title>
        <subtitle>A Programmer's View</subtitle>
    </info>
    <para>For those of you had the good fortune of not being graphics programmers during the
        formative years of the development of consumer graphics hardware, what follows is a brief
        history. Hopefully, it will give you some perspective on what has changed in the last 15
        years or so, as well as an idea of how grateful you should be that you never had to suffer
        through the early days.</para>
    <section>
        <title>Voodoo Magic</title>
        <para>In the years 1995 and 1996, a number of graphics cards were released. Graphics
            processing via specialized hardware on PC platforms was nothing new. What was new about
            these cards was their ability to do 3D rasterization.</para>
        <para>The most popular of these for that era was the Voodoo Graphics card from 3Dfx
            Interactive. It was fast, powerful for its day, and provided high quality rendering
            (again, for its day).</para>
        <para>The functionality of this card was quite bare-bones from a modern perspective.
            Obviously there was no concept of shaders of any kind. Indeed, it did not even have
            vertex transformation; the Voodoo Graphics pipeline began with clip-space values. This
            required the CPU to do vertex transformations. This hardware was effectively just a
            triangle rasterizer.</para>
        <para>That being said, it was quite good for its day. As inputs to its rasterization
            pipeline, it took vertex inputs of a 4-dimensional clip-space position (though the
            actual space was not necessarily the same as OpenGL's clip-space), a single RGBA color,
            and a single three-dimensional texture coordinate. The hardware did not support 3D
            textures; the extra component was in case the user wanted to do projective
            texturing.</para>
        <para>The texture coordinate was used to map into a single texture. The texture coordinate
            and color interpolation was perspective-correct; in those days, that was a significant
            selling point. The venerable Playstation 1 could not do perspective-correct
            interpolation.</para>
        <para>The value fetched from the texture could be combined with the interpolated color using
            one of three math functions: additions, multiplication, or linear interpolation based on
            the texture's alpha value. The alpha of the output was controlled with a separate math
            function, thus allowing the user to generate the alpha with different math than the RGB
            portion of the output color. This was the sum total of its fragment processing.</para>
        <para>It had framebuffer blending support. Its framebuffer could even support a
            destination alpha value, though you had to give up having a depth buffer to get it.
            Probably not a good tradeoff. Outside of that issue, its blending support was superior
            even to OpenGL 1.1. It could use different source and destination factors for the alpha
            component than the RGB component; the old GL 1.1 forced the RGB and A to be blended with
            the same factors.</para>
        <para>The blending was even performed with full 24-bit color precision and then downsampled
            to the 16-bit precision of the output upon writing.</para>
        <para>From a modern perspective, spoiled with our full programmability, this all looks
            incredibly primitive. And, to some degree, it is. But compared to the pure CPU solutions
            to 3D rendering of the day, the Voodoo Graphics card was a monster.</para>
        <para>It's interesting to note that the simplicity of the fragment processing stage owes as
            much to the lack of inputs as anything else. When the only values you have to work with
            are the color from a texture lookup and the per-vertex interpolated color, there really
            is not all that much you could do with them. Indeed, as we will see in the next phases of
            hardware, increases in the complexity of the fragment processor was a reaction to
            increasing the number of inputs <emphasis>to</emphasis> the fragment processor. When you
            have more data to work with, you need more complex operations to make that data
            useful.</para>
    </section>
    <section>
        <?dbhtml filename="History TNT.html" ?>
        <title>Dynamite Combiners</title>
        <para>The next phase of hardware came, not from 3Dfx, but from a new company, NVIDIA. While
            3Dfx's Voodoo II was much more popular than NVIDIA's product, the NVIDIA Riva TNT
            (released in 1998) was more interesting in terms of what it brought to the table for
            programmers. Voodoo II was purely a performance improvement; TNT was the next step in
            the evolution of graphics hardware.</para>
        <para>Like other graphics cards of the day, the TNT hardware had no vertex processing.
            Vertex data was in clip-space, as normal, so the CPU had to do all of the transformation
            and lighting. Where the TNT shone was in its fragment processing. The power of the TNT
            is in it's name; TNT stands for <acronym>T</acronym>wi<acronym>N</acronym>
            <acronym>T</acronym>exel. It could access from two textures at once. And while the
            Voodoo II could do that as well, the TNT had much more flexibility to its fragment
            processing pipeline.</para>
        <para>In order to accomidate two textures, the vertex input was expanded. Two textures meant
            two texture coordinates, since each texture coordinate was directly bound to a
            particular texture. While they were allowing two of things, NVIDIA also allowed for two
            per-vertex colors. The idea here has to do with lighting equations.</para>
        <para>For regular diffuse lighting, the CPU-computed color would simply be dot(N, L),
            possibly with attenuation applied. Indeed, it could be any complicated diffuse lighting
            function, since it was all on the CPU. This diffuse light intensity would be multiplied
            by the texture, which represented the diffuse absorption of the surface at that
            point.</para>
        <para>This becomes less useful if you want to add a specular term. The specular absorption
            and diffuse absorption are not necessarily the same, after all. And while you may not
            need to have a specular texture, you do not want to add the specular component to the
            diffuse component <emphasis>before</emphasis> you multiply by their respective colors.
            You want to do the addition afterwards.</para>
        <para>This is simply not possible if you have only one per-vertex color. But it becomes
            possible if you have two. One color is the diffuse lighting value. The other color is
            the specular component. We multiply the first color by the diffuse color from the
            texture, then add the second color as the specular reflectance.</para>
        <para>Which brings us nicely to fragment processing. The TNT's fragment processor had 5
            inputs: 2 colors sampled from textures, 2 colors interpolated from vertices, and a
            single <quote>constant</quote> color. The latter, in modern parlance, is the equivalent
            of a shader uniform value.</para>
        <para>That's a lot of potential inputs. The solution NVIDIA came up with to produce a final
            color was a bit of fixed functionality that we will call the texture environment. It is
            directly analogous to the OpenGL 1.1 fixed-function pipeline, but with extensions for
            multiple textures and some TNT-specific features.</para>
        <para>The idea is that each texture has an environment. The environment is a specific math
            function, such as addition, subtraction, multiplication, and linear interpolation. The
            operands to this function could be taken from any of the fragment inputs, as well as a
            constant zero color value.</para>
        <para>It can also use the result from the previous environment as one of its arguments.
            Textures and environments are numbered, from zero to one (two textures, two
            environments). The first one executes, followed by the second.</para>
        <para>If you look at it from a hardware perspective, what you have is a two-opcode assembly
            language. The available registers for the language are two vertex colors, a single
            uniform color, two texture colors, and a zero register. There is also a single temporary
            register to hold the output from the first opcode.</para>
        <para>Graphics programmers, by this point, had gotten used to multipass-based algorithms.
            After all, until TNT, that was the only way to apply multiple textures to a single
            surface. And even with TNT, it had a pretty confining limit of two textures and two
            opcodes.</para>
        <para>This was powerful, but quite limited. Two opcodes really was not enough.</para>
        <para>The TNT cards also provided something else: 32-bit framebuffers and depth buffers.
            While the Voodoo cards used high-precision math internally, they still wrote to 16-bit
            framebuffers, using a technique called dithering to make them look like higher
            precision. But dithering was nothing compared to actual high precision framebuffers. And
            it did nothing for the depth buffer artifacts that a 16-bit depth buffer gave
            you.</para>
        <para>While the original TNT could do 32-bit, it lacked the memory and overall performance
            to really show it off. That had to wait for the TNT2. Combined with product delays and
            some poor strategic moves by 3Dfx, NVIDIA became one of the dominant players in the
            consumer PC graphics card market. And that was cemented by their next card, which had
            real power behind it.</para>
        <sidebar>
            <title>Tile-Based Rendering</title>
            <para>While all of this was going on, a small company called PowerVR released its Series
                2 graphics chip. PowerVR's approach to rendering was fundamentally different from
                the standard rendering pipeline.</para>
            <para>They used what they called a <quote>deferred, tile-based renderer.</quote> The
                idea is that they store all of the clip-space triangles in a buffer. Then, they sort
                this buffer based on which triangles cover which areas of the screen. The output
                screen is divided into a number of tiles of a fixed size. Say, 8x8 in size.</para>
            <para>For each tile, the hardware finds the triangles that are within that tile's area.
                Then it does all the usual scan conversion tricks and so forth. It even
                automatically does per-pixel depth sorting for blending, which remains something of
                a selling point (no more having to manually sort blended objects). After rendering
                that tile, it moves on to the next. These operations can of course be executed in
                parallel; you can have multiple tiles being rasterized at the same time.</para>
            <para>The idea behind this is to avoid having large image buffers. You only need a few
                8x8 depth buffers, so you can use very fast, on-chip memory for it. Rather than
                having to deal with caches, DRAM, and large bandwidth memory channels, you just have
                a small block of memory where you do all of your logic. You still need memory for
                textures and the output image, but your bandwidth needs can be devoted solely to
                textures.</para>
            <para>For a time, these cards were competitive with the other graphics chip makers.
                However, the tile-based approach simply did not scale well with resolution or
                geometry complexity. Also, they missed the geometry processing bandwagon, which
                really hurt their standing. They fell farther and farther behind the other major
                players, until they stopped making desktop parts altogether.</para>
            <para>However, they may ultimately have the last laugh; unlike 3Dfx and so many others,
                PowerVR still exists. They provided the GPU for the Sega Dreamcast console. And
                while that console was a market failure, it did show where PowerVR's true strength
                lay: embedded platforms.</para>
            <para>Embedded platforms tend to play to their tile-based renderer's strengths. Memory,
                particularly high-bandwidth memory, eats up power; having less memory means
                longer-lasting mobile devices. Embedded devices tend to use smaller resolutions,
                which their platform excels at. And with low resolutions, you are not trying to push
                nearly as much geometry.</para>
            <para>Thanks to these facts, PowerVR graphics chips power the vast majority of mobile
                platforms that have any 3D rendering in them. Just about every iPhone, Droid, iPad,
                or similar device is running PowerVR technology. And that's a growth market these
                days.</para>
        </sidebar>
    </section>
    <section>
        <?dbhtml filename="History GeForce.html" ?>
        <title>Vertices and Registers</title>
        <para>The next stage in the evolution of graphics hardware again came from NVIDIA. While
            3Dfx released competing cards, they were again behind the curve. The NVIDIA GeForce 256
            (not to be confused with the GeForce GT250, a much more modern card), released in 1999,
            provided something truly new: a vertex processing pipeline.</para>
        <para>The OpenGL API has always defined a vertex processing pipeline (it was fixed-function
            in those days rather than shader-based). And NVIDIA implemented it in their TNT-era
            drivers on the CPU. But only with the GeForce 256 was this actually implemented in
            hardware. And NVIDIA essentially built the entire OpenGL fixed-function vertex
            processing pipeline directly into the GeForce hardware.</para>
        <para>This was primarily a performance win. While it was important for the progress of
            hardware, a less-well-known improvement of the early GeForce hardware was more important
            to its future.</para>
        <para>In the fragment processing pipeline, the texture environment stages were removed. In
            their place was a more powerful mechanism, what NVIDIA called <quote>register
                combiners.</quote></para>
        <para>The GeForce 256 provided 2 regular combiner stages. Each of these stages represented
            up to four independent opcodes that operated over the register set. The opcodes could
            result in multiple outputs, which could be written to two temporary registers.</para>
        <para>What is interesting is that the register values are no longer limited to color values.
            Instead, they are signed values, on the range [-1, 1]; they have 9 bits of precision or
            so. While the initial color or texture values are on [0, 1], the actual opcodes
            themselves can perform operations that generate negative values. Opcodes can even
            scale/bias their inputs, which allow them to turn unsigned colors into signed
            values.</para>
        <para>Because of this, the GeForce 256 was the first hardware to be able to do functional
            bump mapping, without hacks or tricks. A single register combiner stage could do 2
            3-vector dot-products at a time. Textures could store normals by compressing them to a
            [0, 1] range. The light direction could either be a constant or interpolated per-vertex
            in texture space.</para>
        <para>Now granted, this still was a primitive form of bump mapping. There was no way to
            correct for texture-space values with binormals and tangents. But this was at least
            something. And it really was the first step towards programmability; it showed that
            textures could truly represent values other than colors.</para>
        <para>There was also a single final combiner stage. This was a much more limited stage than
            the regular combiner stages. It could do a linear interpolation operation and an
            addition; this was designed specifically to implement OpenGL's fixed-function fog and
            specular computations.</para>
        <para>The register file consisted of two temporary registers, two per-vertex colors, two
            texture colors, two uniform values, the zero register, and a few other values used for
            OpenGL fixed-function fog operations. The color and texture registers were even
            writeable, if you needed more temporaries.</para>
        <para>There were a few other sundry additions to the hardware. Cube textures first came onto
            the scene. Combined with the right texture coordinate computations (now in hardware),
            you could have reflective surfaces much more easily. Anisotropic filtering and
            multisampling also appeared at this time. The limits were relatively small; anisotropic
            filtering was limited to 4x, while the maximum number of samples was restricted to two.
            Compressed texture formats also appeared on the scene.</para>
        <para>What we see thus far as we take steps towards true programmability is that increased
            complexity in fragment processing starts pushing for other needs. The addition of a dot
            product allows lighting computations to take place per-fragment. But you cannot have full
            texture-space bump mapping because of the lack of a normal/binormal/tangent matrix to
            transform vectors to texture space. Cubemaps allow you to do arbitrary reflections, but
            computing reflection directions per-vertex requires interpolating reflection normals,
            which does not work very well over large polygons.</para>
        <para>This also saw the introduction of something called a rectangle texture. This texture
            type is something of an odd duck that still remains in current day. It was a way of
            creating a texture of arbitrary size; until then, textures were limited to powers of two
            in size (though the sizes did not have to be the same). The texture coordinates for
            rectangle textures are not normalized; they were in texture space values.</para>
        <sidebar>
            <title>The GPU Divide</title>
            <para>When NVIDIA released the GeForce 256, they coined the term <quote>Geometry
                    Processing Unit</quote> or <acronym>GPU</acronym>. Until this point, graphics
                chips were called exactly that: graphics chips. The term GPU was intended by NVIDIA
                to differentiate the GeForce from all of its competition, including the final cards
                from 3Dfx.</para>
            <para>Because the term was so reminiscent to CPUs, the term took over. Every graphics
                chip is a GPU now, even ones released before the term came to exist.</para>
            <para>In truth, the term GPU never really made much sense until the next stage, where
                the first cards with actual programmability came onto the scene.</para>
        </sidebar>
    </section>
    <section>
        <?dbhtml filename="History Radeon8500.html" ?>
        <title>Programming at Last</title>
        <para>How do you define a demarcation between non-programmable graphics chips and
            programmable ones? We have seen that, even in the humble TNT days, there were a couple
            of user-defined opcodes with several possible input values.</para>
        <para>One way is to consider what programming is. Programming is not simply a mathematical
            operation; programming needs conditional logic. Therefore, it is not unreasonable to say
            that something is not truly programmable until there is the possibility of some form of
            conditional logic.</para>
        <para>And it is at this point where that first truly appears. It appears first in the
                <emphasis>vertex</emphasis> pipeline rather than the fragment pipeline. This seems
            odd until one realizes how crucial fragment operations are to overall performance. It
            therefore makes sense to introduce heavy programmability in the less
            performance-critical areas of hardware first.</para>
        <para>The GeForce 3, released in 2001 (a mere 3 years after the TNT), was the first hardware
            to provide this level of programmability. While GeForce 3 hardware did indeed have the
            fixed-function vertex pipeline, it also had very flexible programmable pipeline. The
            retaining of the fixed-function code was a performance need; the vertex shader was not
            as fast as the fixed-function one. It should be noted that the original X-Box's GPU,
            designed in tandem with the GeForce 3, eschewed the fixed-functionality altogether in
            favor of having multiple vertex shaders that could compute several vertices at a time.
            This was eventually adopted for later GeForces.</para>
        <para>Vertex shaders were pretty powerful, even in their first incarnation. While there was
            no conditional branching, there was conditional logic, the equivalent of the ?:
            operator. These vertex shaders exposed up to 128 <type>vec4</type> uniforms, up to 16
                <type>vec4</type> inputs (still the modern limit), and could output 6
                <type>vec4</type> outputs. Two of the outputs, intended for colors, were lower
            precisions than the others. There was a hard limit of 128 opcodes. These vertex shaders
            brought full swizzling support and a plethora of math operations.</para>
        <para>The GeForce 3 also added up to two more textures, for a total of four textures per
            triangle. They were hooked directly into certain per-vertex outputs, because the
            per-fragment pipeline did not have real programmability yet.</para>
        <para>At this point, the holy grail of programmability at the fragment level was dependent
            texture access. That is, being able to access a texture, do some arbitrary computations
            on it, and then access another texture with the result. The GeForce 3 had some
            facilities for that, but they were not very good ones.</para>
        <para>The GeForce 3 used 8 register combiner stages instead of the 2 that the earlier cards
            used. Their register files were extended to support two extra texture colors and a few
            more tricks. But the main change was something that, in OpenGL terminology, would be
            called <quote>texture shaders.</quote></para>
        <para>What texture shaders did was allow the user to, instead of accessing a texture,
            perform a computation on that texture's texture unit. This was much like the old texture
            environment functionality, except only for texture coordinates. The textures were
            arranged in a sequence. And instead of accessing a texture, you could perform a
            computation between that texture unit's coordinate and possibly the coordinate from the
            previous texture shader operation, if there was one.</para>
        <para>It was not very flexible functionality. It did allow for full texture-space bump
            mapping, though. While the 8 register combiners were enough to do a full matrix
            multiply, they were not powerful enough to normalize the resulting vector. However, you
            could normalize a vector by accessing a special cubemap. The values of this cubemap
            represented a normalized vector in the direction of the cubemap's given texture
            coordinate.</para>
        <para>But using that required spending a total of 3 texture shader stages. Which meant you
            get a bump map and a normalization cubemap only; there was no room for a diffuse map in
            that pass. It also did not perform very well; the texture shader functions were quite
            expensive.</para>
        <para>True programmability came to the fragment shader from ATI, with the Radeon 8500,
            released in late 2001.</para>
        <para>The 8500's fragment shader architecture was pretty straightforward, and in terms of
            programming, it is not too dissimilar to modern shader systems. Texture coordinates
            would come in. They could either be used to fetch from a texture or be given directly as
            inputs to the processing stage. Up to 6 textures could be used at once. Then, up to 8
            opcodes, including a conditional operation, could be used. After that, the hardware
            would repeat the process using registers written by the opcodes. Those registers could
            feed texture accesses from the same group of textures used in the first pass. And then
            another 8 opcodes would generate the output color.</para>
        <para>It also had strong, but not full, swizzling support in the fragment shader. Register
            combiners had very little support for swizzling.</para>
        <para>This era of hardware was also the first to allow 3D textures. Though that was as much
            a memory concern as anything else, since 3D textures take up lots of memory which was
            not available on earlier cards. Depth comparison texturing was also made
            available.</para>
        <para>While the 8500 was a technological marvel, it was a flop in the market compared to the
            GeForce 3 &amp; 4. Indeed, this is a recurring theme of these eras: the card with the
            more programmable hardware often tends to lose in its first iteration.</para>
        <sidebar>
            <title>API Hell</title>
            <para>This era is notable in what it did to graphics APIs. Consider the hardware
                differences between the 8500 and the GeForce 3/4 in terms of fragment
                processing.</para>
            <para>On the Direct3D front, things were not the best. Direct3D 8 promised a unified
                shader development pipeline. That is, you could write a shader according to their
                specifications and it would work on any D3D 8 hardware. And this was effectively
                true. For vertex shaders, at least.</para>
            <para>However, the D3D 8.0 pixel shader pipeline was nothing more than NVIDIA's register
                combiners and texture shaders. There was no real abstraction of capabilities; the
                D3D 8.0 pixel shaders simply took NVIDIA's hardware and made a shader language out
                of it.</para>
            <para>To provide support for the 8500's expanded fragment processing feature-set, there
                was D3D 8.1. This version altered the pixel shader pipeline to match the
                capabilities of the Radeon 8500. Fortunately, the 8500 would accept 8.0 shaders just
                fine, since it was capable of doing everything the GeForce 3 could do. But no one
                would mistake either shader specification for any kind of real abstraction.</para>
            <para>Things were much worse on the OpenGL front. At least in D3D, you used the same
                basic C++ API to provide shaders; the shaders themselves may have been different,
                but the base API was the same. Not so in OpenGL land.</para>
            <para>NVIDIA and ATI released entirely separate proprietary extensions for specifying
                fragment shaders. NVIDIA's extensions built on the register combiner extension they
                released with the GeForce 256. They were completely incompatible. And worse, they
                were not even string-based.</para>
            <para>Imagine having to call a C++ function to write every opcode of a shader. Now
                imagine having to call <emphasis>three</emphasis> functions to write each opcode.
                That's what using those APIs was like.</para>
            <para>Things were better on vertex shaders. NVIDIA initially released a vertex shader
                extension, as did ATI. NVIDIA's was string-based, but ATI's version was like their
                fragment shader. Fortunately, this state of affairs did not last long; the OpenGL
                ARB came along with their own vertex shader extension. This was not GLSL, but an
                assembly like language based on NVIDIA's extension.</para>
            <para>It would take much longer for the fragment shader disparity to be worked
                out.</para>
        </sidebar>
    </section>
    <section>
        <?dbhtml filename="History GeForceFX.html" ?>
        <title>Dependency</title>
        <para>The Radeon 9700 was the 8500's successor. It improved on the 8500 somewhat. The vertex
            shader gained real conditional branching logic. Some of the limits were also relaxed;
            the number of available outputs and uniforms increased. The fragment shader's
            architecture remained effectively the same; the 9700 simply increased the limits. There
            were 8 textures available and 16 opcodes, and it could perform 4 passes over this
            set.</para>
        <para>The GeForce FX, released in 2003, was a substantial improvement, both over the GeForce
            3/4 and over the 9700 in terms of fragment processing. NVIDIA took a different approach
            to their fragment shaders; their fragment processor worked not entirely unlike modern
            shader processors do.</para>
        <para>It read an instruction, which could be a math operation, conditional branch (they had
            actual branches in fragment shading), or texture lookup instruction. It then executed
            that instruction. The texture lookup could be from a set of 8 textures. And then it
            repeated this process on the next instruction. It was doing math computations in a way
            not entirely unlike a traditional CPU.</para>
        <para>There was no real concept of a dependent texture access for the GeForce FX. The inputs
            to the fragment pipeline were simply the texture coordinates and colors from the vertex
            stage. If you used a texture coordinate to access a texture, it was fine with that. If
            you did some computations with them and then accessed a texture, it was just as fine
            with that. It was completely generic.</para>
        <para>It also failed in the marketplace. This was due primarily to its lateness and its poor
            performance in high-precision computation operations. The FX was optimized for doing
            16-bit math computations in its fragment shader; while it <emphasis>could</emphasis> do
            32-bit math, it was half as fast when doing this. But Direct3D 9's shaders did not allow
            the user to specify the precision of computations; the specification required at least
            24-bits of precision. To match this, NVIDIA had no choice but to force 32-bit math on
            all D3D 9 applications, making them run much slower than their ATI counterparts (the
            9700 always used 24-bit precision math).</para>
        <para>Things were no better in OpenGL land. The two competing unified fragment processing
            APIs, GLSL and an assembly-like fragment shader, did not have precision specifications
            either. Only NVIDIA's proprietary extension for fragment shaders provided that, and
            developers were less likely to use it. Especially with the head start that the 9700
            gained in the market by the FX being released late.</para>
        <para>It performs so poorly in the market that NVIDIA dropped the FX name for the next
            hardware revision. The GeForce 6 improved its 32-bit performance to the point where it
            was competitive with the ATI equivalents.</para>
        <para>This level of hardware saw the gaining of a number of different features. sRGB
            textures and framebuffers appeared, as did floating-point textures. Blending support for
            floating-point framebuffers was somewhat spotty; some hardware could do it only for
            16-bit floating-point, some could not do it at all. The restrictions of power-of-two
            texture sizes was also lifted, to varying degrees. None of ATI's hardware of this era
            fully supported this when used with mipmapping, but NVIDIA's hardware from the GeForce 6
            and above did.</para>
        <para>The ability to access textures from vertex shaders was also introduced in this series
            of hardware. Vertex texture accesses uses a separate list of textures from those bound
            for fragment shaders. Only four textures could be accessed from a vertex shader, while 8
            textures was normal for fragment shaders.</para>
        <para>Render to texture also became generally available at this time, though this was more
            of an API issue (neither OpenGL nor Direct3D allowed textures to be used as render
            targets before this point) than hardware functionality. That is not to say that hardware
            had no role to play. Textures are often not stored as linear arrays of memory the way
            they are loaded with <function>glTexImage</function>. They are usually stored in a
            swizzled format, where 2D or 3D blocks of texture data are stored sequentially. Thus,
            rendering to a texture required either the ability to render directly to swizzled
            formats or the ability to read textures that are stored in unswizzled formats.</para>
        <para>More than just render to texture was introduced. What was also introduced was the
            ability to render to multiple textures or buffers at one time. The number of renderable
            buffers was generally limited to 4 across all hardware platforms.</para>
        <sidebar>
            <title>Rise of the Compilers</title>
            <para>Microsoft put their foot down after the fiasco with D3D 8's fragment shaders. They
                wanted a single standard that all hardware makers would support. While this lead to
                the FX's performance failings, it also meant that compilers were becoming very
                important to shader performance.</para>
            <para>In order to have a real abstraction, you need compilers that are able to take the
                abstract language and map it to very different kinds of hardware. With Direct3D and
                OpenGL providing standards for shading languages, compiler quality started to become
                vital for performance.</para>
            <para>OpenGL moved whole-heartedly, and perhaps incautiously, into the realm of
                compilers when the OpenGL ARB embraced GLSL, a C-style language. They developed this
                language to the exclusion of all others.</para>
            <para>In Direct3D land, Microsoft developed the High-Level Shading Language, HLSL. But
                the base shading languages used by Direct3D 9 were still the assembly-like shading
                languages. HLSL was compiled by a Microsoft-developed compiler into the assembly
                languages, which were fed to Direct3D.</para>
            <para>With compilers and semi-real languages with actual logic constructs, a new field
                started to arise: General Programming GPU or <acronym>GPGPU</acronym>. The idea was
                to use a GPU to do non-rendering tasks. It started around this era, but the
                applications were limited due to the nature of hardware. Only fairly recently, with
                the advent of special languages and APIs (OpenCL, for example) that are designed for
                GPGPU tasks, has GPGPU started to really move into its own. Indeed, in the most
                recent hardware era, hardware makers have added features to GPUs that have
                somewhat... dubious uses in the field of graphics, but substantial uses in GPGPU
                tasks.</para>
        </sidebar>
    </section>
    <section>
        <?dbhtml filename="History Unified.html" ?>
        <title>Modern Unification</title>
        <para>Welcome to the modern era. All of the examples in this book are designed on and for
            this era of hardware, though some of them could run on older ones with some alteration.
            The release of the Radeon HD 2000 and GeForce 8000 series cards in 2006 represented
            unification in more ways than one.</para>
        <para>With the prior generations, fragment hardware had certain platform-specific
            peculiarities. While the API kinks were mostly ironed out with the development of proper
            shading languages, there were still differences in the behavior of hardware. While 4
            dependent texture accesses were sufficient for most applications, naive use of shading
            languages could get you in trouble on ATI hardware.</para>
        <para>With this generation, neither side really offered any real functionality difference.
            There are still differences between the hardware lines, and certainly in terms of
            performance. But the functionality differences have never been more blurred than they
            were with this revision.</para>
        <para>Another form of unification was that both NVIDIA and ATI moved to a unified shader
            architecture. In all prior generations, fragment shaders and vertex shaders were
            fundamentally different hardware. Even when they started doing the same kinds of things,
            such as accessing textures, they were both using different physical hardware to do so.
            This led to some inefficiencies.</para>
        <para>Deferred rendering probably gives the most explicit illustration of the problem. The
            first pass, the creation of the g-buffers, is a very vertex-shader-intensive activity.
            While the fragment shader can be somewhat complex, doing several texture fetches to
            compute various material parameters, the vertex shader is where much of the real work is
            done. Lots of vertices come through the shader, and if there are any complex
            transformations, they will happen here.</para>
        <para>The second pass is a <emphasis>very</emphasis> fragment shader intensive pass. Each
            light layer is comprised of exactly 4 vertices. Vertices that can be provided directly
            in clip-space. From then on, the fragment shader is what is being worked. It performs
            all of the complex lighting calculations necessary for the various rendering techniques.
            Four vertices generate literally millions of fragments, depending on the rendering
            resolution.</para>
        <para>In prior hardware generations, in the first pass, there would be fragment shaders
            going to waste, as they would process fragments faster than the vertex shaders could
            deliver triangles. In the second pass, the reverse happens, only even moreso. Four
            vertex shader executions, and then all of those vertex shaders would be completely
            useless. All of those parallel computational units would go to waste.</para>
        <para>Both NVIDIA and ATI devised hardware such that the computational elements were
            separated from their particular kind of computations. All shader hardware could be used
            for vertices, fragments, or geometry shaders (new in this generation). This would be
            changed on demand, based on the resource load. This makes deferred rendering in
            particular much more efficient; the second pass is able to use almost all of the
            available shader resources for lighting operations.</para>
        <para>This unified shader approach also means that every shader stage has essentially the
            same capabilities. The standard for the maximum texture count is 16, which is plenty
            enough for doing just about anything. This is applied equally to all shader types, so
            vertex shaders have the same number of textures available as fragment shaders.</para>
        <para>This smoothed out a great many things. Shaders gained quite a few new features.
            Uniform buffers became available. Shaders could perform computations directly on integer
            values. Unlike every generation before, all of these features were parceled out to all
            types of shaders equally.</para>
        <para>Along with unified shaders came a long list of various and sundry improvements to
            non-shader hardware. These include, but are not limited to:</para>
        <itemizedlist>
            <listitem>
                <para>Floating-point blending was worked out fully. Hardware of this era supports
                    full 32-bit floating point blending, though for performance reasons you're still
                    advised to use the lowest precision you can get away with.</para>
            </listitem>
            <listitem>
                <para>Arbitrary texture swizzling as a direct part of texture sampling parameters,
                    rather than in the shader itself.</para>
            </listitem>
            <listitem>
                <para>Integer texture formats, to compliment the shader's ability to use integer
                    values.</para>
            </listitem>
            <listitem>
                <para>Array textures.</para>
            </listitem>
        </itemizedlist>
        <para>Various other limitations were expanded as well.</para>
        <sidebar>
            <title>Post-Modern</title>
            <para>This was not the end of hardware evolution; there has been hardware released in
                recent years  The Radeon HD 5000 and GeForce GT 400 series and above have increased
                rendering features. They're just not as big of a difference compared to what came
                before.</para>
            <para>One of the biggest new feature in this hardware is tessellation, the ability to
                take triangles output from a vertex shader and split them into new triangles based
                on arbitrary (mostly) shader logic. This sounds like what geometry shaders can do,
                but it is different.</para>
            <para>Tessellation is actually something that ATI toyed around with for years. The
                Radeon 9700 had tessellation support with something they called PN triangles. This
                was very automated and not particularly useful. The entire Radeon HD 2000-4000 cards
                included tessellation features as well. These were pre-vertex shader, while the
                current version comes post-vertex shader.</para>
            <para>In the older form, the vertex shader would serve double duty. An incoming triangle
                would be broken down into many triangles. The vertex shader would then have to
                compute the per-vertex attributes for each of the new triangles, based on the old
                attributes and which vertex in the new series of vertices is being computed. Then it
                would do its normal transformation and other operations on those attributes.</para>
            <para>The current form introduces two new shader stages. The first, immediately after
                the vertex shader, controls how much tessellation happens on a particular primitive.
                The tessellation happens, splitting the single primitive into multiple primitives.
                The next stage determines how to compute the new positions, normals, etc of the
                primitive, based on the values of the primitive being tessellated. The geometry
                shader still exists; it is executed after the final tessellation shader
                stage.</para>
            <para>Another feature is the ability to have a shader arbitrarily read
                    <emphasis>and</emphasis> write to images in textures. This is not merely
                sampling from a texture; it uses a different interface (no filtering), and it means
                very different things. This form of image data access breaks many of the rules
                around OpenGL, and it is very easy to use the feature wrongly.</para>
            <para>These are not covered in this book for a few reasons. First, there is not as much
                hardware out there that supports it (though this is increasing daily). Sticking to
                OpenGL 3.3 meant casting a wider net; requiring OpenGL 4.2 would have meant fewer
                people could run those tutorials.</para>
            <para>Second, these features are quite complicated to use. Any discussion of
                tessellation would require discussing tessellation algorithms, which are all quite
                complicated. Any discussion of image reading/writing would require talking about
                shader hardware at a level of depth that is well beyond the beginner level. These
                are useful features, to be sure, but they are also very complex features.</para>
        </sidebar>
    </section>
</appendix>
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.