Commits

Jason McKesson committed 4f25b36 Merge

Merge

Comments (0)

Files changed (4)

Documents/History of Graphics Hardware.xml

-<?xml version="1.0" encoding="UTF-8"?>
-<?oxygen RNGSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng" type="xml"?>
-<?oxygen SCHSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng"?>
-<appendix xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
-    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
-    <?dbhtml filename="History of Graphics Hardware.html" ?>
-    <info>
-        <title>History of PC Graphics Hardware</title>
-        <subtitle>A Programmer's View</subtitle>
-    </info>
-    <para>For those of you had the good fortune of not being graphics programmers during the
-        formative years of the development of consumer graphics hardware, what follows is a brief
-        history. Hopefully, it will give you some perspective on what has changed in the last 15
-        years or so, as well as an idea of how grateful you should be that you never had to suffer
-        through the early days.</para>
-    <section>
-        <title>Voodoo Magic</title>
-        <para>In the years 1995 and 1996, a number of graphics cards were released. Graphics
-            processing via specialized hardware on PC platforms was nothing new. What was new about
-            these cards was their ability to do 3D rasterization.</para>
-        <para>The most popular of these for that era was the Voodoo Graphics card from 3Dfx
-            Interactive. It was fast, powerful for its day, and provided high quality rendering
-            (again, for its day).</para>
-        <para>The functionality of this card was quite bare-bones from a modern perspective.
-            Obviously there was no concept of shaders of any kind. Indeed, it did not even have
-            vertex transformation; the Voodoo Graphics pipeline begins with clip-space values. This
-            required the CPU to do vertex transformations. This hardware was effectively just a
-            triangle rasterizer.</para>
-        <para>That being said, it was quite good for its day. As inputs to its rasterization
-            pipeline, it took vertex inputs of a 4-dimensional clip-space position (though the
-            actual space was not necessarily the same as OpenGL's clip-space), a single RGBA color,
-            and a single three-dimensional texture coordinate. The hardware did not support 3D
-            textures; the extra component was in case the user wanted to do projective
-            texturing.</para>
-        <para>The texture coordinate was used to map into a single texture. The texture coordinate
-            and color interpolation was perspective-correct; in those days, that was a significant
-            selling point. The venerable Playstation 1 could not do perspective-correct
-            interpolation.</para>
-        <para>The value fetched from the texture could be combined with the interpolated color using
-            one of three math functions: additions, multiplication, or linear interpolation based on
-            the texture's alpha value. The alpha of the output was controlled with a separate math
-            function, thus allowing the user to generate the alpha with different math than the RGB
-            portion of the output color. This was the sum total of its fragment processing.</para>
-        <para>It even had framebuffer blending support. Its framebuffer could even support a
-            destination alpha value, though you had to give up having a depth buffer to get it.
-            Probably not a good tradeoff. Outside of that issue, its blending support was superior
-            even to OpenGL 1.1. It could use different source and destination factors for the alpha
-            component than the RGB component; the old GL 1.1 forced the RGB and A to be blended with
-            the same factors.</para>
-        <para>The blending was even performed with full 24-bit color precision and then downsampled
-            to the 16-bit precision of the output upon writing.</para>
-        <para>From a modern perspective, spoiled with our full programmability, this all looks
-            incredibly primitive. And, to some degree, it is. But compared to the pure CPU solutions
-            to 3D rendering of the day, the Voodoo Graphics card was a monster.</para>
-        <para>It's interesting to note that the simplicity of the fragment processing stage owes as
-            much to the lack of inputs as anything else. When the only values you have to work with
-            are the color from a texture lookup and the per-vertex interpolated color, there really
-            is not all that much you could do with them. Indeed, as we will see in the next phases of
-            hardware, increases in the complexity of the fragment processor was a reaction to
-            increasing the number of inputs <emphasis>to</emphasis> the fragment processor. When you
-            have more data to work with, you need more complex operations to make that data
-            useful.</para>
-    </section>
-    <section>
-        <?dbhtml filename="History TNT.html" ?>
-        <title>Dynamite Combiners</title>
-        <para>The next phase of hardware came, not from 3Dfx, but from a new company, NVIDIA. While
-            3Dfx's Voodoo II was much more popular than NVIDIA's product, the NVIDIA Riva TNT
-            (released in 1998) was more interesting in terms of what it brought to the table for
-            programmers. Voodoo II was purely a performance improvement; TNT was the next step in
-            the evolution of graphics hardware.</para>
-        <para>Like other graphics cards of the day, the TNT hardware had no vertex processing.
-            Vertex data was in clip-space, as normal, so the CPU had to do all of the transformation
-            and lighting. Where the TNT shone was in its fragment processing.</para>
-        <para>The power of the TNT is in it's name; TNT stands for
-                <acronym>T</acronym>wi<acronym>N</acronym>
-            <acronym>T</acronym>exel. Where other graphics cards could only allow a triangle to use
-            a single texture, the TNT allowed it to use two.</para>
-        <para>This meant that its vertex input data was expanded. Two textures meant two texture
-            coordinates, since each texture coordinate was directly bound to a particular texture.
-            While they were allowing two of things, they also allowed for two per-vertex colors. The
-            idea here has to do with lighting equations.</para>
-        <para>For regular diffuse lighting, the CPU-computed color would simply be dot(N, L),
-            possibly with attenuation applied. Indeed, it could be any complicated diffuse lighting
-            function, since it was all on the CPU. This diffuse light intensity would be multiplied
-            by the texture, which represented the diffuse absorption of the surface at that
-            point.</para>
-        <para>This becomes less useful if you want to add a specular term. The specular absorption
-            and diffuse absorption are not necessarily the same, after all. And while you may not
-            need to have a specular texture, you do not want to add the specular component to the
-            diffuse component <emphasis>before</emphasis> you multiply by their respective colors.
-            You want to do the addition afterwards.</para>
-        <para>This is simply not possible if you have only one per-vertex color. But it becomes
-            possible if you have two. One color is the diffuse lighting value. The other color is
-            the specular component. We multiply the first color by the diffuse color from the
-            texture, then add the second color as the specular reflectance.</para>
-        <para>Which brings us nicely to fragment processing. The TNT's fragment processor had 5
-            inputs: 2 colors sampled from textures, 2 colors interpolated from vertices, and a
-            single <quote>constant</quote> color. The latter, in modern parlance, is the equivalent
-            of a shader uniform value.</para>
-        <para>That's a lot of potential inputs. The solution NVIDIA came up with to produce a final
-            color was a bit of fixed functionality that we will call the texture environment. It is
-            directly analogous to the OpenGL 1.1 fixed-function pipeline, but with extensions for
-            multiple textures and some TNT-specific features.</para>
-        <para>The idea is that each texture has an environment. The environment is a specific math
-            function, such as addition, subtraction, multiplication, and linear interpolation. The
-            operands to this function could be taken from any of the fragment inputs, as well as a
-            constant zero color value.</para>
-        <para>It can also use the result from the previous environment as one of its arguments.
-            Textures and environments are numbered, from zero to one (two textures, two
-            environments). The first one executes, followed by the second.</para>
-        <para>If you look at it from a hardware perspective, what you have is a two-opcode assembly
-            language. The available registers for the language are two vertex colors, a single
-            uniform color, two texture colors, and a zero register. There is also a single temporary
-            register to hold the output from the first opcode.</para>
-        <para>Graphics programmers, by this point, had gotten used to multipass-based algorithms.
-            After all, until TNT, that was the only way to apply multiple textures to a single
-            surface. And even with TNT, it had a pretty confining limit of two textures and two
-            opcodes.</para>
-        <para>This was powerful, but quite limited. Two opcodes really was not enough.</para>
-        <para>The TNT cards also provided something else: 32-bit framebuffers and depth buffers.
-            While the Voodoo cards used high-precision math internally, they still wrote to 16-bit
-            framebuffers, using a technique called dithering to make them look like higher
-            precision. But dithering was nothing compared to actual high precision framebuffers. And
-            it did nothing for the depth buffer artifacts that a 16-bit depth buffer gave
-            you.</para>
-        <para>While the original TNT could do 32-bit, it lacked the memory and overall performance
-            to really show it off. That had to wait for the TNT2. Combined with product delays and
-            some poor strategic moves by 3Dfx, NVIDIA became one of the dominant players in the
-            consumer PC graphics card market. And that was cemented by their next card, which had
-            real power behind it.</para>
-        <sidebar>
-            <title>Tile-Based Rendering</title>
-            <para>While all of this was going on, a small company called PowerVR released its Series
-                2 graphics chip. PowerVR's approach to rendering was fundamentally different from
-                the standard rendering pipeline.</para>
-            <para>They used what they called a <quote>deferred, tile-based renderer.</quote> The
-                idea is that they store all of the clip-space triangles in a buffer. Then, they sort
-                this buffer based on which triangles cover which areas of the screen. The output
-                screen is divided into a number of tiles of a fixed size. Say, 8x8 in size.</para>
-            <para>For each tile, the hardware finds the triangles that are within that tile's area.
-                Then it does all the usual scan conversion tricks and so forth. It even
-                automatically does per-pixel depth sorting for blending, which remains something of
-                a selling point (no more having to manually sort blended objects). After rendering
-                that tile, it moves on to the next. These operations can of course be executed in
-                parallel; you can have multiple tiles being rasterized at the same time.</para>
-            <para>The idea behind this to avoid having large image buffers. You only need a few 8x8
-                depth buffers, so you can use very fast, on-chip memory for it. Rather than having
-                to deal with caches, DRAM, and large bandwidth memory channels, you just have a
-                small block of memory where you do all of your logic. You still need memory for
-                textures and the output image, but your bandwidth needs can be devoted solely to
-                textures.</para>
-            <para>For a time, these cards were competitive with the other graphics chip makers.
-                However, the tile-based approach simply did not scale well with resolution or
-                geometry complexity. Also, they missed the geometry processing bandwagon, which
-                really hurt their standing. They fell farther and farther behind the other major
-                players, until they stopped making desktop parts altogether.</para>
-            <para>However, they may ultimately have the last laugh; unlike 3Dfx and so many others,
-                PowerVR still exists. They provided the GPU for the Sega Dreamcast console. And
-                while that console was a market failure, it did show where PowerVR's true strength
-                lay: embedded platforms.</para>
-            <para>Embedded platforms tend to play to their tile-based renderer's strengths. Memory,
-                particularly high-bandwidth memory, eats up power; having less memory means
-                longer-lasting mobile devices. Embedded devices tend to use smaller resolutions,
-                which their platform excels at. And with low resolutions, you are not trying to push
-                nearly as much geometry.</para>
-            <para>Thanks to these facts, PowerVR graphics chips power the vast majority of mobile
-                platforms that have any 3D rendering in them. Just about every iPhone, Droid, iPad,
-                or similar device is running PowerVR technology. And that's a growth market these
-                days.</para>
-        </sidebar>
-    </section>
-    <section>
-        <?dbhtml filename="History GeForce.html" ?>
-        <title>Vertices and Registers</title>
-        <para>The next stage in the evolution of graphics hardware again came from NVIDIA. While
-            3Dfx released competing cards, they were again behind the curve. The NVIDIA GeForce 256
-            (not to be confused with the GeForce GT250, a much more modern card), released in 1999,
-            provided something truly new: a vertex processing pipeline.</para>
-        <para>The OpenGL API has always defined a vertex processing pipeline (it was fixed-function
-            in those days rather than shader-based). And NVIDIA implemented it in their TNT-era
-            drivers on the CPU. But only with the GeForce 256 was this actually implemented in
-            hardware. And NVIDIA essentially built the entire OpenGL fixed-function vertex
-            processing pipeline directly into the GeForce hardware.</para>
-        <para>This was primarily a performance win. While it was important for the progress of
-            hardware, a less-well-known improvement of the early GeForce hardware was more important
-            to its future.</para>
-        <para>In the fragment processing pipeline, the texture environment stages were removed. In
-            their place was a more powerful mechanism, what NVIDIA called <quote>register
-                combiners.</quote></para>
-        <para>The GeForce 256 provided 2 regular combiner stages. Each of these stages represented
-            up to four independent opcodes that operated over the register set. The opcodes could
-            result in multiple outputs, which could be written to two temporary registers.</para>
-        <para>What is interesting is that the register values are no longer limited to color values.
-            Instead, they are signed values, on the range [-1, 1]; they have 9 bits of precision or
-            so. While the initial color or texture values are on [0, 1], the actual opcodes
-            themselves can perform operations that generate negative values. Opcodes can even
-            scale/bias their inputs, which allow them to turn unsigned colors into signed
-            values.</para>
-        <para>Because of this, the GeForce 256 was the first hardware to be able to do functional
-            bump mapping, without hacks or tricks. A single register combiner stage could do 2
-            3-vector dot-products at a time. Textures could store normals by compressing them to a
-            [0, 1] range. The light direction could either be a constant or interpolated per-vertex
-            in texture space.</para>
-        <para>Now granted, this still was a primitive form of bump mapping. There was no way to
-            correct for texture-space values with binormals and tangents. But this was at least
-            something. And it really was the first step towards programmability; it showed that
-            textures could truly represent values other than colors.</para>
-        <para>There was also a single final combiner stage. This was a much more limited stage than
-            the regular combiner stages. It could do a linear interpolation operation and an
-            addition; this was designed specifically to implement OpenGL's fixed-function fog and
-            specular computations.</para>
-        <para>The register file consisted of two temporary registers, two per-vertex colors, two
-            texture colors, two uniform values, the zero register, and a few other values used for
-            OpenGL fixed-function fog operations. The color and texture registers were even
-            writeable, if you needed more temporaries.</para>
-        <para>There were a few other sundry additions to the hardware. Cube textures first came onto
-            the scene. Combined with the right texture coordinate computations (now in hardware),
-            you could have reflective surfaces much more easily. Anisotropic filtering and
-            multisampling also appeared at this time. The limits were relatively small; anisotropic
-            filtering was limited to 4x, while the maximum number of samples was restricted to two.
-            Compressed texture formats also appeared on the scene.</para>
-        <para>What we see thus far as we take steps towards true programmability is that increased
-            complexity in fragment processing starts pushing for other needs. The addition of a dot
-            product allows lighting computations to take place per-fragment. But you cannot have full
-            texture-space bump mapping because of the lack of a normal/binormal/tangent matrix to
-            transform vectors to texture space. Cubemaps allow you to do arbitrary reflections, but
-            computing reflection directions per-vertex requires interpolating reflection normals,
-            which does not work very well over large polygons.</para>
-        <para>This also saw the introduction of something called a rectangle texture. This was
-            something of an odd duck, that still remains in current day. It was a way of creating a
-            texture of arbitrary size; until then, textures were limited to powers of two in size
-            (though the sizes did not have to be the same). The texture coordinates were not
-            normalized either; they were in texture space values.</para>
-        <sidebar>
-            <title>The GPU Divide</title>
-            <para>When NVIDIA released the GeForce 256, they coined the term <quote>Geometry
-                    Processing Unit</quote> or <acronym>GPU</acronym>. Until this point, graphics
-                chips were called exactly that: graphics chips. The term GPU was intended by NVIDIA
-                to differentiate the GeForce from all of its competition, including the final cards
-                from 3Dfx.</para>
-            <para>Because the term was so reminiscent to CPUs, the term took over. Every graphics
-                chip is a GPU now, even ones released before the term came to exist.</para>
-            <para>In truth, the term GPU never really made much sense until the next stage, where
-                the first cards with actual programmability came onto the scene.</para>
-        </sidebar>
-    </section>
-    <section>
-        <?dbhtml filename="History Radeon8500.html" ?>
-        <title>Programming at Last</title>
-        <para>How do you define a demarcation between non-programmable graphics chips and
-            programmable ones? We have seen that, even in the humble TNT days, there were a couple
-            of user-defined opcodes with several possible input values.</para>
-        <para>One way is to consider what programming is. Programming is not simply a mathematical
-            operation; programming needs conditional logic. Therefore, it is not unreasonable to say
-            that something is not truly programmable until there is the possibility of some form of
-            conditional logic.</para>
-        <para>And it is at this point where that first truly appears. It appears first in the
-                <emphasis>vertex</emphasis> pipeline rather than the fragment pipeline. This seems
-            odd until one realizes how crucial fragment operations are to overall performance. It
-            therefore makes sense to introduce heavy programmability in the less
-            performance-critical areas of hardware first.</para>
-        <para>The GeForce 3, released in 2001 (a mere 3 years after the TNT), was the first hardware
-            to provide this level of programmability. While GeForce 3 hardware did indeed have the
-            fixed-function vertex pipeline, it also had very flexible programmable pipeline. The
-            retaining of the fixed-function code was a performance need; the vertex shader was not
-            as fast as the fixed-function one. It should be noted that the original X-Box's GPU,
-            designed in tandem with the GeForce 3, eschewed the fixed-functionality altogether in
-            favor of having multiple vertex shaders that could compute several vertices at a time.
-            This was eventually adopted for later GeForces.</para>
-        <para>Vertex shaders were pretty powerful, even in their first incarnation. While there was
-            no conditional branching, there was conditional logic, the equivalent of the ?:
-            operator. These vertex shaders exposed up to 128 <type>vec4</type> uniforms, up to 16
-                <type>vec4</type> inputs (still the modern limit), and could output 6
-                <type>vec4</type> outputs. Two of the outputs, intended for colors, were lower
-            precisions than the others. There was a hard limit of 128 opcodes. These vertex shaders
-            brought full swizzling support and a plethora of math operations.</para>
-        <para>The GeForce 3 also added up to two more textures, for a total of four textures per
-            triangle. They were hooked directly into certain per-vertex outputs, because the
-            per-fragment pipeline did not have real programmability yet.</para>
-        <para>At this point, the holy grail of programmability at the fragment level was dependent
-            texture access. That is, being able to access a texture, do some arbitrary computations
-            on it, and then access another texture with the result. The GeForce 3 had some
-            facilities for that, but they were not very good ones.</para>
-        <para>The GeForce 3 used 8 register combiner stages instead of the 2 that the earlier cards
-            used. Their register files were extended to support two extra texture colors and a few
-            more tricks. But the main change was something that, in OpenGL terminology, would be
-            called <quote>texture shaders.</quote></para>
-        <para>What texture shaders did was allow the user to, instead of accessing a texture,
-            perform a computation on that texture's texture unit. This was much like the old texture
-            environment functionality, except only for texture coordinates. The textures were
-            arranged in a sequence. And instead of accessing a texture, you could perform a
-            computation between that texture unit's coordinate and possibly the coordinate from the
-            previous texture shader operation, if there was one.</para>
-        <para>It was not very flexible functionality. It did allow for full texture-space bump
-            mapping, though. While the 8 register combiners were enough to do a full matrix
-            multiply, they were not powerful enough to normalize the resulting vector. However, you
-            could normalize a vector by accessing a special cubemap. The values of this cubemap
-            represented a normalized vector in the direction of the cubemap's given texture
-            coordinate.</para>
-        <para>But using that required spending a total of 3 texture shader stages. Which meant you
-            get a bump map and a normalization cubemap only; there was no room for a diffuse map in
-            that pass. It also did not perform very well; the texture shader functions were quite
-            expensive.</para>
-        <para>True programmability came to the fragment shader from ATI, with the Radeon 8500,
-            released in late 2001.</para>
-        <para>The 8500's fragment shader architecture was pretty straightforward, and in terms of
-            programming, it is not too dissimilar to modern shader systems. Texture coordinates
-            would come in. They could either be used to fetch from a texture or be given directly as
-            inputs to the processing stage. Up to 6 textures could be used at once. Then, up to 8
-            opcodes, including a conditional operation, could be used. After that, the hardware
-            would repeat the process using registers written by the opcodes. Those registers could
-            feed texture accesses from the same group of textures used in the first pass. And then
-            another 8 opcodes would generate the output color.</para>
-        <para>It also had strong, but not full, swizzling support in the fragment shader. Register
-            combiners had very little support for swizzling.</para>
-        <para>This era of hardware was also the first to allow 3D textures. Though that was as much
-            a memory concern as anything else, since 3D textures take up lots of memory which was
-            not available on earlier cards. Depth comparison texturing was also made
-            available.</para>
-        <para>While the 8500 was a technological marvel, it was a flop in the market compared to the
-            GeForce 3 &amp; 4. Indeed, this is a recurring theme of these eras: the card with the
-            more programmable hardware often tends to lose in its first iteration.</para>
-        <sidebar>
-            <title>API Hell</title>
-            <para>This era is notable in what it did to graphics APIs. Consider the hardware
-                differences between the 8500 and the GeForce 3/4 in terms of fragment
-                processing.</para>
-            <para>On the Direct3D front, things were not the best. Direct3D 8 promised a unified
-                shader development pipeline. That is, you could write a shader according to their
-                specifications and it would work on any D3D 8 hardware. And this was effectively
-                true. For vertex shaders, at least.</para>
-            <para>However, the D3D 8.0 pixel shader pipeline was nothing more than NVIDIA's register
-                combiners and texture shaders. There was no real abstraction of capabilities; the
-                D3D 8.0 pixel shaders simply took NVIDIA's hardware and made a shader language out
-                of it.</para>
-            <para>To provide support for the 8500's expanded fragment processing feature-set, there
-                was D3D 8.1. This version altered the pixel shader pipeline to match the
-                capabilities of the Radeon 8500. Fortunately, the 8500 would accept 8.0 shaders just
-                fine, since it was capable of doing everything the GeForce 3 could do. But no one
-                would mistake either shader specification for any kind of real abstraction.</para>
-            <para>Things were much worse on the OpenGL front. At least in D3D, you used the same
-                basic C++ API to provide shaders; the shaders themselves may have been different,
-                but the base API was the same. Not so in OpenGL land.</para>
-            <para>NVIDIA and ATI released entirely separate proprietary extensions for specifying
-                fragment shaders. NVIDIA's extensions built on the register combiner extension they
-                released with the GeForce 256. They were completely incompatible. And worse, they
-                were not even string-based.</para>
-            <para>Imagine having to call a C++ function to write every opcode of a shader. Now
-                imagine having to call <emphasis>three</emphasis> functions to write each opcode.
-                That's what using those APIs was like.</para>
-            <para>Things were better on vertex shaders. NVIDIA initially released a vertex shader
-                extension, as did ATI. NVIDIA's was string-based, but ATI's version was like their
-                fragment shader. Fortunately, this state of affairs did not last long; the OpenGL
-                ARB came along with their own vertex shader extension. This was not GLSL, but an
-                assembly like language based on NVIDIA's extension.</para>
-            <para>It would take much longer for the fragment shader disparity to be worked
-                out.</para>
-        </sidebar>
-    </section>
-    <section>
-        <?dbhtml filename="History GeForceFX.html" ?>
-        <title>Dependency</title>
-        <para>The Radeon 9700 was the 8500's successor. It improved on the 8500 somewhat. The vertex
-            shader gained real conditional branching logic. Some of the limits were also relaxed;
-            the number of available outputs and uniforms increased. The fragment shader's
-            architecture remained effectively the same; the 9700 simply increased the limits. There
-            were 8 textures available and 16 opcodes, and it could perform 4 passes over this
-            set.</para>
-        <para>The GeForce FX, released in 2003, was a substantial improvement, both over the GeForce
-            3/4 and over the 9700 in terms of fragment processing. NVIDIA took a different approach
-            to their fragment shaders; their fragment processor worked not entirely unlike modern
-            shader processors do.</para>
-        <para>It read an instruction, which could be a math operation, conditional branch (they had
-            actual branches in fragment shading), or texture lookup instruction. It then executed
-            that instruction. The texture lookup could be from a set of 8 textures. And then it
-            repeated this process on the next instruction. It was doing math computations in a way
-            not entirely unlike a traditional CPU.</para>
-        <para>There was no real concept of a dependent texture access for the GeForce FX. The inputs
-            to the fragment pipeline were simply the texture coordinates and colors from the vertex
-            stage. If you used a texture coordinate to access a texture, it was fine with that. If
-            you did some computations with them and then accessed a texture, it was just as fine
-            with that. It was completely generic.</para>
-        <para>It also failed in the marketplace. This was due primarily to its lateness and its poor
-            performance in high-precision computation operations. The FX was optimized for doing
-            16-bit math computations in its fragment shader; while it <emphasis>could</emphasis> do
-            32-bit math, it was half as fast when doing this. But Direct3D 9's shaders did not allow
-            the user to specify the precision of computations; the specification required at least
-            24-bits of precision. To match this, NVIDIA had no choice but to force 32-bit math on
-            all D3D 9 applications, making them run much slower than their ATI counterparts (the
-            9700 always used 24-bit precision math).</para>
-        <para>Things were no better in OpenGL land. The two competing unified fragment processing
-            APIs, GLSL and an assembly-like fragment shader, did not have precision specifications
-            either. Only NVIDIA's proprietary extension for fragment shaders provided that, and
-            developers were less likely to use it. Especially with the head start that the 9700
-            gained in the market by the FX being released late.</para>
-        <para>It performs so poorly in the market that NVIDIA dropped the FX name for the next
-            hardware revision. The GeForce 6 improved its 32-bit performance to the point where it
-            was competitive with the ATI equivalents.</para>
-        <para>This level of hardware saw the gaining of a number of different features. sRGB
-            textures and framebuffers appeared, as did floating-point textures. Blending support for
-            floating-point framebuffers was somewhat spotty; some hardware could do it only for
-            16-bit floating-point, some could not do it at all. The restrictions of power-of-two
-            texture sizes was also lifted, to varying degrees. None of ATI's hardware of this era
-            fully supported this when used with mipmapping, but NVIDIA's hardware from the GeForce 6
-            and above did.</para>
-        <para>The ability to access textures from vertex shaders was also introduced in this series
-            of hardware. Vertex texture accesses uses a separate list of textures from those bound
-            for fragment shaders. Only four textures could be accessed from a vertex shader, while 8
-            textures was normal for fragment shaders.</para>
-        <para>Render to texture also became generally available at this time, though this was more
-            of an API issue (neither OpenGL nor Direct3D allowed textures to be used as render
-            targets before this point) than hardware functionality. That is not to say that hardware
-            had no role to play. Textures are often not stored as linear arrays of memory the way
-            they are loaded with <function>glTexImage</function>. They are usually stored in a
-            swizzled format, where 2D or 3D blocks of texture data are stored sequentially. Thus,
-            rendering to a texture required either the ability to render directly to swizzled
-            formats or the ability to read textures that are stored in unswizzled formats.</para>
-        <para>More than just render to texture was introduced. What was also introduced was the
-            ability to render to multiple textures or buffers at one time. The number of renderable
-            buffers was generally limited to 4 across all hardware platforms.</para>
-        <sidebar>
-            <title>Rise of the Compilers</title>
-            <para>Microsoft put their foot down after the fiasco with D3D 8's fragment shaders. They
-                wanted a single standard that all hardware makers would support. While this lead to
-                the FX's performance failings, it also meant that compilers were becoming very
-                important to shader performance.</para>
-            <para>In order to have a real abstraction, you need compilers that are able to take the
-                abstract language and map it to very different kinds of hardware. With Direct3D and
-                OpenGL providing standards for shading languages, compiler quality started to become
-                vital for performance.</para>
-            <para>OpenGL moved whole-heartedly, and perhaps incautiously, into the realm of
-                compilers when the OpenGL ARB embraced GLSL, a C-style language. They developed this
-                language to the exclusion of all others.</para>
-            <para>In Direct3D land, Microsoft developed the High-Level Shading Language, HLSL. But
-                the base shading languages used by Direct3D 9 were still the assembly-like shading
-                languages. HLSL was compiled by a Microsoft-developed compiler into the assembly
-                languages, which were fed to Direct3D.</para>
-            <para>With compilers and semi-real languages with actual logic constructs, a new field
-                started to arise: General Programming GPU or <acronym>GPGPU</acronym>. The idea was
-                to use a GPU to do non-rendering tasks. It started around this era, but the
-                applications were limited due to the nature of hardware. Only fairly recently, with
-                the advent of special languages and APIs (OpenCL, for example) that are designed for
-                GPGPU tasks, has GPGPU started to really move into its own. Indeed, in the most
-                recent hardware era, hardware makers have added features to GPUs that have
-                somewhat... dubious uses in the field of graphics, but substantial uses in GPGPU
-                tasks.</para>
-        </sidebar>
-    </section>
-    <section>
-        <?dbhtml filename="History Unified.html" ?>
-        <title>Modern Unification</title>
-        <para>Welcome to the modern era. All of the examples in this book are designed on and for
-            this era of hardware, though some of them could run on older ones. The release of the
-            Radeon HD 2000 and GeForce 8000 series cards in 2006 represented unification in more
-            ways than one.</para>
-        <para>With the prior generations, fragment hardware had certain platform-specific
-            peculiarities. While the API kinks were mostly ironed out with the development of proper
-            shading languages, there were still differences in the behavior of hardware. While 4
-            dependent texture accesses were sufficient for most applications, naive use of shading
-            languages could get you in trouble on ATI hardware.</para>
-        <para>With this generation, neither side really offered any real functionality difference.
-            There are still differences between the hardware lines, and certainly in terms of
-            performance. But the functionality differences have never been more blurred than they
-            were with this revision.</para>
-        <para>Another form of unification was that both NVIDIA and ATI moved to a unified shader
-            architecture. In all prior generations, fragment shaders and vertex shaders were
-            fundamentally different hardware. Even when they started doing the same kinds of things,
-            such as accessing textures, they were both using different physical hardware to do so.
-            This led to some inefficiencies.</para>
-        <para>Deferred rendering probably gives the most explicit illustration of the problem. The
-            first pass, the creation of the g-buffers, is a very vertex-shader-intensive activity.
-            While the fragment shader can be somewhat complex, doing several texture fetches to
-            compute various material parameters, the vertex shader is where much of the real work is
-            done. Lots of vertices come through the shader, and if there are any complex
-            transformations, they will happen here.</para>
-        <para>The second pass is a <emphasis>very</emphasis> fragment shader intensive pass. Each
-            light layer is comprised of exactly 4 vertices. Vertices that can be provided directly
-            in clip-space. From then on, the fragment shader is what is being worked. It performs
-            all of the complex lighting calculations necessary for the various rendering techniques.
-            Four vertices generate literally millions of fragments, depending on the rendering
-            resolution.</para>
-        <para>In prior hardware generations, in the first pass, there would be fragment shaders
-            going to waste, as they would process fragments faster than the vertex shaders could
-            deliver triangles. In the second pass, the reverse happens, only even moreso. Four
-            vertex shader executions, and then all of those vertex shaders would be completely
-            useless. All of those parallel computational units would go to waste.</para>
-        <para>Both NVIDIA and ATI devised hardware such that the computational elements were
-            separated from their particular kind of computations. All shader hardware could be used
-            for vertices, fragments, or geometry shaders (new in this generation). This would be
-            changed on demand, based on the resource load. This makes deferred rendering in
-            particular much more efficient; the second pass is able to use almost all of the
-            available shader resources for lighting operations.</para>
-        <para>This unified shader approach also means that every shader stage has essentially the
-            same capabilities. The standard for the maximum texture count is 16, which is plenty
-            enough for doing just about anything. This is applied equally to all shader types, so
-            vertex shaders have the same number of textures available as fragment shaders.</para>
-        <para>This smoothed out a great many things. Shaders gained quite a few new features.
-            Uniform buffers became available. Shaders could perform computations directly on integer
-            values. Unlike every generation before, all of these features were parceled out to all
-            types of shaders equally.</para>
-        <para>Along with unified shaders came a long list of various and sundry improvements to
-            non-shader hardware. These include, but are not limited to:</para>
-        <itemizedlist>
-            <listitem>
-                <para>Floating-point blending was worked out fully. Hardware of this era supports
-                    full 32-bit floating point blending, though for performance reasons you're still
-                    advised to use the lowest precision you can get away with.</para>
-            </listitem>
-            <listitem>
-                <para>Arbitrary texture swizzling as a direct part of texture sampling parameters,
-                    rather than in the shader itself.</para>
-            </listitem>
-            <listitem>
-                <para>Integer texture formats, to compliment the shader's ability to use integer
-                    values.</para>
-            </listitem>
-            <listitem>
-                <para>Array textures.</para>
-            </listitem>
-        </itemizedlist>
-        <para>Various other limitations were expanded as well.</para>
-        <sidebar>
-            <title>Tessellation</title>
-            <para>This was not the end of hardware evolution; there has been hardware released in
-                recent years  The Radeon HD 5000 and GeForce GT 400 series and above have increased
-                rendering features. They're just not as big of a difference compared to what came
-                before.</para>
-            <para>The biggest new feature in this hardware is tessellation, the ability to take
-                triangles output from a vertex shader and split them into new triangles based on
-                arbitrary (mostly) shader logic. This sounds like what geometry shaders can do, but
-                it is different.</para>
-            <para>Tessellation is actually something that ATI toyed around with for years. The
-                Radeon 9700 had tessellation support with something they called PN triangles. This
-                was very automated and not particularly useful. The entire Radeon HD 2000-4000 cards
-                included tessellation features as well. These were pre-vertex shader, while the
-                current version comes post-vertex shader.</para>
-            <para>In the older form, the vertex shader would serve double duty. An incoming triangle
-                would be broken down into many triangles. The vertex shader would then have to
-                compute the per-vertex attributes for each of the new triangles, based on the old
-                attributes and which vertex in the new series of vertices is being computed. Then it
-                would do its normal transformation and other operations on those attributes.</para>
-            <para>The current form introduces two new shader stages. The first, immediately after
-                the vertex shader, controls how much tessellation happens on a particular primitive.
-                The tessellation happens, splitting the single primitive into multiple primitives.
-                The next stage determines how to compute the new positions, normals, etc of the
-                primitive, based on the values of the primitive being tessellated. The geometry
-                shader still exists; it is executed after the final tessellation shader
-                stage.</para>
-            <para>Tessellation is not covered in this book for a few reasons. First, there is not as
-                much hardware out there that supports it. Sticking to OpenGL 3.3 meant casting a
-                wider net; requiring OpenGL 4.1 (which includes tessellation) would have meant fewer
-                people could run those tutorials.</para>
-            <para>Second, tessellation is not that important. That's not to say that it is not
-                important or a worthwhile feature. But it really is not something that matters a
-                great deal.</para>
-        </sidebar>
-    </section>
-</appendix>
+<?xml version="1.0" encoding="UTF-8"?>
+<?oxygen RNGSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng" type="xml"?>
+<?oxygen SCHSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng"?>
+<appendix xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
+    <?dbhtml filename="History of Graphics Hardware.html" ?>
+    <info>
+        <title>History of PC Graphics Hardware</title>
+        <subtitle>A Programmer's View</subtitle>
+    </info>
+    <para>For those of you had the good fortune of not being graphics programmers during the
+        formative years of the development of consumer graphics hardware, what follows is a brief
+        history. Hopefully, it will give you some perspective on what has changed in the last 15
+        years or so, as well as an idea of how grateful you should be that you never had to suffer
+        through the early days.</para>
+    <section>
+        <title>Voodoo Magic</title>
+        <para>In the years 1995 and 1996, a number of graphics cards were released. Graphics
+            processing via specialized hardware on PC platforms was nothing new. What was new about
+            these cards was their ability to do 3D rasterization.</para>
+        <para>The most popular of these for that era was the Voodoo Graphics card from 3Dfx
+            Interactive. It was fast, powerful for its day, and provided high quality rendering
+            (again, for its day).</para>
+        <para>The functionality of this card was quite bare-bones from a modern perspective.
+            Obviously there was no concept of shaders of any kind. Indeed, it did not even have
+            vertex transformation; the Voodoo Graphics pipeline begins with clip-space values. This
+            required the CPU to do vertex transformations. This hardware was effectively just a
+            triangle rasterizer.</para>
+        <para>That being said, it was quite good for its day. As inputs to its rasterization
+            pipeline, it took vertex inputs of a 4-dimensional clip-space position (though the
+            actual space was not necessarily the same as OpenGL's clip-space), a single RGBA color,
+            and a single three-dimensional texture coordinate. The hardware did not support 3D
+            textures; the extra component was in case the user wanted to do projective
+            texturing.</para>
+        <para>The texture coordinate was used to map into a single texture. The texture coordinate
+            and color interpolation was perspective-correct; in those days, that was a significant
+            selling point. The venerable Playstation 1 could not do perspective-correct
+            interpolation.</para>
+        <para>The value fetched from the texture could be combined with the interpolated color using
+            one of three math functions: additions, multiplication, or linear interpolation based on
+            the texture's alpha value. The alpha of the output was controlled with a separate math
+            function, thus allowing the user to generate the alpha with different math than the RGB
+            portion of the output color. This was the sum total of its fragment processing.</para>
+        <para>It even had framebuffer blending support. Its framebuffer could even support a
+            destination alpha value, though you had to give up having a depth buffer to get it.
+            Probably not a good tradeoff. Outside of that issue, its blending support was superior
+            even to OpenGL 1.1. It could use different source and destination factors for the alpha
+            component than the RGB component; the old GL 1.1 forced the RGB and A to be blended with
+            the same factors.</para>
+        <para>The blending was even performed with full 24-bit color precision and then downsampled
+            to the 16-bit precision of the output upon writing.</para>
+        <para>From a modern perspective, spoiled with our full programmability, this all looks
+            incredibly primitive. And, to some degree, it is. But compared to the pure CPU solutions
+            to 3D rendering of the day, the Voodoo Graphics card was a monster.</para>
+        <para>It's interesting to note that the simplicity of the fragment processing stage owes as
+            much to the lack of inputs as anything else. When the only values you have to work with
+            are the color from a texture lookup and the per-vertex interpolated color, there really
+            is not all that much you could do with them. Indeed, as we will see in the next phases of
+            hardware, increases in the complexity of the fragment processor was a reaction to
+            increasing the number of inputs <emphasis>to</emphasis> the fragment processor. When you
+            have more data to work with, you need more complex operations to make that data
+            useful.</para>
+    </section>
+    <section>
+        <?dbhtml filename="History TNT.html" ?>
+        <title>Dynamite Combiners</title>
+        <para>The next phase of hardware came, not from 3Dfx, but from a new company, NVIDIA. While
+            3Dfx's Voodoo II was much more popular than NVIDIA's product, the NVIDIA Riva TNT
+            (released in 1998) was more interesting in terms of what it brought to the table for
+            programmers. Voodoo II was purely a performance improvement; TNT was the next step in
+            the evolution of graphics hardware.</para>
+        <para>Like other graphics cards of the day, the TNT hardware had no vertex processing.
+            Vertex data was in clip-space, as normal, so the CPU had to do all of the transformation
+            and lighting. Where the TNT shone was in its fragment processing.</para>
+        <para>The power of the TNT is in it's name; TNT stands for
+                <acronym>T</acronym>wi<acronym>N</acronym>
+            <acronym>T</acronym>exel. Where other graphics cards could only allow a triangle to use
+            a single texture, the TNT allowed it to use two.</para>
+        <para>This meant that its vertex input data was expanded. Two textures meant two texture
+            coordinates, since each texture coordinate was directly bound to a particular texture.
+            While they were allowing two of things, they also allowed for two per-vertex colors. The
+            idea here has to do with lighting equations.</para>
+        <para>For regular diffuse lighting, the CPU-computed color would simply be dot(N, L),
+            possibly with attenuation applied. Indeed, it could be any complicated diffuse lighting
+            function, since it was all on the CPU. This diffuse light intensity would be multiplied
+            by the texture, which represented the diffuse absorption of the surface at that
+            point.</para>
+        <para>This becomes less useful if you want to add a specular term. The specular absorption
+            and diffuse absorption are not necessarily the same, after all. And while you may not
+            need to have a specular texture, you do not want to add the specular component to the
+            diffuse component <emphasis>before</emphasis> you multiply by their respective colors.
+            You want to do the addition afterwards.</para>
+        <para>This is simply not possible if you have only one per-vertex color. But it becomes
+            possible if you have two. One color is the diffuse lighting value. The other color is
+            the specular component. We multiply the first color by the diffuse color from the
+            texture, then add the second color as the specular reflectance.</para>
+        <para>Which brings us nicely to fragment processing. The TNT's fragment processor had 5
+            inputs: 2 colors sampled from textures, 2 colors interpolated from vertices, and a
+            single <quote>constant</quote> color. The latter, in modern parlance, is the equivalent
+            of a shader uniform value.</para>
+        <para>That's a lot of potential inputs. The solution NVIDIA came up with to produce a final
+            color was a bit of fixed functionality that we will call the texture environment. It is
+            directly analogous to the OpenGL 1.1 fixed-function pipeline, but with extensions for
+            multiple textures and some TNT-specific features.</para>
+        <para>The idea is that each texture has an environment. The environment is a specific math
+            function, such as addition, subtraction, multiplication, and linear interpolation. The
+            operands to this function could be taken from any of the fragment inputs, as well as a
+            constant zero color value.</para>
+        <para>It can also use the result from the previous environment as one of its arguments.
+            Textures and environments are numbered, from zero to one (two textures, two
+            environments). The first one executes, followed by the second.</para>
+        <para>If you look at it from a hardware perspective, what you have is a two-opcode assembly
+            language. The available registers for the language are two vertex colors, a single
+            uniform color, two texture colors, and a zero register. There is also a single temporary
+            register to hold the output from the first opcode.</para>
+        <para>Graphics programmers, by this point, had gotten used to multipass-based algorithms.
+            After all, until TNT, that was the only way to apply multiple textures to a single
+            surface. And even with TNT, it had a pretty confining limit of two textures and two
+            opcodes.</para>
+        <para>This was powerful, but quite limited. Two opcodes really was not enough.</para>
+        <para>The TNT cards also provided something else: 32-bit framebuffers and depth buffers.
+            While the Voodoo cards used high-precision math internally, they still wrote to 16-bit
+            framebuffers, using a technique called dithering to make them look like higher
+            precision. But dithering was nothing compared to actual high precision framebuffers. And
+            it did nothing for the depth buffer artifacts that a 16-bit depth buffer gave
+            you.</para>
+        <para>While the original TNT could do 32-bit, it lacked the memory and overall performance
+            to really show it off. That had to wait for the TNT2. Combined with product delays and
+            some poor strategic moves by 3Dfx, NVIDIA became one of the dominant players in the
+            consumer PC graphics card market. And that was cemented by their next card, which had
+            real power behind it.</para>
+        <sidebar>
+            <title>Tile-Based Rendering</title>
+            <para>While all of this was going on, a small company called PowerVR released its Series
+                2 graphics chip. PowerVR's approach to rendering was fundamentally different from
+                the standard rendering pipeline.</para>
+            <para>They used what they called a <quote>deferred, tile-based renderer.</quote> The
+                idea is that they store all of the clip-space triangles in a buffer. Then, they sort
+                this buffer based on which triangles cover which areas of the screen. The output
+                screen is divided into a number of tiles of a fixed size. Say, 8x8 in size.</para>
+            <para>For each tile, the hardware finds the triangles that are within that tile's area.
+                Then it does all the usual scan conversion tricks and so forth. It even
+                automatically does per-pixel depth sorting for blending, which remains something of
+                a selling point (no more having to manually sort blended objects). After rendering
+                that tile, it moves on to the next. These operations can of course be executed in
+                parallel; you can have multiple tiles being rasterized at the same time.</para>
+            <para>The idea behind this to avoid having large image buffers. You only need a few 8x8
+                depth buffers, so you can use very fast, on-chip memory for it. Rather than having
+                to deal with caches, DRAM, and large bandwidth memory channels, you just have a
+                small block of memory where you do all of your logic. You still need memory for
+                textures and the output image, but your bandwidth needs can be devoted solely to
+                textures.</para>
+            <para>For a time, these cards were competitive with the other graphics chip makers.
+                However, the tile-based approach simply did not scale well with resolution or
+                geometry complexity. Also, they missed the geometry processing bandwagon, which
+                really hurt their standing. They fell farther and farther behind the other major
+                players, until they stopped making desktop parts altogether.</para>
+            <para>However, they may ultimately have the last laugh; unlike 3Dfx and so many others,
+                PowerVR still exists. They provided the GPU for the Sega Dreamcast console. And
+                while that console was a market failure, it did show where PowerVR's true strength
+                lay: embedded platforms.</para>
+            <para>Embedded platforms tend to play to their tile-based renderer's strengths. Memory,
+                particularly high-bandwidth memory, eats up power; having less memory means
+                longer-lasting mobile devices. Embedded devices tend to use smaller resolutions,
+                which their platform excels at. And with low resolutions, you are not trying to push
+                nearly as much geometry.</para>
+            <para>Thanks to these facts, PowerVR graphics chips power the vast majority of mobile
+                platforms that have any 3D rendering in them. Just about every iPhone, Droid, iPad,
+                or similar device is running PowerVR technology. And that's a growth market these
+                days.</para>
+        </sidebar>
+    </section>
+    <section>
+        <?dbhtml filename="History GeForce.html" ?>
+        <title>Vertices and Registers</title>
+        <para>The next stage in the evolution of graphics hardware again came from NVIDIA. While
+            3Dfx released competing cards, they were again behind the curve. The NVIDIA GeForce 256
+            (not to be confused with the GeForce GT250, a much more modern card), released in 1999,
+            provided something truly new: a vertex processing pipeline.</para>
+        <para>The OpenGL API has always defined a vertex processing pipeline (it was fixed-function
+            in those days rather than shader-based). And NVIDIA implemented it in their TNT-era
+            drivers on the CPU. But only with the GeForce 256 was this actually implemented in
+            hardware. And NVIDIA essentially built the entire OpenGL fixed-function vertex
+            processing pipeline directly into the GeForce hardware.</para>
+        <para>This was primarily a performance win. While it was important for the progress of
+            hardware, a less-well-known improvement of the early GeForce hardware was more important
+            to its future.</para>
+        <para>In the fragment processing pipeline, the texture environment stages were removed. In
+            their place was a more powerful mechanism, what NVIDIA called <quote>register
+                combiners.</quote></para>
+        <para>The GeForce 256 provided 2 regular combiner stages. Each of these stages represented
+            up to four independent opcodes that operated over the register set. The opcodes could
+            result in multiple outputs, which could be written to two temporary registers.</para>
+        <para>What is interesting is that the register values are no longer limited to color values.
+            Instead, they are signed values, on the range [-1, 1]; they have 9 bits of precision or
+            so. While the initial color or texture values are on [0, 1], the actual opcodes
+            themselves can perform operations that generate negative values. Opcodes can even
+            scale/bias their inputs, which allow them to turn unsigned colors into signed
+            values.</para>
+        <para>Because of this, the GeForce 256 was the first hardware to be able to do functional
+            bump mapping, without hacks or tricks. A single register combiner stage could do 2
+            3-vector dot-products at a time. Textures could store normals by compressing them to a
+            [0, 1] range. The light direction could either be a constant or interpolated per-vertex
+            in texture space.</para>
+        <para>Now granted, this still was a primitive form of bump mapping. There was no way to
+            correct for texture-space values with binormals and tangents. But this was at least
+            something. And it really was the first step towards programmability; it showed that
+            textures could truly represent values other than colors.</para>
+        <para>There was also a single final combiner stage. This was a much more limited stage than
+            the regular combiner stages. It could do a linear interpolation operation and an
+            addition; this was designed specifically to implement OpenGL's fixed-function fog and
+            specular computations.</para>
+        <para>The register file consisted of two temporary registers, two per-vertex colors, two
+            texture colors, two uniform values, the zero register, and a few other values used for
+            OpenGL fixed-function fog operations. The color and texture registers were even
+            writeable, if you needed more temporaries.</para>
+        <para>There were a few other sundry additions to the hardware. Cube textures first came onto
+            the scene. Combined with the right texture coordinate computations (now in hardware),
+            you could have reflective surfaces much more easily. Anisotropic filtering and
+            multisampling also appeared at this time. The limits were relatively small; anisotropic
+            filtering was limited to 4x, while the maximum number of samples was restricted to two.
+            Compressed texture formats also appeared on the scene.</para>
+        <para>What we see thus far as we take steps towards true programmability is that increased
+            complexity in fragment processing starts pushing for other needs. The addition of a dot
+            product allows lighting computations to take place per-fragment. But you cannot have full
+            texture-space bump mapping because of the lack of a normal/binormal/tangent matrix to
+            transform vectors to texture space. Cubemaps allow you to do arbitrary reflections, but
+            computing reflection directions per-vertex requires interpolating reflection normals,
+            which does not work very well over large polygons.</para>
+        <para>This also saw the introduction of something called a rectangle texture. This texture
+            type is something of an odd duck that still remains in current day. It was a way of
+            creating a texture of arbitrary size; until then, textures were limited to powers of two
+            in size (though the sizes did not have to be the same). The texture coordinates for
+            rectangle textures are not normalized; they were in texture space values.</para>
+        <sidebar>
+            <title>The GPU Divide</title>
+            <para>When NVIDIA released the GeForce 256, they coined the term <quote>Geometry
+                    Processing Unit</quote> or <acronym>GPU</acronym>. Until this point, graphics
+                chips were called exactly that: graphics chips. The term GPU was intended by NVIDIA
+                to differentiate the GeForce from all of its competition, including the final cards
+                from 3Dfx.</para>
+            <para>Because the term was so reminiscent to CPUs, the term took over. Every graphics
+                chip is a GPU now, even ones released before the term came to exist.</para>
+            <para>In truth, the term GPU never really made much sense until the next stage, where
+                the first cards with actual programmability came onto the scene.</para>
+        </sidebar>
+    </section>
+    <section>
+        <?dbhtml filename="History Radeon8500.html" ?>
+        <title>Programming at Last</title>
+        <para>How do you define a demarcation between non-programmable graphics chips and
+            programmable ones? We have seen that, even in the humble TNT days, there were a couple
+            of user-defined opcodes with several possible input values.</para>
+        <para>One way is to consider what programming is. Programming is not simply a mathematical
+            operation; programming needs conditional logic. Therefore, it is not unreasonable to say
+            that something is not truly programmable until there is the possibility of some form of
+            conditional logic.</para>
+        <para>And it is at this point where that first truly appears. It appears first in the
+                <emphasis>vertex</emphasis> pipeline rather than the fragment pipeline. This seems
+            odd until one realizes how crucial fragment operations are to overall performance. It
+            therefore makes sense to introduce heavy programmability in the less
+            performance-critical areas of hardware first.</para>
+        <para>The GeForce 3, released in 2001 (a mere 3 years after the TNT), was the first hardware
+            to provide this level of programmability. While GeForce 3 hardware did indeed have the
+            fixed-function vertex pipeline, it also had very flexible programmable pipeline. The
+            retaining of the fixed-function code was a performance need; the vertex shader was not
+            as fast as the fixed-function one. It should be noted that the original X-Box's GPU,
+            designed in tandem with the GeForce 3, eschewed the fixed-functionality altogether in
+            favor of having multiple vertex shaders that could compute several vertices at a time.
+            This was eventually adopted for later GeForces.</para>
+        <para>Vertex shaders were pretty powerful, even in their first incarnation. While there was
+            no conditional branching, there was conditional logic, the equivalent of the ?:
+            operator. These vertex shaders exposed up to 128 <type>vec4</type> uniforms, up to 16
+                <type>vec4</type> inputs (still the modern limit), and could output 6
+                <type>vec4</type> outputs. Two of the outputs, intended for colors, were lower
+            precisions than the others. There was a hard limit of 128 opcodes. These vertex shaders
+            brought full swizzling support and a plethora of math operations.</para>
+        <para>The GeForce 3 also added up to two more textures, for a total of four textures per
+            triangle. They were hooked directly into certain per-vertex outputs, because the
+            per-fragment pipeline did not have real programmability yet.</para>
+        <para>At this point, the holy grail of programmability at the fragment level was dependent
+            texture access. That is, being able to access a texture, do some arbitrary computations
+            on it, and then access another texture with the result. The GeForce 3 had some
+            facilities for that, but they were not very good ones.</para>
+        <para>The GeForce 3 used 8 register combiner stages instead of the 2 that the earlier cards
+            used. Their register files were extended to support two extra texture colors and a few
+            more tricks. But the main change was something that, in OpenGL terminology, would be
+            called <quote>texture shaders.</quote></para>
+        <para>What texture shaders did was allow the user to, instead of accessing a texture,
+            perform a computation on that texture's texture unit. This was much like the old texture
+            environment functionality, except only for texture coordinates. The textures were
+            arranged in a sequence. And instead of accessing a texture, you could perform a
+            computation between that texture unit's coordinate and possibly the coordinate from the
+            previous texture shader operation, if there was one.</para>
+        <para>It was not very flexible functionality. It did allow for full texture-space bump
+            mapping, though. While the 8 register combiners were enough to do a full matrix
+            multiply, they were not powerful enough to normalize the resulting vector. However, you
+            could normalize a vector by accessing a special cubemap. The values of this cubemap
+            represented a normalized vector in the direction of the cubemap's given texture
+            coordinate.</para>
+        <para>But using that required spending a total of 3 texture shader stages. Which meant you
+            get a bump map and a normalization cubemap only; there was no room for a diffuse map in
+            that pass. It also did not perform very well; the texture shader functions were quite
+            expensive.</para>
+        <para>True programmability came to the fragment shader from ATI, with the Radeon 8500,
+            released in late 2001.</para>
+        <para>The 8500's fragment shader architecture was pretty straightforward, and in terms of
+            programming, it is not too dissimilar to modern shader systems. Texture coordinates
+            would come in. They could either be used to fetch from a texture or be given directly as
+            inputs to the processing stage. Up to 6 textures could be used at once. Then, up to 8
+            opcodes, including a conditional operation, could be used. After that, the hardware
+            would repeat the process using registers written by the opcodes. Those registers could
+            feed texture accesses from the same group of textures used in the first pass. And then
+            another 8 opcodes would generate the output color.</para>
+        <para>It also had strong, but not full, swizzling support in the fragment shader. Register
+            combiners had very little support for swizzling.</para>
+        <para>This era of hardware was also the first to allow 3D textures. Though that was as much
+            a memory concern as anything else, since 3D textures take up lots of memory which was
+            not available on earlier cards. Depth comparison texturing was also made
+            available.</para>
+        <para>While the 8500 was a technological marvel, it was a flop in the market compared to the
+            GeForce 3 &amp; 4. Indeed, this is a recurring theme of these eras: the card with the
+            more programmable hardware often tends to lose in its first iteration.</para>
+        <sidebar>
+            <title>API Hell</title>
+            <para>This era is notable in what it did to graphics APIs. Consider the hardware
+                differences between the 8500 and the GeForce 3/4 in terms of fragment
+                processing.</para>
+            <para>On the Direct3D front, things were not the best. Direct3D 8 promised a unified
+                shader development pipeline. That is, you could write a shader according to their
+                specifications and it would work on any D3D 8 hardware. And this was effectively
+                true. For vertex shaders, at least.</para>
+            <para>However, the D3D 8.0 pixel shader pipeline was nothing more than NVIDIA's register
+                combiners and texture shaders. There was no real abstraction of capabilities; the
+                D3D 8.0 pixel shaders simply took NVIDIA's hardware and made a shader language out
+                of it.</para>
+            <para>To provide support for the 8500's expanded fragment processing feature-set, there
+                was D3D 8.1. This version altered the pixel shader pipeline to match the
+                capabilities of the Radeon 8500. Fortunately, the 8500 would accept 8.0 shaders just
+                fine, since it was capable of doing everything the GeForce 3 could do. But no one
+                would mistake either shader specification for any kind of real abstraction.</para>
+            <para>Things were much worse on the OpenGL front. At least in D3D, you used the same
+                basic C++ API to provide shaders; the shaders themselves may have been different,
+                but the base API was the same. Not so in OpenGL land.</para>
+            <para>NVIDIA and ATI released entirely separate proprietary extensions for specifying
+                fragment shaders. NVIDIA's extensions built on the register combiner extension they
+                released with the GeForce 256. They were completely incompatible. And worse, they
+                were not even string-based.</para>
+            <para>Imagine having to call a C++ function to write every opcode of a shader. Now
+                imagine having to call <emphasis>three</emphasis> functions to write each opcode.
+                That's what using those APIs was like.</para>
+            <para>Things were better on vertex shaders. NVIDIA initially released a vertex shader
+                extension, as did ATI. NVIDIA's was string-based, but ATI's version was like their
+                fragment shader. Fortunately, this state of affairs did not last long; the OpenGL
+                ARB came along with their own vertex shader extension. This was not GLSL, but an
+                assembly like language based on NVIDIA's extension.</para>
+            <para>It would take much longer for the fragment shader disparity to be worked
+                out.</para>
+        </sidebar>
+    </section>
+    <section>
+        <?dbhtml filename="History GeForceFX.html" ?>
+        <title>Dependency</title>
+        <para>The Radeon 9700 was the 8500's successor. It improved on the 8500 somewhat. The vertex
+            shader gained real conditional branching logic. Some of the limits were also relaxed;
+            the number of available outputs and uniforms increased. The fragment shader's
+            architecture remained effectively the same; the 9700 simply increased the limits. There
+            were 8 textures available and 16 opcodes, and it could perform 4 passes over this
+            set.</para>
+        <para>The GeForce FX, released in 2003, was a substantial improvement, both over the GeForce
+            3/4 and over the 9700 in terms of fragment processing. NVIDIA took a different approach
+            to their fragment shaders; their fragment processor worked not entirely unlike modern
+            shader processors do.</para>
+        <para>It read an instruction, which could be a math operation, conditional branch (they had
+            actual branches in fragment shading), or texture lookup instruction. It then executed
+            that instruction. The texture lookup could be from a set of 8 textures. And then it
+            repeated this process on the next instruction. It was doing math computations in a way
+            not entirely unlike a traditional CPU.</para>
+        <para>There was no real concept of a dependent texture access for the GeForce FX. The inputs
+            to the fragment pipeline were simply the texture coordinates and colors from the vertex
+            stage. If you used a texture coordinate to access a texture, it was fine with that. If
+            you did some computations with them and then accessed a texture, it was just as fine
+            with that. It was completely generic.</para>
+        <para>It also failed in the marketplace. This was due primarily to its lateness and its poor
+            performance in high-precision computation operations. The FX was optimized for doing
+            16-bit math computations in its fragment shader; while it <emphasis>could</emphasis> do
+            32-bit math, it was half as fast when doing this. But Direct3D 9's shaders did not allow
+            the user to specify the precision of computations; the specification required at least
+            24-bits of precision. To match this, NVIDIA had no choice but to force 32-bit math on
+            all D3D 9 applications, making them run much slower than their ATI counterparts (the
+            9700 always used 24-bit precision math).</para>
+        <para>Things were no better in OpenGL land. The two competing unified fragment processing
+            APIs, GLSL and an assembly-like fragment shader, did not have precision specifications
+            either. Only NVIDIA's proprietary extension for fragment shaders provided that, and
+            developers were less likely to use it. Especially with the head start that the 9700
+            gained in the market by the FX being released late.</para>
+        <para>It performs so poorly in the market that NVIDIA dropped the FX name for the next
+            hardware revision. The GeForce 6 improved its 32-bit performance to the point where it
+            was competitive with the ATI equivalents.</para>
+        <para>This level of hardware saw the gaining of a number of different features. sRGB
+            textures and framebuffers appeared, as did floating-point textures. Blending support for
+            floating-point framebuffers was somewhat spotty; some hardware could do it only for
+            16-bit floating-point, some could not do it at all. The restrictions of power-of-two
+            texture sizes was also lifted, to varying degrees. None of ATI's hardware of this era
+            fully supported this when used with mipmapping, but NVIDIA's hardware from the GeForce 6
+            and above did.</para>
+        <para>The ability to access textures from vertex shaders was also introduced in this series
+            of hardware. Vertex texture accesses uses a separate list of textures from those bound
+            for fragment shaders. Only four textures could be accessed from a vertex shader, while 8
+            textures was normal for fragment shaders.</para>
+        <para>Render to texture also became generally available at this time, though this was more
+            of an API issue (neither OpenGL nor Direct3D allowed textures to be used as render
+            targets before this point) than hardware functionality. That is not to say that hardware
+            had no role to play. Textures are often not stored as linear arrays of memory the way
+            they are loaded with <function>glTexImage</function>. They are usually stored in a
+            swizzled format, where 2D or 3D blocks of texture data are stored sequentially. Thus,
+            rendering to a texture required either the ability to render directly to swizzled
+            formats or the ability to read textures that are stored in unswizzled formats.</para>
+        <para>More than just render to texture was introduced. What was also introduced was the
+            ability to render to multiple textures or buffers at one time. The number of renderable
+            buffers was generally limited to 4 across all hardware platforms.</para>
+        <sidebar>
+            <title>Rise of the Compilers</title>
+            <para>Microsoft put their foot down after the fiasco with D3D 8's fragment shaders. They
+                wanted a single standard that all hardware makers would support. While this lead to
+                the FX's performance failings, it also meant that compilers were becoming very
+                important to shader performance.</para>
+            <para>In order to have a real abstraction, you need compilers that are able to take the
+                abstract language and map it to very different kinds of hardware. With Direct3D and
+                OpenGL providing standards for shading languages, compiler quality started to become
+                vital for performance.</para>
+            <para>OpenGL moved whole-heartedly, and perhaps incautiously, into the realm of
+                compilers when the OpenGL ARB embraced GLSL, a C-style language. They developed this
+                language to the exclusion of all others.</para>
+            <para>In Direct3D land, Microsoft developed the High-Level Shading Language, HLSL. But
+                the base shading languages used by Direct3D 9 were still the assembly-like shading
+                languages. HLSL was compiled by a Microsoft-developed compiler into the assembly
+                languages, which were fed to Direct3D.</para>
+            <para>With compilers and semi-real languages with actual logic constructs, a new field
+                started to arise: General Programming GPU or <acronym>GPGPU</acronym>. The idea was
+                to use a GPU to do non-rendering tasks. It started around this era, but the
+                applications were limited due to the nature of hardware. Only fairly recently, with
+                the advent of special languages and APIs (OpenCL, for example) that are designed for
+                GPGPU tasks, has GPGPU started to really move into its own. Indeed, in the most
+                recent hardware era, hardware makers have added features to GPUs that have
+                somewhat... dubious uses in the field of graphics, but substantial uses in GPGPU
+                tasks.</para>
+        </sidebar>
+    </section>
+    <section>
+        <?dbhtml filename="History Unified.html" ?>
+        <title>Modern Unification</title>
+        <para>Welcome to the modern era. All of the examples in this book are designed on and for
+            this era of hardware, though some of them could run on older ones. The release of the
+            Radeon HD 2000 and GeForce 8000 series cards in 2006 represented unification in more
+            ways than one.</para>
+        <para>With the prior generations, fragment hardware had certain platform-specific
+            peculiarities. While the API kinks were mostly ironed out with the development of proper
+            shading languages, there were still differences in the behavior of hardware. While 4
+            dependent texture accesses were sufficient for most applications, naive use of shading
+            languages could get you in trouble on ATI hardware.</para>
+        <para>With this generation, neither side really offered any real functionality difference.
+            There are still differences between the hardware lines, and certainly in terms of
+            performance. But the functionality differences have never been more blurred than they
+            were with this revision.</para>
+        <para>Another form of unification was that both NVIDIA and ATI moved to a unified shader
+            architecture. In all prior generations, fragment shaders and vertex shaders were
+            fundamentally different hardware. Even when they started doing the same kinds of things,
+            such as accessing textures, they were both using different physical hardware to do so.
+            This led to some inefficiencies.</para>
+        <para>Deferred rendering probably gives the most explicit illustration of the problem. The
+            first pass, the creation of the g-buffers, is a very vertex-shader-intensive activity.
+            While the fragment shader can be somewhat complex, doing several texture fetches to
+            compute various material parameters, the vertex shader is where much of the real work is
+            done. Lots of vertices come through the shader, and if there are any complex
+            transformations, they will happen here.</para>
+        <para>The second pass is a <emphasis>very</emphasis> fragment shader intensive pass. Each
+            light layer is comprised of exactly 4 vertices. Vertices that can be provided directly
+            in clip-space. From then on, the fragment shader is what is being worked. It performs
+            all of the complex lighting calculations necessary for the various rendering techniques.
+            Four vertices generate literally millions of fragments, depending on the rendering
+            resolution.</para>
+        <para>In prior hardware generations, in the first pass, there would be fragment shaders
+            going to waste, as they would process fragments faster than the vertex shaders could
+            deliver triangles. In the second pass, the reverse happens, only even moreso. Four
+            vertex shader executions, and then all of those vertex shaders would be completely
+            useless. All of those parallel computational units would go to waste.</para>
+        <para>Both NVIDIA and ATI devised hardware such that the computational elements were
+            separated from their particular kind of computations. All shader hardware could be used
+            for vertices, fragments, or geometry shaders (new in this generation). This would be
+            changed on demand, based on the resource load. This makes deferred rendering in
+            particular much more efficient; the second pass is able to use almost all of the
+            available shader resources for lighting operations.</para>
+        <para>This unified shader approach also means that every shader stage has essentially the
+            same capabilities. The standard for the maximum texture count is 16, which is plenty
+            enough for doing just about anything. This is applied equally to all shader types, so
+            vertex shaders have the same number of textures available as fragment shaders.</para>
+        <para>This smoothed out a great many things. Shaders gained quite a few new features.
+            Uniform buffers became available. Shaders could perform computations directly on integer
+            values. Unlike every generation before, all of these features were parceled out to all
+            types of shaders equally.</para>
+        <para>Along with unified shaders came a long list of various and sundry improvements to
+            non-shader hardware. These include, but are not limited to:</para>
+        <itemizedlist>
+            <listitem>
+                <para>Floating-point blending was worked out fully. Hardware of this era supports
+                    full 32-bit floating point blending, though for performance reasons you're still
+                    advised to use the lowest precision you can get away with.</para>
+            </listitem>
+            <listitem>
+                <para>Arbitrary texture swizzling as a direct part of texture sampling parameters,
+                    rather than in the shader itself.</para>
+            </listitem>
+            <listitem>
+                <para>Integer texture formats, to compliment the shader's ability to use integer
+                    values.</para>
+            </listitem>
+            <listitem>
+                <para>Array textures.</para>
+            </listitem>
+        </itemizedlist>
+        <para>Various other limitations were expanded as well.</para>
+        <sidebar>
+            <title>Tessellation</title>
+            <para>This was not the end of hardware evolution; there has been hardware released in
+                recent years  The Radeon HD 5000 and GeForce GT 400 series and above have increased
+                rendering features. They're just not as big of a difference compared to what came
+                before.</para>
+            <para>The biggest new feature in this hardware is tessellation, the ability to take
+                triangles output from a vertex shader and split them into new triangles based on
+                arbitrary (mostly) shader logic. This sounds like what geometry shaders can do, but
+                it is different.</para>
+            <para>Tessellation is actually something that ATI toyed around with for years. The
+                Radeon 9700 had tessellation support with something they called PN triangles. This
+                was very automated and not particularly useful. The entire Radeon HD 2000-4000 cards
+                included tessellation features as well. These were pre-vertex shader, while the
+                current version comes post-vertex shader.</para>
+            <para>In the older form, the vertex shader would serve double duty. An incoming triangle
+                would be broken down into many triangles. The vertex shader would then have to
+                compute the per-vertex attributes for each of the new triangles, based on the old
+                attributes and which vertex in the new series of vertices is being computed. Then it
+                would do its normal transformation and other operations on those attributes.</para>
+            <para>The current form introduces two new shader stages. The first, immediately after
+                the vertex shader, controls how much tessellation happens on a particular primitive.
+                The tessellation happens, splitting the single primitive into multiple primitives.
+                The next stage determines how to compute the new positions, normals, etc of the
+                primitive, based on the values of the primitive being tessellated. The geometry
+                shader still exists; it is executed after the final tessellation shader
+                stage.</para>
+            <para>Tessellation is not covered in this book for a few reasons. First, there is not as
+                much hardware out there that supports it. Sticking to OpenGL 3.3 meant casting a
+                wider net; requiring OpenGL 4.1 (which includes tessellation) would have meant fewer
+                people could run those tutorials.</para>
+            <para>Second, tessellation is not that important. That's not to say that it is not
+                important or a worthwhile feature. But it really is not something that matters a
+                great deal.</para>
+        </sidebar>
+    </section>
+</appendix>

Documents/Optimization.xml

 <?oxygen SCHSchema="http://docbook.org/xml/5.0/rng/docbookxi.rng"?>
 <appendix xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
     xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
-    <?dbhtml filename="Optimization.html" ?>
-    <title>Optimizations</title>
-    <para>This appendix is not intended to be a detailed view of possible graphics optimizations.
-        Instead, it is a high-level view of important information for optimizing rendering
-        applications. There are also no source code samples for this.</para>
+    <?dbhtml filename="Basic Optimization.html" ?>
+    <title>Basic Optimization</title>
+    <para>Optimization is far too large of a subject to cover adequately in a mere appendix.
+        Optimizations tend to be specific to particular algorithms, and they usually involve
+        tradeoffs with memory. That is, one can make something run faster by taking up memory. And
+        even then, optimizations should only be made when one has proper profiling to determine
+        where performance is lacking.</para>
+    <para>This appendix will instead cover the most basic optimizations. These are not guaranteed to
+        improve performance in any particular program, but they almost never hurt. They are also
+        things you can implement relatively easily. These of these as the default standard practice
+        you should start with before performing real optimizations. For the sake of clarity, most of
+        the code in this book did not use these practices, so many of them will be new.</para>
+    <section>
+        <title>Vertex Format</title>
+        <para>Interleave vertex arrays for objects where possible. Obviously, if you need to
+            overwrite some vertex data frequently while other data remains static, then you will
+            need to separate that data. But unless you have some specific need to do so, interleave
+            your vertex data.</para>
+        <para>Equally importantly, use the smallest vertex data possible. In the tutorials, the
+            vertex data was almost always 32-bit floats. You should only use 32-bit floats when you
+            absolutely need that much precision.</para>
+        <para>The biggest key to this is the use of normalized integer values for attributes. Here
+            is the definition of <function>glVertexAttribPointer</function>:</para>
+        <funcsynopsis>
+            <funcprototype>
+                <funcdef>void <function>glVertexAttribPointer</function></funcdef>
+                <paramdef>GLuint <parameter>index</parameter></paramdef>
+                <paramdef>GLint <parameter>size</parameter></paramdef>
+                <paramdef>GLenum <parameter>type</parameter></paramdef>
+                <paramdef>GLboolean <parameter>normalized</parameter></paramdef>
+                <paramdef>GLsizei <parameter>stride</parameter></paramdef>
+                <paramdef>GLvoid *<parameter>pointer</parameter></paramdef>
+            </funcprototype>
+        </funcsynopsis>
+        <para>If <varname>type</varname> is an integer attribute, like
+                <varname>GL_UNSIGNED_BYTE</varname>, then setting <varname>normalized</varname> to
+                <literal>GL_TRUE</literal> will mean that OpenGL interprets the integer value as
+            normalized. It will automatically convert the integer 255 to 1.0, and so forth. If the
+            normalization flag is false instead, then it will convert the integers directly to
+            floats: 255 becomes 255.0, etc. Signed values can be normalized as well; GL_BYTE with
+            normalization will map 127 to 1.0, -128 to -1.0, etc.</para>
+        <formalpara>
+            <title>Colors</title>
+            <para>Color values are commonly stored as 4 unsigned normalized bytes. This is far
+                smaller than using 4 32-bit floats, but the loss of precision is almost always
+                negligible. To send 4 unsigned normalized bytes, use:</para>
+        </formalpara>
+        <programlisting language="cpp">glVertexAttribPointer(#, 4, GL_UNSIGNED_BYTE, GL_TRUE, ...);</programlisting>
+        <para>The best part is that all of this is free; it costs no actual performance. Note
+            however that 32-bit integers cannot be normalized.</para>
+        <para>Sometimes, color values need higher precision than 8-bits, but less than 16-bits. If a
+            color is a linear RGB color, it is often desirable to give them greater than 8-bit
+            precision. If the alpha of the color is negligible or non-existent, then a special
+                <varname>type</varname> can be used. This type is
+                <literal>GL_UNSIGNED_INT_2_10_10_10_REV</literal>. It takes 32-bit unsigned
+            normalized integers and pulls the four components of the attributes out of each integer.
+            This type can only be used with normalization:</para>
+        <programlisting language="cpp">glVertexAttribPointer(#, 4, GL_UNSIGNED_BYTE, GL_TRUE, ...);</programlisting>
+        <para>The most significant 2 bits of each integer is the Alpha. The next 10 bits are the
+            Blue, then Green, and finally red. It is equivalent to this struct in C:</para>
+        <programlisting language="cpp">struct RGB10_A2
+{
+  unsigned int alpha    : 2;
+  unsigned int blue     : 10;
+  unsigned int green    : 10;
+  unsigned int red      : 10;
+};</programlisting>
+        <formalpara>
+            <title>Normals</title>
+            <para>Another attribute where precision isn't of paramount importance is normals. If the
+                normals are normalized, and they always should be, the coordinates are always going
+                to be on the [-1, 1] range. So signed normalized integers are appropriate here.
+                8-bits of precision are sometimes enough, but 10-bit precision is going to be an
+                improvement. 16-bit precision, <literal>GL_SHORT</literal>, may be overkill, so
+                stick with <literal>GL_INT_2_10_10_10_REV</literal>. Because this format provides 4
+                values, you will still need to use 4 as the size of the attribute, but you can still
+                use <type>vec3</type> in the shader as the normal's input variable.</para>
+        </formalpara>
+        <formalpara>
+            <title>Texture Coordinates</title>
+            <para>Two-dimensional texture coordinates do not typically need 32-bits of precision. 8
+                and 10-bit precision are usually not good enough, but 16-bit unsigned normalized
+                integers are often sufficient. If texture coordinates range outside of [0, 1], then
+                normalization will not be sufficient. In these cases, there is an alternative to
+                32-bit floats: 16-bit floats.</para>
+        </formalpara>
+        <para>The hardest part of dealing with 16-bit floats is that C/C++ does not deal with very
+            well. There is no native 16-bit float type, unlike virtually every other type. Even the
+            10-bit format can be built using bit selectors in structs, as above. Generating a 16-bit
+            float from a 32-bit float requires care, as well as an understanding of how
+            floating-point values work. The details of that are beyond the scope of this work,
+            however.</para>
+        <formalpara>
+            <title>Positions</title>
+            <para>In general, positions are the least likely attribute to be easily optimized
+                without consequence. 16-bit floats can be used, but these are restricted to a range
+                of approximately [-6550.4, 6550.4]. They also lack some precision, which may be
+                necessary depending on the size and detail of the object in model space.</para>
+        </formalpara>
+        <para>If 16-bit floats are insufficient, there are things that can be done. The process is
+            as follows:</para>
+        <orderedlist>
+            <listitem>
+                <para>When loading the mesh data, find the bounding volume of the mesh in model
+                    space. To do this, find the maximum and minimum values in the X, Y and Z
+                    directions independently. This represents a rectangle in model space that
+                    contains all of the vertices. This rectangle is defined by two vectors: the
+                    maximum vector (containing the max X, Y and Z values), and the minimum vector.
+                    These are named <varname>max</varname> and <varname>min</varname>.</para>
+            </listitem>
+            <listitem>
+                <para>Compute the center point of this region:</para>
+                <programlisting language="cpp">glm::vec3 center = (max + min) / 2.0f;</programlisting>
+            </listitem>
+            <listitem>
+                <para>Compute half of the size (width, height, depth) of the region:</para>
+                <programlisting language="cpp">glm::vec3 halfSize = (max - min) / 2.0f;</programlisting>
+            </listitem>
+            <listitem>
+                <para>For each position in the mesh, compute a normalized version by subtracting the
+                    center from it, then dividing it by half the size. As follows:</para>
+                <programlisting language="cpp">glm::vec3 newPosition = (position - center) / halfSize;</programlisting>
+            </listitem>
+            <listitem>
+                <para>For each new position, convert it to a signed, normalized integer by
+                    multiplying it by 32767:</para>
+                <programlisting>unsigned short normX = (unsigned short)(newPosition.x * 32767.0f);
+unsigned short normY = (unsigned short)(newPosition.y * 32767.0f);
+unsigned short normZ = (unsigned short)(newPosition.z * 32767.0f);</programlisting>
+                <para>These three coordinates are then stored as the new position data in the buffer
+                    object.</para>
+            </listitem>
+            <listitem>
+                <para>Keep the <varname>center</varname> and <varname>halfSize</varname> variables
+                    stored with your mesh data. When computing the model-space to camera-space
+                    matrix for that mesh, add one final matrix to the top. This matrix will perform
+                    the inverse operation from the one that we used to compute the normalized
+                    values:</para>
+                <programlisting language="cpp">matrixStack.Translate(center);
+matrixStack.Scale(halfSize);</programlisting>
+                <para>This final matrix should <emphasis>not</emphasis> be applied to the normal's
+                    matrix. Compute the normal matrix <emphasis>before</emphasis> applying the final
+                    step above. So if you were not using a separate matrix for normals (you did not
+                    have non-uniform scales in your model-to-camera matrix), you will need to use
+                    one now. So this may make your data bigger or make your shader run slightly
+                    slower.</para>
+            </listitem>
+        </orderedlist>
+        <formalpara>
+            <title>Alignment</title>
+            <para>One additional rule you should always follow is this: make sure that all
+                attributes begin on a 4-byte boundary. This is true for attributes that are smaller
+                than 4-bytes, such as a 3-vector of 8-bit values. While OpenGL will allow you to use
+                arbitrary alignments, hardware may have problems making it work. So if you make your
+                position data 16-bit floats or signed normalized integers, you will still waste 2
+                bytes from every position. You may want to try making your position values
+                4-dimensional values and using the last value for something useful.</para>
+        </formalpara>
+    </section>
+    <section>
+        <title>Image Formats</title>
+        <para>As with vertex formats, try to use the smallest format that you can get away with.
+            Also, as with vertex formats, what you can get away with tends to be defined by what you
+            are trying to store in the texture.</para>
+        <formalpara>
+            <title>Normals</title>
+            <para>Textures containing normals can use <literal>GL_RGB10_A2_SNORM</literal>, which is
+                the texture equivalent to the 10-bit signed normalized format we used for attribute
+                normals. However, this can be made more precise if the normals are for a
+                tangent-space bump map. Since the tangent-space normals always have a positive Z
+                coordinate, and since the normals are normalized, the actual Z value can be computed
+                from the other two. So you only need to store 2 values;
+                    <literal>GL_RG16_SNORM</literal> is sufficient for these needs. To compute the
+                third value, do this:</para>
+        </formalpara>
+        <programlisting language="glsl">vec2 norm2d = texture(tangentBumpTex, texCoord).xy;
+vec3 tanSpaceNormal = sqrt(1.0 - dot(norm2d, norm2d));</programlisting>
+        <para>Obviously this costs some performance, so the added precision may not be worthwhile.
+            On the plus side, you will not have to do any normalization of the tangent-space
+            normal.</para>
+        <para>The <literal>GL_RG16_SNORM</literal> format can be made even smaller with texture
+            compression. The <literal>GL_COMPRESSED_SIGNED_RG_RGTC1</literal> compressed texture
+            format is a 2-channel signed integer format. It only takes up 8-bits per pixel.</para>
+        <formalpara>
+            <title>Floating-point Intensity</title>
+            <para>There are two unorthodox formats for floating-point textures, both of which have
+                important uses. The <literal>GL_R11F_G11F_B10F</literal> format is potentially a
+                good format to use for HDR render targets. As the name suggests, it takes up only
+                32-bits. The downside is the relative loss of precision compared to
+                    <literal>GL_RGB16F</literal>. They can store approximately the same magnitude of
+                values, but the smaller format loses some precision. This may or may not impact the
+                overall visual quality of the scene. It should be fairly simple to test to see which
+                is better.</para>
+        </formalpara>
+        <para>The <literal>GL_RGB9_E5</literal> format is used for input floating-point textures. If
+            you have a texture that represents light intensity in HDR situations, this format can be
+            quite handy. The way it works is that each of the RGB colors get 9 bits for their
+            values, but they all share the same exponent. This has to do with how floating-point
+            numbers work, but what it boils down to is that the values have to be relatively close
+            to one another in magnitude. They do not have to be that close; there's still some
+            leeway. Values that are too small relative to larger ones become zero. This is
+            oftentimes an acceptable tradeoff, depending on the particular magnitude in
+            question.</para>
+        <para>This format is useful for textures that are generated offline by tools. You cannot
+            render to a texture in this format.</para>
+        <formalpara>
+            <title>Colors</title>
+            <para>Storing colors that are clamped to [0, 1] can be done with good precision with
+                    <literal>GL_RGBA8</literal> or <literal>GL_SRGB8_ALPHA8</literal> as needed.
+                However, compressed texture formats are available. The S3TC formats are good choices
+                if the compression works reasonably well for the texture. There are sRGB versions of
+                the S3TC formats as well.</para>
+        </formalpara>
+        <para>The difference in the various S3TC formats are how much alpha you need. The choices
+            are as follows:</para>
+        <glosslist>
+            <glossentry>
+                <glossterm>GL_COMPRESSED_RGB_S3TC_DXT1_EXT</glossterm>
+                <glossdef>
+                    <para>No alpha.</para>
+                </glossdef>
+            </glossentry>
+            <glossentry>
+                <glossterm>GL_COMPRESSED_RGBA_S3TC_DXT1_EXT</glossterm>
+                <glossdef>
+                    <para>Binary alpha. Either zero or one for each texel. The RGB color for any
+                        alpha of zero will also be zero.</para>
+                </glossdef>
+            </glossentry>
+            <glossentry>
+                <glossterm>GL_COMPRESSED_RGBA_S3TC_DXT3_EXT</glossterm>
+                <glossdef>
+                    <para>4-bits of alpha per pixel.</para>
+                </glossdef>
+            </glossentry>
+            <glossentry>
+                <glossterm>GL_COMPRESSED_RGBA_S3TC_DXT5_EXT</glossterm>
+                <glossdef>
+                    <para>Alpha is compressed in an S3TC block, much like RG texture
+                        compression.</para>
+                </glossdef>
+            </glossentry>
+        </glosslist>
+        <para>If a variable alpha matters for a texture, the primary difference will be between DXT3
+            and DXT5. DXT5 has the potential for better results, but if the alpha does not compress
+            well with the S3TC algorithm, the results will be rather worse.</para>
+    </section>
+    <section>
+        <title>Textures</title>
+        <para>Mipmapping improves performance when textures are mapped to regions that are larger in
+            texel space than in window space. That is, when texture minification happens. Mipmapping
+            improves performance because it keeps the locality of texture accesses near each other.
+            Texture hardware is optimized for accessing regions of textures, so improving locality
+            of texture data will help performance.</para>
+        <para>How much this matters depends on how the texture is mapped to the surface. Static
+            mapping with explicit texture coordinates, or with linear computation based on surface
+            properties, can use mipmapping to improve locality of texture access. For more unusual
+            mappings or for pure-lookup tables, mipmapping may not help locality at all.</para>
+        <para/>
+    </section>
     <section>
         <title>Finding the Bottleneck</title>
         <para>The absolute best tool to have in your repertoire for optimizing your rendering is
             <para>If we did a memcpy between <varname>vertArray</varname> and a buffer object, and
                 we wanted to set the attributes to pull from this data, we could do so using the
                 stride and offsets to position things properly.</para>
-            <programlisting>glVertexAttribPointer(0, 3, GLfloat, GLfalse, 20, 0);
-glVertexAttribPointer(1, 3, GLubyte, GLtrue, 20, 12);
-glVertexAttribPointer(3, 3, GLushort, GLtrue, 20, 16);</programlisting>
+            <programlisting>glVertexAttribPointer(0, 3, GL_FLOAT, GLfalse, 20, 0);
+glVertexAttribPointer(1, 3, GL_UNSIGNED_BYTE, GL_TRUE, 20, 12);
+glVertexAttribPointer(3, 3, GL_UNSIGNED_SHORT, GL_TRUE, 20, 16);</programlisting>
             <para>The fifth argument is the stride. The stride is the number of bytes from the
                 beginning of one instance of this attribute to the beginning of another. The stride
                 here is set to <literal>sizeof</literal>(<type>Vertex</type>). C++ defines that the

Documents/Positioning/Tutorial 04.xml

             </figure>
             <para>What we have are two similar right triangles; the triangle formed by E, R and
                     E<subscript>z</subscript>the origin; and the triangle formed by E, P, and
-                    P<subscript>z</subscript>+E<subscript>z</subscript>. We have the eye position
-                and the position of the unprojected point. To find the location of R, we simply do
-                this:</para>
+                    P<subscript>z</subscript>. We have the eye position and the position of the
+                unprojected point. To find the location of R, we simply do this:</para>
             <equation>
                 <title>Perspective Computation</title>
                 <mediaobject>
                 zFar are positive but refer to negative values).</para>
             <para>The location of the prism has also changed. In the original tutorial, it was
                 located on the 0.75 range in Z. Because camera space has a very different Z from
-                clip space, this had to change. Now, the Z location of the prims is between -1.25
+                clip space, this had to change. Now, the Z location of the prism is between -1.25
                 and -2.75.</para>
             <para>All of this leaves us with this result:</para>
             <figure>

Documents/chunked.css

 
 pre.programlisting
 {
+	max-height: 204pt;
+	overflow: auto;
+
     font-family: consolas, monospace;
     font-size: 12pt;
     margin-left: 5%;