agreed with the most but:

As i know the Pentium and better CPUs work with the 80bit float numbers actually faster than with the 32 and 64bits. Of course i know few languages which make use of the extended format.

There are other few stuff, which i’m not yet sure about.. but the article is really good! Like everything in the _nick_ style ;)

It’s quite surprising that the most elementary rendering operations, like rasterization, are so little documented. Graphics cards take care of most of this, so people have forgotten how it actually works. But there are many reasons left why understanding the inner workings of the graphics pipeline is still useful. Not only for software rendering, but for any situation where this type of operations is required. Knowledge of these algorithm simply make you a better graphics programmer.

Today I present you the theory and implementation of an advanced rasterizer. It is advanced in the sense that it has many nice properties the classical scanline conversion algorithm does not have. The main problem with the old algorithm is that it’s hard to process pixels in parallel. It identifies filled scanlines, but this is only suited for processing one pixel at a time. A much more efficient approach is to process 2x2 pixels together. These are called quads. By sharing some setup cost per quad, and using advanced parallel instructions, this results in a significant speedup. Some of the current graphics hardware also uses quad pixel pipelines.

Our starting point is the half-space function approach. Before I go into details, all you have to know is that a half-space function is positive on one side of a line, and negative on the other. So it splits the screen in half, hence the name. Here’s a small illustration of it, where the edges of a triangle each split the screen in a positive and negative part: half-space.png. Locations where all three half-space functions are positive define the inside of the triangle. When a half-edge function is zero, we’re on an edge. When any of them is negative, we are outside the triangle. So this can be used for rasterization. If we can find formulas for the half-space functions, and evaluate them for each pixel location, we know which pixels need to be filled.

So what formula is positive on one side of a line and negative on the other? Surprisingly, the equation of a line is exactly what we’re looking for. For a segment with starting coordinates (x1, y1) and end coordinates (x2, y2), the equation is:

(x2 - x1) * (y - y1) - (y2 - y1) * (x - x1) = 0

You can easily verify that this equation is true for any point (x, y) on this line (for example the start and end points themselves). But the left hand side also behaves perfectly as a half-space fuction. It is positive for (x, y) coordinates on one side, and negative on the other. It’s zero on the line itself, which effectively defines the line in the above equation. Verifying this behaviour is left as an excercise for the reader. Let’s write some code!

This is the simplest working implementation of the half-space functions algorithm. Before I continue to explain how it can be improved, there are a few things to note. It would be possible to scan the whole screen, to see which pixels are inside the triangle, but this is of course a waste of time. Therefore only the smallest rectangle surrounding the triangle is scanned. You might also have noticed that I swapped some components in the formula. This is because in screen coordinates, the y-axis points downward. It is also very important that the triangle’s vertices are in counter-clockwise order. This makes sure that all the positive sides of the half-spaces point inside the triangle.

Unfortunately, this implementation isn’t fast at all. For every pixel, six multiplications and no less than fifteen subtractions are required. But don’t worry, this can be optimized greatly. Per horizontal line that we scan, only the x component in the formula changes. For the first edge, this means only (y1 - y2) gets added to the half-edge function value of the previous pixel. That’s just an addition and subtraction per pixel! Furthermore, the vertex coordinates are ‘constant’ per triangle, so (y1 - y2) and all other pairs like this only have to be computed once per pixel. Also for stepping in the y direction it’s just an addition of a constant. Let’s implement this:

The C1-C3 variables are the constant part of the half-edge equation and are not recomputed per pixel. The Cy1-Cy3 variables determine the ‘starting value’ of the half-space functions at the top of the bounding rectangle. And finally the Dx1-Dx3 variables are the start values per horizontal scan. Only these are incremented (actually decremented) with the delta values every pixel. So testing whether a pixel is inside the triangle has become really cheap: aside from some setup per horizontal line and per triangle that is negligible, just three subtractions and three compares for zero.

Ok, so this algorithm is starting to show its usefulness. But there still are some serious issues to solve. Here’s an actual image created using this code: first_try.png. If it isn’t clear that this is actually a car, here’s the shaded reference image: reference.png. As you can see, there are many gaps between the polygons. The cause is twofold: precision, and the fill convention.

There are precision issues because floating-point numbers are quite limited. They have 24-bit of precision. If you look at the half-space function, you see that we do several subtractions and multiplications. That’s the perfect recipe for precision problems. What most people would do in such situation is use double-precision floating-point numbers. While this will make the issue unnoticable, it isn’t exactly perfect and also slower. The real solution might sound crazy: use integers! Integers have 32-bit of precision, and do not suffer from precision loss when subtracting. Instead of using a ‘floating-point’, we’ll use fixed-point arithmetic to assure we get sub-pixel accuracy.

The second cause of the gaps is the fill convention. The half-space functions can be positive or negative, but also zero. In this case, the pixel is exactly on the edge. Until now, we’ve ignored this case and treated it as a pixel outside. We could treat it as inside, using >= comparators, but this isn’t correct either. What will happen is that pixels on the edge between two triangles will be drawn twice. While this may sound more acceptable than gaps, it causes some serious artifacts for things like transparency and stenciling.

What we’ll use here is a top-left fill convention, just like DirectX and OpenGL. This means that all edges which are on the left side of the triangle, or on a horizontal top, are treated as inside the triangle. This convention assures that no gaps will occur, nor drawing the same pixel twice. Let’s first see how we can detect whether an edge is ‘top-left’. For humans, it’s really easy to recognise them. Here are a few examples: top-left.png. I’m sure nobody really had a problem to identify the top-left edges himself, it’s really simple. But give yourself a minute to think about how you would detect this using code… Don’t look before you tried it, but here’s the answer: detect.png. The arrows indicate the counter-clockwise order of the vertices. Note how for left edges they all point downward! To detect this in code we merely have to check the Dy1-Dy3 values! The only exception is the top edge. There, Dy# is zero, but Dx# is positive.

Now that we know how to detect top-left edges, we still have to deal with them correctly so pixels on these edges are treated as inside. What we actually want to do is change the Cx# > 0 statements into Cx# >= 0 statements. Testing for top-left edges per pixel is way to slow, and handling all these cases in separate loops is way too complex. But look at what Cx# > 0 fundamentally is. It is ‘true’ when Cx# is 1, 2, 3, etc, and ‘false’ when Cx# is 0, -1, -2, etc. All we want to do is make it true also when Cx# is zero. The solution is too simple for words: all we have to do is add 1 to Cx# beforehand. This can even be done sooner, with the C1-C3 variables! Ok, let’s sum this all up:

Here’s the end result: correct.png. Not only is it now entirely flawless, it’s also faster than the floating-point version! The range of the fixed-point integers is big enough for a 2048x2048 color buffer, with 4 bits of sub-pixel precision. Now we got this fixed, let’s focus on performance again…

This implementation is perfect for small triangles. The setup cost per triangle is really low compared to the scanline conversion algorithm. Unfortunately, it’s not optimal for big triangles. About half of all pixels will be outside the triangle, but we still pay the price of evaluating the half-space functions. Furthermore, until now I’ve only talked about drawing one pixel at a time, while the real benefits of this approach is the possibility for parallelism. I’ll show you how to take advantage of this, and what other benefits we get from it.

What we really want to do is quickly skip pixels outside the triangle. We don’t want to waste any time evaluating half-space functions there. We also don’t want to spend too much time inside the triangle. The really interesting part is around the edges. So what we’ll do is quickly detect whether 8x8 blocks are not covered, partially covered, or fully covered. Not covered or fully covered blocks can quicly be respectively rejected or accepted. Partially covered blocks will be completely scanned. There’s another reason to do this block-based approach: it is very useful for visibility determination algorithms. And an extra benefit is that memory accesses are more localized.

So how do we detect coverage as fast as possible? With the half-space functions, it’s really easy. All we have to do is evaluate the half-space functions in the corners of the blocks. A non-covered block has negative half-space values for the corners for at least one edge. A completely covered block has positive half-space values for all edges. Everything else is a partially covered block. So the final implementation is:

Note that I scan partially covered blocks completely, all 8x8 pixels. This is for consistency with the other blocks so they can be processed by the same visibility algorithm. Also, this is extremely fast when done in an unrolled loop, using assembly instructions. All further optimizations to this algorithm are best done in assembly anyway. So now the rasterizer can output coverage masks for 8x8 pixels. This can then easily be processed by the pixel pipeline(s). It’s easy to process them as 4x4 quads, and many calculations can even be done per block. Everything taken together, there is no reason left to use the old scanline conversion algorithm.

Enjoy!

Nicolas “Nick” Capens