Today I present you the theory and implementation of an advanced rasterizer. It is advanced in the sense that it has many nice properties the classical scanline conversion algorithm does not have. The main problem with the old algorithm is that it's hard to process pixels in parallel. It identifies filled scanlines, but this is only suited for processing one pixel at a time. A much more efficient approach is to process 2x2 pixels together. These are called quads. By sharing some setup cost per quad, and using advanced parallel instructions, this results in a significant speedup. Some of the current graphics hardware also uses quad pixel pipelines.
Our starting point is the half-space function approach. Before I go into details, all you have to know is that a half-space function is positive on one side of a line, and negative on the other. So it splits the screen in half, hence the name. Here's a small illustration of it, where the edges of a triangle each split the screen in a positive and negative part: half-space.png. Locations where all three half-space functions are positive define the inside of the triangle. When a half-edge function is zero, we're on an edge. When any of them is negative, we are outside the triangle. So this can be used for rasterization. If we can find formulas for the half-space functions, and evaluate them for each pixel location, we know which pixels need to be filled.
So what formula is positive on one side of a line and negative on the other? Surprisingly, the equation of a line is exactly what we're looking for. For a segment with starting coordinates (x1, y1) and end coordinates (x2, y2), the equation is:
(x2 - x1) * (y - y1) - (y2 - y1) * (x - x1) = 0
You can easily verify that this equation is true for any point (x, y) on this line (for example the start and end points themselves). But the left hand side also behaves perfectly as a half-space fuction. It is positive for (x, y) coordinates on one side, and negative on the other. It's zero on the line itself, which effectively defines the line in the above equation. Verifying this behaviour is left as an excercise for the reader. Let's write some code!
void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
float y1 = v1.y;
float y2 = v2.y;
float y3 = v3.y;
float x1 = v1.x;
float x2 = v2.x;
float x3 = v3.x;
// Bounding rectangle
int minx = (int)min(x1, x2, x3);
int maxx = (int)max(x1, x2, x3);
int miny = (int)min(y1, y2, y3);
int maxy = (int)max(y1, y2, y3);
(char*&)colorBuffer += miny * stride;
// Scan through bounding rectangle
for(int y = miny; y < maxy; y++)
{
for(int x = minx; x < maxx; x++)
{
// When all half-space functions positive, pixel is in triangle
if((x1 - x2) * (y - y1) - (y1 - y2) * (x - x1) > 0 &&
<< (x2 - x3) * (y - y2) - (y2 - y3) * (x - x2) > 0 &&
<< (x3 - x1) * (y - y3) - (y3 - y1) * (x - x3) > 0)
{
colorBuffer[x] = 0x00FFFFFF;<< // White
}
}
(char*&)colorBuffer += stride;
}
}
This is the simplest working implementation of the half-space functions algorithm. Before I continue to explain how it can be improved, there are a few things to note. It would be possible to scan the whole screen, to see which pixels are inside the triangle, but this is of course a waste of time. Therefore only the smallest rectangle surrounding the triangle is scanned. You might also have noticed that I swapped some components in the formula. This is because in screen coordinates, the y-axis points downward. It is also very important that the triangle's vertices are in counter-clockwise order. This makes sure that all the positive sides of the half-spaces point inside the triangle.Unfortunately, this implementation isn't fast at all. For every pixel, six multiplications and no less than fifteen subtractions are required. But don't worry, this can be optimized greatly. Per horizontal line that we scan, only the x component in the formula changes. For the first edge, this means only (y1 - y2) gets added to the half-edge function value of the previous pixel. That's just an addition and subtraction per pixel! Furthermore, the vertex coordinates are 'constant' per triangle, so (y1 - y2) and all other pairs like this only have to be computed once per pixel. Also for stepping in the y direction it's just an addition of a constant. Let's implement this:
void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
float y1 = v1.y;
float y2 = v2.y;
float y3 = v3.y;
float x1 = v1.x;
float x2 = v2.x;
float x3 = v3.x;
// Deltas
float Dx12 = x1 - x2;
float Dx23 = x2 - x3;
float Dx31 = x3 - x1;
float Dy12 = y1 - y2;
float Dy23 = y2 - y3;
float Dy31 = y3 - y1;
// Bounding rectangle
int minx = (int)min(x1, x2, x3);
int maxx = (int)max(x1, x2, x3);
int miny = (int)min(y1, y2, y3);
int maxy = (int)max(y1, y2, y3);
(char*&)colorBuffer += miny * stride;
// Constant part of half-edge functions
float C1 = Dy12 * x1 - Dx12 * y1;
float C2 = Dy23 * x2 - Dx23 * y2;
float C3 = Dy31 * x3 - Dx31 * y3;
float Cy1 = C1 + Dx12 * miny - Dy12 * minx;
float Cy2 = C2 + Dx23 * miny - Dy23 * minx;
float Cy3 = C3 + Dx31 * miny - Dy31 * minx;
// Scan through bounding rectangle
for(int y = miny; y < maxy; y++)
{
// Start value for horizontal scan
float Cx1 = Cy1;
float Cx2 = Cy2;
float Cx3 = Cy3;
for(int x = minx; x < maxx; x++)
{
if(Cx1 > 0 && Cx2 > 0 && Cx3 > 0)
{
colorBuffer[x] = 0x00FFFFFF;<< // White
}
Cx1 -= Dy12;
Cx2 -= Dy23;
Cx3 -= Dy31;
}
Cy1 += Dx12;
Cy2 += Dx23;
Cy3 += Dx31;
(char*&)colorBuffer += stride;
}
}
The C1-C3 variables are the constant part of the half-edge equation and are not recomputed per pixel. The Cy1-Cy3 variables determine the 'starting value' of the half-space functions at the top of the bounding rectangle. And finally the Dx1-Dx3 variables are the start values per horizontal scan. Only these are incremented (actually decremented) with the delta values every pixel. So testing whether a pixel is inside the triangle has become really cheap: aside from some setup per horizontal line and per triangle that is negligible, just three subtractions and three compares for zero.Ok, so this algorithm is starting to show its usefulness. But there still are some serious issues to solve. Here's an actual image created using this code: first_try.png. If it isn't clear that this is actually a car, here's the shaded reference image: reference.png. As you can see, there are many gaps between the polygons. The cause is twofold: precision, and the fill convention.
There are precision issues because floating-point numbers are quite limited. They have 24-bit of precision. If you look at the half-space function, you see that we do several subtractions and multiplications. That's the perfect recipe for precision problems. What most people would do in such situation is use double-precision floating-point numbers. While this will make the issue unnoticable, it isn't exactly perfect and also slower. The real solution might sound crazy: use integers! Integers have 32-bit of precision, and do not suffer from precision loss when subtracting. Instead of using a 'floating-point', we'll use fixed-point arithmetic to assure we get sub-pixel accuracy.
The second cause of the gaps is the fill convention. The half-space functions can be positive or negative, but also zero. In this case, the pixel is exactly on the edge. Until now, we've ignored this case and treated it as a pixel outside. We could treat it as inside, using >= comparators, but this isn't correct either. What will happen is that pixels on the edge between two triangles will be drawn twice. While this may sound more acceptable than gaps, it causes some serious artifacts for things like transparency and stenciling.
What we'll use here is a top-left fill convention, just like DirectX and OpenGL. This means that all edges which are on the left side of the triangle, or on a horizontal top, are treated as inside the triangle. This convention assures that no gaps will occur, nor drawing the same pixel twice. Let's first see how we can detect whether an edge is 'top-left'. For humans, it's really easy to recognise them. Here are a few examples: top-left.png. I'm sure nobody really had a problem to identify the top-left edges himself, it's really simple. But give yourself a minute to think about how you would detect this using code... Don't look before you tried it, but here's the answer: detect.png. The arrows indicate the counter-clockwise order of the vertices. Note how for left edges they all point downward! To detect this in code we merely have to check the Dy1-Dy3 values! The only exception is the top edge. There, Dy# is zero, but Dx# is positive.
Now that we know how to detect top-left edges, we still have to deal with them correctly so pixels on these edges are treated as inside. What we actually want to do is change the Cx# > 0 statements into Cx# >= 0 statements. Testing for top-left edges per pixel is way to slow, and handling all these cases in separate loops is way too complex. But look at what Cx# > 0 fundamentally is. It is 'true' when Cx# is 1, 2, 3, etc, and 'false' when Cx# is 0, -1, -2, etc. All we want to do is make it true also when Cx# is zero. The solution is too simple for words: all we have to do is add 1 to Cx# beforehand. This can even be done sooner, with the C1-C3 variables! Ok, let's sum this all up:
void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
// 28.4 fixed-point coordinates
const int Y1 = iround(16.0f * v1.y);
const int Y2 = iround(16.0f * v2.y);
const int Y3 = iround(16.0f * v3.y);
const int X1 = iround(16.0f * v1.x);
const int X2 = iround(16.0f * v2.x);
const int X3 = iround(16.0f * v3.x);
// Deltas
const int DX12 = X1 - X2;
const int DX23 = X2 - X3;
const int DX31 = X3 - X1;
const int DY12 = Y1 - Y2;
const int DY23 = Y2 - Y3;
const int DY31 = Y3 - Y1;
// Fixed-point deltas
const int FDX12 = DX12 << 4;
const int FDX23 = DX23 << 4;
const int FDX31 = DX31 << 4;
const int FDY12 = DY12 << 4;
const int FDY23 = DY23 << 4;
const int FDY31 = DY31 << 4;
// Bounding rectangle
int minx = (min(X1, X2, X3) + 0xF) >> 4;
int maxx = (max(X1, X2, X3) + 0xF) >> 4;
int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;
int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;
(char*&)colorBuffer += miny * stride;
// Half-edge constants
int C1 = DY12 * X1 - DX12 * Y1;
int C2 = DY23 * X2 - DX23 * Y2;
int C3 = DY31 * X3 - DX31 * Y3;
// Correct for fill convention
if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;
if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;
if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;
int CY1 = C1 + DX12 * (miny << 4) - DY12 * (minx << 4);
int CY2 = C2 + DX23 * (miny << 4) - DY23 * (minx << 4);
int CY3 = C3 + DX31 * (miny << 4) - DY31 * (minx << 4);
for(int y = miny; y < maxy; y++)
{
int CX1 = CY1;
int CX2 = CY2;
int CX3 = CY3;
for(int x = minx; x < maxx; x++)
{
if(CX1 > 0 && CX2 > 0 && CX3 > 0)
{
colorBuffer[x] = 0x00FFFFFF;
}
CX1 -= FDY12;
CX2 -= FDY23;
CX3 -= FDY31;
}
CY1 += FDX12;
CY2 += FDX23;
CY3 += FDX31;
(char*&)colorBuffer += stride;
}
}
Here's the end result: correct.png. Not only is it now entirely flawless, it's also faster than the floating-point version! The range of the fixed-point integers is big enough for a 2048x2048 color buffer, with 4 bits of sub-pixel precision. Now we got this fixed, let's focus on performance again...This implementation is perfect for small triangles. The setup cost per triangle is really low compared to the scanline conversion algorithm. Unfortunately, it's not optimal for big triangles. About half of all pixels will be outside the triangle, but we still pay the price of evaluating the half-space functions. Furthermore, until now I've only talked about drawing one pixel at a time, while the real benefits of this approach is the possibility for parallelism. I'll show you how to take advantage of this, and what other benefits we get from it.
What we really want to do is quickly skip pixels outside the triangle. We don't want to waste any time evaluating half-space functions there. We also don't want to spend too much time inside the triangle. The really interesting part is around the edges. So what we'll do is quickly detect whether 8x8 blocks are not covered, partially covered, or fully covered. Not covered or fully covered blocks can quicly be respectively rejected or accepted. Partially covered blocks will be completely scanned. There's another reason to do this block-based approach: it is very useful for visibility determination algorithms. And an extra benefit is that memory accesses are more localized.
So how do we detect coverage as fast as possible? With the half-space functions, it's really easy. All we have to do is evaluate the half-space functions in the corners of the blocks. A non-covered block has negative half-space values for the corners for at least one edge. A completely covered block has positive half-space values for all edges. Everything else is a partially covered block. So the final implementation is:
void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
// 28.4 fixed-point coordinates
const int Y1 = iround(16.0f * v1.y);
const int Y2 = iround(16.0f * v2.y);
const int Y3 = iround(16.0f * v3.y);
const int X1 = iround(16.0f * v1.x);
const int X2 = iround(16.0f * v2.x);
const int X3 = iround(16.0f * v3.x);
// Deltas
const int DX12 = X1 - X2;
const int DX23 = X2 - X3;
const int DX31 = X3 - X1;
const int DY12 = Y1 - Y2;
const int DY23 = Y2 - Y3;
const int DY31 = Y3 - Y1;
// Fixed-point deltas
const int FDX12 = DX12 << 4;
const int FDX23 = DX23 << 4;
const int FDX31 = DX31 << 4;
const int FDY12 = DY12 << 4;
const int FDY23 = DY23 << 4;
const int FDY31 = DY31 << 4;
// Bounding rectangle
int minx = (min(X1, X2, X3) + 0xF) >> 4;
int maxx = (max(X1, X2, X3) + 0xF) >> 4;
int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;
int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;
// Block size, standard 8x8 (must be power of two)
const int q = 8;
// Start in corner of 8x8 block
minx &= ~(q - 1);
miny &= ~(q - 1);
(char*&)colorBuffer += miny * stride;
// Half-edge constants
int C1 = DY12 * X1 - DX12 * Y1;
int C2 = DY23 * X2 - DX23 * Y2;
int C3 = DY31 * X3 - DX31 * Y3;
// Correct for fill convention
if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;
if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;
if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;
// Loop through blocks
for(int y = miny; y < maxy; y += q)
{
for(int x = minx; x < maxx; x += q)
{
// Corners of block
int x0 = x << 4;
int x1 = (x + q - 1) << 4;
int y0 = y << 4;
int y1 = (y + q - 1) << 4;
// Evaluate half-space functions
bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;
bool a10 = C1 + DX12 * y0 - DY12 * x1 > 0;
bool a01 = C1 + DX12 * y1 - DY12 * x0 > 0;
bool a11 = C1 + DX12 * y1 - DY12 * x1 > 0;
int a = (a00 << 0) | (a10 << 1) | (a01 << 2) | (a11 << 3);
bool b00 = C2 + DX23 * y0 - DY23 * x0 > 0;
bool b10 = C2 + DX23 * y0 - DY23 * x1 > 0;
bool b01 = C2 + DX23 * y1 - DY23 * x0 > 0;
bool b11 = C2 + DX23 * y1 - DY23 * x1 > 0;
int b = (b00 << 0) | (b10 << 1) | (b01 << 2) | (b11 << 3);
bool c00 = C3 + DX31 * y0 - DY31 * x0 > 0;
bool c10 = C3 + DX31 * y0 - DY31 * x1 > 0;
bool c01 = C3 + DX31 * y1 - DY31 * x0 > 0;
bool c11 = C3 + DX31 * y1 - DY31 * x1 > 0;
int c = (c00 << 0) | (c10 << 1) | (c01 << 2) | (c11 << 3);
// Skip block when outside an edge
if(a == 0x0 || b == 0x0 || c == 0x0) continue;
unsigned int *buffer = colorBuffer;
// Accept whole block when totally covered
if(a == 0xF && b == 0xF && c == 0xF)
{
for(int iy = 0; iy < q; iy++)
{
for(int ix = x; ix < x + q; ix++)
{
buffer[ix] = 0x00007F00;<< // Green
}
(char*&)buffer += stride;
}
}
else<< // Partially covered block
{
int CY1 = C1 + DX12 * y0 - DY12 * x0;
int CY2 = C2 + DX23 * y0 - DY23 * x0;
int CY3 = C3 + DX31 * y0 - DY31 * x0;
for(int iy = y; iy < y + q; iy++)
{
int CX1 = CY1;
int CX2 = CY2;
int CX3 = CY3;
for(int ix = x; ix < x + q; ix++)
{
if(CX1 > 0 && CX2 > 0 && CX3 > 0)
{
buffer[ix] = 0x0000007F;<< // Blue
}
CX1 -= FDY12;
CX2 -= FDY23;
CX3 -= FDY31;
}
CY1 += FDX12;
CY2 += FDX23;
CY3 += FDX31;
(char*&)buffer += stride;
}
}
}
(char*&)colorBuffer += q * stride;
}
}
Note that I scan partially covered blocks completely, all 8x8 pixels. This is for consistency with the other blocks so they can be processed by the same visibility algorithm. Also, this is extremely fast when done in an unrolled loop, using assembly instructions. All further optimizations to this algorithm are best done in assembly anyway. So now the rasterizer can output coverage masks for 8x8 pixels. This can then easily be processed by the pixel pipeline(s). It's easy to process them as 4x4 quads, and many calculations can even be done per block. Everything taken together, there is no reason left to use the old scanline conversion algorithm.
Enjoy!
Nicolas "Nick" Capens














