Advanced Rasterization

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 11, 2004 at 18:08

It’s quite surprising that the most elementary rendering operations, like rasterization, are so little documented. Graphics cards take care of most of this, so people have forgotten how it actually works. But there are many reasons left why understanding the inner workings of the graphics pipeline is still useful. Not only for software rendering, but for any situation where this type of operations is required. Knowledge of these algorithm simply make you a better graphics programmer.

Today I present you the theory and implementation of an advanced rasterizer. It is advanced in the sense that it has many nice properties the classical scanline conversion algorithm does not have. The main problem with the old algorithm is that it’s hard to process pixels in parallel. It identifies filled scanlines, but this is only suited for processing one pixel at a time. A much more efficient approach is to process 2x2 pixels together. These are called quads. By sharing some setup cost per quad, and using advanced parallel instructions, this results in a significant speedup. Some of the current graphics hardware also uses quad pixel pipelines.

Our starting point is the half-space function approach. Before I go into details, all you have to know is that a half-space function is positive on one side of a line, and negative on the other. So it splits the screen in half, hence the name. Here’s a small illustration of it, where the edges of a triangle each split the screen in a positive and negative part: half-space.png. Locations where all three half-space functions are positive define the inside of the triangle. When a half-edge function is zero, we’re on an edge. When any of them is negative, we are outside the triangle. So this can be used for rasterization. If we can find formulas for the half-space functions, and evaluate them for each pixel location, we know which pixels need to be filled.

So what formula is positive on one side of a line and negative on the other? Surprisingly, the equation of a line is exactly what we’re looking for. For a segment with starting coordinates (x1, y1) and end coordinates (x2, y2), the equation is:

(x2 - x1) * (y - y1) - (y2 - y1) * (x - x1) = 0

You can easily verify that this equation is true for any point (x, y) on this line (for example the start and end points themselves). But the left hand side also behaves perfectly as a half-space fuction. It is positive for (x, y) coordinates on one side, and negative on the other. It’s zero on the line itself, which effectively defines the line in the above equation. Verifying this behaviour is left as an excercise for the reader. Let’s write some code!

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
    float y1 = v1.y;
    float y2 = v2.y;
    float y3 = v3.y;

    float x1 = v1.x;
    float x2 = v2.x;
    float x3 = v3.x;

    // Bounding rectangle
    int minx = (int)min(x1, x2, x3);
    int maxx = (int)max(x1, x2, x3);
    int miny = (int)min(y1, y2, y3);
    int maxy = (int)max(y1, y2, y3);

    (char*&)colorBuffer += miny * stride;

    // Scan through bounding rectangle
    for(int y = miny; y < maxy; y++)
    {
        for(int x = minx; x < maxx; x++)
        {
            // When all half-space functions positive, pixel is in triangle
            if((x1 - x2) * (y - y1) - (y1 - y2) * (x - x1) > 0 &&
            << (x2 - x3) * (y - y2) - (y2 - y3) * (x - x2) > 0 &&
            << (x3 - x1) * (y - y3) - (y3 - y1) * (x - x3) > 0)
            {
                colorBuffer[x] = 0x00FFFFFF;<< // White
            }
        }

        (char*&)colorBuffer += stride;
    }
}

This is the simplest working implementation of the half-space functions algorithm. Before I continue to explain how it can be improved, there are a few things to note. It would be possible to scan the whole screen, to see which pixels are inside the triangle, but this is of course a waste of time. Therefore only the smallest rectangle surrounding the triangle is scanned. You might also have noticed that I swapped some components in the formula. This is because in screen coordinates, the y-axis points downward. It is also very important that the triangle’s vertices are in counter-clockwise order. This makes sure that all the positive sides of the half-spaces point inside the triangle.

Unfortunately, this implementation isn’t fast at all. For every pixel, six multiplications and no less than fifteen subtractions are required. But don’t worry, this can be optimized greatly. Per horizontal line that we scan, only the x component in the formula changes. For the first edge, this means only (y1 - y2) gets added to the half-edge function value of the previous pixel. That’s just an addition and subtraction per pixel! Furthermore, the vertex coordinates are ‘constant’ per triangle, so (y1 - y2) and all other pairs like this only have to be computed once per pixel. Also for stepping in the y direction it’s just an addition of a constant. Let’s implement this:

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
    float y1 = v1.y;
    float y2 = v2.y;
    float y3 = v3.y;

    float x1 = v1.x;
    float x2 = v2.x;
    float x3 = v3.x;

    // Deltas
    float Dx12 = x1 - x2;
    float Dx23 = x2 - x3;
    float Dx31 = x3 - x1;

    float Dy12 = y1 - y2;
    float Dy23 = y2 - y3;
    float Dy31 = y3 - y1;

    // Bounding rectangle
    int minx = (int)min(x1, x2, x3);
    int maxx = (int)max(x1, x2, x3);
    int miny = (int)min(y1, y2, y3);
    int maxy = (int)max(y1, y2, y3);

    (char*&)colorBuffer += miny * stride;

    // Constant part of half-edge functions
    float C1 = Dy12 * x1 - Dx12 * y1;
    float C2 = Dy23 * x2 - Dx23 * y2;
    float C3 = Dy31 * x3 - Dx31 * y3;

    float Cy1 = C1 + Dx12 * miny - Dy12 * minx;
    float Cy2 = C2 + Dx23 * miny - Dy23 * minx;
    float Cy3 = C3 + Dx31 * miny - Dy31 * minx;

    // Scan through bounding rectangle
    for(int y = miny; y < maxy; y++)
    {
        // Start value for horizontal scan
        float Cx1 = Cy1;
        float Cx2 = Cy2;
        float Cx3 = Cy3;

        for(int x = minx; x < maxx; x++)
        {
            if(Cx1 > 0 && Cx2 > 0 && Cx3 > 0)
            {
                colorBuffer[x] = 0x00FFFFFF;<< // White
            }

            Cx1 -= Dy12;
            Cx2 -= Dy23;
            Cx3 -= Dy31;
        }

        Cy1 += Dx12;
        Cy2 += Dx23;
        Cy3 += Dx31;

        (char*&)colorBuffer += stride;
    }
}

The C1-C3 variables are the constant part of the half-edge equation and are not recomputed per pixel. The Cy1-Cy3 variables determine the ‘starting value’ of the half-space functions at the top of the bounding rectangle. And finally the Dx1-Dx3 variables are the start values per horizontal scan. Only these are incremented (actually decremented) with the delta values every pixel. So testing whether a pixel is inside the triangle has become really cheap: aside from some setup per horizontal line and per triangle that is negligible, just three subtractions and three compares for zero.

Ok, so this algorithm is starting to show its usefulness. But there still are some serious issues to solve. Here’s an actual image created using this code: first_try.png. If it isn’t clear that this is actually a car, here’s the shaded reference image: reference.png. As you can see, there are many gaps between the polygons. The cause is twofold: precision, and the fill convention.

There are precision issues because floating-point numbers are quite limited. They have 24-bit of precision. If you look at the half-space function, you see that we do several subtractions and multiplications. That’s the perfect recipe for precision problems. What most people would do in such situation is use double-precision floating-point numbers. While this will make the issue unnoticable, it isn’t exactly perfect and also slower. The real solution might sound crazy: use integers! Integers have 32-bit of precision, and do not suffer from precision loss when subtracting. Instead of using a ‘floating-point’, we’ll use fixed-point arithmetic to assure we get sub-pixel accuracy.

The second cause of the gaps is the fill convention. The half-space functions can be positive or negative, but also zero. In this case, the pixel is exactly on the edge. Until now, we’ve ignored this case and treated it as a pixel outside. We could treat it as inside, using >= comparators, but this isn’t correct either. What will happen is that pixels on the edge between two triangles will be drawn twice. While this may sound more acceptable than gaps, it causes some serious artifacts for things like transparency and stenciling.

What we’ll use here is a top-left fill convention, just like DirectX and OpenGL. This means that all edges which are on the left side of the triangle, or on a horizontal top, are treated as inside the triangle. This convention assures that no gaps will occur, nor drawing the same pixel twice. Let’s first see how we can detect whether an edge is ‘top-left’. For humans, it’s really easy to recognise them. Here are a few examples: top-left.png. I’m sure nobody really had a problem to identify the top-left edges himself, it’s really simple. But give yourself a minute to think about how you would detect this using code… Don’t look before you tried it, but here’s the answer: detect.png. The arrows indicate the counter-clockwise order of the vertices. Note how for left edges they all point downward! To detect this in code we merely have to check the Dy1-Dy3 values! The only exception is the top edge. There, Dy# is zero, but Dx# is positive.

Now that we know how to detect top-left edges, we still have to deal with them correctly so pixels on these edges are treated as inside. What we actually want to do is change the Cx# > 0 statements into Cx# >= 0 statements. Testing for top-left edges per pixel is way to slow, and handling all these cases in separate loops is way too complex. But look at what Cx# > 0 fundamentally is. It is ‘true’ when Cx# is 1, 2, 3, etc, and ‘false’ when Cx# is 0, -1, -2, etc. All we want to do is make it true also when Cx# is zero. The solution is too simple for words: all we have to do is add 1 to Cx# beforehand. This can even be done sooner, with the C1-C3 variables! Ok, let’s sum this all up:

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
    // 28.4 fixed-point coordinates
    const int Y1 = iround(16.0f * v1.y);
    const int Y2 = iround(16.0f * v2.y);
    const int Y3 = iround(16.0f * v3.y);

    const int X1 = iround(16.0f * v1.x);
    const int X2 = iround(16.0f * v2.x);
    const int X3 = iround(16.0f * v3.x);

    // Deltas
    const int DX12 = X1 - X2;
    const int DX23 = X2 - X3;
    const int DX31 = X3 - X1;

    const int DY12 = Y1 - Y2;
    const int DY23 = Y2 - Y3;
    const int DY31 = Y3 - Y1;

    // Fixed-point deltas
    const int FDX12 = DX12 << 4;
    const int FDX23 = DX23 << 4;
    const int FDX31 = DX31 << 4;

    const int FDY12 = DY12 << 4;
    const int FDY23 = DY23 << 4;
    const int FDY31 = DY31 << 4;

    // Bounding rectangle
    int minx = (min(X1, X2, X3) + 0xF) >> 4;
    int maxx = (max(X1, X2, X3) + 0xF) >> 4;
    int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;
    int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;

    (char*&)colorBuffer += miny * stride;

    // Half-edge constants
    int C1 = DY12 * X1 - DX12 * Y1;
    int C2 = DY23 * X2 - DX23 * Y2;
    int C3 = DY31 * X3 - DX31 * Y3;

    // Correct for fill convention
    if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;
    if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;
    if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;

    int CY1 = C1 + DX12 * (miny << 4) - DY12 * (minx << 4);
    int CY2 = C2 + DX23 * (miny << 4) - DY23 * (minx << 4);
    int CY3 = C3 + DX31 * (miny << 4) - DY31 * (minx << 4);

    for(int y = miny; y < maxy; y++)
    {
        int CX1 = CY1;
        int CX2 = CY2;
        int CX3 = CY3;

        for(int x = minx; x < maxx; x++)
        {
            if(CX1 > 0 && CX2 > 0 && CX3 > 0)
            {
                colorBuffer[x] = 0x00FFFFFF;
            }

            CX1 -= FDY12;
            CX2 -= FDY23;
            CX3 -= FDY31;
        }

        CY1 += FDX12;
        CY2 += FDX23;
        CY3 += FDX31;

        (char*&)colorBuffer += stride;
    }
}

Here’s the end result: correct.png. Not only is it now entirely flawless, it’s also faster than the floating-point version! The range of the fixed-point integers is big enough for a 2048x2048 color buffer, with 4 bits of sub-pixel precision. Now we got this fixed, let’s focus on performance again…

This implementation is perfect for small triangles. The setup cost per triangle is really low compared to the scanline conversion algorithm. Unfortunately, it’s not optimal for big triangles. About half of all pixels will be outside the triangle, but we still pay the price of evaluating the half-space functions. Furthermore, until now I’ve only talked about drawing one pixel at a time, while the real benefits of this approach is the possibility for parallelism. I’ll show you how to take advantage of this, and what other benefits we get from it.

What we really want to do is quickly skip pixels outside the triangle. We don’t want to waste any time evaluating half-space functions there. We also don’t want to spend too much time inside the triangle. The really interesting part is around the edges. So what we’ll do is quickly detect whether 8x8 blocks are not covered, partially covered, or fully covered. Not covered or fully covered blocks can quicly be respectively rejected or accepted. Partially covered blocks will be completely scanned. There’s another reason to do this block-based approach: it is very useful for visibility determination algorithms. And an extra benefit is that memory accesses are more localized.

So how do we detect coverage as fast as possible? With the half-space functions, it’s really easy. All we have to do is evaluate the half-space functions in the corners of the blocks. A non-covered block has negative half-space values for the corners for at least one edge. A completely covered block has positive half-space values for all edges. Everything else is a partially covered block. So the final implementation is:

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
    // 28.4 fixed-point coordinates
    const int Y1 = iround(16.0f * v1.y);
    const int Y2 = iround(16.0f * v2.y);
    const int Y3 = iround(16.0f * v3.y);

    const int X1 = iround(16.0f * v1.x);
    const int X2 = iround(16.0f * v2.x);
    const int X3 = iround(16.0f * v3.x);

    // Deltas
    const int DX12 = X1 - X2;
    const int DX23 = X2 - X3;
    const int DX31 = X3 - X1;

    const int DY12 = Y1 - Y2;
    const int DY23 = Y2 - Y3;
    const int DY31 = Y3 - Y1;

    // Fixed-point deltas
    const int FDX12 = DX12 << 4;
    const int FDX23 = DX23 << 4;
    const int FDX31 = DX31 << 4;

    const int FDY12 = DY12 << 4;
    const int FDY23 = DY23 << 4;
    const int FDY31 = DY31 << 4;

    // Bounding rectangle
    int minx = (min(X1, X2, X3) + 0xF) >> 4;
    int maxx = (max(X1, X2, X3) + 0xF) >> 4;
    int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;
    int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;

    // Block size, standard 8x8 (must be power of two)
    const int q = 8;

    // Start in corner of 8x8 block
    minx &= ~(q - 1);
    miny &= ~(q - 1);

    (char*&)colorBuffer += miny * stride;

    // Half-edge constants
    int C1 = DY12 * X1 - DX12 * Y1;
    int C2 = DY23 * X2 - DX23 * Y2;
    int C3 = DY31 * X3 - DX31 * Y3;

    // Correct for fill convention
    if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;
    if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;
    if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;

    // Loop through blocks
    for(int y = miny; y < maxy; y += q)
    {
        for(int x = minx; x < maxx; x += q)
        {
            // Corners of block
            int x0 = x << 4;
            int x1 = (x + q - 1) << 4;
            int y0 = y << 4;
            int y1 = (y + q - 1) << 4;

            // Evaluate half-space functions
            bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;
            bool a10 = C1 + DX12 * y0 - DY12 * x1 > 0;
            bool a01 = C1 + DX12 * y1 - DY12 * x0 > 0;
            bool a11 = C1 + DX12 * y1 - DY12 * x1 > 0;
            int a = (a00 << 0) | (a10 << 1) | (a01 << 2) | (a11 << 3);
    
            bool b00 = C2 + DX23 * y0 - DY23 * x0 > 0;
            bool b10 = C2 + DX23 * y0 - DY23 * x1 > 0;
            bool b01 = C2 + DX23 * y1 - DY23 * x0 > 0;
            bool b11 = C2 + DX23 * y1 - DY23 * x1 > 0;
            int b = (b00 << 0) | (b10 << 1) | (b01 << 2) | (b11 << 3);
    
            bool c00 = C3 + DX31 * y0 - DY31 * x0 > 0;
            bool c10 = C3 + DX31 * y0 - DY31 * x1 > 0;
            bool c01 = C3 + DX31 * y1 - DY31 * x0 > 0;
            bool c11 = C3 + DX31 * y1 - DY31 * x1 > 0;
            int c = (c00 << 0) | (c10 << 1) | (c01 << 2) | (c11 << 3);

            // Skip block when outside an edge
            if(a == 0x0 || b == 0x0 || c == 0x0) continue;

            unsigned int *buffer = colorBuffer;

            // Accept whole block when totally covered
            if(a == 0xF && b == 0xF && c == 0xF)
            {
                for(int iy = 0; iy < q; iy++)
                {
                    for(int ix = x; ix < x + q; ix++)
                    {
                        buffer[ix] = 0x00007F00;<< // Green
                    }

                    (char*&)buffer += stride;
                }
            }
            else<< // Partially covered block
            {
                int CY1 = C1 + DX12 * y0 - DY12 * x0;
                int CY2 = C2 + DX23 * y0 - DY23 * x0;
                int CY3 = C3 + DX31 * y0 - DY31 * x0;

                for(int iy = y; iy < y + q; iy++)
                {
                    int CX1 = CY1;
                    int CX2 = CY2;
                    int CX3 = CY3;

                    for(int ix = x; ix < x + q; ix++)
                    {
                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)
                        {
                            buffer[ix] = 0x0000007F;<< // Blue
                        }

                        CX1 -= FDY12;
                        CX2 -= FDY23;
                        CX3 -= FDY31;
                    }

                    CY1 += FDX12;
                    CY2 += FDX23;
                    CY3 += FDX31;

                    (char*&)buffer += stride;
                }
            }
        }

        (char*&)colorBuffer += q * stride;
    }
}

Note that I scan partially covered blocks completely, all 8x8 pixels. This is for consistency with the other blocks so they can be processed by the same visibility algorithm. Also, this is extremely fast when done in an unrolled loop, using assembly instructions. All further optimizations to this algorithm are best done in assembly anyway. So now the rasterizer can output coverage masks for 8x8 pixels. This can then easily be processed by the pixel pipeline(s). It’s easy to process them as 4x4 quads, and many calculations can even be done per block. Everything taken together, there is no reason left to use the old scanline conversion algorithm.

Enjoy!

Nicolas “Nick” Capens

187 Replies

Please log in or register to post a reply.

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Sep 11, 2004 at 19:21

agreed with the most but:

As i know the Pentium and better CPUs work with the 80bit float numbers actually faster than with the 32 and 64bits. Of course i know few languages which make use of the extended format.

There are other few stuff, which i’m not yet sure about.. but the article is really good! Like everything in the _nick_ style ;)

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 11, 2004 at 19:45

@Mihail121

As i know the Pentium and better CPUs work with the 80bit float numbers actually faster than with the 32 and 64bits. Of course i know few languages which make use of the extended format.

I’m really happy with the performance of my current version, and I don’t think 80-bit floating-point can beat it. Anyway, I’m going to optimize all of it with MMX, which will already probably make it twice as fast… ;)

E05263ec846eb85da803f56e2917962d
0
Noor 101 Sep 11, 2004 at 19:52

this can be turned in an article :yes:

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Sep 11, 2004 at 19:54

Bah… too bad there ain’t MMX/3DNow! on those damn nokias….

254754b37f468a2926bffcd83bbbf1fa
0
z80 101 Sep 11, 2004 at 20:42

Good work, Nick! Very interesting! :yes:

3c6597370b476903ed475f70b4b3ce31
0
john 102 Sep 11, 2004 at 21:16

Impressive work Nick!
Could you provide some figures that reveals the performance (and compare them with traditional rasterizers)?
How do graphic chips currently do this? I wonder if graphic chip vendors would benifit if your algorithm was implemented in hardware.
Have you read the book by Andre LaMoth on:
Tricks of the 3D Game Programming Gurus-Advanced 3D Graphics and Rasterization. What is your recommendation for using it as a learning reference for software rendering?

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 12, 2004 at 07:50

@john

Impressive work Nick!

I hope I haven’t upped the standard too much. ;)

Could you provide some figures that reveals the performance (and compare them with traditional rasterizers)?

For this car model, which has only small polygons, performance is equivalent when using the 8x8 block C++ version. For rendering a skybox (big polygons), performance is a bit lower (it beats the traditional rasterizer when using 16x16 blocks though). But let’s put these mediocre results in the right perspective. This algorithm enables parallel pixel processing. For example, stencil shadows. When an 8x8 block is completely inside the polygon, we can process this block with 64-bit MMX instructions in a flash. So although rasterization is sometimes a bit slower, the classical rasterizer cannot achieve this fillrate when processing one pixel at a time. Also, the C++ version presented here has sub-optimal code for detecting block coverage. I am very sure the MMX version will be able to do it many times faster (it has a pmaddwd instruction which does -four- multiplications in parallel and adds pairs; merely two of these instructions are needed to test one edge). Either way, even if it needs some tweaking, the advantages far outweigh the disadvantages.

If you want numbers, I suggest using swShader, and plugging this code into it.

How do graphic chips currently do this? I wonder if graphic chip vendors would benifit if your algorithm was implemented in hardware.

A lot of hardware has quad processing pipelines. It’s quite probable that they already use this algorithm or a variant more optimal for hardware implementation.

Have you read the book by Andre LaMoth on:
Tricks of the 3D Game Programming Gurus-Advanced 3D Graphics and Rasterization. What is your recommendation for using it as a learning reference for software rendering?

I bought it. It was a waste of my money. It didn’t teach me anything new and on every page I was able to either spot an error or I know a fundamentally better approach. And it’s a big book but that’s only because the whole code is listed in it.

Anyway, it isn’t all bad. I’m sure that for someone who has no experience with software rendering it’s quite good. He writes in a ‘popular’ way so it’s easy to read and understand. It nicely bundles all the primary problems someone has to solve to write a software renderer. Since there are no other books like this, it is automatically the best you can get. I would have scratched the ‘Advanced’ out of the subtitle though…

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Sep 12, 2004 at 10:10

Agreed with Nick on LaMothe’s book 75% of it is code, wrong story about lightmaps, no correct organization no nothing!

F7a4a748ecf664f189bb704a660b3573
0
anubis 101 Sep 12, 2004 at 13:02

I bought it. It was a waste of my money. It didn’t teach me anything new and on every page I was able to either spot an error or I know a fundamentally better approach. And it’s a big book but that’s only because the whole code is listed in it.

i was suprised how many errors (i mean real mathematical errors not typos) it contains. this has to be confusing for somebody whithout any experience.

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Sep 12, 2004 at 19:00

very nice done, nick. i think this shows again, premature optimisation isn’t everything, optimizing the algorithm is what counts.

now rastericing works on a completely different level somehow, it’s not about the individual pixel anymore, but really just the triangle.

it reminds me much of raytracing, this algorithm. the logic is the same, just with tons of restrictions on what rays you trace :D but it’s the same, with the same optimisation possiblities. process full blocks of rays in parallel, skip blocks that are useless, etc..

it all sounds familiar.. can’t wait to see another performance boost in sw-shader :D

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Sep 12, 2004 at 19:57

hehe.. you know what? your rastericer would rock quite a bit on this system:

dscn4140.jpg

i mean.. yeah! :D 16 parallel processors (actually 4 processors with 2 cores and 2 hyperthreads each). there, your parallelisation will get onto a next, higher level..

yehah! :D

6e0b32504d31ae07efc17f3322cdb147
0
SnprBoB86 101 Sep 12, 2004 at 22:51

Very interesting article. Confirmed how I thought this worked, and then extended on it with optimizations that are very nice toknow. Thanks.

@davepermen:
Where did you get that screen? I would be interested to know who has the cash for that many processors, but not for a monitor large enough to show all the CPU usage graphs ;)

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Sep 13, 2004 at 04:48

Intel Developer Forum report at anandtech (just right-click and check the file properties to verify).

they aren’t that expensive actually. both amd and intel start to show off their new multicore cpu’s, for home and servers. this one is actually an ITANIUM dual core at 1.7ghz each, i think. each core has 2 hyperthreads, just as the newer p4 have, and the board is a 4x itanium board.. hence 4x2x2 hyperthreads, 4x2 cores, 4 cpus.

right now, it’s just a “we have it” proof demo. but it’s planned for next year to start shipping. ITANIUMs, Xeons, and Pentiums, all with dual cores.

AMD on it’s side will show off Opterons with multicore (2x, 4x), and athlon 64, too.. should be available next year, too.. (they showed off a first sample, too).

so yes, it’s some years till every home end user has one. but it’s right today important to start thinking about designing with multi-threading, to support hyperthreading by today, multicore by tomorrow, and multi-cpu for the highest-end-systems in the future.

hehe :D

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 13, 2004 at 18:32

I can’t wait for a 4 GHz dual-core 64-bit (double amount of registers), Hyper-Threading CPU with 1200 MHz dual-channel FSB and build-in memory controller. ;)

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Sep 13, 2004 at 18:56

And when i think that only 12 years has passed since the graphical ‘wonders’ of Wolf3D…

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Sep 13, 2004 at 20:50

@Nick

I can’t wait for a 4 GHz dual-core 64-bit (double amount of registers), Hyper-Threading CPU with 1200 MHz dual-channel FSB and build-in memory controller. ;) [snapback]11373[/snapback]

hehe..

will you one day consider writing some raytracing-stuff? i mean, your renderer gets more and more into the right direction anyways :D

you’ll have to at least write a protocoll how each and every ‘tile’ gets submitted to the right cpu.

btw, if you’re quick, you’ll be the first one with an api that wrapps well onto ps3! :D

3c6597370b476903ed475f70b4b3ce31
0
john 102 Sep 13, 2004 at 21:09

not only does Nick’s swShader beat quake’s software rendering and other , but it beats the crap out of DirectX’s REF rastersizer. There’s no point in comparing against quake’s s/w anyways. Good job Nick! You’re going to be a millionare someday. You’re pretty much garenteed a job at any company you apply for. What’s the point of life after that? hehe (j/k) As far as I know, there isn’t a single commercial or free product which emulates shaders except for swShader (isn’t that right?).

I guess there wasn’t any incentive for Microsoft to optimize their software renderer since they expect hardware vendors to implement everything in hardware, and the software part is only used as a reference model to compare against.

F7a4a748ecf664f189bb704a660b3573
0
anubis 101 Sep 13, 2004 at 21:29

You’re pretty much garenteed a job at any company you apply for. What’s the point of life after that?

american ?

Fdbdc4176840d77fe6a8deca457595ab
0
dk 158 Sep 13, 2004 at 23:37

For those who haven’t noticed, the code spotlight submissions are categorized now B)

F7a4a748ecf664f189bb704a660b3573
0
anubis 101 Sep 13, 2004 at 23:56

nice… maybe the categories should be seperated more clearly ?

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 14, 2004 at 08:20

@john

not only does Nick’s swShader beat quake’s software rendering and other , but it beats the crap out of DirectX’s REF rastersizer. There’s no point in comparing against quake’s s/w anyways. Good job Nick!

Thanks for the compliment, but I would nuance this a little. It doesn’t beat the Quake I renderer at all. In it’s context (a Pentium 100 MHz), the Quake I renderer is brilliant. Before that, there was nothing that came close. It did everything right: perspective correction, fill convention, prestepping, mipmapping, lightmapping. And all that at 4-6 clock cycles per pixel (inner loop)! Although my renderer has ten times more features, I have GHz processors with MMX and SSE and I’m still stuck at 20 FPS. :blush: Well, in its own context that ain’t bad at all… But anyway, it’s Quake I’s renderer that really inspired me, and at some points it still does.

You’re going to be a millionare someday. You’re pretty much garenteed a job at any company you apply for. What’s the point of life after that? hehe (j/k)

Money is not everything. Although I could certainly use some more to keep paying for my education, there’s more to life than this. In the end it’s all about living a happy life. And making the people around you happy is very important to feel good yourself. So I would never do a job just for the money, I would do it if I feel good about it and my career doesn’t make me neglect other people. Combining a good career and happiness is hard, but I’m trying…

As far as I know, there isn’t a single commercial or free product which emulates shaders except for swShader (isn’t that right?).

Mesa 3D, the OpenGL ‘reference rasterizer’ supports them too. I haven’t tested it, but I’ve heard it’s not too bad when only using a few simple shaders. But swShader is probably the only product where shaders can be categorized as real-time and efficient. Last week I was able to speed up my [url=http://v]Per-Pixel Lighting[/url] demo from 34 to 42 FPS on my Pentium M 1.4 GHz laptop. Now I’m aiming at 50 FPS. Although I’m very excited about every improvement, I do realize that the actual usefulness of it all is very low. You can’t really run a semi-modern game with swShader. Even the cheapest integrated graphics chip from three years ago beats me at fillrate. It’s an uphill battle I’ll never win.

I guess there wasn’t any incentive for Microsoft to optimize their software renderer since they expect hardware vendors to implement everything in hardware, and the software part is only used as a reference model to compare against.

They did have some attempts at software rendering, like the RGB Emulation and Ramp Emulation. I don’t know if they ever evolved further, but I think that they quickly realized that the days of 2D graphics cards are totally over. A computer with a graphics card that can’t draw a triangle with a single texture is really hard to find. That’s almost all RGB and Ramp could do. :huh:

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Sep 14, 2004 at 13:39

Hey John, check out Muli3D, it also supports shaders in easy and understanable way!

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 14, 2004 at 16:27

@Mihail121

Hey John, check out Muli3D, it also supports shaders in easy and understanable way!

It’s nice to see Muli3D has evolved! Just a couple month ago it had sub-pixel precision issues and now it shows the most advanced features.

I wouldn’t really say it has shader support though. :blush: You can derive a class from a PixelShader class, and implement the execute() function. This allows to simply write the ‘shader’ in C++. In that sense every software renderer would have shader support even before one line of it is implemented.

Anyway, it seems like it could become an open-source ‘reference rasterizer’…

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 16, 2004 at 14:28

The article is now also available on the swShader site:

swShader Documentation

If there’s any part that can be improved, like things that are unclear and need more explanation, please let me know!

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Sep 16, 2004 at 21:40

can’t wait for a first public 0.5.0 online :D espencially with HT support :D

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 16, 2004 at 23:10

@davepermen

can’t wait for a first public 0.5.0 online :D espencially with HT support :D

Unfortunately it won’t have HT support, nor 64-bit extensions (specifically the extra registers). I’m still stuck with a Celeron 1.2 GHz as my main system. :blink:

I don’t have the money for an upgrade, but even then it would be a dilemma. The really interesting processors with a combination of all the latest technology will only be available and affordable in 2006 or so. Realistically, I’m looking forward to nForce 4 motherboards with PCI-Express and socket 939. I can then have an Athlon 64, and next year upgrade to dual-core. Dual Xeon with EMT has also crossed my mind…

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Sep 17, 2004 at 04:48

well, you can try to implement multi-threading anyways, if you want.. i can tell you then, if it works.. at work, we have some p4 with ht..

64bit, well, yeah, i have to wait for as well… it will help you much, thanks to the registers.. it will help much thanks to the huge memory space, too… performance should be great on them. oh, and, native 64bit integer math would lead to much bigger possible render targets, native, too.. (with subpixels taken care of).

i’d be happy to spend you some pc or such, but, well.. i’m on a 2.7ghz celeron myself. intels proof that mhz don’t mather really at all… :(

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Sep 23, 2004 at 21:31

hey, nick, i’ve read you can enable the 64bit registers even on 32bit os’ !! and possibly even on semprons!

that’ll rock. semprons are hell cheap..

Fdbdc4176840d77fe6a8deca457595ab
0
dk 158 Sep 27, 2004 at 04:18

Nick, I added a link to this Code Spotlight submission to the articles section since it deserve to be an article of its own.

I hope you have more of these to share with us :wacko:

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 27, 2004 at 14:33

Thanks!

I have plans to write about projection and clipping. That’s another stage in the rendering pipeline that is usually handled by the hardware, but again deeper knowledge of it is useful beyond software rendering.

3a53830bbb3936338554ddf7a72b5e75
0
cdgray 101 Sep 28, 2004 at 03:40

hey Nick: which profiler do you use to test the efficiency of your code? I would like to know where the bottleneck is in the code, to know where to start optimzations.

How much assembly (and MMX, SIMD, etc.) should one know in order to implement an efficient software renderer? I know little assembly (but i’m an expert in C/C++) and I’m wondering if assembly is a prerequisite for such things.

btw, I personally use DevPartner and it’s great, especially for memory leaks.

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Sep 28, 2004 at 06:05

Hey Nick and DevMaster stuff, article is missing something! Can’t these image leaks be directly opened inside the article? I mean.. i would like to see the schemes directly, not to click around!

B78f55615f429b968f52c8aae7e55f5e
0
ector 101 Sep 28, 2004 at 06:57

Very interesting algorithm, Nick..

Small right-edge optimization to your original simple algorithm, should work in the blocked one too:

bool done = false;
for(int x = minx; x < maxx; x++)
{
   if(CX1 > 0 && CX2 > 0 && CX3 > 0)
   {
     colorBuffer[x] = 0x00FFFFFF;
     done=true;
   }
   else
     if (done) break;

   CX1 -= FDY12;
   CX2 -= FDY23;
   CX3 -= FDY31;
}

Haven’t done any benchmarking to see if getting rid of on average half the time wasted outside the triangle is worth the extra check.. could easily see it being worth it in an ARM implementation on GBA for example because of its cheap conditionals. Or you could spend some extra code size to split the inner loop into two, the first will only step forwards until it finds the triangle, the second one will fill and exit as soon as the condition becomes false.

Anyhow, I haven’t given it much though yet but how would you efficiently interpolate things like texture coordinates and colors over the triangle? How to do it is obvious with the scanline algorithm but here I guess a little bit of line equation trickery is needed and it’s too early in the morning for me to work it out :wacko:

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 28, 2004 at 11:18

@cdgray

hey Nick: which profiler do you use to test the efficiency of your code? I would like to know where the bottleneck is in the code, to know where to start optimzations.

I’m not using any profiler. Mostly I just disable portions of the code, to see how much that influences performance. Or I use simple timers. For the dynamically generated code in my software renderer it’s simple: less instructions is better. :D

How much assembly (and MMX, SIMD, etc.) should one know in order to implement an efficient software renderer? I know little assembly (but i’m an expert in C/C++) and I’m wondering if assembly is a prerequisite for such things.

It depends on your definition of efficient. MMX and SIMD give me a speedup of 2 to 4. In my new parallel renderer it will probably be close to 4 everywhere. But dynamical compilation of assembly pipelines is definitely the biggest optimization. It eliminates hundreds of state checks per pixel, while still allowing hundreds of states. The best alternative I know, completely in C++, is to do something like deferred rendering. First sample all the textures, then do all the blending, then the interpolation, etc. This also prevents checking the states for every pixel, but memory bandwidth becomes a problem since you constantly read data, modify it a little and write it back. In a dynamically compiled pipeline everything is processed per pixel and the important data can stay in registers.

btw, I personally use DevPartner and it’s great, especially for memory leaks.

I once tried using it but I didn’t find it very useful. My home-made tools can do just the same thing, more focussed. I love VTune, but unfortunately still can’t afford it.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 28, 2004 at 11:30

@ector

Small right-edge optimization to your original simple algorithm, should work in the blocked one too…

It’s slightly worth it in C++, but not for the assembly implementation. It’s already really fast to determine coverage for a block. The extra checks and using more registers is not worth it.

Indeed on a GBA it could be very useful, though I doubt if this parallel rasterizer actually has use there at all?

Anyhow, I haven’t given it much though yet but how would you efficiently interpolate things like texture coordinates and colors over the triangle?

This is quite easy. First determine gradients du/dx, dv/dx, du/dy and dv/dy. Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0 (note that x0 and y0 are floating-point, x and y are integer). Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy.

Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.

B78f55615f429b968f52c8aae7e55f5e
0
ector 101 Sep 28, 2004 at 21:39

@Nick

Indeed on a GBA it could be very useful, though I doubt if this parallel rasterizer actually has use there at all?

Well it’s a way of rasterizing subpixel correct triangles with cheap setup and no divisions (GBA has no hardware divider).. and your blocked algorithm could be very interesting when rasterizing to tilemaps, you could fill a block instantly with just a map index set :) That would only be useful for really big flat triangles though.

This is quite easy. First determine gradients du/dx, dv/dx, du/dy and dv/dy. Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0 (note that x0 and y0 are floating-point, x and y are integer). Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy.
Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.

Yeah, of course. Realized that’s exactly how you would do it immediately after posting. Thanks for putting it into words though for the enjoyment of other readers :)

Bba99cde4c49924aea2beb20cf3f8ea2
0
Ravyne 101 Oct 15, 2004 at 03:08

Nick:
I’m new to these boards but Have been very interested in software rasterization for a long while. I have my own project in the early stages, Hopefully I’ll find more time to work on it but I’m back in school now.

Anyhow, in Graphics class today I was thinking about the technique in your article. I had a moment of potential inspiration and ended up with somewhat of a branch from your technique, I wonder what your thoughts might be on it.

What I propose is that, given a triangle, create the smallest square bounding box which is a power of 2 along both axis inclusive of the three points. In the case of a 2x2 box, the half-spaces are immediatly calculated for each corner and each of the 4 pixels are lit accordingly. Otherwise, the box is sub divided into 4 equal, square portions recursivly. The half-space values are calculated for the corners and used to determine if the box is completely lit, rejected or partially lit as in your technique. Partially Lit boxes are further sub divided untill such time that they form a 2x2 box, the calculated half-space values are then used to light each of the 4 pixels accordingly.

This technique has the advantage that the largest possible boxes are accepted/rejected as early as possible. 2x2 and 4x4 boxes produce 4 and 16 half-space checks respectively, 8x8 boxes produce 80 half-space checks, boxes larger than 8x8 is where the benefit kicks in, as the likelyhood of finding large boxes to accept/reject increases. LargeTriangles become quite efficient. I also believe that this technique can lend itself to parellel processing as well.

I would be glad to hear your input on this technique, perhaps its already been done and I am unaware of it. Did you make a decision to chose 8x8 boxes for other benefits like per-box lighting calculations prehaps? How well do you think this technique might perform? If it interests you I would be glad to discuss it further.

Should this idea be something new (and more importantly, a good idea) I’d call it “Rasterization by recursive bilateral reduction.” Also, aside from triangles, this approach, and yours as well I believe, can be made to apply to any convex n-gon.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Oct 15, 2004 at 08:04

Hi Ravyne,

Theoretically your algorithm has a lot of value. Indeed it would be most efficient of all, especially for gigantic resolutions. However, I have tried similar techniques in practice and they all turned out to have more or less the same performance as the 8x8 block method. The reason is that, although the recursive algorithms are algorithmically more efficient, they are computationally more expensive. To make a close analogy; it’s like using quick-sort for three elements. :wink:

Like you say, it only starts to really work for boxes larger than 8x8. But for big polygons, the rasterization cost is, compared to the time spend in the pixel pipelines, relatively small. 8x8 blocks are skipped in around 10 clock cycles or so, while 8x8 shaded pixels with one textures takes a lot more than 1.000 clock cycles. So in this perspective I think it’s not all that useful to look for more efficient rasterization.

But the primary reason why I’m still using 8x8 blocks is actually the implementation complexity. As you can see in the article, it is relatively compact and straightforward to implement. All other attempts added some complexity, and this comes at the cost of flexibility. Also, I’d like to keep the setup cost for tiny polygons as low as possible. With a recursive approach it would optimize for the big polygons but make small polygons less efficient. So currently I’m keeping the 8x8 blocks approach, because it still gives me the freedom to refine it afterwards, if necessary.

Another reason is that consistently using 8x8 blocks makes it easy to implement a visibility system without complications. I already had to deal with situations where the render target’s dimentions are not a multiple of 8. Using other sizes of blocks would only make this worse, while I believe it wouldn’t add much benefit to improve the visibility algorithm. In 640x480 mode, there are ‘only’ 80x60 blocks anyway.

The most succesful alternative so far is to have a specialized rasterizer for tiny and huge polygons. Tiny polygons often fall into a 2x2 pixel block, so they can be passed to the pixel piplines directly. This occurs frequently for models with a large number of polygons, far away. For huge polygons several 8x8 blocks can be skipped at once by computing exact start and end values of the polygon edges per row. This requires a slow division though, so I haven’t seen much performance increase of it for medium resolutions, unfortunately.

But your idea is definitely valuable for big resolutions. For non-real-time rendering with anti-aliasing I believe an approach like that becomes a necessity. Either way, I’ll keep you updated on any new findings that apply to my software renderer, and I wish you good luck with your project!

2b745549fe339c54b768543c9faa96ea
0
Mario 101 Nov 05, 2004 at 20:44

Greetingz

It’s needless to say that Your article precisely describes fast and efficient method for such a basic routine as triangle rasterization. Simply, nothing to add, nothing to remove. However it wouldn’t be smart to use it instead of existing solutions unless they haven’t been created. The majority of programmers rather aviods diving into low-level code, which generally is implemented as hardware service. In matter of fact, the presented algoritm was used as compact asm routines in 80x86 simple 3D intros/games, but it could be completely forgotten since the development of hardware dedicated to boost graphic drawing. Nowadays, the modern 3D applications aren’t even similiar to these from the past.

The discussion leading to optimize the basics isn’t as productive as before. The algorithmical analyses and theoretical complexity evaluations doesn’t produce any interesting results. The only tool programmer can use is profiler (or other time-based feedbacks of course), which role is to find bottlenecks; so these can be found only by empeiria. My experience had taught me, that analysing code isn’t good method for determinig time complexity. That, what usually looks ineffective is often fast. Simply, I believe in 80/20 theory.

Despite the fact, that mentioned algorithm is satisfying, it doesn’t change anything.

Best regards
- Mario

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Nov 06, 2004 at 08:50

Hi Mario,

I agree that my article has close to zero practical use nowadays. The real reason I wrote it and posted it here was for the reasons explained in the first paragraph. These kind of algorithms are forgotton or simply unknown by many, while I think they are interesting for all graphics programmers. In the worst case, it’s just a small math excercise, showing that there are alternatives to the classic scanline based approach.

Imagine that someone would like to optimize his terrain renderer, by identifying regions that are inside/ouside the frustum. Although this would be done in 3D, the algorithm would actually be equivalent to rasterization. Terrain vertices are the pixels, frustum planes are the half-space functions. Also the inside/outside test would use the same approach, and just like here it can be optimized by working on bigger regions first.

So, the article’s primary function is not to try to let everybody start writing his own software renderer. As you say, this is not productive, and the market for software renderers is tiny or next to non-existant. But it’s still very useful as a learning experience. I also don’t fully agree with your optinion about the algorithmical optimization. It’s clear that computing coverage of 8x8 blocks, 64 pixels, is a good shortcut to avoid rasterizing the whole bounding rectangle. I forgot to mention this in the article, but the size of the optimal block was determined experimentally. I did mention that further optimizations are best done at the assembly level, implying that further profiling would be required.

Kind regards,

Nicolas

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Nov 24, 2004 at 15:58

Bug & fix:

A couple days ago I detected a bug in my renderer. When drawing a very detailed terrain, there were some 8x8 pixel blocks flashing on the horizon. After some analysis I discovered that it’s a bug in this rasterizer, when using very small triangles.

The explanation turned out to be very obvious: The half-space function (x2 - x1) * (y - y1) - (y2 - y1) * (x - x1) >= 0 assumes that x2 - x1 and y2 - y1 are never both zero. However, this does happen with tiny triangles since the fixed-point precision is limited to 1/16. When these deltas are zero, the half-space function (corrected for top-left fill convention) becomes true for any pixel, resulting in a whole 8x8 block to be filled, while actually none of the pixels should be filled.

Fixing this is very straight-foward. For every edge we have to check that the deltas aren’t both zero. We compute them already anyway, so there’s virtually no extra work. Tiny triangles like this don’t have to be drawn. They fall ‘between’ pixels so they won’t even cause gaps in high-detail geometry. For the interested: a bit more triangles can be rejected this way…

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Nov 29, 2004 at 00:29

hey, you need an athlon64 to test your work on?..

i got one now:D

03aee5b93473d83466aef86f725797f9
0
Kettlebell 101 Dec 19, 2004 at 08:24

Just a question about the code Nick.

What is your implementation of ‘iround’?

Cheers,
Tim

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Dec 21, 2004 at 11:00

Hi Kettlebell,

iround() just performs a round to integer operation. I implemented it like this:

inline int iround(float x)
{
  int t;

  __asm
  {
    fld  x
    fistp t
  }

  return t;
}
03aee5b93473d83466aef86f725797f9
0
Kettlebell 101 Dec 22, 2004 at 15:34

Many thanks Nick.

Previously I was doing an (int) cast on all the floats, and I’ve since discovered that it’s a very inefficient way to be doing it (on the microsoft compiler).

I’ve integrated your code to replace my scanline algorithm, and I can see a very clear performance boost. Awesome, awesome work.

Looking forward to any future articles.

Tim

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Dec 23, 2004 at 02:59

@Kettlebell

Previously I was doing an (int) cast on all the floats, and I’ve since discovered that it’s a very inefficient way to be doing it (on the microsoft compiler).

All x86 compilers have this suboptimal behaviour. The problem is that that standard rounding mode for x86 is round to nearest, while (int) cast requires round towards zero. So the cast requires two rounding mode changes, which is a rather slow operation.

I’ve integrated your code to replace my scanline algorithm, and I can see a very clear performance boost. Awesome, awesome work.

Great! What kind of project are you working on?

Looking forward to any future articles.

There will definitely come more…

03aee5b93473d83466aef86f725797f9
0
Kettlebell 101 Dec 24, 2004 at 05:56

@Nick

Great! What kind of project are you working on?

I’m working on my own software renderer. It’s quite an undertaking, but I’ve so far managed to put in clipping, directional and point lights and flat shading (made a lot better thanks to your help).

I’m now working on putting in gouraud shading. Since I’m here I have a quick question for you Nick. Earlier on in the thread you made reference to calculating du/dx, du/dy etc for interpolation of texture coordinates.

I’ve read through part 1 of Chris Heckler’s tutorial on perspective texture mapping. Though he didn’t mention it explicitly in the tutorial, a sorting of the y-values is required prior to the calculation of the triangle gradients right?

Thanks for the help, have a Merry Christmas!

Tim

3dc8148cb234ef1b680283f477d9b814
0
efortier 101 Dec 25, 2004 at 21:52

I agree that my article has close to zero practical use nowadays.

I beg to differ :). Most PocketPC devices available today (with the exceptions of the new Axim and a few other) have no 3D chip. Nor does the Zodiac gaming device and many other platforms as well.

And processing 8x8 blocks also has another hidden advantage, especially true on devices with no video memory and/or 3D chips. It makes better use of the memory cache. Processing pixels in sequence, like the scanline method, does not use the cache efficiently (on PocketPc devices at least) but using block does. I suspect that processing 16x16 on some devices would yield even better results than 8x8. Unless the polygons are very small.

It’s true that people are starting to forget about software rasterizer. Nowaday it’s all about shaders, normal mapping, etc. I’m having a real hard time learning.

Your article really inspired me and I decided to start working on a new engine to replace my old one (based on LaMothe’s book). The technique you presented is much clearer compared to the tedious setup you need to do when doing scanline.

–Eric

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Dec 26, 2004 at 10:02

@Kettlebell

I’m working on my own software renderer. It’s quite an undertaking, but I’ve so far managed to put in clipping, directional and point lights and flat shading (made a lot better thanks to your help).

Cool! Is it for educational purposes or do you intend to achieve something more than that? Either way good luck!

I’m now working on putting in gouraud shading. Since I’m here I have a quick question for you Nick. Earlier on in the thread you made reference to calculating du/dx, du/dy etc for interpolation of texture coordinates. I’ve read through part 1 of Chris Heckler’s tutorial on perspective texture mapping. Though he didn’t mention it explicitly in the tutorial, a sorting of the y-values is required prior to the calculation of the triangle gradients right?

No, sorting is no longer required. Vertices have to be passed in counter-clockwise order. This guarantees the half-space functions are oriented correctly, and also allows to consistently compute the gradients.

3dc8148cb234ef1b680283f477d9b814
0
efortier 101 Dec 27, 2004 at 02:55

No, sorting is no longer required. Vertices have to be passed in counter-clockwise order. This guarantees the half-space functions are oriented correctly, and also allows to consistently compute the gradients.

If it’s not too much to ask, would it be possible to see an example of the right way to find the gradients? I’ve been at this fulltime for the past two days, and I can’t seem to find the right way to get those gradients…

Man, you really have to try your hands at 3D stuff to learn to appreciate the likes of you, Nick ( and Carmack too I guess :) ).

:wacko:

my head spins…

–Eric

03aee5b93473d83466aef86f725797f9
0
Kettlebell 101 Dec 27, 2004 at 08:33

@efortier

If it’s not too much to ask, would it be possible to see an example of the right way to find the gradients? I’ve been at this fulltime for the past two days, and I can’t seem to find the right way to get those gradients… [snapback]14729[/snapback]

Hi Eric, have you checked out the tutorials by Chris Hecker?

If not, here’s a link:

http://www.d6.com/users/checker/misctech.htm

Check out the perspective correct texture mapping series, part1.

Inside you’ll find a discussion of the triangle gradients. His explanation of the gradients makes sense, but I get messed up when I draw a different triangle. Maybe you could lend us a hand Nick? :)

The formulas are still in there, and it was pretty easy to integrate 1/z buffering into Nick’s rasterizer. All of this sure beats Lamothe’s tedious scanline implementation.

Good luck mate,
Tim

03aee5b93473d83466aef86f725797f9
0
Kettlebell 101 Dec 27, 2004 at 08:49

@Nick

Cool! Is it for educational purposes or do you intend to achieve something more than that? Either way good luck!

Only for educational purposes. I’m using Lamothe’s book. He’s got some of the most hideous code I’ve ever seen, which is probably a good thing since it forces one to implement everything on their own. But the bits where he’s explaining a concept are easy enough for a beginner like myself to grasp.

He could have easily cut the page count in half though.

3dc8148cb234ef1b680283f477d9b814
0
efortier 101 Dec 27, 2004 at 09:14

Hi Eric, have you checked out the tutorials by Chris Hecker?

Thanks Tim, I’ll check it out. I need to implement an affine mapper, but I suppose it can’t hurt.

He could have easily cut the page count in half though.

Why the hell does he list page after page of prototypes?? A few listings I understand, but function names? What’s the use?

I also emailed him a few times with errors in his book. Some of them are pretty apparent (typos), but some others are insidious (math stuff).

Last time I emailed with a list of errors, he didn’t even send me a thank you reply (I had sent him around 15 errors I found!) or ask for any clarification…

–Eric

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Dec 27, 2004 at 17:24

@efortier

If it’s not too much to ask, would it be possible to see an example of the right way to find the gradients? I’ve been at this fulltime for the past two days, and I can’t seem to find the right way to get those gradients…

The easiest and most robust way is to use the plane equation.

Let’s say we want to interpolate some component z linearly across the polygon (note that z stands for any interpolant). We can visualize this as a plane going through the x, y and z positions of the triangle, in 3D space. Now, the equation of a 3D plane is generally:

A * x + B * y + C * z + D = 0

From this we can derive that:

z = -A / C * x - B / C * y - D

Note that for every step in the x-direction, z increments by -A / C, and likewise it increments by -B / C for every step in the y-direction. So these are the gradients we’re looking for to perform linear interpolation. In the plane equation (A, B, C) is the normal vector of the plane. It can easily be computed with a cross product.

Now that we have the gradients, let’s call them dz/dx (which is -A / C) and dz/dy (which is -B / C), we can easily compute z everywhere on the triangle. We know the z value in all three vertex positions. Let’s call the one of the first vertex z0, and it’s position coordinates (x0, y0). Then a generic z value of a point (x, y) can be computed as:

z = z0 + dz/dx * (x - x0) + dz/dy * (y - y0)

Once you’ve computed the z value for the center of the starting pixel this way, you can easily add dz/dx to get the z value for the next pixel, or dz/dy for the pixel below (with the y-axis going down).

I hope this helps! Don’t hesitate to ask for further clarification.

3dc8148cb234ef1b680283f477d9b814
0
efortier 101 Dec 27, 2004 at 17:33

Thanks Nick, that makes sense. What I still don’t understand, given the context of the half-space rasterizing technique you presented, is how to get the proper deltas (dx,dy,du,dv) if the vertices are not y-sorted. I should mention that I am implementing an affine mapper (first).

LaMothe sorts the vertices and then checks for flat-top and flat-bottom. I am unable to come up with the right way to find the deltas if you drop these two steps.

Any suggestions?

Thanks again,

–Eric

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Dec 27, 2004 at 22:58

@efortier

LaMothe sorts the vertices and then checks for flat-top and flat-bottom. I am unable to come up with the right way to find the deltas if you drop these two steps.

Computing the (A, B, C) normal of the plane takes care of everything. It’s done with a cross product. Component-wise it looks like this:

A = (z3 - z1) * (y2 - y1) - (z2 - z1) * (y3 - y1)
B = (x3 - x1) * (z2 - z1) - (x2 - x1) * (z3 - z1)
C = (x2 - x1) * (y3 - y1) - (x3 - x1) * (y2 - y1)

See, a cross product is independent of which two edge vectors of the triangle you choose, but the sign depends on the order. So as long as they are in counter-clockwise order it doesn’t matter how you number them.

Good luck!

3dc8148cb234ef1b680283f477d9b814
0
efortier 101 Dec 28, 2004 at 04:01

Hi Nick,

Thanks for the explanation. I got the affine texture mapping working! Here’s what I do, let me know if I understood correctly:

  • Use counter-clockwise polygon ordering

  • Transform visible polygons from object all the way through screen space

  • Using the plane equation you showed above, replace z1-z3 with the value I want to interpolate (u, v etc)

  • dx is then -A / C, and dy = -B / C

  • in the half-space loop, use dx and dy to interpolate my value

  • x1-x3 and y1-y3 used in the rasterizer should be in screen space

How does it looks? That is very straightforward, and adding new values to interpolate is easier than before.

My problem is that I don’t have a very strong math background, so I can’t seem to think in term of math when trying to solve a problem, but instead in term of language and programming…

Thanks for your help. I’ll be benchmarking both the scanline and half-space rasterizer on the PocketPC devices I have soon, and I’ll post my findings.

–Eric

03aee5b93473d83466aef86f725797f9
0
Kettlebell 101 Dec 29, 2004 at 03:24

That was so easy to understand, thanks Nick.

You know, you should probably consider writing a book on all of this. You are very good at explaining things clearly.

Tim

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Dec 29, 2004 at 09:27

@Kettlebell

That was so easy to understand, thanks Nick. You know, you should probably consider writing a book on all of this. You are very good at explaining things clearly.

Thanks! I’ve already written articles for ShaderX 2 and ShaderX 3. :wink:

D41f3b80a53f833c1ab04e2901133c6f
0
Vertexmania 101 Apr 15, 2006 at 20:56

Nice algorithm, but how do you handle visible surface determination?
How can I calculate the z-value of every pixel and compare it to the z-buffer?

Sometimes when a triangle is getting clipped it can became a more complex polygon like a quad. How do you handle this?

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Apr 15, 2006 at 22:22

@Vertexmania

Sometimes when a triangle is getting clipped it can became a more complex polygon like a quad. How do you handle this?

Simple, my friend! There shouldn’t be any problem at all to implement the standard rasterization routine (scanline) and the one discussed here (half-space), then you would use the new algorithm only with those primitives, that are actually triangles after the clipping phase - the scanline-rasterizer can handle the rest.

Another way to do it would be to be brake the clipped polygon in triangles - a straightforward process!

Third possibility is to consider extending the half-space rasterization function to polygons with more than 3 vertices.

Hope that helps, cheers!

B6f2d859eef6e0cc35e5feaa35af0e22
0
Madis731 101 Mar 29, 2007 at 14:18

Hi, I turned this code (the final one) into assembly and wow was it painful :D
There were some problems making it take advantage of MMX/SSE, but I did it to some extent. Anyway I will first comment your code a bit:

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
    // 28.4 fixed-point coordinates
    const int Y1 = iround(16.0f * v1.y);
    const int Y2 = iround(16.0f * v2.y);
    const int Y3 = iround(16.0f * v3.y);

    const int X1 = iround(16.0f * v1.x);
    const int X2 = iround(16.0f * v2.x);
    const int X3 = iround(16.0f * v3.x);

    // Deltas
/*
  This was really easy. I fit 6 words into XMMs and could do all the
  subtraction and shifting.
*/
    const int DX12 = X1 - X2;
    const int DX23 = X2 - X3;
    const int DX31 = X3 - X1;

    const int DY12 = Y1 - Y2;
    const int DY23 = Y2 - Y3;
    const int DY31 = Y3 - Y1;

    // Fixed-point deltas
    const int FDX12 = DX12 << 4;
    const int FDX23 = DX23 << 4;
    const int FDX31 = DX31 << 4;

    const int FDY12 = DY12 << 4;
    const int FDY23 = DY23 << 4;
    const int FDY31 = DY31 << 4;

    // Bounding rectangle
/*
  These four went in 2 clock cycles :)
*/
    int minx = (min(X1, X2, X3) + 0xF) >> 4;
    int maxx = (max(X1, X2, X3) + 0xF) >> 4;
    int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;
    int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;

    // Block size, standard 8x8 (must be power of two)
    const int q = 8;
/*
  I found out this to be very interesting dynamically. There were situations
  where 4 or 16 was better. Imagine triangles from 4x4px bounding in size
  upto 500x500 in size...
*/
    // Start in corner of 8x8 block
    minx &= ~(q - 1); // This and later is all in 64-bit x86 ASM without SSE
    miny &= ~(q - 1);

    (char*&)colorBuffer += miny * stride;

    // Half-edge constants
    int C1 = DY12 * X1 - DX12 * Y1; // SSE only has instructions for WORDs
    int C2 = DY23 * X2 - DX23 * Y2; // but this needs at least DWORDs and
    int C3 = DY31 * X3 - DX31 * Y3; // doing 6 multiplys with IMUL is faster
    // than converting it to SSE and doing 2*3 MULUDQs with additional sign-
    // checking because UDQ is UNsigned :(

    // Correct for fill convention
    if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;
    if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;
    if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;

    // Loop through blocks
    for(int y = miny; y < maxy; y += q)
    {
        for(int x = minx; x < maxx; x += q)
        {
            // Corners of block
            int x0 = x << 4;
            int x1 = (x + q - 1) << 4;
            int y0 = y << 4;
            int y1 = (y + q - 1) << 4;

            // Evaluate half-space functions
/*
  This one here is very painful and even in case of a right triangle sitting in
  one half of the bounding box it takes 24 IMULs for every 8x8 pixel region
  empty or not. One or two optimizations are possible, though. You cast these
  bools to ints here, but actually you could just add these results without
  shifting (a00+a10+a01+a11) because you only check whether its all 0 or
  all 1. Then it becomes 0 or 4. Only loss is that you don't know which ones
  where 1 and which ones 0 (may be necessary with masking). The definite
  one optimization is that you don't have to wait until c to check for 0, but
  you can to it after every four bools. Bail out of the loop quickly after a==0
  so you will cut off these painful 16 IMULs, 16 ADDs/SUBs, 8 CMPs, 4 SHLs
  and 4 ORs. That on average will improve throughput by 10-35% depending
  on the triangles.
*/
            bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;
            bool a10 = C1 + DX12 * y0 - DY12 * x1 > 0;
            bool a01 = C1 + DX12 * y1 - DY12 * x0 > 0;
            bool a11 = C1 + DX12 * y1 - DY12 * x1 > 0;
            int a = (a00 << 0) | (a10 << 1) | (a01 << 2) | (a11 << 3);
    
            bool b00 = C2 + DX23 * y0 - DY23 * x0 > 0;
            bool b10 = C2 + DX23 * y0 - DY23 * x1 > 0;
            bool b01 = C2 + DX23 * y1 - DY23 * x0 > 0;
            bool b11 = C2 + DX23 * y1 - DY23 * x1 > 0;
            int b = (b00 << 0) | (b10 << 1) | (b01 << 2) | (b11 << 3);
    
            bool c00 = C3 + DX31 * y0 - DY31 * x0 > 0;
            bool c10 = C3 + DX31 * y0 - DY31 * x1 > 0;
            bool c01 = C3 + DX31 * y1 - DY31 * x0 > 0;
            bool c11 = C3 + DX31 * y1 - DY31 * x1 > 0;
            int c = (c00 << 0) | (c10 << 1) | (c01 << 2) | (c11 << 3);

            // Skip block when outside an edge
/*
  This one can be cleverly done with
  #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
  hint: search for "null" at
  http://www.azillionmonkeys.com/qed/asmexample.html
  but I didn't need it at all because of the optimizations above
*/
            if(a == 0x0 || b == 0x0 || c == 0x0) continue;

            unsigned int *buffer = colorBuffer;

            // Accept whole block when totally covered
/*
  By fusing a,b,c together earlier I could make here only one test which is
  if(c==0xFFF) - that is when c=a<<8|b<<4|c
*/
            if(a == 0xF && b == 0xF && c == 0xF)
            {
                for(int iy = 0; iy < q; iy++)
                {
                    for(int ix = x; ix < x + q; ix++)
                    {
                        buffer[ix] = 0x00007F00;<< // Green
/*
  This part here gave a really good boost. I switched to SSE for a second
  here and shuffled the color to all parts of the 128-bit register. It could
  write 16 bytes per clock and with some macro magic, I made it calculate
  from the q how much of these it needed. The inner loop is too tight to use
  it at all so I left it out :)
*/
                    }

                    (char*&)buffer += stride;
                }
            }
            else<< // Partially covered block
            {
                int CY1 = C1 + DX12 * y0 - DY12 * x0;
                int CY2 = C2 + DX23 * y0 - DY23 * x0;
                int CY3 = C3 + DX31 * y0 - DY31 * x0;

                for(int iy = y; iy < y + q; iy++)
                {
                    int CX1 = CY1;
                    int CX2 = CY2;
                    int CX3 = CY3;

                    for(int ix = x; ix < x + q; ix++)
                    {
                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)
                        {
                            buffer[ix] = 0x0000007F;<< // Blue
                        }

                        CX1 -= FDY12;
                        CX2 -= FDY23;
                        CX3 -= FDY31;
                    }

                    CY1 += FDX12;
                    CY2 += FDX23;
                    CY3 += FDX31;

                    (char*&)buffer += stride;
                }
            }
        }

        (char*&)colorBuffer += q * stride;
    }
}

Now I will post the code that is not really ready yet, but already works. I will find some way to put some more SSE into it. The first try was over 400 lines of code, but some SSE magic made it crush to about 340 lines. My aim is a line count less than or equal to yours :D

In ASM syntax comments are ; and // /**/ don’t mean anything
I made string replace on the code so try to understand ;)

Engine_Triangle:
/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
;  Description: Draws a filled triangle.
;
;  In:     xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
;          rsi = colour
;
;  Out:    All changed
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;*/
pshuflw xmm0,xmm0,01001110b //Esimene/tagumine külg
Q=4
PREC=4
    psllw     xmm0,PREC           //xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
    movdqa    xmm5,xmm0
    pextrw    efx,xmm0,0
    pextrw    eex,xmm0,1
    pextrw    edx,xmm0,2
    pextrw    ecx,xmm0,3
    pextrw    ebx,xmm0,4
    pextrw    eax,xmm0,5
    pshufd    xmm1,xmm0,11010010b //xmm1 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
    movdqa    xmm2,xmm0
    movdqa    xmm3,xmm1
    pxor      xmm4,xmm4
    punpckhwd xmm0,xmm4
    punpckhwd xmm1,xmm4
    punpcklwd xmm2,xmm4
    punpcklwd xmm3,xmm4
    psubd     xmm0,xmm1           //xmm0 = 00  |  00  | dX12 | dY12
    psubd     xmm2,xmm3           //xmm1 = dX23| dY23 | dX31 | dY31
    movdqa    xmm1,xmm0
    movdqa    xmm3,xmm2
    pslld     xmm1,PREC           //xmm2 = 00  |  00  |fdX12 |fdY12
    pslld     xmm3,PREC           //xmm3 =fdX23|fdY23 |fdX31 |fdY31
    movq      qword[DY12],xmm0
    movdqu    dqword[DY31],xmm2
    movq      qword[FDY12],xmm1
    movdqu    dqword[FDY31],xmm3
                                  //xmm5 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
    pshufd    xmm6,xmm5,11001001b //xmm6 = 00 | 00 | X2 | Y2 | X0 | Y0 | X1 | Y1
    pshufd    xmm7,xmm5,11010010b //xmm7 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
    movdqa    xmm8,xmm5
    movdqa    xmm9,xmm6
    pmaxsw    xmm8,xmm6
    pmaxsw    xmm8,xmm7
    pminsw    xmm9,xmm5
    pminsw    xmm9,xmm7
    paddw     xmm8,dqword[xmm_8x15]
    psraw     xmm8,4
    pextrw    ehx,xmm8,1
    pextrw    ejx,xmm8,0
    paddw     xmm9,dqword[xmm_8x15]
    psraw     xmm9,4
    pextrw    egx,xmm9,1
    pextrw    eix,xmm9,0
    // Block size, standard 8x8 (must be power of two)
//    const int q = 8//

    // Start in corner of 8x8 block
//    minx &= ~(q - 1)//
//    miny &= ~(q - 1)//
    and       egx,not(Q-1)
    and       eix,not(Q-1)

    imul      ebp,eix,BUFW*4    //Set up pointer to backbuffer
    add       ebp,backbuffer

    imul      eax,[DY12]        //Constant part of half-edge functions
    imul      ebx,[DX12]
    imul      ecx,[DY23]
    imul      edx,[DX23]
    imul      eex,[DY31]
    imul      efx,[DX31]
    sub       eax,ebx
    sub       ecx,edx
    sub       eex,efx
    mov       [C1],eax
    mov       [C2],ecx
    mov       [C3],eex

    cmp       [DY12],0           //Correct for fill convention
    jl        .incC1
    jne       .noincC1
    cmp       [DX12],0
    jle       .noincC1
  .incC1:
    add       [C1],1
  .noincC1:
    cmp       [DY23],0
    jl        .incC2
    jne       .noincC2
    cmp       [DX23],0
    jle       .noincC2
  .incC2:
    add       [C2],1
  .noincC2:
    cmp       [DY31],0
    jl        .incC3
    jne       .noincC3
    cmp       [DX31],0
    jle       .noincC3
  .incC3:
    add       [C3],1
  .noincC3:

// Loop through blocks
  .forouter:
    push      rgx
  .forinner:
// Corners of block
    mov       eax,egx
    shl       eax,PREC
    mov       ebx,eix
    shl       ebx,PREC
    lea       esi,[egx+Q-1]
    shl       esi,PREC
    lea       edi,[eix+Q-1]
    shl       edi,PREC

// Evaluate half-space functions
    xor       ecx,ecx
    mov       ekx,ebx
    imul      ekx,[DX12]
    mov       elx,eax
    imul      elx,[DY12]
    sub       ekx,elx
    add       ekx,[C1]
    jle       @f
    add       ecx,1
  @@:
    mov       ekx,ebx
    imul      ekx,[DX12]
    mov       elx,esi
    imul      elx,[DY12]
    sub       ekx,elx
    add       ekx,[C1]
    jle       @f
    add       ecx,1
  @@:
    mov       ekx,edi
    imul      ekx,[DX12]
    mov       elx,eax
    imul      elx,[DY12]
    sub       ekx,elx
    add       ekx,[C1]
    jle       @f
    add       ecx,1
  @@:
    mov       ekx,edi
    imul      ekx,[DX12]
    mov       elx,esi
    imul      elx,[DY12]
    sub       ekx,elx
    add       ekx,[C1]
    jle       @f
    add       ecx,1
  @@:
    shl       ecx,4
    jz        .prolog_inner_end

    mov       ekx,ebx
    imul      ekx,[DX12]
    mov       elx,eax
    imul      elx,[DY12]
    sub       ekx,elx
    add       ekx,[C2]
    jle       @f
    add       ecx,1
  @@:
    mov       ekx,ebx
    imul      ekx,[DX23]
    mov       elx,esi
    imul      elx,[DY23]
    sub       ekx,elx
    add       ekx,[C2]
    jle       @f
    add       ecx,1
  @@:
    mov       ekx,edi
    imul      ekx,[DX23]
    mov       elx,eax
    imul      elx,[DY23]
    sub       ekx,elx
    add       ekx,[C2]
    jle       @f
    add       ecx,1
  @@:
    mov       ekx,edi
    imul      ekx,[DX23]
    mov       elx,esi
    imul      elx,[DY23]
    sub       ekx,elx
    add       ekx,[C2]
    jle       @f
    add       ecx,1
  @@:
    shl       ecx,4
    test      ecx,0F0h
    jz        .prolog_inner_end

    mov       ekx,ebx
    imul      ekx,[DX31]
    mov       elx,eax
    imul      elx,[DY31]
    sub       ekx,elx
    add       ekx,[C3]
    jle       @f
    add       ecx,1
  @@:
    mov       ekx,ebx
    imul      ekx,[DX31]
    mov       elx,esi
    imul      elx,[DY31]
    sub       ekx,elx
    add       ekx,[C3]
    jle       @f
    add       ecx,1
  @@:
    mov       ekx,edi
    imul      ekx,[DX31]
    mov       elx,eax
    imul      elx,[DY31]
    sub       ekx,elx
    add       ekx,[C3]
    jle       @f
    add       ecx,1
  @@:
    mov       ekx,edi
    imul      ekx,[DX31]
    mov       elx,esi
    imul      elx,[DY31]
    sub       ekx,elx
    add       ekx,[C3]
    jle       @f
    add       ecx,1
  @@:
    test      ecx,0Fh
    jz        .prolog_inner_end
    //ecx=a shl 8 + b shl 4 + c

    mov       efx,ebp
    cmp       ekx,0444h
    jne       .not_totally_covered
// Totally inside
    mov       ekx,Q
  @@:
    mov       ecx,00FFFFFFh
    movd      xmm0,ecx
    pshufd    xmm0,xmm0,00000000b // broadcast
    times     Q/4 movdqu [efx+egx*4+(%-1)*16],xmm0
    add       efx,BUFW*4
    sub       ekx,1
    ja        @b
    jmp       .prolog_inner_end
  .not_totally_covered:
//            else// <<  Partially covered block
//            {
//                int CY1 = C1 + DX12 * y0 - DY12 * x0//
//                int CY2 = C2 + DX23 * y0 - DY23 * x0//
//                int CY3 = C3 + DX31 * y0 - DY31 * x0//
    mov       ekx,[DX12]
    mov       elx,[DY12]
    imul      ekx,ebx
    imul      elx,eax
    sub       ekx,elx
    add       ekx,[C1]
    mov       [CY1],ekx

    mov       ekx,[DX23]
    mov       elx,[DY23]
    imul      ekx,ebx
    imul      elx,eax
    sub       ekx,elx
    add       ekx,[C2]
    mov       [CY2],ekx

    mov       ekx,[DX31]
    mov       elx,[DY31]
    imul      ekx,ebx
    imul      elx,eax
    sub       ekx,elx
    add       ekx,[C3]
    mov       [CY3],ekx
//                for(int iy = y// iy < y + q// iy++)
//                    int CX1 = CY1//
//                    int CX2 = CY2//
//                    int CX3 = CY3//
    mov       elx,Q
    mov       eax,[CY1]
    mov       ebx,[CY2]
    mov       ecx,[CY3]

    lea       eex,[efx+egx*4]
  .partial_loopy:
    push      rax rbx rcx
//                    for(int ix = x// ix < x + q// ix++)
    xor       ekx,ekx
  .partial_loopx:
//                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)
    test      eax,eax
    jle       @f
    test      ebx,ebx
    jle       @f
    test      ecx,ecx
    jle       @f
//                            buffer[ix] = 0x0000007F//<< // Blue
    mov       dword[eex+ekx*4],0FFFFFFh
  @@:
    sub       eax,[FDY12]
    sub       ebx,[FDY23]
    sub       ecx,[FDY31]
//                    }
    add       ekx,1
    cmp       ekx,Q
    jc        .partial_loopx
    pop       rcx rbx rax
    add       eax,[FDX12]
    add       ebx,[FDX23]
    add       ecx,[FDX31]
//                    (char*&)buffer += stride//
    add       eex,BUFW*4
//                }
    sub       elx,1
    ja        .partial_loopy

  .prolog_inner_end:
    add       egx,Q
    cmp       egx,ehx
    jc        .forinner
    pop       rgx

//        (char*&)colorBuffer += q*stride//
    add       ebp,BUFW*4*Q
    add       eix,Q
    cmp       eix,ejx
    jc        .forouter
//    }
    ret
99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Mar 29, 2007 at 22:34

Neat. :yes:

I noticed some opportunity to eliminate jumps. Modern CPUs can suffer greatly from jumps that are badly predictable, so you should try to avoid them if possible, especially short jumps. For example:

if(DY12 < 0) C1++;

Could be implemented as:

lea ebx, [eax+1]   ; C1 in eax
cmp DY12
cmovl eax, ebx

Lots of other tricks exist to convert apparently conditional code into fast arithmetic code…

B6f2d859eef6e0cc35e5feaa35af0e22
0
Madis731 101 Mar 30, 2007 at 07:20

Sorry, I didn’t see it at first :$
My logic failed me at first, but I think its okay now - tested and benchmarked:

    mov       ekx,[C1]
    lea       elx,[ekx+1]
    cmp       [DY12],0
    cmovl     ekx,elx //jl        .incC1   --   in comments, its the old code
    cmovne    elx,[C1] //jne       .noincC1  --  that got replaced
    cmp       [DX12],0
    cmovle    ekx,elx //jle       .noincC1
//  .incC1:
//    add       [C1],1
//  .noincC1:
    mov       [C1],ekx 

The code size didn’t reduce in size but it remained the same while pickin’ up
the pace ;) about 70M clocks/frame maximum is now 60M clocks/frame. And
this is only in the emulator and this is just the OUTER loop that we optimized.
A lot more speed could be cranked out of this with some thinking.
Maybe:
bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;
into
bool a00 = DX12 * y0 - DY12 * x0 > -C1; // Only faster in SSE?
Gentlemen - put your thinking caps on! :D

…and I think the general rule is: “Avoid short jumps if the code from the jump
to the destination takes less clocks than the penalty of the jump”

1378938c53637c141789d6df667b77c6
0
Nico 101 Mar 30, 2007 at 12:28

This kind of tiled traversal using halfspace functions is pretty common in hardware, and you can possibly use a simple optimisation that we use there to get rid of the multiple distance function evaluations used in the ‘block inside triangle’ test.

In order to determine if a box is entierly on one side of a halfplane it is not necessary to determine the distance for all four corner points - it is sufficient to determine the distance to one corner.
But depending on the sign of your halfplane normal a different corner has to be used. This is a win because you are going to test many boxes against the same halfplane.

I don’t have an illustration for this, but it should be pretty obvious that for any halfplane there is always one corner of a box that will have a minimal distance to the plane - regardless of the relative positioning or size.

Personally I like to express the change of reference corner points as an offset to the top left corner distance, calculated during triangle setup from the edge normals :

// find box test offsets for each edge
int eo1 = 0;
if (FDY12 < 0) eo1 -= FDY12 << BLOCKSIZE;
if (FDX12 > 0) eo1 += FDX12 << BLOCKSIZE;

int eo2 = 0;
if (FDY23 < 0) eo2 -= FDY23 << BLOCKSIZE;
if (FDX23 > 0) eo2 += FDX23 << BLOCKSIZE;

int eo3 = 0;
if (FDY31 < 0) eo3 -= FDY31 << BLOCKSIZE;
if (FDX31 > 0) eo3 += FDX31 << BLOCKSIZE;

Using these offset values (one per edge) the test for ‘block is fully outside triangle’ becomes essentially the same as the simple ‘pixel is fully inside triangle’ test.

// test if a single pixel is inside the triangle
if (CX1 > 0 && CX2 > 0 && CX3 > 0) ...

// test if a block of n*n pixels with top left corner at current position is inside
if ((CX1+eo1) > 0 && (CX2+eo2) > 0 && (CX3+eo3) > 0) ...

So just three add’s instead of 21 muls and whatnot ;)

Note that checking for exclusion and checking for inclusion requires different offsets (one can be easily derived from the other) but in hardware we usually differentiate into covered and not-covered only.

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 Mar 30, 2007 at 13:32

From a quick view it seems like you’re using global variables. While this is not that bad, the codesize grows quite a bit (the address operand field will need at least 4 bytes).

If you can free up one register (I know - good joke) you could put your locals into a temporary structure and address from there. This has two benefits:

  • Your code will become threadsave.
  • Your code size will be shrink, and this will reduce the instruction decoding time. That does not make a difference if your code takes longer to execute than to fetch, but once you’re code is tight enough, instruction decoding time will be factor to care about.

Oh - and by all means, get rid of the branches. If it takes 5 or 6 instructions to do something without a branch it’s a saving. You have lots of branches that are placed near to each other, and this will confuse the branch prediction logic. I don’t have the exact numbers anymore, but it can only identify branches that are several instructions apart. I think it was something around 64 or 32 bytes or so.

If you have one branch that branches likely, and one that branches branches unlikely near to each other, you’ll get a costly branch missprediction for each branch.

Btw, hi nico *g*

B6f2d859eef6e0cc35e5feaa35af0e22
0
Madis731 101 Mar 30, 2007 at 21:47

I feel ashamed :$ optimizing guru as I am, I posted you unoptimized code. What was I thinking? :)

Anyway, I’m on my way to better and better code - like you, I started discovering some clever and some not very clever spots to crush.

Just a side-note here, I’m aiming at Core 2 here (that is what I’m writing and testing on) and I’ve got all my clock-knowledge from Agner Fog’s manuals and
picked up a few tricks from http://azillionmonkeys.com/qed/tech.shtml

The weirdest thing is that I didn’t discover the possibilities in “Evaluate half-space functions” until today, when I looked at the lines and saw that
actually we need 12 multiplies instead of 24. I must be blind :P Anyway, that
translated nicely to SSE, the saddest fact being it doesn’t support signed
DWORDs. If I will get over that tiny problem, I will have a whole different
code to show you.

@Nils: I want to argue a bit if I may. I agree with you on the thread safety,
but there are other ways around it if I really don’t like locals :) and stack is
really becoming a problem under the 64-bit CPUs because you can’t put there
anything else besides 64-bit registers and accessing 32-bit or 16-bit data
from this 8-byte aligned stack is really too much additional SSE. I mean all
the packing and shuffling. Of course like with other things, there are ways
around it.

Another reason why I like globals so much is that they are cache-friendly :)
Really! They never move ;) and I can fit like 8 DWORDs in one cache-line.

About the branches - I’m still sorry about posting this code - I will edit this
immediately not to confuse anyone again :$ Three posts back I copied the
replacement code for the nasty branches and I don’t have them anymore.

Here it is (still without full SSE handling) with only 300 lines of code:

Engine_Triangle:
/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
;  Description: Draws a filled triangle.
;
;  In:     xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
;          rsi = colour
;
;  Out:    All changed
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;*/
    pshuflw xmm0,xmm0,01001110b //Switch front/back
Q = 4
PREC = 4
//    psllw     xmm0,PREC           //xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
    movdqa    xmm5,xmm0
    pextrw    efx,xmm0,0
    pextrw    eex,xmm0,1
    pextrw    edx,xmm0,2
    pextrw    ecx,xmm0,3
    pextrw    ebx,xmm0,4
    pextrw    eax,xmm0,5
    pshufd    xmm1,xmm0,11010010b //xmm1 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
    movdqa    xmm2,xmm0
    movdqa    xmm3,xmm1
    pxor      xmm4,xmm4
    punpckhwd xmm0,xmm4
    punpckhwd xmm1,xmm4
    punpcklwd xmm2,xmm4
    punpcklwd xmm3,xmm4
    psubd     xmm0,xmm1           //xmm0 = 00  |  00  | dX12 | dY12
    psubd     xmm2,xmm3           //xmm1 = dX23| dY23 | dX31 | dY31
    movdqa    xmm1,xmm0
    movdqa    xmm3,xmm2
    pslld     xmm1,PREC           //xmm2 = 00  |  00  |fdX12 |fdY12
    pslld     xmm3,PREC           //xmm3 =fdX23|fdY23 |fdX31 |fdY31
    movq      qword[DY12],xmm0
    movdqu    dqword[DY31],xmm2
    movq      qword[FDY12],xmm1
    movdqu    dqword[FDY31],xmm3
                                  //xmm5 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
    pshufd    xmm6,xmm5,11001001b //xmm6 = 00 | 00 | X2 | Y2 | X0 | Y0 | X1 | Y1
    pshufd    xmm7,xmm5,11010010b //xmm7 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
    movdqa    xmm8,xmm5
    movdqa    xmm9,xmm6
    pmaxsw    xmm8,xmm6
    pmaxsw    xmm8,xmm7
    pminsw    xmm9,xmm5
    pminsw    xmm9,xmm7
    paddw     xmm8,dqword[xmm_8x15]
    psraw     xmm8,4
    pextrw    ehx,xmm8,1
    pextrw    ejx,xmm8,0
    paddw     xmm9,dqword[xmm_8x15]
    psraw     xmm9,4
    pextrw    egx,xmm9,1
    pextrw    eix,xmm9,0
    // Block size, standard 8x8 (must be power of two)
//    const int q = 8//

    // Start in corner of 8x8 block
//    minx &= ~(q - 1)//
//    miny &= ~(q - 1)//
    and       egx,not(Q-1)
    and       eix,not(Q-1)

    imul      ebp,eix,BUFW*4    //Set up pointer to backbuffer
    add       ebp,backbuffer

    imul      eax,[DY12]        //Constant part of half-edge functions
    imul      ebx,[DX12]
    imul      ecx,[DY23]
    imul      edx,[DX23]
    imul      eex,[DY31]
    imul      efx,[DX31]
    sub       eax,ebx
    sub       ecx,edx
    sub       eex,efx
    mov       [C1],eax
    mov       [C2],ecx
    mov       [C3],eex

    mov       ekx,[C1]          //Correct for fill convention
    lea       elx,[ekx+1]
    cmp       [DY12],0
    cmovl     ekx,elx
    cmovne    elx,[C1]
    cmp       [DX12],0
    cmovle    ekx,elx
    mov       [C1],ekx

    mov       ekx,[C2]
    lea       elx,[ekx+1]
    cmp       [DY23],0
    cmovl     ekx,elx
    cmovne    elx,[C2]
    cmp       [DX23],0
    cmovle    ekx,elx
    mov       [C2],ekx

    mov       ekx,[C3]
    lea       elx,[ekx+1]
    cmp       [DY31],0
    cmovl     ekx,elx
    cmovne    elx,[C3]
    cmp       [DX31],0
    cmovle    ekx,elx
    mov       [C3],ekx

// Loop through blocks
  .forouter:
    push      rgx
  .forinner:
// Corners of block
    mov       eax,egx
    shl       eax,PREC
    mov       ebx,eix
    shl       ebx,PREC
    lea       esi,[egx+Q-1]
    shl       esi,PREC
    lea       edi,[eix+Q-1]
    shl       edi,PREC

// Evaluate half-space functions

    mov       ekx,ebx
    shl       rkx,32
    add       rkx,rax
    mov       elx,edi
    shl       rlx,32
    add       rlx,rsi

    movq      xmm0,rkx              // xmm0 = 00 | 00 | x0 | y0
    movq      xmm1,rlx              // xmm1 = 00 | 00 | x1 | y1
   punpcklqdq xmm0,xmm1             // xmm0 = y1 | x1 | y0 | x0
    movd      xmm1,[C1]
    pshufd    xmm1,xmm1,00000000b   // xmm1 = C1 | C1 | C1 | C1
    movq      xmm4,qword[DY12]
    pshufd    xmm4,xmm4,01000100b   // xmm4 = DX12 | DY12 | DX12 | DY12
    pshufd    xmm5,xmm4,10110001b   // xmm5 = DY12 | DX12 | DY12 | DX12
    pmuludq   xmm4,xmm0             // xmm4 = OF | DY12*x1 | OF | DY12*x0
    pshufd    xmm4,xmm4,10001000b   // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0
    pshufd    xmm0,xmm0,10110001b   // xmm0 = x1 | y1 | x0 | y0
    pmuludq   xmm5,xmm0             // xmm5 = OF | DX12*y1 | OF | DX12*y0
    pshufd    xmm5,xmm5,10100000b   // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0
    psubd     xmm5,xmm4             // xmm5 = DX12*(y0...1)-DY12*(x0...1)
    paddd     xmm5,xmm1
    pcmpgtd   xmm5,dqword[xmm_zero]
    packssdw  xmm5,xmm5
    packsswb  xmm5,xmm5
    pmovmskb  edx,xmm5
    and       edx,0Fh
    xor       ecx,ecx
    add       ecx,edx
    jz        .prolog_inner_end
    shl       ecx,4

    movd      xmm2,[C2]
    pshufd    xmm2,xmm2,00000000b   // xmm2 = C2 | C2 | C2 | C2
    movq      xmm4,qword[DY23]
    pshufd    xmm4,xmm4,01000100b   // xmm4 = DX12 | DY12 | DX12 | DY12
    pshufd    xmm5,xmm4,10110001b   // xmm5 = DY12 | DX12 | DY12 | DX12
    pshufd    xmm0,xmm0,10110001b   // xmm0 = y1 | x1 | y0 | x0
    pmuludq   xmm4,xmm0             // xmm4 = OF | DY12*x1 | OF | DY12*x0
    pshufd    xmm4,xmm4,10001000b   // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0
    pshufd    xmm0,xmm0,10110001b   // xmm0 = x1 | y1 | x0 | y0
    pmuludq   xmm5,xmm0             // xmm5 = OF | DX12*y1 | OF | DX12*y0
    pshufd    xmm5,xmm5,10100000b   // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0
    psubd     xmm5,xmm4             // xmm5 = DX12*(y0...1)-DY12*(x0...1)
    paddd     xmm5,xmm2
    pcmpgtd   xmm5,dqword[xmm_zero]
    packssdw  xmm5,xmm5
    packsswb  xmm5,xmm5
    pmovmskb  edx,xmm5
    and       edx,0Fh
    add       ecx,edx
    test      ecx,0Fh
    jz        .prolog_inner_end
    shl       ecx,4

    movd      xmm3,[C3]
    pshufd    xmm3,xmm3,00000000b   // xmm3 = C3 | C3 | C3 | C3
    movq      xmm4,qword[DY31]
    pshufd    xmm4,xmm4,01000100b   // xmm4 = DX12 | DY12 | DX12 | DY12
    pshufd    xmm5,xmm4,10110001b   // xmm5 = DY12 | DX12 | DY12 | DX12
    pshufd    xmm0,xmm0,10110001b   // xmm0 = y1 | x1 | y0 | x0
    pmuludq   xmm4,xmm0             // xmm4 = OF | DY12*x1 | OF | DY12*x0
    pshufd    xmm4,xmm4,10001000b   // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0
    pshufd    xmm0,xmm0,10110001b   // xmm0 = x1 | y1 | x0 | y0
    pmuludq   xmm5,xmm0             // xmm5 = OF | DX12*y1 | OF | DX12*y0
    pshufd    xmm5,xmm5,10100000b   // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0
    psubd     xmm5,xmm4             // xmm5 = DX12*(y0...1)-DY12*(x0...1)
    paddd     xmm5,xmm3
    pcmpgtd   xmm5,dqword[xmm_zero]
    packssdw  xmm5,xmm5
    packsswb  xmm5,xmm5
    pmovmskb  edx,xmm5
    and       edx,0Fh
    add       ecx,edx
    test      ecx,0Fh
    jz        .prolog_inner_end

    mov       efx,ebp
    cmp       ecx,0FFFh
    jne       .not_totally_covered
// Totally inside
    mov       ekx,Q
  @@:
    mov       ecx,00FFFFFFh
    movd      xmm0,ecx
    pshufd    xmm0,xmm0,00000000b // broadcast
    times     Q/4 movdqu [efx+egx*4+(%-1)*16],xmm0
    add       efx,BUFW*4
    sub       ekx,1
    ja        @b
    jmp       .prolog_inner_end
  .not_totally_covered:
//            else// <<  Partially covered block
//            {
//                int CY1 = C1 + DX12 * y0 - DY12 * x0//
//                int CY2 = C2 + DX23 * y0 - DY23 * x0//
//                int CY3 = C3 + DX31 * y0 - DY31 * x0//
    mov       ekx,[DX12]
    mov       elx,[DY12]
    imul      ekx,ebx
    imul      elx,eax
    sub       ekx,elx
    add       ekx,[C1]
    mov       [CY1],ekx

    mov       ekx,[DX23]
    mov       elx,[DY23]
    imul      ekx,ebx
    imul      elx,eax
    sub       ekx,elx
    add       ekx,[C2]
    mov       [CY2],ekx

    mov       ekx,[DX31]
    mov       elx,[DY31]
    imul      ekx,ebx
    imul      elx,eax
    sub       ekx,elx
    add       ekx,[C3]
    mov       [CY3],ekx
//                for(int iy = y// iy < y + q// iy++)
//                    int CX1 = CY1//
//                    int CX2 = CY2//
//                    int CX3 = CY3//
    mov       elx,Q
    mov       eax,[CY1]
    mov       ebx,[CY2]
    mov       ecx,[CY3]

    lea       eex,[efx+egx*4]
  .partial_loopy:
    push      rax rbx rcx
//                    for(int ix = x// ix < x + q// ix++)
    xor       ekx,ekx
  .partial_loopx:
//                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)
    test      eax,eax
    jle       @f
    test      ebx,ebx
    jle       @f
    test      ecx,ecx
    jle       @f
//                            buffer[ix] = 0x0000007F//<< // Blue
    mov       dword[eex+ekx*4],0FFFFFFh
  @@:
    sub       eax,[FDY12]
    sub       ebx,[FDY23]
    sub       ecx,[FDY31]
//                    }
    add       ekx,1
    cmp       ekx,Q
    jc        .partial_loopx
    pop       rcx rbx rax
    add       eax,[FDX12]
    add       ebx,[FDX23]
    add       ecx,[FDX31]
//                    (char*&)buffer += stride//
    add       eex,BUFW*4
//                }
    sub       elx,1
    ja        .partial_loopy

  .prolog_inner_end:
    add       egx,Q
    cmp       egx,ehx
    jc        .forinner
    pop       rgx

//        (char*&)colorBuffer += q*stride//
    add       ebp,BUFW*4*Q
    add       eix,Q
    cmp       eix,ejx
    jc        .forouter
//    }
    ret
E3a1db864249a05e4952ac91cb55418d
0
rarefluid 101 Jun 11, 2007 at 15:08

Maybe I’m missing something here but the halfspace function should be linear per edge/line. So you could:
- evalutate the halfspace functions by calculating the halfspace-value at the 4 edges of the bounding box
- calculate a per-X and per-Y constant
- add those to the corner points when you step to the next 8x8 cube

there should be no need for the multiplication in the outer loop, which should save 6 MULs and 6 ADD/SUBs per block

And sorry for reanimating the dead ;)

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 Jun 11, 2007 at 16:40

@rarefluid

Maybe I’m missing something here but the halfspace function should be linear per edge/line.

That’s right, they are linear and you get all kinds of cool shortcuts here. If you really want to optimize performance read nicos post. He showed that it’s sufficient to just interpolate one edge distance instead of three.

E3a1db864249a05e4952ac91cb55418d
0
rarefluid 101 Jun 11, 2007 at 18:31

What I understood was that you only need to check one edge depending on the normal i.e. if block is above and left of a line that has an angle of 45°, check bottom right edge of box… I don’t understand how this can be independent of the position of the block…
And you will probably need to do some jumps that might slow things down again.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jun 11, 2007 at 20:51

I’ve analysed Nico’s suggestion, and I think this picture tells the whole story: boundary.

The colored triangle represents the pixels we want to fill. The squares are coverage blocks (the extreme cases). The fat lines are the boundaries we need to test the top-left corner of a coverage block with. If the top-left corner is inside the boundary, the block covers part of the triangle.

Note that the boundary is not a triangle, but that’s actually not a problem. We’ll only check its boundary box for covered blocks anyway. So only the three edges of the boundary that are parallel to the triangle’s edges need to be computed.

That’s only a one-time cost, so block coverage is practically as fast as point coverage! :worthy:

Here’s the equivalent picture for testing whether a block is fully covered by only looking at the top-left corner: boundary2.@rarefluid

And you will probably need to do some jumps that might slow things down again.

You can eliminate jumps with conditional moves, or using AND and OR operations.

E3a1db864249a05e4952ac91cb55418d
0
rarefluid 101 Jun 11, 2007 at 22:00

I think I can grasp the idea now…
But I don’t really understand how this can yield speedups when NOT exploiting parallelism, in contrast to a conventional scanline render, ‘cause it actually necessarly processes more pixels. There is no sorting needed though.

…Sorry. I read page 1 again and found some hints… :$

Couldn’t one do a bresenham-like movement of the blocks to save some tests? Move block down and onto triangle boundary. Or am I completely wrong here? Why not combine the conventional scanline and parallel aproach?
…maybe I should implement this and stop bothering you guys… ;)

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 Jun 11, 2007 at 22:48

@rarefluid

Couldn’t one do a bresenham-like movement of the blocks to save some tests? Move block down and onto triangle boundary. Or am I completely wrong here? Why not combine the conventional scanline and parallel aproach?
…maybe I should implement this and stop bothering you guys… ;)

No. You aren’t wrong. You can as well use a simple flood fill style algorithm.

Once you’ve found a block that covers at least a part of the triangle (the middle one might be a good choice) fill into all adjecting, covered blocks. This will be very fast. The only drawback is that you need some kind of stack space to store your state, but that’s usually not a problem.

On hardware designs you avoid such things, but in a software implementation it is no big deal.

Btw, during my latest work on rasterization I found out that on modern machines you get a nice performance gain from first scanning geometry and generating block/trapezoid/scanline data for your polygons into a dedicated memory buffer. Once the buffer is full render it. This is up to twice as fast compared to the “render the block immediately” style.

I think the effects of the cache makes a huge difference here.

124e341afb359425b55ff1c2006e3379
0
powers 101 Jun 22, 2007 at 15:35

Many thanks for Nick’s good job.

It’s my first time to learn how rasterizer do, so I have some questions.
1. Why we use 28.4 fixed-point coordinates but not other fixed-point coordinates, such as 24.8 fixed-point coordinates?
2. I don’t know why we have to add 0xF to min(X1, X2, X3) and max(X1, X2, X3)? Additionally, why we do the bitwise operations, when we compute fixed-point deltas and bounding rectangle?
3. I also don’t know how to get corners of block.
4. What is “stride” means in this job?
5. Finally, how to explain the “block”? It’s just like a rectangle?
Thanks a lot!!!

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jun 23, 2007 at 09:48

@powers

Why we use 28.4 fixed-point coordinates but not other fixed-point coordinates, such as 24.8 fixed-point coordinates?

The numbers are multiplied, so a 24.8 format would result in a 48.16 format. But that doesn’t fit within a 32-bit value and would be chopped to a 16.16 format, where the upper 16 bits correspond to only 8 bits before the multiplication. So you’d only have a resolution of 256 pixels.

The 28.4 format gives us 12 bit resolution, or 4096 pixels. 4 bits of sub-pixel precision is also plenty in practice. So it’s the best compromise. If you really need a higher resolution, anti-aliasing or more sub-pixel precision, you’ll have to use 64-bit numbers or another approach.

I don’t know why we have to add 0xF to min(X1, X2, X3) and max(X1, X2, X3)?

Adding 0xF and shifting right 4 bits is the equivalent of a ceil() function for the 28.4 fixed-point format. The first pixel to be filled is the one stricly to the right of the first edge, so that’s why a ceil() operation is used (the fill convention for pixels spot on the edge is taken care of a few lines lower).

Additionally, why we do the bitwise operations, when we compute fixed-point deltas and bounding rectangle?

You mean the shift operations? The bounding rectangle is really in integer coordinates, not in fixed-point coordinates.

I also don’t know how to get corners of block.

They are in fixed-point format again, so that’s why there is a shift left by 4 bits. They have to be in fixed-point because testing whether a point is inside or outside an edge requires that precision (also note that the C values are in fixed-point).

What is “stride” means in this job?

It’s the number of bytes between two rows in the color buffer. So it takes you from one scanline to the next.

Finally, how to explain the “block”? It’s just like a rectangle?

It’s a square of 8x8 pixels. Sometimes in rasterization it’s also called a “tile”. The main advantage is that with tiles 64 pixels can be skipped at once when the whole tile is either fully outside or fully inside the triangle. Only tiles that cross an edge require the per-pixel tests.

7a1d0b51a2de5e0d11e8ac26263b3a5a
0
Optimus 101 Jul 07, 2007 at 22:13

Interesting approach! I like it when I read something completely new and different on 3d graphics. Usually, there are several tutorial out there on the subject but all explain the same old classic algorithm via the predictable way you’ve read and coded yourself before. Now that’s something unique! And I am still wondering if it works good and fast enough (I have to try first ;)

B6f2d859eef6e0cc35e5feaa35af0e22
0
Madis731 101 Jul 09, 2007 at 15:06

The Q is best left at 8 because then 64 pixels will be rendered fast enough to not notice the slowdown from MULs and DIVs you stumble on on the way there :) Unoptimized will work at the same pace as the scan-conversion, but a bit optimization and you’ll have something fast.

I’m trying to find out if there’s a way to dynamically change it between 4, 8 and 16 for different triangles. I think it will be too complicated and makes it even slower but lets see.

0e340ed4245585a6b9ad132136ec3742
0
supagu 101 Aug 22, 2007 at 05:33

@Nick

This is quite easy. First determine gradients du/dx, dv/dx, du/dy and dv/dy. Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0 (note that x0 and y0 are floating-point, x and y are integer). Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy. Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.

can you further explain how this is working?
what is du, dx, etc…

also, how does this compare, performance wise, to scan line rasterisation once you have colours, texture coords etc all in place assuming you dont want to convert it to assembly?

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Aug 22, 2007 at 06:23

@supagu

can you further explain how this is working?
what is du, dx, etc…

Triangle Scan Conversion using 2D Homogeneous Coordinates explains an efficient way to compute interpolants (section 4).

also, how does this compare, performance wise, to scan line rasterisation once you have colours, texture coords etc all in place assuming you dont want to convert it to assembly?

It has advantages and disadvantages. Scanline rasterization only computes endpoints of the scanline, while this method computes coverage for every pixel individually. The advantage is that it translates very well to SIMD assembly. But with SISD code you’re better off with scanline rasterization. Another advantage is that you can easily determine coverage of whole blocks of pixels, which is useful for visibility algorithms.

So it really depends on other aspects of your renderer whether this algorithms is to be preferred or not.

E3a1db864249a05e4952ac91cb55418d
0
rarefluid 101 Dec 03, 2007 at 11:33

I finally implemented the algorithm with the proposed improvement of only interpolating one edge. Though it is quite logical, the implementation puzzled me quite a bit, so I’m posting here for people that are as confused as I was…
Block size is still 8.

     ...
     ...
     if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;

    // find box test offsets for each edge
    int eo1 = 0;
    if (FDY12 < 0) eo1 -= FDY12 << 3; //meaning y1 is above y2, block size is 8 ==> shl 3
    if (FDX12 > 0) eo1 += FDX12 << 3; //meaning x1 is right of x2

    int eo2 = 0;
    if (FDY23 < 0) eo2 -= FDY23 << 3; //meaning y2 is above y3
    if (FDX23 > 0) eo2 += FDX23 << 3; //meaning x2 is right of x3

    int eo3 = 0;
    if (FDY31 < 0) eo3 -= FDY31 << 3; //meaning y3 is above y1
    if (FDX31 > 0) eo3 += FDX31 << 3; //meaning x3 is right of x1

    //these are the offsets fo the bottom-right block corner
    int ei1 = (DX12 << 7) - (DY12 << 7) - eo1; //block size is 8 ==> shl 3 + 4
    int ei2 = (DX23 << 7) - (DY23 << 7) - eo2;
    int ei3 = (DX31 << 7) - (DY31 << 7) - eo3;

    // Loop through blocks
    for(int y = miny; y < maxy; y += q) {
    unsigned char filledFlag = 0; //variable that states if we have found a somehow filled block already

        for(int x = minx; x < maxx; x += q) {
            // Corners of block
            int x0 = x << 4;
            int y0 = y << 4;

            int CX1 = C1 + DX12 * y0 - DY12 * x0;
            int CX2 = C2 + DX23 * y0 - DY23 * x0;
            int CX3 = C3 + DX31 * y0 - DY31 * x0;

            // Skip block when outside an edge
            if((CX1+eo1) < 0 || (CX2+eo2) < 0 || (CX3+eo3) < 0) {
        unsigned int *cbuffer = tempBuffer;
                if (filledFlag & 0x80) {
                    //we have hit a filled block before, but now we've hit an empty one. we can skip to the next line.
                    break;
                }
                else {
                    //we have NOT hit a filled block before, so we just skip to the next one.
                    continue;
                }
            }
            //Accept whole block when totally covered
            if((CX1+ei1) > 0 && (CX2+ei2) > 0 && (CX3+ei3) > 0) {
                unsigned int *cbuffer = tempBuffer;
                for(int iy = 0; iy < q; iy++) {
                    for(int ix = x; ix < x + q; ix++) {
                        cbuffer[ix] = 0x00007F00; // Green
                    }
                    cbuffer = (unsigned int *)((char *)cbuffer + target->bytesPerLine);
                }
            }
            else {
                // Partially covered block
                unsigned int *cbuffer = tempBuffer;
                int CY1 = CX1;
                int CY2 = CX2;
                int CY3 = CX3;
                for(int iy = y; iy < y + q; iy++) {
                    CX1 = CY1;
                    CX2 = CY2;
                    CX3 = CY3;
                    for(int ix = x; ix < x + q; ix++) {
                        if(CX1 > 0 && CX2 > 0 && CX3 > 0) {
                            cbuffer[ix] = 0x0000007F; // Blue
                        }
                        CX1 -= FDY12;
                        CX2 -= FDY23;
                        CX3 -= FDY31;
                    }
                    CY1 += FDX12;
                    CY2 += FDX23;
                    CY3 += FDX31;
                    cbuffer = (unsigned int *)((char *)cbuffer + target->bytesPerLine);
                }
            }
            filledFlag = 0x80;
        }
        tempBuffer = (unsigned int *)((char *)tempBuffer + q * target->bytesPerLine);
    }
};

Though there are much less calculations going on, the code is not much faster (as in blazing fast) than the original version, because eo1/eo2/eo3 are kept in memory.
The code to calculate + test for “fully inside” (ei1, ei2, ei3) slows down the algorithm a lot too, probably because of the lack of registers. There might be another way of calculating those offsets off the eoX variables.

A slight improvement is the filledFlag variable that skips to the next line of blocks when we have encountered a filled and an empty block in sequence (because that means the rest of the line is empty…). I tried that mechanism for partially filled blocks too (skipping lines of pixels), but noticed no improvement…

comments and flames are welcome ;)

E3a1db864249a05e4952ac91cb55418d
0
rarefluid 101 Dec 04, 2007 at 17:00

I have to check, but I guess on an architecture with many registers this might be really fast, ‘cause almost everything is precalculated and kept in registers instead of memory.

Another improvement might be to convert the block inside calculation to interpolate coordinates:

    ...
    int CY1 = C1 + DX12 * (miny << 4) - DY12 * (minx << 4);
    int CY2 = C2 + DX23 * (miny << 4) - DY23 * (minx << 4);
    int CY3 = C3 + DX31 * (miny << 4) - DY31 * (minx << 4);
    // Loop through blocks
    for(int y = miny; y < maxy; y += q) {

        int CX1 = CY1;
        int CX2 = CY2;
        int CX3 = CY3;
    unsigned char filledFlag = 0; //variable stat states if we haven't found a somehow filled block/pixel already

        for(int x = minx; x < maxx; x += q) {
           // Skip block when outside an edge
            if((CX1+eo1) < 0 || (CX2+eo2) < 0 || (CX3+eo3) < 0) {
        unsigned int *cbuffer = tempBuffer;
                if (filledFlag & 0x80) {
                    //we have hit a filled block before, but now we've hit an empty one. we can skip to the next line.
                    break;
                }
                else {
                    //we have NOT hit a filled block before, so we just skip to the next one.
                    goto skip;
                }
            }
            //Accept whole block when totally covered
            if((CX1+ei1) > 0 && (CX2+ei2) > 0 && (CX3+ei3) > 0) {
                unsigned int *cbuffer = tempBuffer;
                for(int iy = 0; iy < q; iy++) {
                    for(int ix = x; ix < x + q; ix++) {
                        cbuffer[ix] = 0x00007F00; // Green
                    }
                    cbuffer = (unsigned int *)((char *)cbuffer + target->bytesPerLine);
                }
            }
            else {
                // Partially covered block
                unsigned int *cbuffer = tempBuffer;
                for(int iy = y; iy < y + q; iy++) {
                    int cx1 = CX1;
                    int cx2 = CX2;
                    int cx3 = CX3;
                    for(int ix = x; ix < x + q; ix++) {
                        if(cx1 > 0 && cx2 > 0 && cx3 > 0) {
                            cbuffer[ix] = 0x0000007F; // Blue
                        }
                        cx1 -= FDY12;
                        cx2 -= FDY23;
                        cx3 -= FDY31;
                    }
                    CX1 += FDX12;
                    CX2 += FDX23;
                    CX3 += FDX31;
                    cbuffer = (unsigned int *)((char *)cbuffer + target->bytesPerLine);
                }
                CX1 -= (FDX12 << 3);
                CX2 -= (FDX23 << 3);
                CX3 -= (FDX31 << 3);
            }
            filledFlag = 0x80;
skip:
            CX1 -= (FDY12 << 3);
            CX2 -= (FDY23 << 3);
            CX3 -= (FDY31 << 3);
        }
        CY1 += (FDX12 << 3);
        CY2 += (FDX23 << 3);
        CY3 += (FDX31 << 3);
        tempBuffer = (unsigned int *)((char *)tempBuffer + q * target->bytesPerLine);
    }
};

No multiplications anymore! There is a slight improvement for this, which again might be bigger if we could keep some stuff in registers…

I’m really keen to hear your suggestions for further improving speed algorithmically and not converting to SSE or other stuff…

E3a1db864249a05e4952ac91cb55418d
0
rarefluid 101 Jan 02, 2008 at 16:46

Dunno if anyone reads this, but I found an interesting piece for prespective-correct interpolation from Nick :) in another forum

I think I did find a more practical result though: To compute the classical matrix inverse, a division by the deteminant is required. This takes 1 reciproke, 12 multiplications and 2 addtions. I believe this can be eliminated altogether. Without dividing by the determinant d, instead of computing u/w and 1/w we’d be computing d*u/w and d/w. Per pixel we’re computing (d*u/w)/(d/w) = u.

This should be quite ideal for this algorithm…

F25d39eb6d3230a1ebc73eedea4c0b27
0
Jamie 101 Apr 08, 2008 at 13:55

Hi,

http://winden.wordpress.com/category/coding/

A very interesting article about old rendering method, for perspective mapping we can have only 2 division per polygon, and if you have vertex cached it’s only one division.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Apr 09, 2008 at 10:37

@Jamie

A very interesting article about old rendering method, for perspective mapping we can have only 2 division per polygon, and if you have vertex cached it’s only one division.

Turning multiple divisions into one is a very basic math trick:

x = a / b
y = c / d

becomes

x = ad / bd
y = cb / db

The denominators are equal so you only have to compute 1 / bd once.

However, this has several implications. The first is that if d or b is zero you get a division by zero for both. With triangles, one or two vertices can have w = 0 (but not all three). Secondly, computers have limited precision and so b * (1 / bd) is of lower precision than 1 / d. And it gets worse when doing it for three divisions at once. And thirdly, you quickly have many more muliplications and data movement, making the gains much less worthwhile.

By the way, gradient setup for perspective correct interpolation can even be done with no division at all…

F25d39eb6d3230a1ebc73eedea4c0b27
0
Jamie 101 Apr 09, 2008 at 13:47

I tested the optimization for the first 2 division in my scanline perspective correction, i don’t see a difference but as you said i have not tested for 3 division.

For the w = 0, you can clipp the triangle with the near plane

One thing that is really good it’s he compute all edge,gradient with the 3d coordinate so he has a good precision.

Of course you don’t need to have division for the perspective gradient setup.

D3f8df9f744b91535ab178d829fb8fc1
0
n_o_p 101 Dec 28, 2008 at 20:15

@Nick

So how do we detect coverage as fast as possible? With the half-space functions, it’s really easy. All we have to do is evaluate the half-space functions in the corners of the blocks. A non-covered block has negative half-space values for the corners for at least one edge. A completely covered block has positive half-space values for all edges. Everything else is a partially covered block.

Doesn’t that give false positives (flagging a non-covered block as partially covered block)?

+-----+
 \   /
  \ /
   v             
+------+
|      |
|      |
+------+

This block detected as partitally covered even though it’s not covered at all. Might not be a big problem here (except for some wasted cycles), but it should be noted.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Dec 29, 2008 at 13:16

@n.o.p

Doesn’t that give false positives (flagging a non-covered block as partially covered block)?

No, the block in your image is outside of the bounding box of the triangle, so it’s never going to be tested in the first place. :happy:

Aa25abeff9c53f82f51a99f93a8d04bd
0
Bernie 101 Jan 18, 2009 at 00:14

Very interesting thread! I have some questions ..

Currently I’m working on accurate scan conversion functions for both pixel-centers (e.g. rasterizing triangles) and pixel-coverage (e.g. which tiles touch this triangle). However I want to use floating-point math, and avoid unecessary work such as looping over all pixels in the bounding box. I’m not trying to optimize the performance with assembly at this stage - instead I want to optimize the algorithm so I can then implement it on Altivec and SPU later.

So far I have the pixel-center algorithm working quite well - it handles the edge cases perfectly and requires only a couple multiplies and a floorf() and ceilf() per scanline. The way I do it is to first rearrange the vertices so that vertex A has the highest y-coord and then if the winding is clockwise I swap B and C. This gives me one of two configurations which I handle separately:

(sorry bad ASCII art)

. A
. /|
. B |
. \|
. C

or

. A
. |\
. | B
. |/
. C

In the first configuration vertex B is to the left of edge AC … I scan along edge AB (left) and AC (right) for the upper half and then along BC (left) and AC (right) for the lower half.

In the second configuration vertex B is to the right of edge AC … I scan along edge AC (left) and AB (right) for the upper half and then along AC (left) and BC (right) for the lower half.

Since I’m only considering pixel centers here, the “upper” and “lower” parts of the triangle are well-defined by simply comparing the y-coord of the center of the span to the y-coord of vertex B, so things are fairly simple.

Now I’m working on the pixel-coverage scan converter (here a “pixel” can be considered as a tile, it makes no difference) and the problem seems a lot more complex. Again I’m looping over the scanlines just like the pixel-center version, and for each scanline I determine the furthest extent {xmin,xmax} of the triangle over the scanline’s area. Calling floorf(xmin), ceilf(xmax) gives me the actual extent in integer values. The problem is that for some scanlines I have to consider multiple edges on the left side the right side …

My approach so far is to process groups of spans. First the “top” span (which includes vertex A), then the “upper” spans (between A and B but not including either), then the “middle” span (B), then the “lower” spans (between B and C but not including either) and finally the “bottom” span (C). Plus, the “upper+top” case where both A and B lie on the same span has to be considered separately, as does the “lower+bottom” case, and the trivial case where all vertices lie on the same span. In each of these 8 cases there is a different equation for getting {xmin,xmax}, but it’s always just a couple of multiplies and adds.

The y-extents of the spans are as follows:

“top” span will be {floorf(Ay), floorf(Ay)+1}
“upper” spans will be between {floorf(Ay)+1, floorf(By)}
“middle” span will be {floorf(By), ceilf(By)}
“lower” spans will be between {ceilf(By), ceilf(Cy)-1}
“bottom” span will be {ceilf(Cy)-1, ceilf(Cy)}

Note that some of these extents cover zero area or even negative area .. e.g. if By is an integer then the middle span will not exist. But also if vertices A and B (or vertices B and C) lie on the same span then the middle span must be skipped because it would otherwise overlap.

Has anyone implemented anything like this or does anyone know of any references to this kind of solution?

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 19, 2009 at 13:48

Hi Bernie,

You should be able to compute pixel-coverage (or other tiles) by resizing the triangle with the dimensions of the tile and then using any point-coverage algorithm. Have a look at post #72 for some details. I hope that helps.

Cheers,

Nick

Aa25abeff9c53f82f51a99f93a8d04bd
0
Bernie 101 Jan 19, 2009 at 18:43

@Nick

Hi Bernie,

You should be able to compute pixel-coverage (or other tiles) by resizing the triangle with the dimensions of the tile and then using any point-coverage algorithm. Have a look at post #72 for some details. I hope that helps.

Cheers,

Nick

i considered that, but it seemed more complex because the expanded triangle is no longer really a triangle due to clipping (especially when the triangle vertices are very acute). also, my solution does not give false-positives. thanks though!

Aa25abeff9c53f82f51a99f93a8d04bd
0
Bernie 101 Jan 19, 2009 at 19:02

in case anyone wants to see my pixel coverage code … i’m pretty sure there are no remaining boundary issues.

INLINE void TriangleSetup(float &Ax, float &Ay, float &Bx, float &By, float &Cx, float &Cy)
{
    float temp;

    if (Ay > By || Ay >= Cy) // rotate triangle so that Ay is on top
    {
        temp = Ax; Ax = Bx; Bx = Cx; Cx = temp;
        temp = Ay; Ay = By; By = Cy; Cy = temp;

        if (Ay > By || Ay >= Cy)
        {
            temp = Ax; Ax = Bx; Bx = Cx; Cx = temp;
            temp = Ay; Ay = By; By = Cy; Cy = temp;
        }
    }

    if ((Bx - Ax)*(Cy - Ay) > (Cx - Ax)*(By - Ay)) // flip vertices B,C so that triangle is counter-clockwise
    {
        temp = Bx; Bx = Cx; Cx = temp;
        temp = By; By = Cy; Cy = temp;
    }
}

void RasterizeCoverage(SpanList &spans, float Ax, float Ay, float Bx, float By, float Cx, float Cy)
{
    // rasterize pixel "coverage" (e.g. determining which tiles overlap with a triangle)

    // debug colors ...
    const V4 spanColor_TRIVIAL(0.8f, 0.8f, 0.8f, 1.0f); // white
    const V4 spanColor_TOP    (0.8f, 0.0f, 0.0f, 1.0f); // red
    const V4 spanColor_TOP1   (0.8f, 0.8f, 0.0f, 1.0f); // red-green
    const V4 spanColor_UPPER  (0.4f, 0.4f, 0.4f, 1.0f); // red-gray (red)
    const V4 spanColor_MIDDLE (0.0f, 0.8f, 0.0f, 1.0f); // green
    const V4 spanColor_LOWER  (0.4f, 0.4f, 0.5f, 1.0f); // blue-gray
    const V4 spanColor_BOTTOM1(0.0f, 0.8f, 0.8f, 1.0f); // blue-green
    const V4 spanColor_BOTTOM (0.0f, 0.0f, 0.8f, 1.0f); // blue

    TriangleSetup(Ax, Ay, Bx, By, Cx, Cy);

    ASSERT(IsNormal(Ax) && IsNormal(Ay)); // INF or NaN input are not supported, also denormals are not supported
    ASSERT(IsNormal(Bx) && IsNormal(By));
    ASSERT(IsNormal(Cx) && IsNormal(Cy));

    float Ay0 = Floor(Ay);

    float xmin = Min(Ax, Bx, Cx); // triangle bounds
    float xmax = Max(Ax, Bx, Cx);
    float ymax = Max(By, Cy);

    int iy = (int)Ay0; // index of first span

    // first check for trivial cases
    {
        float range = Min(Ceil(xmax) - Floor(xmin), Ceil(ymax) - Ay0);

        if (range == 0.0f) // entire triangle spans 0 rows or columns - zero coverage
        {
            return; // done
        }
        else if (range == 1.0f) // entire triangle spans 1 row or column
        {
            while (Ay0 < ymax)
            {
                spans.AddSpan(xmin, xmax, iy, spanColor_TRIVIAL);

                ++iy, ++Ay0;
            }

            return; // done
        }
    }

    bool flip = (By > Cy);

    if (flip)
    {
        float temp;

        // negate x-coords, swap B and C
        Ax = -Ax;
        temp = -Bx; Bx = -Cx; Cx = temp;
        temp =  By; By =  Cy; Cy = temp;

        // negate and swap xmin,xmax
        temp = -xmin; xmin = -xmax; xmax = temp;
    }

    ASSERT(By <= Cy);

    float dAB = (Bx - Ax)/(By - Ay); // divide by zero is possible, but the results will not be used ..
    float dBC = (Cx - Bx)/(Cy - By);
    float dCA = (Ax - Cx)/(Ay - Cy);

    float dAB_pos = Max(0.0f, dAB);
    float dBC_pos = Max(0.0f, dBC);
    float dCA_pos = Max(0.0f, dCA);
    float dAB_neg = Min(0.0f, dAB);
    float dBC_neg = Min(0.0f, dBC);
    float dCA_neg = Min(0.0f, dCA);

    float By0      = Floor(By);
    float By1      = Ceil (By);
    float Cy1_sub1 = Ceil (Cy) - 1.0f;

    float y1 = Ay0 + 1.0f; // lower edge of first span

    // TOP SPAN ....... [floor(Ay),   floor(Ay)+1]
    // UPPER SPANS .... [floor(Ay)+1, floor(By)  ] 
    // MIDDLE SPAN .... [floor(By),   ceil (By)  ]
    // LOWER SPANS .... [ceil (By),   ceil (Cy)-1]
    // BOTTOM SPANS ... [ceil (Cy)-1, ceil (Cy)  ]

    if (By1 == y1) // top half is 1 span [Ay0..By1], will include or touch (A) and (B) but will not touch (C)
    {
        ASSERT(IsFinite(dBC_neg));
        ASSERT(IsFinite(dCA_pos));
        ASSERT(y1 == (float)(iy + 1));

        float x0 = (y1 - By)*dBC_neg + Min(Ax, Bx);
        float x1 = (y1 - Ay)*dCA_pos + Ax;

        ASSERT(xmin <= x0 && x0 < x1 && x1 <= xmax);

        spans.AddSpan(x0, x1, iy, spanColor_TOP1);
        ++iy, ++y1;

        By1 = By0; // skip middle span
    }
    else if (By1 > y1) // top half is multiple spans ..
    {
        // top span [Ay0..Ay0+1], will include or touch (A) on top but will not touch (B) or (C)
        {
            ASSERT(IsFinite(dAB_neg));
            ASSERT(IsFinite(dCA_pos));
            ASSERT(y1 == (float)(iy + 1));

            float x0 = (y1 - Ay)*dAB_neg + Ax;
            float x1 = (y1 - Ay)*dCA_pos + Ax; 

            ASSERT(xmin <= x0 && x0 < x1 && x1 <= xmax);

            spans.AddSpan(x0, x1, iy, spanColor_TOP);
            ++iy, ++y1;
        }

        while (y1 <= By0) // upper spans [Ay0+1..By0], may touch (B) on bottom but will not touch (A) ... may touch (C) if Cy=By is an integer
        {
            ASSERT(IsFinite(dAB));
            ASSERT(IsFinite(dCA));
            ASSERT(y1 == (float)(iy + 1));

            float x0 = ((y1 - By)*dAB - dAB_pos) + Bx; // must = Bx exactly when y = By and dAB <= 0
            float x1 = ((y1 - Cy)*dCA - dCA_neg) + Cx; // must = Cx exactly when y = Cy and dCA >= 0

            ASSERT(xmin <= x0 && x0 < x1 && x1 <= xmax);

            spans.AddSpan(x0, x1, iy, spanColor_UPPER);
            ++iy, ++y1;
        }
    }

    float y0 = y1 - 1.0f; // now we track the upper edge of the span

    if (By0 == Cy1_sub1) // bottom half is 1 span [By0..Cy1], will include or touch (B) and (C) but will not touch (A)
    {
        ASSERT(IsFinite(dAB_pos));
        ASSERT(IsFinite(dCA_neg));
        ASSERT(y0 == (float)(iy + 0));

        float x0 = (y0 - By)*dAB_pos + Min(Bx, Cx);
        float x1 = (y0 - Cy)*dCA_neg + Cx;

        ASSERT(xmin <= x0 && x0 < x1 && x1 <= xmax);

        spans.AddSpan(x0, x1, iy, spanColor_BOTTOM1);
        //++iy, ++y0;
    }
    else if (By0 < Cy1_sub1) // bottom half is multiple spans ..
    {
        if (By0 < By1) // middle span [By0..By1], will include or touch (B) on top or bottom but will not touch (A) or (C) 
        {
            ASSERT(IsFinite(dAB_pos));
            ASSERT(IsFinite(dBC_neg));
            ASSERT(IsFinite(dCA));
            ASSERT(y0 == (float)(iy + 0));
            ASSERT(y1 == (float)(iy + 1));

            ASSERT(dAB_pos*dBC_neg == 0.0f); // NOTE: dAB_pos and dBC_neg cannot simultaneously be non-zero due to CCW winding

            float x0  = ((y0 - By)*dAB_pos + (y1 - By)*dBC_neg) + Bx;
            float x1  = ((y0 - Ay)*dCA     +           dCA_pos) + Ax; // all four of these (x1,x1b,x1c,x1d) are valid ...
            float x1b = ((y1 - Ay)*dCA     -           dCA_neg) + Ax;
            float x1c = ((y0 - Cy)*dCA     +           dCA_pos) + Cx;
            float x1d = ((y1 - Cy)*dCA     -           dCA_neg) + Cx;

            ASSERT(xmin <= x0 && x0 < x1 && x1 <= xmax);

            spans.AddSpan(x0, x1, iy, spanColor_MIDDLE);
            ++iy, ++y0;
        }

        while (y0 < Cy1_sub1) // lower spans [By1..Cy1-1], may touch (B) on top but will not touch (C) ... may touch (A) if Ay=By is an integer
        {
            ASSERT(IsFinite(dBC));
            ASSERT(IsFinite(dCA));
            ASSERT(y0 == (float)(iy + 0));

            float x0 = ((y0 - By)*dBC + dBC_neg) + Bx; // must = Bx exactly when y = By and dBC >= 0
            float x1 = ((y0 - Ay)*dCA + dCA_pos) + Ax; // must = Ax exactly when y = Ay and dCA <= 0

            ASSERT(xmin <= x0 && x0 < x1 && x1 <= xmax);

            spans.AddSpan(x0, x1, iy, spanColor_LOWER);
            ++iy, ++y0;
        }

        // bottom span [Cy1-1..Cy1], will include or touch (C) on bottom but will not touch (A) or (B)
        {
            ASSERT(IsFinite(dBC_pos));
            ASSERT(IsFinite(dCA_neg));
            ASSERT(y0 == (float)(iy + 0));

            float x0 = (y0 - Cy)*dBC_pos + Cx; // must = Cx exactly when y = Cy
            float x1 = (y0 - Cy)*dCA_neg + Cx; // must = Cx exactly when y = Cy

            ASSERT(xmin <= x0 && x0 < x1 && x1 <= xmax);

            spans.AddSpan(x0, x1, iy, spanColor_BOTTOM);
            //++iy, ++y0;
        }
    }

    if (flip) // negate and swap spans
    {
        for (int i = 0; i < spans.mSpanCount; ++i)
        {
            Span &s = spans.mSpans[i];
            float x = -s.mX0;
            s.mX0   = -s.mX1;
            s.mX1   = x;
        }
    }
}

...

e.g.:

SpanList spans;
RasterizeCoverage(spans, ax, ay, bx, by, cx, cy);

for (int i = 0; i < spans.mSpanCount; ++i)
{
    const Span &s = spans.mSpans[i];
    int x0 = (int)Floor(s.mX0);
    int x1 = (int)Ceil (s.mX1);
    int y = s.mY;
    
    for (int x = x0; x < x1; ++x)
    {
        --> pixel {x,y} is covered by triangle {ax,ay,bx,by,cx,cy}
    }
}
A1974ba0bcf2630484aab68ee3e7e2f1
0
Frogblast 101 Jan 20, 2009 at 07:26

@Nick

No, the block in your image is outside of the bounding box of the triangle, so it’s never going to be tested in the first place. :happy:

Unfortunately, the answer isn’t quite so easy. I’ve just run into the following example: there really is an intersection between the quad and the triangle, even though no sample locations are covered. This can occur at the corners of the triangle.

corner.png

I haven’t been able to find a way to reduce this false positive. Unfortunately, when dealing with very small triangles, this can become a significant expense. For example, in one scene I have an average triangle size of 11 pixels and a block size of 2x2 pixels. About 30% of my blocks are determined to be partially covered even though no samples are covered. As the block size increases, the problem worsens. At 4x4, 55% of all partially covered blocks contain no covered samples.

Does anyone have some insight on how to catch some of these cases before I spend time testing every single sample point?

Aa25abeff9c53f82f51a99f93a8d04bd
0
Bernie 101 Jan 20, 2009 at 08:12

@Frogblast

Unfortunately, the answer isn’t quite so easy. I’ve just run into the following example: there really is an intersection between the quad and the triangle, even though no sample locations are covered. This can occur at the corners of the triangle.

Although I’ve gone with a completely different solution, I think in your case you could test whether all four corners of the quad are outside of the *same* plane. If this isn’t the case, *and* if the quad intersects the bounding box of the triangle, then the quad cannot be rejected.

This is basically the same as testing whether two convex 2D polygons intersect by considering each polygon edge as a potential separating-plane.

So for each corner you would construct a 3-bit flag indicating which of the 3 edges (planes) fails the distance check. If the bitwise-AND of these four 3-bit values is nonzero then the quad is outside. If the bitwise-OR of these values is zero, then the quad is trivially inside. Otherwise, it is partially covered.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 20, 2009 at 09:05

@Frogblast

I’ve just run into the following example: there really is an intersection between the quad and the triangle, even though no sample locations are covered. This can occur at the corners of the triangle.

Indeed, that type of false positives do happen. It’s actually unavoidable. We really want to know the coverage of the pixel centers but the tile test determines intersection between the triangle and a rectangular shape.

Unfortunately, when dealing with very small triangles, this can become a significant expense. For example, in one scene I have an average triangle size of 11 pixels and a block size of 2x2 pixels. About 30% of my blocks are determined to be partially covered even though no samples are covered. As the block size increases, the problem worsens. At 4x4, 55% of all partially covered blocks contain no covered samples.

How significant is it really? Rasterization is only a fraction of the total graphics pipeline. Besides, I think it’s just something we have to accept; you can’t have higher detail geometry without additional cost. Graphics hardware struggles with this too and recent generations dedicate more logic to rasterization than before as average triangle sizes approach single digit pixel counts. In fact DirectX 11 will up the demands even more as it introduces standardized tessellation. But that’s still nothing compared to the shading demands.

All we can do is ensure that our test tiles cover just the pixel centers, not the entire pixels, and maximize the use of vector instructions. SSE2 can compute the coverage of 8 pixels in just a handful of instructions (less than 1 clock cycle per pixel on modern CPUs).

Anyway, are you really using 2x2 tiles? That’s actually useless since the cost of testing the tile is equivalent to the cost of just testing the four pixels (and the latterobviously gives you 0% false positives).

A1974ba0bcf2630484aab68ee3e7e2f1
0
Frogblast 101 Jan 20, 2009 at 17:09

@Nick

Anyway, are you really using 2x2 tiles? That’s actually useless since the cost of testing the tile is equivalent to the cost of just testing the four pixels (and the latterobviously gives you 0% false positives).

For now, I’m just trying to get it to work with any block size. I haven’t yet settled on a final block size. I would intuitively expect the proportion of false positives to decrease as the blocks get larger (fewer block edges for a triangle corner to happen to and on), so I need to figure that out before increasing it further.
@Nick

SSE2 can compute the coverage of 8 pixels in just a handful of instructions (less than 1 clock cycle per pixel on modern CPUs).

The best I’ve come up with is 6 instructions for 4 pixels using SSE4.1, using 32bit line equation results (also should execute in 6 instructions, but I haven’t yet verified that). Are you using 16bit values for your line equation results?

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 Jan 20, 2009 at 17:32

@Frogblast

The best I’ve come up with is 6 instructions for 4 pixels using SSE4.1, using 32bit line equation results (also should execute in 6 instructions, but I haven’t yet verified that). Are you using 16bit values for your line equation results?

I think he means that with pipelining, you can get the results for a new pixel each clock cycle. Not that there is only one cycle of latency. ;)

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Mar 09, 2009 at 16:23

@Nick

Anyway, are you really using 2x2 tiles? That’s actually useless since the cost of testing the tile is equivalent to the cost of just testing the four pixels (and the latterobviously gives you 0% false positives).

Actually, testing the four corners is too much in the first place. Per plane you only need the two outer-most corners, depending on the orientation of the plane. By describing the box as a center b.c and an extents vector b.x and the plane by normal p.n and distance p.d to world origin, you’ll only need to test:

float c = dot(p.n, b.c);
float x = dot(abs(p.n), b.x);
if (c + x < p.d)
    // box is outside for this plane

where abs(x) is the component-absolute function for a vector

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Mar 10, 2009 at 07:20

@.oisyn

Actually, testing the four corners is too much in the first place.

I know, I was merely illustrating that a 2x2 tile doesn’t win you anything as it’s equivalent to determining coverage of each and every pixel.

Per plane you only need the two outer-most corners…

Actually it can be done with just one point coverage test per tile, by adjusting the half-space inequalities, as described and illustrated in a post above.

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Mar 10, 2009 at 11:01

@Nick

Actually it can be done with just one point coverage test per tile, by adjusting the half-space inequalities, as described and illustrated in a post above.

I was talking in the generic sense. For testing an arbitrary box you’ll need to test two points rather than 4. But if you were going to optimize the code for this particular case, it is apparent that b.x is constant (as all boxes have the same size) and therefore the computed value of ‘x’ is constant for every plane. Which effectively leaves you with the center point test ;). This is basically the same as adjusting the planes themselves (as c + x < p.d <=> c < p.d - x).

3c6e32f84407eded0083b087e5ea37e9
0
sdiverdi 101 Apr 30, 2009 at 01:05

Thanks for the great article! I have a couple of questions about the fixed point math from the original post in this thread. I’m trying to understand each of the basic steps here…

const int Y1 = iround(16.0f * v1.y);

This sets up Y1 as a 28.4 fixed point, with nearest rounding from the float value.

const int DX12 = X1 - X2;

Since X1 and X2 are fixed point, doesn’t that mean DX12 is also a 28.4 fixed point?

const int FDX12 = DX12 << 4;

In that case, doesn’t that mean FDX12 is actually 24.8 fixed point?

int minx = (min(X1, X2, X3) + 0xF) >> 4;

Here, minx is in integer coords, rounded up.

int C1 = DY12 * X1 - DX12 * Y1;

X1, Y1, DX12, and DY12 are all 28.4 fixed point. Doesn’t that make C1 a 24.8 fixed point value (multiplying two 28.4s yields a 24.8, and subtraction doesn’t change format)?

int CY1 = C1 + DX12 * (miny << 4) - DY12 * (minx << 4);

Okay, C1 is 24.8, DX12 is 28.4, (miny << 4) is 28.4, so after the multiply that becomes 24.8, so CY1 is 24.8.

int CX1 = CY1;

CX1 -= FDY12;

Hmm, so CX1 is 24.8, and FDY12 is 24.8, so this is consistent!

So I guess that all makes sense and is consistent, I just wanted to check because the description as using 28.4 fixed point, and the comments in the code, didn’t jive with what the code _actually_ appeared to be doing. Do I understand correctly? Thanks!

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Apr 30, 2009 at 06:09

Hi sdiverdi, That’s totally correct. I agree it’s not entirely clear from the article if you don’t check it for yourself. I didn’t want to clutter the essence of the algorithm with an explanation of the fixed-point math and just provided working code. :happy:

One little addition: Multiplying two 28.4 format numbers actually gives you a 56.8 format result. But since that doesn’t fit into 32-bit you only get the 24.8 format. So you lose the upper bits. But actually the only format that is guaranteed to fit in 32-bit after multipliction is 12.4. And this is the reason why, as mentioned in the article, the implementation is limited to a 2048x2048 resolution. The highest bit is the sign bit so you only have 11 bits for integer coordinates. But that can be extended in various ways…

B6f2d859eef6e0cc35e5feaa35af0e22
0
Madis731 101 May 07, 2009 at 11:47

Just wanted to let you know: the SSE4.2 PCMPxSTRx and BLENDVxx!
If anyone is able, I am willing to race you to the solution. I have some not-very-optimal solutions in my head and I will start putting them on paper soon…

Maybe there are far more clever solutions using other instructions ;) but maybe I’m just not that good…

EDIT: I just realized that MASKMOVDQU isn’t that bad afterall. BLENDVxx+MOVxxx will warm up the memory before writing, but its still slower and has a side-effect of acessing not needed memory. PCMPxSTRc would need PCKxxxx and UNPCKxxxx instructions, which make it slower and we will be better off with just PCMPGTD+MASKMOVDQU. Back to the drawing-boards…

My first try:

/*
 xmm_1_2_3_4 is 16-byte-aligned 4-packed-dword integer
 FDYxx and CXx are defined like this:
  align 16
    FDYY12 dd ?
    FDYY23 dd ?
    FDYY31 dd ?
  align 16
    CX1    dd ?
    CX2    dd ?
    CX3    dd ?
*/

//   37 instr. / 35,637 clocks CPI=0.96 :)
//   8 pixels / 4,45 clk/px
    mov         eax,0FFh ;blue
    mov         edi,buffer
    movdqa      xmm0,[xmm_1_2_3_4]
    movd        xmm10,eax
    pshufd      xmm1,[FDY12],00000000b
    pshufd      xmm2,[FDY12],01010101b ;virtual FDY23
    pshufd      xmm3,[FDY12],10101010b ;virtual FDY31
    pmulld      xmm1,xmm0
    pmulld      xmm2,xmm0
    pmulld      xmm3,xmm0
    pxor        xmm0,xmm0
    pshufd      xmm10,xmm10,0
    pshufd      xmm4,[CX1],00000000b
    pshufd      xmm5,[CX1],01010101b ;virtual CX2
    pshufd      xmm6,[CX1],10101010b ;virtual CX3
    psubd       xmm4,xmm1
    psubd       xmm5,xmm2
    psubd       xmm6,xmm3
    movdqa      xmm7,xmm4
    movdqa      xmm8,xmm5
    movdqa      xmm9,xmm6
    psubd       xmm4,xmm1
    psubd       xmm5,xmm2
    psubd       xmm6,xmm3
    pcmpgtd     xmm4,xmm0
    pcmpgtd     xmm5,xmm0
    pcmpgtd     xmm6,xmm0
    pcmpgtd     xmm7,xmm0
    pcmpgtd     xmm8,xmm0
    pcmpgtd     xmm9,xmm0
    pand        xmm4,xmm5
    pand        xmm4,xmm6
    pand        xmm7,xmm8
    pand        xmm7,xmm9
    maskmovdqu  xmm10,xmm4
    add         edi,16
    maskmovdqu  xmm10,xmm7
B6f2d859eef6e0cc35e5feaa35af0e22
0
Madis731 101 May 10, 2009 at 16:27

Weird that after some time I can’t edit my post. Anyway - I was talking about the partially covered block’s loop. The most critical part is the X-direction there:

                    for(int ix = x; ix < x + q; ix++)
                    {
                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)
                        {
                            buffer[ix] = 0x0000007F;<< // Blue
                        }

                        CX1 -= FDY12;
                        CX2 -= FDY23;
                        CX3 -= FDY31;
                    }

Actually I managed to render 8 pixels in one pass with SSE. It only needs two stores as opposed to 8 with traditional approach.

I conducted a test and some normalized values are here (units are CPU kiloclocks in two triangle rendering batches):
ORIG = 280 / 300 ; 8 x DWORD move in a loop
MASK = 220 / 260 ; 2 x MASKMOVDQU after \~40 LOC
BLND = 173 / 185 ; 2 x MOVDQA after \~50 LOC

Masked is 13%-21% faster than the original
PBLENDVB technique gains 38% on average compared to the original.
I’m still optimizing but SSE4.1 has some tricks up its sleeve afterall.
Next I will try to take the whole (Q=8) 8x8 block at a time - branchless.

Success! :)

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 May 10, 2009 at 20:42

Btw.. It may be worth to try out the following change:

                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)

Replace with:

                        if((CX1 > 0) & (CX2 > 0) & (CX3 > 0))

This eliminates two conditional branches for the cost of a couple of extra instructions (SETs and ANDs). In my experience it speeds up code a little bit if the values reside in registers at the first place.

The reduction from three to one branch-prediction slot may make a difference as well.

Cheers,
Nils

B6f2d859eef6e0cc35e5feaa35af0e22
0
Madis731 101 May 11, 2009 at 09:40

EDIT: Some notes
Why I think this topic is hot is that Larrabee (due in 2010) deals with the same problem. Cached here: http://nap.koduleht.net/Madis/Programming/Graphics%20manuals/Larrabee/forsyth-larrabee-graphics-architecture.pdf
p.23 !

I don’t think this SSE code can be improved any more with “early-exits” as this is what you suggested if I understood correctly.

I’ve managed to learn Intel Compiler a bit in the process and I’m really glad about the flexibility of this product. There were a few things I needed for it to add all those assembly-tricks in its code.
First the part of the modified source and then the explanations:

// compile with "/Ox /QxS /Qvec-report3"
// /QxS is necessary for the SSE4.1
// /Qvec-report3 is for me to see what it does with the loops
                for(int iy = y; iy < y + q; iy++) {
                    int cx123[16];
                    for(int i=0;i<16;i++)
                      cx123[i]=(CX1-i*FDY12 >0) &
                               (CX2-i*FDY23 >0) &
                               (CX3-i*FDY31 >0);
                    #pragma vector always
                    for(int ix = x; ix < x + q; ix++) {
                        if(cx123[ix-x])
                            cbuffer[ix] = 0x0000007F; // Blue
                    }
                    CX1 += FDX12;
                    CX2 += FDX23;
                    CX3 += FDX31;
                    cbuffer = (unsigned int *)((char *)cbuffer + 1280*4);
                }

First I packed the three comparisons in one variable cx123[16] (q==16 at the time of testing) and the loop itself got a lot simpler. The variable cx123 is filled in with SSE (default with Intel Compiler).

Second I turned on the Qvec-report3 to see what it vectorizes and what not. That revealed “condition may protect exception” false presumption (well, maybe not :)) explained here: http://software.intel.com/en-us/forums/intel-c-compiler/topic/46566/ . Then I added the #pragma vector always to the code.

Third change needed was the /QxS switch because earlier CPUs don’t know about the SSE4.1 and thus general optimization didn’t find the path to BLENDVB.

Results {tri(<200,100>,<100,400>,<500,400>); CPU=T9300; 1000 loop minimum clocks}:
1) 912075 - Naive simplest solution;
2) 469238 - Original floating-point code;
3) 231737 - Original non-float code;
4) 186475 - SSE4.1 variant (no hand-coded assembly yet!)
Even comparing to the 3rd variant the final implementation gains 20% :)
Compared to the naive approach its almost 5x as fast.

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 May 11, 2009 at 13:20

Good work, Madis..

Nice to see that there are still some people out there who are interested in software rendering.

What’s your long term goal btw? I’m just curious. :-)

B6f2d859eef6e0cc35e5feaa35af0e22
0
Madis731 101 May 11, 2009 at 14:11

Actually I’m generally interested in optimizations and 2D/3D-graphics provides the most of the training bases. I always get turned on when there’s optimizations and especially SSE in question.
That is why I’m interested in this. I’m actually using the same tiling-approach in my to-be 3D-engine for http://menuetos.net/
that has no graphics-driver support as of yet.

When products (such as Larrabee) use the same approach, I know I’m on the right track :) this drives my enthusiasm.

PS. Sorry about littering the thread over 5 years or so ;) I usually do my littering around Flat assembler board.

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 May 11, 2009 at 20:15

@Madis731

PS. Sorry about littering the thread over 5 years or so ;) I usually do my littering around Flat assembler board.

You’re welcome!

I always like to read a rant or another about low-level things..:yes:

Aa5c94ea3db083fe435ecf385d3ae8d7
0
primer 101 Jul 13, 2009 at 15:04

Hi All,

I have implemented the algorithm detailed within this thread for my own halfspace rasteriser although I am still confused about the topleft fill convention implementation within the example source provided in the article.

As I see it you seem to be incrementing the halfspace equation constant part by 1 if we are indeed on the top or left edges, is this really the correct thing to do as incrementing by 1 from a fixed point 24;8 number would just increment this value by 0.00390625f or 1/256 (being 8bits fractional part) wouldnt it?

Am I just being a fixed point dumbass or am I correct? Any explaination would be great ;)

Many thanks!

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jul 21, 2009 at 12:31

@primer

As I see it you seem to be incrementing the halfspace equation constant part by 1 if we are indeed on the top or left edges, is this really the correct thing to do as incrementing by 1 from a fixed point 24;8 number would just increment this value by 0.00390625f or 1/256 (being 8bits fractional part) wouldnt it?

Yes, but fixed-point numbers are discrete, so there are no values in between each 1/256 interval (for x.8 format). For rasterization this means that you’re able to determine exactly whether a point is on the edge, or not. And for the fill convention you can decide whether you treat those as covered or not covered, just by adding 1.

6ab56178803a7116affd2cb9ffc5fbb1
0
hoglet 101 Sep 16, 2009 at 12:36

Nick,

A question on your top-left fill convention if I may? I have some experience of the classical scan-line algorithm and the need for a (top-left) fill convention.

What happens in the case where two triangles share a vertex and that vertex lies exactly on a pixel centre?

Say you have a triangle fan of a square where the vertices of both triangles are ordered counter-clockwise. Consider the top-left square vertex (shared by the two triangles) marked by the asterisk. Inward facing edge normals as per your convention.

*—
|\ |
| \ |
| \|
—-

Please forgive the ascii Nick! The half-space tests for the left and right edges for the lower-left triangle would be 0 and the half-space test for the bottom edge would by > 0.

The half-space tests for the top and left edges for the upper-right triangle would be 0 and the half-space test for the right edge would be > 0. So, I was wondering…does your approach correctly fill the pixel for the upper-right triangle only as the classical scan-line approach would?

Given your deserved solid reputation in this area I suspect that my understanding is flawed and not your rasterizer! I’m stumped :-(

I know this post goes back a while, but please Nick - if you’re tuning in then enlighten me!

Thank you VERY much - Hoglet.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Sep 16, 2009 at 13:12

Hi hoglet,

Yes, the case you depicted is handled correctly by the top-left fill convention. For the upper-right triangle, the top edge uses a X >= 0 comparison (X being the distance). Since the distance to that edge is 0, it’s inside that half-space. For it’s left edge it also uses a X >= 0 comparison and the distance is 0 to that edge so it’s also inside the half-space of the left edge. And finally the right edge uses a X > 0 comparison and the distance is clearly larger than 0. So the pixel represented by the asterisk is covered by this upper-right triangle.

For the lower-left triangle, the right edge also uses a X > 0 comparison, but this time the distance to that edge is 0. So the pixel is considered outside that half-space. Hence the pixel is not covered by that triangle. So there’s no risk of drawing it twice. Note that this triangle’s left edge uses a X >= 0 comparison and the bottom edge uses a X > 0 comparison as dictated by the fill convention and the pixel is on the inside of both but it has to test positive on all three half-spaces to be inside the triangle.

Making top and left edges use a X >= 0 comparison and making right and bottom edges use a X > 0 comparison is done by simply adding 1 to X for top and left edges, and then comparing (X + 1) > 0. That’s because (X + 1) > 0 is the same as X >= 0, for integer numbers. So in the code you’ll only see comparisons of the form X’ > 0, with some X’s being the distance plus 1.

I hope this makes things a little clearer. :happy:

6ab56178803a7116affd2cb9ffc5fbb1
0
hoglet 101 Sep 16, 2009 at 13:45

Nick!

Yup - that makes perfect sense. I’m very grateful to you for pointing out what, perhaps, I should’ve seen in the first place!

I’ll PM you!

Cheers,

Hoglet.

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Nov 28, 2009 at 17:16

Hi guys. At the begin, sorry for my bad english :D

I am developing a project http://sourceforge.net/projects/phenomenon/
where a unit with name GeSource is a software renderer based on Nick’s rasterization algorhitm.

What the unit actualy does:
-output thru DirectDraw (gdi is so damn slow, daaaaaaamn slow omg)
-basic linear setup and interpolation of z-value
-hierarchical rasterizing thru nodes (not zbuffer for now, just z-buffer & coverage variables-prototypes )
-triangle_boundingbox vs node_boundingbox optimizations
-fast hierarchical z-buffer clear
-fast hierarchical z-buffer initialization and destruction
-some parts optimized for SSE

But… this unit is in pascal language with slovak language comments :-\
I hope, the source will help someone…. ;-)

OpenIDEA aka OpenSource rules :D

0042e38e2d1ba68e2b87b984868468e7
0
stowelly 101 Feb 02, 2010 at 18:14

@Nick

It’s slightly worth it in C++, but not for the assembly implementation. It’s already really fast to determine coverage for a block. The extra checks and using more registers is not worth it.

Indeed on a GBA it could be very useful, though I doubt if this parallel rasterizer actually has use there at all?

This is quite easy. First determine gradients du/dx, dv/dx, du/dy and dv/dy. Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0 (note that x0 and y0 are floating-point, x and y are integer). Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy.

Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.

Hi, im having some problems implementing this, could anyone please tell me where I am going wrong?

“First determine gradients du/dx, dv/dx, du/dy and dv/dy”

float dx = (MS_LIBS::max(vec[0].x, vec[1].x, vec[2].x)) - (MS_LIBS::min(vec[0].x, vec[1].x, vec[2].x));
float dy = (MS_LIBS::max(vec[0].y, vec[1].y, vec[2].y)) - (MS_LIBS::min(vec[0].y, vec[1].y, vec[2].y));
float dz = (MS_LIBS::max(vec[0].z, vec[1].z, vec[2].z)) - (MS_LIBS::min(vec[0].z, vec[1].z, vec[2].z));

du = texture->m_width;
dv = texture->m_height;

float du_dx = du/dx;
float dv_dx = dv/dx;
float du_dy = du/dy;
float dv_dy = dv/dy;

MS_LIBS::VECTOR_3 start_point = vec[0];

“Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0”

at the end of the x loop

dx = x - start_point.x;

at the end of the y loop

dy = y - start_point.y;

“Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy.”

and before drawing the pixel

float u = uvs[0].m_u + (du_dx * dx) + (du_dy * dy);
float v = uvs[0].m_v + (dv_dx * dx) + (dv_dy * dy);

ive obviously missed some fundimental point here, but would be very greatfull if someone could point me in the right direction please!

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Feb 03, 2010 at 01:32

Hi stowelly. Unfortunately your code for determining the gradients isn’t correct. Probably the best source for a fast and robust algorithm is Triangle Scan Conversion using 2D Homogeneous Coordinates (section 4). I hope that helps.

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Feb 14, 2010 at 12:01

Hi guys. I have one problem with texture coordinates. my work is based on Nick’s rasterizer. The problem is not about interpolation but about fill convention in texture coordinates compatible with Nick’s rasterizer. Its about the texture position. He must stay between 0..0.999999 in triangle with u coordinates for example (0,0)-(0,1)-(1,1)..without “CLAMP” or “REPEAT” function. Similar for v coordinate. I experimented with Chris Checkers rounding rules, or like Chris with rotating bitmaps and setting Dx_Du & Dx_Dv rules, but without result. Chris rounding rules are just compatible with his scanline-based-up2down-left2righ-2subtriangle-classic algorytm. Pls, if someone guys knows the solution for this problem, pls HELP !! .. or better.. Nick, you “ THE MASTER OF MODERN RASTERIZING” HEEEEEEEEEEEELP!!! ???? :-( ???

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Feb 14, 2010 at 12:21

I almost forgot:
- i implemented Zmax function im my gesource library so with n-level hierarchical z-buffer can be not just small, but big triangles too nicely reject.
- i implemented 8x8 tilecaching for texture memory and for rendertargets (texture=rendertarget) and conversion from linear-2-8x8tile memory for textures and 8x8tile-2-linear for converting rendering framebuffer to linear representation for visualizing the drawing result.Swithcing the texture to rendertarget mode will be without conversion (not implemented now). 8x8 tileing helps for the cpu cache but in momental form of my hierarchical drawing rutines will it not much help.

PS: My library is not a second prototype of SW-shader. It will be hardcoded deffered renderer with not so much flexibilness. Flexibilness=lower speed except SW-Shader aka SwiftShader. :-)

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Feb 15, 2010 at 05:37

@jiti

Its about the texture position. He must stay between 0..0.999999 in triangle with u coordinates for example (0,0)-(0,1)-(1,1)..without “CLAMP” or “REPEAT” function.

That’s not possible. Mathematically the coordinates range from 0.0 to 1.0, irrespective of the fill convention. And due to rounding it could go slightly outside that range.

But why can’t you use a clamp or repeat function?

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Feb 16, 2010 at 00:25

Thnx for reaction nick :) , but i mean texture coordinate rounding rule depending on rotation of the texture, the rounding rules are similar like the 2 fill conventions for the triangle rasterization (left-up & right down).
Chris is talking about they in his articles. I can use the 2 function (clamp&repeat) for masking when the texture-sample-pointer goes outside the range of texture image (0..1 normalized coordinate), but i am bit pedant in precise of texturemapping. I wanna do texturemapping, like the gpu’s it does. Ok the floating-point precision is a problem, problem without solution. But the texture mapping problem in to triangle has got a solution. I draw a small diagram about the problem for better visualizing what i mean .. http://img3.imageshack.us/img3/596/textureproblem.png The problem is in 2 elipses as you can see.
we have texture about 2x2 random colors. the texture coordinates are not normalized.

in 0 degree diagram.. when we step horzontal left to right for the u coordinate, we don’t reach the number “2” because the last pixel of the right edge of the quad are rejected, because of the left-up fill convention of triangle rasterizing algorhytm.
similar is it for the v coordinate , we step from up to down but we don’t reache the number “2” on down edge pixels because the pixels lieing on the edge are reject thru the fill convention. This is the good case !!!

now the oposite.. yep, the problem-bad-case:
now we rotate the texture about 180 deegres.
We see (in red elipses).. we start sampling on positions “2” for the u and v coordinate. Without the repeat&clamp function , we were sampling outside the texture-image. this is the point of my problem.

As i say, Chris found the solution, but just for his alghorhytm. Now i wanna find the rounding-rules algorhytm for your 3-edge-check rasterization algorhytm and it must be compatible with your fill-convention. In forward, thnx for help :)

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Feb 16, 2010 at 00:42

Ok, i repeat now but.. SORRY FOR MY ENGLISH !! :D

0042e38e2d1ba68e2b87b984868468e7
0
stowelly 101 Feb 16, 2010 at 19:25

@Nick

Hi stowelly. Unfortunately your code for determining the gradients isn’t correct. Probably the best source for a fast and robust algorithm is Triangle Scan Conversion using 2D Homogeneous Coordinates (section 4). I hope that helps.

thanks, that makes a lot of sense.

I know have textured polys :)

I know have another question regarding this process. I need to utilise a Z buffer, and im not entirely sure how i can get from the X,Y in screen space back to the triangles relative Z co-ord in order to do these checks. any ideas?

thanks

0042e38e2d1ba68e2b87b984868468e7
0
stowelly 101 Feb 16, 2010 at 19:27

jiti

i had the same problem as you with the mesh I was using

some textures are mapped using texture wrapping etc.

picture it as a sheet containing multiple amounts of the same texture.

so if you had -0.4 that would actually translate as 0.6 likewise if you had 1.2 that would be 0.2 if that makes sense. im sure there are other ways of texture mapping such as mirroring etc. but this was the solution to my problem

0042e38e2d1ba68e2b87b984868468e7
0
stowelly 101 Feb 16, 2010 at 21:05

@stowelly

thanks, that makes a lot of sense.

I know have textured polys :)

I know have another question regarding this process. I need to utilise a Z buffer, and im not entirely sure how i can get from the X,Y in screen space back to the triangles relative Z co-ord in order to do these checks. any ideas?

thanks

can ignore this post. its just the same as finding u and v :) I was thinking its a lot harder than it actually was

thanks

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Feb 17, 2010 at 00:05

@jiti

http://img3.imageshack.us/img3/596/textureproblem.png The problem is in 2 elipses as you can see.

Thanks for the image. That does make things clearer.

This situation can’t be avoided though. Even with a GPU if you perfectly align the left edge on the centers of the pixels, and a coordinate on that edge is 1.0, you WILL be sampling at 0.0 instead when using ‘repeat’ addressing.

This isn’t actually a problem when using bilinear filtering instead of point filtering. And if you really want to use point filtering, use ‘clamp’ addressing instead. When the quad is supposed to be aligned to the pixels, make sure you’re subtracting 0.5 from the x and y coordinates so that you cover the entire pixels and the sample locations (at pixel centers) are off the edges.

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Feb 17, 2010 at 06:37

Yes stowelly the function “repeat” does the wrapping. this is formula u0=u-floor(u) does the wrap, for this example is it for the u coordinate.
So as you sayd -0.4 will be wraped to 0.6 and 1.2 to 0.2.
So if you send normalized texture coordinate (0,0)-(5,5)-(5,0), then the texture will 5x repeated on the triangle. Not good for cache but good for memory. :)
One tip :
On cpu is the sse floor function a cycle-eater if you doit 2x or more per pixel (for u&v). I think better make all u&v coordinates pozitive and in mainloop of rasterizer, when you do the floor calculation, just use Trunc function instead.. i mean sse trunc instruction. Its one sse instruction against n-instructions of the function floor. floor and trunc functions have with positive numbers same result. Just be sure, you send positive texture coordinates. But for flexibilityl just use the standart u-floor(u). lower speed but more flexibility :)
Yes z coordinate is same as the u & v.. This all are interpolants. So all are calculated with the same formula. Don’t forgive the perspective correction interpolation. Good for z-buffer, good for texture, good for human eyes :D and not so much work for the CPU thanx to SSE :)

stowelly but my problem is implementing rounding rules for texture mapping based on rotation of the texture as it do Chris Hecker in his scanline algorhytm and implementing it in Nick’s 3-edge-check algorhytm. Both rasterizing algorhytm’s have differend behavior, differend rules of rasterizing.

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Feb 17, 2010 at 07:09

Thnx Nick :).. now i understand. Yes bilinear filtering can mask it, and its true. Nowdays software rasterizers must can do lighting, perspective correction and yep, bilinear or better trilinear filtering and shaders. i don’t need realy point-sampling as i say i am just pedant in sense of “what gpu’s can do, that can do my rasterizer too” :D. In software can i do the postprocess without polygon rasterizing and texel-filtering like the gpu’s. Cpu is more flexible as gpu.
Now another thema:

Nick first question: What do you think about hierarchical z-buffer.. Is it a good idea? And what is better n-level-z-buffer or fixed-level.. for example 2 or 3-level-hierarchical-zbuffer. And how big the tiles must be for good performance? 2x2 or 4x4 or like i have… 8x8 per level? What do you think?
..and second question: Is it good idea to tileing texture image in memory.. i use 8x8 pixel tiles for all things in my rasterizer (zbuffer, texture,
rendertarget). How are textures stored in videomemory of graphic cards?

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Feb 19, 2010 at 01:10

@jiti

What do you think about hierarchical z-buffer.. Is it a good idea?

It depends. For modern applications with complex shaders the per-pixel arithmetic work is so high that the time (and bandwidth) spent on reading and writing the z-buffer becomes less significant.

Either way I would advise to focus on adding functionality first before optimizing things.

..and second question: Is it good idea to tileing texture image in memory.. i use 8x8 pixel tiles for all things in my rasterizer (zbuffer, texture,
rendertarget). How are textures stored in videomemory of graphic cards?

Graphics cards store them tiled, as far as I know. But for a software renderer the overhead of ‘swizzling’ the texel address is typically higher than the average memory access latency. Today’s CPUs have huge caches and can handle the access patterns quite well.

Have you implemented mipmapping yet?

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Feb 20, 2010 at 01:21

Nick,
mipmaping is on my plan. I know that mipmaping will help for cache. Its good for both cpu & gpu.

When i understand. On now days cpu’s when i rasterize, its not required to tile the texture? I read the old FATMAP documentations about tiled texture-mapping and this aproach give a boost. But it was on day’s when the 486 and 586 was the best CPU’s. Now are there new cpu’s with new behavior’s, new caching system and so. The rules can be now changed.

i experimented. When i use linear adress based z-buffer, i must jump with adress on every scanline of a tile, like nextscanline begin by adding zbufferpitch value. For better undestanding. When i draw the tile and i do the x direction, then the writing is linear like normal writing to system memory, but when i go to next y scanline i calculate new adress, just increment the adress with zbufferpitch. But so huge jump’s can destroy the data uploaded in cpu’s cache (cache pollution) because zbufferpitch can be a huge value for example in resolution 1024x768x32fp zbuffer its 1024*4=4096 bytes and when i do 8x8 block so i do 8x zbufferpitch increments and that mean’s when i begin at 0 a end 8x1024x4=32768 bytes. So i need about 32kb big cache for best performance when i do 8x8 tile zbuffering when the zbuffer is linear in memory representation and not tiled. And its important that zbuffering is not the only memory access. I am accesing other variables in memory when i draw the tile, because i am programing in high-level language and his optimizations are not so good, same like optimizations of other compilers (its better do hand-made asm optimizations). So i changed the zbuffer mapping to tile-representation so when i do zbuffer and when i do x or y direction i just increment the z-buffer pointer with 4 (size of fp32) so for 8x8 tile i need just 8x8x4=256 bytes. Its huge difference 32kb and 256 bytes. This same schema i do for color buffer. Color buffer can be used as texture without conversion. just set-up the pointer of color-buffer as texturemap and ready. with this system i need just 2 conversion. When i load image from disk, its in linear representation, then i convert it to 8x8 tiles and set the pointer to this converted image as texture. When i want to see the output of drawing i must convert the colorbuffer from 8x8 to linear representation. We do image to texture offline conversion so it will not harm the performance of drawing and when i convert the color-buffer to visible representation, i mean converting from 8x8 tile to linear representation, i do this just 1x per frame. And I made the procedure 8x8 to linear conversion quite fast,so it will not so reduce the performance. i think tileing is performance boost, nice memory-strategy. Gpu’s use this, so it must be good for cpu’s to in question of cache … but what is with tile acessing thru texture coordinates, i mean calculating memory positon from texture coordinates, the calculating is bit slow i think… this can be reduced i phase of z-buffer rejection speeded up with tileing. Or? what do you think guys, or you Nick?

Another thema about cache:
Storing material atributes (like diffuse,normal,specularity, glossiness.. ) in different slot’s or different memory locations like the gpu’s it does, or some software renders just to be compatible with gpu is bad idea. I think on gpu’s is a little help named texture-arrays.On cpu side i think will help interleave representation combined with tileing. Texture with texel’s with many atributes. Like G-buffers in deffered shading. This texture can be readed with just one memory access (on cpu with SSE 128 register) and with just one texture-coordinates.Bottleleck is the memory bandwith and the atribute textures (color, normal,..) must have same resolution. What do you think guys, or you Nick?

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Feb 20, 2010 at 01:26

My english is bad, very bad, but i think, its understandable :D

66f3d0b0f3375d0be0648d5752aa2f07
0
Pixar 101 Mar 16, 2010 at 07:33

Despite the age of the article I want to revive it… Can anyone say how that rasterizer could be implemented in C# software renderer were we can’t use pointers (we can but we won’t, because it is unreasonably to use unsafe code for making this code work).

I’ve wrote a simple scanline triangle filler, but it costs me from 30 fps to 100 fps (depends on the square I fill) drop every time I call this function. That’s the hardest part of it:
for (int y = y1; y <= y3; y++)
{
ixs = (int)(xs + 0.5f);
ixe = (int)(xe + 0.5f);
for (i = ixs; i <= ixe; i++)
Buffer[y * Pitch_Div_4 + i] = color;

xs += dx_left; xe += dx_right;
}
Buffer is the “int” array which has the size calculated by the formula: ScreenHeight * MemoryPitch>>2.

Later, when all resterization is made, I copy Buffer array into video memory like this: DataRect.Data.WriteRange<int>(Buffer).

Oh, I didn’t said yet that I’m using SlimDX to access directly to the videomemory. So, the DataRect is the “DataRectangle” which consists of the data I get after I lock my surface. I should say about it more closely because for some reason locking/unlocking takes about 120 fps.
I declared a class Device, which is working as an layer between me and SlimDX. Inside of it there is a method, which lock’s rendertarget:
public DataRectangle LockSurface()
{
surface = device.GetRenderTarget(0);
return surface.LockRectangle(LockFlags.None);
}
This two lines take from me 100 fps. It sucks…

So, I gave as much information as I can. I want to know the reason why my code is so slow. I thought about changing my scanline function to one that is written in this article. Aslo I thought about changing SlimDX to GDI. But I don’t want to make all this irrationally and I want to know the reason why something is worth to change and something to leave.
Thanks in advance.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Mar 17, 2010 at 00:55

@Pixar

I’ve wrote a simple scanline triangle filler, but it costs me from 30 fps to 100 fps (depends on the square I fill) drop every time I call this function.

You shouldn’t be using FPS as an absolute performance metric. If previously things ran at a 10,000 FPS, a drop of 100 FPS is unnoticable. If you were at 101 FPS, a drop of 100 FPS is disastrous. It can’t drop by 30 to 100 FPS “every time” you call it, since that would result in an negative FPS if you called it enough times.

Instead you should just look at the time each operation takes to get a better assessment of the performance.

My favorite metric is actually the number of clock cycles per pixel (for a pixel heavy scene). If it’s in the order of 100 clock cycles, this means that on a 3 GHz single-core CPU and at 800x600 resolution you could achieve 60 FPS!

So how much time does your project spent on rasterization each frame? Where does the rest of the time go?

66f3d0b0f3375d0be0648d5752aa2f07
0
Pixar 101 Mar 18, 2010 at 13:55

@Nick

You shouldn’t be using FPS as an absolute performance metric. If previously things ran at a 10,000 FPS, a drop of 100 FPS is unnoticable. If you were at 101 FPS, a drop of 100 FPS is disastrous. It can’t drop by 30 to 100 FPS “every time” you call it, since that would result in an negative FPS if you called it enough times.

Instead you should just look at the time each operation takes to get a better assessment of the performance.

My favorite metric is actually the number of clock cycles per pixel (for a pixel heavy scene). If it’s in the order of 100 clock cycles, this means that on a 3 GHz single-core CPU and at 800x600 resolution you could achieve 60 FPS!

So how much time does your project spent on rasterization each frame? Where does the rest of the time go?

Hmm… what is “the number of clock cycles per pixel” and how is it measured? Also I don’t know how I should measure “the time each operation takes” :)

By the way, I tried 3 different approaches to my filling method: 1. Filling int[] array and then copying it to the surface 2. Filling each pixel in the loop like this DataRect.Data.Write(color) 3. Filling memory using method similar to memset, but which is filling memory by 4 bytes (code is below). All this approaches give the same fps! I am disappointed. I thought that the third teqnique must be the fastest…

//////////////////////////////////////////////////////////////////////////////////
IN C++:
__declspec(dllexport) MemSetDWORD(int* ptr, int c, int n);
{
_asm
{
CLD ; clear direction of copy to forward
MOV EAX, c ; color goes here
MOV ECX, n ; number of DWORDS goes here
MOV EDI, ptr ; address of line to move data
REP STOSD ; send the pentium X on its way
} // end asm
}

IN C#:
[DllImport(“testfill.dll”)]
public static unsafe extern void MemSetDWORD(int* ptr, int c, int n);

for (temp_y = y1; temp_y <= y3; temp_y++, DataPointer += Pitch_Div_4)
{
MemSetDWORD(DataPointer + (int)xs, color, (int)(xe - xs + 1));
xs += dx_left; xe += dx_right;
}
///////////////////////////////////////////////////////////////

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Mar 19, 2010 at 00:09

@Pixar

Hmm… what is “the number of clock cycles per pixel” and how is it measured? Also I don’t know how I should measure “the time each operation takes” :)

With a stopwatch. :ninja:

Seriously now, you should Google for “C# code timing”.

The fundamental computations made by the CPU are synchronized to an internal clock. This clock runs at a certain frequency (typically around 3 GHz these days). One interval is called a clock cycle. So if you know the number of clock cycles an operation takes you know how fast or slow your implementation is, pretty much independent of the CPU.

You can either compute the number of clock cycles per pixel by timing the rendering of a certain amount of pixels and taking the clock frequency of your CPU into account, or you could use a profile. An excellent free profile is CodeAnalyst.

By the way, I tried 3 different approaches to my filling method: 1. Filling int[] array and then copying it to the surface 2. Filling each pixel in the loop like this DataRect.Data.Write(color) 3. Filling memory using method similar to memset, but which is filling memory by 4 bytes (code is below). All this approaches give the same fps! I am disappointed. I thought that the third teqnique must be the fastest…

Clearly your bottleneck is elsewhere. A profiler will tell you exactly where the most time (or clock cyles) is being spent, so you can concentrate on that.

Anyway, I have to seriously warn you that optimizing things too early is futile. Make sure you implement all the functionality you want first, and then get your profiling results to determine the bottlenecks / hotspots. Profiling early on is ususally a waste of time since the results will typically change radically when you near the end of the project.

Good luck!

26cdc3924f691215caa9f72b78734495
0
alexpolt 101 Apr 14, 2010 at 15:40

Nick, thanks for very helpful topic. And sorry for reviving this times old post again.

You can also rasterize by recursion - dividing by half every edge (and all possible interpolants - light, uv etc..) and then forming two new triangles by extending new edge opposite to one that is halved, then
fill them seperately. Nice recursion, cheap operations ( >> 1 ), great parallezation, but on PC turned to slow due to recursion and memory usage.

And one remark on interpolating various values across triangle.
There two ways: you can either compute plane equation for each or you can compute just two plane equations (for two verticies and one vertex is 0) for values 0 to 1. The the final value would look like: z = z0 + dzdx1 * (x-x0) + dzdy1 * (y-y0) + dzdx2 * (x-x0) + dzdy2 * (y-y0);
That way you can calculate many interpolants across triangle.

32f6d05912ce22b4a6ae9ad85a7e45af
0
renton79 101 Apr 18, 2010 at 19:31

Hi

Nick, first of all a great work here, I have implemented your work in my project and it works beautiful, there were few typo errors but nothing serious. I have one question thought, have you got anything similar for line rasterization? or maybe how could I easily create a fast algorithm

Dave

9a3116910021121a0a86d1169d01ca60
0
jiti 101 May 14, 2010 at 04:52

Hi Nick. I read the paper about 2d homogenous rasterization. I nearly understand
the rasterization, but the all the calculations need to be do on floating point numbers. But how they handle the fill convention? It needs finite number precision like the integers. In forward, thnx for the answer.

To the hierarchical z-buffer technic. Its problematic because the whole pyramid
with zmax and zmin info and tile info, with subtile pointers info on every level are just many data for the cpu cache. So its slow. Firstly i programmed it with recrusive algorhytm (using call-ret subrutine system), then per-level throutput checking, but both algorhytm was slow. So i go back to basic and do 2 level hierarchical z-buffer (zmin-zmax per tile). Later 3 level (64x64 tile with 8x8 subtiles) like in the Larabee-rasterization paper. But i need to compare the speed of 3 against 2 level z-buffer. Now i know why the other developers don’t use the full pyramide of hierarchical z-buffer.

1507dcea53ec1a6e6a6bb2d0e017d34a
0
Geri 101 May 25, 2010 at 17:26

hey Nick, do you has some benchmarks from your software d3d renderer?

(if you has, and want to show it, but you dont want to place it here, please send the results on msn to me to my address).

i just want it to compare to the performance of my codes.

greetings!

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 May 25, 2010 at 22:18

@Geri

hey Nick, do you has some benchmarks from your software d3d renderer?

What benchmark numbers do you need exactly? Maybe it’s easier to just run it yourself: http://www.transgaming.com/business/swiftshader/

1507dcea53ec1a6e6a6bb2d0e017d34a
0
Geri 101 May 26, 2010 at 10:17

Its fast (a bit faster than mine, sometimes, lot faster, sometimes equal or slower). How many cpu cores can it use?

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 May 27, 2010 at 11:16

@Geri

How many cpu cores can it use?

In theory up to 16. Depending on the application I get about 70% CPU usage on my 8-threaded Core i7. For a windowed game you can open [url]http://localhost:8080/swiftconfig[/url] in a browser and change some of the settings.

1507dcea53ec1a6e6a6bb2d0e017d34a
0
Geri 101 May 28, 2010 at 21:09

Why dont you try to sell it to linux users, as a dll for wine, for ppls who got problems with 3d acceleration?

9a3116910021121a0a86d1169d01ca60
0
jiti 101 May 30, 2010 at 17:36

Nobody to answer me the 2d homogenous rasterization with fill-convention problem? :(

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 May 30, 2010 at 19:23

@jiti

Hi Nick. I read the paper about 2d homogenous rasterization. I nearly understand the rasterization, but the all the calculations need to be do on floating point numbers. But how they handle the fill convention? It needs finite number precision like the integers. In forward, thnx for the answer.

Hi jiti. The paper doesn’t discuss the fill convention. But in the first post of this thread I explain how to implement it correctly using fixed-point numbers. If it’s not clear how it’s done, try looking for tutorials on fixed-point math.

To the hierarchical z-buffer technic. Its problematic because the whole pyramid with zmax and zmin info and tile info, with subtile pointers info on every level are just many data for the cpu cache. So its slow. Firstly i programmed it with recrusive algorhytm (using call-ret subrutine system), then per-level throutput checking, but both algorhytm was slow. So i go back to basic and do 2 level hierarchical z-buffer (zmin-zmax per tile). Later 3 level (64x64 tile with 8x8 subtiles) like in the Larabee-rasterization paper. But i need to compare the speed of 3 against 2 level z-buffer. Now i know why the other developers don’t use the full pyramide of hierarchical z-buffer.

It’s not easy to get a speedup from a hierarchical z-buffer in software. Unlike in hardware, there’s a real overhead in terms of clock cycles. Also, hardware uses it both to reduce overdraw and to reduce memory bandwidth. But nowadays many applications use a depth-only pass so overdraw is already minimal, and a software renderer doesn’t really benefit from a reduction in memory bandwidth (the number of clock cycles spent per pixel is typically quite high so bandwidth isn’t the bottleneck).

Anyway, it depends on the application, and you’ll need to experiment to see what works best.

9a3116910021121a0a86d1169d01ca60
0
jiti 101 May 30, 2010 at 20:37

Nick thnx for the answer. :)

Yes nick. Development of programs is all about experiments .. As you see i experimented with full z-buffer pyramide and i learned new things. :)

I understand your algorhytm with the fill-convention. The rule is, it must be done with integer numbers because of the finite precision (i experimented with it, but only integers worked well).
But how can be your integer-fill-convention-algorhytm combined with the 2d homogenous rasterization?
The 2d homogenous rasterization needs calculations to be done in floating point, but the fill-convention doesn’t work with floating-point numers

So the question is: HOW CAN I COMBINE THEM? :(

0ab1e9a89c4e0018f85b0edea50cd024
0
thebigT 101 May 31, 2010 at 06:16

jiti I toyed around with 2d homogeneous rasterisation last year after reading that paper. You find the triangle edge gradients in 2dh space as the article describes using floating point math then convert the values to fixed point format for rasterisation. You will need to take into account your screen scaling / mapping in the same way you do for other surface interpolants like depth & texture values. I gave the technique away because in extreme cases the gradients would overflow the 32 bit fixed point integer. If anyone has had success with this method I would be interested to hear..it was tantalisingly elegant.

9a3116910021121a0a86d1169d01ca60
0
jiti 101 May 31, 2010 at 14:53

@thebigT

jiti I toyed around with 2d homogeneous rasterisation last year after reading that paper. You find the triangle edge gradients in 2dh space as the article describes using floating point math then convert the values to fixed point format for rasterisation. You will need to take into account your screen scaling / mapping in the same way you do for other surface interpolants like depth & texture values. I gave the technique away because in extreme cases the gradients would overflow the 32 bit fixed point integer. If anyone has had success with this method I would be interested to hear..it was tantalisingly elegant.

Then gradient-values can be overfloved (in case of integer calculations) if the area of the triangle is smaller then 1. So we need to draw those triangles, they have area bigger or equal then 1. The area is calculated if we check the back-facing (in 2d) of the triangle. Its logical, that the triangles with area smaller then 1 we cant see. :)

Yes TheBigT i found the description in the paper about the conversion from float to fixed numbers, but if i do it on per tile for the line-edge interpolants, the fill-convention won’t work, because the edge interpolatns was interpolated thru floating point numbers. I can’t mix pricisions of floating point numbers and integer numbers on the line-edge-interpolants. If the triangle is huge, the line-edge intepolants have big numbers, thay can’t be converted to integer numers because, they can’t fit in 32-bit integer, as u say, and some pricision in floating-point numbers is lost. Nick’s fill-convention need precise bit-wise number precision, so just fixed-point numbers can be used.
if we don’t find any good solution how to find a working fill-convention, then the 2d homogenouse for rasterization and clipping is for me useless and i go back to standart 3d poly-plane clipping. :(

A pseudo algorhytm would be good for better understand.

0ab1e9a89c4e0018f85b0edea50cd024
0
thebigT 101 Jun 02, 2010 at 02:36

the only suggestion I can make is try using 64 bit integers. I didn’t want to make the jump to 64 bit in my program so I returned to conventional screen space rasterisation

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Jun 02, 2010 at 11:01

@thebigT

the only suggestion I can make is try using 64 bit integers. I didn’t want to make the jump to 64 bit in my program so I returned to conventional screen space rasterisation

Adding more bits don’t have sense. You know 2d homogeneouse rasterization is about detecting if a 3d point in space is in convex pohyhedron constructed from frustum-cliping planes and planes constructed from triangle edges. The position of the point is interpolated across the triangle and we are detecting if the point is inside or outside of the polyhedron. Similar like creating triangle from lines and detecting if a 2d point is in same halfspace of all 3 lines (tutorial of this thread). The space of the polyhedron or the size of triangle can be huge or very small, so we operate with true space positions (no normalization tricks- no numbers between -1 and 1) and need variable precision, so we can use only floating point numbers. This is the only type of number who is flexible enough for the calculations. As you said, with fixed number of bits we can easy overflow the integer number or we can lost all precision if the number is too small. The problem can rise if we are dividing by z ( or w).

Have anyone a functional combination of 2d homogeneouse rasterization with fill-convention? Is there any trick? Or we close this problem without solution? :(

Damn my head hurts !!

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jun 26, 2010 at 01:19

Digging up an old thread. I’m currently in the process of researching whether it’s feasible for us to do occlusion queries using software based z-buffer rendering. We don’t want to use hardware occlusion queries as the GPU isn’t really designed for quick queries - the latency is too high, and it’s usually busy doing other work (in most of our usecases, the GPU is still busy rendering the previous frame while we’re in the process of culling for the next). I too have implemented a rudimentary triangle rasterizer and have used some of the tips in this topic :). Of course, my job is a bit easier as I only need to fill a low resolution zbuffer - no color interpolation and texture fetches are necessary ;)

I’m currently using floating point math, which is enough precision for the resolution I need (probably about 256 pixels wide, perhaps 512). How does integer SSE math compare to float SSE?

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Jun 26, 2010 at 09:29

.oisyn i think integer math with sse is bit faster or almost the same as with floating point numbers on modern cpu’s, but the sse instruction set is optimized more for floating point operations then integers, so better use floating point numbers if you don’t do fill-convention or fixed-point math, or color packing-upacking, or other integer specific operations. But in otherside .. with integer math can use mmx registers for example just for saving information (you don’t acces the memory and cpu cache), or do integer math , which can speed up the process too. It’s your choice…

9a3116910021121a0a86d1169d01ca60
0
jiti 101 Jun 26, 2010 at 09:48

Jiti will be renamed to Herrcoolness

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jun 26, 2010 at 10:56

Ok guys… i uploaded new version of my phenomenon engine on the sourceforge https://sourceforge.net/projects/phenomenon/. For now there is not z-buffer but its question of time and copy&paste from old n-level hierarchical zbuffer code. As i wrote in my news on the projects side, there will be no full pyramidal z buffer, but just 1 level or 2. A recoded the whole drawing system of the triangle. Comparing to Nick’s algorytm the subpixel precision is 11-bit not 4-bit, i don’t use the Nick’s basic rectangle traversal algorytm but my, based on edge detection and block-scanlining, which is better for the triangle drawing. I think all graphic-card developers use similar algorytm’s as i saw many pictures showing triangle traversal algorytm path’s. Checking if the tile is inside, partially inside, or outside is optimized with trivial accepting and rejecting methods based on document of intel about the larabee rasterization. For last I am working in negative halfspaces of lines not in positive as Nick algorytm, because of the sign bit of integers. I use it in logic operations. The triangle rasterizing code, without pixel drawing is nice fast, and eat almost nothing of the cpu time. :) And its not even assembler optimized. :w00t: Later test will show more. ;)
I wish i could write similar article with similar quality like Nick :worthy: , but as part 2 of “advanced rasterization” with my research and code (pascal-asm) snippets.There is one problem, my english is horrible… :wallbash:
Oh, and I coded texture fetching with bilinear interpolation with clamp-repeat options. No mip-mapping yet.

The graphic oddysey continues…. :cool:

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jun 26, 2010 at 13:19

@jiti

.oisyn i think integer math with sse is bit faster or almost the same as with floating point numbers on modern cpu’s, but the sse instruction set is optimized more for floating point operations then integers, so better use floating point numbers if you don’t do fill-convention

Nonsense, floats work perfectly with fill convention. You just have to use the right epsilon when increasing the halfplane values for the topleft rule.

I snap my vertex coords on a grid by adding and subtracting 16777216/subpixelgridsize, and then I add 1/subpixelgridsize to the halfplane value for topleft edges. I’ve currently defined subpixelgridsize as 16, which corresponds to the number of bits after the fixed point in Nicks solution, but I’m still experimenting with it. 64 gives smoother results for slow moving polygons, but it takes away precision and I might get into trouble for polygons way larger than the screen and I’ll probably have to clip at some point. Then again, it’s for visibility determination only, so I can afford to be off a pixel or two.

Also, I need to be able to run on 360 and PS3 as well. I know the cell SPU’s have vector int arithmetic, but I’m not so sure about AltiVec in the PowerPC.

Anyway, currently, on an Intel Core 2 I’m able to push 5000 on-screen cubes within 7 million cycles (2.33ms on a single 3GHz core) using a 256x144 zbuffer. When quadrupling the buffer size to 512x288 it goes up to 9.8ms.

cubes.png

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jun 26, 2010 at 14:02

@.oisyn

Nonsense, floats work perfectly with fill convention. You just have to use the right epsilon when increasing the halfplane values for the topleft rule.

I snap my vertex coords on a grid by adding and subtracting 16777216/subpixelgridsize, and then I add 1/subpixelgridsize to the halfplane value for topleft edges. I’ve currently defined subpixelgridsize as 16, which corresponds to the number of bits after the fixed point in Nicks solution, but I’m still experimenting with it. 64 gives smoother results for slow moving polygons, but it takes away precision and I might get into trouble for polygons way larger than the screen and I’ll probably have to clip at some point. Then again, it’s for visibility determination only, so I can afford to be off a pixel or two.

Also, I need to be able to run on 360 and PS3 as well. I know the cell SPU’s have vector int arithmetic, but I’m not so sure about AltiVec in the PowerPC.

Ok i take it in to the count… The rounding idea can help me a lot. thnx. Check my project on the sourceforge. Maybe some ideas from my source-code can help you….

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jun 27, 2010 at 21:10

I thought of another thing worth considering. Rather than implementing a >= using > and some offset in case of topleft edges , you could use >= in general and implement the > by using a negative offset in case of bottomright edges. That way, for very large polygons, you’ll never get potential holes. You’ll only get potential overdraw, but that doesn’t really matter that much.

A1974ba0bcf2630484aab68ee3e7e2f1
0
Frogblast 101 Jun 28, 2010 at 05:52

@.oisyn

I thought of another thing worth considering. Rather than implementing a >= using > and some offset in case of topleft edges , you could use >= in general and implement the > by using a negative offset in case of bottomright edges. That way, for very large polygons, you’ll never get potential holes. You’ll only get potential overdraw, but that doesn’t really matter that much.

It doesn’t really matter for opaque geometry, but it definitely does matter for blended primitives.

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jun 28, 2010 at 07:56

Ah yes of course, that’s true. I didn’t realize that as I don’t have to deal with translucencies ;)

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jul 03, 2010 at 09:24

Ok guys. i removed in the procedure “texturesampler_bilinear “ sse4 instructions because they caused small problems on AMD proccesors (i have intel) :happy: and replaced with faster lookup table which calculates the adresses in tile for bilinear sample fetching. Fps jumped from 26 to 32 fps.. with point sampling ist it about 36-37 fps.. so its nice speedup. :yes:
I compared “tile bilinear sampler” against “linear (standart in memory image representation on PC) bilinear sampler” and the speed stayed almost the same… linear representation of the texture was a bit slower,because of not cache-friendly representation of the texture. Tiled texture is good for big textures, because if the texture is in high resolution , the speed don’t drop so fast down as in linear (standart) representation of the texture. Of course the linear calculation of the sample adress from texture coordinates is much simpler, but the cache-polution is much bigger and is causing much bigger slowdowns. :geek:

https://sourceforge.net/projects/phenomenon/

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jul 22, 2010 at 23:56

New update. I added per-pixel mip-maping. To see how it workds i created 2 demos. One where we can see the mip-map levels, the second with normal drawing. There is a noisy pattern at the mip-map level boundary’s. The reason is.. i use the “RCPPS” SSE instruction which is not so precise, as when i use “DIVPS” . Using “DIVPS” i get sharp edges on the mip-map boundary’s, but this instruction is more slower then the “RCPPS”. But when the mip-map levels are not colored the noisy pattern is not visible. See the no-mip-map-colored demo. ;) waiting for your feedback guys ;)

https://sourceforge.net/projects/phenomenon/

Ceee4d1295c32a0c1c08a9eae8c9459d
0
v71 105 Jul 23, 2010 at 06:55

I am very interested into occlusion culling, i wrote some rendere in the past, but at that time i had some difficulties making the whole thing running at a decent speed.
I have abandoned the idea, in favour of a pvs calculation system i am starting to rethink again.
My main problem was basically that i had to render stuff 2 times, even tough at a lowe resolution regarding the occlusion buffer.
Are things mature enough to work on it again ? , isn’t realtime ray tracing approaching fat, will all the competency accumulated be wasted in a 3-4 years ??
Basically i mean is it worth to write a ‘manual’ occlusion culling system right now ?
Gurus, please respond…

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jul 23, 2010 at 10:15

@Herrcoolness

The reason is.. i use the “RCPPS” SSE instruction which is not so precise, as when i use “DIVPS” . Using “DIVPS” i get sharp edges on the mip-map boundary’s, but this instruction is more slower then the “RCPPS”.

https://sourceforge.net/projects/phenomenon/

Try using RCPPS followed by one or two iterations of a newton-raphson division (if x is an approximation of 1/d (which is given by the rcpps instruction), x*(2 - d*x) is a better one).

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jul 23, 2010 at 10:31

@v71

I am very interested into occlusion culling, i wrote some rendere in the past, but at that time i had some difficulties making the whole thing running at a decent speed.
I have abandoned the idea, in favour of a pvs calculation system i am starting to rethink again.
My main problem was basically that i had to render stuff 2 times, even tough at a lowe resolution regarding the occlusion buffer.
Are things mature enough to work on it again ? , isn’t realtime ray tracing approaching fat, will all the competency accumulated be wasted in a 3-4 years ??
Basically i mean is it worth to write a ‘manual’ occlusion culling system right now ?
Gurus, please respond…

I’m not sure what you mean with the raytracing argument, but occlusion queries using software rendering *is* feasible and is already being used in AAA titles (Dice is using it in it’s engine for Battlefield: Bad Company 2). We are soon going to research this subject as well, and as you can read a few posts back I already made a rudimentary singlethreaded implementation on PC. Of course, as with all occlusion culling systems, their effectiveness depend on the kind of environment you’re having. While z-buffer based occlusion queries are pretty decent all-round, for indoor environments with clearly separated rooms and corridors a cell and portal system could provide for better culling. But for large dynamic open-world environments, I recon that using z-buffer based queries is both the best culling system and also easiest to set up (since you can mostly use actual modeled geometry for occlusion, although you probably still want to insert low-poly occlusion-only mesh here and there and not render high-poly or skinned mesh to speed things up).

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jul 24, 2010 at 10:43

@.oisyn

Try using RCPPS followed by one or two iterations of a newton-raphson division (if x is an approximation of 1/d (which is given by the rcpps instruction), x*(2 - d*x) is a better one).

Thnx for the tip ;)

Ceee4d1295c32a0c1c08a9eae8c9459d
0
v71 105 Jul 24, 2010 at 22:09

Sorry for my messy post before, i try to explain better.
For a software rasterizer, bascially you have to write 2 renderers , one using opengl or directx and the other running entirely on the cpu.
I mean, everything vertex rotation, perspective divison, and a fast triangle filler.
I know that it is sufficient to use the z-buffer , use a lower screen resolution , and other optimization, but i am asking to myself, is it worth to write a system like this ? isn’t ray tracing approaching fast ?
Even if ray tracing won’t be used to render a complete scene with light, will the new multicore gpu boards allow us to write a visibility system running entirely on hardware in a matter of 2-3 years ???

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jul 25, 2010 at 05:39

@v71

Sorry for my messy post before, i try to explain better.
For a software rasterizer, bascially you have to write 2 renderers , one using opengl or directx and the other running entirely on the cpu.
I mean, everything vertex rotation, perspective divison, and a fast triangle filler.
I know that it is sufficient to use the z-buffer , use a lower screen resolution , and other optimization, but i am asking to myself, is it worth to write a system like this ? isn’t ray tracing approaching fast ?
Even if ray tracing won’t be used to render a complete scene with light, will the new multicore gpu boards allow us to write a visibility system running entirely on hardware in a matter of 2-3 years ???

Raytracing is good just for drawing high quality images, or good for physics collision or for the AI. Better use scanline-triangle-filler or tile-triangle-filler. Its fast, memory friendly,coherent, and its using lower amount of math. Triangle rasterizers are evolved from raytracing for faster drawing of triangles. So why will you use raytracing for the occlusion pass,when you can use a fast triangle rasterizer?

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jul 25, 2010 at 16:52

@.oisyn

Try using RCPPS followed by one or two iterations of a newton-raphson division (if x is an approximation of 1/d (which is given by the rcpps instruction), x*(2 - d*x) is a better one).

I used the this method which added 4 sse instructions : 2 muls, 1 mov (constant read), 1 sub
The speed was same as using divps. :huh:

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jul 25, 2010 at 20:35

Hmmm that’s too bad. But I think I read it in an intel optimization manual once, but that was a couple of years back (P4 era) and maybe the divps has evolved since then. Or perhaps I’m just mistaken :)

.edit: no, it’s still there: http://www.intel.com/Assets/PDF/manual/248966.pdf
Chapter 6.1:

Use the reciprocal instructions followed by iteration for increased accuracy. These instructions yield reduced accuracy but execute much faster. Note the following:
— If reduced accuracy is acceptable, use them with no iteration.
— If near full accuracy is needed, use a Newton-Raphson iteration.
— If full accuracy is needed, then use divide and square root which provide more accuracy, but slow down performance.

If you google on “rcpps newton raphson”, a lot of sites are saying it’s faster as well.

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jul 26, 2010 at 04:44

@.oisyn

Hmmm that’s too bad. But I think I read it in an intel optimization manual once, but that was a couple of years back (P4 era) and maybe the divps has evolved since then. Or perhaps I’m just mistaken :)

.edit: no, it’s still there: http://www.intel.com/Assets/PDF/manual/248966.pdf
Chapter 6.1:

If you google on “rcpps newton raphson”, a lot of sites are saying it’s faster as well.

I used :

// xmm1  - input value
RCPPS xmm0,xmm1
mulps xmm1,xmm0
mulps xmm1,xmm0
addps xmm0,xmm0
subps xmm0,xmm1
// xmm0 - output value

Now, there is no constant load.
But almost still same speed as using:
*DIVPS - 32.50 fps
*RCPPS + Newton-Raphson iteration - 32.40 fps
*RCPPS - 33.70 fps
. I have Intel Core 2 Quad Q8300. It may be true. After 4 years the DIVPS can be faster. But i will use the iteration for older CPU’s. :) But still thnx for the tip :).

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jul 26, 2010 at 13:14

@Herrcoolness

I have Intel Core 2 Quad Q8300. It may be true. After 4 years the DIVPS can be faster. But i will use the iteration for older CPU’s. :)

Starting from the Core 2 on 45 nm technology, Intel implemented a new radix-16 division unit, which is twice as fast as its predecessor.

divps still has a high latency of maximum 15 cycles, but if you have other instructions that can execute independently then that’s no problem. If instead you use rcpps and a Newton-Raphson iteration the total latency is nearly identical but you’re executing more instructions (while you could have done other work instead).

So indeed on newer processors its faster to use divps, and you even get full precision!

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jul 26, 2010 at 14:26

Nick, do you know any resources where you can get that kind of information? Or is it just a matter of keeping up with latest developments?

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Aug 13, 2010 at 14:46

News-news-news guys. ;) So i (re)implemented the hierarchical z-buffer with 3 basic funtions, for fast tile skip, standart per pixel z comparing and fast z writing without z comparing to old z values in z buffer.

I uploaded 2 demos. One with colored debug info and one without the coloring to see how it normal works.
*black tiles - skipped tiles of the hidden small quad
*green tiles - tiles drawn with the fast write fucntion (no z comparison) and are not compared against the triangle edges
*cyan tiles - tiles are drawn with fast write function (no z comparison) but compared against the triangle edges
*gray tiles - tiles are drawn with function that compares the z-values agaisnt the z-buffer and are compared against the triangle edges

Next stop …clipping and transform pipeline … and first rotated cube? :happy:

https://sourceforge.net/projects/phenomenon/

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Aug 29, 2010 at 14:55

Ok guys. What’s new?
Now the triangle input coordiantes are in NDC (Normalized device coordinates), so x and y postion need to be in +1,-1 interval. Why this? because this are using graphicards and helped me to solve the problem when you change the size of the window. Now the size of of the triangles is changing too and is propotional to the rendering window.

Aaand i added third texture filtering method for low-end pc’s. Its almost fast like nearest texture filtering (because of 1 texture fetch) but looks almost like bilinear. Yes-yes you saw this method in Unreal. I found a description about this technique in old flipcode archive on net (http://www.flipcode.com/archives/Texturing_As_In_Unreal.shtml)

There are 2 demos :
-one static to see how fast are all 3 techniques (push 1,2,3 to change the filtering technique)
-and dynamic to see the dither-bilinear technique in action (push 1,2,3 to change the filtering technique)

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Sep 18, 2010 at 15:09

News-news-news !! I added full transformation pipeline of vertices and homogeneous clipping of triangles based on direct 3d (like orientation, perspective matrix and so). I created small demo where you can move with the camera and see a big cube with texture of size 2048 x2048. About the rasterizer. I reimplemented Nick’s rasterizer with fixed-point math because of its the numerical stability near edge of the drawing bounds. Sometimes after clipping and homogeneous division the positions of points of the triangle was going outside of the screen whitch caused an error in triangle rasterizer.
About the demo;
q,e - moving in y direction
a,d - moving in x direction
w,s - moving in y direction
1,2,3- filtering method
9,0 - vsync on-off

https://sourceforge.net/projects/phenomenon/

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Sep 18, 2010 at 16:36
An unhandled exception occurred at $00402437 :
EAccessViolation : Access violation
  $00402437
  $004183A4  DDRAWFLIPWINDOWED,  line 45 of ddrawwindowed.inc
  $0041872C  GS_WNDPROC,  line 144 of gs_screen.inc
  $0041D87E  WNDKEYBPROC,  line 29 of fenomenon_keyboard.pas
  $0042DAA2  WNDMOUSEPROC,  line 62 of fenomenon_mouse.pas
  $7E418734
  $7E418816
  $7E428EA0
  $7E428EEC
  $7C90E473
  $7E4196C7
  $00411530
  $00401D0D

Heap dump by heaptrc unit
97 memory blocks allocated : 13722380/13722720
86 memory blocks freed     : 13616288/13616616
11 unfreed memory blocks : 106092
True heap size : 5373952 (128 used in System startup)
True free heap : 5266832
Should be : 5267016
Call trace for block $0007DEE8 size 64
  $004097E8
  $00408381
  $0041C375
  $004092AE
  $004183A4
  $0041872C
  $0041D87E
  $0042DAA2
Call trace for block $00067158 size 24
  $00408381
  $0041C375
  $004092AE
  $004183A4
  $0041872C
  $0041D87E
  $0042DAA2
  $7E418734
Call trace for block $000670F8 size 16
  $0041C187
  $004092AE
  $004183A4
  $0041872C
  $0041D87E
  $0042DAA2
  $7E418734
  $7E418816
Call trace for block $0011E410 size 391
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $0011E240 size 391
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $020299A0 size 1159
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
  $F0F0F0F0
  $F0F0F0F0
  $F0F0F0F0
  $F0F0F0F0
Call trace for block $02028FC0 size 2439
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
  $F0F0F0F0
  $F0F0F0F0
  $F0F0F0F0
  $F0F0F0F0
Call trace for block $020275E0 size 6535
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $02022400 size 20871
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $027A0198 size 74119
  $004178D2
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $00116238 size 83
  $004119B8
  $00401CD0
  $0040D111
E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Sep 19, 2010 at 06:55

Stabilized and reuploaded :happy:

https://sourceforge.net/projects/phenomenon/

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Sep 19, 2010 at 09:04

AMD Athlon XP 1900+ at 1.6 GHz
512 RAM
GeForce4 MX 440 with 64 MB
PS/2 Mouse + USB Keyboard

An unhandled exception occurred at $00403247 :
EAccessViolation : Access violation
  $00403247
  $0041A4B4  DDRAWFLIPWINDOWED,  line 45 of ddrawwindowed.inc
  $0041A7DE  GS_WNDPROC,  line 131 of gs_screen.inc
  $0041F98E  WNDKEYBPROC,  line 29 of fenomenon_keyboard.pas
  $00432382  WNDMOUSEPROC,  line 62 of fenomenon_mouse.pas
  $7E418734
  $7E418816
  $7E42C03D
  $7E42C228
  $7E42C1D5
  $004122E7
  $0041A820
  $0041F98E
  $00432382
  $7E418734
  $7E418816
  $7E428EA0

Heap dump by heaptrc unit
84 memory blocks allocated : 8439373/8439688
77 memory blocks freed     : 8434761/8435056
7 unfreed memory blocks : 4612
True heap size : 1867776 (80 used in System startup)
True free heap : 1862528
Should be : 1862616
Call trace for block $00085DD8 size 64
  $0040A5F8
  $00409191
  $0041E485
  $0040A0BE
  $0041A4B4
  $0041A7DE
  $0041F98E
  $00432382
Call trace for block $00067068 size 24
  $00409191
  $0041E485
  $0040A0BE
  $0041A4B4
  $0041A7DE
  $0041F98E
  $00432382
  $7E418734
Call trace for block $00067008 size 16
  $0041E297
  $0040A0BE
  $0041A4B4
  $0041A7DE
  $0041F98E
  $00432382
  $7E418734
  $7E418816
Call trace for block $000E96B0 size 147
  $0040DF21
Call trace for block $000A96B8 size 3859
  $00402A45
  $0040DF21
Call trace for block $000A16A0 size 403
  $00402A45
  $0040DF21
Call trace for block $00099698 size 99
  $00402A45
  $0040DF21
E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Sep 19, 2010 at 15:12

@Mihail121

AMD Athlon XP 1900+ at 1.6 GHz
512 RAM
GeForce4 MX 440 with 64 MB
PS/2 Mouse + USB Keyboard

An unhandled exception occurred at $00403247 :
EAccessViolation : Access violation
  $00403247
  $0041A4B4  DDRAWFLIPWINDOWED,  line 45 of ddrawwindowed.inc
  $0041A7DE  GS_WNDPROC,  line 131 of gs_screen.inc
  $0041F98E  WNDKEYBPROC,  line 29 of fenomenon_keyboard.pas
  $00432382  WNDMOUSEPROC,  line 62 of fenomenon_mouse.pas
  $7E418734
  $7E418816
  $7E42C03D
  $7E42C228
  $7E42C1D5
  $004122E7
  $0041A820
  $0041F98E
  $00432382
  $7E418734
  $7E418816
  $7E428EA0

Heap dump by heaptrc unit
84 memory blocks allocated : 8439373/8439688
77 memory blocks freed     : 8434761/8435056
7 unfreed memory blocks : 4612
True heap size : 1867776 (80 used in System startup)
True free heap : 1862528
Should be : 1862616
Call trace for block $00085DD8 size 64
  $0040A5F8
  $00409191
  $0041E485
  $0040A0BE
  $0041A4B4
  $0041A7DE
  $0041F98E
  $00432382
Call trace for block $00067068 size 24
  $00409191
  $0041E485
  $0040A0BE
  $0041A4B4
  $0041A7DE
  $0041F98E
  $00432382
  $7E418734
Call trace for block $00067008 size 16
  $0041E297
  $0040A0BE
  $0041A4B4
  $0041A7DE
  $0041F98E
  $00432382
  $7E418734
  $7E418816
Call trace for block $000E96B0 size 147
  $0040DF21
Call trace for block $000A96B8 size 3859
  $00402A45
  $0040DF21
Call trace for block $000A16A0 size 403
  $00402A45
  $0040DF21
Call trace for block $00099698 size 99
  $00402A45
  $0040DF21

is you pc sse2 compatible? what is your desktop resolution? Because some detections are not implemented in the progy.

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Sep 19, 2010 at 16:20

@Herrcoolness

is you pc sse2 compatible? what is your desktop resolution? Because some detections are not implemented in the progy.

SSE2 is not supported, MMX and 3DNow! only. Desktop resolution is 1024x768@16.

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Sep 19, 2010 at 18:13

@Mihail121

SSE2 is not supported, MMX and 3DNow! only. Desktop resolution is 1024x768@16.

The engine can run just in 32 bit color mode and i think in some procedures are using sse2 instructions. Just set the desktop to right color depth and check again.

Be4aa2b4115e19843acf3d0d16a8aee2
0
droolz 101 Oct 21, 2010 at 06:50

Hi I’m very new ot graphics programming and am using the original post as a test project for myself.
I’m using it to rasterize UV coordinates, which unlike screen space do not require the Y coordinate to be flipped.

I’ve implemented the first version of the rasterizer (the unoptimised half space calculation one) and it works fine (by flipping back the half space coordinates to the original formula implementation).

When I try to implement the second one (the intermediate optimized one) it breaks. I can only think it’s because some of the maths needs to be changed because of me flipping the various deltas (float Dx12 = x1 - x2 becomes float Dx12 = x2 - x1; etc).

Does anyone have any pointers how the other maths should be changed, I’m really struggling with understanding how it works, especially Cy1, Cy2, Cy3.

Many thanks,

Jules

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Apr 08, 2011 at 18:34

Ok folks. I have new nick on sourceforge and the name of the project is little different because 2 bad things happened:
1. Some idiot or idiot’s atacked the sourceforge server’s and the SF team changed the passwords of all projects for the security reasons
2. I used the e-mail recorvery, but my e-mail provider horribly failed because i can’t get any new mails, sometimes the page was down… There were many problems.

So i changed the name of the project and my nick. I will now use the SVN system to make updates to the source code.

About the project … So what is new? Per-tile operations are more optimized, i am using now 2D homogenouse rasterization, clipping is done in 4D homogenouse, vertex transformation path and backface culling is more optimized, texture reading is in post-proccesing pass, indexing pixels to triangles is optimized with bit mask and linked litss (it was before slow because i don’t used combination of 64bit mask and one 32-bit pointer but 64 x 32 bit pointer’s, so for every pixel one 32bit value alias pointer to the triangle)

In the demo (cube field - 2000 objects):
q,w,e,a,s,d - move in all 6 directions
arrows - rotation of the camera

Next step :
-per tile mip-maping
-bilinear texture fiiltering with one texture fetch - with one “movaps”

cya ;)

https://sourceforge.net/projects/phenomenonngsw/

For the demo is desktop in 32-bit color mode needed and sse2 instruction support..

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Apr 16, 2011 at 12:42

Hi folks. I just updated the engine with a per tile mip-mapping.

I was think about one problem. Is the rasterizer really fast? I compared it with the s-buffer technique. The s-buffer can reject almost all pixel in the x-screen resolution but my can just maximum 64 pixels. When all scanlines in the rasterization pipeline of the polygon (triangle) are rejected, the y-loop need to take just 1680 loops when we take that we have y screen resolution of 1680 pixels and a fullscreen triangle, but my rasterizer need ( 1680 div 8 )*( 1050 div 8 ) loops to do which is 27000 so when we compare it is about 27x slower… GOD DAMN !!

What i get when i use scanline based rasterizing with conjunction with s-buffer?
-faster rejecting of pixel in the x-direction
-ability to render in to texture
-more natural memory organization.
-fewer pixels wasted bacuse the tile aligning
-…
Problem? Yes, when i do light calculations. With pixel’s sorted in tiles i could create bbox for the tile from all pixel postions. Then when i check this bbox with vertex positions in world-space against a sphere which is the maximal radius of the light, i could reject or accept the whole 64 pixels. But in scanline based rasterization i don’t have the ability to do this. Yes there could be a conversion of the pixels but i think it could be slow. So i am thinkink about a horizontal span in the s-buffer as a line segment, which i check against sphere or cone of the light and see how much pixels i need to calculate.

ok people.. see you later.. i think .. much later. But don’t forget .. i am working on it.. ;)

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jun 11, 2011 at 23:06

Hi folks. After doing some research,windows-crash-tests and comparing the speed of scanline based rasterization against tile-halfspace rasterization, i come to a result, that scanline rasterization is much slower then the tile rasterization. Why is it so. So at first what techniques i have used. (You know that all this ideas are not from me. I am just a human like you and not a alien master-brain :D … and it’s fair to show you the source of my knowledge, which will help you to understand me and my source code) :

Scanline based:

-Rasterization algorytm is based on Chris Hecker’s floating point rasterizer (http://chrishecker.com/Miscellaneous_Technical_Articles) modified for sse and deferred texturing. Guys, if you are new in rasterization, this is the site where can you learn all the basics and tricks.
-Clipping algorytm is from latest source code of NIck’s SwShader 0.3.0. (ftp://nic.funet.fi/pub/sci/graphics/packages/swshader-softwire/sw-shader/swShader-0.3.0.zip) Nick, nice trick to shift the clipping planes from -1<x<1 to 0<x<1. This help a lot when you wanna calculate the clipping flags with SSE (see my source code)
-Transformation code is based on article from http://www.cortstratton.org/articles/OptimizingForSSE.php
-The s-buffer idea is based on Paul Nettle’s “S-buffer FAQ” (http://www.gamedev.net/page/resources/_/reference/programming/140/algorithms-and-data-structu res/s-buffer-faq-r668) and the source code, which helped me to not going thru the hell of coding a s-buffer insert routine is based on Bero’s software renderer (http://vserver.rosseaux.net/stuff/BeRoSoftRender.zip) which based on c++ code of The Swine (http://www.luki.webzdarma.cz/eng_07_en.htm)
-The bilinear filtering is based on the Nick’ SWshader source code optimized for sse2 by me.

Tile based:

-Rasterization algorytm is based on Nick’s article (http://www.devmaster.net/codespotlight/show.php?id=17) extended to 2d homogenous rasterization (http://www.cs.unc.edu/\~olano/papers/2dh-tri/) with little help of the source code from the Attila GPU simulator (https://attila.ac.upc.edu/wiki/index.php/Attila_Project)), but the
normalization code of the edge equation, which helps converting the edge quation from the FPU format to Fixedpoint format is coded by my self
-Calculation of the triangle’s 2d screen coverage bounding box is based on he Attila GPU Simulator source code too, which is again based on article “Jim Blinn’s Corner: Calculating Screen Coverage”
-Early accept-reject of block-in-triangle idea is based on intels Larabee article (http://software.intel.com/en-us/articles/rasterization-on-larrabee/)
-deferred rendering idea is based on the PowerVR thechnology article (http://www.imgtec.com/factsheets/SDK/PowerVR%20Technology%20Overview.1.0.2e.External.pdf)
-hierarchical Zmin updating with coverage-mask is based on article Two-level hierarchical z-buffer for 3D graphics hardware (http://www.si2lab.org/publications/cnf/chchen_iscas02.pdf)
-Transformation code is based on article from http://www.cortstratton.org/articles/OptimizingForSSE.php

Pros & Cons:

Scanline based:
+fewer calculations by calculating adress of pixel
+fewer wasted pixels for render targets and texture
+rasterization is calculated pixel precise
-variable scanline size so we can not directly unroll a drawing loop
-not cachce friendly representation of render targets and textures for random access and texture size (can’t access pixels in y direction without destroying the data in cache, problematic are calculations of dx/dy derivates and
+rejecting more, equal or fewer as 64 pixels with sbuffer
-we need to draw from front to back and from right to left to gain speed of the sbuffer and the rejection of pixels without traveling the linked list segments which hurts the cache
-demo cubefield rendered at 20-25 fps (the cubes are not drawn from front to back, from right to left,just random)

Tile based
-more calculations by calculating adress of pixel (mmx or sse instructions can help)
-more wasted pixels (because we need align the texture and render target data to tiles sizes)
-rasterization is calculated per tiles, and some cpu-cycles are wasted for those pixel, which are not in triangle (sse can help here to reduce it)
+constant drawing loop which can be unrolled because of the constant size of the tile
+ more cache friendly representation of render targets and textures for random access and texture (we can access the pixels in y direction without destroying the data in the cache, good for pixelshaders, dx/dy derivate calculation,multithreading)
-rejecting just 64 pixels, yes or no, nothing between
+fast accessing the z-buffer thru the x,y coordinates, just comparing higher level which is one floating point number, more better organised structure (grid based)
+demo cubefield rendered at 44 fps and more

The problem with s-buffer is we need sometimes to travel the linked-list of the segments. This is a slow process compared to the hierarchical z-buffer where we have direct access to the memory by calculating the adress from the x,y coordinates. So s-buffer is good for scene without much polygons, because every new polygon adds a new segment to the s-buffer structure and then traveling the growing structure really slowdowns the whole process. See Quake 1. It is using this technique. But in the drawing process the s-buffer is used just for drawing the rooms - static structures which haves small amount of polygons. The mosters,characters are drawn separatly, in another pipeline. But without s-buffer there would be overdraws which can cause more slowdowns.Another problem in scanline based rasterization is the need of clipping. Clipping a polygon in 3d is slow procces and drawing the polygon with the triangle routine is even slower. We need this clipping process for calculating the W coordinates. because we can’t use W coordinates behind the camera. And we need to do a perspective division where W coordinate can be equal to 0, which can’t be calculated. In tilebased 2d homogenous rasterization we don’t have this problem because we directly calculate visible pixels. We even don’t need to divide because we are operating in homogenous space. We divide just the interpolated values with the W after checking if the pixel is visible which is that pixel which is lying between the triangle edges and the near and far plane.

Anyway. I uploaded the scanline render for learning purpose. If something is not clear, just ask me. But i think, the FPS numbers are saying it all. Now i know that using the tile render is the right way and i will continue in it. Its deep in the night. I almost see nothing so.. any feedback any question every reaction is welcome. So… let’s get back to the TILE WORLD !!!

https://sourceforge.net/projects/phenomenonngsw/files/old%20scanline%20based%20version/

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Jul 01, 2011 at 21:20

Hi folks. So I updated the engine with new memory representation of texture. The texture is divided in to 8x8 tiles like in old version but now the tiles are bigger - 9x9 pixel. Why ? Because i implemented bilinear filtering and there was small problem with the right and bottom texels in the tile. There when i wanna access the 3 other texels i need to read the color from other tiles and i needed to calculate new adress because of tile change. So now when i am encoding the texture in to the tiles i write the colors in right and bottom extra texels from neighbor tiles. When i am reading the texels now, i am reading they all from one tile and calculating the adresses from other 3 texels is very cheap.

I implemented per triangle occlusion so we calculate the nearest Z coordinate from the triangle to the camera and then comparing it against Zmin from tiles which are in the bounding box of the triangle. Zmin is in separate array as the whole z-buffer so traveling thru the Zmin array is fast. It simulates the on-chip memory rough z-buffer in hardware. When there is one tile which is behind the Zmin from the triangle, the triangle need to be rendered else it can be skipped. Skipping the triangle can save nice amount of computation and the drawing is faster. This method is descripted in article “Method for accelerated triangle occlusion culling” (http://www.freepatentsonline.com/20030043148.pdf)), but if you are thinking about the hierarchical z-buffer, you get the idea automaticaly. I’ve got similar idea, when i was programming the n-level z-buffer.

http://sourceforge.net/projects/phenomenonngsw/files/current%20halfspace%20based%20version/Phenomenon%20engine%2001-07-2011.rar/download

E3e7f7fc936dae879440f1a2d882b918
0
Herrcoolness 101 Sep 01, 2011 at 17:07

Ok guys, new update of my project.

Every programmer knows, that the 64-bit environment brings new possibilities, like more memory, more registers, extension of old general register to 64-bite size. As assembler programmer, i’ve got some problem with some functions because of small amount of cpu-registers. So i need to handle with this problem thru memory accesses to temporary variables and constants, which slows down the whole function. The speed for software rendering is very important so i changed the OS environment to get more horse-power from my cpu.

Changing from 32-bit to 64-bit brings some problems, because now we strictly need to use just 64-bit version of device-drivers. Some old hardware is not supported from his company with a new 64-bit firmware. So we need to buy a new… like me. Now the OS is fully functional.

I changed the version of FPC to 64-bit version too. I needed to rewrite the sourcecode of the render (changing the pointer and pointer operations from dword size to qword size) . I change the output from DirectDraw to GDI, because,i’ve got problems to run the program under the 64-bit environment. I optimized some parts for the 64 bit because now i have 15 general purpose and 15 xmm registers. Man i was feeling like a kid in toy-shop .. so many registers… :) . I changed some names of the variables like in the gradient calculation for better understanding, what they mean.

I compared the old rectangular traversal algorhytm with the recursive algorhytm, but the old was still faster. Anyway i uploaded the recursive algorhytm http://sourceforge.net/projects/phenomenonngsw/files/current%2064-bit%20version/old%20recursive%20half-space%20version/Phenomenon%20engine%2031-08-2011.rar/download for study reasons. Added some more comments for better understanding the code, like in trivial reject & accept calculations. Cya ;)

http://sourceforge.net/projects/phenomenonngsw/files/current%2064-bit%20version/current%20half-space%20version/Phenomenon%20engine%2001-09-2011.rar/download

215de2bbbaea860d666b2a7d4c4fdf37
0
zbethel 101 Oct 01, 2011 at 19:20

EDIT: I solved it…

It wasn’t the rasterization algorithm. I was converting the vertices to integers earlier in my code and testing if the size of the edges was zero (testing for degenerate triangles)–without first converting to fixed point. Thus, entire triangles weren’t being drawn. How could I be so stupid! :)

==============================================================

Ahem, I know I’m resurrecting an ancient post. But it’s such a great one! :)

I’ve implemented this rasterization method, and it works great, except that I’ve noticed a problem with small triangles. I noticed your post Nick about entire blocks getting included when the x and y deltas were both zero, but I don’t think this is the same problem. Here’s an image of what’s happening:

flicker.png

Uploaded with ImageShack.us

I tracked it down to a problem with this line:

if(CX1 > 0 && CX2 > 0 && CX3 > 0 )
{
colorBuffer[ix] = 0xffffffff;
}

Looking at the numbers in the debugger, it does not appear that any overflow is happening. Since I pretty much implemented the algorithm verbatim, I was hoping maybe this has been solved already? If you’ve had this issue and know what’s going on, I’d appreciate the heads up. :)

Thanks.