Jump to content


Advanced Rasterization


187 replies to this topic

#61 Vertexmania

    New Member

  • Members
  • Pip
  • 4 posts

Posted 15 April 2006 - 08:56 PM

Nice algorithm, but how do you handle visible surface determination?
How can I calculate the z-value of every pixel and compare it to the z-buffer?

Sometimes when a triangle is getting clipped it can became a more complex polygon like a quad. How do you handle this?

#62 Mihail121

    Senior Member

  • Members
  • PipPipPipPip
  • 1059 posts

Posted 15 April 2006 - 10:22 PM

Vertexmania said:

Sometimes when a triangle is getting clipped it can became a more complex polygon like a quad. How do you handle this?

Simple, my friend! There shouldn't be any problem at all to implement the standard rasterization routine (scanline) and the one discussed here (half-space), then you would use the new algorithm only with those primitives, that are actually triangles after the clipping phase - the scanline-rasterizer can handle the rest.

Another way to do it would be to be brake the clipped polygon in triangles - a straightforward process!

Third possibility is to consider extending the half-space rasterization function to polygons with more than 3 vertices.

Hope that helps, cheers!

#63 Madis731

    New Member

  • Members
  • Pip
  • 8 posts

Posted 29 March 2007 - 02:18 PM

Hi, I turned this code (the final one) into assembly and wow was it painful :D
There were some problems making it take advantage of MMX/SSE, but I did it to some extent. Anyway I will first comment your code a bit:

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)

{

    // 28.4 fixed-point coordinates

    const int Y1 = iround(16.0f * v1.y);

    const int Y2 = iround(16.0f * v2.y);

    const int Y3 = iround(16.0f * v3.y);


    const int X1 = iround(16.0f * v1.x);

    const int X2 = iround(16.0f * v2.x);

    const int X3 = iround(16.0f * v3.x);


    // Deltas

/*

  This was really easy. I fit 6 words into XMMs and could do all the

  subtraction and shifting.

*/

    const int DX12 = X1 - X2;

    const int DX23 = X2 - X3;

    const int DX31 = X3 - X1;


    const int DY12 = Y1 - Y2;

    const int DY23 = Y2 - Y3;

    const int DY31 = Y3 - Y1;


    // Fixed-point deltas

    const int FDX12 = DX12 << 4;

    const int FDX23 = DX23 << 4;

    const int FDX31 = DX31 << 4;


    const int FDY12 = DY12 << 4;

    const int FDY23 = DY23 << 4;

    const int FDY31 = DY31 << 4;


    // Bounding rectangle

/*

  These four went in 2 clock cycles :)

*/

    int minx = (min(X1, X2, X3) + 0xF) >> 4;

    int maxx = (max(X1, X2, X3) + 0xF) >> 4;

    int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;

    int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;


    // Block size, standard 8x8 (must be power of two)

    const int q = 8;

/*

  I found out this to be very interesting dynamically. There were situations

  where 4 or 16 was better. Imagine triangles from 4x4px bounding in size

  upto 500x500 in size...

*/

    // Start in corner of 8x8 block

    minx &= ~(q - 1); // This and later is all in 64-bit x86 ASM without SSE

    miny &= ~(q - 1);


    (char*&)colorBuffer += miny * stride;


    // Half-edge constants

    int C1 = DY12 * X1 - DX12 * Y1; // SSE only has instructions for WORDs

    int C2 = DY23 * X2 - DX23 * Y2; // but this needs at least DWORDs and

    int C3 = DY31 * X3 - DX31 * Y3; // doing 6 multiplys with IMUL is faster

    // than converting it to SSE and doing 2*3 MULUDQs with additional sign-

    // checking because UDQ is UNsigned :(


    // Correct for fill convention

    if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;

    if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;

    if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;


    // Loop through blocks

    for(int y = miny; y < maxy; y += q)

    {

        for(int x = minx; x < maxx; x += q)

        {

            // Corners of block

            int x0 = x << 4;

            int x1 = (x + q - 1) << 4;

            int y0 = y << 4;

            int y1 = (y + q - 1) << 4;


            // Evaluate half-space functions

/*

  This one here is very painful and even in case of a right triangle sitting in

  one half of the bounding box it takes 24 IMULs for every 8x8 pixel region

  empty or not. One or two optimizations are possible, though. You cast these

  bools to ints here, but actually you could just add these results without

  shifting (a00+a10+a01+a11) because you only check whether its all 0 or

  all 1. Then it becomes 0 or 4. Only loss is that you don't know which ones

  where 1 and which ones 0 (may be necessary with masking). The definite

  one optimization is that you don't have to wait until c to check for 0, but

  you can to it after every four bools. Bail out of the loop quickly after a==0

  so you will cut off these painful 16 IMULs, 16 ADDs/SUBs, 8 CMPs, 4 SHLs

  and 4 ORs. That on average will improve throughput by 10-35% depending

  on the triangles.

*/

            bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;

            bool a10 = C1 + DX12 * y0 - DY12 * x1 > 0;

            bool a01 = C1 + DX12 * y1 - DY12 * x0 > 0;

            bool a11 = C1 + DX12 * y1 - DY12 * x1 > 0;

            int a = (a00 << 0) | (a10 << 1) | (a01 << 2) | (a11 << 3);

    

            bool b00 = C2 + DX23 * y0 - DY23 * x0 > 0;

            bool b10 = C2 + DX23 * y0 - DY23 * x1 > 0;

            bool b01 = C2 + DX23 * y1 - DY23 * x0 > 0;

            bool b11 = C2 + DX23 * y1 - DY23 * x1 > 0;

            int b = (b00 << 0) | (b10 << 1) | (b01 << 2) | (b11 << 3);

    

            bool c00 = C3 + DX31 * y0 - DY31 * x0 > 0;

            bool c10 = C3 + DX31 * y0 - DY31 * x1 > 0;

            bool c01 = C3 + DX31 * y1 - DY31 * x0 > 0;

            bool c11 = C3 + DX31 * y1 - DY31 * x1 > 0;

            int c = (c00 << 0) | (c10 << 1) | (c01 << 2) | (c11 << 3);


            // Skip block when outside an edge

/*

  This one can be cleverly done with

  #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)

  hint: search for "null" at

  http://www.azillionmonkeys.com/qed/asmexample.html

  but I didn't need it at all because of the optimizations above

*/

            if(a == 0x0 || b == 0x0 || c == 0x0) continue;


            unsigned int *buffer = colorBuffer;


            // Accept whole block when totally covered

/*

  By fusing a,b,c together earlier I could make here only one test which is

  if(c==0xFFF) - that is when c=a<<8|b<<4|c

*/

            if(a == 0xF && b == 0xF && c == 0xF)

            {

                for(int iy = 0; iy < q; iy++)

                {

                    for(int ix = x; ix < x + q; ix++)

                    {

                        buffer[ix] = 0x00007F00;<< // Green

/*

  This part here gave a really good boost. I switched to SSE for a second

  here and shuffled the color to all parts of the 128-bit register. It could

  write 16 bytes per clock and with some macro magic, I made it calculate

  from the q how much of these it needed. The inner loop is too tight to use

  it at all so I left it out :)

*/

                    }


                    (char*&)buffer += stride;

                }

            }

            else<< // Partially covered block

            {

                int CY1 = C1 + DX12 * y0 - DY12 * x0;

                int CY2 = C2 + DX23 * y0 - DY23 * x0;

                int CY3 = C3 + DX31 * y0 - DY31 * x0;


                for(int iy = y; iy < y + q; iy++)

                {

                    int CX1 = CY1;

                    int CX2 = CY2;

                    int CX3 = CY3;


                    for(int ix = x; ix < x + q; ix++)

                    {

                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)

                        {

                            buffer[ix] = 0x0000007F;<< // Blue

                        }


                        CX1 -= FDY12;

                        CX2 -= FDY23;

                        CX3 -= FDY31;

                    }


                    CY1 += FDX12;

                    CY2 += FDX23;

                    CY3 += FDX31;


                    (char*&)buffer += stride;

                }

            }

        }


        (char*&)colorBuffer += q * stride;

    }

}


Now I will post the code that is not really ready yet, but already works. I will find some way to put some more SSE into it. The first try was over 400 lines of code, but some SSE magic made it crush to about 340 lines. My aim is a line count less than or equal to yours :D

In ASM syntax comments are ; and // /**/ don't mean anything
I made string replace on the code so try to understand ;)

Engine_Triangle:

/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;

;  Description: Draws a filled triangle.

;

;  In:     xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2

;          rsi = colour

;

;  Out:    All changed

;

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;*/

pshuflw xmm0,xmm0,01001110b //Esimene/tagumine külg

Q=4

PREC=4

    psllw     xmm0,PREC           //xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2

    movdqa    xmm5,xmm0

    pextrw    efx,xmm0,0

    pextrw    eex,xmm0,1

    pextrw    edx,xmm0,2

    pextrw    ecx,xmm0,3

    pextrw    ebx,xmm0,4

    pextrw    eax,xmm0,5

    pshufd    xmm1,xmm0,11010010b //xmm1 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0

    movdqa    xmm2,xmm0

    movdqa    xmm3,xmm1

    pxor      xmm4,xmm4

    punpckhwd xmm0,xmm4

    punpckhwd xmm1,xmm4

    punpcklwd xmm2,xmm4

    punpcklwd xmm3,xmm4

    psubd     xmm0,xmm1           //xmm0 = 00  |  00  | dX12 | dY12

    psubd     xmm2,xmm3           //xmm1 = dX23| dY23 | dX31 | dY31

    movdqa    xmm1,xmm0

    movdqa    xmm3,xmm2

    pslld     xmm1,PREC           //xmm2 = 00  |  00  |fdX12 |fdY12

    pslld     xmm3,PREC           //xmm3 =fdX23|fdY23 |fdX31 |fdY31

    movq      qword[DY12],xmm0

    movdqu    dqword[DY31],xmm2

    movq      qword[FDY12],xmm1

    movdqu    dqword[FDY31],xmm3

                                  //xmm5 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2

    pshufd    xmm6,xmm5,11001001b //xmm6 = 00 | 00 | X2 | Y2 | X0 | Y0 | X1 | Y1

    pshufd    xmm7,xmm5,11010010b //xmm7 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0

    movdqa    xmm8,xmm5

    movdqa    xmm9,xmm6

    pmaxsw    xmm8,xmm6

    pmaxsw    xmm8,xmm7

    pminsw    xmm9,xmm5

    pminsw    xmm9,xmm7

    paddw     xmm8,dqword[xmm_8x15]

    psraw     xmm8,4

    pextrw    ehx,xmm8,1

    pextrw    ejx,xmm8,0

    paddw     xmm9,dqword[xmm_8x15]

    psraw     xmm9,4

    pextrw    egx,xmm9,1

    pextrw    eix,xmm9,0

    // Block size, standard 8x8 (must be power of two)

//    const int q = 8//


    // Start in corner of 8x8 block

//    minx &= ~(q - 1)//

//    miny &= ~(q - 1)//

    and       egx,not(Q-1)

    and       eix,not(Q-1)


    imul      ebp,eix,BUFW*4    //Set up pointer to backbuffer

    add       ebp,backbuffer


    imul      eax,[DY12]        //Constant part of half-edge functions

    imul      ebx,[DX12]

    imul      ecx,[DY23]

    imul      edx,[DX23]

    imul      eex,[DY31]

    imul      efx,[DX31]

    sub       eax,ebx

    sub       ecx,edx

    sub       eex,efx

    mov       [C1],eax

    mov       [C2],ecx

    mov       [C3],eex


    cmp       [DY12],0           //Correct for fill convention

    jl        .incC1

    jne       .noincC1

    cmp       [DX12],0

    jle       .noincC1

  .incC1:

    add       [C1],1

  .noincC1:

    cmp       [DY23],0

    jl        .incC2

    jne       .noincC2

    cmp       [DX23],0

    jle       .noincC2

  .incC2:

    add       [C2],1

  .noincC2:

    cmp       [DY31],0

    jl        .incC3

    jne       .noincC3

    cmp       [DX31],0

    jle       .noincC3

  .incC3:

    add       [C3],1

  .noincC3:


// Loop through blocks

  .forouter:

    push      rgx

  .forinner:

// Corners of block

    mov       eax,egx

    shl       eax,PREC

    mov       ebx,eix

    shl       ebx,PREC

    lea       esi,[egx+Q-1]

    shl       esi,PREC

    lea       edi,[eix+Q-1]

    shl       edi,PREC


// Evaluate half-space functions

    xor       ecx,ecx

    mov       ekx,ebx

    imul      ekx,[DX12]

    mov       elx,eax

    imul      elx,[DY12]

    sub       ekx,elx

    add       ekx,[C1]

    jle       @f

    add       ecx,1

  @@:

    mov       ekx,ebx

    imul      ekx,[DX12]

    mov       elx,esi

    imul      elx,[DY12]

    sub       ekx,elx

    add       ekx,[C1]

    jle       @f

    add       ecx,1

  @@:

    mov       ekx,edi

    imul      ekx,[DX12]

    mov       elx,eax

    imul      elx,[DY12]

    sub       ekx,elx

    add       ekx,[C1]

    jle       @f

    add       ecx,1

  @@:

    mov       ekx,edi

    imul      ekx,[DX12]

    mov       elx,esi

    imul      elx,[DY12]

    sub       ekx,elx

    add       ekx,[C1]

    jle       @f

    add       ecx,1

  @@:

    shl       ecx,4

    jz        .prolog_inner_end


    mov       ekx,ebx

    imul      ekx,[DX12]

    mov       elx,eax

    imul      elx,[DY12]

    sub       ekx,elx

    add       ekx,[C2]

    jle       @f

    add       ecx,1

  @@:

    mov       ekx,ebx

    imul      ekx,[DX23]

    mov       elx,esi

    imul      elx,[DY23]

    sub       ekx,elx

    add       ekx,[C2]

    jle       @f

    add       ecx,1

  @@:

    mov       ekx,edi

    imul      ekx,[DX23]

    mov       elx,eax

    imul      elx,[DY23]

    sub       ekx,elx

    add       ekx,[C2]

    jle       @f

    add       ecx,1

  @@:

    mov       ekx,edi

    imul      ekx,[DX23]

    mov       elx,esi

    imul      elx,[DY23]

    sub       ekx,elx

    add       ekx,[C2]

    jle       @f

    add       ecx,1

  @@:

    shl       ecx,4

    test      ecx,0F0h

    jz        .prolog_inner_end


    mov       ekx,ebx

    imul      ekx,[DX31]

    mov       elx,eax

    imul      elx,[DY31]

    sub       ekx,elx

    add       ekx,[C3]

    jle       @f

    add       ecx,1

  @@:

    mov       ekx,ebx

    imul      ekx,[DX31]

    mov       elx,esi

    imul      elx,[DY31]

    sub       ekx,elx

    add       ekx,[C3]

    jle       @f

    add       ecx,1

  @@:

    mov       ekx,edi

    imul      ekx,[DX31]

    mov       elx,eax

    imul      elx,[DY31]

    sub       ekx,elx

    add       ekx,[C3]

    jle       @f

    add       ecx,1

  @@:

    mov       ekx,edi

    imul      ekx,[DX31]

    mov       elx,esi

    imul      elx,[DY31]

    sub       ekx,elx

    add       ekx,[C3]

    jle       @f

    add       ecx,1

  @@:

    test      ecx,0Fh

    jz        .prolog_inner_end

    //ecx=a shl 8 + b shl 4 + c


    mov       efx,ebp

    cmp       ekx,0444h

    jne       .not_totally_covered

// Totally inside

    mov       ekx,Q

  @@:

    mov       ecx,00FFFFFFh

    movd      xmm0,ecx

    pshufd    xmm0,xmm0,00000000b // broadcast

    times     Q/4 movdqu [efx+egx*4+(%-1)*16],xmm0

    add       efx,BUFW*4

    sub       ekx,1

    ja        @b

    jmp       .prolog_inner_end

  .not_totally_covered:

//            else// <<  Partially covered block

//            {

//                int CY1 = C1 + DX12 * y0 - DY12 * x0//

//                int CY2 = C2 + DX23 * y0 - DY23 * x0//

//                int CY3 = C3 + DX31 * y0 - DY31 * x0//

    mov       ekx,[DX12]

    mov       elx,[DY12]

    imul      ekx,ebx

    imul      elx,eax

    sub       ekx,elx

    add       ekx,[C1]

    mov       [CY1],ekx


    mov       ekx,[DX23]

    mov       elx,[DY23]

    imul      ekx,ebx

    imul      elx,eax

    sub       ekx,elx

    add       ekx,[C2]

    mov       [CY2],ekx


    mov       ekx,[DX31]

    mov       elx,[DY31]

    imul      ekx,ebx

    imul      elx,eax

    sub       ekx,elx

    add       ekx,[C3]

    mov       [CY3],ekx

//                for(int iy = y// iy < y + q// iy++)

//                    int CX1 = CY1//

//                    int CX2 = CY2//

//                    int CX3 = CY3//

    mov       elx,Q

    mov       eax,[CY1]

    mov       ebx,[CY2]

    mov       ecx,[CY3]


    lea       eex,[efx+egx*4]

  .partial_loopy:

    push      rax rbx rcx

//                    for(int ix = x// ix < x + q// ix++)

    xor       ekx,ekx

  .partial_loopx:

//                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)

    test      eax,eax

    jle       @f

    test      ebx,ebx

    jle       @f

    test      ecx,ecx

    jle       @f

//                            buffer[ix] = 0x0000007F//<< // Blue

    mov       dword[eex+ekx*4],0FFFFFFh

  @@:

    sub       eax,[FDY12]

    sub       ebx,[FDY23]

    sub       ecx,[FDY31]

//                    }

    add       ekx,1

    cmp       ekx,Q

    jc        .partial_loopx

    pop       rcx rbx rax

    add       eax,[FDX12]

    add       ebx,[FDX23]

    add       ecx,[FDX31]

//                    (char*&)buffer += stride//

    add       eex,BUFW*4

//                }

    sub       elx,1

    ja        .partial_loopy


  .prolog_inner_end:

    add       egx,Q

    cmp       egx,ehx

    jc        .forinner

    pop       rgx


//        (char*&)colorBuffer += q*stride//

    add       ebp,BUFW*4*Q

    add       eix,Q

    cmp       eix,ejx

    jc        .forouter

//    }

    ret


#64 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 29 March 2007 - 10:34 PM

Neat. :yes:

I noticed some opportunity to eliminate jumps. Modern CPUs can suffer greatly from jumps that are badly predictable, so you should try to avoid them if possible, especially short jumps. For example:

if(DY12 < 0) C1++;
Could be implemented as:

lea ebx, [eax+1]   ; C1 in eax

cmp DY12

cmovl eax, ebx
Lots of other tricks exist to convert apparently conditional code into fast arithmetic code...

#65 Madis731

    New Member

  • Members
  • Pip
  • 8 posts

Posted 30 March 2007 - 07:20 AM

Sorry, I didn't see it at first :$
My logic failed me at first, but I think its okay now - tested and benchmarked:
    mov       ekx,[C1]
    lea       elx,[ekx+1]
    cmp       [DY12],0
    cmovl     ekx,elx //jl        .incC1   --   in comments, its the old code
    cmovne    elx,[C1] //jne       .noincC1  --  that got replaced
    cmp       [DX12],0
    cmovle    ekx,elx //jle       .noincC1
//  .incC1:
//    add       [C1],1
//  .noincC1:
    mov       [C1],ekx 
The code size didn't reduce in size but it remained the same while pickin' up
the pace ;) about 70M clocks/frame maximum is now 60M clocks/frame. And
this is only in the emulator and this is just the OUTER loop that we optimized.
A lot more speed could be cranked out of this with some thinking.
Maybe:
bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;
into
bool a00 = DX12 * y0 - DY12 * x0 > -C1; // Only faster in SSE?
Gentlemen - put your thinking caps on! :D

...and I think the general rule is: "Avoid short jumps if the code from the jump
to the destination takes less clocks than the penalty of the jump"

#66 Nico

    New Member

  • Members
  • Pip
  • 4 posts

Posted 30 March 2007 - 12:28 PM

This kind of tiled traversal using halfspace functions is pretty common in hardware, and you can possibly use a simple optimisation that we use there to get rid of the multiple distance function evaluations used in the 'block inside triangle' test.

In order to determine if a box is entierly on one side of a halfplane it is not necessary to determine the distance for all four corner points - it is sufficient to determine the distance to one corner.
But depending on the sign of your halfplane normal a different corner has to be used. This is a win because you are going to test many boxes against the same halfplane.

I don't have an illustration for this, but it should be pretty obvious that for any halfplane there is always one corner of a box that will have a minimal distance to the plane - regardless of the relative positioning or size.

Personally I like to express the change of reference corner points as an offset to the top left corner distance, calculated during triangle setup from the edge normals :


// find box test offsets for each edge

int eo1 = 0;

if (FDY12 < 0) eo1 -= FDY12 << BLOCKSIZE;

if (FDX12 > 0) eo1 += FDX12 << BLOCKSIZE;


int eo2 = 0;

if (FDY23 < 0) eo2 -= FDY23 << BLOCKSIZE;

if (FDX23 > 0) eo2 += FDX23 << BLOCKSIZE;


int eo3 = 0;

if (FDY31 < 0) eo3 -= FDY31 << BLOCKSIZE;

if (FDX31 > 0) eo3 += FDX31 << BLOCKSIZE;

Using these offset values (one per edge) the test for 'block is fully outside triangle' becomes essentially the same as the simple 'pixel is fully inside triangle' test.


// test if a single pixel is inside the triangle

if (CX1 > 0 && CX2 > 0 && CX3 > 0) ...


// test if a block of n*n pixels with top left corner at current position is inside

if ((CX1+eo1) > 0 && (CX2+eo2) > 0 && (CX3+eo3) > 0) ...

So just three add's instead of 21 muls and whatnot ;)

Note that checking for exclusion and checking for inclusion requires different offsets (one can be easily derived from the other) but in hardware we usually differentiate into covered and not-covered only.

#67 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 30 March 2007 - 01:32 PM

From a quick view it seems like you're using global variables. While this is not that bad, the codesize grows quite a bit (the address operand field will need at least 4 bytes).

If you can free up one register (I know - good joke) you could put your locals into a temporary structure and address from there. This has two benefits:

  • Your code will become threadsave.
  • Your code size will be shrink, and this will reduce the instruction decoding time. That does not make a difference if your code takes longer to execute than to fetch, but once you're code is tight enough, instruction decoding time will be factor to care about.

Oh - and by all means, get rid of the branches. If it takes 5 or 6 instructions to do something without a branch it's a saving. You have lots of branches that are placed near to each other, and this will confuse the branch prediction logic. I don't have the exact numbers anymore, but it can only identify branches that are several instructions apart. I think it was something around 64 or 32 bytes or so.

If you have one branch that branches likely, and one that branches branches unlikely near to each other, you'll get a costly branch missprediction for each branch.

Btw, hi nico *g*
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#68 Madis731

    New Member

  • Members
  • Pip
  • 8 posts

Posted 30 March 2007 - 09:47 PM

I feel ashamed :$ optimizing guru as I am, I posted you unoptimized code. What was I thinking? :)

Anyway, I'm on my way to better and better code - like you, I started discovering some clever and some not very clever spots to crush.

Just a side-note here, I'm aiming at Core 2 here (that is what I'm writing and testing on) and I've got all my clock-knowledge from Agner Fog's manuals and
picked up a few tricks from http://azillionmonke.../qed/tech.shtml

The weirdest thing is that I didn't discover the possibilities in "Evaluate half-space functions" until today, when I looked at the lines and saw that
actually we need 12 multiplies instead of 24. I must be blind :P Anyway, that
translated nicely to SSE, the saddest fact being it doesn't support signed
DWORDs. If I will get over that tiny problem, I will have a whole different
code to show you.

@Nils: I want to argue a bit if I may. I agree with you on the thread safety,
but there are other ways around it if I really don't like locals :) and stack is
really becoming a problem under the 64-bit CPUs because you can't put there
anything else besides 64-bit registers and accessing 32-bit or 16-bit data
from this 8-byte aligned stack is really too much additional SSE. I mean all
the packing and shuffling. Of course like with other things, there are ways
around it.

Another reason why I like globals so much is that they are cache-friendly :)
Really! They never move ;) and I can fit like 8 DWORDs in one cache-line.

About the branches - I'm still sorry about posting this code - I will edit this
immediately not to confuse anyone again :$ Three posts back I copied the
replacement code for the nasty branches and I don't have them anymore.

Here it is (still without full SSE handling) with only 300 lines of code:
Engine_Triangle:
/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
;  Description: Draws a filled triangle.
;
;  In:     xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
;          rsi = colour
;
;  Out:    All changed
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;*/
    pshuflw xmm0,xmm0,01001110b //Switch front/back
Q = 4
PREC = 4
//    psllw     xmm0,PREC           //xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
    movdqa    xmm5,xmm0
    pextrw    efx,xmm0,0
    pextrw    eex,xmm0,1
    pextrw    edx,xmm0,2
    pextrw    ecx,xmm0,3
    pextrw    ebx,xmm0,4
    pextrw    eax,xmm0,5
    pshufd    xmm1,xmm0,11010010b //xmm1 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
    movdqa    xmm2,xmm0
    movdqa    xmm3,xmm1
    pxor      xmm4,xmm4
    punpckhwd xmm0,xmm4
    punpckhwd xmm1,xmm4
    punpcklwd xmm2,xmm4
    punpcklwd xmm3,xmm4
    psubd     xmm0,xmm1           //xmm0 = 00  |  00  | dX12 | dY12
    psubd     xmm2,xmm3           //xmm1 = dX23| dY23 | dX31 | dY31
    movdqa    xmm1,xmm0
    movdqa    xmm3,xmm2
    pslld     xmm1,PREC           //xmm2 = 00  |  00  |fdX12 |fdY12
    pslld     xmm3,PREC           //xmm3 =fdX23|fdY23 |fdX31 |fdY31
    movq      qword[DY12],xmm0
    movdqu    dqword[DY31],xmm2
    movq      qword[FDY12],xmm1
    movdqu    dqword[FDY31],xmm3
                                  //xmm5 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
    pshufd    xmm6,xmm5,11001001b //xmm6 = 00 | 00 | X2 | Y2 | X0 | Y0 | X1 | Y1
    pshufd    xmm7,xmm5,11010010b //xmm7 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
    movdqa    xmm8,xmm5
    movdqa    xmm9,xmm6
    pmaxsw    xmm8,xmm6
    pmaxsw    xmm8,xmm7
    pminsw    xmm9,xmm5
    pminsw    xmm9,xmm7
    paddw     xmm8,dqword[xmm_8x15]
    psraw     xmm8,4
    pextrw    ehx,xmm8,1
    pextrw    ejx,xmm8,0
    paddw     xmm9,dqword[xmm_8x15]
    psraw     xmm9,4
    pextrw    egx,xmm9,1
    pextrw    eix,xmm9,0
    // Block size, standard 8x8 (must be power of two)
//    const int q = 8//

    // Start in corner of 8x8 block
//    minx &= ~(q - 1)//
//    miny &= ~(q - 1)//
    and       egx,not(Q-1)
    and       eix,not(Q-1)

    imul      ebp,eix,BUFW*4    //Set up pointer to backbuffer
    add       ebp,backbuffer

    imul      eax,[DY12]        //Constant part of half-edge functions
    imul      ebx,[DX12]
    imul      ecx,[DY23]
    imul      edx,[DX23]
    imul      eex,[DY31]
    imul      efx,[DX31]
    sub       eax,ebx
    sub       ecx,edx
    sub       eex,efx
    mov       [C1],eax
    mov       [C2],ecx
    mov       [C3],eex

    mov       ekx,[C1]          //Correct for fill convention
    lea       elx,[ekx+1]
    cmp       [DY12],0
    cmovl     ekx,elx
    cmovne    elx,[C1]
    cmp       [DX12],0
    cmovle    ekx,elx
    mov       [C1],ekx

    mov       ekx,[C2]
    lea       elx,[ekx+1]
    cmp       [DY23],0
    cmovl     ekx,elx
    cmovne    elx,[C2]
    cmp       [DX23],0
    cmovle    ekx,elx
    mov       [C2],ekx

    mov       ekx,[C3]
    lea       elx,[ekx+1]
    cmp       [DY31],0
    cmovl     ekx,elx
    cmovne    elx,[C3]
    cmp       [DX31],0
    cmovle    ekx,elx
    mov       [C3],ekx

// Loop through blocks
  .forouter:
    push      rgx
  .forinner:
// Corners of block
    mov       eax,egx
    shl       eax,PREC
    mov       ebx,eix
    shl       ebx,PREC
    lea       esi,[egx+Q-1]
    shl       esi,PREC
    lea       edi,[eix+Q-1]
    shl       edi,PREC

// Evaluate half-space functions

    mov       ekx,ebx
    shl       rkx,32
    add       rkx,rax
    mov       elx,edi
    shl       rlx,32
    add       rlx,rsi

    movq      xmm0,rkx              // xmm0 = 00 | 00 | x0 | y0
    movq      xmm1,rlx              // xmm1 = 00 | 00 | x1 | y1
   punpcklqdq xmm0,xmm1             // xmm0 = y1 | x1 | y0 | x0
    movd      xmm1,[C1]
    pshufd    xmm1,xmm1,00000000b   // xmm1 = C1 | C1 | C1 | C1
    movq      xmm4,qword[DY12]
    pshufd    xmm4,xmm4,01000100b   // xmm4 = DX12 | DY12 | DX12 | DY12
    pshufd    xmm5,xmm4,10110001b   // xmm5 = DY12 | DX12 | DY12 | DX12
    pmuludq   xmm4,xmm0             // xmm4 = OF | DY12*x1 | OF | DY12*x0
    pshufd    xmm4,xmm4,10001000b   // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0
    pshufd    xmm0,xmm0,10110001b   // xmm0 = x1 | y1 | x0 | y0
    pmuludq   xmm5,xmm0             // xmm5 = OF | DX12*y1 | OF | DX12*y0
    pshufd    xmm5,xmm5,10100000b   // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0
    psubd     xmm5,xmm4             // xmm5 = DX12*(y0...1)-DY12*(x0...1)
    paddd     xmm5,xmm1
    pcmpgtd   xmm5,dqword[xmm_zero]
    packssdw  xmm5,xmm5
    packsswb  xmm5,xmm5
    pmovmskb  edx,xmm5
    and       edx,0Fh
    xor       ecx,ecx
    add       ecx,edx
    jz        .prolog_inner_end
    shl       ecx,4

    movd      xmm2,[C2]
    pshufd    xmm2,xmm2,00000000b   // xmm2 = C2 | C2 | C2 | C2
    movq      xmm4,qword[DY23]
    pshufd    xmm4,xmm4,01000100b   // xmm4 = DX12 | DY12 | DX12 | DY12
    pshufd    xmm5,xmm4,10110001b   // xmm5 = DY12 | DX12 | DY12 | DX12
    pshufd    xmm0,xmm0,10110001b   // xmm0 = y1 | x1 | y0 | x0
    pmuludq   xmm4,xmm0             // xmm4 = OF | DY12*x1 | OF | DY12*x0
    pshufd    xmm4,xmm4,10001000b   // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0
    pshufd    xmm0,xmm0,10110001b   // xmm0 = x1 | y1 | x0 | y0
    pmuludq   xmm5,xmm0             // xmm5 = OF | DX12*y1 | OF | DX12*y0
    pshufd    xmm5,xmm5,10100000b   // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0
    psubd     xmm5,xmm4             // xmm5 = DX12*(y0...1)-DY12*(x0...1)
    paddd     xmm5,xmm2
    pcmpgtd   xmm5,dqword[xmm_zero]
    packssdw  xmm5,xmm5
    packsswb  xmm5,xmm5
    pmovmskb  edx,xmm5
    and       edx,0Fh
    add       ecx,edx
    test      ecx,0Fh
    jz        .prolog_inner_end
    shl       ecx,4

    movd      xmm3,[C3]
    pshufd    xmm3,xmm3,00000000b   // xmm3 = C3 | C3 | C3 | C3
    movq      xmm4,qword[DY31]
    pshufd    xmm4,xmm4,01000100b   // xmm4 = DX12 | DY12 | DX12 | DY12
    pshufd    xmm5,xmm4,10110001b   // xmm5 = DY12 | DX12 | DY12 | DX12
    pshufd    xmm0,xmm0,10110001b   // xmm0 = y1 | x1 | y0 | x0
    pmuludq   xmm4,xmm0             // xmm4 = OF | DY12*x1 | OF | DY12*x0
    pshufd    xmm4,xmm4,10001000b   // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0
    pshufd    xmm0,xmm0,10110001b   // xmm0 = x1 | y1 | x0 | y0
    pmuludq   xmm5,xmm0             // xmm5 = OF | DX12*y1 | OF | DX12*y0
    pshufd    xmm5,xmm5,10100000b   // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0
    psubd     xmm5,xmm4             // xmm5 = DX12*(y0...1)-DY12*(x0...1)
    paddd     xmm5,xmm3
    pcmpgtd   xmm5,dqword[xmm_zero]
    packssdw  xmm5,xmm5
    packsswb  xmm5,xmm5
    pmovmskb  edx,xmm5
    and       edx,0Fh
    add       ecx,edx
    test      ecx,0Fh
    jz        .prolog_inner_end

    mov       efx,ebp
    cmp       ecx,0FFFh
    jne       .not_totally_covered
// Totally inside
    mov       ekx,Q
  @@:
    mov       ecx,00FFFFFFh
    movd      xmm0,ecx
    pshufd    xmm0,xmm0,00000000b // broadcast
    times     Q/4 movdqu [efx+egx*4+(%-1)*16],xmm0
    add       efx,BUFW*4
    sub       ekx,1
    ja        @b
    jmp       .prolog_inner_end
  .not_totally_covered:
//            else// <<  Partially covered block
//            {
//                int CY1 = C1 + DX12 * y0 - DY12 * x0//
//                int CY2 = C2 + DX23 * y0 - DY23 * x0//
//                int CY3 = C3 + DX31 * y0 - DY31 * x0//
    mov       ekx,[DX12]
    mov       elx,[DY12]
    imul      ekx,ebx
    imul      elx,eax
    sub       ekx,elx
    add       ekx,[C1]
    mov       [CY1],ekx

    mov       ekx,[DX23]
    mov       elx,[DY23]
    imul      ekx,ebx
    imul      elx,eax
    sub       ekx,elx
    add       ekx,[C2]
    mov       [CY2],ekx

    mov       ekx,[DX31]
    mov       elx,[DY31]
    imul      ekx,ebx
    imul      elx,eax
    sub       ekx,elx
    add       ekx,[C3]
    mov       [CY3],ekx
//                for(int iy = y// iy < y + q// iy++)
//                    int CX1 = CY1//
//                    int CX2 = CY2//
//                    int CX3 = CY3//
    mov       elx,Q
    mov       eax,[CY1]
    mov       ebx,[CY2]
    mov       ecx,[CY3]

    lea       eex,[efx+egx*4]
  .partial_loopy:
    push      rax rbx rcx
//                    for(int ix = x// ix < x + q// ix++)
    xor       ekx,ekx
  .partial_loopx:
//                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)
    test      eax,eax
    jle       @f
    test      ebx,ebx
    jle       @f
    test      ecx,ecx
    jle       @f
//                            buffer[ix] = 0x0000007F//<< // Blue
    mov       dword[eex+ekx*4],0FFFFFFh
  @@:
    sub       eax,[FDY12]
    sub       ebx,[FDY23]
    sub       ecx,[FDY31]
//                    }
    add       ekx,1
    cmp       ekx,Q
    jc        .partial_loopx
    pop       rcx rbx rax
    add       eax,[FDX12]
    add       ebx,[FDX23]
    add       ecx,[FDX31]
//                    (char*&)buffer += stride//
    add       eex,BUFW*4
//                }
    sub       elx,1
    ja        .partial_loopy

  .prolog_inner_end:
    add       egx,Q
    cmp       egx,ehx
    jc        .forinner
    pop       rgx

//        (char*&)colorBuffer += q*stride//
    add       ebp,BUFW*4*Q
    add       eix,Q
    cmp       eix,ejx
    jc        .forouter
//    }
    ret


#69 rarefluid

    New Member

  • Members
  • PipPip
  • 14 posts

Posted 11 June 2007 - 03:08 PM

Maybe I'm missing something here but the halfspace function should be linear per edge/line. So you could:
- evalutate the halfspace functions by calculating the halfspace-value at the 4 edges of the bounding box
- calculate a per-X and per-Y constant
- add those to the corner points when you step to the next 8x8 cube

there should be no need for the multiplication in the outer loop, which should save 6 MULs and 6 ADD/SUBs per block

And sorry for reanimating the dead ;)

#70 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 11 June 2007 - 04:40 PM

rarefluid said:

Maybe I'm missing something here but the halfspace function should be linear per edge/line.

That's right, they are linear and you get all kinds of cool shortcuts here. If you really want to optimize performance read nicos post. He showed that it's sufficient to just interpolate one edge distance instead of three.
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#71 rarefluid

    New Member

  • Members
  • PipPip
  • 14 posts

Posted 11 June 2007 - 06:31 PM

What I understood was that you only need to check one edge depending on the normal i.e. if block is above and left of a line that has an angle of 45°, check bottom right edge of box... I don't understand how this can be independent of the position of the block...
And you will probably need to do some jumps that might slow things down again.

#72 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 11 June 2007 - 08:51 PM

I've analysed Nico's suggestion, and I think this picture tells the whole story: boundary.

The colored triangle represents the pixels we want to fill. The squares are coverage blocks (the extreme cases). The fat lines are the boundaries we need to test the top-left corner of a coverage block with. If the top-left corner is inside the boundary, the block covers part of the triangle.

Note that the boundary is not a triangle, but that's actually not a problem. We'll only check its boundary box for covered blocks anyway. So only the three edges of the boundary that are parallel to the triangle's edges need to be computed.

That's only a one-time cost, so block coverage is practically as fast as point coverage! :worthy:

Here's the equivalent picture for testing whether a block is fully covered by only looking at the top-left corner: boundary2.

rarefluid said:

And you will probably need to do some jumps that might slow things down again.
You can eliminate jumps with conditional moves, or using AND and OR operations.

#73 rarefluid

    New Member

  • Members
  • PipPip
  • 14 posts

Posted 11 June 2007 - 10:00 PM

I think I can grasp the idea now...
But I don't really understand how this can yield speedups when NOT exploiting parallelism, in contrast to a conventional scanline render, 'cause it actually necessarly processes more pixels. There is no sorting needed though.

...Sorry. I read page 1 again and found some hints... :$

Couldn't one do a bresenham-like movement of the blocks to save some tests? Move block down and onto triangle boundary. Or am I completely wrong here? Why not combine the conventional scanline and parallel aproach?
...maybe I should implement this and stop bothering you guys... ;)

#74 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 11 June 2007 - 10:48 PM

rarefluid said:

Couldn't one do a bresenham-like movement of the blocks to save some tests? Move block down and onto triangle boundary. Or am I completely wrong here? Why not combine the conventional scanline and parallel aproach?
...maybe I should implement this and stop bothering you guys... ;)

No. You aren't wrong. You can as well use a simple flood fill style algorithm.

Once you've found a block that covers at least a part of the triangle (the middle one might be a good choice) fill into all adjecting, covered blocks. This will be very fast. The only drawback is that you need some kind of stack space to store your state, but that's usually not a problem.

On hardware designs you avoid such things, but in a software implementation it is no big deal.

Btw, during my latest work on rasterization I found out that on modern machines you get a nice performance gain from first scanning geometry and generating block/trapezoid/scanline data for your polygons into a dedicated memory buffer. Once the buffer is full render it. This is up to twice as fast compared to the "render the block immediately" style.

I think the effects of the cache makes a huge difference here.
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#75 powers

    New Member

  • Members
  • Pip
  • 1 posts

Posted 22 June 2007 - 03:35 PM

Many thanks for Nick's good job.

It's my first time to learn how rasterizer do, so I have some questions.
1. Why we use 28.4 fixed-point coordinates but not other fixed-point coordinates, such as 24.8 fixed-point coordinates?
2. I don't know why we have to add 0xF to min(X1, X2, X3) and max(X1, X2, X3)? Additionally, why we do the bitwise operations, when we compute fixed-point deltas and bounding rectangle?
3. I also don't know how to get corners of block.
4. What is "stride" means in this job?
5. Finally, how to explain the "block"? It's just like a rectangle?
Thanks a lot!!!

#76 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 23 June 2007 - 09:48 AM

powers said:

Why we use 28.4 fixed-point coordinates but not other fixed-point coordinates, such as 24.8 fixed-point coordinates?
The numbers are multiplied, so a 24.8 format would result in a 48.16 format. But that doesn't fit within a 32-bit value and would be chopped to a 16.16 format, where the upper 16 bits correspond to only 8 bits before the multiplication. So you'd only have a resolution of 256 pixels.

The 28.4 format gives us 12 bit resolution, or 4096 pixels. 4 bits of sub-pixel precision is also plenty in practice. So it's the best compromise. If you really need a higher resolution, anti-aliasing or more sub-pixel precision, you'll have to use 64-bit numbers or another approach.

Quote

I don't know why we have to add 0xF to min(X1, X2, X3) and max(X1, X2, X3)?
Adding 0xF and shifting right 4 bits is the equivalent of a ceil() function for the 28.4 fixed-point format. The first pixel to be filled is the one stricly to the right of the first edge, so that's why a ceil() operation is used (the fill convention for pixels spot on the edge is taken care of a few lines lower).

Quote

Additionally, why we do the bitwise operations, when we compute fixed-point deltas and bounding rectangle?
You mean the shift operations? The bounding rectangle is really in integer coordinates, not in fixed-point coordinates.

Quote

I also don't know how to get corners of block.
They are in fixed-point format again, so that's why there is a shift left by 4 bits. They have to be in fixed-point because testing whether a point is inside or outside an edge requires that precision (also note that the C values are in fixed-point).

Quote

What is "stride" means in this job?
It's the number of bytes between two rows in the color buffer. So it takes you from one scanline to the next.

Quote

Finally, how to explain the "block"? It's just like a rectangle?
It's a square of 8x8 pixels. Sometimes in rasterization it's also called a "tile". The main advantage is that with tiles 64 pixels can be skipped at once when the whole tile is either fully outside or fully inside the triangle. Only tiles that cross an edge require the per-pixel tests.

#77 Optimus

    New Member

  • Members
  • Pip
  • 1 posts

Posted 07 July 2007 - 10:13 PM

Interesting approach! I like it when I read something completely new and different on 3d graphics. Usually, there are several tutorial out there on the subject but all explain the same old classic algorithm via the predictable way you've read and coded yourself before. Now that's something unique! And I am still wondering if it works good and fast enough (I have to try first ;)

#78 Madis731

    New Member

  • Members
  • Pip
  • 8 posts

Posted 09 July 2007 - 03:06 PM

The Q is best left at 8 because then 64 pixels will be rendered fast enough to not notice the slowdown from MULs and DIVs you stumble on on the way there :) Unoptimized will work at the same pace as the scan-conversion, but a bit optimization and you'll have something fast.

I'm trying to find out if there's a way to dynamically change it between 4, 8 and 16 for different triangles. I think it will be too complicated and makes it even slower but lets see.

#79 supagu

    New Member

  • Members
  • PipPip
  • 14 posts

Posted 22 August 2007 - 05:33 AM

Nick said:

This is quite easy. First determine gradients du/dx, dv/dx, du/dy and dv/dy. Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0 (note that x0 and y0 are floating-point, x and y are integer). Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy.

Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.

can you further explain how this is working?
what is du, dx, etc...

also, how does this compare, performance wise, to scan line rasterisation once you have colours, texture coords etc all in place assuming you dont want to convert it to assembly?

#80 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 22 August 2007 - 06:23 AM

supagu said:

can you further explain how this is working?
what is du, dx, etc...
Triangle Scan Conversion using 2D Homogeneous Coordinates explains an efficient way to compute interpolants (section 4).

Quote

also, how does this compare, performance wise, to scan line rasterisation once you have colours, texture coords etc all in place assuming you dont want to convert it to assembly?
It has advantages and disadvantages. Scanline rasterization only computes endpoints of the scanline, while this method computes coverage for every pixel individually. The advantage is that it translates very well to SIMD assembly. But with SISD code you're better off with scanline rasterization. Another advantage is that you can easily determine coverage of whole blocks of pixels, which is useful for visibility algorithms.

So it really depends on other aspects of your renderer whether this algorithms is to be preferred or not.





2 user(s) are reading this topic

0 members, 2 guests, 0 anonymous users