187 replies to this topic

### #61Vertexmania

New Member

• Members
• 4 posts

Posted 15 April 2006 - 08:56 PM

Nice algorithm, but how do you handle visible surface determination?
How can I calculate the z-value of every pixel and compare it to the z-buffer?

Sometimes when a triangle is getting clipped it can became a more complex polygon like a quad. How do you handle this?

### #62Mihail121

Senior Member

• Members
• 1059 posts

Posted 15 April 2006 - 10:22 PM

Vertexmania said:

Sometimes when a triangle is getting clipped it can became a more complex polygon like a quad. How do you handle this?

Simple, my friend! There shouldn't be any problem at all to implement the standard rasterization routine (scanline) and the one discussed here (half-space), then you would use the new algorithm only with those primitives, that are actually triangles after the clipping phase - the scanline-rasterizer can handle the rest.

Another way to do it would be to be brake the clipped polygon in triangles - a straightforward process!

Third possibility is to consider extending the half-space rasterization function to polygons with more than 3 vertices.

Hope that helps, cheers!

New Member

• Members
• 8 posts

Posted 29 March 2007 - 02:18 PM

Hi, I turned this code (the final one) into assembly and wow was it painful :D
There were some problems making it take advantage of MMX/SSE, but I did it to some extent. Anyway I will first comment your code a bit:

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)

{

// 28.4 fixed-point coordinates

const int Y1 = iround(16.0f * v1.y);

const int Y2 = iround(16.0f * v2.y);

const int Y3 = iround(16.0f * v3.y);

const int X1 = iround(16.0f * v1.x);

const int X2 = iround(16.0f * v2.x);

const int X3 = iround(16.0f * v3.x);

// Deltas

/*

This was really easy. I fit 6 words into XMMs and could do all the

subtraction and shifting.

*/

const int DX12 = X1 - X2;

const int DX23 = X2 - X3;

const int DX31 = X3 - X1;

const int DY12 = Y1 - Y2;

const int DY23 = Y2 - Y3;

const int DY31 = Y3 - Y1;

// Fixed-point deltas

const int FDX12 = DX12 << 4;

const int FDX23 = DX23 << 4;

const int FDX31 = DX31 << 4;

const int FDY12 = DY12 << 4;

const int FDY23 = DY23 << 4;

const int FDY31 = DY31 << 4;

// Bounding rectangle

/*

These four went in 2 clock cycles :)

*/

int minx = (min(X1, X2, X3) + 0xF) >> 4;

int maxx = (max(X1, X2, X3) + 0xF) >> 4;

int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;

int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;

// Block size, standard 8x8 (must be power of two)

const int q = 8;

/*

I found out this to be very interesting dynamically. There were situations

where 4 or 16 was better. Imagine triangles from 4x4px bounding in size

upto 500x500 in size...

*/

// Start in corner of 8x8 block

minx &= ~(q - 1); // This and later is all in 64-bit x86 ASM without SSE

miny &= ~(q - 1);

(char*&)colorBuffer += miny * stride;

// Half-edge constants

int C1 = DY12 * X1 - DX12 * Y1; // SSE only has instructions for WORDs

int C2 = DY23 * X2 - DX23 * Y2; // but this needs at least DWORDs and

int C3 = DY31 * X3 - DX31 * Y3; // doing 6 multiplys with IMUL is faster

// than converting it to SSE and doing 2*3 MULUDQs with additional sign-

// checking because UDQ is UNsigned :(

// Correct for fill convention

if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;

if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;

if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;

// Loop through blocks

for(int y = miny; y < maxy; y += q)

{

for(int x = minx; x < maxx; x += q)

{

// Corners of block

int x0 = x << 4;

int x1 = (x + q - 1) << 4;

int y0 = y << 4;

int y1 = (y + q - 1) << 4;

// Evaluate half-space functions

/*

This one here is very painful and even in case of a right triangle sitting in

one half of the bounding box it takes 24 IMULs for every 8x8 pixel region

empty or not. One or two optimizations are possible, though. You cast these

bools to ints here, but actually you could just add these results without

shifting (a00+a10+a01+a11) because you only check whether its all 0 or

all 1. Then it becomes 0 or 4. Only loss is that you don't know which ones

where 1 and which ones 0 (may be necessary with masking). The definite

one optimization is that you don't have to wait until c to check for 0, but

you can to it after every four bools. Bail out of the loop quickly after a==0

so you will cut off these painful 16 IMULs, 16 ADDs/SUBs, 8 CMPs, 4 SHLs

and 4 ORs. That on average will improve throughput by 10-35% depending

on the triangles.

*/

bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;

bool a10 = C1 + DX12 * y0 - DY12 * x1 > 0;

bool a01 = C1 + DX12 * y1 - DY12 * x0 > 0;

bool a11 = C1 + DX12 * y1 - DY12 * x1 > 0;

int a = (a00 << 0) | (a10 << 1) | (a01 << 2) | (a11 << 3);

bool b00 = C2 + DX23 * y0 - DY23 * x0 > 0;

bool b10 = C2 + DX23 * y0 - DY23 * x1 > 0;

bool b01 = C2 + DX23 * y1 - DY23 * x0 > 0;

bool b11 = C2 + DX23 * y1 - DY23 * x1 > 0;

int b = (b00 << 0) | (b10 << 1) | (b01 << 2) | (b11 << 3);

bool c00 = C3 + DX31 * y0 - DY31 * x0 > 0;

bool c10 = C3 + DX31 * y0 - DY31 * x1 > 0;

bool c01 = C3 + DX31 * y1 - DY31 * x0 > 0;

bool c11 = C3 + DX31 * y1 - DY31 * x1 > 0;

int c = (c00 << 0) | (c10 << 1) | (c01 << 2) | (c11 << 3);

// Skip block when outside an edge

/*

This one can be cleverly done with

#define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)

hint: search for "null" at

http://www.azillionmonkeys.com/qed/asmexample.html

but I didn't need it at all because of the optimizations above

*/

if(a == 0x0 || b == 0x0 || c == 0x0) continue;

unsigned int *buffer = colorBuffer;

// Accept whole block when totally covered

/*

By fusing a,b,c together earlier I could make here only one test which is

if(c==0xFFF) - that is when c=a<<8|b<<4|c

*/

if(a == 0xF && b == 0xF && c == 0xF)

{

for(int iy = 0; iy < q; iy++)

{

for(int ix = x; ix < x + q; ix++)

{

buffer[ix] = 0x00007F00;<< // Green

/*

This part here gave a really good boost. I switched to SSE for a second

here and shuffled the color to all parts of the 128-bit register. It could

write 16 bytes per clock and with some macro magic, I made it calculate

from the q how much of these it needed. The inner loop is too tight to use

it at all so I left it out :)

*/

}

(char*&)buffer += stride;

}

}

else<< // Partially covered block

{

int CY1 = C1 + DX12 * y0 - DY12 * x0;

int CY2 = C2 + DX23 * y0 - DY23 * x0;

int CY3 = C3 + DX31 * y0 - DY31 * x0;

for(int iy = y; iy < y + q; iy++)

{

int CX1 = CY1;

int CX2 = CY2;

int CX3 = CY3;

for(int ix = x; ix < x + q; ix++)

{

if(CX1 > 0 && CX2 > 0 && CX3 > 0)

{

buffer[ix] = 0x0000007F;<< // Blue

}

CX1 -= FDY12;

CX2 -= FDY23;

CX3 -= FDY31;

}

CY1 += FDX12;

CY2 += FDX23;

CY3 += FDX31;

(char*&)buffer += stride;

}

}

}

(char*&)colorBuffer += q * stride;

}

}



Now I will post the code that is not really ready yet, but already works. I will find some way to put some more SSE into it. The first try was over 400 lines of code, but some SSE magic made it crush to about 340 lines. My aim is a line count less than or equal to yours :D

In ASM syntax comments are ; and // /**/ don't mean anything
I made string replace on the code so try to understand ;)

Engine_Triangle:

/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;

;  Description: Draws a filled triangle.

;

;  In:     xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2

;          rsi = colour

;

;  Out:    All changed

;

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;*/

pshuflw xmm0,xmm0,01001110b //Esimene/tagumine külg

Q=4

PREC=4

psllw     xmm0,PREC           //xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2

movdqa    xmm5,xmm0

pextrw    efx,xmm0,0

pextrw    eex,xmm0,1

pextrw    edx,xmm0,2

pextrw    ecx,xmm0,3

pextrw    ebx,xmm0,4

pextrw    eax,xmm0,5

pshufd    xmm1,xmm0,11010010b //xmm1 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0

movdqa    xmm2,xmm0

movdqa    xmm3,xmm1

pxor      xmm4,xmm4

punpckhwd xmm0,xmm4

punpckhwd xmm1,xmm4

punpcklwd xmm2,xmm4

punpcklwd xmm3,xmm4

psubd     xmm0,xmm1           //xmm0 = 00  |  00  | dX12 | dY12

psubd     xmm2,xmm3           //xmm1 = dX23| dY23 | dX31 | dY31

movdqa    xmm1,xmm0

movdqa    xmm3,xmm2

pslld     xmm1,PREC           //xmm2 = 00  |  00  |fdX12 |fdY12

pslld     xmm3,PREC           //xmm3 =fdX23|fdY23 |fdX31 |fdY31

movq      qword[DY12],xmm0

movdqu    dqword[DY31],xmm2

movq      qword[FDY12],xmm1

movdqu    dqword[FDY31],xmm3

//xmm5 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2

pshufd    xmm6,xmm5,11001001b //xmm6 = 00 | 00 | X2 | Y2 | X0 | Y0 | X1 | Y1

pshufd    xmm7,xmm5,11010010b //xmm7 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0

movdqa    xmm8,xmm5

movdqa    xmm9,xmm6

pmaxsw    xmm8,xmm6

pmaxsw    xmm8,xmm7

pminsw    xmm9,xmm5

pminsw    xmm9,xmm7

psraw     xmm8,4

pextrw    ehx,xmm8,1

pextrw    ejx,xmm8,0

psraw     xmm9,4

pextrw    egx,xmm9,1

pextrw    eix,xmm9,0

// Block size, standard 8x8 (must be power of two)

//    const int q = 8//

// Start in corner of 8x8 block

//    minx &= ~(q - 1)//

//    miny &= ~(q - 1)//

and       egx,not(Q-1)

and       eix,not(Q-1)

imul      ebp,eix,BUFW*4    //Set up pointer to backbuffer

imul      eax,[DY12]        //Constant part of half-edge functions

imul      ebx,[DX12]

imul      ecx,[DY23]

imul      edx,[DX23]

imul      eex,[DY31]

imul      efx,[DX31]

sub       eax,ebx

sub       ecx,edx

sub       eex,efx

mov       [C1],eax

mov       [C2],ecx

mov       [C3],eex

cmp       [DY12],0           //Correct for fill convention

jl        .incC1

jne       .noincC1

cmp       [DX12],0

jle       .noincC1

.incC1:

.noincC1:

cmp       [DY23],0

jl        .incC2

jne       .noincC2

cmp       [DX23],0

jle       .noincC2

.incC2:

.noincC2:

cmp       [DY31],0

jl        .incC3

jne       .noincC3

cmp       [DX31],0

jle       .noincC3

.incC3:

.noincC3:

// Loop through blocks

.forouter:

push      rgx

.forinner:

// Corners of block

mov       eax,egx

shl       eax,PREC

mov       ebx,eix

shl       ebx,PREC

lea       esi,[egx+Q-1]

shl       esi,PREC

lea       edi,[eix+Q-1]

shl       edi,PREC

// Evaluate half-space functions

xor       ecx,ecx

mov       ekx,ebx

imul      ekx,[DX12]

mov       elx,eax

imul      elx,[DY12]

sub       ekx,elx

jle       @f

@@:

mov       ekx,ebx

imul      ekx,[DX12]

mov       elx,esi

imul      elx,[DY12]

sub       ekx,elx

jle       @f

@@:

mov       ekx,edi

imul      ekx,[DX12]

mov       elx,eax

imul      elx,[DY12]

sub       ekx,elx

jle       @f

@@:

mov       ekx,edi

imul      ekx,[DX12]

mov       elx,esi

imul      elx,[DY12]

sub       ekx,elx

jle       @f

@@:

shl       ecx,4

jz        .prolog_inner_end

mov       ekx,ebx

imul      ekx,[DX12]

mov       elx,eax

imul      elx,[DY12]

sub       ekx,elx

jle       @f

@@:

mov       ekx,ebx

imul      ekx,[DX23]

mov       elx,esi

imul      elx,[DY23]

sub       ekx,elx

jle       @f

@@:

mov       ekx,edi

imul      ekx,[DX23]

mov       elx,eax

imul      elx,[DY23]

sub       ekx,elx

jle       @f

@@:

mov       ekx,edi

imul      ekx,[DX23]

mov       elx,esi

imul      elx,[DY23]

sub       ekx,elx

jle       @f

@@:

shl       ecx,4

test      ecx,0F0h

jz        .prolog_inner_end

mov       ekx,ebx

imul      ekx,[DX31]

mov       elx,eax

imul      elx,[DY31]

sub       ekx,elx

jle       @f

@@:

mov       ekx,ebx

imul      ekx,[DX31]

mov       elx,esi

imul      elx,[DY31]

sub       ekx,elx

jle       @f

@@:

mov       ekx,edi

imul      ekx,[DX31]

mov       elx,eax

imul      elx,[DY31]

sub       ekx,elx

jle       @f

@@:

mov       ekx,edi

imul      ekx,[DX31]

mov       elx,esi

imul      elx,[DY31]

sub       ekx,elx

jle       @f

@@:

test      ecx,0Fh

jz        .prolog_inner_end

//ecx=a shl 8 + b shl 4 + c

mov       efx,ebp

cmp       ekx,0444h

jne       .not_totally_covered

// Totally inside

mov       ekx,Q

@@:

mov       ecx,00FFFFFFh

movd      xmm0,ecx

times     Q/4 movdqu [efx+egx*4+(%-1)*16],xmm0

sub       ekx,1

ja        @b

jmp       .prolog_inner_end

.not_totally_covered:

//            else// <<  Partially covered block

//            {

//                int CY1 = C1 + DX12 * y0 - DY12 * x0//

//                int CY2 = C2 + DX23 * y0 - DY23 * x0//

//                int CY3 = C3 + DX31 * y0 - DY31 * x0//

mov       ekx,[DX12]

mov       elx,[DY12]

imul      ekx,ebx

imul      elx,eax

sub       ekx,elx

mov       [CY1],ekx

mov       ekx,[DX23]

mov       elx,[DY23]

imul      ekx,ebx

imul      elx,eax

sub       ekx,elx

mov       [CY2],ekx

mov       ekx,[DX31]

mov       elx,[DY31]

imul      ekx,ebx

imul      elx,eax

sub       ekx,elx

mov       [CY3],ekx

//                for(int iy = y// iy < y + q// iy++)

//                    int CX1 = CY1//

//                    int CX2 = CY2//

//                    int CX3 = CY3//

mov       elx,Q

mov       eax,[CY1]

mov       ebx,[CY2]

mov       ecx,[CY3]

lea       eex,[efx+egx*4]

.partial_loopy:

push      rax rbx rcx

//                    for(int ix = x// ix < x + q// ix++)

xor       ekx,ekx

.partial_loopx:

//                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)

test      eax,eax

jle       @f

test      ebx,ebx

jle       @f

test      ecx,ecx

jle       @f

//                            buffer[ix] = 0x0000007F//<< // Blue

mov       dword[eex+ekx*4],0FFFFFFh

@@:

sub       eax,[FDY12]

sub       ebx,[FDY23]

sub       ecx,[FDY31]

//                    }

cmp       ekx,Q

jc        .partial_loopx

pop       rcx rbx rax

//                    (char*&)buffer += stride//

//                }

sub       elx,1

ja        .partial_loopy

.prolog_inner_end:

cmp       egx,ehx

jc        .forinner

pop       rgx

//        (char*&)colorBuffer += q*stride//

cmp       eix,ejx

jc        .forouter

//    }

ret

### #64Nick

Senior Member

• Members
• 1227 posts

Posted 29 March 2007 - 10:34 PM

Neat. :yes:

I noticed some opportunity to eliminate jumps. Modern CPUs can suffer greatly from jumps that are badly predictable, so you should try to avoid them if possible, especially short jumps. For example:

if(DY12 < 0) C1++;
Could be implemented as:

lea ebx, [eax+1]   ; C1 in eax

cmp DY12

cmovl eax, ebx
Lots of other tricks exist to convert apparently conditional code into fast arithmetic code...

New Member

• Members
• 8 posts

Posted 30 March 2007 - 07:20 AM

Sorry, I didn't see it at first :$My logic failed me at first, but I think its okay now - tested and benchmarked:  mov ekx,[C1] lea elx,[ekx+1] cmp [DY12],0 cmovl ekx,elx //jl .incC1 -- in comments, its the old code cmovne elx,[C1] //jne .noincC1 -- that got replaced cmp [DX12],0 cmovle ekx,elx //jle .noincC1 // .incC1: // add [C1],1 // .noincC1: mov [C1],ekx  The code size didn't reduce in size but it remained the same while pickin' up the pace ;) about 70M clocks/frame maximum is now 60M clocks/frame. And this is only in the emulator and this is just the OUTER loop that we optimized. A lot more speed could be cranked out of this with some thinking. Maybe: bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0; into bool a00 = DX12 * y0 - DY12 * x0 > -C1; // Only faster in SSE? Gentlemen - put your thinking caps on! :D ...and I think the general rule is: "Avoid short jumps if the code from the jump to the destination takes less clocks than the penalty of the jump" ### #66Nico New Member • Members • 4 posts Posted 30 March 2007 - 12:28 PM This kind of tiled traversal using halfspace functions is pretty common in hardware, and you can possibly use a simple optimisation that we use there to get rid of the multiple distance function evaluations used in the 'block inside triangle' test. In order to determine if a box is entierly on one side of a halfplane it is not necessary to determine the distance for all four corner points - it is sufficient to determine the distance to one corner. But depending on the sign of your halfplane normal a different corner has to be used. This is a win because you are going to test many boxes against the same halfplane. I don't have an illustration for this, but it should be pretty obvious that for any halfplane there is always one corner of a box that will have a minimal distance to the plane - regardless of the relative positioning or size. Personally I like to express the change of reference corner points as an offset to the top left corner distance, calculated during triangle setup from the edge normals :  // find box test offsets for each edge int eo1 = 0; if (FDY12 < 0) eo1 -= FDY12 << BLOCKSIZE; if (FDX12 > 0) eo1 += FDX12 << BLOCKSIZE; int eo2 = 0; if (FDY23 < 0) eo2 -= FDY23 << BLOCKSIZE; if (FDX23 > 0) eo2 += FDX23 << BLOCKSIZE; int eo3 = 0; if (FDY31 < 0) eo3 -= FDY31 << BLOCKSIZE; if (FDX31 > 0) eo3 += FDX31 << BLOCKSIZE; Using these offset values (one per edge) the test for 'block is fully outside triangle' becomes essentially the same as the simple 'pixel is fully inside triangle' test.  // test if a single pixel is inside the triangle if (CX1 > 0 && CX2 > 0 && CX3 > 0) ... // test if a block of n*n pixels with top left corner at current position is inside if ((CX1+eo1) > 0 && (CX2+eo2) > 0 && (CX3+eo3) > 0) ... So just three add's instead of 21 muls and whatnot ;) Note that checking for exclusion and checking for inclusion requires different offsets (one can be easily derived from the other) but in hardware we usually differentiate into covered and not-covered only. ### #67Nils Pipenbrinck Senior Member • Members • 597 posts Posted 30 March 2007 - 01:32 PM From a quick view it seems like you're using global variables. While this is not that bad, the codesize grows quite a bit (the address operand field will need at least 4 bytes). If you can free up one register (I know - good joke) you could put your locals into a temporary structure and address from there. This has two benefits: • Your code will become threadsave. • Your code size will be shrink, and this will reduce the instruction decoding time. That does not make a difference if your code takes longer to execute than to fetch, but once you're code is tight enough, instruction decoding time will be factor to care about. Oh - and by all means, get rid of the branches. If it takes 5 or 6 instructions to do something without a branch it's a saving. You have lots of branches that are placed near to each other, and this will confuse the branch prediction logic. I don't have the exact numbers anymore, but it can only identify branches that are several instructions apart. I think it was something around 64 or 32 bytes or so. If you have one branch that branches likely, and one that branches branches unlikely near to each other, you'll get a costly branch missprediction for each branch. Btw, hi nico *g* My music: http://myspace.com/planetarchh <-- my music My stuff: torus.untergrund.net <-- some diy electronic stuff and more. ### #68Madis731 New Member • Members • 8 posts Posted 30 March 2007 - 09:47 PM I feel ashamed :$ optimizing guru as I am, I posted you unoptimized code. What was I thinking? :)

Anyway, I'm on my way to better and better code - like you, I started discovering some clever and some not very clever spots to crush.

Just a side-note here, I'm aiming at Core 2 here (that is what I'm writing and testing on) and I've got all my clock-knowledge from Agner Fog's manuals and
picked up a few tricks from http://azillionmonke.../qed/tech.shtml

The weirdest thing is that I didn't discover the possibilities in "Evaluate half-space functions" until today, when I looked at the lines and saw that
actually we need 12 multiplies instead of 24. I must be blind :P Anyway, that
translated nicely to SSE, the saddest fact being it doesn't support signed
DWORDs. If I will get over that tiny problem, I will have a whole different
code to show you.

@Nils: I want to argue a bit if I may. I agree with you on the thread safety,
but there are other ways around it if I really don't like locals :) and stack is
really becoming a problem under the 64-bit CPUs because you can't put there
anything else besides 64-bit registers and accessing 32-bit or 16-bit data
from this 8-byte aligned stack is really too much additional SSE. I mean all
the packing and shuffling. Of course like with other things, there are ways
around it.

Another reason why I like globals so much is that they are cache-friendly :)
Really! They never move ;) and I can fit like 8 DWORDs in one cache-line.

About the branches - I'm still sorry about posting this code - I will edit this
immediately not to confuse anyone again :$Three posts back I copied the replacement code for the nasty branches and I don't have them anymore. Here it is (still without full SSE handling) with only 300 lines of code: Engine_Triangle: /*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; ; Description: Draws a filled triangle. ; ; In: xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2 ; rsi = colour ; ; Out: All changed ; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;*/ pshuflw xmm0,xmm0,01001110b //Switch front/back Q = 4 PREC = 4 // psllw xmm0,PREC //xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2 movdqa xmm5,xmm0 pextrw efx,xmm0,0 pextrw eex,xmm0,1 pextrw edx,xmm0,2 pextrw ecx,xmm0,3 pextrw ebx,xmm0,4 pextrw eax,xmm0,5 pshufd xmm1,xmm0,11010010b //xmm1 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0 movdqa xmm2,xmm0 movdqa xmm3,xmm1 pxor xmm4,xmm4 punpckhwd xmm0,xmm4 punpckhwd xmm1,xmm4 punpcklwd xmm2,xmm4 punpcklwd xmm3,xmm4 psubd xmm0,xmm1 //xmm0 = 00 | 00 | dX12 | dY12 psubd xmm2,xmm3 //xmm1 = dX23| dY23 | dX31 | dY31 movdqa xmm1,xmm0 movdqa xmm3,xmm2 pslld xmm1,PREC //xmm2 = 00 | 00 |fdX12 |fdY12 pslld xmm3,PREC //xmm3 =fdX23|fdY23 |fdX31 |fdY31 movq qword[DY12],xmm0 movdqu dqword[DY31],xmm2 movq qword[FDY12],xmm1 movdqu dqword[FDY31],xmm3 //xmm5 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2 pshufd xmm6,xmm5,11001001b //xmm6 = 00 | 00 | X2 | Y2 | X0 | Y0 | X1 | Y1 pshufd xmm7,xmm5,11010010b //xmm7 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0 movdqa xmm8,xmm5 movdqa xmm9,xmm6 pmaxsw xmm8,xmm6 pmaxsw xmm8,xmm7 pminsw xmm9,xmm5 pminsw xmm9,xmm7 paddw xmm8,dqword[xmm_8x15] psraw xmm8,4 pextrw ehx,xmm8,1 pextrw ejx,xmm8,0 paddw xmm9,dqword[xmm_8x15] psraw xmm9,4 pextrw egx,xmm9,1 pextrw eix,xmm9,0 // Block size, standard 8x8 (must be power of two) // const int q = 8// // Start in corner of 8x8 block // minx &= ~(q - 1)// // miny &= ~(q - 1)// and egx,not(Q-1) and eix,not(Q-1) imul ebp,eix,BUFW*4 //Set up pointer to backbuffer add ebp,backbuffer imul eax,[DY12] //Constant part of half-edge functions imul ebx,[DX12] imul ecx,[DY23] imul edx,[DX23] imul eex,[DY31] imul efx,[DX31] sub eax,ebx sub ecx,edx sub eex,efx mov [C1],eax mov [C2],ecx mov [C3],eex mov ekx,[C1] //Correct for fill convention lea elx,[ekx+1] cmp [DY12],0 cmovl ekx,elx cmovne elx,[C1] cmp [DX12],0 cmovle ekx,elx mov [C1],ekx mov ekx,[C2] lea elx,[ekx+1] cmp [DY23],0 cmovl ekx,elx cmovne elx,[C2] cmp [DX23],0 cmovle ekx,elx mov [C2],ekx mov ekx,[C3] lea elx,[ekx+1] cmp [DY31],0 cmovl ekx,elx cmovne elx,[C3] cmp [DX31],0 cmovle ekx,elx mov [C3],ekx // Loop through blocks .forouter: push rgx .forinner: // Corners of block mov eax,egx shl eax,PREC mov ebx,eix shl ebx,PREC lea esi,[egx+Q-1] shl esi,PREC lea edi,[eix+Q-1] shl edi,PREC // Evaluate half-space functions mov ekx,ebx shl rkx,32 add rkx,rax mov elx,edi shl rlx,32 add rlx,rsi movq xmm0,rkx // xmm0 = 00 | 00 | x0 | y0 movq xmm1,rlx // xmm1 = 00 | 00 | x1 | y1 punpcklqdq xmm0,xmm1 // xmm0 = y1 | x1 | y0 | x0 movd xmm1,[C1] pshufd xmm1,xmm1,00000000b // xmm1 = C1 | C1 | C1 | C1 movq xmm4,qword[DY12] pshufd xmm4,xmm4,01000100b // xmm4 = DX12 | DY12 | DX12 | DY12 pshufd xmm5,xmm4,10110001b // xmm5 = DY12 | DX12 | DY12 | DX12 pmuludq xmm4,xmm0 // xmm4 = OF | DY12*x1 | OF | DY12*x0 pshufd xmm4,xmm4,10001000b // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0 pshufd xmm0,xmm0,10110001b // xmm0 = x1 | y1 | x0 | y0 pmuludq xmm5,xmm0 // xmm5 = OF | DX12*y1 | OF | DX12*y0 pshufd xmm5,xmm5,10100000b // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0 psubd xmm5,xmm4 // xmm5 = DX12*(y0...1)-DY12*(x0...1) paddd xmm5,xmm1 pcmpgtd xmm5,dqword[xmm_zero] packssdw xmm5,xmm5 packsswb xmm5,xmm5 pmovmskb edx,xmm5 and edx,0Fh xor ecx,ecx add ecx,edx jz .prolog_inner_end shl ecx,4 movd xmm2,[C2] pshufd xmm2,xmm2,00000000b // xmm2 = C2 | C2 | C2 | C2 movq xmm4,qword[DY23] pshufd xmm4,xmm4,01000100b // xmm4 = DX12 | DY12 | DX12 | DY12 pshufd xmm5,xmm4,10110001b // xmm5 = DY12 | DX12 | DY12 | DX12 pshufd xmm0,xmm0,10110001b // xmm0 = y1 | x1 | y0 | x0 pmuludq xmm4,xmm0 // xmm4 = OF | DY12*x1 | OF | DY12*x0 pshufd xmm4,xmm4,10001000b // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0 pshufd xmm0,xmm0,10110001b // xmm0 = x1 | y1 | x0 | y0 pmuludq xmm5,xmm0 // xmm5 = OF | DX12*y1 | OF | DX12*y0 pshufd xmm5,xmm5,10100000b // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0 psubd xmm5,xmm4 // xmm5 = DX12*(y0...1)-DY12*(x0...1) paddd xmm5,xmm2 pcmpgtd xmm5,dqword[xmm_zero] packssdw xmm5,xmm5 packsswb xmm5,xmm5 pmovmskb edx,xmm5 and edx,0Fh add ecx,edx test ecx,0Fh jz .prolog_inner_end shl ecx,4 movd xmm3,[C3] pshufd xmm3,xmm3,00000000b // xmm3 = C3 | C3 | C3 | C3 movq xmm4,qword[DY31] pshufd xmm4,xmm4,01000100b // xmm4 = DX12 | DY12 | DX12 | DY12 pshufd xmm5,xmm4,10110001b // xmm5 = DY12 | DX12 | DY12 | DX12 pshufd xmm0,xmm0,10110001b // xmm0 = y1 | x1 | y0 | x0 pmuludq xmm4,xmm0 // xmm4 = OF | DY12*x1 | OF | DY12*x0 pshufd xmm4,xmm4,10001000b // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0 pshufd xmm0,xmm0,10110001b // xmm0 = x1 | y1 | x0 | y0 pmuludq xmm5,xmm0 // xmm5 = OF | DX12*y1 | OF | DX12*y0 pshufd xmm5,xmm5,10100000b // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0 psubd xmm5,xmm4 // xmm5 = DX12*(y0...1)-DY12*(x0...1) paddd xmm5,xmm3 pcmpgtd xmm5,dqword[xmm_zero] packssdw xmm5,xmm5 packsswb xmm5,xmm5 pmovmskb edx,xmm5 and edx,0Fh add ecx,edx test ecx,0Fh jz .prolog_inner_end mov efx,ebp cmp ecx,0FFFh jne .not_totally_covered // Totally inside mov ekx,Q @@: mov ecx,00FFFFFFh movd xmm0,ecx pshufd xmm0,xmm0,00000000b // broadcast times Q/4 movdqu [efx+egx*4+(%-1)*16],xmm0 add efx,BUFW*4 sub ekx,1 ja @b jmp .prolog_inner_end .not_totally_covered: // else// << Partially covered block // { // int CY1 = C1 + DX12 * y0 - DY12 * x0// // int CY2 = C2 + DX23 * y0 - DY23 * x0// // int CY3 = C3 + DX31 * y0 - DY31 * x0// mov ekx,[DX12] mov elx,[DY12] imul ekx,ebx imul elx,eax sub ekx,elx add ekx,[C1] mov [CY1],ekx mov ekx,[DX23] mov elx,[DY23] imul ekx,ebx imul elx,eax sub ekx,elx add ekx,[C2] mov [CY2],ekx mov ekx,[DX31] mov elx,[DY31] imul ekx,ebx imul elx,eax sub ekx,elx add ekx,[C3] mov [CY3],ekx // for(int iy = y// iy < y + q// iy++) // int CX1 = CY1// // int CX2 = CY2// // int CX3 = CY3// mov elx,Q mov eax,[CY1] mov ebx,[CY2] mov ecx,[CY3] lea eex,[efx+egx*4] .partial_loopy: push rax rbx rcx // for(int ix = x// ix < x + q// ix++) xor ekx,ekx .partial_loopx: // if(CX1 > 0 && CX2 > 0 && CX3 > 0) test eax,eax jle @f test ebx,ebx jle @f test ecx,ecx jle @f // buffer[ix] = 0x0000007F//<< // Blue mov dword[eex+ekx*4],0FFFFFFh @@: sub eax,[FDY12] sub ebx,[FDY23] sub ecx,[FDY31] // } add ekx,1 cmp ekx,Q jc .partial_loopx pop rcx rbx rax add eax,[FDX12] add ebx,[FDX23] add ecx,[FDX31] // (char*&)buffer += stride// add eex,BUFW*4 // } sub elx,1 ja .partial_loopy .prolog_inner_end: add egx,Q cmp egx,ehx jc .forinner pop rgx // (char*&)colorBuffer += q*stride// add ebp,BUFW*4*Q add eix,Q cmp eix,ejx jc .forouter // } ret  ### #69rarefluid New Member • Members • 14 posts Posted 11 June 2007 - 03:08 PM Maybe I'm missing something here but the halfspace function should be linear per edge/line. So you could: - evalutate the halfspace functions by calculating the halfspace-value at the 4 edges of the bounding box - calculate a per-X and per-Y constant - add those to the corner points when you step to the next 8x8 cube there should be no need for the multiplication in the outer loop, which should save 6 MULs and 6 ADD/SUBs per block And sorry for reanimating the dead ;) ### #70Nils Pipenbrinck Senior Member • Members • 597 posts Posted 11 June 2007 - 04:40 PM rarefluid said: Maybe I'm missing something here but the halfspace function should be linear per edge/line. That's right, they are linear and you get all kinds of cool shortcuts here. If you really want to optimize performance read nicos post. He showed that it's sufficient to just interpolate one edge distance instead of three. My music: http://myspace.com/planetarchh <-- my music My stuff: torus.untergrund.net <-- some diy electronic stuff and more. ### #71rarefluid New Member • Members • 14 posts Posted 11 June 2007 - 06:31 PM What I understood was that you only need to check one edge depending on the normal i.e. if block is above and left of a line that has an angle of 45°, check bottom right edge of box... I don't understand how this can be independent of the position of the block... And you will probably need to do some jumps that might slow things down again. ### #72Nick Senior Member • Members • 1227 posts • LocationOttawa, Ontario, Canada Posted 11 June 2007 - 08:51 PM I've analysed Nico's suggestion, and I think this picture tells the whole story: boundary. The colored triangle represents the pixels we want to fill. The squares are coverage blocks (the extreme cases). The fat lines are the boundaries we need to test the top-left corner of a coverage block with. If the top-left corner is inside the boundary, the block covers part of the triangle. Note that the boundary is not a triangle, but that's actually not a problem. We'll only check its boundary box for covered blocks anyway. So only the three edges of the boundary that are parallel to the triangle's edges need to be computed. That's only a one-time cost, so block coverage is practically as fast as point coverage! :worthy: Here's the equivalent picture for testing whether a block is fully covered by only looking at the top-left corner: boundary2. rarefluid said: And you will probably need to do some jumps that might slow things down again. You can eliminate jumps with conditional moves, or using AND and OR operations. ### #73rarefluid New Member • Members • 14 posts Posted 11 June 2007 - 10:00 PM I think I can grasp the idea now... But I don't really understand how this can yield speedups when NOT exploiting parallelism, in contrast to a conventional scanline render, 'cause it actually necessarly processes more pixels. There is no sorting needed though. ...Sorry. I read page 1 again and found some hints... :$

Couldn't one do a bresenham-like movement of the blocks to save some tests? Move block down and onto triangle boundary. Or am I completely wrong here? Why not combine the conventional scanline and parallel aproach?
...maybe I should implement this and stop bothering you guys... ;)

### #74Nils Pipenbrinck

Senior Member

• Members
• 597 posts

Posted 11 June 2007 - 10:48 PM

rarefluid said:

Couldn't one do a bresenham-like movement of the blocks to save some tests? Move block down and onto triangle boundary. Or am I completely wrong here? Why not combine the conventional scanline and parallel aproach?
...maybe I should implement this and stop bothering you guys... ;)

No. You aren't wrong. You can as well use a simple flood fill style algorithm.

Once you've found a block that covers at least a part of the triangle (the middle one might be a good choice) fill into all adjecting, covered blocks. This will be very fast. The only drawback is that you need some kind of stack space to store your state, but that's usually not a problem.

On hardware designs you avoid such things, but in a software implementation it is no big deal.

Btw, during my latest work on rasterization I found out that on modern machines you get a nice performance gain from first scanning geometry and generating block/trapezoid/scanline data for your polygons into a dedicated memory buffer. Once the buffer is full render it. This is up to twice as fast compared to the "render the block immediately" style.

I think the effects of the cache makes a huge difference here.
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

### #75powers

New Member

• Members
• 1 posts

Posted 22 June 2007 - 03:35 PM

Many thanks for Nick's good job.

It's my first time to learn how rasterizer do, so I have some questions.
1. Why we use 28.4 fixed-point coordinates but not other fixed-point coordinates, such as 24.8 fixed-point coordinates?
2. I don't know why we have to add 0xF to min(X1, X2, X3) and max(X1, X2, X3)? Additionally, why we do the bitwise operations, when we compute fixed-point deltas and bounding rectangle?
3. I also don't know how to get corners of block.
4. What is "stride" means in this job?
5. Finally, how to explain the "block"? It's just like a rectangle?
Thanks a lot!!!

### #76Nick

Senior Member

• Members
• 1227 posts

Posted 23 June 2007 - 09:48 AM

powers said:

Why we use 28.4 fixed-point coordinates but not other fixed-point coordinates, such as 24.8 fixed-point coordinates?
The numbers are multiplied, so a 24.8 format would result in a 48.16 format. But that doesn't fit within a 32-bit value and would be chopped to a 16.16 format, where the upper 16 bits correspond to only 8 bits before the multiplication. So you'd only have a resolution of 256 pixels.

The 28.4 format gives us 12 bit resolution, or 4096 pixels. 4 bits of sub-pixel precision is also plenty in practice. So it's the best compromise. If you really need a higher resolution, anti-aliasing or more sub-pixel precision, you'll have to use 64-bit numbers or another approach.

Quote

I don't know why we have to add 0xF to min(X1, X2, X3) and max(X1, X2, X3)?
Adding 0xF and shifting right 4 bits is the equivalent of a ceil() function for the 28.4 fixed-point format. The first pixel to be filled is the one stricly to the right of the first edge, so that's why a ceil() operation is used (the fill convention for pixels spot on the edge is taken care of a few lines lower).

Quote

Additionally, why we do the bitwise operations, when we compute fixed-point deltas and bounding rectangle?
You mean the shift operations? The bounding rectangle is really in integer coordinates, not in fixed-point coordinates.

Quote

I also don't know how to get corners of block.
They are in fixed-point format again, so that's why there is a shift left by 4 bits. They have to be in fixed-point because testing whether a point is inside or outside an edge requires that precision (also note that the C values are in fixed-point).

Quote

What is "stride" means in this job?
It's the number of bytes between two rows in the color buffer. So it takes you from one scanline to the next.

Quote

Finally, how to explain the "block"? It's just like a rectangle?
It's a square of 8x8 pixels. Sometimes in rasterization it's also called a "tile". The main advantage is that with tiles 64 pixels can be skipped at once when the whole tile is either fully outside or fully inside the triangle. Only tiles that cross an edge require the per-pixel tests.

### #77Optimus

New Member

• Members
• 1 posts

Posted 07 July 2007 - 10:13 PM

Interesting approach! I like it when I read something completely new and different on 3d graphics. Usually, there are several tutorial out there on the subject but all explain the same old classic algorithm via the predictable way you've read and coded yourself before. Now that's something unique! And I am still wondering if it works good and fast enough (I have to try first

New Member

• Members
• 8 posts

Posted 09 July 2007 - 03:06 PM

The Q is best left at 8 because then 64 pixels will be rendered fast enough to not notice the slowdown from MULs and DIVs you stumble on on the way there :) Unoptimized will work at the same pace as the scan-conversion, but a bit optimization and you'll have something fast.

I'm trying to find out if there's a way to dynamically change it between 4, 8 and 16 for different triangles. I think it will be too complicated and makes it even slower but lets see.

### #79supagu

New Member

• Members
• 14 posts

Posted 22 August 2007 - 05:33 AM

Nick said:

This is quite easy. First determine gradients du/dx, dv/dx, du/dy and dv/dy. Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0 (note that x0 and y0 are floating-point, x and y are integer). Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy.

Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.

can you further explain how this is working?
what is du, dx, etc...

also, how does this compare, performance wise, to scan line rasterisation once you have colours, texture coords etc all in place assuming you dont want to convert it to assembly?

### #80Nick

Senior Member

• Members
• 1227 posts

Posted 22 August 2007 - 06:23 AM

supagu said:

can you further explain how this is working?
what is du, dx, etc...
Triangle Scan Conversion using 2D Homogeneous Coordinates explains an efficient way to compute interpolants (section 4).

Quote

also, how does this compare, performance wise, to scan line rasterisation once you have colours, texture coords etc all in place assuming you dont want to convert it to assembly?
It has advantages and disadvantages. Scanline rasterization only computes endpoints of the scanline, while this method computes coverage for every pixel individually. The advantage is that it translates very well to SIMD assembly. But with SISD code you're better off with scanline rasterization. Another advantage is that you can easily determine coverage of whole blocks of pixels, which is useful for visibility algorithms.

So it really depends on other aspects of your renderer whether this algorithms is to be preferred or not.

#### 1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users