Advanced Rasterization
#61
Posted 15 April 2006 - 08:56 PM
How can I calculate the z-value of every pixel and compare it to the z-buffer?
Sometimes when a triangle is getting clipped it can became a more complex polygon like a quad. How do you handle this?
#62
Posted 15 April 2006 - 10:22 PM
Vertexmania said:
Simple, my friend! There shouldn't be any problem at all to implement the standard rasterization routine (scanline) and the one discussed here (half-space), then you would use the new algorithm only with those primitives, that are actually triangles after the clipping phase - the scanline-rasterizer can handle the rest.
Another way to do it would be to be brake the clipped polygon in triangles - a straightforward process!
Third possibility is to consider extending the half-space rasterization function to polygons with more than 3 vertices.
Hope that helps, cheers!
#63
Posted 29 March 2007 - 02:18 PM
There were some problems making it take advantage of MMX/SSE, but I did it to some extent. Anyway I will first comment your code a bit:
void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)
{
// 28.4 fixed-point coordinates
const int Y1 = iround(16.0f * v1.y);
const int Y2 = iround(16.0f * v2.y);
const int Y3 = iround(16.0f * v3.y);
const int X1 = iround(16.0f * v1.x);
const int X2 = iround(16.0f * v2.x);
const int X3 = iround(16.0f * v3.x);
// Deltas
/*
This was really easy. I fit 6 words into XMMs and could do all the
subtraction and shifting.
*/
const int DX12 = X1 - X2;
const int DX23 = X2 - X3;
const int DX31 = X3 - X1;
const int DY12 = Y1 - Y2;
const int DY23 = Y2 - Y3;
const int DY31 = Y3 - Y1;
// Fixed-point deltas
const int FDX12 = DX12 << 4;
const int FDX23 = DX23 << 4;
const int FDX31 = DX31 << 4;
const int FDY12 = DY12 << 4;
const int FDY23 = DY23 << 4;
const int FDY31 = DY31 << 4;
// Bounding rectangle
/*
These four went in 2 clock cycles :)
*/
int minx = (min(X1, X2, X3) + 0xF) >> 4;
int maxx = (max(X1, X2, X3) + 0xF) >> 4;
int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;
int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;
// Block size, standard 8x8 (must be power of two)
const int q = 8;
/*
I found out this to be very interesting dynamically. There were situations
where 4 or 16 was better. Imagine triangles from 4x4px bounding in size
upto 500x500 in size...
*/
// Start in corner of 8x8 block
minx &= ~(q - 1); // This and later is all in 64-bit x86 ASM without SSE
miny &= ~(q - 1);
(char*&)colorBuffer += miny * stride;
// Half-edge constants
int C1 = DY12 * X1 - DX12 * Y1; // SSE only has instructions for WORDs
int C2 = DY23 * X2 - DX23 * Y2; // but this needs at least DWORDs and
int C3 = DY31 * X3 - DX31 * Y3; // doing 6 multiplys with IMUL is faster
// than converting it to SSE and doing 2*3 MULUDQs with additional sign-
// checking because UDQ is UNsigned :(
// Correct for fill convention
if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;
if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;
if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;
// Loop through blocks
for(int y = miny; y < maxy; y += q)
{
for(int x = minx; x < maxx; x += q)
{
// Corners of block
int x0 = x << 4;
int x1 = (x + q - 1) << 4;
int y0 = y << 4;
int y1 = (y + q - 1) << 4;
// Evaluate half-space functions
/*
This one here is very painful and even in case of a right triangle sitting in
one half of the bounding box it takes 24 IMULs for every 8x8 pixel region
empty or not. One or two optimizations are possible, though. You cast these
bools to ints here, but actually you could just add these results without
shifting (a00+a10+a01+a11) because you only check whether its all 0 or
all 1. Then it becomes 0 or 4. Only loss is that you don't know which ones
where 1 and which ones 0 (may be necessary with masking). The definite
one optimization is that you don't have to wait until c to check for 0, but
you can to it after every four bools. Bail out of the loop quickly after a==0
so you will cut off these painful 16 IMULs, 16 ADDs/SUBs, 8 CMPs, 4 SHLs
and 4 ORs. That on average will improve throughput by 10-35% depending
on the triangles.
*/
bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;
bool a10 = C1 + DX12 * y0 - DY12 * x1 > 0;
bool a01 = C1 + DX12 * y1 - DY12 * x0 > 0;
bool a11 = C1 + DX12 * y1 - DY12 * x1 > 0;
int a = (a00 << 0) | (a10 << 1) | (a01 << 2) | (a11 << 3);
bool b00 = C2 + DX23 * y0 - DY23 * x0 > 0;
bool b10 = C2 + DX23 * y0 - DY23 * x1 > 0;
bool b01 = C2 + DX23 * y1 - DY23 * x0 > 0;
bool b11 = C2 + DX23 * y1 - DY23 * x1 > 0;
int b = (b00 << 0) | (b10 << 1) | (b01 << 2) | (b11 << 3);
bool c00 = C3 + DX31 * y0 - DY31 * x0 > 0;
bool c10 = C3 + DX31 * y0 - DY31 * x1 > 0;
bool c01 = C3 + DX31 * y1 - DY31 * x0 > 0;
bool c11 = C3 + DX31 * y1 - DY31 * x1 > 0;
int c = (c00 << 0) | (c10 << 1) | (c01 << 2) | (c11 << 3);
// Skip block when outside an edge
/*
This one can be cleverly done with
#define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
hint: search for "null" at
http://www.azillionmonkeys.com/qed/asmexample.html
but I didn't need it at all because of the optimizations above
*/
if(a == 0x0 || b == 0x0 || c == 0x0) continue;
unsigned int *buffer = colorBuffer;
// Accept whole block when totally covered
/*
By fusing a,b,c together earlier I could make here only one test which is
if(c==0xFFF) - that is when c=a<<8|b<<4|c
*/
if(a == 0xF && b == 0xF && c == 0xF)
{
for(int iy = 0; iy < q; iy++)
{
for(int ix = x; ix < x + q; ix++)
{
buffer[ix] = 0x00007F00;<< // Green
/*
This part here gave a really good boost. I switched to SSE for a second
here and shuffled the color to all parts of the 128-bit register. It could
write 16 bytes per clock and with some macro magic, I made it calculate
from the q how much of these it needed. The inner loop is too tight to use
it at all so I left it out :)
*/
}
(char*&)buffer += stride;
}
}
else<< // Partially covered block
{
int CY1 = C1 + DX12 * y0 - DY12 * x0;
int CY2 = C2 + DX23 * y0 - DY23 * x0;
int CY3 = C3 + DX31 * y0 - DY31 * x0;
for(int iy = y; iy < y + q; iy++)
{
int CX1 = CY1;
int CX2 = CY2;
int CX3 = CY3;
for(int ix = x; ix < x + q; ix++)
{
if(CX1 > 0 && CX2 > 0 && CX3 > 0)
{
buffer[ix] = 0x0000007F;<< // Blue
}
CX1 -= FDY12;
CX2 -= FDY23;
CX3 -= FDY31;
}
CY1 += FDX12;
CY2 += FDX23;
CY3 += FDX31;
(char*&)buffer += stride;
}
}
}
(char*&)colorBuffer += q * stride;
}
}
Now I will post the code that is not really ready yet, but already works. I will find some way to put some more SSE into it. The first try was over 400 lines of code, but some SSE magic made it crush to about 340 lines. My aim is a line count less than or equal to yours :D
In ASM syntax comments are ; and // /**/ don't mean anything
I made string replace on the code so try to understand ;)
Engine_Triangle:
/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; Description: Draws a filled triangle.
;
; In: xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
; rsi = colour
;
; Out: All changed
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;*/
pshuflw xmm0,xmm0,01001110b //Esimene/tagumine külg
Q=4
PREC=4
psllw xmm0,PREC //xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
movdqa xmm5,xmm0
pextrw efx,xmm0,0
pextrw eex,xmm0,1
pextrw edx,xmm0,2
pextrw ecx,xmm0,3
pextrw ebx,xmm0,4
pextrw eax,xmm0,5
pshufd xmm1,xmm0,11010010b //xmm1 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
movdqa xmm2,xmm0
movdqa xmm3,xmm1
pxor xmm4,xmm4
punpckhwd xmm0,xmm4
punpckhwd xmm1,xmm4
punpcklwd xmm2,xmm4
punpcklwd xmm3,xmm4
psubd xmm0,xmm1 //xmm0 = 00 | 00 | dX12 | dY12
psubd xmm2,xmm3 //xmm1 = dX23| dY23 | dX31 | dY31
movdqa xmm1,xmm0
movdqa xmm3,xmm2
pslld xmm1,PREC //xmm2 = 00 | 00 |fdX12 |fdY12
pslld xmm3,PREC //xmm3 =fdX23|fdY23 |fdX31 |fdY31
movq qword[DY12],xmm0
movdqu dqword[DY31],xmm2
movq qword[FDY12],xmm1
movdqu dqword[FDY31],xmm3
//xmm5 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
pshufd xmm6,xmm5,11001001b //xmm6 = 00 | 00 | X2 | Y2 | X0 | Y0 | X1 | Y1
pshufd xmm7,xmm5,11010010b //xmm7 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
movdqa xmm8,xmm5
movdqa xmm9,xmm6
pmaxsw xmm8,xmm6
pmaxsw xmm8,xmm7
pminsw xmm9,xmm5
pminsw xmm9,xmm7
paddw xmm8,dqword[xmm_8x15]
psraw xmm8,4
pextrw ehx,xmm8,1
pextrw ejx,xmm8,0
paddw xmm9,dqword[xmm_8x15]
psraw xmm9,4
pextrw egx,xmm9,1
pextrw eix,xmm9,0
// Block size, standard 8x8 (must be power of two)
// const int q = 8//
// Start in corner of 8x8 block
// minx &= ~(q - 1)//
// miny &= ~(q - 1)//
and egx,not(Q-1)
and eix,not(Q-1)
imul ebp,eix,BUFW*4 //Set up pointer to backbuffer
add ebp,backbuffer
imul eax,[DY12] //Constant part of half-edge functions
imul ebx,[DX12]
imul ecx,[DY23]
imul edx,[DX23]
imul eex,[DY31]
imul efx,[DX31]
sub eax,ebx
sub ecx,edx
sub eex,efx
mov [C1],eax
mov [C2],ecx
mov [C3],eex
cmp [DY12],0 //Correct for fill convention
jl .incC1
jne .noincC1
cmp [DX12],0
jle .noincC1
.incC1:
add [C1],1
.noincC1:
cmp [DY23],0
jl .incC2
jne .noincC2
cmp [DX23],0
jle .noincC2
.incC2:
add [C2],1
.noincC2:
cmp [DY31],0
jl .incC3
jne .noincC3
cmp [DX31],0
jle .noincC3
.incC3:
add [C3],1
.noincC3:
// Loop through blocks
.forouter:
push rgx
.forinner:
// Corners of block
mov eax,egx
shl eax,PREC
mov ebx,eix
shl ebx,PREC
lea esi,[egx+Q-1]
shl esi,PREC
lea edi,[eix+Q-1]
shl edi,PREC
// Evaluate half-space functions
xor ecx,ecx
mov ekx,ebx
imul ekx,[DX12]
mov elx,eax
imul elx,[DY12]
sub ekx,elx
add ekx,[C1]
jle @f
add ecx,1
@@:
mov ekx,ebx
imul ekx,[DX12]
mov elx,esi
imul elx,[DY12]
sub ekx,elx
add ekx,[C1]
jle @f
add ecx,1
@@:
mov ekx,edi
imul ekx,[DX12]
mov elx,eax
imul elx,[DY12]
sub ekx,elx
add ekx,[C1]
jle @f
add ecx,1
@@:
mov ekx,edi
imul ekx,[DX12]
mov elx,esi
imul elx,[DY12]
sub ekx,elx
add ekx,[C1]
jle @f
add ecx,1
@@:
shl ecx,4
jz .prolog_inner_end
mov ekx,ebx
imul ekx,[DX12]
mov elx,eax
imul elx,[DY12]
sub ekx,elx
add ekx,[C2]
jle @f
add ecx,1
@@:
mov ekx,ebx
imul ekx,[DX23]
mov elx,esi
imul elx,[DY23]
sub ekx,elx
add ekx,[C2]
jle @f
add ecx,1
@@:
mov ekx,edi
imul ekx,[DX23]
mov elx,eax
imul elx,[DY23]
sub ekx,elx
add ekx,[C2]
jle @f
add ecx,1
@@:
mov ekx,edi
imul ekx,[DX23]
mov elx,esi
imul elx,[DY23]
sub ekx,elx
add ekx,[C2]
jle @f
add ecx,1
@@:
shl ecx,4
test ecx,0F0h
jz .prolog_inner_end
mov ekx,ebx
imul ekx,[DX31]
mov elx,eax
imul elx,[DY31]
sub ekx,elx
add ekx,[C3]
jle @f
add ecx,1
@@:
mov ekx,ebx
imul ekx,[DX31]
mov elx,esi
imul elx,[DY31]
sub ekx,elx
add ekx,[C3]
jle @f
add ecx,1
@@:
mov ekx,edi
imul ekx,[DX31]
mov elx,eax
imul elx,[DY31]
sub ekx,elx
add ekx,[C3]
jle @f
add ecx,1
@@:
mov ekx,edi
imul ekx,[DX31]
mov elx,esi
imul elx,[DY31]
sub ekx,elx
add ekx,[C3]
jle @f
add ecx,1
@@:
test ecx,0Fh
jz .prolog_inner_end
//ecx=a shl 8 + b shl 4 + c
mov efx,ebp
cmp ekx,0444h
jne .not_totally_covered
// Totally inside
mov ekx,Q
@@:
mov ecx,00FFFFFFh
movd xmm0,ecx
pshufd xmm0,xmm0,00000000b // broadcast
times Q/4 movdqu [efx+egx*4+(%-1)*16],xmm0
add efx,BUFW*4
sub ekx,1
ja @b
jmp .prolog_inner_end
.not_totally_covered:
// else// << Partially covered block
// {
// int CY1 = C1 + DX12 * y0 - DY12 * x0//
// int CY2 = C2 + DX23 * y0 - DY23 * x0//
// int CY3 = C3 + DX31 * y0 - DY31 * x0//
mov ekx,[DX12]
mov elx,[DY12]
imul ekx,ebx
imul elx,eax
sub ekx,elx
add ekx,[C1]
mov [CY1],ekx
mov ekx,[DX23]
mov elx,[DY23]
imul ekx,ebx
imul elx,eax
sub ekx,elx
add ekx,[C2]
mov [CY2],ekx
mov ekx,[DX31]
mov elx,[DY31]
imul ekx,ebx
imul elx,eax
sub ekx,elx
add ekx,[C3]
mov [CY3],ekx
// for(int iy = y// iy < y + q// iy++)
// int CX1 = CY1//
// int CX2 = CY2//
// int CX3 = CY3//
mov elx,Q
mov eax,[CY1]
mov ebx,[CY2]
mov ecx,[CY3]
lea eex,[efx+egx*4]
.partial_loopy:
push rax rbx rcx
// for(int ix = x// ix < x + q// ix++)
xor ekx,ekx
.partial_loopx:
// if(CX1 > 0 && CX2 > 0 && CX3 > 0)
test eax,eax
jle @f
test ebx,ebx
jle @f
test ecx,ecx
jle @f
// buffer[ix] = 0x0000007F//<< // Blue
mov dword[eex+ekx*4],0FFFFFFh
@@:
sub eax,[FDY12]
sub ebx,[FDY23]
sub ecx,[FDY31]
// }
add ekx,1
cmp ekx,Q
jc .partial_loopx
pop rcx rbx rax
add eax,[FDX12]
add ebx,[FDX23]
add ecx,[FDX31]
// (char*&)buffer += stride//
add eex,BUFW*4
// }
sub elx,1
ja .partial_loopy
.prolog_inner_end:
add egx,Q
cmp egx,ehx
jc .forinner
pop rgx
// (char*&)colorBuffer += q*stride//
add ebp,BUFW*4*Q
add eix,Q
cmp eix,ejx
jc .forouter
// }
ret
#64
Posted 29 March 2007 - 10:34 PM
I noticed some opportunity to eliminate jumps. Modern CPUs can suffer greatly from jumps that are badly predictable, so you should try to avoid them if possible, especially short jumps. For example:
if(DY12 < 0) C1++;Could be implemented as:
lea ebx, [eax+1] ; C1 in eax cmp DY12 cmovl eax, ebxLots of other tricks exist to convert apparently conditional code into fast arithmetic code...
#65
Posted 30 March 2007 - 07:20 AM
My logic failed me at first, but I think its okay now - tested and benchmarked:
mov ekx,[C1]
lea elx,[ekx+1]
cmp [DY12],0
cmovl ekx,elx //jl .incC1 -- in comments, its the old code
cmovne elx,[C1] //jne .noincC1 -- that got replaced
cmp [DX12],0
cmovle ekx,elx //jle .noincC1
// .incC1:
// add [C1],1
// .noincC1:
mov [C1],ekx
The code size didn't reduce in size but it remained the same while pickin' upthe pace ;) about 70M clocks/frame maximum is now 60M clocks/frame. And
this is only in the emulator and this is just the OUTER loop that we optimized.
A lot more speed could be cranked out of this with some thinking.
Maybe:
bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;
into
bool a00 = DX12 * y0 - DY12 * x0 > -C1; // Only faster in SSE?
Gentlemen - put your thinking caps on! :D
...and I think the general rule is: "Avoid short jumps if the code from the jump
to the destination takes less clocks than the penalty of the jump"
#66
Posted 30 March 2007 - 12:28 PM
In order to determine if a box is entierly on one side of a halfplane it is not necessary to determine the distance for all four corner points - it is sufficient to determine the distance to one corner.
But depending on the sign of your halfplane normal a different corner has to be used. This is a win because you are going to test many boxes against the same halfplane.
I don't have an illustration for this, but it should be pretty obvious that for any halfplane there is always one corner of a box that will have a minimal distance to the plane - regardless of the relative positioning or size.
Personally I like to express the change of reference corner points as an offset to the top left corner distance, calculated during triangle setup from the edge normals :
// find box test offsets for each edge int eo1 = 0; if (FDY12 < 0) eo1 -= FDY12 << BLOCKSIZE; if (FDX12 > 0) eo1 += FDX12 << BLOCKSIZE; int eo2 = 0; if (FDY23 < 0) eo2 -= FDY23 << BLOCKSIZE; if (FDX23 > 0) eo2 += FDX23 << BLOCKSIZE; int eo3 = 0; if (FDY31 < 0) eo3 -= FDY31 << BLOCKSIZE; if (FDX31 > 0) eo3 += FDX31 << BLOCKSIZE;
Using these offset values (one per edge) the test for 'block is fully outside triangle' becomes essentially the same as the simple 'pixel is fully inside triangle' test.
// test if a single pixel is inside the triangle if (CX1 > 0 && CX2 > 0 && CX3 > 0) ... // test if a block of n*n pixels with top left corner at current position is inside if ((CX1+eo1) > 0 && (CX2+eo2) > 0 && (CX3+eo3) > 0) ...
So just three add's instead of 21 muls and whatnot ;)
Note that checking for exclusion and checking for inclusion requires different offsets (one can be easily derived from the other) but in hardware we usually differentiate into covered and not-covered only.
#67
Posted 30 March 2007 - 01:32 PM
If you can free up one register (I know - good joke) you could put your locals into a temporary structure and address from there. This has two benefits:
- Your code will become threadsave.
- Your code size will be shrink, and this will reduce the instruction decoding time. That does not make a difference if your code takes longer to execute than to fetch, but once you're code is tight enough, instruction decoding time will be factor to care about.
Oh - and by all means, get rid of the branches. If it takes 5 or 6 instructions to do something without a branch it's a saving. You have lots of branches that are placed near to each other, and this will confuse the branch prediction logic. I don't have the exact numbers anymore, but it can only identify branches that are several instructions apart. I think it was something around 64 or 32 bytes or so.
If you have one branch that branches likely, and one that branches branches unlikely near to each other, you'll get a costly branch missprediction for each branch.
Btw, hi nico *g*
My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
#68
Posted 30 March 2007 - 09:47 PM
Anyway, I'm on my way to better and better code - like you, I started discovering some clever and some not very clever spots to crush.
Just a side-note here, I'm aiming at Core 2 here (that is what I'm writing and testing on) and I've got all my clock-knowledge from Agner Fog's manuals and
picked up a few tricks from http://azillionmonke.../qed/tech.shtml
The weirdest thing is that I didn't discover the possibilities in "Evaluate half-space functions" until today, when I looked at the lines and saw that
actually we need 12 multiplies instead of 24. I must be blind :P Anyway, that
translated nicely to SSE, the saddest fact being it doesn't support signed
DWORDs. If I will get over that tiny problem, I will have a whole different
code to show you.
@Nils: I want to argue a bit if I may. I agree with you on the thread safety,
but there are other ways around it if I really don't like locals :) and stack is
really becoming a problem under the 64-bit CPUs because you can't put there
anything else besides 64-bit registers and accessing 32-bit or 16-bit data
from this 8-byte aligned stack is really too much additional SSE. I mean all
the packing and shuffling. Of course like with other things, there are ways
around it.
Another reason why I like globals so much is that they are cache-friendly :)
Really! They never move ;) and I can fit like 8 DWORDs in one cache-line.
About the branches - I'm still sorry about posting this code - I will edit this
immediately not to confuse anyone again :$ Three posts back I copied the
replacement code for the nasty branches and I don't have them anymore.
Here it is (still without full SSE handling) with only 300 lines of code:
Engine_Triangle:
/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; Description: Draws a filled triangle.
;
; In: xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
; rsi = colour
;
; Out: All changed
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;*/
pshuflw xmm0,xmm0,01001110b //Switch front/back
Q = 4
PREC = 4
// psllw xmm0,PREC //xmm0 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
movdqa xmm5,xmm0
pextrw efx,xmm0,0
pextrw eex,xmm0,1
pextrw edx,xmm0,2
pextrw ecx,xmm0,3
pextrw ebx,xmm0,4
pextrw eax,xmm0,5
pshufd xmm1,xmm0,11010010b //xmm1 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
movdqa xmm2,xmm0
movdqa xmm3,xmm1
pxor xmm4,xmm4
punpckhwd xmm0,xmm4
punpckhwd xmm1,xmm4
punpcklwd xmm2,xmm4
punpcklwd xmm3,xmm4
psubd xmm0,xmm1 //xmm0 = 00 | 00 | dX12 | dY12
psubd xmm2,xmm3 //xmm1 = dX23| dY23 | dX31 | dY31
movdqa xmm1,xmm0
movdqa xmm3,xmm2
pslld xmm1,PREC //xmm2 = 00 | 00 |fdX12 |fdY12
pslld xmm3,PREC //xmm3 =fdX23|fdY23 |fdX31 |fdY31
movq qword[DY12],xmm0
movdqu dqword[DY31],xmm2
movq qword[FDY12],xmm1
movdqu dqword[FDY31],xmm3
//xmm5 = 00 | 00 | X0 | Y0 | X1 | Y1 | X2 | Y2
pshufd xmm6,xmm5,11001001b //xmm6 = 00 | 00 | X2 | Y2 | X0 | Y0 | X1 | Y1
pshufd xmm7,xmm5,11010010b //xmm7 = 00 | 00 | X1 | Y1 | X2 | Y2 | X0 | Y0
movdqa xmm8,xmm5
movdqa xmm9,xmm6
pmaxsw xmm8,xmm6
pmaxsw xmm8,xmm7
pminsw xmm9,xmm5
pminsw xmm9,xmm7
paddw xmm8,dqword[xmm_8x15]
psraw xmm8,4
pextrw ehx,xmm8,1
pextrw ejx,xmm8,0
paddw xmm9,dqword[xmm_8x15]
psraw xmm9,4
pextrw egx,xmm9,1
pextrw eix,xmm9,0
// Block size, standard 8x8 (must be power of two)
// const int q = 8//
// Start in corner of 8x8 block
// minx &= ~(q - 1)//
// miny &= ~(q - 1)//
and egx,not(Q-1)
and eix,not(Q-1)
imul ebp,eix,BUFW*4 //Set up pointer to backbuffer
add ebp,backbuffer
imul eax,[DY12] //Constant part of half-edge functions
imul ebx,[DX12]
imul ecx,[DY23]
imul edx,[DX23]
imul eex,[DY31]
imul efx,[DX31]
sub eax,ebx
sub ecx,edx
sub eex,efx
mov [C1],eax
mov [C2],ecx
mov [C3],eex
mov ekx,[C1] //Correct for fill convention
lea elx,[ekx+1]
cmp [DY12],0
cmovl ekx,elx
cmovne elx,[C1]
cmp [DX12],0
cmovle ekx,elx
mov [C1],ekx
mov ekx,[C2]
lea elx,[ekx+1]
cmp [DY23],0
cmovl ekx,elx
cmovne elx,[C2]
cmp [DX23],0
cmovle ekx,elx
mov [C2],ekx
mov ekx,[C3]
lea elx,[ekx+1]
cmp [DY31],0
cmovl ekx,elx
cmovne elx,[C3]
cmp [DX31],0
cmovle ekx,elx
mov [C3],ekx
// Loop through blocks
.forouter:
push rgx
.forinner:
// Corners of block
mov eax,egx
shl eax,PREC
mov ebx,eix
shl ebx,PREC
lea esi,[egx+Q-1]
shl esi,PREC
lea edi,[eix+Q-1]
shl edi,PREC
// Evaluate half-space functions
mov ekx,ebx
shl rkx,32
add rkx,rax
mov elx,edi
shl rlx,32
add rlx,rsi
movq xmm0,rkx // xmm0 = 00 | 00 | x0 | y0
movq xmm1,rlx // xmm1 = 00 | 00 | x1 | y1
punpcklqdq xmm0,xmm1 // xmm0 = y1 | x1 | y0 | x0
movd xmm1,[C1]
pshufd xmm1,xmm1,00000000b // xmm1 = C1 | C1 | C1 | C1
movq xmm4,qword[DY12]
pshufd xmm4,xmm4,01000100b // xmm4 = DX12 | DY12 | DX12 | DY12
pshufd xmm5,xmm4,10110001b // xmm5 = DY12 | DX12 | DY12 | DX12
pmuludq xmm4,xmm0 // xmm4 = OF | DY12*x1 | OF | DY12*x0
pshufd xmm4,xmm4,10001000b // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0
pshufd xmm0,xmm0,10110001b // xmm0 = x1 | y1 | x0 | y0
pmuludq xmm5,xmm0 // xmm5 = OF | DX12*y1 | OF | DX12*y0
pshufd xmm5,xmm5,10100000b // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0
psubd xmm5,xmm4 // xmm5 = DX12*(y0...1)-DY12*(x0...1)
paddd xmm5,xmm1
pcmpgtd xmm5,dqword[xmm_zero]
packssdw xmm5,xmm5
packsswb xmm5,xmm5
pmovmskb edx,xmm5
and edx,0Fh
xor ecx,ecx
add ecx,edx
jz .prolog_inner_end
shl ecx,4
movd xmm2,[C2]
pshufd xmm2,xmm2,00000000b // xmm2 = C2 | C2 | C2 | C2
movq xmm4,qword[DY23]
pshufd xmm4,xmm4,01000100b // xmm4 = DX12 | DY12 | DX12 | DY12
pshufd xmm5,xmm4,10110001b // xmm5 = DY12 | DX12 | DY12 | DX12
pshufd xmm0,xmm0,10110001b // xmm0 = y1 | x1 | y0 | x0
pmuludq xmm4,xmm0 // xmm4 = OF | DY12*x1 | OF | DY12*x0
pshufd xmm4,xmm4,10001000b // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0
pshufd xmm0,xmm0,10110001b // xmm0 = x1 | y1 | x0 | y0
pmuludq xmm5,xmm0 // xmm5 = OF | DX12*y1 | OF | DX12*y0
pshufd xmm5,xmm5,10100000b // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0
psubd xmm5,xmm4 // xmm5 = DX12*(y0...1)-DY12*(x0...1)
paddd xmm5,xmm2
pcmpgtd xmm5,dqword[xmm_zero]
packssdw xmm5,xmm5
packsswb xmm5,xmm5
pmovmskb edx,xmm5
and edx,0Fh
add ecx,edx
test ecx,0Fh
jz .prolog_inner_end
shl ecx,4
movd xmm3,[C3]
pshufd xmm3,xmm3,00000000b // xmm3 = C3 | C3 | C3 | C3
movq xmm4,qword[DY31]
pshufd xmm4,xmm4,01000100b // xmm4 = DX12 | DY12 | DX12 | DY12
pshufd xmm5,xmm4,10110001b // xmm5 = DY12 | DX12 | DY12 | DX12
pshufd xmm0,xmm0,10110001b // xmm0 = y1 | x1 | y0 | x0
pmuludq xmm4,xmm0 // xmm4 = OF | DY12*x1 | OF | DY12*x0
pshufd xmm4,xmm4,10001000b // xmm4 = DY12*x1|DY12*x0|DY12*x1|DY12*x0
pshufd xmm0,xmm0,10110001b // xmm0 = x1 | y1 | x0 | y0
pmuludq xmm5,xmm0 // xmm5 = OF | DX12*y1 | OF | DX12*y0
pshufd xmm5,xmm5,10100000b // xmm5 = DX12*y1|DX12*y1|DX12*y0|DX12*y0
psubd xmm5,xmm4 // xmm5 = DX12*(y0...1)-DY12*(x0...1)
paddd xmm5,xmm3
pcmpgtd xmm5,dqword[xmm_zero]
packssdw xmm5,xmm5
packsswb xmm5,xmm5
pmovmskb edx,xmm5
and edx,0Fh
add ecx,edx
test ecx,0Fh
jz .prolog_inner_end
mov efx,ebp
cmp ecx,0FFFh
jne .not_totally_covered
// Totally inside
mov ekx,Q
@@:
mov ecx,00FFFFFFh
movd xmm0,ecx
pshufd xmm0,xmm0,00000000b // broadcast
times Q/4 movdqu [efx+egx*4+(%-1)*16],xmm0
add efx,BUFW*4
sub ekx,1
ja @b
jmp .prolog_inner_end
.not_totally_covered:
// else// << Partially covered block
// {
// int CY1 = C1 + DX12 * y0 - DY12 * x0//
// int CY2 = C2 + DX23 * y0 - DY23 * x0//
// int CY3 = C3 + DX31 * y0 - DY31 * x0//
mov ekx,[DX12]
mov elx,[DY12]
imul ekx,ebx
imul elx,eax
sub ekx,elx
add ekx,[C1]
mov [CY1],ekx
mov ekx,[DX23]
mov elx,[DY23]
imul ekx,ebx
imul elx,eax
sub ekx,elx
add ekx,[C2]
mov [CY2],ekx
mov ekx,[DX31]
mov elx,[DY31]
imul ekx,ebx
imul elx,eax
sub ekx,elx
add ekx,[C3]
mov [CY3],ekx
// for(int iy = y// iy < y + q// iy++)
// int CX1 = CY1//
// int CX2 = CY2//
// int CX3 = CY3//
mov elx,Q
mov eax,[CY1]
mov ebx,[CY2]
mov ecx,[CY3]
lea eex,[efx+egx*4]
.partial_loopy:
push rax rbx rcx
// for(int ix = x// ix < x + q// ix++)
xor ekx,ekx
.partial_loopx:
// if(CX1 > 0 && CX2 > 0 && CX3 > 0)
test eax,eax
jle @f
test ebx,ebx
jle @f
test ecx,ecx
jle @f
// buffer[ix] = 0x0000007F//<< // Blue
mov dword[eex+ekx*4],0FFFFFFh
@@:
sub eax,[FDY12]
sub ebx,[FDY23]
sub ecx,[FDY31]
// }
add ekx,1
cmp ekx,Q
jc .partial_loopx
pop rcx rbx rax
add eax,[FDX12]
add ebx,[FDX23]
add ecx,[FDX31]
// (char*&)buffer += stride//
add eex,BUFW*4
// }
sub elx,1
ja .partial_loopy
.prolog_inner_end:
add egx,Q
cmp egx,ehx
jc .forinner
pop rgx
// (char*&)colorBuffer += q*stride//
add ebp,BUFW*4*Q
add eix,Q
cmp eix,ejx
jc .forouter
// }
ret
#69
Posted 11 June 2007 - 03:08 PM
- evalutate the halfspace functions by calculating the halfspace-value at the 4 edges of the bounding box
- calculate a per-X and per-Y constant
- add those to the corner points when you step to the next 8x8 cube
there should be no need for the multiplication in the outer loop, which should save 6 MULs and 6 ADD/SUBs per block
And sorry for reanimating the dead ;)
#70
Posted 11 June 2007 - 04:40 PM
rarefluid said:
That's right, they are linear and you get all kinds of cool shortcuts here. If you really want to optimize performance read nicos post. He showed that it's sufficient to just interpolate one edge distance instead of three.
My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
#71
Posted 11 June 2007 - 06:31 PM
And you will probably need to do some jumps that might slow things down again.
#72
Posted 11 June 2007 - 08:51 PM
The colored triangle represents the pixels we want to fill. The squares are coverage blocks (the extreme cases). The fat lines are the boundaries we need to test the top-left corner of a coverage block with. If the top-left corner is inside the boundary, the block covers part of the triangle.
Note that the boundary is not a triangle, but that's actually not a problem. We'll only check its boundary box for covered blocks anyway. So only the three edges of the boundary that are parallel to the triangle's edges need to be computed.
That's only a one-time cost, so block coverage is practically as fast as point coverage! :worthy:
Here's the equivalent picture for testing whether a block is fully covered by only looking at the top-left corner: boundary2.
rarefluid said:
#73
Posted 11 June 2007 - 10:00 PM
But I don't really understand how this can yield speedups when NOT exploiting parallelism, in contrast to a conventional scanline render, 'cause it actually necessarly processes more pixels. There is no sorting needed though.
...Sorry. I read page 1 again and found some hints... :$
Couldn't one do a bresenham-like movement of the blocks to save some tests? Move block down and onto triangle boundary. Or am I completely wrong here? Why not combine the conventional scanline and parallel aproach?
...maybe I should implement this and stop bothering you guys... ;)
#74
Posted 11 June 2007 - 10:48 PM
rarefluid said:
...maybe I should implement this and stop bothering you guys... ;)
No. You aren't wrong. You can as well use a simple flood fill style algorithm.
Once you've found a block that covers at least a part of the triangle (the middle one might be a good choice) fill into all adjecting, covered blocks. This will be very fast. The only drawback is that you need some kind of stack space to store your state, but that's usually not a problem.
On hardware designs you avoid such things, but in a software implementation it is no big deal.
Btw, during my latest work on rasterization I found out that on modern machines you get a nice performance gain from first scanning geometry and generating block/trapezoid/scanline data for your polygons into a dedicated memory buffer. Once the buffer is full render it. This is up to twice as fast compared to the "render the block immediately" style.
I think the effects of the cache makes a huge difference here.
My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
#75
Posted 22 June 2007 - 03:35 PM
It's my first time to learn how rasterizer do, so I have some questions.
1. Why we use 28.4 fixed-point coordinates but not other fixed-point coordinates, such as 24.8 fixed-point coordinates?
2. I don't know why we have to add 0xF to min(X1, X2, X3) and max(X1, X2, X3)? Additionally, why we do the bitwise operations, when we compute fixed-point deltas and bounding rectangle?
3. I also don't know how to get corners of block.
4. What is "stride" means in this job?
5. Finally, how to explain the "block"? It's just like a rectangle?
Thanks a lot!!!
#76
Posted 23 June 2007 - 09:48 AM
powers said:
The 28.4 format gives us 12 bit resolution, or 4096 pixels. 4 bits of sub-pixel precision is also plenty in practice. So it's the best compromise. If you really need a higher resolution, anti-aliasing or more sub-pixel precision, you'll have to use 64-bit numbers or another approach.
Quote
Quote
Quote
Quote
Quote
#77
Posted 07 July 2007 - 10:13 PM
#78
Posted 09 July 2007 - 03:06 PM
I'm trying to find out if there's a way to dynamically change it between 4, 8 and 16 for different triangles. I think it will be too complicated and makes it even slower but lets see.
#79
Posted 22 August 2007 - 05:33 AM
Nick said:
Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.
can you further explain how this is working?
what is du, dx, etc...
also, how does this compare, performance wise, to scan line rasterisation once you have colours, texture coords etc all in place assuming you dont want to convert it to assembly?
#80
Posted 22 August 2007 - 06:23 AM
supagu said:
what is du, dx, etc...
Quote
So it really depends on other aspects of your renderer whether this algorithms is to be preferred or not.
2 user(s) are reading this topic
0 members, 2 guests, 0 anonymous users











