Jump to content


Advanced Rasterization


187 replies to this topic

#101 sdiverdi

    New Member

  • Members
  • Pip
  • 1 posts

Posted 30 April 2009 - 01:05 AM

Thanks for the great article! I have a couple of questions about the fixed point math from the original post in this thread. I'm trying to understand each of the basic steps here...

const int Y1 = iround(16.0f * v1.y);

This sets up Y1 as a 28.4 fixed point, with nearest rounding from the float value.

const int DX12 = X1 - X2;

Since X1 and X2 are fixed point, doesn't that mean DX12 is also a 28.4 fixed point?

const int FDX12 = DX12 << 4;

In that case, doesn't that mean FDX12 is actually 24.8 fixed point?

int minx = (min(X1, X2, X3) + 0xF) >> 4;

Here, minx is in integer coords, rounded up.

int C1 = DY12 * X1 - DX12 * Y1;

X1, Y1, DX12, and DY12 are all 28.4 fixed point. Doesn't that make C1 a 24.8 fixed point value (multiplying two 28.4s yields a 24.8, and subtraction doesn't change format)?

int CY1 = C1 + DX12 * (miny << 4) - DY12 * (minx << 4);

Okay, C1 is 24.8, DX12 is 28.4, (miny << 4) is 28.4, so after the multiply that becomes 24.8, so CY1 is 24.8.

int CX1 = CY1;

CX1 -= FDY12;

Hmm, so CX1 is 24.8, and FDY12 is 24.8, so this is consistent!

So I guess that all makes sense and is consistent, I just wanted to check because the description as using 28.4 fixed point, and the comments in the code, didn't jive with what the code _actually_ appeared to be doing. Do I understand correctly? Thanks!

#102 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 30 April 2009 - 06:09 AM

Hi sdiverdi, That's totally correct. I agree it's not entirely clear from the article if you don't check it for yourself. I didn't want to clutter the essence of the algorithm with an explanation of the fixed-point math and just provided working code. :happy:

One little addition: Multiplying two 28.4 format numbers actually gives you a 56.8 format result. But since that doesn't fit into 32-bit you only get the 24.8 format. So you lose the upper bits. But actually the only format that is guaranteed to fit in 32-bit after multipliction is 12.4. And this is the reason why, as mentioned in the article, the implementation is limited to a 2048x2048 resolution. The highest bit is the sign bit so you only have 11 bits for integer coordinates. But that can be extended in various ways...

#103 Madis731

    New Member

  • Members
  • Pip
  • 8 posts

Posted 07 May 2009 - 11:47 AM

Just wanted to let you know: the SSE4.2 PCMPxSTRx and BLENDVxx!
If anyone is able, I am willing to race you to the solution. I have some not-very-optimal solutions in my head and I will start putting them on paper soon...

Maybe there are far more clever solutions using other instructions ;) but maybe I'm just not that good...

EDIT: I just realized that MASKMOVDQU isn't that bad afterall. BLENDVxx+MOVxxx will warm up the memory before writing, but its still slower and has a side-effect of acessing not needed memory. PCMPxSTRc would need PCKxxxx and UNPCKxxxx instructions, which make it slower and we will be better off with just PCMPGTD+MASKMOVDQU. Back to the drawing-boards...

My first try:
/*
 xmm_1_2_3_4 is 16-byte-aligned 4-packed-dword integer
 FDYxx and CXx are defined like this:
  align 16
    FDYY12 dd ?
    FDYY23 dd ?
    FDYY31 dd ?
  align 16
    CX1    dd ?
    CX2    dd ?
    CX3    dd ?
*/

//   37 instr. / 35,637 clocks CPI=0.96 :)
//   8 pixels / 4,45 clk/px
    mov         eax,0FFh ;blue
    mov         edi,buffer
    movdqa      xmm0,[xmm_1_2_3_4]
    movd        xmm10,eax
    pshufd      xmm1,[FDY12],00000000b
    pshufd      xmm2,[FDY12],01010101b ;virtual FDY23
    pshufd      xmm3,[FDY12],10101010b ;virtual FDY31
    pmulld      xmm1,xmm0
    pmulld      xmm2,xmm0
    pmulld      xmm3,xmm0
    pxor        xmm0,xmm0
    pshufd      xmm10,xmm10,0
    pshufd      xmm4,[CX1],00000000b
    pshufd      xmm5,[CX1],01010101b ;virtual CX2
    pshufd      xmm6,[CX1],10101010b ;virtual CX3
    psubd       xmm4,xmm1
    psubd       xmm5,xmm2
    psubd       xmm6,xmm3
    movdqa      xmm7,xmm4
    movdqa      xmm8,xmm5
    movdqa      xmm9,xmm6
    psubd       xmm4,xmm1
    psubd       xmm5,xmm2
    psubd       xmm6,xmm3
    pcmpgtd     xmm4,xmm0
    pcmpgtd     xmm5,xmm0
    pcmpgtd     xmm6,xmm0
    pcmpgtd     xmm7,xmm0
    pcmpgtd     xmm8,xmm0
    pcmpgtd     xmm9,xmm0
    pand        xmm4,xmm5
    pand        xmm4,xmm6
    pand        xmm7,xmm8
    pand        xmm7,xmm9
    maskmovdqu  xmm10,xmm4
    add         edi,16
    maskmovdqu  xmm10,xmm7


#104 Madis731

    New Member

  • Members
  • Pip
  • 8 posts

Posted 10 May 2009 - 04:27 PM

Weird that after some time I can't edit my post. Anyway - I was talking about the partially covered block's loop. The most critical part is the X-direction there:
                    for(int ix = x; ix < x + q; ix++)
                    {
                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)
                        {
                            buffer[ix] = 0x0000007F;<< // Blue
                        }

                        CX1 -= FDY12;
                        CX2 -= FDY23;
                        CX3 -= FDY31;
                    }

Actually I managed to render 8 pixels in one pass with SSE. It only needs two stores as opposed to 8 with traditional approach.

I conducted a test and some normalized values are here (units are CPU kiloclocks in two triangle rendering batches):
ORIG = 280 / 300 ; 8 x DWORD move in a loop
MASK = 220 / 260 ; 2 x MASKMOVDQU after ~40 LOC
BLND = 173 / 185 ; 2 x MOVDQA after ~50 LOC

Masked is 13%-21% faster than the original
PBLENDVB technique gains 38% on average compared to the original.
I'm still optimizing but SSE4.1 has some tricks up its sleeve afterall.
Next I will try to take the whole (Q=8) 8x8 block at a time - branchless.

Success! :)

#105 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 10 May 2009 - 08:42 PM

Btw.. It may be worth to try out the following change:

                        if(CX1 > 0 && CX2 > 0 && CX3 > 0)

Replace with:

                        if((CX1 > 0) & (CX2 > 0) & (CX3 > 0))


This eliminates two conditional branches for the cost of a couple of extra instructions (SETs and ANDs). In my experience it speeds up code a little bit if the values reside in registers at the first place.

The reduction from three to one branch-prediction slot may make a difference as well.

Cheers,
Nils
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#106 Madis731

    New Member

  • Members
  • Pip
  • 8 posts

Posted 11 May 2009 - 09:40 AM

EDIT: Some notes
Why I think this topic is hot is that Larrabee (due in 2010) deals with the same problem. Cached here: http://nap.koduleht....rchitecture.pdf
p.23 !

I don't think this SSE code can be improved any more with "early-exits" as this is what you suggested if I understood correctly.

I've managed to learn Intel Compiler a bit in the process and I'm really glad about the flexibility of this product. There were a few things I needed for it to add all those assembly-tricks in its code.
First the part of the modified source and then the explanations:
// compile with "/Ox /QxS /Qvec-report3"
// /QxS is necessary for the SSE4.1
// /Qvec-report3 is for me to see what it does with the loops
                for(int iy = y; iy < y + q; iy++) {
                    int cx123[16];
                    for(int i=0;i<16;i++)
                      cx123[i]=(CX1-i*FDY12 >0) &
                               (CX2-i*FDY23 >0) &
                               (CX3-i*FDY31 >0);
                    #pragma vector always
                    for(int ix = x; ix < x + q; ix++) {
                        if(cx123[ix-x])
                            cbuffer[ix] = 0x0000007F; // Blue
                    }
                    CX1 += FDX12;
                    CX2 += FDX23;
                    CX3 += FDX31;
                    cbuffer = (unsigned int *)((char *)cbuffer + 1280*4);
                }

First I packed the three comparisons in one variable cx123[16] (q==16 at the time of testing) and the loop itself got a lot simpler. The variable cx123 is filled in with SSE (default with Intel Compiler).

Second I turned on the Qvec-report3 to see what it vectorizes and what not. That revealed "condition may protect exception" false presumption (well, maybe not :)) explained here: http://software.inte...er/topic/46566/ . Then I added the #pragma vector always to the code.

Third change needed was the /QxS switch because earlier CPUs don't know about the SSE4.1 and thus general optimization didn't find the path to BLENDVB.

Results {tri(<200,100>,<100,400>,<500,400>); CPU=T9300; 1000 loop minimum clocks}:
1) 912075 - Naive simplest solution;
2) 469238 - Original floating-point code;
3) 231737 - Original non-float code;
4) 186475 - SSE4.1 variant (no hand-coded assembly yet!)
Even comparing to the 3rd variant the final implementation gains 20% :)
Compared to the naive approach its almost 5x as fast.

#107 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 11 May 2009 - 01:20 PM

Good work, Madis..

Nice to see that there are still some people out there who are interested in software rendering.

What's your long term goal btw? I'm just curious. :-)
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#108 Madis731

    New Member

  • Members
  • Pip
  • 8 posts

Posted 11 May 2009 - 02:11 PM

Actually I'm generally interested in optimizations and 2D/3D-graphics provides the most of the training bases. I always get turned on when there's optimizations and especially SSE in question.
That is why I'm interested in this. I'm actually using the same tiling-approach in my to-be 3D-engine for http://menuetos.net/
that has no graphics-driver support as of yet.

When products (such as Larrabee) use the same approach, I know I'm on the right track :) this drives my enthusiasm.

PS. Sorry about littering the thread over 5 years or so ;) I usually do my littering around Flat assembler board.

#109 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 11 May 2009 - 08:15 PM

Madis731 said:

PS. Sorry about littering the thread over 5 years or so ;) I usually do my littering around Flat assembler board.

You're welcome!

I always like to read a rant or another about low-level things..:yes:
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#110 primer

    New Member

  • Members
  • Pip
  • 1 posts

Posted 13 July 2009 - 03:04 PM

Hi All,

I have implemented the algorithm detailed within this thread for my own halfspace rasteriser although I am still confused about the topleft fill convention implementation within the example source provided in the article.

As I see it you seem to be incrementing the halfspace equation constant part by 1 if we are indeed on the top or left edges, is this really the correct thing to do as incrementing by 1 from a fixed point 24;8 number would just increment this value by 0.00390625f or 1/256 (being 8bits fractional part) wouldnt it?

Am I just being a fixed point dumbass or am I correct? Any explaination would be great ;)

Many thanks!

#111 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 21 July 2009 - 12:31 PM

primer said:

As I see it you seem to be incrementing the halfspace equation constant part by 1 if we are indeed on the top or left edges, is this really the correct thing to do as incrementing by 1 from a fixed point 24;8 number would just increment this value by 0.00390625f or 1/256 (being 8bits fractional part) wouldnt it?
Yes, but fixed-point numbers are discrete, so there are no values in between each 1/256 interval (for x.8 format). For rasterization this means that you're able to determine exactly whether a point is on the edge, or not. And for the fill convention you can decide whether you treat those as covered or not covered, just by adding 1.

#112 hoglet

    New Member

  • Members
  • Pip
  • 2 posts

Posted 16 September 2009 - 12:36 PM

Nick,

A question on your top-left fill convention if I may? I have some experience of the classical scan-line algorithm and the need for a (top-left) fill convention.

What happens in the case where two triangles share a vertex and that vertex lies exactly on a pixel centre?

Say you have a triangle fan of a square where the vertices of both triangles are ordered counter-clockwise. Consider the top-left square vertex (shared by the two triangles) marked by the asterisk. Inward facing edge normals as per your convention.

*---
|\ |
| \ |
| \|
----

Please forgive the ascii Nick! The half-space tests for the left and right edges for the lower-left triangle would be 0 and the half-space test for the bottom edge would by > 0.

The half-space tests for the top and left edges for the upper-right triangle would be 0 and the half-space test for the right edge would be > 0. So, I was wondering...does your approach correctly fill the pixel for the upper-right triangle only as the classical scan-line approach would?

Given your deserved solid reputation in this area I suspect that my understanding is flawed and not your rasterizer! I'm stumped :-(

I know this post goes back a while, but please Nick - if you're tuning in then enlighten me!

Thank you VERY much - Hoglet.

#113 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 16 September 2009 - 01:12 PM

Hi hoglet,

Yes, the case you depicted is handled correctly by the top-left fill convention. For the upper-right triangle, the top edge uses a X >= 0 comparison (X being the distance). Since the distance to that edge is 0, it's inside that half-space. For it's left edge it also uses a X >= 0 comparison and the distance is 0 to that edge so it's also inside the half-space of the left edge. And finally the right edge uses a X > 0 comparison and the distance is clearly larger than 0. So the pixel represented by the asterisk is covered by this upper-right triangle.

For the lower-left triangle, the right edge also uses a X > 0 comparison, but this time the distance to that edge is 0. So the pixel is considered outside that half-space. Hence the pixel is not covered by that triangle. So there's no risk of drawing it twice. Note that this triangle's left edge uses a X >= 0 comparison and the bottom edge uses a X > 0 comparison as dictated by the fill convention and the pixel is on the inside of both but it has to test positive on all three half-spaces to be inside the triangle.

Making top and left edges use a X >= 0 comparison and making right and bottom edges use a X > 0 comparison is done by simply adding 1 to X for top and left edges, and then comparing (X + 1) > 0. That's because (X + 1) > 0 is the same as X >= 0, for integer numbers. So in the code you'll only see comparisons of the form X' > 0, with some X's being the distance plus 1.

I hope this makes things a little clearer. :happy:

#114 hoglet

    New Member

  • Members
  • Pip
  • 2 posts

Posted 16 September 2009 - 01:45 PM

Nick!

Yup - that makes perfect sense. I'm very grateful to you for pointing out what, perhaps, I should've seen in the first place!

I'll PM you!

Cheers,

Hoglet.

#115 jiti

    New Member

  • Members
  • PipPip
  • 16 posts

Posted 28 November 2009 - 05:16 PM

Hi guys. At the begin, sorry for my bad english :D

I am developing a project http://sourceforge.n...cts/phenomenon/
where a unit with name GeSource is a software renderer based on Nick's rasterization algorhitm.

What the unit actualy does:
-output thru DirectDraw (gdi is so damn slow, daaaaaaamn slow omg)
-basic linear setup and interpolation of z-value
-hierarchical rasterizing thru nodes (not zbuffer for now, just z-buffer & coverage variables-prototypes )
-triangle_boundingbox vs node_boundingbox optimizations
-fast hierarchical z-buffer clear
-fast hierarchical z-buffer initialization and destruction
-some parts optimized for SSE

But... this unit is in pascal language with slovak language comments :-\
I hope, the source will help someone.... ;-)

OpenIDEA aka OpenSource rules :D

#116 stowelly

    New Member

  • Members
  • Pip
  • 4 posts

Posted 02 February 2010 - 06:14 PM

Nick said:

It's slightly worth it in C++, but not for the assembly implementation. It's already really fast to determine coverage for a block. The extra checks and using more registers is not worth it.

Indeed on a GBA it could be very useful, though I doubt if this parallel rasterizer actually has use there at all?

This is quite easy. First determine gradients du/dx, dv/dx, du/dy and dv/dy. Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0 (note that x0 and y0 are floating-point, x and y are integer). Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy.

Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.


Hi, im having some problems implementing this, could anyone please tell me where I am going wrong?



"First determine gradients du/dx, dv/dx, du/dy and dv/dy"


float dx = (MS_LIBS::max(vec[0].x, vec[1].x, vec[2].x)) - (MS_LIBS::min(vec[0].x, vec[1].x, vec[2].x));
float dy = (MS_LIBS::max(vec[0].y, vec[1].y, vec[2].y)) - (MS_LIBS::min(vec[0].y, vec[1].y, vec[2].y));
float dz = (MS_LIBS::max(vec[0].z, vec[1].z, vec[2].z)) - (MS_LIBS::min(vec[0].z, vec[1].z, vec[2].z));

du = texture->m_width;
dv = texture->m_height;

float du_dx = du/dx;
float dv_dx = dv/dx;
float du_dy = du/dy;
float dv_dy = dv/dy;


MS_LIBS::VECTOR_3 start_point = vec[0];


"Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0"

at the end of the x loop

dx = x - start_point.x;

at the end of the y loop

dy = y - start_point.y;

"Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy."

and before drawing the pixel

float u = uvs[0].m_u + (du_dx * dx) + (du_dy * dy);
float v = uvs[0].m_v + (dv_dx * dx) + (dv_dy * dy);


ive obviously missed some fundimental point here, but would be very greatfull if someone could point me in the right direction please!

#117 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 03 February 2010 - 01:32 AM

Hi stowelly. Unfortunately your code for determining the gradients isn't correct. Probably the best source for a fast and robust algorithm is Triangle Scan Conversion using 2D Homogeneous Coordinates (section 4). I hope that helps.

#118 jiti

    New Member

  • Members
  • PipPip
  • 16 posts

Posted 14 February 2010 - 12:01 PM

Hi guys. I have one problem with texture coordinates. my work is based on Nick's rasterizer. The problem is not about interpolation but about fill convention in texture coordinates compatible with Nick's rasterizer. Its about the texture position. He must stay between 0..0.999999 in triangle with u coordinates for example (0,0)-(0,1)-(1,1)..without "CLAMP" or "REPEAT" function. Similar for v coordinate. I experimented with Chris Checkers rounding rules, or like Chris with rotating bitmaps and setting Dx_Du & Dx_Dv rules, but without result. Chris rounding rules are just compatible with his scanline-based-up2down-left2righ-2subtriangle-classic algorytm. Pls, if someone guys knows the solution for this problem, pls HELP !! .. or better.. Nick, you " THE MASTER OF MODERN RASTERIZING" HEEEEEEEEEEEELP!!! ???? :-( ???

#119 jiti

    New Member

  • Members
  • PipPip
  • 16 posts

Posted 14 February 2010 - 12:21 PM

I almost forgot:
- i implemented Zmax function im my gesource library so with n-level hierarchical z-buffer can be not just small, but big triangles too nicely reject.
- i implemented 8x8 tilecaching for texture memory and for rendertargets (texture=rendertarget) and conversion from linear-2-8x8tile memory for textures and 8x8tile-2-linear for converting rendering framebuffer to linear representation for visualizing the drawing result.Swithcing the texture to rendertarget mode will be without conversion (not implemented now). 8x8 tileing helps for the cpu cache but in momental form of my hierarchical drawing rutines will it not much help.

PS: My library is not a second prototype of SW-shader. It will be hardcoded deffered renderer with not so much flexibilness. Flexibilness=lower speed except SW-Shader aka SwiftShader. :-)

#120 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 15 February 2010 - 05:37 AM

jiti said:

Its about the texture position. He must stay between 0..0.999999 in triangle with u coordinates for example (0,0)-(0,1)-(1,1)..without "CLAMP" or "REPEAT" function.
That's not possible. Mathematically the coordinates range from 0.0 to 1.0, irrespective of the fill convention. And due to rounding it could go slightly outside that range.

But why can't you use a clamp or repeat function?





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users