187 replies to this topic

### #121jiti

New Member

• Members
• 16 posts

Posted 16 February 2010 - 12:25 AM

Thnx for reaction nick :) , but i mean texture coordinate rounding rule depending on rotation of the texture, the rounding rules are similar like the 2 fill conventions for the triangle rasterization (left-up & right down).
Chris is talking about they in his articles. I can use the 2 function (clamp&repeat) for masking when the texture-sample-pointer goes outside the range of texture image (0..1 normalized coordinate), but i am bit pedant in precise of texturemapping. I wanna do texturemapping, like the gpu's it does. Ok the floating-point precision is a problem, problem without solution. But the texture mapping problem in to triangle has got a solution. I draw a small diagram about the problem for better visualizing what i mean .. http://img3.imagesha...tureproblem.png The problem is in 2 elipses as you can see.
we have texture about 2x2 random colors. the texture coordinates are not normalized.

in 0 degree diagram.. when we step horzontal left to right for the u coordinate, we don't reach the number "2" because the last pixel of the right edge of the quad are rejected, because of the left-up fill convention of triangle rasterizing algorhytm.
similar is it for the v coordinate , we step from up to down but we don't reache the number "2" on down edge pixels because the pixels lieing on the edge are reject thru the fill convention. This is the good case !!!

now the oposite.. yep, the problem-bad-case:
now we rotate the texture about 180 deegres.
We see (in red elipses).. we start sampling on positions "2" for the u and v coordinate. Without the repeat&clamp function , we were sampling outside the texture-image. this is the point of my problem.

As i say, Chris found the solution, but just for his alghorhytm. Now i wanna find the rounding-rules algorhytm for your 3-edge-check rasterization algorhytm and it must be compatible with your fill-convention. In forward, thnx for help :)

### #122jiti

New Member

• Members
• 16 posts

Posted 16 February 2010 - 12:42 AM

Ok, i repeat now but.. SORRY FOR MY ENGLISH !! :D

### #123stowelly

New Member

• Members
• 4 posts

Posted 16 February 2010 - 07:25 PM

Nick said:

Hi stowelly. Unfortunately your code for determining the gradients isn't correct. Probably the best source for a fast and robust algorithm is Triangle Scan Conversion using 2D Homogeneous Coordinates (section 4). I hope that helps.

thanks, that makes a lot of sense.

I know have textured polys :)

I know have another question regarding this process. I need to utilise a Z buffer, and im not entirely sure how i can get from the X,Y in screen space back to the triangles relative Z co-ord in order to do these checks. any ideas?

thanks

### #124stowelly

New Member

• Members
• 4 posts

Posted 16 February 2010 - 07:27 PM

jiti

i had the same problem as you with the mesh I was using

some textures are mapped using texture wrapping etc.

picture it as a sheet containing multiple amounts of the same texture.

so if you had -0.4 that would actually translate as 0.6 likewise if you had 1.2 that would be 0.2 if that makes sense. im sure there are other ways of texture mapping such as mirroring etc. but this was the solution to my problem

### #125stowelly

New Member

• Members
• 4 posts

Posted 16 February 2010 - 09:05 PM

stowelly said:

thanks, that makes a lot of sense.

I know have textured polys :)

I know have another question regarding this process. I need to utilise a Z buffer, and im not entirely sure how i can get from the X,Y in screen space back to the triangles relative Z co-ord in order to do these checks. any ideas?

thanks

can ignore this post. its just the same as finding u and v :) I was thinking its a lot harder than it actually was

thanks

### #126Nick

Senior Member

• Members
• 1227 posts

Posted 17 February 2010 - 12:05 AM

jiti said:

http://img3.imagesha...tureproblem.png The problem is in 2 elipses as you can see.
Thanks for the image. That does make things clearer.

This situation can't be avoided though. Even with a GPU if you perfectly align the left edge on the centers of the pixels, and a coordinate on that edge is 1.0, you WILL be sampling at 0.0 instead when using 'repeat' addressing.

This isn't actually a problem when using bilinear filtering instead of point filtering. And if you really want to use point filtering, use 'clamp' addressing instead. When the quad is supposed to be aligned to the pixels, make sure you're subtracting 0.5 from the x and y coordinates so that you cover the entire pixels and the sample locations (at pixel centers) are off the edges.

### #127jiti

New Member

• Members
• 16 posts

Posted 17 February 2010 - 06:37 AM

Yes stowelly the function "repeat" does the wrapping. this is formula u0=u-floor(u) does the wrap, for this example is it for the u coordinate.
So as you sayd -0.4 will be wraped to 0.6 and 1.2 to 0.2.
So if you send normalized texture coordinate (0,0)-(5,5)-(5,0), then the texture will 5x repeated on the triangle. Not good for cache but good for memory. :)
One tip :
On cpu is the sse floor function a cycle-eater if you doit 2x or more per pixel (for u&v). I think better make all u&v coordinates pozitive and in mainloop of rasterizer, when you do the floor calculation, just use Trunc function instead.. i mean sse trunc instruction. Its one sse instruction against n-instructions of the function floor. floor and trunc functions have with positive numbers same result. Just be sure, you send positive texture coordinates. But for flexibilityl just use the standart u-floor(u). lower speed but more flexibility :)
Yes z coordinate is same as the u & v.. This all are interpolants. So all are calculated with the same formula. Don't forgive the perspective correction interpolation. Good for z-buffer, good for texture, good for human eyes :D and not so much work for the CPU thanx to SSE :)

stowelly but my problem is implementing rounding rules for texture mapping based on rotation of the texture as it do Chris Hecker in his scanline algorhytm and implementing it in Nick's 3-edge-check algorhytm. Both rasterizing algorhytm's have differend behavior, differend rules of rasterizing.

### #128jiti

New Member

• Members
• 16 posts

Posted 17 February 2010 - 07:09 AM

Thnx Nick :).. now i understand. Yes bilinear filtering can mask it, and its true. Nowdays software rasterizers must can do lighting, perspective correction and yep, bilinear or better trilinear filtering and shaders. i don't need realy point-sampling as i say i am just pedant in sense of "what gpu's can do, that can do my rasterizer too" :D. In software can i do the postprocess without polygon rasterizing and texel-filtering like the gpu's. Cpu is more flexible as gpu.
Now another thema:

Nick first question: What do you think about hierarchical z-buffer.. Is it a good idea? And what is better n-level-z-buffer or fixed-level.. for example 2 or 3-level-hierarchical-zbuffer. And how big the tiles must be for good performance? 2x2 or 4x4 or like i have... 8x8 per level? What do you think?
..and second question: Is it good idea to tileing texture image in memory.. i use 8x8 pixel tiles for all things in my rasterizer (zbuffer, texture,
rendertarget). How are textures stored in videomemory of graphic cards?

### #129Nick

Senior Member

• Members
• 1227 posts

Posted 19 February 2010 - 01:10 AM

jiti said:

What do you think about hierarchical z-buffer.. Is it a good idea?
It depends. For modern applications with complex shaders the per-pixel arithmetic work is so high that the time (and bandwidth) spent on reading and writing the z-buffer becomes less significant.

Either way I would advise to focus on adding functionality first before optimizing things.

Quote

..and second question: Is it good idea to tileing texture image in memory.. i use 8x8 pixel tiles for all things in my rasterizer (zbuffer, texture,
rendertarget). How are textures stored in videomemory of graphic cards?
Graphics cards store them tiled, as far as I know. But for a software renderer the overhead of 'swizzling' the texel address is typically higher than the average memory access latency. Today's CPUs have huge caches and can handle the access patterns quite well.

Have you implemented mipmapping yet?

### #130jiti

New Member

• Members
• 16 posts

Posted 20 February 2010 - 01:21 AM

Nick,
mipmaping is on my plan. I know that mipmaping will help for cache. Its good for both cpu & gpu.

When i understand. On now days cpu's when i rasterize, its not required to tile the texture? I read the old FATMAP documentations about tiled texture-mapping and this aproach give a boost. But it was on day's when the 486 and 586 was the best CPU's. Now are there new cpu's with new behavior's, new caching system and so. The rules can be now changed.

Storing material atributes (like diffuse,normal,specularity, glossiness.. ) in different slot's or different memory locations like the gpu's it does, or some software renders just to be compatible with gpu is bad idea. I think on gpu's is a little help named texture-arrays.On cpu side i think will help interleave representation combined with tileing. Texture with texel's with many atributes. Like G-buffers in deffered shading. This texture can be readed with just one memory access (on cpu with SSE 128 register) and with just one texture-coordinates.Bottleleck is the memory bandwith and the atribute textures (color, normal,..) must have same resolution. What do you think guys, or you Nick?

### #131jiti

New Member

• Members
• 16 posts

Posted 20 February 2010 - 01:26 AM

My english is bad, very bad, but i think, its understandable :D

### #132Pixar

New Member

• Members
• 2 posts

Posted 16 March 2010 - 07:33 AM

Despite the age of the article I want to revive it... Can anyone say how that rasterizer could be implemented in C# software renderer were we can't use pointers (we can but we won't, because it is unreasonably to use unsafe code for making this code work).

I've wrote a simple scanline triangle filler, but it costs me from 30 fps to 100 fps (depends on the square I fill) drop every time I call this function. That's the hardest part of it:
for (int y = y1; y <= y3; y++)
{
ixs = (int)(xs + 0.5f);
ixe = (int)(xe + 0.5f);
for (i = ixs; i <= ixe; i++)
Buffer[y * Pitch_Div_4 + i] = color;

xs += dx_left; xe += dx_right;
}
Buffer is the "int" array which has the size calculated by the formula: ScreenHeight * MemoryPitch>>2.

Later, when all resterization is made, I copy Buffer array into video memory like this: DataRect.Data.WriteRange<int>(Buffer).

Oh, I didn't said yet that I'm using SlimDX to access directly to the videomemory. So, the DataRect is the "DataRectangle" which consists of the data I get after I lock my surface. I should say about it more closely because for some reason locking/unlocking takes about 120 fps.
I declared a class Device, which is working as an layer between me and SlimDX. Inside of it there is a method, which lock's rendertarget:
public DataRectangle LockSurface()
{
surface = device.GetRenderTarget(0);
return surface.LockRectangle(LockFlags.None);
}
This two lines take from me 100 fps. It sucks...

So, I gave as much information as I can. I want to know the reason why my code is so slow. I thought about changing my scanline function to one that is written in this article. Aslo I thought about changing SlimDX to GDI. But I don't want to make all this irrationally and I want to know the reason why something is worth to change and something to leave.

### #133Nick

Senior Member

• Members
• 1227 posts

Posted 17 March 2010 - 12:55 AM

Pixar said:

I've wrote a simple scanline triangle filler, but it costs me from 30 fps to 100 fps (depends on the square I fill) drop every time I call this function.
You shouldn't be using FPS as an absolute performance metric. If previously things ran at a 10,000 FPS, a drop of 100 FPS is unnoticable. If you were at 101 FPS, a drop of 100 FPS is disastrous. It can't drop by 30 to 100 FPS "every time" you call it, since that would result in an negative FPS if you called it enough times.

Instead you should just look at the time each operation takes to get a better assessment of the performance.

My favorite metric is actually the number of clock cycles per pixel (for a pixel heavy scene). If it's in the order of 100 clock cycles, this means that on a 3 GHz single-core CPU and at 800x600 resolution you could achieve 60 FPS!

So how much time does your project spent on rasterization each frame? Where does the rest of the time go?

### #134Pixar

New Member

• Members
• 2 posts

Posted 18 March 2010 - 01:55 PM

Nick said:

You shouldn't be using FPS as an absolute performance metric. If previously things ran at a 10,000 FPS, a drop of 100 FPS is unnoticable. If you were at 101 FPS, a drop of 100 FPS is disastrous. It can't drop by 30 to 100 FPS "every time" you call it, since that would result in an negative FPS if you called it enough times.

Instead you should just look at the time each operation takes to get a better assessment of the performance.

My favorite metric is actually the number of clock cycles per pixel (for a pixel heavy scene). If it's in the order of 100 clock cycles, this means that on a 3 GHz single-core CPU and at 800x600 resolution you could achieve 60 FPS!

So how much time does your project spent on rasterization each frame? Where does the rest of the time go?

Hmm... what is "the number of clock cycles per pixel" and how is it measured? Also I don't know how I should measure "the time each operation takes"

By the way, I tried 3 different approaches to my filling method: 1. Filling int[] array and then copying it to the surface 2. Filling each pixel in the loop like this DataRect.Data.Write(color) 3. Filling memory using method similar to memset, but which is filling memory by 4 bytes (code is below). All this approaches give the same fps! I am disappointed. I thought that the third teqnique must be the fastest...

//////////////////////////////////////////////////////////////////////////////////
IN C++:
__declspec(dllexport) MemSetDWORD(int* ptr, int c, int n);
{
_asm
{
CLD ; clear direction of copy to forward
MOV EAX, c ; color goes here
MOV ECX, n ; number of DWORDS goes here
MOV EDI, ptr ; address of line to move data
REP STOSD ; send the pentium X on its way
} // end asm
}

IN C#:
[DllImport("testfill.dll")]
public static unsafe extern void MemSetDWORD(int* ptr, int c, int n);

for (temp_y = y1; temp_y <= y3; temp_y++, DataPointer += Pitch_Div_4)
{
MemSetDWORD(DataPointer + (int)xs, color, (int)(xe - xs + 1));
xs += dx_left; xe += dx_right;
}
///////////////////////////////////////////////////////////////

### #135Nick

Senior Member

• Members
• 1227 posts

Posted 19 March 2010 - 12:09 AM

Pixar said:

Hmm... what is "the number of clock cycles per pixel" and how is it measured? Also I don't know how I should measure "the time each operation takes" :)
With a stopwatch. :ninja:

Seriously now, you should Google for "C# code timing".

The fundamental computations made by the CPU are synchronized to an internal clock. This clock runs at a certain frequency (typically around 3 GHz these days). One interval is called a clock cycle. So if you know the number of clock cycles an operation takes you know how fast or slow your implementation is, pretty much independent of the CPU.

You can either compute the number of clock cycles per pixel by timing the rendering of a certain amount of pixels and taking the clock frequency of your CPU into account, or you could use a profile. An excellent free profile is CodeAnalyst.

Quote

By the way, I tried 3 different approaches to my filling method: 1. Filling int[] array and then copying it to the surface 2. Filling each pixel in the loop like this DataRect.Data.Write(color) 3. Filling memory using method similar to memset, but which is filling memory by 4 bytes (code is below). All this approaches give the same fps! I am disappointed. I thought that the third teqnique must be the fastest...
Clearly your bottleneck is elsewhere. A profiler will tell you exactly where the most time (or clock cyles) is being spent, so you can concentrate on that.

Anyway, I have to seriously warn you that optimizing things too early is futile. Make sure you implement all the functionality you want first, and then get your profiling results to determine the bottlenecks / hotspots. Profiling early on is ususally a waste of time since the results will typically change radically when you near the end of the project.

Good luck!

### #136alexpolt

New Member

• Members
• 1 posts

Posted 14 April 2010 - 03:40 PM

Nick, thanks for very helpful topic. And sorry for reviving this times old post again.

You can also rasterize by recursion - dividing by half every edge (and all possible interpolants - light, uv etc..) and then forming two new triangles by extending new edge opposite to one that is halved, then
fill them seperately. Nice recursion, cheap operations ( >> 1 ), great parallezation, but on PC turned to slow due to recursion and memory usage.

And one remark on interpolating various values across triangle.
There two ways: you can either compute plane equation for each or you can compute just two plane equations (for two verticies and one vertex is 0) for values 0 to 1. The the final value would look like: z = z0 + dzdx1 * (x-x0) + dzdy1 * (y-y0) + dzdx2 * (x-x0) + dzdy2 * (y-y0);
That way you can calculate many interpolants across triangle.

### #137renton79

New Member

• Members
• 8 posts

Posted 18 April 2010 - 07:31 PM

Hi

Nick, first of all a great work here, I have implemented your work in my project and it works beautiful, there were few typo errors but nothing serious. I have one question thought, have you got anything similar for line rasterization? or maybe how could I easily create a fast algorithm

Dave

### #138jiti

New Member

• Members
• 16 posts

Posted 14 May 2010 - 04:52 AM

Hi Nick. I read the paper about 2d homogenous rasterization. I nearly understand
the rasterization, but the all the calculations need to be do on floating point numbers. But how they handle the fill convention? It needs finite number precision like the integers. In forward, thnx for the answer.

To the hierarchical z-buffer technic. Its problematic because the whole pyramid
with zmax and zmin info and tile info, with subtile pointers info on every level are just many data for the cpu cache. So its slow. Firstly i programmed it with recrusive algorhytm (using call-ret subrutine system), then per-level throutput checking, but both algorhytm was slow. So i go back to basic and do 2 level hierarchical z-buffer (zmin-zmax per tile). Later 3 level (64x64 tile with 8x8 subtiles) like in the Larabee-rasterization paper. But i need to compare the speed of 3 against 2 level z-buffer. Now i know why the other developers don't use the full pyramide of hierarchical z-buffer.

### #139Geri

New Member

• Members
• 30 posts

Posted 25 May 2010 - 05:26 PM

hey Nick, do you has some benchmarks from your software d3d renderer?

(if you has, and want to show it, but you dont want to place it here, please send the results on msn to me to my address).

i just want it to compare to the performance of my codes.

greetings!

### #140Nick

Senior Member

• Members
• 1227 posts

Posted 25 May 2010 - 10:18 PM

Geri said:

hey Nick, do you has some benchmarks from your software d3d renderer?
What benchmark numbers do you need exactly? Maybe it's easier to just run it yourself: http://www.transgami...ss/swiftshader/

#### 1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users