187 replies to this topic

### #21anubis

Senior Member

• Members
• 2225 posts

Posted 13 September 2004 - 11:56 PM

nice... maybe the categories should be seperated more clearly ?
If Prolog is the answer, what is the question ?

### #22Nick

Senior Member

• Members
• 1227 posts

Posted 14 September 2004 - 08:20 AM

john said:

not only does Nick's swShader beat quake's software rendering and other , but it beats the crap out of DirectX's REF rastersizer. There's no point in comparing against quake's s/w anyways. Good job Nick!
Thanks for the compliment, but I would nuance this a little. It doesn't beat the Quake I renderer at all. In it's context (a Pentium 100 MHz), the Quake I renderer is brilliant. Before that, there was nothing that came close. It did everything right: perspective correction, fill convention, prestepping, mipmapping, lightmapping. And all that at 4-6 clock cycles per pixel (inner loop)! Although my renderer has ten times more features, I have GHz processors with MMX and SSE and I'm still stuck at 20 FPS. :blush: Well, in its own context that ain't bad at all... But anyway, it's Quake I's renderer that really inspired me, and at some points it still does.

Quote

You're going to be a millionare someday. You're pretty much garenteed a job at any company you apply for. What's the point of life after that? hehe (j/k)
Money is not everything. Although I could certainly use some more to keep paying for my education, there's more to life than this. In the end it's all about living a happy life. And making the people around you happy is very important to feel good yourself. So I would never do a job just for the money, I would do it if I feel good about it and my career doesn't make me neglect other people. Combining a good career and happiness is hard, but I'm trying...

Quote

As far as I know, there isn't a single commercial or free product which emulates shaders except for swShader (isn't that right?).
Mesa 3D, the OpenGL 'reference rasterizer' supports them too. I haven't tested it, but I've heard it's not too bad when only using a few simple shaders. But swShader is probably the only product where shaders can be categorized as real-time and efficient. Last week I was able to speed up my Per-Pixel Lighting demo from 34 to 42 FPS on my Pentium M 1.4 GHz laptop. Now I'm aiming at 50 FPS. Although I'm very excited about every improvement, I do realize that the actual usefulness of it all is very low. You can't really run a semi-modern game with swShader. Even the cheapest integrated graphics chip from three years ago beats me at fillrate. It's an uphill battle I'll never win.

Quote

I guess there wasn't any incentive for Microsoft to optimize their software renderer since they expect hardware vendors to implement everything in hardware, and the software part is only used as a reference model to compare against.
They did have some attempts at software rendering, like the RGB Emulation and Ramp Emulation. I don't know if they ever evolved further, but I think that they quickly realized that the days of 2D graphics cards are totally over. A computer with a graphics card that can't draw a triangle with a single texture is really hard to find. That's almost all RGB and Ramp could do. :huh:

### #23Mihail121

Senior Member

• Members
• 1059 posts

Posted 14 September 2004 - 01:39 PM

Hey John, check out Muli3D, it also supports shaders in easy and understanable way!

### #24Nick

Senior Member

• Members
• 1227 posts

Posted 14 September 2004 - 04:27 PM

Mihail121 said:

Hey John, check out Muli3D, it also supports shaders in easy and understanable way!
It's nice to see Muli3D has evolved! Just a couple month ago it had sub-pixel precision issues and now it shows the most advanced features.

I wouldn't really say it has shader support though. :blush: You can derive a class from a PixelShader class, and implement the execute() function. This allows to simply write the 'shader' in C++. In that sense every software renderer would have shader support even before one line of it is implemented.

Anyway, it seems like it could become an open-source 'reference rasterizer'...

### #25Nick

Senior Member

• Members
• 1227 posts

Posted 16 September 2004 - 02:28 PM

The article is now also available on the swShader site:

If there's any part that can be improved, like things that are unclear and need more explanation, please let me know!

### #26davepermen

Senior Member

• Members
• 1306 posts

Posted 16 September 2004 - 09:40 PM

can't wait for a first public 0.5.0 online :D espencially with HT support :D
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #27Nick

Senior Member

• Members
• 1227 posts

Posted 16 September 2004 - 11:10 PM

davepermen said:

can't wait for a first public 0.5.0 online :D espencially with HT support :D
Unfortunately it won't have HT support, nor 64-bit extensions (specifically the extra registers). I'm still stuck with a Celeron 1.2 GHz as my main system. :blink:

I don't have the money for an upgrade, but even then it would be a dilemma. The really interesting processors with a combination of all the latest technology will only be available and affordable in 2006 or so. Realistically, I'm looking forward to nForce 4 motherboards with PCI-Express and socket 939. I can then have an Athlon 64, and next year upgrade to dual-core. Dual Xeon with EMT has also crossed my mind...

### #28davepermen

Senior Member

• Members
• 1306 posts

Posted 17 September 2004 - 04:48 AM

well, you can try to implement multi-threading anyways, if you want.. i can tell you then, if it works.. at work, we have some p4 with ht..

64bit, well, yeah, i have to wait for as well... it will help you much, thanks to the registers.. it will help much thanks to the huge memory space, too... performance should be great on them. oh, and, native 64bit integer math would lead to much bigger possible render targets, native, too.. (with subpixels taken care of).

i'd be happy to spend you some pc or such, but, well.. i'm on a 2.7ghz celeron myself. intels proof that mhz don't mather really at all... :(
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #29davepermen

Senior Member

• Members
• 1306 posts

Posted 23 September 2004 - 09:31 PM

hey, nick, i've read you can enable the 64bit registers even on 32bit os' !! and possibly even on semprons!

that'll rock. semprons are hell cheap..
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #30Dia

DevMaster Staff

• 1120 posts

Posted 27 September 2004 - 04:18 AM

Nick, I added a link to this Code Spotlight submission to the articles section since it deserve to be an article of its own.

I hope you have more of these to share with us

### #31Nick

Senior Member

• Members
• 1227 posts

Posted 27 September 2004 - 02:33 PM

Thanks!

I have plans to write about projection and clipping. That's another stage in the rendering pipeline that is usually handled by the hardware, but again deeper knowledge of it is useful beyond software rendering.

### #32cdgray

New Member

• Members
• 21 posts

Posted 28 September 2004 - 03:40 AM

hey Nick: which profiler do you use to test the efficiency of your code? I would like to know where the bottleneck is in the code, to know where to start optimzations.

How much assembly (and MMX, SIMD, etc.) should one know in order to implement an efficient software renderer? I know little assembly (but i'm an expert in C/C++) and I'm wondering if assembly is a prerequisite for such things.

btw, I personally use DevPartner and it's great, especially for memory leaks.

### #33Mihail121

Senior Member

• Members
• 1059 posts

Posted 28 September 2004 - 06:05 AM

Hey Nick and DevMaster stuff, article is missing something! Can't these image leaks be directly opened inside the article? I mean.. i would like to see the schemes directly, not to click around!

### #34ector

New Member

• Members
• 2 posts

Posted 28 September 2004 - 06:57 AM

Very interesting algorithm, Nick..

Small right-edge optimization to your original simple algorithm, should work in the blocked one too:


bool done = false;
for(int x = minx; x < maxx; x++)
{
if(CX1 > 0 && CX2 > 0 && CX3 > 0)
{
colorBuffer[x] = 0x00FFFFFF;
done=true;
}
else
if (done) break;

CX1 -= FDY12;
CX2 -= FDY23;
CX3 -= FDY31;
}


Haven't done any benchmarking to see if getting rid of on average half the time wasted outside the triangle is worth the extra check.. could easily see it being worth it in an ARM implementation on GBA for example because of its cheap conditionals. Or you could spend some extra code size to split the inner loop into two, the first will only step forwards until it finds the triangle, the second one will fill and exit as soon as the condition becomes false.

Anyhow, I haven't given it much though yet but how would you efficiently interpolate things like texture coordinates and colors over the triangle? How to do it is obvious with the scanline algorithm but here I guess a little bit of line equation trickery is needed and it's too early in the morning for me to work it out

### #35Nick

Senior Member

• Members
• 1227 posts

Posted 28 September 2004 - 11:18 AM

cdgray said:

hey Nick: which profiler do you use to test the efficiency of your code? I would like to know where the bottleneck is in the code, to know where to start optimzations.
I'm not using any profiler. Mostly I just disable portions of the code, to see how much that influences performance. Or I use simple timers. For the dynamically generated code in my software renderer it's simple: less instructions is better. :D

Quote

How much assembly (and MMX, SIMD, etc.) should one know in order to implement an efficient software renderer? I know little assembly (but i'm an expert in C/C++) and I'm wondering if assembly is a prerequisite for such things.
It depends on your definition of efficient. MMX and SIMD give me a speedup of 2 to 4. In my new parallel renderer it will probably be close to 4 everywhere. But dynamical compilation of assembly pipelines is definitely the biggest optimization. It eliminates hundreds of state checks per pixel, while still allowing hundreds of states. The best alternative I know, completely in C++, is to do something like deferred rendering. First sample all the textures, then do all the blending, then the interpolation, etc. This also prevents checking the states for every pixel, but memory bandwidth becomes a problem since you constantly read data, modify it a little and write it back. In a dynamically compiled pipeline everything is processed per pixel and the important data can stay in registers.

Quote

btw, I personally use DevPartner and it's great, especially for memory leaks.
I once tried using it but I didn't find it very useful. My home-made tools can do just the same thing, more focussed. I love VTune, but unfortunately still can't afford it.

### #36Nick

Senior Member

• Members
• 1227 posts

Posted 28 September 2004 - 11:30 AM

ector said:

Small right-edge optimization to your original simple algorithm, should work in the blocked one too...
It's slightly worth it in C++, but not for the assembly implementation. It's already really fast to determine coverage for a block. The extra checks and using more registers is not worth it.

Indeed on a GBA it could be very useful, though I doubt if this parallel rasterizer actually has use there at all?

Quote

Anyhow, I haven't given it much though yet but how would you efficiently interpolate things like texture coordinates and colors over the triangle?
This is quite easy. First determine gradients du/dx, dv/dx, du/dy and dv/dy. Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0 (note that x0 and y0 are floating-point, x and y are integer). Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy.

Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.

### #37ector

New Member

• Members
• 2 posts

Posted 28 September 2004 - 09:39 PM

Nick said:

Indeed on a GBA it could be very useful, though I doubt if this parallel rasterizer actually has use there at all?
Well it's a way of rasterizing subpixel correct triangles with cheap setup and no divisions (GBA has no hardware divider).. and your blocked algorithm could be very interesting when rasterizing to tilemaps, you could fill a block instantly with just a map index set That would only be useful for really big flat triangles though.

Quote

This is quite easy. First determine gradients du/dx, dv/dx, du/dy and dv/dy. Then for every pixel you can determine how far it is from the first vertex: dx = x - x0, dy = y - y0 (note that x0 and y0 are floating-point, x and y are integer). Then the texture coordinates are simply: u = u0 + du/dx * dx + du/dy * dy.
Once you got the coordinates for one pixel you obviously only have to add du/dx and/or du/dy to get the texture coordinates for the neighboring pixels. So I do the above calculation once per block, then use this linear interpolation for the other pixels in the block.
Yeah, of course. Realized that's exactly how you would do it immediately after posting. Thanks for putting it into words though for the enjoyment of other readers

### #38Ravyne

New Member

• Members
• 5 posts

Posted 15 October 2004 - 03:08 AM

Nick:
I'm new to these boards but Have been very interested in software rasterization for a long while. I have my own project in the early stages, Hopefully I'll find more time to work on it but I'm back in school now.

Anyhow, in Graphics class today I was thinking about the technique in your article. I had a moment of potential inspiration and ended up with somewhat of a branch from your technique, I wonder what your thoughts might be on it.

What I propose is that, given a triangle, create the smallest square bounding box which is a power of 2 along both axis inclusive of the three points. In the case of a 2x2 box, the half-spaces are immediatly calculated for each corner and each of the 4 pixels are lit accordingly. Otherwise, the box is sub divided into 4 equal, square portions recursivly. The half-space values are calculated for the corners and used to determine if the box is completely lit, rejected or partially lit as in your technique. Partially Lit boxes are further sub divided untill such time that they form a 2x2 box, the calculated half-space values are then used to light each of the 4 pixels accordingly.

This technique has the advantage that the largest possible boxes are accepted/rejected as early as possible. 2x2 and 4x4 boxes produce 4 and 16 half-space checks respectively, 8x8 boxes produce 80 half-space checks, boxes larger than 8x8 is where the benefit kicks in, as the likelyhood of finding large boxes to accept/reject increases. LargeTriangles become quite efficient. I also believe that this technique can lend itself to parellel processing as well.

I would be glad to hear your input on this technique, perhaps its already been done and I am unaware of it. Did you make a decision to chose 8x8 boxes for other benefits like per-box lighting calculations prehaps? How well do you think this technique might perform? If it interests you I would be glad to discuss it further.

Should this idea be something new (and more importantly, a good idea) I'd call it "Rasterization by recursive bilateral reduction." Also, aside from triangles, this approach, and yours as well I believe, can be made to apply to any convex n-gon.

### #39Nick

Senior Member

• Members
• 1227 posts

Posted 15 October 2004 - 08:04 AM

Hi Ravyne,

Theoretically your algorithm has a lot of value. Indeed it would be most efficient of all, especially for gigantic resolutions. However, I have tried similar techniques in practice and they all turned out to have more or less the same performance as the 8x8 block method. The reason is that, although the recursive algorithms are algorithmically more efficient, they are computationally more expensive. To make a close analogy; it's like using quick-sort for three elements. :wink:

Like you say, it only starts to really work for boxes larger than 8x8. But for big polygons, the rasterization cost is, compared to the time spend in the pixel pipelines, relatively small. 8x8 blocks are skipped in around 10 clock cycles or so, while 8x8 shaded pixels with one textures takes a lot more than 1.000 clock cycles. So in this perspective I think it's not all that useful to look for more efficient rasterization.

But the primary reason why I'm still using 8x8 blocks is actually the implementation complexity. As you can see in the article, it is relatively compact and straightforward to implement. All other attempts added some complexity, and this comes at the cost of flexibility. Also, I'd like to keep the setup cost for tiny polygons as low as possible. With a recursive approach it would optimize for the big polygons but make small polygons less efficient. So currently I'm keeping the 8x8 blocks approach, because it still gives me the freedom to refine it afterwards, if necessary.

Another reason is that consistently using 8x8 blocks makes it easy to implement a visibility system without complications. I already had to deal with situations where the render target's dimentions are not a multiple of 8. Using other sizes of blocks would only make this worse, while I believe it wouldn't add much benefit to improve the visibility algorithm. In 640x480 mode, there are 'only' 80x60 blocks anyway.

The most succesful alternative so far is to have a specialized rasterizer for tiny and huge polygons. Tiny polygons often fall into a 2x2 pixel block, so they can be passed to the pixel piplines directly. This occurs frequently for models with a large number of polygons, far away. For huge polygons several 8x8 blocks can be skipped at once by computing exact start and end values of the polygon edges per row. This requires a slow division though, so I haven't seen much performance increase of it for medium resolutions, unfortunately.

But your idea is definitely valuable for big resolutions. For non-real-time rendering with anti-aliasing I believe an approach like that becomes a necessity. Either way, I'll keep you updated on any new findings that apply to my software renderer, and I wish you good luck with your project!

### #40Mario

New Member

• Members
• 5 posts

Posted 05 November 2004 - 08:44 PM

Greetingz

It's needless to say that Your article precisely describes fast and efficient method for such a basic routine as triangle rasterization. Simply, nothing to add, nothing to remove. However it wouldn't be smart to use it instead of existing solutions unless they haven't been created. The majority of programmers rather aviods diving into low-level code, which generally is implemented as hardware service. In matter of fact, the presented algoritm was used as compact asm routines in 80x86 simple 3D intros/games, but it could be completely forgotten since the development of hardware dedicated to boost graphic drawing. Nowadays, the modern 3D applications aren't even similiar to these from the past.

The discussion leading to optimize the basics isn't as productive as before. The algorithmical analyses and theoretical complexity evaluations doesn't produce any interesting results. The only tool programmer can use is profiler (or other time-based feedbacks of course), which role is to find bottlenecks; so these can be found only by empeiria. My experience had taught me, that analysing code isn't good method for determinig time complexity. That, what usually looks ineffective is often fast. Simply, I believe in 80/20 theory.

Despite the fact, that mentioned algorithm is satisfying, it doesn't change anything.

Best regards
- Mario

#### 2 user(s) are reading this topic

0 members, 2 guests, 0 anonymous users