187 replies to this topic

### #1Nick

Senior Member

• Members
• 1227 posts

Posted 11 September 2004 - 06:08 PM

It's quite surprising that the most elementary rendering operations, like rasterization, are so little documented. Graphics cards take care of most of this, so people have forgotten how it actually works. But there are many reasons left why understanding the inner workings of the graphics pipeline is still useful. Not only for software rendering, but for any situation where this type of operations is required. Knowledge of these algorithm simply make you a better graphics programmer.

Today I present you the theory and implementation of an advanced rasterizer. It is advanced in the sense that it has many nice properties the classical scanline conversion algorithm does not have. The main problem with the old algorithm is that it's hard to process pixels in parallel. It identifies filled scanlines, but this is only suited for processing one pixel at a time. A much more efficient approach is to process 2x2 pixels together. These are called quads. By sharing some setup cost per quad, and using advanced parallel instructions, this results in a significant speedup. Some of the current graphics hardware also uses quad pixel pipelines.

Our starting point is the half-space function approach. Before I go into details, all you have to know is that a half-space function is positive on one side of a line, and negative on the other. So it splits the screen in half, hence the name. Here's a small illustration of it, where the edges of a triangle each split the screen in a positive and negative part: half-space.png. Locations where all three half-space functions are positive define the inside of the triangle. When a half-edge function is zero, we're on an edge. When any of them is negative, we are outside the triangle. So this can be used for rasterization. If we can find formulas for the half-space functions, and evaluate them for each pixel location, we know which pixels need to be filled.

So what formula is positive on one side of a line and negative on the other? Surprisingly, the equation of a line is exactly what we're looking for. For a segment with starting coordinates (x1, y1) and end coordinates (x2, y2), the equation is:

(x2 - x1) * (y - y1) - (y2 - y1) * (x - x1) = 0

You can easily verify that this equation is true for any point (x, y) on this line (for example the start and end points themselves). But the left hand side also behaves perfectly as a half-space fuction. It is positive for (x, y) coordinates on one side, and negative on the other. It's zero on the line itself, which effectively defines the line in the above equation. Verifying this behaviour is left as an excercise for the reader. Let's write some code!

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)

{

float y1 = v1.y;

float y2 = v2.y;

float y3 = v3.y;

float x1 = v1.x;

float x2 = v2.x;

float x3 = v3.x;

// Bounding rectangle

int minx = (int)min(x1, x2, x3);

int maxx = (int)max(x1, x2, x3);

int miny = (int)min(y1, y2, y3);

int maxy = (int)max(y1, y2, y3);

(char*&)colorBuffer += miny * stride;

// Scan through bounding rectangle

for(int y = miny; y < maxy; y++)

{

for(int x = minx; x < maxx; x++)

{

// When all half-space functions positive, pixel is in triangle

if((x1 - x2) * (y - y1) - (y1 - y2) * (x - x1) > 0 &&

<< (x2 - x3) * (y - y2) - (y2 - y3) * (x - x2) > 0 &&

<< (x3 - x1) * (y - y3) - (y3 - y1) * (x - x3) > 0)

{

colorBuffer[x] = 0x00FFFFFF;<< // White

}

}

(char*&)colorBuffer += stride;

}

}


This is the simplest working implementation of the half-space functions algorithm. Before I continue to explain how it can be improved, there are a few things to note. It would be possible to scan the whole screen, to see which pixels are inside the triangle, but this is of course a waste of time. Therefore only the smallest rectangle surrounding the triangle is scanned. You might also have noticed that I swapped some components in the formula. This is because in screen coordinates, the y-axis points downward. It is also very important that the triangle's vertices are in counter-clockwise order. This makes sure that all the positive sides of the half-spaces point inside the triangle.

Unfortunately, this implementation isn't fast at all. For every pixel, six multiplications and no less than fifteen subtractions are required. But don't worry, this can be optimized greatly. Per horizontal line that we scan, only the x component in the formula changes. For the first edge, this means only (y1 - y2) gets added to the half-edge function value of the previous pixel. That's just an addition and subtraction per pixel! Furthermore, the vertex coordinates are 'constant' per triangle, so (y1 - y2) and all other pairs like this only have to be computed once per pixel. Also for stepping in the y direction it's just an addition of a constant. Let's implement this:

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)

{

float y1 = v1.y;

float y2 = v2.y;

float y3 = v3.y;

float x1 = v1.x;

float x2 = v2.x;

float x3 = v3.x;

// Deltas

float Dx12 = x1 - x2;

float Dx23 = x2 - x3;

float Dx31 = x3 - x1;

float Dy12 = y1 - y2;

float Dy23 = y2 - y3;

float Dy31 = y3 - y1;

// Bounding rectangle

int minx = (int)min(x1, x2, x3);

int maxx = (int)max(x1, x2, x3);

int miny = (int)min(y1, y2, y3);

int maxy = (int)max(y1, y2, y3);

(char*&)colorBuffer += miny * stride;

// Constant part of half-edge functions

float C1 = Dy12 * x1 - Dx12 * y1;

float C2 = Dy23 * x2 - Dx23 * y2;

float C3 = Dy31 * x3 - Dx31 * y3;

float Cy1 = C1 + Dx12 * miny - Dy12 * minx;

float Cy2 = C2 + Dx23 * miny - Dy23 * minx;

float Cy3 = C3 + Dx31 * miny - Dy31 * minx;

// Scan through bounding rectangle

for(int y = miny; y < maxy; y++)

{

// Start value for horizontal scan

float Cx1 = Cy1;

float Cx2 = Cy2;

float Cx3 = Cy3;

for(int x = minx; x < maxx; x++)

{

if(Cx1 > 0 && Cx2 > 0 && Cx3 > 0)

{

colorBuffer[x] = 0x00FFFFFF;<< // White

}

Cx1 -= Dy12;

Cx2 -= Dy23;

Cx3 -= Dy31;

}

Cy1 += Dx12;

Cy2 += Dx23;

Cy3 += Dx31;

(char*&)colorBuffer += stride;

}

}


The C1-C3 variables are the constant part of the half-edge equation and are not recomputed per pixel. The Cy1-Cy3 variables determine the 'starting value' of the half-space functions at the top of the bounding rectangle. And finally the Dx1-Dx3 variables are the start values per horizontal scan. Only these are incremented (actually decremented) with the delta values every pixel. So testing whether a pixel is inside the triangle has become really cheap: aside from some setup per horizontal line and per triangle that is negligible, just three subtractions and three compares for zero.

Ok, so this algorithm is starting to show its usefulness. But there still are some serious issues to solve. Here's an actual image created using this code: first_try.png. If it isn't clear that this is actually a car, here's the shaded reference image: reference.png. As you can see, there are many gaps between the polygons. The cause is twofold: precision, and the fill convention.

There are precision issues because floating-point numbers are quite limited. They have 24-bit of precision. If you look at the half-space function, you see that we do several subtractions and multiplications. That's the perfect recipe for precision problems. What most people would do in such situation is use double-precision floating-point numbers. While this will make the issue unnoticable, it isn't exactly perfect and also slower. The real solution might sound crazy: use integers! Integers have 32-bit of precision, and do not suffer from precision loss when subtracting. Instead of using a 'floating-point', we'll use fixed-point arithmetic to assure we get sub-pixel accuracy.

The second cause of the gaps is the fill convention. The half-space functions can be positive or negative, but also zero. In this case, the pixel is exactly on the edge. Until now, we've ignored this case and treated it as a pixel outside. We could treat it as inside, using >= comparators, but this isn't correct either. What will happen is that pixels on the edge between two triangles will be drawn twice. While this may sound more acceptable than gaps, it causes some serious artifacts for things like transparency and stenciling.

What we'll use here is a top-left fill convention, just like DirectX and OpenGL. This means that all edges which are on the left side of the triangle, or on a horizontal top, are treated as inside the triangle. This convention assures that no gaps will occur, nor drawing the same pixel twice. Let's first see how we can detect whether an edge is 'top-left'. For humans, it's really easy to recognise them. Here are a few examples: top-left.png. I'm sure nobody really had a problem to identify the top-left edges himself, it's really simple. But give yourself a minute to think about how you would detect this using code... Don't look before you tried it, but here's the answer: detect.png. The arrows indicate the counter-clockwise order of the vertices. Note how for left edges they all point downward! To detect this in code we merely have to check the Dy1-Dy3 values! The only exception is the top edge. There, Dy# is zero, but Dx# is positive.

Now that we know how to detect top-left edges, we still have to deal with them correctly so pixels on these edges are treated as inside. What we actually want to do is change the Cx# > 0 statements into Cx# >= 0 statements. Testing for top-left edges per pixel is way to slow, and handling all these cases in separate loops is way too complex. But look at what Cx# > 0 fundamentally is. It is 'true' when Cx# is 1, 2, 3, etc, and 'false' when Cx# is 0, -1, -2, etc. All we want to do is make it true also when Cx# is zero. The solution is too simple for words: all we have to do is add 1 to Cx# beforehand. This can even be done sooner, with the C1-C3 variables! Ok, let's sum this all up:

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)

{

// 28.4 fixed-point coordinates

const int Y1 = iround(16.0f * v1.y);

const int Y2 = iround(16.0f * v2.y);

const int Y3 = iround(16.0f * v3.y);

const int X1 = iround(16.0f * v1.x);

const int X2 = iround(16.0f * v2.x);

const int X3 = iround(16.0f * v3.x);

// Deltas

const int DX12 = X1 - X2;

const int DX23 = X2 - X3;

const int DX31 = X3 - X1;

const int DY12 = Y1 - Y2;

const int DY23 = Y2 - Y3;

const int DY31 = Y3 - Y1;

// Fixed-point deltas

const int FDX12 = DX12 << 4;

const int FDX23 = DX23 << 4;

const int FDX31 = DX31 << 4;

const int FDY12 = DY12 << 4;

const int FDY23 = DY23 << 4;

const int FDY31 = DY31 << 4;

// Bounding rectangle

int minx = (min(X1, X2, X3) + 0xF) >> 4;

int maxx = (max(X1, X2, X3) + 0xF) >> 4;

int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;

int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;

(char*&)colorBuffer += miny * stride;

// Half-edge constants

int C1 = DY12 * X1 - DX12 * Y1;

int C2 = DY23 * X2 - DX23 * Y2;

int C3 = DY31 * X3 - DX31 * Y3;

// Correct for fill convention

if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;

if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;

if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;

int CY1 = C1 + DX12 * (miny << 4) - DY12 * (minx << 4);

int CY2 = C2 + DX23 * (miny << 4) - DY23 * (minx << 4);

int CY3 = C3 + DX31 * (miny << 4) - DY31 * (minx << 4);

for(int y = miny; y < maxy; y++)

{

int CX1 = CY1;

int CX2 = CY2;

int CX3 = CY3;

for(int x = minx; x < maxx; x++)

{

if(CX1 > 0 && CX2 > 0 && CX3 > 0)

{

colorBuffer[x] = 0x00FFFFFF;

}

CX1 -= FDY12;

CX2 -= FDY23;

CX3 -= FDY31;

}

CY1 += FDX12;

CY2 += FDX23;

CY3 += FDX31;

(char*&)colorBuffer += stride;

}

}


Here's the end result: correct.png. Not only is it now entirely flawless, it's also faster than the floating-point version! The range of the fixed-point integers is big enough for a 2048x2048 color buffer, with 4 bits of sub-pixel precision. Now we got this fixed, let's focus on performance again...

This implementation is perfect for small triangles. The setup cost per triangle is really low compared to the scanline conversion algorithm. Unfortunately, it's not optimal for big triangles. About half of all pixels will be outside the triangle, but we still pay the price of evaluating the half-space functions. Furthermore, until now I've only talked about drawing one pixel at a time, while the real benefits of this approach is the possibility for parallelism. I'll show you how to take advantage of this, and what other benefits we get from it.

What we really want to do is quickly skip pixels outside the triangle. We don't want to waste any time evaluating half-space functions there. We also don't want to spend too much time inside the triangle. The really interesting part is around the edges. So what we'll do is quickly detect whether 8x8 blocks are not covered, partially covered, or fully covered. Not covered or fully covered blocks can quicly be respectively rejected or accepted. Partially covered blocks will be completely scanned. There's another reason to do this block-based approach: it is very useful for visibility determination algorithms. And an extra benefit is that memory accesses are more localized.

So how do we detect coverage as fast as possible? With the half-space functions, it's really easy. All we have to do is evaluate the half-space functions in the corners of the blocks. A non-covered block has negative half-space values for the corners for at least one edge. A completely covered block has positive half-space values for all edges. Everything else is a partially covered block. So the final implementation is:

void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3)

{

// 28.4 fixed-point coordinates

const int Y1 = iround(16.0f * v1.y);

const int Y2 = iround(16.0f * v2.y);

const int Y3 = iround(16.0f * v3.y);

const int X1 = iround(16.0f * v1.x);

const int X2 = iround(16.0f * v2.x);

const int X3 = iround(16.0f * v3.x);

// Deltas

const int DX12 = X1 - X2;

const int DX23 = X2 - X3;

const int DX31 = X3 - X1;

const int DY12 = Y1 - Y2;

const int DY23 = Y2 - Y3;

const int DY31 = Y3 - Y1;

// Fixed-point deltas

const int FDX12 = DX12 << 4;

const int FDX23 = DX23 << 4;

const int FDX31 = DX31 << 4;

const int FDY12 = DY12 << 4;

const int FDY23 = DY23 << 4;

const int FDY31 = DY31 << 4;

// Bounding rectangle

int minx = (min(X1, X2, X3) + 0xF) >> 4;

int maxx = (max(X1, X2, X3) + 0xF) >> 4;

int miny = (min(Y1, Y2, Y3) + 0xF) >> 4;

int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4;

// Block size, standard 8x8 (must be power of two)

const int q = 8;

// Start in corner of 8x8 block

minx &= ~(q - 1);

miny &= ~(q - 1);

(char*&)colorBuffer += miny * stride;

// Half-edge constants

int C1 = DY12 * X1 - DX12 * Y1;

int C2 = DY23 * X2 - DX23 * Y2;

int C3 = DY31 * X3 - DX31 * Y3;

// Correct for fill convention

if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++;

if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++;

if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++;

// Loop through blocks

for(int y = miny; y < maxy; y += q)

{

for(int x = minx; x < maxx; x += q)

{

// Corners of block

int x0 = x << 4;

int x1 = (x + q - 1) << 4;

int y0 = y << 4;

int y1 = (y + q - 1) << 4;

// Evaluate half-space functions

bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0;

bool a10 = C1 + DX12 * y0 - DY12 * x1 > 0;

bool a01 = C1 + DX12 * y1 - DY12 * x0 > 0;

bool a11 = C1 + DX12 * y1 - DY12 * x1 > 0;

int a = (a00 << 0) | (a10 << 1) | (a01 << 2) | (a11 << 3);

bool b00 = C2 + DX23 * y0 - DY23 * x0 > 0;

bool b10 = C2 + DX23 * y0 - DY23 * x1 > 0;

bool b01 = C2 + DX23 * y1 - DY23 * x0 > 0;

bool b11 = C2 + DX23 * y1 - DY23 * x1 > 0;

int b = (b00 << 0) | (b10 << 1) | (b01 << 2) | (b11 << 3);

bool c00 = C3 + DX31 * y0 - DY31 * x0 > 0;

bool c10 = C3 + DX31 * y0 - DY31 * x1 > 0;

bool c01 = C3 + DX31 * y1 - DY31 * x0 > 0;

bool c11 = C3 + DX31 * y1 - DY31 * x1 > 0;

int c = (c00 << 0) | (c10 << 1) | (c01 << 2) | (c11 << 3);

// Skip block when outside an edge

if(a == 0x0 || b == 0x0 || c == 0x0) continue;

unsigned int *buffer = colorBuffer;

// Accept whole block when totally covered

if(a == 0xF && b == 0xF && c == 0xF)

{

for(int iy = 0; iy < q; iy++)

{

for(int ix = x; ix < x + q; ix++)

{

buffer[ix] = 0x00007F00;<< // Green

}

(char*&)buffer += stride;

}

}

else<< // Partially covered block

{

int CY1 = C1 + DX12 * y0 - DY12 * x0;

int CY2 = C2 + DX23 * y0 - DY23 * x0;

int CY3 = C3 + DX31 * y0 - DY31 * x0;

for(int iy = y; iy < y + q; iy++)

{

int CX1 = CY1;

int CX2 = CY2;

int CX3 = CY3;

for(int ix = x; ix < x + q; ix++)

{

if(CX1 > 0 && CX2 > 0 && CX3 > 0)

{

buffer[ix] = 0x0000007F;<< // Blue

}

CX1 -= FDY12;

CX2 -= FDY23;

CX3 -= FDY31;

}

CY1 += FDX12;

CY2 += FDX23;

CY3 += FDX31;

(char*&)buffer += stride;

}

}

}

(char*&)colorBuffer += q * stride;

}

}



Note that I scan partially covered blocks completely, all 8x8 pixels. This is for consistency with the other blocks so they can be processed by the same visibility algorithm. Also, this is extremely fast when done in an unrolled loop, using assembly instructions. All further optimizations to this algorithm are best done in assembly anyway. So now the rasterizer can output coverage masks for 8x8 pixels. This can then easily be processed by the pixel pipeline(s). It's easy to process them as 4x4 quads, and many calculations can even be done per block. Everything taken together, there is no reason left to use the old scanline conversion algorithm.

Enjoy!

Nicolas "Nick" Capens

### #2Mihail121

Senior Member

• Members
• 1059 posts

Posted 11 September 2004 - 07:21 PM

agreed with the most but:

As i know the Pentium and better CPUs work with the 80bit float numbers actually faster than with the 32 and 64bits. Of course i know few languages which make use of the extended format.

There are other few stuff, which i'm not yet sure about.. but the article is really good! Like everything in the _nick_ style ;)

### #3Nick

Senior Member

• Members
• 1227 posts

Posted 11 September 2004 - 07:45 PM

Mihail121 said:

As i know the Pentium and better CPUs work with the 80bit float numbers actually faster than with the 32 and 64bits. Of course i know few languages which make use of the extended format.
I'm really happy with the performance of my current version, and I don't think 80-bit floating-point can beat it. Anyway, I'm going to optimize all of it with MMX, which will already probably make it twice as fast... ;)

### #4Noor

Senior Member

• Members
• 503 posts

Posted 11 September 2004 - 07:52 PM

this can be turned in an article :yes:
"What ever happened to happily ever after?"

### #5Mihail121

Senior Member

• Members
• 1059 posts

Posted 11 September 2004 - 07:54 PM

Bah... too bad there ain't MMX/3DNow! on those damn nokias....

### #6z80

Valued Member

• Members
• 104 posts

Posted 11 September 2004 - 08:42 PM

Good work, Nick! Very interesting! :yes:

### #7john

Member

• Members
• 84 posts

Posted 11 September 2004 - 09:16 PM

Impressive work Nick!
Could you provide some figures that reveals the performance (and compare them with traditional rasterizers)?
How do graphic chips currently do this? I wonder if graphic chip vendors would benifit if your algorithm was implemented in hardware.
Have you read the book by Andre LaMoth on:
Tricks of the 3D Game Programming Gurus-Advanced 3D Graphics and Rasterization. What is your recommendation for using it as a learning reference for software rendering?

### #8Nick

Senior Member

• Members
• 1227 posts

Posted 12 September 2004 - 07:50 AM

john said:

Impressive work Nick!
I hope I haven't upped the standard too much. ;)

Quote

Could you provide some figures that reveals the performance (and compare them with traditional rasterizers)?
For this car model, which has only small polygons, performance is equivalent when using the 8x8 block C++ version. For rendering a skybox (big polygons), performance is a bit lower (it beats the traditional rasterizer when using 16x16 blocks though). But let's put these mediocre results in the right perspective. This algorithm enables parallel pixel processing. For example, stencil shadows. When an 8x8 block is completely inside the polygon, we can process this block with 64-bit MMX instructions in a flash. So although rasterization is sometimes a bit slower, the classical rasterizer cannot achieve this fillrate when processing one pixel at a time. Also, the C++ version presented here has sub-optimal code for detecting block coverage. I am very sure the MMX version will be able to do it many times faster (it has a pmaddwd instruction which does -four- multiplications in parallel and adds pairs; merely two of these instructions are needed to test one edge). Either way, even if it needs some tweaking, the advantages far outweigh the disadvantages.

If you want numbers, I suggest using swShader, and plugging this code into it.

Quote

How do graphic chips currently do this? I wonder if graphic chip vendors would benifit if your algorithm was implemented in hardware.
A lot of hardware has quad processing pipelines. It's quite probable that they already use this algorithm or a variant more optimal for hardware implementation.

Quote

Have you read the book by Andre LaMoth on:
Tricks of the 3D Game Programming Gurus-Advanced 3D Graphics and Rasterization. What is your recommendation for using it as a learning reference for software rendering?
I bought it. It was a waste of my money. It didn't teach me anything new and on every page I was able to either spot an error or I know a fundamentally better approach. And it's a big book but that's only because the whole code is listed in it.

Anyway, it isn't all bad. I'm sure that for someone who has no experience with software rendering it's quite good. He writes in a 'popular' way so it's easy to read and understand. It nicely bundles all the primary problems someone has to solve to write a software renderer. Since there are no other books like this, it is automatically the best you can get. I would have scratched the 'Advanced' out of the subtitle though...

### #9Mihail121

Senior Member

• Members
• 1059 posts

Posted 12 September 2004 - 10:10 AM

Agreed with Nick on LaMothe's book 75% of it is code, wrong story about lightmaps, no correct organization no nothing!

### #10anubis

Senior Member

• Members
• 2225 posts

Posted 12 September 2004 - 01:02 PM

Quote

I bought it. It was a waste of my money. It didn't teach me anything new and on every page I was able to either spot an error or I know a fundamentally better approach. And it's a big book but that's only because the whole code is listed in it.

i was suprised how many errors (i mean real mathematical errors not typos) it contains. this has to be confusing for somebody whithout any experience.
If Prolog is the answer, what is the question ?

### #11davepermen

Senior Member

• Members
• 1306 posts

Posted 12 September 2004 - 07:00 PM

very nice done, nick. i think this shows again, premature optimisation isn't everything, optimizing the algorithm is what counts.

now rastericing works on a completely different level somehow, it's not about the individual pixel anymore, but really just the triangle.

it reminds me much of raytracing, this algorithm. the logic is the same, just with tons of restrictions on what rays you trace :D but it's the same, with the same optimisation possiblities. process full blocks of rays in parallel, skip blocks that are useless, etc..

it all sounds familiar.. can't wait to see another performance boost in sw-shader :D
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #12davepermen

Senior Member

• Members
• 1306 posts

Posted 12 September 2004 - 07:57 PM

hehe.. you know what? your rastericer would rock quite a bit on this system:

i mean.. yeah! :D 16 parallel processors (actually 4 processors with 2 cores and 2 hyperthreads each). there, your parallelisation will get onto a next, higher level..

yehah! :D
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #13SnprBoB86

Valued Member

• Members
• 112 posts

Posted 12 September 2004 - 10:51 PM

Very interesting article. Confirmed how I thought this worked, and then extended on it with optimizations that are very nice toknow. Thanks.

@davepermen:
Where did you get that screen? I would be interested to know who has the cash for that many processors, but not for a monitor large enough to show all the CPU usage graphs ;)
Brandon Bloom
http://brandonbloom.name

### #14davepermen

Senior Member

• Members
• 1306 posts

Posted 13 September 2004 - 04:48 AM

Intel Developer Forum report at anandtech (just right-click and check the file properties to verify).

they aren't that expensive actually. both amd and intel start to show off their new multicore cpu's, for home and servers. this one is actually an ITANIUM dual core at 1.7ghz each, i think. each core has 2 hyperthreads, just as the newer p4 have, and the board is a 4x itanium board.. hence 4x2x2 hyperthreads, 4x2 cores, 4 cpus.

right now, it's just a "we have it" proof demo. but it's planned for next year to start shipping. ITANIUMs, Xeons, and Pentiums, all with dual cores.

AMD on it's side will show off Opterons with multicore (2x, 4x), and athlon 64, too.. should be available next year, too.. (they showed off a first sample, too).

so yes, it's some years till every home end user has one. but it's right today important to start thinking about designing with multi-threading, to support hyperthreading by today, multicore by tomorrow, and multi-cpu for the highest-end-systems in the future.

hehe :D
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #15Nick

Senior Member

• Members
• 1227 posts

Posted 13 September 2004 - 06:32 PM

I can't wait for a 4 GHz dual-core 64-bit (double amount of registers), Hyper-Threading CPU with 1200 MHz dual-channel FSB and build-in memory controller. ;)

### #16Mihail121

Senior Member

• Members
• 1059 posts

Posted 13 September 2004 - 06:56 PM

And when i think that only 12 years has passed since the graphical 'wonders' of Wolf3D...

### #17davepermen

Senior Member

• Members
• 1306 posts

Posted 13 September 2004 - 08:50 PM

Nick said:

I can't wait for a 4 GHz dual-core 64-bit (double amount of registers), Hyper-Threading CPU with 1200 MHz dual-channel FSB and build-in memory controller. ;)

hehe..

will you one day consider writing some raytracing-stuff? i mean, your renderer gets more and more into the right direction anyways :D

you'll have to at least write a protocoll how each and every 'tile' gets submitted to the right cpu.

btw, if you're quick, you'll be the first one with an api that wrapps well onto ps3! :D
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #18john

Member

• Members
• 84 posts

Posted 13 September 2004 - 09:09 PM

not only does Nick's swShader beat quake's software rendering and other , but it beats the crap out of DirectX's REF rastersizer. There's no point in comparing against quake's s/w anyways. Good job Nick! You're going to be a millionare someday. You're pretty much garenteed a job at any company you apply for. What's the point of life after that? hehe (j/k) As far as I know, there isn't a single commercial or free product which emulates shaders except for swShader (isn't that right?).

I guess there wasn't any incentive for Microsoft to optimize their software renderer since they expect hardware vendors to implement everything in hardware, and the software part is only used as a reference model to compare against.

### #19anubis

Senior Member

• Members
• 2225 posts

Posted 13 September 2004 - 09:29 PM

Quote

You're pretty much garenteed a job at any company you apply for. What's the point of life after that?

american ?
If Prolog is the answer, what is the question ?

### #20Dia

DevMaster Staff