0
101 Jan 01, 2008 at 13:30

Let me introduce myself:

my name ryan widi saputra, i’m from indonesia.

A month ago i start a 3d software rasterizer, using delphi 6, using z buffer and deffered like rendering stage. Which has many feature that bases on Random Jitter.

tridi (my project) has software realtime Shadow map, Vertex map and customized pixel operation (you can manipulate span drawing code).
Some code using MMX and SSE but i dont know to optimize more. And some code using self modifying code (depends on rendering feature needed)

But i think, after convert some code to MMX/SSE i cant make my renderer faster. I dont know, maybe becauze Z buffer i used (32bit), maybe the scanlien conversion method.

Currently i try to learn Span-Buffer algorithm but.. no result so far, idont understand the basic of it (span buffer).

Did Span-buffer will make my renderer faster ?

http://sourceforge.net/projects/tridi
the latest demo including all feature (see readme.txt for control). btw it need SDL.DLL

Thing is, my code is so worse that without calling the function to start scanline conversion (by comment it) it got only 190fps for 640x400, and when rendering turned on, it only got under 50fps.

Oh btw my rendering stage is
========= INITIALIZATION ===========
1. Transform all vertices to view projection, mark all vertex that has negative z.
2. Clipping, maybe need to generate new vertex data. Mark all polygon outside visible region (*A). Mark all vertex used by visible region (*B).

Now we have (*A) and (*B) that ready to process.

—INFORMATION—
Rendering process , used by all rendering method, the only difference is the pixel shader:
1. Convert marked vertex (*B) to 16.16, and convert Z into 1/Z
2. Call Scanline converter for each polygon and shade using shader (*S)

Renderer itself is a class/object. In the demo they have lot renderer (texture,shadowmap,light) that work in series and combine the result.

========= RENDERING STAGE ===========
Stage 1: Texture , using renderer (*R1)
(*S) is scanztex which draw texture only and write Z
result: texture only buffer

Stage 2: Lighting
Intro: Each light has its own renderer(*RL1…*RLn), which its Z-buffer pointed to (*R1) z-buffer, so they dont need to write Z-Buffer, they only read.

1. Do lighting for data (*B), and fill intensity attribute when needed it can check shadow using a shadow map (which generated every frame for this light).
3. Call scanline converter.

All lights renderer using same output buffer, every light contribute intensity on this lightmap.

Stage 3: modulate/combine texture and lightmap.
Stage 4: post processing, such as HDR, BLOOM, Blurr, etc.

My demo rendering stages
my demo has many rendering stages that i cannot combine in one stage because will make the pixel shader very complex.

// =========== shadow map stage =============
if frame mod 4 =0 then begin
scene.reset;
// sumbit geometry that cause shadow
for i:=0 to high(tanklain) do begin
end;
sarea.submit;
scene.drawscene(false,4); // 1 = stage texture 2 = stage lighting  4 = shadow map gen
resetvis;
end;
// =========== main stage =============
scene.reset;

for i:=0 to high(tanklain) do begin
tanklain[i].submit;
end;

area.submit;
player.submit;

animsmoke; // sumbit and animate all smoke objects
animdecals;

scene.drawscene(true,1+2); // texture and lighting

// modulate Texture and Light
scene.modulate;

// =========== particle stage use custom shader =============
scene.reset;
animbullets;
animexplosion;
scene.drawscene(false,1); // texture only


Sample picture:

The basic of my renderer is generate texture render and lightrender separately

and then mixed it…

Another cool feature:
free antialiasing, by modfying small camera angle frame by frame and blend two frame.

Shadow mapping in action (no AA, this image before i found free AA)

Shadow mapping optimization (per vertex check, if partially shadowed then render using shadowmap shader else, render using normal guroud - its wireframe in this image) on the corner, its shadowmap (zbuffer) generated from tank cannon muzzle flash.

#### 27 Replies

0
101 Jan 01, 2008 at 13:45

The shader itself is a code, that you can write in assembler or maybe not, but i write all my shader in assembler such as:

smoke shader, which the operation per color channel is:

output = saturate((texture-128)*intensity+output)


intensity allow texture blend from visible (when intensity>0) to invisible(when intensity=0) in smooth transition.

procedure smokeshader;
asm
push ebp
// prepare varying data
// EAX: texture EDI:Output ESI:Zbuffer ECX:Pixel Count EBP:Varying data
// Varying data 16:16FixedPoint: [ebp+???] can use MMX/SSE here :)
// ??? = _vz _vi _vu _vv _vr _vg  _vzm _vim _vum _vvm _vrm _vgm
// ??m = d??/ecx = varying step

lea esi,[esi+ecx*4]
lea edi,[edi+ecx*4]

neg ecx
mov _esp,esp

mov esp,[ebp+_vz]

movq mm0,[ebp+_vu] // save _vu and _vv in mm0
movq mm1,[ebp+_vum] // get _vum and _vvm

mov eax,[esi+_vi] // intensity is not varying, its same across polygon
shr eax,16
mov word[__c3],ax // __c3 = array of word, temporary data in memory
mov word[__c3+2],ax
mov word[__c3+4],ax

movq mm7,__c3
movq mm5,_zero
movq mm6,__c2 // __c2 = const [127,127,127,0]

pslld mm1,1 // using 2x pixel, cheap loop (1/2pixelcount)
and ecx,not 1
jz @e
@repeat:        // no zbuffer is faster, but ...
{$ifndef nozbuf} cmp esp,dword[esi+ecx*4] jg @noz {$endif}
movq [ebp+_vu],mm0
movzx edx,byte[ebp+_vv+2]
movzx ebx,byte[ebp+_vu+2]

shl edx,10

movd mm2,[edi+ecx*4]
lea edx,[edx+eax]
movd mm3,[edx+ebx*4]

punpcklbw mm2,mm5
punpcklbw mm3,mm5
psubsw mm3,mm6
pmullw mm3,mm7
psraw mm3,6
packuswb mm2,mm5
movd [edi+ecx*4],mm2
movd [edi+ecx*4+4],mm2
@noz:
jnz @repeat                 // loop
@e:
mov esp,_esp
pop ebp
end;

0
101 Jan 01, 2008 at 16:09

Hi ryannining,

In my rendering routines the speed is limited by the amount of memory I have to access. So ask yourself if you really need a 32 bit zbuffer, 32 bit textures and 32 bit rendertargets.

I’ve seen you’re doing defered shading, so you might get away with a 8 bit rendertarget for the lights and a 16 bit rendertarget for the colors. This can then be mixed into a 32 bit format for display. Give it a try, it should be faster.

In any case you should prefetch some pixels from the zbuffer and frontbuffer for the next line at the start of your scanline loop. This will make accesses to the first pixels a lot faster. If your triangles are small enough this can easily make a difference of factor two.

Your assembler code looks pretty good so far. There are some things you may want to do different:

movq [ebp+_vu],mm0
movzx edx,byte[ebp+_vv+2]
movzx ebx,byte[ebp+_vu+2]


Avoid accessing memory like this. This will limit the out of order exeution on modern cpus since a sequence like this must be executed in order. I see what you’re trying to do here, but you can get the same effect with some shifts and moves. If possible do the shifts in the mmx unit, this makes a difference on the P4.

Do the first texture access outside the loop, then load the textures for the next loop iteration somewhere in the middle of your processing. e.g. you do the texture fetches always one pixel in advance. In your code I would put them directly below the pmullw instruction since it has the highest latency.

Your zbuffer test can be improved as well. Zbuffer fail/pass aren’t random and often occur in groups of several pixels. You can take advantage of this by making a special loop for just the zfail case. This seconds loop can be much simpler since its only task is to skip zfail pixels. Once you got a zpass case recalculate your texture offsets ect and branch into the ordinary pixel loop.

It’s a bit tricky to get this right, but it makes a good difference in speed since you can do your zfail skipping loop with just one branch instead of two.

The biggest performance hog is your memory bandwith though. Try to make your buffers smaller even if you needs to add some extra instructions to decode the colors.

Hope this helps,
Nils

0
101 Jan 01, 2008 at 16:15

Btw - I like the dithering look.

Since you have a zbuffer: have you considered running an edge-detector (sobel) filer over it and use the detected edges to darken the framebuffer? This gives a very nice “non photorealistic rendering” look.

You can do screen space ambient occlusion as well, but it’s difficult to tweak and takes quite a bit of time when done on the cpu.

0
101 Jan 02, 2008 at 00:31

Looks nice, though the demo crashes on me (Unknown Exception).
The free antialiasing isn’t really useful if you have a low framerate. I’d rather call it cheap motion-blur (that’s a simple implementation of it).

Span buffering (or s-buffering) isn’t that hard to understand, but it has its limitations. It is very fill-rate efficient, because overdraw is almost zero. This is good if you have many complex operations happening per pixel. I don’t see how you could efficiently combine this with antialiasing though….
Take a look here or here for descriptions of the technique.

0
101 Jan 02, 2008 at 01:12

WOW thankyou.. !

all my render target for z,rgb,light is 32bit, maybe this is why its slow. I think 8bit is slower in memory read/write ? is that true, if its not , well maybe i will use 16 bit Z buffer, 32bit for lightbuffer, 8 bit paletized for texture stage and 32bit output after modulate.

If my app crashed, well i dont know my PC is AMD sempron 2600, 512mb. Maybe you dont have SDL.DLL ?

You can do screen space ambient occlusion as well, but it’s difficult to tweak and takes quite a bit of time when done on the cpu.

is it post processing ? i done some trick on ambient occlusion in pixel shader, by comparing the distance between occluder and occluded Z value. The result is look like this, but i remove it from my code. Maybe if I can make the whole rendering process faster i will use the ambient occlusion again.

THe code look like this:

        ... (calculate address shadow map on EDX)
movd mm5,[ebp+_vi]
//ambient occlusion
sub ebx,[eax+edx*4]
jns @lit
mov eax,0
sar ebx,7     // the light intensity will reduced depends on how far from occluder
cmovs ebx,eax
movd mm7,ebx
psrld mm5,mm7 // shift right intensity (why not div ??? i dont know)

@lit:
psrld mm5,8
pshufw mm5,mm5,0
pmullw mm5,mm1
psrlw mm5,8

// lit color
movd mm4,[edi+ecx*4]
packsswb mm5,mm0
movd [edi+ecx*4],mm5
...


Picture: early stage of my rendering, no texture, but lot of lighting trick, in this pic its ambient occlusion + subdivide .

off

on

The frame rate it self must be high so some fake-tricks will optimum such antialiasing. On my PC, with 640x480res the demo run above 30fps. But if i run in 320x240 its more than 100fps.

Hey hmm maybe i cant ask for help here, to rearrange my assembler code :D, can i ? if its ok i will post the code of the shader from zonlyshader to textureshader.

0
101 Jan 02, 2008 at 01:38

Oh one more, the shadow occluder on “tridi” is the polygon backfacing the light source, the the other.

And the shadowmap resolution can be 32,64,128 or 256 :D

Oh one more, :D on the demo, all the lights in this demo cast shadow (all light has its own shadowmap). The lightgroup actually a limiter the numebr of light every frame, its sort the light by intensity and calculate “n’ lights (can be modified in code) and every lights in a group has same light color, but he intensity can be different. The shadowmap generation is easy, automatically, you only need to create class based on lightclass and override the “transformshadow” method. For example, the muzzlelight if from tpointlight and modify the transformshadow:

function ttankmuzlelight.transformshadow;
var
r,sc, vl, llx, lly : gxfloat;
begin
result:=sz>oz-300;
if result then begin
sc:=sd2/64; // check the shadowmap size/64
llx := -(sx - ox);
lly := -(sy - oy);
vl  := (sz - oz);
if vl<0.001 then vl:=0.001; // if occluder above light
r:=sqrt(sqr(llx)+sqr(lly))+1000;
vl:=(oz/vl)*sc*80/r;
sx  := llx*vl + sd2;
sy  := lly*vl + sd2;  // output in SX and SY

result:=true;
end;
end;


that code i create use brute trial & error :D, not efficient, but look natural. Its shadowmap is not linear, occluder near the light give bigger amount pixel compared to occluder in far distance, so the shadow can cover more object compared if its linear.

And the “free” or “fake” anti-aliasing, is hmmm how to say, at first actually i use the camera angle trick to create Depth of Field, but its fail. Because the result is to flicker. Then i try and try another camera method and found this “fake” antialiasing :D

0
102 Jan 02, 2008 at 15:22

Hi Ryan,@ryannining

A month ago i start a 3d software rasterizer…

That’s pretty impressive for just a month of work!

…using delphi 6…

Delphi is pretty nice, but you’ll find a lot more code for C++.

Some code using MMX and SSE but i dont know to optimize more. And some code using self modifying code (depends on rendering feature needed) But i think, after convert some code to MMX/SSE i cant make my renderer faster. I dont know, maybe becauze Z buffer i used (32bit), maybe the scanlien conversion method.

Have you tried profiling yet? Often the hot code is not where you expected it.

Thing is, my code is so worse that without calling the function to start scanline conversion (by comment it) it got only 190fps for 640x400, and when rendering turned on, it only got under 50fps.

That’s actually not bad at all. Also remember that you should first implement all features and only then start really optimizing (after profiling). Otherwise you’ll easily end up reimplementing everything.

0
102 Jan 02, 2008 at 16:11

@Nils Pipenbrinck

The biggest performance hog is your memory bandwith though. Try to make your buffers smaller even if you needs to add some extra instructions to decode the colors.

This might be true for certain architectures, but for PC’s you don’t easily run out of bandwidth.

Say you have 4.8 GB/s bandwidth and an 800x600 32-bit image, then you can access it entirely 10,000 times per second. At 50 FPS that’s still 200 accesses per pixel. Such a ‘shader’ would be very long, and you’ll be limited by arithmetic performance much sooner.

Another way to look at it is that with a 2.4 GHz CPU you can access 32-bit every two clock cycles. You typically can’t do any useful compression to save bandwidth.

0
101 Jan 02, 2008 at 17:03

I have exact the opposite experience.

I just run the benchmark of the graphic library I’m working on for over a year now. It’s not optimized for x86 so the rendering routines are far from ideal (just generic c-code, no mmx ect).

Here are the results of one particular test (lots of small sized gouraud triangles)

Alpha8 - 96 megapixels/second
Argb32 - 34 megapixels/second
RGB565 - 67 megapixels/second

I see nearly a 1:1 relationship of memory bandwith (bytes per pixel) vs. performance here, and it is the same in all routines except texturing (where the cache coherence problem comes into play).

My machine at home is slow… 1Ghz Athlon with 100Mhz FSB, I see the same figures on a modern P4 as well. Performance always scales with memory bandwith.

If we really would have a 32 bit access every two clock cycles - why do we still need caches? 2 cycles per access would be great. Maybe we have 2 cycles in theory, but in pratice the memory access takes much longer.

Btw - it would be interesting to see how much raw memory performance we really get. The Gb/s for different RAM chips is well know. You can get a chunk of memory with disabled cache under win32 by calling VirtualAlloc with the PAGE_NOCACHE protection constant. It would be interesting to know how much time a memcpy of 10mb takes and how many cycles it takes per access.

0
101 Jan 02, 2008 at 17:25

whatever -

Now we can all measure how much GB/s we really have. I’m far away from the theoretical 3.2 GB/s that my ram can do. I get is 0.5GB/s (cached) and 0.09GB/s (uncached).

#include <windows.h>
#include <stdio.h>

void bench (void * data)
{
int i;

// do 1gb of traffic:
for (i=0; i<1024; i++)
memset (data, i, 1024*1024);
}

int main (int argc, char **args)
{
void * mem_cached;
void * mem_uncached;
int t1,t2;
float ms;

mem_cached   = VirtualAlloc(0, 1024*1024,
mem_uncached = VirtualAlloc(0, 1024*1024,

// warmup:
memset (mem_cached, 0, 1024*1024);

// measure
t1 = GetTickCount();
bench (mem_cached);
t2 = GetTickCount();
ms = (float)(t2-t1) / 1000.0f;
printf ("cached gb/s = %f\n", 1.0f / ms);

// warmup:
memset (mem_uncached, 0, 1024*1024);

// measure
t1 = GetTickCount();
bench (mem_uncached);
t2 = GetTickCount();
ms = (float)(t2-t1) / 1000.0f;
printf ("uncached gb/s = %f\n", 1.0f / ms);

VirtualFree (mem_cached,1024*1024, MEM_FREE);
VirtualFree (mem_uncached,1024*1024, MEM_FREE);
}


I know - GetTickCount() is not really made for this, but it is precise enough to get ballpark numbers.

0
102 Jan 02, 2008 at 21:18

@Nils Pipenbrinck

Here are the results of one particular test (lots of small sized gouraud triangles)

Ah, yes, for plain gouraud triangles the memory bandwidth can become a bottleneck. But in my experience as soon as you start doing anything interesting the arithmetic operations become the bottleneck and bandwidth is close to irrelevant.

Alpha8 - 96 megapixels/second
Argb32 - 34 megapixels/second
RGB565 - 67 megapixels/second

How do you interpolate the components? Argb32 has four times more components than Alpha8 so you must be doing a lot more arithmetic work per pixel. Rgb565 has three components and might benefit significantly from lower register pressure.

And from another perspective; 34 MP/s is still 70 FPS for an 800x600 image. There is no need to go higher. There is however need to do something more exciting than gouraud, and then bandwidth is not your primary concern any more.

I see nearly a 1:1 relationship of memory bandwith (bytes per pixel) vs. performance here, and it is the same in all routines except texturing (where the cache coherence problem comes into play).

Cache locality is quite ok for texturing. All modern CPU’s have automatic prefetching which is able to predict where the next texture access(es) will land. The non-1:1 relationship you see there will be mainly due to arithmetic operations becoming more of a bottleneck than bandwidth, not cache locality.

By the way, cache coherency refers to integrity.

If we really would have a 32 bit access every two clock cycles - why do we still need caches? 2 cycles per access would be great. Maybe we have 2 cycles in theory, but in pratice the memory access takes much longer.

I was talking about throughput, not latency. The time between RAM transactions is only a few CPU cycles. The total round trip time between a read request and getting the data into the registers can be hundreds of cycles. Caches are there to vastly improve latency, and reduce bandwidth needs.

0
102 Jan 02, 2008 at 22:11

@Nils Pipenbrinck

Now we can all measure how much GB/s we really have. I’m far away from the theoretical 3.2 GB/s that my ram can do. I get is 0.5GB/s (cached) and 0.09GB/s (uncached).

I get 34.7 GB/s cached (Core 2 @ 2.4 GHz - 2 MB L2), very close to the theoretical 38.4 GB/s. But instead of using memset I used this:

__asm
{
xorps xmm0, xmm0
mov ecx, 1024*1024
mov edi, data

loopset:
movaps [edi+0*16], xmm0
movaps [edi+1*16], xmm0
movaps [edi+2*16], xmm0
movaps [edi+3*16], xmm0
movaps [edi+4*16], xmm0
movaps [edi+5*16], xmm0
movaps [edi+6*16], xmm0
movaps [edi+7*16], xmm0

sub ecx, 128
jg loopset
}


Core 2 has a 128-bit bus to L2 cache, which can only be used to the fullest with SSE.

And I get 2.0 GB/s when using movntps instead of movaps, which writes directly to RAM. This is somewhat less than expected but not nearly as bad as the numbers you report.

PAGE_NOCACHE might not work the way you expect. It disables write combining (i.e. it won’t fully use the 64-bit bus to RAM), and forces the CPU to keep a very strict memory access order (no reads during writes). It might even be implemented by invalidating the cache line after every access, causing no less than 64 bytes to be fetched. PAGE_NOCACHE is only useful for device drivers and for some advanced security purposes, and you should use VirtualCopy for optimized copying.

Using movntps is a much friendlier way to avoid the cache (for write operations). It might still have its limitations though, which would explain why I only get 2.0 GB/s instead of something closer to 5.3 GB/s.

Anyway, I hope you can see that memory bandwidth is not that much of a concern. In fact, many professional benchmarks show that CPU’s with a higher FSB are not significantly faster, while a higher clock frequency scales performance of multimedia applications practically linearly.

I even believe that the only operation that truely benefits from extra bandwidth is copying a large block of memory. As soon as you do some arithmetic operations between those memory accesses, they become the actual bottleneck.

0
101 Jan 03, 2008 at 02:12

Thats why i’m affraid to change my buffer to less than 32bit, i’m affraid it cause lot of problem in my MMX code. Im afraid its slower,

maybe i will give it a try.

Btw i dont have profiler, to “profile” my renderer, i jut create a FPS counter, try to disable some code, run and write down the FPS… :D not a good way but i think its works.

I have implemented all feature i need, and some have become very different approach from my first design, but all feature i need is work. I jut dont know where to optimize again.

After profiling my code, i think i wasted CPU cycle in the scanline conversion, by disabling the scanline conversion the FPS can go up to 330fps, but when i enable the scanline conversion (but disable pixel shader) the fps go down to 180fps.

My scanline conversion is like this:

Type
Gxint = Integer;
GxFloat = single;
t2dpoint    = packed record
x, y, z,
i, u, v,
r, g    : gxint;
end;

var

Spans : array [0..500] of t2dpoint;
// vary is a data record to help the pixel shader
vary : record
x ,y ,z ,i ,u ,v ,r ,g  : integer;
xm,ym,zm,im,um,vm,rm,gm : integer;
n                       : integer;
sdq                     : integer;
rnd                     : integer;
smap,sbuf,zbuf,tbuf     : pointer;
lmap                    : pointer;
lcolor                  : integer;
width                   : integer;
texinfo                 : integer;
flatcolor               : integer;
polyinfo                : integer;
flatint                 : integer;
sdand                   : integer;
end;

var xya1            : pt2dpoint;
xya2            : pt2dpoint;
span            : ^t2dpoint;
t,y1,y2         : integer;
iy3,i           : gxint;
awidth          : gxint;
y3              : gxint;
begin
if xyaa.y>xyab.y then begin
xya1:=xyab;
xya2:=xyaa;
end else begin
xya1:=xyaa;
xya2:=xyab;
end;

y1:=xya1^.y div 2;
y2:=xya2^.y div 2;
if (y1>clipymaxi) or (y2<clipymini) then exit;
if (y2>clipymaxi) then y2:=clipymaxi;
if y1>=y2 then exit;
i:=(clipxmini*2-xya1^.y);
if (i>0) and (clipymini>=y2) then exit;
if (i>0) then y1:=clipymini;

iy3 :=(y2-y1);
y3:=(xya2^.y-xya1^.y)+1;
t:=y1;
span:=ptr(integer(spans)+y1*32); // calculate the start span, span size is 32byte
awidth:=width;
asm
fild  dword [y3]
fdivr  dword [_float1]      // calculate 1/y3
fstp  dword [y3]

mov  eax,xya1
movups xmm6,[eax]
movups xmm7,[eax+16]

mov eax,xya2
movups xmm0,[eax]
movups xmm1,[eax+16]

cvtdq2ps xmm6,xmm6
cvtdq2ps xmm7,xmm7
cvtdq2ps xmm0,xmm0
cvtdq2ps xmm1,xmm1

subps xmm0,xmm6
subps xmm1,xmm7

//    atrm.x:=atrm.x div y3;  // original lots of X86 code
//    atrm.z:=atrm.z div y3;
//    atrm.i:=atrm.i div y3;
//    atrm.u:=atrm.u div y3;
//    atrm.v:=atrm.v div y3;
//    atrm.r:=atrm.r div y3;
//    atrm.g:=atrm.g div y3;

movd xmm3,y3        // SSE replacement
shufps xmm3,xmm3,0
mulps xmm0,xmm3
mulps xmm1,xmm3

cmp i,0
jle @skip

//    if i>0 then begin
//        inc(atrv.x,atrm.x*i);  // original lots of X86 code
//        inc(atrv.z,atrm.z*i);
//        inc(atrv.i,atrm.i*i);
//        inc(atrv.u,atrm.u*i);
//        inc(atrv.v,atrm.v*i);
//        inc(atrv.r,atrm.r*i);
//        inc(atrv.g,atrm.g*i);

movaps xmm4,xmm0       // SSE replacement
movaps xmm5,xmm1

movd xmm3,i
shufps xmm3,xmm3,0
cvtdq2ps xmm3,xmm3
mulps xmm4,xmm3
mulps xmm5,xmm3
@skip:
push edi
push esi

lea edi,[atrv+16]
and edi,not 15
lea eax,[atrm+16]
and eax,not 15
mov esi,span

// fake horizontal antialiasing by add 1/2 step in odd frame
test rframe,1
jz @skiphalf
@skiphalf:

cvttps2dq xmm0,xmm0
cvttps2dq xmm1,xmm1
pslld xmm0,1
pslld xmm1,1

cvttps2dq xmm6,xmm6
cvttps2dq xmm7,xmm7

movaps [eax]   ,xmm0
movaps [eax+16],xmm1

movaps [edi] ,xmm6
movaps [edi+16] ,xmm7

@nextline:
cmp [esi+4],0
jne @callscan
@storeedge:   // only 1 data, store the data in [ESI]
movaps [esi]   ,xmm6
movaps [esi+16],xmm7
mov  [esi+4],1
jmp  @nodraw1
@callscan:    // we have 2 data here, so we must draw
push esi
mov  [esi+4],0
movd ecx,xmm6    // atrv
mov edx,[esi]

sar edx,16      // ebx = edge.x shr detail
sar ecx,16

mov ebx,ecx     // edx = atrv.x shr detail
sub ecx,edx     // ecx = atrv.x - edge.x
jz @nodraw      // zeo then dont draw
jg @greater     // x1>x2 then flip the data
neg ecx        // calculate dz,di,du,dv,dr,dg
mov eax,vary   // and store in [vary] records
movaps [eax],xmm6
movaps [eax+16],xmm7

movaps xmm0,[esi]
movaps xmm1,[esi+16]

psubd xmm0,xmm6     // z2-z1, i2-i1 ...
psubd xmm1,xmm7

movaps [eax+32],xmm0
movaps [eax+48],xmm1

jmp @draw
@greater:       // x2>x1
mov eax,vary        // calculate dz,di,du,dv,dr,dg
movaps xmm0,[esi]   // and store in [vary] records
movaps xmm1,[esi+16]
movaps [eax],xmm0
movaps [eax+16],xmm1

movaps xmm2,xmm6
movaps xmm3,xmm7
psubd xmm2,xmm0     // z2-z1, i2-i1 ...
psubd xmm3,xmm1

movaps [eax+32],xmm2
movaps [eax+48],xmm3

mov ebx,edx
@draw:
mov eax,t               // t = current Y coordinate
imul eax,awidth         // y*Witdh

mov eax,1 shl _fdiv     // calculate EAX=(1 << 12)/ECX
cdq                     // so in shader we can get
div ecx                 //  ??/ECX = ??*EAX >> 12
// i think its faster than ??/ECX
mov edx,ebx             // EDX,EBX = pixel position to help the shader
// ex: Zpixel = Zbuffer + EDX*4 (32bit)
// ex: RGBout = RGBbuffer + EDX*4 (32bit)
@nodraw:
pop esi
@nodraw1:

lea eax,[atrm+16]
//    next egde position
//    align to 16byte
and eax,not 15
inc t
// increase all variable ( 8 variable from x,y,z,i,u,v,r,g)
movaps xmm4,[eax]
movaps xmm5,[eax+16]

dec iy3
jnz @nextline
@e:
pop esi
pop edi
@e2:
emms
end;
end;


That code will always work, that i dont have to find the minimum y vertex, i just call that code like this:

shadealine(vtx1,vtx2);


Is there any better solution of scanline conversion?

0
101 Jan 03, 2008 at 02:26

Oh one more, i have the “modulate” ocde that can be slow if resolution is high, can you help me optimize it.

output = RGB*RGB >> 6 // i use 6 to get the overexposure effect

procedure modulatetexlight(tex,light:pointer;w,h,ambient:integer);
asm
push ebp
mov _esp,esp
mov esp,ambient
movq mm0,_zero
movq mm7,_zero
mov avgint,0
mov ecx,w
imul ecx,h
shr ecx,1
mov eax,tex
mov edx,light
lea eax,[eax+ecx*8]
lea edx,[edx+ecx*8]
neg ecx
@1:

movq mm1,qword [eax+ecx*8]
movq mm2,qword [edx+ecx*8]
movq mm3,qword [eax+ecx*8+4]
movq mm4,qword [edx+ecx*8+4]

punpcklbw mm1,mm0
punpcklbw mm2,mm0
punpcklbw mm3,mm0
punpcklbw mm4,mm0

pmullw mm1,mm2
pmullw mm3,mm4

psraw mm1,6
psraw mm3,6

movq mm6,mm1
packuswb mm1,mm0
packuswb mm3,mm0

movd [eax+ecx*8],mm1
movd [eax+ecx*8+4],mm3
mov [edx+ecx*8],esp
mov [edx+ecx*8+4],esp

psrlw mm6,4
movq datamm1,mm6
movzx ebx,word[datamm1]
movzx ebp,word[datamm1+2]
movzx ebx,word[datamm1+4]

inc ecx
jnz @1
mov esp,_esp
pop ebp
emms
end;

0
165 Jan 03, 2008 at 03:04

Beware of using FPS numbers to measure performance. They can be misleading. What you should really count is the amount of time to render a frame (which is the reciprocal of the framerate).

For example, the drop from 330 to 180 fps seems like a big one, yes? Probably much bigger than a drop of, say, 70 to 60 fps, right?

But if you thought that, you’d be wrong. At 330 fps, you’re taking 3.03 ms to render a frame. At 180 fps, it takes 5.56 ms. So, the scanline conversion is really only adding about 2.5 ms to the time. At 70 fps, you’re taking 14.3 ms to render a frame, while at 60 fps, it’s taking 16.7 ms - which is 2.4 ms longer. So both drops are about the same size in the amount of extra time taken.

0
101 Jan 03, 2008 at 04:03

oooo, i see, its my fault, well i will use time to render next time,…. wow big mistake !! damn.., i have thinking all day, but cant see such simple mistake, damn !

Btw Nick, hehe have see your SwiftShader, its damn HOT, and i have get the softwire but i dont know to use it in delphi… maybe i should learn C++, i dont know where to start, i’m 27years old, sometimes i’m lazy to learn new thing.

And nick, i think if you create a fixed pipeline software renderer it will be very-very fast… maybe it can help the Console Emulator scene , hehe. I see lot of graphics hardcore there, to create the graphics engine for their emulator.

Well maybe i will learn OpenGL or DirectX soon (damn, i dont know how to use them :D ),…

0
102 Jan 03, 2008 at 09:45

@ryannining

Btw i dont have profiler, to “profile” my renderer, i jut create a FPS counter, try to disable some code, run and write down the FPS… :D not a good way but i think its works.

This is another reason to start using C++. AMD’s CodeAnalyst is a powerful free profiler for C/C++ and assembly, and there are many more.

I have implemented all feature i need, and some have become very different approach from my first design, but all feature i need is work. I jut dont know where to optimize again.

First get some measurements from a profiler. It will show you exactly where to focus your attention.

On the other hand, your demo runs perfectly smoothly, and if the renderer is feature complete then I don’t see much reason to change anything (unless you’re really not focussing on finishing a product but just gaining experience instead).

0
101 Jan 03, 2008 at 10:53

Hehe, im not focusing in finished product, if fact its not finish, i must learn how to create AI, pathfinding, but first i want to create the graphics that can do great lighting effect (many shadow), postprocessing, etc. Or maybe just to show to my friends that underestimate my graphics programming knowledge.

I’m doing this because i have no job for a month… :D.

For C++ i have download MS VS Express, but i cant download the PSDK, my internet is to slow ( indonesia internet = meh ). Maybe you have suggestion what i must do to learn C++.

Maybe you can give me a Code that using MingW or anything else to begin with :D.

0
102 Jan 03, 2008 at 16:28

@ryannining

movq mm1,qword [eax+ecx*8]
movq mm2,qword [edx+ecx*8]
movq mm3,qword [eax+ecx*8+4]
movq mm4,qword [edx+ecx*8+4]

punpcklbw mm1,mm0
punpcklbw mm2,mm0
punpcklbw mm3,mm0
punpcklbw mm4,mm0

pmullw mm1,mm2
pmullw mm3,mm4


Accessing unaligned data (your third and fourth instruction) can have a significant performance impact. The CPU actually reads two aligned quadwords and then extracts the unaligned quadword from that. So make sure you always access an address that is a multiple of eight (for quadwords) for optimal performance.

All the unpacking is costing performance too, so I propose to use something like this:

punpcklbw mm1, dword [eax+ecx*8]
punpcklbw mm2, dword [edx+ecx*8]
punpcklbw mm3, dword [eax+ecx*8+4]
punpcklbw mm4, dword [edx+ecx*8+4]

pmulhuw mm1, mm2
pmulhuw mm3, mm4

0
102 Jan 03, 2008 at 16:29

@ryannining

For C++ i have download MS VS Express, but i cant download the PSDK, my internet is to slow ( indonesia internet = meh ). Maybe you have suggestion what i must do to learn C++.

0
101 Jan 04, 2008 at 04:35

Wow I’m impressed by what you have achieved in just one month!! :)
I wrote a couple some years ago, you have surpassed them all :)
Mine was all in C++, I could send it to you, I never went down to asm level, I recall the z-buffer being a huge problem in my profiler and hence I spent a long time trying to figure out a different way… I believe ATI had some really cool way of implementing an alternative to the z-buffer, maybe it was just marketing but as I recall it they did score better in the benchmarks at that time.. maybe nick has some details here? :)

0
101 Jan 04, 2008 at 04:49

I decided to try it on my intel-mac but it crashed, I had put SDL.dll in the same dir, has AMD some funky instructions that you rely on?

0
102 Jan 04, 2008 at 10:11

@noglin

I believe ATI had some really cool way of implementing an alternative to the z-buffer, maybe it was just marketing but as I recall it they did score better in the benchmarks at that time.. maybe nick has some details here? :)

They introduced the hierarchical z-buffer. The basic idea is to keep a low resolution z-buffer which stores the maximum of z-values within tiles of the full resolution z-buffer. This way a whole tile can be discarded with a single test.

It’s very effective for hardware, but for software rendering it’s of less use. In software you can literally skip the pixel processing if the depth test fails. With hardware, once a pixel enters the processing pipeline you can’t remove it (or more correctly, you can’t remove it and insert a visible pixel). Unlike a CPU, the GPU always processes many pixels concurrently. So only when all these pixels (i.e. a tile) fail the depth test they can stop them from entering the pipeline and try another tile.

0
101 Jan 05, 2008 at 17:11

I decided to try it on my intel-mac but it crashed, I had put SDL.dll in the same dir, has AMD some funky instructions that you rely on?

I dont know, i will search for the bug, but i dont have Intel machine here. You can send me the code to ryannining@yahoo.com, i will try to use your code as a base to my renderer.

About the PSDK, i download via web installer, it takes so long and not resumable. If i download the ISO, well, i’m affraid my internet bills. Here in indonesia i only have 0.5Gb limit a month, over that quota i must pay 5times/mb.

I have some idea about soft shadow with penumbra/ area shadow, but i need some time to implement it.

Btw about shadow mapping, i dont have experience with directX/OpenGL programming, and i want to ask here, is it true that in DX/GL to create Shadow Map we must transform the geometry ourself or its automatically depend on our light ? How to create shadow map for point light ?

Because in my renderer, i must supply shadow tranformation. For 128x128 sahdowmap, i cannot expect to cover all geometry in the demo, so i code my shadow tranformation (just skew transform) using try & error until found optimal transformation which can cover all visible geometry for current camera setting. If i adjust the camera setting (distance, orientation) i must re-code the shadow transform. This is just for directional type light. For point of light the shadow map tranform is independence from camera setting.

My point light only can create shadow map for geometry below the light (180 coverage) :D, i dont know the correct transform for this light type.

function ttankmuzlelight.transformshadow;
var
r,sc, vl, llx, lly : gxfloat;
begin
result:=sz>oz-300;  // sx,sy,sz : current vertex position
if result then begin  // ox,oy,oz : light position
sc:=sd2/64;       // ocx,ocy,ocz : camera position
llx := -(sx - ox);
lly := -(sy - oy);
vl  := (sz - oz);
if vl<0.001 then vl:=0.001; // to prevent error
r:=sqrt(sqr(llx)+sqr(lly))+1000;
vl:=(oz/vl)*sc*80/r;
sx  := llx*vl + sd2;      // output in sx,sy
sy  := lly*vl + sd2;

result:=true;
end;
end;

0
101 Jan 05, 2008 at 17:18

hierarchical z-buffer, hmm how about if we store a block of 16x1 pixel (horizontal) so we can skip 16pixel in scanline shader ?

a 2dimension Tile sound complex to software renderer due to non paralel pipeline, but 1dimension ?… its look promising ? any thought ? or its just like SPAN BUFFER ?

0
101 Jan 05, 2008 at 18:09

Latest source available at http://sourceforge.net/projects/tridi
That source is the latest source code, you still need data files from last binary demo.

Maybe i will rewrite my code from scratch and not using assembler.

0
101 Jan 11, 2008 at 20:33

I have rewriten some code, specially main rendering engine. This time no assembler used in scanline converter and pixel shader. So maybe someone can translate this code to C++. But i remove some feature such jitter filter, colored light.

procedure scansdwlightns;
var oo,zz,sc:pinteger;
t:integer;
z,zm:integer;
i,im:integer;
u,um:integer;
v,vm:integer;
r,rm:integer;
begin
zz:=vary.zbuf;
inc(zz,vary.dd);
oo:=vary.lmap;
inc(oo,vary.dd);

z:=vary.z;
zm:=vary.zm div vary.n;

i:=vary.i div 128+120;  // light intensity
im:=(vary.im div 128)div vary.n;

u:=vary.u;      // shadow Map UV coord
um:=vary.um div vary.n;
v:=vary.v;
vm:=vary.vm div vary.n;

rm:=vary.rm div vary.n;

for t:=1 to vary.n do begin
if z=zz^ then begin
sc:=vary.smap;
inc(sc,(u div sdetail16) and vary.sdand+
((v div sdetail16) and vary.sdand) shl (vary.sdshl-2));
if sc^>r then
inc(oo^,i);
end;
z:=z+zm;
i:=i+im;
u:=u+um;
v:=v+vm;
r:=r+rm;
inc(zz);
inc(oo);
end;
end;
`

enjoy ….