Need help: Faster Software Rendering

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 01, 2008 at 13:30

Let me introduce myself:

my name ryan widi saputra, i’m from indonesia.

A month ago i start a 3d software rasterizer, using delphi 6, using z buffer and deffered like rendering stage. Which has many feature that bases on Random Jitter.

tridi (my project) has software realtime Shadow map, Vertex map and customized pixel operation (you can manipulate span drawing code).
Some code using MMX and SSE but i dont know to optimize more. And some code using self modifying code (depends on rendering feature needed)

But i think, after convert some code to MMX/SSE i cant make my renderer faster. I dont know, maybe becauze Z buffer i used (32bit), maybe the scanlien conversion method.

Currently i try to learn Span-Buffer algorithm but.. no result so far, idont understand the basic of it (span buffer).

Did Span-buffer will make my renderer faster ?

Oh, if you want to see my engine demo, please download it at

http://sourceforge.net/projects/tridi
the latest demo including all feature (see readme.txt for control). btw it need SDL.DLL

Thing is, my code is so worse that without calling the function to start scanline conversion (by comment it) it got only 190fps for 640x400, and when rendering turned on, it only got under 50fps.

Oh btw my rendering stage is
========= INITIALIZATION ===========
1. Transform all vertices to view projection, mark all vertex that has negative z.
2. Clipping, maybe need to generate new vertex data. Mark all polygon outside visible region (*A). Mark all vertex used by visible region (*B).

Now we have (*A) and (*B) that ready to process.

—INFORMATION—
Rendering process , used by all rendering method, the only difference is the pixel shader:
1. Convert marked vertex (*B) to 16.16, and convert Z into 1/Z
2. Call Scanline converter for each polygon and shade using shader (*S)

Renderer itself is a class/object. In the demo they have lot renderer (texture,shadowmap,light) that work in series and combine the result.

========= RENDERING STAGE ===========
Stage 1: Texture , using renderer (*R1)
(*S) is scanztex which draw texture only and write Z
result: texture only buffer

Stage 2: Lighting
Intro: Each light has its own renderer(*RL1…*RLn), which its Z-buffer pointed to (*R1) z-buffer, so they dont need to write Z-Buffer, they only read.

  1. Do lighting for data (*B), and fill intensity attribute when needed it can check shadow using a shadow map (which generated every frame for this light).
  2. Set renderer to guroudshader or shadowmapshader
  3. Call scanline converter.

All lights renderer using same output buffer, every light contribute intensity on this lightmap.

Stage 3: modulate/combine texture and lightmap.
Stage 4: post processing, such as HDR, BLOOM, Blurr, etc.

My demo rendering stages
my demo has many rendering stages that i cannot combine in one stage because will make the pixel shader very complex.

// =========== shadow map stage =============
    if frame mod 4 =0 then begin
        scene.reset;
        scene.addlights(bgroup);
        scene.addlights(fgroup);
        scene.addlights(agroup);
        // sumbit geometry that cause shadow
        for i:=0 to high(tanklain) do begin
            tanklain[i].submitshadow;
        end;
        sarea.submit;
        player.submitshadow;
        scene.uploadgeom(@vertexs, @faces, wnv, wnf);
        scene.drawscene(false,4); // 1 = stage texture 2 = stage lighting  4 = shadow map gen
        resetvis;
    end;
// =========== main stage =============
        scene.reset;
        scene.addlights(bgroup);
        scene.addlights(fgroup);
        scene.addlights(agroup);

        for i:=0 to high(tanklain) do begin
            tanklain[i].submit;
        end;

        area.submit;
        player.submit;
        player.headke(mousex,mousey);

        animsmoke; // sumbit and animate all smoke objects
        animdecals;

        scene.uploadgeom(@vertexs, @faces, wnv, wnf);

        scene.drawscene(true,1+2); // texture and lighting

        // modulate Texture and Light
        scene.modulate;

// =========== particle stage use custom shader =============
        scene.reset;
        animbullets;
        animexplosion;
        scene.maingeom.uploadgeom(@vertexs, @faces, wnv, wnf);
        scene.drawscene(false,1); // texture only

Sample picture:

The basic of my renderer is generate texture render and lightrender separately

lightmap.jpg

albedo.jpg

and then mixed it…

result.jpg

Another cool feature:
free antialiasing, by modfying small camera angle frame by frame and blend two frame.

anti-alias-2.jpg

Shadow mapping in action (no AA, this image before i found free AA)

engine-5.jpg

Shadow mapping optimization (per vertex check, if partially shadowed then render using shadowmap shader else, render using normal guroud - its wireframe in this image) on the corner, its shadowmap (zbuffer) generated from tank cannon muzzle flash.

engine-6.jpg

27 Replies

Please log in or register to post a reply.

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 01, 2008 at 13:45

The shader itself is a code, that you can write in assembler or maybe not, but i write all my shader in assembler such as:

smoke shader, which the operation per color channel is:

output = saturate((texture-128)*intensity+output)

intensity allow texture blend from visible (when intensity>0) to invisible(when intensity=0) in smooth transition.

procedure smokeshader;
asm
    push ebp
// prepare varying data 
// EAX: texture EDI:Output ESI:Zbuffer ECX:Pixel Count EBP:Varying data
// Varying data 16:16FixedPoint: [ebp+???] can use MMX/SSE here :)
// ??? = _vz _vi _vu _vv _vr _vg  _vzm _vim _vum _vvm _vrm _vgm  
// ??m = d??/ecx = varying step

    call prepareshader  
 
    lea esi,[esi+ecx*4]  
    lea edi,[edi+ecx*4] 

    neg ecx
    mov _esp,esp

    mov esp,[ebp+_vz]

    movq mm0,[ebp+_vu] // save _vu and _vv in mm0
    movq mm1,[ebp+_vum] // get _vum and _vvm

    mov eax,[esi+_vi] // intensity is not varying, its same across polygon
    shr eax,16
    mov word[__c3],ax // __c3 = array of word, temporary data in memory
    mov word[__c3+2],ax
    mov word[__c3+4],ax

    movq mm7,__c3
    movq mm5,_zero
    movq mm6,__c2 // __c2 = const [127,127,127,0]


    pslld mm1,1 // using 2x pixel, cheap loop (1/2pixelcount) 
    and ecx,not 1
    jz @e
@repeat:        // no zbuffer is faster, but ...
{$ifndef nozbuf}
    cmp esp,dword[esi+ecx*4]
    jg @noz
{$endif}
        movq [ebp+_vu],mm0
        movzx edx,byte[ebp+_vv+2]
        movzx ebx,byte[ebp+_vu+2]

        shl edx,10

        // additive & subtractive blend
        movd mm2,[edi+ecx*4]
        lea edx,[edx+eax]
        movd mm3,[edx+ebx*4]

        punpcklbw mm2,mm5
        punpcklbw mm3,mm5
        psubsw mm3,mm6
        pmullw mm3,mm7
        psraw mm3,6
        paddsw mm2,mm3
        packuswb mm2,mm5
        movd [edi+ecx*4],mm2
        movd [edi+ecx*4+4],mm2
@noz:
    paddd mm0,mm1
    add ecx,2
    jnz @repeat                 // loop
@e:
    mov esp,_esp
    pop ebp
end;
B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 Jan 01, 2008 at 16:09

Hi ryannining,

In my rendering routines the speed is limited by the amount of memory I have to access. So ask yourself if you really need a 32 bit zbuffer, 32 bit textures and 32 bit rendertargets.

I’ve seen you’re doing defered shading, so you might get away with a 8 bit rendertarget for the lights and a 16 bit rendertarget for the colors. This can then be mixed into a 32 bit format for display. Give it a try, it should be faster.

In any case you should prefetch some pixels from the zbuffer and frontbuffer for the next line at the start of your scanline loop. This will make accesses to the first pixels a lot faster. If your triangles are small enough this can easily make a difference of factor two.

Your assembler code looks pretty good so far. There are some things you may want to do different:

movq [ebp+_vu],mm0
movzx edx,byte[ebp+_vv+2]
movzx ebx,byte[ebp+_vu+2]

Avoid accessing memory like this. This will limit the out of order exeution on modern cpus since a sequence like this must be executed in order. I see what you’re trying to do here, but you can get the same effect with some shifts and moves. If possible do the shifts in the mmx unit, this makes a difference on the P4.

Do the first texture access outside the loop, then load the textures for the next loop iteration somewhere in the middle of your processing. e.g. you do the texture fetches always one pixel in advance. In your code I would put them directly below the pmullw instruction since it has the highest latency.

Your zbuffer test can be improved as well. Zbuffer fail/pass aren’t random and often occur in groups of several pixels. You can take advantage of this by making a special loop for just the zfail case. This seconds loop can be much simpler since its only task is to skip zfail pixels. Once you got a zpass case recalculate your texture offsets ect and branch into the ordinary pixel loop.

It’s a bit tricky to get this right, but it makes a good difference in speed since you can do your zfail skipping loop with just one branch instead of two.

The biggest performance hog is your memory bandwith though. Try to make your buffers smaller even if you needs to add some extra instructions to decode the colors.

Hope this helps,
Nils

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 Jan 01, 2008 at 16:15

Btw - I like the dithering look.

Since you have a zbuffer: have you considered running an edge-detector (sobel) filer over it and use the detected edges to darken the framebuffer? This gives a very nice “non photorealistic rendering” look.

You can do screen space ambient occlusion as well, but it’s difficult to tweak and takes quite a bit of time when done on the cpu.

E3a1db864249a05e4952ac91cb55418d
0
rarefluid 101 Jan 02, 2008 at 00:31

Looks nice, though the demo crashes on me (Unknown Exception).
The free antialiasing isn’t really useful if you have a low framerate. I’d rather call it cheap motion-blur (that’s a simple implementation of it).

Span buffering (or s-buffering) isn’t that hard to understand, but it has its limitations. It is very fill-rate efficient, because overdraw is almost zero. This is good if you have many complex operations happening per pixel. I don’t see how you could efficiently combine this with antialiasing though….
Take a look here or here for descriptions of the technique.

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 02, 2008 at 01:12

WOW thankyou.. !

all my render target for z,rgb,light is 32bit, maybe this is why its slow. I think 8bit is slower in memory read/write ? is that true, if its not , well maybe i will use 16 bit Z buffer, 32bit for lightbuffer, 8 bit paletized for texture stage and 32bit output after modulate.

If my app crashed, well i dont know my PC is AMD sempron 2600, 512mb. Maybe you dont have SDL.DLL ?

You can do screen space ambient occlusion as well, but it’s difficult to tweak and takes quite a bit of time when done on the cpu.

is it post processing ? i done some trick on ambient occlusion in pixel shader, by comparing the distance between occluder and occluded Z value. The result is look like this, but i remove it from my code. Maybe if I can make the whole rendering process faster i will use the ambient occlusion again.

THe code look like this:

        ... (calculate address shadow map on EDX)
        movd mm5,[ebp+_vi]
        add eax,[ebp+_smap]
//ambient occlusion
        sub ebx,[eax+edx*4]
        jns @lit
        mov eax,0
        sar ebx,7     // the light intensity will reduced depends on how far from occluder
        add ebx,10
        cmovs ebx,eax
        movd mm7,ebx
        psrld mm5,mm7 // shift right intensity (why not div ??? i dont know)

@lit:
        psrld mm5,8
        pshufw mm5,mm5,0
        pmullw mm5,mm1
        psrlw mm5,8

        // lit color
        movd mm4,[edi+ecx*4]
        packsswb mm5,mm0
        paddsb mm5,mm4
        movd [edi+ecx*4],mm5
        ...

Picture: early stage of my rendering, no texture, but lot of lighting trick, in this pic its ambient occlusion + subdivide .

a42dd9405a.jpg

b24a558c52.jpg

Shadow map ambient occlusion
off

1235512d88.jpg
on

6ebc9f6a9d.jpg

The frame rate it self must be high so some fake-tricks will optimum such antialiasing. On my PC, with 640x480res the demo run above 30fps. But if i run in 320x240 its more than 100fps.

Hey hmm maybe i cant ask for help here, to rearrange my assembler code :D, can i ? if its ok i will post the code of the shader from zonlyshader to textureshader.

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 02, 2008 at 01:38

Oh one more, the shadow occluder on “tridi” is the polygon backfacing the light source, the the other.

And the shadowmap resolution can be 32,64,128 or 256 :D

Oh one more, :D on the demo, all the lights in this demo cast shadow (all light has its own shadowmap). The lightgroup actually a limiter the numebr of light every frame, its sort the light by intensity and calculate “n’ lights (can be modified in code) and every lights in a group has same light color, but he intensity can be different. The shadowmap generation is easy, automatically, you only need to create class based on lightclass and override the “transformshadow” method. For example, the muzzlelight if from tpointlight and modify the transformshadow:

function ttankmuzlelight.transformshadow;
    var
        r,sc, vl, llx, lly : gxfloat;
begin
    result:=sz>oz-300;
    if result then begin
        sc:=sd2/64; // check the shadowmap size/64
        llx := -(sx - ox);
        lly := -(sy - oy);
        vl  := (sz - oz);
        if vl<0.001 then vl:=0.001; // if occluder above light
        r:=sqrt(sqr(llx)+sqr(lly))+1000;
        vl:=(oz/vl)*sc*80/r;
        sx  := llx*vl + sd2;
        sy  := lly*vl + sd2;  // output in SX and SY

        result:=true;
    end;
end;

that code i create use brute trial & error :D, not efficient, but look natural. Its shadowmap is not linear, occluder near the light give bigger amount pixel compared to occluder in far distance, so the shadow can cover more object compared if its linear.

engine-6.jpg

And the “free” or “fake” anti-aliasing, is hmmm how to say, at first actually i use the camera angle trick to create Depth of Field, but its fail. Because the result is to flicker. Then i try and try another camera method and found this “fake” antialiasing :D

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 02, 2008 at 15:22

Hi Ryan,@ryannining

A month ago i start a 3d software rasterizer…

That’s pretty impressive for just a month of work!

…using delphi 6…

Delphi is pretty nice, but you’ll find a lot more code for C++.

Some code using MMX and SSE but i dont know to optimize more. And some code using self modifying code (depends on rendering feature needed) But i think, after convert some code to MMX/SSE i cant make my renderer faster. I dont know, maybe becauze Z buffer i used (32bit), maybe the scanlien conversion method.

Have you tried profiling yet? Often the hot code is not where you expected it.

Thing is, my code is so worse that without calling the function to start scanline conversion (by comment it) it got only 190fps for 640x400, and when rendering turned on, it only got under 50fps.

That’s actually not bad at all. Also remember that you should first implement all features and only then start really optimizing (after profiling). Otherwise you’ll easily end up reimplementing everything.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 02, 2008 at 16:11

@Nils Pipenbrinck

The biggest performance hog is your memory bandwith though. Try to make your buffers smaller even if you needs to add some extra instructions to decode the colors.

This might be true for certain architectures, but for PC’s you don’t easily run out of bandwidth.

Say you have 4.8 GB/s bandwidth and an 800x600 32-bit image, then you can access it entirely 10,000 times per second. At 50 FPS that’s still 200 accesses per pixel. Such a ‘shader’ would be very long, and you’ll be limited by arithmetic performance much sooner.

Another way to look at it is that with a 2.4 GHz CPU you can access 32-bit every two clock cycles. You typically can’t do any useful compression to save bandwidth.

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 Jan 02, 2008 at 17:03

I have exact the opposite experience.

I just run the benchmark of the graphic library I’m working on for over a year now. It’s not optimized for x86 so the rendering routines are far from ideal (just generic c-code, no mmx ect).

Here are the results of one particular test (lots of small sized gouraud triangles)

Alpha8 - 96 megapixels/second
Argb32 - 34 megapixels/second
RGB565 - 67 megapixels/second

I see nearly a 1:1 relationship of memory bandwith (bytes per pixel) vs. performance here, and it is the same in all routines except texturing (where the cache coherence problem comes into play).

My machine at home is slow… 1Ghz Athlon with 100Mhz FSB, I see the same figures on a modern P4 as well. Performance always scales with memory bandwith.

If we really would have a 32 bit access every two clock cycles - why do we still need caches? 2 cycles per access would be great. Maybe we have 2 cycles in theory, but in pratice the memory access takes much longer.

Btw - it would be interesting to see how much raw memory performance we really get. The Gb/s for different RAM chips is well know. You can get a chunk of memory with disabled cache under win32 by calling VirtualAlloc with the PAGE_NOCACHE protection constant. It would be interesting to know how much time a memcpy of 10mb takes and how many cycles it takes per access.

B91eae75cd6245bd8074bd0c3f1cc495
0
Nils_Pipenbrinck 101 Jan 02, 2008 at 17:25

whatever -

Now we can all measure how much GB/s we really have. I’m far away from the theoretical 3.2 GB/s that my ram can do. I get is 0.5GB/s (cached) and 0.09GB/s (uncached).

#include <windows.h>
#include <stdio.h>

void bench (void * data)
{
  int i;

  // do 1gb of traffic:
  for (i=0; i<1024; i++)
    memset (data, i, 1024*1024);
}

int main (int argc, char **args)
{
  void * mem_cached;
  void * mem_uncached;
  int t1,t2;
  float ms;
  
  mem_cached   = VirtualAlloc(0, 1024*1024, 
    MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
  mem_uncached = VirtualAlloc(0, 1024*1024, 
    MEM_COMMIT|MEM_RESERVE, PAGE_READWRITE|PAGE_NOCACHE);

  // warmup:
  memset (mem_cached, 0, 1024*1024);

  // measure  
  t1 = GetTickCount();
  bench (mem_cached);
  t2 = GetTickCount();
  ms = (float)(t2-t1) / 1000.0f;
  printf ("cached gb/s = %f\n", 1.0f / ms);
  
  // warmup:
  memset (mem_uncached, 0, 1024*1024);

  // measure  
  t1 = GetTickCount();
  bench (mem_uncached);
  t2 = GetTickCount();
  ms = (float)(t2-t1) / 1000.0f;
  printf ("uncached gb/s = %f\n", 1.0f / ms);
  
  VirtualFree (mem_cached,1024*1024, MEM_FREE);  
  VirtualFree (mem_uncached,1024*1024, MEM_FREE);  
}

I know - GetTickCount() is not really made for this, but it is precise enough to get ballpark numbers.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 02, 2008 at 21:18

@Nils Pipenbrinck

Here are the results of one particular test (lots of small sized gouraud triangles)

Ah, yes, for plain gouraud triangles the memory bandwidth can become a bottleneck. But in my experience as soon as you start doing anything interesting the arithmetic operations become the bottleneck and bandwidth is close to irrelevant.

Alpha8 - 96 megapixels/second
Argb32 - 34 megapixels/second
RGB565 - 67 megapixels/second

How do you interpolate the components? Argb32 has four times more components than Alpha8 so you must be doing a lot more arithmetic work per pixel. Rgb565 has three components and might benefit significantly from lower register pressure.

And from another perspective; 34 MP/s is still 70 FPS for an 800x600 image. There is no need to go higher. There is however need to do something more exciting than gouraud, and then bandwidth is not your primary concern any more.

I see nearly a 1:1 relationship of memory bandwith (bytes per pixel) vs. performance here, and it is the same in all routines except texturing (where the cache coherence problem comes into play).

Cache locality is quite ok for texturing. All modern CPU’s have automatic prefetching which is able to predict where the next texture access(es) will land. The non-1:1 relationship you see there will be mainly due to arithmetic operations becoming more of a bottleneck than bandwidth, not cache locality.

By the way, cache coherency refers to integrity.

If we really would have a 32 bit access every two clock cycles - why do we still need caches? 2 cycles per access would be great. Maybe we have 2 cycles in theory, but in pratice the memory access takes much longer.

I was talking about throughput, not latency. The time between RAM transactions is only a few CPU cycles. The total round trip time between a read request and getting the data into the registers can be hundreds of cycles. Caches are there to vastly improve latency, and reduce bandwidth needs.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 02, 2008 at 22:11

@Nils Pipenbrinck

Now we can all measure how much GB/s we really have. I’m far away from the theoretical 3.2 GB/s that my ram can do. I get is 0.5GB/s (cached) and 0.09GB/s (uncached).

I get 34.7 GB/s cached (Core 2 @ 2.4 GHz - 2 MB L2), very close to the theoretical 38.4 GB/s. But instead of using memset I used this:

__asm
{
    xorps xmm0, xmm0
    mov ecx, 1024*1024
    mov edi, data

loopset:
    movaps [edi+0*16], xmm0
    movaps [edi+1*16], xmm0
    movaps [edi+2*16], xmm0
    movaps [edi+3*16], xmm0
    movaps [edi+4*16], xmm0
    movaps [edi+5*16], xmm0
    movaps [edi+6*16], xmm0
    movaps [edi+7*16], xmm0

    sub ecx, 128
    jg loopset
}

Core 2 has a 128-bit bus to L2 cache, which can only be used to the fullest with SSE.

And I get 2.0 GB/s when using movntps instead of movaps, which writes directly to RAM. This is somewhat less than expected but not nearly as bad as the numbers you report.

PAGE_NOCACHE might not work the way you expect. It disables write combining (i.e. it won’t fully use the 64-bit bus to RAM), and forces the CPU to keep a very strict memory access order (no reads during writes). It might even be implemented by invalidating the cache line after every access, causing no less than 64 bytes to be fetched. PAGE_NOCACHE is only useful for device drivers and for some advanced security purposes, and you should use VirtualCopy for optimized copying.

Using movntps is a much friendlier way to avoid the cache (for write operations). It might still have its limitations though, which would explain why I only get 2.0 GB/s instead of something closer to 5.3 GB/s.

Anyway, I hope you can see that memory bandwidth is not that much of a concern. In fact, many professional benchmarks show that CPU’s with a higher FSB are not significantly faster, while a higher clock frequency scales performance of multimedia applications practically linearly.

I even believe that the only operation that truely benefits from extra bandwidth is copying a large block of memory. As soon as you do some arithmetic operations between those memory accesses, they become the actual bottleneck.

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 03, 2008 at 02:12

Thats why i’m affraid to change my buffer to less than 32bit, i’m affraid it cause lot of problem in my MMX code. Im afraid its slower,

maybe i will give it a try.

Btw i dont have profiler, to “profile” my renderer, i jut create a FPS counter, try to disable some code, run and write down the FPS… :D not a good way but i think its works.

I have implemented all feature i need, and some have become very different approach from my first design, but all feature i need is work. I jut dont know where to optimize again.

After profiling my code, i think i wasted CPU cycle in the scanline conversion, by disabling the scanline conversion the FPS can go up to 330fps, but when i enable the scanline conversion (but disable pixel shader) the fps go down to 180fps.

My scanline conversion is like this:

Type
    Gxint = Integer;
    GxFloat = single;
    t2dpoint    = packed record
        x, y, z,
        i, u, v,
        r, g    : gxint;
    end;

var 
 
Spans : array [0..500] of t2dpoint;
// vary is a data record to help the pixel shader
     vary : record
        x ,y ,z ,i ,u ,v ,r ,g  : integer;
        xm,ym,zm,im,um,vm,rm,gm : integer;
        n                       : integer;
        sdq                     : integer;
        rnd                     : integer;
        smap,sbuf,zbuf,tbuf     : pointer;
        lmap                    : pointer;
        lcolor                  : integer;
        width                   : integer;
        texinfo                 : integer;
        flatcolor               : integer;
        polyinfo                : integer;
        flatint                 : integer;
        sdand                   : integer;
    end;

procedure trender.shadealine(shader : fshader; xyaa, xyab : pt2dpoint);
var xya1            : pt2dpoint;
    xya2            : pt2dpoint;
    span            : ^t2dpoint;
    t,y1,y2         : integer;
    iy3,i           : gxint;
    awidth          : gxint;
    y3              : gxint;
begin
    if xyaa.y>xyab.y then begin
       xya1:=xyab;
       xya2:=xyaa;
    end else begin
       xya1:=xyaa;
       xya2:=xyab;
    end;

    y1:=xya1^.y div 2;
    y2:=xya2^.y div 2;
    if (y1>clipymaxi) or (y2<clipymini) then exit;
    if (y2>clipymaxi) then y2:=clipymaxi;
    if y1>=y2 then exit;
    i:=(clipxmini*2-xya1^.y);
    if (i>0) and (clipymini>=y2) then exit;
    if (i>0) then y1:=clipymini;

    iy3 :=(y2-y1);
    y3:=(xya2^.y-xya1^.y)+1;
    t:=y1;
    span:=ptr(integer(spans)+y1*32); // calculate the start span, span size is 32byte
    awidth:=width;
asm
    fild  dword [y3]
    fdivr  dword [_float1]      // calculate 1/y3
    fstp  dword [y3]

    mov  eax,xya1
    movups xmm6,[eax]
    movups xmm7,[eax+16]

    mov eax,xya2
    movups xmm0,[eax]
    movups xmm1,[eax+16]

    cvtdq2ps xmm6,xmm6
    cvtdq2ps xmm7,xmm7
    cvtdq2ps xmm0,xmm0
    cvtdq2ps xmm1,xmm1

    subps xmm0,xmm6
    subps xmm1,xmm7

//    atrm.x:=atrm.x div y3;  // original lots of X86 code
//    atrm.z:=atrm.z div y3;
//    atrm.i:=atrm.i div y3;
//    atrm.u:=atrm.u div y3;
//    atrm.v:=atrm.v div y3;
//    atrm.r:=atrm.r div y3;
//    atrm.g:=atrm.g div y3;

    movd xmm3,y3        // SSE replacement
    shufps xmm3,xmm3,0
    mulps xmm0,xmm3
    mulps xmm1,xmm3

    cmp i,0
    jle @skip

//    if i>0 then begin
//        inc(atrv.x,atrm.x*i);  // original lots of X86 code
//        inc(atrv.z,atrm.z*i);
//        inc(atrv.i,atrm.i*i);
//        inc(atrv.u,atrm.u*i);
//        inc(atrv.v,atrm.v*i);
//        inc(atrv.r,atrm.r*i);
//        inc(atrv.g,atrm.g*i);

        movaps xmm4,xmm0       // SSE replacement
        movaps xmm5,xmm1

        movd xmm3,i
        shufps xmm3,xmm3,0
        cvtdq2ps xmm3,xmm3
        mulps xmm4,xmm3
        mulps xmm5,xmm3
        addps xmm6,xmm4
        addps xmm7,xmm5
@skip:
    push edi
    push esi

    lea edi,[atrv+16]
    and edi,not 15
    lea eax,[atrm+16]
    and eax,not 15
    mov esi,span
    
    // fake horizontal antialiasing by add 1/2 step in odd frame
    test rframe,1
    jz @skiphalf
        addps xmm6, xmm0
        addps xmm7, xmm1
@skiphalf:

    cvttps2dq xmm0,xmm0
    cvttps2dq xmm1,xmm1
    pslld xmm0,1
    pslld xmm1,1

    cvttps2dq xmm6,xmm6
    cvttps2dq xmm7,xmm7

    movaps [eax]   ,xmm0
    movaps [eax+16],xmm1

    movaps [edi] ,xmm6
    movaps [edi+16] ,xmm7

@nextline:
      cmp [esi+4],0
      jne @callscan
      @storeedge:   // only 1 data, store the data in [ESI]
                movaps [esi]   ,xmm6
                movaps [esi+16],xmm7
                mov  [esi+4],1
                jmp  @nodraw1
      @callscan:    // we have 2 data here, so we must draw
                push esi
                mov  [esi+4],0
                movd ecx,xmm6    // atrv
                mov edx,[esi]

                sar edx,16      // ebx = edge.x shr detail
                sar ecx,16

                mov ebx,ecx     // edx = atrv.x shr detail
                sub ecx,edx     // ecx = atrv.x - edge.x
                jz @nodraw      // zeo then dont draw
                jg @greater     // x1>x2 then flip the data
                    neg ecx        // calculate dz,di,du,dv,dr,dg
                    mov eax,vary   // and store in [vary] records
                    movaps [eax],xmm6
                    movaps [eax+16],xmm7

                    movaps xmm0,[esi]
                    movaps xmm1,[esi+16]

                    psubd xmm0,xmm6     // z2-z1, i2-i1 ...
                    psubd xmm1,xmm7

                    movaps [eax+32],xmm0
                    movaps [eax+48],xmm1

                    jmp @draw
                @greater:       // x2>x1
                    mov eax,vary        // calculate dz,di,du,dv,dr,dg
                    movaps xmm0,[esi]   // and store in [vary] records
                    movaps xmm1,[esi+16]
                    movaps [eax],xmm0
                    movaps [eax+16],xmm1

                    movaps xmm2,xmm6
                    movaps xmm3,xmm7
                    psubd xmm2,xmm0     // z2-z1, i2-i1 ...
                    psubd xmm3,xmm1

                    movaps [eax+32],xmm2
                    movaps [eax+48],xmm3

                    mov ebx,edx
            @draw:
                mov eax,t               // t = current Y coordinate
                imul eax,awidth         // y*Witdh
                add ebx,eax             // EBX = y*Witdh+x1 = first pixel address

                mov eax,1 shl _fdiv     // calculate EAX=(1 << 12)/ECX
                cdq                     // so in shader we can get
                div ecx                 //  ??/ECX = ??*EAX >> 12
                                        // i think its faster than ??/ECX
                mov edx,ebx             // EDX,EBX = pixel position to help the shader
                                        // ex: Zpixel = Zbuffer + EDX*4 (32bit)
                                        // ex: RGBout = RGBbuffer + EDX*4 (32bit)
                call shader             // call the pixel shader
        @nodraw:
            pop esi
@nodraw1:

// get the step address
      lea eax,[atrm+16]
//    next egde position
      add esi,32
//    align to 16byte
      and eax,not 15
      inc t
// increase all variable ( 8 variable from x,y,z,i,u,v,r,g)
      movaps xmm4,[eax]
      movaps xmm5,[eax+16]
      paddd xmm6,xmm4
      paddd xmm7,xmm5

      dec iy3
      jnz @nextline
@e:
      pop esi
      pop edi
@e2:
      emms
end;
end;

That code will always work, that i dont have to find the minimum y vertex, i just call that code like this:

shadealine(vtx1,vtx2);
shadealine(vtx2,vtx3);
shadealine(vtx3,vtx1);

Is there any better solution of scanline conversion?

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 03, 2008 at 02:26

Oh one more, i have the “modulate” ocde that can be slow if resolution is high, can you help me optimize it.

output = RGB*RGB >> 6 // i use 6 to get the overexposure effect

procedure modulatetexlight(tex,light:pointer;w,h,ambient:integer);
    asm
        push ebp
        mov _esp,esp
        mov esp,ambient
        movq mm0,_zero
        movq mm7,_zero
        mov avgint,0
        mov ecx,w
        imul ecx,h
        shr ecx,1
        mov eax,tex
        mov edx,light
        lea eax,[eax+ecx*8]
        lea edx,[edx+ecx*8]
        neg ecx
    @1:

            movq mm1,qword [eax+ecx*8]
            movq mm2,qword [edx+ecx*8]
            movq mm3,qword [eax+ecx*8+4]
            movq mm4,qword [edx+ecx*8+4]

            punpcklbw mm1,mm0
            punpcklbw mm2,mm0
            punpcklbw mm3,mm0
            punpcklbw mm4,mm0

            pmullw mm1,mm2
            pmullw mm3,mm4

            psraw mm1,6
            psraw mm3,6

            movq mm6,mm1
            packuswb mm1,mm0
            packuswb mm3,mm0

            movd [eax+ecx*8],mm1
            movd [eax+ecx*8+4],mm3
            mov [edx+ecx*8],esp
            mov [edx+ecx*8+4],esp

            psrlw mm6,4
            movq datamm1,mm6
            movzx ebx,word[datamm1]
            movzx ebp,word[datamm1+2]
            add avgint,ebx
            movzx ebx,word[datamm1+4]
            add avgint,ebp
            add avgint,ebx

            inc ecx
            jnz @1
    mov esp,_esp
    pop ebp
    emms
end;
A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 Jan 03, 2008 at 03:04

Beware of using FPS numbers to measure performance. They can be misleading. What you should really count is the amount of time to render a frame (which is the reciprocal of the framerate).

For example, the drop from 330 to 180 fps seems like a big one, yes? Probably much bigger than a drop of, say, 70 to 60 fps, right?

But if you thought that, you’d be wrong. At 330 fps, you’re taking 3.03 ms to render a frame. At 180 fps, it takes 5.56 ms. So, the scanline conversion is really only adding about 2.5 ms to the time. At 70 fps, you’re taking 14.3 ms to render a frame, while at 60 fps, it’s taking 16.7 ms - which is 2.4 ms longer. So both drops are about the same size in the amount of extra time taken.

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 03, 2008 at 04:03

oooo, i see, its my fault, well i will use time to render next time,…. wow big mistake !! damn.., i have thinking all day, but cant see such simple mistake, damn !

Btw Nick, hehe have see your SwiftShader, its damn HOT, and i have get the softwire but i dont know to use it in delphi… maybe i should learn C++, i dont know where to start, i’m 27years old, sometimes i’m lazy to learn new thing.

And nick, i think if you create a fixed pipeline software renderer it will be very-very fast… maybe it can help the Console Emulator scene , hehe. I see lot of graphics hardcore there, to create the graphics engine for their emulator.


Well maybe i will learn OpenGL or DirectX soon (damn, i dont know how to use them :D ),…

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 03, 2008 at 09:45

@ryannining

Btw i dont have profiler, to “profile” my renderer, i jut create a FPS counter, try to disable some code, run and write down the FPS… :D not a good way but i think its works.

This is another reason to start using C++. AMD’s CodeAnalyst is a powerful free profiler for C/C++ and assembly, and there are many more.

I have implemented all feature i need, and some have become very different approach from my first design, but all feature i need is work. I jut dont know where to optimize again.

First get some measurements from a profiler. It will show you exactly where to focus your attention.

On the other hand, your demo runs perfectly smoothly, and if the renderer is feature complete then I don’t see much reason to change anything (unless you’re really not focussing on finishing a product but just gaining experience instead).

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 03, 2008 at 10:53

Hehe, im not focusing in finished product, if fact its not finish, i must learn how to create AI, pathfinding, but first i want to create the graphics that can do great lighting effect (many shadow), postprocessing, etc. Or maybe just to show to my friends that underestimate my graphics programming knowledge.

I’m doing this because i have no job for a month… :D.


For C++ i have download MS VS Express, but i cant download the PSDK, my internet is to slow ( indonesia internet = meh ). Maybe you have suggestion what i must do to learn C++.

Maybe you can give me a Code that using MingW or anything else to begin with :D.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 03, 2008 at 16:28

@ryannining

movq mm1,qword [eax+ecx*8]
movq mm2,qword [edx+ecx*8]
movq mm3,qword [eax+ecx*8+4]
movq mm4,qword [edx+ecx*8+4]

punpcklbw mm1,mm0
punpcklbw mm2,mm0
punpcklbw mm3,mm0
punpcklbw mm4,mm0

pmullw mm1,mm2
pmullw mm3,mm4

Accessing unaligned data (your third and fourth instruction) can have a significant performance impact. The CPU actually reads two aligned quadwords and then extracts the unaligned quadword from that. So make sure you always access an address that is a multiple of eight (for quadwords) for optimal performance.

All the unpacking is costing performance too, so I propose to use something like this:

punpcklbw mm1, dword [eax+ecx*8]
punpcklbw mm2, dword [edx+ecx*8]
punpcklbw mm3, dword [eax+ecx*8+4]
punpcklbw mm4, dword [edx+ecx*8+4]

pmulhuw mm1, mm2
pmulhuw mm3, mm4
99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 03, 2008 at 16:29

@ryannining

For C++ i have download MS VS Express, but i cant download the PSDK, my internet is to slow ( indonesia internet = meh ). Maybe you have suggestion what i must do to learn C++.

Keep trying to download the Platform SDK. It’s well worth it. Use a download manager if you have to.

3ac651d9512cecad86af937a7eaca34e
0
noglin 101 Jan 04, 2008 at 04:35

Wow I’m impressed by what you have achieved in just one month!! :)
I wrote a couple some years ago, you have surpassed them all :)
Mine was all in C++, I could send it to you, I never went down to asm level, I recall the z-buffer being a huge problem in my profiler and hence I spent a long time trying to figure out a different way… I believe ATI had some really cool way of implementing an alternative to the z-buffer, maybe it was just marketing but as I recall it they did score better in the benchmarks at that time.. maybe nick has some details here? :)

3ac651d9512cecad86af937a7eaca34e
0
noglin 101 Jan 04, 2008 at 04:49

I decided to try it on my intel-mac but it crashed, I had put SDL.dll in the same dir, has AMD some funky instructions that you rely on?

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 04, 2008 at 10:11

@noglin

I believe ATI had some really cool way of implementing an alternative to the z-buffer, maybe it was just marketing but as I recall it they did score better in the benchmarks at that time.. maybe nick has some details here? :)

They introduced the hierarchical z-buffer. The basic idea is to keep a low resolution z-buffer which stores the maximum of z-values within tiles of the full resolution z-buffer. This way a whole tile can be discarded with a single test.

It’s very effective for hardware, but for software rendering it’s of less use. In software you can literally skip the pixel processing if the depth test fails. With hardware, once a pixel enters the processing pipeline you can’t remove it (or more correctly, you can’t remove it and insert a visible pixel). Unlike a CPU, the GPU always processes many pixels concurrently. So only when all these pixels (i.e. a tile) fail the depth test they can stop them from entering the pipeline and try another tile.

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 05, 2008 at 17:11

I decided to try it on my intel-mac but it crashed, I had put SDL.dll in the same dir, has AMD some funky instructions that you rely on?

I dont know, i will search for the bug, but i dont have Intel machine here. You can send me the code to ryannining@yahoo.com, i will try to use your code as a base to my renderer.

About the PSDK, i download via web installer, it takes so long and not resumable. If i download the ISO, well, i’m affraid my internet bills. Here in indonesia i only have 0.5Gb limit a month, over that quota i must pay 5times/mb.

I have some idea about soft shadow with penumbra/ area shadow, but i need some time to implement it.

Btw about shadow mapping, i dont have experience with directX/OpenGL programming, and i want to ask here, is it true that in DX/GL to create Shadow Map we must transform the geometry ourself or its automatically depend on our light ? How to create shadow map for point light ?

Because in my renderer, i must supply shadow tranformation. For 128x128 sahdowmap, i cannot expect to cover all geometry in the demo, so i code my shadow tranformation (just skew transform) using try & error until found optimal transformation which can cover all visible geometry for current camera setting. If i adjust the camera setting (distance, orientation) i must re-code the shadow transform. This is just for directional type light. For point of light the shadow map tranform is independence from camera setting.

My point light only can create shadow map for geometry below the light (180` coverage) :D, i dont know the correct transform for this light type.

function ttankmuzlelight.transformshadow;
    var
        r,sc, vl, llx, lly : gxfloat;
begin
    result:=sz>oz-300;  // sx,sy,sz : current vertex position
    if result then begin  // ox,oy,oz : light position
        sc:=sd2/64;       // ocx,ocy,ocz : camera position 
        llx := -(sx - ox);
        lly := -(sy - oy);
        vl  := (sz - oz);
        if vl<0.001 then vl:=0.001; // to prevent error
        r:=sqrt(sqr(llx)+sqr(lly))+1000;
        vl:=(oz/vl)*sc*80/r;
        sx  := llx*vl + sd2;      // output in sx,sy
        sy  := lly*vl + sd2;

        result:=true;
    end;
end;
F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 05, 2008 at 17:18

hierarchical z-buffer, hmm how about if we store a block of 16x1 pixel (horizontal) so we can skip 16pixel in scanline shader ?

a 2dimension Tile sound complex to software renderer due to non paralel pipeline, but 1dimension ?… its look promising ? any thought ? or its just like SPAN BUFFER ?

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 05, 2008 at 18:09

Latest source available at http://sourceforge.net/projects/tridi
That source is the latest source code, you still need data files from last binary demo.

Maybe i will rewrite my code from scratch and not using assembler.

F7bea7283a5a6adb7c8f22c8cd56ba1e
0
ryannining 101 Jan 11, 2008 at 20:33

I have rewriten some code, specially main rendering engine. This time no assembler used in scanline converter and pixel shader. So maybe someone can translate this code to C++. But i remove some feature such jitter filter, colored light.

For example the shadowmap shader code:

procedure scansdwlightns;
var oo,zz,sc:pinteger;
    t:integer;
    z,zm:integer;
    i,im:integer;
    u,um:integer;
    v,vm:integer;
    r,rm:integer;
begin
    zz:=vary.zbuf;
    inc(zz,vary.dd);
    oo:=vary.lmap;
    inc(oo,vary.dd);

    z:=vary.z;
    zm:=vary.zm div vary.n;

    i:=vary.i div 128+120;  // light intensity
    im:=(vary.im div 128)div vary.n;

    u:=vary.u;      // shadow Map UV coord
    um:=vary.um div vary.n;
    v:=vary.v;
    vm:=vary.vm div vary.n;

    r:=vary.r;      // shadow Z
    rm:=vary.rm div vary.n;

    for t:=1 to vary.n do begin
        if z=zz^ then begin
            sc:=vary.smap;
            inc(sc,(u div sdetail16) and vary.sdand+
                   ((v div sdetail16) and vary.sdand) shl (vary.sdshl-2));
            if sc^>r then
                inc(oo^,i);
        end;
        z:=z+zm;
        i:=i+im;
        u:=u+um;
        v:=v+vm;
        r:=r+rm;
        inc(zz);
        inc(oo);
    end;
end;

enjoy ….