New update. I added per-pixel mip-maping. To see how it workds i created 2 demos. One where we can see the mip-map levels, the second with normal drawing. There is a noisy pattern at the mip-map level boundary's. The reason is.. i use the "RCPPS" SSE instruction which is not so precise, as when i use "DIVPS" . Using "DIVPS" i get sharp edges on the mip-map boundary's, but this instruction is more slower then the "RCPPS". But when the mip-map levels are not colored the noisy pattern is not visible. See the no-mip-map-colored demo. ;) waiting for your feedback guys ;)

https://sourceforge....cts/phenomenon/

https://sourceforge....cts/phenomenon/

I am very interested into occlusion culling, i wrote some rendere in the past, but at that time i had some difficulties making the whole thing running at a decent speed.
I have abandoned the idea, in favour of a pvs calculation system i am starting to rethink again.
My main problem was basically that i had to render stuff 2 times, even tough at a lowe resolution regarding the occlusion buffer.
Are things mature enough to work on it again ? , isn't realtime ray tracing approaching fat, will all the competency accumulated be wasted in a 3-4 years ??
Basically i mean is it worth to write a 'manual' occlusion culling system right now ?

Herrcoolness said:

The reason is.. i use the "RCPPS" SSE instruction which is not so precise, as when i use "DIVPS" . Using "DIVPS" i get sharp edges on the mip-map boundary's, but this instruction is more slower then the "RCPPS".
https://sourceforge....cts/phenomenon/

Try using RCPPS followed by one or two iterations of a newton-raphson division (if x is an approximation of 1/d (which is given by the rcpps instruction), x*(2 - d*x) is a better one).
v71 said:

I am very interested into occlusion culling, i wrote some rendere in the past, but at that time i had some difficulties making the whole thing running at a decent speed.
I have abandoned the idea, in favour of a pvs calculation system i am starting to rethink again.
My main problem was basically that i had to render stuff 2 times, even tough at a lowe resolution regarding the occlusion buffer.
Are things mature enough to work on it again ? , isn't realtime ray tracing approaching fat, will all the competency accumulated be wasted in a 3-4 years ??
Basically i mean is it worth to write a 'manual' occlusion culling system right now ?
I'm not sure what you mean with the raytracing argument, but occlusion queries using software rendering *is* feasible and is already being used in AAA titles (Dice is using it in it's engine for Battlefield: Bad Company 2). We are soon going to research this subject as well, and as you can read a few posts back I already made a rudimentary singlethreaded implementation on PC. Of course, as with all occlusion culling systems, their effectiveness depend on the kind of environment you're having. While z-buffer based occlusion queries are pretty decent all-round, for indoor environments with clearly separated rooms and corridors a cell and portal system could provide for better culling. But for large dynamic open-world environments, I recon that using z-buffer based queries is both the best culling system and also easiest to set up (since you can mostly use actual modeled geometry for occlusion, although you probably still want to insert low-poly occlusion-only mesh here and there and not render high-poly or skinned mesh to speed things up).
.oisyn said:

Try using RCPPS followed by one or two iterations of a newton-raphson division (if x is an approximation of 1/d (which is given by the rcpps instruction), x*(2 - d*x) is a better one).

Thnx for the tip ;)

Sorry for my messy post before, i try to explain better.
For a software rasterizer, bascially you have to write 2 renderers , one using opengl or directx and the other running entirely on the cpu.
I mean, everything vertex rotation, perspective divison, and a fast triangle filler.
I know that it is sufficient to use the z-buffer , use a lower screen resolution , and other optimization, but i am asking to myself, is it worth to write a system like this ? isn't ray tracing approaching fast ?
Even if ray tracing won't be used to render a complete scene with light, will the new multicore gpu boards allow us to write a visibility system running entirely on hardware in a matter of 2-3 years ???

v71 said:

Sorry for my messy post before, i try to explain better.
For a software rasterizer, bascially you have to write 2 renderers , one using opengl or directx and the other running entirely on the cpu.
I mean, everything vertex rotation, perspective divison, and a fast triangle filler.
I know that it is sufficient to use the z-buffer , use a lower screen resolution , and other optimization, but i am asking to myself, is it worth to write a system like this ? isn't ray tracing approaching fast ?
Even if ray tracing won't be used to render a complete scene with light, will the new multicore gpu boards allow us to write a visibility system running entirely on hardware in a matter of 2-3 years ???
Raytracing is good just for drawing high quality images, or good for physics collision or for the AI. Better use scanline-triangle-filler or tile-triangle-filler. Its fast, memory friendly,coherent, and its using lower amount of math. Triangle rasterizers are evolved from raytracing for faster drawing of triangles. So why will you use raytracing for the occlusion pass,when you can use a fast triangle rasterizer?

.oisyn said:

Try using RCPPS followed by one or two iterations of a newton-raphson division (if x is an approximation of 1/d (which is given by the rcpps instruction), x*(2 - d*x) is a better one).

I used the this method which added 4 sse instructions : 2 muls, 1 mov (constant read), 1 sub
The speed was same as using divps. :huh:

Hmmm that's too bad. But I think I read it in an intel optimization manual once, but that was a couple of years back (P4 era) and maybe the divps has evolved since then. Or perhaps I'm just mistaken

.edit: no, it's still there: http://www.intel.com...nual/248966.pdf
Chapter 6.1:

Quote

Use the reciprocal instructions followed by iteration for increased accuracy. These instructions yield reduced accuracy but execute much faster. Note the following:
— If reduced accuracy is acceptable, use them with no iteration.
— If near full accuracy is needed, use a Newton-Raphson iteration.
— If full accuracy is needed, then use divide and square root which provide more accuracy, but slow down performance.

If you google on "rcpps newton raphson", a lot of sites are saying it's faster as well.
.oisyn said:

Hmmm that's too bad. But I think I read it in an intel optimization manual once, but that was a couple of years back (P4 era) and maybe the divps has evolved since then. Or perhaps I'm just mistaken :)

.edit: no, it's still there: http://www.intel.com...nual/248966.pdf
Chapter 6.1:

If you google on "rcpps newton raphson", a lot of sites are saying it's faster as well.

I used :

// xmm1  - input value

RCPPS xmm0,xmm1

mulps xmm1,xmm0

mulps xmm1,xmm0

subps xmm0,xmm1

// xmm0 - output value


Now, there is no constant load.
But almost still same speed as using:
*DIVPS - 32.50 fps
*RCPPS + Newton-Raphson iteration - 32.40 fps
*RCPPS - 33.70 fps
. I have Intel Core 2 Quad Q8300. It may be true. After 4 years the DIVPS can be faster. But i will use the iteration for older CPU's. :) But still thnx for the tip :).

Herrcoolness said:

I have Intel Core 2 Quad Q8300. It may be true. After 4 years the DIVPS can be faster. But i will use the iteration for older CPU's. :)
Starting from the Core 2 on 45 nm technology, Intel implemented a new radix-16 division unit, which is twice as fast as its predecessor.

divps still has a high latency of maximum 15 cycles, but if you have other instructions that can execute independently then that's no problem. If instead you use rcpps and a Newton-Raphson iteration the total latency is nearly identical but you're executing more instructions (while you could have done other work instead).

So indeed on newer processors its faster to use divps, and you even get full precision!

News-news-news guys. ;) So i (re)implemented the hierarchical z-buffer with 3 basic funtions, for fast tile skip, standart per pixel z comparing and fast z writing without z comparing to old z values in z buffer.

I uploaded 2 demos. One with colored debug info and one without the coloring to see how it normal works.
*black tiles - skipped tiles of the hidden small quad
*green tiles - tiles drawn with the fast write fucntion (no z comparison) and are not compared against the triangle edges
*cyan tiles - tiles are drawn with fast write function (no z comparison) but compared against the triangle edges
*gray tiles - tiles are drawn with function that compares the z-values agaisnt the z-buffer and are compared against the triangle edges

Next stop ...clipping and transform pipeline ... and first rotated cube? :happy:

https://sourceforge....cts/phenomenon/

Ok guys. What's new?
Now the triangle input coordiantes are in NDC (Normalized device coordinates), so x and y postion need to be in +1,-1 interval. Why this? because this are using graphicards and helped me to solve the problem when you change the size of the window. Now the size of of the triangles is changing too and is propotional to the rendering window.

Aaand i added third texture filtering method for low-end pc's. Its almost fast like nearest texture filtering (because of 1 texture fetch) but looks almost like bilinear. Yes-yes you saw this method in Unreal. I found a description about this technique in old flipcode archive on net (http://www.flipcode....In_Unreal.shtml)

There are 2 demos :
-one static to see how fast are all 3 techniques (push 1,2,3 to change the filtering technique)
-and dynamic to see the dither-bilinear technique in action (push 1,2,3 to change the filtering technique)

News-news-news !! I added full transformation pipeline of vertices and homogeneous clipping of triangles based on direct 3d (like orientation, perspective matrix and so). I created small demo where you can move with the camera and see a big cube with texture of size 2048 x2048. About the rasterizer. I reimplemented Nick's rasterizer with fixed-point math because of its the numerical stability near edge of the drawing bounds. Sometimes after clipping and homogeneous division the positions of points of the triangle was going outside of the screen whitch caused an error in triangle rasterizer.
q,e - moving in y direction
a,d - moving in x direction
w,s - moving in y direction
1,2,3- filtering method
9,0 - vsync on-off

https://sourceforge....cts/phenomenon/

https://sourceforge....cts/phenomenon/

AMD Athlon XP 1900+ at 1.6 GHz
512 RAM
GeForce4 MX 440 with 64 MB
PS/2 Mouse + USB Keyboard

Mihail121 said:

AMD Athlon XP 1900+ at 1.6 GHz
512 RAM
GeForce4 MX 440 with 64 MB
PS/2 Mouse + USB Keyboard


An unhandled exception occurred at $00403247 : EAccessViolation : Access violation$00403247

$0041A4B4 DDRAWFLIPWINDOWED, line 45 of ddrawwindowed.inc$0041A7DE  GS_WNDPROC,  line 131 of gs_screen.inc

$0041F98E WNDKEYBPROC, line 29 of fenomenon_keyboard.pas$00432382  WNDMOUSEPROC,  line 62 of fenomenon_mouse.pas

$7E418734$7E418816

$7E42C03D$7E42C228

$7E42C1D5$004122E7

$0041A820$0041F98E

$00432382$7E418734

$7E418816$7E428EA0

Heap dump by heaptrc unit

84 memory blocks allocated : 8439373/8439688

77 memory blocks freed     : 8434761/8435056

7 unfreed memory blocks : 4612

True heap size : 1867776 (80 used in System startup)

True free heap : 1862528

Should be : 1862616

Call trace for block $00085DD8 size 64$0040A5F8

$00409191$0041E485

$0040A0BE$0041A4B4

$0041A7DE$0041F98E

$00432382 Call trace for block$00067068 size 24

$00409191$0041E485

$0040A0BE$0041A4B4

$0041A7DE$0041F98E

$00432382$7E418734

Call trace for block $00067008 size 16$0041E297

$0040A0BE$0041A4B4

$0041A7DE$0041F98E

$00432382$7E418734

$7E418816 Call trace for block$000E96B0 size 147

$0040DF21 Call trace for block$000A96B8 size 3859

$00402A45$0040DF21

Call trace for block $000A16A0 size 403$00402A45

$0040DF21 Call trace for block$00099698 size 99

$00402A45$0040DF21


is you pc sse2 compatible? what is your desktop resolution? Because some detections are not implemented in the progy.

Herrcoolness said:

