Jump to content


Advanced Rasterization


187 replies to this topic

#161 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 22 July 2010 - 11:56 PM

New update. I added per-pixel mip-maping. To see how it workds i created 2 demos. One where we can see the mip-map levels, the second with normal drawing. There is a noisy pattern at the mip-map level boundary's. The reason is.. i use the "RCPPS" SSE instruction which is not so precise, as when i use "DIVPS" . Using "DIVPS" i get sharp edges on the mip-map boundary's, but this instruction is more slower then the "RCPPS". But when the mip-map levels are not colored the noisy pattern is not visible. See the no-mip-map-colored demo. ;) waiting for your feedback guys ;)

https://sourceforge....cts/phenomenon/

#162 v71

    Valued Member

  • Members
  • PipPipPipPip
  • 353 posts

Posted 23 July 2010 - 06:55 AM

I am very interested into occlusion culling, i wrote some rendere in the past, but at that time i had some difficulties making the whole thing running at a decent speed.
I have abandoned the idea, in favour of a pvs calculation system i am starting to rethink again.
My main problem was basically that i had to render stuff 2 times, even tough at a lowe resolution regarding the occlusion buffer.
Are things mature enough to work on it again ? , isn't realtime ray tracing approaching fat, will all the competency accumulated be wasted in a 3-4 years ??
Basically i mean is it worth to write a 'manual' occlusion culling system right now ?
Gurus, please respond...

#163 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 23 July 2010 - 10:15 AM

Herrcoolness said:

The reason is.. i use the "RCPPS" SSE instruction which is not so precise, as when i use "DIVPS" . Using "DIVPS" i get sharp edges on the mip-map boundary's, but this instruction is more slower then the "RCPPS".
https://sourceforge....cts/phenomenon/

Try using RCPPS followed by one or two iterations of a newton-raphson division (if x is an approximation of 1/d (which is given by the rcpps instruction), x*(2 - d*x) is a better one).
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#164 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 23 July 2010 - 10:31 AM

v71 said:

I am very interested into occlusion culling, i wrote some rendere in the past, but at that time i had some difficulties making the whole thing running at a decent speed.
I have abandoned the idea, in favour of a pvs calculation system i am starting to rethink again.
My main problem was basically that i had to render stuff 2 times, even tough at a lowe resolution regarding the occlusion buffer.
Are things mature enough to work on it again ? , isn't realtime ray tracing approaching fat, will all the competency accumulated be wasted in a 3-4 years ??
Basically i mean is it worth to write a 'manual' occlusion culling system right now ?
Gurus, please respond...
I'm not sure what you mean with the raytracing argument, but occlusion queries using software rendering *is* feasible and is already being used in AAA titles (Dice is using it in it's engine for Battlefield: Bad Company 2). We are soon going to research this subject as well, and as you can read a few posts back I already made a rudimentary singlethreaded implementation on PC. Of course, as with all occlusion culling systems, their effectiveness depend on the kind of environment you're having. While z-buffer based occlusion queries are pretty decent all-round, for indoor environments with clearly separated rooms and corridors a cell and portal system could provide for better culling. But for large dynamic open-world environments, I recon that using z-buffer based queries is both the best culling system and also easiest to set up (since you can mostly use actual modeled geometry for occlusion, although you probably still want to insert low-poly occlusion-only mesh here and there and not render high-poly or skinned mesh to speed things up).
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#165 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 24 July 2010 - 10:43 AM

.oisyn said:

Try using RCPPS followed by one or two iterations of a newton-raphson division (if x is an approximation of 1/d (which is given by the rcpps instruction), x*(2 - d*x) is a better one).

Thnx for the tip ;)

#166 v71

    Valued Member

  • Members
  • PipPipPipPip
  • 353 posts

Posted 24 July 2010 - 10:09 PM

Sorry for my messy post before, i try to explain better.
For a software rasterizer, bascially you have to write 2 renderers , one using opengl or directx and the other running entirely on the cpu.
I mean, everything vertex rotation, perspective divison, and a fast triangle filler.
I know that it is sufficient to use the z-buffer , use a lower screen resolution , and other optimization, but i am asking to myself, is it worth to write a system like this ? isn't ray tracing approaching fast ?
Even if ray tracing won't be used to render a complete scene with light, will the new multicore gpu boards allow us to write a visibility system running entirely on hardware in a matter of 2-3 years ???

#167 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 25 July 2010 - 05:39 AM

v71 said:

Sorry for my messy post before, i try to explain better.
For a software rasterizer, bascially you have to write 2 renderers , one using opengl or directx and the other running entirely on the cpu.
I mean, everything vertex rotation, perspective divison, and a fast triangle filler.
I know that it is sufficient to use the z-buffer , use a lower screen resolution , and other optimization, but i am asking to myself, is it worth to write a system like this ? isn't ray tracing approaching fast ?
Even if ray tracing won't be used to render a complete scene with light, will the new multicore gpu boards allow us to write a visibility system running entirely on hardware in a matter of 2-3 years ???
Raytracing is good just for drawing high quality images, or good for physics collision or for the AI. Better use scanline-triangle-filler or tile-triangle-filler. Its fast, memory friendly,coherent, and its using lower amount of math. Triangle rasterizers are evolved from raytracing for faster drawing of triangles. So why will you use raytracing for the occlusion pass,when you can use a fast triangle rasterizer?

#168 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 25 July 2010 - 04:52 PM

.oisyn said:

Try using RCPPS followed by one or two iterations of a newton-raphson division (if x is an approximation of 1/d (which is given by the rcpps instruction), x*(2 - d*x) is a better one).

I used the this method which added 4 sse instructions : 2 muls, 1 mov (constant read), 1 sub
The speed was same as using divps. :huh:

#169 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 25 July 2010 - 08:35 PM

Hmmm that's too bad. But I think I read it in an intel optimization manual once, but that was a couple of years back (P4 era) and maybe the divps has evolved since then. Or perhaps I'm just mistaken :)

.edit: no, it's still there: http://www.intel.com...nual/248966.pdf
Chapter 6.1:

Quote

Use the reciprocal instructions followed by iteration for increased accuracy. These instructions yield reduced accuracy but execute much faster. Note the following:
— If reduced accuracy is acceptable, use them with no iteration.
— If near full accuracy is needed, use a Newton-Raphson iteration.
— If full accuracy is needed, then use divide and square root which provide more accuracy, but slow down performance.

If you google on "rcpps newton raphson", a lot of sites are saying it's faster as well.
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#170 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 26 July 2010 - 04:44 AM

.oisyn said:

Hmmm that's too bad. But I think I read it in an intel optimization manual once, but that was a couple of years back (P4 era) and maybe the divps has evolved since then. Or perhaps I'm just mistaken :)

.edit: no, it's still there: http://www.intel.com...nual/248966.pdf
Chapter 6.1:


If you google on "rcpps newton raphson", a lot of sites are saying it's faster as well.

I used :

// xmm1  - input value

RCPPS xmm0,xmm1

mulps xmm1,xmm0

mulps xmm1,xmm0

addps xmm0,xmm0

subps xmm0,xmm1

// xmm0 - output value

Now, there is no constant load.
But almost still same speed as using:
*DIVPS - 32.50 fps
*RCPPS + Newton-Raphson iteration - 32.40 fps
*RCPPS - 33.70 fps
. I have Intel Core 2 Quad Q8300. It may be true. After 4 years the DIVPS can be faster. But i will use the iteration for older CPU's. :) But still thnx for the tip :).

#171 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 26 July 2010 - 01:14 PM

Herrcoolness said:

I have Intel Core 2 Quad Q8300. It may be true. After 4 years the DIVPS can be faster. But i will use the iteration for older CPU's. :)
Starting from the Core 2 on 45 nm technology, Intel implemented a new radix-16 division unit, which is twice as fast as its predecessor.

divps still has a high latency of maximum 15 cycles, but if you have other instructions that can execute independently then that's no problem. If instead you use rcpps and a Newton-Raphson iteration the total latency is nearly identical but you're executing more instructions (while you could have done other work instead).

So indeed on newer processors its faster to use divps, and you even get full precision!

#172 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 26 July 2010 - 02:26 PM

Nick, do you know any resources where you can get that kind of information? Or is it just a matter of keeping up with latest developments?
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#173 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 13 August 2010 - 02:46 PM

News-news-news guys. ;) So i (re)implemented the hierarchical z-buffer with 3 basic funtions, for fast tile skip, standart per pixel z comparing and fast z writing without z comparing to old z values in z buffer.

I uploaded 2 demos. One with colored debug info and one without the coloring to see how it normal works.
*black tiles - skipped tiles of the hidden small quad
*green tiles - tiles drawn with the fast write fucntion (no z comparison) and are not compared against the triangle edges
*cyan tiles - tiles are drawn with fast write function (no z comparison) but compared against the triangle edges
*gray tiles - tiles are drawn with function that compares the z-values agaisnt the z-buffer and are compared against the triangle edges

Next stop ...clipping and transform pipeline ... and first rotated cube? :happy:

https://sourceforge....cts/phenomenon/

#174 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 29 August 2010 - 02:55 PM

Ok guys. What's new?
Now the triangle input coordiantes are in NDC (Normalized device coordinates), so x and y postion need to be in +1,-1 interval. Why this? because this are using graphicards and helped me to solve the problem when you change the size of the window. Now the size of of the triangles is changing too and is propotional to the rendering window.

Aaand i added third texture filtering method for low-end pc's. Its almost fast like nearest texture filtering (because of 1 texture fetch) but looks almost like bilinear. Yes-yes you saw this method in Unreal. I found a description about this technique in old flipcode archive on net (http://www.flipcode....In_Unreal.shtml)

There are 2 demos :
-one static to see how fast are all 3 techniques (push 1,2,3 to change the filtering technique)
-and dynamic to see the dither-bilinear technique in action (push 1,2,3 to change the filtering technique)

#175 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 18 September 2010 - 03:09 PM

News-news-news !! I added full transformation pipeline of vertices and homogeneous clipping of triangles based on direct 3d (like orientation, perspective matrix and so). I created small demo where you can move with the camera and see a big cube with texture of size 2048 x2048. About the rasterizer. I reimplemented Nick's rasterizer with fixed-point math because of its the numerical stability near edge of the drawing bounds. Sometimes after clipping and homogeneous division the positions of points of the triangle was going outside of the screen whitch caused an error in triangle rasterizer.
About the demo;
q,e - moving in y direction
a,d - moving in x direction
w,s - moving in y direction
1,2,3- filtering method
9,0 - vsync on-off

https://sourceforge....cts/phenomenon/

#176 Mihail121

    Senior Member

  • Members
  • PipPipPipPip
  • 1059 posts

Posted 18 September 2010 - 04:36 PM

An unhandled exception occurred at $00402437 :
EAccessViolation : Access violation
  $00402437
  $004183A4  DDRAWFLIPWINDOWED,  line 45 of ddrawwindowed.inc
  $0041872C  GS_WNDPROC,  line 144 of gs_screen.inc
  $0041D87E  WNDKEYBPROC,  line 29 of fenomenon_keyboard.pas
  $0042DAA2  WNDMOUSEPROC,  line 62 of fenomenon_mouse.pas
  $7E418734
  $7E418816
  $7E428EA0
  $7E428EEC
  $7C90E473
  $7E4196C7
  $00411530
  $00401D0D

Heap dump by heaptrc unit
97 memory blocks allocated : 13722380/13722720
86 memory blocks freed     : 13616288/13616616
11 unfreed memory blocks : 106092
True heap size : 5373952 (128 used in System startup)
True free heap : 5266832
Should be : 5267016
Call trace for block $0007DEE8 size 64
  $004097E8
  $00408381
  $0041C375
  $004092AE
  $004183A4
  $0041872C
  $0041D87E
  $0042DAA2
Call trace for block $00067158 size 24
  $00408381
  $0041C375
  $004092AE
  $004183A4
  $0041872C
  $0041D87E
  $0042DAA2
  $7E418734
Call trace for block $000670F8 size 16
  $0041C187
  $004092AE
  $004183A4
  $0041872C
  $0041D87E
  $0042DAA2
  $7E418734
  $7E418816
Call trace for block $0011E410 size 391
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $0011E240 size 391
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $020299A0 size 1159
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
  $F0F0F0F0
  $F0F0F0F0
  $F0F0F0F0
  $F0F0F0F0
Call trace for block $02028FC0 size 2439
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
  $F0F0F0F0
  $F0F0F0F0
  $F0F0F0F0
  $F0F0F0F0
Call trace for block $020275E0 size 6535
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $02022400 size 20871
  $00417D0C
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $027A0198 size 74119
  $004178D2
  $004119B8
  $00401CD0
  $0040D111
Call trace for block $00116238 size 83
  $004119B8
  $00401CD0
  $0040D111


#177 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 19 September 2010 - 06:55 AM

Stabilized and reuploaded :happy:
https://sourceforge....cts/phenomenon/

#178 Mihail121

    Senior Member

  • Members
  • PipPipPipPip
  • 1059 posts

Posted 19 September 2010 - 09:04 AM

AMD Athlon XP 1900+ at 1.6 GHz
512 RAM
GeForce4 MX 440 with 64 MB
PS/2 Mouse + USB Keyboard

An unhandled exception occurred at $00403247 :
EAccessViolation : Access violation
  $00403247
  $0041A4B4  DDRAWFLIPWINDOWED,  line 45 of ddrawwindowed.inc
  $0041A7DE  GS_WNDPROC,  line 131 of gs_screen.inc
  $0041F98E  WNDKEYBPROC,  line 29 of fenomenon_keyboard.pas
  $00432382  WNDMOUSEPROC,  line 62 of fenomenon_mouse.pas
  $7E418734
  $7E418816
  $7E42C03D
  $7E42C228
  $7E42C1D5
  $004122E7
  $0041A820
  $0041F98E
  $00432382
  $7E418734
  $7E418816
  $7E428EA0

Heap dump by heaptrc unit
84 memory blocks allocated : 8439373/8439688
77 memory blocks freed     : 8434761/8435056
7 unfreed memory blocks : 4612
True heap size : 1867776 (80 used in System startup)
True free heap : 1862528
Should be : 1862616
Call trace for block $00085DD8 size 64
  $0040A5F8
  $00409191
  $0041E485
  $0040A0BE
  $0041A4B4
  $0041A7DE
  $0041F98E
  $00432382
Call trace for block $00067068 size 24
  $00409191
  $0041E485
  $0040A0BE
  $0041A4B4
  $0041A7DE
  $0041F98E
  $00432382
  $7E418734
Call trace for block $00067008 size 16
  $0041E297
  $0040A0BE
  $0041A4B4
  $0041A7DE
  $0041F98E
  $00432382
  $7E418734
  $7E418816
Call trace for block $000E96B0 size 147
  $0040DF21
Call trace for block $000A96B8 size 3859
  $00402A45
  $0040DF21
Call trace for block $000A16A0 size 403
  $00402A45
  $0040DF21
Call trace for block $00099698 size 99
  $00402A45
  $0040DF21


#179 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 19 September 2010 - 03:12 PM

Mihail121 said:

AMD Athlon XP 1900+ at 1.6 GHz
512 RAM
GeForce4 MX 440 with 64 MB
PS/2 Mouse + USB Keyboard


An unhandled exception occurred at $00403247 :

EAccessViolation : Access violation

  $00403247

  $0041A4B4  DDRAWFLIPWINDOWED,  line 45 of ddrawwindowed.inc

  $0041A7DE  GS_WNDPROC,  line 131 of gs_screen.inc

  $0041F98E  WNDKEYBPROC,  line 29 of fenomenon_keyboard.pas

  $00432382  WNDMOUSEPROC,  line 62 of fenomenon_mouse.pas

  $7E418734

  $7E418816

  $7E42C03D

  $7E42C228

  $7E42C1D5

  $004122E7

  $0041A820

  $0041F98E

  $00432382

  $7E418734

  $7E418816

  $7E428EA0


Heap dump by heaptrc unit

84 memory blocks allocated : 8439373/8439688

77 memory blocks freed     : 8434761/8435056

7 unfreed memory blocks : 4612

True heap size : 1867776 (80 used in System startup)

True free heap : 1862528

Should be : 1862616

Call trace for block $00085DD8 size 64

  $0040A5F8

  $00409191

  $0041E485

  $0040A0BE

  $0041A4B4

  $0041A7DE

  $0041F98E

  $00432382

Call trace for block $00067068 size 24

  $00409191

  $0041E485

  $0040A0BE

  $0041A4B4

  $0041A7DE

  $0041F98E

  $00432382

  $7E418734

Call trace for block $00067008 size 16

  $0041E297

  $0040A0BE

  $0041A4B4

  $0041A7DE

  $0041F98E

  $00432382

  $7E418734

  $7E418816

Call trace for block $000E96B0 size 147

  $0040DF21

Call trace for block $000A96B8 size 3859

  $00402A45

  $0040DF21

Call trace for block $000A16A0 size 403

  $00402A45

  $0040DF21

Call trace for block $00099698 size 99

  $00402A45

  $0040DF21

is you pc sse2 compatible? what is your desktop resolution? Because some detections are not implemented in the progy.

#180 Mihail121

    Senior Member

  • Members
  • PipPipPipPip
  • 1059 posts

Posted 19 September 2010 - 04:20 PM

Herrcoolness said:

is you pc sse2 compatible? what is your desktop resolution? Because some detections are not implemented in the progy.

SSE2 is not supported, MMX and 3DNow! only. Desktop resolution is 1024x768@16.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users