Jump to content


Advanced Rasterization


187 replies to this topic

#141 Geri

    New Member

  • Members
  • PipPip
  • 30 posts

Posted 26 May 2010 - 10:17 AM

Its fast (a bit faster than mine, sometimes, lot faster, sometimes equal or slower). How many cpu cores can it use?

#142 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 27 May 2010 - 11:16 AM

Geri said:

How many cpu cores can it use?
In theory up to 16. Depending on the application I get about 70% CPU usage on my 8-threaded Core i7. For a windowed game you can open http://localhost:8080/swiftconfig in a browser and change some of the settings.

#143 Geri

    New Member

  • Members
  • PipPip
  • 30 posts

Posted 28 May 2010 - 09:09 PM

Why dont you try to sell it to linux users, as a dll for wine, for ppls who got problems with 3d acceleration?

#144 jiti

    New Member

  • Members
  • PipPip
  • 16 posts

Posted 30 May 2010 - 05:36 PM

Nobody to answer me the 2d homogenous rasterization with fill-convention problem? :(

#145 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 30 May 2010 - 07:23 PM

jiti said:

Hi Nick. I read the paper about 2d homogenous rasterization. I nearly understand the rasterization, but the all the calculations need to be do on floating point numbers. But how they handle the fill convention? It needs finite number precision like the integers. In forward, thnx for the answer.
Hi jiti. The paper doesn't discuss the fill convention. But in the first post of this thread I explain how to implement it correctly using fixed-point numbers. If it's not clear how it's done, try looking for tutorials on fixed-point math.

Quote

To the hierarchical z-buffer technic. Its problematic because the whole pyramid with zmax and zmin info and tile info, with subtile pointers info on every level are just many data for the cpu cache. So its slow. Firstly i programmed it with recrusive algorhytm (using call-ret subrutine system), then per-level throutput checking, but both algorhytm was slow. So i go back to basic and do 2 level hierarchical z-buffer (zmin-zmax per tile). Later 3 level (64x64 tile with 8x8 subtiles) like in the Larabee-rasterization paper. But i need to compare the speed of 3 against 2 level z-buffer. Now i know why the other developers don't use the full pyramide of hierarchical z-buffer.
It's not easy to get a speedup from a hierarchical z-buffer in software. Unlike in hardware, there's a real overhead in terms of clock cycles. Also, hardware uses it both to reduce overdraw and to reduce memory bandwidth. But nowadays many applications use a depth-only pass so overdraw is already minimal, and a software renderer doesn't really benefit from a reduction in memory bandwidth (the number of clock cycles spent per pixel is typically quite high so bandwidth isn't the bottleneck).

Anyway, it depends on the application, and you'll need to experiment to see what works best.

#146 jiti

    New Member

  • Members
  • PipPip
  • 16 posts

Posted 30 May 2010 - 08:37 PM

Nick thnx for the answer. :)

Yes nick. Development of programs is all about experiments .. As you see i experimented with full z-buffer pyramide and i learned new things. :)

I understand your algorhytm with the fill-convention. The rule is, it must be done with integer numbers because of the finite precision (i experimented with it, but only integers worked well).
But how can be your integer-fill-convention-algorhytm combined with the 2d homogenous rasterization?
The 2d homogenous rasterization needs calculations to be done in floating point, but the fill-convention doesn't work with floating-point numers

So the question is: HOW CAN I COMBINE THEM? :(

#147 thebigT

    New Member

  • Members
  • PipPip
  • 18 posts

Posted 31 May 2010 - 06:16 AM

jiti I toyed around with 2d homogeneous rasterisation last year after reading that paper. You find the triangle edge gradients in 2dh space as the article describes using floating point math then convert the values to fixed point format for rasterisation. You will need to take into account your screen scaling / mapping in the same way you do for other surface interpolants like depth & texture values. I gave the technique away because in extreme cases the gradients would overflow the 32 bit fixed point integer. If anyone has had success with this method I would be interested to hear..it was tantalisingly elegant.

#148 jiti

    New Member

  • Members
  • PipPip
  • 16 posts

Posted 31 May 2010 - 02:53 PM

thebigT said:

jiti I toyed around with 2d homogeneous rasterisation last year after reading that paper. You find the triangle edge gradients in 2dh space as the article describes using floating point math then convert the values to fixed point format for rasterisation. You will need to take into account your screen scaling / mapping in the same way you do for other surface interpolants like depth & texture values. I gave the technique away because in extreme cases the gradients would overflow the 32 bit fixed point integer. If anyone has had success with this method I would be interested to hear..it was tantalisingly elegant.

Then gradient-values can be overfloved (in case of integer calculations) if the area of the triangle is smaller then 1. So we need to draw those triangles, they have area bigger or equal then 1. The area is calculated if we check the back-facing (in 2d) of the triangle. Its logical, that the triangles with area smaller then 1 we cant see. :)

Yes TheBigT i found the description in the paper about the conversion from float to fixed numbers, but if i do it on per tile for the line-edge interpolants, the fill-convention won't work, because the edge interpolatns was interpolated thru floating point numbers. I can't mix pricisions of floating point numbers and integer numbers on the line-edge-interpolants. If the triangle is huge, the line-edge intepolants have big numbers, thay can't be converted to integer numers because, they can't fit in 32-bit integer, as u say, and some pricision in floating-point numbers is lost. Nick's fill-convention need precise bit-wise number precision, so just fixed-point numbers can be used.
if we don't find any good solution how to find a working fill-convention, then the 2d homogenouse for rasterization and clipping is for me useless and i go back to standart 3d poly-plane clipping. :(

A pseudo algorhytm would be good for better understand.

#149 thebigT

    New Member

  • Members
  • PipPip
  • 18 posts

Posted 02 June 2010 - 02:36 AM

the only suggestion I can make is try using 64 bit integers. I didn't want to make the jump to 64 bit in my program so I returned to conventional screen space rasterisation

#150 jiti

    New Member

  • Members
  • PipPip
  • 16 posts

Posted 02 June 2010 - 11:01 AM

thebigT said:

the only suggestion I can make is try using 64 bit integers. I didn't want to make the jump to 64 bit in my program so I returned to conventional screen space rasterisation

Adding more bits don't have sense. You know 2d homogeneouse rasterization is about detecting if a 3d point in space is in convex pohyhedron constructed from frustum-cliping planes and planes constructed from triangle edges. The position of the point is interpolated across the triangle and we are detecting if the point is inside or outside of the polyhedron. Similar like creating triangle from lines and detecting if a 2d point is in same halfspace of all 3 lines (tutorial of this thread). The space of the polyhedron or the size of triangle can be huge or very small, so we operate with true space positions (no normalization tricks- no numbers between -1 and 1) and need variable precision, so we can use only floating point numbers. This is the only type of number who is flexible enough for the calculations. As you said, with fixed number of bits we can easy overflow the integer number or we can lost all precision if the number is too small. The problem can rise if we are dividing by z ( or w).

Have anyone a functional combination of 2d homogeneouse rasterization with fill-convention? Is there any trick? Or we close this problem without solution? :(

Damn my head hurts !!

#151 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 26 June 2010 - 01:19 AM

Digging up an old thread. I'm currently in the process of researching whether it's feasible for us to do occlusion queries using software based z-buffer rendering. We don't want to use hardware occlusion queries as the GPU isn't really designed for quick queries - the latency is too high, and it's usually busy doing other work (in most of our usecases, the GPU is still busy rendering the previous frame while we're in the process of culling for the next). I too have implemented a rudimentary triangle rasterizer and have used some of the tips in this topic :). Of course, my job is a bit easier as I only need to fill a low resolution zbuffer - no color interpolation and texture fetches are necessary ;)

I'm currently using floating point math, which is enough precision for the resolution I need (probably about 256 pixels wide, perhaps 512). How does integer SSE math compare to float SSE?
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#152 jiti

    New Member

  • Members
  • PipPip
  • 16 posts

Posted 26 June 2010 - 09:29 AM

.oisyn i think integer math with sse is bit faster or almost the same as with floating point numbers on modern cpu's, but the sse instruction set is optimized more for floating point operations then integers, so better use floating point numbers if you don't do fill-convention or fixed-point math, or color packing-upacking, or other integer specific operations. But in otherside .. with integer math can use mmx registers for example just for saving information (you don't acces the memory and cpu cache), or do integer math , which can speed up the process too. It's your choice...

#153 jiti

    New Member

  • Members
  • PipPip
  • 16 posts

Posted 26 June 2010 - 09:48 AM

Jiti will be renamed to Herrcoolness

#154 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 26 June 2010 - 10:56 AM

Ok guys... i uploaded new version of my phenomenon engine on the sourceforge https://sourceforge....cts/phenomenon/. For now there is not z-buffer but its question of time and copy&paste from old n-level hierarchical zbuffer code. As i wrote in my news on the projects side, there will be no full pyramidal z buffer, but just 1 level or 2. A recoded the whole drawing system of the triangle. Comparing to Nick's algorytm the subpixel precision is 11-bit not 4-bit, i don't use the Nick's basic rectangle traversal algorytm but my, based on edge detection and block-scanlining, which is better for the triangle drawing. I think all graphic-card developers use similar algorytm's as i saw many pictures showing triangle traversal algorytm path's. Checking if the tile is inside, partially inside, or outside is optimized with trivial accepting and rejecting methods based on document of intel about the larabee rasterization. For last I am working in negative halfspaces of lines not in positive as Nick algorytm, because of the sign bit of integers. I use it in logic operations. The triangle rasterizing code, without pixel drawing is nice fast, and eat almost nothing of the cpu time. :) And its not even assembler optimized. :w00t: Later test will show more. ;)
I wish i could write similar article with similar quality like Nick :worthy: , but as part 2 of "advanced rasterization" with my research and code (pascal-asm) snippets.There is one problem, my english is horrible... :wallbash:
Oh, and I coded texture fetching with bilinear interpolation with clamp-repeat options. No mip-mapping yet.

The graphic oddysey continues.... :cool:

#155 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 26 June 2010 - 01:19 PM

jiti said:

.oisyn i think integer math with sse is bit faster or almost the same as with floating point numbers on modern cpu's, but the sse instruction set is optimized more for floating point operations then integers, so better use floating point numbers if you don't do fill-convention
Nonsense, floats work perfectly with fill convention. You just have to use the right epsilon when increasing the halfplane values for the topleft rule.

I snap my vertex coords on a grid by adding and subtracting 16777216/subpixelgridsize, and then I add 1/subpixelgridsize to the halfplane value for topleft edges. I've currently defined subpixelgridsize as 16, which corresponds to the number of bits after the fixed point in Nicks solution, but I'm still experimenting with it. 64 gives smoother results for slow moving polygons, but it takes away precision and I might get into trouble for polygons way larger than the screen and I'll probably have to clip at some point. Then again, it's for visibility determination only, so I can afford to be off a pixel or two.

Also, I need to be able to run on 360 and PS3 as well. I know the cell SPU's have vector int arithmetic, but I'm not so sure about AltiVec in the PowerPC.

Anyway, currently, on an Intel Core 2 I'm able to push 5000 on-screen cubes within 7 million cycles (2.33ms on a single 3GHz core) using a 256x144 zbuffer. When quadrupling the buffer size to 512x288 it goes up to 9.8ms.

Posted Image
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#156 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 26 June 2010 - 02:02 PM

.oisyn said:

Nonsense, floats work perfectly with fill convention. You just have to use the right epsilon when increasing the halfplane values for the topleft rule.

I snap my vertex coords on a grid by adding and subtracting 16777216/subpixelgridsize, and then I add 1/subpixelgridsize to the halfplane value for topleft edges. I've currently defined subpixelgridsize as 16, which corresponds to the number of bits after the fixed point in Nicks solution, but I'm still experimenting with it. 64 gives smoother results for slow moving polygons, but it takes away precision and I might get into trouble for polygons way larger than the screen and I'll probably have to clip at some point. Then again, it's for visibility determination only, so I can afford to be off a pixel or two.

Also, I need to be able to run on 360 and PS3 as well. I know the cell SPU's have vector int arithmetic, but I'm not so sure about AltiVec in the PowerPC.

Ok i take it in to the count... The rounding idea can help me a lot. thnx. Check my project on the sourceforge. Maybe some ideas from my source-code can help you....

#157 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 27 June 2010 - 09:10 PM

I thought of another thing worth considering. Rather than implementing a >= using > and some offset in case of topleft edges , you could use >= in general and implement the > by using a negative offset in case of bottomright edges. That way, for very large polygons, you'll never get potential holes. You'll only get potential overdraw, but that doesn't really matter that much.
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#158 Frogblast

    New Member

  • Members
  • Pip
  • 3 posts

Posted 28 June 2010 - 05:52 AM

.oisyn said:

I thought of another thing worth considering. Rather than implementing a >= using > and some offset in case of topleft edges , you could use >= in general and implement the > by using a negative offset in case of bottomright edges. That way, for very large polygons, you'll never get potential holes. You'll only get potential overdraw, but that doesn't really matter that much.

It doesn't really matter for opaque geometry, but it definitely does matter for blended primitives.

#159 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 28 June 2010 - 07:56 AM

Ah yes of course, that's true. I didn't realize that as I don't have to deal with translucencies ;)
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#160 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 03 July 2010 - 09:24 AM

Ok guys. i removed in the procedure "texturesampler_bilinear " sse4 instructions because they caused small problems on AMD proccesors (i have intel) :happy: and replaced with faster lookup table which calculates the adresses in tile for bilinear sample fetching. Fps jumped from 26 to 32 fps.. with point sampling ist it about 36-37 fps.. so its nice speedup. :yes:
I compared "tile bilinear sampler" against "linear (standart in memory image representation on PC) bilinear sampler" and the speed stayed almost the same... linear representation of the texture was a bit slower,because of not cache-friendly representation of the texture. Tiled texture is good for big textures, because if the texture is in high resolution , the speed don't drop so fast down as in linear (standart) representation of the texture. Of course the linear calculation of the sample adress from texture coordinates is much simpler, but the cache-polution is much bigger and is causing much bigger slowdowns. :geek:

https://sourceforge....cts/phenomenon/





2 user(s) are reading this topic

0 members, 2 guests, 0 anonymous users