Jump to content


Advanced Rasterization


187 replies to this topic

#181 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 19 September 2010 - 06:13 PM

Mihail121 said:

SSE2 is not supported, MMX and 3DNow! only. Desktop resolution is 1024x768@16.
The engine can run just in 32 bit color mode and i think in some procedures are using sse2 instructions. Just set the desktop to right color depth and check again.

#182 droolz

    New Member

  • Members
  • Pip
  • 1 posts

Posted 21 October 2010 - 06:50 AM

Hi I'm very new ot graphics programming and am using the original post as a test project for myself.
I'm using it to rasterize UV coordinates, which unlike screen space do not require the Y coordinate to be flipped.

I've implemented the first version of the rasterizer (the unoptimised half space calculation one) and it works fine (by flipping back the half space coordinates to the original formula implementation).

When I try to implement the second one (the intermediate optimized one) it breaks. I can only think it's because some of the maths needs to be changed because of me flipping the various deltas (float Dx12 = x1 - x2 becomes float Dx12 = x2 - x1; etc).

Does anyone have any pointers how the other maths should be changed, I'm really struggling with understanding how it works, especially Cy1, Cy2, Cy3.

Many thanks,

Jules

#183 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 08 April 2011 - 06:34 PM

Ok folks. I have new nick on sourceforge and the name of the project is little different because 2 bad things happened:
1. Some idiot or idiot's atacked the sourceforge server's and the SF team changed the passwords of all projects for the security reasons
2. I used the e-mail recorvery, but my e-mail provider horribly failed because i can't get any new mails, sometimes the page was down... There were many problems.

So i changed the name of the project and my nick. I will now use the SVN system to make updates to the source code.

About the project ... So what is new? Per-tile operations are more optimized, i am using now 2D homogenouse rasterization, clipping is done in 4D homogenouse, vertex transformation path and backface culling is more optimized, texture reading is in post-proccesing pass, indexing pixels to triangles is optimized with bit mask and linked litss (it was before slow because i don't used combination of 64bit mask and one 32-bit pointer but 64 x 32 bit pointer's, so for every pixel one 32bit value alias pointer to the triangle)

In the demo (cube field - 2000 objects):
q,w,e,a,s,d - move in all 6 directions
arrows - rotation of the camera

Next step :
-per tile mip-maping
-bilinear texture fiiltering with one texture fetch - with one "movaps"

cya ;)

https://sourceforge....phenomenonngsw/

For the demo is desktop in 32-bit color mode needed and sse2 instruction support..

#184 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 16 April 2011 - 12:42 PM

Hi folks. I just updated the engine with a per tile mip-mapping.

I was think about one problem. Is the rasterizer really fast? I compared it with the s-buffer technique. The s-buffer can reject almost all pixel in the x-screen resolution but my can just maximum 64 pixels. When all scanlines in the rasterization pipeline of the polygon (triangle) are rejected, the y-loop need to take just 1680 loops when we take that we have y screen resolution of 1680 pixels and a fullscreen triangle, but my rasterizer need ( 1680 div 8 )*( 1050 div 8 ) loops to do which is 27000 so when we compare it is about 27x slower... GOD DAMN !!

What i get when i use scanline based rasterizing with conjunction with s-buffer?
-faster rejecting of pixel in the x-direction
-ability to render in to texture
-more natural memory organization.
-fewer pixels wasted bacuse the tile aligning
-...
Problem? Yes, when i do light calculations. With pixel's sorted in tiles i could create bbox for the tile from all pixel postions. Then when i check this bbox with vertex positions in world-space against a sphere which is the maximal radius of the light, i could reject or accept the whole 64 pixels. But in scanline based rasterization i don't have the ability to do this. Yes there could be a conversion of the pixels but i think it could be slow. So i am thinkink about a horizontal span in the s-buffer as a line segment, which i check against sphere or cone of the light and see how much pixels i need to calculate.

ok people.. see you later.. i think .. much later. But don't forget .. i am working on it.. ;)

#185 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 11 June 2011 - 11:06 PM

Hi folks. After doing some research,windows-crash-tests and comparing the speed of scanline based rasterization against tile-halfspace rasterization, i come to a result, that scanline rasterization is much slower then the tile rasterization. Why is it so. So at first what techniques i have used. (You know that all this ideas are not from me. I am just a human like you and not a alien master-brain :D ... and it's fair to show you the source of my knowledge, which will help you to understand me and my source code) :

Scanline based:

-Rasterization algorytm is based on Chris Hecker's floating point rasterizer (http://chrishecker.c...hnical_Articles) modified for sse and deferred texturing. Guys, if you are new in rasterization, this is the site where can you learn all the basics and tricks.
-Clipping algorytm is from latest source code of NIck's SwShader 0.3.0. (ftp://nic.funet.fi/p...hader-0.3.0.zip) Nick, nice trick to shift the clipping planes from -1<x<1 to 0<x<1. This help a lot when you wanna calculate the clipping flags with SSE (see my source code)
-Transformation code is based on article from http://www.cortstrat...izingForSSE.php
-The s-buffer idea is based on Paul Nettle's "S-buffer FAQ" (http://www.gamedev.n...nd-data-structu res/s-buffer-faq-r668) and the source code, which helped me to not going thru the hell of coding a s-buffer insert routine is based on Bero's software renderer (http://vserver.rosse...oSoftRender.zip) which based on c++ code of The Swine (http://www.luki.webz...z/eng_07_en.htm)
-The bilinear filtering is based on the Nick' SWshader source code optimized for sse2 by me.

Tile based:

-Rasterization algorytm is based on Nick's article (http://www.devmaster.../show.php?id=17) extended to 2d homogenous rasterization (http://www.cs.unc.ed...papers/2dh-tri/) with little help of the source code from the Attila GPU simulator (https://attila.ac.up...Attila_Project), but the
normalization code of the edge equation, which helps converting the edge quation from the FPU format to Fixedpoint format is coded by my self
-Calculation of the triangle's 2d screen coverage bounding box is based on he Attila GPU Simulator source code too, which is again based on article "Jim Blinn's Corner: Calculating Screen Coverage"
-Early accept-reject of block-in-triangle idea is based on intels Larabee article (http://software.inte...on-on-larrabee/)
-deferred rendering idea is based on the PowerVR thechnology article (http://www.imgtec.co...2e.External.pdf)
-hierarchical Zmin updating with coverage-mask is based on article Two-level hierarchical z-buffer for 3D graphics hardware (http://www.si2lab.or...hen_iscas02.pdf)
-Transformation code is based on article from http://www.cortstrat...izingForSSE.php

Pros & Cons:

Scanline based:
+fewer calculations by calculating adress of pixel
+fewer wasted pixels for render targets and texture
+rasterization is calculated pixel precise
-variable scanline size so we can not directly unroll a drawing loop
-not cachce friendly representation of render targets and textures for random access and texture size (can't access pixels in y direction without destroying the data in cache, problematic are calculations of dx/dy derivates and
+rejecting more, equal or fewer as 64 pixels with sbuffer
-we need to draw from front to back and from right to left to gain speed of the sbuffer and the rejection of pixels without traveling the linked list segments which hurts the cache
-demo cubefield rendered at 20-25 fps (the cubes are not drawn from front to back, from right to left,just random)


Tile based
-more calculations by calculating adress of pixel (mmx or sse instructions can help)
-more wasted pixels (because we need align the texture and render target data to tiles sizes)
-rasterization is calculated per tiles, and some cpu-cycles are wasted for those pixel, which are not in triangle (sse can help here to reduce it)
+constant drawing loop which can be unrolled because of the constant size of the tile
+ more cache friendly representation of render targets and textures for random access and texture (we can access the pixels in y direction without destroying the data in the cache, good for pixelshaders, dx/dy derivate calculation,multithreading)
-rejecting just 64 pixels, yes or no, nothing between
+fast accessing the z-buffer thru the x,y coordinates, just comparing higher level which is one floating point number, more better organised structure (grid based)
+demo cubefield rendered at 44 fps and more

The problem with s-buffer is we need sometimes to travel the linked-list of the segments. This is a slow process compared to the hierarchical z-buffer where we have direct access to the memory by calculating the adress from the x,y coordinates. So s-buffer is good for scene without much polygons, because every new polygon adds a new segment to the s-buffer structure and then traveling the growing structure really slowdowns the whole process. See Quake 1. It is using this technique. But in the drawing process the s-buffer is used just for drawing the rooms - static structures which haves small amount of polygons. The mosters,characters are drawn separatly, in another pipeline. But without s-buffer there would be overdraws which can cause more slowdowns.Another problem in scanline based rasterization is the need of clipping. Clipping a polygon in 3d is slow procces and drawing the polygon with the triangle routine is even slower. We need this clipping process for calculating the W coordinates. because we can't use W coordinates behind the camera. And we need to do a perspective division where W coordinate can be equal to 0, which can't be calculated. In tilebased 2d homogenous rasterization we don't have this problem because we directly calculate visible pixels. We even don't need to divide because we are operating in homogenous space. We divide just the interpolated values with the W after checking if the pixel is visible which is that pixel which is lying between the triangle edges and the near and far plane.

Anyway. I uploaded the scanline render for learning purpose. If something is not clear, just ask me. But i think, the FPS numbers are saying it all. Now i know that using the tile render is the right way and i will continue in it. Its deep in the night. I almost see nothing so.. any feedback any question every reaction is welcome. So... let's get back to the TILE WORLD !!!

https://sourceforge....ased%20version/

#186 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 01 July 2011 - 09:20 PM

Hi folks. So I updated the engine with new memory representation of texture. The texture is divided in to 8x8 tiles like in old version but now the tiles are bigger - 9x9 pixel. Why ? Because i implemented bilinear filtering and there was small problem with the right and bottom texels in the tile. There when i wanna access the 3 other texels i need to read the color from other tiles and i needed to calculate new adress because of tile change. So now when i am encoding the texture in to the tiles i write the colors in right and bottom extra texels from neighbor tiles. When i am reading the texels now, i am reading they all from one tile and calculating the adresses from other 3 texels is very cheap.

I implemented per triangle occlusion so we calculate the nearest Z coordinate from the triangle to the camera and then comparing it against Zmin from tiles which are in the bounding box of the triangle. Zmin is in separate array as the whole z-buffer so traveling thru the Zmin array is fast. It simulates the on-chip memory rough z-buffer in hardware. When there is one tile which is behind the Zmin from the triangle, the triangle need to be rendered else it can be skipped. Skipping the triangle can save nice amount of computation and the drawing is faster. This method is descripted in article "Method for accelerated triangle occlusion culling" (http://www.freepaten...0030043148.pdf), but if you are thinking about the hierarchical z-buffer, you get the idea automaticaly. I've got similar idea, when i was programming the n-level z-buffer.

http://sourceforge.n...11.rar/download

#187 Herrcoolness

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 01 September 2011 - 05:07 PM

Ok guys, new update of my project.

Every programmer knows, that the 64-bit environment brings new possibilities, like more memory, more registers, extension of old general register to 64-bite size. As assembler programmer, i've got some problem with some functions because of small amount of cpu-registers. So i need to handle with this problem thru memory accesses to temporary variables and constants, which slows down the whole function. The speed for software rendering is very important so i changed the OS environment to get more horse-power from my cpu.

Changing from 32-bit to 64-bit brings some problems, because now we strictly need to use just 64-bit version of device-drivers. Some old hardware is not supported from his company with a new 64-bit firmware. So we need to buy a new... like me. Now the OS is fully functional.

I changed the version of FPC to 64-bit version too. I needed to rewrite the sourcecode of the render (changing the pointer and pointer operations from dword size to qword size) . I change the output from DirectDraw to GDI, because,i've got problems to run the program under the 64-bit environment. I optimized some parts for the 64 bit because now i have 15 general purpose and 15 xmm registers. Man i was feeling like a kid in toy-shop .. so many registers... :) . I changed some names of the variables like in the gradient calculation for better understanding, what they mean.

I compared the old rectangular traversal algorhytm with the recursive algorhytm, but the old was still faster. Anyway i uploaded the recursive algorhytm http://sourceforge.n...11.rar/download for study reasons. Added some more comments for better understanding the code, like in trivial reject & accept calculations. Cya ;)

http://sourceforge.n...11.rar/download

#188 zbethel

    Member

  • Members
  • PipPip
  • 50 posts

Posted 01 October 2011 - 07:20 PM

EDIT: I solved it...

It wasn't the rasterization algorithm. I was converting the vertices to integers earlier in my code and testing if the size of the edges was zero (testing for degenerate triangles)--without first converting to fixed point. Thus, entire triangles weren't being drawn. How could I be so stupid! :)

==============================================================

Ahem, I know I'm resurrecting an ancient post. But it's such a great one! :)

I've implemented this rasterization method, and it works great, except that I've noticed a problem with small triangles. I noticed your post Nick about entire blocks getting included when the x and y deltas were both zero, but I don't think this is the same problem. Here's an image of what's happening:

Posted Image

Uploaded with ImageShack.us

I tracked it down to a problem with this line:

if(CX1 > 0 && CX2 > 0 && CX3 > 0 )
{
colorBuffer[ix] = 0xffffffff;
}

Looking at the numbers in the debugger, it does not appear that any overflow is happening. Since I pretty much implemented the algorithm verbatim, I was hoping maybe this has been solved already? If you've had this issue and know what's going on, I'd appreciate the heads up. :)

Thanks.





2 user(s) are reading this topic

0 members, 2 guests, 0 anonymous users