Jump to content


Faster sort.


20 replies to this topic

#1 Hawkwind

    Member

  • Members
  • PipPip
  • 92 posts

Posted 19 February 2008 - 09:00 AM

I wrote some code a year or two back for sorting objects before render (particularly semi-transparent stuff).

I'm wondering if anybody has something faster than this ?
Cheers.

//------------------------------------------------------------------------------------
// Sort
// ----
//		Binary radix sort - used for ordering partially transparent entities, etc.
//
//	Sorts an array of CSortEntry.   
//	
//	When adding entries to the array:
//
//			iMaxKey |= key;		//  This is therefore an upper bound on
//						//  unsigned values. 
//      Initially:  iMaxKey = 0;
//
//------------------------------------------------------------------------------------

#pragma warning( push )
#pragma warning( disable : 4731 )	// 4731 =  frame pointer register 'ebp' modified by inline assembly code


struct CSortEntry				//Used for sorting.  Total of 32bits per entry
{
	unsigned short key;
	unsigned short idx;
};


void CGeneric::Sort( CSortEntry *  pArray, int iCount, unsigned int iMaxKey)
{
 
	if( iCount == 0 )			// Early out if required
	 return;


 /*	
	unsigned int iMask = 1;
	while( iMask < iMaxKey )
	{
		
		unsigned int Count1 = 0;
		CSortEntry * inPtr = pArray;
		CSortEntry * outPtr0 = inPtr;			// Output 0 is always to the original array
		CSortEntry * outPtr1 = CSortArrayTemp;		// Output 1 is always to a temporary store	
		for( int i = 0; i < iCount; i++ )
		{
			if( (inPtr->key & iMask)==0)
			{
				*outPtr0++ = *inPtr++;
			}
			else
			{
				*outPtr1++ = *inPtr++;
				Count1++;
			}
		};

			// Weld the two parts together

		memcpy( outPtr0, CSortArrayTemp, (Count1)*sizeof(CSortEntry) );
		
			// Advance mask, to next bit

		iMask += iMask;
	};
*/
						// Use local stack space vars for older compilers not recognising
						// parameter vars in _asm{}
	unsigned int maxMask = iMaxKey;		
	CSortEntry * inPtr = pArray;
	CSortEntry * outPtr1 = CSortArrayTemp;	//  A temporary buffer at least as big as pArray.
	unsigned int totalEntries = iCount;

		_asm
		{
			pusha
			push ebp

			push maxMask
			push totalEntries
			push inPtr
			push outPtr1

			mov  eax, 1			// eax holds the mask
			jmp  L0002

			mov   ebx, [esp + 0 ]
			mov   esi, [esp + 4 ]

L0003a:
			mov   ecx, [esp + 8 ]

			mov  edi, esi

			lea  esi, [esi + ecx*4]
			neg  ecx


				//--------------------------------------------------------------------------
				// Main loop - 
				// ---------
				//  branchless decision on destination buffer
				//--------------------------------------------------------------------------

			
		L0000:
			mov   ebp, [esi + ecx*4]
			test  ebp, eax
			cmovne edx, ebx
			cmove  edx, edi
			mov   [edx], ebp
			lea   edx, [edx + 4]
			cmovne ebx, edx
			cmove edi, edx
			inc  ecx
			jl   L0000
 
				//--------------------------------------------------------------------------
				// End of Main Loop
				//--------------------------------------------------------------------------

			mov  esi, [esp + 0]					// Top of stack  (push ebx)
			sub  ebx, esi		//  Total * 8
			jz   L0004			//  if none then early exit - copy loop won't work for zero!
			add  esi, ebx
			add  edi, ebx
			neg  ebx
			
L0001:
			mov  ecx, [esi + ebx]
			mov  [edi + ebx], ecx
			add  ebx, 4				// Seperate  from jl L001 so that there are no dependencies
			jl   L0001
L0004:		
			add  eax, eax
L0002:
		
				// Interlace the first two instructions at the destination loop label - this breaks up dependencies
				// and is effective since the destination label is always taken except the final time

			 mov   esi, [esp + 4 ]
			cmp  eax,  [esp + 12]   //maxMask
			 mov   ebx, [esp + 0 ]
			jle  L0003a

			add  esp, 16 // Account for top 4 stack dwords
			pop ebp
			popa

		};





};
#pragma warning( pop )


#2 Reedbeta

    DevMaster Staff

  • Administrators
  • 4979 posts
  • LocationBellevue, WA

Posted 19 February 2008 - 05:14 PM

Please use the [code]...[/code] tags when posting code. I've added them for you this time.
reedbeta.com - developer blog, OpenGL demos, and other projects

#3 Hawkwind

    Member

  • Members
  • PipPip
  • 92 posts

Posted 19 February 2008 - 05:15 PM

Thanks. It looks clearer!

#4 CheshireCat

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 19 February 2008 - 05:23 PM

Might be faster apply temporal coherency and use insertion sort on the sorted list from previous frame. I doubt this is bottleneck in your rendering code though.

#5 Hawkwind

    Member

  • Members
  • PipPip
  • 92 posts

Posted 19 February 2008 - 05:56 PM

Cheshire Cat - interesting idea, problem is with a rapidly moving camera I don't know how it would work - how would you know whether something has potentially changed its order in the list ? possibly using bounding spheres ?
I'd be interested to know how this could be done if you've managed it....

#6 CheshireCat

    New Member

  • Members
  • PipPip
  • 19 posts

Posted 19 February 2008 - 06:42 PM

Insertion sort will work regardless how fast the camera moves. It'll just become slower when the initial list becomes less sorted, but the order of transparent objects should be pretty much the same from previous frame. Another nice thing is that when there are relatively small updates to the order it's very cache friendly algorithm.

#7 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 19 February 2008 - 08:28 PM

Hi there.

First off: pusha and popa are 16 bit instructions and only save the lower 16 bit of your 32 bit registers. Use pushad and popad instead (and thank MSVC that it analyzed your code and saved the upper 16 bit for your).

To your soring example: Here is my code.. it is roughly 2.5 times faster than your code on my core2 duo (up to factor 4 for more than half a million items). Main difference is, that I do radix of 8 bits instead of 16 bits. My code expects ints, but since your element-size is 32 bits it will just work. My code sorts for the lower 16 bits unsigned (just like your code does it...)

#include <mmintrin.h>

void radix16_mmx (unsigned int amount, unsigned int * __restrict src, unsigned int * __restrict tmp)
{
  unsigned int i;
  __declspec(align(16)) unsigned int count[512];
  __m64 a0, a1, zero;
  __m64 * cnt64;

  // fill frequency table with zeros:
  cnt64 = (__m64  *) count;
  zero = _mm_setzero_si64();
	
  for (i=0; i<256; i+=8)
  {
    cnt64[i+0]=zero; cnt64[i+1]=zero; cnt64[i+2]=zero; cnt64[i+3]=zero;
    cnt64[i+4]=zero; cnt64[i+5]=zero; cnt64[i+6]=zero; cnt64[i+7]=zero;
  }

  // build byte-frequency table:
  for (i=0; i<amount; i++) 
  {
    unsigned int s = src[i];
    count[2*((s>> 0)&255)+0]++;
    count[2*((s>> 8)&255)+1]++;
  }

  // turn frequency table into offset table:
  // process four dwords per iteration
  a0 = a1 = zero;
  for (i=0; i<256; i++)
  {
    __m64 c0 = cnt64[i*2]; 
    cnt64[i*2+0] = a0; 
    a0 = _m_paddd (a0, c0);
  }

  // sort from lowest byte to highest byte:
  for (i=0; i<amount; i++) tmp[count[((src[i]>> 0)&0xff)*2+0]++] = src[i];
  for (i=0; i<amount; i++) src[count[((tmp[i]>> 8)&0xff)*2+1]++] = tmp[i];
  _m_empty();
}

Btw - I do use MMX for this function, but not for speed but for code-size. The code runs approximately at the same speed if you do it with ordinary integers.

Nils
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#8 z80

    Valued Member

  • Members
  • PipPipPip
  • 104 posts

Posted 19 February 2008 - 08:52 PM

Nils Pipenbrinck said:

Btw - I do use MMX for this function, but not for speed but for code-size. The code runs approximately at the same speed if you do it with ordinary integers.

Heh.. Nice to see somebody optimizing for code-size.. you never see that anymore (except for 4/64k intros maybe.. and embedded devices.. well maybe you do see it -- but still..) everybody always think about speed -- and often where it doesnt make any sense to do so.

#9 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 19 February 2008 - 10:30 PM

I'm an embedded programmer, but I don't care if my application needs a megabyte of code-size or not. Nowadays we don't need to squeeze everything into 64kb anymore (there are exceptions, but ...).

The point is: optimizing for size if you don't sacrifice speed for it *is* optimizing for speef. There exist something called the instruction cache, and it can make a big difference to the overall speed of an application.

The big problem with the i-cache is: you can't really measure the performance gain of small versus bloated code. But it makes a difference. Otherwise the instruction-cache won't be there :-)
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#10 z80

    Valued Member

  • Members
  • PipPipPip
  • 104 posts

Posted 19 February 2008 - 10:44 PM

I didnt really think a lot about the instruction cache eventhough I should have since I know its there. Very good points you bring up there Nils as usual :-)

#11 Hawkwind

    Member

  • Members
  • PipPip
  • 92 posts

Posted 20 February 2008 - 09:30 AM

Nils - Thanks for this. 2.5x is a great improvement, I'm just looking through it so I understand how it works....

#12 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 20 February 2008 - 10:17 AM

To be fair, I run your code with a iMaxKey value of 0xffff. If you have much less keys I wouldn't be surprised if your code wins performance wise.

I could do the same and write a special version that handles the cases where only the lower 8 bits are used as a key, but well - performance matters most when you have to process lots of data.

It would be interesting to see how good std::sort compares to your and my function.

Btw - your asm-code looks pretty good. I like the conditional move to get rid of the branches.

You might get some additional speedup if you make sure that your branch labels aren't as near to each other as they are now. The modern CPU's build a branch history based on the address of the branch. In the worst case you can have two unrelated branches messing up their branch-prediction because one branches likely and the other does not.

The optimization guides at agner.org have more details on this.


About the algorithm itself: It's just plain old radix-sort with a 8-bit radix. You can find the details here: http://www.cubic.org/docs/radix.htm I removed the work done for the upper 16 bits of the key. I restructured the byte-history table building by doing it once for all sort-passes in parallel. That increases the stack-usage by a kilobyte, but enables me to do the mmx/sse trick during the history to index conversion step.
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#13 Hawkwind

    Member

  • Members
  • PipPip
  • 92 posts

Posted 20 February 2008 - 11:45 AM

I don't know much about MMX, I tried compiling your code in Studio2008 but I keep getting a runtime error of stack corruption around count.... do I need to 'enable' amd 'disable' MMX to stop floating point problems ? I read something somewhere about this....

#14 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 20 February 2008 - 09:23 PM

MMX shouldn't be the problem. There a re issues, but you only have to take care of them after you've done MMX stuff. I checked the code, and it should not overflow.


Anyway, here is the non-mmx version:

void radix16 (unsigned int amount, unsigned int * __restrict src, unsigned int * __restrict tmp)
{
  unsigned int i;
  unsigned int a0, a1;
  unsigned int count[512];

  // fill frequency table with zeros:
  memset (count, 0, sizeof (count));

  // build byte-frequency table:
  for (i=0; i<amount; i++) 
  {
    unsigned int s = src[i];
    count[2*((s>> 0)&255)+0]++;
    count[2*((s>> 8)&255)+1]++;
  }

  // turn frequency table into offset table:
  // process four dwords per iteration
  a0 = a1 = 0;
  for (i=0; i<256; i++)
  {
    int c0 = count[i*2+0];
    int c1 = count[i*2+1];
    count [i*2+0] = a0;
    count [i*2+1] = a1;
    a0 += c0;
    a1 += c1;
  }

  // sort from lowest byte to highest byte:
  for (i=0; i<amount; i++) tmp[count[((src[i]>> 0)&0xff)*2+0]++] = src[i];
  for (i=0; i<amount; i++) src[count[((tmp[i]>> 8)&0xff)*2+1]++] = tmp[i];
}

from numbers-of-elements > 1024 it's more than 5 times faster on my core2 btw...
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#15 SamuraiCrow

    Senior Member

  • Members
  • PipPipPipPip
  • 459 posts

Posted 21 February 2008 - 06:17 PM

Also, there's another sort algorithm known as Burst Sort that is in O(n) time. You just insert each element into a trie and then do a traversal of the trie. It's mainly useful for DNA sequencing where each element is constructed of 2-bit genes. The Burst Sort hogs memory like there's no tomorrow but it is a bit faster than the radix sort since there are only 2 passes.

#16 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 21 February 2008 - 11:15 PM

Burst Sort sounds interesting. Do you have a link to a website that gives some details on it? My google search didn't had any good results.

Anyway,

I benchmarked my and Hawkwinds code against the stl-sort.

Results are interesting, and I want to share them with you. To bad that I have misplaced the password for my webspace, so I can't show the cool statistics I've done.

I tested with several amounts from 512 to 65536 in steps of 1024. The performance scales almost linear with the element-count The O(n log n) complexity of the std sort does not even shows up! It nearly looks like a linear sort algorithm. (I checked that, it's merge-sort, so no O(n)).

Here are some numbers, calculated as an average of all test runs, and expressed as cycles per element (I measured using RDTSC):

radix8 sort: 34
radix1 sort: 171
std::sort: 220
clib qsort: 420

The lower ranges - where I measured less than 4000 elements - are interesting as well.

std::sort is faster than radix1 for less than 3000 and faster than radix8 for less than 64 elements.

I would have expected std::sort to perform better, and regarding the clib qsort function: That one is slow as hell. I will not use it anymore.
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#17 starstutter

    Senior Member

  • Members
  • PipPipPipPip
  • 1039 posts

Posted 22 February 2008 - 03:59 AM

maybe this has been mentioned (I didn't read through all the post) but I hear the Radix sort is extremley fast.

Also, a word of advice, unless under special conditions, you only need to resort once a second, if that even. Unless your strapping the camera to a rocket, they just don't move fast enough to need a re-sort every frame.

#18 Hawkwind

    Member

  • Members
  • PipPip
  • 92 posts

Posted 22 February 2008 - 09:43 AM

Nils - I got the sort to compile. Thanks its cool. I must admit that I had
dismissed the idea of using a radix > 2 since I just assumed that the extra
space and copy complexity would make it slower. Guess it proves I should assume nothing. I ditched the built in quicksort long ago since it chugs.
SamuraiCrow - I'll google for Burst sort (not heard of this one). I'm currently
playing around on paper with the coherency-insertion idea and looking at methods of predicting 'potential' order changes in the list.
Cheers.

#19 Mihail121

    Senior Member

  • Members
  • PipPipPipPip
  • 1052 posts

Posted 22 February 2008 - 12:27 PM

SamuraiCrow said:

Also, there's another sort algorithm known as Burst Sort that is in O(n) time. You just insert each element into a trie and then do a traversal of the trie. It's mainly useful for DNA sequencing where each element is constructed of 2-bit genes. The Burst Sort hogs memory like there's no tomorrow but it is a bit faster than the radix sort since there are only 2 passes.

Ehm, wouldn't consider that very effective, since the lower bound of sort algorithms has been found and O(n) is on a computer NOT POSSIBLE. I've heard of some hardware dependant algos, that come close to O(n), but I guess that's not very effective.

#20 Reedbeta

    DevMaster Staff

  • Administrators
  • 4979 posts
  • LocationBellevue, WA

Posted 22 February 2008 - 04:44 PM

Mihail: the lower bound is O(n log n) only for *comparison* sorts - that is, sorts in which calling some less-than function is the only way to compare elements. You can truly do the sort in O(n) time when sorting integers. For instance, radix sort is also O(n). This may seem like dark magic (it did to me when I learned it), but basically it's just taking advantage that integers have bits inside them you can look at and that the number of bits is O(1).

Also, asymptotic time isn't everything. Someone suggested the insertion sort above, and that algorithm is O(n^2) in the worst case...but it could well be the fastest thing in the special case of an almost-sorted list. Also, you might already know that quicksort has O(n^2) worst case time (though O(n log n) average case time), yet it's used over heapsort and mergesort, which have O(n log n) worst case time, because quicksort's faster in terms of *real* time in many situations.
reedbeta.com - developer blog, OpenGL demos, and other projects





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users