# asm engine

83 replies to this topic

### #21Nodlehs

Valued Member

• Members
• 152 posts

Posted 08 May 2006 - 05:31 PM

Premature optimization is worthless. If you want to write in ASM for just the heck of it, for a challange, or whatever, go ahead have a blast, sounds fun.

Please don't do it for performance reasons. If you want something to 'interest' you more, first read Code Complete, it may prompt you to evaluate your outlook on C++ and other languages. Finally, as others have pointed out, maybe find an exciting project instead of an exciting language, sounds like a lack of motivation more than anything else.

### #22SmokingRope

Valued Member

• Members
• 210 posts

Posted 08 May 2006 - 05:46 PM

### #23Nick

Senior Member

• Members
• 1227 posts

Posted 08 May 2006 - 07:39 PM

SmokingRope said:

Intel Developer's Manuals (The Bible)
MMX/SSE Primers (Nice quick reference too)
x86 Architecture (The alphabet starts with ACDB)

### #24Nils Pipenbrinck

Senior Member

• Members
• 597 posts

Posted 10 May 2006 - 01:06 AM

Integer Log2 with *that* much code?

// taken from http://www.stereopsis.com/log2.html
static inline int32 ilog2(float x)
{
uint32 ix = (uint32&)x;
uint32 exp = (ix >> 23) & 0xFF;
int32 log2 = int32(exp) - 127;

return log2;
}


### #25Nils Pipenbrinck

Senior Member

• Members
• 597 posts

Posted 10 May 2006 - 01:52 AM

Ok, here is something unexpected:

I timed bsr against the float version.. guess which one was faster (at least on my 1.6ghz athlon). I even didn't tried to optimize the float-version.


int bsr_test (void)

{

int ret = 0;

_asm {

mov ecx, 0x00400000

xor edx, edx

again:

bsr eax, ecx

dec ecx

jnz again

mov ret, edx

}

return ret;

}

int flt_test (void)

{

int ret = 0;

float temp;

_asm {

mov ecx, 0x00400000

xor edx, edx

again:

mov  temp, ecx

fild temp

fstp temp

mov eax, [temp]

shr eax, 23

and eax, 255

sub eax, 127

dec ecx

jnz again

mov ret, edx

}

return ret;

}

void main (void)

{

LARGE_INTEGER t1,t2,t3;

int i=0;

int j=0;

QueryPerformanceCounter (&t1);

for (int k=0; k<10; k++)

i += bsr_test();

QueryPerformanceCounter (&t2);

QueryPerformanceCounter (&t1);

for (int k=0; k<10; k++)

j += flt_test();

QueryPerformanceCounter (&t3);

printf ("time bsr = %d, result = %d\n", (int)t2.QuadPart,i);

printf ("time flt = %d, result = %d\n", (int)t3.QuadPart,j);

}



### #26Nick

Senior Member

• Members
• 1227 posts

Posted 10 May 2006 - 12:53 PM

Nils Pipenbrinck said:

I timed bsr against the float version.. guess which one was faster (at least on my 1.6ghz athlon). I even didn't tried to optimize the float-version.
Yeah I get the same results on an Athlon 64. Even a simple loop to shift the value one bit at a time is faster...

I checked the AMD manuals and it turns out the bsr instruction has a high latency and uses a slow decoder. So it was a bad example (at least for Athlon processors). But the results could have been totally opposite. I should have checked. :blush:

### #27juhnu

Valued Member

• Members
• 292 posts

Posted 10 May 2006 - 01:32 PM

Nick said:

Yeah I get the same results on an Athlon 64. Even a simple loop to shift the value one bit at a time is faster...

I checked the AMD manuals and it turns out the bsr instruction has a high latency and uses a slow decoder. So it was a bad example (at least for Athlon processors). But the results could have been totally opposite. I should have checked. :blush:

Well, at least it was a good example that hand-written asm is not necessarily 300% faster ;)

### #28.oisyn

DevMaster Staff

• Moderators
• 1842 posts

Posted 10 May 2006 - 02:05 PM

Touché

BTW, anyone care to test my piece of code as well? I could type it in myself, if only I weren't that lazy.
-
Currently working on: the 3D engine for Tomb Raider.

### #29Nick

Senior Member

• Members
• 1227 posts

Posted 10 May 2006 - 07:27 PM

juhnu said:

Well, at least it was a good example that hand-written asm is not necessarily 300% faster ;)
Hehe. :worthy:

Well, it doesn't prove anything either. The most important thing about low-level performance tuning is to profile. That counts for C++ as well. For example bubble sort can be fast for a small number of elements. And I could swear bsr was fast on older processors (and maybe it still is on Intel processors). :closedeye

### #30Nick

Senior Member

• Members
• 1227 posts

Posted 10 May 2006 - 07:32 PM

.oisyn said:

BTW, anyone care to test my piece of code as well? I could type it in myself, if only I weren't that lazy.
It's about two times slower than bsr on an Athlon 64. Oddly a non-branchless version is the fastest I got for now (about two times faster - using random input):

int log2(int value)

{

int log = 0;

if(value > 0x0000FFFF) {value = value >> 16; log += 16;}

if(value > 0x000000FF) {value = value >> 8;  log += 8;}

if(value > 0x0000000F) {value = value >> 4;  log += 4;}

if(value > 0x00000003) {value = value >> 2;  log += 2;}

if(value > 0x00000001) {                     log += 1;}

return log;

}


Haven't compared with the floating-point version yet...

### #31Nils Pipenbrinck

Senior Member

• Members
• 597 posts

Posted 10 May 2006 - 10:46 PM

Nick said:

Hehe. :worthy:

Well, it doesn't prove anything either. The most important thing about low-level performance tuning is to profile. That counts for C++ as well. For example bubble sort can be fast for a small number of elements. And I could swear bsr was fast on older processors (and maybe it still is on Intel processors). :closedeye

This evening I tested the code on a P4 (hyperthreading) machine at work.. the bsr-version was 20% faster than the float version :)

Back, when I did assembly programming on the 386 (when bsr had it's debut) it was a horrible slow instruction as well. Almost as slow as a multiply, but still the fastest way to to get the highest set bit. I've never found a good use for it though. I've used it once to detect mipmap levels for a software rasterizer.

On XScale CPU's however there's a very important instruction that counts the leading zeros of a dword (almost the same as bsr, just the reverse), it's damn important for this cpu since the XScale doesn't has a hardware divide.

### #32tbp

Valued Member

• Members
• 135 posts

Posted 11 May 2006 - 07:05 AM

Quick note, if you're doing a micro-bench taking a couple of cycles then you'd better stay clear from system call like PerfCounters (context switch, PIC resolution, >5000 cycles last time i measured... ok, that depends on your HAL, but still...).

PS: Writing a modern raytracer in asm? :wacko:

### #33Nick

Senior Member

• Members
• 1227 posts

Posted 11 May 2006 - 08:57 AM

tbp said:

PS: Writing a modern raytracer in asm? :wacko:
What do you mean?

### #34tbp

Valued Member

• Members
• 135 posts

Posted 11 May 2006 - 09:13 AM

I mean it doesn't make sense. But we've already had that discussion on the defunct flipcode.

Much like you're better using _BitScanForward than some inline asm bsr with MSVC (or anything that follow its silly inline asm mechanism), it's much more productive to use intrinsics if you're trying to write a modern SSE aware raytracer.

Because i haven't seen yet any serious discussion about how that asm integrates with the rest (call conventions, inlining etc).

I'm not saying that you shouldn't look at the generated code (hence requiring at least some notion of asm) or fix things with ad-hoc asm when the compiler derails.
But those broad 3x speedup claims are just uter bollocks; of course as someone said: You can write Fortran in any language.

### #35Nick

Senior Member

• Members
• 1227 posts

Posted 11 May 2006 - 12:40 PM

Intrinsics are still assembly. It just integrates with the compiler pipeline and performs register allocation. But you still needs 'pure' assembly knowledge to make good use of it. And like you said sometimes the compiler derails so writing inline assembly still has its merits. If you write whole loops in inline assembly it doesn't matter how it integrates anyway and you can control register allocation (and spilling) yourself for maximum control. You can even use __declspec(naked)...

So is assembly useful for raytracers? No doubt about it, using either intrinsics or inline assembly.

And 3x speedup is definitely possible for some classes of applications. Think about the psadbw operation, which can compute the sum of the absolute difference of 16 bytes in just a couple clock cycles. This is extremely useful for matching operations, like motion estimation in video encoding (cfr. webcam). Also, with Intel Core 2 Duo released this summer, SSE will be four times faster than the FPU in every situation. Furthermore, the extra registers lower pressure on the cache.

### #36tbp

Valued Member

• Members
• 135 posts

Posted 11 May 2006 - 01:44 PM

Nick said:

Intrinsics are still assembly. It just integrates with the compiler pipeline and performs register allocation. But you still needs 'pure' assembly knowledge to make good use of it.
'++x', 'a += b', 'p->', 'm[i]' etc... the whole - ok most - C language directly maps to "asm" by design (and by extension C++).
So that's not news.
The real problem is indeed
a) to know what really flies on a given hardware
b) express it

Nick said:

And like you said sometimes the compiler derails so writing inline assembly still has its merits. If you write whole loops in inline assembly it doesn't matter how it integrates anyway and you can control register allocation (and spilling) yourself for maximum control. You can even use __declspec(naked)...
That's where we disagree. Unless you have access to better asm inlining mechanisms (ie constraints with gcc, which are a freaking pain to deal with), there's no way your asm can properly integrate with the rest of the flow: you *will* step on the toes of the optimizer. You force it to kludge around the registers you clobber (remember it's its code that surrounds you, not the other way around). You enforce a calling convention, naked, which prohibits any inlining or shortcuts through prologue & epilogue. No dead store removal can happen. Etc... In fact it's just a big opaque blob of untouchable bits.
So, you better be sure you're going to reap enough benefits from your enlighted hand coding to pay for pessimization happening all over.
I'm not saying it's not possible or probable, just that it's no panacea. And that i find more productive to work with the compiler :)

Nick said:

Also, with Intel Core 2 Duo released this summer, SSE will be four times faster than the FPU in every situation. Furthermore, the extra registers lower pressure on the cache.
Hmmk. But most compilers already produce excellent scalar SSE code.
Granted they auto-vectorize like crap; now unless you switch to SoA, your vector code will perform like <bleep> anyway because fundamentally that not how things are meant to be.
And with higher level code you don't need to rewrite all your code because your register file has doubled, or not to the same extent.

Granted, the situation isn't as rosy with the integer side of SSE.

### #37Nick

Senior Member

• Members
• 1227 posts

Posted 11 May 2006 - 02:42 PM

tbp said:

That's where we disagree.
Do we? ;) I never said inline assembly is superiour or anything. In fact my opinion is to use use high-level as much as possible.

But all code has its benefits and so does inline assembly. When manual register allocation matters (the programmer mostly knows best which data needs fast access and what can be spilled) it's a useful option. And by including the surrounding code into the assembly block (mostly loops and function prologue/epilogue) the overhead can be minimized. Also, intrinics for SSE use a general-purpose register for a 16-byte aligned stack frame, so they also have an overhead. I mostly set up an aligned stack frame myself using ebp.

I absolutely agree it's no panacea at all. And I won't hesitate to use another approach if that's more suited for the situation. Simple as that. In fact my performance critical application uses extremely little inline assembly. I do use tons of SIMD run-time intrinsics (dynamic code generation) though, because that's what makes my application tick. I'd avoid that as well if it was an option. It always comes down to using the best tool for the job, and inline assembly is definitely in my toolbox.

Quote

But most compilers already produce excellent scalar SSE code.
Indeed, but that doesn't gain us much. There's huge potential in future processors for both integer and floating-point SIMD processing that compilers will not be able to use efficiently any time soon. So using assembly, in whichever form most applicable, is very valuable.

Personally I think it would be most useful to add SIMD support directly to the C++ language, using the syntax from HLSL. So we'd primarily have a float4 type, which automatically takes care of all alignment requirements. Intrinsics can take care of exotic instructions (or sequences of instructions) but they need a proper namespace and simple syntax, like simd::sad(byte8, byte8) instead of _mm_sad_pu8(_m64, _m64). And HLSL swizzling and masking syntax would be very useful as well. But now I'm just dreaming awake I think... :whistle:

### #38tbp

Valued Member

• Members
• 135 posts

Posted 11 May 2006 - 03:23 PM

Right. I still don't think that generally going for straight asm is worth the cost you pay for annoying the compiler, but of course that's the same kind of broad statement i was whining about to begin with ;)

Nick said:

Also, intrinics for SSE use a general-purpose register for a 16-byte aligned stack frame, so they also have an overhead. I mostly set up an aligned stack frame myself using ebp.
Err, nope, the whole stack is promoted to 16-byte alignment, on doze/linux with msvc/gcc/icc, on proper condition.

Like you i find those intrinsics incredibly verbose, they make my eyes go on strike. It's almost as noisy as Java *ducks*
So i never get out without my macro kit. Macros, now that's bleeding edge!

GCC has the kind of builtin float4 type you're looking for. But that doesn't solve the fundamental problem. I mean it will only make the 3-component-vector-class-mapped-to-SSE syndrom more prevalent than it is (with it's usual corollary question: why isn't it faster?). On that front only better auto-vectorization will help i think.

### #39Sephir

New Member

• Members
• 2 posts

Posted 11 May 2006 - 04:05 PM

Well, I am sorry if I was unclear or not clear enough for that matter...
As I said, I got tired of C++. As I mentioned I also code as a hobby so the answers for:
Q.: "Do you want asm for performance?"
A.: No. I code it just to read something else but C++.

Q.: "Why not JAVA, Csharp, other crap?"
A.: Well, a hobbyist programmer has own rights and priviledges to choose the language s/he wants to code in. I Choose asm. That's all. Not only that but the others are very alike which would kill the whole point in programming in a different language.

Q.: "How long do you program?"
A.: C++? Just 2 years. It was enough tho. I can say it's a perfect tool. But still got tired of it.

Q.: "Want to learn how to program assembly?"
A.: Actually not. I know what I need. I just wanted something to get going without having to go all that long way before blitting something useful in the screen. That's why I asked an engine or just a simple renderer not tutorials.
I dont know everything I should and I am not a master neither of asm nor cpp but since I dont do it 'for a living' I can actually endure bugs.

cypher543 said:

If you are still interested, here is an open source 3D RTS created entirely in 32bit Assembly: http://www.oby.ro/rts/index.html

Thank you very much I will check it out. Looks promising.

Anyway, thanks you all for your input. Very nice to see people active.

### #40.oisyn

DevMaster Staff

• Moderators
• 1842 posts

Posted 11 May 2006 - 04:31 PM

I'm programming C++ myself for over 10 years now and I'm still not getting tired of it