asm engine
#21
Posted 08 May 2006 - 05:31 PM
Please don't do it for performance reasons. If you want something to 'interest' you more, first read Code Complete, it may prompt you to evaluate your outlook on C++ and other languages. Finally, as others have pointed out, maybe find an exciting project instead of an exciting language, sounds like a lack of motivation more than anything else.
#22
Posted 08 May 2006 - 05:46 PM
#23
Posted 08 May 2006 - 07:39 PM
SmokingRope said:
MMX/SSE Primers (Nice quick reference too)
x86 Architecture (The alphabet starts with ACDB)
#24
Posted 10 May 2006 - 01:06 AM
// taken from http://www.stereopsis.com/log2.html
static inline int32 ilog2(float x)
{
uint32 ix = (uint32&)x;
uint32 exp = (ix >> 23) & 0xFF;
int32 log2 = int32(exp) - 127;
return log2;
}
#25
Posted 10 May 2006 - 01:52 AM
I timed bsr against the float version.. guess which one was faster (at least on my 1.6ghz athlon). I even didn't tried to optimize the float-version.
int bsr_test (void)
{
int ret = 0;
_asm {
mov ecx, 0x00400000
xor edx, edx
again:
bsr eax, ecx
add edx, eax
dec ecx
jnz again
mov ret, edx
}
return ret;
}
int flt_test (void)
{
int ret = 0;
float temp;
_asm {
mov ecx, 0x00400000
xor edx, edx
again:
mov temp, ecx
fild temp
fstp temp
mov eax, [temp]
shr eax, 23
and eax, 255
sub eax, 127
add edx, eax
dec ecx
jnz again
mov ret, edx
}
return ret;
}
void main (void)
{
LARGE_INTEGER t1,t2,t3;
int i=0;
int j=0;
QueryPerformanceCounter (&t1);
for (int k=0; k<10; k++)
i += bsr_test();
QueryPerformanceCounter (&t2);
t2.QuadPart -= t1.QuadPart;
QueryPerformanceCounter (&t1);
for (int k=0; k<10; k++)
j += flt_test();
QueryPerformanceCounter (&t3);
t3.QuadPart -= t1.QuadPart;
printf ("time bsr = %d, result = %d\n", (int)t2.QuadPart,i);
printf ("time flt = %d, result = %d\n", (int)t3.QuadPart,j);
}
#26
Posted 10 May 2006 - 12:53 PM
Nils Pipenbrinck said:
I checked the AMD manuals and it turns out the bsr instruction has a high latency and uses a slow decoder. So it was a bad example (at least for Athlon processors). But the results could have been totally opposite. I should have checked. :blush:
#27
Posted 10 May 2006 - 01:32 PM
Nick said:
I checked the AMD manuals and it turns out the bsr instruction has a high latency and uses a slow decoder. So it was a bad example (at least for Athlon processors). But the results could have been totally opposite. I should have checked. :blush:
Well, at least it was a good example that hand-written asm is not necessarily 300% faster ;)
#28
Posted 10 May 2006 - 02:05 PM
BTW, anyone care to test my piece of code as well? I could type it in myself, if only I weren't that lazy.
-
Currently working on: the 3D engine for Tomb Raider.
#29
Posted 10 May 2006 - 07:27 PM
juhnu said:
Well, it doesn't prove anything either. The most important thing about low-level performance tuning is to profile. That counts for C++ as well. For example bubble sort can be fast for a small number of elements. And I could swear bsr was fast on older processors (and maybe it still is on Intel processors). :closedeye
#30
Posted 10 May 2006 - 07:32 PM
.oisyn said:
int log2(int value)
{
int log = 0;
if(value > 0x0000FFFF) {value = value >> 16; log += 16;}
if(value > 0x000000FF) {value = value >> 8; log += 8;}
if(value > 0x0000000F) {value = value >> 4; log += 4;}
if(value > 0x00000003) {value = value >> 2; log += 2;}
if(value > 0x00000001) { log += 1;}
return log;
}
Haven't compared with the floating-point version yet...
#31
Posted 10 May 2006 - 10:46 PM
Nick said:
Well, it doesn't prove anything either. The most important thing about low-level performance tuning is to profile. That counts for C++ as well. For example bubble sort can be fast for a small number of elements. And I could swear bsr was fast on older processors (and maybe it still is on Intel processors). :closedeye
This evening I tested the code on a P4 (hyperthreading) machine at work.. the bsr-version was 20% faster than the float version :)
Back, when I did assembly programming on the 386 (when bsr had it's debut) it was a horrible slow instruction as well. Almost as slow as a multiply, but still the fastest way to to get the highest set bit. I've never found a good use for it though. I've used it once to detect mipmap levels for a software rasterizer.
On XScale CPU's however there's a very important instruction that counts the leading zeros of a dword (almost the same as bsr, just the reverse), it's damn important for this cpu since the XScale doesn't has a hardware divide.
#32
Posted 11 May 2006 - 07:05 AM
PS: Writing a modern raytracer in asm? :wacko:
#33
Posted 11 May 2006 - 08:57 AM
tbp said:
#34
Posted 11 May 2006 - 09:13 AM
Much like you're better using _BitScanForward than some inline asm bsr with MSVC (or anything that follow its silly inline asm mechanism), it's much more productive to use intrinsics if you're trying to write a modern SSE aware raytracer.
Because i haven't seen yet any serious discussion about how that asm integrates with the rest (call conventions, inlining etc).
I'm not saying that you shouldn't look at the generated code (hence requiring at least some notion of asm) or fix things with ad-hoc asm when the compiler derails.
But those broad 3x speedup claims are just uter bollocks; of course as someone said: You can write Fortran in any language.
#35
Posted 11 May 2006 - 12:40 PM
So is assembly useful for raytracers? No doubt about it, using either intrinsics or inline assembly.
And 3x speedup is definitely possible for some classes of applications. Think about the psadbw operation, which can compute the sum of the absolute difference of 16 bytes in just a couple clock cycles. This is extremely useful for matching operations, like motion estimation in video encoding (cfr. webcam). Also, with Intel Core 2 Duo released this summer, SSE will be four times faster than the FPU in every situation. Furthermore, the extra registers lower pressure on the cache.
#36
Posted 11 May 2006 - 01:44 PM
Nick said:
So that's not news.
The real problem is indeed
a) to know what really flies on a given hardware
b) express it
Nick said:
So, you better be sure you're going to reap enough benefits from your enlighted hand coding to pay for pessimization happening all over.
I'm not saying it's not possible or probable, just that it's no panacea. And that i find more productive to work with the compiler :)
Nick said:
Granted they auto-vectorize like crap; now unless you switch to SoA, your vector code will perform like <bleep> anyway because fundamentally that not how things are meant to be.
And with higher level code you don't need to rewrite all your code because your register file has doubled, or not to the same extent.
Granted, the situation isn't as rosy with the integer side of SSE.
#37
Posted 11 May 2006 - 02:42 PM
tbp said:
But all code has its benefits and so does inline assembly. When manual register allocation matters (the programmer mostly knows best which data needs fast access and what can be spilled) it's a useful option. And by including the surrounding code into the assembly block (mostly loops and function prologue/epilogue) the overhead can be minimized. Also, intrinics for SSE use a general-purpose register for a 16-byte aligned stack frame, so they also have an overhead. I mostly set up an aligned stack frame myself using ebp.
I absolutely agree it's no panacea at all. And I won't hesitate to use another approach if that's more suited for the situation. Simple as that. In fact my performance critical application uses extremely little inline assembly. I do use tons of SIMD run-time intrinsics (dynamic code generation) though, because that's what makes my application tick. I'd avoid that as well if it was an option. It always comes down to using the best tool for the job, and inline assembly is definitely in my toolbox.
Quote
Personally I think it would be most useful to add SIMD support directly to the C++ language, using the syntax from HLSL. So we'd primarily have a float4 type, which automatically takes care of all alignment requirements. Intrinsics can take care of exotic instructions (or sequences of instructions) but they need a proper namespace and simple syntax, like simd::sad(byte8, byte8) instead of _mm_sad_pu8(_m64, _m64). And HLSL swizzling and masking syntax would be very useful as well. But now I'm just dreaming awake I think... :whistle:
#38
Posted 11 May 2006 - 03:23 PM
Nick said:
Like you i find those intrinsics incredibly verbose, they make my eyes go on strike. It's almost as noisy as Java *ducks*
So i never get out without my macro kit. Macros, now that's bleeding edge!
GCC has the kind of builtin float4 type you're looking for. But that doesn't solve the fundamental problem. I mean it will only make the 3-component-vector-class-mapped-to-SSE syndrom more prevalent than it is (with it's usual corollary question: why isn't it faster?). On that front only better auto-vectorization will help i think.
#39
Posted 11 May 2006 - 04:05 PM
As I said, I got tired of C++. As I mentioned I also code as a hobby so the answers for:
Q.: "Do you want asm for performance?"
A.: No. I code it just to read something else but C++.
Q.: "Why not JAVA, Csharp, other crap?"
A.: Well, a hobbyist programmer has own rights and priviledges to choose the language s/he wants to code in. I Choose asm. That's all. Not only that but the others are very alike which would kill the whole point in programming in a different language.
Q.: "How long do you program?"
A.: C++? Just 2 years. It was enough tho. I can say it's a perfect tool. But still got tired of it.
Q.: "Want to learn how to program assembly?"
A.: Actually not. I know what I need. I just wanted something to get going without having to go all that long way before blitting something useful in the screen. That's why I asked an engine or just a simple renderer not tutorials.
I dont know everything I should and I am not a master neither of asm nor cpp but since I dont do it 'for a living' I can actually endure bugs.
cypher543 said:
Thank you very much I will check it out. Looks promising.
Anyway, thanks you all for your input. Very nice to see people active.
#40
Posted 11 May 2006 - 04:31 PM
-
Currently working on: the 3D engine for Tomb Raider.
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












