0
139 Feb 14, 2011 at 04:51

I’ve been throwing together an SSE math library, and when I tried to write my 4x4 matrix constructor, I got this error using VC++ 2008 Express:

error C2719: ‘n3’: formal parameter with __declspec(align(‘16’)) won’t be aligned

The code is:

struct float4x4
{
__m128 m0;
__m128 m1;
__m128 m2;
__m128 m3;

float4x4 (__m128 n0, __m128 n1, __m128 n2, __m128 n3): m0(n0), m1(n1), m2(n2), m3(n3) {}
// ...snip...
};


The thing I can’t understand is that the error only shows up on n3, the last row of the matrix. My 3x3 matrix does just the same thing and it works fine:

struct float3x3
{
__m128 m0;
__m128 m1;
__m128 m2;

float3x3 (__m128 n0, __m128 n1, __m128 n2): m0(n0), m1(n1), m2(n2) {}
// ...snip...
};


So…what’s the deal? The compiler is smart enough to align 3 parameters properly, but can’t handle 4? ;)

I’m guessing I’m just going to have to declare all these things to take const-references instead of __m128 by value (a bit of Googling has suggested that __m128 by value isn’t considered quite safe in general, although that seems a bit silly to me). Hopefully the references will get optimized away and not *actually* forced into memory all the time…

#### 22 Replies

0
101 Feb 14, 2011 at 06:23

Well, like MSDN says: “The align __declspec modifier is not permitted on function parameters”, which kind of makes sense, when you think about it.

0
101 Feb 14, 2011 at 07:11

No, it does not make sense at all, when you think about it. Also, it makes even less sense that it’s allowed for the first three parameters (well, for sse datatypes at least), but not for the rest.

@Reedbeta: in my experience, the compiler will not force them to memory. If the function can be inlined and the contents are already in a register at the call site, it will keep them in that register.

0
101 Feb 14, 2011 at 19:40

It makes perfect sense. If you were allowed to align formal parameters, calling conventions would go out the window.

The reason it might be able to handle 3, and not 4, might be because it is able to keep the first 3 __m128s in registers, but when introducing more, it runs out of space, which forces it to use the stack, and hence the alignment error.

You can probably circumvent this, by passing them as references, instead of by value.

0
101 Feb 14, 2011 at 22:34

@Kenneth Gorking

It makes perfect sense. If you were allowed to align formal parameters, calling conventions would go out the window.

Your argument is invalid. Fact is that current calling conventions don’t deal with explicitely aligned parameters. It would be perfectly compatible by adding rules for aligned types. Since you weren’t allowed to declare such functions before, no code is going to break by allowing them now.

The reason it might be able to handle 3, and not 4, might be because it is able to keep the first 3 __m128s in registers

Ok, wait, first you argue about calling conventions, and then you talk about having them passed in registers? How is that compatible with cdecl or stdcall?

0
101 Feb 15, 2011 at 09:28

@.oisyn

Fact is that current calling conventions don’t deal with explicitely aligned parameters

That’s what I was getting at. Any padding added by alignment, would result in the function accessing bad data.
@.oisyn

It would be perfectly compatible by adding rules for aligned types. Since you weren’t allowed to declare such functions before, no code is going to break by allowing them now.

Trying to patch up x86 at this time, seems futile. It would probably also be a nightmare inducing endeavor. Simply switching to x64, would make all this go away.
@.oisyn

Ok, wait, first you argue about calling conventions, and then you talk about having them passed in registers? How is that compatible with cdecl or stdcall?

First, I was speaking generally, then I was addressing the problem at hand. Also, __m128 variables are mapped directly to XMM registers, so your point is moot.

Anyways, after some digging around, I found that the first three __m128 are indeed passed in registers, and the rest go on the stack, hence the compiler error.

0
101 Feb 15, 2011 at 09:53

@Kenneth Gorking

That’s what I was getting at. Any padding added by alignment, would result in the function accessing bad data. Trying to patch up x86 at this time, seems futile. It would probably also be a nightmare inducing endeavor.

You don’t have to patch anything. As said, you currently can’t use aligned parameters altogether. This means that current function declarations don’t contain any aligned types, which implies that you will *NEVER* access bad data by introducing padding. Simply because padding will not be required for those functions. We’re only talking about functions with __declspec(align(x)) type parameters, and they don’t exist yet.

First, I was speaking generally, then I was addressing the problem at hand. Also, __m128 variables are mapped directly to XMM registers, so your point is moot.

An __m128 variable is just as mapped to a register as an int is.

0
102 Feb 15, 2011 at 13:44

@.oisyn

An __m128 variable is just as mapped to a register as an int is.

I could be very mistaken, but I was under the impression that the Visual C++ compiler treats them as special. It’s a proprietary data format and it’s not well defined. If something works, great, if it doesn’t, oh well.

0
101 Feb 15, 2011 at 13:55

It’s only special in the sense that it’s essentially a user defined type, yet the compiler understands that it can put them in SSE registers as if it were a built-in type. Aside from the fact that it’s a struct, it isn’t much different from an int.

0
102 Feb 15, 2011 at 14:20

@.oisyn

Aside from the fact that it’s a struct, it isn’t much different from an int.

Even as a struct it’s very special. I believe the debugger is capable of showing the symbolic values even when it’s really stored in a register.

0
101 Feb 15, 2011 at 15:14

Doesn’t that also apply to int?

0
101 Feb 15, 2011 at 17:45

pass parameters by reference like this:

float4x4 (cosnt __m128& n0, const __m128& n1, const __m128& n2, const __m128& n3)

This must fix your problems. Function Params can’t be aligned by compiler and there are reasons for that. You must understand how it will be translated in machine code. First registers are limited and there are two way to pass parameters to a funtion:
__stdcall (Standard calling convenction) by stack;
__fastcall (Fast calling convenction) first n values by registers the rest onto the stack.
Now Microsoft, GCC are using by default __fastcall vs Embardacero C++ Builder using __stdcall for c++ code. If you compile for 32bits platform only the first 3 are passed by registers, if you compile for 64 bits only the first 6 on windows 8 on linux. All the parms that are passed by stack will be not aligned. Can you guess why ?
stack pass:
push value1,
push value2,
call function

function:
push ebp
mov ebp, esp
sub esp, 8

mov esp, ebp
pop ebp
ret 0

So a function is compiled in save the stack pointer, allocating local variable into the stack, if you request aligned on local variable it will padded and the stack become a mess but it will be restored at function exit.
If you want to pass aligned data into the stack the compiler must add padding before calling the function and restore at the exit by correcting the stack pointer and it means that it must create n variation of the compiled function depend on the stack padding combination, for aligned 16, 16 combinations. The compiler won’t do this for you. Pass by reference it’s the only solution that not rely on specific compiler or platform behaviour.
Because it will pass on the stack the value of the address and than fetch the data on the function that was previously aligned globally or into the stack of the previously function called.

Hope that the explanation is enough clear…

0
101 Feb 15, 2011 at 19:44

@.oisyn

An __m128 variable is just as mapped to a register as an int is.

Not when passing formals.
@Sandevil

pass parameters by reference like this: float4x4 (cosnt __m128& n0, const __m128& n1, const __m128& n2, const __m128& n3)

Only parameters 4 and up, need to be passed by reference. The first three will go in the registers, and save a pointer indirection.
@Sandevil

Now Microsoft, GCC are using by default __fastcall

Actually, they use __cdecl.
@Sandevil

if you compile for 64 bits only the first 6 on windows 8 on linux. All the parms that are passed by stack will be not aligned.

The first 4 parameters are passed on the stack, for both systems, and everything on the stack is aligned in a 16-byte boundry. That is also why this problem won’t be an issue on x64 systems.

0
101 Feb 15, 2011 at 20:30

__cdecl come from ex borland __fastcall in witch the first 3 parameters are passed by registers. by the way nowadays they produce the same code.
Windows and linux do not follow the same calling convenction on 64 bits system, but i must correct they are 4 per windows and 6 for linux.

Look here:

http://blog.csdn.net/yiruirui0507/archive/2010/08/20/5827661.aspx

Also the first are passed by registers the last to the stack.
Save Pointer indirection its a good idea but as i said will beaviour different on different platform.

0
101 Feb 15, 2011 at 20:33

I forget integer 4 vs 6, float simd or integer simd (sse4) 4 on windows 8 on linux.

0
101 Feb 15, 2011 at 21:46

@Kenneth Gorking

Not when passing formals.

Exactly, the __m128 type has a set of rules in calling conventions just like any other built-in type.
@Sandevil

__cdecl come from ex borland __fastcall in witch the first 3 parameters are passed by registers. by the way nowadays they produce the same code.

Actually, __cdecl is the x86 way of passing arguments by pushing them on the stack from right to left. Values are returned in their respective register(s), and the caller is responsible for stack cleanup.

Indeed, look there. You will read that all the cdecl parameters are passed on the stack. Perhaps you are confused by the fact that eax, ecx and edx are free for the function to use (ie., their state need not be restored).
Here’s another source: http://en.wikipedia.org/wiki/X86_calling_conventions#cdecl

The cdecl calling convention is used by many C systems for the x86 architecture[1]. In cdecl, function parameters are pushed on the stack in a right-to-left order. Function return values are returned in the EAX register (except for floating point values, which are returned in the x87 register ST0). Registers EAX, ECX, and EDX are available for use in the function.

0
101 Feb 15, 2011 at 22:50

No, i’m not confusing start by the fact that the first 3 do not give errors because they are passed by registers.
__cdecl and __fastcall are nowadays the exact things, in fact __fastcall was introduced by borland that pass the first 3 in registers and all rest by the stack from left to right, microsoft in it’s war against Borland compilers have used a calling conventions 2 by registers and all the rest onto the stack but right to left (the last parameter is pushed first). Borlad was faster and microsoft use 3 register by regs and the rest from right to left onto the stack. Microsoft win gcc adopt the same and now is referred as __cdecl but borland continue to call __fastcall and today microsoft compilers treat __fastcall the same as __cdecl.
So pratically they are now different names to identify the same sheet.
I now because if you today use __fastcall in c++ builder and than use a dll compiled with Visual c++ with __cdecl everything is ok. If they were different a crash will occour. You probably don’t remember the war between Microsoft, borland and Sybase. I’m not here to start a war but trust me i know what i’m saying __fastcall and __cdecl are today the same thing like __property (c++ Builder) and __declspec(property) (Microsoft).
In the link you post see Microsoft fastcall and Borland fastcall and you start to understand the compiler war of the past.
As my history teacher used to say “the winner of a war write the history”.

0
101 Feb 16, 2011 at 00:40

By the way this confusion come by the fact that different compilers use the same thing for different meaning or different identifiers for the same meaning and most of them are not standard.
The standard for c++ code do not exist, but the de facto standard is to pass the first ‘n’ params into same reigsters that obviously vary from cpu to cpu (x86 is not equal as power7 or powerpc or an arm cpu).
The standard calling convenction for C code (C89, C99) force the compilers to pass all the parameters into the stack.
So against the standard if you want to use c code in c++ you must declare it in this way:
extern “C” {
result functionName(params);
};
Microsoft compiler for a non class member function you can declare a c function as WINAPI or __cdecl (C declaration) or __stdcall (C standard call) avoiding the extern “C” { … }. WINAPI is __cdecl because OS is written in C and not in C++ and all the calls to a Windows API must pass all the parameters in to the stack in respect to the standard.
Other compilers use the standard so for mingw compiler WINAPI is tipically declared as and empty macro and all the OS API are included in a header with extern “C” {
} for a c++ compiler or the compiler will receive the switch to compile in C mode.
For the OS api just include the header the compiler will do the rest.
Usually i try do not use non standard specifier if possible, so i generally use only ‘inline’. For all the rest i write a macro like:
#define FORCE_INLINE __forceinline
#define DLL_IMPORT __declspec(dllimport)
and so on. In this way port to another platform will be easy.

What do you think about intel that is forcing microsoft to adopt a c++ calling convenction that will benefits only their processor and also are escluding us to use inline assembly on 64 bits windows platform ?
Nowing the fact that will penalize AMD and that linux has chosen the AMD proposed calling convention. So we will have AMD with a chance to win for free on linux web server and intel win (by paying microsot) on desktop/notebook platform.

0
101 Feb 16, 2011 at 07:20

@Sandevil

No, i’m not confusing start by the fact that the first 3 do not give errors because they are passed by registers.

You are confused. As has already been mentioned in this thread, this is due to special rules for the __m128 datatype.
@Sandevil

__cdecl and __fastcall are nowadays the exact things

I can asure you, they are not. From MSDN:
__cdecl

• Stack-maintenance responsibility: Calling function pops the arguments from the stack
• Name-decoration convention: Underscore character (_) is prefixed to names, except when exporting __cdecl functions that use C linkage.

__fastcall

• Argument-passing order: The first two DWORD or smaller arguments are passed in ECX and EDX registers; all other arguments are passed right to left.
• Stack-maintenance responsibility: Called function pops the arguments from the stack.
• Name-decoration convention: At sign (@) is prefixed to names; an at sign followed by the number of bytes (in decimal) in the parameter list is suffixed to names.
0
101 Feb 16, 2011 at 08:16

And MS’s __fastcall isn’t even compatible with Borland’s __fastcall.

http://en.wikipedia.org/wiki/X86_calling_conventions#fastcall

Microsoft fastcall
Microsoft or GCC __fastcall convention (aka __msfastcall) passes the first two arguments (evaluated left to right) that fit into ECX and EDX. Remaining arguments are pushed onto the stack from right to left.

Borland fastcall
Evaluating arguments from left to right, it passes three arguments via EAX, EDX, ECX. Remaining arguments are pushed onto the stack, also left to right.

It is the default calling convention of Borland Delphi, where it is known as register.

I now because if you today use __fastcall in c++ builder and than use a dll compiled with Visual c++ with __cdecl everything is ok. If they were different a crash will occour.

You are wrong. This will never just work. The arguments are passed on the stack in the wrong order, and a few of them are passed in registers. And no, a crash would not by definition occur - arguments would simply have wrong values. If course, if one of them happes to be a pointer which you’re dereferencing, then it might crash. I urge you to try it, and post your original code and your results here. Perhaps you’re remembering it wrong? Or there was a “#define __fastcall __cdecl” somewhere in the code or project settings? Or your functions only used zero parameters, or just one float or something.

0
101 Feb 16, 2011 at 15:38

Console Application Source Code:

#include "stdafx.h"

#pragma inline_depth(0);
#pragma inline_recursion(off);

int test1(const int a, const int b, const int c, const int d, const int e)
{
int result = (((a*b) + c) * d) - e;

return result;
}

int __cdecl test2(const int a, const int b, const int c, const int d, const int e)
{
int result = (((a*b) + c) * d) - e;

return result;
}

int __stdcall test3(const int a, const int b, const int c, const int d, const int e)
{
int result = (((a*b) + c) * d) - e;

return result;
}

int __fastcall test4(const int a, const int b, const int c, const int d, const int e)
{
int result = (((a*b) + c) * d) - e;

return result;
}

__declspec(dllexport) int test5(const int a, const int b, const int c, const int d, const int e)
{
int result = (((a*b) + c) * d) - e;

return result;
}

class test6
{
public:
test6() {};
~test6() {};

int exec_test(const const int a, int b, const int c, const int d, const int e)
{
int result = (((a*b) + c) * d) - e;

return result;
}

int __fastcall exec_test2(const const int a, int b, const int c, const int d, const int e)
{
int result = (((a*b) + c) * d) - e;

return result;
}
};

int _tmain(int argc, _TCHAR* argv[])
{
int a;
int b;
int c;
int d;
int e;
test6 ctest6;

scanf("%d",a);
scanf("%d",b);
scanf("%d",c);
scanf("%d",d);
scanf("%d",e);

int r1,r2,r3,r4,r5,r6, r7;

r1 = test1(a,b,c,d,e);
r2 = test2(a,b,c,d,e);
r3 = test3(a,b,c,d,e);
r4 = test4(a,b,c,d,e);
r5 = test5(a,b,c,d,e);
r6 = ctest6.exec_test(a,b,c,d,e);
r7 = ctest6.exec_test2(a,b,c,d,e);

printf("%d\n",r1);
printf("%d\n",r2);
printf("%d\n",r3);
printf("%d\n",r4);
printf("%d\n",r5);
printf("%d\n",r6);
printf("%d\n",r7);

char ch;
scanf("%c",ch);

return 0;
}


Release Assembly Output:

; 80   :
; 81   :    int r1,r2,r3,r4,r5,r6, r7;
; 82   :
; 83   :    r1 = test1(a,b,c,d,e);

mov edx, DWORD PTR _e$[ebp] mov eax, DWORD PTR _d$[ebp]
push    edx
push    eax
push    ebx
push    edi
mov eax, esi
call    ?test1@@YAHHHHHH@Z          ; test1

; 84   :    r2 = test2(a,b,c,d,e);
; 85   :    r3 = test3(a,b,c,d,e);
; 86   :    r4 = test4(a,b,c,d,e);
; 87   :    r5 = test5(a,b,c,d,e);
; 88   :    r6 = ctest6.exec_test(a,b,c,d,e);
; 89   :    r7 = ctest6.exec_test2(a,b,c,d,e);
; 90   :
; 91   :    printf("%d\n",r1);

push    eax
push    OFFSET ??_C@_03PMGGPEJJ@?$CFd?6?$AA@
call    DWORD PTR __imp__printf
mov ecx, DWORD PTR _e$[ebp] mov edx, DWORD PTR _d$[ebp]
push    ecx
push    edx
push    ebx
push    edi
mov eax, esi
call    ?test2@@YAHHHHHH@Z          ; test2

; 92   :    printf("%d\n",r2);

push    eax
push    OFFSET ??_C@_03PMGGPEJJ@?$CFd?6?$AA@
call    DWORD PTR __imp__printf
mov eax, DWORD PTR _e$[ebp] mov ecx, DWORD PTR _d$[ebp]
push    eax
push    ecx
push    ebx
push    edi
mov eax, esi
call    ?test3@@YGHHHHHH@Z          ; test3

; 93   :    printf("%d\n",r3);

push    eax
push    OFFSET ??_C@_03PMGGPEJJ@?$CFd?6?$AA@
call    DWORD PTR __imp__printf
mov edx, DWORD PTR _e$[ebp] mov eax, DWORD PTR _d$[ebp]
push    edx
push    eax
push    ebx
push    edi
mov eax, esi
call    ?test4@@YIHHHHHH@Z          ; test4

; 94   :    printf("%d\n",r4);

push    eax
push    OFFSET ??_C@_03PMGGPEJJ@?$CFd?6?$AA@
call    DWORD PTR __imp__printf
mov ecx, DWORD PTR _e$[ebp] mov edx, DWORD PTR _d$[ebp]
push    ecx
push    edx
push    ebx
push    edi
push    esi
call    ?test5@@YAHHHHHH@Z          ; test5

; 95   :    printf("%d\n",r5);

push    eax
push    OFFSET ??_C@_03PMGGPEJJ@?$CFd?6?$AA@
call    DWORD PTR __imp__printf
mov eax, DWORD PTR _e$[ebp] mov ecx, DWORD PTR _d$[ebp]
push    eax
push    ecx
push    ebx
push    edi
mov eax, esi
call    ?exec_test@test6@@QAEHHHHHH@Z       ; test6::exec_test

; 96   :    printf("%d\n",r6);

push    eax
push    OFFSET ??_C@_03PMGGPEJJ@?$CFd?6?$AA@
call    DWORD PTR __imp__printf
mov edx, DWORD PTR _e$[ebp] mov eax, DWORD PTR _d$[ebp]
push    edx
push    eax
push    ebx
push    edi
mov eax, esi
call    ?exec_test2@test6@@QAIHHHHHH@Z      ; test6::exec_test2


As you can see __cdecl and __stdcall generate same code, all params pass right to left so evaulation is left to right (last push will be the first to be popped).

In this test case the compiler wasn’t be able to optimize __fastcall and it was ignored (__fastcall works like inline keyword).
By the way __fastcall will try to pass 3 arguments into registers if he can’t do it, it will try with 2 otherwise all into the stack.
But if you generate a dll, every function that will be marked as __declspec(dllexport) will use __stdcall = __cdecl, __fastcall will be ignored.
The only difference beetwen Embarcadero C++ Builder and Microsoft Visual C++ is that embarcadero for __fastcall 3 arguments onto the registers or nothing and everythig goes into the stack. But this will create problem only if you create a library and try a statical linking but it’s impossible by default they use two different object format, and also gcc use another object format.
So using a DLL in another language will usually never cause a problem.
I think that the MSDN documentation is not updated for calling convections.
I used Visual C++ 2010.

0
101 Feb 16, 2011 at 15:46

That’s an awful lot of words just to say you were wrong and we were right :).
Also, cdecl is not the same as stdcall. Name mangling is different, and with stdcall the callee is responsible for stack cleanup (using the retn instruction), while with cdecl the caller is responsible.

I think that the MSDN documentation is not updated for calling convections.

What you describe is perfectly to specifications in the MSDN.

My test, clean and simple. Compiled in release without whole program optimization (otherwise, all bets are off)

#include <iostream>
#include <intrin.h>
#include <stdlib.h>

__declspec(noinline) void __cdecl test1 (int a, int b, int c, int d) { std::cout << a << std::endl; }
__declspec(noinline) void __stdcall test2 (int a, int b, int c, int d) { std::cout << a << std::endl; }
__declspec(noinline) void __fastcall test3 (int a, int b, int c, int d) { std::cout << a << std::endl; }

__declspec(noinline, dllexport) void __cdecl test4 (int a, int b, int c, int d) { std::cout << a << std::endl; }
__declspec(noinline, dllexport) void __stdcall test5 (int a, int b, int c, int d) { std::cout << a << std::endl; }
__declspec(noinline, dllexport) void __fastcall test6 (int a, int b, int c, int d) { std::cout << a << std::endl; }

int main()
{
test1(0, 1, 2, 3);
test2(0, 1, 2, 3);
test3(0, 1, 2, 3);
test4(0, 1, 2, 3);
test5(0, 1, 2, 3);
test6(0, 1, 2, 3);
}


Asm:

int main()
{
test1(0, 1, 2, 3);
00381070  push        3
00381072  push        2
00381074  push        1
00381076  push        0
00381078  call        test4 (381000h)
test2(0, 1, 2, 3);
00381080  push        3
00381082  push        2
00381084  push        1
00381086  push        0
00381088  call        test5 (381020h)
test3(0, 1, 2, 3);
0038108D  push        3
0038108F  push        2
00381091  mov         edx,1
00381096  xor         ecx,ecx
00381098  call        test6 (381050h)
test4(0, 1, 2, 3);
0038109D  push        3
0038109F  push        2
003810A1  push        1
003810A3  push        0
003810A5  call        test4 (381000h)
test5(0, 1, 2, 3);
003810AF  push        2
003810B1  push        1
003810B3  push        0
003810B5  call        test5 (381020h)
test6(0, 1, 2, 3);
003810BA  push        3
003810BC  push        2
003810BE  mov         edx,1
003810C3  xor         ecx,ecx
003810C5  call        test6 (381050h)
}
003810CA  xor         eax,eax
003810CC  ret


dllexport makes no difference whatsoever. __fastcall uses two registers.

0
101 Feb 16, 2011 at 16:28

If you put in this way you have right and i’m wrong.
Sorry for the posts.
I know that the stack cleanup is different but the Name mangling not always.
Look the assembly code or try by yoursef.
But i was saying that:

The parameters are put always right to left for both __stdcall and __cdecl so the params passing is equal if you want to believe that they are different believe what you want.
If you want to believe that 64 bits linux and windows will have the same calling convention believe it but gcc is following amd and microsoft will follow intel advice.
if you want to believe that __fastcall will try to pass 2 and not 3 values onto the registers if he can do it. by the way everything started on the 4th parameter, so why the 3rd was passed onto the register ?.
if you want to believe that microsoft by removing inline assembly in 64 bits is doing something good for you, believe it.
And if you want to demostrate that i was wrong on everything write some code and post it.