# SSE: Bizarre alignment error

22 replies to this topic

### #1Reedbeta

DevMaster Staff

• 5340 posts
• LocationSanta Clara, CA

Posted 14 February 2011 - 04:51 AM

I've been throwing together an SSE math library, and when I tried to write my 4x4 matrix constructor, I got this error using VC++ 2008 Express:

error C2719: 'n3': formal parameter with __declspec(align('16')) won't be aligned

The code is:
struct float4x4
{
__m128 m0;
__m128 m1;
__m128 m2;
__m128 m3;

float4x4 (__m128 n0, __m128 n1, __m128 n2, __m128 n3): m0(n0), m1(n1), m2(n2), m3(n3) {}
// ...snip...
};


The thing I can't understand is that the error only shows up on n3, the last row of the matrix. My 3x3 matrix does just the same thing and it works fine:
struct float3x3
{
__m128 m0;
__m128 m1;
__m128 m2;

float3x3 (__m128 n0, __m128 n1, __m128 n2): m0(n0), m1(n1), m2(n2) {}
// ...snip...
};


So...what's the deal? The compiler is smart enough to align 3 parameters properly, but can't handle 4?

I'm guessing I'm just going to have to declare all these things to take const-references instead of __m128 by value (a bit of Googling has suggested that __m128 by value isn't considered quite safe in general, although that seems a bit silly to me). Hopefully the references will get optimized away and not *actually* forced into memory all the time...
reedbeta.com - developer blog, OpenGL demos, and other projects

### #2Kenneth Gorking

Senior Member

• Members
• 939 posts

Posted 14 February 2011 - 06:23 AM

Well, like MSDN says: "The align __declspec modifier is not permitted on function parameters", which kind of makes sense, when you think about it.
"Stupid bug! You go squish now!!" - Homer Simpson

### #3.oisyn

DevMaster Staff

• Moderators
• 1842 posts

Posted 14 February 2011 - 07:11 AM

No, it does not make sense at all, when you think about it. Also, it makes even less sense that it's allowed for the first three parameters (well, for sse datatypes at least), but not for the rest.

@Reedbeta: in my experience, the compiler will not force them to memory. If the function can be inlined and the contents are already in a register at the call site, it will keep them in that register.
-
Currently working on: the 3D engine for Tomb Raider.

### #4Kenneth Gorking

Senior Member

• Members
• 939 posts

Posted 14 February 2011 - 07:40 PM

It makes perfect sense. If you were allowed to align formal parameters, calling conventions would go out the window.

The reason it might be able to handle 3, and not 4, might be because it is able to keep the first 3 __m128s in registers, but when introducing more, it runs out of space, which forces it to use the stack, and hence the alignment error.

You can probably circumvent this, by passing them as references, instead of by value.
"Stupid bug! You go squish now!!" - Homer Simpson

### #5.oisyn

DevMaster Staff

• Moderators
• 1842 posts

Posted 14 February 2011 - 10:34 PM

Kenneth Gorking said:

It makes perfect sense. If you were allowed to align formal parameters, calling conventions would go out the window.
Your argument is invalid. Fact is that current calling conventions don't deal with explicitely aligned parameters. It would be perfectly compatible by adding rules for aligned types. Since you weren't allowed to declare such functions before, no code is going to break by allowing them now.

Quote

The reason it might be able to handle 3, and not 4, might be because it is able to keep the first 3 __m128s in registers
Ok, wait, first you argue about calling conventions, and then you talk about having them passed in registers? How is that compatible with cdecl or stdcall?
-
Currently working on: the 3D engine for Tomb Raider.

### #6Kenneth Gorking

Senior Member

• Members
• 939 posts

Posted 15 February 2011 - 09:28 AM

.oisyn said:

Fact is that current calling conventions don't deal with explicitely aligned parameters
That's what I was getting at. Any padding added by alignment, would result in the function accessing bad data.

.oisyn said:

It would be perfectly compatible by adding rules for aligned types. Since you weren't allowed to declare such functions before, no code is going to break by allowing them now.
Trying to patch up x86 at this time, seems futile. It would probably also be a nightmare inducing endeavor. Simply switching to x64, would make all this go away.

.oisyn said:

Ok, wait, first you argue about calling conventions, and then you talk about having them passed in registers? How is that compatible with cdecl or stdcall?
First, I was speaking generally, then I was addressing the problem at hand. Also, __m128 variables are mapped directly to XMM registers, so your point is moot.

Anyways, after some digging around, I found that the first three __m128 are indeed passed in registers, and the rest go on the stack, hence the compiler error.
"Stupid bug! You go squish now!!" - Homer Simpson

### #7.oisyn

DevMaster Staff

• Moderators
• 1842 posts

Posted 15 February 2011 - 09:53 AM

Kenneth Gorking said:

That's what I was getting at. Any padding added by alignment, would result in the function accessing bad data.

Trying to patch up x86 at this time, seems futile. It would probably also be a nightmare inducing endeavor.
You don't have to patch anything. As said, you currently can't use aligned parameters altogether. This means that current function declarations don't contain any aligned types, which implies that you will *NEVER* access bad data by introducing padding. Simply because padding will not be required for those functions. We're only talking about functions with __declspec(align(x)) type parameters, and they don't exist yet.

Quote

First, I was speaking generally, then I was addressing the problem at hand. Also, __m128 variables are mapped directly to XMM registers, so your point is moot.
An __m128 variable is just as mapped to a register as an int is.
-
Currently working on: the 3D engine for Tomb Raider.

### #8Nick

Senior Member

• Members
• 1227 posts

Posted 15 February 2011 - 01:44 PM

.oisyn said:

An __m128 variable is just as mapped to a register as an int is.
I could be very mistaken, but I was under the impression that the Visual C++ compiler treats them as special. It's a proprietary data format and it's not well defined. If something works, great, if it doesn't, oh well.

### #9.oisyn

DevMaster Staff

• Moderators
• 1842 posts

Posted 15 February 2011 - 01:55 PM

It's only special in the sense that it's essentially a user defined type, yet the compiler understands that it can put them in SSE registers as if it were a built-in type. Aside from the fact that it's a struct, it isn't much different from an int.
-
Currently working on: the 3D engine for Tomb Raider.

### #10Nick

Senior Member

• Members
• 1227 posts

Posted 15 February 2011 - 02:20 PM

.oisyn said:

Aside from the fact that it's a struct, it isn't much different from an int.
Even as a struct it's very special. I believe the debugger is capable of showing the symbolic values even when it's really stored in a register.

### #11.oisyn

DevMaster Staff

• Moderators
• 1842 posts

Posted 15 February 2011 - 03:14 PM

Doesn't that also apply to int?
-
Currently working on: the 3D engine for Tomb Raider.

### #12Sandevil

New Member

• Members
• 7 posts

Posted 15 February 2011 - 05:45 PM

pass parameters by reference like this:

float4x4 (cosnt __m128& n0, const __m128& n1, const __m128& n2, const __m128& n3)

This must fix your problems. Function Params can't be aligned by compiler and there are reasons for that. You must understand how it will be translated in machine code. First registers are limited and there are two way to pass parameters to a funtion:
__stdcall (Standard calling convenction) by stack;
__fastcall (Fast calling convenction) first n values by registers the rest onto the stack.
Now Microsoft, GCC are using by default __fastcall vs Embardacero C++ Builder using __stdcall for c++ code. If you compile for 32bits platform only the first 3 are passed by registers, if you compile for 64 bits only the first 6 on windows 8 on linux. All the parms that are passed by stack will be not aligned. Can you guess why ?
stack pass:
push value1,
push value2,
call function

function:
push ebp
mov ebp, esp
sub esp, 8

...

mov esp, ebp
pop ebp
ret 0

So a function is compiled in save the stack pointer, allocating local variable into the stack, if you request aligned on local variable it will padded and the stack become a mess but it will be restored at function exit.
If you want to pass aligned data into the stack the compiler must add padding before calling the function and restore at the exit by correcting the stack pointer and it means that it must create n variation of the compiled function depend on the stack padding combination, for aligned 16, 16 combinations. The compiler won't do this for you. Pass by reference it's the only solution that not rely on specific compiler or platform behaviour.
Because it will pass on the stack the value of the address and than fetch the data on the function that was previously aligned globally or into the stack of the previously function called.

Hope that the explanation is enough clear...

### #13Kenneth Gorking

Senior Member

• Members
• 939 posts

Posted 15 February 2011 - 07:44 PM

.oisyn said:

An __m128 variable is just as mapped to a register as an int is.
Not when passing formals.

Sandevil said:

pass parameters by reference like this:

float4x4 (cosnt __m128& n0, const __m128& n1, const __m128& n2, const __m128& n3)
Only parameters 4 and up, need to be passed by reference. The first three will go in the registers, and save a pointer indirection.

Sandevil said:

Now Microsoft, GCC are using by default __fastcall
Actually, they use __cdecl.

Sandevil said:

if you compile for 64 bits only the first 6 on windows 8 on linux. All the parms that are passed by stack will be not aligned.
The first 4 parameters are passed on the stack, for both systems, and everything on the stack is aligned in a 16-byte boundry. That is also why this problem won't be an issue on x64 systems.
"Stupid bug! You go squish now!!" - Homer Simpson

### #14Sandevil

New Member

• Members
• 7 posts

Posted 15 February 2011 - 08:30 PM

__cdecl come from ex borland __fastcall in witch the first 3 parameters are passed by registers. by the way nowadays they produce the same code.
Windows and linux do not follow the same calling convenction on 64 bits system, but i must correct they are 4 per windows and 6 for linux.

Look here:

http://blog.csdn.net...20/5827661.aspx

Also the first are passed by registers the last to the stack.
Save Pointer indirection its a good idea but as i said will beaviour different on different platform.

### #15Sandevil

New Member

• Members
• 7 posts

Posted 15 February 2011 - 08:33 PM

I forget integer 4 vs 6, float simd or integer simd (sse4) 4 on windows 8 on linux.

### #16.oisyn

DevMaster Staff

• Moderators
• 1842 posts

Posted 15 February 2011 - 09:46 PM

Kenneth Gorking said:

Not when passing formals.
Exactly, the __m128 type has a set of rules in calling conventions just like any other built-in type.

Sandevil said:

__cdecl come from ex borland __fastcall in witch the first 3 parameters are passed by registers. by the way nowadays they produce the same code.
Actually, __cdecl is the x86 way of passing arguments by pushing them on the stack from right to left. Values are returned in their respective register(s), and the caller is responsible for stack cleanup.

Quote

Indeed, look there. You will read that all the cdecl parameters are passed on the stack. Perhaps you are confused by the fact that eax, ecx and edx are free for the function to use (ie., their state need not be restored).
Here's another source: http://en.wikipedia....nventions#cdecl

Quote

The cdecl calling convention is used by many C systems for the x86 architecture[1]. In cdecl, function parameters are pushed on the stack in a right-to-left order. Function return values are returned in the EAX register (except for floating point values, which are returned in the x87 register ST0). Registers EAX, ECX, and EDX are available for use in the function.

-
Currently working on: the 3D engine for Tomb Raider.

### #17Sandevil

New Member

• Members
• 7 posts

Posted 15 February 2011 - 10:50 PM

No, i'm not confusing start by the fact that the first 3 do not give errors because they are passed by registers.
__cdecl and __fastcall are nowadays the exact things, in fact __fastcall was introduced by borland that pass the first 3 in registers and all rest by the stack from left to right, microsoft in it's war against Borland compilers have used a calling conventions 2 by registers and all the rest onto the stack but right to left (the last parameter is pushed first). Borlad was faster and microsoft use 3 register by regs and the rest from right to left onto the stack. Microsoft win gcc adopt the same and now is referred as __cdecl but borland continue to call __fastcall and today microsoft compilers treat __fastcall the same as __cdecl.
So pratically they are now different names to identify the same sheet.
I now because if you today use __fastcall in c++ builder and than use a dll compiled with Visual c++ with __cdecl everything is ok. If they were different a crash will occour. You probably don't remember the war between Microsoft, borland and Sybase. I'm not here to start a war but trust me i know what i'm saying __fastcall and __cdecl are today the same thing like __property (c++ Builder) and __declspec(property) (Microsoft).
In the link you post see Microsoft fastcall and Borland fastcall and you start to understand the compiler war of the past.
As my history teacher used to say "the winner of a war write the history".

### #18Sandevil

New Member

• Members
• 7 posts

Posted 16 February 2011 - 12:40 AM

By the way this confusion come by the fact that different compilers use the same thing for different meaning or different identifiers for the same meaning and most of them are not standard.
The standard for c++ code do not exist, but the de facto standard is to pass the first 'n' params into same reigsters that obviously vary from cpu to cpu (x86 is not equal as power7 or powerpc or an arm cpu).
The standard calling convenction for C code (C89, C99) force the compilers to pass all the parameters into the stack.
So against the standard if you want to use c code in c++ you must declare it in this way:
extern "C" {
result functionName(params);
};
Microsoft compiler for a non class member function you can declare a c function as WINAPI or __cdecl (C declaration) or __stdcall (C standard call) avoiding the extern "C" { ... }. WINAPI is __cdecl because OS is written in C and not in C++ and all the calls to a Windows API must pass all the parameters in to the stack in respect to the standard.
Other compilers use the standard so for mingw compiler WINAPI is tipically declared as and empty macro and all the OS API are included in a header with extern "C" {
} for a c++ compiler or the compiler will receive the switch to compile in C mode.
For the OS api just include the header the compiler will do the rest.
Usually i try do not use non standard specifier if possible, so i generally use only 'inline'. For all the rest i write a macro like:
#define FORCE_INLINE __forceinline
#define DLL_IMPORT __declspec(dllimport)
and so on. In this way port to another platform will be easy.

What do you think about intel that is forcing microsoft to adopt a c++ calling convenction that will benefits only their processor and also are escluding us to use inline assembly on 64 bits windows platform ?
Nowing the fact that will penalize AMD and that linux has chosen the AMD proposed calling convention. So we will have AMD with a chance to win for free on linux web server and intel win (by paying microsot) on desktop/notebook platform.

### #19Kenneth Gorking

Senior Member

• Members
• 939 posts

Posted 16 February 2011 - 07:20 AM

Sandevil said:

No, i'm not confusing start by the fact that the first 3 do not give errors because they are passed by registers.
You are confused. As has already been mentioned in this thread, this is due to special rules for the __m128 datatype.

Sandevil said:

__cdecl and __fastcall are nowadays the exact things
I can asure you, they are not. From MSDN:
__cdecl
Argument-passing order: Right to left
• Stack-maintenance responsibility: Calling function pops the arguments from the stack
• Name-decoration convention: Underscore character (_) is prefixed to names, except when exporting __cdecl functions that use C linkage.
__fastcall
• Argument-passing order: The first two DWORD or smaller arguments are passed in ECX and EDX registers; all other arguments are passed right to left.
• Stack-maintenance responsibility: Called function pops the arguments from the stack.
• Name-decoration convention: At sign (@) is prefixed to names; an at sign followed by the number of bytes (in decimal) in the parameter list is suffixed to names.

"Stupid bug! You go squish now!!" - Homer Simpson

### #20.oisyn

DevMaster Staff

• Moderators
• 1842 posts

Posted 16 February 2011 - 08:16 AM

And MS's __fastcall isn't even compatible with Borland's __fastcall.

http://en.wikipedia....ntions#fastcall

Quote

Microsoft fastcall
Microsoft or GCC __fastcall convention (aka __msfastcall) passes the first two arguments (evaluated left to right) that fit into ECX and EDX. Remaining arguments are pushed onto the stack from right to left.

Borland fastcall
Evaluating arguments from left to right, it passes three arguments via EAX, EDX, ECX. Remaining arguments are pushed onto the stack, also left to right.

It is the default calling convention of Borland Delphi, where it is known as register.

Quote

I now because if you today use __fastcall in c++ builder and than use a dll compiled with Visual c++ with __cdecl everything is ok. If they were different a crash will occour.
You are wrong. This will never just work. The arguments are passed on the stack in the wrong order, and a few of them are passed in registers. And no, a crash would not by definition occur - arguments would simply have wrong values. If course, if one of them happes to be a pointer which you're dereferencing, then it might crash. I urge you to try it, and post your original code and your results here. Perhaps you're remembering it wrong? Or there was a "#define __fastcall __cdecl" somewhere in the code or project settings? Or your functions only used zero parameters, or just one float or something.