Floating point question

Mihail121 102 Feb 06, 2010 at 12:05

Hi, does anyone know, why this is the case:

// Given a single precision float flt (23 bit mantissa) with intV and fltV being the integer and the fraction part respectively, so: flt = (float) intV + fltV

float flt = ...

// Do this

flt = flt + (float) (1 << 23) 

// Now somehow the mantissa of flt has become exactly the bitwise representation of intV, so:

(int) flt & (mantissa_mask = (1<<23)-1)

// gives exactly intV, the truncated flt value.

I would be greatful if somebody provides some info on the matter, how this great magic works!

edit: No, sorry, the code seems to perform a nearest rounding.

3 Replies

Please log in or register to post a reply.

Reedbeta 167 Feb 06, 2010 at 19:08

When you add two floating-point numbers, conceptually speaking they have to be shifted to line up their radix points with each other before you can add. The number (1 << 23) is a 1 followed by 23 bits of zero, so as a float its radix point lies at the very end of its mantissa, and its mantissa is all zeros (the 1 being implied). When you add flt to it, flt is first shifted so that its radix point lies at the same place. Thus the fractional part of flt falls off the end of the mantissa and the integer part is all that is left. Do the add, and you wind up with the integer part of flt in the mantissa field, at least up to rounding.

If you want truncation you may be able to switch to round-toward-zero mode (google for floating point rounding modes). Or, if your numbers are all positive, an easy trick is to subtract 0.5 before you round it.

UnrealSolo 101 May 29, 2010 at 13:30

Another way of thinking about it is to imagine the numbers represented as bitfield/integer which has enough bits to store the numbers in fixed point, but you only have a window of bits to operate with. you then add multiply , divide as a normal integer, the result window will move so most significant 1 occupies the highest bit in the window. Im not sure if this makes it easier, but it helps me when doing fixed point operations.

_oisyn 101 May 30, 2010 at 22:36

Note that a float doesn’t store it’s most significant 1, as it is always implied. And obviously, the trick will only work for positive numbers between 0 and 223 (if you want the next 11 bits as well, you could apply the same trick by adding 232, then shifting the resulting int representation by 23 bits and add them to the lower 23 bits already calculated)