Hi, does anyone know, why this is the case:
// Given a single precision float flt (23 bit mantissa) with intV and fltV being the integer and the fraction part respectively, so: flt = (float) intV + fltV
float flt = ...
// Do this
flt = flt + (float) (1 << 23)
// Now somehow the mantissa of flt has become exactly the bitwise representation of intV, so:
(int) flt & (mantissa_mask = (1<<23)-1)
// gives exactly intV, the truncated flt value.
I would be greatful if somebody provides some info on the matter, how
this great magic works!
edit: No, sorry, the code seems to perform a nearest rounding.
Please log in or register to post a reply.
When you add two floating-point numbers, conceptually speaking they have
to be shifted to line up their radix points with each other before you
can add. The number (1 << 23) is a 1 followed by 23 bits of zero, so
as a float its radix point lies at the very end of its mantissa, and its
mantissa is all zeros (the 1 being implied). When you add flt to it, flt
is first shifted so that its radix point lies at the same place. Thus
the fractional part of flt falls off the end of the mantissa and the
integer part is all that is left. Do the add, and you wind up with the
integer part of flt in the mantissa field, at least up to rounding.
If you want truncation you may be able to switch to round-toward-zero
mode (google for floating point rounding modes). Or, if your numbers are
all positive, an easy trick is to subtract 0.5 before you round it.
Another way of thinking about it is to imagine the numbers represented
as bitfield/integer which has enough bits to store the numbers in fixed
point, but you only have a window of bits to operate with. you then add
multiply , divide as a normal integer, the result window will move so
most significant 1 occupies the highest bit in the window. Im not sure
if this makes it easier, but it helps me when doing fixed point
Note that a float doesn’t store it’s most significant 1, as it is always
implied. And obviously, the trick will only work for positive numbers
between 0 and 223 (if you want the next 11 bits as well, you could apply
the same trick by adding 232, then shifting the resulting int
representation by 23 bits and add them to the lower 23 bits already