Jump to content


- - - - -

Floating point question


3 replies to this topic

#1 Mihail121

    Senior Member

  • Members
  • PipPipPipPip
  • 1059 posts

Posted 06 February 2010 - 12:05 PM

Hi, does anyone know, why this is the case:


// Given a single precision float flt (23 bit mantissa) with intV and fltV being the integer and the fraction part respectively, so: flt = (float) intV + fltV

float flt = ...

// Do this

flt = flt + (float) (1 << 23) 

// Now somehow the mantissa of flt has become exactly the bitwise representation of intV, so:

(int) flt & (mantissa_mask = (1<<23)-1)

// gives exactly intV, the truncated flt value.


I would be greatful if somebody provides some info on the matter, how this great magic works!

edit: No, sorry, the code seems to perform a nearest rounding.

#2 Reedbeta

    DevMaster Staff

  • Administrators
  • 5309 posts
  • LocationSanta Clara, CA

Posted 06 February 2010 - 07:08 PM

When you add two floating-point numbers, conceptually speaking they have to be shifted to line up their radix points with each other before you can add. The number (1 << 23) is a 1 followed by 23 bits of zero, so as a float its radix point lies at the very end of its mantissa, and its mantissa is all zeros (the 1 being implied). When you add flt to it, flt is first shifted so that its radix point lies at the same place. Thus the fractional part of flt falls off the end of the mantissa and the integer part is all that is left. Do the add, and you wind up with the integer part of flt in the mantissa field, at least up to rounding.

If you want truncation you may be able to switch to round-toward-zero mode (google for floating point rounding modes). Or, if your numbers are all positive, an easy trick is to subtract 0.5 before you round it.
reedbeta.com - developer blog, OpenGL demos, and other projects

#3 UnrealSolo

    Member

  • Members
  • PipPip
  • 30 posts

Posted 29 May 2010 - 01:30 PM

Another way of thinking about it is to imagine the numbers represented as bitfield/integer which has enough bits to store the numbers in fixed point, but you only have a window of bits to operate with. you then add multiply , divide as a normal integer, the result window will move so most significant 1 occupies the highest bit in the window. Im not sure if this makes it easier, but it helps me when doing fixed point operations.

#4 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 30 May 2010 - 10:36 PM

Note that a float doesn't store it's most significant 1, as it is always implied. And obviously, the trick will only work for positive numbers between 0 and 223 (if you want the next 11 bits as well, you could apply the same trick by adding 232, then shifting the resulting int representation by 23 bits and add them to the lower 23 bits already calculated)
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users