Jump to content


Could you help me to optimize this?


  • You cannot reply to this topic
19 replies to this topic

#1 Nautilus

    Senior Member

  • Members
  • PipPipPipPip
  • 326 posts

Posted 26 January 2007 - 04:23 PM

Greetings,
I need to optimize some code.
Speed is critical, and I'm out of ideas.
If you are interested, please read.


Bottleneck - Part #1:
Given a 32 bits BitMask, I first need to test if certain bits are ON.
Bits of interest are [0-26].
At least 1 of them is always ON.
Any number of bits can be simultaneously ON.
Any combination using bits [0-26] is valid.
Bits [27-31] are always OFF (therefore ignored).

// Test for early bail out: we check if 1 and only 1 bit is ON.

// Tokens "BIT_0" through "BIT_26" are macro in the form:

//     #define BIT_13 (0x00002000UL)


switch (BitMask) // BitMask is a DWORD.

{

    case BIT_13: return 1;


    case BIT_0:  case BIT_1:

    case BIT_2:  case BIT_3:

    case BIT_4:  case BIT_5:

    case BIT_6:  case BIT_7:

    case BIT_8:  case BIT_9:

    case BIT_10: case BIT_11:

    case BIT_12: case BIT_14:

    case BIT_15: case BIT_16:

    case BIT_17: case BIT_18:

    case BIT_19: case BIT_20:

    case BIT_21: case BIT_22:

    case BIT_23: case BIT_24:

    case BIT_25: case BIT_26: return -1;

}


// At this point, 2 or more bits are ON.

// Last chance for early bail out: if BIT_13 is ON, we return.

if (BitMask & BIT_13) return 0;


// Out of luck. More tests are needed.

// "Bottleneck - Part #2" will continue from here...

I can't think of a better way to write the above. Is there one?


Bottleneck - Part #2:
The tests here take the same BitMask from above and first check if
a specific bit is ON. In case it is, they then check if any bit in a
specific combo of bits is ON.
The combo can hold from 4 to 17 bits, it doesn't matter since I pack
them bits into a single mask, like this:

// Define the combo of bits to use only if BIT_0 is ON.

#define MASK_0 (BIT_14 | BIT_16 | BIT_17 | \

                BIT_22 | BIT_23 | BIT_25 | BIT_26)


if ((BitMask & BIT_0) && (BitMask & MASK_0)) return 0;

In case BIT_0 is OFF, MASK_0 must not be checked.
Therefore I cannot pack BIT_0 in MASK_0, or it could return even
when BIT_0 is OFF.

The real problem is that I need 17 of these if(), and it's not possible
to predict which of them will more likely return most of the times.
Also, there's a rare chance that none of them will return.

The complete list of if() tests is:

#define MASK_0  (0x06C34000UL) // Bits : 14 16 17 22 23 25 26.

#define MASK_1  (0x07E3D000UL) // Bits : 12 14 15 16 17 21 22 23 24 25 26.

#define MASK_2  (0x03619000UL) // Bits : 12 15 16 21 22 24 25.

#define MASK_3  (0x06DB4C00UL) // Bits : 10 11 14 16 17 19 20 22 23 25 26.

#define MASK_4  (0x07FFDE00UL) // Bits :  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26.

#define MASK_5  (0x036D9600UL) // Bits :  9 10 12 15 16 18 19 21 22 24 25.

#define MASK_6  (0x00D84C00UL) // Bits : 10 11 14 19 20 22 23.

#define MASK_7  (0x00FC5E00UL) // Bits :  9 10 11 12 14 18 19 20 21 22 23.

#define MASK_8  (0x006C1600UL) // Bits :  9 10 12 18 19 21 22.

#define MASK_9  (0x06C34000UL) // Bits : 14 16 17 22 23 25 26.

#define MASK_10 (0x07E3D000UL) // Bits : 12 14 15 16 17 21 22 23 24 25 26.

#define MASK_11 (0x03619000UL) // Bits : 12 15 16 21 22 24 25.

#define MASK_12 (0x06DB4000UL) // Bits : 14 16 17 19 20 22 23 25 26.

// MASK_13 doesn't exist.

#define MASK_14 (0x036D4000UL) // Bits : 15 16 18 19 21 22 24 25.

#define MASK_15 (0x00D40000UL) // Bits : 19 20 22 23.

#define MASK_16 (0x00FC0000UL) // Bits : 18 19 20 21 22 23.

#define MASK_17 (0x006C0000UL) // Bits : 18 19 21 22.


if ((BitMask & BIT_0)  && (BitMask & MASK_0))  return 0;

if ((BitMask & BIT_1)  && (BitMask & MASK_1))  return 0;

if ((BitMask & BIT_2)  && (BitMask & MASK_2))  return 0;

if ((BitMask & BIT_3)  && (BitMask & MASK_3))  return 0;

if ((BitMask & BIT_4)  && (BitMask & MASK_4))  return 0;

if ((BitMask & BIT_5)  && (BitMask & MASK_5))  return 0;

if ((BitMask & BIT_6)  && (BitMask & MASK_6))  return 0;

if ((BitMask & BIT_7)  && (BitMask & MASK_7))  return 0;

if ((BitMask & BIT_8)  && (BitMask & MASK_8))  return 0;

if ((BitMask & BIT_9)  && (BitMask & MASK_9))  return 0;

if ((BitMask & BIT_10) && (BitMask & MASK_10)) return 0;

if ((BitMask & BIT_11) && (BitMask & MASK_11)) return 0;

if ((BitMask & BIT_12) && (BitMask & MASK_12)) return 0;

// No check for BIT_13. If it was ON, we'd have returned already.

if ((BitMask & BIT_14) && (BitMask & MASK_14)) return 0;

if ((BitMask & BIT_15) && (BitMask & MASK_15)) return 0;

if ((BitMask & BIT_16) && (BitMask & MASK_16)) return 0;

if ((BitMask & BIT_17) && (BitMask & MASK_17)) return 0;


return -1;

I know: it's ugly.
I need to perform these 17 tests faster.
Any help will be most appreciated.

Thanks In Advance,
Ciao ciao : )
-Nautilus

(readin' this? perhaps you should get out more -- give it a thought)


#2 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1225 posts

Posted 26 January 2007 - 04:49 PM

What exactly is this BitMask? Without a context it's hard to fully optimize it. It's not unlikely that with a totally different approach (storing the information differently) it can be much faster.

Anyway, here's one trick that you might be able to apply:

inline bool isPow2(int x)

{

    return (x & -x) == x;

}

It will tell you whether a number is a power of two. In binary, a power of two has just one bit set to 1. So you can do something like this:

if(ifPow2(BitMask))

{

    return (BitMask == 0x00002000) ? 1 : 0;

}



#3 SmokingRope

    Valued Member

  • Members
  • PipPipPip
  • 210 posts

Posted 26 January 2007 - 07:51 PM

Using the process of elimination would be much better suited to part 2 above. I went through the bitmasks and saw that 14 was a frequent occurence across the various tests so this would be a good place to start.

#define BITMASK_STAGE_ONE (BIT_0 | BIT_1 | BIT_3 | BIT_4 | BIT_6 | ...)

if( (BitMask & BITMASK_STAGE_ONE) && (BitMask & BIT_14) )
{
}
else
{
  if( (BitMask & BIT_2) && (BitMask & MASK_2) ) return 0;
  if( (BitMask & BIT_5) && (BitMask & MASK_5) ) return 0;
  if( (BitMask & BIT_8) && (BitMask & MASK_8) ) return 0;
  if( (BitMask & BIT_11) && (BitMask & MASK_11)) return 0;
  ...
}

If you find a good mode average bit for each set of MASKS which need to be tested you can split the different tests by a central pivot. The ideal conditions would allow you to isolate the only possible bitmask(s) in 3 tests.

Once you've found the mode for all the bitmasks (like i did above) you can repeat the process again, finding the mode for both sets which you just created. Once any resulting set reaches a quantity of 2(or less) or the mode is undefined (i.e. each bit tested has become unique) you will want to test the bitmasks individually. For this reason you shouldn't necessarily choose the actual mode but possibly a runner up, so as to create the most even distribution among the two resultant set of tests.

#4 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 26 January 2007 - 08:40 PM

I think you need to brush up your bittwiddeling skills.

Check out this site: http://www.hackersdelight.org/

He has lots of that stuff collected. Most important you'll find the code here: http://www.hackersdelight.org/HDcode/ Read it!

Your first problem can be solved using the IsPow2 test nick mentioned.

For the second test either use SmokingRopes solution. If you know that of the lower (BIT_X) bits only one is set you can find the lowest bit position and use it as an index into a table containing the masks. The rest of the test is trivial.

Btw, I'd put the conditions for the second test into a table and use a loop. That's more maintainable and I think it could be even faster.

Nils

#5 dave_

    Senior Member

  • Members
  • PipPipPipPip
  • 584 posts

Posted 27 January 2007 - 01:40 AM

I'm really curious now, why is this the bottle neck? Is it an academic challenge? I'm not some kind of rain man, I cant see any pattern in those bits.

When you get this close to the wire, the benefit of each optimization really depends on the platform.

Is there a difference between (BitMask & BIT_2) && (BitMask & MASK_2) and (BitMask & BIT_2|MASK_2)?
It won't help on an optimising compiler... but debug mode maybe (clutching at straws)?

Is it better to look at the numbers like this? 22 is in every row (bitmask & (1<<22)) return 0?
#define MASK_0  (0x06C34000UL) // Bits :             14    16 17             22 23 25    26.
#define MASK_1  (0x07E3D000UL) // Bits :          12 14 15 16 17          21 22 23 24 25 26.
#define MASK_2  (0x03619000UL) // Bits :          12    15 16             21 22    24 25.
#define MASK_3  (0x06DB4C00UL) // Bits :    10 11    14    16 17    19 20    22 23    25 26.
#define MASK_4  (0x07FFDE00UL) // Bits :  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26.
#define MASK_5  (0x036D9600UL) // Bits :  9 10    12    15 16    18 19    21 22    24 25.
#define MASK_6  (0x00D84C00UL) // Bits :    10 11    14             19 20    22 23.
#define MASK_7  (0x00FC5E00UL) // Bits :  9 10 11 12 14          18 19 20 21 22 23.
#define MASK_8  (0x006C1600UL) // Bits :  9 10    12             18 19    21 22.
#define MASK_9  (0x06C34000UL) // Bits :             14    16 17             22 23    25 26.
#define MASK_10 (0x07E3D000UL) // Bits :          12 14 15 16 17          21 22 23 24 25 26.
#define MASK_11 (0x03619000UL) // Bits :          12    15 16             21 22    24 25.
#define MASK_12 (0x06DB4000UL) // Bits :             14    16 17    19 20    22 23    25 26.
// MASK_13 doesn't exist.
#define MASK_14 (0x036D4000UL) // Bits :                15 16    18 19    21 22    24 25.
#define MASK_15 (0x00D40000UL) // Bits :                            19 20    22 23.
#define MASK_16 (0x00FC0000UL) // Bits :                         18 19 20 21 22 23.
#define MASK_17 (0x006C0000UL) // Bits :                         18 19    21 22.

Occurances of bits?

9 4
10 6
11 4
12 8
14 9
15 7
16 11
17 7
18 7
19 11
20 7
21 11
22 17
23 11
24 7
25 11
26 7

>14 are all prime number of bits... probably just coincidence...
Unless you include the extra bit for the BIT_X macro, then its bits 14-17 that aren't prime...
And there are similar patterns horizontally, weird eh?
Then again its typical thing of people to see patterns when there aren't any
</ramble>

#6 Reedbeta

    DevMaster Staff

  • Administrators
  • 4979 posts
  • LocationBellevue, WA

Posted 27 January 2007 - 02:03 AM

dave_ said:

Is there a difference between (BitMask & BIT_2) && (BitMask & MASK_2) and (BitMask & BIT_2|MASK_2)?

They're not the same, as the first one tests if BIT_2 is set and at least one of the bits in MASK_2, while the second one tests if BIT_2 is set or at least one of the bits in MASK_2 is set.
reedbeta.com - developer blog, OpenGL demos, and other projects

#7 SmokingRope

    Valued Member

  • Members
  • PipPipPip
  • 210 posts

Posted 27 January 2007 - 02:18 AM

I thought I should clarify, the solution i provided is an application of divide and conquer.

Divide And Conquer

#8 Nautilus

    Senior Member

  • Members
  • PipPipPipPip
  • 326 posts

Posted 27 January 2007 - 02:50 AM

@ dave_:
Yes, bit 22 appears in every row.
If you see it like a table and swap rows with columns, then
it's bit 4's turn to appear in every row.
Nothing would change.
[edit]
And yes, there is a pattern. I didn't mention it, but the 17 if()
tests are originated from a long list of pairs of bits to check.
Half the pairs were redundant, so I discarded them.
[/edit]



@ Nick and Nils:
The IsPow2() suggestion is great to handle Part #1,
which is now revised in:
// Test for early bail out: we check if 1 and only 1 bit is ON.
#define BIT_13 (0x00002000UL)

if (BitMask == BIT_13) return 1;
if ((BitMask & -BitMask) == BitMask) return -1; // <-- Thanks Nick =)

// At this point, 2 or more bits are ON.
// Last chance for early bail out: if BIT_13 is ON, we return.
if (BitMask & BIT_13) return 0;

// Out of luck. More tests are needed.
// "Bottleneck - Part #2" will continue from here...



@ SmokingRope:
I was just about to ask... :p

Ciao ciao : )
-Nautilus

(readin' this? perhaps you should get out more -- give it a thought)


#9 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 27 January 2007 - 12:40 PM

Bottleneck Part 2:

The idea is to not try all possible 18 conditions but to skip those where the first test (test against BIT_?) would fail. This should reduce the numbes of tests from 18 to n, where n is the population count of the lowest 18 bits.

We need a helper to find the lowest bit set. This one works in c but is kinda slow. Fortunately the x86 has an instruction that does exactly the same: BSR, so you can replace this code with a simple assembler pragma.


int ntz(unsigned src)

// returns the number of trailing zero bits 

// of an integer

{

  unsigned x;

  unsigned bit4, bit3, bit2, bit1, bit0;

  if (src == 0) return 32;

  x = src & -src;

  bit4 = (src & 0xffff) ? 0 : 16;

  bit3 = (x & 0xff00ff00) ? 8 : 0;

  bit2 = (x & 0xf0f0f0f0) ? 4 : 0;

  bit1 = (x & 0xcccccccc) ? 2 : 0;

  bit0 = (x & 0xaaaaaaaa) ? 1 : 0;

  return bit4 + bit3 + bit2 + bit1 + bit0;

}

2nd: Check all bits in a loop. I might be off by one here and there. E.g. using <= 17 instead of <=18, but I'm sure you'll get the idea.

I've moved the MASK_?? constants into an array called MaskTable, so I can index into them (not shown here)


    int WorkMask = BitMask;

  // Bottleneck 2:

  // Several bits below the 17th are set.

  // process them one by one, skipping all 0 bits.

  do {

    // get the number of trailing zeros.

    // this is identical to the id of the first bit set:

    int index = ntz(WorkMask);


    // are we in the range of valid bits?

    if (index < 18)

    {

      // check against mask (from a table)

      if (BitMask & MaskTable[index])

      {

        // We have a hit!

        return 0;

      } else {

        // Miss. Unset the bit, so we don't try it again

        WorkMask &= ~(1<<index);


        // Optional: Do a fast reject test. 

        // if none of the lowest 17 bits is set we can stop here.

        if (WorkMask < ((1<<17)-1))

          return -1;

      }

    } else return -1;

  } while (1);


This code - as is - is most probably slower than your straight forward version, but if you implement it as an inline assembler pragma an unroll it, it might be faster than the original version. It all depends on how many bits below the 18th are set.

#10 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 27 January 2007 - 12:48 PM

hm. this can be optimized even further... by processing two bits at once. And some of the compares can be optimized as well.

I'll do that later :-)

#11 Nautilus

    Senior Member

  • Members
  • PipPipPipPip
  • 326 posts

Posted 27 January 2007 - 06:15 PM

I tried to implement SmokingRope's idea. I came up with different
splittings, trying to maintain a 50%-50% branching as much as possible.
But I ended up with (I think) too many if() in all cases.
Still much less than the original 17, though.
It was worth to try (and I learned something new : ).

However, in the past hours I've read some documents explaining
what's behind the mechanism known as Branch Prediction, and the
performance cost that can occur if a branch is often mis-predicted.
In my specific case it can get quite ugly since, more or less, all the
bits have the same chances to be set or not, and the branches can
be as many as 17.
In other words, there's no way I can rearrange my code so that
branching will take the same direction most of the times.

So I think a good idea would be to completely eliminate branching.
The end result would cost more operations, but the whole should be
executed many times faster than with all those if().

First attempt, very simple.
Only 1 if() left, and 16 added logical OR "||".
End result is correct anyway, and the if() jumps at the "return 0;"
as soon as one side of any || evaluates to True.
But I don't like much the idea =/
if ( ((BitMask & BIT_0) && (BitMask & MASK_0)) ||
     ((BitMask & BIT_1) && (BitMask & MASK_1)) ||
     ((BitMask & BIT_2) && (BitMask & MASK_2)) ||
     ((BitMask & BIT_3) && (BitMask & MASK_3)) ||
     ...
     ((BitMask & BIT_17) && (BitMask & MASK_17))
   ) return 0;

return -1;   

Here is a variant of the above:
BOOL Bool  = ((BitMask & BIT_0) && (BitMask & MASK_0));
     Bool |= ((BitMask & BIT_1) && (BitMask & MASK_1));
     Bool |= ((BitMask & BIT_2) && (BitMask & MASK_2));
     Bool |= ((BitMask & BIT_3) && (BitMask & MASK_3));
     ...
     Bool |= ((BitMask & BIT_17) && (BitMask & MASK_17));

return (Bool) ? 0 : -1;
I like this more, and looks a lot cleaner.
It still uses only 1 branching operator ()?:, which really can't
be avoided.

In the above snippet I setup a dedicated variable, Bool. I could as well
reuse BitMask itself, since it contains 5 unused bits [27-31].
I could modify the function so that the used bits are shifted left by 5
positions, and the range of bits to check would become [5-31].
In that way I could write something like:
// Assuming the macros BIT_?? and MASK_?? use now bits [5-31]
// instead of bits [0-26].
BitMask |= ((BitMask & BIT_0) && (BitMask & MASK_0));
BitMask |= ((BitMask & BIT_1) && (BitMask & MASK_1));
BitMask |= ((BitMask & BIT_2) && (BitMask & MASK_2));
BitMask |= ((BitMask & BIT_3) && (BitMask & MASK_3));
...
BitMask |= ((BitMask & BIT_17) && (BitMask & MASK_17));

return (BitMask & 1) ? 0 : -1;
Actually I like it. But I'm not sure if it's really a good idea to execute
all those instructions every time.
Sounds like a big waste if we consider that I could return already after
the first line, if it'd evaluate to True.



@ Nils:
Your new idea is much more clever than mine : )

Ciao ciao : )
-Nautilus

(readin' this? perhaps you should get out more -- give it a thought)


#12 Razor

    Member

  • Members
  • PipPip
  • 46 posts

Posted 27 January 2007 - 11:24 PM

Nautilus said:

Here is a variant of the above:

BOOL Bool  = ((BitMask & BIT_0) && (BitMask & MASK_0));

     Bool |= ((BitMask & BIT_1) && (BitMask & MASK_1));

     Bool |= ((BitMask & BIT_2) && (BitMask & MASK_2));

     Bool |= ((BitMask & BIT_3) && (BitMask & MASK_3));

     ...

     Bool |= ((BitMask & BIT_17) && (BitMask & MASK_17));


return (Bool) ? 0 : -1;

I like this more, and looks a lot cleaner.
It still uses only 1 branching operator ()?:, which really can't
be avoided.
That won't be a branch will it? With any luck it should be a conditional move. And you certainly could avoid it. return (Bool > 0) - 1; should work, and I bet there are other ways.

#13 Nautilus

    Senior Member

  • Members
  • PipPipPipPip
  • 326 posts

Posted 28 January 2007 - 12:29 AM

Razor said:

That won't be a branch will it? With any luck it should be a conditional move. [...]
Oops, you're right.
My head is spinning. Please have mercy, and pretend you
don't notice all my blasphemies.

Ciao ciao : )
-Nautilus

(readin' this? perhaps you should get out more -- give it a thought)


#14 Nautilus

    Senior Member

  • Members
  • PipPipPipPip
  • 326 posts

Posted 29 January 2007 - 01:00 AM

I try hard but can't find how to eliminate branching and keeping
instructions carried out at a minimum.
Every time I look at the produced assembly code, I still see lot of
jumps.

Even the && operator produces jumps. They could be avoided by
using * instead of &&. Assembly becomes much more compact, but
every multiplication costs so many cycles...
I'm talking of this:

// * instead of &&. No jumps at all. Very few instructions.

// End result is correct... but the multiplication is a tax.

BOOL Bool  = ((BitMask & BIT_0) * (BitMask & MASK_0));

     Bool |= ((BitMask & BIT_1) * (BitMask & MASK_1));

     Bool |= ((BitMask & BIT_2) * (BitMask & MASK_2));

     Bool |= ((BitMask & BIT_3) * (BitMask & MASK_3));

     ...

     Bool |= ((BitMask & BIT_17) * (BitMask & MASK_17));


return (Bool) ? 0 : -1;


The expression (BitMask & BIT_??) always produces either a
power of 2, or Zero. So I have attempted to remove the multiplication
by swapping the operands and introducing << and >> to do something
clever.
But I can't get it right:

Bool |= ((BitMask & MASK_??) << ((BitMask & BIT_??) >> 1));

...

The above miserably fails when (BitMask & BIT_??) == Zero.
It also fails when BIT_?? is any bit past the 6th, because valid bits are
lost, and I may lose the one and only bit ON obtained from
(BitMask & MASK_??).
Just a stupid idea. I'll spare you the many others I had.

This is a plain defeat :surrender

My most sincere thanks to everyone who spent his time to contribute. It was very kind of you.

Ciao ciao : )
-Nautilus

(readin' this? perhaps you should get out more -- give it a thought)


#15 DrunkenCoder

    Member

  • Members
  • PipPip
  • 97 posts

Posted 29 January 2007 - 02:43 PM

You could use an 8Mb LUT to solve it with just one lookup :) If you have the memory for that.

#16 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 29 January 2007 - 04:31 PM

DrunkenCoder said:

You could use an 8Mb LUT to solve it with just one lookup :) If you have the memory for that.

A cache miss can easily take 100 or more cycles these days... large LUTs are almost always slower than code.

#17 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 29 January 2007 - 05:44 PM

Branchless version, except for the return.


#define MASK_0  (0x06C34000UL)

#define MASK_1  (0x07E3D000UL)

#define MASK_2  (0x03619000UL)

#define MASK_3  (0x06DB4C00UL)

#define MASK_4  (0x07FFDE00UL)


#define BitAsMask(A,B) (((A)<<(31-(B)))>>31)


int btest (int mask)

{

  int b = (BitAsMask(mask, 0) & mask & MASK_0) | 

          (BitAsMask(mask, 1) & mask & MASK_1) | 

          (BitAsMask(mask, 2) & mask & MASK_2) | 

          (BitAsMask(mask, 3) & mask & MASK_3);


  return b ? 0 : -1;

}

Note: Mask must be declared as a signed integer, otherwise the BitAsMask macro does not work.

The trick is to move bit B into the highest bit (the sign bit of the integer) and then shift the result down by 31. This replicates bit B into all bits of the resulting integer, e.g. it generates a 0 if bit B is not set and 0xffffffff if bit B is set. This result is interweaved into the test using a binary and operator.

The code can now be rewritten to take advantage of paralellism. All you need to do is to break the dependency chain of the results. This is done by splitting the result into two subterms and merging them at the end.

I think a clever compiler already does this on it's own, but you'll never now.

int btest (int mask)

{

  int b = (BitAsMask(mask, 0) & mask & MASK_0);

  int c = (BitAsMask(mask, 1) & mask & MASK_1);

  b |=    (BitAsMask(mask, 2) & mask & MASK_2);

  c |=    (BitAsMask(mask, 3) & mask & MASK_3);

  b |=    (BitAsMask(mask, 4) & mask & MASK_4);

  c |=    (BitAsMask(mask, 5) & mask & MASK_5);

  // ...


  return (b|c) ? 0 : -1;

}

Paralellism can be increased as long as you have free cpu-registers and the execution units of the CPU are not filled up.

With a bit of clever bit twiddeling the final branch for the return statement can be done branchless as well:


b |= c;

b |= b >> 1; 

b |= b >> 2;

b |= b >> 4;

b |= b >> 8;

b |= b >> 16;

b = (b<<31)>>31;

return ~b;



#18 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 29 January 2007 - 06:46 PM

btw:


  int b = (BitAsMask(mask, 0) & mask & MASK_0) | 

        (BitAsMask(mask, 1) & mask & MASK_1) | 

        (BitAsMask(mask, 2) & mask & MASK_2) | 

        (BitAsMask(mask, 3) & mask & MASK_3);

Is the same as:


  int b = (BitAsMask(mask, 0) & MASK_0) | 

        (BitAsMask(mask, 1) & MASK_1) | 

        (BitAsMask(mask, 2) & MASK_2) | 

        (BitAsMask(mask, 3) & MASK_3);

  b &= mask;

The later one is about 10% faster. I don't know why the compiler didn't optimized that out. Looks like the compilers I've tried don't do a good job at global subexpression elimination for boolean expressions.

#19 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 29 January 2007 - 07:56 PM

Btw. I've implemented the bit-scan version I've mentioned the days before in assembler. On a P4 this version rocks away all C-codes by about factor 4. In the worst case (passing (1<<17)-1 as a mask-parameter) it's still twice as fast.

The compiler can't even inline this thing, I use slow bit-scan and bit-clear instructions and I branch a lot, but it owns the c code. Algorithmic optimizations rule! :yes:

int mask_table[18] = 
{
  0x06C34000, 0x07E3D000, 0x03619000, 0x06DB4C00,
  0x07FFDE00, 0x036D9600, 0x00D84C00, 0x00FC5E00, 
  0x006C1600, 0x06C34000, 0x07E3D000, 0x03619000,
  0x06DB4000, 0x00000000, 0x036D4000, 0x00D40000,
  0x00FC0000, 0x006C0000
};

int test (int BitMask)
{
  int r;
  _asm 
  {
    mov   eax, BitMask          // load mask-parameter
    lea   edi, mask_table       // ptr to mask-table
    mov   ebx, eax              // get copy of mask parameter
    and   ebx, ((1 << 18)-1)    // mask out low 18 bits
    jz    GotFalse              // no bit set? early out
    bsr   ecx, ebx              // get a 1 bit (this instruction does take considerable time...)
NextBit:
    mov   edx, [edi + ecx*4]    // load the mask value for this bit
    btr   ebx, ecx              // clear that bit (rund in parallel with the upcomming branch, so it costs effectively nothing)
    and   edx, eax              // does the mask match?
    jnz   GotTrue               // jep. return true
    bsr   ecx, ebx              // get the next 1 bit (runs in parallel with the upcomming branch, so this costs effectively nothing)
    and   ebx, ebx              // still bits left?
    jnz   NextBit               // jep, then check them
GotFalse:
    mov   [r], -1               // load return value
    jmp   EndTest               // and exit
GotTrue:
    mov   [r], 0                // load return value
EndTest:
  }
  return r;
}


#20 Nautilus

    Senior Member

  • Members
  • PipPipPipPip
  • 326 posts

Posted 29 January 2007 - 11:57 PM

:blink: Nils... :surprise:

You just nuked my brain.
>> B.S.O.D. and waiting for reset...

I was aware that shifting the sign bit would cause a fill
(had to fight against a bug caused by it once), but in all
honesty I'd have never thought it could be used to solve
my problem. That's sheer brilliance!

The very last branch is fine.
When the flow goes to Part #2 it's rare enough that none of
the tested pairs of bits is found ON.
If the function is gonna return "-1", it will likely be the
last return from Part #1.

I owe you big. I can't thank you enough!
I :worthy: to you.

Best regards,
Ciao ciao : )
-Nautilus

(readin' this? perhaps you should get out more -- give it a thought)






1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users