Strange behaviour (slowed) on pixel operations using MSVC

6532e3e5e09db6f966770fdf86c03345
0
hellhound_01 104 Jan 19, 2012 at 15:07 c++ algorithm fixed-point floating-point

Hi,

I’ve implemented some setPixel color operations for native endian and non native endian pixel formats. Using GCC
everything is running fine. Using MSVC some operations for SHORT and HALF_FLOAT values are slowed on
execution, what I can’t explain.

Here is the fragment of my setPixelColor method:

brPixelFormatInfo info = gfxUtils.getPixelFormatInfo(m_format);
unsigned int nativeColor = 0;
// if pixel format is native first calculate the native color before
// entering loop to set color value.
if(true==info.isNativeEndian()){
    nativeColor = this->getNativePixelColor(color);
}
unsigned int bytesPerPixel = this->getBytesPerPixel();
for(unsigned int y = rect.getY(); y < rect.getY()+ rect.getHeight(); y++)
{
      // calculation of the image value stride
      unsigned int byteIndex = (this->getBytesPerRow() * y) + rect.getX() * bytesPerPixel;
        for (unsigned int x = rect.getX(); x < rect.getX()+rect.getWidth(); x++)
        {
            if(true==info.isNativeEndian())
            {  
                [...]
            }
            else{
                switch(m_format)
                {
                    // 32bit float value formats
                    case PF_FLOAT32_RGB:
                    {
                        m_data[byteIndex]    = (unsigned char)color.getRed();
                        m_data[byteIndex + 1] = (unsigned char)color.getGreen();
                        m_data[byteIndex + 2] = (unsigned char)color.getBlue();
                        break;
                    }
                    case PF_SHORT_RGB:
                    {
                      unsigned int red, green, blue = 0;
                      brPixelFormatInfo info = gfxUtils.getPixelFormatInfo(PF_SHORT_RGB);
                      brPixelFormatInfo::RGBA_BITS bits = info.getBitValues();
                      brColor::RGBA rgba = color.getRGBA();
                      red   = gfxUtils.convertColorToFixedPoint(rgba.m_red, bits.m_red);
                      green  = gfxUtils.convertColorToFixedPoint(rgba.m_green, bits.m_green);
                      blue   = gfxUtils.convertColorToFixedPoint(rgba.m_blue, bits.m_blue);
    
                      m_data[byteIndex]  = (unsigned char)red;
                      m_data[byteIndex + 1] = (unsigned char)green;
                      m_data[byteIndex + 2] = (unsigned char)blue;
                break;
                    }
                    // half float precision values
                    case PF_FLOAT16_R:
                    {
                        brColor::RGBA rgba = color.getRGBA();
                        m_data[byteIndex] = (unsigned char)gfxUtils.convertColorToHalfFloat(rgba.m_red);
                        break;
                    }
                    default:
                        throw brCore::brIllegalStateException(
                        "[brImage]::setPixelColor: Invalid pixel format!");
                }
            }
            byteIndex += bytesPerPixel;
        }
   }

And here is my convert to fixed point method:

unsigned int brGraphicsUtils::convertColorToFixedPoint(float color, unsigned int bits) const
{
   unsigned int fixed = 0;
    if(color <= 0.0f){
        fixed = 0;
    }
    else if (color >= 1.0f){
        fixed =  (1U<<bits)-1U;
    }
    else{
        fixed = (unsigned int)(color * (1U<<bits));
    }
    return fixed;
}

Nothing special… Strange is, if I debug my sources step by step (procedual) anything is fast enough, if I
make a single step over the getPixelColor method it takes nearly some seconds, before the call returns
and the Unit test continue the operation …

I’ve checked sources again and again, it looks correct. For evaluation I’ve added some timestamp calls
to the SHORT_RGB handling:

  • in: 2012-Jan-19 15:44:52.664127
  • start convert: 2012-Jan-19 15:44:52.669127
  • end convert: 2012-Jan-19 15:44:52.690127
  • start writing: 2012-Jan-19 15:44:52.695127
  • end writing: 2012-Jan-19 15:44:52.700127
  • out: 2012-Jan-19 15:44:52.705127

It looks to me, that my operations are ok, but the writing of the data takes some milliseconds. But why
only for MSVC?

Has anyone an Idea what could be wrong and how I could fasten up this elementar operations?
Thanks for any hint.

Best regards,
Hellhound

7 Replies

Please log in or register to post a reply.

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jan 19, 2012 at 15:54

I would start by moving the tests out of your inner loop. Testing isNativeEndian() and m_format for every single pixel is just a waste of time.

GCC might be optimizing this for you, which is why you are seeing the speed difference. You would have to examine the assembly output to be sure though…

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jan 19, 2012 at 16:00

Another thing is the convertColorToFixedPoint function. It seems that values outside the 0..1 range are invalid? Maybe it would be better to catch them with an assert, and loose the two conditionals-per-pixel…

6532e3e5e09db6f966770fdf86c03345
0
hellhound_01 104 Jan 19, 2012 at 19:39

It’s not the test alone, if I run my demo file on MSVC the startup during initialization of those formats takes many seconds.
First it look like a deadlock or hang up, but after more than 10 sec. it’s running. On GCC the startup takes less than 2
sec…

I’ve implemented those tests to figure out what’s wrong and why such simple operations take such a long time using
MSVC…

The color values out of range are clamped to min/max color values. Asserts may be an option, but those two conditionals
are not the reason for the slow down.

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jan 19, 2012 at 19:47

Maybe you should try using CodeAnalyst og VTune (depending on your CPU), they should be able to exactly pinpoint where the time is spent. I am using CodeAnalyst, and is a great tool for finding hotspots and slowdowns.

6532e3e5e09db6f966770fdf86c03345
0
hellhound_01 104 Jan 20, 2012 at 06:44

I’ve tested the demo source with VTunes and figured out that my debug logs takes too much CPU
time. It looks like the GCC optimizes String operations instead of MSVC …

http://j18.img-up.ne…rofile9r2h1.jpg

Thanks for the hint with the analyzer. VTunes looks good, but is too expensive (800$ for single user
license). Do you know a good free not properitary alternative for VTunes? CodeAnalyzer looks good,
but if I understoot it correctly with Intel i got less informations …

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 Jan 20, 2012 at 10:36

output debug string in windows can be very, very, very slow.

If you are running a visual studio plugin and having the debug output captured by VS in the output window a single print string can take as much at 175 milliseconds

Logging to file is a lot faster, I know that’s counter intuitive. :)

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jan 21, 2012 at 11:10

@hellhound_01

I’ve tested the demo source with VTunes and figured out that my debug logs takes too much CPU
time. It looks like the GCC optimizes String operations instead of MSVC …

Well, do you really need to output all this info? Dumping every single conversion into a log seems a bit overkill… If you absolutely must, then maybe a rewrite of your toString() function would be needed. Since you are dealing with 16-bit values here (I’m assuming from the name), you could precompute all the string representations into a lookup table, and use the half-value to retrieve the string representation.
@hellhound_01

Thanks for the hint with the analyzer. VTunes looks good, but is too expensive (800$ for single user
license). Do you know a good free not properitary alternative for VTunes? CodeAnalyzer looks good,
but if I understoot it correctly with Intel i got less informations …

Yes, VTune is for Intel CPUs, and CodeAnalyst is for AMD CPUs. The reason they work as well as they do, is because they are hooked directly into the the CPU driver, where they can access counters and what-not. This is also why you can’t use them on other vendors CPUs for anything but rudimentary timing stuff. I don’t know of any alternatives that works on both CPUs, sorry.