Recently, I've been having quite some performance problem with using a large (~~500 MB) std::vector<unsigned char>, which is repeatedly getting resized, cleared, and dropped in a tight loop.
Weirdly, the problem comes from the STL that accompanies Visual Studio 2008.
In particular, using std::vector::resize() is slow, since it uses a "normal" inner loop for default-constructing (zero-initializing) the new vector entries. On my machine, resizing() the vector takes around 8 seconds, while allocating the memory with std::vector::reserve() or copying an entire new dataset into the vector with one of the optimized RTL functions takes an order of magnitude less time.
After some digging in the STL, I noticed that there is an optimized _fill_n method that uses memset for unsigned-char type elements. Yet this method does not get called by the compiler on template resolution, but instead, the vector is resized with the loop-based, default version of _fill_n, i.e. this
// TEMPLATE FUNCTION fill_n
template<class _OutIt, class _Diff, class _Ty> inline
void __CLRCALL_OR_CDECL _Fill_n(_OutIt _First, _Diff _Count, const _Ty& _Val, _Range_checked_iterator_tag)
{
// copy _Val _Count times through [_First, ...)
for (; 0 < _Count; --_Count, ++_First)
*_First = _Val;
}
with:
_OutIt == unsigned char*
_Diff == unsigned int
_Ty = unsigned char
is preferred over the non-template
inline void __CLRCALL_OR_CDECL _Fill_n
(unsigned char *_First,
size_t _Count, int _Val, _Range_checked_iterator_tag)
{
// copy unsigned char _Val _Count times through [_First, ...)
::memset(_First, _Val, _Count);
}
a few lines further down.
The reason, I think, is that the compiler decides to prefer a template over implicitly casting any parameters (also makes sense when considering classes with custom casts), but then again, the optimized methods won't probably ever be used (target pointer type != value type...)
For other methods, such as vector copying, the correct, optimized version using memcpy is used for certain scalar types.
Is this explanation correct ?
Is this a compiler bug or a problem in the STL ?
Any ideas on how to fix the misbehaviour ?
(Apart from designing a better algorithm, or using a custom, malloc-ed container...)
Thank you for your time,
Cheers,
- Wernaeh












