NS_CopySegmentToBuffer() is basically just a wrapper around memcpy(), and a lot of data passes through it. The copying of data would be faster if the source and dest pointers were 16-byte aligned. At this writing the dest pointer is simply base+offset with no regard to alignment. Most contemporary x86 implementations of memcpy() use SIMD (SSE2, etc.) instructions to copy data in 16-byte blocks. This yields better performance than just using integer-sized copies. VS2005/SP1 is one such compiler to provide SSE2 data copying. There's a kink in Microsoft's implementation, though: both source and destination pointers MUST have the same alignment. They apparently didn't think the gains from unaligned SIMD memory accesses were worth supporting, so its both pointers aligned (allowing adjustment to a 16-byte boundary) or nothing. My knowledge of C++ is too sketchy to do the work myself, but I don't think the changes to ensure alignment would that extensive. See file: C:\Program Files\Microsoft Visual Studio 8\VC\crt\src\intel\memcpy.asm
1) Is this actually showing up in profiles? 2) Does memcpy not handle copying up to alignment and then using SSE? I'd think it does. 3) Making the alignment work out would mean changing all consumers of this function, right? And possibly various upstream consumers. Cound be done; but is it worth it (see comment 1).
1. NS_CopySegmentToBuffer() itself is not featured in profiling, but memcpy(), where the real work is done, is certainly prominent. How much of that memcpy() CPU use can be attributed to NS_CopySegmentToBuffer() is unknown. 2. If the alignment of the 2 pointers is equal, then memcpy() will adjust up to 16-byte alignment, then use SIMD copying thereafter. For example, if src=0xXXXXXX04 and dst=0xXXXXXX14 then integer copying is used until src=0xXXXXXX10 and dst=0xXXXXXX20 (both 16-byte aligned). If we start with, say, src=0xXXXXXX04 and dst=0xXXXXXX18 then SIMD will not be used because the pointers cannot be adjusted until both are 16-byte aligned. 3. Is it worth it? Depends on how much work has to be done to ensure alignment. This seems like a cheap way to improve performance (because Microsoft has already written the SIMD copying code). If pointer alignment would be unduly disruptive, maybe it's not so cheap.
Wouldn't any decent profiler be able to give a hierarchical profile to tell you whether memcpy was being called under NS_CopySegmentToBuffer?