Please report any other irregularities here.
In working on call-threaded dispatch, I accidentally observed that adding NOPs and even additional functional instructions could speed things up by 3-7%. I think it might have been from alignment changes, although I really don't know. (I suspect some of it is from increased path lengths between control flow ops.) I do know that Intel recommends branch targets be aligned on 16-byte boundaries, because then the first fetch will fetch a whole cache line from the target. (Conversely, if the branch target is 15 bytes after a 16-byte boundary, the first fetch will fetch only 1 byte that will be executed, and 15 "wasted" bytes.) Regular SM might be able to benefit from this. On GCC, you can use asm(".align N") to align on 2^N-byte boundaries. In my experiments, I got the best results with .align 3 (8-byte boundaries), but my benchmark is very small as I am currently using an unreliable experimental system. Presumably it would be easier to get definitive measurements with trunk SM.
Really want an owner for this (thanks, dmandelin, for filing it) -- Igor, are you around this week? Cc'ing others who might be able to help. /be
You need to log in before you can comment on or make changes to this bug.