Chips with SSE4 are now shipping and once we upgrade to VC2008 we get the use of new intrinsic ops (http://blogs.msdn.com/vcblog/archive/2007/10/18/new-intrinsic-support-in-visual-studio-2008.aspx). Of particular note in SSE4 are CRC32 Calculation and certain 128bit string compare ops. Not sure if either would be interested for NSS or JS. Also interesting to note that newer processors are greatly reducing latency of SSE instructions (http://www.hardwaresecrets.com/fullimage.php?image=6762) which can have a big impact on certain ops. Just wanted to get this on the radar for Moz2 and beyond.
I'd like to make a few comments on the use of SIMD in general so that folks understand the ramifications in using SIMD. The customer base will always have different processors so there has to be a cost/benefit analysis on whether use of SIMD code will hurt or help more. For code involving loops with the same thing done over and over again, it can make sense to have multiple sets of code for different processors as the cost in determining which code set to run is small compared to the performance gain. For small improvements, the cost in picking from multiple instruction sets can match the gains. One way to get the best performance for each processor is to put out separate kits for each processor. Many of the unofficial builders have done this in the past but it is rather exhausting work. But the switching costs between releases is minimized. I think that SSE2 is a comfortable base as the Pentium 4 was released in late 2000 so all recently sold computers should be SSE2-capable. SSE3 doesn't really add much. SSSE3 is useful but it isn't available in any AMD processors and is only available in Core 2 Duo processors. SSE4 will be released in two parts by Intel with some functionality in Penryn and the remainder to be added in Nehalem (late 2008). Penryn will support SSE4.1 but CRC32 support will arrive in SSE4.2 which will be in Nehalem.
FYI - SSE3 is supported by pretty much all 90nm AMD chips.
There is a project SSE4-CRC32 https://github.com/Voxer/sse4_crc32