Open
Bug 165975
Opened 22 years ago
Updated 2 years ago
MMX version of nsID::Equals
Categories
(Core :: XPCOM, enhancement, P3)
Core
XPCOM
Tracking
()
NEW
People
(Reporter: alecf, Unassigned)
References
Details
(Keywords: perf)
Attachments
(1 file)
1.16 KB,
patch
|
Details | Diff | Splinter Review |
Why am I so obsessed with nsID::Equals? because we use it a whole lot. Here's a patch which uses MMX to do a fast vector comparison of 128-bit IIDs.
Reporter | ||
Comment 1•22 years ago
|
||
Here's the patch.. reviews?
Comment 3•22 years ago
|
||
Comment on attachment 97429 [details] [diff] [review] use MMX when mmintrin.h is available MMX is an extension of the Intel instruction set. Do we care about (or is there such thing) as XP_WIN without MMX? Sorry for my ignorance. Other than that the patch looks good. How much does this speed up this function?
this sure doesn't look like it'll work on my p133 laptop :-) Actually, i have a better example. my parents' computer is a p200 we got it right before mmx came out. They really use it (well I really use my laptop, but ...).
Reporter | ||
Comment 5•22 years ago
|
||
timeless: we haven't decided if we're going to do this, and in fact I haven't even tested what kind of difference this makes. It might make no difference at all. its just an idea I want to get some eyeballs on. There may be other such MMX accelerations that we can do..
Reporter | ||
Updated•22 years ago
|
Are you planning on run-time mmx detection? I would think that if Windows runs so should mozilla. There's a penalty for the context switch between mmx and floating point. I'm not sure it's worth it for short functions.
Reporter | ||
Comment 7•22 years ago
|
||
ok, since this seems to confuse the heck out of people, let me help everyone out: - this patch is a test. I'm going to do performance testing to make sure it is actually faster. It doesn't matter if anyone "is sure its worth it" - the numbers will tell us one way or another. - I'm not making any decisions at the moment about runtime testing for MMX, requiring an MMX system, etc... thats a larger decision beyond the scope of this bug - checking in the above patch would change the set of platforms that mozilla runs on, and clearly that decision needs to be made by a larger group - if we end up doing runtime testing for MMX and then use it if its available, then there may be some performance implications. Again, we'll only use MMX if it actually proves to be faster I'm going to do the right thing. Trust me.
Reporter | ||
Comment 8•22 years ago
|
||
ok, so like someone predicted I'm hitting a snag switching to mmx mode. pushing this out while I work on more important things :)
Target Milestone: mozilla1.2alpha → mozilla1.2beta
Reporter | ||
Updated•22 years ago
|
Target Milestone: mozilla1.2beta → mozilla1.3beta
Reporter | ||
Comment 9•22 years ago
|
||
Adding James Rose for some advice from Intel...
Reporter | ||
Updated•22 years ago
|
Target Milestone: mozilla1.3beta → mozilla1.4alpha
Reporter | ||
Comment 10•21 years ago
|
||
mass moving lower risk 1.4alpha stuff to 1.4beta
Target Milestone: mozilla1.4alpha → mozilla1.4beta
Reporter | ||
Updated•21 years ago
|
Target Milestone: mozilla1.4beta → mozilla1.5alpha
Comment 11•20 years ago
|
||
Close this as Wontfix?
Comment 12•20 years ago
|
||
I think in general speeding up this function is a worthy cause. Looks like this patch is stale anyway. You might squeeze some more performance if you did assembler rep loop compare on the ID's than the current individual struct element compares. VC++ will inline memcmp to pretty much that when you go to -O2, but I don't know if we've moved to -O2 or if we could for such time critical code on a case by case basis. Using mmx might be an even better alternative, but I don't know if Alec is interested in continuing to pursue that. If he or no one else is, then maybe FUTURE or WONTFIX it
Reporter | ||
Comment 13•20 years ago
|
||
I think its worth pursuing.. but I'll future it to get it off the 1.5alpha milestone :)
Target Milestone: mozilla1.5alpha → Future
Comment 14•20 years ago
|
||
The information below is for Pentium 4 processors: The problem with an mmx approach here is that the latency of the movq instruction is 6 and it looks like you need to do two or four depending on the data. The parallel compare has a latency of 2. I'm not sure what the m_to_int generates but I'd guess a pack instruction with a latency of 2. The regular MOV and CMP instructions have latencies of 0.5 to 1 depending on the Pentium processor so you have a lot to overcome if you want to use MMX (or SSE and SSE2 instructions for that matter). BTW, I wrote a very short SSE2 routine to do this in maybe seven assembler instructions and didn't see any performance benefit (subjectively speaking). Routines that typically benefit more are those that can better take care of parallelism or where there is more data involved. You can overcome the latancy issues by using multiple registers so that instructions can be processed in parallel thereby reducing overall latency. You can get latency information in the IA-32 Optimization Guide in the back of the manual. It may be that the times are a lot lower in AMD processors. I'm getting one soon and am reading a bit of their optimization guide from time to time. One thing that helps when examining SIMD code using intrinsics is if you show the generated assembly code as well.
Reporter | ||
Comment 15•20 years ago
|
||
michael, you sound like quite the expert here! (and at the very least more of an expert than anyone else on the bug..) I'm trying to understand exactly what you're saying, but it sounds like you're saying that moving the data out of memory and into the registers is slow? But wouldn't we have to do that in order to make this comparison anyway? Maybe I'm just not getting this.. :) Do you have a suggestion for making this work? This was the first time that I had written any kind of assembly in over 5 years, let alone SIMD instructions :)
Comment 16•20 years ago
|
||
(In reply to comment #15) > I'm trying to understand exactly what you're saying, but it sounds > like you're saying that moving the data out of memory and into the > registers is slow? But wouldn't we have to do that in order > to make this comparison anyway? Maybe I'm just not getting this.. :) See Appendix C of http://developer.intel.com/design/pentium4/manuals/248966.htm Some instructions take longer than others do. The traditional x86 instructions, cmp and mov are very fast compared to movq, movdqa, movdqu, movaps and movapu. If it takes six clocks to move two doublewords using an mmx instruction but one to two clocks to move two doublewords into eax/ebx/ecx/edx registers, you're better off with using traditional x86 instructions. > Do you have a suggestion for making this work? This was the first time that I > had written any kind of assembly in over 5 years, let alone SIMD instructions :) It's hard to get the mathematics to add up where you get a win with SIMD here. At least given the latencies on the Pentium 4. I have some latency information on the Pentium 3 in a book at home. It may be that future processors will become more efficient with SIMD instructions which could make for a win there.
Updated•18 years ago
|
QA Contact: scc → xpcom
Comment 17•17 years ago
|
||
Has anyone considered adding a reference equality check to nsID::Equals? Seems like an easier way to get a speedup.
Comment 18•17 years ago
|
||
Just a note: Intel and AMD have been putting a fair amount of work in SIMD latencies and it may be worthwhile considering this again. The problem is with legacy systems. It wouldn't be worthwhile to use the hardware advantages of newer processors if it meant slowing down those with older processors.
Comment 19•17 years ago
|
||
Regarding the reference equality check, it would be interesting to see some metrics on how common that occurrence is. As far as hardware solutions, would the boolean check cause any performance issues. Essentially you could check for the hardware solutions on startup set the boolean flag if they're available. Alex's comment #7 would still be the rule I would think
Comment 20•17 years ago
|
||
On an inlined routine this small, yes, the check could hurt more than it helps.
Updated•13 years ago
|
OS: Windows 2000 → All
Hardware: x86 → All
Updated•12 years ago
|
Assignee: alecf → nobody
Severity: normal → enhancement
Target Milestone: Future → ---
Updated•12 years ago
|
Status: ASSIGNED → NEW
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•