165975 - MMX version of nsID::Equals

Comment on attachment 97429 [details] [diff] [review] use MMX when mmintrin.h is available MMX is an extension of the Intel instruction set. Do we care about (or is there such thing) as XP_WIN without MMX? Sorry for my ignorance. Other than that the patch looks good. How much does this speed up this function?

timeless

Comment 4

•

23 years ago

this sure doesn't look like it'll work on my p133 laptop :-) Actually, i have a better example. my parents' computer is a p200 we got it right before mmx came out. They really use it (well I really use my laptop, but ...).

Alec Flett

Reporter

Comment 5

•

23 years ago

timeless: we haven't decided if we're going to do this, and in fact I haven't even tested what kind of difference this makes. It might make no difference at all. its just an idea I want to get some eyeballs on. There may be other such MMX accelerations that we can do..

Alec Flett

Reporter

Updated

•

23 years ago

Depends on: 165969

Alec Flett

Reporter

Updated

•

23 years ago

Status: NEW → ASSIGNED

Keywords: perf

Priority: -- → P3

Target Milestone: --- → mozilla1.2alpha

tenthumbs

Comment 6

•

23 years ago

Are you planning on run-time mmx detection? I would think that if Windows runs so should mozilla. There's a penalty for the context switch between mmx and floating point. I'm not sure it's worth it for short functions.

Alec Flett

Reporter

Comment 7

•

23 years ago

ok, since this seems to confuse the heck out of people, let me help everyone out: - this patch is a test. I'm going to do performance testing to make sure it is actually faster. It doesn't matter if anyone "is sure its worth it" - the numbers will tell us one way or another. - I'm not making any decisions at the moment about runtime testing for MMX, requiring an MMX system, etc... thats a larger decision beyond the scope of this bug - checking in the above patch would change the set of platforms that mozilla runs on, and clearly that decision needs to be made by a larger group - if we end up doing runtime testing for MMX and then use it if its available, then there may be some performance implications. Again, we'll only use MMX if it actually proves to be faster I'm going to do the right thing. Trust me.

Alec Flett

Reporter

Comment 8

•

23 years ago

ok, so like someone predicted I'm hitting a snag switching to mmx mode. pushing this out while I work on more important things :)

Target Milestone: mozilla1.2alpha → mozilla1.2beta

Alec Flett

Reporter

Updated

•

23 years ago

Target Milestone: mozilla1.2beta → mozilla1.3beta

Alec Flett

Reporter

Comment 9

•

22 years ago

Adding James Rose for some advice from Intel...

Alec Flett

Reporter

Updated

•

22 years ago

Target Milestone: mozilla1.3beta → mozilla1.4alpha

Alec Flett

Reporter

Comment 10

•

22 years ago

mass moving lower risk 1.4alpha stuff to 1.4beta

Target Milestone: mozilla1.4alpha → mozilla1.4beta

Alec Flett

Reporter

Updated

•

22 years ago

Target Milestone: mozilla1.4beta → mozilla1.5alpha

Alessandro Minelli

Comment 11

•

21 years ago

Close this as Wontfix?

David Bradley

Comment 12

•

21 years ago

I think in general speeding up this function is a worthy cause. Looks like this patch is stale anyway. You might squeeze some more performance if you did assembler rep loop compare on the ID's than the current individual struct element compares. VC++ will inline memcmp to pretty much that when you go to -O2, but I don't know if we've moved to -O2 or if we could for such time critical code on a case by case basis. Using mmx might be an even better alternative, but I don't know if Alec is interested in continuing to pursue that. If he or no one else is, then maybe FUTURE or WONTFIX it

Alec Flett

Reporter

Comment 13

•

21 years ago

I think its worth pursuing.. but I'll future it to get it off the 1.5alpha milestone :)

Target Milestone: mozilla1.5alpha → Future

Michael Moy

Comment 14

•

21 years ago

The information below is for Pentium 4 processors: The problem with an mmx approach here is that the latency of the movq instruction is 6 and it looks like you need to do two or four depending on the data. The parallel compare has a latency of 2. I'm not sure what the m_to_int generates but I'd guess a pack instruction with a latency of 2. The regular MOV and CMP instructions have latencies of 0.5 to 1 depending on the Pentium processor so you have a lot to overcome if you want to use MMX (or SSE and SSE2 instructions for that matter). BTW, I wrote a very short SSE2 routine to do this in maybe seven assembler instructions and didn't see any performance benefit (subjectively speaking). Routines that typically benefit more are those that can better take care of parallelism or where there is more data involved. You can overcome the latancy issues by using multiple registers so that instructions can be processed in parallel thereby reducing overall latency. You can get latency information in the IA-32 Optimization Guide in the back of the manual. It may be that the times are a lot lower in AMD processors. I'm getting one soon and am reading a bit of their optimization guide from time to time. One thing that helps when examining SIMD code using intrinsics is if you show the generated assembly code as well.

Alec Flett

Reporter

Comment 15

•

21 years ago

michael, you sound like quite the expert here! (and at the very least more of an expert than anyone else on the bug..) I'm trying to understand exactly what you're saying, but it sounds like you're saying that moving the data out of memory and into the registers is slow? But wouldn't we have to do that in order to make this comparison anyway? Maybe I'm just not getting this.. :) Do you have a suggestion for making this work? This was the first time that I had written any kind of assembly in over 5 years, let alone SIMD instructions :)

Michael Moy

Comment 16

•

21 years ago

(In reply to comment #15) > I'm trying to understand exactly what you're saying, but it sounds > like you're saying that moving the data out of memory and into the > registers is slow? But wouldn't we have to do that in order > to make this comparison anyway? Maybe I'm just not getting this.. :) See Appendix C of http://developer.intel.com/design/pentium4/manuals/248966.htm Some instructions take longer than others do. The traditional x86 instructions, cmp and mov are very fast compared to movq, movdqa, movdqu, movaps and movapu. If it takes six clocks to move two doublewords using an mmx instruction but one to two clocks to move two doublewords into eax/ebx/ecx/edx registers, you're better off with using traditional x86 instructions. > Do you have a suggestion for making this work? This was the first time that I > had written any kind of assembly in over 5 years, let alone SIMD instructions :) It's hard to get the mathematics to add up where you get a win with SIMD here. At least given the latencies on the Pentium 4. I have some latency information on the Pentium 3 in a book at home. It may be that future processors will become more efficient with SIMD instructions which could make for a win there.

Worcester12345

Updated

•

21 years ago

Blocks: 203448

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

19 years ago

QA Contact: scc → xpcom

Eli Friedman

Comment 17

•

18 years ago

Has anyone considered adding a reference equality check to nsID::Equals? Seems like an easier way to get a speedup.

Michael Moy

Comment 18

•

18 years ago

Just a note: Intel and AMD have been putting a fair amount of work in SIMD latencies and it may be worthwhile considering this again. The problem is with legacy systems. It wouldn't be worthwhile to use the hardware advantages of newer processors if it meant slowing down those with older processors.

David Bradley

Comment 19

•

18 years ago

Regarding the reference equality check, it would be interesting to see some metrics on how common that occurrence is. As far as hardware solutions, would the boolean check cause any performance issues. Essentially you could check for the hardware solutions on startup set the boolean flag if they're available. Alex's comment #7 would still be the rule I would think

Michael Moy

Comment 20

•

18 years ago

On an inlined routine this small, yes, the check could hurt more than it helps.

Ryan VanderMeulen [:RyanVM]

Updated

•

13 years ago

OS: Windows 2000 → All

Hardware: x86 → All

Benjamin Smedberg

Updated

•

13 years ago

Assignee: alecf → nobody

Severity: normal → enhancement

Target Milestone: Future → ---

Benjamin Smedberg

Updated

•

13 years ago

Status: ASSIGNED → NEW

BMO Automation

Updated

•

3 years ago

Severity: normal → S3