Open Bug 165975 Opened 22 years ago Updated 2 years ago

MMX version of nsID::Equals

Categories

(Core :: XPCOM, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: alecf, Unassigned)

References

Details

(Keywords: perf)

Attachments

(1 file)

Why am I so obsessed with nsID::Equals? because we use it a whole lot.

Here's a patch which uses MMX to do a fast vector comparison of 128-bit IIDs.
Here's the patch.. reviews?
oops, over to me.
Assignee: dougt → alecf
Comment on attachment 97429 [details] [diff] [review]
use MMX when mmintrin.h is available

MMX is an extension of the Intel instruction set. Do we care about (or is there
such thing) as XP_WIN without MMX?  Sorry for my ignorance.

Other than that the patch looks good.  How much does this speed up this
function?
this sure doesn't look like it'll work on my p133 laptop :-)

Actually, i have a better example. my parents' computer is a p200 we got it
right before mmx came out. They really use it (well I really use my laptop, but
...).
timeless: we haven't decided if we're going to do this, and in fact I haven't
even tested what kind of difference this makes. It might make no difference at
all. its just an idea I want to get some eyeballs on. There may be other such
MMX accelerations that we can do..
Depends on: 165969
Status: NEW → ASSIGNED
Keywords: perf
Priority: -- → P3
Target Milestone: --- → mozilla1.2alpha
Are you planning on run-time mmx detection? I would think that if Windows runs
so should mozilla.

There's a penalty for the context switch between mmx and floating point. I'm not
sure it's worth it for short functions.
ok, since this seems to confuse the heck out of people, let me help everyone out:
- this patch is a test. I'm going to do performance testing to make sure it is
actually faster. It doesn't matter if anyone "is sure its worth it" - the
numbers will tell us one way or another.
- I'm not making any decisions at the moment about runtime testing for MMX,
requiring an MMX system, etc... thats a larger decision beyond the scope of this
bug - checking in the above patch would change the set of platforms that mozilla
runs on, and clearly that decision needs to be made by a larger group
- if we end up doing runtime testing for MMX and then use it if its available,
then there may be some performance implications. Again, we'll only use MMX if it
actually proves to be faster

I'm going to do the right thing. Trust me.
ok, so like someone predicted I'm hitting a snag switching to mmx mode. pushing
this out while I work on more important things :)
Target Milestone: mozilla1.2alpha → mozilla1.2beta
Target Milestone: mozilla1.2beta → mozilla1.3beta
Adding James Rose for some advice from Intel...
Target Milestone: mozilla1.3beta → mozilla1.4alpha
mass moving lower risk 1.4alpha stuff to 1.4beta
Target Milestone: mozilla1.4alpha → mozilla1.4beta
Target Milestone: mozilla1.4beta → mozilla1.5alpha
Close this as Wontfix?
I think in general speeding up this function is a worthy cause. Looks like this
patch is stale anyway. You might squeeze some more performance if you did
assembler rep loop compare on the ID's than the current individual struct
element compares. VC++ will inline memcmp to pretty much that when you go to
-O2, but I don't know if we've moved to -O2 or if we could for such time
critical code on a case by case basis.

Using mmx might be an even better alternative, but I don't know if Alec is
interested in continuing to pursue that. If he or no one else is, then maybe
FUTURE or WONTFIX it
I think its worth pursuing.. but I'll future it to get it off the 1.5alpha
milestone :)
Target Milestone: mozilla1.5alpha → Future
The information below is for Pentium 4 processors:

The problem with an mmx approach here is that the latency of the movq instruction is 6 and it looks like you need to do two or four depending on the 
data. The parallel compare has a latency of 2. I'm not sure what the m_to_int 
generates but I'd guess a pack instruction with a latency of 2.

The regular MOV and CMP instructions have latencies of 0.5 to 1 depending on the
Pentium processor so you have a lot to overcome if you want to use MMX (or SSE
and SSE2 instructions for that matter). BTW, I wrote a very short SSE2 routine
to do this in maybe seven assembler instructions and didn't see any performance
benefit (subjectively speaking).

Routines that typically benefit more are those that can better take care of
parallelism or where there is more data involved. You can overcome the latancy
issues by using multiple registers so that instructions can be processed in
parallel thereby reducing overall latency.

You can get latency information in the IA-32 Optimization Guide in the back of
the manual.

It may be that the times are a lot lower in AMD processors. I'm getting one
soon and am reading a bit of their optimization guide from time to time.

One thing that helps when examining SIMD code using intrinsics is if you
show the generated assembly code as well.
michael, you sound like quite the expert here! (and at the very least more of an
expert than anyone else on the bug..) I'm trying to understand exactly what
you're saying, but it sounds like you're saying that moving the data out of
memory and into the registers is slow? But wouldn't we have to do that in order
to make this comparison anyway? Maybe I'm just not getting this.. :)

Do you have a suggestion for making this work? This was the first time that I
had written any kind of assembly in over 5 years, let alone SIMD instructions :)
(In reply to comment #15)
> I'm trying to understand exactly what you're saying, but it sounds 
> like you're saying that moving the data out of memory and into the 
> registers is slow? But wouldn't we have to do that in order
> to make this comparison anyway? Maybe I'm just not getting this.. :)

See Appendix C of http://developer.intel.com/design/pentium4/manuals/248966.htm

Some instructions take longer than others do. The traditional x86 instructions,
cmp and mov are very fast compared to movq, movdqa, movdqu, movaps and movapu.
If it takes six clocks to move two doublewords using an mmx instruction but
one to two clocks to move two doublewords into eax/ebx/ecx/edx registers, you're
better off with using traditional x86 instructions.

> Do you have a suggestion for making this work? This was the first time that I
> had written any kind of assembly in over 5 years, let alone SIMD instructions :)

It's hard to get the mathematics to add up where you get a win with SIMD here.
At least given the latencies on the Pentium 4. I have some latency information
on the Pentium 3 in a book at home. It may be that future processors will become
more efficient with SIMD instructions which could make for a win there.
Blocks: 203448
QA Contact: scc → xpcom
Has anyone considered adding a reference equality check to nsID::Equals?  Seems like an easier way to get a speedup.
Just a note: Intel and AMD have been putting a fair amount of work in SIMD latencies and it may be worthwhile considering this again. The problem is with legacy systems. It wouldn't be worthwhile to use the hardware advantages of newer processors if it meant slowing down those with older processors.
Regarding the reference equality check, it would be interesting to see some metrics on how common that occurrence is.

As far as hardware solutions, would the boolean check cause any performance issues. Essentially you could check for the hardware solutions on startup set the boolean flag if they're available. Alex's comment #7 would still be the rule I would think
On an inlined routine this small, yes, the check could hurt more than it helps.
OS: Windows 2000 → All
Hardware: x86 → All
Assignee: alecf → nobody
Severity: normal → enhancement
Target Milestone: Future → ---
Status: ASSIGNED → NEW
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: