164580 - speed up nsID::Equals

Reporter

Description

•

23 years ago

I know that the structure is listed as frozen, but I'm hoping we can update Equals() - VC++ generates some wacky code for this: 74 inline PRBool Equals(const nsID& other) const { 75 return (PRBool) 76 ((((PRUint32*) &m0)[0] == ((PRUint32*) &other.m0)[0]) && 77 (((PRUint32*) &m0)[1] == ((PRUint32*) &other.m0)[1]) && 78 (((PRUint32*) &m0)[2] == ((PRUint32*) &other.m0)[2]) && 79 (((PRUint32*) &m0)[3] == ((PRUint32*) &other.m0)[3])); 80 } it does this wierd thing where it keeps re-adding to m0's base address for each comparison..according to vtune's timer sampling, nsID::Equals is called with the 2nd most time spent in it in all of gkcontent.dll (its inline, thats why it shows up there) changing this to a memcmp() moved this function to the 3rd most time, a slight improvement. patch forthcoming.

Alec Flett

Reporter

Comment 1

•

23 years ago

Attached patch use memcmp() instead — Details — Splinter Review

I think that eventually we should try to come up with an MMX-based solution, because vector comparisons are the P6's bread and butter.

Alec Flett

Reporter

Updated

•

23 years ago

Status: NEW → ASSIGNED

Priority: -- → P1

Target Milestone: --- → mozilla1.2alpha

Alec Flett

Reporter

Updated

•

23 years ago

Whiteboard: fix in hand

Bradley Baetz (:bbaetz)

Comment 2

•

23 years ago

Let the compiler optmisise |memcpy| for you - thats its job. (Not to mention the alignment issues) I don't think that changing the frozen file is a problem - the vtable remains the same, and the method is self contained, so there aren't issues with older callers even though the routine is |inline| No compiler will add trailing (or internal) padding to that struct layout (which will affect sizeof), will they? G++ on linux doesn't, though. I'm reasonably certain that the ccurrent casting of the PRUint16s to PRUint32s breaks the C aliasing rules. (PRUInt8 is ok, because they're char*s, which are allowed). This means that this code is better, I think. g++'s asm output looks better, too. r=bbaetz if you're sure that none of the ports are going to break by your implict casting of a const nsIID* to a const void* for memcmp

Doug Turner (:dougt)

Comment 3

•

23 years ago

Comment on attachment 96657 [details] [diff] [review] use memcmp() instead what Bradley said. r=dougt

Attachment #96657 - Flags: review+

Darin Fisher

Comment 4

•

23 years ago

Comment on attachment 96657 [details] [diff] [review] use memcmp() instead sr=darin

Attachment #96657 - Flags: superreview+

Alec Flett

Reporter

Comment 5

•

23 years ago

yep, everything you said is right, that might explain the wierd assembly for the old code that I saw on windows! Padding should be ok, because of the order of elements in the structure (largest to smallest) and I think that the old code would have bonked on padding anyway. fix checked in, I'm curious if this makes us measurably faster in any way...

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Alec Flett

Reporter

Comment 6

•

23 years ago

wow, so this actually slowed us down on linux. Go figure. According to VTune, we're faster on windows. so I #ifdef'ed this out for XP_UNIX.. I don't know about mac so I left mac with memcpy() since its appearently a toss up.

Bradley Baetz (:bbaetz)

Comment 7

•

23 years ago

alecf: Oops. I tested with -O2, and linux doesn't optimise memcpy with just -O... Hows the use-g++-3.2 project coming? :)

Johnny Stenback (:jst)

Comment 8

•

22 years ago

FYI, this didn't speed us up at all, at least not on windows. This bug caused the DOM perf regression bug 186133. I say we back this out.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Doug Turner (:dougt)

Comment 9

•

22 years ago

if this gets back out, please mention this bug in a comment.

David Bradley

Comment 10

•

22 years ago

Just a note for Windows. memcmp/memcpy get inlined at -O2 and become just a few bytes of assembler, but we're compiling at -O1, so that's probably accounts for the decline on Windows.

Michael Moy

Comment 11

•

21 years ago

(In reply to comment #10) > Just a note for Windows. memcmp/memcpy get inlined at -O2 and become just a few > bytes of assembler, but we're compiling at -O1, so that's probably accounts for > the decline on Windows. You could always just take the -O2 output assembler code and inline assembler it. I had a look at the memcpy inlined code at -O2 and spent some time with the Athlon 64 Optimization Guide looking at the rep cmpsd latency. It's 11 + 10/3 * cycles which gives us a max of 23 cycles. Latancy for the current -O2 generated code is at least 24 (4 compares of 2 memory lookups @ 3 cycles per memory lookup). There's the cost for the jmps and some potential branching costs. I don't know what the data looks like so I don't know if there are more hits or more misses or what the statistics are for individual doublewords so I won't know if camparison reordering would provide a win or not. Some stats on the nature of the data could be used to improve performance by putting more common differences earlier in the comparison tree.

Michael Moy

Comment 12

•

21 years ago

I ran statistics on the structure in comparison and fuond that matches and non-matches are generally about even with a variance of about 12% either way. Statistics on the four doubleword compares shows that the first doubleword match finds a non-match 99% to 100% of the time when there is a non-match of the structure. So in general, when there is a non-match, the algorithm finds it in the first compare and it's probably unlikely that this part can be improved. The question is can the comparison of the other three doublewords be improved? Some possibilities (but latencies would have to be checked): - Use instruction parallelism to do multiple loads in parallel so that data that we may need later will be available immediately on the later compares. Need registers to hold these though and there may be some push/pop cost for using additional registers. - See if the cmove instruction can be used to get rid of branches. - See if we can use SSE2 instructions to compare a quadword while the ALU compares the other doubleword in parallel and then merge the results at the end. The last is only good with SSE2 and the middle with SSE and SSE2. The first would work with any Pentium. I don't know when instruction parallelism was added though. I'm pretty sure that it's in the Pentium 3.

Michael Moy

Comment 13

•

20 years ago

I did some benchmarking and found that VS2005 (Beta) memcpy took a little under 14 seconds (on a relative basis), the current nsID.h routine takes about 10.8 seconds to do all four compares, and an SSE2 implementation takes 8 seconds. But the SSE2 implementation requires that the structures be aligned on quadword boundaries. A little testing showed that this structure is doubleword aligned. The SSE2 implementation on doubleword aligned data is similar to memcmp. Now the current implementation doesn't actuall take the whole time as it will terminate if there is an inequality on the first or second or third. I'd guess that the overall cost based on profiling I did a while ago would be about 75% of the 10.8 seconds. The biggest problem in using an SIMD solution is finding a way to set EFLAGS inexpensively. So far the cheapest way I've found is to do a packed compare, then a mov byte mask to a register and then a compare on the register to set EFLAGS. The move byte mask is about the fastest way to get data from an SIMD register to a GPR. Doing it with MMX registers doesn't provide a win as you only get 8 bytes. The SIMD 16 bytes provides a decent win.

David Bradley

Comment 14

•

20 years ago

Which version of memcpy were you benchmarking with? The inline version or the function veresion?

Michael Moy

Comment 15

•

20 years ago

I compiled with -O2. I suppose that I could have tried with -GL. My tests were with long strings, not the 16 bytes here. I have walked through the memcmp generated code before and it isn't simple code. What's needed in nsID is simple code. What's generated with the current algorithm is very efficient if compiled -O2. One of the reasons why the unofficial builds run much faster.

Michael Moy

Comment 16

•

20 years ago

Some rather interesting tests indicate that the first byte and the last byte appear to determine uniqueness between nsIDs. I ran a build commenting out the two conditionals in the middle and didn't run into any problems while visiting a few dozen different sites. I don't think a change like this would get approved but it's interesting that this might work.

Michael Moy

Comment 17

•

20 years ago

Attached patch nsID.h patch to speed up ::equals — Details — Splinter Review

This patch has been in use in unofficial builds for several months and I haven't had any reported problems from the change. Comments can be added later if this patch is up for consideration.

Michael Moy

Updated

•

20 years ago

Attachment #174351 - Flags: review?(dougt)

Darin Fisher

Comment 18

•

20 years ago

Comment on attachment 174351 [details] [diff] [review] nsID.h patch to speed up ::equals no way man! ;-) even if all the UUIDs in use by mozilla today can be uniquely identified by the first and last 32-bits, that's not sufficient. mozilla is an extensible platform. people can create new UUIDs, so there is no way for you to know up front that testing only those bits is sufficient. this is just begging for a crash.

Attachment #174351 - Flags: review?(dougt) → review-

Michael Moy

Comment 19

•

20 years ago

Okay. I kind of expected that. But I'll leave it in my unofficial builds given that the .4% overall speedup that it provides. (In reply to comment #18) > (From update of attachment 174351 [details] [diff] [review] [edit]) > no way man! ;-) even if all the UUIDs in use by mozilla today can be uniquely > identified by the first and last 32-bits, that's not sufficient. mozilla is an > extensible platform. people can create new UUIDs, so there is no way for you > to know up front that testing only those bits is sufficient. this is just > begging for a crash.

David Bradley

Comment 20

•

20 years ago

.4% doesn't seem like much compared to the potential to crash. I'm not sure how MSCOM lays out UUID's but there's various parts of Moz that deals with them, IIRC. I know the scripting ActiveX logic does, and I imagine the Moz embedded as an ActiveX control does as well.

David Bradley

Comment 21

•

20 years ago

.4% doesn't seem like much compared to the potential to crash. I'm not sure how MSCOM lays out UUID's but there's various parts of Moz that deals with them, IIRC. I know the scripting ActiveX logic does, and I imagine the Moz embedded as an ActiveX control does as well.

David Bradley

Comment 22

•

20 years ago

Would it be worth changing the order of evaluation. If the second and third bytes aren't unique, testing the first and fourth first might yield a boost without sacrificing the entire compare.

Michael Moy

Comment 23

•

20 years ago

(In reply to comment #21) > .4% doesn't seem like much compared to the potential to crash. I'm not sure how > MSCOM lays out UUID's but there's various parts of Moz that deals with them, > IIRC. I know the scripting ActiveX logic does, and I imagine the Moz embedded as > an ActiveX control does as well. Well, the shortened code has been out there in Seamonkey, Firefox and Thunderbird for several months and is in Moox' builds which get a pretty wide distribution and I haven't heard of any crashes yet. When there's a problem with one of my performance patches, I usually hear about it pretty quickly.

Michael Moy

Comment 24

•

20 years ago

(In reply to comment #22) > Would it be worth changing the order of evaluation. If the second and third > bytes aren't unique, testing the first and fourth first might yield a boost > without sacrificing the entire compare. See post #12 above. The code (compiled with -O2) actually works pretty well for non-matches. But for matches, you have to check everything that you want to check to confirm a match so I don't think that changing the order will make that big a difference. The easiest and best (IMO) way to get a performance boost for this routine (absent skipping the two middle doublewords) is to compile Mozilla -O2. This will provide benefits elsewhere too. Unofficial builders have been using -O2 for over a year with no problems. When we all have 64-bit systems (I already have one) and 64-bit Mozilla, we can just change the code to do two quadword compares instead of four doubleword compares. I don't know if the Linux 64-bit x86 compiler is smart enough to do that yet but it will be a piece of cake to do on Windows (64-bit).

Michael Moy

Comment 25

•

20 years ago

Just a three-month checkup. This patch is being used in my OSX builds since release 1.0.3 and is being used by most (if not all) of the other unofficial Windows builders. It's been picked up by a few of the unofficial Linux builders as well. I haven't heard of any problems, crash or otherwise, with this code change.

Michael Moy

Comment 26

•

20 years ago

Attached patch ::equals patch for Windows 64-bit (obsolete) — Details — Splinter Review

A patch that works on Firefox for Windows XP 64-bit edition. Modifications could also be done for 64-bit Linux but I don't have such a platform.

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

19 years ago

QA Contact: scc → xpcom

mmc

Comment 27

•

17 years ago

Could this be resurrected?

Ryan VanderMeulen [:RyanVM]

Updated

•

17 years ago

OS: Windows 2000 → All

Hardware: PC → All

Target Milestone: mozilla1.2alpha → Future

Steve Snyder

Comment 28

•

15 years ago

Attached patch Use NSPR definitions instead of compiler-specific defs — Details — Splinter Review

Assignee: alecf → swsnyder

use memcmp() instead 23 years ago Alec Flett 719 bytes, patch	dougt : review+ darin.moz : superreview+	Details \| Diff \| Splinter Review
nsID.h patch to speed up ::equals 20 years ago Michael Moy 710 bytes, patch	darin.moz : review-	Details \| Diff \| Splinter Review
::equals patch for Windows 64-bit 20 years ago Michael Moy 826 bytes, patch		Details \| Diff \| Splinter Review
Use NSPR definitions instead of compiler-specific defs 15 years ago Steve Snyder 758 bytes, patch		Details \| Diff \| Splinter Review
Use 64-bit words with no guard 14 years ago Justin Lebar (not reading bugmail) 1.40 KB, patch	benjamin : review+	Details \| Diff \| Splinter Review
Backout 14 years ago Justin Lebar (not reading bugmail) 1.40 KB, patch	glandium : review+	Details \| Diff \| Splinter Review