Last Comment Bug 164580 - speed up nsID::Equals
: speed up nsID::Equals
Status: RESOLVED WONTFIX
fix in hand
:
Product: Core
Classification: Components
Component: XPCOM (show other bugs)
: Trunk
: All All
: P1 normal with 2 votes (vote)
: ---
Assigned To: Justin Lebar (not reading bugmail)
:
Mentors:
Depends on: 659546
Blocks: 657011
  Show dependency treegraph
 
Reported: 2002-08-25 23:59 PDT by Alec Flett
Modified: 2013-06-03 07:50 PDT (History)
24 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
use memcmp() instead (719 bytes, patch)
2002-08-26 00:01 PDT, Alec Flett
dougt: review+
darin.moz: superreview+
Details | Diff | Review
nsID.h patch to speed up ::equals (710 bytes, patch)
2005-02-14 21:22 PST, Michael Moy
darin.moz: review-
Details | Diff | Review
::equals patch for Windows 64-bit (826 bytes, patch)
2005-08-13 18:10 PDT, Michael Moy
no flags Details | Diff | Review
Use NSPR definitions instead of compiler-specific defs (758 bytes, patch)
2010-10-14 12:02 PDT, Steve Snyder
no flags Details | Diff | Review
Use 64-bit words with no guard (1.40 KB, patch)
2011-05-16 07:44 PDT, Justin Lebar (not reading bugmail)
benjamin: review+
Details | Diff | Review
Backout (1.40 KB, patch)
2011-10-16 12:28 PDT, Justin Lebar (not reading bugmail)
mh+mozilla: review+
Details | Diff | Review

Description Alec Flett 2002-08-25 23:59:30 PDT
I know that the structure is listed as frozen, but I'm hoping we can update
Equals() - VC++ generates some wacky code for this:

 74   inline PRBool Equals(const nsID& other) const {
 75     return (PRBool)
 76       ((((PRUint32*) &m0)[0] == ((PRUint32*) &other.m0)[0]) &&
 77        (((PRUint32*) &m0)[1] == ((PRUint32*) &other.m0)[1]) &&
 78        (((PRUint32*) &m0)[2] == ((PRUint32*) &other.m0)[2]) &&
 79        (((PRUint32*) &m0)[3] == ((PRUint32*) &other.m0)[3]));
 80   }

it does this wierd thing where it keeps re-adding to m0's base address for each
comparison..according to vtune's timer sampling, nsID::Equals is called with the
2nd most time spent in it in all of gkcontent.dll (its inline, thats why it
shows up there)

changing this to a memcmp() moved this function to the 3rd most time, a slight
improvement. patch forthcoming.
Comment 1 Alec Flett 2002-08-26 00:01:28 PDT
Created attachment 96657 [details] [diff] [review]
use memcmp() instead

I think that eventually we should try to come up with an MMX-based solution,
because vector comparisons are the P6's bread and butter.
Comment 2 Bradley Baetz (:bbaetz) 2002-08-26 06:53:49 PDT
Let the compiler optmisise |memcpy| for you - thats its job. (Not to mention the
alignment issues)

I don't think that changing the frozen file is a problem - the vtable remains
the same, and the method is self contained, so there aren't issues with older
callers even though the routine is |inline|

No compiler will add trailing (or internal) padding to that struct layout (which
will affect sizeof), will they? G++ on linux doesn't, though.

I'm reasonably certain that the ccurrent casting of the PRUint16s to PRUint32s
breaks the C aliasing rules. (PRUInt8 is ok, because they're char*s, which are
allowed). This means that this code is better, I think. g++'s asm output looks
better, too.

r=bbaetz if you're sure that none of the ports are going to break by your
implict casting of a const nsIID* to a const void* for memcmp
Comment 3 Doug Turner (:dougt) 2002-08-26 10:18:25 PDT
Comment on attachment 96657 [details] [diff] [review]
use memcmp() instead

what Bradley said.

r=dougt
Comment 4 Darin Fisher 2002-08-26 10:58:10 PDT
Comment on attachment 96657 [details] [diff] [review]
use memcmp() instead

sr=darin
Comment 5 Alec Flett 2002-08-26 13:58:44 PDT
yep, everything you said is right, that might explain the wierd assembly for the
old code that I saw on windows!
Padding should be ok, because of the order of elements in the structure (largest
to smallest) and I think that the old code would have bonked on padding anyway. 

fix checked in, I'm curious if this makes us measurably faster in any way...
Comment 6 Alec Flett 2002-08-26 22:03:12 PDT
wow, so this actually slowed us down on linux. Go figure. According to VTune,
we're faster on windows.

so I #ifdef'ed this out for XP_UNIX.. I don't know about mac so I left mac with
memcpy() since its appearently a toss up.
Comment 7 Bradley Baetz (:bbaetz) 2002-08-27 01:08:26 PDT
alecf: Oops. I tested with -O2, and linux doesn't optimise memcpy with just -O...

Hows the use-g++-3.2 project coming? :)
Comment 8 Johnny Stenback (:jst, jst@mozilla.com) 2003-01-30 17:52:42 PST
FYI, this didn't speed us up at all, at least not on windows. This bug caused
the DOM perf regression bug 186133. I say we back this out.
Comment 9 Doug Turner (:dougt) 2003-01-30 18:03:08 PST
if this gets back out, please mention this bug in a comment.
Comment 10 David Bradley 2003-02-10 20:22:10 PST
Just a note for Windows. memcmp/memcpy get inlined at -O2 and become just a few
bytes of assembler, but we're compiling at -O1, so that's probably accounts for
the decline on Windows.
Comment 11 Michael Moy 2004-08-10 20:21:44 PDT
(In reply to comment #10)
> Just a note for Windows. memcmp/memcpy get inlined at -O2 and become just a few
> bytes of assembler, but we're compiling at -O1, so that's probably accounts for
> the decline on Windows.

You could always just take the -O2 output assembler code and inline assembler it.

I had a look at the memcpy inlined code at -O2 and spent some time with the
Athlon 64 Optimization Guide looking at the rep cmpsd latency. It's 11 + 10/3 *
cycles which gives us a max of 23 cycles. Latancy for the current -O2
generated code is at least 24 (4 compares of 2 memory lookups @ 3 cycles per
memory lookup). There's the cost for the jmps and some potential branching
costs.

I don't know what the data looks like so I don't know if there are more hits
or more misses or what the statistics are for individual doublewords so I
won't know if camparison reordering would provide a win or not. Some stats
on the nature of the data could be used to improve performance by putting
more common differences earlier in the comparison tree.
Comment 12 Michael Moy 2004-08-11 06:24:38 PDT
I ran statistics on the structure in comparison and fuond that matches and
non-matches are generally about even with a variance of about 12% either way.

Statistics on the four doubleword compares shows that the first doubleword match
finds a non-match 99% to 100% of the time when there is a non-match of the
structure.

So in general, when there is a non-match, the algorithm finds it in the first
compare and it's probably unlikely that this part can be improved. The question
is can the comparison of the other three doublewords be improved?

Some possibilities (but latencies would have to be checked):

- Use instruction parallelism to do multiple loads in parallel so that data
  that we may need later will be available immediately on the later compares.
  Need registers to hold these though and there may be some push/pop cost for
  using additional registers.
- See if the cmove instruction can be used to get rid of branches.
- See if we can use SSE2 instructions to compare a quadword while the ALU
  compares the other doubleword in parallel and then merge the results at the
  end.

The last is only good with SSE2 and the middle with SSE and SSE2. The first
would work with any Pentium. I don't know when instruction parallelism was
added though. I'm pretty sure that it's in the Pentium 3.
Comment 13 Michael Moy 2004-12-10 12:10:03 PST
I did some benchmarking and found that VS2005 (Beta) memcpy took a little
under 14 seconds (on a relative basis), the current nsID.h routine takes about
10.8 seconds to do all four compares, and an SSE2 implementation takes 8 seconds.
But the SSE2 implementation requires that the structures be aligned on quadword
boundaries. A little testing showed that this structure is doubleword aligned.

The SSE2 implementation on doubleword aligned data is similar to memcmp. Now the
current implementation doesn't actuall take the whole time as it will terminate
if there is an inequality on the first or second or third. I'd guess that the
overall cost based on profiling I did a while ago would be about 75% of the 10.8
seconds.

The biggest problem in using an SIMD solution is finding a way to set EFLAGS
inexpensively. So far the cheapest way I've found is to do a packed compare,
then a mov byte mask to a register and then a compare on the register to set
EFLAGS. The move byte mask is about the fastest way to get data from an SIMD
register to a GPR. Doing it with MMX registers doesn't provide a win as you
only get 8 bytes. The SIMD 16 bytes provides a decent win.
Comment 14 David Bradley 2004-12-10 12:17:20 PST
Which version of memcpy were you benchmarking with? The inline version or the
function veresion?
Comment 15 Michael Moy 2004-12-10 15:16:34 PST
I compiled with -O2. I suppose that I could have tried with -GL. My tests were
with long strings, not the 16 bytes here.

I have walked through the memcmp generated code before and it isn't simple code.
What's needed in nsID is simple code. What's generated with the current algorithm
is very efficient if compiled -O2. One of the reasons why the unofficial builds
run much faster.
Comment 16 Michael Moy 2004-12-12 20:39:22 PST
Some rather interesting tests indicate that the first byte and the last byte
appear to determine uniqueness between nsIDs. I ran a build commenting out the
two conditionals in the middle and didn't run into any problems while visiting
a few dozen different sites. I don't think a change like this would get approved
but it's interesting that this might work.
Comment 17 Michael Moy 2005-02-14 21:22:46 PST
Created attachment 174351 [details] [diff] [review]
nsID.h patch to speed up ::equals

This patch has been in use in unofficial builds for several months and I
haven't
had any reported problems from the change. Comments can be added later if this
patch is up for consideration.
Comment 18 Darin Fisher 2005-02-15 00:10:06 PST
Comment on attachment 174351 [details] [diff] [review]
nsID.h patch to speed up ::equals

no way man! ;-)  even if all the UUIDs in use by mozilla today can be uniquely
identified by the first and last 32-bits, that's not sufficient.  mozilla is an
extensible platform.  people can create new UUIDs, so there is no way for you
to know up front that testing only those bits is sufficient.  this is just
begging for a crash.
Comment 19 Michael Moy 2005-02-15 04:36:41 PST
Okay. I kind of expected that. But I'll leave it in my unofficial builds given
that the .4% overall speedup that it provides.

(In reply to comment #18)
> (From update of attachment 174351 [details] [diff] [review] [edit])
> no way man! ;-)  even if all the UUIDs in use by mozilla today can be uniquely
> identified by the first and last 32-bits, that's not sufficient.  mozilla is an
> extensible platform.  people can create new UUIDs, so there is no way for you
> to know up front that testing only those bits is sufficient.  this is just
> begging for a crash.
Comment 20 David Bradley 2005-02-15 05:38:42 PST
.4% doesn't seem like much compared to the potential to crash. I'm not sure how 
MSCOM lays out UUID's but there's various parts of Moz that deals with them, 
IIRC. I know the scripting ActiveX logic does, and I imagine the Moz embedded 
as an ActiveX control does as well.
Comment 21 David Bradley 2005-02-15 05:39:20 PST
.4% doesn't seem like much compared to the potential to crash. I'm not sure how
MSCOM lays out UUID's but there's various parts of Moz that deals with them,
IIRC. I know the scripting ActiveX logic does, and I imagine the Moz embedded as
an ActiveX control does as well.
Comment 22 David Bradley 2005-02-15 05:41:42 PST
Would it be worth changing the order of evaluation. If the second and third
bytes aren't unique, testing the first and fourth first might yield a boost
without sacrificing the entire compare.
Comment 23 Michael Moy 2005-02-15 06:04:07 PST
(In reply to comment #21)
> .4% doesn't seem like much compared to the potential to crash. I'm not sure how
> MSCOM lays out UUID's but there's various parts of Moz that deals with them,
> IIRC. I know the scripting ActiveX logic does, and I imagine the Moz embedded as
> an ActiveX control does as well.

Well, the shortened code has been out there in Seamonkey, Firefox and Thunderbird
for several months and is in Moox' builds which get a pretty wide distribution
and I haven't heard of any crashes yet. When there's a problem with one of my
performance patches, I usually hear about it pretty quickly.
Comment 24 Michael Moy 2005-02-15 06:14:26 PST
(In reply to comment #22)
> Would it be worth changing the order of evaluation. If the second and third
> bytes aren't unique, testing the first and fourth first might yield a boost
> without sacrificing the entire compare.

See post #12 above. The code (compiled with -O2) actually works pretty well for
non-matches. But for matches, you have to check everything that you want to
check to confirm a match so I don't think that changing the order will make that
big a difference.

The easiest and best (IMO) way to get a performance boost for this routine
(absent skipping the two middle doublewords) is to compile Mozilla -O2. This
will provide benefits elsewhere too. Unofficial builders have been using -O2
for over a year with no problems.

When we all have 64-bit systems (I already have one) and 64-bit Mozilla, we
can just change the code to do two quadword compares instead of four doubleword
compares. I don't know if the Linux 64-bit x86 compiler is smart enough to do
that yet but it will be a piece of cake to do on Windows (64-bit).
Comment 25 Michael Moy 2005-05-15 22:50:52 PDT
Just a three-month checkup. This patch is being used in my OSX builds since
release 1.0.3 and is being used by most (if not all) of the other unofficial
Windows builders. It's been picked up by a few of the unofficial Linux builders
as well. I haven't heard of any problems, crash or otherwise, with this code
change.
Comment 26 Michael Moy 2005-08-13 18:10:40 PDT
Created attachment 192645 [details] [diff] [review]
::equals patch for Windows 64-bit

A patch that works on Firefox for Windows XP 64-bit edition. Modifications
could also be done for 64-bit Linux but I don't have such a platform.
Comment 27 mmc 2008-08-07 00:53:47 PDT
Could this be resurrected?
Comment 28 Steve Snyder 2010-10-14 12:02:17 PDT
Created attachment 483220 [details] [diff] [review]
Use NSPR definitions instead of compiler-specific defs
Comment 29 Justin Lebar (not reading bugmail) 2011-05-16 07:19:23 PDT
Comment on attachment 483220 [details] [diff] [review]
Use NSPR definitions instead of compiler-specific defs

>+#ifdef HAVE_LONG_LONG /* let NSPR do the heavy lifting */
>+    return
>+      ((((PRUint64*) &m0)[0] == ((PRUint64*) &other.m0)[0]) &&
>+       (((PRUint64*) &m0)[1] == ((PRUint64*) &other.m0)[1]));
>+#else

If HAVE_LONG_LONG is not defined, then PRUint64 is

  typedef struct {
    PRUint32 lo, hi;
    PRUint32 hi, lo;
  } PRUInt64;

Using this fake structure should be the same as the current PRUint32 code, right?  If so, can we get rid of the ifdef and just use PRUint64?
Comment 30 Justin Lebar (not reading bugmail) 2011-05-16 07:20:37 PDT
Whoops; the struct is just:

  typedef struct {
    PRUint32 lo, hi;
  } PRUint64;

on a little-endian system.
Comment 31 Justin Lebar (not reading bugmail) 2011-05-16 07:39:33 PDT
Oh, we need the guards to ensure that == is actually defined on the type.

I'm still not convinced the guard is necessary -- we have plenty of operations that will fail if == isn't defined on PRUint64.
Comment 32 Justin Lebar (not reading bugmail) 2011-05-16 07:44:52 PDT
Created attachment 532634 [details] [diff] [review]
Use 64-bit words with no guard
Comment 33 Benjamin Smedberg [:bsmedberg] 2011-05-16 09:25:08 PDT
Comment on attachment 532634 [details] [diff] [review]
Use 64-bit words with no guard

Justin's theory is that this will compile down to basically the same code on 32-bit systems... this needs to be tested somehow: testing against m-c isn't really ideal.
Comment 34 Justin Lebar (not reading bugmail) 2011-05-16 18:47:15 PDT
I have try builds coming in at http://tbpl.mozilla.org/?tree=Try&pusher=jlebar@mozilla.com&rev=193a3dd250c7

I'll disassemble the opt-32 builds and make sure NS_TableDrivenQI looks right.
Comment 35 Justin Lebar (not reading bugmail) 2011-05-17 08:03:03 PDT
Binaries are here:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-193a3dd250c7/
Comment 36 Justin Lebar (not reading bugmail) 2011-05-17 13:17:34 PDT
I managed to disassemble a Win32 PGO build.  I've included the code for
nsID::Equals below, but the gist is: It's basically the same before and after
this patch, as we expected.

The only real difference is the placement of the return 0 basic block (the
second block in the original code, and the third block in the modified code).
As I understand CPU architecture, the branch predictor should keep us from
caring where this block lives.

Nightly nsID::Equals:

  xul!nsID::Equals
     mov     edx,dword ptr [ecx]
     cmp     edx,dword ptr [eax]
     je      xul!nsID::Equals+0x9 (66ff0139)

  xul!nsID::Equals+0x6
     xor     eax,eax
     ret

  xul!nsID::Equals+0x9
     mov     edx,dword ptr [ecx+4]
     cmp     edx,dword ptr [eax+4]
     jne     xul!nsID::Equals+0x6 (66ff0136)

  xul!nsID::Equals+0x11
     mov     edx,dword ptr [ecx+8]
     cmp     edx,dword ptr [eax+8]
     jne     xul!nsID::Equals+0x6 (66ff0136)

  xul!nsID::Equals+0x19
     mov     ecx,dword ptr [ecx+0Ch]
     cmp     ecx,dword ptr [eax+0Ch]
     jne     xul!nsID::Equals+0x6 (66ff0136)

  xul!nsID::Equals+0x21
     mov     eax,1
     ret


After change:

  xul!nsID::Equals
     mov     edx,dword ptr [ecx]
     cmp     edx,dword ptr [eax]
     jne     xul!nsID::Equals+0xe (67c4ee7e)

  xul!nsID::Equals+0x6
     mov     edx,dword ptr [ecx+4]
     cmp     edx,dword ptr [eax+4]
     je      xul!nsID::Equals+0x11 (67c4ee81)

  xul!nsID::Equals+0xe
     xor     eax,eax
     ret

  xul!nsID::Equals+0x11
     mov     edx,dword ptr [ecx+8]
     cmp     edx,dword ptr [eax+8]
     jne     xul!nsID::Equals+0xe (67c4ee7e)

  xul!nsID::Equals+0x19
     mov     ecx,dword ptr [ecx+0Ch]
     cmp     ecx,dword ptr [eax+0Ch]
     jne     xul!nsID::Equals+0xe (67c4ee7e)

  xul!nsID::Equals+0x21
     mov     eax,1
     ret
Comment 37 Justin Lebar (not reading bugmail) 2011-05-17 13:32:06 PDT
Having done that due diligence, I've pushed to m-c.

http://hg.mozilla.org/mozilla-central/rev/9ad5c138c2d5

We still need to watch out for regressions, but I expect it to be OK.
Comment 38 Justin Lebar (not reading bugmail) 2011-10-16 10:06:12 PDT
Glandium and I have been discussing the potential problems with this patch in bug 660335.  The main issue is that unaligned 64-bit loads and stores can cause problems on x86-64.

The following silly microbenchmark is faster on Mac 10.6 (64-bit) with this patch than without.  There's no significant performance difference on Linux-64 (non-PGO), although interestingly, my Linux hardware (Core i7) is half the speed of my Mac hardware (Core i5).

Mac 10.6, before patch:
  45ms (first run)
  36ms (other runs)

Mac 10.6, after patch:
  39ms (first run)
  31ms (other runs)

Since this patch really only applies to x86-64 (x86-32 and ARM don't have 64-bit loads), I'm going to see if we can write some SSE2 code here.  We can use the explicitly unaligned 128-bit load, and just use 32-bit operations everywhere else.
Comment 39 Justin Lebar (not reading bugmail) 2011-10-16 10:34:44 PDT
Here's the microbenchmark:

        mozilla::TimeStamp start = mozilla::TimeStamp::Now();
        for (int i = 0; i < 100000; i++) {
            nsCOMPtr<nsISupports> t = do_QueryInterface(static_cast<nsIDocShell*>(this));
            nsCOMPtr<nsIDocShell> d = do_QueryInterface(t);
            nsCOMPtr<nsIDocument> doc = do_GetInterface(d);
            nsCOMPtr<nsIObserver> obs = do_QueryInterface(t);
            nsCOMPtr<nsINode> node = do_QueryInterface(doc);
            nsCOMPtr<nsISHEntry> shentry = do_QueryInterface(doc);
            nsCOMPtr<nsIDOMDocument> domdoc = do_QueryInterface(doc);
            nsCOMPtr<nsISHEntry> shentry2 = do_QueryInterface(t);
        }
        mozilla::TimeStamp end = mozilla::TimeStamp::Now();
        printf("Benchmark took %0.fms\n", (end - start).ToMilliseconds());
Comment 40 Justin Lebar (not reading bugmail) 2011-10-16 11:21:34 PDT
SSE2 with movdqu is about the same speed as before the patch (32-bit reads).

Since this makes only a (pretty small) difference on Mac and none on Linux, I'm tempted to back it out...
Comment 41 Justin Lebar (not reading bugmail) 2011-10-16 12:28:50 PDT
Created attachment 567348 [details] [diff] [review]
Backout
Comment 42 Mike Hommey [:glandium] 2011-10-27 06:11:08 PDT
Comment on attachment 567348 [details] [diff] [review]
Backout

Review of attachment 567348 [details] [diff] [review]:
-----------------------------------------------------------------

> and causes problems on x86-64.

It can (and does) cause problems on other 64 bits architectures, but does it cause problems on x86-64? (maybe performance hit on non-aligned 64-bits reads, though I'm not quite sure)
Comment 43 Justin Lebar (not reading bugmail) 2011-10-27 07:23:20 PDT
Thanks, Mike.

Backed out on inbound.  When this lands on m-c, this bug should be marked WONTFIX.

https://hg.mozilla.org/integration/mozilla-inbound/rev/cdd529811cb9
Comment 44 Ed Morley [:emorley] 2011-10-28 04:36:47 PDT
Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/cdd529811cb9

Marking WONTFIX per comment 43.

Note You need to log in before you can comment on or make changes to this bug.