1137963 - Consider using a spin lock for TraceRefCnt

Assignee

Description

•

9 years ago

We're seeing very high overhead of the prlock used in nsTraceRefCnt.cpp in debug crashtests on Mac.

Using a spin lock cuts the time for debug crashtests on 10.10 down by a factor of 4, from 45min to 11min.

Markus Stange [:mstange]

Assignee

Comment 1

•

9 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Markus Stange [:mstange]

Assignee

Comment 2

•

9 years ago

I fired off two try pushes with debug tests on all platforms.
Without patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=430cdd99c7f0
With patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=362310d7cb6b

Nathan Froyd [:froydnj]

Comment 3

•

9 years ago

Is it the PR_Lock overhead that's causing problems, or just that the platform's locks (pthread_mutex_t &co) are too expensive?

Markus Stange [:mstange]

Assignee

Comment 4

•

9 years ago

The platform's lock. The profiles at http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://tests.themasta.com/hostedprofiles/profiles-10.10.zip show almost all the time being spent in __psynch_mutexwait and __psynch_mutexdrop.

Nathan Froyd [:froydnj]

Comment 5

•

9 years ago

Lovely.  Does OS X happen to support pthread_spinlock_t, out of curiosity?

Markus Stange [:mstange]

Assignee

Comment 6

•

9 years ago

Apparently not, but it looks like it has a replacement called OSSpinLock: https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/spinlock.3.html

Jeff Muizelaar [:jrmuizel]

Comment 7

•

9 years ago

(In reply to Markus Stange [:mstange] from comment #0)
> We're seeing very high overhead of the prlock used in nsTraceRefCnt.cpp in
> debug crashtests on Mac.
> 
> Using a spin lock cuts the time for debug crashtests on 10.10 down by a
> factor of 4, from 45min to 11min.

What difference does it make on older versions?

Phil Ringnalda (:philor)

Comment 8

•

9 years ago

An not-so-interesting 16min to 12min for crashtests on 10.8 and 10.6, but mochitest-3, which has been worrying me more than crashtests, went from 75min to 35 min on 10.8 (10.6 hit an intermittent assertion, retriggered). 10.10 m-3 is going to be difficult to measure, because while the one with the patch finished in 35 minutes, the one without is currently at 116 minutes, and will time out when it hits the suite maxtime.

Looks like a wash on Windows, and slightly worse on Linux, but I'm in love with the results on Mac.

(not currently active) Ted Mielczarek

Comment 9

•

9 years ago

Nice find!

Phil Ringnalda (:philor)

Comment 10

•

9 years ago

Unfortunately, the 10.6 mochitest-3 assertion looks real and non-intermittent.

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

9 years ago

good stuff!  Looking forward for this to land.

Markus Stange [:mstange]

Assignee

Comment 12

•

9 years ago

(In reply to Phil Ringnalda (:philor) from comment #10)
> Unfortunately, the 10.6 mochitest-3 assertion looks real and
> non-intermittent.

I've filed bug 1138604 about this and am looking at it now.

Andrew McCreight [:mccr8]

Comment 13

•

9 years ago

I think on TreeHerder we only use the bloat log to track when objects are created and destroyed.  If the overhead here is so dramatic, we might be able to have some kind of lighter mode that would only record stuff for Ctor() and Dtor().  I think we don't generally care about the actual number of addrefs and releases.

Markus Stange [:mstange]

Assignee

Comment 14

•

9 years ago

Attached patch patch (obsolete) — Details — Splinter Review

This is the last patch plus a PR_Sleep(PR_INTERVAL_NO_WAIT), and has had a green mochitest-3 10.6 debug run on try.

Attachment #8570800 - Attachment is obsolete: true

Attachment #8572172 - Flags: review?(nfroyd)

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Depends on: 1125998

Kim Moir [:kmoir] ET

Updated

•

9 years ago

No longer depends on: 1125998

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Blocks: 1125998

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Comment 15

•

9 years ago

Comment on attachment 8572172 [details] [diff] [review]
patch

Waldo might be able to move this along.

Attachment #8572172 - Flags: review?(jwalden+bmo)

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Blocks: 1140246

Jeff Walden [:Waldo]

Comment 16

•

9 years ago

Comment on attachment 8572172 [details] [diff] [review]
patch

Review of attachment 8572172 [details] [diff] [review]:
-----------------------------------------------------------------

Woo for lock-free code!

I don't have a sense for whether immediately yielding on these things, versus trying to acquire a bunch of times, and only if those bunches of times fail fall back to yielding before retrying, is faster.  I'm simply going to assume this is better.  At the very least, tho, it is *correct*.

If you're not in particular rush for this to land this week or anything, I'd mildly prefer another person look at this as well -- because it's been a few months since I looked at anything atomic so maybe I've gone partially stupid (quite likely at the edges, at least!), and because I suggested an optimization, and because atomics can always use more eyes.  But this is basic enough I'm fine landing and getting froydnj's look post-landing, too -- I probably haven't forgotten *that* much.  :-)

::: xpcom/base/nsTraceRefcnt.cpp
@@ +33,5 @@
>  #endif
>  
>  #include "mozilla/BlockingResourceBase.h"
>  #include "mozilla/PoisonIOInterposer.h"
> +#include "mozilla/Atomics.h"

Alphabetize.

@@ +77,5 @@
> +// We use a spin lock instead of a regular mutex because this lock is usually
> +// only held for a very short time, and gets grabbed at a very high frequency
> +// (~100000 times per second). On Mac, the overhead of using a regular lock
> +// is very high, see bug 1137963.
> +static mozilla::Atomic<bool> gTraceLogLocked(false);

Remove the (false) here.  That suggests it's *necessary* to do that, but it's not -- the relevant member variable is initialized by the class as mValue(), which direct-initializes the field, which performs value initialization, which zero-initializes it.  I'd rather have these uses never specify a false, so that people who read it get used to its semantics.

Additionally.  That this is effectively a lock, that users acquire to do protected work, then release for the next user to use, suggests you can use a more efficient memory ordering: ReleaseAcquire, which doesn't force threads to sync up with each other to totally order atomic operations on *this* variable with atomic operations on unrelated memory in other threads.  That won't force threads to sync up quite so often, which theoretically will make acquiring and releasing faster.  (In practice x86 doesn't take advantage of this optimization opportunity, but ARM does.)

static mozilla::Atomic<bool, mozilla::ReleaseAcquire> gTraceLogLocked;

@@ +83,5 @@
> +  while (!gTraceLogLocked.compareExchange(false, true)) {                     \
> +    PR_Sleep(PR_INTERVAL_NO_WAIT); /* yield */                                \
> +  }
> +#define UNLOCK_TRACELOG()                                                     \
> +  do { gTraceLogLocked = false; } while(0)

Is there some reason these macros can't be made into inline functions, avoiding all the backslashing and syntactic-nicety nonsense?

Attachment #8572172 - Flags: review?(jwalden+bmo) → review+

Markus Stange [:mstange]

Assignee

Comment 17

•

9 years ago

(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #16)
> I don't have a sense for whether immediately yielding on these things,
> versus trying to acquire a bunch of times, and only if those bunches of
> times fail fall back to yielding before retrying, is faster.

Yeah, me neither. I just checked that it doesn't make crashtests slower by multiple minutes, and that seems to be the case.

> If you're not in particular rush for this to land this week or anything, I'd
> mildly prefer another person look at this as well

Now that Andrew has already made the tests a lot faster going a different route (bug 1138616), I'm not in a rush to land this one.

> Additionally.  That this is effectively a lock, that users acquire to do
> protected work, then release for the next user to use, suggests you can use
> a more efficient memory ordering: ReleaseAcquire, which doesn't force
> threads to sync up with each other to totally order atomic operations on
> *this* variable with atomic operations on unrelated memory in other threads.
> That won't force threads to sync up quite so often, which theoretically will
> make acquiring and releasing faster.  (In practice x86 doesn't take
> advantage of this optimization opportunity, but ARM does.)
> 
> static mozilla::Atomic<bool, mozilla::ReleaseAcquire> gTraceLogLocked;

Thanks, I wanted to ask about this, but forgot. I didn't even find the place where we get the default value for the ordering policy if you don't specify any (which I did before).

> @@ +83,5 @@
> > +  while (!gTraceLogLocked.compareExchange(false, true)) {                     \
> > +    PR_Sleep(PR_INTERVAL_NO_WAIT); /* yield */                                \
> > +  }
> > +#define UNLOCK_TRACELOG()                                                     \
> > +  do { gTraceLogLocked = false; } while(0)
> 
> Is there some reason these macros can't be made into inline functions,
> avoiding all the backslashing and syntactic-nicety nonsense?

Probably not. I just assumed there was a reason for these to be macros in the old code, but I don't see one, so I'll make these inline functions.

Emanuel Hoogeveen [:ehoogeveen]

Comment 18

•

9 years ago

(In reply to Markus Stange [:mstange] from comment #17)
> Thanks, I wanted to ask about this, but forgot. I didn't even find the place
> where we get the default value for the ordering policy if you don't specify
> any (which I did before).

The default is the most cautious variant, SequentiallyConsistent, since the others require more thought to say whether they are good enough for your situation.

Jeff Walden [:Waldo]

Comment 19

•

9 years ago

(In reply to Markus Stange [:mstange] from comment #17)
> I didn't even find the place
> where we get the default value for the ordering policy if you don't specify
> any (which I did before).

It's on the declaration of mozilla::Atomic.

http://hg.mozilla.org/integration/mozilla-inbound/annotate/16f240fd86de/mfbt/Atomics.h#l1001

The declaration looks this way because Atomic<T> is only defined for particular sorts of T, through specializations targeting those particular sorts.  The default must go on the declaration, even tho that leaves things a little unclear to someone looking for it at the location of some specialization.

Nathan Froyd [:froydnj]

Comment 20

•

9 years ago

Comment on attachment 8572172 [details] [diff] [review]
patch

Review of attachment 8572172 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine to me.  If you take Waldo's advice to make {LOCK,UNLOCK}_TRACELOG inline functions, please go one step further and make them part of an RAII class.

::: xpcom/base/nsTraceRefcnt.cpp
@@ +76,4 @@
>  
> +// We use a spin lock instead of a regular mutex because this lock is usually
> +// only held for a very short time, and gets grabbed at a very high frequency
> +// (~100000 times per second). On Mac, the overhead of using a regular lock

Nit: please say "On some versions of OS X...", since my understanding is that it's only a problem on 10.10 (and maybe 10.9?).  Have we filed a bug with Apple about this?  Surely we can't be the only people running into this.

Attachment #8572172 - Flags: review?(nfroyd) → review+

Andrew McCreight [:mccr8]

Comment 21

•

9 years ago

(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #20)
> Nit: please say "On some versions of OS X...", since my understanding is
> that it's only a problem on 10.10 (and maybe 10.9?).  Have we filed a bug
> with Apple about this?  Surely we can't be the only people running into this.

I think it is a problem on earlier versions, too.  My patch to avoid getting the lock on every addref and release sped up 10.6 debug by ~20% on a number of tests, but had no real impact on Linux.  It is just much worse in 10.10.

Nathan Froyd [:froydnj]

Comment 22

•

9 years ago

(In reply to Andrew McCreight [:mccr8] from comment #21)
> (In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #20)
> > Nit: please say "On some versions of OS X...", since my understanding is
> > that it's only a problem on 10.10 (and maybe 10.9?).  Have we filed a bug
> > with Apple about this?  Surely we can't be the only people running into this.
> 
> I think it is a problem on earlier versions, too.  My patch to avoid getting
> the lock on every addref and release sped up 10.6 debug by ~20% on a number
> of tests, but had no real impact on Linux.  It is just much worse in 10.10.

OK!  That can stay as-is, then.

Kim Moir [:kmoir] ET

Comment 23

•

9 years ago

Looking forward to seeing this land.  Any idea on the timeframe for that?  We are anxious to move debug tests to 10.10 in bug 1125998 and bug 1140246.

Flags: needinfo?(mstange)

Markus Stange [:mstange]

Assignee

Comment 24

•

9 years ago

I'll finish this up within the next few days. But I think you can go ahead with turning on tests for 10.10 since bug 1138616 has landed in the meantime which should also have made them much faster.

Flags: needinfo?(mstange)

Andrew McCreight [:mccr8]

Comment 25

•

9 years ago

(In reply to Markus Stange [:mstange] from comment #24)
> I'll finish this up within the next few days. But I think you can go ahead
> with turning on tests for 10.10 since bug 1138616 has landed in the meantime
> which should also have made them much faster.

Well, that caused a frequent orange, so it may need to get backed out.  Though this patch may just cause the same problem.

Markus Stange [:mstange]

Assignee

Comment 26

•

9 years ago

Oh. :-(

Kim Moir [:kmoir] ET

Comment 27

•

9 years ago

So it looks like the bug 1138616 is still in production and the intermittents are being tracked in bug 1122249. I'm going to ask relops to reimage more 10.10 machines in bug 1140246.

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Blocks: 1139002

Markus Stange [:mstange]

Assignee

Comment 28

•

9 years ago

Attached patch patch — Details — Splinter Review

With review comments addressed.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=94eb8844378e

Using an RAII class for the lock caused a shutdown hang and turned all debug tests red, which was the reason why I didn't land this sooner. I've made a small change that fixes the deadlock.

The problem was in nsTraceRefCnt::DumpStatistics, which has an nsTArray<...> entries, and we were holding the lock while that array was destroyed, and gLogging was no longer NoLogging, so the array destructor logging deadlocked.

The order before my patch was:

LOCK()
gLogging = NoLogging
create array
gLogging = OnlyBloatLogging
UNLOCK()
destroy array <-- destructor logging here

The bad version of my patch had this order:

LOCK()
gLogging = NoLogging
create array
gLogging = OnlyBloatLogging
destroy array <-- destructor logging deadlocks here
UNLOCK()

And with AutoRestore for gLogging (current patch):

LOCK()
gLogging = NoLogging
create array
destroy array <-- destructor doesn't try to log
gLogging = OnlyBloatLogging
UNLOCK()

The version before the patch looks wrong to me, since we count the destruction of the array, but not its creation. But since that only happens after we've already dumped the statistics table, it probably didn't matter.

Attachment #8572172 - Attachment is obsolete: true

Emanuel Hoogeveen [:ehoogeveen]

Comment 29

•

9 years ago

Did you mean to set any flags on that and/or land it?

Markus Stange [:mstange]

Assignee

Comment 30

•

9 years ago

I meant to land it as soon as the try results were in, which they are now.

https://hg.mozilla.org/integration/mozilla-inbound/rev/b83f949ac00d

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 31

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/b83f949ac00d

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox39: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla39

patch 9 years ago Markus Stange [:mstange] 2.00 KB, patch		Details \| Diff \| Splinter Review
patch 9 years ago Markus Stange [:mstange] 2.34 KB, patch	froydnj : review+ Waldo : review+	Details \| Diff \| Splinter Review
patch 9 years ago Markus Stange [:mstange] 9.01 KB, patch		Details \| Diff \| Splinter Review