Closed Bug 1137963 Opened 5 years ago Closed 5 years ago

Consider using a spin lock for TraceRefCnt

Categories

(Core :: XPCOM, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox39 --- fixed

People

(Reporter: mstange, Assigned: mstange)

References

Details

Attachments

(1 file, 2 obsolete files)

We're seeing very high overhead of the prlock used in nsTraceRefCnt.cpp in debug crashtests on Mac.

Using a spin lock cuts the time for debug crashtests on 10.10 down by a factor of 4, from 45min to 11min.
Attached patch patch (obsolete) — Splinter Review
I fired off two try pushes with debug tests on all platforms.
Without patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=430cdd99c7f0
With patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=362310d7cb6b
Is it the PR_Lock overhead that's causing problems, or just that the platform's locks (pthread_mutex_t &co) are too expensive?
The platform's lock. The profiles at http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://tests.themasta.com/hostedprofiles/profiles-10.10.zip show almost all the time being spent in __psynch_mutexwait and __psynch_mutexdrop.
Lovely.  Does OS X happen to support pthread_spinlock_t, out of curiosity?
Apparently not, but it looks like it has a replacement called OSSpinLock: https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/spinlock.3.html
(In reply to Markus Stange [:mstange] from comment #0)
> We're seeing very high overhead of the prlock used in nsTraceRefCnt.cpp in
> debug crashtests on Mac.
> 
> Using a spin lock cuts the time for debug crashtests on 10.10 down by a
> factor of 4, from 45min to 11min.

What difference does it make on older versions?
An not-so-interesting 16min to 12min for crashtests on 10.8 and 10.6, but mochitest-3, which has been worrying me more than crashtests, went from 75min to 35 min on 10.8 (10.6 hit an intermittent assertion, retriggered). 10.10 m-3 is going to be difficult to measure, because while the one with the patch finished in 35 minutes, the one without is currently at 116 minutes, and will time out when it hits the suite maxtime.

Looks like a wash on Windows, and slightly worse on Linux, but I'm in love with the results on Mac.
Unfortunately, the 10.6 mochitest-3 assertion looks real and non-intermittent.
good stuff!  Looking forward for this to land.
(In reply to Phil Ringnalda (:philor) from comment #10)
> Unfortunately, the 10.6 mochitest-3 assertion looks real and
> non-intermittent.

I've filed bug 1138604 about this and am looking at it now.
I think on TreeHerder we only use the bloat log to track when objects are created and destroyed.  If the overhead here is so dramatic, we might be able to have some kind of lighter mode that would only record stuff for Ctor() and Dtor().  I think we don't generally care about the actual number of addrefs and releases.
Attached patch patch (obsolete) — Splinter Review
This is the last patch plus a PR_Sleep(PR_INTERVAL_NO_WAIT), and has had a green mochitest-3 10.6 debug run on try.
Attachment #8570800 - Attachment is obsolete: true
Attachment #8572172 - Flags: review?(nfroyd)
Depends on: 1125998
No longer depends on: 1125998
Blocks: 1125998
See Also: → 1137397
Comment on attachment 8572172 [details] [diff] [review]
patch

Waldo might be able to move this along.
Attachment #8572172 - Flags: review?(jwalden+bmo)
Blocks: 1140246
Comment on attachment 8572172 [details] [diff] [review]
patch

Review of attachment 8572172 [details] [diff] [review]:
-----------------------------------------------------------------

Woo for lock-free code!

I don't have a sense for whether immediately yielding on these things, versus trying to acquire a bunch of times, and only if those bunches of times fail fall back to yielding before retrying, is faster.  I'm simply going to assume this is better.  At the very least, tho, it is *correct*.

If you're not in particular rush for this to land this week or anything, I'd mildly prefer another person look at this as well -- because it's been a few months since I looked at anything atomic so maybe I've gone partially stupid (quite likely at the edges, at least!), and because I suggested an optimization, and because atomics can always use more eyes.  But this is basic enough I'm fine landing and getting froydnj's look post-landing, too -- I probably haven't forgotten *that* much.  :-)

::: xpcom/base/nsTraceRefcnt.cpp
@@ +33,5 @@
>  #endif
>  
>  #include "mozilla/BlockingResourceBase.h"
>  #include "mozilla/PoisonIOInterposer.h"
> +#include "mozilla/Atomics.h"

Alphabetize.

@@ +77,5 @@
> +// We use a spin lock instead of a regular mutex because this lock is usually
> +// only held for a very short time, and gets grabbed at a very high frequency
> +// (~100000 times per second). On Mac, the overhead of using a regular lock
> +// is very high, see bug 1137963.
> +static mozilla::Atomic<bool> gTraceLogLocked(false);

Remove the (false) here.  That suggests it's *necessary* to do that, but it's not -- the relevant member variable is initialized by the class as mValue(), which direct-initializes the field, which performs value initialization, which zero-initializes it.  I'd rather have these uses never specify a false, so that people who read it get used to its semantics.

Additionally.  That this is effectively a lock, that users acquire to do protected work, then release for the next user to use, suggests you can use a more efficient memory ordering: ReleaseAcquire, which doesn't force threads to sync up with each other to totally order atomic operations on *this* variable with atomic operations on unrelated memory in other threads.  That won't force threads to sync up quite so often, which theoretically will make acquiring and releasing faster.  (In practice x86 doesn't take advantage of this optimization opportunity, but ARM does.)

static mozilla::Atomic<bool, mozilla::ReleaseAcquire> gTraceLogLocked;

@@ +83,5 @@
> +  while (!gTraceLogLocked.compareExchange(false, true)) {                     \
> +    PR_Sleep(PR_INTERVAL_NO_WAIT); /* yield */                                \
> +  }
> +#define UNLOCK_TRACELOG()                                                     \
> +  do { gTraceLogLocked = false; } while(0)

Is there some reason these macros can't be made into inline functions, avoiding all the backslashing and syntactic-nicety nonsense?
Attachment #8572172 - Flags: review?(jwalden+bmo) → review+
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #16)
> I don't have a sense for whether immediately yielding on these things,
> versus trying to acquire a bunch of times, and only if those bunches of
> times fail fall back to yielding before retrying, is faster.

Yeah, me neither. I just checked that it doesn't make crashtests slower by multiple minutes, and that seems to be the case.

> If you're not in particular rush for this to land this week or anything, I'd
> mildly prefer another person look at this as well

Now that Andrew has already made the tests a lot faster going a different route (bug 1138616), I'm not in a rush to land this one.

> Additionally.  That this is effectively a lock, that users acquire to do
> protected work, then release for the next user to use, suggests you can use
> a more efficient memory ordering: ReleaseAcquire, which doesn't force
> threads to sync up with each other to totally order atomic operations on
> *this* variable with atomic operations on unrelated memory in other threads.
> That won't force threads to sync up quite so often, which theoretically will
> make acquiring and releasing faster.  (In practice x86 doesn't take
> advantage of this optimization opportunity, but ARM does.)
> 
> static mozilla::Atomic<bool, mozilla::ReleaseAcquire> gTraceLogLocked;

Thanks, I wanted to ask about this, but forgot. I didn't even find the place where we get the default value for the ordering policy if you don't specify any (which I did before).

> @@ +83,5 @@
> > +  while (!gTraceLogLocked.compareExchange(false, true)) {                     \
> > +    PR_Sleep(PR_INTERVAL_NO_WAIT); /* yield */                                \
> > +  }
> > +#define UNLOCK_TRACELOG()                                                     \
> > +  do { gTraceLogLocked = false; } while(0)
> 
> Is there some reason these macros can't be made into inline functions,
> avoiding all the backslashing and syntactic-nicety nonsense?

Probably not. I just assumed there was a reason for these to be macros in the old code, but I don't see one, so I'll make these inline functions.
(In reply to Markus Stange [:mstange] from comment #17)
> Thanks, I wanted to ask about this, but forgot. I didn't even find the place
> where we get the default value for the ordering policy if you don't specify
> any (which I did before).

The default is the most cautious variant, SequentiallyConsistent, since the others require more thought to say whether they are good enough for your situation.
(In reply to Markus Stange [:mstange] from comment #17)
> I didn't even find the place
> where we get the default value for the ordering policy if you don't specify
> any (which I did before).

It's on the declaration of mozilla::Atomic.

http://hg.mozilla.org/integration/mozilla-inbound/annotate/16f240fd86de/mfbt/Atomics.h#l1001

The declaration looks this way because Atomic<T> is only defined for particular sorts of T, through specializations targeting those particular sorts.  The default must go on the declaration, even tho that leaves things a little unclear to someone looking for it at the location of some specialization.
Comment on attachment 8572172 [details] [diff] [review]
patch

Review of attachment 8572172 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine to me.  If you take Waldo's advice to make {LOCK,UNLOCK}_TRACELOG inline functions, please go one step further and make them part of an RAII class.

::: xpcom/base/nsTraceRefcnt.cpp
@@ +76,4 @@
>  
> +// We use a spin lock instead of a regular mutex because this lock is usually
> +// only held for a very short time, and gets grabbed at a very high frequency
> +// (~100000 times per second). On Mac, the overhead of using a regular lock

Nit: please say "On some versions of OS X...", since my understanding is that it's only a problem on 10.10 (and maybe 10.9?).  Have we filed a bug with Apple about this?  Surely we can't be the only people running into this.
Attachment #8572172 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #20)
> Nit: please say "On some versions of OS X...", since my understanding is
> that it's only a problem on 10.10 (and maybe 10.9?).  Have we filed a bug
> with Apple about this?  Surely we can't be the only people running into this.

I think it is a problem on earlier versions, too.  My patch to avoid getting the lock on every addref and release sped up 10.6 debug by ~20% on a number of tests, but had no real impact on Linux.  It is just much worse in 10.10.
(In reply to Andrew McCreight [:mccr8] from comment #21)
> (In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #20)
> > Nit: please say "On some versions of OS X...", since my understanding is
> > that it's only a problem on 10.10 (and maybe 10.9?).  Have we filed a bug
> > with Apple about this?  Surely we can't be the only people running into this.
> 
> I think it is a problem on earlier versions, too.  My patch to avoid getting
> the lock on every addref and release sped up 10.6 debug by ~20% on a number
> of tests, but had no real impact on Linux.  It is just much worse in 10.10.

OK!  That can stay as-is, then.
Looking forward to seeing this land.  Any idea on the timeframe for that?  We are anxious to move debug tests to 10.10 in bug 1125998 and bug 1140246.
Flags: needinfo?(mstange)
I'll finish this up within the next few days. But I think you can go ahead with turning on tests for 10.10 since bug 1138616 has landed in the meantime which should also have made them much faster.
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #24)
> I'll finish this up within the next few days. But I think you can go ahead
> with turning on tests for 10.10 since bug 1138616 has landed in the meantime
> which should also have made them much faster.

Well, that caused a frequent orange, so it may need to get backed out.  Though this patch may just cause the same problem.
Oh. :-(
So it looks like the bug 1138616 is still in production and the intermittents are being tracked in bug 1122249. I'm going to ask relops to reimage more 10.10 machines in bug 1140246.
Blocks: 1139002
Attached patch patchSplinter Review
With review comments addressed.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=94eb8844378e

Using an RAII class for the lock caused a shutdown hang and turned all debug tests red, which was the reason why I didn't land this sooner. I've made a small change that fixes the deadlock.

The problem was in nsTraceRefCnt::DumpStatistics, which has an nsTArray<...> entries, and we were holding the lock while that array was destroyed, and gLogging was no longer NoLogging, so the array destructor logging deadlocked.

The order before my patch was:

LOCK()
gLogging = NoLogging
create array
gLogging = OnlyBloatLogging
UNLOCK()
destroy array <-- destructor logging here

The bad version of my patch had this order:

LOCK()
gLogging = NoLogging
create array
gLogging = OnlyBloatLogging
destroy array <-- destructor logging deadlocks here
UNLOCK()

And with AutoRestore for gLogging (current patch):

LOCK()
gLogging = NoLogging
create array
destroy array <-- destructor doesn't try to log
gLogging = OnlyBloatLogging
UNLOCK()

The version before the patch looks wrong to me, since we count the destruction of the array, but not its creation. But since that only happens after we've already dumped the statistics table, it probably didn't matter.
Attachment #8572172 - Attachment is obsolete: true
Did you mean to set any flags on that and/or land it?
I meant to land it as soon as the try results were in, which they are now.

https://hg.mozilla.org/integration/mozilla-inbound/rev/b83f949ac00d
https://hg.mozilla.org/mozilla-central/rev/b83f949ac00d
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
You need to log in before you can comment on or make changes to this bug.