Closed Bug 645263 Opened 10 years ago Closed 10 years ago

dangerous deadlock detection data divergence

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: cjones)

References

Details

Attachments

(7 files, 3 obsolete files)

6.65 KB, patch
benjamin
: superreview+
Details | Diff | Splinter Review
15.03 KB, patch
benjamin
: superreview+
Details | Diff | Splinter Review
29.84 KB, patch
benjamin
: review+
Details | Diff | Splinter Review
481.95 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
50.05 KB, patch
cajbir
: review+
Details | Diff | Splinter Review
2.66 KB, patch
ehsan
: review+
Details | Diff | Splinter Review
1.46 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
After investigating bug 594666, I've noticed a bunch of problems with our deadlock detection code (start from bug 594666 comment 365) for the relevant history.

In particular:

 * we have separate deadlock detection code in BlockingResourceBase.cpp and in nsAutoLock.cpp; they don't detect deadlocks between the different types of locks

 * each of these files lives in xpcom/glue, which means that there's a separate copy of their data structures inside and outside of xpcom glue.  This means that we don't detect deadlocks between a lock used by one user of xpcom glue and a lock used by another (e.g., a lock used in the firefox binary and a lock used in libxul, if my memory of glue usage is correct)


We should:

 * move the data structure out of xpcom glue and maintain one copy per process

 * either unify the structure or rapidly migrate nsAutoLock/nsAutoMonitor users (and raw PRLock/PRMonitor users) to mozilla::Mutex
The first part of this work is basically what bug 493080 covers.  (The old deadlock detector has other problems too that made us want to kill it off, like not actually being thread safe.)  The second part isn't filed; that problem was known when we created the new deadlock detector, but not considered especially serious.  The work following bug 642381 would allow us to fix that, and to additionally check for deadlocks across SM/Gecko.

The work from bug 493080 obviously fizzled out.  There's some momentum now from bug 594666, so let's try to use this new hotter bug to finally kill off nsAutoLock.
Depends on: 594666
Duplicate of this bug: 486606
Duplicate of this bug: 493080
Duplicate of this bug: 493112
Duplicate of this bug: 493116
Duplicate of this bug: 493132
Duplicate of this bug: 493138
Duplicate of this bug: 493146
This patch was made slightly-not-quite-entirely-trivial by the custom AutoExit helper.
Assignee: nobody → jones.chris.g
Attachment #522608 - Flags: review?(chris.double)
This patch was made slightly-not-quite-entirely-trivial by the AutoAsyncCallLock helper.
Attachment #522609 - Flags: superreview?(benjamin)
This patch was made slightly-not-quite-entirely-trivial by the nsAutoMonitor-compatibility-mode helper.
Attachment #522610 - Flags: superreview?(benjamin)
Aaand the rest is straight rewrite :).
Attachment #522611 - Flags: review?(dbaron)
Not sure if this is semi-officially part of our SDK, but this would feel really good.
Attachment #522612 - Flags: review?(benjamin)
Attachment #522609 - Flags: superreview?(benjamin) → superreview+
Attachment #522610 - Flags: superreview?(benjamin) → superreview+
Attachment #522612 - Flags: review?(benjamin) → review+
Comment on attachment 522611 [details] [diff] [review]
part 4: Migrate everything else to mozilla:: sync primitives

Not quite ready yet.
Attachment #522611 - Flags: review?(dbaron)
Fixed some windows- and mac-only callers.
Attachment #522611 - Attachment is obsolete: true
Attachment #522915 - Flags: review?(dbaron)
You should also fix up the one user of nsAutoCMonitor at some point-- probably to just use nsAutoMonitor.
Hmm, looks like an unused test.  I wonder if we should just remove the test.
Comment on attachment 522915 [details] [diff] [review]
part 4: Migrate everything else to mozilla:: sync primitives, v2

XPCAutoLock should probably become a typedef for MonitorAutoEnter
at some point.  Doesn't have to be now, though.  Though it'd be nice.

In nsJar::Open and OpenInner, I think you should use another mechanism for
enforcing the already-open check.  Add a boolean if you need one.

>+      mMonitor("nsCacheService.mMoniton"),

s/n"/r"/

nsHostResolver.h:
>-    PRLock       *mLock;
>-    PRCondVar    *mIdleThreadCV; // non-null if idle thread
>+    Mutex         mLock;
>+    CondVar       mIdleThreadCV; // non-null if idle thread

comment is no longer true

nsHttpActivityDistributor::Init should be removed, and it should
just use NS_GENERIC_FACTORY_CONSTRUCTOR (no _INIT).

Maybe leave the blank lines before and after the first (new) block
in nsKeygenThread::Run?

In nsHTTPListener::FreeLoadGroup you should leave the block that you
removed; you're changing the scope of the autolock.

>+      //Free the lock and return - no error - just implies nothing to schedule

Drop the "Free the lock and "

Also, you don't need to introduce a new scope and reindent in this
method, since you unlock at the very end.  (Having the return inside the
lock scope is a sign you don't need it!)

In nsNSSShutdownList you should probably either remove the indent
in those methods or make an explicit scope to match what the indent
used to show.  Same for nsNSSActivityState (where you already
did make the explicit scope in some places).

In nsProtectedAuthThread::GetSlot, don't introduce extra spaces before
=.

>+            mQueueCond.Wait(PR_MillisecondsToInterval(10000)_);

You've got an extra underscore, which says you didn't compile on
Android.

In nsExceptionService, don't get too cute -- use a variable name
with letters in it.


You shouldn't remove FreeAutoLockStatics calls in this patch.

>+    mLock("nsRecyclingAllocatior.mLock"),

s/tior/tor/

nsPipe::GetInputStream and GetOutputStream need another mechanism
of testing for initialization.

>+        : mCounterLock("ProxyTex.mCounterLock")
>+        , mEvilMonitor("ProxyTex.mEvilMonitor")

s/Tex/Test/

Please add MOZ_COUNT_CTOR and MOZ_COUNT_DTOR macros to
mozilla::Mutex, mozilla::Monitor, and mozilla::CondVar to ensure
you're not leaking locks anywhere.

Make sure you get green try runs before landing.

r=dbaron with that
Attachment #522915 - Flags: review?(dbaron) → review+
(Removing FreeAutoLockStatics in patch 5 is great, though.)
Thanks much.  This can't have been an enjoyable review :).

(In reply to comment #18)
> Comment on attachment 522915 [details] [diff] [review]
> part 4: Migrate everything else to mozilla:: sync primitives, v2
> 
> XPCAutoLock should probably become a typedef for MonitorAutoEnter
> at some point.  Doesn't have to be now, though.  Though it'd be nice.

Yes, I didn't understand the fork there; maybe it predated nsAutoMonitor?  Bug 556214 might be a good starting point for cleaning this up, since the re-entrancy semantics needs to be made clear there.

> In nsJar::Open and OpenInner, I think you should use another mechanism for
> enforcing the already-open check.  Add a boolean if you need one.

Done.  I added a bool because I wasn't sure how to generally check already-open in OpenInner.

> >+      mMonitor("nsCacheService.mMoniton"),
> 
> s/n"/r"/

Fixed in bug 646259, which will be folded into this before pushing.

> nsHostResolver.h:
> >-    PRLock       *mLock;
> >-    PRCondVar    *mIdleThreadCV; // non-null if idle thread
> >+    Mutex         mLock;
> >+    CondVar       mIdleThreadCV; // non-null if idle thread
> 
> comment is no longer true

D'oh, I went to check whether that comment was true in the old PRCondVar impl before nuking (it wasn't), then I forgot to nuke.  Fixed.

> nsHttpActivityDistributor::Init should be removed, and it should
> just use NS_GENERIC_FACTORY_CONSTRUCTOR (no _INIT).

Done.

> Maybe leave the blank lines before and after the first (new) block
> in nsKeygenThread::Run?

OK.

> In nsHTTPListener::FreeLoadGroup you should leave the block that you
> removed; you're changing the scope of the autolock.
> 
> >+      //Free the lock and return - no error - just implies nothing to schedule
> 
> Drop the "Free the lock and "

Done.

> Also, you don't need to introduce a new scope and reindent in this
> method, since you unlock at the very end.  (Having the return inside the
> lock scope is a sign you don't need it!)

Hm, I'm not sure there's a good solution here.  I tried to only add new locking scopes where needed, and I was inconsistent here :S.

In v2, I reindented the sites I was 99.9% sure were strangely formatted just because of explicit lock/unlock, not because, say, pushing a value onto the stack while the mutex was held was found to cause perf issues (i.e., none).

> In nsNSSShutdownList you should probably either remove the indent
> in those methods or make an explicit scope to match what the indent
> used to show.  Same for nsNSSActivityState (where you already
> did make the explicit scope in some places).

Changed per above.

> In nsProtectedAuthThread::GetSlot, don't introduce extra spaces before
> =.

Fixed.

> >+            mQueueCond.Wait(PR_MillisecondsToInterval(10000)_);
> 
> You've got an extra underscore, which says you didn't compile on
> Android.

I did, locally, but only opt, which is also only what tinderbox does :S (bug 581290).  Debug builds were broken for months until recently.

> In nsExceptionService, don't get too cute -- use a variable name
> with letters in it.

The problem here was that the static variable is named "lock", the same as our (apparent) convention for RAII auto-locks.  Renamed static variable to sLock to match style guide.

> You shouldn't remove FreeAutoLockStatics calls in this patch.

Hm yeah, I had the kill-nsAutoLock patch before this in series so that users wouldn't compile.  Will rejigger and fix.  (I'm going to fold all these patches before pushing, but that change will help rebasing if a backout is needed.)

> >+    mLock("nsRecyclingAllocatior.mLock"),
> 
> s/tior/tor/

Whups.

> nsPipe::GetInputStream and GetOutputStream need another mechanism
> of testing for initialization.

Nice catch!  Tryserver found that too :).  Another place where I forgot to change the is-init check after looking at the .cpp.  Fixed with another new bool member introduced to match existing semantics.

> >+        : mCounterLock("ProxyTex.mCounterLock")
> >+        , mEvilMonitor("ProxyTex.mEvilMonitor")
> 
> s/Tex/Test/

Fixed.

> Please add MOZ_COUNT_CTOR and MOZ_COUNT_DTOR macros to
> mozilla::Mutex, mozilla::Monitor, and mozilla::CondVar to ensure
> you're not leaking locks anywhere.
> 
> Make sure you get green try runs before landing.

Yep!
Comment on attachment 522612 [details] [diff] [review]
part 5: Remove nsAutoLock.*

>diff --git a/xpcom/glue/Makefile.in b/xpcom/glue/Makefile.in
>--- a/xpcom/glue/Makefile.in
>+++ b/xpcom/glue/Makefile.in
>@@ -65,17 +65,16 @@ CPPSRCS		= \
> 		nsStringAPI.cpp \
> 		GenericModule.cpp \
> 		$(NULL)
> 
> # TODO nsAutoLock.h should die soon

Kill this line too.
Rebased, smaller now.
Attachment #522608 - Attachment is obsolete: true
Attachment #522608 - Flags: review?(chris.double)
Attachment #523088 - Flags: review?(chris.double)
Something about these tests is making for very slow.  The new deadlock detector was explicitly designed to fix scalability problems in the old one, and in testing was much better, so there's something very wrong here.  Not sure whether the problem lies in the detector, the tests, or the code being tested, but I'd rather fix whichever off the critical path to landing this.
Attachment #523091 - Flags: review?(ehsan)
These patches are looking good on tryserver, except the xpcshell tests didn't finish within 2 hours on windows :/.  Most likely due to capturing backtraces being slow.
Attachment #523180 - Flags: review?(dbaron)
Part 7 seems to fix timeout on tryserver.  Looks like this is ready to go as soon as the reviews come in.
Comment on attachment 523091 [details] [diff] [review]
part 6: Make editor test be nicer to deadlock detector

These tests are stupid, and I will kill them without mercy in the future, but in the meantime, you're free to do whatever you need to them in order to get this landed.  :-)
Attachment #523091 - Flags: review?(ehsan) → review+
Attachment #523088 - Flags: review?(chris.double) → review+
(In reply to comment #23)
> Created attachment 523091 [details] [diff] [review]
> part 6: Make editor test be nicer to deadlock detector
> 
> Not sure
> whether the problem lies in the detector, the tests, or the code being tested,
> but I'd rather fix whichever off the critical path to landing this.

The main problem here is that
 - the test is external API
 - it defines XPCOM classes
 - by the nature of the tests, there are a gerjillion AddRef/Release of the COM-y things defined under the external API
 - AddRef/Release check NS_IsCycleCollectorThread() in debug builds
 - ... aaand NS_IsCycleCollectorThread() is hahahaomglollerskates slow for external-API code: http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsCycleCollectorUtils.cpp#53

Most of the time in my profile was in tracemalloc.  Running before/after with NS_TRACE_MALLOC_DISABLE_STACKS=1 took about the same amount of time; new detector was ~10% slower.  The test was *much* slower with the patches here and stacks enabled; too slow for my patience.  The old deadlock detector only checked lock ordering in the nsAutoLock/nsAutoMonitor constructors (which is a bug), and the new detector checks on all Mutex::Lock/Monitor::Enter calls.  My guess is that something was using raw PR_* operations or nsAuto*::lock/Enter, which weren't getting ordering checks, and they're now being checked (which incurs grabbing a backtrace).  Didn't confirm this.

Anyway, the takeaway is
 - NS_IsCycleCollectorThread() is ridiculous for external-API code.  Not sure how much anyone cares.
 - TestTXMgr is chewing a ton of CPU cycles in debug builds; not sure what we gain from that.
Fixed a style inconsistency and a grammar-o.
Attachment #523180 - Attachment is obsolete: true
Attachment #523180 - Flags: review?(dbaron)
Attachment #523237 - Flags: review?(dbaron)
Comment on attachment 523237 [details] [diff] [review]
part 7: Disable tracemalloc backtraces for xpcshell tests, v1.1

r=dbaron

I should have done this in bug 549561.
Attachment #523237 - Flags: review?(dbaron) → review+
http://hg.mozilla.org/mozilla-central/rev/4beec31b9ea9
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Was this announced anywhere, particularly to existing nsAutoLock consumers?
Increased frequency of intermittent orange bug 618052, backed out

http://hg.mozilla.org/mozilla-central/rev/e03c3a6df3cb

Sigh.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The first round of OS X debug tests was green on tryserver, with parts 1-6.  I pushed a second time with part 7 added, and just ran on win/OS X 10.6, and saw bug 618052 on 10.6.  That's very puzzling.
(In reply to comment #31)
> Was this announced anywhere, particularly to existing nsAutoLock consumers?

It was deprecated 2 years ago [1], but this bug wasn't announced beforehand.  There was an effort to kill off nsAutoLock that began about 2 years ago and fizzled out because of review neglect, so I tried to move fast here.

[1] http://blog.mozilla.com/cjones/2009/05/04/farewell-prlock-i-hardly-knew-thee/
(In reply to comment #27)
>  - TestTXMgr is chewing a ton of CPU cycles in debug builds; not sure what we
> gain from that.

Not much.  => bug 647064.
Sadly,

-    self.env["NS_TRACE_MALLOC_DISABLE_STACKS"] = "1"
+    if not (sys.platform == 'osx' or sys.platform == "darwin"):
+      # XXX automation.py has this odd special case; without it, bug
+      # 618052 seems to be exacerbated
+      self.env["NS_TRACE_MALLOC_DISABLE_STACKS"] = "1"

made this pass both OS X debug xpcshell runs on tryserver.  Ah well; guess it's good to be consistent with automation.py until the two are unified.
Relanded with, ahem, correction to make automation.py and runxpcshelltests.py agree.

http://hg.mozilla.org/mozilla-central/rev/1a89509e25e4
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Depends on: 647150
Amazingly and perhaps frighteningly, this got an all-green.
Blocks: 647000
(In reply to comment #35)
> (In reply to comment #31)
> > Was this announced anywhere, particularly to existing nsAutoLock consumers?
> It was deprecated 2 years ago
I guess that constitutes fair warning.
Blocks: 647003
You need to log in before you can comment on or make changes to this bug.