Closed
Bug 645263
Opened 14 years ago
Closed 14 years ago
dangerous deadlock detection data divergence
Categories
(Core :: XPCOM, defect)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
People
(Reporter: dbaron, Assigned: cjones)
References
Details
Attachments
(7 files, 3 obsolete files)
6.65 KB,
patch
|
benjamin
:
superreview+
|
Details | Diff | Splinter Review |
15.03 KB,
patch
|
benjamin
:
superreview+
|
Details | Diff | Splinter Review |
29.84 KB,
patch
|
benjamin
:
review+
|
Details | Diff | Splinter Review |
481.95 KB,
patch
|
dbaron
:
review+
|
Details | Diff | Splinter Review |
50.05 KB,
patch
|
cajbir
:
review+
|
Details | Diff | Splinter Review |
2.66 KB,
patch
|
ehsan.akhgari
:
review+
|
Details | Diff | Splinter Review |
1.46 KB,
patch
|
dbaron
:
review+
|
Details | Diff | Splinter Review |
After investigating bug 594666, I've noticed a bunch of problems with our deadlock detection code (start from bug 594666 comment 365) for the relevant history.
In particular:
* we have separate deadlock detection code in BlockingResourceBase.cpp and in nsAutoLock.cpp; they don't detect deadlocks between the different types of locks
* each of these files lives in xpcom/glue, which means that there's a separate copy of their data structures inside and outside of xpcom glue. This means that we don't detect deadlocks between a lock used by one user of xpcom glue and a lock used by another (e.g., a lock used in the firefox binary and a lock used in libxul, if my memory of glue usage is correct)
We should:
* move the data structure out of xpcom glue and maintain one copy per process
* either unify the structure or rapidly migrate nsAutoLock/nsAutoMonitor users (and raw PRLock/PRMonitor users) to mozilla::Mutex
Updated•14 years ago
|
Assignee | ||
Comment 1•14 years ago
|
||
The first part of this work is basically what bug 493080 covers. (The old deadlock detector has other problems too that made us want to kill it off, like not actually being thread safe.) The second part isn't filed; that problem was known when we created the new deadlock detector, but not considered especially serious. The work following bug 642381 would allow us to fix that, and to additionally check for deadlocks across SM/Gecko.
The work from bug 493080 obviously fizzled out. There's some momentum now from bug 594666, so let's try to use this new hotter bug to finally kill off nsAutoLock.
Depends on: 594666
Assignee | ||
Comment 9•14 years ago
|
||
This patch was made slightly-not-quite-entirely-trivial by the custom AutoExit helper.
Assignee: nobody → jones.chris.g
Attachment #522608 -
Flags: review?(chris.double)
Assignee | ||
Comment 10•14 years ago
|
||
This patch was made slightly-not-quite-entirely-trivial by the AutoAsyncCallLock helper.
Attachment #522609 -
Flags: superreview?(benjamin)
Assignee | ||
Comment 11•14 years ago
|
||
This patch was made slightly-not-quite-entirely-trivial by the nsAutoMonitor-compatibility-mode helper.
Attachment #522610 -
Flags: superreview?(benjamin)
Assignee | ||
Comment 12•14 years ago
|
||
Aaand the rest is straight rewrite :).
Attachment #522611 -
Flags: review?(dbaron)
Assignee | ||
Comment 13•14 years ago
|
||
Not sure if this is semi-officially part of our SDK, but this would feel really good.
Attachment #522612 -
Flags: review?(benjamin)
Updated•14 years ago
|
Attachment #522609 -
Flags: superreview?(benjamin) → superreview+
Updated•14 years ago
|
Attachment #522610 -
Flags: superreview?(benjamin) → superreview+
Updated•14 years ago
|
Attachment #522612 -
Flags: review?(benjamin) → review+
Assignee | ||
Comment 14•14 years ago
|
||
Comment on attachment 522611 [details] [diff] [review]
part 4: Migrate everything else to mozilla:: sync primitives
Not quite ready yet.
Attachment #522611 -
Flags: review?(dbaron)
Assignee | ||
Comment 15•14 years ago
|
||
Fixed some windows- and mac-only callers.
Attachment #522611 -
Attachment is obsolete: true
Attachment #522915 -
Flags: review?(dbaron)
Reporter | ||
Comment 16•14 years ago
|
||
You should also fix up the one user of nsAutoCMonitor at some point-- probably to just use nsAutoMonitor.
Assignee | ||
Comment 17•14 years ago
|
||
Hmm, looks like an unused test. I wonder if we should just remove the test.
Reporter | ||
Comment 18•14 years ago
|
||
Comment on attachment 522915 [details] [diff] [review]
part 4: Migrate everything else to mozilla:: sync primitives, v2
XPCAutoLock should probably become a typedef for MonitorAutoEnter
at some point. Doesn't have to be now, though. Though it'd be nice.
In nsJar::Open and OpenInner, I think you should use another mechanism for
enforcing the already-open check. Add a boolean if you need one.
>+ mMonitor("nsCacheService.mMoniton"),
s/n"/r"/
nsHostResolver.h:
>- PRLock *mLock;
>- PRCondVar *mIdleThreadCV; // non-null if idle thread
>+ Mutex mLock;
>+ CondVar mIdleThreadCV; // non-null if idle thread
comment is no longer true
nsHttpActivityDistributor::Init should be removed, and it should
just use NS_GENERIC_FACTORY_CONSTRUCTOR (no _INIT).
Maybe leave the blank lines before and after the first (new) block
in nsKeygenThread::Run?
In nsHTTPListener::FreeLoadGroup you should leave the block that you
removed; you're changing the scope of the autolock.
>+ //Free the lock and return - no error - just implies nothing to schedule
Drop the "Free the lock and "
Also, you don't need to introduce a new scope and reindent in this
method, since you unlock at the very end. (Having the return inside the
lock scope is a sign you don't need it!)
In nsNSSShutdownList you should probably either remove the indent
in those methods or make an explicit scope to match what the indent
used to show. Same for nsNSSActivityState (where you already
did make the explicit scope in some places).
In nsProtectedAuthThread::GetSlot, don't introduce extra spaces before
=.
>+ mQueueCond.Wait(PR_MillisecondsToInterval(10000)_);
You've got an extra underscore, which says you didn't compile on
Android.
In nsExceptionService, don't get too cute -- use a variable name
with letters in it.
You shouldn't remove FreeAutoLockStatics calls in this patch.
>+ mLock("nsRecyclingAllocatior.mLock"),
s/tior/tor/
nsPipe::GetInputStream and GetOutputStream need another mechanism
of testing for initialization.
>+ : mCounterLock("ProxyTex.mCounterLock")
>+ , mEvilMonitor("ProxyTex.mEvilMonitor")
s/Tex/Test/
Please add MOZ_COUNT_CTOR and MOZ_COUNT_DTOR macros to
mozilla::Mutex, mozilla::Monitor, and mozilla::CondVar to ensure
you're not leaking locks anywhere.
Make sure you get green try runs before landing.
r=dbaron with that
Attachment #522915 -
Flags: review?(dbaron) → review+
Reporter | ||
Comment 19•14 years ago
|
||
(Removing FreeAutoLockStatics in patch 5 is great, though.)
Assignee | ||
Comment 20•14 years ago
|
||
Thanks much. This can't have been an enjoyable review :).
(In reply to comment #18)
> Comment on attachment 522915 [details] [diff] [review]
> part 4: Migrate everything else to mozilla:: sync primitives, v2
>
> XPCAutoLock should probably become a typedef for MonitorAutoEnter
> at some point. Doesn't have to be now, though. Though it'd be nice.
Yes, I didn't understand the fork there; maybe it predated nsAutoMonitor? Bug 556214 might be a good starting point for cleaning this up, since the re-entrancy semantics needs to be made clear there.
> In nsJar::Open and OpenInner, I think you should use another mechanism for
> enforcing the already-open check. Add a boolean if you need one.
Done. I added a bool because I wasn't sure how to generally check already-open in OpenInner.
> >+ mMonitor("nsCacheService.mMoniton"),
>
> s/n"/r"/
Fixed in bug 646259, which will be folded into this before pushing.
> nsHostResolver.h:
> >- PRLock *mLock;
> >- PRCondVar *mIdleThreadCV; // non-null if idle thread
> >+ Mutex mLock;
> >+ CondVar mIdleThreadCV; // non-null if idle thread
>
> comment is no longer true
D'oh, I went to check whether that comment was true in the old PRCondVar impl before nuking (it wasn't), then I forgot to nuke. Fixed.
> nsHttpActivityDistributor::Init should be removed, and it should
> just use NS_GENERIC_FACTORY_CONSTRUCTOR (no _INIT).
Done.
> Maybe leave the blank lines before and after the first (new) block
> in nsKeygenThread::Run?
OK.
> In nsHTTPListener::FreeLoadGroup you should leave the block that you
> removed; you're changing the scope of the autolock.
>
> >+ //Free the lock and return - no error - just implies nothing to schedule
>
> Drop the "Free the lock and "
Done.
> Also, you don't need to introduce a new scope and reindent in this
> method, since you unlock at the very end. (Having the return inside the
> lock scope is a sign you don't need it!)
Hm, I'm not sure there's a good solution here. I tried to only add new locking scopes where needed, and I was inconsistent here :S.
In v2, I reindented the sites I was 99.9% sure were strangely formatted just because of explicit lock/unlock, not because, say, pushing a value onto the stack while the mutex was held was found to cause perf issues (i.e., none).
> In nsNSSShutdownList you should probably either remove the indent
> in those methods or make an explicit scope to match what the indent
> used to show. Same for nsNSSActivityState (where you already
> did make the explicit scope in some places).
Changed per above.
> In nsProtectedAuthThread::GetSlot, don't introduce extra spaces before
> =.
Fixed.
> >+ mQueueCond.Wait(PR_MillisecondsToInterval(10000)_);
>
> You've got an extra underscore, which says you didn't compile on
> Android.
I did, locally, but only opt, which is also only what tinderbox does :S (bug 581290). Debug builds were broken for months until recently.
> In nsExceptionService, don't get too cute -- use a variable name
> with letters in it.
The problem here was that the static variable is named "lock", the same as our (apparent) convention for RAII auto-locks. Renamed static variable to sLock to match style guide.
> You shouldn't remove FreeAutoLockStatics calls in this patch.
Hm yeah, I had the kill-nsAutoLock patch before this in series so that users wouldn't compile. Will rejigger and fix. (I'm going to fold all these patches before pushing, but that change will help rebasing if a backout is needed.)
> >+ mLock("nsRecyclingAllocatior.mLock"),
>
> s/tior/tor/
Whups.
> nsPipe::GetInputStream and GetOutputStream need another mechanism
> of testing for initialization.
Nice catch! Tryserver found that too :). Another place where I forgot to change the is-init check after looking at the .cpp. Fixed with another new bool member introduced to match existing semantics.
> >+ : mCounterLock("ProxyTex.mCounterLock")
> >+ , mEvilMonitor("ProxyTex.mEvilMonitor")
>
> s/Tex/Test/
Fixed.
> Please add MOZ_COUNT_CTOR and MOZ_COUNT_DTOR macros to
> mozilla::Mutex, mozilla::Monitor, and mozilla::CondVar to ensure
> you're not leaking locks anywhere.
>
> Make sure you get green try runs before landing.
Yep!
Comment on attachment 522612 [details] [diff] [review]
part 5: Remove nsAutoLock.*
>diff --git a/xpcom/glue/Makefile.in b/xpcom/glue/Makefile.in
>--- a/xpcom/glue/Makefile.in
>+++ b/xpcom/glue/Makefile.in
>@@ -65,17 +65,16 @@ CPPSRCS = \
> nsStringAPI.cpp \
> GenericModule.cpp \
> $(NULL)
>
> # TODO nsAutoLock.h should die soon
Kill this line too.
Assignee | ||
Comment 22•14 years ago
|
||
Rebased, smaller now.
Attachment #522608 -
Attachment is obsolete: true
Attachment #522608 -
Flags: review?(chris.double)
Attachment #523088 -
Flags: review?(chris.double)
Assignee | ||
Comment 23•14 years ago
|
||
Something about these tests is making for very slow. The new deadlock detector was explicitly designed to fix scalability problems in the old one, and in testing was much better, so there's something very wrong here. Not sure whether the problem lies in the detector, the tests, or the code being tested, but I'd rather fix whichever off the critical path to landing this.
Attachment #523091 -
Flags: review?(ehsan)
Assignee | ||
Comment 24•14 years ago
|
||
These patches are looking good on tryserver, except the xpcshell tests didn't finish within 2 hours on windows :/. Most likely due to capturing backtraces being slow.
Attachment #523180 -
Flags: review?(dbaron)
Assignee | ||
Comment 25•14 years ago
|
||
Part 7 seems to fix timeout on tryserver. Looks like this is ready to go as soon as the reviews come in.
Comment 26•14 years ago
|
||
Comment on attachment 523091 [details] [diff] [review]
part 6: Make editor test be nicer to deadlock detector
These tests are stupid, and I will kill them without mercy in the future, but in the meantime, you're free to do whatever you need to them in order to get this landed. :-)
Attachment #523091 -
Flags: review?(ehsan) → review+
Updated•14 years ago
|
Attachment #523088 -
Flags: review?(chris.double) → review+
Assignee | ||
Comment 27•14 years ago
|
||
(In reply to comment #23)
> Created attachment 523091 [details] [diff] [review]
> part 6: Make editor test be nicer to deadlock detector
>
> Not sure
> whether the problem lies in the detector, the tests, or the code being tested,
> but I'd rather fix whichever off the critical path to landing this.
The main problem here is that
- the test is external API
- it defines XPCOM classes
- by the nature of the tests, there are a gerjillion AddRef/Release of the COM-y things defined under the external API
- AddRef/Release check NS_IsCycleCollectorThread() in debug builds
- ... aaand NS_IsCycleCollectorThread() is hahahaomglollerskates slow for external-API code: http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsCycleCollectorUtils.cpp#53
Most of the time in my profile was in tracemalloc. Running before/after with NS_TRACE_MALLOC_DISABLE_STACKS=1 took about the same amount of time; new detector was ~10% slower. The test was *much* slower with the patches here and stacks enabled; too slow for my patience. The old deadlock detector only checked lock ordering in the nsAutoLock/nsAutoMonitor constructors (which is a bug), and the new detector checks on all Mutex::Lock/Monitor::Enter calls. My guess is that something was using raw PR_* operations or nsAuto*::lock/Enter, which weren't getting ordering checks, and they're now being checked (which incurs grabbing a backtrace). Didn't confirm this.
Anyway, the takeaway is
- NS_IsCycleCollectorThread() is ridiculous for external-API code. Not sure how much anyone cares.
- TestTXMgr is chewing a ton of CPU cycles in debug builds; not sure what we gain from that.
Assignee | ||
Comment 28•14 years ago
|
||
Fixed a style inconsistency and a grammar-o.
Attachment #523180 -
Attachment is obsolete: true
Attachment #523180 -
Flags: review?(dbaron)
Attachment #523237 -
Flags: review?(dbaron)
Reporter | ||
Comment 29•14 years ago
|
||
Comment on attachment 523237 [details] [diff] [review]
part 7: Disable tracemalloc backtraces for xpcshell tests, v1.1
r=dbaron
I should have done this in bug 549561.
Attachment #523237 -
Flags: review?(dbaron) → review+
Assignee | ||
Comment 30•14 years ago
|
||
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 31•14 years ago
|
||
Was this announced anywhere, particularly to existing nsAutoLock consumers?
Assignee | ||
Comment 32•14 years ago
|
||
Increased frequency of intermittent orange bug 618052, backed out
http://hg.mozilla.org/mozilla-central/rev/e03c3a6df3cb
Sigh.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 33•14 years ago
|
||
The first round of OS X debug tests was green on tryserver, with parts 1-6. I pushed a second time with part 7 added, and just ran on win/OS X 10.6, and saw bug 618052 on 10.6. That's very puzzling.
Assignee | ||
Comment 34•14 years ago
|
||
I wonder if this is relevant: http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#590. Siiigh.
Assignee | ||
Comment 35•14 years ago
|
||
(In reply to comment #31)
> Was this announced anywhere, particularly to existing nsAutoLock consumers?
It was deprecated 2 years ago [1], but this bug wasn't announced beforehand. There was an effort to kill off nsAutoLock that began about 2 years ago and fizzled out because of review neglect, so I tried to move fast here.
[1] http://blog.mozilla.com/cjones/2009/05/04/farewell-prlock-i-hardly-knew-thee/
Comment 36•14 years ago
|
||
(In reply to comment #27)
> - TestTXMgr is chewing a ton of CPU cycles in debug builds; not sure what we
> gain from that.
Not much. => bug 647064.
Assignee | ||
Comment 37•14 years ago
|
||
Sadly,
- self.env["NS_TRACE_MALLOC_DISABLE_STACKS"] = "1"
+ if not (sys.platform == 'osx' or sys.platform == "darwin"):
+ # XXX automation.py has this odd special case; without it, bug
+ # 618052 seems to be exacerbated
+ self.env["NS_TRACE_MALLOC_DISABLE_STACKS"] = "1"
made this pass both OS X debug xpcshell runs on tryserver. Ah well; guess it's good to be consistent with automation.py until the two are unified.
Assignee | ||
Comment 38•14 years ago
|
||
Relanded with, ahem, correction to make automation.py and runxpcshelltests.py agree.
http://hg.mozilla.org/mozilla-central/rev/1a89509e25e4
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 39•14 years ago
|
||
Amazingly and perhaps frighteningly, this got an all-green.
Comment 40•14 years ago
|
||
(In reply to comment #35)
> (In reply to comment #31)
> > Was this announced anywhere, particularly to existing nsAutoLock consumers?
> It was deprecated 2 years ago
I guess that constitutes fair warning.
You need to log in
before you can comment on or make changes to this bug.
Description
•