Closed Bug 1370644 Opened 7 years ago Closed 7 years ago

Deadlock detector seems to pay no attention to the fact that a lock gets unlocked

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56
Performance Impact high
Tracking Status
firefox55 --- fixed
firefox56 --- fixed

People

(Reporter: ehsan.akhgari, Assigned: erahm)

References

Details

Attachments

(1 file)

See bug 1370448.  Let me try to walk you through the reported potential deadlock:

12:34:48     INFO -  PID 4684 | === Cyclical dependency starts at
12:34:48     INFO -  PID 4684 | --- Mutex : TimerThread.mMonitor calling context

This lock is held in TimerThread::FindNextFireTimeForCurrentThread() which is being triggered from https://hg.mozilla.org/integration/mozilla-inbound/rev/16798a4167a1#l1.40 (sorry for linking to inbound, this code hasn't landed.)

What happens here is like this:

* xpcshell js code calls nsIThread::hasPendingEvents()
* That gets to https://hg.mozilla.org/integration/mozilla-inbound/rev/16798a4167a1#l1.83
* That gets to https://hg.mozilla.org/integration/mozilla-inbound/rev/16798a4167a1#l1.40
* That gets to https://searchfox.org/mozilla-central/rev/d441cb24482c2e5448accaf07379445059937080/xpcom/threads/MainThreadIdlePeriod.cpp#33
* That gets to https://searchfox.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#596

The lock is held and released using MonitorAutoLock.  The deadlock detector clearly knows that the lock isn't being held since it's not printing "(currently acquired)" here!

12:34:48     INFO -  PID 4684 |   [stack trace unavailable]
12:34:48     INFO -  PID 4684 | --- Next dependency:
12:34:48     INFO -  PID 4684 | --- Mutex : nsSocketTransportService::mLock (currently acquired)

https://hg.mozilla.org/integration/mozilla-inbound/rev/551c2fc04c72#l1.12

12:34:48     INFO -  PID 4684 |  calling context
12:34:48     INFO -  PID 4684 |   [stack trace unavailable]
12:34:48     INFO -  PID 4684 | --- Next dependency:
12:34:48     INFO -  PID 4684 | --- Mutex : nsTimerImpl::mMutex (currently acquired)

https://searchfox.org/mozilla-central/rev/d441cb24482c2e5448accaf07379445059937080/xpcom/threads/nsTimerImpl.cpp#327

12:34:48     INFO -  PID 4684 |  calling context
12:34:48     INFO -  PID 4684 |   [stack trace unavailable]
12:34:48     INFO -  PID 4684 | === Cycle completed at
12:34:48     INFO -  PID 4684 | --- Mutex : TimerThread.mMonitor calling context

https://searchfox.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#547


It would be nice to teach the deadlock detector to be gentle and keep silent here and not complain without a reason.
Bug 1371704 is posing a threat to backout bug 1368286 due to this bug for the beta merge to go on smoothly, and bug 1368286 is pretty important to not get backed out for us to not starve our idle queue.  As a result of that I'm going to bump up the priority of this to [qf:p1]...
Whiteboard: [qf:p1]
I'll take a look at this.
Assignee: nobody → erahm
So just to be clear, this is more of warning. The deadlock detector thinks you're setting yourself up for issues because you're acquiring locks in a different order than it expects. It's definitely possible there's a logic error and this is a false positive, I'm going to dig into the code and try to figure out what's going on.

If I recall correctly, it's something like the first time you acquired lock C you already held lock A and B. The next time you you acquired C you held lock just lock F (please don't argue the validity of this statement, I still need to look back at the logic).

I think a super quick solution is to only use |NS_ERROR| if |maybeImminent == true| [1] and otherwise use an |NS_WARNING|.

[1] http://searchfox.org/mozilla-central/rev/a798ee4fc323f9387b7576dbed177859d29d09b7/xpcom/threads/BlockingResourceBase.cpp#308
This modifies the logic in |CheckAcquisition| to only call |NS_ERROR| if we're
really going to deadlock. Instead, if we detect a suspicious cycle, we just use
an |NS_WARNING|. This means that we'll still output warning text in debug
builds, but we won't cause the process to abort.

Ehsan, can you test if this fixes your intermittents?

MozReview-Commit-ID: 71mFInWwbDY
Attachment #8876245 - Flags: review?(nfroyd)
Attachment #8876245 - Flags: feedback?(ehsan)
I also plan on following up on whether or not our logic is correct, hence the current patch being part 1. I think what we have is good enough to unblock Ehsan for now though.
I made a try push to test whether this will fix the permaorange we saw on beta55: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8cb95f92fe8621f69715a6f6ab6ebf977a3061a3
Comment on attachment 8876245 [details] [diff] [review]
Part 1: Only use NS_ERROR for an imminent failure

Review of attachment 8876245 [details] [diff] [review]:
-----------------------------------------------------------------

Nobody ever pays attention to warnings, but...
Attachment #8876245 - Flags: review?(nfroyd) → review+
Comment on attachment 8876245 [details] [diff] [review]
Part 1: Only use NS_ERROR for an imminent failure

Review of attachment 8876245 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like this indeed fixes the issue.  Thanks!
Attachment #8876245 - Flags: feedback?(ehsan) → feedback+
https://hg.mozilla.org/integration/mozilla-inbound/rev/091c5613b5cfe1e448bd2c93cd2bb4149701ba9f
Bug 1370644 - Part 1: Only use NS_ERROR for an imminent failure. r=froydnj
Backed out for failing GTest XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest | Death test: Sanity3_Child():

https://hg.mozilla.org/integration/mozilla-inbound/rev/25b03b08f551411d7618232028584d66498d5bba

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=091c5613b5cfe1e448bd2c93cd2bb4149701ba9f&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=106042392&repo=mozilla-inbound

[task 2017-06-10T08:13:50.589511Z] 08:13:50     INFO -  TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest
[task 2017-06-10T08:13:50.589780Z] 08:13:50     INFO -  [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:836:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 15 threads.
[task 2017-06-10T08:13:50.589977Z] 08:13:50     INFO -  Program /home/worker/workspace/build/application/firefox/firefox (pid = 1203) received signal 11.
[task 2017-06-10T08:13:50.590163Z] 08:13:50     INFO -  Stack:
[task 2017-06-10T08:13:50.590389Z] 08:13:50     INFO -  #01: mozilla::OffTheBooksMutex::~OffTheBooksMutex [xpcom/threads/Mutex.h:57]
[task 2017-06-10T08:13:50.590601Z] 08:13:50     INFO -  #02: Sanity3_Child [xpcom/tests/gtest/TestDeadlockDetector.cpp:146]
[task 2017-06-10T08:13:50.590835Z] 08:13:50     INFO -  #03: XPCOMDeadlockDetectorTest_XPCOMSanity3DeathTest_Test::TestBody [xpcom/tests/gtest/TestDeadlockDetector.cpp:174]
[task 2017-06-10T08:13:50.591042Z] 08:13:50     INFO -  #04: testing::Test::Run [testing/gtest/gtest/src/gtest.cc:2481]
[task 2017-06-10T08:13:50.591261Z] 08:13:50     INFO -  #05: testing::TestInfo::Run [testing/gtest/gtest/src/gtest.cc:2656]
[task 2017-06-10T08:13:50.591475Z] 08:13:50     INFO -  #06: testing::TestCase::Run [testing/gtest/gtest/src/gtest.cc:2774]
[task 2017-06-10T08:13:50.591703Z] 08:13:50     INFO -  #07: testing::internal::UnitTestImpl::RunAllTests [testing/gtest/gtest/src/gtest.cc:4647]
[task 2017-06-10T08:13:50.591919Z] 08:13:50     INFO -  #08: testing::UnitTest::Run [testing/gtest/gtest/src/gtest.cc:4260]
[task 2017-06-10T08:13:50.592132Z] 08:13:50     INFO -  #09: mozilla::RunGTestFunc [xpcom/base/nsCOMPtr.h:402]
[task 2017-06-10T08:13:50.592348Z] 08:13:50     INFO -  #10: XREMain::XRE_mainStartup [toolkit/xre/nsAppRunner.cpp:3818]
[task 2017-06-10T08:13:50.592558Z] 08:13:50     INFO -  #11: XREMain::XRE_main [toolkit/xre/nsAppRunner.cpp:4735]
[task 2017-06-10T08:13:50.592766Z] 08:13:50     INFO -  #12: XRE_main [toolkit/xre/nsAppRunner.cpp:4844]
[task 2017-06-10T08:13:50.592997Z] 08:13:50     INFO -  #13: do_main [browser/app/nsBrowserApp.cpp:237]
[task 2017-06-10T08:13:50.593229Z] 08:13:50     INFO -  #14: main [browser/app/nsBrowserApp.cpp:312]
[task 2017-06-10T08:13:50.593455Z] 08:13:50     INFO -  #15: libc.so.6 + 0x20830
[task 2017-06-10T08:13:50.593721Z] 08:13:50     INFO -  #16: _start
[task 2017-06-10T08:13:50.593901Z] 08:13:50     INFO -  Sleeping for 0 seconds.
[task 2017-06-10T08:13:50.594196Z] 08:13:50     INFO -  Type 'gdb /home/worker/workspace/build/application/firefox/firefox 1203' to attach your debugger to this thread.
[task 2017-06-10T08:13:50.594403Z] 08:13:50     INFO -  Done sleeping...
[task 2017-06-10T08:13:50.594641Z] 08:13:50  WARNING -  TEST-UNEXPECTED-FAIL | XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest | Death test: Sanity3_Child()
[task 2017-06-10T08:13:50.594983Z] 08:13:50     INFO -      Result: died but not with expected error.
[task 2017-06-10T08:13:50.595344Z] 08:13:50     INFO -    Expected: ###!!! ERROR: Potential deadlock detected.*=== Cyclical dependency starts at.*--- Mutex : dd.sanity3.m1.*--- Next dependency:.*--- Mutex : dd.sanity3.m2.*--- Next dependency:.*--- Mutex : dd.sanity3.m3.*--- Next dependency:.*--- Mutex : dd.sanity3.m4.*=== Cycle completed at.*--- Mutex : dd.sanity3.m1.*###!!! ASSERTION: Potential deadlock detected.*
[task 2017-06-10T08:13:50.595466Z] 08:13:50     INFO -  Actual msg:
[task 2017-06-10T08:13:50.595736Z] 08:13:50     INFO -  [  DEATH   ] [1203] WARNING: OOPDeinit() without successful OOPInit(): file /home/worker/workspace/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 3531
[task 2017-06-10T08:13:50.595937Z] 08:13:50     INFO -  [  DEATH   ] ###!!! ERROR: Potential deadlock detected:
[task 2017-06-10T08:13:50.596167Z] 08:13:50     INFO -  [  DEATH   ] === Cyclical dependency starts at
[task 2017-06-10T08:13:50.596358Z] 08:13:50     INFO -  [  DEATH   ] --- Mutex : dd.sanity3.m1 calling context
[task 2017-06-10T08:13:50.596621Z] 08:13:50     INFO -  [  DEATH   ]   [stack trace unavailable]
[task 2017-06-10T08:13:50.596864Z] 08:13:50     INFO -  [  DEATH   ]
[task 2017-06-10T08:13:50.597111Z] 08:13:50     INFO -  [  DEATH   ] --- Next dependency:
[task 2017-06-10T08:13:50.597324Z] 08:13:50     INFO -  [  DEATH   ] --- Mutex : dd.sanity3.m2 calling context
[task 2017-06-10T08:13:50.597542Z] 08:13:50     INFO -  [  DEATH   ]   [stack trace unavailable]
[task 2017-06-10T08:13:50.597757Z] 08:13:50     INFO -  [  DEATH   ]
[task 2017-06-10T08:13:50.597968Z] 08:13:50     INFO -  [  DEATH   ] --- Next dependency:
[task 2017-06-10T08:13:50.598210Z] 08:13:50     INFO -  [  DEATH   ] --- Mutex : dd.sanity3.m3 calling context
[task 2017-06-10T08:13:50.598408Z] 08:13:50     INFO -  [  DEATH   ]   [stack trace unavailable]
[task 2017-06-10T08:13:50.598625Z] 08:13:50     INFO -  [  DEATH   ]
[task 2017-06-10T08:13:50.598857Z] 08:13:50     INFO -  [  DEATH   ] --- Next dependency:
[task 2017-06-10T08:13:50.599099Z] 08:13:50     INFO -  [  DEATH   ] --- Mutex : dd.sanity3.m4 (currently acquired)
[task 2017-06-10T08:13:50.599297Z] 08:13:50     INFO -  [  DEATH   ]  calling context
[task 2017-06-10T08:13:50.599524Z] 08:13:50     INFO -  [  DEATH   ]   [stack trace unavailable]
[task 2017-06-10T08:13:50.599741Z] 08:13:50     INFO -  [  DEATH   ]
[task 2017-06-10T08:13:50.599975Z] 08:13:50     INFO -  [  DEATH   ] === Cycle completed at
[task 2017-06-10T08:13:50.600166Z] 08:13:50     INFO -  [  DEATH   ] --- Mutex : dd.sanity3.m1 calling context
[task 2017-06-10T08:13:50.600427Z] 08:13:50     INFO -  [  DEATH   ]   [stack trace unavailable]
[task 2017-06-10T08:13:50.600640Z] 08:13:50     INFO -  [  DEATH   ]
[task 2017-06-10T08:13:50.600887Z] 08:13:50     INFO -  [  DEATH   ] Deadlock may happen for some other execution
[task 2017-06-10T08:13:50.601114Z] 08:13:50     INFO -  [  DEATH   ]
[task 2017-06-10T08:13:50.601340Z] 08:13:50     INFO -  [  DEATH   ] [1203] WARNING: Potential deadlock detected:
[task 2017-06-10T08:13:50.601563Z] 08:13:50     INFO -  [  DEATH   ] Cyclical dependency starts at
[task 2017-06-10T08:13:50.601796Z] 08:13:50     INFO -  [  DEATH   ] Mutex : dd.sanity3.m1
[task 2017-06-10T08:13:50.602022Z] 08:13:50     INFO -  [  DEATH   ] Next dependency:
[task 2017-06-10T08:13:50.602240Z] 08:13:50     INFO -  [  DEATH   ] Mutex : dd.sanity3.m2
[task 2017-06-10T08:13:50.602463Z] 08:13:50     INFO -  [  DEATH   ] Next dependency:
[task 2017-06-10T08:13:50.602692Z] 08:13:50     INFO -  [  DEATH   ] Mutex : dd.sanity3.m3
[task 2017-06-10T08:13:50.602930Z] 08:13:50     INFO -  [  DEATH   ] Next dependency:
[task 2017-06-10T08:13:50.603154Z] 08:13:50     INFO -  [  DEATH   ] Mutex : dd.sanity3.m4 (currently acquired)
[task 2017-06-10T08:13:50.603375Z] 08:13:50     INFO -  [  DEATH   ] Cycle completed at
[task 2017-06-10T08:13:50.603584Z] 08:13:50     INFO -  [  DEATH   ] Mutex : dd.sanity3.m1
[task 2017-06-10T08:13:50.603812Z] 08:13:50     INFO -  [  DEATH   ] Deadlock may happen for some other execution
[task 2017-06-10T08:13:50.604031Z] 08:13:50     INFO -  [  DEATH   ]
[task 2017-06-10T08:13:50.604274Z] 08:13:50     INFO -  [  DEATH   ] : file /home/worker/workspace/build/src/xpcom/threads/BlockingResourceBase.cpp, line 308
[task 2017-06-10T08:13:50.604487Z] 08:13:50     INFO -  [  DEATH   ] Assertion failure: !mOwningThread (destroying a still-owned lock!), at /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/Mutex.h:57
[task 2017-06-10T08:13:50.604759Z] 08:13:50     INFO -  [  DEATH   ] #01: Sanity3_Child [xpcom/tests/gtest/TestDeadlockDetector.cpp:146]
[task 2017-06-10T08:13:50.605096Z] 08:13:50     INFO -  [  DEATH   ] #02: XPCOMDeadlockDetectorTest_XPCOMSanity3DeathTest_Test::TestBody [xpcom/tests/gtest/TestDeadlockDetector.cpp:174]
[task 2017-06-10T08:13:50.605316Z] 08:13:50     INFO -  [  DEATH   ] #03: testing::Test::Run [testing/gtest/gtest/src/gtest.cc:2481]
[task 2017-06-10T08:13:50.605523Z] 08:13:50     INFO -  [  DEATH   ] #04: testing::TestInfo::Run [testing/gtest/gtest/src/gtest.cc:2656]
[task 2017-06-10T08:13:50.605799Z] 08:13:50     INFO -  [  DEATH   ] #05: testing::TestCase::Run [testing/gtest/gtest/src/gtest.cc:2774]
[task 2017-06-10T08:13:50.606031Z] 08:13:50     INFO -  [  DEATH   ] #06: testing::internal::UnitTestImpl::RunAllTests [testing/gtest/gtest/src/gtest.cc:4647]
[task 2017-06-10T08:13:50.606269Z] 08:13:50     INFO -  [  DEATH   ] #07: testing::UnitTest::Run [testing/gtest/gtest/src/gtest.cc:4260]
[task 2017-06-10T08:13:50.606512Z] 08:13:50     INFO -  [  DEATH   ] #08: mozilla::RunGTestFunc [xpcom/base/nsCOMPtr.h:402]
[task 2017-06-10T08:13:50.606751Z] 08:13:50     INFO -  [  DEATH   ] #09: XREMain::XRE_mainStartup [toolkit/xre/nsAppRunner.cpp:3818]
[task 2017-06-10T08:13:50.606972Z] 08:13:50     INFO -  [  DEATH   ] #10: XREMain::XRE_main [toolkit/xre/nsAppRunner.cpp:4735]
[task 2017-06-10T08:13:50.607201Z] 08:13:50     INFO -  [  DEATH   ] #11: XRE_main [toolkit/xre/nsAppRunner.cpp:4844]
[task 2017-06-10T08:13:50.607405Z] 08:13:50     INFO -  [  DEATH   ] #12: do_main [browser/app/nsBrowserApp.cpp:237]
[task 2017-06-10T08:13:50.607700Z] 08:13:50     INFO -  [  DEATH   ] #13: main [browser/app/nsBrowserApp.cpp:312]
[task 2017-06-10T08:13:50.607897Z] 08:13:50     INFO -  [  DEATH   ] #14: libc.so.6 + 0x20830
[task 2017-06-10T08:13:50.608162Z] 08:13:50     INFO -  [  DEATH   ] #15: _start
[task 2017-06-10T08:13:50.608417Z] 08:13:50     INFO -  [  DEATH   ]  @ /home/worker/workspace/build/src/xpcom/tests/gtest/TestDeadlockDetector.cpp:174
[task 2017-06-10T08:13:50.608673Z] 08:13:50  WARNING -  TEST-UNEXPECTED-FAIL | XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest | test completed (time: 8ms)
Flags: needinfo?(erahm)
I pushed a patch that disables Sanity3DeathTest temporarily to try:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=daf2ff8ea44bb5612b50c02b18f11553704868e4
Second patch is green on try, I'm gonna land it after checking with erahm.
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c38ea3bf8704
Part 1: Only use NS_ERROR for an imminent failure. r=froydnj
Just to note: Ehsan updated the patch to disable the Sanity3 and TwoThreads death tests. This makes sense since we've changed the potential deadlock code to just be a warning. As a follow-up when we have more time to look into this we can probably just re-enable those as non-death tests.
Flags: needinfo?(erahm)
https://hg.mozilla.org/mozilla-central/rev/c38ea3bf8704
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Performance Impact: --- → P1
Whiteboard: [qf:p1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: