Closed
Bug 1331049
Opened 7 years ago
Closed 7 years ago
DeadlockDetector death test block for 90s each and cause gtest failure due to timeout on osx debug TC build
Categories
(Testing :: General, defect)
Testing
General
Tracking
(firefox57 fixed)
RESOLVED
FIXED
mozilla57
Tracking | Status | |
---|---|---|
firefox57 | --- | fixed |
People
(Reporter: jmaher, Assigned: jmaher)
References
Details
Attachments
(3 files)
3.57 KB,
text/plain
|
Details | |
948 bytes,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
878 bytes,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
test: BB runtime, BBB runtime, diff (seconds) XPCOMDeadlockDetectorTest.XPCOMSanity2DeathTest: 0:00:02,0:01:35,93 XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock: 0:00:01,0:01:32,91 XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest: 0:00:01,0:01:38,97 storage_DeadlockDetectorTest.storage_Sanity3DeathTest: 0:00:01,0:01:33,92 storage_DeadlockDetectorTest.storage_TwoThreadsDeathTest: 0:00:01,0:01:33,92 storage_DeadlockDetectorTest.storage_Sanity2DeathTest: 0:00:02,0:01:33,91 XPCOMDeadlockDetectorTest.XPCOMTwoThreadsDeathTest: 0:00:02,0:01:32,90 storage_DeadlockDetectorTest.storage_ContentionNoDeadlock: 0:00:01,0:01:33,92 XPCOMDeadlockDetectorTest.XPCOMSanity4DeathTest: 0:00:01,0:01:31,90 storage_DeadlockDetectorTest.storage_Sanity4DeathTest: 0:00:01,0:01:33,92 there are 10 tests that have a 90 second delay vs the other tests, and I believe this makes up the large majority of our time differences. Sadly we are running ~1500 tests out of the almost 4500 tests which we run typically, so there might be more hiding out there.
Comment 1•7 years ago
|
||
erahm: wcosta and jmaher are standing up tests on the taskcluster mac builds, which are cross-compiled on Linux. Most of the tests seem to be running fine, but the gtests are timing out and it seems to be because the DeadlockDetectorTests are taking way longer on the cross-builds (see comment 0). It seems to be very close to 90 seconds in every case, which makes me think there's something fundamentally wrong here. Can you take a look at this?
Flags: needinfo?(erahm)
Comment 2•7 years ago
|
||
Joel, can you get me the output of these tests? A couple of thoughts: #1 - These tests should only be running on debug builds, is it possible the optimization settings are different b/w x-compiler and native? #2 - Most of those are death tests, we mess with the gdb sleep duration to make them fail quickly, it's possible that's not working w/ the x-compiler
Flags: needinfo?(erahm) → needinfo?(jmaher)
Assignee | ||
Comment 3•7 years ago
|
||
I am going to ask :ted this since I am not too familiar with the build specifics
Flags: needinfo?(jmaher) → needinfo?(ted)
Comment 4•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #0) > test: BB runtime, BBB runtime, diff (seconds) > XPCOMDeadlockDetectorTest.XPCOMSanity2DeathTest: 0:00:02,0:01:35,93 > XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock: 0:00:01,0:01:32,91 > XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest: 0:00:01,0:01:38,97 > storage_DeadlockDetectorTest.storage_Sanity3DeathTest: 0:00:01,0:01:33,92 > storage_DeadlockDetectorTest.storage_TwoThreadsDeathTest: 0:00:01,0:01:33,92 > storage_DeadlockDetectorTest.storage_Sanity2DeathTest: 0:00:02,0:01:33,91 > XPCOMDeadlockDetectorTest.XPCOMTwoThreadsDeathTest: 0:00:02,0:01:32,90 > storage_DeadlockDetectorTest.storage_ContentionNoDeadlock: 0:00:01,0:01:33,92 > XPCOMDeadlockDetectorTest.XPCOMSanity4DeathTest: 0:00:01,0:01:31,90 > storage_DeadlockDetectorTest.storage_Sanity4DeathTest: 0:00:01,0:01:33,92 > > there are 10 tests that have a 90 second delay vs the other tests, and I > believe this makes up the large majority of our time differences. > > Sadly we are running ~1500 tests out of the almost 4500 tests which we run > typically, so there might be more hiding out there. I am confused, is this causing Gtest to fail? Logs suggest the intermittent bug 1237167 is permanent for x-compiled builds.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 5•7 years ago
|
||
this is causing gtest to fail because our runtime is longer than the 20 minutes allocated for the job. Typically this job takes <13 minutes, so 20 minutes would appear to be more than enough time. Looking at the other gtest bugs on file, I see most are on windows, and some instances on OSX. Looking at the logs, I am not finding these specific tests mentioned, and most of the logs are for builds, not specific test tasks- I don't understand why we have gtests run different ways.
Flags: needinfo?(jmaher)
Comment 6•7 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #2) > #1 - These tests should only be running on debug builds, is it possible > the optimization settings are different b/w x-compiler and native? We've been running the cross-compiled builds for a while now, I just picked an arbitrary inbound revision that had both types of mac debug builds: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=74790f4d30de03105799064332f44d045e40723e Here's the native buildbot build log: https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1484659765/mozilla-inbound-macosx64-debug-bm84-build1-build7693.txt.gz and here's the cross-compiled taskcluster build log: https://public-artifacts.taskcluster.net/VO6Vlx5PQEe7H7q8F0YxDg/0/public/logs/live_backing.log These are dep builds, so picking out a source file that actually got compiled we see: buildbot: 05:51:58 INFO - /usr/local/bin/ccache /builds/slave/m-in-m64-d-0000000000000000000/build/src/clang/bin/clang++ -std=gnu++11 -o Unified_cpp_dom_presentation0.o -c -fvisibility=hidden -fvisibility-inlines-hidden -DDEBUG=1 -DTRACING=1 -DOS_POSIX=1 -DOS_MACOSX=1 -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/dom/presentation -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dom/presentation -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/dom/base -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/ipc/ipdl/_ipdlheaders -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/ipc/chromium/src -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/ipc/glue -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dist/include -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dist/include/nspr -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dist/include/nss -fPIC -DMOZILLA_CLIENT -include /builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/mozilla-config.h -MD -MP -MF .deps/Unified_cpp_dom_presentation0.o.pp -Qunused-arguments -Qunused-arguments -Wall -Wc++11-compat -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wloop-analysis -Wc++11-compat-pedantic -Wc++14-compat -Wc++14-compat-pedantic -Wc++1z-compat -Wimplicit-fallthrough -Werror=non-literal-null-conversion -Wstring-conversion -Wthread-safety -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-unknown-warning-option -Wno-return-type-c-linkage -fno-exceptions -fno-strict-aliasing -stdlib=libc++ -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -O3 -fno-omit-frame-pointer -Werror /builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dom/presentation/Unified_cpp_dom_presentation0.cpp taskcluster: [task 2017-01-17T13:40:42.153341Z] 13:40:42 INFO - /home/worker/workspace/build/src/sccache2/sccache /home/worker/workspace/build/src/clang/bin/clang++ -target x86_64-apple-darwin10 -mlinker-version=136 -B /home/worker/workspace/build/src/cctools/bin -isysroot /home/worker/workspace/build/src/MacOSX10.7.sdk -std=gnu++11 -o Unified_cpp_dom_presentation0.o -c -fvisibility=hidden -fvisibility-inlines-hidden -DDEBUG=1 -DTRACING=1 -DOS_POSIX=1 -DOS_MACOSX=1 -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -I/home/worker/workspace/build/src/dom/presentation -I/home/worker/workspace/build/src/obj-firefox/dom/presentation -I/home/worker/workspace/build/src/dom/base -I/home/worker/workspace/build/src/obj-firefox/ipc/ipdl/_ipdlheaders -I/home/worker/workspace/build/src/ipc/chromium/src -I/home/worker/workspace/build/src/ipc/glue -I/home/worker/workspace/build/src/obj-firefox/dist/include -I/home/worker/workspace/build/src/obj-firefox/dist/include/nspr -I/home/worker/workspace/build/src/obj-firefox/dist/include/nss -fPIC -DMOZILLA_CLIENT -include /home/worker/workspace/build/src/obj-firefox/mozilla-config.h -MD -MP -MF .deps/Unified_cpp_dom_presentation0.o.pp -Qunused-arguments -Qunused-arguments -Wall -Wc++11-compat -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wloop-analysis -Wc++11-compat-pedantic -Wc++14-compat -Wc++14-compat-pedantic -Wc++1z-compat -Wimplicit-fallthrough -Werror=non-literal-null-conversion -Wstring-conversion -Wthread-safety -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-unknown-warning-option -Wno-return-type-c-linkage -fno-exceptions -fno-strict-aliasing -stdlib=libc++ -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -Xclang -load -Xclang /home/worker/workspace/build/src/obj-firefox/build/clang-plugin/libclang-plugin.dylib -Xclang -add-plugin -Xclang moz-check -O3 -fno-omit-frame-pointer -Werror /home/worker/workspace/build/src/obj-firefox/dom/presentation/Unified_cpp_dom_presentation0.cpp Doing some fiddly local comparison it looks like the only differences are: a) The taskcluster build uses our clang static analysis plugin b) The taskcluster build passes `-B .../cctools/bin` c) The taskcluster build explicitly passes -isysroot d) The taskcluster build passes -mlinker-version=136 e) The taskcluster build explicitly passes `-target x86_64-apple-darwin10`. None of those seem like they ought to make any real difference. For reference, the main difference in the in-tree mozconfigs is that the buildbot builds include this mozconfig: https://dxr.mozilla.org/mozilla-central/source/build/macosx/local-mozconfig.common where the taskcluster builds include this one: https://dxr.mozilla.org/mozilla-central/source/build/macosx/cross-mozconfig.common > #2 - Most of those are death tests, we mess with the gdb sleep duration to > make them fail quickly, it's possible that's not working w/ the x-compiler This is entirely possible, I don't have a working Mac at the moment to test. Although I note that the default gdb sleep duration is 300 seconds: https://dxr.mozilla.org/mozilla-central/rev/3e275d37a06236981bff399b7d7aa0646be3fee7/toolkit/xre/nsSigHandlers.cpp#43 ...and these tests are taking ~90 seconds running against the taskcluster builds. jmaher: erahm asked in comment 2 but maybe you missed it, can you link the test logs you were using for comparison in comment 0? I closed that tab and don't have them handy.
Flags: needinfo?(ted) → needinfo?(jmaher)
Assignee | ||
Comment 7•7 years ago
|
||
an example of a the failing gtest (taskcluster) is: https://archive.mozilla.org/pub/firefox/try-builds/wcosta@mozilla.com-fdb545d3f298790a8406edf0155dbf0d90557ecf/try-macosx64-debug/try_yosemite_r7-debug_test-gtest-bm135-tests1-macosx-build201.txt.gz an example of a passing gtest (buildbot) is: https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1484664924/mozilla-central_yosemite_r7-debug_test-gtest-bm108-tests1-macosx-build38.txt.gz
Flags: needinfo?(jmaher)
Comment 8•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #7) > an example of a the failing gtest (taskcluster) is: > https://archive.mozilla.org/pub/firefox/try-builds/wcosta@mozilla.com- > fdb545d3f298790a8406edf0155dbf0d90557ecf/try-macosx64-debug/try_yosemite_r7- > debug_test-gtest-bm135-tests1-macosx-build201.txt.gz Alright, we can see that |ContentionNoDeadlock| is actually only taking 819ms, so the timeouts are all limited to death tests: > XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock | test completed (time: 819ms) Unfortunately there's no useful output in the logs. > an example of a passing gtest (buildbot) is: > https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central- > macosx64-debug/1484664924/mozilla-central_yosemite_r7-debug_test-gtest-bm108- > tests1-macosx-build38.txt.gz And it looks like for our baseline we're expecting < 2s for the death tests.
Comment 9•7 years ago
|
||
Also of note: - The tests *are* passing, so the output in the child process we were expecting was found - The non-deathtests are fine, so it's not deadlock detector itself being slow
Assignee | ||
Comment 10•7 years ago
|
||
the good news is if we remove the 20 minute restriction, this test passes in ~27 minutes with no errors :) Even better news- there are no other tests which are taking extra time.
Comment 11•7 years ago
|
||
Wander, is there a way for me to test a patch against the t-c build? I just want force logging of the output of the deathtest.
Flags: needinfo?(wcosta)
Comment 12•7 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #11) > Wander, is there a way for me to test a patch against the t-c build? I just > want force logging of the output of the deathtest. You can apply my patch in bug 1330310, then if you push to try with "-p macosx64 -b d -u gtest -t none", tests will run against tc x-build. Does this work for you?
Flags: needinfo?(wcosta)
Comment 13•7 years ago
|
||
Everything ran fine when attempting to repro locally with the t-c build and just the DeadlockDetector tests:
> (venv)[ericrahm@erahm-25043 ~/dev/tmp/gtest_tc_hangs]$ GTEST_FILTER='XPCOMDeadlock*' python -u gtest/rungtests.py --xre-path=application/NightlyDebug.app/Contents/Resources --cwd=gtest --symbols-path=symbols --utility-path=bin application/NightlyDebug.app/Contents/MacOS/firefox
> gtest INFO | Running gtest
> [62356] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /home/worker/workspace/build/src/xpcom/base/nsTraceRefcnt.cpp, line 172
> [62356] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /home/worker/workspace/build/src/xpcom/base/nsTraceRefcnt.cpp, line 172
> [62356] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /home/worker/workspace/build/src/xpcom/base/nsTraceRefcnt.cpp, line 172
> Running GTest tests...
> Setting up crash reporting
> TEST-INFO | GTest unit test starting
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanityDeathTest
>
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMSanityDeathTest | test completed (time: 104ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanity2DeathTest
>
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMSanity2DeathTest | test completed (time: 105ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest
>
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest | test completed (time: 107ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanity4DeathTest
>
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMSanity4DeathTest | test completed (time: 104ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMTwoThreadsDeathTest
>
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMTwoThreadsDeathTest | test completed (time: 27ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock | test completed (time: 517ms)
> TEST-PASS | GTest unit test: passed
> Passed: 6
> Failed: 0
It's possible there's a bad interaction with other tests, I'll do a full run.
Comment 14•7 years ago
|
||
The death tests ran fine in a full run as well. I've requested a loaner in bug 1331739, we'll see if I can repro there once I get the loaner.
Assignee: nobody → erahm
Comment 15•7 years ago
|
||
Here's a script to setup and run the gtests from one of wcosta's builds.
Comment 16•7 years ago
|
||
Try build that purposefully fails the deadlock detector death tests so we get full output: https://treeherder.mozilla.org/#/jobs?repo=try&revision=da54c923e6a50426fbf2a207c49d25cd1822201e
Updated•7 years ago
|
Attachment #8827695 -
Attachment mime type: application/x-shellscript → text/plain
Comment 17•7 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #16) > Try build that purposefully fails the deadlock detector death tests so we > get full output: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=da54c923e6a50426fbf2a207c49d25cd1822201e The tests are still taking 90+ seconds here, and I don't see anything immediately suspicious in the output. There's just a big time gap between the [WARNING] about death tests and threads and the failure message. 18:26:14 INFO - TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanityDeathTest 18:26:14 INFO - [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 18 threads. 18:27:49 WARNING - TEST-UNEXPECTED-FAIL | XPCOMDeadlockDetectorTest.XPCOMSanityDeathTest | Death test: Sanity_Child()
Comment 18•7 years ago
|
||
I've got it reproduced on the loaner, so there's that at least. Still not sure what's going on, I'll dig into the gtest code next.
Comment 19•7 years ago
|
||
I switched over to 'threadsafe' tests (fork and exec) but that didn't help. I can see the output being dumped to a tmp file immediately, so it's definitely not a test startup issue. Possibly an issue in our crash handler? Next step is to test out my lldb-fu to see if I can see where it's blocked.
Comment 20•7 years ago
|
||
llbd is unable to attach to the child process, attaching to parent process show's it waiting on the pipe [1]. The idea here is if the test finishes (this is a failure for a death test) a message is passed back via the pipe saying things didn't go as planned, on proper failure (the test aborts) the pipe is expected to close due to the child process aborting. So the question is why is the pipe not closing for 90s on the t-c machine? A few possibilities: - The ah **** handler is still pausing, this shouldn't happen as there shouldn't be a signal handler on mac [2] but maybe |LINUX| is defined (note: not |XP_LINUX|) due to the x-compile - Our crash handler is somehow getting in the way, we explicitly disable it in the tests but the gtest main does enable it [3] if an env var is set, which it is in automation [4] - There's some setting on our test machine that doesn't close the pipe's fd immediately when the child exits [1] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/testing/gtest/gtest/src/gtest-death-test.cc#389-395 [2] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/toolkit/xre/nsSigHandlers.cpp#45-48 [3] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/testing/gtest/mozilla/GTestRunner.cpp#94-115 [4] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/testing/gtest/rungtests.py#92
Comment 21•7 years ago
|
||
Ted said he could check if |__linux__| is defined by the x-compiler (which is what NSRP uses to define |LINUX|).
Flags: needinfo?(ted)
Comment 22•7 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #20) > llbd is unable to attach to the child process |top| shows it as 'stuck' if that means anything to anyone. > A few possibilities: > - The ah **** handler is still pausing, this shouldn't happen as there > shouldn't be a signal handler on mac [2] but maybe |LINUX| is defined (note: > not |XP_LINUX|) due to the x-compile This one's a biiiiig stretch considering I couldn't repro locally (although I'm testing on 10.12, not 10.10). > - Our crash handler is somehow getting in the way, we explicitly disable it > in the tests but the gtest main does enable it [3] if an env var is set, > which it is in automation [4] I tested with |MOZ_CRASHREPORTER=0| and still ran into issues. > - There's some setting on our test machine that doesn't close the pipe's fd > immediately when the child exits A simple test program doesn't show this behavior, I also tested waitpid after crashing the child with segv and it was fine.
Comment 23•7 years ago
|
||
|sample| indicates the death test is stuck in |mozalloc_abort| [1]: > 16 Call graph: > 17 9437 Thread_2190046 DispatchQueue_1: com.apple.main-thread (serial) > 18 + 9437 start (in firefox) + 52 [0x10fe0d3d4] > 19 + 9437 start (in firefox) + 2855 [0x10fe0dec7] > 20 + 9437 XRE_main (in XUL) + 259 [0x1143be173] > 21 + 9437 XREMain::XRE_main(int, char**, mozilla::XREAppData const&) (in XUL) + 577 [0x1143bdb61] > 22 + 9437 XREMain::XRE_mainStartup(bool*) (in XUL) + 252 [0x1143b767c] > 23 + 9437 mozilla::RunGTestFunc(int*, char**) (in XUL) + 888 [0x115225048] > 24 + 9437 testing::UnitTest::Run() (in XUL) + 110 [0x11521a07e] > 25 + 9437 testing::internal::UnitTestImpl::RunAllTests() (in XUL) + 1305 [0x11521a5a9] > 26 + 9437 testing::TestCase::Run() (in XUL) + 483 [0x115214703] > 27 + 9437 testing::TestInfo::Run() (in XUL) + 768 [0x1152142d0] > 28 + 9437 testing::Test::Run() (in XUL) + 737 [0x115213661] > 29 + 9437 XPCOMDeadlockDetectorTest_XPCOMSanityDeathTest_Test::TestBody() (in XUL) + 313 [0x114ecceb9] > 30 + 9437 Sanity_Child() (in XUL) + 49 [0x114eccd11] > 31 + 9437 mozilla::OffTheBooksMutex::Lock() (in XUL) + 14 [0x11059abee] > 32 + 9437 mozilla::BlockingResourceBase::CheckAcquire() (in XUL) + 390 [0x11059a5f6] > 33 + 9437 NS_DebugBreak (in XUL) + 1272 [0x1104b3078] > 34 + 9437 Abort(char const*) (in XUL) + 9 [0x1104b3289] > 35 + 9437 mozalloc_abort(char const*) (in libmozglue.dylib) + 98 [0x10fe16af2] We can also see that |MOZ_CRASH| has been called in the logs, so it's pretty clear we're hanging somewhere in that macro: > 18:27:49 INFO - [ DEATH ] [1984] WARNING: OOPDeinit() without successful OOPInit(): file /home/worker/workspace/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 3561 > 18:27:49 INFO - [ DEATH ] ###!!! ERROR: Potential deadlock detected: > 18:27:49 INFO - [ DEATH ] === Cyclical dependency starts at > 18:27:49 INFO - [ DEATH ] --- Mutex : dd.sanity.m1 (currently acquired) > 18:27:49 INFO - [ DEATH ] calling context > 18:27:49 INFO - [ DEATH ] [stack trace unavailable] > 18:27:49 INFO - [ DEATH ] > 18:27:49 INFO - [ DEATH ] === Cycle completed at > 18:27:49 INFO - [ DEATH ] --- Mutex : dd.sanity.m1 (currently acquired) > 18:27:49 INFO - [ DEATH ] calling context > 18:27:49 INFO - [ DEATH ] [stack trace unavailable] > 18:27:49 INFO - [ DEATH ] > 18:27:49 INFO - [ DEATH ] ###!!! Deadlock may happen NOW! > 18:27:49 INFO - [ DEATH ] > 18:27:49 INFO - [ DEATH ] 1984] ###!!! ASSERTION: Potential deadlock detected: > 18:27:49 INFO - [ DEATH ] Cyclical dependency starts at > 18:27:49 INFO - [ DEATH ] Mutex : dd.sanity.m1 (currently acquired) > 18:27:49 INFO - [ DEATH ] Cycle completed at > 18:27:49 INFO - [ DEATH ] Mutex : dd.sanity.m1 (currently acquired) > 18:27:49 INFO - [ DEATH ] > 18:27:49 INFO - [ DEATH ] ###!!! Deadlock may happen NOW! > 18:27:49 INFO - [ DEATH ] > 18:27:49 INFO - [ DEATH ] : 'Error', file /home/worker/workspace/build/src/xpcom/glue/BlockingResourceBase.cpp, line 307 > 18:29:04 INFO - [ DEATH ] #01: PLDHashTable::Add(void const*, mozilla::fallible_t const&) [xpcom/glue/PLDHashTable.cpp:499] > 18:29:04 INFO - [ DEATH ] #02: Promise_static_all(JSContext*, unsigned int, JS::Value*) [js/public/RootingAPI.h:795] > 18:29:04 INFO - [ DEATH ] #03: Promise_static_all(JSContext*, unsigned int, JS::Value*) [js/src/builtin/Promise.cpp:1399] > 18:29:04 INFO - [ DEATH ] #04: DetachArrayBuffer(JSContext*, unsigned int, JS::Value*) [mfbt/Assertions.h:166] > 18:29:04 INFO - [ DEATH ] #05: FindPath(JSContext*, unsigned int, JS::Value*) [js/src/builtin/TestingFunctions.cpp:2859] > 18:29:04 INFO - [ DEATH ] #06: FindPath(JSContext*, unsigned int, JS::Value*) [mfbt/Assertions.h:166] > 18:29:04 INFO - [ DEATH ] #07: AllocationMarker(JSContext*, unsigned int, JS::Value*) [mfbt/Assertions.h:166] > 18:29:04 INFO - [ DEATH ] #08: GetConstructorName(JSContext*, unsigned int, JS::Value*) [mfbt/Assertions.h:166] > 18:29:04 INFO - [ DEATH ] #09: mozilla::Vector<JS::GCVector<JS::Value, 0ul, js::TempAllocPolicy>, 0ul, js::TempAllocPolicy>::~Vector() [mfbt/Assertions.h:166] > 18:29:04 INFO - [ DEATH ] #10: LoadExtensionDirectories(nsINIParser_internal&, char const*, nsCOMArray<nsIFile>&, NSLocationType) [toolkit/xre/nsXREDirProvider.cpp:716] > 18:29:04 INFO - [ DEATH ] #11: GetNativeWindowPointerFromDOMWindow(mozIDOMWindowProxy*, NSWindow**) [xpcom/glue/nsCOMPtr.h:742] > 18:29:04 INFO - [ DEATH ] #12: nsNativeAppSupportCocoa::ReOpen() [xpcom/glue/nsCOMPtr.h:377] > 18:29:04 INFO - [ DEATH ] #13: main [browser/app/nsBrowserApp.cpp:319] > 18:29:04 INFO - [ DEATH ] [1984] ###!!! ASSERTION: Potential deadlock detected: > 18:29:04 INFO - [ DEATH ] Cyclical dependency starts at > 18:29:04 INFO - [ DEATH ] Mutex : dd.sanity.m1 (currently acquired) > 18:29:04 INFO - [ DEATH ] Cycle completed at > 18:29:04 INFO - [ DEATH ] Mutex : dd.sanity.m1 (currently acquired) > 18:29:04 INFO - [ DEATH ] > 18:29:04 INFO - [ DEATH ] ###!!! Deadlock may happen NOW! > 18:29:04 INFO - [ DEATH ] > 18:29:04 INFO - [ DEATH ] : 'Error', file /home/worker/workspace/build/src/xpcom/glue/BlockingResourceBase.cpp, line 307 > 18:29:04 INFO - [ DEATH ] Hit MOZ_CRASH() at /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33 > 18:29:04 INFO - [ DEATH ] @ /home/worker/workspace/build/src/xpcom/tests/gtest/TestDeadlockDetector.cpp:110 It looks like dmajor made some changes in crash behavior (particularly in debug) in bug 1324093, so I'm doing a new try run with those changes in hopes that things magically work: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc0780816b448a0d34c533c9e304e653c02b2374 [1] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/memory/mozalloc/mozalloc_abort.cpp#22
Comment 24•7 years ago
|
||
(In reply to Eric Rahm [:erahm] from comment #21) > Ted said he could check if |__linux__| is defined by the x-compiler (which > is what NSRP uses to define |LINUX|). Ted checked, it's not defined.
Flags: needinfo?(ted)
Comment 25•7 years ago
|
||
summary |
Ted do you have any recommendations at this point? Quick summary: - I can only reproduce on a loaner (it's 10.10, my laptop is 10.12) - I was not able to attach to the hung process with |lldb|, but |sample| was able to profile it - The death tests *are* passing, but they get hung for 90s in |mozalloc_abort| [1] - We know they hit MOZ_CRASH so the hang is presumably in that macro somewhere - Our method of aborting is to assign a value to a null ptr [2], my best guess is a segv signal handler is somehow involved [1] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/memory/mozalloc/mozalloc_abort.cpp#22 [2] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/mfbt/Assertions.h#232
Flags: needinfo?(ted)
Summary: gtests fail on osx debug TC build → DeadlockDetector death test block for 90s each and cause gtest failure due to timeout on osx debug TC build
Comment 26•7 years ago
|
||
I think the next thing I would test is if crashing the build with a normal MOZ_CRASH produces the same hang before exiting. The simplest way to do that is so just evaluate `Cc["@mozilla.org/xpcom/debug;1"].getService(Ci.nsIDebug2).abort("foo", 1)` in a chrome JS context. You can do it in the browser console but you have to futz around and enable the text input there. I guess the simplest way to automate would be to use the Python Marionette client, something like: ``` from marionette_driver.marionette import Marionette client = Marionette(bin='/path/to/firefox') client.start_session() client.execute_script('Components.classes["@mozilla.org/xpcom/debug;1"].getService(Components.interfaces.nsIDebug2).abort("foo", 1)', sandbox='system') ```
Flags: needinfo?(ted)
Assignee | ||
Comment 27•7 years ago
|
||
have we run out of options here?
Comment 28•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #27) > have we run out of options here? I can take a look at Ted's suggestions this week.
Flags: needinfo?(erahm)
Comment 29•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #26) > I think the next thing I would test is if crashing the build with a normal > MOZ_CRASH produces the same hang before exiting. The simplest way to do that > is so just evaluate > `Cc["@mozilla.org/xpcom/debug;1"].getService(Ci.nsIDebug2).abort("foo", 1)` > in a chrome JS context. You can do it in the browser console but you have to > futz around and enable the text input there. > > I guess the simplest way to automate would be to use the Python Marionette > client, something like: > ``` > from marionette_driver.marionette import Marionette > client = Marionette(bin='/path/to/firefox') > client.start_session() > client.execute_script('Components.classes["@mozilla.org/xpcom/debug;1"]. > getService(Components.interfaces.nsIDebug2).abort("foo", 1)', > sandbox='system') > ``` It crashes promptly, although I'm not sure if it has all the same env vars set.
Flags: needinfo?(erahm)
Comment 30•7 years ago
|
||
I added some sleeps so I could attach to the process before it attempts to crash but still could not attach. I'm PTO for a week, so I'm going to abdicate this. Ted can you take over? I should note there was some confusion over breakpad functionality on mac, signal handlers might not be used [1], so that might be a red herring. They do send mach messages with no timeout [2], so it's possible there's something weird going on there. [1] https://dxr.mozilla.org/mozilla-central/rev/1d025ac534a6333a8170a59a95a8a3673d4028ee/toolkit/crashreporter/google-breakpad/src/client/mac/handler/exception_handler.cc#630 [2] https://dxr.mozilla.org/mozilla-central/rev/1d025ac534a6333a8170a59a95a8a3673d4028ee/toolkit/crashreporter/google-breakpad/src/client/mac/handler/exception_handler.cc#803-806
Assignee: erahm → nobody
Flags: needinfo?(ted)
Assignee | ||
Comment 31•7 years ago
|
||
checking in here, this is still the one test blocking using OSX builds from taskcluster- ted, any thoughts?
Comment 32•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #10) > the good news is if we remove the 20 minute restriction, this test passes in > ~27 minutes with no errors :) Even better news- there are no other tests > which are taking extra time. Can we just bump the runtime for this test for now to unblock the migration, and keep this bug open to pursue a fix? erahm has spent a lot of time investigating (and I spent some time as well) and there's no obvious easy solution yet, but if we can trade some extra runtime for green tests that seems like the best call for now since the migration is pretty important.
Flags: needinfo?(ted)
Assignee | ||
Comment 33•7 years ago
|
||
I don't see any indication in this bug that increasing the time limit from 90s -> XXs will let this pass. I do think that is a realistic solution though!
Comment 34•7 years ago
|
||
...I quoted your comment 10 in my previous comment where you said that the test passes in 27 minutes if we bump the current limit. :)
Assignee | ||
Comment 35•7 years ago
|
||
oh, face palm- thanks ted!
Assignee | ||
Comment 36•7 years ago
|
||
Attachment #8835444 -
Flags: review?(ted)
Comment 37•7 years ago
|
||
Comment on attachment 8835444 [details] [diff] [review] extend the timeout for gtest from 20 to 30 minutes Review of attachment 8835444 [details] [diff] [review]: ----------------------------------------------------------------- I guess if you really wanted to get fancy you could set this timeout only on OS X, but this should be fine in general unless we hit a rash of hangs elsewhere.
Attachment #8835444 -
Flags: review?(ted) → review+
Comment 38•7 years ago
|
||
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/d8276bdfb12b DeadlockDetector death test fails on osx tc build, extend timeout to 30 minutes. r=ted
Comment 39•7 years ago
|
||
Let's leave this bug open to track actually fixing the root cause.
Keywords: leave-open
Comment 40•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/d8276bdfb12b
Comment 41•7 years ago
|
||
FWIW I'm fine with disabling these tests on your franken-builds, there's nothing particularly platform specific about deadlock detection so as long as they're running on linux I'm not too worried. It would be nice to figure out what's going on but it's low priority for me.
Comment 42•7 years ago
|
||
I just realized that this is almost certainly the same root cause as bug 1338651. As haik said in bug 1338651 comment 177: > This probably affects the Mac crash reporter tool too. The root cause here is that our cross-compiled builds are built from a path starting with /home/, vs. /Users/ on the native builds we used to have. When we don't strip our builds (so that profiling works) there are STABS entries in the symbol table containing those paths, and that makes some Mac system library take way longer to read the symbol tables from our binaries, presumably because it's trying to stat all of those paths and /home/ is special on Mac.
Comment 43•7 years ago
|
||
I was curious as to whether fixing bug 1338651 fixed this, and ekyle helped me poke at ActiveData to answer that question. The answer is yes! Here's a graph of Mac debug gtest task runtimes on inbound, you can see the dip on August 29th when that patch landed: https://docs.google.com/spreadsheets/d/1HtdXwwZef1SnSoEDNCCtTXeC_fn2u6bLzSaLlY4cmyw/edit?usp=sharing Here's the ActiveData query I used to get that data: https://activedata.allizom.org/tools/query.html#query_id=6xgJw1kN (Thanks, Kyle!)
Comment 44•7 years ago
|
||
We can back out the timeout increase patch and close this bug!
Flags: needinfo?(jmaher)
Assignee | ||
Comment 45•7 years ago
|
||
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Flags: needinfo?(jmaher)
Attachment #8907181 -
Flags: review?(ted)
Comment 46•7 years ago
|
||
Comment on attachment 8907181 [details] [diff] [review] reduce the timeout to 20 minutes Review of attachment 8907181 [details] [diff] [review]: ----------------------------------------------------------------- Thanks!
Attachment #8907181 -
Flags: review?(ted) → review+
Updated•7 years ago
|
Keywords: leave-open
Comment 47•7 years ago
|
||
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/fab94e5b9c9c reduce max timeout from 30 minutes to 20 minutes. r=ted
Comment 48•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/fab94e5b9c9c
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox57:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in
before you can comment on or make changes to this bug.
Description
•