Closed Bug 1331049 Opened 3 years ago Closed 2 years ago

DeadlockDetector death test block for 90s each and cause gtest failure due to timeout on osx debug TC build

Categories

(Testing :: General, defect)

defect
Not set

Tracking

(firefox57 fixed)

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(3 files)

test: BB runtime, BBB runtime, diff (seconds)
XPCOMDeadlockDetectorTest.XPCOMSanity2DeathTest: 0:00:02,0:01:35,93
XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock: 0:00:01,0:01:32,91
XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest: 0:00:01,0:01:38,97
storage_DeadlockDetectorTest.storage_Sanity3DeathTest: 0:00:01,0:01:33,92
storage_DeadlockDetectorTest.storage_TwoThreadsDeathTest: 0:00:01,0:01:33,92
storage_DeadlockDetectorTest.storage_Sanity2DeathTest: 0:00:02,0:01:33,91
XPCOMDeadlockDetectorTest.XPCOMTwoThreadsDeathTest: 0:00:02,0:01:32,90
storage_DeadlockDetectorTest.storage_ContentionNoDeadlock: 0:00:01,0:01:33,92
XPCOMDeadlockDetectorTest.XPCOMSanity4DeathTest: 0:00:01,0:01:31,90
storage_DeadlockDetectorTest.storage_Sanity4DeathTest: 0:00:01,0:01:33,92

there are 10 tests that have a 90 second delay vs the other tests, and I believe this makes up the large majority of our time differences.

Sadly we are running ~1500 tests out of the almost 4500 tests which we run typically, so there might be more hiding out there.
erahm: wcosta and jmaher are standing up tests on the taskcluster mac builds, which are cross-compiled on Linux. Most of the tests seem to be running fine, but the gtests are timing out and it seems to be because the DeadlockDetectorTests are taking way longer on the cross-builds (see comment 0). It seems to be very close to 90 seconds in every case, which makes me think there's something fundamentally wrong here. Can you take a look at this?
Flags: needinfo?(erahm)
Joel, can you get me the output of these tests?

A couple of thoughts:
  #1 - These tests should only be running on debug builds, is it possible the optimization settings are different b/w x-compiler and native?
  #2 - Most of those are death tests, we mess with the gdb sleep duration to make them fail quickly, it's possible that's not working w/ the x-compiler
Flags: needinfo?(erahm) → needinfo?(jmaher)
I am going to ask :ted this since I am not too familiar with the build specifics
Flags: needinfo?(jmaher) → needinfo?(ted)
(In reply to Joel Maher ( :jmaher) from comment #0)
> test: BB runtime, BBB runtime, diff (seconds)
> XPCOMDeadlockDetectorTest.XPCOMSanity2DeathTest: 0:00:02,0:01:35,93
> XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock: 0:00:01,0:01:32,91
> XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest: 0:00:01,0:01:38,97
> storage_DeadlockDetectorTest.storage_Sanity3DeathTest: 0:00:01,0:01:33,92
> storage_DeadlockDetectorTest.storage_TwoThreadsDeathTest: 0:00:01,0:01:33,92
> storage_DeadlockDetectorTest.storage_Sanity2DeathTest: 0:00:02,0:01:33,91
> XPCOMDeadlockDetectorTest.XPCOMTwoThreadsDeathTest: 0:00:02,0:01:32,90
> storage_DeadlockDetectorTest.storage_ContentionNoDeadlock: 0:00:01,0:01:33,92
> XPCOMDeadlockDetectorTest.XPCOMSanity4DeathTest: 0:00:01,0:01:31,90
> storage_DeadlockDetectorTest.storage_Sanity4DeathTest: 0:00:01,0:01:33,92
> 
> there are 10 tests that have a 90 second delay vs the other tests, and I
> believe this makes up the large majority of our time differences.
> 
> Sadly we are running ~1500 tests out of the almost 4500 tests which we run
> typically, so there might be more hiding out there.

I am confused, is this causing Gtest to fail? Logs suggest the intermittent bug 1237167 is permanent for x-compiled builds.
Flags: needinfo?(jmaher)
this is causing gtest to fail because our runtime is longer than the 20 minutes allocated for the job.  Typically this job takes <13 minutes, so 20 minutes would appear to be more than enough time.

Looking at the other gtest bugs on file, I see most are on windows, and some instances on OSX.  Looking at the logs, I am not finding these specific tests mentioned, and most of the logs are for builds, not specific test tasks- I don't understand why we have gtests run different ways.
Flags: needinfo?(jmaher)
(In reply to Eric Rahm [:erahm] from comment #2)
>   #1 - These tests should only be running on debug builds, is it possible
> the optimization settings are different b/w x-compiler and native?

We've been running the cross-compiled builds for a while now, I just picked an arbitrary inbound revision that had both types of mac debug builds:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=74790f4d30de03105799064332f44d045e40723e

Here's the native buildbot build log:
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1484659765/mozilla-inbound-macosx64-debug-bm84-build1-build7693.txt.gz

and here's the cross-compiled taskcluster build log:
https://public-artifacts.taskcluster.net/VO6Vlx5PQEe7H7q8F0YxDg/0/public/logs/live_backing.log

These are dep builds, so picking out a source file that actually got compiled we see:
buildbot:
05:51:58     INFO -  /usr/local/bin/ccache /builds/slave/m-in-m64-d-0000000000000000000/build/src/clang/bin/clang++ -std=gnu++11 -o Unified_cpp_dom_presentation0.o -c  -fvisibility=hidden -fvisibility-inlines-hidden -DDEBUG=1 -DTRACING=1 -DOS_POSIX=1 -DOS_MACOSX=1 -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/dom/presentation -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dom/presentation -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/dom/base -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/ipc/ipdl/_ipdlheaders -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/ipc/chromium/src -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/ipc/glue -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dist/include  -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dist/include/nspr -I/builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dist/include/nss       -fPIC  -DMOZILLA_CLIENT -include /builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/mozilla-config.h -MD -MP -MF .deps/Unified_cpp_dom_presentation0.o.pp -Qunused-arguments  -Qunused-arguments -Wall -Wc++11-compat -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wloop-analysis -Wc++11-compat-pedantic -Wc++14-compat -Wc++14-compat-pedantic -Wc++1z-compat -Wimplicit-fallthrough -Werror=non-literal-null-conversion -Wstring-conversion -Wthread-safety -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-unknown-warning-option -Wno-return-type-c-linkage -fno-exceptions -fno-strict-aliasing -stdlib=libc++ -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe  -g -O3 -fno-omit-frame-pointer  -Werror   /builds/slave/m-in-m64-d-0000000000000000000/build/src/obj-firefox/dom/presentation/Unified_cpp_dom_presentation0.cpp

taskcluster:
[task 2017-01-17T13:40:42.153341Z] 13:40:42     INFO -  /home/worker/workspace/build/src/sccache2/sccache /home/worker/workspace/build/src/clang/bin/clang++ -target x86_64-apple-darwin10 -mlinker-version=136 -B /home/worker/workspace/build/src/cctools/bin -isysroot /home/worker/workspace/build/src/MacOSX10.7.sdk -std=gnu++11 -o Unified_cpp_dom_presentation0.o -c  -fvisibility=hidden -fvisibility-inlines-hidden -DDEBUG=1 -DTRACING=1 -DOS_POSIX=1 -DOS_MACOSX=1 -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -I/home/worker/workspace/build/src/dom/presentation -I/home/worker/workspace/build/src/obj-firefox/dom/presentation -I/home/worker/workspace/build/src/dom/base -I/home/worker/workspace/build/src/obj-firefox/ipc/ipdl/_ipdlheaders -I/home/worker/workspace/build/src/ipc/chromium/src -I/home/worker/workspace/build/src/ipc/glue -I/home/worker/workspace/build/src/obj-firefox/dist/include  -I/home/worker/workspace/build/src/obj-firefox/dist/include/nspr -I/home/worker/workspace/build/src/obj-firefox/dist/include/nss       -fPIC  -DMOZILLA_CLIENT -include /home/worker/workspace/build/src/obj-firefox/mozilla-config.h -MD -MP -MF .deps/Unified_cpp_dom_presentation0.o.pp -Qunused-arguments  -Qunused-arguments -Wall -Wc++11-compat -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wloop-analysis -Wc++11-compat-pedantic -Wc++14-compat -Wc++14-compat-pedantic -Wc++1z-compat -Wimplicit-fallthrough -Werror=non-literal-null-conversion -Wstring-conversion -Wthread-safety -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-unknown-warning-option -Wno-return-type-c-linkage -fno-exceptions -fno-strict-aliasing -stdlib=libc++ -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe  -g -Xclang -load -Xclang /home/worker/workspace/build/src/obj-firefox/build/clang-plugin/libclang-plugin.dylib -Xclang -add-plugin -Xclang moz-check -O3 -fno-omit-frame-pointer  -Werror   /home/worker/workspace/build/src/obj-firefox/dom/presentation/Unified_cpp_dom_presentation0.cpp

Doing some fiddly local comparison it looks like the only differences are:
a) The taskcluster build uses our clang static analysis plugin
b) The taskcluster build passes `-B .../cctools/bin`
c) The taskcluster build explicitly passes -isysroot
d) The taskcluster build passes -mlinker-version=136
e) The taskcluster build explicitly passes `-target  x86_64-apple-darwin10`.

None of those seem like they ought to make any real difference. For reference, the main difference in the in-tree mozconfigs is that the buildbot builds include this mozconfig:
https://dxr.mozilla.org/mozilla-central/source/build/macosx/local-mozconfig.common

where the taskcluster builds include this one:
https://dxr.mozilla.org/mozilla-central/source/build/macosx/cross-mozconfig.common

>   #2 - Most of those are death tests, we mess with the gdb sleep duration to
> make them fail quickly, it's possible that's not working w/ the x-compiler

This is entirely possible, I don't have a working Mac at the moment to test. Although I note that the default gdb sleep duration is 300 seconds:
https://dxr.mozilla.org/mozilla-central/rev/3e275d37a06236981bff399b7d7aa0646be3fee7/toolkit/xre/nsSigHandlers.cpp#43

...and these tests are taking ~90 seconds running against the taskcluster builds.

jmaher: erahm asked in comment 2 but maybe you missed it, can you link the test logs you were using for comparison in comment 0? I closed that tab and don't have them handy.
Flags: needinfo?(ted) → needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) from comment #7)
> an example of a the failing gtest (taskcluster) is:
> https://archive.mozilla.org/pub/firefox/try-builds/wcosta@mozilla.com-
> fdb545d3f298790a8406edf0155dbf0d90557ecf/try-macosx64-debug/try_yosemite_r7-
> debug_test-gtest-bm135-tests1-macosx-build201.txt.gz

Alright, we can see that |ContentionNoDeadlock| is actually only taking 819ms, so the timeouts are all limited to death tests:

> XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock | test completed (time: 819ms)

Unfortunately there's no useful output in the logs.

> an example of a passing gtest (buildbot) is:
> https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-
> macosx64-debug/1484664924/mozilla-central_yosemite_r7-debug_test-gtest-bm108-
> tests1-macosx-build38.txt.gz

And it looks like for our baseline we're expecting < 2s for the death tests.
Also of note:
  - The tests *are* passing, so the output in the child process we were expecting was found
  - The non-deathtests are fine, so it's not deadlock detector itself being slow
the good news is if we remove the 20 minute restriction, this test passes in ~27 minutes with no errors :)  Even better news- there are no other tests which are taking extra time.
Wander, is there a way for me to test a patch against the t-c build? I just want force logging of the output of the deathtest.
Flags: needinfo?(wcosta)
(In reply to Eric Rahm [:erahm] from comment #11)
> Wander, is there a way for me to test a patch against the t-c build? I just
> want force logging of the output of the deathtest.

You can apply my patch in bug 1330310, then if you push to try with "-p macosx64 -b d -u gtest -t none", tests will run against tc x-build. Does this work for you?
Flags: needinfo?(wcosta)
Everything ran fine when attempting to repro locally with the t-c build and just the DeadlockDetector tests:

> (venv)[ericrahm@erahm-25043 ~/dev/tmp/gtest_tc_hangs]$ GTEST_FILTER='XPCOMDeadlock*' python -u gtest/rungtests.py --xre-path=application/NightlyDebug.app/Contents/Resources --cwd=gtest --symbols-path=symbols --utility-path=bin application/NightlyDebug.app/Contents/MacOS/firefox
> gtest INFO | Running gtest
> [62356] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /home/worker/workspace/build/src/xpcom/base/nsTraceRefcnt.cpp, line 172
> [62356] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /home/worker/workspace/build/src/xpcom/base/nsTraceRefcnt.cpp, line 172
> [62356] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /home/worker/workspace/build/src/xpcom/base/nsTraceRefcnt.cpp, line 172
> Running GTest tests...
> Setting up crash reporting
> TEST-INFO | GTest unit test starting
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanityDeathTest
> 
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMSanityDeathTest | test completed (time: 104ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanity2DeathTest
> 
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMSanity2DeathTest | test completed (time: 105ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest
> 
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMSanity3DeathTest | test completed (time: 107ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanity4DeathTest
> 
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMSanity4DeathTest | test completed (time: 104ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMTwoThreadsDeathTest
> 
> [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 24 threads.
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMTwoThreadsDeathTest | test completed (time: 27ms)
> TEST-START | XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock
> TEST-PASS | XPCOMDeadlockDetectorTest.XPCOMContentionNoDeadlock | test completed (time: 517ms)
> TEST-PASS | GTest unit test: passed
> Passed: 6
> Failed: 0

It's possible there's a bad interaction with other tests, I'll do a full run.
Depends on: 1331739
The death tests ran fine in a full run as well. I've requested a loaner in bug 1331739, we'll see if I can repro there once I get the loaner.
Assignee: nobody → erahm
Here's a script to setup and run the gtests from one of wcosta's builds.
Try build that purposefully fails the deadlock detector death tests so we get full output:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=da54c923e6a50426fbf2a207c49d25cd1822201e
Attachment #8827695 - Attachment mime type: application/x-shellscript → text/plain
(In reply to Eric Rahm [:erahm] from comment #16)
> Try build that purposefully fails the deadlock detector death tests so we
> get full output:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=da54c923e6a50426fbf2a207c49d25cd1822201e

The tests are still taking 90+ seconds here, and I don't see anything immediately suspicious in the output. There's just a big time gap between the [WARNING] about death tests and threads and the failure message.

18:26:14     INFO -  TEST-START | XPCOMDeadlockDetectorTest.XPCOMSanityDeathTest
18:26:14     INFO -  [WARNING] /home/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:789:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 18 threads.
18:27:49  WARNING -  TEST-UNEXPECTED-FAIL | XPCOMDeadlockDetectorTest.XPCOMSanityDeathTest | Death test: Sanity_Child()
I've got it reproduced on the loaner, so there's that at least. Still not sure what's going on, I'll dig into the gtest code next.
 I switched over to 'threadsafe' tests (fork and exec) but that didn't help. I can see the output being dumped to a tmp file immediately, so it's definitely not a test startup issue. Possibly an issue in our crash handler? Next step is to test out my lldb-fu to see if I can see where it's blocked.
llbd is unable to attach to the child process, attaching to parent process show's it waiting on the pipe [1]. The idea here is if the test finishes (this is a failure for a death test) a message is passed back via the pipe saying things didn't go as planned, on proper failure (the test aborts) the pipe is expected to close due to the child process aborting.

So the question is why is the pipe not closing for 90s on the t-c machine?

A few possibilities:
- The ah crap handler is still pausing, this shouldn't happen as there shouldn't be a signal handler on mac [2] but maybe |LINUX| is defined (note: not |XP_LINUX|) due to the x-compile
- Our crash handler is somehow getting in the way, we explicitly disable it in the tests but the gtest main does enable it [3] if an env var is set, which it is in automation [4]
- There's some setting on our test machine that doesn't close the pipe's fd immediately when the child exits

[1] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/testing/gtest/gtest/src/gtest-death-test.cc#389-395
[2] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/toolkit/xre/nsSigHandlers.cpp#45-48
[3] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/testing/gtest/mozilla/GTestRunner.cpp#94-115
[4] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/testing/gtest/rungtests.py#92
Ted said he could check if |__linux__| is defined by the x-compiler (which is what NSRP uses to define |LINUX|).
Flags: needinfo?(ted)
(In reply to Eric Rahm [:erahm] from comment #20)
> llbd is unable to attach to the child process

|top| shows it as 'stuck' if that means anything to anyone.

> A few possibilities:
> - The ah crap handler is still pausing, this shouldn't happen as there
> shouldn't be a signal handler on mac [2] but maybe |LINUX| is defined (note:
> not |XP_LINUX|) due to the x-compile

This one's a biiiiig stretch considering I couldn't repro locally (although I'm testing on 10.12, not 10.10).

> - Our crash handler is somehow getting in the way, we explicitly disable it
> in the tests but the gtest main does enable it [3] if an env var is set,
> which it is in automation [4]

I tested with |MOZ_CRASHREPORTER=0| and still ran into issues.

> - There's some setting on our test machine that doesn't close the pipe's fd
> immediately when the child exits

A simple test program doesn't show this behavior, I also tested waitpid after crashing the child with segv and it was fine.
|sample| indicates the death test is stuck in |mozalloc_abort| [1]:

> 16 Call graph:
> 17     9437 Thread_2190046   DispatchQueue_1: com.apple.main-thread  (serial)
> 18     + 9437 start  (in firefox) + 52  [0x10fe0d3d4]
> 19     +   9437 start  (in firefox) + 2855  [0x10fe0dec7]
> 20     +     9437 XRE_main  (in XUL) + 259  [0x1143be173]
> 21     +       9437 XREMain::XRE_main(int, char**, mozilla::XREAppData const&)  (in XUL) + 577  [0x1143bdb61]
> 22     +         9437 XREMain::XRE_mainStartup(bool*)  (in XUL) + 252  [0x1143b767c]
> 23     +           9437 mozilla::RunGTestFunc(int*, char**)  (in XUL) + 888  [0x115225048]
> 24     +             9437 testing::UnitTest::Run()  (in XUL) + 110  [0x11521a07e]
> 25     +               9437 testing::internal::UnitTestImpl::RunAllTests()  (in XUL) + 1305  [0x11521a5a9]
> 26     +                 9437 testing::TestCase::Run()  (in XUL) + 483  [0x115214703]
> 27     +                   9437 testing::TestInfo::Run()  (in XUL) + 768  [0x1152142d0]
> 28     +                     9437 testing::Test::Run()  (in XUL) + 737  [0x115213661]
> 29     +                       9437 XPCOMDeadlockDetectorTest_XPCOMSanityDeathTest_Test::TestBody()  (in XUL) + 313  [0x114ecceb9]
> 30     +                         9437 Sanity_Child()  (in XUL) + 49  [0x114eccd11]
> 31     +                           9437 mozilla::OffTheBooksMutex::Lock()  (in XUL) + 14  [0x11059abee]
> 32     +                             9437 mozilla::BlockingResourceBase::CheckAcquire()  (in XUL) + 390  [0x11059a5f6]
> 33     +                               9437 NS_DebugBreak  (in XUL) + 1272  [0x1104b3078]
> 34     +                                 9437 Abort(char const*)  (in XUL) + 9  [0x1104b3289]
> 35     +                                   9437 mozalloc_abort(char const*)  (in libmozglue.dylib) + 98  [0x10fe16af2]

We can also see that |MOZ_CRASH| has been called in the logs, so it's pretty clear we're hanging somewhere in that macro:

> 18:27:49     INFO -  [  DEATH   ] [1984] WARNING: OOPDeinit() without successful OOPInit(): file /home/worker/workspace/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 3561
> 18:27:49     INFO -  [  DEATH   ] ###!!! ERROR: Potential deadlock detected:
> 18:27:49     INFO -  [  DEATH   ] === Cyclical dependency starts at
> 18:27:49     INFO -  [  DEATH   ] --- Mutex : dd.sanity.m1 (currently acquired)
> 18:27:49     INFO -  [  DEATH   ]  calling context
> 18:27:49     INFO -  [  DEATH   ]   [stack trace unavailable]
> 18:27:49     INFO -  [  DEATH   ]
> 18:27:49     INFO -  [  DEATH   ] === Cycle completed at
> 18:27:49     INFO -  [  DEATH   ] --- Mutex : dd.sanity.m1 (currently acquired)
> 18:27:49     INFO -  [  DEATH   ]  calling context
> 18:27:49     INFO -  [  DEATH   ]   [stack trace unavailable]
> 18:27:49     INFO -  [  DEATH   ]
> 18:27:49     INFO -  [  DEATH   ] ###!!! Deadlock may happen NOW!
> 18:27:49     INFO -  [  DEATH   ]
> 18:27:49     INFO -  [  DEATH   ] 1984] ###!!! ASSERTION: Potential deadlock detected:
> 18:27:49     INFO -  [  DEATH   ] Cyclical dependency starts at
> 18:27:49     INFO -  [  DEATH   ] Mutex : dd.sanity.m1 (currently acquired)
> 18:27:49     INFO -  [  DEATH   ] Cycle completed at
> 18:27:49     INFO -  [  DEATH   ] Mutex : dd.sanity.m1 (currently acquired)
> 18:27:49     INFO -  [  DEATH   ]
> 18:27:49     INFO -  [  DEATH   ] ###!!! Deadlock may happen NOW!
> 18:27:49     INFO -  [  DEATH   ]
> 18:27:49     INFO -  [  DEATH   ] : 'Error', file /home/worker/workspace/build/src/xpcom/glue/BlockingResourceBase.cpp, line 307
> 18:29:04     INFO -  [  DEATH   ] #01: PLDHashTable::Add(void const*, mozilla::fallible_t const&) [xpcom/glue/PLDHashTable.cpp:499]
> 18:29:04     INFO -  [  DEATH   ] #02: Promise_static_all(JSContext*, unsigned int, JS::Value*) [js/public/RootingAPI.h:795]
> 18:29:04     INFO -  [  DEATH   ] #03: Promise_static_all(JSContext*, unsigned int, JS::Value*) [js/src/builtin/Promise.cpp:1399]
> 18:29:04     INFO -  [  DEATH   ] #04: DetachArrayBuffer(JSContext*, unsigned int, JS::Value*) [mfbt/Assertions.h:166]
> 18:29:04     INFO -  [  DEATH   ] #05: FindPath(JSContext*, unsigned int, JS::Value*) [js/src/builtin/TestingFunctions.cpp:2859]
> 18:29:04     INFO -  [  DEATH   ] #06: FindPath(JSContext*, unsigned int, JS::Value*) [mfbt/Assertions.h:166]
> 18:29:04     INFO -  [  DEATH   ] #07: AllocationMarker(JSContext*, unsigned int, JS::Value*) [mfbt/Assertions.h:166]
> 18:29:04     INFO -  [  DEATH   ] #08: GetConstructorName(JSContext*, unsigned int, JS::Value*) [mfbt/Assertions.h:166]
> 18:29:04     INFO -  [  DEATH   ] #09: mozilla::Vector<JS::GCVector<JS::Value, 0ul, js::TempAllocPolicy>, 0ul, js::TempAllocPolicy>::~Vector() [mfbt/Assertions.h:166]
> 18:29:04     INFO -  [  DEATH   ] #10: LoadExtensionDirectories(nsINIParser_internal&, char const*, nsCOMArray<nsIFile>&, NSLocationType) [toolkit/xre/nsXREDirProvider.cpp:716]
> 18:29:04     INFO -  [  DEATH   ] #11: GetNativeWindowPointerFromDOMWindow(mozIDOMWindowProxy*, NSWindow**) [xpcom/glue/nsCOMPtr.h:742]
> 18:29:04     INFO -  [  DEATH   ] #12: nsNativeAppSupportCocoa::ReOpen() [xpcom/glue/nsCOMPtr.h:377]
> 18:29:04     INFO -  [  DEATH   ] #13: main [browser/app/nsBrowserApp.cpp:319]
> 18:29:04     INFO -  [  DEATH   ] [1984] ###!!! ASSERTION: Potential deadlock detected:
> 18:29:04     INFO -  [  DEATH   ] Cyclical dependency starts at
> 18:29:04     INFO -  [  DEATH   ] Mutex : dd.sanity.m1 (currently acquired)
> 18:29:04     INFO -  [  DEATH   ] Cycle completed at
> 18:29:04     INFO -  [  DEATH   ] Mutex : dd.sanity.m1 (currently acquired)
> 18:29:04     INFO -  [  DEATH   ]
> 18:29:04     INFO -  [  DEATH   ] ###!!! Deadlock may happen NOW!
> 18:29:04     INFO -  [  DEATH   ]
> 18:29:04     INFO -  [  DEATH   ] : 'Error', file /home/worker/workspace/build/src/xpcom/glue/BlockingResourceBase.cpp, line 307
> 18:29:04     INFO -  [  DEATH   ] Hit MOZ_CRASH() at /home/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33
> 18:29:04     INFO -  [  DEATH   ]  @ /home/worker/workspace/build/src/xpcom/tests/gtest/TestDeadlockDetector.cpp:110

It looks like dmajor made some changes in crash behavior (particularly in debug) in bug 1324093, so I'm doing a new try run with those changes in hopes that things magically work: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc0780816b448a0d34c533c9e304e653c02b2374

[1] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/memory/mozalloc/mozalloc_abort.cpp#22
(In reply to Eric Rahm [:erahm] from comment #21)
> Ted said he could check if |__linux__| is defined by the x-compiler (which
> is what NSRP uses to define |LINUX|).

Ted checked, it's not defined.
Flags: needinfo?(ted)
Ted do you have any recommendations at this point?

Quick summary:
  - I can only reproduce on a loaner (it's 10.10, my laptop is 10.12)
  - I was not able to attach to the hung process with |lldb|, but |sample| was able to profile it
  - The death tests *are* passing, but they get hung for 90s in |mozalloc_abort| [1]
  - We know they hit MOZ_CRASH so the hang is presumably in that macro somewhere
  - Our method of aborting is to assign a value to a null ptr [2], my best guess is a segv signal handler is somehow involved

[1] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/memory/mozalloc/mozalloc_abort.cpp#22
[2] http://searchfox.org/mozilla-central/rev/30fcf167af036aeddf322de44a2fadd370acfd2f/mfbt/Assertions.h#232
Flags: needinfo?(ted)
Summary: gtests fail on osx debug TC build → DeadlockDetector death test block for 90s each and cause gtest failure due to timeout on osx debug TC build
I think the next thing I would test is if crashing the build with a normal MOZ_CRASH produces the same hang before exiting. The simplest way to do that is so just evaluate `Cc["@mozilla.org/xpcom/debug;1"].getService(Ci.nsIDebug2).abort("foo", 1)` in a chrome JS context. You can do it in the browser console but you have to futz around and enable the text input there.

I guess the simplest way to automate would be to use the Python Marionette client, something like:
```
from marionette_driver.marionette import Marionette
client = Marionette(bin='/path/to/firefox')
client.start_session()
client.execute_script('Components.classes["@mozilla.org/xpcom/debug;1"].getService(Components.interfaces.nsIDebug2).abort("foo", 1)', sandbox='system')
```
Flags: needinfo?(ted)
have we run out of options here?
(In reply to Joel Maher ( :jmaher) from comment #27)
> have we run out of options here?

I can take a look at Ted's suggestions this week.
Flags: needinfo?(erahm)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #26)
> I think the next thing I would test is if crashing the build with a normal
> MOZ_CRASH produces the same hang before exiting. The simplest way to do that
> is so just evaluate
> `Cc["@mozilla.org/xpcom/debug;1"].getService(Ci.nsIDebug2).abort("foo", 1)`
> in a chrome JS context. You can do it in the browser console but you have to
> futz around and enable the text input there.
> 
> I guess the simplest way to automate would be to use the Python Marionette
> client, something like:
> ```
> from marionette_driver.marionette import Marionette
> client = Marionette(bin='/path/to/firefox')
> client.start_session()
> client.execute_script('Components.classes["@mozilla.org/xpcom/debug;1"].
> getService(Components.interfaces.nsIDebug2).abort("foo", 1)',
> sandbox='system')
> ```

It crashes promptly, although I'm not sure if it has all the same env vars set.
Flags: needinfo?(erahm)
I added some sleeps so I could attach to the process before it attempts to crash but still could not attach. I'm PTO for a week, so I'm going to abdicate this. Ted can you take over?

I should note there was some confusion over breakpad functionality on mac, signal handlers might not be used [1], so that might be a red herring. They do send mach messages with no timeout [2], so it's possible there's something weird going on there.

[1] https://dxr.mozilla.org/mozilla-central/rev/1d025ac534a6333a8170a59a95a8a3673d4028ee/toolkit/crashreporter/google-breakpad/src/client/mac/handler/exception_handler.cc#630
[2] https://dxr.mozilla.org/mozilla-central/rev/1d025ac534a6333a8170a59a95a8a3673d4028ee/toolkit/crashreporter/google-breakpad/src/client/mac/handler/exception_handler.cc#803-806
Assignee: erahm → nobody
Flags: needinfo?(ted)
checking in here, this is still the one test blocking using OSX builds from taskcluster- ted, any thoughts?
(In reply to Joel Maher ( :jmaher) from comment #10)
> the good news is if we remove the 20 minute restriction, this test passes in
> ~27 minutes with no errors :)  Even better news- there are no other tests
> which are taking extra time.

Can we just bump the runtime for this test for now to unblock the migration, and keep this bug open to pursue a fix? erahm has spent a lot of time investigating (and I spent some time as well) and there's no obvious easy solution yet, but if we can trade some extra runtime for green tests that seems like the best call for now since the migration is pretty important.
Flags: needinfo?(ted)
I don't see any indication in this bug that increasing the time limit from 90s -> XXs will let this pass.  I do think that is a realistic solution though!
...I quoted your comment 10 in my previous comment where you said that the test passes in 27 minutes if we bump the current limit. :)
oh, face palm- thanks ted!
Comment on attachment 8835444 [details] [diff] [review]
extend the timeout for gtest from 20 to 30 minutes

Review of attachment 8835444 [details] [diff] [review]:
-----------------------------------------------------------------

I guess if you really wanted to get fancy you could set this timeout only on OS X, but this should be fine in general unless we hit a rash of hangs elsewhere.
Attachment #8835444 - Flags: review?(ted) → review+
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d8276bdfb12b
DeadlockDetector death test fails on osx tc build, extend timeout to 30 minutes. r=ted
Let's leave this bug open to track actually fixing the root cause.
Keywords: leave-open
FWIW I'm fine with disabling these tests on your franken-builds, there's nothing particularly platform specific about deadlock detection so as long as they're running on linux I'm not too worried. It would be nice to figure out what's going on but it's low priority for me.
I just realized that this is almost certainly the same root cause as bug 1338651. As haik said in bug 1338651 comment 177:
> This probably affects the Mac crash reporter tool too.

The root cause here is that our cross-compiled builds are built from a path starting with /home/, vs. /Users/ on the native builds we used to have. When we don't strip our builds (so that profiling works) there are STABS entries in the symbol table containing those paths, and that makes some Mac system library take way longer to read the symbol tables from our binaries, presumably because it's trying to stat all of those paths and /home/ is special on Mac.
I was curious as to whether fixing bug 1338651 fixed this, and ekyle helped me poke at ActiveData to answer that question. The answer is yes!

Here's a graph of Mac debug gtest task runtimes on inbound, you can see the dip on August 29th when that patch landed:
https://docs.google.com/spreadsheets/d/1HtdXwwZef1SnSoEDNCCtTXeC_fn2u6bLzSaLlY4cmyw/edit?usp=sharing

Here's the ActiveData query I used to get that data:
https://activedata.allizom.org/tools/query.html#query_id=6xgJw1kN

(Thanks, Kyle!)
We can back out the timeout increase patch and close this bug!
Flags: needinfo?(jmaher)
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Flags: needinfo?(jmaher)
Attachment #8907181 - Flags: review?(ted)
Comment on attachment 8907181 [details] [diff] [review]
reduce the timeout to 20 minutes

Review of attachment 8907181 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8907181 - Flags: review?(ted) → review+
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fab94e5b9c9c
reduce max timeout from 30 minutes to 20 minutes. r=ted
https://hg.mozilla.org/mozilla-central/rev/fab94e5b9c9c
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.