Closed Bug 1419196 Opened 5 years ago Closed 5 years ago

gtest Jemalloc.Arenas intermittently hangs test suite

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(firefox59 fixed)

RESOLVED FIXED
mozilla59
Tracking Status
firefox59 --- fixed

People

(Reporter: gbrown, Assigned: glandium)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Bug 1115253 collects a variety of gtest failures in which the test harness times out after not seeing any new output for a long time. Some of those failures occur following lots of stack dumps from Jemalloc.Arenas.


An example:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=146195132&lineNumber=19958

[task 2017-11-20T20:34:34.517Z] 20:34:34     INFO -  TEST-START | Jemalloc.Arenas
[task 2017-11-20T20:34:34.517Z] 20:34:34     INFO -  [WARNING] /builds/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:836:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 125 threads.
[task 2017-11-20T20:34:34.518Z] 20:34:34     INFO -  Program /builds/worker/workspace/build/application/firefox/firefox (pid = 10642) received signal 11.
[task 2017-11-20T20:34:34.518Z] 20:34:34     INFO -  Stack:
[task 2017-11-20T20:34:34.520Z] 20:34:34     INFO -  #01: Allocator<MozJemallocBase>::moz_arena_malloc [memory/build/mozjemalloc.cpp:4195]
[task 2017-11-20T20:34:34.520Z] 20:34:34     INFO -  #02: Jemalloc_Arenas_Test::TestBody [memory/gtest/TestJemalloc.cpp:287]
[task 2017-11-20T20:34:34.520Z] 20:34:34     INFO -  #03: testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> [testing/gtest/gtest/src/gtest.cc:2460]
[task 2017-11-20T20:34:34.522Z] 20:34:34     INFO -  #04: testing::Test::Run [testing/gtest/gtest/src/gtest.cc:2477]
[task 2017-11-20T20:34:34.522Z] 20:34:34     INFO -  #05: testing::TestInfo::Run [testing/gtest/gtest/src/gtest.cc:2658]
[task 2017-11-20T20:34:34.523Z] 20:34:34     INFO -  #06: testing::TestCase::Run [testing/gtest/gtest/src/gtest.cc:2775]
[task 2017-11-20T20:34:34.523Z] 20:34:34     INFO -  #07: testing::internal::UnitTestImpl::RunAllTests [testing/gtest/gtest/src/gtest.cc:4649]
[task 2017-11-20T20:34:34.524Z] 20:34:34     INFO -  #08: testing::UnitTest::Run [testing/gtest/gtest/src/gtest.cc:3877]
[task 2017-11-20T20:34:34.524Z] 20:34:34     INFO -  #09: mozilla::RunGTestFunc [xpcom/base/nsCOMPtr.h:424]
[task 2017-11-20T20:34:34.524Z] 20:34:34     INFO -  #10: XREMain::XRE_mainStartup [toolkit/xre/nsAppRunner.cpp:3888]
[task 2017-11-20T20:34:34.525Z] 20:34:34     INFO -  #11: XREMain::XRE_main [toolkit/xre/nsAppRunner.cpp:4833]
[task 2017-11-20T20:34:34.525Z] 20:34:34     INFO -  #12: XRE_main [toolkit/xre/nsAppRunner.cpp:4943]
[task 2017-11-20T20:34:34.525Z] 20:34:34     INFO -  #13: do_main [browser/app/nsBrowserApp.cpp:231]
[task 2017-11-20T20:34:34.525Z] 20:34:34     INFO -  #14: main [browser/app/nsBrowserApp.cpp:306]
[task 2017-11-20T20:34:34.526Z] 20:34:34     INFO -  Sleeping for 0 seconds.
[task 2017-11-20T20:34:34.526Z] 20:34:34     INFO -  Type 'gdb /builds/worker/workspace/build/application/firefox/firefox 10642' to attach your debugger to this thread.
[task 2017-11-20T20:34:34.526Z] 20:34:34     INFO -  Done sleeping...
[task 2017-11-20T20:34:34.527Z] 20:34:34     INFO -  [WARNING] /builds/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:836:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 124 threads.
[task 2017-11-20T20:34:34.534Z] 20:34:34     INFO -  Program /builds/worker/workspace/build/application/firefox/firefox (pid = 10643) received signal 11.
[task 2017-11-20T20:34:34.535Z] 20:34:34     INFO -  Stack:
[task 2017-11-20T20:34:34.535Z] 20:34:34     INFO -  #01: Allocator<MozJemallocBase>::moz_arena_malloc [memory/build/mozjemalloc.cpp:4195]
[task 2017-11-20T20:34:34.535Z] 20:34:34     INFO -  #02: Jemalloc_Arenas_Test::TestBody [memory/gtest/TestJemalloc.cpp:290]
[task 2017-11-20T20:34:34.536Z] 20:34:34     INFO -  #03: testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> [testing/gtest/gtest/src/gtest.cc:2460]
[task 2017-11-20T20:34:34.536Z] 20:34:34     INFO -  #04: testing::Test::Run [testing/gtest/gtest/src/gtest.cc:2477]
[task 2017-11-20T20:34:34.537Z] 20:34:34     INFO -  #05: testing::TestInfo::Run [testing/gtest/gtest/src/gtest.cc:2658]
[task 2017-11-20T20:34:34.537Z] 20:34:34     INFO -  #06: testing::TestCase::Run [testing/gtest/gtest/src/gtest.cc:2775]
[task 2017-11-20T20:34:34.537Z] 20:34:34     INFO -  #07: testing::internal::UnitTestImpl::RunAllTests [testing/gtest/gtest/src/gtest.cc:4649]
[task 2017-11-20T20:34:34.538Z] 20:34:34     INFO -  #08: testing::UnitTest::Run [testing/gtest/gtest/src/gtest.cc:3877]
[task 2017-11-20T20:34:34.538Z] 20:34:34     INFO -  #09: mozilla::RunGTestFunc [xpcom/base/nsCOMPtr.h:424]
[task 2017-11-20T20:34:34.539Z] 20:34:34     INFO -  #10: XREMain::XRE_mainStartup [toolkit/xre/nsAppRunner.cpp:3888]
[task 2017-11-20T20:34:34.539Z] 20:34:34     INFO -  #11: XREMain::XRE_main [toolkit/xre/nsAppRunner.cpp:4833]
[task 2017-11-20T20:34:34.539Z] 20:34:34     INFO -  #12: XRE_main [toolkit/xre/nsAppRunner.cpp:4943]
[task 2017-11-20T20:34:34.540Z] 20:34:34     INFO -  #13: do_main [browser/app/nsBrowserApp.cpp:231]
[task 2017-11-20T20:34:34.540Z] 20:34:34     INFO -  #14: main [browser/app/nsBrowserApp.cpp:306]
[task 2017-11-20T20:34:34.541Z] 20:34:34     INFO -  Sleeping for 0 seconds.
[task 2017-11-20T20:34:34.541Z] 20:34:34     INFO -  Type 'gdb /builds/worker/workspace/build/application/firefox/firefox 10643' to attach your debugger to this thread.
[task 2017-11-20T20:34:34.542Z] 20:34:34     INFO -  Done sleeping...

...

[task 2017-11-20T20:34:35.811Z] 20:34:35     INFO -  Program /builds/worker/workspace/build/application/firefox/firefox (pid = 10753) received signal 11.
[task 2017-11-20T20:34:35.811Z] 20:34:35     INFO -  Stack:
[task 2017-11-20T20:34:35.812Z] 20:34:35     INFO -  #01: Allocator<MozJemallocBase>::moz_arena_realloc [memory/build/malloc_decls.h:49]
[task 2017-11-20T20:34:35.812Z] 20:34:35     INFO -  #02: Jemalloc_Arenas_Test::TestBody [memory/gtest/TestJemalloc.cpp:306]
[task 2017-11-20T20:34:35.812Z] 20:34:35     INFO -  #03: testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> [testing/gtest/gtest/src/gtest.cc:2460]
[task 2017-11-20T20:34:35.812Z] 20:34:35     INFO -  #04: testing::Test::Run [testing/gtest/gtest/src/gtest.cc:2477]
[task 2017-11-20T20:34:35.813Z] 20:34:35     INFO -  #05: testing::TestInfo::Run [testing/gtest/gtest/src/gtest.cc:2658]
[task 2017-11-20T20:34:35.813Z] 20:34:35     INFO -  #06: testing::TestCase::Run [testing/gtest/gtest/src/gtest.cc:2775]
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  #07: testing::internal::UnitTestImpl::RunAllTests [testing/gtest/gtest/src/gtest.cc:4649]
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  #08: testing::UnitTest::Run [testing/gtest/gtest/src/gtest.cc:3877]
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  #09: mozilla::RunGTestFunc [xpcom/base/nsCOMPtr.h:424]
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  #10: XREMain::XRE_mainStartup [toolkit/xre/nsAppRunner.cpp:3888]
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  #11: XREMain::XRE_main [toolkit/xre/nsAppRunner.cpp:4833]
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  #12: XRE_main [toolkit/xre/nsAppRunner.cpp:4943]
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  #13: do_main [browser/app/nsBrowserApp.cpp:231]
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  #14: main [browser/app/nsBrowserApp.cpp:306]
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  Sleeping for 0 seconds.
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  Type 'gdb /builds/worker/workspace/build/application/firefox/firefox 10753' to attach your debugger to this thread.
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  Done sleeping...
[task 2017-11-20T20:34:35.816Z] 20:34:35     INFO -  [WARNING] /builds/worker/workspace/build/src/testing/gtest/gtest/src/gtest-death-test.cc:836:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 124 threads.
[task 2017-11-20T20:39:35.870Z] 20:39:35    ERROR -  gtest TEST-UNEXPECTED-FAIL | gtest | timed out after 1200 seconds
[task 2017-11-20T20:39:36.662Z] 20:39:36    ERROR - Return code: 1
[task 2017-11-20T20:39:36.662Z] 20:39:36    ERROR - No tests run or test summary not found
[task 2017-11-20T20:39:36.663Z] 20:39:36     INFO - TinderboxPrint: gtest-gtest<br/><em class="testfail">T-FAIL</em>
[task 2017-11-20T20:39:36.663Z] 20:39:36    ERROR - # TBPL FAILURE #
[task 2017-11-20T20:39:36.664Z] 20:39:36  WARNING - setting return code to 2
[task 2017-11-20T20:39:36.664Z] 20:39:36    ERROR - The gtest suite: gtest ran with return status: FAILURE
:glandium -- I see you have modified TestJemalloc.cpp recently. Can you have a look at this failure?
Blocks: 1115253
Flags: needinfo?(mh+mozilla)
Those are not failures. That's how gtest death tests appear on debug builds, unfortunately.
I don't think that's causing new timeouts. As in, this didn't change bug 1115253.
Flags: needinfo?(mh+mozilla)
OK, but all the current Linux timeouts are definitely happening after Jemalloc.Arenas starts and before it completes.
Summary: gtest Jemalloc.Arenas intermittently fails and hangs test suite → gtest Jemalloc.Arenas intermittently hangs test suite
So, after some battling with the gtest test harness, I was finally able to get a stack trace from the freeze (and I think the harness should deal with that on its own by default because we can't do anything about those timeout issues without such traces).

Unfortunately, it's not /very/ helpful. All it says is that the child process of the death test froze...

https://treeherder.mozilla.org/logviewer.html#?job_id=147052039&repo=try&lineNumber=21096
Some of the timeouts in bug 1115253 are *also* due to death tests, like:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=139441795&lineNumber=3074

TestJemalloc.cpp has added, in the past 2 weeks, a *lot* of death tests, increasing the likelihood of this happening (and making it very likely that if it happens, it happens in TestJemalloc).
I finally was able to get a stack trace of the death test process, and it looks like this:

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f2457140efe in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f24413d09d0) at ../nptl/pthread_mutex_lock.c:135
#2  0x00007f24559ceda3 in PR_Lock (lock=lock@entry=0x7f24413d09d0)
    at /builds/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptsynch.c:176
#3  0x00007f2444a85cfa in mozilla::DeadlockDetector<mozilla::BlockingResourceBase>::PRAutoLock::PRAutoLock (
    aLock=0x7f24413d09d0, this=<synthetic pointer>)
    at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/DeadlockDetector.h:113
#4  mozilla::DeadlockDetector<mozilla::BlockingResourceBase>::Remove (this=0x7f24413f9610, aResource=0x7f24413a3ad8)
    at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/DeadlockDetector.h:182
#5  0x00007f24449e7fd7 in mozilla::OffTheBooksMutex::~OffTheBooksMutex (this=0x7f24413a3ab0, __in_chrg=<optimized out>)
    at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/Mutex.h:55
#6  0x00007f24476a6451 in CrashReporter::UnsetExceptionHandler ()
    at /builds/worker/workspace/build/src/toolkit/crashreporter/nsExceptionHandler.cpp:2043
#7  0x00007f244768e602 in nsXULAppInfo::SetEnabled (this=<optimized out>, aEnabled=<optimized out>)
    at /builds/worker/workspace/build/src/toolkit/xre/nsAppRunner.cpp:1245
#8  0x00007f2447b1eead in DisableCrashReporter () at /builds/worker/workspace/build/src/memory/gtest/TestJemalloc.cpp:35
#9  0x00007f2447b1faad in Jemalloc_Arenas_Test::TestBody (this=<optimized out>)

So what's happening is that disabling the crash reporter in the death test forked process (which is what all death tests are doing), deletes a lock:

https://dxr.mozilla.org/mozilla-central/rev/960f50c2e0a991ab2ab313132e69fb2c96cb7866/toolkit/crashreporter/nsExceptionHandler.cpp#2043

That lock is an OffTheBooksMutex:

https://dxr.mozilla.org/mozilla-central/rev/960f50c2e0a991ab2ab313132e69fb2c96cb7866/xpcom/threads/Mutex.h#36

There's nothing interesting in the destructor for that class, but it inherits from a couple classes, and the interesting bits are in BlockingResourceBase:

https://dxr.mozilla.org/mozilla-central/rev/960f50c2e0a991ab2ab313132e69fb2c96cb7866/xpcom/threads/BlockingResourceBase.cpp#229

Which sends us to DeadLockDetector::Remove on a singleton:
https://dxr.mozilla.org/mozilla-central/rev/960f50c2e0a991ab2ab313132e69fb2c96cb7866/xpcom/threads/DeadlockDetector.h#180

The problem is that, in the parent process, another thread was adding or removing something in the same singleton of DeadLockDetector, holding the lock. When the death test forked in a different thread, all the other threads in the child process "died", and the state was still that the lock was held. Then DisableCrashReporter ends up trying to get that same lock and deadlocked.

Ironic that it's the dead lock detector that causes a dead lock.

One way to deal with this would be to add a pthread_atfork that unlocks or reinitializes the deadlock detector's lock in the child process after a fork. OTOH, there are possibly other locks that could be involved in deadlocks depending on what the death tests do. We can't use pthread_atfork to handle them all...

Come to think of it, maybe the right thing to do is for the gtest harness to setup a pthread_atfork that does what all death test users are currently doing manually, plus a few additional things necessary to avoid dead locks such as this one. That would also solve bug 1393711.
Assignee: nobody → mh+mozilla
A simpler option, which would actually be safer, albeit slower, would be to force the --gtest_death_test_style=threadsafe.
FWIW, without the patch, I could get at least 1 dead-lock for 100 gtest runs (but often more). With the patch applied, I got none in 200 gtest runs.
Comment on attachment 8931577 [details]
Bug 1419196 - Use the "threadsafe" variant of gtest death tests.

https://reviewboard.mozilla.org/r/202704/#review208478

This seems like the right thing to do. Unfortunately you'll also have to modify the mach command because it doesn't share a common implementation with the code used in CI (that's filed as bug 1337013):
https://dxr.mozilla.org/mozilla-central/rev/da90245d47b17c750560dedb5cbe1973181166e3/python/mozbuild/mozbuild/mach_commands.py#560
Attachment #8931577 - Flags: review?(ted) → review+
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/afa417c34354
Use the "threadsafe" variant of gtest death tests. r=ted
https://hg.mozilla.org/mozilla-central/rev/afa417c34354
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
You need to log in before you can comment on or make changes to this bug.