browser_aboutRestartRequired_buildid.js fail on artifact try builds
Categories
(Testing :: Mochitest, defect)
Tracking
(firefox101 fixed)
Tracking | Status | |
---|---|---|
firefox101 | --- | fixed |
People
(Reporter: standard8, Assigned: gerard-majax)
References
Details
Attachments
(2 files)
browser_aboutRestartRequired_buildid.js and browser_autoSubmitRequest.js are both failing on try builds using artifacts.
I'm not sure if this needs fixing in artifact builds, or if this is a bug with the tests. I believe this has been around for quite a while, but it looks like no-one has filed it.
Example log:
https://treeherder.mozilla.org/logviewer?job_id=368409732&repo=try&lineNumber=18598
TEST-UNEXPECTED-FAIL | browser/base/content/test/tabcrashed/browser_aboutRestartRequired_buildid.js | Timed out waiting oop-browser-crashed (false-positive) event
browser/base/content/test/tabcrashed/browser_autoSubmitRequest.js | uncaught exception - ReferenceError: info is not defined at getEventPromise/</<@chrome://mochitests/content/browser/browser/base/content/test/tabcrashed/head.js:183:9
Comment hidden (Intermittent Failures Robot) |
Comment 2•3 years ago
|
||
Hmm, thanks for reporting this bug.
I'm going to send this to the Mochitest folks, as they'll know better for how to start digging into this.
Once there's build-specific information, then we should be able to help out over here :)
Comment hidden (Intermittent Failures Robot) |
Comment 4•3 years ago
|
||
The severity field is not set for this bug.
:ahal, could you have a look please?
For more information, please visit auto_nag documentation.
Comment hidden (Intermittent Failures Robot) |
Comment 6•3 years ago
|
||
Testing :: Mochitest is more about the test harness itself. Imo, this is more on the owners of the test itself to fix. But that is likely Standard8's team, so we've now come full circle :P
Comment 7•3 years ago
|
||
:jld, looks like you reviewed this test addition in bug 1651133 - any idea why it's failing on artifact builds? It adds a lot of noise to trypushes (as it fails on multiple platforms etc.)
Comment 8•3 years ago
|
||
It timed out waiting for an oop-browser-crashed
, from the deliberate crash that originates here. Just from looking at those ifdefs, that will fail if Gecko isn't built with ENABLE_TESTS
, or if an opt build of Gecko is being used but the test conditions are evaluated with debug
being true (the test is skip-if = !debug
because it's not expected to work in that case).
But, looking at the run from comment #5, the only build consumed by its Taskcluster job should have ENABLE_TESTS
and MOZ_DEBUG
both defined, according to the config.status
artifact.
So I don't know why it's failing on artifact builds yet.
Assignee | ||
Comment 9•3 years ago
|
||
Looking at https://treeherder.mozilla.org/logviewer?job_id=370972231&repo=try&lineNumber=15690-15876 I dont see anything showing that we killed any process at all ?
Especially highlighing dom/base/nsFrameLoader.cpp
https://treeherder.mozilla.org/logviewer?job_id=370972231&repo=try&lineNumber=15728-15729 seems we dont hit the false-positive code path. Is it possible setting the env var is a bit racy and we get tricked ?
Assignee | ||
Comment 10•3 years ago
|
||
I fear this is racy and related to that https://searchfox.org/mozilla-central/rev/766773ca7e3495a3f1ba365f47dffefdea29e3ee/ipc/glue/MessageChannel.cpp#877
I've changed that to 1
locally, and a $ MOZ_CHAOSMODE=0 BUILD_DEBUG=1 ./mach test --headless --repeat=400 --run-until-failure browser/base/content/test/tabcrashed/browser_aboutRestartRequired_b* 2>&1 | tee repro_timeout_oop-browser-crashed.log
does repro the issue ; however, putting back the default value does not repro :[
Assignee | ||
Comment 11•3 years ago
|
||
However for browser/base/content/test/tabcrashed/browser_autoSubmitRequest.js
I have not worked on that so I dont know. I'm going to morph this bug into dealing with the changes I am responsible for and file another for this.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Reporter | ||
Comment 12•3 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #11)
However for
browser/base/content/test/tabcrashed/browser_autoSubmitRequest.js
I have not worked on that so I dont know. I'm going to morph this bug into dealing with the changes I am responsible for and file another for this.
That's fine, thank you for looking.
Assignee | ||
Comment 13•3 years ago
|
||
I'm finding some source of intermittence because of a NS_ERROR_FILE_NOT_FOUND
on platform.ini
check. Wondering how much it could be because of bug 1737304 ?
Assignee | ||
Comment 14•3 years ago
|
||
Hm no, bug 1721343 is much prior
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 16•3 years ago
|
||
So, it seems there were races all over. I've rewrote the tests in a (hopefully) less racy way, and it seems to have fixed it for linux and windows ; however I've been tracking down continuing failures on macOS.
Until recently, all I could tell for sure was that:
- we would properly reach the call of
callback_
at https://searchfox.org/mozilla-central/rev/6da1ebe13b260efabd88eb98dec5fa8ee65987b2/toolkit/crashreporter/breakpad-client/mac/handler/exception_handler.cc#474-475 - we would never reach the called function, according to
NS_WARNING
placed at https://searchfox.org/mozilla-central/rev/dc09246dfbfd8dafeb6d55ebee18a6294d525443/toolkit/crashreporter/nsExceptionHandler.cpp#1783-1799
I had to use NS_WARNING
since:
- issue is hard to repro
- when the process gets into this "frozen" state,
lldb
refuses to act because it complains the process is already being debugged (likely expected, since we are in the middle of a minidump)
Then I learnt about sample
tool on macOS, which unveil some interesting informations. Specifically:
7205 Thread_1251641: Breakpad ExceptionHandler
+ 7205 thread_start (in libsystem_pthread.dylib) + 15 [0x7ff80fec9f6b]
+ 7205 _pthread_start (in libsystem_pthread.dylib) + 125 [0x7ff80fece4e1]
+ 7205 google_breakpad::ExceptionHandler::WaitForMessage(void*) (in XUL) + 987 [0x12ea3510b] exception_handler.cc:731
+ 7205 google_breakpad::ExceptionHandler::WriteMinidumpWithException(int, long long, long long, __darwin_ucontext*, unsigned int, unsigned int, bool, bool) (in XUL) + 956 [0x12ea347dc] exception_handler.cc:496
+ 7205 CrashReporter::ChildMinidumpCallback(char const*, char const*, void*, mozilla::phc::AddrInfo const*, bool) (in XUL) + 17 [0x12ea2c6d1] nsExceptionHandler.cpp:1819
+ 7205 CrashReporter::PrepareChildExceptionTimeAnnotations(mozilla::phc::AddrInfo const*) (in XUL) + 142 [0x12ea2ef6e] nsExceptionHandler.cpp:1663
+ 7205 NS_IsMainThread (in XUL) + 13 [0x128c8b2ad] nsThreadManager.cpp:219
+ 7205 tlv_get_addr (in libdyld.dylib) + 296 [0x7ff80fed5564]
+ 7205 dyld4::RuntimeState::_instantiateTLVs(unsigned long) (in dyld) + 175 [0x11754c91f]
+ 7205 _malloc_zone_malloc (in libsystem_malloc.dylib) + 125 [0x7ff80fd01abb]
+ 7205 replace_malloc(unsigned long) (in libmozglue.dylib) + 108 [0x10c40354c] PHC.cpp:1138
+ 7205 Allocator<MozJemallocBase>::malloc(unsigned long) (in libmozglue.dylib) + 52 [0x10c3e7084] malloc_decls.h:51
+ 7205 arena_t::Malloc(unsigned long, bool) (in libmozglue.dylib) + 159 [0x10c3e089f] mozjemalloc.cpp:3060
+ 7205 _OSSpinLockLockYield (in libsystem_platform.dylib) + 80 [0x7ff80fee652f]
+ 7205 syscall_thread_switch (in libsystem_kernel.dylib) + 10 [0x7ff80fe91a9a]
This call stacks shows we are indeed correctly reaching the ChildMinidumpCallback
and then we hit some memory allocatiion because of NS_IsMainThread()
being called in https://searchfox.org/mozilla-central/rev/2e328e8040a4b4419647aabdc536a3f61bafc384/toolkit/crashreporter/nsExceptionHandler.cpp#1275-1295
I could confirm with a push nuking the call to NS_IsMainThread()
that things are better: https://treeherder.mozilla.org/jobs?repo=try&revision=41464cf3c81856a269fe3a2411838aecc1f66c8f
We used to have a ~25% rate of intermittence being reported on this specific test, this push has three timeout which seems to be known intermittents.
Assignee | ||
Comment 18•3 years ago
|
||
So:
- with the offending
NS_IsMainThread()
: https://treeherder.mozilla.org/jobs?repo=try&revision=b1054b5b22802d1b4b3ef245b48f7022287259f6 - with an alternative
NS_IsMainThreadSafe()
: https://treeherder.mozilla.org/jobs?repo=try&revision=e185fc460a6c62ff049e5ed15f57de1d1513d62d
There is no doubt.
Assignee | ||
Comment 19•3 years ago
|
||
Assignee | ||
Comment 20•3 years ago
|
||
Thread local variables might be doing lazy allocation in our back, so
querying for NS_IsMainThread() within an exception handler would trigger
memory allocation which in turn will deadlock our process.
Depends on D144953
Updated•3 years ago
|
Comment 21•3 years ago
|
||
Comment 22•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/63083b88a46d
https://hg.mozilla.org/mozilla-central/rev/6921abcd7429
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 28•3 years ago
|
||
Unfortunately I'm still seeing a lot of this on recent trypushes, e.g. https://treeherder.mozilla.org/jobs?repo=try&revision=219e53402d03efd78c1771d6e5b935991e9707cc&selectedTaskRun=cnAlP6AUQFSu3UaUB0zEng.0 .
Is there a follow-up, and/or should I file one?
Assignee | ||
Comment 29•3 years ago
|
||
As you can see in those failure it seems to be failing on the new assertion that verifies platform.ini content. That seems to be a new error, and i am not aware of it... Also i am on PTO for a few days so i cant have a look.
Assignee | ||
Comment 30•3 years ago
|
||
That would suggest something removes or changes the content of platform.ini under us see https://searchfox.org/mozilla-central/source/browser/base/content/test/tabcrashed/head.js#179
This is something i would hit locally if you rerun mach build we end up with inconsistent data around. I have no idea why it would happens there. Maybe something introduced some race ?
Description
•