Closed Bug 1756407 Opened 2 years ago Closed 2 years ago

browser_aboutRestartRequired_buildid.js fail on artifact try builds

Categories

(Testing :: Mochitest, defect)

defect

Tracking

(firefox101 fixed)

RESOLVED FIXED
101 Branch
Tracking Status
firefox101 --- fixed

People

(Reporter: standard8, Assigned: gerard-majax)

References

Details

Attachments

(2 files)

browser_aboutRestartRequired_buildid.js and browser_autoSubmitRequest.js are both failing on try builds using artifacts.

I'm not sure if this needs fixing in artifact builds, or if this is a bug with the tests. I believe this has been around for quite a while, but it looks like no-one has filed it.

Example log:

https://treeherder.mozilla.org/logviewer?job_id=368409732&repo=try&lineNumber=18598

TEST-UNEXPECTED-FAIL | browser/base/content/test/tabcrashed/browser_aboutRestartRequired_buildid.js | Timed out waiting oop-browser-crashed (false-positive) event

browser/base/content/test/tabcrashed/browser_autoSubmitRequest.js | uncaught exception - ReferenceError: info is not defined at getEventPromise/</<@chrome://mochitests/content/browser/browser/base/content/test/tabcrashed/head.js:183:9

Hmm, thanks for reporting this bug.
I'm going to send this to the Mochitest folks, as they'll know better for how to start digging into this.
Once there's build-specific information, then we should be able to help out over here :)

Component: General → Mochitest
Product: Firefox Build System → Testing

The severity field is not set for this bug.
:ahal, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(ahal)

Testing :: Mochitest is more about the test harness itself. Imo, this is more on the owners of the test itself to fix. But that is likely Standard8's team, so we've now come full circle :P

Severity: -- → S4
Flags: needinfo?(ahal)

:jld, looks like you reviewed this test addition in bug 1651133 - any idea why it's failing on artifact builds? It adds a lot of noise to trypushes (as it fails on multiple platforms etc.)

Flags: needinfo?(jld)

It timed out waiting for an oop-browser-crashed, from the deliberate crash that originates here. Just from looking at those ifdefs, that will fail if Gecko isn't built with ENABLE_TESTS, or if an opt build of Gecko is being used but the test conditions are evaluated with debug being true (the test is skip-if = !debug because it's not expected to work in that case).

But, looking at the run from comment #5, the only build consumed by its Taskcluster job should have ENABLE_TESTS and MOZ_DEBUG both defined, according to the config.status artifact.

So I don't know why it's failing on artifact builds yet.

Flags: needinfo?(jld)

Looking at https://treeherder.mozilla.org/logviewer?job_id=370972231&repo=try&lineNumber=15690-15876 I dont see anything showing that we killed any process at all ?

Especially highlighing dom/base/nsFrameLoader.cpp https://treeherder.mozilla.org/logviewer?job_id=370972231&repo=try&lineNumber=15728-15729 seems we dont hit the false-positive code path. Is it possible setting the env var is a bit racy and we get tricked ?

I fear this is racy and related to that https://searchfox.org/mozilla-central/rev/766773ca7e3495a3f1ba365f47dffefdea29e3ee/ipc/glue/MessageChannel.cpp#877

I've changed that to 1 locally, and a $ MOZ_CHAOSMODE=0 BUILD_DEBUG=1 ./mach test --headless --repeat=400 --run-until-failure browser/base/content/test/tabcrashed/browser_aboutRestartRequired_b* 2>&1 | tee repro_timeout_oop-browser-crashed.log does repro the issue ; however, putting back the default value does not repro :[

However for browser/base/content/test/tabcrashed/browser_autoSubmitRequest.js I have not worked on that so I dont know. I'm going to morph this bug into dealing with the changes I am responsible for and file another for this.

Flags: needinfo?(standard8)
Blocks: 1762549
No longer blocks: 1762549
Summary: browser_aboutRestartRequired_buildid.js and browser_autoSubmitRequest.js fail on artifact try builds → browser_aboutRestartRequired_buildid.js fail on artifact try builds
Assignee: nobody → lissyx+mozillians

(In reply to Alexandre LISSY :gerard-majax from comment #11)

However for browser/base/content/test/tabcrashed/browser_autoSubmitRequest.js I have not worked on that so I dont know. I'm going to morph this bug into dealing with the changes I am responsible for and file another for this.

That's fine, thank you for looking.

Flags: needinfo?(standard8)

I'm finding some source of intermittence because of a NS_ERROR_FILE_NOT_FOUND on platform.ini check. Wondering how much it could be because of bug 1737304 ?

Hm no, bug 1721343 is much prior

So, it seems there were races all over. I've rewrote the tests in a (hopefully) less racy way, and it seems to have fixed it for linux and windows ; however I've been tracking down continuing failures on macOS.

Until recently, all I could tell for sure was that:

I had to use NS_WARNING since:

  • issue is hard to repro
  • when the process gets into this "frozen" state, lldb refuses to act because it complains the process is already being debugged (likely expected, since we are in the middle of a minidump)

Then I learnt about sample tool on macOS, which unveil some interesting informations. Specifically:

    7205 Thread_1251641: Breakpad ExceptionHandler
    + 7205 thread_start  (in libsystem_pthread.dylib) + 15  [0x7ff80fec9f6b]
    +   7205 _pthread_start  (in libsystem_pthread.dylib) + 125  [0x7ff80fece4e1]
    +     7205 google_breakpad::ExceptionHandler::WaitForMessage(void*)  (in XUL) + 987  [0x12ea3510b]  exception_handler.cc:731
    +       7205 google_breakpad::ExceptionHandler::WriteMinidumpWithException(int, long long, long long, __darwin_ucontext*, unsigned int, unsigned int, bool, bool)  (in XUL) + 956  [0x12ea347dc]  exception_handler.cc:496
    +         7205 CrashReporter::ChildMinidumpCallback(char const*, char const*, void*, mozilla::phc::AddrInfo const*, bool)  (in XUL) + 17  [0x12ea2c6d1]  nsExceptionHandler.cpp:1819
    +           7205 CrashReporter::PrepareChildExceptionTimeAnnotations(mozilla::phc::AddrInfo const*)  (in XUL) + 142  [0x12ea2ef6e]  nsExceptionHandler.cpp:1663
    +             7205 NS_IsMainThread  (in XUL) + 13  [0x128c8b2ad]  nsThreadManager.cpp:219
    +               7205 tlv_get_addr  (in libdyld.dylib) + 296  [0x7ff80fed5564]
    +                 7205 dyld4::RuntimeState::_instantiateTLVs(unsigned long)  (in dyld) + 175  [0x11754c91f]
    +                   7205 _malloc_zone_malloc  (in libsystem_malloc.dylib) + 125  [0x7ff80fd01abb]
    +                     7205 replace_malloc(unsigned long)  (in libmozglue.dylib) + 108  [0x10c40354c]  PHC.cpp:1138
    +                       7205 Allocator<MozJemallocBase>::malloc(unsigned long)  (in libmozglue.dylib) + 52  [0x10c3e7084]  malloc_decls.h:51
    +                         7205 arena_t::Malloc(unsigned long, bool)  (in libmozglue.dylib) + 159  [0x10c3e089f]  mozjemalloc.cpp:3060
    +                           7205 _OSSpinLockLockYield  (in libsystem_platform.dylib) + 80  [0x7ff80fee652f]
    +                             7205 syscall_thread_switch  (in libsystem_kernel.dylib) + 10  [0x7ff80fe91a9a]

This call stacks shows we are indeed correctly reaching the ChildMinidumpCallback and then we hit some memory allocatiion because of NS_IsMainThread() being called in https://searchfox.org/mozilla-central/rev/2e328e8040a4b4419647aabdc536a3f61bafc384/toolkit/crashreporter/nsExceptionHandler.cpp#1275-1295

I could confirm with a push nuking the call to NS_IsMainThread() that things are better: https://treeherder.mozilla.org/jobs?repo=try&revision=41464cf3c81856a269fe3a2411838aecc1f66c8f

We used to have a ~25% rate of intermittence being reported on this specific test, this push has three timeout which seems to be known intermittents.

Thread local variables might be doing lazy allocation in our back, so
querying for NS_IsMainThread() within an exception handler would trigger
memory allocation which in turn will deadlock our process.

Depends on D144953

Attachment #9274234 - Attachment description: Bug 1756407 - Make browser_aboutRestartRequired_buildid.js more robust r?gsvelto! → Bug 1756407 - Make about:restartrequired tests more robust r?gsvelto!
Pushed by alissy@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/63083b88a46d
Make about:restartrequired tests more robust r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/6921abcd7429
Avoid NS_IsMainThread() during exception handling r=gsvelto
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 101 Branch

Unfortunately I'm still seeing a lot of this on recent trypushes, e.g. https://treeherder.mozilla.org/jobs?repo=try&revision=219e53402d03efd78c1771d6e5b935991e9707cc&selectedTaskRun=cnAlP6AUQFSu3UaUB0zEng.0 .

Is there a follow-up, and/or should I file one?

Flags: needinfo?(lissyx+mozillians)

As you can see in those failure it seems to be failing on the new assertion that verifies platform.ini content. That seems to be a new error, and i am not aware of it... Also i am on PTO for a few days so i cant have a look.

Flags: needinfo?(lissyx+mozillians)

That would suggest something removes or changes the content of platform.ini under us see https://searchfox.org/mozilla-central/source/browser/base/content/test/tabcrashed/head.js#179

This is something i would hit locally if you rerun mach build we end up with inconsistent data around. I have no idea why it would happens there. Maybe something introduced some race ?

Blocks: 1812314
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: