Closed Bug 1568571 Opened 4 years ago Closed 4 years ago

Perma [Tier 2] <xpcshell test> | application crashed [@ NS_DispatchToMainThread(already_AddRefed<nsIRunnable>&&, unsigned int)]

Categories

(WebExtensions :: General, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: crash, intermittent-failure, regression)

Crash Data

Filed by: malexandru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=258073255&repo=mozilla-central
Full log: https://queue.taskcluster.net/v1/task/WfitDgTSSz2_Q7JmiijhNQ/runs/0/artifacts/public/logs/live_backing.log


[task 2019-07-24T11:26:12.803Z] 11:26:12 WARNING - PROCESS-CRASH | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_shutdown_cleanup.js | application crashed [@ NS_DispatchToMainThread(already_AddRefed<nsIRunnable>&&, unsigned int)]
[task 2019-07-24T11:26:12.803Z] 11:26:12 INFO - Crash dump filename: /tmp/xpc-other-9cVstT/78abb918-4332-a10d-3efd-0a013b474774.dmp
[task 2019-07-24T11:26:12.804Z] 11:26:12 INFO - Operating system: Linux
[task 2019-07-24T11:26:12.804Z] 11:26:12 INFO - 0.0.0 Linux 4.4.0-1014-aws #14taskcluster1-Ubuntu SMP Tue Apr 3 10:27:00 UTC 2018 x86_64
[task 2019-07-24T11:26:12.804Z] 11:26:12 INFO - CPU: amd64
[task 2019-07-24T11:26:12.805Z] 11:26:12 INFO - family 6 model 62 stepping 4
[task 2019-07-24T11:26:12.805Z] 11:26:12 INFO - 4 CPUs
[task 2019-07-24T11:26:12.805Z] 11:26:12 INFO - GPU: UNKNOWN
[task 2019-07-24T11:26:12.806Z] 11:26:12 INFO - Crash reason: SIGSEGV /SEGV_MAPERR
[task 2019-07-24T11:26:12.807Z] 11:26:12 INFO - Crash address: 0x0
[task 2019-07-24T11:26:12.807Z] 11:26:12 INFO - Process uptime: not available
[task 2019-07-24T11:26:12.807Z] 11:26:12 INFO - Thread 1 (crashed)
[task 2019-07-24T11:26:12.807Z] 11:26:12 INFO - 0 xpcshell!mozalloc_abort [mozalloc_abort.cpp:5585edba8fdb83267722e62241952272c8e24a8f : 33 + 0x1d]
[task 2019-07-24T11:26:12.808Z] 11:26:12 INFO - rax = 0x000055cc6bafc740 rdx = 0x000055cc6ba753d6
[task 2019-07-24T11:26:12.808Z] 11:26:12 INFO - rcx = 0x00007fcd16ff02dd rbx = 0x00007fcd172be700
[task 2019-07-24T11:26:12.809Z] 11:26:12 INFO - rsi = 0x00007fcd172bf770 rdi = 0x00007fcd172be540
[task 2019-07-24T11:26:12.809Z] 11:26:12 INFO - rbp = 0x00007fcd0e87e540 rsp = 0x00007fcd0e87e530
[task 2019-07-24T11:26:12.810Z] 11:26:12 INFO - r8 = 0x00007fcd172bf770 r9 = 0x00007fcd0e881700
[task 2019-07-24T11:26:12.810Z] 11:26:12 INFO - r10 = 0x0000000000000058 r11 = 0x00007fcd16f667a0
[task 2019-07-24T11:26:12.811Z] 11:26:12 INFO - r12 = 0x00007fff99d72ad7 r13 = 0x0000000000000001
[task 2019-07-24T11:26:12.811Z] 11:26:12 INFO - r14 = 0x00007fcd172be700 r15 = 0x00007fcd0e67e160
[task 2019-07-24T11:26:12.812Z] 11:26:12 INFO - rip = 0x000055cc6ba033d0
[task 2019-07-24T11:26:12.812Z] 11:26:12 INFO - Found by: given as instruction pointer in context
[task 2019-07-24T11:26:12.812Z] 11:26:12 INFO - 1 libxul.so!NS_DebugBreak [nsDebugImpl.cpp:5585edba8fdb83267722e62241952272c8e24a8f : 429 + 0x9]
[task 2019-07-24T11:26:12.813Z] 11:26:12 INFO - rbx = 0x00007fcd0e87e790 rbp = 0x00007fcd0e87e9d0
[task 2019-07-24T11:26:12.813Z] 11:26:12 INFO - rsp = 0x00007fcd0e87e550 r12 = 0x00007fff99d72ad7
[task 2019-07-24T11:26:12.813Z] 11:26:12 INFO - r13 = 0x0000000000000001 r14 = 0x00007fcd172be700
[task 2019-07-24T11:26:12.814Z] 11:26:12 INFO - r15 = 0x00007fcd0e67e160 rip = 0x00007fcd1f3f1c8b
[task 2019-07-24T11:26:12.814Z] 11:26:12 INFO - Found by: call frame info
[task 2019-07-24T11:26:12.814Z] 11:26:12 INFO - 2 libxul.so!NS_DispatchToMainThread(already_AddRefed<nsIRunnable>&&, unsigned int) [nsThreadUtils.cpp:5585edba8fdb83267722e62241952272c8e24a8f : 242 + 0x2c]
[task 2019-07-24T11:26:12.815Z] 11:26:12 INFO - rbx = 0x00007fcd0e87e9f0 rbp = 0x00007fcd0e87ea30
[task 2019-07-24T11:26:12.815Z] 11:26:12 INFO - rsp = 0x00007fcd0e87e9e0 r12 = 0x00000000c1f30001
[task 2019-07-24T11:26:12.816Z] 11:26:12 INFO - r13 = 0x00007fcd0e87e9e8 r14 = 0x0000000000000000
[task 2019-07-24T11:26:12.816Z] 11:26:12 INFO - r15 = 0x00007fcd0e67c840 rip = 0x00007fcd1f50f754
[task 2019-07-24T11:26:12.816Z] 11:26:12 INFO - Found by: call frame info
[task 2019-07-24T11:26:12.816Z] 11:26:12 INFO - 3 libxul.so!NS_DispatchToMainThread(nsIRunnable*, unsigned int) [nsThreadUtils.cpp:5585edba8fdb83267722e62241952272c8e24a8f : 258 + 0x24]
[task 2019-07-24T11:26:12.817Z] 11:26:12 INFO - rbx = 0x00007fcd0e87ea58 rbp = 0x00007fcd0e87ea90
[task 2019-07-24T11:26:12.817Z] 11:26:12 INFO - rsp = 0x00007fcd0e87ea40 r12 = 0x00007fcd0e87ea60
[task 2019-07-24T11:26:12.817Z] 11:26:12 INFO - r13 = 0x0000000000000000 r14 = 0x00007fcd0e87ead0
[task 2019-07-24T11:26:12.817Z] 11:26:12 INFO - r15 = 0x00007fcd0e87eca0 rip = 0x00007fcd1f50f834
[task 2019-07-24T11:26:12.817Z] 11:26:12 INFO - Found by: call frame info
[task 2019-07-24T11:26:12.817Z] 11:26:12 INFO - 4 libxul.so!mozilla::ipc::MessagePump::ScheduleWork() [MessagePump.cpp:5585edba8fdb83267722e62241952272c8e24a8f : 125 + 0x11]

Summary: Perma [Tier 2] toolkit/components/extensions/test/xpcshell/test_ext_shutdown_cleanup.js | application crashed [@ NS_DispatchToMainThread(already_AddRefed<nsIRunnable>&&, unsigned int)] → Perma [Tier 2] <xpcshell test> | application crashed [@ NS_DispatchToMainThread(already_AddRefed<nsIRunnable>&&, unsigned int)]
Flags: needinfo?(nfroyd)
Whiteboard: [stockwell needswork:owner]

Update:
There have been 30 failures within the last 7 days:

  • 1 failure on linux x64 quantumrender debug
  • 29 failures on linux x64 ccov debug

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=264487216&repo=mozilla-central&lineNumber=2865

It looks like these are all ccov failures.

Flags: needinfo?(nfroyd) → needinfo?(mcastelluccio)

There is a failed assertion at:

[task 2019-09-01T23:16:32.718Z] 23:16:32 INFO - PID 32506 | \x07[Parent 32506, Gecko_IOThread] ###!!! ASSERTION: Failed NS_DispatchToMainThread() in shutdown; leaking: 'false', file /builds/worker/workspace/build/src/xpcom/threads/nsThreadUtils.cpp, line 243

A bit before that:

[task 2019-09-01T23:16:32.715Z] 23:16:32 INFO - PID 32506 | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object

Maybe the error is indirectly causing the assertion to fail. I see the failure is always in a test under toolkit/components/extensions/test/xpcshell/, those tests are using PromiseTestUtils (https://searchfox.org/mozilla-central/rev/250f5cc9fb8bdcbb6b23d2a06acfd48addb2f99b/toolkit/components/extensions/test/xpcshell/head.js#36).

My guess is that this is a timing issue that only manifests on ccov builds because they are slower and so the timing is different than normal builds.

Flags: needinfo?(mcastelluccio)

Marco, any new insight on this?

Flags: needinfo?(mcastelluccio)

Nope :(
This should likely be investigated by who wrote those tests.

Flags: needinfo?(mcastelluccio)

Over the last 7 days there are 48 failures on this bug. These happen on: windows10-64, linux64-ccov

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=271935166&repo=mozilla-central&lineNumber=2643

:froydnj I cannot need info you so leaving you pinged. :)

Over the last 7 days there are 43 failures on this bug. These happen on: windows10-64, linux64-ccov.

Any updates on this?

Flags: needinfo?(nfroyd)

I don't know what the issue is, but per comment 14, this looks like a test-specific issue, and not something related to XPCOM.

Component: XPCOM → General
Flags: needinfo?(nfroyd)
Product: Core → WebExtensions

Jim can you assign someone to take a look?

Flags: needinfo?(jmathies)

By looking to the failure logs, it seems that the crash is consistently happening after the test file is actually exiting.
On the other end, the particular test file that is failing is not always the same one, and so I doubt that the underlying issue is specific to something that the particular test is doing (but maybe it may be true for some of the tests), as an example one of the test file that is failing is test_ext_activityLog.js... which doesn't do really much on its own:

And so if this kind of failure is only happening on WebExtensions xpcshell tests, then it may be related to something that the WebExtensions framework does and not what the particular test file is doing.

Something special related to the WebExtension framework and these xpcshell test is that we spawn an extension process where the test extension runs (and in some other tests we also spawn some additional content process to mock webpages running in Firefox tabs),
and so I'm wondering if that could be related to the same failure happening in different tests from this test suite.

Also the logs collected for the failures tracked by orangefactor seems to suggest that there is still something during the shutdown that is trying to dispatch a message (which sounds definitely wrong):

\x07[Parent 30881, Gecko_IOThread] ###!!! ASSERTION: Failed NS_DispatchToMainThread() in shutdown; leaking: 'false', file /builds/worker/workspace/build/src/xpcom/threads/nsThreadUtils.cpp, line 243
INFO -  PID 30881 | #01: NS_DispatchToMainThread(already_AddRefed<nsIRunnable>&&, unsigned int) [xpcom/threads/nsThreadUtils.cpp:242]
INFO -  PID 30881 | #02: NS_DispatchToMainThread(nsIRunnable*, unsigned int) [xpcom/threads/nsThreadUtils.cpp:258]
INFO -  PID 30881 | #03: mozilla::ipc::MessagePump::ScheduleWork() [ipc/glue/MessagePump.cpp:125]
INFO -  PID 30881 | #04: MessageLoop::PostTask_Helper(already_AddRefed<nsIRunnable>, int) [ipc/chromium/src/base/message_loop.cc:402]
INFO -  PID 30881 | #05: MessageLoop::PostTask(already_AddRefed<nsIRunnable>) [ipc/chromium/src/base/message_loop.cc:346]
INFO -  PID 30881 | #06: mozilla::ipc::MessageChannel::OnChannelConnected(int) [ipc/glue/MessageChannel.cpp:2455]
INFO -  PID 30881 | #07: mozilla::ipc::ProcessLink::OnChannelConnected(int) [ipc/glue/MessageLink.cpp:328]
INFO -  PID 30881 | #08: IPC::Channel::ChannelImpl::ProcessIncomingMessages() [ipc/chromium/src/chrome/common/ipc_channel_posix.cc:571]
INFO -  PID 30881 | #09: IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking(int) [ipc/chromium/src/chrome/common/ipc_channel_posix.cc:822]
INFO -  PID 30881 | #10: base::MessagePumpLibevent::OnLibeventNotification(int, short, void*) [ipc/chromium/src/base/message_pump_libevent.cc:244]
INFO -  PID 30881 | #11: event_process_active_single_queue [ipc/chromium/src/third_party/libevent/event.c:1639]
INFO -  PID 30881 | #12: event_base_loop [ipc/chromium/src/third_party/libevent/event.c:1961]
INFO -  PID 30881 | #13: base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) [ipc/chromium/src/base/message_pump_libevent.cc:337]
INFO -  PID 30881 | #14: MessageLoop::RunInternal() [ipc/chromium/src/base/message_loop.cc:315]
INFO -  PID 30881 | #15: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:289]
INFO -  PID 30881 | #16: base::Thread::ThreadMain() [ipc/chromium/src/base/thread.cc:195]
INFO -  PID 30881 | #17: ThreadFunc [ipc/chromium/src/base/platform_thread_posix.cc:42]
INFO -  PID 30881 | #18: libpthread.so.0 + 0x76ba

But I'm not sure if it is something that may be actually triggered by the WebExtensions internals, the stacktrace suggests that the failure is triggered from the ipc layer, and so maybe someone that works on that layer may be able to help us to shade some light on the underlying reason (or suggest a strategy to get some additional detail that could help us with that).

As a side note, I tried to reproduce this failure locally (on a local linux64 ccov debug build) to double-check if I could give it a deeper look from gdb, but I've been able to see these xpcshell test to fail consistently in the ccov debug build, but I've got a different failing assertion and stacktrace (which looks to have in common with the failures collected on the build infrastructure that there is some unexpected ipc activity happening during the shutdown):

0x00007fffc698f6fb in mozilla::ipc::MessageChannel::WillDestroyCurrentMessageLoop (this=0x7fffa3b91120)
    at /zfs-ubik/mozlab/shared/mozilla-central/ipc/glue/MessageChannel.cpp:732
732       MOZ_CRASH("MessageLoop destroyed before MessageChannel that's bound to it");
(gdb) bt
#0  0x00007fffc698f6fb in mozilla::ipc::MessageChannel::WillDestroyCurrentMessageLoop() (this=0x7fffa3b91120)
    at /zfs-ubik/mozlab/shared/mozilla-central/ipc/glue/MessageChannel.cpp:732
#1  0x00007fffc66a17a8 in MessageLoop::~MessageLoop() (this=0x7fffa76d5020) at /zfs-ubik/mozlab/shared/mozilla-central/ipc/chromium/src/base/message_loop.cc:253
#2  0x00007fffc448ab50 in MessageLoopForUI::~MessageLoopForUI() (this=0x7fffa76d5020) at /zfs-ubik/mozlab/shared/mozilla-central/ipc/chromium/src/base/message_loop.h:459
#3  0x00007fffc448abb4 in MessageLoopForUI::~MessageLoopForUI() (this=0x7fffa76d5020) at /zfs-ubik/mozlab/shared/mozilla-central/ipc/chromium/src/base/message_loop.h:459
#4  0x00007fffc446dd5b in mozilla::ShutdownXPCOM(nsIServiceManager*) (aServMgr=0x0) at /zfs-ubik/mozlab/shared/mozilla-central/xpcom/build/XPCOMInit.cpp:791
#5  0x00007fffc446cb5d in NS_ShutdownXPCOM(nsIServiceManager*) (aServMgr=0x0) at /zfs-ubik/mozlab/shared/mozilla-central/xpcom/build/XPCOMInit.cpp:564
#6  0x00007fffc8542352 in XRE_XPCShellMain(int, char**, char**, XREShellData const*) (argc=24, argv=0x7fffffffcca0, envp=0x7fffffffcd68, aShellData=0x7fffffffcb60)
    at /zfs-ubik/mozlab/shared/mozilla-central/js/xpconnect/src/XPCShellImpl.cpp:1423
#7  0x00007fffdc23f71c in mozilla::BootstrapImpl::XRE_XPCShellMain(int, char**, char**, XREShellData const*) (this=0x7fffa76026d0, argc=31, argv=0x7fffffffcc68, envp=0x7fffffffcd68, aShellData=0x7fffffffcb60) at /zfs-ubik/mozlab/shared/mozilla-central/toolkit/xre/Bootstrap.cpp:54
#8  0x000055555561c260 in main(int, char**, char**) (argc=31, argv=0x7fffffffcc68, envp=0x7fffffffcd68)
    at /zfs-ubik/mozlab/shared/mozilla-central/js/xpconnect/shell/xpcshell.cpp:66
(gdb) p this->mName
$1 = 0x7fffba12614b "PContentParent"
(gdb) p this->mIsCrossProcess
$2 = true

The MessageChannel that is triggering this failure for me locally has "PContentParent" as its name and it is a cross process message channel.

I also noticed that if I introduce a little latency at the end of the test, the assertion failure isn't triggered anymore, and so there seems to be a race.

Hi Marco,
I was wondering: is there something that our internal "CodeCoverage"-related components are doing over ipc that may be able to trigger these kind of failures? (maybe triggered by these tests by spawning the child processes, see comment 31 and comment 32 for some details I tried to collect about what I observed)

Flags: needinfo?(jmathies) → needinfo?(mcastelluccio)

this seems to have stopped on November 15th (a week ago). I am not sure if this was disabled, or something else changed. Possibly this is related to running ccov on opt vs debug?

Whiteboard: [stockwell needswork:owner]

Yeah, I'm waiting for things to settle, but probably the opt switch (or the GCC 7 switch) fixed this.

See Also: → 1603854

I think the assertion is debug-only, so the switch to opt should effectively avoid it.

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(mcastelluccio)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.