Closed Bug 1704887 Opened 3 years ago Closed 3 years ago

Crash in [@ mozilla::ThreadEventTarget::OnDelayedRunnableRan] (MOZ_DIAGNOSTIC_ASSERT)

Categories

(Core :: XPCOM, defect, P2)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox87 --- unaffected
firefox88 --- unaffected
firefox89 --- fixed
firefox90 --- fixed

People

(Reporter: kbrosnan, Assigned: KrisWright)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(4 files)

Crash report: https://crash-stats.mozilla.org/report/index/c1394414-ed25-4933-b7dd-ee1c90210412

MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(false) (mScheduledDelayedRunnables.RemoveElement(aRunnable))

Top 10 frames of crashing thread:

0 libxul.so mozilla::ThreadEventTarget::OnDelayedRunnableRan xpcom/threads/ThreadEventTarget.cpp:154
1 libxul.so mozilla::DelayedRunnable::Notify xpcom/threads/DelayedRunnable.cpp:61
2 libxul.so nsTimerEvent::Run xpcom/threads/TimerThread.cpp:252
3 libxul.so mozilla::RunnableTask::Run xpcom/threads/TaskController.cpp:470
4 libxul.so mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:754
5 libxul.so mozilla::TaskController::ProcessPendingMTTask xpcom/threads/TaskController.cpp:393
6 libxul.so mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal xpcom/threads/nsThreadUtils.h:534
7 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1159
8 libxul.so NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:548
9 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87

Filed from https://github.com/mozilla-mobile/fenix/issues/18972

Looks like an XPCOM bug to me?

Component: General → XPCOM
Product: GeckoView → Core

It looks like this means that we're trying to remove a runnable, but it does not exist in the array.

It looks like this assert might have been added in bug 1695580.

Flags: needinfo?(apehrson)

I tried reasoning about possible paths leading up to this assert. I am guessing this happens because the timer fires and dispatches the notification to the target before DelayedDispatch() does its actual dispatch of the runnable. That dispatch comes from a different thread (TimerThread) and this seems like a legit race.

The stack reveals that this happens on a ThreadEventTarget, so at least there is no tail dispatch involved. That should increase the rate significantly. I should be able to write a gtest for this.

It makes sense that this is happening mostly on Android, though it shows a bit on Windows too.

Assignee: nobody → apehrson
Severity: -- → S3
Flags: needinfo?(apehrson)
OS: Android → All
Priority: -- → P2
Has Regression Range: --- → yes

DelayedDispatch, in all current implementations, will set up a timer sync and
then Dispatch() a runnable. Since the timer is set up before the Dispatch, there
is a theoretical chance that the timer fires and dispatches a TimerEvent to the
target thread before DelayedDispatch managed to do so. When this happens the
internal DelayedDispatch runnable exits early, i.e., in practice it never runs.

The chance increases dramatically if the Dispatch() to the target in question is
tail dispatched, since the time between DelayedDispatch and the tail could be
non-trivial.

This patch removes the assert that checks that all DelayedRunnables that have
run have also been scheduled, since per the above no such guarantee exists.

This silences a lint error on the previous patches in the series.

Pushed by kwright@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/34e875f85b88
Add gtest for TaskQueue impl. r=KrisWright
https://hg.mozilla.org/integration/autoland/rev/bde076bfca7e
Remove assertions in OnDelayedRunnableRan. r=KrisWright,necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/211dcc933514
Discard results of RemoveElement for DelayedRunnable r=necko-reviewers,xpcom-reviewers,mccr8,valentin
Assignee: apehrson → nobody
Summary: Crash in [@ mozilla::ThreadEventTarget::OnDelayedRunnableRan] → Crash in [@ mozilla::ThreadEventTarget::OnDelayedRunnableRan] (MOZ_DIAGNOSTIC_ASSERT)
See Also: → 1703978
Crash Signature: [@ mozilla::ThreadEventTarget::OnDelayedRunnableRan] → [@ mozilla::ThreadEventTarget::OnDelayedRunnableRan] [@ nsThread::OnDelayedRunnableRan]
Crash Signature: [@ mozilla::ThreadEventTarget::OnDelayedRunnableRan] [@ nsThread::OnDelayedRunnableRan] → [@ mozilla::ThreadEventTarget::OnDelayedRunnableRan] [@ nsThread::OnDelayedRunnableRan] [@ mozilla::net::nsStreamTransportService::OnDelayedRunnableRan]
Crash Signature: [@ mozilla::ThreadEventTarget::OnDelayedRunnableRan] [@ nsThread::OnDelayedRunnableRan] [@ mozilla::net::nsStreamTransportService::OnDelayedRunnableRan] → [@ mozilla::ThreadEventTarget::OnDelayedRunnableRan] [@ nsThread::OnDelayedRunnableRan] [@ mozilla::net::nsStreamTransportService::OnDelayedRunnableRan]

I've worked on the gtests for this bug a bit more - looks like on some machines the call to monitor.NotifyAll() causes the outer monitor wait to exit before the inner monitor wait does - resulting in a UAF on the mutex that presents as a hang (sound familiar?). I think to solve this we can create two monitors - one that makes the gtest wait until the async steps complete, and an inner one that waits for the timer notification in the test pools.

Flags: needinfo?(kwright)

In the gtest for these patches, we can encounter a hang in the following steps:

  • Monitor waits in the main thread (for async steps to finish)
  • Offthread, monitor waits for the timer to fire
  • Timer notifies the waiting monitors
  • Monitor on the main thread continues first, wrapping up the test. It assumes all async steps are finished.
  • The offthread monitor wait follows, but at this point the main thread has dereferenced the monitor. Because of this race, we run into a UAF and hang on the offthread monitor.

My solution for this is to use two monitors, and notify the outer one after we have actually completed all of the async steps. This should avoid a race between the inner monitor.wait() and the free at the end of the tests.

Tentative try push for the fix: https://treeherder.mozilla.org/jobs?repo=try&revision=307960b2899b320ef5a82d276b86d633a9653941

Assignee: nobody → kwright
Status: NEW → ASSIGNED
Pushed by kwright@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a36d2b07df6f
Add gtest for TaskQueue impl. r=KrisWright
https://hg.mozilla.org/integration/autoland/rev/9c7bed3cd8ef
Remove assertions in OnDelayedRunnableRan. r=KrisWright,necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/74d88e756177
Discard results of RemoveElement for DelayedRunnable r=necko-reviewers,xpcom-reviewers,mccr8,valentin
https://hg.mozilla.org/integration/autoland/rev/40cd1b4fc1d3
Fix a UAF in the DelayedRunnable gtests. r=xpcom-reviewers,nika
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: