Closed Bug 1319278 Opened 3 years ago Closed 3 years ago

Crash in OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xrealloc | nsTArray_base<T>::EnsureCapacity<T> | mozilla::dom::workers::WorkerPrivateParent<T>::DispatchControlRunnable

Categories

(Core :: DOM: Workers, defect, critical)

Unspecified
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox-esr45 --- wontfix
firefox51 --- wontfix
firefox52 + wontfix
firefox53 + wontfix
firefox54 --- fixed

People

(Reporter: erahm, Assigned: bkelly)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: crash, Whiteboard: [MemShrink:P2])

Crash Data

Attachments

(5 files, 9 obsolete files)

2.28 KB, patch
baku
: review+
Details | Diff | Splinter Review
7.88 KB, patch
baku
: review+
Details | Diff | Splinter Review
4.42 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
3.02 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
5.95 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-f04a6930-7ca1-4b34-9cc1-1eedd2161122.
=============================================================

This is an OOM large where |mozilla::dom::workers::WorkerPrivateParent<T>::DispatchControlRunnable| is pushing a runnable into it's queue and is OOMing due to an attempt to resize to 100's of MiB, indicating there 10's of millions of events queued. Almost all crashes are on Android.

The few stacks I've looked at all appear to involve a timer event being pushed into the queue. I'm guessing somehow the worker has become unresponsive but a timer is still spinning and pushing events.

There are a few things that can be done here:
  1) Make the allocation fallible
  2) Add memory reporting of the queue size
  3) Figure out what timer is spamming the queue
Ben, do you have any ideas here? Would your recent throttling work help?
Flags: needinfo?(bkelly)
No, the throttling is for runnables coming from the worker thread to the main thread.  This is a timer runnable being dispatched from the timer thread to the worker thread.

Possible theories:

1. The WorkerPrivate is bogus when the timer fires and we're operating on corrupt memory.
2. The worker is frozen, but we keep adding control runnables for some reason.  Are these all timer runnables?
3. We legitimately have all these runnables and the worker Queue class is just not that great.  It should probably use nsEventQueue instead of nsTArray internally.
4. The worker thread is not getting a chance to run because its at low priority compared to other threads.

Andrea, do you have any time to look at this?
Flags: needinfo?(bkelly) → needinfo?(amarchesini)
(In reply to Ben Kelly [:bkelly] from comment #2)
> 1. The WorkerPrivate is bogus when the timer fires and we're operating on
> corrupt memory.

This seems unlikely as we're consistently increasing the array size (vs blowing up on some address of e5e5e5e5).

> 2. The worker is frozen, but we keep adding control runnables for some
> reason.  Are these all timer runnables?

The stacks I've checked are all dying when adding a timer runnable, so that that seems about right.

> 3. We legitimately have all these runnables and the worker Queue class is
> just not that great.  It should probably use nsEventQueue instead of
> nsTArray internally.

The Queue class is definitely a bit over-engineered, a nsDeque could be subbed as well.

> 4. The worker thread is not getting a chance to run because its at low
> priority compared to other threads.

That's definitely possible, but we're talking about millions of events. If we just say naively we're doubling the array size, an allocation of 160MiB fails, so we were going from 80MiB on a 32-bit system (these are mostly fennec), that's ~20 million events already queued. With an uptime of ~1 hour that's ~6,000 events per second so something seems horribly off.

I guess one thing to do is audit whether or not the mutex is held when accessing the Queue (the class is misleading, in it's default instantiation it is not thread safe).
Well, we regularly fire these timer runnable to execute GC.  It may just often be the final straw.  We don't really know what the contents of the control runnable queue are, though.
So the dom/workers is using TimerThreadEventTarget() which has its own interesting Dispatch() implementation:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#1569

The interesting thing here is that we run the nsTimerEvent runnable's Run() immediately.  This allows the nsTimerImpl to reschedule period timers before we actually execute any work on the worker thread:

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#498

And we actually use TYPE_REPEATING_SLACK for garbage collection timers:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#4812

This is bad because its conceivable garbage collections could take longer than our interval of 1 second.  In these cases we will build up GC timer events and potentially never catch up.

I think we should probably use ONE_SHOT timers for GC's and manually reschedule after we execute work on the actual worker thread.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Flags: needinfo?(amarchesini)
Note, the 1 second GC falling behind is even more believable given this is largely a fennec crash.
This probably impacts our service worker stability on android devices.
Actually, I think we should re-write TimerThreadEventTarget.  That is probably not something we want to uplift, though.

To mitigate the problem on release channels I'll add a separate patch where we use a pref set the worker GC timer intervals.  If we increase these intervals I think it should reduce the number of OOMs.
Depends on: 1330432
Whiteboard: [MemShrink] → [MemShrink:P2]
This makes worker GC use the new event target.  I still need to fix the ServiceWorkerPrivate usage of TimerThreadEventTarget and then remove all the old code.
I think I just need a cleanup patch now to remove the newly unused code.
Hit a failure in last try because I did not call the default WorkerRunnable::Cancel() in my override.  Lets try again.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fd031c0a8841a7b5cedf524639295566628c6ca
Attachment #8829635 - Attachment is obsolete: true
Comment on attachment 8829655 [details] [diff] [review]
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku

This adds new nsIEventTarget implementation that wraps a given nsIRunnable and dispatches it as a WorkerControlRunnable.

This is different from TimerThreadEventTarget in an important way.  The new event target does not run the source runnable until the worker thread can actually execute it.  The TimerThreadEventTarget uses a "dummy" callback that runs immediately.  This differences allows nsTimer's backpressure mechanism to work.  If the worker thread is slow to respond then nsTimer will wait to schedule the next interval.

The goal here is to avoid the OOM by not continuing to dispatch GC runnables to the worker thread while the worker thread is too busy to process any of them.
Attachment #8829655 - Flags: review?(amarchesini)
Comment on attachment 8829636 [details] [diff] [review]
P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku

Add the new WorkerControlEventTarget in a WorkerPrivate.  We will use it later patches.
Attachment #8829636 - Flags: review?(amarchesini)
Comment on attachment 8829637 [details] [diff] [review]
P3 Make worker GC use new WorkerControlEventTarget. r=baku

Modify worker GC to use the new event target.  The stale worker gc bits will be removed in a later patch.
Attachment #8829637 - Flags: review?(amarchesini)
Comment on attachment 8829638 [details] [diff] [review]
P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku

Make ServiceWorkerPrivate use the new WorkerControlEventTarget.  In addition, I modified this to use WorkerHolder instead of busy count.
Attachment #8829638 - Flags: review?(amarchesini)
Comment on attachment 8829640 [details] [diff] [review]
P5 Removed unused TimerThreadEventTarget. r=baku

Remove all the orphaned code.  This deletes TimerThreadEventTarget and the various WorkerPrivate members associated with it.
Attachment #8829640 - Flags: review?(amarchesini)
BTW, I see various other crashes that look like TimerThreadEventTarget.  I'm hopeful these patches will fix those issues as well.
Comment on attachment 8829655 [details] [diff] [review]
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku

Review of attachment 8829655 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/WorkerPrivate.cpp
@@ +1615,5 @@
> +{
> +  mozilla::Mutex mMutex;
> +  WorkerPrivate* mWorkerPrivate;
> +
> +  ~WorkerControlEventTarget();

= default;

@@ +1637,5 @@
> +WorkerControlEventTarget::WorkerControlEventTarget(WorkerPrivate* aWorkerPrivate)
> +  : mMutex("WorkerControlEventTarget")
> +  , mWorkerPrivate(aWorkerPrivate)
> +{
> +  MOZ_DIAGNOSTIC_ASSERT(mWorkerPrivate);

Why not only having all these methods inline?

@@ +1663,5 @@
> +class WrappedControlRunnable final : public WorkerControlRunnable
> +{
> +  nsCOMPtr<nsIRunnable> mInner;
> +
> +  ~WrappedControlRunnable()

= default;
Attachment #8829655 - Flags: review?(amarchesini) → review+
Attachment #8829636 - Flags: review?(amarchesini) → review+
Attachment #8829637 - Flags: review?(amarchesini) → review+
Attachment #8829638 - Flags: review?(amarchesini) → review+
Attachment #8829640 - Flags: review?(amarchesini) → review+
Attachment #8830750 - Attachment description: Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku → P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6e1f8f0c4d94
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/e20b4c55bd6a
P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/e881b33252de
P3 Make worker GC use new WorkerControlEventTarget. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/6196560e5967
P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/d4f4afc096fd
P5 Removed unused TimerThreadEventTarget. r=baku
[Tracking Requested - why for this release]:

I want to let this bake on nightly for a few days, but we should consider uplifting it to fix the crashes.  Its is more risky, though, than a nullptr fix.
Blocks: 1216175
No longer depends on: 1330432
Might be worth to document somewhere in Worker code that WorkerControlEventTarget behaves as it behaves with timers because of nsTimerEvent's implementation
(In reply to Olli Pettay [:smaug] from comment #34)
> Might be worth to document somewhere in Worker code that
> WorkerControlEventTarget behaves as it behaves with timers because of
> nsTimerEvent's implementation

Well, it doesn't really.  It behaves per the nsIEventTarget contract now.  When you dispatch a runnable it doesn't call Run() until it actually runs on the target.

It was the old code I removed that was doing something non-standard and hidden.  This broke one of the features of nsTimer because it expected standard nsIEventTarget behavior.
Tracking for 52/53, let's see how this goes in nightly; fingers crossed :)
I've noticed a few worker related test timeouts this morning.  I don't know if they are related to this, bug 1216175, or something else, but we should watch them.

Bug 1334378
Bug 1334379
Bug 1334383
Ben, any conclusions from the test timeouts or results on crash-stats? Would you like to try at least an aurora uplift?
Flags: needinfo?(bkelly)
Of those 3 bugs only bug 1334383 has triggered more than.  It does seem it could be caused by this landing, though.

I don't think we should uplift until we can look at that.
Depends on: 1334383
Flags: needinfo?(bkelly)
[Tracking Requested - why for this release]:

Do we still want to track this for FF52 and FF53?  It appears the fix in this bug created some low frequency assertion crashes in bug 1334383.  I think we can fix these for FF54, but they might be risky to uplift.

+froydnj since this decision depends on our ability to uplift xpcom timer thread changes.
Depends on: 1338341
Ben, if it turns out this is still an issue can you request uplift?
Do you need help from QA?
Flags: needinfo?(bkelly)
I don't think its safe to uplift this patch.  We need to fix some fallout from the change in trunk and I doubt those changes could be easily uplifted either.
Flags: needinfo?(bkelly)
You need to log in before you can comment on or make changes to this bug.