Closed Bug 1319278 Opened 8 years ago Closed 7 years ago

Crash in OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xrealloc | nsTArray_base<T>::EnsureCapacity<T> | mozilla::dom::workers::WorkerPrivateParent<T>::DispatchControlRunnable

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla54

Tracking Flags:

Tracking

Status

firefox-esr45

---

wontfix

firefox51

---

wontfix

firefox52

wontfix

firefox53

wontfix

firefox54

---

fixed

People

(Reporter: erahm, Assigned: bkelly)

References

(Depends on 1 open bug)

Details

(Keywords: crash, Whiteboard: [MemShrink:P2])

Crash Data

Attachments

(5 files, 9 obsolete files)

P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 4.34 KB, patch		Details \| Diff \| Splinter Review
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 4.64 KB, patch		Details \| Diff \| Splinter Review
P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 2.28 KB, patch		Details \| Diff \| Splinter Review
P3 Make worker GC use new WorkerControlEventTarget. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 3.02 KB, patch		Details \| Diff \| Splinter Review
P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 7.05 KB, patch		Details \| Diff \| Splinter Review
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 4.54 KB, patch		Details \| Diff \| Splinter Review
P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 2.28 KB, patch	baku : review+	Details \| Diff \| Splinter Review
P3 Make worker GC use new WorkerControlEventTarget. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 3.02 KB, patch	baku : review+	Details \| Diff \| Splinter Review
P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 7.88 KB, patch	baku : review+	Details \| Diff \| Splinter Review
P5 Removed unused TimerThreadEventTarget. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 5.98 KB, patch	baku : review+	Details \| Diff \| Splinter Review
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 4.66 KB, patch	baku : review+	Details \| Diff \| Splinter Review
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 4.42 KB, patch	bkelly : review+	Details \| Diff \| Splinter Review
P3 Make worker GC use new WorkerControlEventTarget. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 3.02 KB, patch	bkelly : review+	Details \| Diff \| Splinter Review
P5 Removed unused TimerThreadEventTarget. r=baku 7 years ago Ben Kelly [:bkelly, not reviewing] 5.95 KB, patch	bkelly : review+	Details \| Diff \| Splinter Review

Eric Rahm [:erahm]

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is 
report bp-f04a6930-7ca1-4b34-9cc1-1eedd2161122.
=============================================================

This is an OOM large where |mozilla::dom::workers::WorkerPrivateParent<T>::DispatchControlRunnable| is pushing a runnable into it's queue and is OOMing due to an attempt to resize to 100's of MiB, indicating there 10's of millions of events queued. Almost all crashes are on Android.

The few stacks I've looked at all appear to involve a timer event being pushed into the queue. I'm guessing somehow the worker has become unresponsive but a timer is still spinning and pushing events.

There are a few things that can be done here:
  1) Make the allocation fallible
  2) Add memory reporting of the queue size
  3) Figure out what timer is spamming the queue

Andrew Overholt [:overholt]

Comment 1

•

8 years ago

Ben, do you have any ideas here? Would your recent throttling work help?

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 2

•

8 years ago

No, the throttling is for runnables coming from the worker thread to the main thread.  This is a timer runnable being dispatched from the timer thread to the worker thread.

Possible theories:

1. The WorkerPrivate is bogus when the timer fires and we're operating on corrupt memory.
2. The worker is frozen, but we keep adding control runnables for some reason.  Are these all timer runnables?
3. We legitimately have all these runnables and the worker Queue class is just not that great.  It should probably use nsEventQueue instead of nsTArray internally.
4. The worker thread is not getting a chance to run because its at low priority compared to other threads.

Andrea, do you have any time to look at this?

Flags: needinfo?(bkelly) → needinfo?(amarchesini)

Eric Rahm [:erahm]

Reporter

Comment 3

•

8 years ago

(In reply to Ben Kelly [:bkelly] from comment #2)
> 1. The WorkerPrivate is bogus when the timer fires and we're operating on
> corrupt memory.

This seems unlikely as we're consistently increasing the array size (vs blowing up on some address of e5e5e5e5).

> 2. The worker is frozen, but we keep adding control runnables for some
> reason.  Are these all timer runnables?

The stacks I've checked are all dying when adding a timer runnable, so that that seems about right.

> 3. We legitimately have all these runnables and the worker Queue class is
> just not that great.  It should probably use nsEventQueue instead of
> nsTArray internally.

The Queue class is definitely a bit over-engineered, a nsDeque could be subbed as well.

> 4. The worker thread is not getting a chance to run because its at low
> priority compared to other threads.

That's definitely possible, but we're talking about millions of events. If we just say naively we're doubling the array size, an allocation of 160MiB fails, so we were going from 80MiB on a 32-bit system (these are mostly fennec), that's ~20 million events already queued. With an uptime of ~1 hour that's ~6,000 events per second so something seems horribly off.

I guess one thing to do is audit whether or not the mutex is held when accessing the Queue (the class is misleading, in it's default instantiation it is not thread safe).

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 4

•

8 years ago

Well, we regularly fire these timer runnable to execute GC.  It may just often be the final straw.  We don't really know what the contents of the control runnable queue are, though.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 5

•

7 years ago

So the dom/workers is using TimerThreadEventTarget() which has its own interesting Dispatch() implementation:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#1569

The interesting thing here is that we run the nsTimerEvent runnable's Run() immediately.  This allows the nsTimerImpl to reschedule period timers before we actually execute any work on the worker thread:

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#498

And we actually use TYPE_REPEATING_SLACK for garbage collection timers:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#4812

This is bad because its conceivable garbage collections could take longer than our interval of 1 second.  In these cases we will build up GC timer events and potentially never catch up.

I think we should probably use ONE_SHOT timers for GC's and manually reschedule after we execute work on the actual worker thread.

Assignee: nobody → bkelly

Status: NEW → ASSIGNED

Flags: needinfo?(amarchesini)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 6

•

7 years ago

Note, the 1 second GC falling behind is even more believable given this is largely a fennec crash.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 7

•

7 years ago

This probably impacts our service worker stability on android devices.

Blocks: ServiceWorkers-stability

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 8

•

7 years ago

Actually, I think we should re-write TimerThreadEventTarget.  That is probably not something we want to uplift, though.

To mitigate the problem on release channels I'll add a separate patch where we use a pref set the worker GC timer intervals.  If we increase these intervals I think it should reduce the number of OOMs.

Ben Kelly [:bkelly, not reviewing]

Assignee

Updated

•

7 years ago

Depends on: 1330432

Eric Rahm [:erahm]

Reporter

Updated

•

7 years ago

Whiteboard: [MemShrink] → [MemShrink:P2]

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 9

•

7 years ago

Attached patch P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku (obsolete) — Details — Splinter Review

First step adding the new event target.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 10

•

7 years ago

Attached patch P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku (obsolete) — Details — Splinter Review

Attachment #8826342 - Attachment is obsolete: true

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 11

•

7 years ago

Attached patch P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku (obsolete) — Details — Splinter Review

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 12

•

7 years ago

Attached patch P3 Make worker GC use new WorkerControlEventTarget. r=baku (obsolete) — Details — Splinter Review

This makes worker GC use the new event target.  I still need to fix the ServiceWorkerPrivate usage of TimerThreadEventTarget and then remove all the old code.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 13

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=88631be90b1d28a7bfb51cf49bb7a5a176817980

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 14

•

7 years ago

Attached patch P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku (obsolete) — Details — Splinter Review

I think I just need a cleanup patch now to remove the newly unused code.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 15

•

7 years ago

Attached patch P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku (obsolete) — Details — Splinter Review

Attachment #8828509 - Attachment is obsolete: true

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 16

•

7 years ago

Attached patch P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku — Details — Splinter Review

Attachment #8828510 - Attachment is obsolete: true

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 17

•

7 years ago

Attached patch P3 Make worker GC use new WorkerControlEventTarget. r=baku (obsolete) — Details — Splinter Review

Attachment #8828512 - Attachment is obsolete: true

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 18

•

7 years ago

Attached patch P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku — Details — Splinter Review

Attachment #8828978 - Attachment is obsolete: true

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 19

•

7 years ago

Attached patch P5 Removed unused TimerThreadEventTarget. r=baku (obsolete) — Details — Splinter Review

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f761b505634de8c9d7db7e3a95e5dd47019e076

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 20

•

7 years ago

Attached patch P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku (obsolete) — Details — Splinter Review

Hit a failure in last try because I did not call the default WorkerRunnable::Cancel() in my override.  Lets try again.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fd031c0a8841a7b5cedf524639295566628c6ca

Attachment #8829635 - Attachment is obsolete: true

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 21

•

7 years ago

Comment on attachment 8829655 [details] [diff] [review]
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku

This adds new nsIEventTarget implementation that wraps a given nsIRunnable and dispatches it as a WorkerControlRunnable.

This is different from TimerThreadEventTarget in an important way.  The new event target does not run the source runnable until the worker thread can actually execute it.  The TimerThreadEventTarget uses a "dummy" callback that runs immediately.  This differences allows nsTimer's backpressure mechanism to work.  If the worker thread is slow to respond then nsTimer will wait to schedule the next interval.

The goal here is to avoid the OOM by not continuing to dispatch GC runnables to the worker thread while the worker thread is too busy to process any of them.

Attachment #8829655 - Flags: review?(amarchesini)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 22

•

7 years ago

Comment on attachment 8829636 [details] [diff] [review]
P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku

Add the new WorkerControlEventTarget in a WorkerPrivate.  We will use it later patches.

Attachment #8829636 - Flags: review?(amarchesini)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 23

•

7 years ago

Comment on attachment 8829637 [details] [diff] [review]
P3 Make worker GC use new WorkerControlEventTarget. r=baku

Modify worker GC to use the new event target.  The stale worker gc bits will be removed in a later patch.

Attachment #8829637 - Flags: review?(amarchesini)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 24

•

7 years ago

Comment on attachment 8829638 [details] [diff] [review]
P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku

Make ServiceWorkerPrivate use the new WorkerControlEventTarget.  In addition, I modified this to use WorkerHolder instead of busy count.

Attachment #8829638 - Flags: review?(amarchesini)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 25

•

7 years ago

Comment on attachment 8829640 [details] [diff] [review]
P5 Removed unused TimerThreadEventTarget. r=baku

Remove all the orphaned code.  This deletes TimerThreadEventTarget and the various WorkerPrivate members associated with it.

Attachment #8829640 - Flags: review?(amarchesini)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 26

•

7 years ago

BTW, I see various other crashes that look like TimerThreadEventTarget.  I'm hopeful these patches will fix those issues as well.

Andrea Marchesini [:baku]

Comment 27

•

7 years ago

Comment on attachment 8829655 [details] [diff] [review]
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku

Review of attachment 8829655 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/WorkerPrivate.cpp
@@ +1615,5 @@
> +{
> +  mozilla::Mutex mMutex;
> +  WorkerPrivate* mWorkerPrivate;
> +
> +  ~WorkerControlEventTarget();

= default;

@@ +1637,5 @@
> +WorkerControlEventTarget::WorkerControlEventTarget(WorkerPrivate* aWorkerPrivate)
> +  : mMutex("WorkerControlEventTarget")
> +  , mWorkerPrivate(aWorkerPrivate)
> +{
> +  MOZ_DIAGNOSTIC_ASSERT(mWorkerPrivate);

Why not only having all these methods inline?

@@ +1663,5 @@
> +class WrappedControlRunnable final : public WorkerControlRunnable
> +{
> +  nsCOMPtr<nsIRunnable> mInner;
> +
> +  ~WrappedControlRunnable()

= default;

Attachment #8829655 - Flags: review?(amarchesini) → review+

Andrea Marchesini [:baku]

Updated

•

7 years ago

Attachment #8829636 - Flags: review?(amarchesini) → review+

Andrea Marchesini [:baku]

Updated

•

7 years ago

Attachment #8829637 - Flags: review?(amarchesini) → review+

Andrea Marchesini [:baku]

Updated

•

7 years ago

Attachment #8829638 - Flags: review?(amarchesini) → review+

Andrea Marchesini [:baku]

Updated

•

7 years ago

Attachment #8829640 - Flags: review?(amarchesini) → review+

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 28

•

7 years ago

Attached patch P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku — Details — Splinter Review

Attachment #8829655 - Attachment is obsolete: true

Attachment #8830750 - Flags: review+

Ben Kelly [:bkelly, not reviewing]

Assignee

Updated

•

7 years ago

Attachment #8830750 - Attachment description: Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku → P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 29

•

7 years ago

Attached patch P3 Make worker GC use new WorkerControlEventTarget. r=baku — Details — Splinter Review

Attachment #8829637 - Attachment is obsolete: true

Attachment #8830752 - Flags: review+

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 30

•

7 years ago

Attached patch P5 Removed unused TimerThreadEventTarget. r=baku — Details — Splinter Review

Attachment #8829640 - Attachment is obsolete: true

Attachment #8830754 - Flags: review+

Pulsebot

Comment 31

•

7 years ago

Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6e1f8f0c4d94
P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/e20b4c55bd6a
P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/e881b33252de
P3 Make worker GC use new WorkerControlEventTarget. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/6196560e5967
P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/d4f4afc096fd
P5 Removed unused TimerThreadEventTarget. r=baku

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 32

•

7 years ago

[Tracking Requested - why for this release]:

I want to let this bake on nightly for a few days, but we should consider uplifting it to fix the crashes.  Its is more risky, though, than a nullptr fix.

status-firefox51: --- → wontfix

status-firefox52: --- → affected

status-firefox53: --- → affected

status-firefox54: --- → affected

status-firefox-esr45: --- → wontfix

tracking-firefox52: --- → ?

tracking-firefox53: --- → ?

Ben Kelly [:bkelly, not reviewing]

Assignee

Updated

•

7 years ago

Blocks: 1216175

Ben Kelly [:bkelly, not reviewing]

Assignee

Updated

•

7 years ago

No longer depends on: 1330432

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 33

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6e1f8f0c4d94
https://hg.mozilla.org/mozilla-central/rev/e20b4c55bd6a
https://hg.mozilla.org/mozilla-central/rev/e881b33252de
https://hg.mozilla.org/mozilla-central/rev/6196560e5967
https://hg.mozilla.org/mozilla-central/rev/d4f4afc096fd

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox54: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 34

•

7 years ago

Might be worth to document somewhere in Worker code that WorkerControlEventTarget behaves as it behaves with timers because of nsTimerEvent's implementation

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 35

•

7 years ago

(In reply to Olli Pettay [:smaug] from comment #34)
> Might be worth to document somewhere in Worker code that
> WorkerControlEventTarget behaves as it behaves with timers because of
> nsTimerEvent's implementation

Well, it doesn't really.  It behaves per the nsIEventTarget contract now.  When you dispatch a runnable it doesn't call Run() until it actually runs on the target.

It was the old code I removed that was doing something non-standard and hidden.  This broke one of the features of nsTimer because it expected standard nsIEventTarget behavior.

Julien Cristau [:jcristau]

Comment 36

•

7 years ago

Tracking for 52/53, let's see how this goes in nightly; fingers crossed :)

tracking-firefox52: ? → +

tracking-firefox53: ? → +

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 37

•

7 years ago

I've noticed a few worker related test timeouts this morning.  I don't know if they are related to this, bug 1216175, or something else, but we should watch them.

Bug 1334378
Bug 1334379
Bug 1334383

Liz Henry (:lizzard) (relman/hg->git project)

Comment 38

•

7 years ago

Ben, any conclusions from the test timeouts or results on crash-stats? Would you like to try at least an aurora uplift?

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 39

•

7 years ago

Of those 3 bugs only bug 1334383 has triggered more than.  It does seem it could be caused by this landing, though.

I don't think we should uplift until we can look at that.

Depends on: 1334383

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 40

•

7 years ago

[Tracking Requested - why for this release]:

Do we still want to track this for FF52 and FF53?  It appears the fix in this bug created some low frequency assertion crashes in bug 1334383.  I think we can fix these for FF54, but they might be risky to uplift.

+froydnj since this decision depends on our ability to uplift xpcom timer thread changes.

tracking-firefox52: + → ?

tracking-firefox53: + → ?

Ben Kelly [:bkelly, not reviewing]

Assignee

Updated

•

7 years ago

Depends on: 1338341

Liz Henry (:lizzard) (relman/hg->git project)

Comment 41

•

7 years ago

Ben, if it turns out this is still an issue can you request uplift?
Do you need help from QA?

tracking-firefox52: ? → +

tracking-firefox53: ? → +

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 42

•

7 years ago

I don't think its safe to uplift this patch.  We need to fix some fallout from the change in trunk and I doubt those changes could be easily uplifted either.

Flags: needinfo?(bkelly)

Julien Cristau [:jcristau]

Updated

•

7 years ago

status-firefox52: affected → wontfix

status-firefox53: affected → wontfix

You need to log in before you can comment on or make changes to this bug.