Open Bug 1216175 Opened 4 years ago Updated 5 months ago

GC handling in workers is broken | crash in OOM | large | NS_ABORT_OOM | mozilla::dom::CallbackObject::Init

Categories

(Core :: DOM: Workers, defect, P2, critical)

x86_64
Windows 8.1
defect

Tracking

()

REOPENED

People

(Reporter: jujjyl, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, Whiteboard: DWS_NEXT)

Attachments

(4 files, 7 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-55e6e5b0-125f-4454-a4be-c54252151019.
=============================================================

A call stack that performs a large 3GB allocation and OOMs without handling possible failure. Unfortunately no consistent repro.
Hmm, mJSHolders tries to do some large allocation when we're close to OOM anyway?

Does the testcase perhaps create tons of setTimeout calls?


mccr8, we've talked at some point splitting mJSHolders to several hashtables so that
black marking could be done in several steps. Something like
switch (ptr % 3) {
  case 0: mJSHolders0.Put(ptr); break;
  case 1: mJSHolders1.Put(ptr); break;
  case 2: mJSHolders2.Put(ptr); break;
}
That might or might not help here, by reducing the risk for large hashtable.
(we could add more jsholder hashtables dynamically)
If we're really doing a 3GB allocation here, splitting up the table into a few parts won't help. ;)
Jukka, can you try to figure out what this test case is doing? If that crash report is right, it seems like it is creating hundreds of millions of DOM callbacks...
Built current mozilla-central from sources and attached Visual Studio to it and waited for the crash. The callstack looks like this:

https://dl.dropboxusercontent.com/u/40949268/dump/callstack_1.png
https://dl.dropboxusercontent.com/u/40949268/dump/callstack_2.png
https://dl.dropboxusercontent.com/u/40949268/dump/callstack_3.png
https://dl.dropboxusercontent.com/u/40949268/dump/callstack_4.png
https://dl.dropboxusercontent.com/u/40949268/dump/callstack_5.png

How does that call trace look to you guys?

The page I'm seeing this on exhibits the problem systematically after pretty close to 1000 seconds of running. It is a large Emscripten-compiled application with some handwritten JS code on the side, though something is unique to this Emscripten application since other Emscripten pages don't have the problem.

Thanks Andrew for the pointer on checking setTimeout()s. Let me debug the setTimeout()s to see if the page is doing something bad there.
OS: Windows NT → Windows 8.1
Hardware: Unspecified → x86_64
Version: unspecified → Trunk
Yeah, it looks like the application must be calling setTimeout in workers an excessive number of times. (Or else there's an underflow bug in the hashtable.)
(In reply to Andrew McCreight [:mccr8] from comment #4)
> If we're really doing a 3GB allocation here, splitting up the table into a
> few parts won't help. ;)
Well, I was actually wondering if there is first some huge allocation and then some rather excessive setTimeout calls. But sure, sounds like a web app bug anyway.
(In reply to Olli Pettay [:smaug] from comment #8)
> Well, I was actually wondering if there is first some huge allocation and
> then some rather excessive setTimeout calls. But sure, sounds like a web app
> bug anyway.

If you look at the crash report, the "OOM Allocation Size" field is "3120562176", which means that is how big the allocation we were attempting to do when we crashed, so the actual JS holder table is trying to grow itself to 3gb.
It's likely that the page is doing something rogue (still trying to figure out what), but this should not be disregarded as a web app bug, since it hard-crashes the browser in non-e10s, and in e10s, it crashes all tabs, so whatever the page is doing, it is an effective denial of service attack.

I do see a few setTimeout()s used in workers that the page is spawning, but statically analyzing, the logic looks ok. I'm implementing console.log() now in workers page to get some visibility into them.
There are plenty of web app bugs which can't really be handled in browsers without breaking something.
Leaks for example - web apps tend to leak rather often.
Sure, I understand that not all OOM-related denial of service vectors are easily fixable, but we should still recognize when such DoS vectors exist. The ideal behavior would be only to crash the offending tab on OOM and free up all the memory the offending tab had leaked, and leave other tabs alive.

In any case, we don't still know whether this is a web app bug or not. I'm debugging it further to diagnose.
Attached file settimeouts.zip
Given that the web worker on the page in question (sorry for not being able to link it directly in here, it is a nonpublic demo governed by Mozilla Partner NDA) is using setTimeout(0) to drive a continuous event loop, I tried to reproduce that scenario as a standalone test case, and it looks like I'm seeing similar behavior.

The Emscripten page is growing in memory consumption at about 10MB/sec rate, and the crash occurs when it reaches about 13 GB of consumed memory. In about:memory, this memory is listed in the content process under heap-unclassified.

A standalone test case of the same setTimeout() behavior is shown in the attached .zip file. In that file, the main thread is simulated to be under heavy load by repeatedly calling performance.now() (not sure how relevant that is, it did seem to exaggerate the behavior), and the main thread spawns web workers, which run a busy loop of setTimeout() callbacks. These setTimeout() calls keep growing the used memory size up until the memory size reaches 13GB (on my system with 16GB of RAM). Once the memory consumption reaches close to 100% of total system RAM (as observed in Windows task manager), there is momentary large ~1GB/sec disk activity, after which memory usage goes down by about 1-3GB, and begins to grow again.

In the .zip sample, the OOM crash does not occur, but it does look like such a chain of setTimeout()s either leaks, or consumes a lot of memory which is GCd/CCd in an overly relaxed manner(?). Could you take a peek at the .zip file :smaug?
Flags: needinfo?(bugs)
If I manually go and click on the GC button in the about:memory page when the browser is at 12GB memory usage, then the memory usage drops to 4GB, so looks at least like this is not a direct memory leak, but the GC has not decided to run in between?
Running the .zip test page in 32-bit Firefox stable did exhibit a OOM crash like this https://crash-stats.mozilla.com/report/index/bb18caf8-b045-4fab-b568-58f7d2151020 .
Attached file simpler testcase
So, this is fun. We never trigger GC based on the GC timers in workers in this case.
We trigger periodic timer, but then we don't have anything else in the event loop, so
we cancel that and trigger idle timer, and then the setTimeout timer runs and we do the same again
(triggering periodic timer cancels idle timer...).
Attached patch wip (obsolete) — Splinter Review
This helps with setTimeout case, but I consider any SetGCTimerMode(NoTimer); call still a bug.
I wonder whether to deal those later when I finally manage to have saner GC/CC scheduling for both main thread and workers.
Flags: needinfo?(bugs)
The current setup in workers seems to be totally ad-hoc, and even though we cancel some
timer, the relevant GarbageCollectRunnable can be already in control queue, so cancelling doesn't
actually guarantee anything is cancelled.
Component: DOM → DOM: Workers
Comment on attachment 8679734 [details] [diff] [review]
simpler

We need to not cancel timer all the time, so we need to keep at least
idle timer running.

hadDebuggerOrNormalRunnables is there so that we don't restart idle timer all the time.


Btw, periodic timer runs quite rarely with or without the patch. We could possibly get rid of it.
Attachment #8679734 - Flags: review?(khuey)
Comment on attachment 8679734 [details] [diff] [review]
simpler

whoever gets chance to review this first ;)
Attachment #8679734 - Flags: review?(amarchesini)
Comment on attachment 8679734 [details] [diff] [review]
simpler

Review of attachment 8679734 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/WorkerPrivate.cpp
@@ +4578,4 @@
>    // periodically (PERIODIC_GC_TIMER_DELAY_SEC) while the worker is running.
>    // Once the worker goes idle we set a short (IDLE_GC_TIMER_DELAY_SEC) timer to
> +  // run a shrinking GC.
> +  mPeriodicGCTimer = do_CreateInstance(NS_TIMER_CONTRACTID);

you removed MOZ_ASSERT().

@@ +4583,5 @@
>      new GarbageCollectRunnable(this, false, false);
> +  nsCOMPtr<nsIEventTarget> target = new TimerThreadEventTarget(this, runnable);
> +  MOZ_ALWAYS_TRUE(NS_SUCCEEDED(mPeriodicGCTimer->SetTarget(target)));
> +
> +  mIdleGCTimer = do_CreateInstance(NS_TIMER_CONTRACTID);

an MOZ_ASSERT() here too?
Attachment #8679734 - Flags: review?(amarchesini) → review+
yes, removed MOZ_ASSERT very much on purpose. useless to assert if something is null if we're going to crash (safely) anyway immediately after that.
Attachment #8679734 - Flags: review?(khuey)
bizarre, I'm not seeing any worker usage in that test.
pushing to try again, since I don't really know what could cause the issue
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1ffe4a4ff2a1
Blocks: 1254240
I need to upload the patch again to try
See Also: → 1333035
I harshed your patches good in bug 1319278.  I want to uplift that, so maybe you can just depend on it for all branches.
Depends on: 1319278
bug 1319278 changed the underlying code quite a bit so need to merge the patch on top of that.
Attached patch worker_gc_scheduling_6.diff (obsolete) — Splinter Review
The patch isn't that different, but I think a new review doesn't harm.
Basically it isn't using TimeThreadEventTarget anymore but WorkerControlEventTarget
(because of https://bugzilla.mozilla.org/show_bug.cgi?id=1319278#c24)

Let's see what tryserver thinks about it
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f312b423f34479da9b21fc3d8f08ae0288739ac7
Attachment #8830967 - Flags: review?(amarchesini)
Attachment #8830967 - Flags: review?(amarchesini) → review+
Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/15a5f1ecca37
ensure GC/CC are run in workers, r=baku
huh, devtools.

I have no idea how to fix this.
Perhaps I could try longer timeout for the test.
Hmm, but it is usually very fast test
FYI, I did notice some other worker intermittents after the landing of this and bug 1319278:

 Bug 1334378
 Bug 1334379
 Bug 1334383

I'm not sure if they are related, but it makes me a bit nervous to uplift.

Perhaps 5 seconds is too often to run a shrinking GC?  Or maybe the shrinking GC does not play well with worker termination?  The service worker tests probably test the termination paths more than other tests.
Flags: needinfo?(bugs)
v7 has 10, but I doubt that is the reason.
The link you gave, Bug 1003730, sounds much more likely reason.

But I re-triggered c2 several times.
Flags: needinfo?(bugs)
Attached patch v8 (obsolete) — Splinter Review
Passed the failing test at least locally.
The key change is 

       while (mControlQueue.IsEmpty() &&
              !(debuggerRunnablesPending = !mDebuggerQueue.IsEmpty()) &&
              !(normalRunnablesPending = NS_HasPendingEvents(mThread))) {
+        if (previousStatus == Closing) {
+          // Nothing to do, let the code below to kill us.
+          break;
+        }
         WaitForWorkerEvents();

That fixes .close() handling. Right now .close() works somewhat accidentally since we have a GC
timer scheduled and that spins event loop so we get out of that while loop.
Attachment #8835147 - Flags: review?(amarchesini)
Attachment #8679060 - Attachment is obsolete: true
Attachment #8679734 - Attachment is obsolete: true
Attachment #8830081 - Attachment is obsolete: true
Attachment #8830967 - Attachment is obsolete: true
Attachment #8831338 - Attachment is obsolete: true
Attachment #8835147 - Flags: review?(amarchesini) → review+
And now there are some new test failures :/
Attached patch v9 (obsolete) — Splinter Review
This may fix bug 1137403.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f571403da29eb2f2142b298a146c9ad607c0753c

The change to the previous patch is that postMessage shouldn't throw if the other side is already closed.
Attachment #8835147 - Attachment is obsolete: true
Attached patch v10Splinter Review
yet another variant, trying to change the code less.
Won't ask review until I get good enough tryserver results.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7ea7d08c8b7321cefc373e5a7b52ee05517d6f02
Attachment #8835601 - Attachment is obsolete: true
(In reply to Olli Pettay [:smaug] from comment #48)
> Created attachment 8835701 [details] [diff] [review]
> v10
> 
> yet another variant, trying to change the code less.
> Won't ask review until I get good enough tryserver results.
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=7ea7d08c8b7321cefc373e5a7b52ee05517d6f02

That run looks sort of green...are you planning on investigating the remaining failures?
Flags: needinfo?(bugs)
yes, once I have time. I did spend several days looking at those failures bug couldn't figure out a setup which doesn't break any tests.
If anyone else has time, feel free to take a look.
Flags: needinfo?(bugs)
Priority: -- → P2
:smaug how's this bug looking?
Flags: needinfo?(bugs)
I guess I should fine time to get back to this.
I never managed to get green try runs, but since then many things have changed.
rebased

remote: Follow the progress of your build on Treeherder:
remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f049eaa0d9a695e11e03bc2d03e08e74bac742c
remote: recorded changegroup in replication log in 0.018s
Flags: needinfo?(bugs)
Looks like I had pushed from a broken m-i base, 
remote: Follow the progress of your build on Treeherder:
remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=188cb07a1a46362feae2941eaccf8e67aa023029
remote: recorded changegroup in replication log in 0.018s
Marion, would be great if you could find someone to look at this. Clearly the tryserver push isn't green enough.

Comment 17 explains the issue.
Assignee: bugs → nobody
Flags: needinfo?(mdaly)
Summary: crash in OOM | large | NS_ABORT_OOM | mozilla::dom::CallbackObject::Init → GC handling in workers is broken | crash in OOM | large | NS_ABORT_OOM | mozilla::dom::CallbackObject::Init
Duplicate of this bug: 1451250
Perry, talk to Asuth and see if this may be up your ally?
Flags: needinfo?(mdaly) → needinfo?(perry)
This is definitely something we should pursue landing.  I'm really hoping :smaug's second push in comment 54 is also on top of bad commits, because the MessageChannel assertions are super odd.  :perry, I'd suggest starting with a fresh try build on top of a known clean mozilla-central commit.
Assignee: nobody → perry
Status: NEW → ASSIGNED
Flags: needinfo?(perry)
As far as I know, it is not on top of a bad commit.
Haven't gotten to this yet but it's on the todo list.

(resetting so we can assign to someone with more cycles)

Assignee: perry → nobody
Status: ASSIGNED → NEW
Whiteboard: DWS_NEXT

I had a better re-cap that got lost in tab bankruptcy. I think the appropriate summary that I'm now stealing from bug 1514723 is: The test failures seem to be that the patch exacerbates some platform thread shutdown issues where nsThread thinks the thread is shutdown because it messaged back and so gecko shutdown continues, but the thread wasn't actually shutdown (we don't join on the thread), so when the thread actually hits full pthread-level shutdown and the RAII stuff about thread-local-storage involving PBackground goes to clean up, assertions start exploding.

The next step was going to be to reproduce under rr since what was happening was more than a little complex.

Not a regression. We've had this GC/CC handling in workers basically forever.

Keywords: regression

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → WORKSFORME
Status: RESOLVED → REOPENED
Crash Signature: [@ OOM | large | NS_ABORT_OOM | mozilla::dom::CallbackObject::Init]
Resolution: WORKSFORME → ---

This bug is about a specific test case that fails, not failures on crash-stats.

You need to log in before you can comment on or make changes to this bug.