Closed Bug 617569 Opened 14 years ago Closed 13 years ago

High CPU load and memory leak with web workers

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
blocking2.0 --- .x+

People

(Reporter: igor, Assigned: bent.mozilla)

References

()

Details

(Keywords: testcase, Whiteboard: [cib-workers][softblocker][MemShrink:P2])

See bug 590533 comment 21.

For the test case see bug 590533 comment 6 and the attachment 493718 [details] (bzip tar archive).
Assignee: general → igor
Whiteboard: [cib-workers]
Whiteboard: [cib-workers] → [cib-workers][softblocker]
Twice now, I ran the testcase on an osx64 debug build, watched it get up to 3GB, navigate away, watched it stay at 3GB allocated but then eventually go back down to 200MB (but it takes like 10 minutes).  Does this correspond to any worker-related timers that you know of Ben?
Ben explained that workers get suspended when you navigate away, so it makes sense that they are holding the resources until they get kicked out of the cache.  Tools -> Clear History (Everything) causes everything to be released within about 30 seconds.  So that's not the leak.

What does seem to be a leak is that letting this run for a while uses up all memory and then crashes.
I played with it some more and a few observations:

1. If I let the test run, it keeps growing in memory (until it crashes (at about 4.2GB) from the DOMWorkerErrorReporter (called by js_ReportOutOfMemory)).

2. If I create a new tab or otherwise muck with the browser (while keeping the test running in the other tab) the GC kicks in and memory goes back down to a few hundred MB (and starts rising again).

3. After finally getting Shark to give me the profile I wanted, I can see that all four workers spend the majority of their time in the JS_YieldRequest and other manner of blocking on GC.

4. All this allocation seems to be from stubs::CreateThis (presumably for the V3 vector objects that are created in the render loop).

So point 2 seems to indicate that GC is still getting somewhat blocked.

As for Point 3, will per-compartment GC allow the threads not to block each other?
(In reply to comment #3)
> So point 2 seems to indicate that GC is still getting somewhat blocked.

This is due to the main thread not running the JS so we do not have chance to schedule the GC. This suggests to post event to the main thread when a worker wants the GC. But we need an extra callback to post that event promptly.

> 
> As for Point 3, will per-compartment GC allow the threads not to block each
> other?

No - the proposed implementation would block all the threads in the same way as the current GC does.
comment 4: Blocking all threads is a temporary implementation kludge we did to accelerate the FF4 schedule. It will be removed after FF4.
Depends on: 616927
So should we implement a workaround GC schedule runnable in the meantime?
I don't understand comment 6. How does that help?
Well, it sounds like memory usage climbs because lots of stuff is happening on workers where GC is not allowed. The main thread doesn't know that it needs to trigger a GC since it isn't running JS. To work around that we could modify the GC callback to post a "MaybeGC" runnable when it gets invoked off the main thread.
softblockers that require a beta are likely no longer in scope for 4.0.  sounds
like this needs to move to FF 4.0.x or be made a hardblocker.

I think I hit this bug or something like 3 or 4 times a day (see Bug 629703).  The only thing I can do is go get coffee and wait for the pegged cpu to go back down after 4 or 5 minutes.  It's pretty painful.

anyone come up with other ideas in the last month?
Not a hard blocker. Workers are reasonably rare still. .x sounds good to me. We should post a runnable with MaybeGC. That sounds pretty good to me.
blocking2.0: betaN+ → .x
is there an easy way for me to detect if I've got a page loaded in one of many tabs that might have web workers running?
might also affect how much we talk about and evangelize web workers in the rollout of 4.0.  cc blizard.
I am actually out of blockers so I will jump on this one. I will ask for approval. I can probably do something that will only trigger with workers, which should be low risk.
I ran this with a TM tip build. Fresh browser instance. 70MB in use. When I start the test case memory use goes to 600MB and stays at 600MB. After navigating away memory use drops to 378 MB but stays at 378 MB. After navigating around a bit more memory use drops to 120MB.

There are a couple bugs here:
1. Is 600MB memory use necessary for this test case?
2. Why do we only drop to 378MB and why does it stay flat at 378MB until I navigate around more and close the tab?

None of these seem like a hard blocker.
Alright, I can reproduce this. I think this is related to the chunk release code. This also explains one of smaug's bugs. Every time we GC we age the chunk. 3 GCs or so bring the memory use down to 74MB.
Assignee: igor → gal
I have a patch. This is actually a dup of bug 631733 and I will fix there. I am leaving this open for confirmation.

In related news I noticed that GCs in a debug build take roughly a second for this test case, so that's another reason to leave this open. Our GC performance on this test case blows.
I looked at the test case and the profile. The code really, really wants a generational collector. It creates some ridiculous amount of temporary object garbage. Background finalization might help us to hide the bad pause times until we get a generational collector.
Bug 654265 looks very similar to this, and has what I think is a simpler testcase. The leak there only happens when CPU load is very high (when the worker runs without pausing).
No longer blocks: mlk2.0
No longer blocks: mlk-fx5+
With bug 649537 this test page behaves quite normally, memory usage stays stable, no crashes.
Whiteboard: [cib-workers][softblocker] → [cib-workers][softblocker][MemShrink:P2]
Depends on: new-web-workers
No longer depends on: 616927, GenerationalGC, 590533
Ben, sounds like this can simply be closed once your new worker stuf lands.
Assignee: gal → bent.mozilla
Totally fixed by bug 649537.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.