617569 - High CPU load and memory leak with web workers

Reporter

Description

•

14 years ago

See bug 590533 comment 21.

For the test case see bug 590533 comment 6 and the attachment 493718 [details] (bzip tar archive).

Igor Bukanov

Reporter

Updated

•

14 years ago

Assignee: general → igor

David Mandelin [:dmandelin]

Updated

•

14 years ago

Whiteboard: [cib-workers]

David Mandelin [:dmandelin]

Updated

•

14 years ago

Whiteboard: [cib-workers] → [cib-workers][softblocker]

Luke Wagner [:luke]

Comment 1

•

14 years ago

Twice now, I ran the testcase on an osx64 debug build, watched it get up to 3GB, navigate away, watched it stay at 3GB allocated but then eventually go back down to 200MB (but it takes like 10 minutes).  Does this correspond to any worker-related timers that you know of Ben?

Luke Wagner [:luke]

Comment 2

•

14 years ago

Ben explained that workers get suspended when you navigate away, so it makes sense that they are holding the resources until they get kicked out of the cache.  Tools -> Clear History (Everything) causes everything to be released within about 30 seconds.  So that's not the leak.

What does seem to be a leak is that letting this run for a while uses up all memory and then crashes.

Luke Wagner [:luke]

Comment 3

•

14 years ago

I played with it some more and a few observations:

1. If I let the test run, it keeps growing in memory (until it crashes (at about 4.2GB) from the DOMWorkerErrorReporter (called by js_ReportOutOfMemory)).

2. If I create a new tab or otherwise muck with the browser (while keeping the test running in the other tab) the GC kicks in and memory goes back down to a few hundred MB (and starts rising again).

3. After finally getting Shark to give me the profile I wanted, I can see that all four workers spend the majority of their time in the JS_YieldRequest and other manner of blocking on GC.

4. All this allocation seems to be from stubs::CreateThis (presumably for the V3 vector objects that are created in the render loop).

So point 2 seems to indicate that GC is still getting somewhat blocked.

As for Point 3, will per-compartment GC allow the threads not to block each other?

Igor Bukanov

Reporter

Comment 4

•

14 years ago

(In reply to comment #3)
> So point 2 seems to indicate that GC is still getting somewhat blocked.

This is due to the main thread not running the JS so we do not have chance to schedule the GC. This suggests to post event to the main thread when a worker wants the GC. But we need an extra callback to post that event promptly.

> 
> As for Point 3, will per-compartment GC allow the threads not to block each
> other?

No - the proposed implementation would block all the threads in the same way as the current GC does.

Andreas Gal :gal

Comment 5

•

14 years ago

comment 4: Blocking all threads is a temporary implementation kludge we did to accelerate the FF4 schedule. It will be removed after FF4.

Luke Wagner [:luke]

Updated

•

14 years ago

Depends on: 616927

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 6

•

13 years ago

So should we implement a workaround GC schedule runnable in the meantime?

Andreas Gal :gal

Comment 7

•

13 years ago

I don't understand comment 6. How does that help?

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 8

•

13 years ago

Well, it sounds like memory usage climbs because lots of stuff is happening on workers where GC is not allowed. The main thread doesn't know that it needs to trigger a GC since it isn't running JS. To work around that we could modify the GC callback to post a "MaybeGC" runnable when it gets invoked off the main thread.

chris hofmann

Comment 9

•

13 years ago

softblockers that require a beta are likely no longer in scope for 4.0.  sounds
like this needs to move to FF 4.0.x or be made a hardblocker.

I think I hit this bug or something like 3 or 4 times a day (see Bug 629703).  The only thing I can do is go get coffee and wait for the pegged cpu to go back down after 4 or 5 minutes.  It's pretty painful.

anyone come up with other ideas in the last month?

Andreas Gal :gal

Comment 10

•

13 years ago

Not a hard blocker. Workers are reasonably rare still. .x sounds good to me. We should post a runnable with MaybeGC. That sounds pretty good to me.

Andreas Gal :gal

Updated

•

13 years ago

blocking2.0: betaN+ → .x

chris hofmann

Comment 11

•

13 years ago

is there an easy way for me to detect if I've got a page loaded in one of many tabs that might have web workers running?

chris hofmann

Comment 12

•

13 years ago

might also affect how much we talk about and evangelize web workers in the rollout of 4.0.  cc blizard.

Andreas Gal :gal

Comment 13

•

13 years ago

I am actually out of blockers so I will jump on this one. I will ask for approval. I can probably do something that will only trigger with workers, which should be low risk.

Andreas Gal :gal

Comment 14

•

13 years ago

I ran this with a TM tip build. Fresh browser instance. 70MB in use. When I start the test case memory use goes to 600MB and stays at 600MB. After navigating away memory use drops to 378 MB but stays at 378 MB. After navigating around a bit more memory use drops to 120MB.

There are a couple bugs here:
1. Is 600MB memory use necessary for this test case?
2. Why do we only drop to 378MB and why does it stay flat at 378MB until I navigate around more and close the tab?

None of these seem like a hard blocker.

Andreas Gal :gal

Comment 15

•

13 years ago

Alright, I can reproduce this. I think this is related to the chunk release code. This also explains one of smaug's bugs. Every time we GC we age the chunk. 3 GCs or so bring the memory use down to 74MB.

Andreas Gal :gal

Updated

•

13 years ago

Assignee: igor → gal

Andreas Gal :gal

Comment 16

•

13 years ago

I have a patch. This is actually a dup of bug 631733 and I will fix there. I am leaving this open for confirmation.

In related news I noticed that GCs in a debug build take roughly a second for this test case, so that's another reason to leave this open. Our GC performance on this test case blows.

Andreas Gal :gal

Comment 17

•

13 years ago

I looked at the test case and the profile. The code really, really wants a generational collector. It creates some ridiculous amount of temporary object garbage. Background finalization might help us to hide the bad pause times until we get a generational collector.

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Blocks: mlk2.0

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Blocks: mlk-fx5+

Alon Zakai (:azakai)

Comment 19

•

13 years ago

Bug 654265 looks very similar to this, and has what I think is a simpler testcase. The leak there only happens when CPU load is very high (when the worker runs without pausing).

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Depends on: GenerationalGC

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Blocks: mlk-fx4-beta

Nicholas Nethercote [inactive]

Updated

•

13 years ago

No longer blocks: mlk2.0

Nicholas Nethercote [inactive]

Updated

•

13 years ago

No longer blocks: mlk-fx5+

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 20

•

13 years ago

With bug 649537 this test page behaves quite normally, memory usage stays stable, no crashes.

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Whiteboard: [cib-workers][softblocker] → [cib-workers][softblocker][MemShrink:P2]

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Depends on: new-web-workers
No longer depends on: 616927, GenerationalGC, 590533

Johnny Stenback (:jst)

Comment 21

•

13 years ago

Ben, sounds like this can simply be closed once your new worker stuf lands.

Assignee: gal → bent.mozilla

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 22

•

13 years ago

Totally fixed by bug 649537.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → WORKSFORME