service workers that fail to install due to 404 during Cache.addAll() are leaked forever

RESOLVED FIXED in Firefox 47

Status

()

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: bkelly, Assigned: bkelly)

Tracking

48 Branch
mozilla49
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox46 wontfix, firefox47 fixed, firefox48 fixed, firefox49 fixed, firefox-esr45 wontfix)

Details

(Whiteboard: [MemShrink:P2] btpp-active)

Attachments

(1 attachment, 5 obsolete attachments)

The service worker currently on my blog has a bug that causes it to reject the install event waitUntil() promise.  I noticed today that when this happens we don't immediately kill the worker thread:

├─────19.65 MB (01.96%) -- workers
│     ├───9.97 MB (00.99%) ++ workers(telegram.org)/worker(js/lib/crypto_worker.js, 0x1ca385af000)
│     ├───6.81 MB (00.68%) ++ workers(chrome)
│     └───2.88 MB (00.29%) -- workers(blog.wanderview.com)
│         ├──1.44 MB (00.14%) ++ worker(https://blog.wanderview.com/sw.js, 0x1ca3d242800)
│         └──1.44 MB (00.14%) ++ worker(https://blog.wanderview.com/sw.js, 0x1cbc3224000)

We should be cleaning these up immediately.  I think they probably get cleaned up eventually by the long timer kill mechanism, but I need to verify that.
Its been over an hour and the workers still show in about:memory.  Seems we leak them forever.
Summary: service workers that fail to install are left alive → service workers that fail to install are leaked forever
Whiteboard: [MemShrink]
This appear to be fixed in nightly 49.  It reproduces in aurora 48, though.
I take it back.  This does happen in nightly, but only in non-e10s mode.  I think it might be an interaction with devtools about:debugging.
Whiteboard: [MemShrink] → [MemShrink] btpp-active
Whiteboard: [MemShrink] btpp-active → [MemShrink:P2] btpp-active
Turns out this is a bug in PromiseWorkerProxy.  If a proxy is actively holding a worker alive with a WorkerFeature, it will leak the thread forever when you call WorkerPrivate::Terminate().

When I trigger this error I get this in the console:

[3688] WARNING: A runnable was posted to a worker that is already shutting down!: file c:/devel/mozilla-central/dom/workers/WorkerPrivate.cpp, line 2397
[3688] WARNING: Could not dispatch fetch response end: file c:/devel/mozilla-central/dom/fetch/Fetch.cpp, line 355

Here a fetch() is running and holding a WorkerPromiseProxy.  This is most likely due to a the Cache.addAll() being run in the service worker script.  When one of those addAll() requests fails, however, it rejects the install event waitUntil() immediately.  The failed install then causes ServiceWorkerPrivate to call WorkerPrivate::Terminate() even though other parts of the addAll() are still running.

The result is the WorkerPrivate gets a status of Terminating.  The PromiseWorkerProxy is still alive, however, and can only be cleaned up on the worker thread.  No WorkerRunnable can be schedule on the worker thread, however, because of this check:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#2395

This means there is no way for the code using the PromiseWorkerProxy to clean it up at this point and the WorkerFeature will keep the thread alive forever.

The solution is simple.  We just make PromiseWorkerProxy clean itself up when the worker status reaches a point where runnables can't be dispatched any more.
Attachment #8748962 - Flags: review?(khuey)
Actually, it seems like it should cleanup on Closing to avoid getting into this situation.  Unless there is some mechanism that guarantees a worker will progress from Closing to Terminating.
Attachment #8748962 - Attachment is obsolete: true
Attachment #8748962 - Flags: review?(khuey)
Attachment #8748967 - Flags: review?(khuey)
An alternative solution is to make ServiceWorkerPrivate use WorkerPrivate::Kill() instead of Terminate().  I'm proposing this patch, though, since this seems like the PromiseWorkerProxy leak would be a problem for non-service-worker code as well.
It seems this might be a problem for these other places in the code as well:

  dom/workers/ScriptLoader.cpp
  dom/indexedDB/IDBTransaction.cpp
  dom/cache/Feature.cpp
  dom/base/WebSocket.cpp
  dom/workers/XMLHttpRequest.cpp
  dom/notification/Notification.cpp

A quick glance suggests that many, if not all, require executing a runnable on the worker thread to release the feature.  For example, IDB needs to receive a PBackground message that executes FireCompleteOrAbortEvents() on the worker thread.

If you think its appropriate, I can make patches to switch these over to Closing as well.
Or maybe IPC uses control runnables so they can still be dispatched or something.  If that's the case, then maybe I should make Fetch use a control runnable.
I spoke with Kyle on IRC and he indicated that fetch() should be using a control runnable here.  I can't just always use a control runnable, though, because in the non-shutting-down case we don't want to cleanup the promise proxy before our other runnable resolves the promise.

So we just fire a control runnable to cleanup if the non-control runnable dispatch fails.
Attachment #8748967 - Attachment is obsolete: true
Attachment #8748967 - Flags: review?(khuey)
Attachment #8749212 - Flags: review?(amarchesini)
The ContinueConsumeBodyRunnable does not have this problem because FetchBodyFeature::Notify() cleans things up on the lowest worker Closing status.
Comment on attachment 8749212 [details] [diff] [review]
Make fetch() use a control runnable to cleanup if the worker thread is shutting down. r=baku

Review of attachment 8749212 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/fetch/Fetch.cpp
@@ +301,5 @@
> +class WorkerFetchResponseEndBase
> +{
> +  RefPtr<PromiseWorkerProxy> mPromiseProxy;
> +public:
> +  WorkerFetchResponseEndBase(PromiseWorkerProxy* aPromiseProxy)

explicit

@@ +339,5 @@
> +  Cancel() override
> +  {
> +    // Execute Run anyway to make sure we cleanup our promise proxy to avoid
> +    // leaking the worker thread
> +    Run();

same here. Call WorkerRun() instead.

@@ +362,5 @@
> +    WorkerRunInternal(aWorkerPrivate);
> +    return true;
> +  }
> +
> +  // Control runnable cancel already calls Run().

Correct, it calls Run() but that will not call WorkerRun().
You must call WorkerRunInternal(aWorkerPrivate) by yourself.
Attachment #8749212 - Flags: review?(amarchesini) → review-
Fix the missing explicit keyword.

I also removed the MOZ_ALWAYS_TRUE() on the control runnable dispatch.  I realized that it can actually fail to dispatch if the worker thread is canceled or killed since that will cause PromiseWorkerProxy to give up its feature early.

As discussed on IRC I left the calls to Run().  They should work since they are called before WorkerRunnable::Cancel() which sets the flag necessary for IsCanceled() to return true.
Attachment #8749212 - Attachment is obsolete: true
Attachment #8749296 - Flags: review?(amarchesini)
Attachment #8749296 - Flags: review?(amarchesini) → review+
This is a relatively small change and prevents erratic memory leaks. I think we should uplift it.  Note, I marked that it affects esr45 since you can trigger this on dedicated workers.
See Also: → 1270594
Summary: service workers that fail to install are leaked forever → service workers that fail to install due to 404 during Cache.addAll() are leaked forever
https://hg.mozilla.org/mozilla-central/rev/d657d71ec3e9
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Comment on attachment 8749332 [details] [diff] [review]
Make fetch() use a control runnable to cleanup if the worker thread is shutting down. r=baku

Approval Request Comment
[Feature/regressing bug #]: Fetch API and Cache API used in workers.  In practice this mostly means service workers, but developers can also use these APIs in normal web workers.
[User impact if declined]: Memory leak.
[Describe test coverage new/current, TreeHerder]: Existing mochitests and wpt tests for regression testing.  It was not possible to write a test for this particular leak, however.
[Risks and why]: Minimal.  This is a very isolated change to the usage of these APIs on workers.
[String/UUID change made/needed]: None.
Attachment #8749332 - Flags: approval-mozilla-beta?
Attachment #8749332 - Flags: approval-mozilla-aurora?
On further reflection I don't think we should uplift this to ESR45.  Its not a security issue and the common leak involves service workers which are disabled.  In theory a site could use Fetch API or Cache API on a normal web worker and trigger this, but its highly unlikely.
Comment on attachment 8749332 [details] [diff] [review]
Make fetch() use a control runnable to cleanup if the worker thread is shutting down. r=baku

Fixing a mlk in SWs, let's uplift to Aurora48 and bake for a bit before deciding whether to uplift to Beta or not.
Attachment #8749332 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8749332 [details] [diff] [review]
Make fetch() use a control runnable to cleanup if the worker thread is shutting down. r=baku

Baked on Nightly and Aurora for a few days, looks safe to uplift to Beta47.
Attachment #8749332 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I pushed a fix for the beta build bustage:

remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/2991f214d4f4
Blocks: 1271493
Unfortunately, the patch in this bug fixed the leak in my local opt-DEBUG builds, but there is still a leak in opt/pgo builds.  I'm going to keep investigating that in bug 1271493.
You need to log in before you can comment on or make changes to this bug.