Closed Bug 1281040 Opened 3 years ago Closed 3 years ago

service worker is not resumed if debugger window is closed

Categories

(DevTools :: Debugger, defect, P1)

48 Branch
defect

Tracking

(firefox48 wontfix, firefox49 fixed, firefox-esr45 disabled, firefox50 fixed, firefox51 fixed)

RESOLVED FIXED
Firefox 51
Tracking Status
firefox48 --- wontfix
firefox49 --- fixed
firefox-esr45 --- disabled
firefox50 --- fixed
firefox51 --- fixed

People

(Reporter: bkelly, Assigned: ejpbruel)

References

(Blocks 1 open bug)

Details

(Keywords: regression)

Attachments

(3 files, 2 obsolete files)

STR:

1) Register a service worker
2) Open the site to get the service worker running
3) Open the debugger window for the service worker
4) Set a break point in the fetch event handler
5) Reload the site
6) Observer the break point being triggered
7) Close the debugger window while the worker is paused without clicking continue
8) Observe that the worker is stuck until browser exit

This is a pretty big footgun for people using the debugger feature.
Eddy, can we get a quick fix for this?  Its a really big footgun for people trying to use the worker debugger and seems like it might be relatively easy to fix.
Flags: needinfo?(ejpbruel)
(In reply to Ben Kelly [:bkelly] from comment #1)
> Eddy, can we get a quick fix for this?  Its a really big footgun for people
> trying to use the worker debugger and seems like it might be relatively easy
> to fix.

Ok. Will take a look at this tomorrow.
Flags: needinfo?(ejpbruel)
Assignee: nobody → ejpbruel
Marking this as P1 because bkelly indicated that this is a pretty big footgun for people trying to debug service workers, and it looks like this is low hanging fruit.
Priority: -- → P1
This patch should fix the issue. I only tested it against normal workers, but since service workers use the same code, it should work there as well.
Attachment #8766757 - Flags: review?(jlong)
I think I see sandbox compartments being leaked by devtools when I open and close a worker debugger as well:

│  ├──103 (13.05%) ── [System Principal], [anonymous sandbox] (from: resource://gre/modules/commonjs/toolkit/loader.js -> resource://devtools/shared/builtin-modules.js:183) [103]
│  ├──103 (13.05%) ── [System Principal], [anonymous sandbox] (from: resource://gre/modules/commonjs/toolkit/loader.js -> resource://devtools/shared/builtin-modules.js:209) [103]
│  ├──103 (13.05%) ── [System Principal], [anonymous sandbox] (from: resource://gre/modules/commonjs/toolkit/loader.js -> resource://devtools/shared/builtin-modules.js:220) [103]
│  ├──103 (13.05%) ── [System Principal], [anonymous sandbox] (from: resource://gre/modules/commonjs/toolkit/loader.js -> resource://devtools/shared/builtin-modules.js:227) [103]
│  ├──103 (13.05%) ── [System Principal], Addon-SDK (from: resource://gre/modules/commonjs/toolkit/loader.js:249) [103]
│  ├───21 (02.66%) ── [System Principal], outOfProcessTabChildGlobal [21]
│  └────8 (01.01%) ── [System Principal], [anonymous sandbox] (from: resource://gre/modules/commonjs/toolkit/loader.js -> resource://devtools/server/actors/child-process.js:35) [8]

Should this be fixed by this patch as well?  Or should I write a separate bug?
Flags: needinfo?(ejpbruel)
(In reply to Ben Kelly [:bkelly] from comment #5)
> I think I see sandbox compartments being leaked by devtools when I open and
> close a worker debugger as well:
> 
> │  ├──103 (13.05%) ── [System Principal], [anonymous sandbox] (from:
> resource://gre/modules/commonjs/toolkit/loader.js ->
> resource://devtools/shared/builtin-modules.js:183) [103]
> │  ├──103 (13.05%) ── [System Principal], [anonymous sandbox] (from:
> resource://gre/modules/commonjs/toolkit/loader.js ->
> resource://devtools/shared/builtin-modules.js:209) [103]
> │  ├──103 (13.05%) ── [System Principal], [anonymous sandbox] (from:
> resource://gre/modules/commonjs/toolkit/loader.js ->
> resource://devtools/shared/builtin-modules.js:220) [103]
> │  ├──103 (13.05%) ── [System Principal], [anonymous sandbox] (from:
> resource://gre/modules/commonjs/toolkit/loader.js ->
> resource://devtools/shared/builtin-modules.js:227) [103]
> │  ├──103 (13.05%) ── [System Principal], Addon-SDK (from:
> resource://gre/modules/commonjs/toolkit/loader.js:249) [103]
> │  ├───21 (02.66%) ── [System Principal], outOfProcessTabChildGlobal [21]
> │  └────8 (01.01%) ── [System Principal], [anonymous sandbox] (from:
> resource://gre/modules/commonjs/toolkit/loader.js ->
> resource://devtools/server/actors/child-process.js:35) [8]
> 
> Should this be fixed by this patch as well?  Or should I write a separate
> bug?

It's very possible that we're still leaking stuff even if we detach from the worker client, so I would file a separate bug for this.

Did you happen to try my patch with a service worker? Did it resolve the issue for you?
Flags: needinfo?(ejpbruel)
(In reply to Eddy Bruel [:ejpbruel] from comment #6)
> Did you happen to try my patch with a service worker? Did it resolve the
> issue for you?

No.  I don't have a build handy.  I'll start one now.
Comment on attachment 8766757 [details] [diff] [review]
Make sure we detach from worker client when target is destroyed.

Review of attachment 8766757 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not too familiar with this area of the code, but this makes sense. Especially since I've frequently seen `this._extraPools is null` errors.
Attachment #8766757 - Flags: review?(jlong) → review+
Unfortunately this patch does not fix the compartment leak.  I filed bug 1283587.
James, can we get this into 50 and uplifted to 48/49 before the next merge?
Flags: needinfo?(jlong)
I already r+ it, eddy can you land it?
Flags: needinfo?(jlong) → needinfo?(ejpbruel)
Looks like it just needs a try push and then we can push
(In reply to James Long (:jlongster) from comment #12)
> Looks like it just needs a try push and then we can push

I was planning on taking a day off tomorrow, but I can try to land this if it means we can make the uplift.
Flags: needinfo?(ejpbruel)
I'm happy to push it then, it's a tiny patch :)
(In reply to James Long (:jlongster) from comment #15)
> I'm happy to push it then, it's a tiny patch :)

Thanks, that would be great. If we can get a green try push for it tonight, you can push it tomorrow.
ni? myself to remind me
Flags: needinfo?(jlong)
There are several very suspicious timeout failures for the worker tests, so it must be related to this. It's failing pretty consistently. I don't think I should push this yet.
Flags: needinfo?(jlong)
Going to look into these test failures. Sorry for the delay; I was on PTO last week.
I figured out the issue. The root of the problem is that calling postMessage on
a WorkerDebugger is inherently racy: when a worker shuts down, there is a small
interval during which no more  runnables can be dispatched to the worker, but
the worker debugger has not yet been closed. As a result, calling postMessage on
a WorkerDebugger can fail, even if you check that the  WorkerDebugger is not yet
closed.

When we detach from a worker actor, we close the connection to the corresponding
worker server by closing the transport for that connection. The  transport
responds to this by removing itself as a listener on the WorkerDebugger that
represents the connection, and then calling its onClosed hook.

One place where this onClosed hook is used is here:
https://dxr.mozilla.org/mozilla-central/source/devtools/server/main.js?q=server%2Fmain.js%3A843&redirect_type=direct#863

In this hook we send a message to the server requesting it to clean up its end
of the connection. If the WorkerDebugger is already closed, we don’t need to
send this message, but as explained above, this check is actually racy, so the
call to postMessage fails.

When this happens, the resulting exception propagates up all the way to here:
https://dxr.mozilla.org/mozilla-central/source/devtools/server/actors/worker.js#192

This causes us to exit from the call to _detach early, so the WorkerActor is
never fully detached. Detaching from the worker a second time will cause the
transport to remove itself as listener from the WorkerDebugger again. Since it
already removed itself as listener during our previous attempt, this will cause
an error to be thrown.

The failing tests are set up so that they destroy the toolbox first, and then
close the worker client (note that closing an attached worker client will
automatically detach from it).

Previously, destroying the toolbox did not cause us to detach from the worker
client, but with this patch applied it does. Consequently, these tests try to
detach from the worker client twice. That shouldn’t be a problem, but if the
first attempt to detach fails, due to the race condition above, the second
attempt to detach will cause an error.

The solution here is to make the call to wrap the call to postMessage in the
onClosed trap in a try catch block. The only time this call can throw an error
is if the worker is already shutting down, in which case we don’t care whether
the message arrives anyway.
New patch with test failures addressed. This probably needs to be re-reviewed.
Attachment #8766757 - Attachment is obsolete: true
Attachment #8776954 - Flags: review?(jlong)
Ran into memory leaks on try. I suspect the cause is that we don't wait for the toolbox to be fully closed before shutting down the test. Fixed that issue. Additionally split up the browser_dbg_worker-console test into three smaller ones because I kept running into timeouts.
Attachment #8776954 - Attachment is obsolete: true
Attachment #8776954 - Flags: review?(jlong)
Attachment #8777482 - Flags: review?(jlong)
Comment on attachment 8777482 [details] [diff] [review]
Make sure we detach from worker client when target is destroyed.

Review of attachment 8777482 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine, not a whole lot different from last review. I can't remember if the tests were split up before, but I had a few nits.

::: devtools/client/debugger/test/mochitest/browser_dbg_worker-console-01.js
@@ +3,5 @@
> +
> +var TAB_URL = EXAMPLE_URL + "doc_WorkerActor.attachThread-tab.html";
> +var WORKER_URL = "code_WorkerActor.attachThread-worker.js";
> +
> +function* initWorkerDebugger(TAB_URL, WORKER_URL) {

Why isn't this in head.js?

I guess this code shouldn't be reviewed too much, since you are just splitting the tests up. But this function at least is duplicated 3 times...

::: devtools/client/debugger/test/mochitest/browser_dbg_worker-console-03.js
@@ +1,2 @@
> +// Check to make sure that a worker can be attached to a toolbox
> +// and that the console works.

This comment is the same in all 3 test files, can you go into a little more detail what each test does?

::: devtools/server/main.js
@@ +861,5 @@
>            transport.ready();
>            transport.hooks = {
>              onClosed: () => {
>                if (!dbg.isClosed) {
> +                // If the worker happens to be shutting down, there is a small

Can you explain why the worker would be shutting down sometimes, and not other times?
Attachment #8777482 - Flags: review?(jlong) → review+
This patch is turning out to be harder than I expected :-(

The tests are still leaking, because we fail to destroy the worker toolbox properly. When we try to destroy the toolbox, we hit an error in the webconsole client. There is some asynchronous bug there that causes us to destroy things either twice or in the wrong order. I am not yet sure what the problem is exactly.

I suffered a system crash today that wiped out my entire local git repository, so I probably won't be able to look into this again until monday.
I think I finally figured out why those tests were failing. Here is another try run with those issues hopefully addressed:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4345ac00752d
Pushed by ejpbruel@mozilla.com:
https://hg.mozilla.org/integration/fx-team/rev/c0e4639e98b0
Make sure we detach from worker client when target is destroyed. r=jlongster
https://hg.mozilla.org/mozilla-central/rev/c0e4639e98b0
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 51
Eddy, do you feel comfortable uplifting this to aurora or beta?
Flags: needinfo?(ejpbruel)
(In reply to Ben Kelly [:bkelly] from comment #30)
> Eddy, do you feel comfortable uplifting this to aurora or beta?

I think we can do that yeah.
Flags: needinfo?(ejpbruel)
Attached patch Uplift requestSplinter Review
Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]:
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: 
[String/UUID change made/needed]:

Approval Request Comment
[Feature/regressing bug #]:
No clear candidate.

[User impact if declined]:
Service workers are not resumed if you close the debugger while they are paused. This is a pretty big footgun for people using the debugger.

[Describe test coverage new/current, TreeHerder]:
browser_dbg_worker-console-01.js
browser_dbg_worker-console-01.js
browser_dbg_worker-window.js

[Risks and why]: 
Applying this patch caused some test failures because it changes the order in which certain events happen during shutdown. Those changes are all well defined, however, and the patch should have addressed all of them.

[String/UUID change made/needed]:
N/A
Attachment #8781563 - Flags: review+
Attachment #8781563 - Flags: approval-mozilla-beta?
Attachment #8781563 - Flags: approval-mozilla-aurora?
Keywords: regression
Comment on attachment 8781563 [details] [diff] [review]
Uplift request

Lots of new automated tests (yay!), stabilized on Nightly for a week, Aurora50+
Attachment #8781563 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8781563 [details] [diff] [review]
Uplift request

We want service workers to keep on working for webdevs. Please uplift for the beta 6 build!
Attachment #8781563 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Needs rebasing for Beta.
Flags: needinfo?(ejpbruel)
This patch should do the trick for beta.
Flags: needinfo?(ejpbruel)
Attachment #8783575 - Flags: checkin?(ryanvm)
Attachment #8783575 - Flags: checkin?(ryanvm)
Version: unspecified → 48 Branch
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.