Closed Bug 1232931 Opened 4 years ago Closed 2 years ago

Intermittent browser_service_workers_timeout.js | Test timed out

Categories

(DevTools :: about:debugging, defect, P3)

defect

Tracking

(firefox46 affected, firefox47 affected)

RESOLVED FIXED
Tracking Status
firefox46 --- affected
firefox47 --- affected

People

(Reporter: philor, Assigned: ochameau)

References

Details

(Keywords: intermittent-failure)

Attachments

(5 files)

The issue here is that nsIServiceWorkerManager.getWorkerByID() is throwing when it can't find the worker.  The devtools code, though, expects null to be returned instead:

  https://dxr.mozilla.org/mozilla-central/source/devtools/server/actors/worker.js#134
  https://dxr.mozilla.org/mozilla-central/source/devtools/server/actors/worker.js#67

This just changes the service worker manager code to return null instead of throwing.

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=baf31f0fddb5
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Attachment #8708385 - Flags: review?(poirot.alex)
Comment on attachment 8708385 [details] [diff] [review]
Return null instead of throwing if swm.getWorkerByID() cannot find the worker. r=ochameau

Review of attachment 8708385 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8708385 - Flags: review?(poirot.alex) → review+
Hmm, I got a failed trigger in try after landing this in comment 8.  Seems there is more to fix.
Keywords: leave-open
Alexander, can you investigate what is going on here?  My quick fix does not seem to have solved the issue.  Seems to trigger most often on mac 10.6 debug dt6.
Flags: needinfo?(poirot.alex)
I'm not sure the try had the right syntax in comment 14.  I added manually triggered the mac 10.6 dt6 job.  Let's see if that helps.
Oh! Yes, it looks like mochitest-dt only runs on 10.10 which need explicit flag :/
Oh! I didn't knew about this treeherder option, looks fantastic, if that works!!
Flags: needinfo?(poirot.alex)
It doesn't solve the intermittent, but ease running the tests multiple times.
Attachment #8709427 - Flags: review?(janx)
Comment on attachment 8709427 [details] [diff] [review]
Allow running browser_service_workers*.js more than once

Review of attachment 8709427 [details] [diff] [review]:
-----------------------------------------------------------------

Works for me. One nit, also please submit it to try.

::: devtools/client/aboutdebugging/test/browser_service_workers.js
@@ +45,1 @@
>    // Use message manager to work with e10s

Nit: The empty line before "// Finally" should probably before "// Use" instead.
Attachment #8709427 - Flags: review?(janx) → review+
Going to pass this to Alexander for now since he's actively investigating it.
Assignee: bkelly → poirot.alex
Ok, so I finally narrowed it down to WorkerDebugge.initialize.
We first get this warning:
  [1074] WARNING: A debugger runnable was posted to a worker that is already shutting down!: file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/WorkerPrivate.cpp, line 2399
And then, connectToWorker throws on aDbg.initialize() with this exception:
  NS_ERROR_FAILURE - nsIWorkerDebugger.initialize
  resource://devtools/server/main.js :: DebuggerServer.connectToWorker/< :: line 785

Eddy, any random guess?
Unfortunately, this exception happens intermitently on OSX debug builds on devtools/client/aboutdebugging/test/browser_service_workers_timeout.js
Would you happen to be on OSX?
Flags: needinfo?(ejpbruel)
Note that there is a bunch of warnings happening on the previous test:
07:32:23     INFO -  [1074] WARNING: '!registration', file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/ServiceWorkerManager.cpp, line 999
07:32:23     INFO -  [1074] WARNING: 'NS_FAILED(rv)', file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/ServiceWorkerManager.cpp, line 1205
07:32:23     INFO -  [1074] WARNING: 'NS_FAILED(aStatus)', file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/ServiceWorkerManager.cpp, line 372
07:32:23     INFO -  [1074] WARNING: ServiceWorkerJob failed with error: NS_ERROR_DOM_ABORT_ERR: file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/ServiceWorkerManager.cpp, line 378
Also, in the previous test, there is very few logs, as if it breaks early, but it still reports the test succeed... Pushing a new try run to figure this out.
(In reply to Alexandre Poirot [:ochameau] from comment #23)
> Ok, so I finally narrowed it down to WorkerDebugge.initialize.
> We first get this warning:
>   [1074] WARNING: A debugger runnable was posted to a worker that is already
> shutting down!: file
> /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/
> WorkerPrivate.cpp, line 2399
> And then, connectToWorker throws on aDbg.initialize() with this exception:
>   NS_ERROR_FAILURE - nsIWorkerDebugger.initialize
>   resource://devtools/server/main.js :: DebuggerServer.connectToWorker/< ::
> line 785
> 
> Eddy, any random guess?
> Unfortunately, this exception happens intermitently on OSX debug builds on
> devtools/client/aboutdebugging/test/browser_service_workers_timeout.js
> Would you happen to be on OSX?

This looks like a race condition.

Right before it shuts down, a worker unregisters itself with the WorkerDebuggerManager. The WorkerDebuggerManager then closes the WorkerDebugger for that worker. Once the WorkerDebugger is closed, it can no longer be used. Any attempts to do so will result in an NS_ERROR_UNEXPECTED being thrown.

To make sure we don't attempt to use the WorkerDebugger after it is closed, the WorkerDebugger notifies its listeners when it closes, and also sets it isClosed property to true. Closing the WorkerDebugger, notifying the listeners, and setting the isClosed property all happen on the main thread, so these steps are not racy.

I've double checked the implementation of WorkerActor, and it looks like we are correctly using the onClosed handler and isClosed property to make sure we don't access the WorkerDebugger after it is closed.

However, there is a small window where the worker has started to to shut down, and should no longer be processing runnables, but has not unregistered itself with the WorkerDebuggerManager yet. We could maybe unregister the worker sooner (right before it starts to shut down), but that would not solve the problem, because unregistration happens on the parent thread. If the parent thread is not the main thread, we have to send a runnable to the main thread, because the WorkerDebuggerManager can only be accessed there.

While this runnable is underway, we can still interact with the WorkerDebugger, such as trying to initialize it. This will not result in an NS_ERROR_UNEXPECTED, because the WorkerDebugger has not yet been closed. It will, however, result in an NS_ERROR_FAILURE, because the worker is already shutting down.

We either need a way to make the onClosed event non-race prone, but that seems impossible at first glance. If we cannot do that, then the best we can do is add an explicit error check in the code that calls initialize.

Hope that helps!
Flags: needinfo?(ejpbruel)
https://hg.mozilla.org/integration/fx-team/rev/ca345be438249e308318379eca7eec7d6013ae93
Bug 1232931 - Allow running browser_service_workers_timeout.js more than once. r=janx
Blocks: 1212797
Last try seems to say it only happens when the two about:debugging test regarding service worker are run.
browser_service_workers.js and browser_service_workers_timeout.js
Both instanciates the same service worker, so I highly suspect the first test to not wait correctly for full service worker cleanup before running the second one.
But with my patch, we are now waiting for registration.unregister() resolution before proceeding.
(And before that patch, we were already waiting for nsIWorkerDebuggerManager's onUnregister callback to be called in the main process)
I don't know what else we could wait for?
This race happens when using e10s... which seems to easily introduce such race.
(In reply to Alexandre Poirot [:ochameau] from comment #33)
> Last try seems to say it only happens when the two about:debugging test
> regarding service worker are run.
> browser_service_workers.js and browser_service_workers_timeout.js
> Both instanciates the same service worker, so I highly suspect the first
> test to not wait correctly for full service worker cleanup before running
> the second one.
> But with my patch, we are now waiting for registration.unregister()
> resolution before proceeding.
> (And before that patch, we were already waiting for
> nsIWorkerDebuggerManager's onUnregister callback to be called in the main
> process)
> I don't know what else we could wait for?
> This race happens when using e10s... which seems to easily introduce such
> race.

FWIW, The onUnregister callback on nsIWorkerDebuggerManager is called very late during the worker shutdown process: the RuntimeService will have already nulled out the worker's thread, and the WorkerPrivate is no longer safe to be touched. The RuntimeService then schedules the worker for deletion, which will cause a WorkerFinishedRunnable to be dispatched. This runnable will disable the debugger for the WorkerPrivate, which will cause the worker to be unregistered.

Since the worker always has to be unregistered on the main thread (because that's where the WorkerDebuggerManager lives), we always have to dispatch *some* runnable to the main thread to unregister the worker after it has shut down. Because the WorkerDebugger is not closed until after the worker is unregistered, there will always be some window where the worker has already shut down, but not unregistered (since the runnable is still underway).

Assuming that your problem is the one you mentioned in comment 26, i.e. that we're posting a runnable to a worker that has already shut down (I thought that was only supposed to be a warning, not an error), perhaps we should close the WorkerDebugger before the worker starts to shut down? We could at the very least set the isClosed flag without dispatching a runnable (using the WorkerDebugger's mutex to keep it thread safe).

The only problem with that solution would be that it breaks the invariant that the isClosed flag will not be set until after the onClosed notification, which can cause the server to access a WorkerDebugger that has already been closed.

Not sure how helpful any of this is, but there ya go.
Looks like I finally got it:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc2edbeb8b71&selectedJob=15864659

We can prevent all these platform exception and toolbox init failure
by waiting for the SW registration promise (navigator.serviceWorker.register(...)) to finish before proceeding.
I forgot about this horrible setTimeout I used.
Waiting nsIWorkerDebuggerManager's onRegister call isn't enough.
That's what we were waiting for in addition to this horrible timeout.

Thanks Eddy for your help here!
Attachment #8711621 - Flags: review?(janx)
Comment on attachment 8711621 [details] [diff] [review]
Wait for service worker registration before attaching a toolbox to it

Review of attachment 8711621 [details] [diff] [review]:
-----------------------------------------------------------------

Nice! Looks much cleaner this way, and the all-green try makes me happy. Thanks!
Attachment #8711621 - Flags: review?(janx) → review+
https://hg.mozilla.org/integration/fx-team/rev/48571bf11be832658eabe82402739b4f42428d98
Bug 1232931: Wait for service worker registration before attaching a toolbox to it. r=janx
Does this change also resolve the unhandled promise rejection[1] in rows 7 - 16?

https://docs.google.com/spreadsheets/d/1ZKXW9HqnxJNdeEVG7s33_TGSV94PdUJCsQsZe1GGqjY/edit#gid=0
Flags: needinfo?(poirot.alex)
I don't know. I've never seen any of these exceptions so far, except the test timed out.
But I tried on linux, with the promise patch and don't see any exception/failure.
These two patches should have fixed the test timed out failure.
Flags: needinfo?(poirot.alex)
Looks like this intermittent is gone \o/

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1232931&startday=2016-01-18&endday=2016-01-28&tree=all
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Apparently you shouldn't challenge it to fail again that way.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Ok. So it changed again. Still failing but at another spot and differently.
We call WorkerActor.onAttach which return {error: 'closed'} as dbg.isClosed is true.
It is important to note that this is on non-e10s runs.
I imagine it is still due to the previous test registering the exact same service worker.
I imagine the previous registration, is being unregistered during previous test and it still in process of cleaning things up while the second test registers a new instance and fails.
I realized we were not waiting for navigator.register() promise in the first test.
It may help...
Using distinct workers is most likely going to help a lot!
Ok. So using two distinct service worker scripts doesn't help.
This is surprising!
Unfortunately, using some log statements in the C++ part seems to prevent the intermittent...
I'll try to get in touch with a MacOS hardware to try to reproduce it locally.
Pushing to try doesn't scale here.

My last findings were that the worker is terminated, ServiceWorkerPrivate::TerminateWorker is being called. It see two WorkerDebugger constructor being called and one is immediately destroyed.
When the test pass, we only get one WorkerDebugger constructor call.
I now suspect that we get two workers being instianciated. One installing and one active, the first being destroyed immediately and used by about:debugging, while the other one stay alive but is ignored by about:debugging.

about:debugging uses nsIWorkerDebuggerManager.getWorkerDebuggerEnumerator(). Does it returns distincts WorkerDebugger instances, one for each worker state: installing/active/... ?
Wait. I'm having hard time to reproduce it on try.
Since Feb. 1st, it only reproduce on some try run or aurora.
Wouldn't we just need to uplift??
Or did we disabled osx 10.6 on this date?
A try run with an empty commit on top of mozilla-central:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=21765759ab3d
(In reply to Alexandre Poirot [:ochameau] from comment #54)
> about:debugging uses nsIWorkerDebuggerManager.getWorkerDebuggerEnumerator().
> Does it returns distincts WorkerDebugger instances, one for each worker
> state: installing/active/... ?

Yes it does. Each service worker registration can have up to three service workers associated with it at any given time: an installing, a waiting, and an active service worker. Each of those service workers runs in its own thread, and therefore gets it's own WorkerPrivate. And since a WorkerDebugger just represents each WorkerPrivate on the main thread, you get multiple WorkerDebuggers as well.
Comment on attachment 8709427 [details] [diff] [review]
Allow running browser_service_workers*.js more than once

This intermittent doesn't reproduces on mozilla-central:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=21765759ab3d
Whereas we still see occurances on aurora:
  https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1232931&startday=2016-02-08&endday=2016-02-14&tree=all

I think my two js patches really fixed it.
Or at least very drastically reduced it!

This is just test script tweaks.
Attachment #8709427 - Flags: approval-mozilla-aurora?
Comment on attachment 8711621 [details] [diff] [review]
Wait for service worker registration before attaching a toolbox to it

Same.
Attachment #8711621 - Flags: approval-mozilla-aurora?
Comment on attachment 8709427 [details] [diff] [review]
Allow running browser_service_workers*.js more than once

Test fixes, ok to uplift to aurora. Sounds promising!
Attachment #8709427 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8711621 [details] [diff] [review]
Wait for service worker registration before attaching a toolbox to it

Another test only change, fine to uplift to aurora
Attachment #8711621 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Component: Developer Tools → Developer Tools: about:debugging
Priority: -- → P3
Has not failed in 2 years, I think we can close this.
Status: REOPENED → RESOLVED
Closed: 4 years ago2 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.