Closed
Bug 1232931
Opened 9 years ago
Closed 7 years ago
Intermittent browser_service_workers_timeout.js | Test timed out
Categories
(DevTools :: about:debugging, defect, P3)
DevTools
about:debugging
Tracking
(firefox46 affected, firefox47 affected)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: ochameau)
References
Details
(Keywords: intermittent-failure)
Attachments
(5 files)
2.07 KB,
patch
|
ochameau
:
review+
|
Details | Diff | Splinter Review |
3.76 KB,
patch
|
janx
:
review+
lizzard
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
3.07 KB,
patch
|
janx
:
review+
lizzard
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
2.50 KB,
patch
|
Details | Diff | Splinter Review | |
5.45 KB,
patch
|
Details | Diff | Splinter Review |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 6•9 years ago
|
||
The issue here is that nsIServiceWorkerManager.getWorkerByID() is throwing when it can't find the worker. The devtools code, though, expects null to be returned instead:
https://dxr.mozilla.org/mozilla-central/source/devtools/server/actors/worker.js#134
https://dxr.mozilla.org/mozilla-central/source/devtools/server/actors/worker.js#67
This just changes the service worker manager code to return null instead of throwing.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=baf31f0fddb5
Assignee | ||
Comment 7•9 years ago
|
||
Comment on attachment 8708385 [details] [diff] [review]
Return null instead of throwing if swm.getWorkerByID() cannot find the worker. r=ochameau
Review of attachment 8708385 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks!
Attachment #8708385 -
Flags: review?(poirot.alex) → review+
Comment 9•9 years ago
|
||
Hmm, I got a failed trigger in try after landing this in comment 8. Seems there is more to fix.
Keywords: leave-open
Comment 10•9 years ago
|
||
Alexander, can you investigate what is going on here? My quick fix does not seem to have solved the issue. Seems to trigger most often on mac 10.6 debug dt6.
Flags: needinfo?(poirot.alex)
Comment hidden (Intermittent Failures Robot) |
Comment 12•9 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 14•9 years ago
|
||
Comment 15•9 years ago
|
||
I'm not sure the try had the right syntax in comment 14. I added manually triggered the mac 10.6 dt6 job. Let's see if that helps.
Assignee | ||
Comment 16•9 years ago
|
||
Oh! Yes, it looks like mochitest-dt only runs on 10.10 which need explicit flag :/
Oh! I didn't knew about this treeherder option, looks fantastic, if that works!!
Flags: needinfo?(poirot.alex)
Assignee | ||
Comment 17•9 years ago
|
||
Assignee | ||
Comment 18•9 years ago
|
||
It doesn't solve the intermittent, but ease running the tests multiple times.
Attachment #8709427 -
Flags: review?(janx)
Comment 19•9 years ago
|
||
Comment on attachment 8709427 [details] [diff] [review]
Allow running browser_service_workers*.js more than once
Review of attachment 8709427 [details] [diff] [review]:
-----------------------------------------------------------------
Works for me. One nit, also please submit it to try.
::: devtools/client/aboutdebugging/test/browser_service_workers.js
@@ +45,1 @@
> // Use message manager to work with e10s
Nit: The empty line before "// Finally" should probably before "// Use" instead.
Attachment #8709427 -
Flags: review?(janx) → review+
Comment 20•9 years ago
|
||
Going to pass this to Alexander for now since he's actively investigating it.
Assignee: bkelly → poirot.alex
Assignee | ||
Comment 21•9 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 23•9 years ago
|
||
Ok, so I finally narrowed it down to WorkerDebugge.initialize.
We first get this warning:
[1074] WARNING: A debugger runnable was posted to a worker that is already shutting down!: file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/WorkerPrivate.cpp, line 2399
And then, connectToWorker throws on aDbg.initialize() with this exception:
NS_ERROR_FAILURE - nsIWorkerDebugger.initialize
resource://devtools/server/main.js :: DebuggerServer.connectToWorker/< :: line 785
Eddy, any random guess?
Unfortunately, this exception happens intermitently on OSX debug builds on devtools/client/aboutdebugging/test/browser_service_workers_timeout.js
Would you happen to be on OSX?
Flags: needinfo?(ejpbruel)
Assignee | ||
Comment 24•9 years ago
|
||
Note that there is a bunch of warnings happening on the previous test:
07:32:23 INFO - [1074] WARNING: '!registration', file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/ServiceWorkerManager.cpp, line 999
07:32:23 INFO - [1074] WARNING: 'NS_FAILED(rv)', file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/ServiceWorkerManager.cpp, line 1205
07:32:23 INFO - [1074] WARNING: 'NS_FAILED(aStatus)', file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/ServiceWorkerManager.cpp, line 372
07:32:23 INFO - [1074] WARNING: ServiceWorkerJob failed with error: NS_ERROR_DOM_ABORT_ERR: file /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/ServiceWorkerManager.cpp, line 378
Assignee | ||
Comment 25•9 years ago
|
||
Also, in the previous test, there is very few logs, as if it breaks early, but it still reports the test succeed... Pushing a new try run to figure this out.
Comment 26•9 years ago
|
||
(In reply to Alexandre Poirot [:ochameau] from comment #23)
> Ok, so I finally narrowed it down to WorkerDebugge.initialize.
> We first get this warning:
> [1074] WARNING: A debugger runnable was posted to a worker that is already
> shutting down!: file
> /builds/slave/try-m64-d-00000000000000000000/build/src/dom/workers/
> WorkerPrivate.cpp, line 2399
> And then, connectToWorker throws on aDbg.initialize() with this exception:
> NS_ERROR_FAILURE - nsIWorkerDebugger.initialize
> resource://devtools/server/main.js :: DebuggerServer.connectToWorker/< ::
> line 785
>
> Eddy, any random guess?
> Unfortunately, this exception happens intermitently on OSX debug builds on
> devtools/client/aboutdebugging/test/browser_service_workers_timeout.js
> Would you happen to be on OSX?
This looks like a race condition.
Right before it shuts down, a worker unregisters itself with the WorkerDebuggerManager. The WorkerDebuggerManager then closes the WorkerDebugger for that worker. Once the WorkerDebugger is closed, it can no longer be used. Any attempts to do so will result in an NS_ERROR_UNEXPECTED being thrown.
To make sure we don't attempt to use the WorkerDebugger after it is closed, the WorkerDebugger notifies its listeners when it closes, and also sets it isClosed property to true. Closing the WorkerDebugger, notifying the listeners, and setting the isClosed property all happen on the main thread, so these steps are not racy.
I've double checked the implementation of WorkerActor, and it looks like we are correctly using the onClosed handler and isClosed property to make sure we don't access the WorkerDebugger after it is closed.
However, there is a small window where the worker has started to to shut down, and should no longer be processing runnables, but has not unregistered itself with the WorkerDebuggerManager yet. We could maybe unregister the worker sooner (right before it starts to shut down), but that would not solve the problem, because unregistration happens on the parent thread. If the parent thread is not the main thread, we have to send a runnable to the main thread, because the WorkerDebuggerManager can only be accessed there.
While this runnable is underway, we can still interact with the WorkerDebugger, such as trying to initialize it. This will not result in an NS_ERROR_UNEXPECTED, because the WorkerDebugger has not yet been closed. It will, however, result in an NS_ERROR_FAILURE, because the worker is already shutting down.
We either need a way to make the onClosed event non-race prone, but that seems impossible at first glance. If we cannot do that, then the best we can do is add an explicit error check in the code that calls initialize.
Hope that helps!
Updated•9 years ago
|
Flags: needinfo?(ejpbruel)
Assignee | ||
Comment 27•9 years ago
|
||
Assignee | ||
Comment 28•9 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/ca345be438249e308318379eca7eec7d6013ae93
Bug 1232931 - Allow running browser_service_workers_timeout.js more than once. r=janx
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 30•9 years ago
|
||
Assignee | ||
Comment 31•9 years ago
|
||
Comment 32•9 years ago
|
||
bugherder |
Assignee | ||
Comment 33•9 years ago
|
||
Last try seems to say it only happens when the two about:debugging test regarding service worker are run.
browser_service_workers.js and browser_service_workers_timeout.js
Both instanciates the same service worker, so I highly suspect the first test to not wait correctly for full service worker cleanup before running the second one.
But with my patch, we are now waiting for registration.unregister() resolution before proceeding.
(And before that patch, we were already waiting for nsIWorkerDebuggerManager's onUnregister callback to be called in the main process)
I don't know what else we could wait for?
This race happens when using e10s... which seems to easily introduce such race.
Comment 34•9 years ago
|
||
(In reply to Alexandre Poirot [:ochameau] from comment #33)
> Last try seems to say it only happens when the two about:debugging test
> regarding service worker are run.
> browser_service_workers.js and browser_service_workers_timeout.js
> Both instanciates the same service worker, so I highly suspect the first
> test to not wait correctly for full service worker cleanup before running
> the second one.
> But with my patch, we are now waiting for registration.unregister()
> resolution before proceeding.
> (And before that patch, we were already waiting for
> nsIWorkerDebuggerManager's onUnregister callback to be called in the main
> process)
> I don't know what else we could wait for?
> This race happens when using e10s... which seems to easily introduce such
> race.
FWIW, The onUnregister callback on nsIWorkerDebuggerManager is called very late during the worker shutdown process: the RuntimeService will have already nulled out the worker's thread, and the WorkerPrivate is no longer safe to be touched. The RuntimeService then schedules the worker for deletion, which will cause a WorkerFinishedRunnable to be dispatched. This runnable will disable the debugger for the WorkerPrivate, which will cause the worker to be unregistered.
Since the worker always has to be unregistered on the main thread (because that's where the WorkerDebuggerManager lives), we always have to dispatch *some* runnable to the main thread to unregister the worker after it has shut down. Because the WorkerDebugger is not closed until after the worker is unregistered, there will always be some window where the worker has already shut down, but not unregistered (since the runnable is still underway).
Assuming that your problem is the one you mentioned in comment 26, i.e. that we're posting a runnable to a worker that has already shut down (I thought that was only supposed to be a warning, not an error), perhaps we should close the WorkerDebugger before the worker starts to shut down? We could at the very least set the isClosed flag without dispatching a runnable (using the WorkerDebugger's mutex to keep it thread safe).
The only problem with that solution would be that it breaks the invariant that the isClosed flag will not be set until after the onClosed notification, which can cause the server to access a WorkerDebugger that has already been closed.
Not sure how helpful any of this is, but there ya go.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 37•9 years ago
|
||
Looks like I finally got it:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc2edbeb8b71&selectedJob=15864659
We can prevent all these platform exception and toolbox init failure
by waiting for the SW registration promise (navigator.serviceWorker.register(...)) to finish before proceeding.
I forgot about this horrible setTimeout I used.
Waiting nsIWorkerDebuggerManager's onRegister call isn't enough.
That's what we were waiting for in addition to this horrible timeout.
Thanks Eddy for your help here!
Attachment #8711621 -
Flags: review?(janx)
Comment 38•9 years ago
|
||
Comment on attachment 8711621 [details] [diff] [review]
Wait for service worker registration before attaching a toolbox to it
Review of attachment 8711621 [details] [diff] [review]:
-----------------------------------------------------------------
Nice! Looks much cleaner this way, and the all-green try makes me happy. Thanks!
Attachment #8711621 -
Flags: review?(janx) → review+
Assignee | ||
Comment 39•9 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/48571bf11be832658eabe82402739b4f42428d98
Bug 1232931: Wait for service worker registration before attaching a toolbox to it. r=janx
Comment 40•9 years ago
|
||
bugherder |
Does this change also resolve the unhandled promise rejection[1] in rows 7 - 16?
https://docs.google.com/spreadsheets/d/1ZKXW9HqnxJNdeEVG7s33_TGSV94PdUJCsQsZe1GGqjY/edit#gid=0
Flags: needinfo?(poirot.alex)
Assignee | ||
Comment 42•9 years ago
|
||
I don't know. I've never seen any of these exceptions so far, except the test timed out.
But I tried on linux, with the promise patch and don't see any exception/failure.
These two patches should have fixed the test timed out failure.
Flags: needinfo?(poirot.alex)
Assignee | ||
Comment 43•9 years ago
|
||
Looks like this intermittent is gone \o/
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1232931&startday=2016-01-18&endday=2016-01-28&tree=all
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 44•9 years ago
|
||
Apparently you shouldn't challenge it to fail again that way.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 46•9 years ago
|
||
Ok. So it changed again. Still failing but at another spot and differently.
We call WorkerActor.onAttach which return {error: 'closed'} as dbg.isClosed is true.
It is important to note that this is on non-e10s runs.
I imagine it is still due to the previous test registering the exact same service worker.
I imagine the previous registration, is being unregistered during previous test and it still in process of cleaning things up while the second test registers a new instance and fails.
Assignee | ||
Comment 47•9 years ago
|
||
Assignee | ||
Comment 48•9 years ago
|
||
Assignee | ||
Comment 49•9 years ago
|
||
I realized we were not waiting for navigator.register() promise in the first test.
It may help...
Assignee | ||
Comment 50•9 years ago
|
||
Using distinct workers is most likely going to help a lot!
Assignee | ||
Comment 51•9 years ago
|
||
Assignee | ||
Comment 52•9 years ago
|
||
Assignee | ||
Comment 53•9 years ago
|
||
Assignee | ||
Comment 54•9 years ago
|
||
Ok. So using two distinct service worker scripts doesn't help.
This is surprising!
Unfortunately, using some log statements in the C++ part seems to prevent the intermittent...
I'll try to get in touch with a MacOS hardware to try to reproduce it locally.
Pushing to try doesn't scale here.
My last findings were that the worker is terminated, ServiceWorkerPrivate::TerminateWorker is being called. It see two WorkerDebugger constructor being called and one is immediately destroyed.
When the test pass, we only get one WorkerDebugger constructor call.
I now suspect that we get two workers being instianciated. One installing and one active, the first being destroyed immediately and used by about:debugging, while the other one stay alive but is ignored by about:debugging.
about:debugging uses nsIWorkerDebuggerManager.getWorkerDebuggerEnumerator(). Does it returns distincts WorkerDebugger instances, one for each worker state: installing/active/... ?
Assignee | ||
Comment 55•9 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
status-firefox47:
--- → affected
Assignee | ||
Comment 57•9 years ago
|
||
Wait. I'm having hard time to reproduce it on try.
Since Feb. 1st, it only reproduce on some try run or aurora.
Wouldn't we just need to uplift??
Or did we disabled osx 10.6 on this date?
Assignee | ||
Comment 58•9 years ago
|
||
A try run with an empty commit on top of mozilla-central:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=21765759ab3d
Comment 59•9 years ago
|
||
(In reply to Alexandre Poirot [:ochameau] from comment #54)
> about:debugging uses nsIWorkerDebuggerManager.getWorkerDebuggerEnumerator().
> Does it returns distincts WorkerDebugger instances, one for each worker
> state: installing/active/... ?
Yes it does. Each service worker registration can have up to three service workers associated with it at any given time: an installing, a waiting, and an active service worker. Each of those service workers runs in its own thread, and therefore gets it's own WorkerPrivate. And since a WorkerDebugger just represents each WorkerPrivate on the main thread, you get multiple WorkerDebuggers as well.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 61•9 years ago
|
||
Comment on attachment 8709427 [details] [diff] [review]
Allow running browser_service_workers*.js more than once
This intermittent doesn't reproduces on mozilla-central:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=21765759ab3d
Whereas we still see occurances on aurora:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1232931&startday=2016-02-08&endday=2016-02-14&tree=all
I think my two js patches really fixed it.
Or at least very drastically reduced it!
This is just test script tweaks.
Attachment #8709427 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 62•9 years ago
|
||
Comment on attachment 8711621 [details] [diff] [review]
Wait for service worker registration before attaching a toolbox to it
Same.
Attachment #8711621 -
Flags: approval-mozilla-aurora?
Comment 63•9 years ago
|
||
Comment on attachment 8709427 [details] [diff] [review]
Allow running browser_service_workers*.js more than once
Test fixes, ok to uplift to aurora. Sounds promising!
Attachment #8709427 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 64•9 years ago
|
||
Comment on attachment 8711621 [details] [diff] [review]
Wait for service worker registration before attaching a toolbox to it
Another test only change, fine to uplift to aurora
Attachment #8711621 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment hidden (Intermittent Failures Robot) |
Comment 66•9 years ago
|
||
Comment 67•9 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
Component: Developer Tools → Developer Tools: about:debugging
Priority: -- → P3
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 75•7 years ago
|
||
Has not failed in 2 years, I think we can close this.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 7 years ago
Resolution: --- → FIXED
Comment 76•7 years ago
|
||
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Updated•6 years ago
|
Product: Firefox → DevTools
You need to log in
before you can comment on or make changes to this bug.
Description
•