Closed Bug 1386613 Opened 7 years ago Closed 7 years ago

Fix race conditions in about:debugging tests

Categories

(DevTools :: about:debugging, enhancement, P2)

enhancement

Tracking

(firefox59 fixed)

RESOLVED FIXED
Firefox 59
Tracking Status
firefox59 --- fixed

People

(Reporter: jdescottes, Assigned: jdescottes)

References

Details

Attachments

(3 files)

From Bug 1386412: (In reply to Alexandre Poirot [:ochameau] from comment #4) > Comment on attachment 8892832 [details] > Bug 1386412 - fix race conditions in > browser_service_workers_multi_content_process.js; > > https://reviewboard.mozilla.org/r/163830/#review169184 > > Thanks for chasing this intermittent. > > It looks like there migh be other suspicious usages of waitForMutation: > http://searchfox.org/mozilla-central/source/devtools/client/aboutdebugging/ > test/browser_page_not_found.js#24 > http://searchfox.org/mozilla-central/source/devtools/client/aboutdebugging/ > test/browser_service_workers_push_service.js#97 > http://searchfox.org/mozilla-central/source/devtools/client/aboutdebugging/ > test/browser_service_workers_status.js#46 > http://searchfox.org/mozilla-central/source/devtools/client/aboutdebugging/ > test/browser_service_workers_timeout.js#66 > > That makes me wonder if we should warn from eslint usages of "yield > waitForXXXX"... > but unfortunately, some usages are valid. > > ::: > devtools/client/aboutdebugging/test/ > browser_service_workers_multi_content_process.js:30 > (Diff revision 1) > > + let onMutation = waitForMutation(serviceWorkersElement, { childList: true }); > > + > > + let swTab = yield addTab(TAB_URL, { background: true }); > > > > - yield waitForMutation(serviceWorkersElement, { childList: true }); > > + info("Wait for service worker to appear in the list"); > > + yield onMutation; > > If that's what caused this intermittent, it looks like there is other > potential intermittents: > http://searchfox.org/mozilla-central/source/devtools/client/aboutdebugging/ > test/browser_service_workers.js#20 > http://searchfox.org/mozilla-central/source/devtools/client/aboutdebugging/ > test/browser_service_workers_fetch_flag.js#18 > http://searchfox.org/mozilla-central/source/devtools/client/aboutdebugging/ > test/browser_service_workers_timeout.js#23 I actually think that we should stop relying on waitForMutation in about:debugging tests in favor of waitUntil (checking that the expected DOM structure appears/disappears). Several reasons: - waitForMutation exposes us to race conditions - waitForMutation not only asserts the DOM but also how it's generated. This doesn't seem wise since we rely on React for about:debugging's UI and therefore do not generate the DOM ourselves. We previously had to update our tests when upgrading react because the framework was no longer recreating an element in order to update an attribute.
Severity: normal → enhancement
Priority: -- → P3
Blocks: 1419771
Let's address this as it's blocking a worker refactor on Bug 1419771
Assignee: nobody → jdescottes
Status: NEW → ASSIGNED
Priority: P3 → P2
With the patches attached here, the about:debugging tests should be much more stable. When testing on top of Bug 1419771, I still get intermittent failures for browser_service_workers_push.js though. See https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8f0f1e618580b7ef6a2e0b78068023fb6b9ac9a
That test looks a little race-prone; one of the logs is showing: SW not subscribed to push: AbortError: Error retrieving push subscription. Which suggests the permission addition at https://searchfox.org/mozilla-central/source/devtools/client/aboutdebugging/test/service-workers/push-sw.html#10 is racing the subsequent call to invoke subscribe. Other push tests seem to instead use an async helper like https://searchfox.org/mozilla-central/source/dom/push/test/test_utils.js#165 like https://searchfox.org/mozilla-central/source/dom/push/test/test_data.html#46 does that helps ensure the permission has been propagated before doing race-y things. (Also, that test waits for activation which is nice, but shouldn't be required for this test since the SW skips waiting and claims the existing page to make it controlled.)
Thanks :asuth, I've fixed the test based on your suggestion. The intermittent remains however. I rarely reproduce locally, so I'm trying to narrow it down as much as possible. I'll post more details later.
I'm still investigating the intermittent when based on top of :baku's patches from Bug 1419771, but I think we should go ahead and review what I have already. If nothing else it makes the existing about:debugging tests much more stable. Here is a try run on all platforms, with aboutdebugging tests retriggered 10 times: https://treeherder.mozilla.org/#/jobs?repo=try&revision=870036fc6e99698a76948ada6c4c9f87685d2dc7 (have a similar one already green, but I missed some platforms: https://treeherder.mozilla.org/#/jobs?repo=try&revision=22c7a5a98c8760162503c77f7adaf2eb2fe05c63 + did an additional nit change after submitting it)
Comment on attachment 8934700 [details] Bug 1386613 - stop using waitForMutations in aboutdebugging tests (serviceworkers); https://reviewboard.mozilla.org/r/205594/#review212056 Looks good overall, but instead of passing 100ms everywhere in all calls to waitUntil, I think it would be better to change the default value and remove all the `, 100`. Feel free to followup on this as it may have an impact on other tests. ::: devtools/client/aboutdebugging/test/browser_service_workers_multi_content_process.js:56 (Diff revision 3) > yield pushPref("dom.ipc.processCount", 2); > yield onWarningRestored; > + > + // Update the reference to the debugButton, as the previous DOM element might have been > + // deleted. > + debugButton = getDebugButton(serviceWorkerContainer); That's surprising. Was that working before? Or we were just asserting against a detached element? Is it react re-creating the button everytime or are we doing something special around it? (I would expect it only to be enabled/disabled) ::: devtools/client/aboutdebugging/test/browser_service_workers_push_service.js:67 (Diff revision 3) > > - // Wait for the service-workers list to update. > - yield onMutation; > + info("Wait until the service worker appears in about:debugging"); > + yield waitUntilServiceWorkerContainer(SERVICE_WORKER, document); > > // Check that the service worker appears in the UI. > assertHasTarget(true, document, "service-workers", SERVICE_WORKER); Wait, do I get comment 14 right? Are you saying this kind of test still fails? It looks like assertHasTarget is redundant with waitUntilServiceWorkerContainer. The only difference is that assertHasTarget isn't a "waitUntil" like function and assert immediately. So may be waitUntilServiceWorkerContainer times out, but it would be quite surprising if assertHasTarget fails. At the end, I think you can get rid of all assertHasTarget following waitUntilServiceWorkerContainer calls. (there is some other occurences) ::: devtools/client/aboutdebugging/test/browser_service_workers_start.js (Diff revision 3) > - yield waitForMutation(targetElement, { childList: true }); > - } else { > - // Check that there is no Debug button when the SW is already shut down. > - ok(!targetElement.querySelector(".debug-button"), "No debug button when " + > - "the worker is already killed"); > - } Should we keep asserting that the .debug-button isn't in DOM once the .start-button is? ::: devtools/client/aboutdebugging/test/service-workers/push-sw.html:10 (Diff revision 3) > <title>Service worker push test</title> > </head> > <body> > <script type="text/javascript"> > "use strict"; > -SpecialPowers.addPermission("desktop-notification", true, document); > +// The registration promise is expected as a global by unregisterServiceWorker. s/by unregisterServiceWorker/by head.js's unregisterServiceWorker/ ::: devtools/client/aboutdebugging/test/service-workers/push-sw.html:11 (Diff revision 3) > </head> > <body> > <script type="text/javascript"> > "use strict"; > -SpecialPowers.addPermission("desktop-notification", true, document); > -var sw = navigator.serviceWorker.register("push-sw.js"); > +// The registration promise is expected as a global by unregisterServiceWorker. > +var sw = null; This looks racy. Otherwise, used like this, there is no point in using a promise. `sw` could directly be `registration` and not the promise of it. var sw = (async function () { await new Promise( ...pushPermission... ); try { let registration = ...register(push-sw.js); ... } catch(e) {} return registration; })();
Attachment #8934700 - Flags: review?(poirot.alex) → review+
Comment on attachment 8934701 [details] Bug 1386613 - stop using waitForMutations in aboutdebugging tests (addons); https://reviewboard.mozilla.org/r/205596/#review212078 ::: devtools/client/aboutdebugging/test/head.js:262 (Diff revision 3) > + > +function getAddonCount(document) { > + const addonListContainer = getAddonList(document); > + let addonElements = addonListContainer.querySelectorAll(".target"); > + // TODO: jdescottes: Why return -1 here? > + return addonElements ? addonElements.length : -1; You should drop this comment or write a realy one. We don't have to return -1, we can return 0 as well if that makes it easier to follow. It just need to be not strictly greater than 0.
Attachment #8934701 - Flags: review?(poirot.alex) → review+
Comment on attachment 8934702 [details] Bug 1386613 - stop using waitForMutations in aboutdebugging tests (remaining tests); https://reviewboard.mozilla.org/r/205598/#review212096
Attachment #8934702 - Flags: review?(poirot.alex) → review+
See Also: → 1424088
Thanks for the reviews! (In reply to Alexandre Poirot [:ochameau] from comment #15) > Comment on attachment 8934700 [details] > Bug 1386613 - stop using waitForMutations in aboutdebugging tests > (serviceworkers); > > https://reviewboard.mozilla.org/r/205594/#review212056 > > Looks good overall, but instead of passing 100ms everywhere in all calls to > waitUntil, > I think it would be better to change the default value and remove all the `, > 100`. > Feel free to followup on this as it may have an impact on other tests. > Ok will follow up in Bug 1424088 > ::: > devtools/client/aboutdebugging/test/ > browser_service_workers_multi_content_process.js:56 > (Diff revision 3) > > yield pushPref("dom.ipc.processCount", 2); > > yield onWarningRestored; > > + > > + // Update the reference to the debugButton, as the previous DOM element might have been > > + // deleted. > > + debugButton = getDebugButton(serviceWorkerContainer); > > That's surprising. Was that working before? Or we were just asserting > against a detached element? > Is it react re-creating the button everytime or are we doing something > special around it? (I would expect it only to be enabled/disabled) > This test is intermittent today (albeit with a low frequency of failures). But in my tests the instance of debug button initially captured was removed from the DOM tree. I suppose that the previous mutation-based test was running faster, and possibly checked the button before a re-render of the whole list of targets? This behavior was also intermittent when I tested, so that means that normally the button is only disabled/enabled, but from time to time, the DOM for the button was destroyed. > ::: > devtools/client/aboutdebugging/test/browser_service_workers_push_service.js: > 67 > (Diff revision 3) > > > > - // Wait for the service-workers list to update. > > - yield onMutation; > > + info("Wait until the service worker appears in about:debugging"); > > + yield waitUntilServiceWorkerContainer(SERVICE_WORKER, document); > > > > // Check that the service worker appears in the UI. > > assertHasTarget(true, document, "service-workers", SERVICE_WORKER); > > Wait, do I get comment 14 right? > Are you saying this kind of test still fails? > It looks like assertHasTarget is redundant with > waitUntilServiceWorkerContainer. > The only difference is that assertHasTarget isn't a "waitUntil" like > function and assert immediately. It is still intermittent (when using the patches from the other bug) but not for anything like that. The failures I get now are because the click on the "push" button don't trigger the push listener in the service worker. I tracked down the push until ServiceWorkerPrivate.cpp and everything seems successful, I just don't know why the service worker doesn't receive it. But the test itself seems robust. > > So may be waitUntilServiceWorkerContainer times out, > but it would be quite surprising if assertHasTarget fails. > > At the end, I think you can get rid of all assertHasTarget following > waitUntilServiceWorkerContainer calls. (there is some other occurences) > > ::: devtools/client/aboutdebugging/test/browser_service_workers_start.js > (Diff revision 3) > > - yield waitForMutation(targetElement, { childList: true }); > > - } else { > > - // Check that there is no Debug button when the SW is already shut down. > > - ok(!targetElement.querySelector(".debug-button"), "No debug button when " + > > - "the worker is already killed"); > > - } > > Should we keep asserting that the .debug-button isn't in DOM once the > .start-button is? > > ::: devtools/client/aboutdebugging/test/service-workers/push-sw.html:10 > (Diff revision 3) > > <title>Service worker push test</title> > > </head> > > <body> > > <script type="text/javascript"> > > "use strict"; > > -SpecialPowers.addPermission("desktop-notification", true, document); > > +// The registration promise is expected as a global by unregisterServiceWorker. > > s/by unregisterServiceWorker/by head.js's unregisterServiceWorker/ > > ::: devtools/client/aboutdebugging/test/service-workers/push-sw.html:11 > (Diff revision 3) > > </head> > > <body> > > <script type="text/javascript"> > > "use strict"; > > -SpecialPowers.addPermission("desktop-notification", true, document); > > -var sw = navigator.serviceWorker.register("push-sw.js"); > > +// The registration promise is expected as a global by unregisterServiceWorker. > > +var sw = null; > > This looks racy. Which part exactly? If the registration/push subscription is still racy that could help me track down the remaining intermittent. > Otherwise, used like this, there is no point in using a promise. `sw` could > directly be `registration` and not the promise of it. That's correct, I mostly kept it as a promise to stick to the old implementation as much as possible. I can update that. > > var sw = (async function () { > await new Promise( ...pushPermission... ); > try { > let registration = ...register(push-sw.js); > ... > } catch(e) {} > return registration; > })(); (In reply to Alexandre Poirot [:ochameau] from comment #16) > Comment on attachment 8934701 [details] > Bug 1386613 - stop using waitForMutations in aboutdebugging tests (addons); > > https://reviewboard.mozilla.org/r/205596/#review212078 > > ::: devtools/client/aboutdebugging/test/head.js:262 > (Diff revision 3) > > + > > +function getAddonCount(document) { > > + const addonListContainer = getAddonList(document); > > + let addonElements = addonListContainer.querySelectorAll(".target"); > > + // TODO: jdescottes: Why return -1 here? > > + return addonElements ? addonElements.length : -1; > > You should drop this comment or write a realy one. > We don't have to return -1, we can return 0 as well if that makes it easier > to follow. > It just need to be not strictly greater than 0. Ah thanks for catching that! I was just puzzled by this because I don't even see how addonElements could be falsy. querySelectorAll always returns a NodeList, even if there's no match.
(In reply to Julian Descottes [:jdescottes][:julian] from comment #18) > Ok will follow up in Bug 1424088 Thanks! > This test is intermittent today (albeit with a low frequency of failures). > But in my tests the instance of debug button initially captured was removed > from the DOM tree. I suppose that the previous mutation-based test was > running faster, and possibly checked the button before a re-render of the > whole list of targets? This behavior was also intermittent when I tested, so > that means that normally the button is only disabled/enabled, but from time > to time, the DOM for the button was destroyed. Ok, so it doesn't look like weird react behavior but more us updating the page in an unexpected way as if we were updating the current panel/page on about:debugging. Or having the worker being removed and re-added... That may be something to keep in mind for other intermittents. > It is still intermittent (when using the patches from the other bug) but not > for anything like that. The failures I get now are because the click on the > "push" button don't trigger the push listener in the service worker. I > tracked down the push until ServiceWorkerPrivate.cpp and everything seems > successful, I just don't know why the service worker doesn't receive it. Ah ok, may be bkelly can help track what's in ServiceWorkerPrivate... I also already looked into this some time ago for the netmonitor, so you can elaborate here if you find anything suspect. > > This looks racy. > > Which part exactly? If the registration/push subscription is still racy that > could help me track down the remaining intermittent. It is about unregisterServiceWorker and sw variable only. It is most likely not the race you are chasing. Because the mochitest script waits for SW registration in a previous test step. If you ignore the browser_*.js script and only look at unregisterServiceWorker and push-sw.html, your patch looks racy. If unregisterServiceWorker is called during push-sw.html load, it may try to access `sw` while it is still null. This comment is close to be a nit.
(In reply to Alexandre Poirot [:ochameau] from comment #19) > > It is still intermittent (when using the patches from the other bug) but not > > for anything like that. The failures I get now are because the click on the > > "push" button don't trigger the push listener in the service worker. I > > tracked down the push until ServiceWorkerPrivate.cpp and everything seems > > successful, I just don't know why the service worker doesn't receive it. > > Ah ok, may be bkelly can help track what's in ServiceWorkerPrivate... > I also already looked into this some time ago for the netmonitor, > so you can elaborate here if you find anything suspect. This is probably the multi-e10s issue where the push notification is sent to the first (web) content process in the linked list of all (web) content processes, which isn't necessarily the content process hosting your test tab. As a result, a second instance of your test service worker is likely being spun up in a process that is not where the test page is, and it can't see your test page's client to postMessage to it. If you're talking about the test with a fake push manager, then the problem is likely the SW is being spun up in the parent process instead of the/any content process. Mitigations look like: - explicitly set preferences so there's only one live content process - Use the process selector mechanism ("@mozilla.org/ipc/processselector;1") to ensure that your test tab gets allocated in the first process, regardless of how many there are. - Instead of using postMessage to communicate from the ServiceWorker, use the BroadcastChannel mechanism which is cross-process. (Note that if I understand the weirdness with the message manager used to perceive the postMessage message, the push-sw.html page will have to bridge the BroadcastChannel to something the test can perceive. - Wait a little and :bkelly's Client overhauls will start having the clients API work cross-process. - Wait a little longer and we'll have fixed the multi-e10s issue so there will only ever be one instance of the SW.
(In reply to Andrew Sutherland [:asuth] from comment #20) > (In reply to Alexandre Poirot [:ochameau] from comment #19) > > > It is still intermittent (when using the patches from the other bug) but not > > > for anything like that. The failures I get now are because the click on the > > > "push" button don't trigger the push listener in the service worker. I > > > tracked down the push until ServiceWorkerPrivate.cpp and everything seems > > > successful, I just don't know why the service worker doesn't receive it. > > > > Ah ok, may be bkelly can help track what's in ServiceWorkerPrivate... > > I also already looked into this some time ago for the netmonitor, > > so you can elaborate here if you find anything suspect. > > This is probably the multi-e10s issue where the push notification is sent to > the first (web) content process in the linked list of all (web) content > processes, which isn't necessarily the content process hosting your test > tab. As a result, a second instance of your test service worker is likely > being spun up in a process that is not where the test page is, and it can't > see your test page's client to postMessage to it. If you're talking about > the test with a fake push manager, then the problem is likely the SW is > being spun up in the parent process instead of the/any content process. > > Mitigations look like: > - explicitly set preferences so there's only one live content process We already force "dom.ipc.processCount" to 1 before starting each test (see enableServiceWorkerDebugging at https://searchfox.org/mozilla-central/rev/f5f1c3f294f89cfd242c3af9eb2c40d19d5e04e7/devtools/client/aboutdebugging/test/head.js#453-466). We then call Services.ppmm.releaseCachedProcesses(); Is there still a chance that we might get tabs assigned to different content processes? I will add some logs to check ppmm.childCount when the test fails. > - Use the process selector mechanism ("@mozilla.org/ipc/processselector;1") > to ensure that your test tab gets allocated in the first process, regardless > of how many there are. > - Instead of using postMessage to communicate from the ServiceWorker, use > the BroadcastChannel mechanism which is cross-process. (Note that if I > understand the weirdness with the message manager used to perceive the > postMessage message, the push-sw.html page will have to bridge the > BroadcastChannel to something the test can perceive. I verified that postMessage was still working between the service worker and the page. I broadcasted a message every second from the sw code and was able to receive them in the test. > - Wait a little and :bkelly's Client overhauls will start having the clients > API work cross-process. > - Wait a little longer and we'll have fixed the multi-e10s issue so there > will only ever be one instance of the SW. Once these patches have landed I will finish isolating a reduced test case to investigate the remaining intermittent. And again I don't have any failure with these patches on top of the latest central. Only when applied on top of Bug 1419771.
Comment on attachment 8934700 [details] Bug 1386613 - stop using waitForMutations in aboutdebugging tests (serviceworkers); https://reviewboard.mozilla.org/r/205594/#review212056 > Wait, do I get comment 14 right? > Are you saying this kind of test still fails? > It looks like assertHasTarget is redundant with waitUntilServiceWorkerContainer. > The only difference is that assertHasTarget isn't a "waitUntil" like function and assert immediately. > > So may be waitUntilServiceWorkerContainer times out, > but it would be quite surprising if assertHasTarget fails. > > At the end, I think you can get rid of all assertHasTarget following waitUntilServiceWorkerContainer calls. (there is some other occurences) Removed unnecessary assertHasTarget > This looks racy. > Otherwise, used like this, there is no point in using a promise. `sw` could directly be `registration` and not the promise of it. > > var sw = (async function () { > await new Promise( ...pushPermission... ); > try { > let registration = ...register(push-sw.js); > ... > } catch(e) {} > return registration; > })(); Fixed
Pushed by jdescottes@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5cbdcea46ed0 stop using waitForMutations in aboutdebugging tests (serviceworkers);r=ochameau https://hg.mozilla.org/integration/autoland/rev/c9c877119f3f stop using waitForMutations in aboutdebugging tests (addons);r=ochameau https://hg.mozilla.org/integration/autoland/rev/24bd9eb8d545 stop using waitForMutations in aboutdebugging tests (remaining tests);r=ochameau
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 59
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: