Open Bug 1615164 Opened 5 years ago Updated 5 months ago

Intermittent dom/serviceworkers/test/test_abrupt_completion.html | Correct message handler - got "handler-after-throw", expected "handler-before-throw"

Categories

(Core :: DOM: Service Workers, defect, P2)

defect

Tracking

()

People

(Reporter: intermittent-bug-filer, Assigned: asuth)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, leave-open, Whiteboard: [stockwell disabled])

Attachments

(2 files, 2 obsolete files)

The priority flag is not set for this bug.
:edenchuang, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(echuang)
Flags: needinfo?(echuang)
Priority: -- → P3

Most of the failures in the last 7 days are from try pushes, after a bunch of retriggers: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-27&endday=2020-03-05&tree=all&bug=1615164
Masayuki, could you please in the future star the failures on try as expected fail and bug no in the comment section? Otherwise it triggers whitebord tags as [stockwell disable-recommended] for no valid reason. Thank you.

Flags: needinfo?(masayuki)
Whiteboard: [stockwell disable-recommended]

(In reply to Cosmin Sabou [:CosminS] from comment #9)

Most of the failures in the last 7 days are from try pushes, after a punch of retriggers: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-27&endday=2020-03-05&tree=all&bug=1615164
Masayuki, could you please in the future star the failures on try as expected fail and bug no in the comment section? Otherwise it triggers whitebord tags as [stockwell disable-recommended] for no valid reason. Thank you.

I see, just marking it as pinned is okay? Without that, I cannot confirm that whether an intermittent failure which I try to fix is still reproducible or not.

Flags: needinfo?(masayuki) → needinfo?(csabou)

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #10)

(In reply to Cosmin Sabou [:CosminS] from comment #9)

Most of the failures in the last 7 days are from try pushes, after a punch of retriggers: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-27&endday=2020-03-05&tree=all&bug=1615164
Masayuki, could you please in the future star the failures on try as expected fail and bug no in the comment section? Otherwise it triggers whitebord tags as [stockwell disable-recommended] for no valid reason. Thank you.

I see, just marking it as pinned is okay? Without that, I cannot confirm that whether an intermittent failure which I try to fix is still reproducible or not.

Was not questioning the re-triggers but the classifications. After pinning a fail, from the classification drop down list you can choose expected fail and add bug no in the comment section below or you can just not classify them at all, it's no problem with not classifying them on try. Thank you.

Flags: needinfo?(csabou)

you can just not classify them at all, it's no problem with not classifying them on try.

Thank you, that's what I want to know.

The too many intermittent failure in tryserver must be caused by my stars.

Whiteboard: [stockwell disable-recommended]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

There were 31 failures in the past week on linux1804-64-shippable-qr opt, linux1804-64-shippable opt, linux1804-64-qr debug, linux1804-64-qr opt.

Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=305284560&repo=mozilla-central&lineNumber=5963

[task 2020-06-06T00:04:19.524Z] 00:04:19 INFO - TEST-START | dom/serviceworkers/test/test_abrupt_completion.html
[task 2020-06-06T00:04:20.779Z] 00:04:20 INFO - TEST-INFO | started process screentopng
[task 2020-06-06T00:04:21.091Z] 00:04:21 INFO - TEST-INFO | screentopng: exit 0
[task 2020-06-06T00:04:21.091Z] 00:04:21 INFO - Buffered messages logged at 00:04:20
[task 2020-06-06T00:04:21.092Z] 00:04:21 INFO - add_task | Entering test setup
[task 2020-06-06T00:04:21.093Z] 00:04:21 INFO - TEST-PASS | dom/serviceworkers/test/test_abrupt_completion.html | ServiceWorker is activated
[task 2020-06-06T00:04:21.093Z] 00:04:21 INFO - add_task | Leaving test setup
[task 2020-06-06T00:04:21.094Z] 00:04:21 INFO - add_task | Entering test testMessageHandler
[task 2020-06-06T00:04:21.094Z] 00:04:21 INFO - Buffered messages finished
[task 2020-06-06T00:04:21.095Z] 00:04:21 INFO - TEST-UNEXPECTED-FAIL | dom/serviceworkers/test/test_abrupt_completion.html | Correct message handler - got "handler-after-throw", expected "handler-before-throw"
[task 2020-06-06T00:04:21.095Z] 00:04:21 INFO - SimpleTest.is@SimpleTest/SimpleTest.js:383:14
[task 2020-06-06T00:04:21.095Z] 00:04:21 INFO - testMessageHandler/</navigator.serviceWorker.onmessage@dom/serviceworkers/test/test_abrupt_completion.html:107:9
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - EventHandlerNonNulltestMessageHandler/<@dom/serviceworkers/test/test_abrupt_completion.html:106:5
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - testMessageHandler@dom/serviceworkers/test/test_abrupt_completion.html:105:9
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - async
nextTick/<@SimpleTest/SimpleTest.js:2084:34
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - asyncnextTick@SimpleTest/SimpleTest.js:2109:11
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - setTimeout handler
SimpleTest_setTimeoutShim@SimpleTest/SimpleTest.js:788:41
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - add_task@SimpleTest/SimpleTest.js:2039:17
[task 2020-06-06T00:04:21.097Z] 00:04:21 INFO - @dom/serviceworkers/test/test_abrupt_completion.html:71:9
[task 2020-06-06T00:04:21.098Z] 00:04:21 INFO - add_task | Leaving test testMessageHandler
[task 2020-06-06T00:04:21.098Z] 00:04:21 INFO - add_task | Entering test testFetchHandler
[task 2020-06-06T00:09:19.525Z] 00:09:19 INFO - Not taking screenshot here: see the one that was previously logged
[task 2020-06-06T00:09:19.526Z] 00:09:19 INFO - TEST-UNEXPECTED-FAIL | dom/serviceworkers/test/test_abrupt_completion.html | Test timed out.
[task 2020-06-06T00:09:19.526Z] 00:09:19 INFO - SimpleTest.ok@SimpleTest/SimpleTest.js:299:16
[task 2020-06-06T00:09:19.526Z] 00:09:19 INFO - reportError@SimpleTest/TestRunner.js:128:22
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - TestRunner._checkForHangs@SimpleTest/TestRunner.js:150:18
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.528Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.528Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.528Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.528Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.529Z] 00:09:19 INFO - TestRunner.runTests/<@SimpleTest/TestRunner.js:420:16

Jens, could you take a look, please?

Flags: needinfo?(jstutte)
Whiteboard: [stockwell unknown] → [stockwell unknown][stockwell needswork:owner]

Hi Perry, this test has been added for famous bug 1603484 (Walmart Grocery service worker fails to launch after initial run). It had some previous failure peak in March and now it is coming back. From what I understand reading the test, this can only happen if importScripts does not throw at all? There seem to be no hint in the log what may have happened to it, though.
Are there any recent changes you are aware of that might be related with importScripts?

Flags: needinfo?(jstutte) → needinfo?(perry)

As far as I know, there haven't been any changes to importScripts that might affect this. It seems like the worker script isn't exiting when a top-level exception is thrown: https://searchfox.org/mozilla-central/rev/4bb2401ecbfce89af06fb2b4d0ea3557682bd8ff/testing/web-platform/tests/workers/support/abrupt-completion.js#18, which causes the unexpected "handler-after-throw" string to be set that's reported by the intermittent failure. The exception handling not behaving normally is extremely strange.

Flags: needinfo?(perry)

Hmmm, I thought it was this expected throw on importScripts that unexpectedly did not throw (but I don't remember, if I just searched for the handler string and found that one). Unfortunately the set message handler in both cases would be the same, so it is difficult to understand from the log, what really happens here. But both alternatives ("importScripts that does not throw on error sometimes" or "explicit throw does not throw sometimes" look similarly scary enough to be investigated further. I suspect though, that this is one of those things that could only be debugged well with rr/pernosco.
As it is not rare enough to be ignored: Andrew, can you try to set up a (long) test loop for this under rr (I have only an AMD CPU under Linux here and I think, Perry is on Mac) ?

Flags: needinfo?(bugmail)
Severity: normal → S3

So the test is doing the following where Date.now() is being used in an attempt to generate a value that will be different the next time the top-level script is evaluated:

importScripts(`empty.js?${Date.now()}`);

Since Date.now() has only millisecond granularity, this was probably more risk than we should have landed in the first place, but there are also a number of reasons for which Firefox may further reduce the precision for privacy reasons, as https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/now#Examples covers nicely.

Probably the best first step is to also just add a Math.random() or similar to the mix to reduce the chance of accidental reuse of the same timestamp for any reason. It appears that we initialize our per-realm RNG from cryptographically secure seeds, although there is a fallback to timestamp-based initialization (which can be impacted by "fuzzyfox" mitigations) if we are unable to generate the random numbers, but that depends on system affordances failing quite extensively (like errors reading from /dev/urandom). Quick trace:

Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)

My try push ended up still failing with a reproduction under pernosco at https://pernos.co/debug/ssNhDU_d-ff50MxrRj-ITA/index.html which I'm looking into.

Andrew, any updates on this? It's on the disable recommended list with:

Do you think we can go ahead and disable the test until a fix is in place?

Flags: needinfo?(bugmail)
Priority: P3 → P2

It seems from the pernosco trace, that we have a long delay during the processing of the previous event to the one failing here. This leads to a long delay between the dispatching of the LoadAllScripts from the Worker Thread and the actual processing of the ScriptLoaderRunnableevent on the Isolated Web Content thread, which is causing a time out problem then, it seems.
I have no idea, what actually might cause this delay within nsBaseAppShell::OnProcessNextEvent, though, as I interrupted the investigation at this point.

So for the delay, the appshell does an inversion of control so the native event loop is in charge until there's a Gecko Event Loop event. When an event is scheduled against the main thread we expect the dispatch to call nsBaseAppShell::OnDispatchedEvent https://searchfox.org/mozilla-central/rev/5a4aaccb28665807a6fd49cf48367d47fbb5a19a/widget/nsBaseAppShell.cpp#195 which should then dispatch a native event to wake up the native event loop and allow XPCOM event processing to resume.

A good next step is likely to determine if we're failing to schedule a native event for the worker's event, perhaps due to all of our throttled event queues.

Flags: needinfo?(bugmail)
Attachment #9163020 - Attachment is obsolete: true
Attachment #9163020 - Attachment is obsolete: false
Whiteboard: [stockwell disable-recommended] → [stockwell disable]
Keywords: leave-open
Whiteboard: [stockwell disable] → [stockwell disabled]
Attachment #9163020 - Attachment is obsolete: true
Pushed by dluca@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/26f2d33b95f6 disable test_abrupt_completion.html on linux platforms. r=jmaher
Attachment #9155928 - Attachment is obsolete: true

There are 65 total failures in the last 7 days on

  • windows10-64-2004-shippable-qr opt
  • windows10-64-2004-qr debug and opt
  • windows10-32-2004-qr debug
  • android-em-7-0-x86_64-qr debug

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=358535786&repo=autoland&lineNumber=6940

[task 2021-11-19T01:28:27.213Z] 01:28:27     INFO - TEST-START | dom/serviceworkers/test/test_abrupt_completion.html
[task 2021-11-19T01:28:27.272Z] 01:28:27     INFO - TEST-INFO | started process screenshot
[task 2021-11-19T01:28:27.354Z] 01:28:27     INFO - TEST-INFO | screenshot: exit 0
[task 2021-11-19T01:28:27.364Z] 01:28:27     INFO - Buffered messages logged at 01:28:27
[task 2021-11-19T01:28:27.365Z] 01:28:27     INFO - add_task | Entering test setup
[task 2021-11-19T01:28:27.365Z] 01:28:27     INFO - TEST-PASS | dom/serviceworkers/test/test_abrupt_completion.html | ServiceWorker is activated 
[task 2021-11-19T01:28:27.365Z] 01:28:27     INFO - add_task | Leaving test setup
[task 2021-11-19T01:28:27.366Z] 01:28:27     INFO - add_task | Entering test testMessageHandler
[task 2021-11-19T01:28:27.366Z] 01:28:27     INFO - Buffered messages finished
[task 2021-11-19T01:28:27.367Z] 01:28:27     INFO - TEST-UNEXPECTED-FAIL | dom/serviceworkers/test/test_abrupt_completion.html | Correct message handler - got "handler-after-throw", expected "handler-before-throw"
[task 2021-11-19T01:28:27.367Z] 01:28:27     INFO -     SimpleTest.is@SimpleTest/SimpleTest.js:500:14
[task 2021-11-19T01:28:27.367Z] 01:28:27     INFO -     testMessageHandler/</navigator.serviceWorker.onmessage@dom/serviceworkers/test/test_abrupt_completion.html:104:9
[task 2021-11-19T01:28:27.368Z] 01:28:27     INFO - add_task | Leaving test testMessageHandler
[task 2021-11-19T01:28:27.368Z] 01:28:27     INFO - add_task | Entering test testFetchHandler
[task 2021-11-19T01:28:27.368Z] 01:28:27     INFO - TEST-PASS | dom/serviceworkers/test/test_abrupt_completion.html | Fetch succeeded and didn't result in a NetworkError 
[task 2021-11-19T01:28:27.369Z] 01:28:27     INFO - TEST-PASS | dom/serviceworkers/test/test_abrupt_completion.html | Correct response text 
[task 2021-11-19T01:28:27.369Z] 01:28:27     INFO - add_task | Leaving test testFetchHandler
[task 2021-11-19T01:28:27.371Z] 01:28:27     INFO - GECKO(3580) | MEMORY STAT | vsize 2103732MB | vsizeMaxContiguous 75177305MB | residentFast 57MB | heapAllocated 7MB
[task 2021-11-19T01:28:27.371Z] 01:28:27     INFO - TEST-OK | dom/serviceworkers/test/test_abrupt_completion.html | took 102ms

Andrew are you by any chance still working on this?

Flags: needinfo?(bugmail)
Whiteboard: [stockwell disabled] → [stockwell disabled][stockwell needswork:owner]

Hi Eden, moving ni to you, once you have some time left please take a look.

Flags: needinfo?(bugmail) → needinfo?(echuang)

There have been 98 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • android-em-7-0-x86_64-lite-qr
  • android-em-7-0-x86_64-qr
  • windows10-32-2004-qr
  • windows10-32-2004-shippable-qr
  • windows10-64-2004-qr
  • windows10-64-2004-shippable-qr
Pushed by apavel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/aa66642c9d65 disable test_abrupt_completion.html on win10 2004 r=intermittent-reviewers,gbrown DONTBUILD
Whiteboard: [stockwell disabled][stockwell unknown] → [stockwell disabled]
Flags: needinfo?(echuang)

My attempt to get pernosco traces off the fixed-condprof try push didn't work out, but in reviewing https://phabricator.services.mozilla.com/D177427 and thinking about how we could use a cleaner way to force the shutdown of specific ServiceWorkers, I think the startAndStopServiceWorker helper could be the problem. I think we can probably also have the ServiceWorker capture a generation identifier when it starts up which could also be conveyed with the progress indicator. We could sample that value before forcing the SW to shut down in order to be able to distinguish a failure to shutdown the worker (test bug) from an actual failure of what's being tested. I'll implement a new helper method on nsIServiceWorkerInfo and perhaps state exposure as a separate commit so that we can also use the mechanism in https://phabricator.services.mozilla.com/D177427.

Bulk closing intermittent bugs for DOM LWS that have not seen new instances since at least 5 months.

Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → INCOMPLETE

Taking back the bulk change.

Status: RESOLVED → UNCONFIRMED
Ever confirmed: false
Resolution: INCOMPLETE → ---
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: