Open Bug 1615164 Opened 7 months ago Updated 2 days ago

Intermittent dom/serviceworkers/test/test_abrupt_completion.html | Correct message handler - got "handler-after-throw", expected "handler-before-throw"

Categories

(Core :: DOM: Service Workers, defect, P2)

defect

Tracking

()

ASSIGNED

People

(Reporter: intermittent-bug-filer, Assigned: asuth)

Details

(Keywords: intermittent-failure, leave-open, Whiteboard: [stockwell disabled])

Attachments

(2 files, 1 obsolete file)

The priority flag is not set for this bug.
:edenchuang, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(echuang)
Flags: needinfo?(echuang)
Priority: -- → P3

Most of the failures in the last 7 days are from try pushes, after a bunch of retriggers: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-27&endday=2020-03-05&tree=all&bug=1615164
Masayuki, could you please in the future star the failures on try as expected fail and bug no in the comment section? Otherwise it triggers whitebord tags as [stockwell disable-recommended] for no valid reason. Thank you.

Flags: needinfo?(masayuki)
Whiteboard: [stockwell disable-recommended]

(In reply to Cosmin Sabou [:CosminS] from comment #9)

Most of the failures in the last 7 days are from try pushes, after a punch of retriggers: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-27&endday=2020-03-05&tree=all&bug=1615164
Masayuki, could you please in the future star the failures on try as expected fail and bug no in the comment section? Otherwise it triggers whitebord tags as [stockwell disable-recommended] for no valid reason. Thank you.

I see, just marking it as pinned is okay? Without that, I cannot confirm that whether an intermittent failure which I try to fix is still reproducible or not.

Flags: needinfo?(masayuki) → needinfo?(csabou)

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #10)

(In reply to Cosmin Sabou [:CosminS] from comment #9)

Most of the failures in the last 7 days are from try pushes, after a punch of retriggers: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-27&endday=2020-03-05&tree=all&bug=1615164
Masayuki, could you please in the future star the failures on try as expected fail and bug no in the comment section? Otherwise it triggers whitebord tags as [stockwell disable-recommended] for no valid reason. Thank you.

I see, just marking it as pinned is okay? Without that, I cannot confirm that whether an intermittent failure which I try to fix is still reproducible or not.

Was not questioning the re-triggers but the classifications. After pinning a fail, from the classification drop down list you can choose expected fail and add bug no in the comment section below or you can just not classify them at all, it's no problem with not classifying them on try. Thank you.

Flags: needinfo?(csabou)

you can just not classify them at all, it's no problem with not classifying them on try.

Thank you, that's what I want to know.

The too many intermittent failure in tryserver must be caused by my stars.

Whiteboard: [stockwell disable-recommended]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

There were 31 failures in the past week on linux1804-64-shippable-qr opt, linux1804-64-shippable opt, linux1804-64-qr debug, linux1804-64-qr opt.

Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=305284560&repo=mozilla-central&lineNumber=5963

[task 2020-06-06T00:04:19.524Z] 00:04:19 INFO - TEST-START | dom/serviceworkers/test/test_abrupt_completion.html
[task 2020-06-06T00:04:20.779Z] 00:04:20 INFO - TEST-INFO | started process screentopng
[task 2020-06-06T00:04:21.091Z] 00:04:21 INFO - TEST-INFO | screentopng: exit 0
[task 2020-06-06T00:04:21.091Z] 00:04:21 INFO - Buffered messages logged at 00:04:20
[task 2020-06-06T00:04:21.092Z] 00:04:21 INFO - add_task | Entering test setup
[task 2020-06-06T00:04:21.093Z] 00:04:21 INFO - TEST-PASS | dom/serviceworkers/test/test_abrupt_completion.html | ServiceWorker is activated
[task 2020-06-06T00:04:21.093Z] 00:04:21 INFO - add_task | Leaving test setup
[task 2020-06-06T00:04:21.094Z] 00:04:21 INFO - add_task | Entering test testMessageHandler
[task 2020-06-06T00:04:21.094Z] 00:04:21 INFO - Buffered messages finished
[task 2020-06-06T00:04:21.095Z] 00:04:21 INFO - TEST-UNEXPECTED-FAIL | dom/serviceworkers/test/test_abrupt_completion.html | Correct message handler - got "handler-after-throw", expected "handler-before-throw"
[task 2020-06-06T00:04:21.095Z] 00:04:21 INFO - SimpleTest.is@SimpleTest/SimpleTest.js:383:14
[task 2020-06-06T00:04:21.095Z] 00:04:21 INFO - testMessageHandler/</navigator.serviceWorker.onmessage@dom/serviceworkers/test/test_abrupt_completion.html:107:9
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - EventHandlerNonNulltestMessageHandler/<@dom/serviceworkers/test/test_abrupt_completion.html:106:5
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - testMessageHandler@dom/serviceworkers/test/test_abrupt_completion.html:105:9
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - async
nextTick/<@SimpleTest/SimpleTest.js:2084:34
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - asyncnextTick@SimpleTest/SimpleTest.js:2109:11
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - setTimeout handler
SimpleTest_setTimeoutShim@SimpleTest/SimpleTest.js:788:41
[task 2020-06-06T00:04:21.096Z] 00:04:21 INFO - add_task@SimpleTest/SimpleTest.js:2039:17
[task 2020-06-06T00:04:21.097Z] 00:04:21 INFO - @dom/serviceworkers/test/test_abrupt_completion.html:71:9
[task 2020-06-06T00:04:21.098Z] 00:04:21 INFO - add_task | Leaving test testMessageHandler
[task 2020-06-06T00:04:21.098Z] 00:04:21 INFO - add_task | Entering test testFetchHandler
[task 2020-06-06T00:09:19.525Z] 00:09:19 INFO - Not taking screenshot here: see the one that was previously logged
[task 2020-06-06T00:09:19.526Z] 00:09:19 INFO - TEST-UNEXPECTED-FAIL | dom/serviceworkers/test/test_abrupt_completion.html | Test timed out.
[task 2020-06-06T00:09:19.526Z] 00:09:19 INFO - SimpleTest.ok@SimpleTest/SimpleTest.js:299:16
[task 2020-06-06T00:09:19.526Z] 00:09:19 INFO - reportError@SimpleTest/TestRunner.js:128:22
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - TestRunner._checkForHangs@SimpleTest/TestRunner.js:150:18
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.527Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.528Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.528Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.528Z] 00:09:19 INFO - setTimeout handlerTestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.528Z] 00:09:19 INFO - setTimeout handler
TestRunner._checkForHangs@SimpleTest/TestRunner.js:184:15
[task 2020-06-06T00:09:19.529Z] 00:09:19 INFO - TestRunner.runTests/<@SimpleTest/TestRunner.js:420:16

Jens, could you take a look, please?

Flags: needinfo?(jstutte)
Whiteboard: [stockwell unknown] → [stockwell unknown][stockwell needswork:owner]

Hi Perry, this test has been added for famous bug 1603484 (Walmart Grocery service worker fails to launch after initial run). It had some previous failure peak in March and now it is coming back. From what I understand reading the test, this can only happen if importScripts does not throw at all? There seem to be no hint in the log what may have happened to it, though.
Are there any recent changes you are aware of that might be related with importScripts?

Flags: needinfo?(jstutte) → needinfo?(perry)

As far as I know, there haven't been any changes to importScripts that might affect this. It seems like the worker script isn't exiting when a top-level exception is thrown: https://searchfox.org/mozilla-central/rev/4bb2401ecbfce89af06fb2b4d0ea3557682bd8ff/testing/web-platform/tests/workers/support/abrupt-completion.js#18, which causes the unexpected "handler-after-throw" string to be set that's reported by the intermittent failure. The exception handling not behaving normally is extremely strange.

Flags: needinfo?(perry)

Hmmm, I thought it was this expected throw on importScripts that unexpectedly did not throw (but I don't remember, if I just searched for the handler string and found that one). Unfortunately the set message handler in both cases would be the same, so it is difficult to understand from the log, what really happens here. But both alternatives ("importScripts that does not throw on error sometimes" or "explicit throw does not throw sometimes" look similarly scary enough to be investigated further. I suspect though, that this is one of those things that could only be debugged well with rr/pernosco.
As it is not rare enough to be ignored: Andrew, can you try to set up a (long) test loop for this under rr (I have only an AMD CPU under Linux here and I think, Perry is on Mac) ?

Flags: needinfo?(bugmail)
Severity: normal → S3

So the test is doing the following where Date.now() is being used in an attempt to generate a value that will be different the next time the top-level script is evaluated:

importScripts(`empty.js?${Date.now()}`);

Since Date.now() has only millisecond granularity, this was probably more risk than we should have landed in the first place, but there are also a number of reasons for which Firefox may further reduce the precision for privacy reasons, as https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/now#Examples covers nicely.

Probably the best first step is to also just add a Math.random() or similar to the mix to reduce the chance of accidental reuse of the same timestamp for any reason. It appears that we initialize our per-realm RNG from cryptographically secure seeds, although there is a fallback to timestamp-based initialization (which can be impacted by "fuzzyfox" mitigations) if we are unable to generate the random numbers, but that depends on system affordances failing quite extensively (like errors reading from /dev/urandom). Quick trace:

Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)

My try push ended up still failing with a reproduction under pernosco at https://pernos.co/debug/ssNhDU_d-ff50MxrRj-ITA/index.html which I'm looking into.

Andrew, any updates on this? It's on the disable recommended list with:

Do you think we can go ahead and disable the test until a fix is in place?

Flags: needinfo?(bugmail)
Priority: P3 → P2

It seems from the pernosco trace, that we have a long delay during the processing of the previous event to the one failing here. This leads to a long delay between the dispatching of the LoadAllScripts from the Worker Thread and the actual processing of the ScriptLoaderRunnableevent on the Isolated Web Content thread, which is causing a time out problem then, it seems.
I have no idea, what actually might cause this delay within nsBaseAppShell::OnProcessNextEvent, though, as I interrupted the investigation at this point.

So for the delay, the appshell does an inversion of control so the native event loop is in charge until there's a Gecko Event Loop event. When an event is scheduled against the main thread we expect the dispatch to call nsBaseAppShell::OnDispatchedEvent https://searchfox.org/mozilla-central/rev/5a4aaccb28665807a6fd49cf48367d47fbb5a19a/widget/nsBaseAppShell.cpp#195 which should then dispatch a native event to wake up the native event loop and allow XPCOM event processing to resume.

A good next step is likely to determine if we're failing to schedule a native event for the worker's event, perhaps due to all of our throttled event queues.

Flags: needinfo?(bugmail)
Attachment #9163020 - Attachment is obsolete: true
Attachment #9163020 - Attachment is obsolete: false
Whiteboard: [stockwell disable-recommended] → [stockwell disable]
Keywords: leave-open
Whiteboard: [stockwell disable] → [stockwell disabled]
Attachment #9163020 - Attachment is obsolete: true
Pushed by dluca@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/26f2d33b95f6
disable test_abrupt_completion.html on linux platforms. r=jmaher
You need to log in before you can comment on or make changes to this bug.