Open Bug 1274773 Opened 6 years ago Updated 2 years ago

Intermittent test_multiple_register_during_service_activation.html | could not register for push notification | could not register for push notification

Categories

(Core :: DOM: Push Notifications, defect, P2)

defect

Tracking

()

Tracking Status
firefox49 --- affected

People

(Reporter: aryx, Unassigned)

Details

(Keywords: intermittent-failure, leave-open, regression, Whiteboard: [retriggered][stockwell disabled])

Attachments

(2 files)

https://treeherder.mozilla.org/logviewer.html#?job_id=28357533&repo=mozilla-inbound

16:55:04     INFO -  273 INFO TEST-START | dom/push/test/test_multiple_register_during_service_activation.html
16:55:05     INFO -  TEST-INFO | started process screentopng
16:55:06     INFO -  TEST-INFO | screentopng: exit 0
16:55:06     INFO -  274 INFO TEST-PASS | dom/push/test/test_multiple_register_during_service_activation.html | getEndpoint should return null when app not subscribed.
16:55:06     INFO -  275 INFO TEST-UNEXPECTED-FAIL | dom/push/test/test_multiple_register_during_service_activation.html | could not register for push notification
16:55:06     INFO -      subscribe/<@dom/push/test/test_multiple_register_during_service_activation.html:53:9
16:55:06     INFO -      promise callback*subscribe@dom/push/test/test_multiple_register_during_service_activation.html:48:12
16:55:06     INFO -      promise callback*runTest/<@dom/push/test/test_multiple_register_during_service_activation.html:96:7
16:55:06     INFO -      promise callback*runTest@dom/push/test/test_multiple_register_during_service_activation.html:94:5
16:55:06     INFO -      promise callback*@dom/push/test/test_multiple_register_during_service_activation.html:106:3
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Priority: P3 → P5

There are a few things going on in those logs.

"handleHelloReply: Unexpected state 3 (expected STATE_WAITING_FOR_HELLO)" is interesting, but shouldn't cause issues. The mock WebSocket in the parent sends an async message to a chrome script in the child process. Each child sets up and tears down the mock, but those form a queue in the parent. Likely a hello intended for a previous test gets sent to the next one.

These warnings don't look good, though:

13:48:11 INFO - GECKO(5076) | ++DOMWINDOW == 14 (0BE6B800) [pid = 4316] [serial = 14] [outer = 07C3B700]
13:48:11 INFO - GECKO(5076) | [Child 4316, Main Thread] WARNING: '!window', file z:/build/build/src/dom/cache/CacheStorage.cpp, line 572
13:48:12 INFO - GECKO(5076) | [Child 4316, Main Thread] WARNING: '!window', file z:/build/build/src/dom/cache/CacheStorage.cpp, line 572
13:48:12 INFO - GECKO(5076) | [Child 4316, Main Thread] WARNING: '!window', file z:/build/build/src/dom/cache/CacheStorage.cpp, line 572
13:48:12 INFO - GECKO(5076) | [Child 4316, Main Thread] WARNING: A runnable was posted to a worker that is already shutting down!: file z:/build/build/src/dom/workers/WorkerPrivate.cpp, line 1352
13:48:12 INFO - GECKO(5076) | [Child 4316, Main Thread] WARNING: A runnable was posted to a worker that is already shutting down!: file z:/build/build/src/dom/workers/WorkerPrivate.cpp, line 1352
13:48:12 INFO - GECKO(5076) | [Parent 4740, Main Thread] WARNING: '!window', file z:/build/build/src/dom/cache/CacheStorage.cpp, line 572
13:48:12 INFO - GECKO(5076) | [Child 4944, Main Thread] WARNING: '!window', file z:/build/build/src/dom/cache/CacheStorage.cpp, line 572
13:48:12 INFO - GECKO(5076) | [Child 2916, Main Thread] WARNING: '!window', file z:/build/build/src/dom/cache/CacheStorage.cpp, line 572
13:48:12 INFO - GECKO(5076) | [Parent 4740, QuotaManager IO] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80630001: file z:/build/build/src/storage/mozStorageConnection.cpp, line 695
13:48:12 INFO - GECKO(5076) | [Parent 4740, QuotaManager IO] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80630001: file z:/build/build/src/storage/mozStorageService.cpp, line 665
13:48:12 INFO - GECKO(5076) | [Parent 4740, QuotaManager IO] WARNING: Received NS_ERROR_STORAGE_BUSY when attempting to open database '2918063365piupsah.sqlite', retrying for up to 10 seconds: file z:/build/build/src/dom/indexedDB/ActorsParent.cpp, line 4082

([Child 2916, Main Thread] WARNING: 'NS_FAILED(aStatus)', file z:/build/build/src/dom/serviceworkers/ServiceWorkerUnregisterJob.cpp, line 31 is also a push error—we fail to unsubscribe when the worker is unregistered. If push storage is borked, that makes sense).

From that point, the test fails, which leaves the service worker registered, which causes cascading failures for the remaining push tests.

Here's why I think bug 1524655 exacerbated this: since we now connect unconditionally, we're reading and writing to IDB on each test. Previously, we'd only read, and skip writing if there were no push subscriptions stored locally. Now, we go through the full handshake, which means we run dropUnexpiredRegistrations, which opens a readwrite transaction.

I'm not sure if this is a push bug, a quota manager bug, or an IDB bug, or something else entirely. Asuth, do those errors look familiar? Is there a way for us to avoid busy errors here, or at least to see what's causing issues?

If this analysis is right, we can work around the issue by checking if we have any subscriptions, and skipping the write, after we open the connection...but that feels like a hack, and introduces an extra read on push startup just to work around this test failure.

Flags: needinfo?(lina) → needinfo?(bugmail)

The warnings seem sufficiently harmless. The busy IDB warning on open is expected if you do anything that may cause the WAL to need to checkpoint at close. It's possible IDB shouldn't be saying stuff like that because it's so expected. If it's resulting in an actual error being exposed to (chrome) content, that is a problem, however.

The CacheStorage.cpp warnings do suggest there might be some test logic that's operating out of a frame script or JSM that maybe shouldn't be, or that the test logic is being triggered after the window has already begun closing, suggesting the test should be more ordered about shutdown.

Just quickly looking at one of the test errors about being unable to subscribe, I think the registerServiceWorker() logic wants to wait for the SW to become activated. There's a helper registerAndWaitForActive at https://searchfox.org/mozilla-central/source/dom/serviceworkers/test/utils.js#33 that can be used. We also have just manually scattered calls to the waitForState helper at the top of the file throughout our tests too.

Without this change, the SW may only be locally in the 'installing' state. The other processes (including the parent) will not have heard about the registration at that point; propagation only occurs when an installing ServiceWorker becomes activated. (This changes under the e10s parent-intercept refactor to be more sane, of course. Hooray!)

Flags: needinfo?(bugmail)

Over the last 7 days there are 61 failures on this bug. These happen on : linux32, linux64, linux64-pgo-qr, osx-10-10, windows10-64, windows7-32

Here is the latest failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=235984130&repo=autoland&lineNumber=25563

Flags: needinfo?(overholt)

We're going to have the e10s parent-intercept refactor landing soon enough and since that'll change things here, let's wait to work on this until after that lands.

Flags: needinfo?(overholt)
Priority: -- → P2
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]

There are 73 total failures in the last 7 days on linux32, linux64, osx-10-10, windows10-64 and windows7-32 all build types

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=237265086&repo=autoland&lineNumber=5513

[task 2019-04-01T06:44:48.037Z] 06:44:48 INFO - TEST-START | dom/push/test/test_multiple_register_during_service_activation.html
[task 2019-04-01T06:44:48.094Z] 06:44:48 INFO - GECKO(2683) | [Parent 2683, Main Thread] WARNING: Cannot set transparency mode on non-popup windows.: file /builds/worker/workspace/build/src/widget/gtk/nsWindow.cpp, line 4110
[task 2019-04-01T06:44:48.190Z] 06:44:48 INFO - GECKO(2683) | ++DOMWINDOW == 13 (0x7f51ca27a400) [pid = 2762] [serial = 22] [outer = 0x7f51ca2b2c40]
[task 2019-04-01T06:44:48.287Z] 06:44:48 INFO - GECKO(2683) | [Parent 2683, Main Thread] WARNING: Cannot set transparency mode on non-popup windows.: file /builds/worker/workspace/build/src/widget/gtk/nsWindow.cpp, line 4110
[task 2019-04-01T06:44:48.507Z] 06:44:48 INFO - GECKO(2683) | --DOMWINDOW == 12 (0x7f51ca9d9c00) [pid = 2762] [serial = 17] [outer = (nil)] [url = http://mochi.test:8888/tests/SimpleTest/iframe-between-tests.html]
[task 2019-04-01T06:44:48.509Z] 06:44:48 INFO - GECKO(2683) | --DOMWINDOW == 11 (0x7f51ca9d6c00) [pid = 2762] [serial = 15] [outer = (nil)] [url = http://mochi.test:8888/tests/SimpleTest/iframe-between-tests.html]
[task 2019-04-01T06:44:48.547Z] 06:44:48 INFO - GECKO(2683) | [Child 2762, Main Thread] WARNING: '!window', file /builds/worker/workspace/build/src/dom/cache/CacheStorage.cpp, line 572
[task 2019-04-01T06:44:48.565Z] 06:44:48 INFO - GECKO(2683) | [Child 2762, Main Thread] WARNING: '!window', file /builds/worker/workspace/build/src/dom/cache/CacheStorage.cpp, line 572
[task 2019-04-01T06:44:48.687Z] 06:44:48 INFO - GECKO(2683) | console.error: PushServiceWebSocket:
[task 2019-04-01T06:44:48.689Z] 06:44:48 INFO - GECKO(2683) | handleHelloReply: Unexpected state
[task 2019-04-01T06:44:48.689Z] 06:44:48 INFO - GECKO(2683) | 3
[task 2019-04-01T06:44:48.689Z] 06:44:48 INFO - GECKO(2683) | (expected STATE_WAITING_FOR_HELLO)
[task 2019-04-01T06:44:48.714Z] 06:44:48 INFO - TEST-INFO | started process screentopng
[task 2019-04-01T06:44:49.484Z] 06:44:49 INFO - TEST-INFO | screentopng: exit 0
[task 2019-04-01T06:44:49.484Z] 06:44:49 INFO - Buffered messages logged at 06:44:48
[task 2019-04-01T06:44:49.485Z] 06:44:49 INFO - TEST-PASS | dom/push/test/test_multiple_register_during_service_activation.html | getEndpoint should return null when app not subscribed.
[task 2019-04-01T06:44:49.487Z] 06:44:49 INFO - Buffered messages finished
[task 2019-04-01T06:44:49.487Z] 06:44:49 INFO - TEST-UNEXPECTED-FAIL | dom/push/test/test_multiple_register_during_service_activation.html | could not register for push notification
[task 2019-04-01T06:44:49.488Z] 06:44:49 INFO - SimpleTest.ok@SimpleTest/SimpleTest.js:275:18
[task 2019-04-01T06:44:49.490Z] 06:44:49 INFO - subscribe/<@dom/push/test/test_multiple_register_during_service_activation.html:53:9
[task 2019-04-01T06:44:49.490Z] 06:44:49 INFO - promise callbacksubscribe@dom/push/test/test_multiple_register_during_service_activation.html:49:8
[task 2019-04-01T06:44:49.491Z] 06:44:49 INFO - setupMultipleSubscriptions@dom/push/test/test_multiple_register_during_service_activation.html:65:7
[task 2019-04-01T06:44:49.492Z] 06:44:49 INFO - runTest/</<@dom/push/test/test_multiple_register_during_service_activation.html:97:20
[task 2019-04-01T06:44:49.493Z] 06:44:49 INFO - promise callback
runTest/<@dom/push/test/test_multiple_register_during_service_activation.html:97:10
[task 2019-04-01T06:44:49.493Z] 06:44:49 INFO - promise callbackrunTest@dom/push/test/test_multiple_register_during_service_activation.html:95:6
[task 2019-04-01T06:44:49.494Z] 06:44:49 INFO - @dom/push/test/test_multiple_register_during_service_activation.html:106:58
[task 2019-04-01T06:44:49.495Z] 06:44:49 INFO - promise callback
@dom/push/test/test_multiple_register_during_service_activation.html:106:48
[task 2019-04-01T06:44:49.496Z] 06:44:49 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-04-01T06:44:49.502Z] 06:44:49 INFO - TEST-UNEXPECTED-FAIL | dom/push/test/test_multiple_register_during_service_activation.html | could not register for push notification

Andrew are there any updates here? Should we disable this?

Flags: needinfo?(overholt)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

If you'd like to disable until we land bug 1231208, I think that'd be ok.

Flags: needinfo?(overholt)

Okay Andrew.

Cosmin, please take a look at the above comments.

Flags: needinfo?(csabou)
Flags: needinfo?(csabou)
Keywords: leave-open
Whiteboard: [stockwell needswork:owner] → [stockwell disabled]
Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/846780f84ca9
Disable test_multiple_register_during_service_activation.html on linux, mac and windows for frequent failures. r=jmaher

Hi :perry, should we re-anable this since SW e10s is enabled in Nightly quite for a while now?

Flags: needinfo?(perry)

Yeah, that sounds like a good idea.

Flags: needinfo?(perry)

The test was disabled 8 months ago. Re-enabling to see if this intermittent
still exists.

Assignee: nobody → perry
Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fcbd865d22bc
re-enable test_multiple_register_during_service_activation.html r=asuth

Perry: Could you please take a look at this bug? There a big spike in failures since you re-enabled the test in the last patch.

Retrigger link:
https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=e10s-7&tochange=fcbd865d22bc4301fd560663249e23e922191a89&fromchange=9160b261701469d3692cf1662fb82b99088b7380

Flags: needinfo?(perry)
Whiteboard: [stockwell disabled] → [stockwell disabled] [retriggered]
Whiteboard: [stockwell disabled] [retriggered] → [retriggered][stockwell needswork:owner]

Joel, hi, do we backout?

There are currently 117 failures in the last 3 days.

Flags: needinfo?(jmaher)

yeah, I vote for a backout; I don't see evidence that this test was passing on try prior to re-enabling it.

Flags: needinfo?(jmaher)
Whiteboard: [retriggered][stockwell needswork:owner] → [retriggered][stockwell disabled]

Don't think I'll be able to get to this in the near future.

Assignee: perry → nobody
Flags: needinfo?(perry)
You need to log in before you can comment on or make changes to this bug.