Closed Bug 1164432 Opened 9 years ago Closed 8 years ago

Intermittent dom/push/test/test_multiple_register_different_scope.html,test_register.html,test_unregister.html | Test timed out. - expected PASS

Categories

(Core :: DOM: Notifications, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox45 --- wontfix
firefox46 --- affected
firefox47 --- affected
firefox48 --- fixed
firefox-esr38 --- unaffected
firefox-esr45 --- unaffected

People

(Reporter: nigelb, Assigned: wchen)

References

()

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

builder 	Ubuntu VM 12.04 x64 fx-team opt test mochitest-push
buildid 	20150513050443
builduid 	4c039d2240c14b9aa1fc2eddfc4bd00b
results 	warnings (1)
revision 	f5d4f0d1fed915fe3f334c60dae4f8ea619b51c0
slave 	tst-linux64-spot-612
starttime 	Wed May 13 2015 17:53:15 GMT+0530 (IST)

https://treeherder.mozilla.org/logviewer.html#?job_id=3063908&repo=fx-team

05:25:35 INFO - -*- Push.js: Push Constructor
05:25:35 INFO - MEMORY STAT vsize after test: 912162816
05:25:35 INFO - MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
05:25:35 INFO - MEMORY STAT residentFast after test: 179785728
05:25:35 INFO - MEMORY STAT heapAllocated after test: 82436768
05:25:35 INFO - 2 INFO TEST-OK | dom/push/test/test_has_permissions.html | took 1377ms
05:25:35 INFO - 3 INFO TEST-START | dom/push/test/test_multiple_register.html
05:25:36 INFO - MEMORY STAT vsize after test: 974409728
05:25:36 INFO - MEMORY STAT residentFast after test: 201318400
05:25:36 INFO - MEMORY STAT heapAllocated after test: 86086312
05:25:36 INFO - 4 INFO TEST-OK | dom/push/test/test_multiple_register.html | took 762ms
05:25:36 INFO - 5 INFO TEST-START | dom/push/test/test_permissions.html
05:25:36 INFO - MEMORY STAT vsize after test: 980901888
05:25:36 INFO - MEMORY STAT residentFast after test: 192638976
05:25:36 INFO - MEMORY STAT heapAllocated after test: 88559816
05:25:36 INFO - 6 INFO TEST-OK | dom/push/test/test_permissions.html | took 175ms
05:25:36 INFO - 7 INFO TEST-START | dom/push/test/test_register.html
05:25:51 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
05:31:04 INFO - Xlib: extension "RANDR" missing on display ":0".
05:31:04 INFO - TEST-INFO | screentopng: exit 0
05:31:04 INFO - 8 INFO TEST-PASS | dom/push/test/test_register.html | successful registered for push notification
05:31:04 INFO - 9 INFO TEST-UNEXPECTED-FAIL | dom/push/test/test_register.html | Test timed out. - expected PASS
05:31:05 INFO - MEMORY STAT vsize after test: 913612800
05:31:05 INFO - MEMORY STAT residentFast after test: 170483712
05:31:05 INFO - MEMORY STAT heapAllocated after test: 60760008
05:31:05 INFO - 10 INFO TEST-OK | dom/push/test/test_register.html | took 328475ms
05:31:05 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
05:31:05 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
05:31:05 INFO - 1431520265701 Browser.Experiments.Experiments TRACE Experiments #0::uninit: started
05:31:05 INFO - 1431520265701 Browser.Experiments.Experiments TRACE Experiments #0::uninit: finished with _loadTask
05:31:05 INFO - 1431520265702 Browser.Experiments.Experiments TRACE Experiments #0::uninit: no previous shutdown
05:31:05 INFO - 1431520265702 Browser.Experiments.Experiments TRACE Experiments #0::Unregistering instance with Addon Manager.
05:31:05 INFO - 1431520265706 Browser.Experiments.Experiments TRACE Experiments #0::Removing install listener from add-on manager.
05:31:05 INFO - 1431520265706 Browser.Experiments.Experiments TRACE Experiments #0::Removing addon listener from add-on manager.
05:31:05 INFO - 1431520265707 Browser.Experiments.Experiments TRACE Experiments #0::Finished unregistering with addon manager.
05:31:05 INFO - 1431520265707 Browser.Experiments.Experiments TRACE Experiments #0::Unregistering previous experiment add-on provider.
05:31:05 INFO - 1431520265711 Browser.Experiments.Experiments INFO Experiments #0::Completed uninitialization.
05:31:05 INFO - 1431520265712 Browser.Experiments.Experiments TRACE PreviousExperimentProvider #0::shutdown()
05:31:05 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
05:31:05 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
05:31:05 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel closing: too late to send/recv, messages will be lost
05:31:06 INFO - TEST-INFO | Main app process: exit 0
05:31:06 INFO - runtests.py | Application ran for: 0:05:37.353190
05:31:06 INFO - zombiecheck | Reading PID log: /tmp/tmpx8BVIrpidlog
05:31:06 INFO - Stopping web server
05:31:06 INFO - Stopping web socket server
05:31:06 INFO - Stopping ssltunnel
05:31:06 INFO - WARNING | leakcheck | refcount logging is off, so leaks can't be detected!
05:31:06 INFO - runtests.py | Running tests: end.
05:31:06 INFO - 11 INFO TEST-START | Shutdown
05:31:06 INFO - 12 INFO Passed: 11
05:31:06 WARNING - 13 INFO Failed: 1
Doug, Nikhil - Do you guys know what's going on? This is intermittent and I can see it across trees.
Flags: needinfo?(nsm.nikhil)
Flags: needinfo?(dougt)
I did some retriggers on aurora to see if this is something related to a recent push on trunk vs an infrastructure issue:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=d9093db18dc6&filter-searchStr=push
jgriffin: do we do anything special for these machines such that they can't access the public push server?
Flags: needinfo?(dougt) → needinfo?(jgriffin)
I wonder if the public server had a hiccup?  There is always the neverending talk of restricting our test machines from external network access, that hasn't been implemented yet.  We do fill our network though, so it is quite likely that we have dropped packets or really long round trips.

These failures occurred over a small window in time, quite possibly this is something unrelated to code.
We deployed a new version of the Push server yesterday afternoon. This version fixed some SSL bugs that had prevented other clients from connecting properly, so its currently taking on higher load than expected, as well as dealing with a client reconnect bug from FxOS 2.2. We're currently putting some patches into place to deal with the existing client bug that will smooth out operation.
(In reply to Doug Turner (:dougt) from comment #28)
> jgriffin: do we do anything special for these machines such that they can't
> access the public push server?

No, external network access is not blocked.
Flags: needinfo?(jgriffin)
Inactive; closing (see bug 1180138).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Something seems sick here lately.
Status: RESOLVED → REOPENED
Flags: needinfo?(dougt)
Resolution: WORKSFORME → ---
Last production push to https://push.services.mozilla.com was on 6/8/15. Local testing against that server is showing no issues, nor are we seeing any anomalies on server monitoring. (see https://app.datadoghq.com/dash/45471/autopush?live=true&page=0&is_auto=false&from_ts=1435435742237&to_ts=1438027742237&tile_size=m)

Can we have some clarification for what function or test may be causing the timeout?
I'm not so sure this is server-related. Looking at the most recent set of logs, I don't see any entries for "Socket error {errorCode}", "Request timeout: Removing {channelId}", "shutdownWS()", or "reconnectAfterBackoff()". I'd expect these for socket errors.

In comment 33 and comment 35, it looks like the tests time out waiting for a response from the parent process: I don't see any `PushService:*:KO` messages at all. Assuming that explanation is accurate, it doesn't explain why the parent doesn't respond.

In comment 34, comment 36, and comment 38, I see the following log lines:

> -*- PushClient.js: receiveMessage(): {"target":{},"name":"PushService:Register:KO","sync":false,"json":{"requestID":"{c6e0d656-3d80-4b20-9fb4-9aab676cfb01}","error":"NetworkError"},"data":{"requestID":"{c6e0d656-3d80-4b20-9fb4-9aab676cfb01}","error":"NetworkError"},"objects":{}}

And:

> -*- PushClient.js: receiveMessage(): {"target":{},"name":"PushService:Register:KO","sync":false,"json":{"requestID":"{adda18d1-f448-4a1a-8cfa-8e4466b611e4}","error":"Service not active"},"data":{"requestID":"{adda18d1-f448-4a1a-8cfa-8e4466b611e4}","error":"Service not active"},"objects":{}}

`PushServiceWebSocket.jsm` never actually rejects with `NetworkError` (`PushServiceHttp2.jsm` does, but I don't think the Mochitests tests exercise it).

My guess is the `NetworkError` is coming from here: https://github.com/mozilla/gecko-dev/blob/master/dom/push/PushService.jsm#L815-819 Maybe the client state is out of sync?

Comment 37 is odd:

> -*- PushClient.js: receiveMessage(): {"target":{},"name":"PushService:Unregister:KO","sync":false,"json":{"requestID":"{df033678-65b5-4745-8f0c-7586342c8326}"},"data":{"requestID":"{df033678-65b5-4745-8f0c-7586342c8326}"},"objects":{}}

So, `unregister` failed (https://github.com/mozilla/gecko-dev/blob/master/dom/push/PushService.jsm#L1019-1022). But there's no "unregister(): Actual error {error}" anywhere in that log!

I'll try to dig into this during the week.
ryan,

There may be a few things going on that may not be code related (e.g. a network error between the build net and AWS). Are you seeing similar issues with FxA or other remote AWS hosted services? For that matter, are there other remote hosted services you're testing against or are we the lucky first?
Flags: needinfo?(ryanvm)
In general, we specifically forbid touching outside services due to issues like these. The push tests are explicitly Tier 2 because they do. So no, I'm not seeing issues elsewhere, nor would I be in a position to do so.
Flags: needinfo?(ryanvm)
Summary: Intermittent dom/push/test/test_register.html | Test timed out. - expected PASS → Intermittent dom/push/test/test_register.html,test_unregister.html | Test timed out. - expected PASS
Summary: Intermittent dom/push/test/test_register.html,test_unregister.html | Test timed out. - expected PASS → Intermittent dom/push/test/test_multiple_register_different_scope.html,test_register.html,test_unregister.html | Test timed out. - expected PASS
disabling test_try_registering_offline_disabled.html.  Waiting on results:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=048e17379faf
Flags: needinfo?(nsm.nikhil)
Assignee: nobody → dougt
Flags: needinfo?(dougt)
Attachment #8654998 - Flags: review?(kcambridge)
Comment on attachment 8654998 [details] [diff] [review]
bug_1164432_disable_test_try_registering_offline_disabled

Review of attachment 8654998 [details] [diff] [review]:
-----------------------------------------------------------------

Let's give it a go. Thanks, Doug!
Attachment #8654998 - Flags: review?(kcambridge) → review+
Keywords: checkin-needed
url:        https://hg.mozilla.org/integration/mozilla-inbound/rev/2b215c6d87376404b2ecc1c618a98a560c5cb0cb
changeset:  2b215c6d87376404b2ecc1c618a98a560c5cb0cb
user:       Doug Turner <doug.turner@gmail.com>
date:       Mon Aug 31 12:28:04 2015 -0700
description:
Bug 1164432.  Try disabling test_try_registering_offline_disabled due to too many intermittent failures. r=kitcambridge
See Also: → 1188314
hmm. i was not expecting comment 153.
https://hg.mozilla.org/mozilla-central/rev/2b215c6d8737
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla43 → ---
Blocks: 1206969
Here we have 2 problems:

one is a duplicate of bug 1212593 - the most failure after 6 Oct, are the same as that bug

the other problem is test is timing out while waiting to unregister - the same as problem 3 in comment
https://bugzilla.mozilla.org/show_bug.cgi?id=1159963#c56
(In reply to Carsten Book [:Tomcat] from comment #156)
> https://hg.mozilla.org/mozilla-central/rev/2b215c6d8737

Maybe worth reverting this along with whatever changes need to be made to test_try_registering_offline_disabled.html to support the new mocked environment?
Flags: needinfo?(wchen)
This patch updates the test to use the mock push server and fixes a bug in the mock to be able to handle this test file.
Assignee: dougt → wchen
Flags: needinfo?(wchen)
Attachment #8732050 - Flags: review?(kcambridge)
Attachment #8732050 - Flags: review?(kcambridge) → review+
Backed out for failures.
https://hg.mozilla.org/integration/mozilla-inbound/rev/5bee898930bc

https://treeherder.mozilla.org/logviewer.html#?job_id=24165335&repo=mozilla-inbound
20:50:03     INFO -  19 INFO TEST-START | dom/push/test/test_try_registering_offline_disabled.html
20:50:03     INFO -  ++DOMWINDOW == 18 (0x7f73be4c6800) [pid = 2003] [serial = 41] [outer = 0x7f73bf92a400]
20:50:04     INFO -  TEST-INFO | started process screentopng
20:50:04     INFO -  TEST-INFO | screentopng: exit 0
20:50:04     INFO -  20 INFO TEST-PASS | dom/push/test/test_try_registering_offline_disabled.html | getEndpoint should return null when app not subscribed.
20:50:04     INFO -  21 INFO TEST-UNEXPECTED-FAIL | dom/push/test/test_try_registering_offline_disabled.html | successful registered for push notification
20:50:04     INFO -      subscribeFail/</<@dom/push/test/test_try_registering_offline_disabled.html:60:11
20:50:04     INFO -      promise callback*subscribeFail/<@dom/push/test/test_try_registering_offline_disabled.html:58:7
20:50:04     INFO -      subscribeFail@dom/push/test/test_try_registering_offline_disabled.html:57:12
20:50:04     INFO -      promise callback*runTest1/<@dom/push/test/test_try_registering_offline_disabled.html:135:7
20:50:04     INFO -      promise callback*runTest1@dom/push/test/test_try_registering_offline_disabled.html:133:12
20:50:04     INFO -      promise callback*@dom/push/test/test_try_registering_offline_disabled.html:291:3
Failures were e10s-only from the looks of it, in case that helps.
The test wasn't e10s friendly but it's a simple fix.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=abaca3cbbc51
https://hg.mozilla.org/mozilla-central/rev/c41ffc5b9b3c
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
This still fails on Aurora. Not sure if you want to spend time looking into it or not since the test was previously disabled anyway. I'll leave that up to you :)
https://treeherder.mozilla.org/logviewer.html#?job_id=2180392&repo=mozilla-aurora

https://hg.mozilla.org/releases/mozilla-aurora/rev/332ce77a41c7
>    function changePushServerConnectionEnabled(enable) {
>      debug("changePushServerConnectionEnabled");
> -    SpecialPowers.pushPrefEnv({"set": [["dom.push.connection.enabled", enable]]},
> -                              null);
> +    SpecialPowers.setBoolPref("dom.push.connection.enabled", enable);
>    }

I'm curious about this change too :)
You need to log in before you can comment on or make changes to this bug.