Closed
Bug 1180129
Opened 10 years ago
Closed 10 years ago
Loop clients aren't receiving push notifications (no joined conversation notifications / no direct calls)
Categories
(Cloud Services :: Operations: Miscellaneous, task)
Cloud Services
Operations: Miscellaneous
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: standard8, Assigned: oremj)
Details
I've just had it reported to me that loop clients aren't receiving push notifications.
I have reproduced this locally with 38.0.1 and nightly build (42.0a1).
I'm trying to narrow down more information at the moment.
Updated•10 years ago
|
QA Contact: alexandra.lucinet
Reporter | ||
Comment 1•10 years ago
|
||
The production loop-server uses https://push1.push.hello.firefox.com/ and isn't working
The development loop-server uses https://push.services.mozilla.com/ and works fine.
Reporter | ||
Comment 2•10 years ago
|
||
More testing reveals it is the push servers that aren't working.
I'm turning on debugging for Loop (via loop.debug.loglevel -> "All"), and getting the push url. Then I'm doing:
curl -X PUT -d "version=123456799" https://updates-push1.push.hello.firefox.com/update/NqW6soJDKnfpjJTHrT5yI3AA7FtF-6Km3vvMSSK24kKNZ55ylyj6KIHyA3GKDYNcn6-dToq6Q0KH7237ALA8PpjrKs5qX_ecERQzEXIRJEGfKMHoQQ==
With the push servers in development this causes logging that the notifications been received, with the production servers this doesn't do anything.
Comment 3•10 years ago
|
||
$ http PUT "https://updates-push1.push.hello.firefox.com/update/NqW6soJDKnfpjJTHrT5yI3AA7FtF-6Km3vvMSSK24kKNZ55ylyj6KIHyA3GKDYNcn6-dToq6Q0KH7237ALA8PpjrKs5qX_ecERQzEXIRJEGfKMHoQQ==" version=123456789 -v --form
PUT /update/NqW6soJDKnfpjJTHrT5yI3AA7FtF-6Km3vvMSSK24kKNZ55ylyj6KIHyA3GKDYNcn6-dToq6Q0KH7237ALA8PpjrKs5qX_ecERQzEXIRJEGfKMHoQQ== HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 17
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Host: updates-push1.push.hello.firefox.com
User-Agent: HTTPie/0.9.2
version=123456789
HTTP/1.1 202 Accepted
Connection: keep-alive
Content-Length: 2
Content-Type: application/json
Date: Fri, 03 Jul 2015 08:34:49 GMT
{}
Reporter | ||
Comment 4•10 years ago
|
||
I've sent out pages to both Hello and SimplePush, but no response as yet.
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(tblow)
Flags: needinfo?(oremj)
Flags: needinfo?(bobm)
Summary: Loop clients aren't receiving push notifications → Loop clients aren't receiving push notifications (no joined conversation notifications / no direct calls)
Assignee | ||
Comment 5•10 years ago
|
||
Timeline:
2346 - service started flapping, pagerduty started sending alerts
0530 - my phone exploded with about 100 text messages
0535 - started working on the problem, looked overloaded, bumped the node count
0545 - noticed etcd cluster was not working
0612 - etcd cluster fixed
0615 - reset connections, service became overloaded with reconns
0630 - dropped node count back down to 18
0650 - spun up new cluster
0655 - dropped node count down to 10
0711 - node count back up to 20
0715 - started shutting down old cluster
0725 - resolved
Flags: needinfo?(tblow)
Flags: needinfo?(oremj)
Flags: needinfo?(bobm)
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → oremj
Assignee | ||
Comment 6•10 years ago
|
||
Issues:
Since the service was flapping, the pagerduty incidents were opening and closing, which meant they never reached an escalation state, which would have paged the secondary (Benson).
This is the second time etcd has randomly failed and it typically takes 20+ minutes to fix. We need to either, move loop push to autopush or make dynamodb the backend for pushgo.
It took a long time for all of the clients to reconnect to the cluster. This is likely due to the fan out method loop push uses to notify clients. It seemed like the situation was worse with more nodes.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•