Closed Bug 1180129 Opened 10 years ago Closed 10 years ago

Loop clients aren't receiving push notifications (no joined conversation notifications / no direct calls)

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: standard8, Assigned: oremj)

Details

I've just had it reported to me that loop clients aren't receiving push notifications. I have reproduced this locally with 38.0.1 and nightly build (42.0a1). I'm trying to narrow down more information at the moment.
QA Contact: alexandra.lucinet
The production loop-server uses https://push1.push.hello.firefox.com/ and isn't working The development loop-server uses https://push.services.mozilla.com/ and works fine.
More testing reveals it is the push servers that aren't working. I'm turning on debugging for Loop (via loop.debug.loglevel -> "All"), and getting the push url. Then I'm doing: curl -X PUT -d "version=123456799" https://updates-push1.push.hello.firefox.com/update/NqW6soJDKnfpjJTHrT5yI3AA7FtF-6Km3vvMSSK24kKNZ55ylyj6KIHyA3GKDYNcn6-dToq6Q0KH7237ALA8PpjrKs5qX_ecERQzEXIRJEGfKMHoQQ== With the push servers in development this causes logging that the notifications been received, with the production servers this doesn't do anything.
$ http PUT "https://updates-push1.push.hello.firefox.com/update/NqW6soJDKnfpjJTHrT5yI3AA7FtF-6Km3vvMSSK24kKNZ55ylyj6KIHyA3GKDYNcn6-dToq6Q0KH7237ALA8PpjrKs5qX_ecERQzEXIRJEGfKMHoQQ==" version=123456789 -v --form PUT /update/NqW6soJDKnfpjJTHrT5yI3AA7FtF-6Km3vvMSSK24kKNZ55ylyj6KIHyA3GKDYNcn6-dToq6Q0KH7237ALA8PpjrKs5qX_ecERQzEXIRJEGfKMHoQQ== HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Content-Length: 17 Content-Type: application/x-www-form-urlencoded; charset=utf-8 Host: updates-push1.push.hello.firefox.com User-Agent: HTTPie/0.9.2 version=123456789 HTTP/1.1 202 Accepted Connection: keep-alive Content-Length: 2 Content-Type: application/json Date: Fri, 03 Jul 2015 08:34:49 GMT {}
I've sent out pages to both Hello and SimplePush, but no response as yet.
Flags: needinfo?(tblow)
Flags: needinfo?(oremj)
Flags: needinfo?(bobm)
Summary: Loop clients aren't receiving push notifications → Loop clients aren't receiving push notifications (no joined conversation notifications / no direct calls)
Timeline: 2346 - service started flapping, pagerduty started sending alerts 0530 - my phone exploded with about 100 text messages 0535 - started working on the problem, looked overloaded, bumped the node count 0545 - noticed etcd cluster was not working 0612 - etcd cluster fixed 0615 - reset connections, service became overloaded with reconns 0630 - dropped node count back down to 18 0650 - spun up new cluster 0655 - dropped node count down to 10 0711 - node count back up to 20 0715 - started shutting down old cluster 0725 - resolved
Flags: needinfo?(tblow)
Flags: needinfo?(oremj)
Flags: needinfo?(bobm)
Assignee: nobody → oremj
Issues: Since the service was flapping, the pagerduty incidents were opening and closing, which meant they never reached an escalation state, which would have paged the secondary (Benson). This is the second time etcd has randomly failed and it typically takes 20+ minutes to fix. We need to either, move loop push to autopush or make dynamodb the backend for pushgo. It took a long time for all of the clients to reconnect to the cluster. This is likely due to the fan out method loop push uses to notify clients. It seemed like the situation was worse with more nodes.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.