Closed Bug 1180129 Opened 4 years ago Closed 4 years ago

Loop clients aren't receiving push notifications (no joined conversation notifications / no direct calls)

Categories

(Cloud Services :: Operations, task, blocker)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: standard8, Assigned: oremj)

Details

I've just had it reported to me that loop clients aren't receiving push notifications.

I have reproduced this locally with 38.0.1 and nightly build (42.0a1).

I'm trying to narrow down more information at the moment.
QA Contact: alexandra.lucinet
The production loop-server uses https://push1.push.hello.firefox.com/ and isn't working

The development loop-server uses https://push.services.mozilla.com/ and works fine.
More testing reveals it is the push servers that aren't working.

I'm turning on debugging for Loop (via loop.debug.loglevel -> "All"), and getting the push url. Then I'm doing:

curl -X PUT -d "version=123456799" https://updates-push1.push.hello.firefox.com/update/NqW6soJDKnfpjJTHrT5yI3AA7FtF-6Km3vvMSSK24kKNZ55ylyj6KIHyA3GKDYNcn6-dToq6Q0KH7237ALA8PpjrKs5qX_ecERQzEXIRJEGfKMHoQQ==

With the push servers in development this causes logging that the notifications been received, with the production servers this doesn't do anything.
$ http PUT "https://updates-push1.push.hello.firefox.com/update/NqW6soJDKnfpjJTHrT5yI3AA7FtF-6Km3vvMSSK24kKNZ55ylyj6KIHyA3GKDYNcn6-dToq6Q0KH7237ALA8PpjrKs5qX_ecERQzEXIRJEGfKMHoQQ==" version=123456789 -v --form
PUT /update/NqW6soJDKnfpjJTHrT5yI3AA7FtF-6Km3vvMSSK24kKNZ55ylyj6KIHyA3GKDYNcn6-dToq6Q0KH7237ALA8PpjrKs5qX_ecERQzEXIRJEGfKMHoQQ== HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 17
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Host: updates-push1.push.hello.firefox.com
User-Agent: HTTPie/0.9.2

version=123456789

HTTP/1.1 202 Accepted
Connection: keep-alive
Content-Length: 2
Content-Type: application/json
Date: Fri, 03 Jul 2015 08:34:49 GMT

{}
I've sent out pages to both Hello and SimplePush, but no response as yet.
Flags: needinfo?(tblow)
Flags: needinfo?(oremj)
Flags: needinfo?(bobm)
Summary: Loop clients aren't receiving push notifications → Loop clients aren't receiving push notifications (no joined conversation notifications / no direct calls)
Timeline:

2346 - service started flapping, pagerduty started sending alerts
0530 - my phone exploded with about 100 text messages
0535 - started working on the problem, looked overloaded, bumped the node count
0545 - noticed etcd cluster was not working
0612 - etcd cluster fixed
0615 - reset connections, service became overloaded with reconns
0630 - dropped node count back down to 18
0650 - spun up new cluster
0655 - dropped node count down to 10
0711 - node count back up to 20
0715 - started shutting down old cluster
0725 - resolved
Flags: needinfo?(tblow)
Flags: needinfo?(oremj)
Flags: needinfo?(bobm)
Assignee: nobody → oremj
Issues:

Since the service was flapping, the pagerduty incidents were opening and closing, which meant they never reached an escalation state, which would have paged the secondary (Benson).

This is the second time etcd has randomly failed and it typically takes 20+ minutes to fix. We need to either, move loop push to autopush or make dynamodb the backend for pushgo.

It took a long time for all of the clients to reconnect to the cluster. This is likely due to the fan out method loop push uses to notify clients. It seemed like the situation was worse with more nodes.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.