Recover clients from a broken push endpoint

RESOLVED FIXED in Firefox 57

Status

()

Firefox for Android
Firefox Accounts
P1
normal
RESOLVED FIXED
7 months ago
2 months ago

People

(Reporter: Grisha, Assigned: eoger)

Tracking

unspecified
Firefox 57
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox57 fixed)

Details

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

(Reporter)

Description

7 months ago
Mark's phone got into an interesting state. It has a good FxA device registration with a valid autopush channel. However, push Send Tabs aren't working at all. When manually pushing to its endpoint, however, I'm getting a 410 error:

{"code": 410, "errno": 999, "error": "", "more_info": "http://autopush.readthedocs.io/en/latest/http.html#error-codes", "message": "GCM recipient not available."}

1) It would be good to figure out how a device can get into this state, and
2) We need a decent strategy to recover from errors like these.
3) How many other clients are affected by this? My assumption is that the % is high enough to warrant this work.

Recovering itself is likely pretty straightforward. Get fresh GCM token, obtain new UAID/chid from autopush, and write them to our FxA device record.

To determine that we're in a bad state, we could take a few approaches:

a) FxA server knows of 410, as it should see them while handling `devices/notify` calls. We should be able to set a "push registration invalid" flag on the device record. With changes in Bug 1351805, we're actually fetching device list on every sync, and so we will see that flag very soon, and will be able to recover. Alternatively we can clear push registration info from the device record, and have the clients check for their presence and act accordingly. I'd much prefer an explicit flag, however.

b) If we wanted to follow a client-only solution, it will look like posting to our own endpoint, and determining if the post failed. This will have to be done periodically - and infrequently - and so the recovery could take days or weeks. Moreso, it will waste bunch of 
clients' efforts as our assumption is that most of the time everything is good.

c) Get autopush to add an endpoint to check if client's state is valid. This will likely involve actually pushing to that endpoint on the server side, and so it suffers from all of the negatives from the option (b).

I think FxA server changes coupled with client changes (option a) should do the trick.


P.S.: We should have a Push component for fennec bugs.

Comment 1

7 months ago
> FxA server knows of 410, as it should see them while handling `devices/notify` calls.

The server does indeed see them, and IIRC it removes the now-known-to-be-invalid push URL from the device record when it sees one.  A device might already be able to infer whether it's in a bad state: look for its own record in the device list, see whether the push parameters match the ones it has locally, and if they're missing or different then it should re-register itself.

I'm also +1 to an explicit flag if it'll make that easier or more reliable.

> If we wanted to follow a client-only solution

IMHO we should try to do fewer client-only solutions as a rule, especially for things like "help the client detect it's in a bad state".  Easily-updatable server-side smarts FTW in these cases :-)

> it will look like posting to our own endpoint, and determining if the post failed.

Note that this is also true for the server-side solution, we'll only discover that the push endpoint has gone bad when we try to push something to it, and fail.  I suppose we could try to implement some automated freshness-checking on the server side, which might be better than doing it on the client.
(Reporter)

Comment 2

7 months ago
From IRC talking to Ben, "out of XXXXk in the past 24 hours using the firefox fxa VAPID key, 410's were 4% of all bridged endpoint deliveries". So I would say this is definitely high enough to care.

Ryan, I'm happy to push as much logic to the server side as is feasible :-)

I'm not sure if freshness checks are necessary - but if we can perform them without too much server costs, that will only help with getting higher reliability of the service.

I've missed the clearing of push info on errors while perusing server code. That's great! I do think that adding an explicit flag would be very nice, as it will let us figure out how many devices are affected, makes the state very explicit, etc. And if we do have a flag, it seems that clearing push info wouldn't be necessary (maybe it'll be useful for debugging?) - unless there _are_ existing clients who depend on that behaviour.

It also highlights that our clients (at least Android) are not checking their own device state, and so we have divergences like these. Bug 1351805 is nice in that it gets us closer an end-to-end mechanism to keep the system healthy, and we should get iOS to follow suite.

Comment 3

7 months ago
> out of XXXXk in the past 24 hours using the firefox fxa VAPID key

Side note, but wooo, I'm glad to see that VAPID key being useful!

> I've missed the clearing of push info on errors while perusing server code.

For reference: https://github.com/mozilla/fxa-auth-server/blob/master/lib/push.js#L386

> I do think that adding an explicit flag would be very nice

In the past we've also talked about using a flag for other things, e.g. the device needing to re-auth itself due to a password change or reset.  Without wanting to pile on to this bug, we should consider whether there are other places where such status flags might be useful to the client.
(Reporter)

Updated

7 months ago
Depends on: 1351805

Comment 4

7 months ago
fxa-auth-server bug for adding the new flag: https://github.com/mozilla/fxa-auth-server/issues/1873
(Reporter)

Updated

4 months ago
See Also: → bug 1389175
Comment hidden (mozreview-request)
(Assignee)

Comment 6

3 months ago
Adding tests in a 2nd iteration if the approach in this patch makes sense.
Assignee: nobody → eoger
Status: NEW → ASSIGNED
Priority: -- → P1
(Reporter)

Comment 7

3 months ago
Is the server-side support for this ready?
Flags: needinfo?(eoger)

Comment 8

3 months ago
Barring deployment problems, we should have the server-side support out in train-95 sometime next week; adding Bug 1397244 as a blocker for tracking purposes.
Depends on: 1397244
Flags: needinfo?(eoger)
(Reporter)

Comment 9

3 months ago
mozreview-review
Comment on attachment 8900425 [details]
Bug 1359279 - Renew GCM token/Push registration/FxA Registration on push registration expired.

https://reviewboard.mozilla.org/r/171770/#review182528

This looks generally OK, but see my semi-paranoid comments below.

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/devices/FxAccountDevice.java:33
(Diff revision 1)
>    public final String pushAuthKey;
> +  public final Boolean pushEndpointExpired;
>  
>    public FxAccountDevice(String name, String id, String type, Boolean isCurrentDevice,
>                           Long lastAccessTime, String pushCallback, String pushPublicKey,
> -                         String pushAuthKey) {
> +                         String pushAuthKey, Boolean pushEndpointExpired) {

You never check if this is `null`. Should you? You're certainly passing nulls into the constructor. I'd be in favor removing this from constructor's signature.

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/devices/FxAccountDeviceListUpdater.java:54
(Diff revision 1)
>          final ContentValues[] insertValues = new ContentValues[result.length];
>  
>          final long now = System.currentTimeMillis();
>          for (int i = 0; i < result.length; i++) {
>              final FxAccountDevice fxADevice = result[i];
> +            if (fxADevice.isCurrentDevice && fxADevice.pushEndpointExpired) {

Here, we just fetched device list, encountered our own record, and we see that it has a bad push endpoint.

Ideally, this should be the point at which we initiate our re-registration.

Otherwise, we're essentially waiting until the next sync to do this (which could be a while), and in the meantime our push won't work.

But, see my note below about rate limiting this...

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/devices/FxAccountDeviceListUpdater.java:64
(Diff revision 1)
>              deviceValues.put(RemoteDevices.TYPE, fxADevice.type);
>              deviceValues.put(RemoteDevices.NAME, fxADevice.name);
>              deviceValues.put(RemoteDevices.IS_CURRENT_DEVICE, fxADevice.isCurrentDevice);
>              deviceValues.put(RemoteDevices.DATE_CREATED, now);
>              deviceValues.put(RemoteDevices.DATE_MODIFIED, now);
> -            // TODO: Remove that line once FxA sends lastAccessTime all the time.
> +            deviceValues.put(RemoteDevices.LAST_ACCESS_TIME, fxADevice.lastAccessTime);

I assume there were server-side changes to justify this?

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/devices/FxAccountDeviceListUpdater.java:121
(Diff revision 1)
>      }
> +
> +    // Updates the list of remote devices, and also renews our push registration if the list provider
> +    // tells us it's expired.
> +    public void updateAndMaybeRenewRegistration(final Context context) {
> +        this.update();

Ah, we're using a synchronous client. Perhaps that's worth calling out in a comment, here and where we actually get the synch client above.

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/devices/FxAccountDeviceListUpdater.java:122
(Diff revision 1)
> +
> +    // Updates the list of remote devices, and also renews our push registration if the list provider
> +    // tells us it's expired.
> +    public void updateAndMaybeRenewRegistration(final Context context) {
> +        this.update();
> +        if (this.localDevicePushEndpointExpired) {

did you mean to say "not expired"?

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/devices/FxAccountDeviceListUpdater.java:129
(Diff revision 1)
> +        }
> +        ThreadUtils.postToBackgroundThread(new Runnable() {
> +            @Override
> +            public void run() {
> +                try {
> +                    FxAccountDeviceListUpdater.this.renewPushRegistration(context);

I'm worried if these two operations are racy. We need for autopush registration to finish before we can re-register our device. Can you double-check that PushService.onRefresh is entirely synchronous? Otherwise, we'd need a callback mechanism.

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/devices/FxAccountDeviceListUpdater.java:140
(Diff revision 1)
> +        });
> +    }
> +
> +    private void renewPushRegistration(Context context) throws ClassNotFoundException, NoSuchMethodException,
> +                                                InvocationTargetException, IllegalAccessException {
> +        final Class<?> pushService = Class.forName("org.mozilla.gecko.push.PushService");

What about `PushService.getInstance(context).onRefresh();` ?

::: mobile/android/services/src/main/java/org/mozilla/gecko/fxa/devices/FxAccountDeviceListUpdater.java:144
(Diff revision 1)
> +                                                InvocationTargetException, IllegalAccessException {
> +        final Class<?> pushService = Class.forName("org.mozilla.gecko.push.PushService");
> +        final Method getInstance = pushService.getMethod("getInstance", Context.class);
> +        final Object instance = getInstance.invoke(null, context);
> +        final Method onRefresh = pushService.getMethod("onRefresh");
> +        onRefresh.invoke(instance);

Roughly, this clears out stored GCM token from our internal cache, asks for a new one, and kicks off autopush registration.

It's not guaranteed this will actually successfully repair our broken client, we just hope that it will. E.g., it's possible we'll get the same bad GCM token again.

And so we need to make a choice of how we want to handle a case where our recovery didn't work. Server-side flag will be set in response to a user action (right?), and so it's unlikely we'll be attempting this recovery very frequently.

But, let's be cautious and rate limit the repair operation. It's an expensive thing to do, and if something goes bonkers on the server side we don't want clients to go into a loop - _especially_ important if you refactor around my suggestion of kicking off recovery as soon as we determine we're broken.
Attachment #8900425 - Flags: review?(gkruglov)
(Reporter)

Updated

3 months ago
See Also: → bug 1398293
(Assignee)

Comment 10

2 months ago
mozreview-review-reply
Comment on attachment 8900425 [details]
Bug 1359279 - Renew GCM token/Push registration/FxA Registration on push registration expired.

https://reviewboard.mozilla.org/r/171770/#review182528

> You never check if this is `null`. Should you? You're certainly passing nulls into the constructor. I'd be in favor removing this from constructor's signature.

The fxa servers actually already respond with that field, I think this is ok.

> Here, we just fetched device list, encountered our own record, and we see that it has a bad push endpoint.
> 
> Ideally, this should be the point at which we initiate our re-registration.
> 
> Otherwise, we're essentially waiting until the next sync to do this (which could be a while), and in the meantime our push won't work.
> 
> But, see my note below about rate limiting this...

Like we discussed on our IRC conversation, the re-registration happens immediately after that refresh since everything is synchronous.

> I assume there were server-side changes to justify this?

Yup, it landed a while ago

> I'm worried if these two operations are racy. We need for autopush registration to finish before we can re-register our device. Can you double-check that PushService.onRefresh is entirely synchronous? Otherwise, we'd need a callback mechanism.> What about `PushService.getInstance(context).onRefresh();` ?

I wish... But we're not in the same jar sadly :(
Comment hidden (mozreview-request)
(Reporter)

Comment 12

2 months ago
mozreview-review
Comment on attachment 8900425 [details]
Bug 1359279 - Renew GCM token/Push registration/FxA Registration on push registration expired.

https://reviewboard.mozilla.org/r/171770/#review186638

Stamp!
Attachment #8900425 - Flags: review?(gkruglov) → review+
Comment hidden (mozreview-request)

Comment 14

2 months ago
Pushed by eoger@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/53e599ac1ed7
Renew GCM token/Push registration/FxA Registration on push registration expired. r=Grisha
(Assignee)

Updated

2 months ago
Depends on: 1401749
Backed out for Android linting failure:

https://hg.mozilla.org/integration/autoland/rev/e0d1b299d389229ca7d5f66dcb13663824314583

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=53e599ac1ed77ef24763069261996f916255edc2&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=132313628&repo=autoland
Flags: needinfo?(eoger)
Comment hidden (mozreview-request)

Comment 17

2 months ago
Pushed by eoger@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4536d1692a1f
Renew GCM token/Push registration/FxA Registration on push registration expired. r=Grisha
https://hg.mozilla.org/mozilla-central/rev/4536d1692a1f
Status: ASSIGNED → RESOLVED
Last Resolved: 2 months ago
status-firefox57: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 57
(Assignee)

Updated

2 months ago
Flags: needinfo?(eoger)
You need to log in before you can comment on or make changes to this bug.