Open Bug 1486010 Opened 7 years ago Updated 1 month ago

[WebPush] Firefox keeps returning an expired endpoint

Categories

(Core :: DOM: Push Subscriptions, defect, P2)

61 Branch
defect

Tracking

()

UNCONFIRMED

People

(Reporter: collimarco91, Unassigned)

Details

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15 Steps to reproduce: I am the founder of Pushpad. Some time ago I subscribed to web push notifications here: https://pushpad.xyz/demo After many weeks of inactivity the subscription was expired and when I tried to send a notification to that endpoint it returned 410 Gone. This is the expected behavior. However the problem is that the browser continues to return the old (expired) endpoint every time you call pushManager.subscribe(). Note: I could fix the issue manually by revoking the permission and granting it again: that generated a new valid endpoint and notifications started working again Tested on Firefox 61 on MacOS 10.13 Actual results: The browser continues to return the old (expired) endpoint every time you call pushManager.subscribe(). Expected results: The browser must return a valid endpoint (not expired) when you call pushManager.subscribe().
Component: Untriaged → DOM: Push Notifications
Product: Firefox → Core
Andrew, do you know who might be able to look at this?
Flags: needinfo?(overholt)
Priority: -- → P2
Maybe JR?
Flags: needinfo?(overholt) → needinfo?(jrconlin)
Huh. The autopush server should be returning a new URL. Since it's encoded, it should be distinct regardless of the values being used. I'm not sure, but this might be an issue with the local client caching the expired server URL. Lina might be able to provide more guidance.
Flags: needinfo?(jrconlin) → needinfo?(lina)
Hi Marco! Could you please post: * Your `dom.push.userAgentID` from about:config. * If calling `pushSubscription.unsubscribe()`, followed by `pushManager.subscribe()`, also works, or if you need to revoke and grant the permission again? * If you remember opening https://pushpad.xyz/demo between the time you subscribed to push, and when you noticed the endpoint was expired? (To rule out the quota kicking in). * If you ever get a `pushsubscriptionchange` event when visiting the page? * If this happened for any other push subscriptions, or https://pushpad.xyz/demo specifically? One case where you might get an expired endpoint is if the subscription exceeds its quota...but then calling `subscribe()` will reject with an error, not return the old endpoint. Another case is if the server deleted the endpoint without notifying the client. In theory, the server resets all `userAgentIDs` after dropping endpoints, though JR is right: if the client's and server's list of channels doesn't match, we'll end up with problems. Thanks!
Flags: needinfo?(lina) → needinfo?(collimarco91)
> * Your `dom.push.userAgentID` from about:config. I don't know if I can post it publicly, so I have sent it to your email address (lina@mozilla.com) > * If calling `pushSubscription.unsubscribe()`, followed by `pushManager.subscribe()`, also works, or if you need to revoke and grant the permission again? The only thing that I have tried was to revoke / grant permission and it worked. > * If you remember opening https://pushpad.xyz/demo between the time you subscribed to push, and when you noticed the endpoint was expired? Yes, I have probably opened it other times. > * If you ever get a `pushsubscriptionchange` event when visiting the page? I don't know and unfortunately I cannot reproduce that now. In any case note that that that page doesn't handle the pushsubscriptionchange (at the time of the bug). Instead, the page calls pushManager.subscribe every time you visit it, so that any new endpoint will be sent to the application server. > * If this happened for any other push subscriptions, or https://pushpad.xyz/demo specifically? I don't know. It seems that it has happened once for that specific subscription. > One case where you might get an expired endpoint is if the subscription exceeds its quota...but then calling `subscribe()` will reject with an error, not return the old endpoint. No, I am 100% sure that it was not raising an exception and that the old subscription / endpoint was returned.
Flags: needinfo?(collimarco91)

I encountered this issue again on Firefox v68.8.1 for Android.

pushManager.subscribe() returns endpoints that are already expired.

This issue still affects Firefox v85 on Android...

Severity: normal → S3

Still occurs in current releases (e.g. 116). pushsubscriptionchange works in general, but some clients get a probably old endpoint from PushManager.getSubscription. When you try to send a push-message to such an endpoint immediately you will get HTTP 410 errno 106 „No such subscription“. Clients may miss notifications and there seems to be no complete workaround. The whole process of expiration feels flawed and cumbersome, also because of bug 1497429.

I just saw this on stable Firefox 126.0.1. The workaround of removing the notification permission and then granting it again worked.

  • Your dom.push.userAgentID from about:config.

431a5c3fb3624a558722c894b69e5b7e

Full response for future searchers:

{
    "code": 410,
    "errno": 106,
    "error": "Gone",
    "message": "No such subscription",
    "more_info": "http://autopush.readthedocs.io/en/latest/http.html#error-codes"
}

The endpoint was https://updates.push.services.mozilla.com/wpush/v2/gAAAAABklLpauzCCcydiAwZYl4PIO8PHY8vUwwsUGg9dNZ-t_t0B-jVceHFkz2NRRlJ0LnyIJrTrIR4STdWVIeWDBd5KDtm2IlyvxmEj9uWwN94_wWdsZn4_Ru9W2pe2dUjNkKDd517OO50KAhW0R2Kv1eDv9NDp0cn8Qn-L9lFjQq6gRqWZ1Tc.

Is this for notifications to an android device?

Do we have a way to actively check whether a given endpoint is still valid from the client side (without triggering push messages)?

Flags: needinfo?(jrconlin)

Not really?
Technically, that's what the daily check-in is for. It's more complicated for mobile devices because what can happen is that a given endpoint could be invalided by FCM.

In theory:

  1. A device registers with FCM and gets a registration token. It then uses that registration token to register with Autopush, which stores the registration token as part of the UAID record for that device.
  2. Firefox requests a new endpoint, which is generated by Autopush and tied to the UAID record (note, Autopush doesn't check to see if that original FCM registration token is still valid, because there's really no way for us to do that either).
  3. A third party site sends a notification to Autopush, which fetches the registration token from the UAID record, and tries to deliver the message to the remote device using FCM. FCM may return with a 404 flavor of "Nope" which will cause Autopush to drop the UAID record.
  4. The mobile device would do it's "daily check-in", find out that the server has no records for it, and then re-register.

Some fun failure cases I can think of:

  1. I don't believe that there's a way for devices to tell FCM, "Hey, could I get a different registration ID? I think this one is broken." I know that (in theory) when Firefox on android is restarted, there's a function call that fetches the latest registration ID, and that the browser is supposed to call the bridge token update if that ever changes. (IIRC, There's a different event that can be sent by the OS to indicate that the FCM registration token changed. Pretty sure the same code handles that as well, but :saschanaz is the expert of that domain.)
  2. On a somewhat similar note, it could well be that there's some pretty horrific churn going on for that device, where we get an FCM registration token, set things up, the token is later revoked for some reason, the message fails in transit, the UAID is removed (more messages fail in transit) and then the device registers the new token and the cycle begins again. This is the fun of complex systems.

Desktop is a bit different, since it doesn't use a bridge, but fundamentally, it works much the same. A big difference there is that the browser connects directly to the Autopush server and identifies itself using that dom.push.userAgentID string. If that changes, then the endpoints break and messages will wind up going into storage until they die. Eventually the server will determine that the UAID is dead, and all the associated endpoints will return 410 and then eventually 404. (I'll note that a lot of publishers don't bother paying attention to those values and will continue to send notifications to dead endpoints. We have fail2ban rules in place for publishers that do that, so please drop any endpoint that returns 404 or 410, please? Thanks!) The UAID dance on desktop is a bit different as well. It's also done at start up, but it's a bit more of a negotiation than it is on mobile.

The other thing to note is that this feels like a really weird edge case. Generally we see around 22-ish 404/410s ever 5 minutes. We handle something like 250K messages for FCM alone in that same period. That is absolutely not to dismiss or diminish the importance of this, but it does indicate that something very odd is afoot here.

Flags: needinfo?(jrconlin)

But that's for UAID and not each push endpoint. If something something happens and Firefox gets new UAID but somehow couldn't update the endpoint from the DB then the client will remain broken until the next UAID change or resubscription.

Unfortunately the endpoints do not accept HEAD requests 🥺

(But desktop drops all push subscriptions on UAID change, not sure that's the case on Fenix because somehow it ended up with a totally separate implementation 🤔 https://searchfox.org/mozilla-central/rev/38e462fe13ea42ae6cc391fb36e8b9e82e842b00/dom/push/PushServiceWebSocket.sys.mjs#152-162,636-650)

Right, but if desktop gets a new UAID, all the prior endpoints are invalid. The UAID is how the server identifies a recipient device, there's no other ID that is used. During the HELLO part of desktop's registration, it gets back a UAID, which may be different than the UAID it requested. That's because the server identified a problem with the prior UAID for some reason and invalidated it. The problem there is that the new UAID will have no subscriptions, which should trigger a resubscription event.

It's not so much that Autopush drops all push subscriptions on a UAID change, as those subscriptions are immediately unavailable to the desktop because they will never sent by the server. We don't chain UAIDs, so we don't know what a given devices prior UAID values were. As far as the server is concerned a new UAID is a brand new device that never existed before.

Yeah, and from the client point of view, if there's a bug that UAID update end up not fully clearing the push information, then after that it remains broken with no automatic way to recover. I'd like to have some way to recover, or some way to get some telemetry about how widespread (or narrowspread... 🙂) this issue is.

Something like comment #11 would enable it, what do you think?

Maybe... It's certainly worth considering.
I'm having a brief argument with myself about any possible security/ cost problems with having the UAIDs be poll-able like that. I think the security thing is probably not much of a problem since it doesn't really leak any info and anyone polling that data would stand out like a sore thumb. The cost shouldn't be much of a problem either, particularly if we add in some rules about polling per second.

I'll write up a separate ticket for this.
(for folk without jira access, the summary of that ticket is:

Per discussion in https://bugzilla.mozilla.org/show_bug.cgi?id=1486010#c11 ,

It would be useful if there was an Autoendpoint URL that allowed for external UAs to poll and see if a given UAID is viable or not.

e.g. HEAD /v1/check/{uaid} which would return a JSON formatted response containing {"status":status_value} where status_value is 200 or either 404 or 410 depending on the stored UAID status. An error would be returned as a JSON formatted response containing {"status": 500, "error": error_code, "errno": error_number, "message": descriptive_text}, with updates to the Autopush documentation to describe these errors and resolutions.

This endpoint would need to be coordinated with the UA client team.

Let me know if that sounds good or if you need any changes.

Hmm, I was thinking about push endpoint URLs, but probably the same problem applies to UAID itself. So yes, having that would be nice! (But would be nice to have one for push endpoint URL too)

Actually, wouldn't a broken UAID be reissued on restart?

It depends?
For various reasons, the server will try to defer to whatever UAID is presented by the client. If there's a problem with that UAID record (e.g. there was an error in the record, or the UAID doesn't exist in our database), then a new UAID is assigned. We don't "tombstone" UAID records for a lot of reasons. (There is the very, very remote chance that two different users could be assigned the same UAID, but that's literally a 1:5.3×1036 chance of happening) The server is also reasonably aggressive about dropping UAIDs that are no longer valid for whatever reason (Device request, Bridge returned a 404, some errors with the UAID record, too many pending messages (150, which can indicate that a client may have been offline too long)).

So, that leaves potential edge cases:

  1. A device reconnects with a previously bad UAID that isn't cleared for some reason. (As with a lot of the server errors, these wouldn't really be "one-off" sorts of things. We would see huge numbers of errors around this.)
  2. A device reconnects with a previously bad UAID and does not change to the new UAID (I would think this device would generate a bunch of dead endpoints since the device would never "collect" messages for endpoints that it previously registered. We do see a some of those, but they're on the order of maybe 100 out of 150,000 messages, and it's split pretty evenly between desktop and mobile.
  3. Something unknown. (I'm pessimistic by nature, so there's always room for some weird edge case we never knew about)

As for having a "Is this Push Endpoint still valid?" That's a bit harder for the server to figure out, believe it or not.
The push endpoint is basically the UAID + ChannelID + PublicKey hash that's encrypted up. Checking to see if the UAID is valid is pretty easy. Checking to see if the ChannelID is valid isn't, since technically all ChannelIDs are valid because the UA is the one that determines whether or not they are. Plus, the UA would need to know if the endpoint was signed or not and send along the Vapid Public Key so that the key signature could be validated as well, and that's going to be a bit messier for the UA to deal with. (Note, that encryption also complicates things, because it will never return the same result even if you feed it the same inputs. The server could pull the UAID record and make sure that the ChannelID matches up, but we also kinda/sorta already do that with the "Daily Checkin" call that mobile does.

Interesting. When we return 410 from a push endpoint, is that basically based on UAID being invalid? 👀

We return a 410 to the endpoint if either the User has been deleted or the Subscription has been deleted, so it's not always a case that the UAID is invalid. (For mobile, though, it's even more likely that the User has been dropped, since most mobile devices seem to only have one subscription, but even then, it's not a 100% guarantee)

The only thing that a 410 or 404 should do is to consider that endpoint no longer valid and no more messages should be delivered to that endpoint.

(In reply to JR Conlin [:jrconlin,:jconlin] from comment #22)

The only thing that a 410 or 404 should do is to consider that endpoint no longer valid and no more messages should be delivered to that endpoint.

(which is the exact information we need!)

An idea: When PushManager.subscribe is called when there's an existing push subscription, the client may decide to send a push message to the endpoint and confirm whether the corresponding message comes back. For now it can serve for a failure rate telemetry. If we see the rate is significant, we may make the client to actively try resubscription.

I'm not sure that will work, partly because of the registered endpoint problem.
You would need to POST to the endpoint, with a valid VAPID header that includes the public key, and the UA is not going to be able to do that.

I'm currently working on a fix that allows the client to poll the server with a HEAD request to see if the UAID is still valid. https://github.com/mozilla-services/autopush-rs/commit/9b8c10686b189e200ebdb4b6608a825a48b96760 that might be a better solution, but I've got a few things on my plate right now, so it's not top priority.

You need to log in before you can comment on or make changes to this bug.