Record telemetry to better understand the push state of our device, and of devices we are trying to send commands to
Categories
(Firefox :: Firefox Accounts, enhancement, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox79 | --- | fixed |
People
(Reporter: markh, Assigned: markh)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
47 bytes,
text/x-phabricator-request
|
Details | Review | |
1.92 KB,
text/plain
|
chutten
:
data-review+
|
Details |
My understanding of the push subscription mechanics is:
-
A push subscription isn't expected to expire after any fixed amount of time, but nevertheless, it does sometimes expire.
-
Expiry is generally noticed only after FxA tries to send a push message to the device (typically on behalf of another fxa device). In other words, we notice as we fail to deliver a push message. We will still "queue" the command, so will deliver it in the future when the target device explicitly checks, but notification of the command isn't delivered via push.
-
We will then set a special flag on the expired device. This device should notice the flag and recreate its subscription.
This means that once a device has an expired
flag, it means that at least one command has already failed to be delivered in a timely manner, and that will continue until the target device actually notices the flag is set and updates its subscription.
So the outcome of the above is that we should:
-
Poll for missed commands as soon as we notice the flag is true - that's going to be done as part of bug 1632384.
-
Check the flag more often - we currently don't poll this at any regular interval. We do check the flag whenever we update the FxA device list, but that only when the user right-clicks on a page, or interacts with the "send tab" UI - there's no regular check, and it seems like there probably should be?
Comment 1•5 years ago
|
||
Push is kind of at the mercy of whatever bridge system we use. For the most part, if a user has deactivated a device, disabled push permission to firefox, or otherwise explicitly broken the bridge connection, then Push will get a 40* error back when it tries to send a new subscription update to that device. Push reflects that to the Subscription provider immediately.
It's also possible for a bridge device identifier (what we use to communicate to the individual User Agent) to change. There are potentially two ways that this update can be reported back to Autopush: 1) the client sends an .update()
message with the new device ID while handing a push identifier update event or 2) Autopush gets an identifier change notice while sending a push message. That seems to be a feature that exists only in documentation, since I don't think we've seen one of those in the wild.
Ideally, with newer browser instances, clients should be calling the .verify()
routine which compares the local list of channel IDs with whatever the server has, and if there's a discrepancy, the client forces all subscribers to get a new subscription. Potentially, any system could send their own "test ping" message to their web app once a day/week, verify it was handled, and if not, regenerate the subscription message.
tl;dr: All push systems are "best effort" delivery systems.
Assignee | ||
Comment 2•5 years ago
|
||
Thanks JR!
(In reply to JR Conlin [:jrconlin,:jconlin] from comment #1)
Push is kind of at the mercy of whatever bridge system we use. For the most part, if a user has deactivated a device, disabled push permission to firefox, or otherwise explicitly broken the bridge connection, then Push will get a 40* error back when it tries to send a new subscription update to that device. Push reflects that to the Subscription provider immediately.
That's interesting - but if I'm reading it correctly, it is in response to an explicit user action.
It's also possible for a bridge device identifier (what we use to communicate to the individual User Agent) to change. There are potentially two ways that this update can be reported back to Autopush: 1) the client sends an
.update()
message with the new device ID while handing a push identifier update event or 2) Autopush gets an identifier change notice while sending a push message. That seems to be a feature that exists only in documentation, since I don't think we've seen one of those in the wild.
And this doesn't seem like something we routinely do.
However, anecdotally we see devices with their subscription being reported as being "expired" even though there's no reason to believe either of the above happened - is that just a 3rd category of "and sometimes when trying to send a message to a device we get a failure reason that implies it should renew the subscription, but we have no idea why that happens"? Or do you believe that actually is one of the 2 scenarios above?
Ideally, with newer browser instances, clients should be calling the
.verify()
routine which compares the local list of channel IDs with whatever the server has, and if there's a discrepancy, the client forces all subscribers to get a new subscription.
None of our browsers do that now, right? That sounds worthwhile - is that something clients should/could implement today without server-side changes?
Potentially, any system could send their own "test ping" message to their web app once a day/week, verify it was handled, and if not, regenerate the subscription message.
I chatted about that with Ryan and he expressed concern about this approach for our mobile platforms - do you think it could be made to work on those platforms, or would you see this as a desktop-only option?
Comment 3•5 years ago
|
||
Ideally, with newer browser instances, clients should be calling the .verify() routine
None of our browsers do that now, right?
Fenix does this on application startup FWIW, although I think there are currently a few issues with it.
This is filed as a "Firefox / Firefox Accounts" client bug, but for completeness: ideally this would be better handled in a generic manner by the push infrastructure code (and that's how it is structured on Fenix).
IIUC what we're expecting to happen on Desktop, is that the push infra will detect that its push subscriptions need to be updated, it will trigger a "push subscription changed" observer notification, and the FxA client code will observe it here and re-register its device record with an updated subscription.
I chatted about that with Ryan and he expressed concern about this approach for our mobile platforms
The concern is specifically for Firefox on iOS, where IIUC the system requires that we show some UI in response to every push notification (as a security measure to prevent apps from abusing push notifications in order to run in the background).
Comment 4•5 years ago
|
||
(In reply to Mark Hammond [:markh] [:mhammond] from comment #2)
Push is kind of at the mercy of whatever bridge system we use. For the most part, if a user has deactivated a device, disabled push permission to firefox, or otherwise explicitly broken the bridge connection, then Push will get a 40* error back when it tries to send a new subscription update to that device. Push reflects that to the Subscription provider immediately.
That's interesting - but if I'm reading it correctly, it is in response to an explicit user action.
Yes. The system does not notify us of any errors outside of the normal publication transaction. (e.g. Android's FCM doesn't notify us that a device is no longer accessible without trying to push something.)
However, anecdotally we see devices with their subscription being reported as being "expired" even though there's no reason to believe either of the above happened - is that just a 3rd category of "and sometimes when trying to send a message to a device we get a failure reason that implies it should renew the subscription, but we have no idea why that happens"? Or do you believe that actually is one of the 2 scenarios above?
We really don't have a tremendous amount of insight here. FCM reports an error trying to send a message, but isn't terribly helpful in noting what kind of error it is. If we get a 404, we treat it as "this device is no longer valid", we mark the endpoint as expired and report that back to the publisher. That pretty much breaks the push bridge if the mobile device either never gets it's own native subscription update request, or never checks if the push subscription is no longer valid.
Desktop is very different since we have first hand knowledge of any connections from a user agent. If you're seeing the same sort of things on desktop, there's a bit more concern.
Ideally, with newer browser instances, clients should be calling the
.verify()
routine which compares the local list of channel IDs with whatever the server has, and if there's a discrepancy, the client forces all subscribers to get a new subscription.None of our browsers do that now, right? That sounds worthwhile - is that something clients should/could implement today without server-side changes?
Fenix does, but as Ryan pointed out, there may still be a few bugs in the system, since it's a new feature, and there are still a few moving parts outside of our control (e.g. the native push identifier).
Potentially, any system could send their own "test ping" message to their web app once a day/week, verify it was handled, and if not, regenerate the subscription message.
I chatted about that with Ryan and he expressed concern about this approach for our mobile platforms - do you think it could be made to work on those platforms, or would you see this as a desktop-only option?
Like I said, desktop is special since we have first hand knowledge when a UA connects. It opens a websocket connection directly to our endpoint servers, provides directly identifying info and then sits and waits for incoming messages. It's a LOT more straight-forward than what we have to do for mobile.
Mobile have their own restrictions and issues. As Ryan notes, iOS requires user action for incoming push messages. FCM doesn't. We've also kicked the idea around of using the "bridge" as more a "wake up" service which would trigger the mobile devices to then establish a WebSocket connection back to our servers which would then make mobile devices work the same as desktop, but that's a significant amount of work that's not been prioritized.
Assignee | ||
Comment 5•5 years ago
|
||
Thanks JR!
(In reply to JR Conlin [:jrconlin,:jconlin] from comment #4)
Desktop is very different since we have first hand knowledge of any connections from a user agent. If you're seeing the same sort of things on desktop, there's a bit more concern.
So considering just desktop, it sounds like you are saying that desktop should never(-ish) observe its own end-point being marked as expired? And therefore, polling for missed commands when we observe the expired flag is going to be useless as we should never see that state?
(If that's correct, I'll probably just transform this bug into collecting telemetry around expiry, just to prove this and/or see if there is anything of concern we need to big further into.)
Comment 6•5 years ago
|
||
(In reply to Mark Hammond [:markh] [:mhammond] from comment #5)
So considering just desktop, it sounds like you are saying that desktop should never(-ish) observe its own end-point being marked as expired? And therefore, polling for missed commands when we observe the expired flag is going to be useless as we should never see that state?
Correct. We have tighter control over sensing desktop connectivity since it's not "second hand" like it is with mobile. That said, there's value in coding to the least common denominator.
(If that's correct, I'll probably just transform this bug into collecting telemetry around expiry, just to prove this and/or see if there is anything of concern we need to big further into.)
<thumbs up emoji>
Assignee | ||
Comment 7•5 years ago
|
||
So as threatened, I'm subverting this bug. My plan is:
-
Extend the existing telemetry for how many commands we find via "polling". I may well propose doing this more often in bug 1644598, but we might as well track it regardless.
-
Track how often desktop finds itself with either an expired subscription, or without a subscription at all. Bug 1645742 should allow Firefox to repair itself if it hasn't a subscription, but this should be rare. And given the discussion above, finding itself with an "expired" state should also be rare.
-
Whenever we are sending a command to another device, record its push state - one of ["ok", "expiredCallback", "noCallback"]. This can help inform us if @rfkelly's idea of some special ui/notification/action is worthwhile when the target device is not "ok", because they are probably going to have a bad time (I think that was a slack convo, so no bug exists for that yet - this can help us see if it's worthwhile)
I've a patch for this, so I'll just ask for review rather than asking if it makes sense - let's kill 2 birds with 1 stone. Once I've r+ I'll request data review.
Assignee | ||
Comment 8•5 years ago
|
||
Depends on D79783
Comment 9•5 years ago
|
||
(I think that was a slack convo, so no bug exists for that yet)
Worse: it was in Jira.
Assignee | ||
Comment 10•5 years ago
|
||
Comment on attachment 9156890 [details]
Bug 1644598 - record/extend telemetry about the push state of FxA devices. r?rfkelly
What questions will you answer with this data?
How often the "push" mechanism is effective for sends commands between FxA connected devices.
Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? Some example responses:
We have anecdotal evidence that the push mechanism is unreliable. We wish to understand the actual performance in a better way.
List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories found on the Mozilla wiki.
We want to record:
All items are "Category 1 “Technical data”" and are being tracked in bug 1644598
-
Whenever we find that this device has an invalud push subscription, what is invalid about it.
-
Whenever we find that a device we are trying to send a command (ie, tab) to has an invalid push subscription, what is invalid about it.
-
How often our fallback for "polling" for commands is effective, because every time it is, it means push was not effective. Note that this is a renewal from bug 1496638.
How long will this data be collected? Choose one of the following:
Until Firefox 85.
What populations will you measure?
Which release channels?
All are scalars declared as release_channel_collection: opt-out
Which countries?
Which locales?
All
Any other filters? Please describe in detail below.
In practice, this will only be recorded for FxA users. However, no identifiers are recorded.
If this data collection is default on, what is the opt-out mechanism for users?
Standard Firefox mechanisms.
Please provide a general description of how you will analyze this data.
In redash.
Where do you intend to share the results of your analysis?
Internally
Is there a third-party tool (i.e. not Telemetry) that you are proposing to use for this data collection?
No
Comment 11•5 years ago
|
||
Data collection reviews should be attached to bugs so they're easier for Stewards to find (and don't need to clutter up bug comments quite as much).
Updated•5 years ago
|
Comment 12•5 years ago
|
||
Comment 13•5 years ago
|
||
Comment 14•5 years ago
|
||
bugherder |
Description
•