1644598 - Record telemetry to better understand the push state of our device, and of devices we are trying to send commands to

Assignee

Description

•

5 years ago

My understanding of the push subscription mechanics is:

A push subscription isn't expected to expire after any fixed amount of time, but nevertheless, it does sometimes expire.
Expiry is generally noticed only after FxA tries to send a push message to the device (typically on behalf of another fxa device). In other words, we notice as we fail to deliver a push message. We will still "queue" the command, so will deliver it in the future when the target device explicitly checks, but notification of the command isn't delivered via push.
We will then set a special flag on the expired device. This device should notice the flag and recreate its subscription.

This means that once a device has an expired flag, it means that at least one command has already failed to be delivered in a timely manner, and that will continue until the target device actually notices the flag is set and updates its subscription.

So the outcome of the above is that we should:

Poll for missed commands as soon as we notice the flag is true - that's going to be done as part of bug 1632384.
Check the flag more often - we currently don't poll this at any regular interval. We do check the flag whenever we update the FxA device list, but that only when the user right-clicks on a page, or interacts with the "send tab" UI - there's no regular check, and it seems like there probably should be?

JR Conlin [:jrconlin,:jconlin]

Comment 1

•

5 years ago

Push is kind of at the mercy of whatever bridge system we use. For the most part, if a user has deactivated a device, disabled push permission to firefox, or otherwise explicitly broken the bridge connection, then Push will get a 40* error back when it tries to send a new subscription update to that device. Push reflects that to the Subscription provider immediately.

It's also possible for a bridge device identifier (what we use to communicate to the individual User Agent) to change. There are potentially two ways that this update can be reported back to Autopush: 1) the client sends an .update() message with the new device ID while handing a push identifier update event or 2) Autopush gets an identifier change notice while sending a push message. That seems to be a feature that exists only in documentation, since I don't think we've seen one of those in the wild.

Ideally, with newer browser instances, clients should be calling the .verify() routine which compares the local list of channel IDs with whatever the server has, and if there's a discrepancy, the client forces all subscribers to get a new subscription. Potentially, any system could send their own "test ping" message to their web app once a day/week, verify it was handled, and if not, regenerate the subscription message.

tl;dr: All push systems are "best effort" delivery systems.

Mark Hammond [:markh] [:mhammond]

Assignee

Comment 2

•

5 years ago

Thanks JR!

(In reply to JR Conlin [:jrconlin,:jconlin] from comment #1)

Push is kind of at the mercy of whatever bridge system we use. For the most part, if a user has deactivated a device, disabled push permission to firefox, or otherwise explicitly broken the bridge connection, then Push will get a 40* error back when it tries to send a new subscription update to that device. Push reflects that to the Subscription provider immediately.

That's interesting - but if I'm reading it correctly, it is in response to an explicit user action.

It's also possible for a bridge device identifier (what we use to communicate to the individual User Agent) to change. There are potentially two ways that this update can be reported back to Autopush: 1) the client sends an .update() message with the new device ID while handing a push identifier update event or 2) Autopush gets an identifier change notice while sending a push message. That seems to be a feature that exists only in documentation, since I don't think we've seen one of those in the wild.

And this doesn't seem like something we routinely do.

However, anecdotally we see devices with their subscription being reported as being "expired" even though there's no reason to believe either of the above happened - is that just a 3rd category of "and sometimes when trying to send a message to a device we get a failure reason that implies it should renew the subscription, but we have no idea why that happens"? Or do you believe that actually is one of the 2 scenarios above?

Ideally, with newer browser instances, clients should be calling the .verify() routine which compares the local list of channel IDs with whatever the server has, and if there's a discrepancy, the client forces all subscribers to get a new subscription.

None of our browsers do that now, right? That sounds worthwhile - is that something clients should/could implement today without server-side changes?

Potentially, any system could send their own "test ping" message to their web app once a day/week, verify it was handled, and if not, regenerate the subscription message.

I chatted about that with Ryan and he expressed concern about this approach for our mobile platforms - do you think it could be made to work on those platforms, or would you see this as a desktop-only option?

Flags: needinfo?(jrconlin)

Ryan Kelly [:rfkelly]

Comment 3

•

5 years ago

Ideally, with newer browser instances, clients should be calling the .verify() routine
None of our browsers do that now, right?

Fenix does this on application startup FWIW, although I think there are currently a few issues with it.

This is filed as a "Firefox / Firefox Accounts" client bug, but for completeness: ideally this would be better handled in a generic manner by the push infrastructure code (and that's how it is structured on Fenix).

IIUC what we're expecting to happen on Desktop, is that the push infra will detect that its push subscriptions need to be updated, it will trigger a "push subscription changed" observer notification, and the FxA client code will observe it here and re-register its device record with an updated subscription.

I chatted about that with Ryan and he expressed concern about this approach for our mobile platforms

The concern is specifically for Firefox on iOS, where IIUC the system requires that we show some UI in response to every push notification (as a security measure to prevent apps from abusing push notifications in order to run in the background).

JR Conlin [:jrconlin,:jconlin]

Comment 4

•

5 years ago

(In reply to Mark Hammond [:markh] [:mhammond] from comment #2)

Push is kind of at the mercy of whatever bridge system we use. For the most part, if a user has deactivated a device, disabled push permission to firefox, or otherwise explicitly broken the bridge connection, then Push will get a 40* error back when it tries to send a new subscription update to that device. Push reflects that to the Subscription provider immediately.

That's interesting - but if I'm reading it correctly, it is in response to an explicit user action.

Yes. The system does not notify us of any errors outside of the normal publication transaction. (e.g. Android's FCM doesn't notify us that a device is no longer accessible without trying to push something.)

However, anecdotally we see devices with their subscription being reported as being "expired" even though there's no reason to believe either of the above happened - is that just a 3rd category of "and sometimes when trying to send a message to a device we get a failure reason that implies it should renew the subscription, but we have no idea why that happens"? Or do you believe that actually is one of the 2 scenarios above?

We really don't have a tremendous amount of insight here. FCM reports an error trying to send a message, but isn't terribly helpful in noting what kind of error it is. If we get a 404, we treat it as "this device is no longer valid", we mark the endpoint as expired and report that back to the publisher. That pretty much breaks the push bridge if the mobile device either never gets it's own native subscription update request, or never checks if the push subscription is no longer valid.

Desktop is very different since we have first hand knowledge of any connections from a user agent. If you're seeing the same sort of things on desktop, there's a bit more concern.

Ideally, with newer browser instances, clients should be calling the .verify() routine which compares the local list of channel IDs with whatever the server has, and if there's a discrepancy, the client forces all subscribers to get a new subscription.

None of our browsers do that now, right? That sounds worthwhile - is that something clients should/could implement today without server-side changes?

Fenix does, but as Ryan pointed out, there may still be a few bugs in the system, since it's a new feature, and there are still a few moving parts outside of our control (e.g. the native push identifier).

Potentially, any system could send their own "test ping" message to their web app once a day/week, verify it was handled, and if not, regenerate the subscription message.

I chatted about that with Ryan and he expressed concern about this approach for our mobile platforms - do you think it could be made to work on those platforms, or would you see this as a desktop-only option?

Like I said, desktop is special since we have first hand knowledge when a UA connects. It opens a websocket connection directly to our endpoint servers, provides directly identifying info and then sits and waits for incoming messages. It's a LOT more straight-forward than what we have to do for mobile.

Mobile have their own restrictions and issues. As Ryan notes, iOS requires user action for incoming push messages. FCM doesn't. We've also kicked the idea around of using the "bridge" as more a "wake up" service which would trigger the mobile devices to then establish a WebSocket connection back to our servers which would then make mobile devices work the same as desktop, but that's a significant amount of work that's not been prioritized.

Flags: needinfo?(jrconlin)

Mark Hammond [:markh] [:mhammond]

Assignee

Comment 5

•

5 years ago

Thanks JR!

(In reply to JR Conlin [:jrconlin,:jconlin] from comment #4)

Desktop is very different since we have first hand knowledge of any connections from a user agent. If you're seeing the same sort of things on desktop, there's a bit more concern.

So considering just desktop, it sounds like you are saying that desktop should never(-ish) observe its own end-point being marked as expired? And therefore, polling for missed commands when we observe the expired flag is going to be useless as we should never see that state?

(If that's correct, I'll probably just transform this bug into collecting telemetry around expiry, just to prove this and/or see if there is anything of concern we need to big further into.)

Flags: needinfo?(jrconlin)

JR Conlin [:jrconlin,:jconlin]

Comment 6

•

5 years ago

(In reply to Mark Hammond [:markh] [:mhammond] from comment #5)

So considering just desktop, it sounds like you are saying that desktop should never(-ish) observe its own end-point being marked as expired? And therefore, polling for missed commands when we observe the expired flag is going to be useless as we should never see that state?

Correct. We have tighter control over sensing desktop connectivity since it's not "second hand" like it is with mobile. That said, there's value in coding to the least common denominator.

(If that's correct, I'll probably just transform this bug into collecting telemetry around expiry, just to prove this and/or see if there is anything of concern we need to big further into.)

Flags: needinfo?(jrconlin)

Mark Hammond [:markh] [:mhammond]

Assignee

Comment 7

•

5 years ago

So as threatened, I'm subverting this bug. My plan is:

Extend the existing telemetry for how many commands we find via "polling". I may well propose doing this more often in bug 1644598, but we might as well track it regardless.
Track how often desktop finds itself with either an expired subscription, or without a subscription at all. Bug 1645742 should allow Firefox to repair itself if it hasn't a subscription, but this should be rare. And given the discussion above, finding itself with an "expired" state should also be rare.
Whenever we are sending a command to another device, record its push state - one of ["ok", "expiredCallback", "noCallback"]. This can help inform us if @rfkelly's idea of some special ui/notification/action is worthwhile when the target device is not "ok", because they are probably going to have a bad time (I think that was a slack convo, so no bug exists for that yet - this can help us see if it's worthwhile)

I've a patch for this, so I'll just ask for review rather than asking if it makes sense - let's kill 2 birds with 1 stone. Once I've r+ I'll request data review.

Summary: Check for push subscription expiry more often → Record telemetry to better understand the push state of our device, and of devices we are trying to send commands to

Mark Hammond [:markh] [:mhammond]

Assignee

Comment 8

•

5 years ago

Attached file Bug 1644598 - record/extend telemetry about the push state of FxA devices. r?rfkelly — Details

Depends on D79783

Ryan Kelly [:rfkelly]

Comment 9

•

5 years ago

(I think that was a slack convo, so no bug exists for that yet)

Worse: it was in Jira.

Mark Hammond [:markh] [:mhammond]

Assignee

Comment 10

•

5 years ago

Comment on attachment 9156890 [details]
Bug 1644598 - record/extend telemetry about the push state of FxA devices. r?rfkelly

What questions will you answer with this data?

How often the "push" mechanism is effective for sends commands between FxA connected devices.

Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? Some example responses:

We have anecdotal evidence that the push mechanism is unreliable. We wish to understand the actual performance in a better way.

List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories found on the Mozilla wiki.

We want to record:

All items are "Category 1 “Technical data”" and are being tracked in bug 1644598

Whenever we find that this device has an invalud push subscription, what is invalid about it.
Whenever we find that a device we are trying to send a command (ie, tab) to has an invalid push subscription, what is invalid about it.
How often our fallback for "polling" for commands is effective, because every time it is, it means push was not effective. Note that this is a renewal from bug 1496638.

How long will this data be collected? Choose one of the following:

Until Firefox 85.

What populations will you measure?
Which release channels?

All are scalars declared as release_channel_collection: opt-out

Which countries?
Which locales?

All

Any other filters? Please describe in detail below.

In practice, this will only be recorded for FxA users. However, no identifiers are recorded.

If this data collection is default on, what is the opt-out mechanism for users?

Standard Firefox mechanisms.

Please provide a general description of how you will analyze this data.

In redash.

Where do you intend to share the results of your analysis?

Internally

Is there a third-party tool (i.e. not Telemetry) that you are proposing to use for this data collection?

No

Attachment #9156890 - Flags: data-review?(chutten)

Chris H-C :chutten|PTO (back Oct 23)

Comment 11

•

5 years ago

Attached file data collection review — Details

Data collection reviews should be attached to bugs so they're easier for Stewards to find (and don't need to clutter up bug comments quite as much).

Attachment #9157406 - Flags: data-review?(chutten)

Chris H-C :chutten|PTO (back Oct 23)

Updated

•

5 years ago

Attachment #9156890 - Flags: data-review?(chutten)

Chris H-C :chutten|PTO (back Oct 23)

Comment 12

•

5 years ago

Comment on attachment 9157406 [details] data collection review PRELIMINARY NOTES: In future requests please add a little more detail about the data collections. For "what is invalid about [an invalid push subscription]" please explain whether this is an error code or a string error and what it might contain. In this case I note that the options are `"ok", "expiredCallback", "noCallback"`. DATA COLLECTION REVIEW RESPONSE: Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? Yes. This collection is Telemetry so is documented in its definitions file [Scalars.yaml](https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/telemetry/Scalars.yaml) and the [Probe Dictionary](https://telemetry.mozilla.org/probe-dictionary/). Is there a control mechanism that allows the user to turn the data collection on and off? Yes. This collection is Telemetry so can be controlled through Firefox's Preferences. If the request is for permanent data collection, is there someone who will monitor the data over time? No. This collection will expire in Firefox 85. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? Category 1, Technical. Is the data collection request for default-on or default-off? Default on for all channels. Does the instrumentation include the addition of any new identifiers? No. Is the data collection covered by the existing Firefox privacy notice? Yes. Does there need to be a check-in in the future to determine whether to renew the data? Yes. :rfkelly is responsible for renewing or removing the collection before it expires in Firefox 85. --- Result: datareview+

Attachment #9157406 - Flags: data-review?(chutten) → data-review+

Pulsebot

Comment 13

•

5 years ago

Pushed by mhammond@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5cd8ce0bfd95 record/extend telemetry about the push state of FxA devices. r=rfkelly

Cristina Coroiu [:ccoroiu]

Comment 14

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5cd8ce0bfd95

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox79: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 79

Mark Hammond [:markh] [:mhammond]

Assignee

Updated

•

5 years ago

Blocks: 1649055

telemetry-probes

Updated

•

5 years ago

Bug 1644598 - record/extend telemetry about the push state of FxA devices. r?rfkelly 5 years ago Mark Hammond [:markh] [:mhammond] 47 bytes, text/x-phabricator-request		Details \| Review
data collection review 5 years ago Chris H-C :chutten\|PTO (back Oct 23) 1.92 KB, text/plain	chutten\|PTO : data-review+	Details

Bugzilla

Record telemetry to better understand the push state of our device, and of devices we are trying to send commands to

Categories

(Firefox :: Firefox Accounts, enhancement, P2)

Tracking

()

People

(Reporter: markh, Assigned: markh)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Comment 14

Updated

Updated

Attachment

General

Description

File Name

Content Type