Closed Bug 1582317 Opened 5 years ago Closed 5 years ago

Add which servers a user is syncing to (Nodes vs Spanner)

Categories

(Firefox :: Sync, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 73
Tracking Status
firefox72 --- fixed
firefox73 --- fixed

People

(Reporter: adavis, Assigned: markh)

References

Details

Attachments

(3 files)

With the deployment of the spanner backend starting at the end of October, we should monitor if there are sync issues with users that we've moved over.

At this time, we don't have the means to detect and track in sync telemetry if a user is syncing to our current node infrastructure or if they are syncing to the new GCP spanner back-end.

I'm not quite sure how we would do this but I'm starting the conversation.

Things we will want to look at:

  • sync errors
  • validation data

We'll need to be careful about reducing the size of a user's anonymity set here - e.g. if there are only 100 users currently allocated to a node then we know that any telemetry pings including that node must correspond to one of those 100 users. I've no sense of how much of a problem this might be in practice, especially if the metrics ping already includes the hashed fxa uid.

At this time, we don't have the means to detect and track in sync telemetry if a user is syncing to our current node infrastructure
or if they are syncing to the new GCP spanner back-end.

One minimalist option would be to include the full node name in the sync ping. This would let us compare "users on the spanner node" versus "user on some other node". Mark, thoughts?

It would also be great to make sure that when we do this, we ensure that we're updating the metrics that funnel into amplitude specifically.

(In reply to Ryan Kelly [:rfkelly] (PTO until October 16th) from comment #1)

We'll need to be careful about reducing the size of a user's anonymity set here - e.g. if there are only 100 users currently allocated to a node then we know that any telemetry pings including that node must correspond to one of those 100 users. I've no sense of how much of a problem this might be in practice, especially if the metrics ping already includes the hashed fxa uid.

At this time, we don't have the means to detect and track in sync telemetry if a user is syncing to our current node infrastructure
or if they are syncing to the new GCP spanner back-end.

One minimalist option would be to include the full node name in the sync ping. This would let us compare "users on the spanner node" versus "user on some other node". Mark, thoughts?

I could see this being a potential privacy issue assuming we planned to keep users indefinitely on a multitude of storage options. Right now, the main goal here is to get folks moved over from our current infra onto Spanner as quickly as possible. Although, that might (and likely will) take longer than anticipated, supporting two sync backends is not the long term strategy here. Given that, I'd lean in favor of finding a way to do this as it's going to help increase our confidence level that we're safe to continue migrating users (having a way to compare A to B is going to be critical for confidence levels here).

It's a really valid point that this level of tracking will slightly decrease anonymity (even with hashed keys/etc). I think that the benefit of getting
onto durable sync more quickly however makes a blip of potentially decreased anonymity worthwhile.

That being said, really all that matters is Spanner vs old nodes here. We don't need any more granularity than that. If there's a way to grab only that (and lump smaller subgroups of users together), then perhaps there is a way to do this than helps ensure we're doing everything we can to protect users' privacy? I'm all for that.

It would also be great to make sure that when we do this, we ensure that we're updating the metrics that funnel into amplitude specifically.

It's not obvious to me how this would be represented in the amplitude data schema or what amplitude graphs we'd produce using it; :tublitzed or :adavis do you have a specific chart in mind for this case?

One minimalist option would be to include the full node name in the sync ping.
This would let us compare "users on the spanner node" versus "user on some other node". Mark, thoughts?

Whoops, turns out I didn't actually ni? Mark for thoughts here...

Flags: needinfo?(markh)

(In reply to Ryan Kelly [:rfkelly] from comment #6)

One minimalist option would be to include the full node name in the sync ping.
This would let us compare "users on the spanner node" versus "user on some other node". Mark, thoughts?

Whoops, turns out I didn't actually ni? Mark for thoughts here...

I have no problem with that but it's not clear the privacy implications mentioned above can be dismissed. Other possibilities:

  • We store a pref with the name of the spanner node (I'm assuming there's only going to be 1?) and we record a new bool which indicates whether it is that node or not.

  • Generalizing that even more, the token-server could return a new field expose purely for us to write to telemetry. You could then imagine this being capable of being used in other interesting ways in the future.

But if we can dismiss privacy concerns, the full node name does make sense.

CC Thom who is currently touching the back-end scala code - if we can make a decision here soon, he might be able to incorporate the new field at the same time.

Flags: needinfo?(markh)

We don't really want to know which host they are on, we just want to know if they are on the new server or the old. I think we should just record that boolean value.

Generalizing that even more, the token-server could return a new field expose purely for us to write to telemetry.
You could then imagine this being capable of being used in other interesting ways in the future.

I quite like this idea, it seems inline with the way we already return a per-user id to be used for metrics purposes.

(In reply to Ryan Kelly [:rfkelly] from comment #5)

It would also be great to make sure that when we do this, we ensure that we're updating the metrics that funnel into amplitude specifically.

It's not obvious to me how this would be represented in the amplitude data schema or what amplitude graphs we'd produce using it; :tublitzed or :adavis do you have a specific chart in mind for this case?

Sure, this one on Sync MAU for example, or this one by device count - Ie, some of the higher level charts around sync usage...would be great if we could then start ensuring that we're not seeing anything significantly abnormal with a smaller subset of Durable Sync users before we start enabling it more widely.

Sure, this one on Sync MAU for example, or this one by device count

Thanks. Sounds like we'd like to get this piece of data into amplitude as a "user property" somehow. Leif, did we ever manage to get any sync client telemetry events flowing through into amplitude? I recall an effort to get some send-tab events in there, but that it may have been abandoned due to data volume concerns.

Flags: needinfo?(loines)

the token-server could return a new field expose purely for us to write to telemetry.

Strawfox proposal: return a new "node_type" field as part of the tokenserver response data here, and have tokenserver set it to some short string value according to our needs. The client code just grabs this field out and includes it in the sync ping in the same way that it does with hashed_fxa_uid. We can then segment sync-ping telemetry analysis by this field (once it rides the trains to release, of course).

I'm not entirely sure how to get this value into amplitude, because the user's node assignment happens "downstream" of the FxA events that currently make their way into amplitude. For example, the cert_signed event that is used for the "Sync MAU" graph above, is emitted by the FxA server and happens before the client can talk to the tokenserver and get a node assignment. We could consider trying to send some events from the tokenserver backend into amplitude, but I'm not sure what that would look like from an infrastructure perspective.

Note: Including just the domain name of the node in a sync ping would be sufficient to distinguish between spanner and current production.

We could consider trying to send some events from the tokenserver backend into amplitude,
but I'm not sure what that would look like from an infrastructure perspective.

Bob, what would it look like to try to pipe some metrics from the tokenserver backend through our amplitude ingestion pipeline? I recall the logging and metrics setup in tokenserver being quite different from how it's done in FxA, but also that there was some consolidation work happening around the logging pipeline.

Flags: needinfo?(bobm)

IIUC, https://github.com/mozilla/fxa-amplitude-send/blob/master/sync-events.js is some code which is taking stuff from the client pings and getting it into the amplitude pipeline - so maybe this is what we could use to get the data from comment 12 into amplitude via the client?

Attached file GitHub Pull Request

What do you think about adding something like this on the tokenserver, and having clients include it in the sync telemetry ping?

Attachment #9103853 - Flags: feedback?(markh)
Attachment #9103853 - Flags: feedback?(dpreston)

(In reply to Ryan Kelly [:rfkelly] from comment #14)

Bob, what would it look like to try to pipe some metrics from the tokenserver backend through our amplitude ingestion pipeline? I recall the logging and metrics setup in tokenserver being quite different from how it's done in FxA, but also that there was some consolidation work happening around the logging pipeline.

I don't know enough about the FxA and Amplitude ingestion pipeline to say. Tokenserver is using an older version of our logging infra, which I believe is because we use the Tokenserver logs metrics for counting Sync users and such. There is consolidation work going on, and we could move it over or run both pipelines if need be. Bringing :jrgm in for clarity.

Flags: needinfo?(bobm) → needinfo?(jrgm)

(In reply to Mark Hammond [:markh] from comment #15)

IIUC, https://github.com/mozilla/fxa-amplitude-send/blob/master/sync-events.js is some code which is taking stuff from the client pings and getting it into the amplitude pipeline - so maybe this is what we could use to get the data from comment 12 into amplitude via the client?

I think using this code's approach is the correct answer here. Also, the client already knows what server it is connecting to, because it is connecting to the server. So I propose we have the client do "let using_spanner = (server == [the spanner hostname])" and send the boolean "using_spanner" to amplitude, to protect the privacy of those metrics as discussed.

I believe we don't have to touch any servers to do this, which is probably good. If we want the servers to be able to pass through some data to amplitude at some point, we can do that, but let's have a use case for that ready before doing it.

IIUC, https://github.com/mozilla/fxa-amplitude-send/blob/master/sync-events.js is some code which is taking stuff from the client
pings and getting it into the amplitude pipeline

Poking around in amplitude, it seems that all "sync - *" events apart from "tab_received" and "tab_sent" are disabled due to data volume. If we want to go this route we'll have to find an appropriately low-volume event that is reliably emitted for all users, to add to the enabled list. See also comments about amplitude data volume in Bug 1582256 Comment 7 and below. If we have an event in the sync ping for "this client just got a new node assigment" that may be appropriate.

Flags: needinfo?(loines)

So I propose we have the client do "let using_spanner = (server == [the spanner hostname])" and send the boolean
"using_spanner" to amplitude, to protect the privacy of those metrics as discussed.

I don't have any particular objection to this FWIW, apart from a kind of "ick, but...yeah" gut reaction; bonus points if [the spanner hostname] can come from a pref or something that's easy to update off-train, as Mark suggests in Comment 7.

Let's commit to this approach in order to move the bug forward.

If we want the servers to be able to pass through some data to amplitude at some point, we can do that,
but let's have a use case for that ready before doing it.

I do think it's worth pursuing this longer-term though - if we want to measure sync-related metrics in amplitude, we should be able to send metrics from the sync backend into amplitude in a straightforward way. It's a shorter path from source to destination, increasing reliability and coverage of the metrics (for example, FxA users with telemetry disabled still show up in amplitude thanks to server-side metrics).

(In reply to Donovan Preston [:fzzzy] from comment #18)

So I propose we have the client do "let using_spanner = (server == [the spanner hostname])" and send the boolean "using_spanner" to amplitude, to protect the privacy of those metrics as discussed.

Is there any possibility that there will end up being multiple spanner hosts/names for operational reasons? Eg, I'm thinking of the Russian users who find themselves not able to access certain hosts, so Bob reassigns them. Is it possible there will be similar operational "emergencies" that cause us to do something similar?

If so, the obvious concern is that having Firefox know what the canonical names were, but not what they are, isn't going to work very well.

(In reply to Ryan Kelly [:rfkelly] from comment #20)

If we want the servers to be able to pass through some data to amplitude at some point, we can do that,
but let's have a use case for that ready before doing it.

I do think it's worth pursuing this longer-term though

I agree - there's always a chicken-and-egg issue here - we can end up with a use-case, but because we can't update already released versions of Firefox, we end up doing something ugly, hacky, and not 100% effective - then never get back to adding the capability to Firefox, so rinse-and-repeat a few months later. This is the same basic consideration as above - Firefox knowing what the canonical spanner names were a few months ago isn't as useful as knowing what they are today.

(In reply to Ryan Kelly [:rfkelly] from comment #16)

What do you think about adding something like this on the tokenserver, and having clients include it in the sync telemetry ping?

FWIW, I do think that's worthwhile, although I'd be inclined to generalize it even more - eg, a top-level telemetry (or similar) object, which the client opaquely whacks in the sync ping.

Another option is that there's a special header we look for in any storage request?

If it helps at all here to clarify how to implement, I've put this list together of what we want to track here - You'll see that the initial set isn't huge. But at the moment, it'd be better to have something than not have anything at all, so I tried to keep this list small.

Attachment #9103853 - Flags: feedback?(markh)
Attachment #9103853 - Flags: feedback?(dpreston)
Attachment #9103853 - Attachment is obsolete: true

From the meeting today, I believe we voluntold Donovan as the decision-maker here, so ni? him for next steps :-)

Also, since this is changing what Firefox includes in telemetry ping, a reminder that we should get data collection review for this change.

Flags: needinfo?(jrgm) → needinfo?(dpreston)

Ok, I really don't think I am the right person to make the decision, but why not :-)

Let's add a new field to the tokenserver response (or a header -- whoever does the implementation can decide what makes sense). Then, let's log this value in all telemetry events sent by the client.

We want to do it this way because we'd rather not hardcode the logic for calculating the value in the client, since this will require riding the trains, and we don't see any reason to add it to the syncserver responses, as it should not change over the lifetime of the user's login session.

Flags: needinfo?(dpreston)

(In reply to Donovan Preston [:fzzzy] from comment #25)

whoever does the implementation can decide what makes sense)

Which begs the obvious question - who is going to do the implementation? ;)

Attachment #9103853 - Attachment is obsolete: false

I'm happy to help wrangle the tokenserver change, and have un-obsoleted my previous PR to that effect:

https://github.com/mozilla-services/tokenserver/pull/149

That just returns a node_type field as part of the tokenserver response. Mark, does that sound OK to you, or were you thinking something more general?

Flags: needinfo?(markh)

That sounds fine to me - we can generalize later if a good reason comes up.

Flags: needinfo?(markh)
Comment on attachment 9103853 [details] [review]
GitHub Pull Request

I've updated the PR here and marked it as ready for review:

  https://github.com/mozilla-services/tokenserver/pull/149

Unfortunately I can't seem to tag you for review in github :fzzzy, so setting the feedback flag here instead. Could you please take a look?
Attachment #9103853 - Flags: feedback?(dpreston)

Given the tokenserver change will land on github, I see no reason not to use this bug for the client changes. Patch forthcoming so I can start the data-review process.

Assignee: nobody → markh
Status: NEW → ASSIGNED
Attached file data-review.txt

Hi Nicole,
I'm flagging you for data review seeing you did the one in bug 1595954, but obviously feel free to delegate as necessary.

Attachment #9112411 - Flags: data-review?(nshadowen)

ni? myself for a deploy bug that will include this change, but it'll have to wait for the current pending deploy (which involves a db migration, so I don't want to add anything more to that one).

Flags: needinfo?(rfkelly)
Comment on attachment 9112411 [details]
data-review.txt

It's been a week, so I'm shifting the data-review request to :chutten.
Attachment #9112411 - Flags: data-review?(nshadowen) → data-review?(chutten)

Asking for compelteness: is it safe to land this client-side change before the server-side change gets deployed? I expect so from looking at the patch, but wanted to be sure.

Flags: needinfo?(rfkelly) → needinfo?(markh)

It is, yes.

Flags: needinfo?(markh)

(In reply to Mark Hammond [:markh] from comment #32)

Created attachment 9112411 [details]
data-review.txt

Hi Nicole,
I'm flagging you for data review seeing you did the one in bug 1595954, but obviously feel free to delegate as necessary.

thanks for your patience, I was a bit backed up from the holiday.

Could you please indicate the proposed category of data collection for measurement: server-supplied "node type" string in the existing sync ping, using descriptions found on Mozilla wiki here: https://wiki.mozilla.org/Firefox/Data_Collection?
Also, is there (or will there be) documentation on this data set described in public somewhere? And where would that be?

Flags: needinfo?(markh)

(In reply to Nicole Shadowen from comment #37)

Could you please indicate the proposed category of data collection for measurement: server-supplied "node type" string in the existing sync ping, using descriptions found on Mozilla wiki here: https://wiki.mozilla.org/Firefox/Data_Collection?

"Technical data”

Also, is there (or will there be) documentation on this data set described in public somewhere? And where would that be?

This is adding 1 field to the existing "sync ping". The only public documentation I'm aware of for that ping is https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/sync-ping.rst

Flags: needinfo?(markh)
Attachment #9112411 - Flags: data-review?(chutten) → data-review?(nshadowen)

(In reply to Mark Hammond [:markh] from comment #38)

(In reply to Nicole Shadowen from comment #37)

Could you please indicate the proposed category of data collection for measurement: server-supplied "node type" string in the existing sync ping, using descriptions found on Mozilla wiki here: https://wiki.mozilla.org/Firefox/Data_Collection?

"Technical data”

Also, is there (or will there be) documentation on this data set described in public somewhere? And where would that be?

This is adding 1 field to the existing "sync ping". The only public documentation I'm aware of for that ping is https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/sync-ping.rst

Great, thanks. To receive data-review+ the measurement needs complete and public (at the time data begins to be collected) documentation. As far as I can tell, the Data Review can be successfully completed once the new field is added to the existing sync-ping documentation.

Flags: needinfo?(markh)

Comment on attachment 9112410 [details]
Bug 1582317 - record the sync storage node type in the sync telemetry ping. r?tcsc

Beta/Release Uplift Approval Request

  • User impact if declined: The release of durable sync will be delayed, which increases the risk of lost sync data. We'd like to get this in before the release as it will help us to monitor durable sync (compared to existing sync storage) as we start to route users over to it to ensure we're not seeing anything unusual.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This is marked as low risk because it's a non-user facing change to add a new optional field (syncNodeType) to the sync ping. If the field isn't set for some reason, it's not required, so even if we ran into trouble with finding the syncNodeType on the token server (which is where this info comes from), the sync ping would not fail, we'd just not pass this value.
  • String changes made/needed:
Attachment #9112410 - Flags: approval-mozilla-beta?

:jcristau - I just flagged this for uplift approval for 72. Let me know if you need any more details from me.

Flags: needinfo?(jcristau)

We can make a server tag for this at any stage, but from meeting last week, it sounds like we'll wait for some other pending tokenserver changes before doing so.

Flags: needinfo?(jcristau)

:rfk - just so I'm clear though, there's two changes here though, right? The desktop change here and the tokenserver change here (which I agree, let's wait until the other incoming tokenserver change is in before we deploy tokenserver).

The bit I'm hoping to uplift into 72 specifically is the desktop change, as we'd like to get it live before 73 (which doesn't go out until 2/11) so as not to delay the durable sync release, which is slated to start Jan 13.

So, we still need uplift for 72 here, unless I'm missing something obvious :)

Flags: needinfo?(rfkelly)

Sorry, yes, you are correct - my comment was meant to be totally independent of yours, but reading back I can see how it reads as a reply to your uplift request. The landing of the client change and any uplift can proceed independently of the tokenserver deploy.

Flags: needinfo?(rfkelly)

(In reply to Nicole Shadowen from comment #39)

Great, thanks. To receive data-review+ the measurement needs complete and public (at the time data begins to be collected) documentation. As far as I can tell, the Data Review can be successfully completed once the new field is added to the existing sync-ping documentation.

The patch in this bug adds that documentation.

Flags: needinfo?(markh) → needinfo?(nshadowen)
Comment on attachment 9112411 [details]
data-review.txt

DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
     Yes. This collection is Telemetry so is documented in the following file [sync-ping.rst](https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/sync-ping.rst).
Is there a control mechanism that allows the user to turn the data collection on and off?
     Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
     Yes, markh is responsible.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
     Category 1, Technical.
Is the data collection request for default-on or default-off?
     Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
     No. 
Is the data collection covered by the existing Firefox privacy notice?
     Yes.
Does there need to be a check-in in the future to determine whether to renew the data?
     No. This collection is permanent.
Does the data collection use a third-party collection tool? 
     No, telemetry only.
Flags: needinfo?(nshadowen)
Attachment #9112411 - Flags: data-review?(nshadowen) → data-review+

:markh - we just got approval for uplift here to 72 from :jcristau - can you make sure this one gets merged to the right spot for that release? Julien told me he'd like to see it hit trunk soon since 72 is coming up.

:jcristau - any other flags we need to mark on this bug for uplift, or are we good to go after this hits trunk?

Flags: needinfo?(markh)
Flags: needinfo?(jcristau)

The uplift request flag is already set on this patch. I'm mainly waiting to see it in nightly for a day or two before I do anything with it.

Flags: needinfo?(jcristau)
Pushed by mhammond@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/246a4db81402
record the sync storage node type in the sync telemetry ping. r=tcsc
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 73

Comment on attachment 9112410 [details]
Bug 1582317 - record the sync storage node type in the sync telemetry ping. r?tcsc

add telemetry to support sync backend change, approved for 72.0b9

Attachment #9112410 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9103853 - Flags: feedback?(dpreston) → feedback+

Mark, should we be receiving pings with this measure from nightly/beta at this point? I do not see any here and I also do not see any in my own personal pings (nightly 73).

I also filed this issue so that the pipeline schema gets updated

(In reply to Leif Oines [:loines] from comment #53)

Mark, should we be receiving pings with this measure from nightly/beta at this point?

As discussed in slack, it seems some config values need to be added to the prod token server before clients will get and report this value.

Flags: needinfo?(markh)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: