Closed Bug 1748813 Opened 4 years ago Closed 2 years ago

Weishi360 assumed to cause new client_ids to be created on updates

Categories

(External Software Affecting Firefox :: Telemetry, defect)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1845338

People

(Reporter: RT, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fidedi-ope])

Context:

We started noticing new client_id creation spikes around update launch days in China and in Russia with https://sql.telemetry.mozilla.org/queries/82863#205344
So far our analysis is as follows:

  • There seems to be a correlation with the background updater launch given smaller spikes noticed when the the BUA was slowly rolled-out. You can see on the CN line that small spikes happened prior to a large spike on Nov 3rd (94.0 and 94.0.1) and then Nov 23 (94.0.2) - the smaller spikes happen on release dates and seem proportional to the BUA roll-out at the time of the update (10% BUA was Aug 31st - we saw a 220k spike on Sept 8th with 92 release, 25% BUA was Sept 28th and we saw a 348k spike on Oct 8th with 93 and 100% BUA was Oct 26th and we saw a 1.1M spike on Nov 3rd with 94)
  • We also see high volumes of Telemetry "deletion-request" pings from CN clients around releases (https://bugzilla.mozilla.org/show_bug.cgi?id=1741252) which may have a similar or related cause.
  • We assume that this is caused by hitting a condition that triggers a client_id creation referred to as spurious client_id creation(https://bugzilla.mozilla.org/show_bug.cgi?id=1700188 - sometimes new profiles can't write their client_id to disk or read it from disk and we create a new one)
  • This was reproduced once by yliu on the China team along with Weishi 360 and we assume that this Chinese AV may be causing the condition for the behavior top happen. Weishi has 1.6M Firefox MAU in CHina per https://sql.telemetry.mozilla.org/queries/83118/source

Even though users seem unaffected this is an issue given that it messes-up our telemetry in these countrie. Also we need to understand and fix the source of the behavior in order to ensure this won't happen again through alternate sources.

The ask is to reproduce the issue, identify what's causing the behavior and fix it.

My best guess from the data and code about what mechanism is causing this is the data upload pref being changed to false. That's the only way we'd be getting "deletion-request" pings for these clients.

This pref is exposed to the UI in about:preferences#privacy (the "Allow Firefox to send technical and interaction data to Mozilla" one) and has the creative internal name of datareporting.healthreport.uploadEnabled.

This is a distinct cause from the "spurious client_id problems" documented in bug 1700188. Though there's nothing stopping these clients from having both. The counts of spurious client_ids (at least as measured in the methods of bug 1700188) were relatively-stable over time including over updates, once again suggesting that these at-update sudden spikes are not (fully) explained by "spurious client_ids".

See Also: → 1700188

Attempted to reproduce the issue by updating to the latest Firefox version from the 83, 85, 93 and 95 versions, while using Chinese locale builds and 360 Weishi AV installed.
The clientID field from about:telemetry#general-data-tab has not been changed in any of the updates.

The severity field is not set for this bug.
:toshi, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(tkikuchi)

Amir, can someone on your team please help understand is the BUA could somehow flip datareporting.healthreport.uploadEnabled ?

Flags: needinfo?(ahabibi)
Severity: -- → S4
Flags: needinfo?(tkikuchi)

Can we dig into the client_id creation code for clues as to why a new ID was created in these cases? As this impacts our new profile data every time we release, I think it should be treated with higher severity.

Flags: needinfo?(chutten)
Flags: needinfo?(ahabibi)
Whiteboard: [fidedi-ope]

Huh, that's weird, the needinfo should've been sending me mail. Whoops.

(In reply to Vicky Chin [:vchin] from comment #5)

Can we dig into the client_id creation code for clues as to why a new ID was created in these cases? As this impacts our new profile data every time we release, I think it should be treated with higher severity.

If the data upload pref is turned off, we send off a "deletion-request" ping to delete all data associated with the client_id and set the client_id to a known value (the string c0ffee repeated several times). This ensures we delete the data that profile's previously sent us, and that if we accidentally send more data while the upload pref is off (we shouldn't, but we're being careful), it'll be marked so we can discard it. If the data upload pref is later turned back on, we have no idea (nor any desire to have any idea) of what any previous client_id values are, so we generate a new one.

Or, in short, this is on purpose. Bounce your pref, rotate the client_id. Happens in both Telemetry and Glean. This behaviour of client_id was most recently amended in conjunction with Legal as part of Project Shredder back in... 2017 is when we changed how "deletion-request" pings worked, I think. 2019's when Project Shredder stood up and tweaked things slightly.

Flags: needinfo?(chutten)

One inexpensive but slow way to determine if background updates are impacting this would be to disable background updates for the users we believe are impacted. That is, we could disable BU when the application locale is zh-CN and when we witness Weishi AV installed. This change would have to ride the trains: we no longer, as of Bug 1703302, have the ability to target BU with a Normandy pref-flip -- assuming that we could target users with specific AVs installed. (Removing this functionality was not done without consideration: there were many technical difficulties supporting Normandy for roll-out. Nonetheless, it's a gap that we might try to address at some point.) The advantage of this is that the technical work is simple and it should be clear, after a release cycle to roll out the change and one more release cycle to witness impacts, if background updates are impacting these spurious client IDs. The disadvantage is that we have to wait at least two release cycles for data to come in.

romain: we haven't a lot of great ideas for how to investigate this. Should we try what I suggest?

Flags: needinfo?(rtestard)

(In reply to Romain Testard [:RT] from comment #4)

Amir, can someone on your team please help understand is the BUA could somehow flip datareporting.healthreport.uploadEnabled ?

I thought about this some weeks ago, when the question was first raised, and don't see how this could be happening. The background update task creates a new temporary profile. That profile doesn't participate in the regular telemetry system, so it shouldn't be creating spurious client IDs directly. (It does submit Glean telemetry, so it could be creating new Glean IDs, but it does so consistently, so we shouldn't see differences correlated to regions or AV providers.) The background update task does lock the user's default browsing profile directory in order to fish the client ID from the datareporting directory: see this helper. That lock is narrowly scoped and should be released very quickly, but there is I/O happening that could slow things down.

It's just hard to see how any of this could create spurious client IDs. I suppose a background task could lock the profile, Firefox could be started (how -- by hundreds of thousands of users?), Firefox could find that the profile is locked and somehow a new profile created, creating a spurious client ID? That seems unlikely.

One avenue of investigation we might be able to pursue: we have two mostly independent telemetry systems at play here. The background update tasks use Glean with a single, shared Glean configuration. A single Firefox installation should have a fixed Glean ID, regardless of the Firefox client ID that it includes in the background update ping. If we were to see the same Glean ID with "many" Firefox client IDs, that would at least suggest that these are truly spurious IDs. I tried to start this analysis but have been stymied. Somehow the Glean ID appears to be NULL for almost all backgroundupdate pings: https://sql.telemetry.mozilla.org/queries/83892/source. :chutten, could you advise? Is the Glean client_id stripped in some way? If so, why are there any non-NULL Glean client_id values?

Flags: needinfo?(chutten)

(In reply to Nick Alexander :nalexander [he/him] from comment #7)

One inexpensive but slow way to determine if background updates are impacting this would be to disable background updates for the users we believe are impacted. That is, we could disable BU when the application locale is zh-CN and when we witness Weishi AV installed. This change would have to ride the trains: we no longer, as of Bug 1703302, have the ability to target BU with a Normandy pref-flip -- assuming that we could target users with specific AVs installed. (Removing this functionality was not done without consideration: there were many technical difficulties supporting Normandy for roll-out. Nonetheless, it's a gap that we might try to address at some point.) The advantage of this is that the technical work is simple and it should be clear, after a release cycle to roll out the change and one more release cycle to witness impacts, if background updates are impacting these spurious client IDs. The disadvantage is that we have to wait at least two release cycles for data to come in.

romain: we haven't a lot of great ideas for how to investigate this. Should we try what I suggest?

This is something I believe we could do through a telemetry analysis? i.e filter out Weishi360 users from https://sql.telemetry.mozilla.org/queries/82863#205344 to work out if the spikes go away? At least if the spikes don't go away by filtering out Weishi then we know it's unlikely them.
This gets a little out of my telemetry ability but is this something you could help us with Chris? i.e:
1 find correlations between registered av and spiking new profiles
2 find correlations between injected DLLs and spiking new profiles if (1) does not show obvious results

Flags: needinfo?(rtestard)

(In reply to Nick Alexander :nalexander [he/him] from comment #8)

:chutten, could you advise? Is the Glean client_id stripped in some way? If so, why are there any non-NULL Glean client_id values?

https://searchfox.org/mozilla-central/source/toolkit/mozapps/update/pings.yaml#25 forbids the client_id from being included in the "background-update" ping. As to why there are two pings with non-NULL Glean client_ids... I have no idea. I suppose it's heartening that there's only been 2 since Oct 15 that have them. It should be impossible for the Glean SDK to assemble a ping with a client_id if it asks for it to not be. I guess it's possible that those two pings were sent by something that wasn't a Glean SDK... but they appear well-formed according to the values in client_info and ping_info. They claim to come from 95.0.2 build 20211218203254 which looks right to me. They are two sequential pings (seq values 122 and 123) and they describe two intervals of time suggesting that they weren't using their computer between Dec 23 and 29. That lines up with the winter holiday period and is consistent with them having been received from the business arm of Telecom Italia (according to the sending IP address).

So, uh... no idea.

spurious client_ids

I must reiterate that these aren't spurious client_ids, though, they're legit. They're sending "deletion-request" pings via Firefox Telemetry meaning that they're flipping the value of the data upload pref while Firefox is running. We looked at this in bug 1741252 and noticed also that this pref flip must be happening before Glean has been init. See for example how spikey the volume is on Telemetry and how not-spikey it is on Glean.

This is something I believe we could do through a telemetry analysis? i.e filter out Weishi360 users from https://sql.telemetry.mozilla.org/queries/82863#205344 to work out if the spikes go away?

I encourage anyone measuring new profiles in Firefox Telemetry to use the "new-profile" ping. It was built for such analyses, whereas the (internal, complicated, endlessly-infuriating) client_id was not. It is not subject to these spikes (no doubt it has its own problems, though) and also isn't subject to the "spurious client_id" problem (which this is not).

If you wish to measure new profiles in Glean, I recommend getting in touch with someone who understands the profile life cycle and working with them to instrument something very specific to your analysis purpose (perhaps also in a shape of a "new-profile" ping). Using the client_id for purposes beyond defeating pseudo-replication bias on shortish time scales is likely not going to work out well.

To your specific question: I know of antivirus instrumentation in the Environment that is Windows 8+ (Weishi 360/360 Safeguard says it supports all of the Big Three desktop OSes): environment.system.sec.antivirus

Flags: needinfo?(chutten)

For reference an issue with similar impact (regeneration of client_id) happened on iOS where Glean could not load data from disk (or does not persist it), leading to regenerating the client ID - see https://github.com/mozilla-mobile/focus-ios/pull/3595

Can you check the data for these and see if they have distribution IDs?

I want to see if this is happening with our regular Chinese builds or specifically MozillaOnline builds.

It's almost all MozillaOnline in China: https://sql.telemetry.mozilla.org/queries/93674/source#231753

I ran a couple quick checks. MozillaOnline definitely users our update server (that was a question), and I did an update myself with a mozillaonline build and the cachedClientID didn't change, nor did the update channel.

So I guess all we can do is wait and see. I honestly don't know if it will fix this

With no spike from Fx116, the evidence is consistent with this bug having the same cause as bug 1845338. And with that fixed, we might be able to close this out.

See Also: → 1845338

(In reply to Chris H-C :chutten from comment #15)

With no spike from Fx116, the evidence is consistent with this bug having the same cause as bug 1845338. And with that fixed, we might be able to close this out.

That would be a pleasant side effect! Never let a good crisis go to waste and all that.

I re-ran Romain's query and it seems the new profile level from mozillaonline has subsided and stays currently around 100k. 10/31/2023.

Good enough for me to call this a dupe.

Status: NEW → RESOLVED
Closed: 2 years ago
Duplicate of bug: 1845338
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.