Open Bug 1495152 Opened 6 years ago Updated 2 years ago

Shield study participants not getting prefs flipped

Categories

(Shield :: Shield Study, defect)

defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: jrmuizel, Unassigned)

References

(Blocks 1 open bug)

Details

About 26-27% of participants in the webrender shield study are not getting the appropriate pref flipped.

i.e.

["experiments"]["prefflip-webrender-v1-2-1492568"]["branch"] = "enabled"/"disabled"

but 

'gfx.webrender.all.qualified' not in p['settings']['userPrefs']
Blocks: 1492568
I'm still not entirely sure what I've seen in bug 1477380 comment 35.
Could you check if they still have app.normandy.startupExperimentPrefs.gfx.webrender.all.qualified;true/false (bold=userPref)?
Like this: https://hg.mozilla.org/mozilla-central/rev/21fd70d87432 Just to be sure.
If it exists, then participants could have gfx.webrender.all.qualified;true/false as a visible default pref (probably not overwritable) in both branches.
Michael, what can we do to make progress on this?
Flags: needinfo?(mcooper)
In my mind, we are getting conflicting results from Telemetry. Parts of Telemetry (the events sent my Normandy) are saying the pref is changed, whereas other parts (the pref reporter) are saying the pref is not set. To me this, this indicates we can't answer this question by digging into Telemetry more, since it is already proven to be not completely trustworthy (though maybe I'm not thinking of anything). Both of the systems here are pretty well tested by this point, so I'm not willing to point fingers at one or the other.

I think we should try to reproduce this in a controlled environment. By setting up machines that the Normandy recipe in question matches, we can see if Normandy is actually setting the pref, and if not, why. If the data we are getting from Telemetry is to be believed, this should be pretty common (about 1 in 4). Once we do this we can look for evidence of explicit errors, race conditions, checking whether restarts fix the problem, etc. Consistent STR would be amazing.

If we can't reproduce this in a controlled environment, we'll have to theorize about other ways this can go wrong, and add extra telemetry to try and track down what is happening. I don't have any theories about this right now though.

If setting up machines that match the existing recipe is difficult, I can make a custom recipe on the stage Normandy environment that is easier to match.
Flags: needinfo?(mcooper)
Looked at this for a while. I've not been able to repro the pref set failure on my machine.

I note however that Normandy checks on init whether the value it thinks it stored in the pref matches what it stored in its local data store:

https://searchfox.org/mozilla-central/rev/65f9687eb192f8317b4e02b0b791932eff6237cc/toolkit/components/normandy/lib/PreferenceExperiments.jsm#184

So if Normandy was incorrectly setting the pref, it should be unenrolling people from the study there on a subsequent session. Maybe it immediately re-enrolls them later on in the session, and fails to set the pref again?

When we unenroll, PreferenceExperiments.stop() sends an "unenroll" telemetry event, which I assume gets transmitted back to us somehow (?), and if we're hitting that case it should have reason "user-preference-changed-sideload". Are we seeing that in our telemetry data?

I've looked through the code that decides what value to set the pref to ( https://github.com/mozilla/normandy/blob/091bc5fc1df48867b97788ede09500fc9fc007df/client/actions/preference-experiment/index.js#L68 ) but it looks correct.

So basically, we have a contradiction; WebRender is on because the pref value isn't set (to false; for branch=disabled), Normandy thinks the study is active, and it should be checking that the pref is set as expected and unenrolling if the pref isn't as expected, but it does not appear to be.

Michael: Do we have any unenroll messages in telemetry PreferenceExperiments? Thanks!
Flags: needinfo?(mcooper)
> Do we have any unenroll messages in telemetry PreferenceExperiments?

I looked at the telemetry events for this recipe[0]. There have been about 6k enrollment events and 524 unenrollment events. Of those unenrollment events, 497 of them are "recipe-not-seen". Since the recipe is still active[1], that usually means that the user changed in such a way that the filter no longer matched. For most recipes, this is usually due to users switching channels or versions of Firefox, but in this case that isn't true. I suspect it might be that users' eligibility for Web Render changes over time.

The unenrollment reason "user-preference-changed" happened 12 times, and "user-preference-changed-sideload" happened 15 times. Combined this is about 0.5% of unenrollment events. I think these are minimal enough to rule out the hypothesis in comment 5.

> Maybe it immediately re-enrolls them later on in the session

If this is true, that's a bug. Most Normandy actions, including preference experiments, are one-time only, and aren't supposed to re-enroll users again after they've been unenrolled. I looked for direct evidence of this in the databricks notebook above. Only 1% of users had an invalid sequence of events, which here is defined anytime there are more than one enroll or unenroll events for the same client id. This matches the backgroud rate of weirdness we see in almost all Normandy telemetry.

[0]: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/38580/command/38589
[1]: https://normandy.cdn.mozilla.net/api/v3/recipe/587/history/?format=json
Flags: needinfo?(mcooper)
Depends on: 1500230
We ran a test on dummy preferences, in bug 1500230. All of the tests showed the expected output in Telemetry. The study ran for a little under four days, and enrolled approximately 75K users across all branches. Of these users, only 20 (about 0.025%) did not have the expected telemetry.

From this I conclude that the problem is not simply Normandy or Telemetry, but either WebRender or an interaction of WebRender with one or both of the others. This isn't really a solution to the original problem, but it is nice to know that some of the foundational parts of our testing infrastructure are behaving as expected.
The incorrect reporting rate was about 20% over the past 2 weeks or so:
https://gist.github.com/cpearce/00d26c4a00b0681e827014f408e2efd7
I spent a bunch of time today pulling out individual pings from the experiment results and trying to see if there was any reason they'd have the pref or not. I didn't come up with anything interesting yet on that front. However I noticed that there are cases where:
  (a) the experiment is active, and
  (b) the user is on the "disabled" branch, and
  (c) the gfx.webrender.all.qualified pref is set to false, and
  (d) the compositor being used is webrender, which is unexpected.

This wasn't an isolated case, it occurred in 400 out of the 6347 pings in my sample (100% of disabled-branch experiment pings with submission date of 20181020) which seems kind of significant. I'm not sure what would cause this unless they forced webrender on by setting gfx.webrender.{enabled,all} or MOZ_WEBRENDER=1 in the env. gfx.webrender.all should get reported in the telemetry, and it's not set (at least in the ping I was inspecting) so it might be one of the other two, or a bug in the code.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
> I spent a bunch of time today pulling out individual pings from the
> experiment results and trying to see if there was any reason they'd have the
> pref or not. I didn't come up with anything interesting yet on that front.
> However I noticed that there are cases where:
>   (a) the experiment is active, and
>   (b) the user is on the "disabled" branch, and
>   (c) the gfx.webrender.all.qualified pref is set to false, and
>   (d) the compositor being used is webrender, which is unexpected.
> 
> This wasn't an isolated case, it occurred in 400 out of the 6347 pings in my
> sample (100% of disabled-branch experiment pings with submission date of
> 20181020) which seems kind of significant. I'm not sure what would cause
> this unless they forced webrender on by setting gfx.webrender.{enabled,all}
> or MOZ_WEBRENDER=1 in the env. gfx.webrender.all should get reported in the
> telemetry, and it's not set (at least in the ping I was inspecting) so it
> might be one of the other two, or a bug in the code.

I would expect to see this case when a user has enrolled in the experiment, but has not yet restarted Firefox. Is it possible to query the events associated with the experiment for each (or some) of these clients? Then we could see when they enrolled in the study. If it is recently, then this isn't a surprise. If it was well before that ping was sent, or we have evidence that they have restarted, then it is surprising.
I pulled all the pings for one specific client chosen at random from the disabled branch who had webrender enabled and the pref not set, and compared the buildIDs. I saw buildIDs increasing across the pings, implying they're restarting to update, but still experiencing this. I didn't pull pings for more than one client however.
Can you share the id of that client? I would like to see the telemetry events specifically, because they can sometimes reveal more details.
(In reply to Michael Cooper [:mythmon] from comment #10)
> I would expect to see this case when a user has enrolled in the experiment,
> but has not yet restarted Firefox.

Makes sense.

> Is it possible to query the events
> associated with the experiment for each (or some) of these clients? Then we
> could see when they enrolled in the study. If it is recently, then this
> isn't a surprise. If it was well before that ping was sent, or we have
> evidence that they have restarted, then it is surprising.

I will look into this, thanks for the pointer. One clientId that has this is 2491447e-fb44-4c1b-b922-08228b622d32, in case you want to dig more as well.

(In reply to Chris Pearce (:cpearce) from comment #11)
> I pulled all the pings for one specific client chosen at random from the
> disabled branch who had webrender enabled and the pref not set, and compared
> the buildIDs. I saw buildIDs increasing across the pings, implying they're
> restarting to update, but still experiencing this. I didn't pull pings for
> more than one client however.

Not sure if this was a response to my comment or a general comment for this bug. But: if they're on the disabled branch with the pref not set, then they will fall back to the default behaviour (pref is set to true) and so having webrender enabled is "expected" in that case (modulo this bug). The issue with this user is that the pref didn't get set. And I believe that the experiment code will just attempt to set the pref once, so the only restart that would be relevant here is the one immediately after the experiment getting installed.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)
> (In reply to Chris Pearce (:cpearce) from comment #11)
> > I pulled all the pings for one specific client chosen at random from the
> > disabled branch who had webrender enabled and the pref not set, and compared
> > the buildIDs. I saw buildIDs increasing across the pings, implying they're
> > restarting to update, but still experiencing this. I didn't pull pings for
> > more than one client however.
> 
> Not sure if this was a response to my comment or a general comment for this
> bug. But: if they're on the disabled branch with the pref not set, then they
> will fall back to the default behaviour (pref is set to true) and so having
> webrender enabled is "expected" in that case (modulo this bug). The issue
> with this user is that the pref didn't get set. And I believe that the
> experiment code will just attempt to set the pref once, so the only restart
> that would be relevant here is the one immediately after the experiment
> getting installed.


Normandy checks on startup whether the pref was set, and unenrolls from the study if the pref isn't set.

So the fact that the pref is persistently not set across multiple sessions but Normandy keeps reporting that the user is in the study is weird; on every restart, Normandy should be unenrolling.

The pref is reported as not being set in telemetry, but Normandy is behaving as if the pref is set, and WR is behaving as if the pref isn't set.
(In reply to Michael Cooper [:mythmon] from comment #12)
> Can you share the id of that client? I would like to see the telemetry
> events specifically, because they can sometimes reveal more details.

Unfortunately I didn't save that analysis. But another client IDs like this is 123c61ba-dd10-4079-ab60-40bb4e65dea0.
(In reply to Chris Pearce (:cpearce) from comment #14)
> The pref is reported as not being set in telemetry, but Normandy is behaving
> as if the pref is set, and WR is behaving as if the pref isn't set.

Ah, gotcha. Thanks for the explanation, I missed that Normandy unenrolls users in that scenario.
We can thank the telemetry gods for small favours: the probability of the pref failing to be set is similar between the disabled and enabled branches of the study, so we can probably filter out users where the pref isn't set without biasing the analysis. https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/42887/command/42897

Any chance there's a race condition on startup?
You need to log in before you can comment on or make changes to this bug.