Closed Bug 1224725 Opened 6 years ago Closed 5 years ago

Switch Foxfooders on aries to the correct dogfood build and the IMEI whitelisting.

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nhirata, Assigned: janx)

References

Details

(Whiteboard: [step-in-progress=1G])

spin off of bug 1217490/ bug 1224357

We need to coordinate what the plan is on switching dogfooders to the correct build if they are on the nightly OTA build.
What was suggested in bug 1217490 comment 15 was:

> find a way to get all the wrongly-updated dogfood-devices back to the "dogfood" channel.
> Maybe we can we push a "nightly" build that detects dogfood-devices, and fixes the channel
> back to "dogfood" if it does?

But before attempting to address this problem, we need to wait for points 1) and 2) from that comment to be addressed (otherwise, devices back on "dogfood" could accidentally switch to "nightly" again).

Thus, marking this bug as depending on bug 1217490 rather than blocking it.
No longer blocks: 1217490
Depends on: 1217490
In bug 1217490 comment 28, you indicate that only foxfood devices use the IMEI hash in the URL. Maybe there is a better way to switch all foxfood devices on "nightly" back to "dogfood" channel.

Instead of landing a patch that tries to detect a foxfood device and switches the channel from the inside, maybe we can:

1. Make the OTA server detect requests for "nightly" channel that have an IMEI hash in the URL.
2. If there is an IMEI hash, and it matches our whitelist, make the OTA server send a "dogfood" OTA instead of a "nightly" OTA.

Is such a thing even possible?
Flags: needinfo?(nhirata.bugzilla)
I think that's possible.  We would have to coordinate efforts though between releng and taskcluster.  Possibly Alexandre would know best how to proceed with this though and he's not on either teams.

The other option I think that might be on the table is that we create a dogfood app that people volunteer to agree to the license that picks up the imei and switches to the dogfood and whitelists that device.
Flags: needinfo?(nhirata.bugzilla) → needinfo?(lissyx+mozillians)
I think option one would be the best but I have no idea if that is possible. Maybe Chris does know?
Flags: needinfo?(lissyx+mozillians) → needinfo?(catlee)
Following an IRC discussion with :gerard-majax, I'll summarize the issue with additional information.

The problem: Many dogfood devices (normally with "app.update.channel" setting set to "dogfood") received a "nightly" update by mistake that made them switch channels (their "app.update.channel" setting is now "nightly", so they're requesting "nightly" OTAs now). We'd like to switch these dogfood devices on "nightly" back to the "dogfood" channel.

Our options:


1. All devices on "nightly" channel today send an additional "IMEI" URL parameter in their requests for OTA updates. Regular devices that were on nightly before have "default" as value for that parameter, but affected dogfood devices have an IMEI hash there.

Regular devices OTA request (should receive "nightly" OTA) look like:
https://aus5.mozilla.org/update/5/B2G/44.0a1/20151023030241/flame/en-US/nightly/Boot2Gecko%202.5.0.0-prerelease%20(SDK%2019)/default/default/default/update.xml?force=1

Affected dogfood device OTA request (should receive "dogfood" OTA) look like:
https://aus5.mozilla.org/update/5/B2G/44.0a1/20151023104059/aries/en-US/nightly/Boot2Gecko%202.5.0.0-prerelease%20(SDK%2019)/default/default/4158cc2883[...]6df6093550/update.xml?force=1

We can see that devices normally on "nightly" send "default", but that dogfood devices send a long series of digits and letters (I shortened the hash in my example).

Based on that, maybe we can make the following change to the OTA server:
- For affected dogfood devices requesting an OTA on "nightly" with an IMEI hash, send a "dogfood" OTA instead of "nightly".
- For devices that are on "nightly" normally, send a "nightly" OTA.


2. We could also fix this problem in a more 'brutal' way: affected dogfood devices have a Gaia version that was built with the "DOGFOOD=1" build flag, causing the setting "debug.performance_data.dogfooding" to be `true`. Maybe we could land a patch in Nightly, that checks for this setting, and forces the channel back to "dogfood" if the setting is `true`, and send an update with that patch to both regular and affected devices.
An option like 1. is preferable (:gerard-majax insisted on it), but :catlee seemed to say that sending a different OTA to devices based on the presence of an IMEI hash in the request URL is not possible.

If that's really impossible, maybe we can send different OTAs based on:
- The IMEI hash is in the dogfooders whitelist?
- The build ID is recognized to be the faulty build that we accidentally sent to dogfooder, i.e. "20151023104059"?
On IRC, :catlee explained that the problem with option 1. is that we can't currently send two different OTAs on the same update channel based on some conditions. We can only send a single OTA on any given channel, and choose to accept or deny requests.

So, option 1. becomes the following steps:
A. Currently, both "dogfood" and "nightly" channels are frozen (no OTAs).
B. We need to make sure that "dogfood" channel will receive proper "dogfood" builds (not "nightly"!).
C. We can modify "nightly" channel to only serve updates to whitelisted-IMEI requests (effectively serving only dogfood devices on the "nightly" channel, but denying regular "nightly" devices to protect them).
D. Once that's done, we can prepare a good "dogfood" OTA on both "dogfood" and "nightly" channels, and unfreeze both channels.
E. Regular "nightly" users won't receive any OTAs, but dogfood devices on "nightly" will start migrating back to "dogfood" channel. After a week, we'll have a look at the data, and decide if every dogfooder is back, or if we need to serve the "please come back" OTA longer.
F. Once every dogfooder is back and we decide to stop the transition, we freeze "nightly" channel again (the "dogfood" channel continues to serve dogfood OTAs)
G. We prepare a valid "nightly" build for the remaining, regular nightly devices on that channel.
H. Once we're sure that the OTA ready for "nightly" is good (and not "dogfood"!), we unfreeze "nightly" again.

Problem solved.

I will also add an entirely different option:

3. Reach out to every dogfooder that we lost to "nightly" channel, and ask them to please go to their Settings app, at the bottom of the Developer menu, and manually change their channel from "nightly" back to "dogfood". Then repeat until every lost dogfooder came back to "dogfood".

If we still go with options 1., I suggest we reach out anyway during step 1E. to accelerate the transition.

:catlee, do you think option 1. is too much work and risk to be worth trying?
Precision about step 1D., a good dogfood OTA we could send is build ID is 20151116140807 (from :jlorenzo on IRC).
Just something to remember:
* a lot of dogfooders might not be using their phones anymore and are likely still on the "dogfood" channel
* they might have not applied the FOTA which was asked few weeks ago

In other words, you might have a lot of dogfooders that:
* they won't see the update because they're not using their phones (hence not *all* dogfooders will be in the right channel)

Thanks for figurying this out!
This doesn't sound too risky. The risk is that you end up leaving some dogfooders behind on the "nightly" channel, but you have plans to address that. Because we're relying on the IMEI whitelist, the risk that non-dogfood users get migrated over is minimal.

I need some help pinning down the build you want for step 1D). Which branch/revision is that from?
Flags: needinfo?(catlee)
> * they might have not applied the FOTA which was asked few weeks ago

That's true. Many foxfooders that got a device were warned not to update until further notice. So many are on the dogfood channel and wait.
(In reply to Chris AtLee [:catlee] from comment #10)
> This doesn't sound too risky. The risk is that you end up leaving some
> dogfooders behind on the "nightly" channel, but you have plans to address
> that. Because we're relying on the IMEI whitelist, the risk that non-dogfood
> users get migrated over is minimal.

Thanks Chris. While we send out these updates, we'll also reach out to dogfooders to ensure nobody gets left behind on the "nightly" channel indefinitely (or at least they'll know how to come back).

Also, I'd like to ask for a small change to 1C.: In addition to checking if the IMEI is whitelisted, is it possible to also look at the build ID? A portion of the affected devices (at least 25% of the 80 devices handed out at MozFest) don't send their IMEI in their update requests anymore (it is set to "default" so they won't appear to be whitelisted), but they're still on the bad dogfood OTA (build ID "20151023104059").

Is it possible to have the rule go like this?
1C. Modify the "nightly" update channel to distribute the latest "dogfood" OTA only to these devices that have a whitelisted IMEI *and/or* have the build ID "20151023104059"

> I need some help pinning down the build you want for step 1D). Which
> branch/revision is that from?

I will get back to you with a proper branch/revision, because we'd also like to fix 1222527 before that OTA.
Flags: needinfo?(catlee)
Johan, once bug 1226906 has been merged (fixing bug 1222527), can you give Chris a good "dogfood" build including that patch so Chris can send is as an OTA update on both "nightly" and "dogfood" channels?

(As agreed, both "dogfood" and "nightly" channels will only work for dogfood devices temporarily, but "nightly" should resume normal operations after about a week, once enough dogfood devices have transitioned back to "dogfood").
Flags: needinfo?(jlorenzo)
Quick update: We opted for using Option 1, and using the faulty build ID to detect dogfooder devices who should receive the "dogfood" OTA from the "nightly" channel (there are a few edge cases to that plan, but we can mitigate).

I'll follow up tomorrow with more details, and an updated summary of what Option 1 means exactly today.
Thanks Jan. Please keep William and I in the loop on release timing, preferably with a 2 day heads up, so we can give a heads up to foxfooders on what to expect.
So again, here are our options to solve this bug (basically comment 7 with follow-up changes applied):

Common steps:
A. Currently, "nightly" channel is frozen (no OTAs), and last OTA there was Build ID 20151023030241 (not the same as Option 1's Build ID).
B. We need to make sure that "dogfood" and "dogfood-latest" channels will serve proper "DOGFOOD=1" builds (not "nightly" builds!).

Option 1 (preferred):
C. We can modify "nightly" channel to *only serve updates to dogfooder devices* (denying regular "nightly" devices to protect them). We do this by looking for *BUILD ID 20151023104059* in update requests (see complete request URLs in comment 5, non-dogfooder devices have 20151023030241). No whitelist involved here.
D. Once that's done, we'll prepare a good "dogfood" OTA to be served on "dogfood" channel. When we're sure it's a good DOGFOOD build, we make "nightly" channel point to the same build (only serving dogfooder devices).
E. Regular "nightly" users won't receive any OTAs, but dogfood devices on "nightly" will start migrating back to "dogfood" channel thanks to the OTA. After a week, we'll have a look at the data, and decide if sufficient dogfooders are back, or if we need to continue serving the "please come back" OTA.
F. Once most dogfooders are back and we decide to stop the transition, we freeze "nightly" channel again (the "dogfood" channel continues to serve regular dogfood OTAs)
G. We prepare a valid "nightly" build for the remaining regular nightly devices on that channel.
H. Once we're sure that the OTA ready for "nightly" is good (and not a "dogfood" build!), we unfreeze "nightly" again.

Option 2 (bad):
C. We can land a patch that attempts to detect if a device is a dogfooder device, and force "app.update.channel" pref and setting back to "dogfood" if it is.
D. Backout that patch once every dogfooder is back.

Option 3 (fall back):
C. Reach out to every dogfooder that we lost to "nightly" channel, and ask them to please go to their Settings app, at the bottom of the Developer menu, and manually change their channel from "nightly" back to "dogfood". Then repeat these emails until every lost dogfooder came back to "dogfood". They all signed a document where they agreed to actively dogfood, so if they ignore/refuse emails, we could ask them to return the device.

Option 4 (legal concerns due to non-dogfooder IMEI reporting):
C. Land a patch that makes every device on "nightly" report its IMEI hash in OTA requests.
D. Then, modify "nightly" channel to only serve whitelisted devices (which have IMEIs that we know are dogfooders), and serve a DOGFOOD=1 build to them.
E. Dogfooder devices transition back to "dogfood" channel. When we're satisfied with the transition, we undo step D. and backout the patch from step C.

(In reply to Brian King [:kinger] from comment #15)
> Thanks Jan. Please keep William and I in the loop on release timing,
> preferably with a 2 day heads up, so we can give a heads up to foxfooders on
> what to expect.

Will do! I'm already keeping William in the loop, I'll start pinging you too.
The public announcement causes an issue where there are people that will want to rollback on a build.  I don't think 1 is viable any more for legal reasons if they find a way to "roll back" the buildid.
To note, any messaging that goes out should go to ONLY the people on the foxfood program and should be mentioned that they should not distribute that knowledge/email out to the public.

The foxfood program has internal metrics that they had signed up for and presents a legal issue for those people that end up accidentally going from nightly to foxfood builds.  For the people that purposefully change themselves to foxfood builds, we cannot do anything about this.
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #17)
> The public announcement causes an issue where there are people that will
> want to rollback on a build.  I don't think 1 is viable any more for legal
> reasons if they find a way to "roll back" the buildid.

As noted on the email thread, that announcement wasn't public and was only sent to Z3C foxfooders who had signed the Foxfooding Terms & Conditions (so no legal worry hear).

We'll be ready to move forward with Option 1 as soon as a new dogfood build is available on the "dogfood" channel, and we can verify it's really a DOGFOOD=1 build.
We've provisionally executed step 1C (see comment 16) - :catlee changed the B2G "nightly" channel rules to only serve requests with Build ID 20151023104059.

You can simulate an OTA request on that channel by visiting this link:
https://aus5.mozilla.org/update/5/B2G/44.0a1/20151023104059/aries/en-US/nightly/Boot2Gecko%202.5.0.0-prerelease%20%28SDK%2019%29/default/default/default/update.xml?force=1

(Right now, no build is available. We're waiting on QA to greenlight a new dogfood build, needed for step 1D.)

A note on how to communicate about this problem with foxfooders:
- No need to talk about update channels or Build IDs. They're confusing.
- Right now, foxfooders should know that we found a solution, and that the next OTA will fix everything (regardless of Build ID or channel).
- Once we reach step 1E and give the green light, please say something like "We've resolved the update issue! Please install the next OTA you'll see."
Flags: needinfo?(catlee)
Whiteboard: [last-completed-step=1C]
Assignee: nobody → janx
Status: NEW → ASSIGNED
We now have a good DOGFOOD=1 build that we'd like to make available as an OTA update:

https://tools.taskcluster.net/task-inspector/#Do8_llXIRXCZ6uao3hfVvQ/0

It was fully validated by QA, and we double and triple checked it.

Chris, to begin step 1D, can we make this build available as an OTA on the "dogfood" channel? (Maybe let's do "dogfood" first, then make sure we got it right this time, and then publish it on "nightly"?)
Flags: needinfo?(jlorenzo)
Flags: needinfo?(catlee)
We have now published this OTA on "dogfood-test", "dogfood" and "nightly" channels (on "nightly", we currently only serve dogfooder devices which have Build ID 20151023104059).

We've verified that what we serve is actually a DOGFOOD=1 build, and I'm seeing it on my affected dogfooder device. (I'm currently installing the OTA as an additional check to verify that everything works as expected.)

Before we reach out to dogfooders, let's verify that installing the OTA on an actual affected device really solves the issue as we expect.

Then, let's reach out to foxfooders tomorrow (Friday): William and Brian, I'll give you the green. Please refer to the communication guidelines in comment 20.
Flags: needinfo?(wquiviger)
Flags: needinfo?(catlee)
Flags: needinfo?(bking)
Whiteboard: [last-completed-step=1C] → [last-completed-step=1D]
(In reply to Jan Keromnes [:janx] from comment #22)
> Before we reach out to dogfooders, let's verify that installing the OTA on
> an actual affected device really solves the issue as we expect.

I've got one such device, if you need help with that.

> Then, let's reach out to foxfooders tomorrow (Friday): William and Brian,
> I'll give you the green. Please refer to the communication guidelines in
> comment 20.

Got it.
Flags: needinfo?(bking)
Quick update: The OTA installed on an affected device works as expected, and transitions a lost foxfooder on "nightly" channel back to the "dogfood" channel.

That means we did it!

Thanks a lot to everyone who helped fix this.

Now let's wait a little to build up confidence, and tomorrow let's reach out to foxfooders and ask them to install the OTA.

Lost foxfooder devices should start migrating back to the "dogfood" channel (and foxfooders already on "dogfood" should stay there). Let's watch as this happens, and re-evaluate this issue in a week (after the Orlando All Hands).

(In reply to Brian King [:kinger] from comment #23)
> (In reply to Jan Keromnes [:janx] from comment #22)
> > Before we reach out to dogfooders, let's verify that installing the OTA on
> > an actual affected device really solves the issue as we expect.
> 
> I've got one such device, if you need help with that.

If you want, you can independently confirm the fix by installing the OTA you see, and verify that it solves the problem.
(In reply to Jan Keromnes [:janx] from comment #24) 
> If you want, you can independently confirm the fix by installing the OTA you
> see, and verify that it solves the problem.

I'm not being offered the update.
> Quick update: The OTA installed on an affected device works as expected, and transitions a lost foxfooder on "nightly" channel back to the "dogfood" channel.

I tried this to a non-affected device and the update channel remained as "dogfood". Just a confirmation that this also works fon non-affected devices.
(In reply to Brian King [:kinger] from comment #25)
> I'm not being offered the update.

Devices usually check for new updates once every 24h. To force a check manually, you can go to Settings > Device Information > Check Now. Hopefully you should then see the OTA notification.

(In reply to Nikos Roussos [:nikos] from comment #26)
> I tried this to a non-affected device and the update channel remained as
> "dogfood". Just a confirmation that this also works fon non-affected devices.

Very cool, thanks Nikos!
We know that the dogfood devices have started migrating back to the "dogfood" channel.

Let's keep an eye on the data, and decide when we're happy about the results before ending the migration phase (expected in about a week from now).
Whiteboard: [last-completed-step=1D] → [step-in-progress=1E]
Blocks: 1219918
Update: Rob told us that 328 devices are running the new OTA build, and have is_dogfood = True.

This further confirms that the transition plan is working great!

Let's leave the fixing dogfood OTA on "nightly" a few more days, then early next week let's implement steps 1F 1G and 1H to return the "nightly" channel to normal operations (see also bug 1219918).
Flags: needinfo?(wquiviger)
Hi Chris, happy new year! We'd like finish this up (only 3 simple steps left, see comment 16).

1F. When you get a chance, could you please make the "nightly" B2G update channel point to no build at all?

You can verify that no build is available with [0] (which currently still serves our dogfood-fix OTA with Build ID 20151130130150).

[0] https://aus5.mozilla.org/update/5/B2G/44.0a1/20151023104059/aries/en-US/nightly/Boot2Gecko%202.5.0.0-prerelease%20%28SDK%2019%29/default/default/default/update.xml?force=1
Flags: needinfo?(catlee)
Whiteboard: [step-in-progress=1E] → [step-in-progress=1F]
Update: The dogfood fix OTA has been successfully removed from the "nightly" channel, which is now free to resume regular nightly builds (thanks Chris).
Flags: needinfo?(catlee)
Whiteboard: [step-in-progress=1F] → [step-in-progress=1G]
This was fixed, and now ironically we're switching people off of dogfood=1 to dogfood=0, see bug 1248791 
; phone sunsetting.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.