verify the win10 instances specified by OCC match what product expects

RESOLVED FIXED

Status

Taskcluster
Worker
RESOLVED FIXED
a year ago
8 months ago

People

(Reporter: dustin, Assigned: grenade)

Tracking

(Blocks: 1 bug)

Details

(Reporter)

Description

a year ago
Joel brought this up in the meeting yesterday -- we should double-check that the Windows 10 that we're using to run tests in EC2 matches what Firefox Product wants to test on.

Comment 1

10 months ago
Joel, do you know if this has been verified since originally reported?
Flags: needinfo?(jmaher)
I don't know- my main request was that we were running what most of our users are running- at the time this was the windows 10 update to 1607, I suspect that is still the desired version and set of updates, but we should confirm.  I have no idea if this was updated on our images, I believe it was- although I don't have a way to determine this.

:ryanvm, could you help us figure out what specific version of windows 10 we should be running?
Flags: needinfo?(jmaher) → needinfo?(ryanvm)
Yes, I'm very glad we're having this discussion. Thanks for the ping, Joel.

First off, we have data! Be aware that the data can lag by up to a week, though, since it relies on a dataset that only gets updated on the weekends.
https://sql.telemetry.mozilla.org/dashboard/windows-10-user-strata

To answer the immediate question, nearly 85% of our users are currently on the Anniversary update (build 1607), so yes, we should definitely be using that or the Creators update (build 1703) now, depending on how far off we are from getting them enabled in production.

However, the bigger question here is what we should do with respect to adapting our Win10 CI machines to Microsoft's more aggressive update policies going forward.

What we know:
* Microsoft now ships monthly cumulative updates that are more than just security updates. The vast majority of users are on the current one within 1-2 weeks of Patch Tuesday.
* Microsoft throttled the Anniversary update for ~3 months (where we sat around 20-25% usage), then they unthrottled and we rapidly went up to ~65%. It has since steadily risen to the current 85% level.
* Rumor has it that they intend to ship the Creators update on a much more aggressive schedule (unthrottled by end of April). I haven't seen this confirmed by any "official" sources, though. Obviously our Telemetry data will prove or disprove it soon enough, though.

So ultimately, I think there's one main question we need guidance from Product (hi Jeff!) on: What usage threshold should we set for switching CI over to a newer Win10 release? I do believe that ultimately we need to be less conservative about OS updates than we've historically been in the interest of testing what our users are actually on.

Once we know the answer to that, we can figure out what threshold we should set for starting to test a new release. We should also watch the uptake rate of the Creators update. If it's as fast as rumored, we might just want to start testing the new version immediately upon release under the assumption that we'll be over the critical threshold by the time everything is greened up.

I would also suggest we try to pick up a new cumulative updates on some regular basis as well since they can legitimately affect core functionality we care about in Firefox (like Direct2D and media playback stacks) and we know that Microsoft has been very successful in getting their users updated quickly after release. I recognize that doing so every month isn't likely to be practical, so maybe every 3 months or so would work?
Flags: needinfo?(ryanvm) → needinfo?(jgriffiths)
I would love to update once/year, that is a huge step forward from our current model, maybe we could find a way to get to twice/year?  There is a large cost with updating as we have to test and fix stuff, then update or VMs and more important the hardware.  Just testing on the new hardware requires a pool of machines that only run the new updates after they have been applied.  I suspect with the work being done to make this more cloud based we can download a .msi or .exe of the updates and apply it dynamically for testing purposes.

Since we will be focusing on getting win10 running a bit more in Q2, I would like to make sure we start with something good instead of something 6+ months outdated.
(Reporter)

Comment 5

10 months ago
I suspect that when we are at the point where a task can specify an image to run in for Windows, similar to how it can for Docker, this will be a bit easier -- we can "smear" the upgrade over time as we are with the ubuntu1604 update.

I also suspect that going to a faster update cadence will identify tests that are too specific to version, and after a few iterations we'll be down to just tests that actually identify *problems* with the new version.
Focusing again on the short term, if we're talking Q2 for getting Win10 tests running in production, we should plan to go right to build 1703 then. I have very little doubt that the majority of our users will be on it by the end of Q2. I don't see why we'd want to be behind the curve from the start. Though I guess we're already two years behind the curve here anyway...

For the longer-term, I think that we should figure out what we can do to lower the costs of doing these updates and we'll be better off in the long run for doing so. FWIW, my hope/belief is that most of the cost of doing the updates will be up-front with getting a small pool of machines imaged for testing on Try. Seems unlikely that these updates are going to present as many compatibility issues as a major (i.e. 8->10) update would have had. And if it does, I think we'd like to know about it!
(Assignee)

Comment 7

10 months ago
just a note that we are using the "current version" build 10.0.14393 (https://buildfeed.net/) for taskcluster Windows 10 workers currently.

https://public-artifacts.taskcluster.net/OdfD8hzHRo-eev8odxaHrw/0/public/logs/live_backing.log

you can always check current version info in taskcluster windows workers by creating a task like this: https://tools.taskcluster.net/task-inspector/#OdfD8hzHRo-eev8odxaHrw/
ok, build 10.0.14393 is 1607, we should strive for 1703 prior to hacking too much.

I am glad Dustin mentioned that once we do this a few times and understand the issues and fix the common test cases it will be easier- I had overlooked that concept!  This seems valid and reasonable.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #3)
...
> So ultimately, I think there's one main question we need guidance from
> Product (hi Jeff!) on: What usage threshold should we set for switching CI
> over to a newer Win10 release? 

Do we have data/graphs showing Win10 update uptake? It would be useful to know what the, er, viral spread of an update looks like across our population. Generally how I think about this is, if we think the new patch level of win10 will be deployed to most users in the next 6 weeks, we should update.
Flags: needinfo?(jgriffiths)
Unfortunately, the dashboard doesn't give us trending over time (it was on my original wishlist, but I was told it wasn't possible with the longitudinal dataset), so all I can really do is keep an eye on it myself.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #10)
> Unfortunately, the dashboard doesn't give us trending over time (it was on
> my original wishlist, but I was told it wasn't possible with the
> longitudinal dataset), so all I can really do is keep an eye on it myself.

I'd go with your gut feel for now, providing the guidance that we need to stay ahead of the deployment by enough of a margin to be able to act. For example:

* if 50% of users get the new version within 5 weeks
* if it takes typically 2-3 weeks to react to specific patch-level issues
* we should start deploying the patch to some test machines to get some results within 1-2 weeks

Make sense?
Thanks, Jeff. Yeah, I think that further solidifies the point about needing to be able to react more quickly in the future. Maybe even start testing in the late stages of the Insider cycle as things get closer to RTM.
(In reply to Dustin J. Mitchell [:dustin] from comment #5)
> I suspect that when we are at the point where a task can specify an image to
> run in for Windows, similar to how it can for Docker, this will be a bit
> easier -- we can "smear" the upgrade over time as we are with the ubuntu1604
> update.

Yeah, this would be super helpful!

> I also suspect that going to a faster update cadence will identify tests
> that are too specific to version, and after a few iterations we'll be down
> to just tests that actually identify *problems* with the new version.

Also this, yeah. I can believe we'd have a lot fewer issues taking a green set of tests on Win 10 Anniversary Edition and updating to Win 10 Creators update than we're having making the leap from Windows 7/8 all the way to Windows 10.
as a note, lets ensure we update to build 1703, that is what we are using for hardware machines in buildbot that we are standing up this week and next week.
(Assignee)

Updated

9 months ago
Depends on: 1360503
(Assignee)

Updated

8 months ago
Depends on: 1366811
(Assignee)

Comment 15

8 months ago
all tc win 10 instances are now running build 1703
Assignee: nobody → rthijssen
Status: NEW → RESOLVED
Last Resolved: 8 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.