Closed Bug 1086915 Opened 10 years ago Closed 9 years ago

Group Policy refresh intervals for releng Windows machines

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/558] )

In https://bugzilla.mozilla.org/show_bug.cgi?id=1082535 a concern about the periodic refresh interval was raised. 

Currently a client will check in with a DC for an update periodically during a day, aprox once an hour. During this check for a refresh, packages are copied and ready to install on the next boot, registry changes and security changes are picked up and take affect on the next boot, files are copied, deleted, or updated. In case of files, if a file is in used it is locked and doesn't change until the lock is released. 

As for the processes themselves. It is all done in background process with priority given to foreground process. 

In addition GPO will be replaced with Puppet in the near future.

For reference comment 83 from:
(In reply to Chris Cooper [:coop] from comment #83)
> (In reply to Mark Cornmesser [:markco] from comment #82)
> > It does at boot as well. 
> > 
> > A client will check with the server periodically for updates, and as well as
> > at boot. When it checks for the update, and it picks up a task such as a
> > file copy it will begin executing the task. Other task such a package
> > install, a security change, registry change, and others will not take effect
> > or start until the next boot.
> 
> That's...unexpected.
> 
> Generally we don't want puppet or GPO services to run while machines are
> doing builds and/or tests. The possible slowdown could cause a machine to
> unexpectedly timeout during routine activity, and invoke a flurry of
> activity from sheriffs, buildduty, and devs as sheriffs back-out patches and
> close trees.
> 
> Is there a way to *only* run GPO changes on reboot? I think that's how
> everyone in releng thought the system was already working.
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/557]
What should our course of action be in regards to GPO refresh intervals?
Assignee: relops → mcornmesser
Flags: needinfo?(coop)
Flags: needinfo?(arich)
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/557]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/558]
(In reply to Mark Cornmesser [:markco] from comment #0)
> Currently a client will check in with a DC for an update periodically during
> a day, aprox once an hour. During this check for a refresh, packages are
> copied and ready to install on the next boot, registry changes and security
> changes are picked up and take affect on the next boot, files are copied,
> deleted, or updated. In case of files, if a file is in used it is locked and
> doesn't change until the lock is released. 

Didn't this exact process cause some of the problems with the VS2013 deployment? Machines ended up in a partially-installed state as parts of VS were incrementally installed while machines continued to be rebooted via buildbot job control.

I'd prefer to turn off any background activity, especially on test machines, because it can lead to wobble in performance numbers.

If it's hard to disable the background activity, maybe we switch to Puppet more quickly (see below).

> In addition GPO will be replaced with Puppet in the near future.

I'm all for doubling-down on Puppet support for Windows and ditching GPO altogether. GPO is a black box to releng. Getting Windows platform configs into version control adds essentially a whole team of people who can help you and Q write and deploy patches. Seems like a win.

What are the caveats with this plan? 

Will GPO be going away entirely, or will we still need it at a base level?

Could we start deploying new packages with Puppet until we have a chance to convert the existing GPO objects?
Flags: needinfo?(coop)
To my understanding With VS2013, the installer launched an update., and in some cases while the update was running a reboot occurred, and it interfered with the update completing. 

With Puppet GPO will be taken out of the picture completely. However, MDT will still play a role. MDT will handle creating a base image including Visual Studio, SDKs, Windows Updates, and a few other items. MDT will be used to create a deploy image which will include the Puppet run and any items we can't support within Puppet yet such as a specific user account registry settings that need to be changed directly in the registry. the majority of the configuration will be in Puppet. 

>Could we start deploying new packages with Puppet until we have a chance to convert the existing GPO >objects?

It is possible. I don't know what the disadvantages would be off the top of my head. I will need to chat it out with Q and Dustin. 

>I'd prefer to turn off any background activity, especially on test machines, because it can lead to >wobble in performance numbers.

Q: what are the reasons we do not want to disable the periodical updates?
Flags: needinfo?(q)
Correct the vs2013 would have been a problem with puppett or any other solution as long as systems was issued a  reboot command issued (in this case via ssh due to nanny processes and still being enabled in slave alloc.).

From what I have seen there have been some challenges converting from native control methods in windows to puppett which take a very *nix posix approach. However, Dustin and Mark are far morw familiar with those issues.

The ability force a background process was explicitly turned back on after some SSH loop "oopses" put machines in a bad state. We currently need a method to control machines out side of our questionable ssh loops and nanny processes. Once we have puppet functional, SSH redone, and possibly MIG in play the GPO check could easily go away. Until then we should do our best to make sure item level targeting is applied to the few preferences that can apply outside of boot time so that they do not trigger unless forced to do so.
Flags: needinfo?(q)
Coop: we never intend to deploy packages with puppet, only create new images to deploy with a method like openstack or cloud managers like azure/aws. The nodes won't have any ongoing management and making changes means deploying a new image.
Flags: needinfo?(arich)
So we have three levels of manageability under discussion:

 (a) what we thought was in place: config management runs on boot
 (b) what we have: periodic config management runs (+ on boot)
 (c) nothing - code-specified images deployed without any config management

I hearing Q saying that (a) isn't enough on Windows due to inadequate break-fix tools, and Amy saying that (c) is the ultimate goal.  Those don't seem compatible!

Also, given that we're not doing OpenStack in the forseeable future (3-6 mo), is it practical to continue to plan to use OpenStack to deploy Windows images?
I think we're getting off topic for this bug, but I'll address the questions in comment 6. Any time we need to change things on windows, the future option is to redeploy. That's what option C is, just like for linux spot images.

We can also deploy images using WDS instead of openstack, so the choice to delay looking at openstack until it's less buggy does not mean we have no choices for image deployment. What it does mean is that image deployment is not as turn key and (with the current infrastructure) not at the same scale.  However, the direction we are moving in for builders is to put almost all of it in the cloud, anyway, where we will have the flexibility and speed of deployment that'd we'd lack on physical hardware.

The bigger issue is test machines, since we will always have a significant portion of hosts that must remain physical. Here we will see more impact of not having a smoother reimaging process in house.

The whole reason we were asked to move towards option C was because people did not like GPO and the closed windows way of managing windows machine.  Unfortunately there is no good substitute for using microsoft tools for ongoing management because of the closed nature of windows and their style of machine management.  We will never have a solution like puppet on windows or OS X where knowledge is easily shared and spread because windows is fundamentally different. 

For future management, the possibilities are:

1) what we aim for now, ongoing management using windows tools that conform to microsoft methodologies
2) no ongoing management, changes are deployed as reimages with new images only (where we were told to focus because of the needs of the cloud and frustration with the current closed system). this aligns with option C.
3) using non-windows tools for ongoing management of windows. Despite using a tool like puppet, this will not really remove the barrier to understanding of how to manage windows systems, and additionally will not mesh well with windows best practices, procedures, etc. As with OPSI, I expect attempting to manage machines (to the depth of which we manage them) this way will be complex and error prone.

If we want to implement option 3, we certainly can, but I want to make sure that we make that decision consciously knowing that it's going to be more complex and require more man hours than either of the other two options.
FWIW, the surprise on RelEng's part is that on mac/linux puppet must complete successfully before buildbot may start up.
(In reply to Amy Rich [:arich] [:arr] from comment #7)

> The whole reason we were asked to move towards option C was because people
> did not like GPO and the closed windows way of managing windows machine. 
> Unfortunately there is no good substitute for using microsoft tools for
> ongoing management because of the closed nature of windows and their style
> of machine management.  We will never have a solution like puppet on windows
> or OS X where knowledge is easily shared and spread because windows is
> fundamentally different. 

If the main problem we have with GPO is that RelEng doesn't know how it works, or how to contribute changes to it, but it does do what we need (within reason), perhaps we just need to invest some time and effort in RelEng to learn about it. When we understand it better, maybe then we can address gaps or work out ways to make windows management more transparent, and build tools around it where necessary to allow configs to be managed in source control etc?
Getting all of group policy, domain configuration, and MDT task sequences into a version control system with proper diffs, a review process, automatic deployment, and a way to replicate externally is a substantial investment in infrastructure and development, and even if we executed that perfectly and efficiently and had the manpower to maintain it, we would still end up with completely separate systems for configuring POSIX hosts and Windows hosts.
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/558] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/735] [kanban:engops:https://kanbanize.com/ctrl_board/6/558]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/735] [kanban:engops:https://kanbanize.com/ctrl_board/6/558] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/558] [kanban:engops:https://kanbanize.com/ctrl_board/6/558]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/558] [kanban:engops:https://kanbanize.com/ctrl_board/6/558] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/558]
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.