Closed Bug 1590791 Opened 5 years ago Closed 4 years ago

Write an automated restore method for Win 10 hardware

Categories

(Infrastructure & Operations :: RelOps: Windows OS, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

References

Details

No description provided.
Assignee: nobody → mcornmesser
Blocks: 1573864

The in between task time on Windows hardware workers climbs to 20 minutes plus 3 to 4 weeks post deployment. This is caused by a delay of the user environment being completely loaded. The exact reason for this i is still undetermined, but is most likely related to thousands of user accounts being created then deleted it. Historically this has been dealt with mass manual redeploys of hardware workers. Doing a system restore seems to remedy this. Part of the system restore is a reset of the registry.

This is involves adding to modules to the boot strap script:

Function set-restore_point {
  param (
    [string] $mozilla_key = "HKLM:\SOFTWARE\Mozilla\",
    [string] $ronnin_key = "$mozilla_key\ronin_puppet",
    [string] $date = (Get-Date -Format "yyyy/mm/dd-HH:mm"),
    [int32] $max_boots
  )
  begin {
    Write-Log -message ('{0} :: begin - {1:o}' -f $($MyInvocation.MyCommand.Name), (Get-Date).ToUniversalTime()) -severity 'DEBUG'
  }
  process {
    vssadmin delete shadows /all /quiet
    powershell.exe -Command Checkpoint-Computer -Description "default"

    if(!(Test-Path $ronnin_key)) {
      New-Item -Path HKLM:\SOFTWARE -Name Mozilla –Force
      New-Item -Path HKLM:\SOFTWARE\Mozilla -name ronin_puppet –Force
    }

    New-ItemProperty -Path "$ronnin_key" -name "restorable" -PropertyType  string -value yes
    New-ItemProperty -Path "$ronnin_key" -name "reboot_count" -PropertyType  Dword -value 0
    New-ItemProperty -Path "$ronnin_key" -name "last_restore_point" -PropertyType  string -value $date
    New-ItemProperty -Path "$ronnin_key" -name "restore_needed" -PropertyType  string -value false
    New-ItemProperty -Path "$ronnin_key" -name "max_boots" -PropertyType  Dword -value $max_boots
  }
  end {
    Write-Log -message ('{0} :: end - {1:o}' -f $($MyInvocation.MyCommand.Name), (Get-Date).ToUniversalTime()) -severity 'DEBUG'
  }
} 

Set restore point, which should be call prior to Puppet during bootstrap, deletes any other restore points and sets a new restore point with the label of default. This is important for later so the checkpoint can be selected. After each restore a new restore point is needed because it is possible that restore points will disappear. By deleting the restore point and creating a new one with the label of default, the restore function will be able to programmatically find it regardless of the restore number.

Explanation of the registry setting:
restorable: This is needed for decision making in the bootstrap script and maintain system script. If set to yes these script will perform the necessary actions for the automated restore.
reboot_count: See below in the maintain system script change.
last_restore_point: Informational serves no function except data for troubleshooting.
restore_needed: See Start-Restore function below.
max_boots: The max number of boots before the restore will occur.

Function Start-Restore {
  param (
    [string] $ronin_key = "HKLM:\SOFTWARE\Mozilla\ronin_puppet",
    [int32] $boots = (Get-ItemProperty $ronin_key).reboot_count,
    [int32] $max_boots = (Get-ItemProperty $ronin_key).max_boots,
    [string] $restore_needed = (Get-ItemProperty $ronin_key).restore_needed,
    [string] $checkpoint_date = (Get-ItemProperty $ronin_key).last_restore_point

  )
  begin {
    Write-Log -message ('{0} :: begin - {1:o}' -f $($MyInvocation.MyCommand.Name), (Get-Date).ToUniversalTime()) -severity 'DEBUG'
  }
  process {
    if (($boots -ge $max_boots)  -or ($restore_needed -notlike "false")) {
        if ($boots -ge $max_boots){
            Write-Log -message  ('{0} :: System has reach the maxium number of reboots set at HKLM:\SOFTWARE\Mozilla\ronin_puppet\source\max_boots. Attempting restore.' -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
        }
        if ($restore_needed -eq "gw_bad_config") {
            Write-Log -message  ('{0} :: Generic_worker has faild to start multiple times. Attempting restore.' -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
        }
        if ($restore_needed -eq "puppetize_failed") {
            Write-Log -message  ('{0} :: Node has failed to Puppetize multiple times. Attempting restore .' -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
        }
        else {
            Write-Log -message  ('{0} :: darn it something else restore key equals {1} .' -f $($MyInvocation.MyCommand.Name), ($restore_needed )) -severity 'DEBUG'

        }
        Stop-ScheduledTask -TaskName maintain_system

        Write-Log -message  ('{0} :: Removing Generic-worker directory .' -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
        Stop-process -name generic-worker -force
        Remove-Item -Recurse -Force $env:systemdrive\generic-worker
        Remove-Item -Recurse -Force $env:systemdrive\mozilla-build
        Remove-Item -Recurse -Force $env:ALLUSERSPROFILE\puppetlabs\ronin
        Remove-Item –Path -Force $env:windir\temp\*
        Write-Log -message  ('{0} :: pause check registry.' -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
        sc delete "generic-worker"
        Remove-ItemProperty -path $ronin_key -recurse -force
        # OpenSSH will need to be addressed it fails after restore
        # For now commented out of the roles manifests
        #sc delete sshd
        #sc delete ssh-agent
        #Remove-Item -Recurse -Force $env:ALLUSERSPROFILE\ssh
        Write-Log -message  ('{0} :: Initiating system restore from {1}.' -f $($MyInvocation.MyCommand.Name), ($checkpoint_date)) -severity 'DEBUG'
        $RestoreNumber = (Get-ComputerRestorePoint | Where-Object {$_.Description -eq "default"})
        Restore-Computer -RestorePoint $RestoreNumber.SequenceNumber

    } else {
        Write-Log -message  ('{0} :: Restore is not needed.' -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
    }
  }
  end {
    Write-Log -message ('{0} :: end - {1:o}' -f $($MyInvocation.MyCommand.Name), (Get-Date).ToUniversalTime()) -severity 'DEBUG'
  }
}

In order for this to work the bootstrap cleanup function needs to not be called in the bootstrap script and the last function to be called be Start-Restore. The bootstrap script will continue to run on each boot and if the restore_needed registry value is false the script will exit. If the registry value is anything but false it will begin the restore process. Currently if there is perm failure with generic worker or Puppetizing those wrapper scripts will set the restore_needed value, so that the restore is initialized and we can log the reason why.

It will also being the restore if the reboot_count exceeds the max_boots counts.

In order for the node to correctly Puppetize post restore generic-worker services needs to be stopped and removed. As well as the generic worker, Mozilla build, and the Ronin Puppet programdata directories need to be removed.

Because of the restore resets the registry and the bootstrap script relies on the registry setting to determine what stage it the script is at the reboot following the restore will trigger the bootstrap script to start anew.

The bootstrapping script will look like:

$workerType = 'gecko-t-win10-64-hb'
$src_Organisation = 'markcor'
$src_Repository = 'ronin_puppet'
$src_Revision = 'restore2'
$image_provisioner = 'mdt'
$max_boots = 150

# Ensuring scripts can run uninhibited
Set-ExecutionPolicy unrestricted -force  -ErrorAction SilentlyContinue

If(test-path 'HKLM:\SOFTWARE\Mozilla\ronin_puppet') {
  $stage =  (Get-ItemProperty -path "HKLM:\SOFTWARE\Mozilla\ronin_puppet").bootstrap_stage
}
If(!(test-path 'HKLM:\SOFTWARE\Mozilla\ronin_puppet')) {
  Setup-Logging
  Install-BootstrapModule -src_Organisation $src_Organisation -src_Repository $src_Repository -src_Revision $src_Revision
  Bootstrap-schtasks -workerType $workerType -src_Organisation $src_Organisation -src_Repository $src_Repository -src_Revision $src_Revision -image_provisioner $image_provisioner
  Set-restore_point -max_boots $max_boots
  Set-RoninRegOptions  -workerType $workerType -src_Organisation $src_Organisation -src_Repository $src_Repository -src_Revision $src_Revision -image_provisioner $image_provisioner
  Install-Prerequ
  shutdown @('-r', '-t', '0', '-c', 'Reboot; Prerequisites in place, logging setup, and registry setup', '-f', '-d', '4:5')
}
If (($stage -eq 'setup') -or ($stage -eq 'inprogress')){
  Install-BootstrapModule -src_Organisation $src_Organisation -src_Repository $src_Repository -src_Revision $src_Revision
  Ronin-PreRun
  Bootstrap-Puppet
}
If ($stage -eq 'complete') {
  Import-Module bootstrap
  Start-Restore
}

In order to calculate the reboot count the following is needed to be added to the maintain system script:

$reboot_count_exists = Get-ItemProperty HKLM:\SOFTWARE\Mozilla\ronin_puppet reboot_count -ErrorAction SilentlyContinue
  If ( $reboot_count_exists -ne $null) {
  $previous_boots = (Get-ItemProperty -path "HKLM:\SOFTWARE\Mozilla\ronin_puppet").reboot_count
  $new_count = $previous_boots + 1
  Set-ItemProperty -Path HKLM:\SOFTWARE\Mozilla\ronin_puppet -name reboot_count -value $new_count -force
}

In order to test this I have been running pushes against the beta nodes since the beginning of December:
https://treeherder.mozilla.org/#/jobs?repo=try&author=mcornmesser%40mozilla.com&searchStr=Bug%2C1590791&fromchange=10479abbb4e16051a43d33ac7749c83cba98f2e9

jmaher: Could you check out https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1dcc6cff72808e9264a76ee987995a010b5330e and see if there is anything troubling with the test results? For comparison https://treeherder.mozilla.org/#/jobs?repo=try&revision=45f71f86d5f8c85cf95faa0ff979f011b03add38 which is the same repo and test ran on the production nodes.

Flags: needinfo?(jmaher)

I have done some retriggers, waiting on the final results to come in- I might need more retriggers over the weekend, but should have an answer by EOD Monday worse case scenario.

Thanks for working on this- it sounds very useful to have

this looks pretty good:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=45f71f86d5f8c85cf95faa0ff979f011b03add38&newProject=try&newRevision=b1dcc6cff72808e9264a76ee987995a010b5330e

in fact the noise metric (not really scientific, but an indicator) has decreased in both talos and raptor.

Flags: needinfo?(jmaher)

Previously ms-016 - ms-045 and ms-316 - ms-345 had been deployed with this function. With most of ms-316+ having been is use since 2020-01-14, and the in between task time is between 2 to 4 minutes. The typical time frame without the restore is 20 - 30 minutes for that life span.

Nodes ms-361 - ms-390 will be deployed today.

This had been deployed all of the Win moonshots nodes.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.