Closed Bug 1258736 Opened 8 years ago Closed 8 years ago

[16/03/22] Update all mozmill-ci machines for OS and software updates

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

We have some stuff to cover this time - given that it is not a planned update.

* OS update (as usual)
* Update of git (CVE-2016-2324 and CVE‑2016‑2315)

We also have to update all templates for at least:
* NTP settings (bug 1257171)
* Latest Java + finally disable update checks? (seems to be kinda hard)
* virtualenv (bug 1241811 - maybe disable pip checks to stop the noise)
* Check others by inspecting the installation instructions of the platform

Chris, I wonder if I should simply update the templates for Linux and Windows and we re-deploy all machines? Not sure how complicated that is for you and how long it would take. If we go this route maybe it should be done similar in the future?
Flags: needinfo?(cknowles)
Sadly there's nothing "simple" here.  But it's a good idea to hash out how we want to handle this.

Redeploying isn't easy - in a perfect world, we'd do it kinda AWS-like: take down all of the VMs from the old template, redeploy from an updated template, reset the IP/DHCP mapping and boot - this will take anywhere from 30-60 minutes - with balancing the storage and compute, and other things happening in the background.  However, you're going to probably (and have in the past) want to have only 1/2 of the items down for continuity-of-service reasons - and the time to do this doesn't scale with the number of items we deploy, but rather the # of sessions we deploy in.

Normal "linux sysadmin" approach would be to update the existing machines, also update the templates (in case of a need to redeploy due to failure or growth) and then go about your merry way - Some thoughts about this: it's (once the templates have been turned on and their temporary names shared) all under your control - and certainly all production pieces are under your control.  No real coordination needed.  

Now, This looks like a relatively large update - so I completely understand the desire to redeploy - The main thing here is that this is a real investment in time - if we're doing this about 1/year then it feels like a test of the templates and might be a good idea.  If we're talking more frequently than that, I'd recommend perhaps a blended approach - 1/year redeploy - outside that - do the more "linux sysadmin" approach of just updating the live boxes.

Of course, you've seen the "linux sysadmin" message in there - Windows and it's licensing can make the decision even harder - it may be simpler to not redploy those at all.

So ... to sum up:

1) redeploying requires time and coordination - and if we want to update constantly - this will likely become onerous on us and we'd like to do what we can to get out of your path so you can self-manage.

2) updating existing and templates still requires a smidgen of coordination ( the turning on of the templates ) but it isn't in critical path, and other than that is entirely (schedule and timing) in your control

3) a blended approach of doing the redeploys (1) once a year (for example) and other than that keeping things up to date using (2) might be the "best" option.

Let us know your thoughts.
Flags: needinfo?(cknowles)
Thank you for the details feedback Chris! As I can see it's a tough task to do on a more regular basis. Given that we have to do security updates due to 0-day exploits and such more often, a re-deploy doesn't seem to be the way to go. I don't really want to put such a burden on your shoulders.

So for the current situation we mostly have to do it because of the git CVEs. But unless bug 1258528 has not been fully investigated I would not get started. Especially not for the Windows machines.

Also git is only in use for the functional tests of Firefox 38.0esr builds. Those will reach EOL in about 10 weeks. Then it won't be needed anymore.

Florin and Kairo, how important are the functional test results of Firefox 38.0esr for your signoff from releases? Do you check them at all? If not I would stop running them, and update tests would have to be still done with mozmill.
Flags: needinfo?(kairo)
Flags: needinfo?(florin.mezei)
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Florin and Kairo, how important are the functional test results of Firefox
> 38.0esr for your signoff from releases? Do you check them at all? If not I
> would stop running them, and update tests would have to be still done with
> mozmill.

We still check them as part of our sign-off checklist (e.g. https://wiki.mozilla.org/Releases/Firefox_38_ESR/Test_Plan#Checklist_12). However as you can see we get results from both Jenkins and Treeherder. Would this only affect the results we get in Jenkins? Would we still get the results in Treeherder?
Flags: needinfo?(florin.mezei)
I see. So the tests will be necessary and we cannot stop them. FYI the results in Treeherder and Jenkins are exactly the same. Whereby you only get test results in Jenkins when jobs are getting aborted or Java exceptions are happening. Everything else goes directly to Treeherder.

Meanwhile I'm thinking about using Chocolatey to install and update packages on Windows. So if I go this way it will be easy to add the Git package for updates. I will check next week.
Flags: needinfo?(kairo)
Btw. PIP version checks have been disabled via mozmill-ci runtests.py. And that works perfectly.
There is also an issue with Mercurial. So I will have to update it for our Jenkins master machines. It's not in use for slaves.
Hm, given that Mana is down due to maintenance I have no access to our ESX documentation for updating hosts. I think that due to all those current security issues we might want to wait with updating our hosts until all of that has been solved.

This would also mean that I won't be able to update any of our machines before I leave. So updates will have to wait until May or someone else can cover those.
I have updated both staging and production masters for latest security fixes. I won't do any work for slaves.
Blocks: 1273425
As bug 1273425 suggests we want to remove Flash from all of our slaves machines. So the following items remain for me:

(In reply to Henrik Skupin (:whimboo) from comment #0)
* OS update (as usual)
* Java update
* Update of git (CVE-2016-2324 and CVE‑2016‑2315)
* Removal of Flash

> We also have to update all templates for at least:
> * NTP settings (bug 1257171)
> * Latest Java + finally disable update checks? (seems to be kinda hard)
> * virtualenv (bug 1241811 - maybe disable pip checks to stop the noise)
> * Check others by inspecting the installation instructions of the platform

Given the complexity with templates I will completely skip that section. I don't want to rely on the templates anymore. They will still exist for sure as long as we have those VMs - just in case we would need a reset of a VM, but otherwise I will work towards getting our tests running via AMIs on AWS.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Updates have been all done for staging. If no regressions are found I will continue to upgrade production slaves by later today or tomorrow.
Just to note... I did not upgrade git this time. It's usage is currently only limited to esr38 which we will drop really soon. By that time we can completely remove the git installations.
All OS X and Ubuntu boxes have been updated on production. Now the Windows nodes are next.
All remaining Windows machines have been updated. So we are done with this quarterly update.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.