Closed Bug 1200800 Opened 9 years ago Closed 9 years ago

Create 2 Ubuntu 14.04 64bit VMs to replace mm-ci-staging.qa.scl3.mozilla.com and mm-ci-production.qa.scl3.mozilla.com

Categories

(Infrastructure & Operations :: Virtualization, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: cknowles)

References

Details

(Whiteboard: [qa-automation-blocked][vm-create:2][vm-delete:2])

In bug 1200139 we see constant crashes of Java due to no more memory available. Reason here seems to be that we reach the 32bit boundary. Sadly both machines have initially setup with a 32bit OS. We should change that to have 64bit OSes.

For the specs please use the same as what we have for the current machines. For mm-ci-production this will be: 8GB, 2 CPU, 16G /, 50G /data. I don't have the specs handy for staging, but those will be easy to find.

Chris, it would be great if you could do this soon given that a crash each of the last days always brings our CI system down. Thanks!
Flags: needinfo?(cknowles)
Whiteboard: [qa-automation-blocked]
assuming I can get a quick answer to this, I should be able to at least start on that tomorrow.

The question is this ... am I destroying the current mm-ci-{staging,production} and spinning up new, or am I creating new ones with new names (mm-ci-{staging,production}-new ? ) and then we'll cleanup and reconcile later?
Flags: needinfo?(cknowles)
Chris also asked my on IRC and i gave a reply there. Just to sync up here... we need two new VMs and we cannot replace the current ones right now. I need some time to setup the servers, so the service would not be available. Means we will have to swap machines once the new VMs have been finished.
Alright, those boxes exist.  Not puppeted, per your usual, generated from your templates. 

mm-ci-production-new.qa.scl3.mozilla.com as specced
mm-ci-staging-new.qa.scl3.mozilla.com same drive geometry.  1 CPU and 2G RAM to match its predecessor.  

I have not added to Nagios, as these are temporary as named.
Assignee: server-ops-virtualization → cknowles
Whiteboard: [qa-automation-blocked] → [qa-automation-blocked][vm-create:2]
So I configured both VMs for mozmill-ci and it seems to work fine. 

Chris, I would like to replace the staging machine before the weekend so that we have some test results from over the weekend to make a decision about production most likely early next week.

Best would be to coordinate the switch (IP change, DNS name change) via IRC. I will be around the whole day tomorrow. Please let me know when it would work best for you. Thanks!
I should be available in the morning 0700-ish Eastern ... (any earlier than that and you're likely running into non-caffeinated me, which has all sorts of risks.)

I'll ping on IRC.
Pinged, did the switcheroo.  we now have mm-ci-staging-old.qa.scl3 and mm-ci-staging.qa.scl3 (the VM created in this bug)

:whimboo verified that things look acceptable - will monitor and make sure all is well, before scheduling the prod cutover.  From IRC, potentially Mon/Tues - though I'm flexible.

VMs renamed in inventory, vsphere (migrated datastores to make the names  "real"), spreadsheets.  

Let me know if you need anything else here.  Thanks!
I did a check of the new staging VM and it all is working perfect. I cannot see anything which is broken since we swapped those VMs. I think we can replace production when Chris is back and QA doesn't have to run any tests. So I hope it will be tomorrow.
Status: NEW → ASSIGNED
I will be around on Tuesday, though I have a Dentist appointment ~0845 Eastern - but other than that I should be available.  Let me know timings on when QA is able to let us have this.
Robert, will there be any beta build to test tomorrow or do we have enough time for the transition to the new box? Thanks.
Flags: needinfo?(kairo)
I was out yesterday - but yes, today we are running update tests. Usually we do every Tuesday and Friday.
Flags: needinfo?(kairo)
Thanks Robert. Chris, so we will do it tomorrow then! It should give us enough time until Friday.
Alright Henrik, I'll ping you in the AM and we can work on getting it switched over.
Alright, switched the prod machines, and we now have mm-ci-production.qa.scl3 (the new prod box) and mm-ci-production-old.qa.scl3.

I've updated inventory, spreadsheets, and done storage migration to make the names stick.

Per IRC conversation with :whimboo - will reapproach on Monday 9/14 to determine if the -old machines can be removed.
We haven't run any tests for the latest 41.0b9 last Friday. This happened for both staging and production. I don't think that this is related to our replacement here, but more a Pulse problem. I will have to figure that out first before we can finally close this bug.
Alright, moved the reminder to reapproach about decom to later.  Let me know if you need anything from me on this.
Depends on: 1204488
Ok, it turns out that the underlying issue as reported in my last comment is not related to this host replacement. So we are fine here. I don't see why we should keep the old boxes around anytime longer.
No longer depends on: 1204488
Alright the -old VMs are powered off.  no nagios, no puppet, no NFS, in that they've changed IP addresses, really nothing is left ...

Will keep these down for a week, and then destroy, unless you let me know that you need them back.
Alright, a week has passed, no screaming.  The -old boxes are deleted from disk and removed from inventory.  Never were puppeted by that name, and never RHN'd no nfs and no backups - they're all gone.  Closing things out.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [qa-automation-blocked][vm-create:2] → [qa-automation-blocked][vm-create:2][vm-delete:2]
Thanks Chris.
You need to log in before you can comment on or make changes to this bug.