Closed Bug 1472720 Opened 7 years ago Closed 6 years ago

Multiple t-yosemite-r7/mac minis in a bad state

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bcrisan, Unassigned)

References

Details

Hello, Machines that are still missing: >MDC1: 251, 257, 260, 272, 280, 298, 303, 307, 315, 327, 336, 344, 345, 349, 352, 353, 373, 375, 394, 399, 426, 436, 440, 449, >MDC2: 093,098,099,127, 172, 183, 189, 214, 222, 225, 229 All of them are t-yosemite-r7-*** and almost all of them are missing from taskcluster. The last one missing from MDC1 (449) takes jobs but burn them with exceptions (the bug for it has been reopened, 1461914)
I need to know what "missing from taskcluster" means, and we need to understand why this is happening. That's a lot of systems, and there's definitely a problem here; if there's something wrong with those systems, we need to know what it is and set up monitoring for it, that isn't someone manually looking through lists of hosts in TC.
Hello, I have used this script [1] to bring those machines back in Taskcluster and then I've rebooted them from UI. ( as dhouse: suggested us in this comment -> [2] ) [1] - https://github.com/davehouse/relops-infra/blob/quarantine_nonexisting/quarantine_tc.py [2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1467949#c4 (In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #0) > Hello, > > Machines that are still missing: > > >MDC1: 251, 257, 260, 272, 280, 298, 303, 307, 315, 327, 336, 344, 345, 349, 352, 353, 373, 375, 394, 399, 426, 436, 440, 449, MDC1: 251, 257, 272, 303, 307, 315, 336, 344, 345, 352, 353, 373, 375, 399, 436, 440 - are successfully running tasks 280 - rebooted but still without tasks ( ssh does not work) 260 - BUG: 1472841 298 - BUG: 1472855 327 - BUG: 1472861 349 - BUG: 1472865 394 - BUG: 1395960 426 - BUG: 1472868 > >MDC2: 093,098,099,127, 172, 183, 189, 214, 222, 225, 229 MDC2: 098, 099, 172, 183, 214, 222, 225 - are successfully running tasks 229 - rebooted but still without tasks (ssh does not work ) 093 - BUG: 1472880 127 - BUG: 1472878 189 - BUG: 1472682 > All of them are t-yosemite-r7-*** and almost all of them are missing from > taskcluster. > > The last one missing from MDC1 (449) takes jobs but burn them with > exceptions (the bug for it has been reopened, 1461914) I've re-imaged t-yosemite-r7-449. It looks like is running jobs. I'll keep on monitoring.
(In reply to Kendall Libby [:fubar] from comment #1) > I need to know what "missing from taskcluster" means, and we need to > understand why this is happening. That's a lot of systems, and there's > definitely a problem here; if there's something wrong with those systems, we > need to know what it is and set up monitoring for it, that isn't someone > manually looking through lists of hosts in TC. Suddently the machine goes in a locked state and/or doesn't communicate with the TC service anymore so TC is removing them from the lists (which is why they are missing). Currently we are doing manual checks on all moonshots across all platforms, watching for missing numbers (eg: 1,2,3,5,6,7... 4 is missing) we know for sure (if not state otherwise) that the number of MS machines should go for example from 1 to 100 without skipping numbers. When we find missing MSs we start with a reboot, which could fix the issue, or if this doesn't help we will re-image them. For the past 3 weeks (give or take) we have probably re-imaged more than 400 times and also did hundred of restarts. We definitely have an issue here and on a usual day we expect to find anywhere between 20 to 50 machines (on all 3 OSes) As for what is the issue, I know dhouse is working with the devs to fix this, but currently anything that changes (such as OCC) will end up in us or dhouse needing to re-image them machines.
Blocks: 1467949
(In reply to Danut Labici [:dlabici] from comment #3) > > Suddently the machine goes in a locked state and/or doesn't communicate with > the TC service anymore so TC is removing them from the lists (which is why > they are missing). Currently we are doing manual checks on all moonshots > across all platforms, watching for missing numbers (eg: 1,2,3,5,6,7... 4 is > missing) we know for sure (if not state otherwise) that the number of MS > machines should go for example from 1 to 100 without skipping numbers. > For the past 3 weeks (give or take) we have probably re-imaged more than 400 > times and also did hundred of restarts. > We definitely have an issue here and on a usual day we expect to find > anywhere between 20 to 50 machines (on all 3 OSes) I've asked Dave to file a couple bugs and continue working with you folks; I'd like ciduty and relops to start investigating more deeply into why the machines go offline. It started relatively recently, and we're seeing it on both the minis and moonshot hardware; that says to me that it's something in CI, maybe a job or something. It will probably be worth talking to :jmaher and seeing if he has any thoughts on job changes, etc. Dave is also going to talk to the TC folks about understanding what the process is around "removing" systems from TC and finding a better process for hardware. I think it makes sense for AWS where there's a provisioner tracking instances, but less so for hardware. At the very least we should be getting notifications for when this happens, rather than ciduty folks manually paging through and comparing notes.
> Dave is also going to talk to the TC folks about understanding what the > process is around "removing" systems from TC and finding a better process > for hardware. Don't get us wrong, but the missing part it's not actually a bad thing, for us, that means that the machine doesn't do it's job > I think it makes sense for AWS where there's a provisioner > tracking instances, but less so for hardware. At the very least we should be > getting notifications for when this happens, rather than ciduty folks > manually paging through and comparing notes. We have a script (it's in a pretty rough and unpolished way atm and Dlabici is currently working on it as we speak) that is checking for missing machines and is based on the taskcluster's API to get them. Small but important mention here: I thing that if we bring modifications to taskcluser (to not remove those machines) we will not be able to use the script anymore and detect the ones with problems.
Depends on: 1473589
The problem were resolved or tracked in machine bugs. Closing this since all of the machines above are in a working state.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.