Closed
Bug 1472720
Opened 7 years ago
Closed 6 years ago
Multiple t-yosemite-r7/mac minis in a bad state
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bcrisan, Unassigned)
References
Details
Hello,
Machines that are still missing:
>MDC1: 251, 257, 260, 272, 280, 298, 303, 307, 315, 327, 336, 344, 345, 349, 352, 353, 373, 375, 394, 399, 426, 436, 440, 449,
>MDC2: 093,098,099,127, 172, 183, 189, 214, 222, 225, 229
All of them are t-yosemite-r7-*** and almost all of them are missing from taskcluster.
The last one missing from MDC1 (449) takes jobs but burn them with exceptions (the bug for it has been reopened, 1461914)
Comment 1•7 years ago
|
||
I need to know what "missing from taskcluster" means, and we need to understand why this is happening. That's a lot of systems, and there's definitely a problem here; if there's something wrong with those systems, we need to know what it is and set up monitoring for it, that isn't someone manually looking through lists of hosts in TC.
Comment 2•7 years ago
|
||
Hello,
I have used this script [1] to bring those machines back in Taskcluster and then I've rebooted them from UI. ( as dhouse: suggested us in this comment -> [2] )
[1] - https://github.com/davehouse/relops-infra/blob/quarantine_nonexisting/quarantine_tc.py
[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1467949#c4
(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #0)
> Hello,
>
> Machines that are still missing:
>
> >MDC1: 251, 257, 260, 272, 280, 298, 303, 307, 315, 327, 336, 344, 345, 349, 352, 353, 373, 375, 394, 399, 426, 436, 440, 449,
MDC1: 251, 257, 272, 303, 307, 315, 336, 344, 345, 352, 353, 373, 375, 399, 436, 440 - are successfully running tasks
280 - rebooted but still without tasks ( ssh does not work)
260 - BUG: 1472841
298 - BUG: 1472855
327 - BUG: 1472861
349 - BUG: 1472865
394 - BUG: 1395960
426 - BUG: 1472868
> >MDC2: 093,098,099,127, 172, 183, 189, 214, 222, 225, 229
MDC2: 098, 099, 172, 183, 214, 222, 225 - are successfully running tasks
229 - rebooted but still without tasks (ssh does not work )
093 - BUG: 1472880
127 - BUG: 1472878
189 - BUG: 1472682
> All of them are t-yosemite-r7-*** and almost all of them are missing from
> taskcluster.
>
> The last one missing from MDC1 (449) takes jobs but burn them with
> exceptions (the bug for it has been reopened, 1461914)
I've re-imaged t-yosemite-r7-449. It looks like is running jobs. I'll keep on monitoring.
Comment 3•7 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #1)
> I need to know what "missing from taskcluster" means, and we need to
> understand why this is happening. That's a lot of systems, and there's
> definitely a problem here; if there's something wrong with those systems, we
> need to know what it is and set up monitoring for it, that isn't someone
> manually looking through lists of hosts in TC.
Suddently the machine goes in a locked state and/or doesn't communicate with the TC service anymore so TC is removing them from the lists (which is why they are missing). Currently we are doing manual checks on all moonshots across all platforms, watching for missing numbers (eg: 1,2,3,5,6,7... 4 is missing) we know for sure (if not state otherwise) that the number of MS machines should go for example from 1 to 100 without skipping numbers.
When we find missing MSs we start with a reboot, which could fix the issue, or if this doesn't help we will re-image them.
For the past 3 weeks (give or take) we have probably re-imaged more than 400 times and also did hundred of restarts.
We definitely have an issue here and on a usual day we expect to find anywhere between 20 to 50 machines (on all 3 OSes)
As for what is the issue, I know dhouse is working with the devs to fix this, but currently anything that changes (such as OCC) will end up in us or dhouse needing to re-image them machines.
Comment 4•7 years ago
|
||
(In reply to Danut Labici [:dlabici] from comment #3)
>
> Suddently the machine goes in a locked state and/or doesn't communicate with
> the TC service anymore so TC is removing them from the lists (which is why
> they are missing). Currently we are doing manual checks on all moonshots
> across all platforms, watching for missing numbers (eg: 1,2,3,5,6,7... 4 is
> missing) we know for sure (if not state otherwise) that the number of MS
> machines should go for example from 1 to 100 without skipping numbers.
> For the past 3 weeks (give or take) we have probably re-imaged more than 400
> times and also did hundred of restarts.
> We definitely have an issue here and on a usual day we expect to find
> anywhere between 20 to 50 machines (on all 3 OSes)
I've asked Dave to file a couple bugs and continue working with you folks; I'd like ciduty and relops to start investigating more deeply into why the machines go offline. It started relatively recently, and we're seeing it on both the minis and moonshot hardware; that says to me that it's something in CI, maybe a job or something. It will probably be worth talking to :jmaher and seeing if he has any thoughts on job changes, etc.
Dave is also going to talk to the TC folks about understanding what the process is around "removing" systems from TC and finding a better process for hardware. I think it makes sense for AWS where there's a provisioner tracking instances, but less so for hardware. At the very least we should be getting notifications for when this happens, rather than ciduty folks manually paging through and comparing notes.
Reporter | ||
Comment 5•7 years ago
|
||
> Dave is also going to talk to the TC folks about understanding what the
> process is around "removing" systems from TC and finding a better process
> for hardware.
Don't get us wrong, but the missing part it's not actually a bad thing, for us, that means that the machine doesn't do it's job
> I think it makes sense for AWS where there's a provisioner
> tracking instances, but less so for hardware. At the very least we should be
> getting notifications for when this happens, rather than ciduty folks
> manually paging through and comparing notes.
We have a script (it's in a pretty rough and unpolished way atm and Dlabici is currently working on it as we speak) that is checking for missing machines and is based on the taskcluster's API to get them.
Small but important mention here: I thing that if we bring modifications to taskcluser (to not remove those machines) we will not be able to use the script anymore and detect the ones with problems.
Reporter | ||
Comment 6•6 years ago
|
||
The problem were resolved or tracked in machine bugs.
Closing this since all of the machines above are in a working state.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•