1472720 - Multiple t-yosemite-r7/mac minis in a bad state

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Reporter

Description

•

7 years ago

Hello, Machines that are still missing: >MDC1: 251, 257, 260, 272, 280, 298, 303, 307, 315, 327, 336, 344, 345, 349, 352, 353, 373, 375, 394, 399, 426, 436, 440, 449, >MDC2: 093,098,099,127, 172, 183, 189, 214, 222, 225, 229 All of them are t-yosemite-r7-*** and almost all of them are missing from taskcluster. The last one missing from MDC1 (449) takes jobs but burn them with exceptions (the bug for it has been reopened, 1461914)

Kendall Libby [:fubar] (he/him)

Comment 1

•

7 years ago

I need to know what "missing from taskcluster" means, and we need to understand why this is happening. That's a lot of systems, and there's definitely a problem here; if there's something wrong with those systems, we need to know what it is and set up monitoring for it, that isn't someone manually looking through lists of hosts in TC.

Radu Iman[:riman]

Comment 2

•

7 years ago

Hello, I have used this script [1] to bring those machines back in Taskcluster and then I've rebooted them from UI. ( as dhouse: suggested us in this comment -> [2] ) [1] - https://github.com/davehouse/relops-infra/blob/quarantine_nonexisting/quarantine_tc.py [2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1467949#c4 (In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #0) > Hello, > > Machines that are still missing: > > >MDC1: 251, 257, 260, 272, 280, 298, 303, 307, 315, 327, 336, 344, 345, 349, 352, 353, 373, 375, 394, 399, 426, 436, 440, 449, MDC1: 251, 257, 272, 303, 307, 315, 336, 344, 345, 352, 353, 373, 375, 399, 436, 440 - are successfully running tasks 280 - rebooted but still without tasks ( ssh does not work) 260 - BUG: 1472841 298 - BUG: 1472855 327 - BUG: 1472861 349 - BUG: 1472865 394 - BUG: 1395960 426 - BUG: 1472868 > >MDC2: 093,098,099,127, 172, 183, 189, 214, 222, 225, 229 MDC2: 098, 099, 172, 183, 214, 222, 225 - are successfully running tasks 229 - rebooted but still without tasks (ssh does not work ) 093 - BUG: 1472880 127 - BUG: 1472878 189 - BUG: 1472682 > All of them are t-yosemite-r7-*** and almost all of them are missing from > taskcluster. > > The last one missing from MDC1 (449) takes jobs but burn them with > exceptions (the bug for it has been reopened, 1461914) I've re-imaged t-yosemite-r7-449. It looks like is running jobs. I'll keep on monitoring.

Danut Labici [:dlabici]

Comment 3

•

7 years ago

(In reply to Kendall Libby [:fubar] from comment #1) > I need to know what "missing from taskcluster" means, and we need to > understand why this is happening. That's a lot of systems, and there's > definitely a problem here; if there's something wrong with those systems, we > need to know what it is and set up monitoring for it, that isn't someone > manually looking through lists of hosts in TC. Suddently the machine goes in a locked state and/or doesn't communicate with the TC service anymore so TC is removing them from the lists (which is why they are missing). Currently we are doing manual checks on all moonshots across all platforms, watching for missing numbers (eg: 1,2,3,5,6,7... 4 is missing) we know for sure (if not state otherwise) that the number of MS machines should go for example from 1 to 100 without skipping numbers. When we find missing MSs we start with a reboot, which could fix the issue, or if this doesn't help we will re-image them. For the past 3 weeks (give or take) we have probably re-imaged more than 400 times and also did hundred of restarts. We definitely have an issue here and on a usual day we expect to find anywhere between 20 to 50 machines (on all 3 OSes) As for what is the issue, I know dhouse is working with the devs to fix this, but currently anything that changes (such as OCC) will end up in us or dhouse needing to re-image them machines.

Zsolt Fay [:zfay]

Updated

•

7 years ago

Blocks: 1467949

Kendall Libby [:fubar] (he/him)

Comment 4

•

7 years ago

(In reply to Danut Labici [:dlabici] from comment #3) > > Suddently the machine goes in a locked state and/or doesn't communicate with > the TC service anymore so TC is removing them from the lists (which is why > they are missing). Currently we are doing manual checks on all moonshots > across all platforms, watching for missing numbers (eg: 1,2,3,5,6,7... 4 is > missing) we know for sure (if not state otherwise) that the number of MS > machines should go for example from 1 to 100 without skipping numbers. > For the past 3 weeks (give or take) we have probably re-imaged more than 400 > times and also did hundred of restarts. > We definitely have an issue here and on a usual day we expect to find > anywhere between 20 to 50 machines (on all 3 OSes) I've asked Dave to file a couple bugs and continue working with you folks; I'd like ciduty and relops to start investigating more deeply into why the machines go offline. It started relatively recently, and we're seeing it on both the minis and moonshot hardware; that says to me that it's something in CI, maybe a job or something. It will probably be worth talking to :jmaher and seeing if he has any thoughts on job changes, etc. Dave is also going to talk to the TC folks about understanding what the process is around "removing" systems from TC and finding a better process for hardware. I think it makes sense for AWS where there's a provisioner tracking instances, but less so for hardware. At the very least we should be getting notifications for when this happens, rather than ciduty folks manually paging through and comparing notes.

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Reporter

Comment 5

•

7 years ago

> Dave is also going to talk to the TC folks about understanding what the > process is around "removing" systems from TC and finding a better process > for hardware. Don't get us wrong, but the missing part it's not actually a bad thing, for us, that means that the machine doesn't do it's job > I think it makes sense for AWS where there's a provisioner > tracking instances, but less so for hardware. At the very least we should be > getting notifications for when this happens, rather than ciduty folks > manually paging through and comparing notes. We have a script (it's in a pretty rough and unpolished way atm and Dlabici is currently working on it as we speak) that is checking for missing machines and is based on the taskcluster's API to get them. Small but important mention here: I thing that if we bring modifications to taskcluser (to not remove those machines) we will not be able to use the script anymore and detect the ones with problems.

:dhouse

Updated

•

7 years ago

Depends on: 1473589

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Reporter

Comment 6

•

6 years ago

The problem were resolved or tracked in machine bugs. Closing this since all of the machines above are in a working state.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Multiple t-yosemite-r7/mac minis in a bad state

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: bcrisan, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Updated

Comment 6

Updated