Closed
Bug 1467949
Opened 7 years ago
Closed 5 years ago
OSX nodes stop functioning
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: zfay, Unassigned)
References
Details
Using this bug to keep track of any faulty OSX machines
Comment 1•7 years ago
|
||
Changed summary; moonshots are only windows or linux, and OSX is only on mac minis. :-)
Summary: Moonshot OSX nodes stop functioning → OSX nodes stop functioning
Reporter | ||
Comment 2•7 years ago
|
||
After checking the OSX machines today I noticed a few were missing :
MDC1: 251, 253, 257, 260, 272, 280, 298, 303, 306,307, 310, 315, 327, 336, 344, 345, 349, 352, 353, 357, 373, 375, 380, 393, 394, 399, 415, 426, 436, 449, 460, 464
MDC2: 071, 072,073,074,077,080,093,119,127,130, 139, 156, 165, 166, 172, 183, 187, 189,190, 192, 196, 200, 214, 222, 223, 225, 229, 231,232, 235
Reporter | ||
Comment 3•7 years ago
|
||
Managed to ssh into and reimage the following OSX machines: 136,139,156,166,210,223,235,253,306,310,357,380,393,415,460,464
Still got a bunch of them that we can't even ssh into. @dhouse got any info that could help out here?
(In reply to Zsolt Fay [:zsoltfay] from comment #3)
> Managed to ssh into and reimage the following OSX machines:
> 136,139,156,166,210,223,235,253,306,310,357,380,393,415,460,464
>
> Still got a bunch of them that we can't even ssh into. @dhouse got any info
> that could help out here?
From what I've seen, if we cannot ssh into them then we can try cycle their power through the pdu (what roller tries when ssh fails), but if they are not visible in the taskcluster worker explorer, then you can create them (through the taskcluster api; see below) or create a bug for dcops to physically reboot and reimage/netboot the machines).
Here is a version of the quarantine script that will add/define a worker if it is missing (https://github.com/davehouse/relops-infra/blob/quarantine_nonexisting/quarantine_tc.py). So you can call this script like:
`python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449`
Then the worker explorer will show the machine and you can reboot it from there.
(In reply to Dave House [:dhouse] from comment #4)
> `python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g
> mdc2 t-yosemite-r7-449`
In my example, I put the wrong group. #449 is in mdc1 so it should be:
`python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc1 t-yosemite-r7-449`
(In reply to Dave House [:dhouse] from comment #5)
> (In reply to Dave House [:dhouse] from comment #4)
> > `python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g
> > mdc2 t-yosemite-r7-449`
>
> In my example, I put the wrong group. #449 is in mdc1 so it should be:
> `python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g
> mdc1 t-yosemite-r7-449`
I ran the script for a few of the machines that you listed but were not able to ssh into:
```
[david@george relops-infra]$ python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc1 t-yosemite-r7-{394,399,426,436,449}
t-yosemite-r7-394 not quarantined
t-yosemite-r7-399 not quarantined
t-yosemite-r7-426 not quarantined
t-yosemite-r7-436 not quarantined
t-yosemite-r7-449 not quarantined
```
Reporter | ||
Comment 7•7 years ago
|
||
I've done a series of checks, reboots and re-images to my previously noted machines and to the ones you de-quarantined. Many of the machines have been successfully re-booted and re-imaged and are now taking jobs.
I've tried using your script on the ones that were still unreachable t-yosemite-r7-{260, ,280, 298, 327, 349, 426} from mdc1 and {093,127, 189, 229} from mdc2, followed by a reboot. Rebooting does no good and it's worth nothing we can't ssh into these machines either.
Should we file one major bug for the above mentioned workers under DCOps?
Flags: needinfo?(dhouse)
Depends on: t-yosemite-r7-239
(In reply to Zsolt Fay [:zsoltfay] from comment #7)
> I've done a series of checks, reboots and re-images to my previously noted
> machines and to the ones you de-quarantined. Many of the machines have been
> successfully re-booted and re-imaged and are now taking jobs.
>
>
> I've tried using your script on the ones that were still unreachable
> t-yosemite-r7-{260, ,280, 298, 327, 349, 426} from mdc1 and {093,127, 189,
> 229} from mdc2, followed by a reboot. Rebooting does no good and it's worth
> nothing we can't ssh into these machines either.
>
> Should we file one major bug for the above mentioned workers under DCOps?
One major bug with DCOps would be fine since there are so many right now.
Flags: needinfo?(dhouse)
Comment 9•7 years ago
|
||
Worked on:
MDC2
t-yosemite-r7-083.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-086.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-089.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-091.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-095.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-121.test.releng.mdc2.mozilla.com Quarantined
t-yosemite-r7-122.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-124.test.releng.mdc2.mozilla.com Missing from TC and unereachable, last logs were send on Jul 01 from plugin-container that was unable to create a connection because the sandbox denied the right to lookup for an apple service and the coresponding process not beeing able to talk with lauchservicesd.
t-yosemite-r7-130.test.releng.mdc2.mozilla.com Missing from TC and unereachable
t-yosemite-r7-160.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-162.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-177.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-184.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-193.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-201.test.releng.mdc2.mozilla.com Rebooted
t-yosemite-r7-234.test.releng.mdc2.mozilla.com Rebooted
MDC1
t-yosemite-r7-239.test.releng.mdc1.mozilla.com Rebooted (did not appeared in TC) netbooted and reimaged. Rechecked after 27 minutes, reappered in TC.
t-yosemite-r7-241.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-246.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-295.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-296.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-299.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-300.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-305.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-308.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-316.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-319.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-322.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-332.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-334.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-356.test.releng.mdc1.mozilla.com Alive but throw:Stdio forwarding request failed: Session open refused by peer
t-yosemite-r7-357.test.releng.mdc1.mozilla.com is unreachable
t-yosemite-r7-358.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-362.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-371.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-380.test.releng.mdc1.mozilla.com - Loaned to Dragrom
t-yosemite-r7-381.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-391.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-396.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-407.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-414.test.releng.mdc1.mozilla.com Alive but locked while trying to ssh into it, tried again later but still in the same state, meanwhile checked logs, sshd is running and pflog right after (the prompt looks locked and after a while "Connection closed by UNKNOWN port 65535"
t-yosemite-r7-415.test.releng.mdc1.mozilla.com Rebooted- doesn't appear in TC, reimaged
t-yosemite-r7-417.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-425.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-439.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-441.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-442.test.releng.mdc1.mozilla.com Rebooted, doesn't appear in TC, checked logs, a lot of ReportCrash and FileStatsAgent activity happening also doesn't let me ssh into it.
t-yosemite-r7-444.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-455.test.releng.mdc1.mozilla.com Rebooted
t-yosemite-r7-464.test.releng.mdc1.mozilla.com Rebooted
Updated•7 years ago
|
Blocks: t-yosemite-r7-415
Reporter | ||
Comment 10•7 years ago
|
||
Checked the osx pool today, here's the outcome
Rebooted t-yosemite-r7-{237,243,272,285,289,292,336,342,363,373,392,402,407,425,430,440,470} from MDC1
and t-yosemite-r7-{025,031,037,055,082,086,183,225} from mdc2.
All of these are now working properly.
Also have re-imaged the following:
t-yosemite-r7-322.test.releng.mdc1.mozilla.com
t-yosemite-r7-331.test.releng.mdc1.mozilla.com
t-yosemite-r7-354.test.releng.mdc1.mozilla.com
t-yosemite-r7-377.test.releng.mdc1.mozilla.com
t-yosemite-r7-416.test.releng.mdc1.mozilla.com
t-yosemite-r7-443.test.releng.mdc1.mozilla.com
These ones need to be checked ^
Comment 11•7 years ago
|
||
When rebooting these machines, if they have not taken jobs for hours, please also check if the worker processes are running.
I have created bug 1475689 because when I checked 186 today (Zsolt pointed it out as one that had not taken work in one day), I found that the worker processes were not running and that, from the logs, the last task had failed with a timeout. Something broke in how we are handling that failure since it should have automatically rebooted.
Reporter | ||
Comment 12•7 years ago
|
||
Rebooted all the lazy workers from today:
t-tosemite-r7-{253, 272, 313, 314, 328, 400, 459, 101, 131, 180, 181, 187, 194, 225, 229, 230}
Out of which
t-yosemite-r7-{272, 101, 180, 187, 194, 225, 229, 230, 314, 459} need a re-image.
@bcrisan will follow this up. Please let us know with how far you've gotten.
Flags: needinfo?(bcrisan)
Comment 13•7 years ago
|
||
t-yosemite-r7-272 - has open bug 1472845 & finishes jobs as exceptions, at the moment it has been quarantined.
t-yosemite-r7-101 is ok, has done a few task and all of them are green
t-yosemite-r7-180, 187, 194, 225, 229, 230, 314, 459 are ok, all of them have done a few tasks and the majority of them are green
Flags: needinfo?(bcrisan)
Reporter | ||
Comment 14•7 years ago
|
||
re-imaged missing osx machines. All have taken jobs since:
t-yosemite-r7-094.test.releng.mdc1.mozilla.com
t-yosemite-r7-115.test.releng.mdc1.mozilla.com
t-yosemite-r7-159.test.releng.mdc1.mozilla.com
t-yosemite-r7-175.test.releng.mdc1.mozilla.com
t-yosemite-r7-306.test.releng.mdc1.mozilla.com
t-yosemite-r7-308.test.releng.mdc1.mozilla.com
t-yosemite-r7-338.test.releng.mdc1.mozilla.com
t-yosemite-r7-343.test.releng.mdc1.mozilla.com
t-yosemite-r7-374.test.releng.mdc1.mozilla.com
t-yosemite-r7-391.test.releng.mdc1.mozilla.com
t-yosemite-r7-455.test.releng.mdc1.mozilla.com
Updated•7 years ago
|
Blocks: t-yosemite-r7-350
Updated•7 years ago
|
Blocks: t-yosemite-r7-146
Updated•6 years ago
|
Blocks: t-yosemite-r7-284
Updated•6 years ago
|
Blocks: t-yosemite-r7-267
Updated•5 years ago
|
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•