Closed Bug 1467949 Opened 7 years ago Closed 5 years ago

OSX nodes stop functioning

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: zfay, Unassigned)

References

Details

Using this bug to keep track of any faulty OSX machines
Changed summary; moonshots are only windows or linux, and OSX is only on mac minis. :-)
Summary: Moonshot OSX nodes stop functioning → OSX nodes stop functioning
After checking the OSX machines today I noticed a few were missing : MDC1: 251, 253, 257, 260, 272, 280, 298, 303, 306,307, 310, 315, 327, 336, 344, 345, 349, 352, 353, 357, 373, 375, 380, 393, 394, 399, 415, 426, 436, 449, 460, 464 MDC2: 071, 072,073,074,077,080,093,119,127,130, 139, 156, 165, 166, 172, 183, 187, 189,190, 192, 196, 200, 214, 222, 223, 225, 229, 231,232, 235
Managed to ssh into and reimage the following OSX machines: 136,139,156,166,210,223,235,253,306,310,357,380,393,415,460,464 Still got a bunch of them that we can't even ssh into. @dhouse got any info that could help out here?
(In reply to Zsolt Fay [:zsoltfay] from comment #3) > Managed to ssh into and reimage the following OSX machines: > 136,139,156,166,210,223,235,253,306,310,357,380,393,415,460,464 > > Still got a bunch of them that we can't even ssh into. @dhouse got any info > that could help out here? From what I've seen, if we cannot ssh into them then we can try cycle their power through the pdu (what roller tries when ssh fails), but if they are not visible in the taskcluster worker explorer, then you can create them (through the taskcluster api; see below) or create a bug for dcops to physically reboot and reimage/netboot the machines). Here is a version of the quarantine script that will add/define a worker if it is missing (https://github.com/davehouse/relops-infra/blob/quarantine_nonexisting/quarantine_tc.py). So you can call this script like: `python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc2 t-yosemite-r7-449` Then the worker explorer will show the machine and you can reboot it from there.
(In reply to Dave House [:dhouse] from comment #4) > `python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g > mdc2 t-yosemite-r7-449` In my example, I put the wrong group. #449 is in mdc1 so it should be: `python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc1 t-yosemite-r7-449`
(In reply to Dave House [:dhouse] from comment #5) > (In reply to Dave House [:dhouse] from comment #4) > > `python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g > > mdc2 t-yosemite-r7-449` > > In my example, I put the wrong group. #449 is in mdc1 so it should be: > `python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g > mdc1 t-yosemite-r7-449` I ran the script for a few of the machines that you listed but were not able to ssh into: ``` [david@george relops-infra]$ python quarantine_tc.py --enable -p releng-hardware -w gecko-t-osx-1010 -g mdc1 t-yosemite-r7-{394,399,426,436,449} t-yosemite-r7-394 not quarantined t-yosemite-r7-399 not quarantined t-yosemite-r7-426 not quarantined t-yosemite-r7-436 not quarantined t-yosemite-r7-449 not quarantined ```
Depends on: 1472720
I've done a series of checks, reboots and re-images to my previously noted machines and to the ones you de-quarantined. Many of the machines have been successfully re-booted and re-imaged and are now taking jobs. I've tried using your script on the ones that were still unreachable t-yosemite-r7-{260, ,280, 298, 327, 349, 426} from mdc1 and {093,127, 189, 229} from mdc2, followed by a reboot. Rebooting does no good and it's worth nothing we can't ssh into these machines either. Should we file one major bug for the above mentioned workers under DCOps?
Flags: needinfo?(dhouse)
Depends on: 1473589
Depends on: t-yosemite-r7-239
(In reply to Zsolt Fay [:zsoltfay] from comment #7) > I've done a series of checks, reboots and re-images to my previously noted > machines and to the ones you de-quarantined. Many of the machines have been > successfully re-booted and re-imaged and are now taking jobs. > > > I've tried using your script on the ones that were still unreachable > t-yosemite-r7-{260, ,280, 298, 327, 349, 426} from mdc1 and {093,127, 189, > 229} from mdc2, followed by a reboot. Rebooting does no good and it's worth > nothing we can't ssh into these machines either. > > Should we file one major bug for the above mentioned workers under DCOps? One major bug with DCOps would be fine since there are so many right now.
Flags: needinfo?(dhouse)
Worked on: MDC2 t-yosemite-r7-083.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-086.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-089.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-091.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-095.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-121.test.releng.mdc2.mozilla.com Quarantined t-yosemite-r7-122.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-124.test.releng.mdc2.mozilla.com Missing from TC and unereachable, last logs were send on Jul 01 from plugin-container that was unable to create a connection because the sandbox denied the right to lookup for an apple service and the coresponding process not beeing able to talk with lauchservicesd. t-yosemite-r7-130.test.releng.mdc2.mozilla.com Missing from TC and unereachable t-yosemite-r7-160.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-162.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-177.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-184.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-193.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-201.test.releng.mdc2.mozilla.com Rebooted t-yosemite-r7-234.test.releng.mdc2.mozilla.com Rebooted MDC1 t-yosemite-r7-239.test.releng.mdc1.mozilla.com Rebooted (did not appeared in TC) netbooted and reimaged. Rechecked after 27 minutes, reappered in TC. t-yosemite-r7-241.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-246.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-295.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-296.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-299.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-300.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-305.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-308.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-316.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-319.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-322.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-332.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-334.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-356.test.releng.mdc1.mozilla.com Alive but throw:Stdio forwarding request failed: Session open refused by peer t-yosemite-r7-357.test.releng.mdc1.mozilla.com is unreachable t-yosemite-r7-358.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-362.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-371.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-380.test.releng.mdc1.mozilla.com - Loaned to Dragrom t-yosemite-r7-381.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-391.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-396.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-407.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-414.test.releng.mdc1.mozilla.com Alive but locked while trying to ssh into it, tried again later but still in the same state, meanwhile checked logs, sshd is running and pflog right after (the prompt looks locked and after a while "Connection closed by UNKNOWN port 65535" t-yosemite-r7-415.test.releng.mdc1.mozilla.com Rebooted- doesn't appear in TC, reimaged t-yosemite-r7-417.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-425.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-439.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-441.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-442.test.releng.mdc1.mozilla.com Rebooted, doesn't appear in TC, checked logs, a lot of ReportCrash and FileStatsAgent activity happening also doesn't let me ssh into it. t-yosemite-r7-444.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-455.test.releng.mdc1.mozilla.com Rebooted t-yosemite-r7-464.test.releng.mdc1.mozilla.com Rebooted
Checked the osx pool today, here's the outcome Rebooted t-yosemite-r7-{237,243,272,285,289,292,336,342,363,373,392,402,407,425,430,440,470} from MDC1 and t-yosemite-r7-{025,031,037,055,082,086,183,225} from mdc2. All of these are now working properly. Also have re-imaged the following: t-yosemite-r7-322.test.releng.mdc1.mozilla.com t-yosemite-r7-331.test.releng.mdc1.mozilla.com t-yosemite-r7-354.test.releng.mdc1.mozilla.com t-yosemite-r7-377.test.releng.mdc1.mozilla.com t-yosemite-r7-416.test.releng.mdc1.mozilla.com t-yosemite-r7-443.test.releng.mdc1.mozilla.com These ones need to be checked ^
When rebooting these machines, if they have not taken jobs for hours, please also check if the worker processes are running. I have created bug 1475689 because when I checked 186 today (Zsolt pointed it out as one that had not taken work in one day), I found that the worker processes were not running and that, from the logs, the last task had failed with a timeout. Something broke in how we are handling that failure since it should have automatically rebooted.
Rebooted all the lazy workers from today: t-tosemite-r7-{253, 272, 313, 314, 328, 400, 459, 101, 131, 180, 181, 187, 194, 225, 229, 230} Out of which t-yosemite-r7-{272, 101, 180, 187, 194, 225, 229, 230, 314, 459} need a re-image. @bcrisan will follow this up. Please let us know with how far you've gotten.
Flags: needinfo?(bcrisan)
t-yosemite-r7-272 - has open bug 1472845 & finishes jobs as exceptions, at the moment it has been quarantined. t-yosemite-r7-101 is ok, has done a few task and all of them are green t-yosemite-r7-180, 187, 194, 225, 229, 230, 314, 459 are ok, all of them have done a few tasks and the majority of them are green
Flags: needinfo?(bcrisan)
re-imaged missing osx machines. All have taken jobs since: t-yosemite-r7-094.test.releng.mdc1.mozilla.com t-yosemite-r7-115.test.releng.mdc1.mozilla.com t-yosemite-r7-159.test.releng.mdc1.mozilla.com t-yosemite-r7-175.test.releng.mdc1.mozilla.com t-yosemite-r7-306.test.releng.mdc1.mozilla.com t-yosemite-r7-308.test.releng.mdc1.mozilla.com t-yosemite-r7-338.test.releng.mdc1.mozilla.com t-yosemite-r7-343.test.releng.mdc1.mozilla.com t-yosemite-r7-374.test.releng.mdc1.mozilla.com t-yosemite-r7-391.test.releng.mdc1.mozilla.com t-yosemite-r7-455.test.releng.mdc1.mozilla.com
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.