Closed Bug 1472682 (t-yosemite-r7-189) Opened 7 years ago Closed 6 years ago

[MDC2] t-yosemite-r7-189 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: relops-bug-generator, Assigned: dhouse)

References

Details

(Whiteboard: crashes)

No description provided.
Depends on: 1472683
Depends on: 1473358
Summary: t-yosemite-r7-189.test.releng.mdc2.mozilla.com. problem tracking → [MDC2] t-yosemite-r7-189 problem tracking
I've quarantined this machine as it was running 3 tasks at once and has many tasks as exceptions(claim expired; timeouts). From the logs, it looks like it has rebooted during tasks (random reboot?). reporting logs to papertrail shows active tasks has completed N tasks in the last few hours https://papertrailapp.com/groups/1223184?filter=t-yosemite-r7-189 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-189
I've kicked-off the reimage: ``` [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ssh root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com This host is set to follow security level "low" Unauthorized access prohibited [root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ps -ef|grep worker 28 253 1 0 10:01AM ?? 0:00.00 /bin/bash /usr/local/bin/run-generic-worker.sh run --config /etc/generic-worker.config 28 255 253 0 10:01AM ?? 0:00.13 /usr/local/bin/generic-worker run --config /etc/generic-worker.config 28 256 253 0 10:01AM ?? 0:00.01 logger -t generic-worker -s 0 344 340 0 10:03AM ttys000 0:00.00 grep worker [root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ls -ltr /Users/cltbld/tasks/task*/logs/ ls: /Users/cltbld/tasks/task*/logs/: No such file or directory [root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ls -ltr /Users/cltbld/tasks total 0 drwxr-xr-x 3 cltbld staff 102 Jul 25 10:01 task_1532538101 [root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ls -ltr /Users/cltbld/tasks/task* total 0 drwxr-xr-x 2 cltbld staff 68 Jul 25 10:01 generic-worker [root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ls -ltr /Users/cltbld/tasks/task*/gen* [root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# /usr/sbin/bless --netboot --server bsdp://10.51.56.16; reboot Connection to t-yosemite-r7-189.test.releng.mdc2.mozilla.com closed by remote host. Connection to t-yosemite-r7-189.test.releng.mdc2.mozilla.com closed. ```
Deploystudio success email received puppet cert email received worker started taking tasks (and has completed 5 since then) https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-189
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I have rebooted this worker since it didn't work in the last 4 days. After that it was running jobs and resolved as exception. I have tried to reimage it, but the SSH connection is not working. Stdio forwarding request failed: Session open refused by peer ssh_exchange_identification: Connection closed by remote host I've quarantined this machine.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
after reboot, there is a short window to log in before the machine reboots. I've watched the logs and I don't see any errors or logs showing why it is rebooting. I've kicked off a netboot for reimaging. We'll see when it comes back up, if it continues the reboot loop.
Depends on: 1479600
deploystudio success email received puppet cert email received but i don't think it completed puppetizing: no response to ping, ssh no logs in papertrail for 2 hours I turned off the power, and then turned it back on. Still nothing. I've reopened the bug with DCOps to check the machine again.
Machine has issues again. SSH unresponsive and does not appear in taskcluster.
tried to connect on it, today, but problem still persists : Stdio forwarding request failed: Session open refused by peer ssh_exchange_identification: Connection closed by remote host
Depends on: 1472510
I've asked DCOps to physically check this machine since it has become unresponsive 3-4 times in 3-4 months.
Depends on: 1505968
Depends on: 1512019
Whiteboard: crashes
Depends on: 1517348
Depends on: 1517779

This machine's display is not working. We'll leave it in quarantine for now.
It is off warranty.

Assignee: nobody → dhouse

I've removed this machine from releng nagios and roller.

Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.