Closed
Bug 1472682
(t-yosemite-r7-189)
Opened 7 years ago
Closed 6 years ago
[MDC2] t-yosemite-r7-189 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: relops-bug-generator, Assigned: dhouse)
References
Details
(Whiteboard: crashes)
No description provided.
Updated•7 years ago
|
Summary: t-yosemite-r7-189.test.releng.mdc2.mozilla.com. problem tracking → [MDC2] t-yosemite-r7-189 problem tracking
I've quarantined this machine as it was running 3 tasks at once and has many tasks as exceptions(claim expired; timeouts). From the logs, it looks like it has rebooted during tasks (random reboot?).
reporting logs to papertrail
shows active tasks
has completed N tasks in the last few hours
https://papertrailapp.com/groups/1223184?filter=t-yosemite-r7-189
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-189
I've kicked-off the reimage:
```
[dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ssh root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com
This host is set to follow security level "low"
Unauthorized access prohibited
[root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ps -ef|grep worker
28 253 1 0 10:01AM ?? 0:00.00 /bin/bash /usr/local/bin/run-generic-worker.sh run --config /etc/generic-worker.config
28 255 253 0 10:01AM ?? 0:00.13 /usr/local/bin/generic-worker run --config /etc/generic-worker.config
28 256 253 0 10:01AM ?? 0:00.01 logger -t generic-worker -s
0 344 340 0 10:03AM ttys000 0:00.00 grep worker
[root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ls -ltr /Users/cltbld/tasks/task*/logs/
ls: /Users/cltbld/tasks/task*/logs/: No such file or directory
[root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ls -ltr /Users/cltbld/tasks
total 0
drwxr-xr-x 3 cltbld staff 102 Jul 25 10:01 task_1532538101
[root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ls -ltr /Users/cltbld/tasks/task*
total 0
drwxr-xr-x 2 cltbld staff 68 Jul 25 10:01 generic-worker
[root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# ls -ltr /Users/cltbld/tasks/task*/gen*
[root@t-yosemite-r7-189.test.releng.mdc2.mozilla.com ~]# /usr/sbin/bless --netboot --server bsdp://10.51.56.16; reboot
Connection to t-yosemite-r7-189.test.releng.mdc2.mozilla.com closed by remote host.
Connection to t-yosemite-r7-189.test.releng.mdc2.mozilla.com closed.
```
Deploystudio success email received
puppet cert email received
worker started taking tasks (and has completed 5 since then)
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-189
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment 4•7 years ago
|
||
I have rebooted this worker since it didn't work in the last 4 days. After that it was running jobs and resolved as exception. I have tried to reimage it, but the SSH connection is not working.
Stdio forwarding request failed: Session open refused by peer
ssh_exchange_identification: Connection closed by remote host
I've quarantined this machine.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
after reboot, there is a short window to log in before the machine reboots. I've watched the logs and I don't see any errors or logs showing why it is rebooting.
I've kicked off a netboot for reimaging. We'll see when it comes back up, if it continues the reboot loop.
deploystudio success email received
puppet cert email received
but i don't think it completed puppetizing:
no response to ping, ssh
no logs in papertrail for 2 hours
I turned off the power, and then turned it back on. Still nothing.
I've reopened the bug with DCOps to check the machine again.
Comment 7•6 years ago
|
||
Machine has issues again.
SSH unresponsive and does not appear in taskcluster.
Comment 8•6 years ago
|
||
tried to connect on it, today, but problem still persists :
Stdio forwarding request failed: Session open refused by peer
ssh_exchange_identification: Connection closed by remote host
I've asked DCOps to physically check this machine since it has become unresponsive 3-4 times in 3-4 months.
Assignee | ||
Comment 10•6 years ago
|
||
This machine's display is not working. We'll leave it in quarantine for now.
It is off warranty.
Assignee: nobody → dhouse
Assignee | ||
Comment 11•6 years ago
|
||
I've removed this machine from releng nagios and roller.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 6 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•