Closed Bug 1472861 (t-yosemite-r7-327) Opened 7 years ago Closed 6 years ago

[MDC1] t-yosemite-r7-327 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: relops-bug-generator, Unassigned)

References

Details

No description provided.
Depends on: 1472862
Depends on: 1473358
Summary: t-yosemite-r7-327.test.releng.mdc1.mozilla.com. problem tracking → [MDC1] t-yosemite-r7-327 problem tracking
Machine was reimaged by :zfay and I took over the check stage. Machine is not taking jobs, so I went ahead with a second re-image.
Machine went back into Stdio forwarding request failed issue.
the machine it appears on Task cluster but it doesn't takes new jobs. I have tried to log on it but I have received the following error : Stdio forwarding request failed: Session open refused by peer ssh_exchange_identification: Connection closed by remote host
This machine is being investigated for hardware issues through the DCOps depend bug 1472862 The last time it was reimaged by QTS, the machine did not come back up (no deploystudio mail also). no logs no ping no ssh https://papertrailapp.com/groups/1223184?filter=t-yosemite-r7-327 ``` [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ping t-yosemite-r7-327.test.releng.mdc1.mozilla.com PING t-yosemite-r7-327.test.releng.mdc1.mozilla.com (10.49.56.111) 56(84) bytes of data. ^C --- t-yosemite-r7-327.test.releng.mdc1.mozilla.com ping statistics --- 80 packets transmitted, 0 received, 100% packet loss, time 79928ms [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ host t-yosemite-r7-327.test.releng.mdc1.mozilla.com t-yosemite-r7-327.test.releng.mdc1.mozilla.com has address 10.49.56.111 [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ssh root@t-yosemite-r7-327.test.releng.mdc1.mozilla.com ssh: connect to host t-yosemite-r7-327.test.releng.mdc1.mozilla.com port 22: Connection timed out ```
Reimaged and it is up and working correctly.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Did not took tasks for 4 days, ssh unresponsive, rebooted from taskcluster, ssh came back alive, started the reimage process, the successful termination message came shortly after.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Seems like we're hitting the stdio issue when trying to ssh into the machine. Looks alive as it responds to ping.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
rebooted and it looks good https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-327 machine was repeatedly crashing https://papertrailapp.com/systems/t-yosemite-r7-327/events ``` Nov 07 21:29:59 t-yosemite-r7-327 com.apple.xpc.launchd: (com.apple.ReportCrash.Root[20004]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash.DirectoryService Nov 07 21:29:59 t-yosemite-r7-327 com.apple.xpc.launchd: (com.apple.opendirectoryd[20001]): Service exited due to signal: Segmentation fault: 11 Nov 07 21:29:59 t-yosemite-r7-327 com.apple.xpc.launchd: (com.apple.configd[20002]): Service exited due to signal: Segmentation fault: 11 Nov 07 21:29:59 t-yosemite-r7-327 com.apple.xpc.launchd: (com.apple.Kerberos.digest-service[20003]): Service exited due to signal: Segmentation fault: 11 Nov 07 21:29:59 t-yosemite-r7-327 com.apple.xpc.launchd: (com.apple.ReportCrash.Root[20004]): Service exited due to signal: Segmentation fault: 11 ```
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED

High exception/busted rate on the worker and one of the jobs terminated with:

16:59:17     INFO - GECKO(1170) | [Child 1176, Main Thread] WARNING: No active window: file /builds/worker/workspace/build/src/js/xpconnect/src/XPCJSContext.cpp, line 662
16:59:17     INFO - GECKO(1170) | [Child 1176, Main Thread] WARNING: No active window: file /builds/worker/workspace/build/src/js/xpconnect/src/XPCJSContext.cpp, line 662
16:59:17     INFO - GECKO(1170) | ++DOMWINDOW == 2 (0x10b2ed800) [pid = 1176] [serial = 4] [outer = 0x10b244020]
16:59:18     INFO - checking window state
16:59:18     INFO - TEST-START | toolkit/components/thumbnails/test/browser_thumbnails_bg_no_alert.js
16:59:18     INFO - GECKO(1170) | ++DOMWINDOW == 10 (0x11e6f6000) [pid = 1172] [serial = 16] [outer = 0x127f43020]
16:59:18     INFO - GECKO(1170) | --DOMWINDOW == 0 (0x1232b5c00) [pid = 1181] [serial = 2] [outer = 0x0] [url = about:blank]
[taskcluster 2019-06-03T16:59:20.219Z]    Exit Code: -1
[taskcluster 2019-06-03T16:59:20.219Z]    User Time: 1m37.351634s
[taskcluster 2019-06-03T16:59:20.219Z]  Kernel Time: 29.672282s
[taskcluster 2019-06-03T16:59:20.219Z]    Wall Time: 13m26.359194598s
[taskcluster 2019-06-03T16:59:20.219Z]       Result: FAILED
[taskcluster 2019-06-03T16:59:20.219Z] === Task Finished ===
[taskcluster 2019-06-03T16:59:20.219Z] Task Duration: 13m26.359383164s
[taskcluster 2019-06-03T16:59:20.939Z] Uploading artifact public/logs/localconfig.json from file logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2020-06-02T16:05:36.361Z
[taskcluster 2019-06-03T16:59:21.579Z] Uploading artifact public/test_info/manifests.list from file build/blobber_upload_dir/manifests.list with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2020-06-02T16:05:36.361Z
[taskcluster 2019-06-03T16:59:22.103Z] Uploading artifact public/test_info/mochitest-browser-chrome-chunked_errorsummary.log from file build/blobber_upload_dir/mochitest-browser-chrome-chunked_errorsummary.log with content encoding "gzip", mime type "text/plain" and expiry 2020-06-02T16:05:36.361Z
[taskcluster 2019-06-03T16:59:22.471Z] Uploading artifact public/test_info/mochitest-browser-chrome-chunked_raw.log from file build/blobber_upload_dir/mochitest-browser-chrome-chunked_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2020-06-02T16:05:36.361Z
[taskcluster 2019-06-03T16:59:23.061Z] Uploading artifact public/test_info/system-info.log from file build/blobber_upload_dir/system-info.log with content encoding "gzip", mime type "text/plain" and expiry 2020-06-02T16:05:36.361Z
[taskcluster:error] signal: illegal instruction

The worker has been quarantined and it's under investigation.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Type: task → defect

the machine seems to be up and running and taking jobs.
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-327
We will close the bug for now. If the problem will persist in the future, we will re-open this bug.

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.