Closed Bug 1479065 (t-yosemite-r7-350) Opened 7 years ago Closed 7 years ago

[MDC1] t-yosemite-r7-350 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: riman, Unassigned)

References

Details

I have tried to connect via SSH to this machine, but it returned this: Stdio forwarding request failed: Session open refused by peer ssh_exchange_identification: Connection closed by remote host
Last job was an exception (the exit code 69 reported in bug 1478525), but that did not cause generic-worker to stop running. It was running normally, and then shut down for "mainentance". Something caused it to power down. https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-350 ``` Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:02 Querying queue to get latest status for task agIZgairQDa_6RGnTvDh_w... Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:02 Latest status: Errored Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:02 Resolving task agIZgairQDa_6RGnTvDh_w ... Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:02 Not updating status of task agIZgairQDa_6RGnTvDh_w run 0 from Errored to Failed. This is because you can only update to status Failed if the previous status was one of: [Claimed Reclaimed Aborted] Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:02 Saving file file-caches.json (absolute path: /Users/cltbld/file-caches.json) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:02 Saving file directory-caches.json (absolute path: /Users/cltbld/directory-caches.json) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:02 goroutine 1 [running]: Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: runtime/debug.Stack(0x421b16d00, 0x142081c, 0x155bae3) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/.gimme/versions/go1.10.2.src/src/runtime/debug/stack.go:24 +0xa7 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.HandleCrash(0x14aff80, 0x421f52080) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:570 +0x26 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.RunWorker.func1(0x421b17df0) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:589 +0x52 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: panic(0x14aff80, 0x421f52080) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/.gimme/versions/go1.10.2.src/src/runtime/panic.go:502 +0x229 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.(*TaskRun).Run.func1(0x421dde028, 0x4219e4a00) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1086 +0xc5 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: panic(0x14aff80, 0x421f52080) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/.gimme/versions/go1.10.2.src/src/runtime/panic.go:502 +0x229 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: panic(0x14aff80, 0x421fa4050) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/.gimme/versions/go1.10.2.src/src/runtime/panic.go:502 +0x229 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.(*TaskRun).uploadArtifact(0x4219e4a00, 0x1604580, 0x421e940f0, 0x0) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/artifacts.go:467 +0x1057 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.(*TaskRun).uploadLog(0x4219e4a00, 0x155e471, 0x1c, 0x421d24fc0, 0x1f, 0x1) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/artifacts.go:411 +0x12f Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.(*TaskRun).Run.func2(0x421dde028, 0x4219e4a00, 0x421dde030) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1101 +0xe6 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: panic(0x14aff80, 0x421f52080) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/.gimme/versions/go1.10.2.src/src/runtime/panic.go:502 +0x229 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.(*TaskRun).uploadArtifact(0x4219e4a00, 0x1604580, 0x421f85bc0, 0x1c) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/artifacts.go:467 +0x1057 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.(*TaskRun).Run.func4(0x4219e4a00, 0x421dde028) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1194 +0x23c Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.(*TaskRun).Run(0x4219e4a00, 0x421af80a0) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1255 +0x17cb Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.RunWorker(0x0) Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:671 +0xd4f Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: main.main() Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:400 +0x608 Jul 25 17:36:02 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:02 *********** PANIC occurred! *********** Jul 25 17:36:03 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:02 WORKER EXCEPTION due to response code 401 from Queue when uploading artifact &main.S3Artifact{BaseArtifact:(*main.BaseArtifact)(0x421ddc400), Path:"logs/localconfig.json", ContentEncoding:""} with CreateArtifact payload {"contentType":"application/json","expires":"2019-07-25T19:59:05.115Z","storageType":"s3"} Jul 25 17:36:04 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:04 Exiting worker with exit code 69 ... Jul 25 17:36:05 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:05 Removing task directory '/Users/cltbld/tasks/task_1532549048'... ... Jul 25 17:36:10 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:10 All features initialised. Jul 25 17:36:10 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 17:36:10 Created dir: /Users/cltbld/tasks/task_1532565370/generic-worker ... Jul 25 20:59:48 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/25 20:59:48 Disk available: 234181001216 bytes Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com powerd: PID 50(powerd) TimedOut InternalPreventSleep "com.apple.powermanagement.acwakelinger" 00:00:45 id:0xd0000013d [System: SRPrevSleep kCPU] Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com powerd: Summary- [System: No Assertions] Using AC Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com configd: store_notifier: changedKeys <array> { Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com configd: 0 : State:/IOKit/SystemPowerCapabilities Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com configd: } Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com configd: store_notifier: powerkey 0 Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com powerd: Entering Sleep state due to 'Maintenance Sleep': Using AC TCPKeepAlive=inactive Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com configd: SCNC Controller: pm_ConnectionHandler capabilities = 0x0, sleeping = 0 and DarkWake = 1. Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com configd: SCNC Controller: pm_ConnectionHandler going to sleep, delay = 0. Jul 25 20:59:52 t-yosemite-r7-350.test.releng.mdc1.mozilla.com airportd: _configureScanOffloadParameters: Unable to configure scan offloading on en1 (Device power is off) ``` papertrail logs "last seen 1 day ago" https://papertrailapp.com/groups/1223184?filter=t-yosemite-r7-350 no ping no ssh (times out) ``` [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ssh root@t-yosemite-r7-350.test.releng.mdc1.mozilla.com ssh: connect to host t-yosemite-r7-350.test.releng.mdc1.mozilla.com port 22: Connection timed out [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ping t-yosemite-r7-350.test.releng.mdc1.mozilla.com PING t-yosemite-r7-350.test.releng.mdc1.mozilla.com (10.49.56.134) 56(84) bytes of data. ^C --- t-yosemite-r7-350.test.releng.mdc1.mozilla.com ping statistics --- 233 packets transmitted, 0 received, 100% packet loss, time 232765ms ``` pdu shows power on, but I think the mini may be shut down while the outlet is still on
I rebooted through roller. Minimal logs appeared in papertrail, still not response to ping or ssh. ``` Jul 27 12:18:19 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/27 12:18:22 Error: Post https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-osx-1010: read tcp 10.49.56.134:52016->184.72.216.59:443: read: operation timed out Jul 27 12:18:20 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/27 12:18:24 No task claimed. Idle for 42h42m13.995790122s (will exit if no task claimed in 53h17m46.004209878s). 1 more tasks to run before exiting. Jul 27 12:18:25 t-yosemite-r7-350.test.releng.mdc1.mozilla.com generic-worker: 2018/07/27 12:18:29 Disk available: 234206502912 bytes ``` ``` [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ssh root@t-yosemite-r7-350.test.releng.mdc1.mozilla.com ^C [dhouse@rejh2.srv.releng.mdc1.mozilla.com ~]$ ping t-yosemite-r7-350.test.releng.mdc1.mozilla.com PING t-yosemite-r7-350.test.releng.mdc1.mozilla.com (10.49.56.134) 56(84) bytes of data. ^C --- t-yosemite-r7-350.test.releng.mdc1.mozilla.com ping statistics --- 63 packets transmitted, 0 received, 100% packet loss, time 62194ms ```
I manually powered off the machine (pdu power off) and waited a few seconds, and then powered it back on. It briefly responds to ping, and ssh (prompts for a password), but then stops responding to ping/ssh. Same logs appear in papertrail and then stop.
After another reboot, ping/ssh stayed and the machine looks normal. I've removed the quarantine to see if it has any trouble running tasks.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Depends on: 1490458
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.