1474267 - (t-yosemite-r7-121) [MDC2] t-yosemite-r7-121 problem tracking

I've asked DCOps for a physical netboot/reimage. I tried snmp calls, and the machine did not come back up from reboot. It appeared to respond to ping and didn't accept auth for ssh before I tried the reboot. The last deploystudio email/log I found for this machine was May 24th (and a puppet certificate and first run log). https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-121 https://papertrailapp.com/groups/1223184/events?q=t-yosemite-r7-121

:dhouse

Comment 2

•

7 years ago

A deploystudio log for the reimage came through puppet cert set up and on puppet run a failure shows: ``` Mon Jul 09 18:09:29 -0700 2018 Puppet (err): Could not find command 'generic-worker' Mon Jul 09 18:09:29 -0700 2018 /Stage[main]/Generic_worker/Exec[create gpg key]/returns (err): change from notrun to 0 failed: Could not find command 'generic-worker' Mon Jul 09 18:11:48 -0700 2018 /Stage[main]/Mig::Agent::Daemon/Exec[restart mig] (err): Failed to call refresh: /bin/kill -s 2 $(/usr/local/bin/mig-agent -q=pid); /usr/local/bin/mig-agent returned 1 instead of one of [0] Mon Jul 09 18:11:48 -0700 2018 /Stage[main]/Mig::Agent::Daemon/Exec[restart mig] (err): /bin/kill -s 2 $(/usr/local/bin/mig-agent -q=pid); /usr/local/bin/mig-agent returned 1 instead of one of [0] ```

:dhouse

Comment 3

•

7 years ago

I found that puppet completes without problems on this machine, and that the "Could not find command 'generic-worker'" appears on other machines that are working properly. So I do not think that caused the problems on this machine. I let this machine run one task after the reimage. It failed with a timeout exception within the task. I'm manually rebooting from ssh and removed the quarantine to let it try another task to see if that also has problems.

Roland Mutter Michael (:rmutter)

Reporter

Updated

•

7 years ago

Summary: t-yosemite-r7-121.test.releng.mdc2.mozilla.com. problem tracking → [MDC2] t-yosemite-r7-121 problem tracking

Danut Labici [:dlabici]

Comment 4

•

7 years ago

Seems that the machine had problems again. I went ahead and reimaged it. :zfay can you please check if the machine took jobs? thanks.

Flags: needinfo?(zfay)

Zsolt Fay [:zfay]

Comment 5

•

7 years ago

This is still a faulty machine and re-imaging will not fix the problem. It's latest tests are mostly failed. I'm sure the machine won't take jobs anytime soon since it's in quarantine. https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-121 :dhouse please let us know if you have any updates regarding this worker

Flags: needinfo?(zfay)

:dhouse

Comment 6

•

7 years ago

There are tasks in the queue right now, so I have un-quarantined this worker to see if it can take and complete a task withing failure: https://papertrailapp.com/groups/1223184/events?q=t-yosemite-r7-121&focus=954512856919994368 ``` Jul 13 12:40:21 t-yosemite-r7-121.test.releng.mdc2.mozilla.com generic-worker: 2018/07/13 12:40:21 Running task https://tools.taskcluster.net/task-inspector/#Zmgsp2VlTvSbsJupmbUyLg/0 ``` https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-121

:dhouse

Comment 7

•

7 years ago

Python is crashing. I don't know exactly what is happening to cause this or if this is unusual (I expect so): ``` Jul 13 12:44:38 t-yosemite-r7-121 kernel: process python[8289] caught causing excessive wakeups. Observed wakeups rate (per sec): 989; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 45543 Jul 13 12:44:38 t-yosemite-r7-121 com.apple.xpc.launchd: (com.apple.ReportCrash[8311]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com ReportCrash: Invoking spindump for pid=8289 wakeups_rate=989 duration=46 because of excessive wakeups Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com SubmitDiagInfo: Couldn't load config file from on-disk location. Falling back to default location. Reason: Won't serialize in _readDictionaryFromJSONData due to nil object Jul 13 12:44:38 t-yosemite-r7-121 kernel: CODE SIGNING: cs_invalid_page(0x105dd2000): p=8312[spindump] final status 0x2000000, allowing (remove VALID) page Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com spindump: Error grabbing microstackshots: 53 Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com spindump: No microstackshots found ```

:dhouse

Comment 8

•

7 years ago

(In reply to Dave House [:dhouse] from comment #7) > Python is crashing. I don't know exactly what is happening to cause this or Maybe that was part of the test. I vnc'd to the machine and watched it take and run another task. It looks normal as it is firing off Firefox NightlyDebug and performing actions. With the Activity viewer up, the load looks normal. I'll wait to see if it fails this one I'm watching.

:dhouse

Updated

•

7 years ago

Assignee: nobody → dhouse

:dhouse

Comment 9

•

7 years ago

The task I watched completed and the worker rebooted. On the next boot, the machine did not start the worker. I rebooted it and it started the worker and took a job on the next boot.

:dhouse

Comment 10

•

7 years ago

Attached file processes at non-worker boot — Details

:dhouse

Comment 11

•

7 years ago

That next task that I watched also completed and the machine rebooted. Again at that next boot, the worker did not start. I've left this worker un-quarantined to see if it recovers by some miracle, or for when I return to checking it on Monday and can reboot it to watch the next task.

Danut Labici [:dlabici]

Comment 12

•

7 years ago

Ouch, I saw the machine not taking jobs in TC and that it wasn't quarantined, so I went ahead with a re-image. In any case, waiting for the reimage to process and machine to start taking jobs.

:dhouse

Comment 13

•

7 years ago

(In reply to Danut Labici [:dlabici] from comment #12) > Ouch, I saw the machine not taking jobs in TC and that it wasn't > quarantined, so I went ahead with a re-image. > In any case, waiting for the reimage to process and machine to start taking > jobs. No worries. The behavior repeats on this machine after reimages. So I'll keep researching and testing on it.

:dhouse

Comment 14

•

7 years ago

Well, now that I said it repeats it is working fine this morning. So, I'll look to see if another machine has the problem (or if it reappears later on this host).

Assignee: dhouse → nobody

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard