Closed Bug 1474267 (t-yosemite-r7-121) Opened 7 years ago Closed 7 years ago

[MDC2] t-yosemite-r7-121 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rmutter, Unassigned)

References

Details

Attachments

(1 file)

No description provided.
Depends on: 1473358
Summary: t-yosemite-r7-121.test.releng.mdc1.mozilla.com. problem tracking → t-yosemite-r7-121.test.releng.mdc2.mozilla.com. problem tracking
Alias: t-yosemite-r7-121
Depends on: 1474449
I've asked DCOps for a physical netboot/reimage. I tried snmp calls, and the machine did not come back up from reboot. It appeared to respond to ping and didn't accept auth for ssh before I tried the reboot. The last deploystudio email/log I found for this machine was May 24th (and a puppet certificate and first run log). https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-121 https://papertrailapp.com/groups/1223184/events?q=t-yosemite-r7-121
A deploystudio log for the reimage came through puppet cert set up and on puppet run a failure shows: ``` Mon Jul 09 18:09:29 -0700 2018 Puppet (err): Could not find command 'generic-worker' Mon Jul 09 18:09:29 -0700 2018 /Stage[main]/Generic_worker/Exec[create gpg key]/returns (err): change from notrun to 0 failed: Could not find command 'generic-worker' Mon Jul 09 18:11:48 -0700 2018 /Stage[main]/Mig::Agent::Daemon/Exec[restart mig] (err): Failed to call refresh: /bin/kill -s 2 $(/usr/local/bin/mig-agent -q=pid); /usr/local/bin/mig-agent returned 1 instead of one of [0] Mon Jul 09 18:11:48 -0700 2018 /Stage[main]/Mig::Agent::Daemon/Exec[restart mig] (err): /bin/kill -s 2 $(/usr/local/bin/mig-agent -q=pid); /usr/local/bin/mig-agent returned 1 instead of one of [0] ```
I found that puppet completes without problems on this machine, and that the "Could not find command 'generic-worker'" appears on other machines that are working properly. So I do not think that caused the problems on this machine. I let this machine run one task after the reimage. It failed with a timeout exception within the task. I'm manually rebooting from ssh and removed the quarantine to let it try another task to see if that also has problems.
Summary: t-yosemite-r7-121.test.releng.mdc2.mozilla.com. problem tracking → [MDC2] t-yosemite-r7-121 problem tracking
Seems that the machine had problems again. I went ahead and reimaged it. :zfay can you please check if the machine took jobs? thanks.
Flags: needinfo?(zfay)
This is still a faulty machine and re-imaging will not fix the problem. It's latest tests are mostly failed. I'm sure the machine won't take jobs anytime soon since it's in quarantine. https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-121 :dhouse please let us know if you have any updates regarding this worker
Flags: needinfo?(zfay)
There are tasks in the queue right now, so I have un-quarantined this worker to see if it can take and complete a task withing failure: https://papertrailapp.com/groups/1223184/events?q=t-yosemite-r7-121&focus=954512856919994368 ``` Jul 13 12:40:21 t-yosemite-r7-121.test.releng.mdc2.mozilla.com generic-worker: 2018/07/13 12:40:21 Running task https://tools.taskcluster.net/task-inspector/#Zmgsp2VlTvSbsJupmbUyLg/0 ``` https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-121
Python is crashing. I don't know exactly what is happening to cause this or if this is unusual (I expect so): ``` Jul 13 12:44:38 t-yosemite-r7-121 kernel: process python[8289] caught causing excessive wakeups. Observed wakeups rate (per sec): 989; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 45543 Jul 13 12:44:38 t-yosemite-r7-121 com.apple.xpc.launchd: (com.apple.ReportCrash[8311]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com ReportCrash: Invoking spindump for pid=8289 wakeups_rate=989 duration=46 because of excessive wakeups Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com SubmitDiagInfo: Couldn't load config file from on-disk location. Falling back to default location. Reason: Won't serialize in _readDictionaryFromJSONData due to nil object Jul 13 12:44:38 t-yosemite-r7-121 kernel: CODE SIGNING: cs_invalid_page(0x105dd2000): p=8312[spindump] final status 0x2000000, allowing (remove VALID) page Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com spindump: Error grabbing microstackshots: 53 Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com spindump: No microstackshots found ```
(In reply to Dave House [:dhouse] from comment #7) > Python is crashing. I don't know exactly what is happening to cause this or Maybe that was part of the test. I vnc'd to the machine and watched it take and run another task. It looks normal as it is firing off Firefox NightlyDebug and performing actions. With the Activity viewer up, the load looks normal. I'll wait to see if it fails this one I'm watching.
Assignee: nobody → dhouse
The task I watched completed and the worker rebooted. On the next boot, the machine did not start the worker. I rebooted it and it started the worker and took a job on the next boot.
That next task that I watched also completed and the machine rebooted. Again at that next boot, the worker did not start. I've left this worker un-quarantined to see if it recovers by some miracle, or for when I return to checking it on Monday and can reboot it to watch the next task.
Ouch, I saw the machine not taking jobs in TC and that it wasn't quarantined, so I went ahead with a re-image. In any case, waiting for the reimage to process and machine to start taking jobs.
(In reply to Danut Labici [:dlabici] from comment #12) > Ouch, I saw the machine not taking jobs in TC and that it wasn't > quarantined, so I went ahead with a re-image. > In any case, waiting for the reimage to process and machine to start taking > jobs. No worries. The behavior repeats on this machine after reimages. So I'll keep researching and testing on it.
Well, now that I said it repeats it is working fine this morning. So, I'll look to see if another machine has the problem (or if it reappears later on this host).
Assignee: dhouse → nobody
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: