Closed
Bug 1474267
(t-yosemite-r7-121)
Opened 7 years ago
Closed 7 years ago
[MDC2] t-yosemite-r7-121 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rmutter, Unassigned)
References
Details
Attachments
(1 file)
8.60 KB,
text/plain
|
Details |
No description provided.
Reporter | ||
Updated•7 years ago
|
Summary: t-yosemite-r7-121.test.releng.mdc1.mozilla.com. problem tracking → t-yosemite-r7-121.test.releng.mdc2.mozilla.com. problem tracking
I've asked DCOps for a physical netboot/reimage. I tried snmp calls, and the machine did not come back up from reboot. It appeared to respond to ping and didn't accept auth for ssh before I tried the reboot.
The last deploystudio email/log I found for this machine was May 24th (and a puppet certificate and first run log).
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-121
https://papertrailapp.com/groups/1223184/events?q=t-yosemite-r7-121
A deploystudio log for the reimage came through
puppet cert set up
and on puppet run a failure shows:
```
Mon Jul 09 18:09:29 -0700 2018 Puppet (err): Could not find command 'generic-worker'
Mon Jul 09 18:09:29 -0700 2018 /Stage[main]/Generic_worker/Exec[create gpg key]/returns (err): change from notrun to 0 failed: Could not find command 'generic-worker'
Mon Jul 09 18:11:48 -0700 2018 /Stage[main]/Mig::Agent::Daemon/Exec[restart mig] (err): Failed to call refresh: /bin/kill -s 2 $(/usr/local/bin/mig-agent -q=pid); /usr/local/bin/mig-agent returned 1 instead of one of [0]
Mon Jul 09 18:11:48 -0700 2018 /Stage[main]/Mig::Agent::Daemon/Exec[restart mig] (err): /bin/kill -s 2 $(/usr/local/bin/mig-agent -q=pid); /usr/local/bin/mig-agent returned 1 instead of one of [0]
```
I found that puppet completes without problems on this machine, and that the "Could not find command 'generic-worker'" appears on other machines that are working properly. So I do not think that caused the problems on this machine.
I let this machine run one task after the reimage. It failed with a timeout exception within the task. I'm manually rebooting from ssh and removed the quarantine to let it try another task to see if that also has problems.
Reporter | ||
Updated•7 years ago
|
Summary: t-yosemite-r7-121.test.releng.mdc2.mozilla.com. problem tracking → [MDC2] t-yosemite-r7-121 problem tracking
Comment 4•7 years ago
|
||
Seems that the machine had problems again.
I went ahead and reimaged it.
:zfay can you please check if the machine took jobs?
thanks.
Flags: needinfo?(zfay)
Comment 5•7 years ago
|
||
This is still a faulty machine and re-imaging will not fix the problem. It's latest tests are mostly failed.
I'm sure the machine won't take jobs anytime soon since it's in quarantine.
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-121
:dhouse please let us know if you have any updates regarding this worker
Flags: needinfo?(zfay)
There are tasks in the queue right now, so I have un-quarantined this worker to see if it can take and complete a task withing failure:
https://papertrailapp.com/groups/1223184/events?q=t-yosemite-r7-121&focus=954512856919994368
```
Jul 13 12:40:21 t-yosemite-r7-121.test.releng.mdc2.mozilla.com generic-worker: 2018/07/13 12:40:21 Running task https://tools.taskcluster.net/task-inspector/#Zmgsp2VlTvSbsJupmbUyLg/0
```
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc2/t-yosemite-r7-121
Python is crashing. I don't know exactly what is happening to cause this or if this is unusual (I expect so):
```
Jul 13 12:44:38 t-yosemite-r7-121 kernel: process python[8289] caught causing excessive wakeups. Observed wakeups rate (per sec): 989; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 45543
Jul 13 12:44:38 t-yosemite-r7-121 com.apple.xpc.launchd: (com.apple.ReportCrash[8311]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash
Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com ReportCrash: Invoking spindump for pid=8289 wakeups_rate=989 duration=46 because of excessive wakeups
Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com SubmitDiagInfo: Couldn't load config file from on-disk location. Falling back to default location. Reason: Won't serialize in _readDictionaryFromJSONData due to nil object
Jul 13 12:44:38 t-yosemite-r7-121 kernel: CODE SIGNING: cs_invalid_page(0x105dd2000): p=8312[spindump] final status 0x2000000, allowing (remove VALID) page
Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com spindump: Error grabbing microstackshots: 53
Jul 13 12:44:38 t-yosemite-r7-121.test.releng.mdc2.mozilla.com spindump: No microstackshots found
```
(In reply to Dave House [:dhouse] from comment #7)
> Python is crashing. I don't know exactly what is happening to cause this or
Maybe that was part of the test. I vnc'd to the machine and watched it take and run another task. It looks normal as it is firing off Firefox NightlyDebug and performing actions. With the Activity viewer up, the load looks normal.
I'll wait to see if it fails this one I'm watching.
The task I watched completed and the worker rebooted. On the next boot, the machine did not start the worker. I rebooted it and it started the worker and took a job on the next boot.
Comment 10•7 years ago
|
||
Comment 11•7 years ago
|
||
That next task that I watched also completed and the machine rebooted. Again at that next boot, the worker did not start.
I've left this worker un-quarantined to see if it recovers by some miracle, or for when I return to checking it on Monday and can reboot it to watch the next task.
Comment 12•7 years ago
|
||
Ouch, I saw the machine not taking jobs in TC and that it wasn't quarantined, so I went ahead with a re-image.
In any case, waiting for the reimage to process and machine to start taking jobs.
Comment 13•7 years ago
|
||
(In reply to Danut Labici [:dlabici] from comment #12)
> Ouch, I saw the machine not taking jobs in TC and that it wasn't
> quarantined, so I went ahead with a re-image.
> In any case, waiting for the reimage to process and machine to start taking
> jobs.
No worries. The behavior repeats on this machine after reimages. So I'll keep researching and testing on it.
Comment 14•7 years ago
|
||
Well, now that I said it repeats it is working fine this morning. So, I'll look to see if another machine has the problem (or if it reappears later on this host).
Assignee: dhouse → nobody
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•