Closed
Bug 1230942
Opened 9 years ago
Closed 8 years ago
Tasks failing with multiple claim-expired runs
Categories
(Taskcluster :: General, defect)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: armenzg, Unassigned)
Details
https://tools.taskcluster.net/task-inspector/#XRvaOuGcTvOOpeycJ5yXrw/ https://tools.taskcluster.net/task-inspector/#ACaEAEeRRqWV_1KNryEluA/0 Try to load any of the artifacts of any of the runs. You can also see it on Treeherder: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b06014e0ba00&filter-searchStr=cpp https://treeherder.mozilla.org/#/jobs?repo=try&revision=b06014e0ba00&filter-searchStr=lucid
Updated•9 years ago
|
Summary: Jobs with no logs and unclear reason for failures → Tasks failing with multiple claim-expired runs
Reporter | ||
Comment 1•9 years ago
|
||
In those pushes we switched to the 'desktop-test' worker type.
Comment 2•9 years ago
|
||
I'm attempting to search back from 3 days ago, but papertrail is a little slow at doing so. Do you have any recent runs (like in the last day) that this happened on? Also, if this was an issue with docker starting up, there is a commit to docker-worker [1] that will land sometime soon-ish to try to capture errors when there is an issue starting a container. [1] https://github.com/taskcluster/docker-worker/commit/c4338db6a38884b8a3532f8a905f6aaa235b5587
Comment 3•9 years ago
|
||
Nevermind, I found the error in the logs, for some reason the module we're using for checking diskspace failed to spawn a process and caused the entire worker to crash. Looks like a guard is needed around this[1] so that failure to spawn a process from this diskspace module doesn't cause the entire worker to crash. However because of the reason it couldn't spawn a process, ENOMEM, probably would have caused other things to fail as well at some point soon. Also, the worker flirted with the edge of not having any disk space left as well (around 36mb free). This instance completed a few tasks before crashing. It looks like the types of tasks running on that instance type are maybe just a bit much. I'm going to try to pull some stats out of influx to see if we can get worker stats around 3 days ago, but I'm not sure if that's possible at the moment. I'll update the bug if I come up with anything. Relevant papertrail log link: https://papertrailapp.com/systems/151208103/events?r=609570480520998926-609574311849930754 [1] https://github.com/taskcluster/docker-worker/blob/master/lib/stats/host_metrics.js#L45
Comment 4•9 years ago
|
||
OK, so I'm guessing that the *.medium instance type is just too small. This seems like reasonable behavior from trying to fit a large peg into a small hole. Releng uses *.medium for EC2 instances, but that's probably already pretty tight -- loading a desktop environment entails a lot of processes!
Reporter | ||
Comment 6•8 years ago
|
||
Let's close it. If it happens again I will file it again.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(armenzg)
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•