Closed Bug 1230942 Opened 9 years ago Closed 8 years ago

Tasks failing with multiple claim-expired runs

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: armenzg, Unassigned)

Details

Summary: Jobs with no logs and unclear reason for failures → Tasks failing with multiple claim-expired runs
In those pushes we switched to the 'desktop-test' worker type.
I'm attempting to search back from 3 days ago, but papertrail is a little slow at doing so.  Do you have any recent runs (like in the last day) that this happened on?

Also, if this was an issue with docker starting up, there is a commit to docker-worker [1] that will land sometime soon-ish to try to capture errors when there is an issue starting a container.

[1] https://github.com/taskcluster/docker-worker/commit/c4338db6a38884b8a3532f8a905f6aaa235b5587
Nevermind, I found the error in the logs, for some reason the module we're using for checking diskspace failed to spawn a process and caused the entire worker to crash.  Looks like a guard is needed around this[1] so that failure to spawn a process from this diskspace module doesn't cause the entire worker to crash.  However because of the reason it couldn't spawn a process, ENOMEM, probably would have caused other things to fail as well at some point soon.  Also, the worker flirted with the edge of not having any disk space left as well (around 36mb free).

This instance completed a few tasks before crashing.  It looks like the types of tasks running on that instance type are maybe just a bit much.  I'm going to try to pull some stats out of influx to see if we can get worker stats around 3 days ago, but I'm not sure if that's possible at the moment.  I'll update the bug if I come up with anything.

Relevant papertrail log link: 
https://papertrailapp.com/systems/151208103/events?r=609570480520998926-609574311849930754

[1] https://github.com/taskcluster/docker-worker/blob/master/lib/stats/host_metrics.js#L45
OK, so I'm guessing that the *.medium instance type is just too small.  This seems like reasonable behavior from trying to fit a large peg into a small hole.

Releng uses *.medium for EC2 instances, but that's probably already pretty tight -- loading a desktop environment entails a lot of processes!
Does this require any more work?
Flags: needinfo?(armenzg)
Let's close it.
If it happens again I will file it again.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(armenzg)
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.