Closed Bug 1230942 Opened 9 years ago Closed 8 years ago

Tasks failing with multiple claim-expired runs

Tracking

(Not tracked)

Status:

RESOLVED INVALID

People

(Reporter: armenzg, Unassigned)

Details

Armen [:armenzg]

Reporter

Description

•

9 years ago

https://tools.taskcluster.net/task-inspector/#XRvaOuGcTvOOpeycJ5yXrw/
https://tools.taskcluster.net/task-inspector/#ACaEAEeRRqWV_1KNryEluA/0

Try to load any of the artifacts of any of the runs.

You can also see it on Treeherder:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b06014e0ba00&filter-searchStr=cpp
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b06014e0ba00&filter-searchStr=lucid

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

9 years ago

Summary: Jobs with no logs and unclear reason for failures → Tasks failing with multiple claim-expired runs

Armen [:armenzg]

Reporter

Comment 1

•

9 years ago

In those pushes we switched to the 'desktop-test' worker type.

Greg Arndt [:garndt]

Comment 2

•

9 years ago

I'm attempting to search back from 3 days ago, but papertrail is a little slow at doing so.  Do you have any recent runs (like in the last day) that this happened on?

Also, if this was an issue with docker starting up, there is a commit to docker-worker [1] that will land sometime soon-ish to try to capture errors when there is an issue starting a container.

[1] https://github.com/taskcluster/docker-worker/commit/c4338db6a38884b8a3532f8a905f6aaa235b5587

Greg Arndt [:garndt]

Comment 3

•

9 years ago

Nevermind, I found the error in the logs, for some reason the module we're using for checking diskspace failed to spawn a process and caused the entire worker to crash.  Looks like a guard is needed around this[1] so that failure to spawn a process from this diskspace module doesn't cause the entire worker to crash.  However because of the reason it couldn't spawn a process, ENOMEM, probably would have caused other things to fail as well at some point soon.  Also, the worker flirted with the edge of not having any disk space left as well (around 36mb free).

This instance completed a few tasks before crashing.  It looks like the types of tasks running on that instance type are maybe just a bit much.  I'm going to try to pull some stats out of influx to see if we can get worker stats around 3 days ago, but I'm not sure if that's possible at the moment.  I'll update the bug if I come up with anything.

Relevant papertrail log link: 
https://papertrailapp.com/systems/151208103/events?r=609570480520998926-609574311849930754

[1] https://github.com/taskcluster/docker-worker/blob/master/lib/stats/host_metrics.js#L45

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

9 years ago

OK, so I'm guessing that the *.medium instance type is just too small.  This seems like reasonable behavior from trying to fit a large peg into a small hole.

Releng uses *.medium for EC2 instances, but that's probably already pretty tight -- loading a desktop environment entails a lot of processes!

Selena Deckelmann :selenamarie :selena

Comment 5

•

8 years ago

Does this require any more work?

Flags: needinfo?(armenzg)

Armen [:armenzg]

Reporter

Comment 6

•

8 years ago

Let's close it.
If it happens again I will file it again.

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(armenzg)

Resolution: --- → INVALID

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Tasks failing with multiple claim-expired runs

Categories

(Taskcluster :: General, defect)

Tracking

(Not tracked)

People

(Reporter: armenzg, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6