Closed
Bug 1431950
Opened 6 years ago
Closed 6 years ago
Trees closed - failed jobs being retried as claim-expired
Categories
(Taskcluster :: General, enhancement)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: bstack)
References
Details
Starting sometime during yesterday's closure for bug 1431742 (no way for me to guess whether it's because of that, or just something broken while we were closed for that), in some but not all sorts of jobs failures are being retried as being claim-expired, rather than reporting the failure. The two clearest examples I have are push-caused bustage in the Windows mingw build, https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=6d97ae42700e4086584d74cde222fa2f870780dd&filter-searchStr=0688b6df80237d68c0a2245abb49e6dc884cd710&group_state=expanded and the wdspec test job, which shows it nicely because it has a crap-load of intermittent failures, https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=9687e3d987bea9abf622c6e67ea10c77afa15391&group_state=expanded&filter-resultStatus=retry&filter-resultStatus=runnable&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-searchStr=wdspec&tochange=e41964b766df5a2e08ed161073ddfb085562732c&selectedJob=157495631
Reporter | ||
Comment 1•6 years ago
|
||
I closed inbound, autoland, central, beta, and release for this, since up to an 80% intermittent failure would be completely hidden from sheriff view by retrying until it turned green or failed six times in a row (and given how the mingw builds are showing, even permaorange wouldn't necessarily be clearly permaorange).
Reporter | ||
Comment 2•6 years ago
|
||
Looks like for tests, it affects all flavors of Linux, and Android 4.3 but not Android 6.0, and for builds, all browser builds on Linux (including Android and OS X cross-compiles and Windows mingw), but not spidermonkey builds on Linux. Dunno what set that implies. Starting time must be "before early afternoon Pacific on 2018-01-19" since https://treeherder.mozilla.org/#/jobs?repo=try&revision=09fcfe939806359ac5f506f95cccb1385e121a77&filter-searchStr=build&group_state=expanded shows retries just after noon, but the next earlier push which would have shown it and didn't was at 03:56, which doesn't make a very narrow window.
Reporter | ||
Comment 3•6 years ago
|
||
Closed esr52 as well, since it turns out to have just enough tier-1 Linux taskcluster jobs to also be affected.
Assignee | ||
Comment 4•6 years ago
|
||
It appears that the worker is failing to report state due to a bug introduced to the workers yesterday From the logs of i-0a201f1e0fb9be3fa [0] (papertrail link while it lasts [1]): Jan 20 13:45:11 docker-worker.aws-provisioner.us-east-1d.ami-29557d53.m4-4xlarge.i-0a201f1e0fb9be3fa docker-worker: {"type":"task error","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0a201f1e0fb9be3fa","workerGroup":"us-east-1","workerType":"gecko-3-b-linux","workerNodeType":"m4.4xlarge","taskId":"B8RyrbRKThewqMptCCR4Lg","message":"TypeError: Cannot read property 'purgeCaches' of undefined","stack":"TypeError: Cannot read property 'purgeCaches' of undefined\n at Task.completeRun (/home/ubuntu/docker_worker/src/lib/task.js:592:58)\n at Task.start (/home/ubuntu/docker_worker/src/lib/task.js:748:61)\n at <anonymous>\n at process._tickDomainCallback (internal/process/next_tick.js:228:7)","err":{}} It is likely that [2] was deployed for the first time during yesterday's outage. This will probably require us to fix the worker and roll out new ones. NI-ing :wcosta for help with that. [0] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-b-linux/workers/us-east-1/i-0a201f1e0fb9be3fa [1] https://papertrailapp.com/systems/1528076352/events?focus=891488781910302723 [2] https://github.com/taskcluster/docker-worker/commit/d4f687d37b8bf1e74235762fb074725853af427f
Flags: needinfo?(wcosta)
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → bstack
Status: NEW → ASSIGNED
Assignee | ||
Comment 5•6 years ago
|
||
https://github.com/taskcluster/docker-worker/pull/361
Comment 6•6 years ago
|
||
I also note that we should always aim to set additionalProperties: false, whenever possible for example: https://github.com/taskcluster/docker-worker/blob/3dc8e7b70fb40ef0f23017516ef4d4c1c52153bd/schemas/payload.json#L209 And a good way to avoid the issue we have now is to specify a default value: "default": {"purgeCaches": [], "retry": []} In the JSON schema.. and then set useDefaults: true, when validating with ajv, that way we get the defaults injected. Small stuff like that saves us a lot of bugs..
Comment 7•6 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/docker-worker https://github.com/taskcluster/docker-worker/commit/506fc1efe9a0cfcc4d64f730c5def8aa22d544b3 Bug 1431950 - check for onExistStatus before purgeCaches check https://github.com/taskcluster/docker-worker/commit/befe4fbc51e9ac7aec5e31f774facfe882517c7c Merge pull request #361 from taskcluster/bug-1431950 Bug 1431950 - check for onExitStatus before purgeCaches check
Assignee | ||
Comment 8•6 years ago
|
||
We have landed the changes. However, gps has warned us of bug 1431742 comment 15. Given that, I have created a branch [0] that is the last known good commit before per-second billing changes (and also the commit that is being run in production now) and cherry-picked my changes on top of that. I will be creating an ami from that and deploying it shortly.
Assignee | ||
Comment 9•6 years ago
|
||
And [0] from my last comment was supposed to be https://github.com/taskcluster/docker-worker/tree/fun-time-saturday
Comment 10•6 years ago
|
||
I saw the message this morning, I am working on making master branch stable again.
Flags: needinfo?(wcosta)
Assignee | ||
Updated•6 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Blocks: tc-stability
You need to log in
before you can comment on or make changes to this bug.
Description
•