Closed Bug 1431950 Opened 6 years ago Closed 6 years ago

Trees closed - failed jobs being retried as claim-expired

Categories

(Taskcluster :: General, enhancement)

enhancement
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: bstack)

References

Details

Starting sometime during yesterday's closure for bug 1431742 (no way for me to guess whether it's because of that, or just something broken while we were closed for that), in some but not all sorts of jobs failures are being retried as being claim-expired, rather than reporting the failure.

The two clearest examples I have are push-caused bustage in the Windows mingw build, https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=6d97ae42700e4086584d74cde222fa2f870780dd&filter-searchStr=0688b6df80237d68c0a2245abb49e6dc884cd710&group_state=expanded and the wdspec test job, which shows it nicely because it has a crap-load of intermittent failures, https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=9687e3d987bea9abf622c6e67ea10c77afa15391&group_state=expanded&filter-resultStatus=retry&filter-resultStatus=runnable&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-searchStr=wdspec&tochange=e41964b766df5a2e08ed161073ddfb085562732c&selectedJob=157495631
I closed inbound, autoland, central, beta, and release for this, since up to an 80% intermittent failure would be completely hidden from sheriff view by retrying until it turned green or failed six times in a row (and given how the mingw builds are showing, even permaorange wouldn't necessarily be clearly permaorange).
Looks like for tests, it affects all flavors of Linux, and Android 4.3 but not Android 6.0, and for builds, all browser builds on Linux (including Android and OS X cross-compiles and Windows mingw), but not spidermonkey builds on Linux. Dunno what set that implies.

Starting time must be "before early afternoon Pacific on 2018-01-19" since https://treeherder.mozilla.org/#/jobs?repo=try&revision=09fcfe939806359ac5f506f95cccb1385e121a77&filter-searchStr=build&group_state=expanded shows retries just after noon, but the next earlier push which would have shown it and didn't was at 03:56, which doesn't make a very narrow window.
Blocks: 1431742
Closed esr52 as well, since it turns out to have just enough tier-1 Linux taskcluster jobs to also be affected.
It appears that the worker is failing to report state due to a bug introduced to the workers yesterday

From the logs of i-0a201f1e0fb9be3fa [0] (papertrail link while it lasts [1]):



Jan 20 13:45:11 docker-worker.aws-provisioner.us-east-1d.ami-29557d53.m4-4xlarge.i-0a201f1e0fb9be3fa docker-worker: {"type":"task error","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0a201f1e0fb9be3fa","workerGroup":"us-east-1","workerType":"gecko-3-b-linux","workerNodeType":"m4.4xlarge","taskId":"B8RyrbRKThewqMptCCR4Lg","message":"TypeError: Cannot read property 'purgeCaches' of undefined","stack":"TypeError: Cannot read property 'purgeCaches' of undefined\n    at Task.completeRun (/home/ubuntu/docker_worker/src/lib/task.js:592:58)\n    at Task.start (/home/ubuntu/docker_worker/src/lib/task.js:748:61)\n    at <anonymous>\n    at process._tickDomainCallback (internal/process/next_tick.js:228:7)","err":{}} 



It is likely that [2] was deployed for the first time during yesterday's outage. This will probably require us to fix the worker and roll out new ones. NI-ing :wcosta for help with that.


[0] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-b-linux/workers/us-east-1/i-0a201f1e0fb9be3fa
[1] https://papertrailapp.com/systems/1528076352/events?focus=891488781910302723
[2] https://github.com/taskcluster/docker-worker/commit/d4f687d37b8bf1e74235762fb074725853af427f
Flags: needinfo?(wcosta)
Assignee: nobody → bstack
Status: NEW → ASSIGNED
I also note that we should always aim to set additionalProperties: false, whenever possible for example:
https://github.com/taskcluster/docker-worker/blob/3dc8e7b70fb40ef0f23017516ef4d4c1c52153bd/schemas/payload.json#L209

And a good way to avoid the issue we have now is to specify a default value:
  "default": {"purgeCaches": [], "retry": []}
In the JSON schema.. and then set useDefaults: true, when validating with ajv, that way we get the defaults injected.
Small stuff like that saves us a lot of bugs..
Commits pushed to master at https://github.com/taskcluster/docker-worker

https://github.com/taskcluster/docker-worker/commit/506fc1efe9a0cfcc4d64f730c5def8aa22d544b3
Bug 1431950 - check for onExistStatus before purgeCaches check

https://github.com/taskcluster/docker-worker/commit/befe4fbc51e9ac7aec5e31f774facfe882517c7c
Merge pull request #361 from taskcluster/bug-1431950

Bug 1431950 - check for onExitStatus before purgeCaches check
We have landed the changes. However, gps has warned us of bug 1431742 comment 15. Given that, I have created a branch [0] that is the last known good commit before per-second billing changes (and also the commit that is being run in production now) and cherry-picked my changes on top of that. I will be creating an ami from that and deploying it shortly.
And [0] from my last comment was supposed to be https://github.com/taskcluster/docker-worker/tree/fun-time-saturday
I saw the message this morning, I am working on making master branch stable again.
Flags: needinfo?(wcosta)
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.