Closed Bug 1289836 Opened 8 years ago Closed 7 years ago

Intermittent [taskcluster:error] Task was aborted because states could not be created successfully. Error calling 'link' for localLiveLog

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [docker-link-failure][stockwell infra][fennec-scouting])

Whiteboard: [docker-link-failure]
No longer blocks: 1307493
Depends on: 1307493
it looks like bug 1307510 and bug 1307509 are the bugs that need to get fixed first here- I believe there is some desire to get those fixed in the short term.
It looks like a lot of the recent errors were from docker crashing and upstart restarting it.  Stack traces within the docker logs show a bunch of race errors.  Looking at github issues, some have tied this to running docker 1.10 (which we are).  Since this no longer supported, we will be looking at upgrading docker to 1.12.6.
Depends on: 1332490
today in bug 1332490 the new docker worker was deployed- if there are no other side effects we should know in a few days if this is helping or what the next source of the problem is.
The new error we are hitting:

https://public-artifacts.taskcluster.net/OPmqa67-Ss6dstdlXlfcDw/0/public/logs/live_backing.log

[taskcluster:error] Task was aborted because states could not be created successfully. Error calling 'link' for localLiveLog : (HTTP code 404) no such container - invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\\"open /sys/fs/cgroup/cpuset/docker/cpuset.cpus: no such file or directory\\\"\"\n" 

Looks to be related to https://github.com/docker/docker/issues/9654
Whiteboard: [docker-link-failure] → [docker-link-failure][stockwell infra]
Whiteboard: [docker-link-failure][stockwell infra] → [docker-link-failure][stockwell infra][fennec-scouting]
25 failures in the last 4 days:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1289836

:garndt, is this something for the taskcluster team, or does this lie elsewhere?
Flags: needinfo?(garndt)
Looping back to this one, I think retrying to startup the docker container might be the solution here, at least for live log.  Might also be a good solution for both the proxies as well.  There are times where docker just will not start a container (maybe a race with something else).
Flags: needinfo?(garndt)
Code to retry starting up a container should roll out sometime this week.
Bug 1307493 has been rolled out which should alleviate most of these issues.  No issue logged in 9 days, closing out.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.