Long running tasks triggered from bugbug hook fail too often because of claim-expired
Categories
(Taskcluster :: Workers, defect)
Tracking
(Not tracked)
People
(Reporter: marco, Assigned: wcosta)
References
(Blocks 1 open bug)
Details
Tasks triggered from https://tools.taskcluster.net/hooks/project-releng/services-testing-bugbug%2Fdata take (IIRC) around 4-5 hours to finish (sometimes more, depending on network condition, but usually less than 10 hours).
Very often though, they fail because of claim-expired and get automatically rerun many times. They do sometimes succeed, but most of the times they don't (indeed, they often run out of retries and I have to manually retrigger).
The task is using the releng-svc workerType (shared with a few other tasks), which has a few different instance type options.
The process which runs in the task is network and IO heavy.
Comment 1•5 years ago
|
||
Last lines in the log for that worker are:
Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving credentials from provisioner","source":"host/aws","stack":"Error: Unknown Server Error\n at /home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374:21\n at <anonymous>\n at process._tickDomainCallback (internal/process/next_tick.js:228:7)"}
Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"}
Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15.
It had already handled a good deal of work according to the queue. Do you have any idea what's happening here, Wander?
Reporter | ||
Comment 2•5 years ago
|
||
I've tried using a beefier worker type, so that it would finish earlier and not run into this problem. It succeeded: https://tools.taskcluster.net/groups/fjHHiqIOQZu9H8UbrPFJuQ/tasks/fjHHiqIOQZu9H8UbrPFJuQ.
Assignee | ||
Comment 3•5 years ago
|
||
(In reply to Brian Stack [:bstack] from comment #1)
Last lines in the log for that
worker are:Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i- 055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving credentials from provisioner","source":"host/aws","stack":"Error: Unknown Server Error\n at /home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374: 21\n at <anonymous>\n at process._tickDomainCallback (internal/process/next_tick.js:228:7)"} Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i- 055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"} Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i- 055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15.
It had already handled a good deal of work [according to the
queue](https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-
types/releng-svc/workers/us-east-1/i-055853c783ca2302d). Do you have any
idea what's happening here, Wander?
It might be some network issue or maybe something with the provisioner service. I am going to add exponential backoff to this request.
Reporter | ||
Comment 4•5 years ago
|
||
Hook changed, https://tools.taskcluster.net/hooks/project-relman/bugbug, but still seeing the same problem https://tools.taskcluster.net/groups/V8R9mT1kQPG1V9yyodBOVg/tasks/Y2Ww3hzoQJyMuEVowJcC7g/runs/5.
The worker type is releng-svc-compute.
Comment 5•5 years ago
•
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #3)
(In reply to Brian Stack [:bstack] from comment #1)
Last lines in the log for that
worker are:Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i- 055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving credentials from provisioner","source":"host/aws","stack":"Error: Unknown Server Error\n at /home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374: 21\n at <anonymous>\n at process._tickDomainCallback (internal/process/next_tick.js:228:7)"} Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i- 055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"} Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i- 055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15.
It had already handled a good deal of work [according to the
queue](https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-
types/releng-svc/workers/us-east-1/i-055853c783ca2302d). Do you have any
idea what's happening here, Wander?It might be some network issue or maybe something with the provisioner service. I am going to add exponential backoff to this request.
Hi Wander, did you add the exponential backoff?
It would be great if we could log the full http request/responses from the client to see what happened. I wonder if "Error: Unknown Server Error" means there was an HTTP 500 error, or something else.
Another thought - is it possible that the worker is trying to fetch its credentials after it has already deleted them from the provisioner? For example, if the worker is rebooting for any reason, perhaps the worker is starting up a second time, and when it tries to retrieve the credentials from the provisioner, they have already been deleted from the first run of the worker.
Comment 6•5 years ago
|
||
Would also be useful to see the queue and provisioner log messages for their communication with this workerId during this time.
Comment 7•5 years ago
|
||
Another thought - is it possible that the worker is trying to fetch its credentials after it has already deleted them from the provisioner? For example, if the worker is rebooting for any reason, perhaps the worker is starting up a second time, and when it tries to retrieve the credentials from the provisioner, they have already been deleted from the first run of the worker.
I didn't paste enough logs above to show this, but I believe that is what is happening.
Assignee | ||
Comment 8•5 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #5)
(In reply to Wander Lairson Costa [:wcosta] from comment #3)
(In reply to Brian Stack [:bstack] from comment #1)
Last lines in the log for that
worker are:Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i- 055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving credentials from provisioner","source":"host/aws","stack":"Error: Unknown Server Error\n at /home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374: 21\n at <anonymous>\n at process._tickDomainCallback (internal/process/next_tick.js:228:7)"} Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i- 055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"} Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i- 055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15.
It had already handled a good deal of work [according to the
queue](https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-
types/releng-svc/workers/us-east-1/i-055853c783ca2302d). Do you have any
idea what's happening here, Wander?It might be some network issue or maybe something with the provisioner service. I am going to add exponential backoff to this request.
Hi Wander, did you add the exponential backoff?
It would be great if we could log the full http request/responses from the
client to see what happened. I wonder if "Error: Unknown Server Error" means
there was an HTTP 500 error, or something else.Another thought - is it possible that the worker is trying to fetch its
credentials after it has already deleted them from the provisioner? For
example, if the worker is rebooting for any reason, perhaps the worker is
starting up a second time, and when it tries to retrieve the credentials
from the provisioner, they have already been deleted from the first run of
the worker.
Oh no, I realized in another bug this was caused because another error made the worker restart and then it would fail to retrieve credentials (as this is a one-time thing). Newer versions of docker-worker have this fixed as it doesn't restart anymore in case of failure.
Reporter | ||
Comment 9•5 years ago
|
||
It seems like I'm still seeing this pretty often. For example, https://tools.taskcluster.net/groups/X5xFuJ3MR0a8lBwYUKe1QA/tasks/eBFgCeMKSlCAaobczS2XFA/runs/0 had two runs so far failing with claim-expired.
What could be the cause?
Assignee | ||
Comment 10•5 years ago
|
||
It feels like the instance restarted due to OOM. I set the worker type capacity to 1. Hopefully, it will prevent OOM
Reporter | ||
Comment 12•5 years ago
|
||
Yes, thanks!
Description
•