Closed Bug 1541667 Opened 5 years ago Closed 5 years ago

Long running tasks triggered from bugbug hook fail too often because of claim-expired

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: marco, Assigned: wcosta)

References

(Blocks 1 open bug)

Details

Tasks triggered from https://tools.taskcluster.net/hooks/project-releng/services-testing-bugbug%2Fdata take (IIRC) around 4-5 hours to finish (sometimes more, depending on network condition, but usually less than 10 hours).

Very often though, they fail because of claim-expired and get automatically rerun many times. They do sometimes succeed, but most of the times they don't (indeed, they often run out of retries and I have to manually retrigger).

The task is using the releng-svc workerType (shared with a few other tasks), which has a few different instance type options.

The process which runs in the task is network and IO heavy.

Last lines in the log for that worker are:

Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving credentials from provisioner","source":"host/aws","stack":"Error: Unknown Server Error\n    at /home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374:21\n    at <anonymous>\n    at process._tickDomainCallback (internal/process/next_tick.js:228:7)"}
Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"}
Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15. 

It had already handled a good deal of work according to the queue. Do you have any idea what's happening here, Wander?

Flags: needinfo?(wcosta)

I've tried using a beefier worker type, so that it would finish earlier and not run into this problem. It succeeded: https://tools.taskcluster.net/groups/fjHHiqIOQZu9H8UbrPFJuQ/tasks/fjHHiqIOQZu9H8UbrPFJuQ.

(In reply to Brian Stack [:bstack] from comment #1)

Last lines in the log for that
worker
are:

Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving
credentials from provisioner","source":"host/aws","stack":"Error: Unknown
Server Error\n    at
/home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374:
21\n    at <anonymous>\n    at process._tickDomainCallback
(internal/process/next_tick.js:228:7)"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4"
x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15. 

It had already handled a good deal of work [according to the
queue](https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-
types/releng-svc/workers/us-east-1/i-055853c783ca2302d). Do you have any
idea what's happening here, Wander?

It might be some network issue or maybe something with the provisioner service. I am going to add exponential backoff to this request.

Assignee: nobody → wcosta
Status: NEW → ASSIGNED
Flags: needinfo?(wcosta)
Flags: needinfo?(wcosta)

(In reply to Wander Lairson Costa [:wcosta] from comment #3)

(In reply to Brian Stack [:bstack] from comment #1)

Last lines in the log for that
worker
are:

Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving
credentials from provisioner","source":"host/aws","stack":"Error: Unknown
Server Error\n    at
/home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374:
21\n    at <anonymous>\n    at process._tickDomainCallback
(internal/process/next_tick.js:228:7)"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4"
x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15. 

It had already handled a good deal of work [according to the
queue](https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-
types/releng-svc/workers/us-east-1/i-055853c783ca2302d). Do you have any
idea what's happening here, Wander?

It might be some network issue or maybe something with the provisioner service. I am going to add exponential backoff to this request.

Hi Wander, did you add the exponential backoff?

It would be great if we could log the full http request/responses from the client to see what happened. I wonder if "Error: Unknown Server Error" means there was an HTTP 500 error, or something else.

Another thought - is it possible that the worker is trying to fetch its credentials after it has already deleted them from the provisioner? For example, if the worker is rebooting for any reason, perhaps the worker is starting up a second time, and when it tries to retrieve the credentials from the provisioner, they have already been deleted from the first run of the worker.

Would also be useful to see the queue and provisioner log messages for their communication with this workerId during this time.

Another thought - is it possible that the worker is trying to fetch its credentials after it has already deleted them from the provisioner? For example, if the worker is rebooting for any reason, perhaps the worker is starting up a second time, and when it tries to retrieve the credentials from the provisioner, they have already been deleted from the first run of the worker.

I didn't paste enough logs above to show this, but I believe that is what is happening.

(In reply to Pete Moore [:pmoore][:pete] from comment #5)

(In reply to Wander Lairson Costa [:wcosta] from comment #3)

(In reply to Brian Stack [:bstack] from comment #1)

Last lines in the log for that
worker
are:

Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving
credentials from provisioner","source":"host/aws","stack":"Error: Unknown
Server Error\n    at
/home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374:
21\n    at <anonymous>\n    at process._tickDomainCallback
(internal/process/next_tick.js:228:7)"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4"
x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15. 

It had already handled a good deal of work [according to the
queue](https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-
types/releng-svc/workers/us-east-1/i-055853c783ca2302d). Do you have any
idea what's happening here, Wander?

It might be some network issue or maybe something with the provisioner service. I am going to add exponential backoff to this request.

Hi Wander, did you add the exponential backoff?

It would be great if we could log the full http request/responses from the
client to see what happened. I wonder if "Error: Unknown Server Error" means
there was an HTTP 500 error, or something else.

Another thought - is it possible that the worker is trying to fetch its
credentials after it has already deleted them from the provisioner? For
example, if the worker is rebooting for any reason, perhaps the worker is
starting up a second time, and when it tries to retrieve the credentials
from the provisioner, they have already been deleted from the first run of
the worker.

Oh no, I realized in another bug this was caused because another error made the worker restart and then it would fail to retrieve credentials (as this is a one-time thing). Newer versions of docker-worker have this fixed as it doesn't restart anymore in case of failure.

Flags: needinfo?(wcosta)

It seems like I'm still seeing this pretty often. For example, https://tools.taskcluster.net/groups/X5xFuJ3MR0a8lBwYUKe1QA/tasks/eBFgCeMKSlCAaobczS2XFA/runs/0 had two runs so far failing with claim-expired.

What could be the cause?

It feels like the instance restarted due to OOM. I set the worker type capacity to 1. Hopefully, it will prevent OOM

:marco can we close this bug?

Flags: needinfo?(mcastelluccio)

Yes, thanks!

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(mcastelluccio)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.