1541667 - Long running tasks triggered from bugbug hook fail too often because of claim-expired

Reporter

Description

•

5 years ago

Tasks triggered from https://tools.taskcluster.net/hooks/project-releng/services-testing-bugbug%2Fdata take (IIRC) around 4-5 hours to finish (sometimes more, depending on network condition, but usually less than 10 hours).

Very often though, they fail because of claim-expired and get automatically rerun many times. They do sometimes succeed, but most of the times they don't (indeed, they often run out of retries and I have to manually retrigger).

The task is using the releng-svc workerType (shared with a few other tasks), which has a few different instance type options.

The process which runs in the task is network and IO heavy.

Brian Stack [:bstack]

Comment 1

•

5 years ago

Last lines in the log for that worker are:

Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving credentials from provisioner","source":"host/aws","stack":"Error: Unknown Server Error\n    at /home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374:21\n    at <anonymous>\n    at process._tickDomainCallback (internal/process/next_tick.js:228:7)"}
Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"}
Apr 03 14:48:03 docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15.

It had already handled a good deal of work according to the queue. Do you have any idea what's happening here, Wander?

Flags: needinfo?(wcosta)

Marco Castelluccio [:marco]

Reporter

Comment 2

•

5 years ago

I've tried using a beefier worker type, so that it would finish earlier and not run into this problem. It succeeded: https://tools.taskcluster.net/groups/fjHHiqIOQZu9H8UbrPFJuQ/tasks/fjHHiqIOQZu9H8UbrPFJuQ.

Wander Lairson Costa

Assignee

Comment 3

•

5 years ago

(In reply to Brian Stack [:bstack] from comment #1)

Last lines in the log for that
worker are:

Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving
credentials from provisioner","source":"host/aws","stack":"Error: Unknown
Server Error\n    at
/home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374:
21\n    at <anonymous>\n    at process._tickDomainCallback
(internal/process/next_tick.js:228:7)"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4"
x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15.

It had already handled a good deal of work [according to the
queue](https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-
types/releng-svc/workers/us-east-1/i-055853c783ca2302d). Do you have any
idea what's happening here, Wander?

It might be some network issue or maybe something with the provisioner service. I am going to add exponential backoff to this request.

Assignee: nobody → wcosta

Status: NEW → ASSIGNED

Flags: needinfo?(wcosta)

Marco Castelluccio [:marco]

Reporter

Comment 4

•

5 years ago

Hook changed, https://tools.taskcluster.net/hooks/project-relman/bugbug, but still seeing the same problem https://tools.taskcluster.net/groups/V8R9mT1kQPG1V9yyodBOVg/tasks/Y2Ww3hzoQJyMuEVowJcC7g/runs/5.

The worker type is releng-svc-compute.

Flags: needinfo?(wcosta)

Pete Moore [:pmoore][:pete]

Comment 5

•

5 years ago

•

Edited

(In reply to Wander Lairson Costa [:wcosta] from comment #3)

(In reply to Brian Stack [:bstack] from comment #1)
Last lines in the log for that
worker are:
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving
credentials from provisioner","source":"host/aws","stack":"Error: Unknown
Server Error\n    at
/home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374:
21\n    at <anonymous>\n    at process._tickDomainCallback
(internal/process/next_tick.js:228:7)"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4"
x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15. 
It had already handled a good deal of work [according to the
queue](https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-
types/releng-svc/workers/us-east-1/i-055853c783ca2302d). Do you have any
idea what's happening here, Wander?
It might be some network issue or maybe something with the provisioner service. I am going to add exponential backoff to this request.

Hi Wander, did you add the exponential backoff?

It would be great if we could log the full http request/responses from the client to see what happened. I wonder if "Error: Unknown Server Error" means there was an HTTP 500 error, or something else.

Another thought - is it possible that the worker is trying to fetch its credentials after it has already deleted them from the provisioner? For example, if the worker is rebooting for any reason, perhaps the worker is starting up a second time, and when it tries to retrieve the credentials from the provisioner, they have already been deleted from the first run of the worker.

Pete Moore [:pmoore][:pete]

Comment 6

•

5 years ago

Would also be useful to see the queue and provisioner log messages for their communication with this workerId during this time.

Brian Stack [:bstack]

Comment 7

•

5 years ago

Another thought - is it possible that the worker is trying to fetch its credentials after it has already deleted them from the provisioner? For example, if the worker is rebooting for any reason, perhaps the worker is starting up a second time, and when it tries to retrieve the credentials from the provisioner, they have already been deleted from the first run of the worker.

I didn't paste enough logs above to show this, but I believe that is what is happening.

Wander Lairson Costa

Assignee

Comment 8

•

5 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #5)

(In reply to Wander Lairson Costa [:wcosta] from comment #3)
(In reply to Brian Stack [:bstack] from comment #1)
Last lines in the log for that
worker are:
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"[alert-operator] error retrieving
credentials from provisioner","source":"host/aws","stack":"Error: Unknown
Server Error\n    at
/home/ubuntu/docker_worker/node_modules/taskcluster-client/src/client.js:374:
21\n    at <anonymous>\n    at process._tickDomainCallback
(internal/process/next_tick.js:228:7)"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d docker-worker: {"type":"read secrets","source":"host/aws"}
Apr 03 14:48:03
docker-worker.aws-provisioner.us-east-1a.ami-0d2e14f2a00a42c94.c4-xlarge.i-
055853c783ca2302d rsyslogd: [origin software="rsyslogd" swVersion="7.4.4"
x-pid="1457" x-info="http://www.rsyslog.com"] exiting on signal 15. 
It had already handled a good deal of work [according to the
queue](https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-
types/releng-svc/workers/us-east-1/i-055853c783ca2302d). Do you have any
idea what's happening here, Wander?
It might be some network issue or maybe something with the provisioner service. I am going to add exponential backoff to this request.
Hi Wander, did you add the exponential backoff?

It would be great if we could log the full http request/responses from the
client to see what happened. I wonder if "Error: Unknown Server Error" means
there was an HTTP 500 error, or something else.

Another thought - is it possible that the worker is trying to fetch its
credentials after it has already deleted them from the provisioner? For
example, if the worker is rebooting for any reason, perhaps the worker is
starting up a second time, and when it tries to retrieve the credentials
from the provisioner, they have already been deleted from the first run of
the worker.

Oh no, I realized in another bug this was caused because another error made the worker restart and then it would fail to retrieve credentials (as this is a one-time thing). Newer versions of docker-worker have this fixed as it doesn't restart anymore in case of failure.

Flags: needinfo?(wcosta)

Marco Castelluccio [:marco]

Reporter

Comment 9

•

5 years ago

It seems like I'm still seeing this pretty often. For example, https://tools.taskcluster.net/groups/X5xFuJ3MR0a8lBwYUKe1QA/tasks/eBFgCeMKSlCAaobczS2XFA/runs/0 had two runs so far failing with claim-expired.

What could be the cause?

Wander Lairson Costa

Assignee

Comment 10

•

5 years ago

It feels like the instance restarted due to OOM. I set the worker type capacity to 1. Hopefully, it will prevent OOM

Wander Lairson Costa

Assignee

Comment 11

•

5 years ago

:marco can we close this bug?

Flags: needinfo?(mcastelluccio)

Marco Castelluccio [:marco]

Reporter

Comment 12

•

5 years ago

Yes, thanks!

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Flags: needinfo?(mcastelluccio)

Resolution: --- → FIXED

Bugzilla

Quick Search

Long running tasks triggered from bugbug hook fail too often because of claim-expired

Categories

(Taskcluster :: Workers, defect)

Tracking

(Not tracked)

People

(Reporter: marco, Assigned: wcosta)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12