Closed
Bug 1457374
Opened 7 years ago
Closed 7 years ago
Try decision tasks have stopped running
Categories
(Taskcluster :: Operations and Service Requests, task, P2)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bstack, Assigned: wcosta)
Details
Sheriffs have let us know that beginning with [0], try decision tasks seem to have stopped running. I've checked some task logs [1] and worker papertrail logs [2] but don't see what the cause would be. I've terminated all of the workers as of 5 minutes ago to see if they come back healthy. It seems like some are picking up jobs now.
Can someone in Europe time take up this investigation (if any is possible at that point) to try to track down what caused this? I am getting sleeeepy.
p.s. Can we give sheriffs the power to terminate workers? I can't think of a reason they should need to involve us.
[0] https://treeherder.mozilla.org/#/jobs?repo=try&revision=50ddbdb9ba0f919bf270374a3e8d0193da2c2115
[1] https://tools.taskcluster.net/groups/WlcEMWBPTHyQxbtg3QnmRw/tasks/WlcEMWBPTHyQxbtg3QnmRw/runs/0/logs/public%2Flogs%2Fcertified.log
[2] https://papertrailapp.com/systems/1818681471/events
Comment 1•7 years ago
|
||
(In reply to Brian Stack [:bstack] from comment #0)
> Sheriffs have let us know that beginning with [0], try decision tasks seem
> to have stopped running. I've checked some task logs [1] and worker
> papertrail logs [2] but don't see what the cause would be. I've terminated
> all of the workers as of 5 minutes ago to see if they come back healthy. It
> seems like some are picking up jobs now.
We were just in a situation where we had 20 capacity running and 17 pending jobs. I've terminated all instances of the decision task. Let's see if that improves the situation.
> Can someone in Europe time take up this investigation (if any is possible at
> that point) to try to track down what caused this? I am getting sleeeepy.
It seems like this is either a decision task or docker-worker bug. Maybe Dustin or Wander could take look to see the underlying cause?
> p.s. Can we give sheriffs the power to terminate workers? I can't think of a
> reason they should need to involve us.
Totally, that's intended to happen. There's a small amount around complexity for giving granular scopes per-worker-type, but we should be clear to give sheriffs scopes to manage terminations for all tracked instances.
Flags: needinfo?(wcosta)
Flags: needinfo?(dustin)
Comment 2•7 years ago
|
||
The only thing I can think of that might be related is bug 1456326, but that seems unlikely
It looks like there are 10 hosts running, and only one of them (i-0440c57f887b15ac2) is actually doing any work
https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-1-decision/workers/us-west-2/i-0440c57f887b15ac2
Picking one of the others (i-02b41dcd48f98f30a, us-west-2), which the queue helpfully says last finished a job seven hours ago, i see
Apr 27 05:28:49 docker-worker.aws-provisioner.us-west-2a.ami-6cadc014.r3-xlarge.i-02b41dcd48f98f30a rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1527" x-info="http://www.rsyslog.com"] exiting on signal 15.
which suggests it did really try to shut down about 7 hours ago. I don't see that host in the AWS console at all.
Similarly for i-06be2694cb20ae890 in us-east-1 - last ran a task about 3 hours ago, last logs are from 3 hours ago showing it shutting down, and it's not in the console.
So, I wonder if something is missing instance terminations in ec2-manager and thus suggesting to aws-provisioner that it is at maxCapacity when it is not?
Flags: needinfo?(dustin) → needinfo?(jhford)
Comment 3•7 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #2)
> The only thing I can think of that might be related is bug 1456326, but that
> seems unlikely
I'm not sure about all the details of that bug, but rolling out a new docker image to decision tasks seems like it could be very related to decision tasks failing? The timeframe is a reasonably close fit, at the least.
> It looks like there are 10 hosts running, and only one of them
> (i-0440c57f887b15ac2) is actually doing any work
All of the machines that I see as being alive in the EC2-Manager are also alive in the EC2 Console. This suggests the problem is something on the machine or in the docker-image.
> https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/
> gecko-1-decision/workers/us-west-2/i-0440c57f887b15ac2
>
> Picking one of the others (i-02b41dcd48f98f30a, us-west-2), which the queue
> helpfully says last finished a job seven hours ago, i see
>
> Apr 27 05:28:49
> docker-worker.aws-provisioner.us-west-2a.ami-6cadc014.r3-xlarge.i-
> 02b41dcd48f98f30a rsyslogd: [origin software="rsyslogd" swVersion="7.4.4"
> x-pid="1527" x-info="http://www.rsyslog.com"] exiting on signal 15.
>
> which suggests it did really try to shut down about 7 hours ago. I don't
> see that host in the AWS console at all.
Things disappear from the console about an hour after they terminate. EC2-Manager keeps a record of all of our terminations, their codes and reasons as well as when they're received.
ec2-manager::HEROKU_POSTGRESQL_CRIMSON=> select code, terminated, "lastEvent" from terminations where id = 'i-02b41dcd48f98f30a';
code | terminated | lastEvent
----------------------------------+------------------------+----------------------------
Client.InstanceInitiatedShutdown | 2018-04-27 05:29:44+00 | 2018-04-27 05:30:04.567+00
(1 row)
This query shows that the instance in question received it's CloudWatch instance state change message within ~1 minute of the log message showing the initiation of shutdown.
The full lifecycle of this instance is as follows:
ec2-manager::HEROKU_POSTGRESQL_CRIMSON=> select * from cloudwatchlog where id = 'i-02b41dcd48f98f30a';
region | id | state | generated | received
-----------+---------------------+---------------+------------------------+-------------------------------
us-west-2 | i-02b41dcd48f98f30a | pending | 2018-04-27 05:02:36+00 | 2018-04-27 05:02:52.588759+00
us-west-2 | i-02b41dcd48f98f30a | running | 2018-04-27 05:02:52+00 | 2018-04-27 05:03:16.149308+00
us-west-2 | i-02b41dcd48f98f30a | shutting-down | 2018-04-27 05:29:14+00 | 2018-04-27 05:29:31.165386+00
us-west-2 | i-02b41dcd48f98f30a | terminated | 2018-04-27 05:29:44+00 | 2018-04-27 05:30:01.179159+00
(4 rows)
This machine was alive for 27 minutes.
> Similarly for i-06be2694cb20ae890 in us-east-1 - last ran a task about 3
> hours ago, last logs are from 3 hours ago showing it shutting down, and it's
> not in the console.
Like the previous example, it looks like this instance was terminated and updated in the EC2-Manager. Timestamps are a little less clear on this one, but it looks like the state was updated within 30s of the API becoming aware of the termination and in the ballpark of when is reasonable based on the time information I can see.
ec2-manager::HEROKU_POSTGRESQL_CRIMSON=> select code, terminated, "lastEvent" from terminations where id = 'i-06be2694cb20ae890';
code | terminated | lastEvent
----------------------------------+------------------------+----------------------------
Client.InstanceInitiatedShutdown | 2018-04-27 09:50:11+00 | 2018-04-27 09:50:46.146+00
(1 row)
This machine had a lifecycle of:
ec2-manager::HEROKU_POSTGRESQL_CRIMSON=> select * from cloudwatchlog where id = 'i-06be2694cb20ae890';
region | id | state | generated | received
-----------+---------------------+---------------+------------------------+-------------------------------
us-east-1 | i-06be2694cb20ae890 | pending | 2018-04-27 09:21:30+00 | 2018-04-27 09:21:45.688614+00
us-east-1 | i-06be2694cb20ae890 | running | 2018-04-27 09:21:39+00 | 2018-04-27 09:21:55.048105+00
us-east-1 | i-06be2694cb20ae890 | shutting-down | 2018-04-27 09:49:41+00 | 2018-04-27 09:49:56.934271+00
us-east-1 | i-06be2694cb20ae890 | terminated | 2018-04-27 09:50:11+00 | 2018-04-27 09:50:26.934808+00
(4 rows)
which is 28 minutes.
> So, I wonder if something is missing instance terminations in ec2-manager
> and thus suggesting to aws-provisioner that it is at maxCapacity when it is
> not?
I'm confident that is not the case. The provisioner and EC2-Manager seem to have valid state, and this issue seems to be limited to a single worker type as far as I can tell. Were there any indications that EC2-Manager or AWS-Provisioner weren't tracking state correctly?
Just to double check, I've verified that all of the instances which are running at the time I wrote this comment:
ec2-manager::HEROKU_POSTGRESQL_CRIMSON=> select id, region from instances where "workerType" = 'gecko-1-decision';
id | region
---------------------+-----------
i-0a342e804b683419f | us-east-1
i-06a9d4c5351e7c28c | us-west-2
i-0643b6fc2ace21af4 | us-east-1
i-0b4129cd1001b58f6 | us-west-1
i-0b427159b9a541e42 | us-east-1
i-08f2dae06deb21439 | us-east-1
i-0215f95bceceb3549 | us-west-2
i-0440c57f887b15ac2 | us-west-2
i-0bbd468490492f073 | us-west-1
i-0a05cb2a876bc15e6 | us-west-2
(10 rows)
are alive and showing as being in the 'running' state in the AWS-Console.
Given that the two noted examples of bad instances had run times of just about 30 minutes, I wonder if there's something related to that time period that's at play here. That sounds like it could be related to docker-worker start up or misconfiguration of the payload. Either way, this seems to be something on the instance itself that's going wrong and not in the instance management code.
Flags: needinfo?(jhford) → needinfo?(dustin)
Comment 4•7 years ago
|
||
> Were there any indications that EC2-Manager or AWS-Provisioner weren't tracking state correctly?
The indication I saw was that aws-provisioner showed a running capacity of 20, but only one instance was still running. But you're right -- it could be that the other 19 started and never called claimWork. Indeed, picking one htat you identified:
https://papertrailapp.com/systems/1819498441/events
shows a system renewing its DHCP lease, but with no logging from docker-worker.
Flags: needinfo?(dustin)
Comment 5•7 years ago
|
||
Was there a docker-worker upgrade?
Comment 6•7 years ago
|
||
I've noticed on https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types when checking gecko-1-decision for Pending tasks, it shows 18 ( on that view ) if you open it... there are no pending tasks.
Can someone check on AWS console for TaskCluster?
Updated•7 years ago
|
Severity: normal → blocker
Updated•7 years ago
|
Flags: needinfo?(pmoore)
Flags: needinfo?(dustin)
Flags: needinfo?(bstack)
Comment 7•7 years ago
|
||
Hi. I've just killed all instances of that worker type, which should get us back to accepting jobs, but is only a temporary fix.
Flags: needinfo?(pmoore)
Comment 8•7 years ago
|
||
Actually, looking at the log from Comment 4 shows me that the DHCP logs are really just extraneous output from the DHCP Daemon.
The real error looks like this:
<snip>
kernel: [ 13.780758] init: docker main process ended, respawning
kernel: [ 13.780865] init: docker post-start process (1721) terminated with status 1
kernel: [ 13.794456] init: docker main process (1752) terminated with status 1
kernel: [ 13.794463] init: docker main process ended, respawning
kernel: [ 13.794573] init: docker post-start process (1753) terminated with status 1
kernel: [ 13.808304] init: docker main process (1784) terminated with status 1
kernel: [ 13.808311] init: docker main process ended, respawning
kernel: [ 13.808441] init: docker post-start process (1785) terminated with status 1
kernel: [ 13.822379] init: docker main process (1816) terminated with status 1
kernel: [ 13.822385] init: docker main process ended, respawning
kernel: [ 13.822499] init: docker post-start process (1817) terminated with status 1
kernel: [ 13.836680] init: docker main process (1848) terminated with status 1
kernel: [ 13.836687] init: docker main process ended, respawning
kernel: [ 13.836795] init: docker post-start process (1849) terminated with status 1
kernel: [ 13.850872] init: docker main process (1880) terminated with status 1
kernel: [ 13.850879] init: docker respawning too fast, stopped
kernel: [ 13.851024] init: docker post-start process (1881) terminated with status 1
That suggests to me that there's something wrong with docker on this machine. Did a docker-worker deployment happen along with the new docker image?
Greg or Dustin, do you think you could land a change to revert us to the old docker image as a diagnostic?
Flags: needinfo?(gps)
Assignee | ||
Comment 9•7 years ago
|
||
For reasons I cannot understand, the volume mounting was missing on decision worker-types, causing docker daemon to fail to start. I believe this might be related to bug 1448407, just don't ask me how.
Anyway, it seems that the problem is fixed, I can see decision tasks running again. I also don't understand why gecko-[23]-decision are not affected too, but I added volume mounting to them as well.
Flags: needinfo?(wcosta)
Flags: needinfo?(gps)
Flags: needinfo?(dustin)
Flags: needinfo?(bstack)
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → wcosta
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: Operations → Operations and Service Requests
You need to log in
before you can comment on or make changes to this bug.
Description
•