Tasks on servo-docker-worker stuck in a "pending" state

RESOLVED FIXED

Status

task
--
blocker
RESOLVED FIXED
10 months ago
5 months ago

People

(Reporter: SimonSapin, Unassigned)

Tracking

Details

In an attempt to reduce the time needed to compile all of Servo, I changed the definition at [1] to:

* Use only the c5.4xlarge EC2 instance type instead of smaller instance types. (I don’t remember the exact previous set.)
* Change `restrictCpu` from true to false. (If I understand correctly this corresponds to [2] and causes each job to only have one CPU core available.)

Since then, newly-scheduled tasks stay in a "pending" state and never seem to start running. See for example https://tools.taskcluster.net/groups/E6tTuGOwTWetFXSGlVEvZw/tasks/E6tTuGOwTWetFXSGlVEvZw/runs/0 (pending for 17 hours).

[3] is showing 12 instances running, which is the configured maxCapacity. I suspect they are running as far as EC2 is concerned, but the docker-worker software in them did not boot correctly and never got around to claim tasks from the queue. [4] lists some errors, but has no informative detail. (The Message column shows "----- HIDDEN -----" for me. I do not have access to the AWS console to try and find out more. Are there other relevant logs?

Random guess: are the AMIs listed in [1] not appropriate for this new EC2 instance type?

Other random guess: I tried reverting the restrictCpu change but I suspect this only applies to new workers, and I don’t have the ec2-manager:manage-resources:servo-docker-worker scope needed to hit the red Terminate button on [3].


[1] https://tools.taskcluster.net/aws-provisioner/servo-docker-worker/view
[2] https://github.com/taskcluster/docker-worker/blob/ab26dffa3cf5544d96c8009f29bfee38307a6943/config.yml#L51-L53
[3] https://tools.taskcluster.net/aws-provisioner/servo-docker-worker/resources
[4] https://tools.taskcluster.net/aws-provisioner/servo-docker-worker/health
Depends on: 1492123
(In reply to Simon Sapin (:SimonSapin) from comment #0)
> In an attempt to reduce the time needed to compile all of Servo, I changed
> the definition at [1] to:
> 
> * Use only the c5.4xlarge EC2 instance type instead of smaller instance
> types. (I don’t remember the exact previous set.)

Seems to be m3/r3 .xlarge based on the database.

> * Change `restrictCpu` from true to false. (If I understand correctly this
> corresponds to [2] and causes each job to only have one CPU core available.)

Not quite, but I can see where the confusion is.  What it really means is that the worker will run the same number of jobs concurrently as the number of CPU cores.  That means a quad-core machine will run four jobs concurrently.

What you'd need to do is set the capacity to 1 and remove restrictCpu, I think.  Wander can confirm this as he's more knowledgeable about docker-worker internals.  From a provisioning standpoint, this is the correct behaviour for what you want.  If you add multiple instanceTypes to the worker type, you can use utility factor to balance slower and faster machines, it's a weight applied in a calculation of relative prices.

> Since then, newly-scheduled tasks stay in a "pending" state and never seem
> to start running. See for example
> https://tools.taskcluster.net/groups/E6tTuGOwTWetFXSGlVEvZw/tasks/
> E6tTuGOwTWetFXSGlVEvZw/runs/0 (pending for 17 hours).
> 
> [3] is showing 12 instances running, which is the configured maxCapacity. I
> suspect they are running as far as EC2 is concerned, but the docker-worker
> software in them did not boot correctly and never got around to claim tasks
> from the queue. [4] lists some errors, but has no informative detail. (The
> Message column shows "----- HIDDEN -----" for me. I do not have access to
> the AWS console to try and find out more. Are there other relevant logs?

There are.  We intentionally hide the error messages because they are free-form strings from EC2, which we aren't sure if they are safe to display verbatim.

I looked in the database with this query:

select region, "workerType", "instanceType", called, code, message from awsrequests where method = 'runInstances' and error and "workerType" = 'servo-docker-worker' order by called desc;

And since Sep-17, the only errors requesting instances are for c5.4xlarge and they're all failing with the code "InsufficientInstanceCapacity" and the message "There is no Spot capacity available that matches your request".  That might mean the price isn't set high enough, or that there just aren't enough physical machines, or something else entirely.

I also looked at our terminations history (i.e. why machines which we got shut down) for this worker type with this query:

select * from terminations where "workerType" = 'servo-docker-worker' and "launched" > now() - interval '2 days' and "instanceType" = 'c5.4xlarge';

And since Sep-17, every single machine which has terminated had a termination reason of 'Server.SpotInstanceTermination: Spot instance termination'.  Both lived for a total of 1,5 hours.  I'm not sure how many builds that is, but I doubt enough to make an appreciable difference.  Currently, you have these instances running:

  region   | instanceType | count
-----------+--------------+-------
 us-east-1 | c5.4xlarge   |     6
 us-west-1 | c5.4xlarge   |     1
 us-west-2 | c5.4xlarge   |     5

So you should have capacity for 192 jobs (16 vcpus * 12 instances), which is likely not what you actually want.

My guess is that there's a shortage of this instance type in the configured regions.  If you can use other instanceTypes, e.g. c4.4xlarge and other c5 instanceTypes, it might be worth adding them into the mix.  It might also be worthwhile adding other regions, like eu-central-1 and/or us-east-2.  The maxPrice configured should be well within allowed parameters.  It's set to $8 and your machine costs ~$0,30, so I suspect it's a lack of availability.


> Random guess: are the AMIs listed in [1] not appropriate for this new EC2
> instance type?

It's possible, but I don't know a lot about debugging Docker Worker.  Wander, can you verify if these instances are starting up and accepting jobs?  Here's some relevant instances in regions for looking up logs:

  region   |         id
-----------+---------------------
 us-east-1 | i-06d6a21e93286e686
 us-east-1 | i-074b29ae157fdecc4
 us-east-1 | i-0793ba48cef9a0194
 us-east-1 | i-08e0989cbae58010d
 us-west-1 | i-013f1d73d5fd7f216
 us-west-1 | i-08476e582e5d0d067
 us-west-1 | i-09454f5ba52fc9dc6
 us-west-1 | i-0d370fa8d7717cd80
 us-west-2 | i-03e123de74a3fa48c
 us-west-2 | i-04542f972161a4c42
 us-west-2 | i-0949b5eb68f5bfd88
 us-west-2 | i-0a9850a244710a16e

> Other random guess: I tried reverting the restrictCpu change but I suspect
> this only applies to new workers, and I don’t have the
> ec2-manager:manage-resources:servo-docker-worker scope needed to hit the red
> Terminate button on [3].

It does require new machines, but restrictCpu is probably not what you want.  I am happy to click this button, but I agree we should get you the scope to do it yourself.
Flags: needinfo?(wcosta)
The EBS volume configuration was missing in the provisioner configuration. The tasks run now.
Status: NEW → RESOLVED
Closed: 10 months ago
Flags: needinfo?(wcosta)
Resolution: --- → FIXED
restrictCpu is indeed not what I want, either with its actual meaning or the one I incorrectly guessed.

Wander gave me access to the terminate button in bug 1492123, thanks.

I suspect that lack of availability at the EC2 level is not the (only) problem. Before I terminated them earlier today https://tools.taskcluster.net/aws-provisioner/servo-docker-worker/resources showed 12 instances (which is maxCapacity) that had been running for 15 to 17 hours (which matches when Dustin pushed the Terminate All button for me yesterday).
Component: Service Request → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.