Trees closed for backlog on Windows builds

RESOLVED FIXED

Status

Release Engineering
Buildduty
P1
major
RESOLVED FIXED
6 months ago
6 months ago

People

(Reporter: aselagea, Assigned: garndt)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

6 months ago
Closed m-i and autoland since some Windows builds are pending for more than 5 hours.
Priority: -- → P1
(Reporter)

Comment 1

6 months ago
From #buildduty:

Fri 08:46:42 UTC [7321] [moc] nagios1.private.releng.scl3.mozilla.com:Pending builds is CRITICAL: CRITICAL Pending builds: 442 on gecko-3-b-win2012 (http://m.mozilla.org/Pending+builds)

I looked at https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/ and we don't seem to spawn new workers fast enough when the load is higher.
e.g. when we reached 300 pending tasks, the number of running workers was at 38. Two hours later, the backlog was at 442 but the 
number of running workers only reached 40.
Checked the recent pricing history for c4.4xlarge (gecko-3-b-win2012), but noticed it was sitting at ~0.19$ most of the time. So that excludes the hypothesis of not being able to spawn new workers due to an increase in the bid price.

Also, is it a way to see if the existing running workers are actually "running" tasks?
(Assignee)

Comment 2

6 months ago
not 100% sure yet, but it appears that we are hitting some ebs limits within us-east-1 (even though our limit is very high and it doesn't seem like we're going over it)

For example:
Instance: i-0007b1fe01a076abf
Termination reason: Client.VolumeLimitExceeded: Volume limit exceeded

It's possible that the volume snapshots might be contributing to this limit?  I have opened up a service request with AWS to get assistance in us-east-1 and also removed that from the provisioned regions for gecko-3-b-win2012 and gecko-t-win10-64
(Assignee)

Comment 3

6 months ago
AWS case id: 4290484321
(Assignee)

Comment 4

6 months ago
AWS has informed us that there is a hard 5000 limit on the number of EBS volumes.  We were oversubscribed and running almost 5800.

They will initiate a request to increase to 7000 while we work with them to understand why we are hitting these limits during non-peak hours.  Something is using more volumes than usually I think.
(Assignee)

Comment 5

6 months ago
AWS has increased the limit to 10k volumes for our account and we are now getting instances.  There still is a backlog, but I suspect that as we get instances that will clear up.

There were some other limits support realized we were getting close to so they increased those as well.
(Assignee)

Updated

6 months ago
See Also: → bug 1391603
(Assignee)

Updated

6 months ago
Assignee: nobody → garndt
(Reporter)

Comment 6

6 months ago
Trees reopened since the backlog on gecko-3-b-win2012 cleared up.

Once more and more Windows builds have completed, that generated a large number of tests on gecko-t-win7-32 (~3.5k atm). We'll need to keep an eye on this during the next few hours.
Severity: blocker → major
(Reporter)

Comment 7

6 months ago
Per IRC:

"<RyanVM> closed autoland for mass servo bustage
17:30:33 and inbound for windows clipboard bustage"
(Assignee)

Comment 8

6 months ago
Going to keep some trees closed until the test backlog clears up.  The limit issue that blocked builders also caused testers to be backlogged.  Once builds started to complete, they added to an already growing test backlog.
(Assignee)

Comment 9

6 months ago
Trees have been reopened after this issue was resolved.  Limits on the AWS side have been increased to hopefully reduce the risk of this happening anytime soon.
Status: NEW → RESOLVED
Last Resolved: 6 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.