Closed Bug 1391564 Opened 7 years ago Closed 7 years ago

Trees closed for backlog on Windows builds

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aselagea, Assigned: garndt)

References

Details

Closed m-i and autoland since some Windows builds are pending for more than 5 hours.
Priority: -- → P1
From #buildduty:

Fri 08:46:42 UTC [7321] [moc] nagios1.private.releng.scl3.mozilla.com:Pending builds is CRITICAL: CRITICAL Pending builds: 442 on gecko-3-b-win2012 (http://m.mozilla.org/Pending+builds)

I looked at https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/ and we don't seem to spawn new workers fast enough when the load is higher.
e.g. when we reached 300 pending tasks, the number of running workers was at 38. Two hours later, the backlog was at 442 but the 
number of running workers only reached 40.
Checked the recent pricing history for c4.4xlarge (gecko-3-b-win2012), but noticed it was sitting at ~0.19$ most of the time. So that excludes the hypothesis of not being able to spawn new workers due to an increase in the bid price.

Also, is it a way to see if the existing running workers are actually "running" tasks?
not 100% sure yet, but it appears that we are hitting some ebs limits within us-east-1 (even though our limit is very high and it doesn't seem like we're going over it)

For example:
Instance: i-0007b1fe01a076abf
Termination reason: Client.VolumeLimitExceeded: Volume limit exceeded

It's possible that the volume snapshots might be contributing to this limit?  I have opened up a service request with AWS to get assistance in us-east-1 and also removed that from the provisioned regions for gecko-3-b-win2012 and gecko-t-win10-64
AWS case id: 4290484321
AWS has informed us that there is a hard 5000 limit on the number of EBS volumes.  We were oversubscribed and running almost 5800.

They will initiate a request to increase to 7000 while we work with them to understand why we are hitting these limits during non-peak hours.  Something is using more volumes than usually I think.
AWS has increased the limit to 10k volumes for our account and we are now getting instances.  There still is a backlog, but I suspect that as we get instances that will clear up.

There were some other limits support realized we were getting close to so they increased those as well.
See Also: → 1391603
Assignee: nobody → garndt
Trees reopened since the backlog on gecko-3-b-win2012 cleared up.

Once more and more Windows builds have completed, that generated a large number of tests on gecko-t-win7-32 (~3.5k atm). We'll need to keep an eye on this during the next few hours.
Severity: blocker → major
Per IRC:

"<RyanVM> closed autoland for mass servo bustage
17:30:33 and inbound for windows clipboard bustage"
Going to keep some trees closed until the test backlog clears up.  The limit issue that blocked builders also caused testers to be backlogged.  Once builds started to complete, they added to an already growing test backlog.
Trees have been reopened after this issue was resolved.  Limits on the AWS side have been increased to hopefully reduce the risk of this happening anytime soon.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.