Closed Bug 1122582 Opened 9 years ago Closed 9 years ago

Linux build jobs falling behind

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

All
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Unassigned)

Details

11:02:21 AM - RyanVM: hrm, Android builds pending for 45+min on this push? https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/654e15b4dc9b
11:02:47 AM - RyanVM: ASAN builds are backing up on inbound too

Trees closed.
11:18:33 AM - RyanVM|sheriffduty: also affects SM jobs, Valgrind, Hazard Analysis
11:18:41 AM - RyanVM|sheriffduty: in addition to the previously-mentioned Android/ASAN jobs
11:19:06 AM - RyanVM|sheriffduty: and linux32/64 PGO
11:19:18 AM - RyanVM|sheriffduty: and B2G device images
We seem to be caught up. Trees reopened at 09:32 PT. I'll be keeping an eye on pending jobs once pushes start picking up again.
After looking at many different areas I feel this is just capacity issues (or maybe not utilising our full capacity) --

This tells me two things off the bat:
* It's too darned hard to identify when we're "at capacity" right now
* There is an urgent need for capacity bumps (or fixing things to use current capacity limits, if thats it)


I'll leave for catlee to put any parting conclusions

===

[12:29:22]	Callek	RyanVM|sheriffduty: catlee: I'm going to put *my* stake in the ground that this closure is due to capacity
[12:29:49]	Callek	if we're unable to hit our capacity limits for some reason, thats a *different* (but equally valuable-to-fix) bug than expanding capacity itself
[12:30:12]	catlee	I think you're kind of right
[12:30:16]	Callek	ulfr: I'm curious if you have any insight here
[12:30:22]	catlee	1/2 of our slaves aren't usable
[12:30:33]	catlee	I think
[12:30:42]	catlee	also, we've greatly increased our jacuzzi load
[12:31:08]	catlee	leaving very few (of the already reduced) names available for non-jacuzzi jobs
[12:31:10]	Callek	catlee: maybe we need to move jacuzzi-allocator to relengapi, to get better throughput now?
[12:31:17]	catlee	unrelated
[12:31:22]	Callek	oooo you mean load by terms of number of hosts used
[12:31:25]	Callek	got it
[12:31:43]	catlee	we have 498 names usable by aws_watch_pending
[12:32:02]	catlee	but 698 in slavealloc/buildbot-configs
[12:32:06]	catlee	we have 344 names in jacuzzis
[12:32:19]	catlee	leaving 154 slaves to do all non-jacuzzi'ed jobs
[12:32:48]	RyanVM|sheriffduty	Callek: I've reopened the trees, but I'm leaving the bug open for whatever parting thoughts you want to add
[12:32:58]	catlee	well, that's my theory
[12:33:09]	catlee	testing it now
Also bugs about capacity:

[11:35:46]	rail	there is a bug..
[11:35:49]	rail	https://bugzilla.mozilla.org/show_bug.cgi?id=1090568
[11:36:05]	rail	blocked by https://bugzilla.mozilla.org/show_bug.cgi?id=1090139
We discussed this a bit at the buildduty stand-up meeting today. We came up with a few ideas:

* first, we need better data/monitoring so we can better diagnose where the problem lies. The buildduty team is going to scramble to try to get some of this monitoring in place next week so we can make a better decision re: next steps.
* we should, at a minimum, replace the build/try capacity we lost by disabling the hardware machines in bug 1106922. This will amount to adding ~20 instances per pool. That *should* be enough to mitigate build capacity issues.
* once better monitoring is in place, we can make a similar assessment about test jobs. A naive look at the pending graphs makes it look like we run a full capacity on the tst-linux64 nodes for about 6 hrs of the day. Is this OK, or would adding, say, 100 more test nodes make us clear this load more quickly at peak?

I'll link bugs as they're filed.
One issue that I suspected on Friday but just confirmed was that we have quite a number of build slaves on slavealloc with their region set to "aws-us-east-1" or "aws-us-west-2". aws_watch_pending only cares about machines in "us-east-1" or "us-west-2". I moved bld-linux64-spot-200 from "aws-us-east-1" to "us-east-1", and it got used eventually.

We have 698 bld-linux64-spot machines in slavealloc, 199 of those are in aws-us-* and are therefore unusable.

I propose we s/aws-us-(.*)/us-\1/ on these slavealloc entries.
(In reply to Chris AtLee [:catlee] from comment #6)
> One issue that I suspected on Friday but just confirmed was that we have
> quite a number of build slaves on slavealloc with their region set to
> "aws-us-east-1" or "aws-us-west-2". aws_watch_pending only cares about
> machines in "us-east-1" or "us-west-2". I moved bld-linux64-spot-200 from
> "aws-us-east-1" to "us-east-1", and it got used eventually.
> 
> We have 698 bld-linux64-spot machines in slavealloc, 199 of those are in
> aws-us-* and are therefore unusable.
> 
> I propose we s/aws-us-(.*)/us-\1/ on these slavealloc entries.

I've updated the entries in slavealloc, and removed the offending entries in the datacenters table. We now only have 3 datacenters in slavealloc (scl3, us-east-1, us-west2) and all slaves and masters match up against one of them.
(In reply to Chris Cooper [:coop] from comment #5)
> * we should, at a minimum, replace the build/try capacity we lost by
> disabling the hardware machines in bug 1106922. This will amount to adding
> ~20 instances per pool. That *should* be enough to mitigate build capacity
> issues.

With the instances recovered in comment #7, I think we can forgo this step for now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.