1122582 - Linux build jobs falling behind

Reporter

Description

•

11 years ago

11:02:21 AM - RyanVM: hrm, Android builds pending for 45+min on this push? https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/654e15b4dc9b 11:02:47 AM - RyanVM: ASAN builds are backing up on inbound too Trees closed.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 1

•

11 years ago

11:18:33 AM - RyanVM|sheriffduty: also affects SM jobs, Valgrind, Hazard Analysis 11:18:41 AM - RyanVM|sheriffduty: in addition to the previously-mentioned Android/ASAN jobs 11:19:06 AM - RyanVM|sheriffduty: and linux32/64 PGO 11:19:18 AM - RyanVM|sheriffduty: and B2G device images

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 2

•

11 years ago

We seem to be caught up. Trees reopened at 09:32 PT. I'll be keeping an eye on pending jobs once pushes start picking up again.

Justin Wood (:Callek)

Comment 3

•

11 years ago

After looking at many different areas I feel this is just capacity issues (or maybe not utilising our full capacity) -- This tells me two things off the bat: * It's too darned hard to identify when we're "at capacity" right now * There is an urgent need for capacity bumps (or fixing things to use current capacity limits, if thats it) I'll leave for catlee to put any parting conclusions === [12:29:22] Callek RyanVM|sheriffduty: catlee: I'm going to put *my* stake in the ground that this closure is due to capacity [12:29:49] Callek if we're unable to hit our capacity limits for some reason, thats a *different* (but equally valuable-to-fix) bug than expanding capacity itself [12:30:12] catlee I think you're kind of right [12:30:16] Callek ulfr: I'm curious if you have any insight here [12:30:22] catlee 1/2 of our slaves aren't usable [12:30:33] catlee I think [12:30:42] catlee also, we've greatly increased our jacuzzi load [12:31:08] catlee leaving very few (of the already reduced) names available for non-jacuzzi jobs [12:31:10] Callek catlee: maybe we need to move jacuzzi-allocator to relengapi, to get better throughput now? [12:31:17] catlee unrelated [12:31:22] Callek oooo you mean load by terms of number of hosts used [12:31:25] Callek got it [12:31:43] catlee we have 498 names usable by aws_watch_pending [12:32:02] catlee but 698 in slavealloc/buildbot-configs [12:32:06] catlee we have 344 names in jacuzzis [12:32:19] catlee leaving 154 slaves to do all non-jacuzzi'ed jobs [12:32:48] RyanVM|sheriffduty Callek: I've reopened the trees, but I'm leaving the bug open for whatever parting thoughts you want to add [12:32:58] catlee well, that's my theory [12:33:09] catlee testing it now

Justin Wood (:Callek)

Comment 4

•

11 years ago

Also bugs about capacity: [11:35:46] rail there is a bug.. [11:35:49] rail https://bugzilla.mozilla.org/show_bug.cgi?id=1090568 [11:36:05] rail blocked by https://bugzilla.mozilla.org/show_bug.cgi?id=1090139

Chris Cooper [:coop] (he/him)

Comment 5

•

11 years ago

We discussed this a bit at the buildduty stand-up meeting today. We came up with a few ideas: * first, we need better data/monitoring so we can better diagnose where the problem lies. The buildduty team is going to scramble to try to get some of this monitoring in place next week so we can make a better decision re: next steps. * we should, at a minimum, replace the build/try capacity we lost by disabling the hardware machines in bug 1106922. This will amount to adding ~20 instances per pool. That *should* be enough to mitigate build capacity issues. * once better monitoring is in place, we can make a similar assessment about test jobs. A naive look at the pending graphs makes it look like we run a full capacity on the tst-linux64 nodes for about 6 hrs of the day. Is this OK, or would adding, say, 100 more test nodes make us clear this load more quickly at peak? I'll link bugs as they're filed.

Chris AtLee [:catlee]

Comment 6

•

11 years ago

One issue that I suspected on Friday but just confirmed was that we have quite a number of build slaves on slavealloc with their region set to "aws-us-east-1" or "aws-us-west-2". aws_watch_pending only cares about machines in "us-east-1" or "us-west-2". I moved bld-linux64-spot-200 from "aws-us-east-1" to "us-east-1", and it got used eventually. We have 698 bld-linux64-spot machines in slavealloc, 199 of those are in aws-us-* and are therefore unusable. I propose we s/aws-us-(.*)/us-\1/ on these slavealloc entries.

Chris Cooper [:coop] (he/him)

Comment 7

•

11 years ago

(In reply to Chris AtLee [:catlee] from comment #6) > One issue that I suspected on Friday but just confirmed was that we have > quite a number of build slaves on slavealloc with their region set to > "aws-us-east-1" or "aws-us-west-2". aws_watch_pending only cares about > machines in "us-east-1" or "us-west-2". I moved bld-linux64-spot-200 from > "aws-us-east-1" to "us-east-1", and it got used eventually. > > We have 698 bld-linux64-spot machines in slavealloc, 199 of those are in > aws-us-* and are therefore unusable. > > I propose we s/aws-us-(.*)/us-\1/ on these slavealloc entries. I've updated the entries in slavealloc, and removed the offending entries in the datacenters table. We now only have 3 datacenters in slavealloc (scl3, us-east-1, us-west2) and all slaves and masters match up against one of them.

Chris Cooper [:coop] (he/him)

Comment 8

•

11 years ago

(In reply to Chris Cooper [:coop] from comment #5) > * we should, at a minimum, replace the build/try capacity we lost by > disabling the hardware machines in bug 1106922. This will amount to adding > ~20 instances per pool. That *should* be enough to mitigate build capacity > issues. With the instances recovered in comment #7, I think we can forgo this step for now.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Linux build jobs falling behind

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: RyanVM, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated