1034034 - Wait times > 16h, pending builds > 7000, ix jobs ~3000 reftest/crashtest/jsreftest, 10.8 ~10h behind

Pete Moore [:pmoore][:pete]

Reporter

Description

•

11 years ago

No description provided.

Pete Moore [:pmoore][:pete]

Reporter

Comment 1

•

11 years ago

Just a had a quick meeting with kmoir, coop and RyanVM. It seems the main culprits are: 1) Far too many unnecessary try jobs running (common problem) - RyanVM is nuking them currently 2) New Android 2.3 tests are currently running on ix machines, since they cannot run in Amazon yet, and the ix pool is not able to handle the load at the moment kmoir is going to work with gbrown to try to get jobs migrated to amazon as soon as possible, to reduce load on ix machines, and RyanVM is going to continue nuking try jobs. Other points: * We might want to consider stricter measures to solve problem 1) above - e.g. limiting number of pending try jobs allowed per user at any time - so you have to kill old try jobs before running new ones, for example * Some jobs pending on try have already landed in mozilla central - we might want to consider creating build api methods to search from these types of pending jobs, so that they could be killed

Ed Morley [:emorley]

Comment 2

•

11 years ago

(In reply to Pete Moore [:pete][:pmoore] from comment #1) > Other points: * Switching some more platforms/job types on Try to be non-default, so people have to explicitly request them using trychooser, rather than getting them as part of "-p all" or "-u all".

Ryan VanderMeulen [:RyanVM]

Comment 3

•

11 years ago

We briefly discussed using the per-platform logic on Try, though we'd need a way of overriding that behavior to avoid the potential pitfalls of doing so. But that would probably take care of a lot of the cases we're seeing.

Comment hidden (obsolete)

Kim Moir [:kmoir] ET

Updated

•

11 years ago

Depends on: 1034055

Ed Morley [:emorley]

Comment 5

•

11 years ago

(In reply to Ed Morley [:edmorley UTC+0] from comment #4) > (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #3) > > We briefly discussed using the per-platform logic on Try, though we'd need a > > way of overriding that behavior > > try non-default already allows a way to override - support is added both in > buildbotcustom (or wherever) and trychooser :-) Ah comment 3 is talking about something else (looking at what files changed when deciding what jobs to schedule) - ignore comment 4.

Chris Cooper [:coop] (he/him)

Comment 6

•

11 years ago

The situation really hasn't gotten any better over the course of the day. The US is on holidays tomorrow, so I'm going to suggest that non-US folks meet up early EST tomorrow to come up with some solutions for reducing/coping with load.

Rail Aliiev [:rail]

Comment 7

•

11 years ago

* Implement, deploy and test on a branch bug 1034055. This should help moving jobs from IX slaves to AWS. I think we can start deploying it on Tue/Wed next week. * As a result we will need more linux64 test masters. Can be done any time. See bug 1011488 for the latest example. Not sure if killing some tests is an option... (as usually :( )

Chris Cooper [:coop] (he/him)

Comment 8

•

11 years ago

(In reply to Chris Cooper [:coop] from comment #6) > The situation really hasn't gotten any better over the course of the day. > The US is on holidays tomorrow, so I'm going to suggest that non-US folks > meet up early EST tomorrow to come up with some solutions for > reducing/coping with load. Here's what we came up with as short-term solutions this morning: * linux64-ix load: ** migrate 2.3 emulator jobs to new AWS slavetype *** kmoir is working on this in bug 1034055 *** will likely require setup of new buildbot-masters * mtnlion: ** re-image 10 builders from build/try pool *** also re-image builders that are currently in staging or need repair <- that capacity is currently unused * w864: ** resurrect/repair 7 slaves that are currently disabled * try usage: ** coop to collect current information sources re: try usage (newsgroup posts, blog posts, wiki pages, MDN articles) and synthesize best practices info for using this shared resource. *** present best practices at Monday MoFo call *** send best practices synopsis to all try high scores users with >1000 hours every week *** translate best practices into programmatic checks for trychooser *** part of this is killing jobs you don't need any more **** this should also be communicated to project branch owners Some longer term solutions: * reduce # of tests run per check-in ** this is the most important thing, but not surprisingly it's a big, cross-group project to decide "what testing is representative?" ** backfill tooling still required *** arbitrary job capability gets us part way * set limit on # of concurrent pushes per user * automatically cancel try jobs that have already landed on inbound/m-c ** devs should be canceling these jobs themselves I'll be tackling the short-term solutions as buildduty today.

Ryan VanderMeulen [:RyanVM]

Comment 9

•

11 years ago

(In reply to Chris Cooper [:coop] from comment #8) > * try usage: > ** coop to collect current information sources re: try usage (newsgroup > posts, blog posts, wiki pages, MDN articles) and synthesize best practices > info for using this shared resource. > *** present best practices at Monday MoFo call This is probably a good starting point and something we've already been linking people to: https://wiki.mozilla.org/Sheriffing/How:To:Recommended_Try_Practices

Chris Cooper [:coop] (he/him)

Comment 10

•

11 years ago

(In reply to Chris Cooper [:coop] from comment #8) > * mtnlion: > ** re-image 10 builders from build/try pool > *** also re-image builders that are currently in staging or need repair <- > that capacity is currently unused Bug 1034715 filed. > * w864: > ** resurrect/repair 7 slaves that are currently disabled This is bug bug 1004813. (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #9) > This is probably a good starting point and something we've already been > linking people to: > https://wiki.mozilla.org/Sheriffing/How:To:Recommended_Try_Practices That's a good start. I'll plan on (re)presenting it at the MoFo call on Monday, and then doing the other follow-up work listed above.

(dormant account)

Comment 11

•

11 years ago

I mentioned that we have spare ix capacity in a couple of meetings, but I see that the info didn't make it into bugzilla. We have ~100 windows ix builders that we can turn off and repurpose due to recent seamicro builder additions. Running emulators on aws is inefficient as those vms do not expose vmx bit in /proc/cpuinfo. Why aren't we looking at re-purposing existing ix builders?

Flags: needinfo?(laura)

Ed Morley [:emorley]

Updated

•

11 years ago

Updated

•

11 years ago

Depends on: 1035304

Chris Cooper [:coop] (he/him)

Comment 12

•

11 years ago

(In reply to Taras Glek (:taras) from comment #11) > I mentioned that we have spare ix capacity in a couple of meetings, but I > see that the info didn't make it into bugzilla. > We have ~100 windows ix builders that we can turn off and repurpose due to > recent seamicro builder additions. > > Running emulators on aws is inefficient as those vms do not expose vmx bit > in /proc/cpuinfo. Why aren't we looking at re-purposing existing ix builders? The builders have a different config than our existing tester ix platforms, so they're not drop-in replacements: https://bugzilla.mozilla.org/show_bug.cgi?id=1034055#c21

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Flags: needinfo?(laura)

Kim Moir [:kmoir] ET

Comment 13

•

11 years ago

Pete, I think this bug can be closed now?

Justin Wood (:Callek)

Comment 14

•

11 years ago

closed per c#13

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Wait times > 16h, pending builds > 7000, ix jobs ~3000 reftest/crashtest/jsreftest, 10.8 ~10h behind

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: pmoore, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Comment 12

Updated

Comment 13

Comment 14

Updated

Updated