Closed
Bug 1034034
Opened 10 years ago
Closed 10 years ago
Wait times > 16h, pending builds > 7000, ix jobs ~3000 reftest/crashtest/jsreftest, 10.8 ~10h behind
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Unassigned)
References
Details
No description provided.
Reporter | ||
Comment 1•10 years ago
|
||
Just a had a quick meeting with kmoir, coop and RyanVM. It seems the main culprits are: 1) Far too many unnecessary try jobs running (common problem) - RyanVM is nuking them currently 2) New Android 2.3 tests are currently running on ix machines, since they cannot run in Amazon yet, and the ix pool is not able to handle the load at the moment kmoir is going to work with gbrown to try to get jobs migrated to amazon as soon as possible, to reduce load on ix machines, and RyanVM is going to continue nuking try jobs. Other points: * We might want to consider stricter measures to solve problem 1) above - e.g. limiting number of pending try jobs allowed per user at any time - so you have to kill old try jobs before running new ones, for example * Some jobs pending on try have already landed in mozilla central - we might want to consider creating build api methods to search from these types of pending jobs, so that they could be killed
Comment 2•10 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #1) > Other points: * Switching some more platforms/job types on Try to be non-default, so people have to explicitly request them using trychooser, rather than getting them as part of "-p all" or "-u all".
Comment 3•10 years ago
|
||
We briefly discussed using the per-platform logic on Try, though we'd need a way of overriding that behavior to avoid the potential pitfalls of doing so. But that would probably take care of a lot of the cases we're seeing.
Comment hidden (obsolete) |
Comment 5•10 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+0] from comment #4) > (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #3) > > We briefly discussed using the per-platform logic on Try, though we'd need a > > way of overriding that behavior > > try non-default already allows a way to override - support is added both in > buildbotcustom (or wherever) and trychooser :-) Ah comment 3 is talking about something else (looking at what files changed when deciding what jobs to schedule) - ignore comment 4.
Comment 6•10 years ago
|
||
The situation really hasn't gotten any better over the course of the day. The US is on holidays tomorrow, so I'm going to suggest that non-US folks meet up early EST tomorrow to come up with some solutions for reducing/coping with load.
Comment 7•10 years ago
|
||
* Implement, deploy and test on a branch bug 1034055. This should help moving jobs from IX slaves to AWS. I think we can start deploying it on Tue/Wed next week. * As a result we will need more linux64 test masters. Can be done any time. See bug 1011488 for the latest example. Not sure if killing some tests is an option... (as usually :( )
Comment 8•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #6) > The situation really hasn't gotten any better over the course of the day. > The US is on holidays tomorrow, so I'm going to suggest that non-US folks > meet up early EST tomorrow to come up with some solutions for > reducing/coping with load. Here's what we came up with as short-term solutions this morning: * linux64-ix load: ** migrate 2.3 emulator jobs to new AWS slavetype *** kmoir is working on this in bug 1034055 *** will likely require setup of new buildbot-masters * mtnlion: ** re-image 10 builders from build/try pool *** also re-image builders that are currently in staging or need repair <- that capacity is currently unused * w864: ** resurrect/repair 7 slaves that are currently disabled * try usage: ** coop to collect current information sources re: try usage (newsgroup posts, blog posts, wiki pages, MDN articles) and synthesize best practices info for using this shared resource. *** present best practices at Monday MoFo call *** send best practices synopsis to all try high scores users with >1000 hours every week *** translate best practices into programmatic checks for trychooser *** part of this is killing jobs you don't need any more **** this should also be communicated to project branch owners Some longer term solutions: * reduce # of tests run per check-in ** this is the most important thing, but not surprisingly it's a big, cross-group project to decide "what testing is representative?" ** backfill tooling still required *** arbitrary job capability gets us part way * set limit on # of concurrent pushes per user * automatically cancel try jobs that have already landed on inbound/m-c ** devs should be canceling these jobs themselves I'll be tackling the short-term solutions as buildduty today.
Comment 9•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #8) > * try usage: > ** coop to collect current information sources re: try usage (newsgroup > posts, blog posts, wiki pages, MDN articles) and synthesize best practices > info for using this shared resource. > *** present best practices at Monday MoFo call This is probably a good starting point and something we've already been linking people to: https://wiki.mozilla.org/Sheriffing/How:To:Recommended_Try_Practices
Comment 10•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #8) > * mtnlion: > ** re-image 10 builders from build/try pool > *** also re-image builders that are currently in staging or need repair <- > that capacity is currently unused Bug 1034715 filed. > * w864: > ** resurrect/repair 7 slaves that are currently disabled This is bug bug 1004813. (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #9) > This is probably a good starting point and something we've already been > linking people to: > https://wiki.mozilla.org/Sheriffing/How:To:Recommended_Try_Practices That's a good start. I'll plan on (re)presenting it at the MoFo call on Monday, and then doing the other follow-up work listed above.
Comment 11•10 years ago
|
||
I mentioned that we have spare ix capacity in a couple of meetings, but I see that the info didn't make it into bugzilla. We have ~100 windows ix builders that we can turn off and repurpose due to recent seamicro builder additions. Running emulators on aws is inefficient as those vms do not expose vmx bit in /proc/cpuinfo. Why aren't we looking at re-purposing existing ix builders?
Flags: needinfo?(laura)
Comment 12•10 years ago
|
||
(In reply to Taras Glek (:taras) from comment #11) > I mentioned that we have spare ix capacity in a couple of meetings, but I > see that the info didn't make it into bugzilla. > We have ~100 windows ix builders that we can turn off and repurpose due to > recent seamicro builder additions. > > Running emulators on aws is inefficient as those vms do not expose vmx bit > in /proc/cpuinfo. Why aren't we looking at re-purposing existing ix builders? The builders have a different config than our existing tester ix platforms, so they're not drop-in replacements: https://bugzilla.mozilla.org/show_bug.cgi?id=1034055#c21
Updated•10 years ago
|
Flags: needinfo?(laura)
Comment 13•10 years ago
|
||
Pete, I think this bug can be closed now?
Comment 14•10 years ago
|
||
closed per c#13
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•