Closed Bug 1034034 Opened 10 years ago Closed 10 years ago

Wait times > 16h, pending builds > 7000, ix jobs ~3000 reftest/crashtest/jsreftest, 10.8 ~10h behind

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Unassigned)

References

Details

      No description provided.
Just a had a quick meeting with kmoir, coop and RyanVM. It seems the main culprits are:

1) Far too many unnecessary try jobs running (common problem) - RyanVM is nuking them currently
2) New Android 2.3 tests are currently running on ix machines, since they cannot run in Amazon yet, and the ix pool is not able to handle the load at the moment

kmoir is going to work with gbrown to try to get jobs migrated to amazon as soon as possible, to reduce load on ix machines, and RyanVM is going to continue nuking try jobs.

Other points:
* We might want to consider stricter measures to solve problem 1) above - e.g. limiting number of pending try jobs allowed per user at any time - so you have to kill old try jobs before running new ones, for example
* Some jobs pending on try have already landed in mozilla central - we might want to consider creating build api methods to search from these types of pending jobs, so that they could be killed
(In reply to Pete Moore [:pete][:pmoore] from comment #1)
> Other points:

* Switching some more platforms/job types on Try to be non-default, so people have to explicitly request them using trychooser, rather than getting them as part of "-p all" or "-u all".
We briefly discussed using the per-platform logic on Try, though we'd need a way of overriding that behavior to avoid the potential pitfalls of doing so. But that would probably take care of a lot of the cases we're seeing.
Depends on: 1034055
(In reply to Ed Morley [:edmorley UTC+0] from comment #4)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #3)
> > We briefly discussed using the per-platform logic on Try, though we'd need a
> > way of overriding that behavior
> 
> try non-default already allows a way to override - support is added both in
> buildbotcustom (or wherever) and trychooser :-)

Ah comment 3 is talking about something else (looking at what files changed when deciding what jobs to schedule) - ignore comment 4.
The situation really hasn't gotten any better over the course of the day. The US is on holidays tomorrow, so I'm going to suggest that non-US folks meet up early EST tomorrow to come up with some solutions for reducing/coping with load.
* Implement, deploy and test on a branch bug 1034055. This should help moving jobs from IX slaves to AWS. I think we can start deploying it on Tue/Wed next week. 

* As a result we will need more linux64 test masters. Can be done any time. See bug 1011488 for the latest example.

Not sure if killing some tests is an option... (as usually :( )
(In reply to Chris Cooper [:coop] from comment #6)
> The situation really hasn't gotten any better over the course of the day.
> The US is on holidays tomorrow, so I'm going to suggest that non-US folks
> meet up early EST tomorrow to come up with some solutions for
> reducing/coping with load.

Here's what we came up with as short-term solutions this morning:

* linux64-ix load:
** migrate 2.3 emulator jobs to new AWS slavetype
*** kmoir is working on this in bug 1034055
*** will likely require setup of new buildbot-masters

* mtnlion:
** re-image 10 builders from build/try pool
*** also re-image builders that are currently in staging or need repair <- that capacity is currently unused

* w864:
** resurrect/repair 7 slaves that are currently disabled

* try usage:
** coop to collect current information sources re: try usage (newsgroup posts, blog posts, wiki pages, MDN articles) and synthesize best practices info for using this shared resource.
*** present best practices at Monday MoFo call
*** send best practices synopsis to all try high scores users with >1000 hours every week
*** translate best practices into programmatic checks for trychooser
*** part of this is killing jobs you don't need any more
**** this should also be communicated to project branch owners

Some longer term solutions:

* reduce # of tests run per check-in
** this is the most important thing, but not surprisingly it's a big, cross-group project to decide "what testing is representative?"
** backfill tooling still required
*** arbitrary job capability gets us part way

* set limit on # of concurrent pushes per user

* automatically cancel try jobs that have already landed on inbound/m-c
** devs should be canceling these jobs themselves

I'll be tackling the short-term solutions as buildduty today.
(In reply to Chris Cooper [:coop] from comment #8)
> * try usage:
> ** coop to collect current information sources re: try usage (newsgroup
> posts, blog posts, wiki pages, MDN articles) and synthesize best practices
> info for using this shared resource.
> *** present best practices at Monday MoFo call

This is probably a good starting point and something we've already been linking people to:
https://wiki.mozilla.org/Sheriffing/How:To:Recommended_Try_Practices
(In reply to Chris Cooper [:coop] from comment #8)
> * mtnlion:
> ** re-image 10 builders from build/try pool
> *** also re-image builders that are currently in staging or need repair <-
> that capacity is currently unused

Bug 1034715 filed.

> * w864:
> ** resurrect/repair 7 slaves that are currently disabled

This is bug bug 1004813.
 
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #9)
> This is probably a good starting point and something we've already been
> linking people to:
> https://wiki.mozilla.org/Sheriffing/How:To:Recommended_Try_Practices

That's a good start. I'll plan on (re)presenting it at the MoFo call on Monday, and then doing the other follow-up work listed above.
I mentioned that we have spare ix capacity in a couple of meetings, but I see that the info didn't make it into bugzilla.
We have ~100 windows ix builders that we can turn off and repurpose due to recent seamicro builder additions. 

Running emulators on aws is inefficient as those vms do not expose vmx bit in /proc/cpuinfo. Why aren't we looking at re-purposing existing ix builders?
Flags: needinfo?(laura)
See Also: → 1035269
Depends on: 1035304
(In reply to Taras Glek (:taras) from comment #11)
> I mentioned that we have spare ix capacity in a couple of meetings, but I
> see that the info didn't make it into bugzilla.
> We have ~100 windows ix builders that we can turn off and repurpose due to
> recent seamicro builder additions. 
> 
> Running emulators on aws is inefficient as those vms do not expose vmx bit
> in /proc/cpuinfo. Why aren't we looking at re-purposing existing ix builders?

The builders have a different config than our existing tester ix platforms, so they're not drop-in replacements: https://bugzilla.mozilla.org/show_bug.cgi?id=1034055#c21
Flags: needinfo?(laura)
Pete, I think this bug can be closed now?
closed per c#13
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.