Rebalance t-xp32-ix pool to provide more capacity on talos-linux64-ix

RESOLVED WONTFIX

Status

RESOLVED WONTFIX
2 years ago
4 months ago

People

(Reporter: alin.selagea, Unassigned)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
XP testing has been disabled starting with Firefox 53 (bug 1310836). Considering that:
    - the load on this pool is pretty low at the moment
    - we enabled talos tests on linux64-stylo (bug 1338871)
I think we can go on and rebalance the XP pool a bit to help with the talos load.
(Reporter)

Comment 1

2 years ago
@Kim: do you have any suggestion on what number of XP machines to keep?
Flags: needinfo?(kmoir)

Comment 2

2 years ago
Why don't you try what you suggested in channel?

>>judging be the load I noticed during the past week, I think keeping 10 machines out of 40 would be fine for our testing needs on XP
Flags: needinfo?(kmoir)
(Reporter)

Comment 3

2 years ago
All right then.

If no reason to choose a different subset (e.g. rack position), let's re-image:
    --> t-xp32-ix-011 - t-xp32-ix-040
to
    --> talos-linux64-ix-090 - talos-linux64-ix-119

Two things are needed before proceeding to re-image these:
    - move the machines to the appropriate VLAN
    - update Nagios and inventory entries  

I'm happy to file bugs for both of them, but I'll need a confirmation from RelOps that the above interval is ok.

cc-ing :arr here.
(Reporter)

Comment 4

2 years ago
well, I guess I should have used a ni for visibility ^^
Flags: needinfo?(arich)
(Reporter)

Comment 5

2 years ago
Maybe Van could unblock us here?
Flags: needinfo?(vle)

Comment 6

2 years ago
>I'm happy to file bugs for both of them, but I'll need a confirmation from RelOps that the above interval is ok.


im not relops but i can move the machines to the correct vlan if you need that done.
Flags: needinfo?(vle)
I've sent email to catlee and joel to discuss whether we should do this and discussion is ongoing. We will not have additional capacity once we start the move, and we are discussing disabling tests instead.
Flags: needinfo?(arich)
(Reporter)

Comment 8

a year ago
Do we have any updates here?
Flags: needinfo?(arich)
I am no longer seeing alerts for linux now that PI has disabled a bunch of unneeded jobs. jmaher: I recommend that we use these for w10 instead, what do you think?
Flags: needinfo?(arich) → needinfo?(jmaher)
I am fine with using these for win10.  I am curious what load has been reduced on linux hardware, as far as I know this is just talos and we have only been adding new tests recently, not removing.
Flags: needinfo?(jmaher)
I thought I recall you saying that you found some tests that you disabled a while back? If not, disregard. I  haven't seen any pending-jobs alerts for talos-linux64 since 2017-02-14T19:46:16-0500. :kim, :aselagea are you still seeing issues with linux64 somewhere? 

Regardless, based on userbase population, it's far less important that w10, so that's why my suggestion is that the hardware be used there. I'll agree to whatever recommendations PI has, though, since they're driving the test platform requirements.
(Reporter)

Comment 12

a year ago
Created attachment 8864521 [details]
talos_pending.PNG

Here's the evolution of the number of pending talos jobs during the past 2 months. The spikes likely correspond to the US/Canada working hours, while the larger gaps stand for the weekends. 

If we avoid doing multiple rebuilds for these jobs, I think the current pool of machines can handle the load. In that case, we should be able to use the XP machines for Windows 10 tests.
I am curious about the data on multiple rebuilds for jobs.  Is this done automatically?  If so, I can support turning that off.  We do a lot of this manually to bisect or verify performance changes, that will always happen (which ends up being during working hours)
(Reporter)

Comment 14

a year ago
(In reply to Joel Maher ( :jmaher) from comment #13)
> I am curious about the data on multiple rebuilds for jobs.  Is this done
> automatically?  If so, I can support turning that off.  We do a lot of this
> manually to bisect or verify performance changes, that will always happen
> (which ends up being during working hours)

I was referring to the try pushes that use the "--rebuild" option. Sometimes such pushes can cause backlog (especially if they occur consecutively). If they also do talos jobs, then we'll probably have backlog on linux64-ix as well.

There's bug 1359736 filed for a high number of pending jobs due to pushes that use that option. We're discussing the possibility of reducing the maximum multiplication factor (which is 20 atm). If you have any input there, please feel free to add it. :-)
we only run talos on linux64-ix and it is rare that people run talos on try, when they do and use --rebuild is it typically for a reason, not just because they had an accident.  I do agree to reduce it from 20, 10 seems like a better upper bound, although I really wish people wouldn't do foolish things and do -p all -u all in general, let alone with --rebuild :)
(Reporter)

Comment 16

a year ago
Okay, I'll 'wontfix' the bug in this case. Thanks for all the feedback here.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → WONTFIX

Updated

4 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.