Closed Bug 740853 Opened 12 years ago Closed 12 years ago

Shuffle the tegras attached to foopies to reduce load

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86_64
Windows 7

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: Callek)

Details

(Whiteboard: [capacity][foopy])

Attachments

(3 files, 1 obsolete file)

We saw a few buildbot connection lost messages for tegra-190, which kills jobs, causes future and current jobs to fail with that tegra and overall is a pain.

The likely reason is an overloaded foopy, and this foopy has quite a lot of tegras attached, lets try to balance them out a bit.

Bear just stopped tegra-190 to try and reclaim some sanity in the short term.
Priority: -- → P2
Whiteboard: [buildduty][capacity][foopy]
On bug 740860 I show that we have more than 10 tegras per foopy (except foopy07).

I will be reducing the number of tegras per foopy to 10 and see how we do on Monday/Tuesday.

In bug 737415 I had suspicious that it is indeed a problem on why we loose connections.

On another note, are *all* foopies the same rev?
Assignee: nobody → armenzg
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #1)
> On bug 740860 I show that we have more than 10 tegras per foopy (except
> foopy07).
> 
> I will be reducing the number of tegras per foopy to 10 and see how we do on
> Monday/Tuesday.
> 
> In bug 737415 I had suspicious that it is indeed a problem on why we loose
> connections.

we can't reduce the tegra count to that level - it would remove too many tegras from the pool.  

I'm not saying this doesn't need doing as something has increased the end-to-end time for tegras and also is causing them to drop faster.

> 
> On another note, are *all* foopies the same rev?

no, we have 2 generations of foopies - all of the newer ones got assigned more tegras for that reason
I will be fixing this next week but at least I want to lay down the plan.
FTR we can determine the foopies that are having connection lost by looking at the twistd.log on the buildbot masters. Look for error.ConnectionLost or RemoteCommand.interrupt (perhaps we have to ignore the reboot.py ones since it seems we always do). The IP of the foopy is the first one appearing on the line.

I am choosing which tegras to disconnect by grabbing from the bottom of the list for each foopy.

I don't know which ones are the newer foopies but here are their versions
foopy[07-11] - 10.7.0 (386)
foopy[12-17] - 10.4.1 (386)
foopy[18-20,21-24] - 11.2.0 (64-bit)

I will bring foopies 7-11 to 11 tegras and foopies 12-24 to 13.

foopy07 contains 10 tegras: find one tegra to move to it
foopy08 contains 13 tegras: tegra-201,tegra-202
foopy09 contains 13 tegras: tegra-064,tegra-065
foopy10 contains 13 tegras: tegra-078,tegra-203
foopy11 contains 13 tegras: tegra-090,tegra-091
foopy12 contains 13 tegras: no changes
foopy13 contains 13 tegras: no changes
foopy14 contains 17 tegras: tegra-156,tegra-157,tegra-194,tegra-195
foopy15 contains 14 tegras: tegra-146
foopy16 contains 15 tegras: tegra-171,tegra-172
foopy17 contains 16 tegras: tegra-198,tegra-204,tegra-205
foopy18 contains 14 tegras: tegra-219
foopy19 contains 14 tegras: tegra-287
foopy20 contains 14 tegras: tegra-286
foopy22 contains 14 tegras: tegra-261
foopy23 contains 14 tegras: tegra-275
foopy24 contains 12 tegras: find one tegra to move to it

This means that we are reducing the number by 21 tegras from a total of 232 which is less than a 10% and we can have a known standard distribution.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #3)
> foopy13 contains 13 tegras: no changes

Suggest dropping tegra-110 and bringing a different one over to here [110 is staging]

> foopy23 contains 14 tegras: tegra-275

suggest tegra-268 instead [it is also staging]
Priority: P2 → P3
Assignee: armenzg → bear
Whiteboard: [buildduty][capacity][foopy] → [capacity][foopy]
Summary: Shuffle the tegras attached to foopy17 to reduce load → Shuffle the tegras attached to foopies to reduce load
Attachment #620874 - Flags: review?(bugspam.Callek)
Comment on attachment 620874 [details] [diff] [review]
update tegras.json and foopies.sh

Some minor mistakes here, I'm going to fix it up and aim it back at you though.
Attachment #620874 - Flags: review?(bugspam.Callek) → review-
This fixes the dashboard as discussed [needs testing]
Assignee: bear → bugspam.Callek
Status: NEW → ASSIGNED
Attachment #621056 - Flags: review?(bear)
Attachment #621056 - Flags: review?(bear) → review+
After my latest patch here (about to attach):

$ python tegras_per_foopy.py
None contains 20 tegras
foopy07 contains 11 tegras
foopy08 contains 11 tegras
foopy09 contains 11 tegras
foopy10 contains 11 tegras
foopy11 contains 11 tegras
foopy12 contains 13 tegras
foopy13 contains 13 tegras
foopy14 contains 13 tegras
foopy15 contains 14 tegras
foopy16 contains 13 tegras
foopy17 contains 13 tegras
foopy18 contains 13 tegras
foopy19 contains 13 tegras
foopy20 contains 13 tegras
foopy22 contains 13 tegras
foopy23 contains 13 tegras
foopy24 contains 13 tegras
We have 232 tegras in 18 foopies which means a ratio of 12 tegras per foopy

Looks like we'll need to (a) update this script, and (b) pull one of those 14 from foopy15 as well ;-) (not sure how we missed this earlier)
This updates both those files (for some minor misses in your first version), also adds a "_comment" key to the tegras.json in only places that warrant it.

I plan to use that in my upcoming patch for the tegras_per_foopy, but am ok if you would rather have _no_ comments here. (We can't use traditional JS comments in json of course)
Attachment #620874 - Attachment is obsolete: true
Attachment #621059 - Flags: review?(bear)
This updates tegras-per-foopy to account for "None" and expands its usefulness, the output of this script with current patches is:

Justin@ORION /d/sources/build-tools/buildfarm/mobile
$ python tegras_per_foopy.py
PRODUCTION:
  foopy07 contains 11 tegras
  foopy08 contains 11 tegras
  foopy09 contains 11 tegras
  foopy10 contains 11 tegras
  foopy11 contains 11 tegras
  foopy12 contains 13 tegras
  foopy13 contains 13 tegras
  foopy14 contains 13 tegras
  foopy15 contains 14 tegras
  foopy16 contains 13 tegras
  foopy17 contains 13 tegras
  foopy18 contains 13 tegras
  foopy19 contains 13 tegras
  foopy20 contains 13 tegras
  foopy22 contains 13 tegras
  foopy23 contains 13 tegras
  foopy24 contains 13 tegras
We have 212 tegras in 17 foopies which means a ratio of 12 tegras per foopy

STAGING:
  foopy05 contains 7 tegras
  foopy06 contains 16 tegras
We have 23 tegras in 2 foopies which means a ratio of 11 tegras per foopy

UNASSIGNED
   5 With Comment: Bug 749637: Assigned to Sec-Team
  15 (With no Comment)
Attachment #621071 - Flags: review?(armenzg)
Comment on attachment 621071 [details] [diff] [review]
Update tegras-per-foopy

I am not sure how comfortable I feel about having to duplicate the slavealloc's notes into tegras.json but AFAIK there is no alternative.

This is great. Thanks Callek.
Attachment #621071 - Flags: review?(armenzg) → review+
Comment on attachment 621059 [details] [diff] [review]
Update tegras.json and foopies.sj

with the tweak of updating sec team loaner list
Attachment #621059 - Flags: review?(bear) → review+
We finished this shuffling on Saturday
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: