Closed Bug 1024091 Opened 10 years ago Closed 10 years ago

address high pending count in in-house Linux64 test pool

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: kmoir)

References

Details

Attachments

(9 files, 3 obsolete files)

6.47 KB, patch
mozilla
: review+
Details | Diff | Splinter Review
1.51 KB, patch
mozilla
: review+
Details | Diff | Splinter Review
6.19 KB, patch
Details | Diff | Splinter Review
33.52 KB, text/plain
Details
2.15 KB, patch
mozilla
: review+
kmoir
: checked-in+
Details | Diff | Splinter Review
8.01 KB, patch
mozilla
: review+
nthomas
: checked-in+
Details | Diff | Splinter Review
117.03 KB, patch
Details | Diff | Splinter Review
2.15 KB, patch
nthomas
: checked-in+
Details | Diff | Splinter Review
1.51 KB, patch
Callek
: review+
philor
: checked-in+
Details | Diff | Splinter Review
When bug 1020970 (Schedule all Android 2.3 armv6 tests, except mochitest-gl, on all trunk trees and make them ride the trains) became enabled in production this morning the pending count for linux64 builds became really high (currently at 764 for that test platform).

There are three ways to reduce this load that I can think of
1) reallocate some same rev ix machines from build pool to talos-linux64-ix pool
2) reduce the number of branches the tests run on temporarily to reduce wait times
3) buy more ix machines

I'll look at option 1) first since it's the lowest cost solution

As an aside, gbrown spent months trying to get these tests running on AWS without success so it's not an option to run them there.
Assignee: nobody → kmoir
Are talos-linux32-ix the same spec? Despite that pool seeming like it's already tiny by comparison, my gut feeling is that it's actually too big, and basically never has any pending jobs.
Assuming (and given how wildly outdated and wrong about other things it is, that's an actual assumption) the wait time emails are correct about the ubuntu32_hw and ubuntu64_hw pools, we were already doing 70-80% no wait for linux64, and even on busy days doing 100% no wait for linux32.
(In reply to Phil Ringnalda (:philor) from comment #1)
> Are talos-linux32-ix the same spec? Despite that pool seeming like it's
> already tiny by comparison, my gut feeling is that it's actually too big,
> and basically never has any pending jobs.

I feel the same way. I think we could start by moving 10 slaves and see how the numbers look.
So according to inventory talos-linux64-ix-* are iX Systems - iX21X4 2U Neutron
and talos-linux32-ix-* are IX Systems - IX22X4 Four Node 2U, which seems to indicate a different hardware rev
gbrown made this suggestion via email 

"The main problem with running on aws was that reftests (plain and js-reftests, and crashtests) were really slow. Should we re-consider running mochitests/robocop/xpcshell on aws? Recall https://bugzilla.mozilla.org/show_bug.cgi?id=992969#c6."

I'm not sure how to do this with our dictionary definition for tests, will have to think about it.
Was talking to rail this morning about this.  He said we could try a new instance type that is not currently used.  However, in bug 980519 gbrown discusses the types of slave that he ran tests on and none of them had better results than ix.  Not sure if this is still the case or if there are newer instance types.

We know now that other in-house hardware pools of machines have different hardware revs and cannot be moved over. So the options appear to be

1) run only selected tests on ix, the others on AWS.  Not sure how to implement this but will investigate.
2) reduce the number of branches we run tests on (not sure if this is really feasible given that we want to disable tegras)
3) buy more ix machines of the same hardware rev (will ask IT if that hardware rev is still available for purchase)
The t-w864-ix* machines are listed as the same model (iX Systems - iX21X4 2U Neutron) in inventory but the wait counts for this pool don't make me think we should reallocate them.
Blocks: 1020970
Attached patch bug1024091.patchSplinter Review
This patch separates the tests out so the problematic tests identified in comment 5 are run on ix and the others are run on AWS.  I've tested it on my dev-master and verified that the correct slave class is connected to the builder.  Also, test-masters runs fine the the builder diff is correct. I just wrote the patch to run on ash for a test, if we can verify it there, and then I'll write another patch to enable on trunk if all works out.
Attachment #8440828 - Flags: review?(aki)
add new slave class definition to puppet to avoid scheduler duplicate
Attachment #8440829 - Flags: review?(aki)
Comment on attachment 8440829 [details] [diff] [review]
bug1024091puppet.patch

I think armv6 is sufficient, but you're still below the ubuntu64_vm-b2g-emulator-jb name length.
Attachment #8440829 - Flags: review?(aki) → review+
Comment on attachment 8440828 [details] [diff] [review]
bug1024091.patch

>+#kim2

You probably don't need this comment anymore :)

>+for suite in ANDROID_2_3_MOZHARNESS_DICT:
>+    if suite[0].startswith('mochitest-gl'):
>+        continue
>+    if suite[0].startswith('plain-reftest'):
>+        continue
>+    if suite[0].startswith('crashtest'):
>+        continue
>+    if suite[0].startswith('jsreftest'):
>+        continue
>+    ANDROID_2_3_ARMV6_AWS_DICT['opt_unittest_suites'].append(suite)
>+
>+for suite in ANDROID_2_3_MOZHARNESS_DICT:
>+    if suite[0].startswith('mochitest-gl'):
>+        continue
>+    if suite[0].startswith('plain-reftest'):
>+        ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite)
>+    if suite[0].startswith('crashtest'):
>+        ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite)
>+    if suite[0].startswith('jsreftest'):
>+        ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite)

You can probably do this in a single pass:

for suite in ANDROID_2_3_MOZHARNESS_DICT:
    if suite[0].startswith('mochitest-gl'):
        continue
    elif suite[0].startswith('plain-reftest'):
        ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite)
    elif suite[0].startswith('crashtest'):
        ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite)
    elif suite[0].startswith('jsreftest'):
        ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite)
    else:
        ANDROID_2_3_ARMV6_AWS_DICT['opt_unittest_suites'].append(suite)

Which will save a bit of time on reconfigs.
Attachment #8440828 - Flags: review?(aki) → review+
updated patch with aki's suggestions
Attached patch bug1024091trunk-2.patch (obsolete) — Splinter Review
patch to enable tests on trunk
Attached file builder1024091.diff
builder diff
So there is a problem with this approach.  If you look at ash, the AWS jobs are still pending after three hours.

I think the problem is that the builders exist on the master for both AWS and ix tests.  However, masters are segregated by function i.e. non AWS hosts don't serve AWS slaves and vice versa. So we have the case now where the AWS jobs are pending on an inhouse master. Since they share the same builder name.

http://buildbot-master103.srv.releng.scl3.mozilla.com:8201/builders/Android%202.3%20Emulator%20Armv6%20ash%20opt%20test%20mochitest-1

So I think that adding
      "android-armv6"

to limit_mobile_platforms for the applicable aws masters would fix this.  Aki is this the right approach?
Flags: needinfo?(aki)
(In reply to Kim Moir [:kmoir] from comment #15)
> So there is a problem with this approach.  If you look at ash, the AWS jobs
> are still pending after three hours.
> 
> I think the problem is that the builders exist on the master for both AWS
> and ix tests.  However, masters are segregated by function i.e. non AWS
> hosts don't serve AWS slaves and vice versa. So we have the case now where
> the AWS jobs are pending on an inhouse master. Since they share the same
> builder name.
> 
> http://buildbot-master103.srv.releng.scl3.mozilla.com:8201/builders/
> Android%202.3%20Emulator%20Armv6%20ash%20opt%20test%20mochitest-1
> 
> So I think that adding
>       "android-armv6"
> 
> to limit_mobile_platforms for the applicable aws masters would fix this. 
> Aki is this the right approach?

Hmmm.

So this will add tegras to the AWS masters.  However,

a) if we have some sort of other thing that keeps the tegras from attaching to those masters (slavealloc? foopy hardcodes?) and
b) this helps us EOL the tegras so there's only a relatively small window that this is an issue,

I think that wfm.
Flags: needinfo?(aki)
patch to enable armv6 on aws linux64 masters
So Callek looking at slavealloc and bug 888835 it's not clear to me that the tegras are locked to masters.  Is this the case?

See conment #16 for the context of this.  I'm trying to get some armv6 with 2.3 emulator tests running on AWS and some on ix machines.  The problem is that the AWS machines just queue the jobs because android armv6 is not enabled as a mobile platform on the AWS Linux64 masters. I don't want tegras to be attaching to the AWS masters.
Flags: needinfo?(bugspam.Callek)
So, the tegras are not "locked" to any masters, however they are part of the "tests-tegra" pool.

Which means they only get assigned to masters that are assigned to said pool, which additionally means we won't get assigned to the AWS masters.

(only two masters are in the tests-tegra pool, and that is bm88 and bm99) so as long as you don't add new masters to the tests-tegra slavealloc pool we're golden.
Flags: needinfo?(bugspam.Callek)
Attachment #8441810 - Flags: review?(aki)
Attached patch bug1024091trunk-3.patch (obsolete) — Splinter Review
enable on trunk if patch on AWS masters works. (Will deploy that first) Also changed the name as per RyanVM's suggestion here https://bugzilla.mozilla.org/show_bug.cgi?id=1023948#c5
Attachment #8441500 - Attachment is obsolete: true
Attachment #8442084 - Flags: review?(aki)
:gbrown: So the regular android (non armv6) tests on 2.3 all run on ix machines too.  I looked at some bugs when we implemented this and saw that some of the problems with running them on AWS were long running times.  Would it be possible to run a subset of these tests on AWS too?  This would also reduce the load on our Linux64 ix pool.
Flags: needinfo?(gbrown)
Yes, that should be possible. mochitest/robocop/xpcshell tests for both Android 2.3 and Android 2.3 armv6 are theoretically okay on aws. (We run into trouble with reftests, which run much longer on aws.)
Flags: needinfo?(gbrown)
Attachment #8441810 - Flags: review?(aki) → review+
Attachment #8442084 - Flags: review?(aki) → review+
Attachment #8441810 - Flags: checked-in+
So splitting tests over two types of machines worked on ash.  I have a new patch to do this for the Android 2.3 tests that are now riding the trains in conjunction with enabling 2.3 on Armv6 to ride the trains with the same scenario.  However, I'n debugging a problem with splitting the pool over Android 2.3 tests before I ask for review.  When completed, this patch will enable use to enable Armv6 on 2.3 to ride the trains and also decrease the impact that the Android 2.3 tests have on the ix pool.
Attached patch bug1024091-2_3.patch (obsolete) — Splinter Review
puppet patch to run 2.3 tests on aws, vm_android_2_3 wasn't being used
Attachment #8443208 - Flags: review?(aki)
patch to enabled 2.3 armv6 to ride the trains and to split 2.3 tests to run on AWS and ix slaves.  Tested on master, will attach builder diff
Attachment #8442084 - Attachment is obsolete: true
Attachment #8443209 - Flags: review?(aki)
builder diff
Comment on attachment 8443209 [details] [diff] [review]
bug1024091june19.patch

The weird thing about this patch is that I had to stop my master and then
 rm -rf  master/*_ubuntu64_hw_mobile_test*
 1234  rm -rf  master/*_ubuntu64_vm_mobile_test*

on my master before it would show all the builders associated with the  new vm pool for Android 2.3

Not sure if my master was in a weird state or what.
Comment on attachment 8443208 [details] [diff] [review]
bug1024091-2_3.patch

Hmm.
Attachment #8443208 - Flags: review?(aki) → review+
Attachment #8443208 - Attachment is obsolete: true
Comment on attachment 8443209 [details] [diff] [review]
bug1024091june19.patch

Looks like this is the same thing with a 1 char indentation fix?
Attachment #8443209 - Flags: review?(aki) → review+
Depends on: 1023948
re comment 29, yes sorry I didn't realize that this was the very similar to the patch you had already approved.  I thought the earlier patch didn't have the part in it to split the Android 2.3 tests to aws and ix too.
Attachment #8442084 - Flags: checked-in+
Attachment #8443209 - Flags: checked-in+
Comment on attachment 8443209 [details] [diff] [review]
bug1024091june19.patch

Backed out until bug 1028293 is addressed
Attachment #8443209 - Flags: checked-in+ → checked-in-
Depends on: 1028293
This landed and was backed out again.

Was causing reconfig issues (NOT checkconfig) like:

[buildbot-master115.srv.releng.usw2.mozilla.com] out: 2014-06-24 13:02:45-0700 [Broker,83757,10.132.157.100] Unhandled Error
[buildbot-master115.srv.releng.usw2.mozilla.com] out:   Traceback (most recent call last):
[buildbot-master115.srv.releng.usw2.mozilla.com] out:     File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist
ed/spread/pb.py", line 1346, in remote_respond
[buildbot-master115.srv.releng.usw2.mozilla.com] out:       d = self.portal.login(self, mind, IPerspective)
[buildbot-master115.srv.releng.usw2.mozilla.com] out:     File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist
ed/cred/portal.py", line 116, in login
[buildbot-master115.srv.releng.usw2.mozilla.com] out:       ).addCallback(self.realm.requestAvatar, mind, *interfaces
[buildbot-master115.srv.releng.usw2.mozilla.com] out:     File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist
ed/internet/defer.py", line 260, in addCallback
[buildbot-master115.srv.releng.usw2.mozilla.com] out:       callbackKeywords=kw)
[buildbot-master115.srv.releng.usw2.mozilla.com] out:     File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist
ed/internet/defer.py", line 249, in addCallbacks
[buildbot-master115.srv.releng.usw2.mozilla.com] out:       self._runCallbacks()
[buildbot-master115.srv.releng.usw2.mozilla.com] out:   --- <exception caught here> ---
[buildbot-master115.srv.releng.usw2.mozilla.com] out:     File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist
ed/internet/defer.py", line 441, in _runCallbacks
[buildbot-master115.srv.releng.usw2.mozilla.com] out:       self.result = callback(self.result, *args, **kw)
[buildbot-master115.srv.releng.usw2.mozilla.com] out:     File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/build
bot-0.8.2_hg_11716f9bbdeb_production_0.8-py2.7.egg/buildbot/master.py", line 498, in requestAvatar
[buildbot-master115.srv.releng.usw2.mozilla.com] out:       p = self.botmaster.getPerspective(mind, avatarID)
[buildbot-master115.srv.releng.usw2.mozilla.com] out:     File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/build
bot-0.8.2_hg_11716f9bbdeb_production_0.8-py2.7.egg/buildbot/master.py", line 317, in getPerspective
[buildbot-master115.srv.releng.usw2.mozilla.com] out:       sl = self.slaves[slavename]
[buildbot-master115.srv.releng.usw2.mozilla.com] out:   exceptions.KeyError: 'tst-linux64-spot-345'

and 

twistd.log:2014-06-24 12:41:08-0700 [-] configuration update failed
twistd.log-2014-06-24 12:41:08-0700 [-] Unhandled Error
twistd.log-     Traceback (most recent call last):
twistd.log-       File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_8a9e33843c3f_production_0.8-p
y2.7.egg/buildbot/master.py", line 1151, in loadConfig_Builders
twistd.log-         d = self.botmaster.setBuilders(sortedAllBuilders)
twistd.log-       File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_8a9e33843c3f_production_0.8-p
y2.7.egg/buildbot/master.py", line 294, in setBuilders
twistd.log-         d.addCallback(_add)
twistd.log-       File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 260, in addCal
lback
twistd.log-         callbackKeywords=kw)
twistd.log-       File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 249, in addCal
lbacks
twistd.log-         self._runCallbacks()
twistd.log-     --- <exception caught here> ---
twistd.log-       File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCa
llbacks
twistd.log-         self.result = callback(self.result, *args, **kw)
twistd.log-       File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_8a9e33843c3f_production_0.8-p
y2.7.egg/buildbot/master.py", line 289, in _add
twistd.log-         assert slavename in self.slaves
twistd.log-     exceptions.AssertionError:
twistd.log-

Sadly this error wasn't hitting during checkconfig, so I'm not sure what the heck caused it.
So this caused problems in the reconfig where the reconfig failed on test masters serving this platform.  

I didn't see this in staging.  

So I think the strategy to test this in staging more thoroughly is 
* setup master as AWS Linux64 test master with armv6
* revert all patches and  rm -rf  master/*_ubuntu64_hw_mobile_test*, rm -rf  master/*_ubuntu64_vm_mobile_test* so new builder directories are created
* export builders
* apply patches
* checkconfig, reconfig
* export new builders and compare

repeat steps but as a master serving Linux64 with armv6

see if there are errors
I did a lot of testing this morning and it appears a rolling restart of the relevant masters is required to implement this patch.  I don't know why.  But if I apply the patch and checkconfig and reconfig it has an errror. If I then make stop and make start it works fine and the new builders appear.  Will talk to buildduty to see when we can schedule this.
Talked to Callek, this reconfig will be scheduled for 10am EST tomorrow.
This is in production, will look at pending counts after today's excitement dies down.

Basically the reconfig we did on Tuesday that failed added a lot of builder state to the scheduling masters.  So when Callek did the rolling reconfig today the pending count went up too high for 2.3 jobs and consumed much the Linux AWS test capacity.  The sheriffs closed the trees.  Nick analyzed the db and saw that there were many jobs from changesets that were two days old but not scheduled until today. Nick cleaned up the db.  Called endured a very long rolling reconfig to implement this.  Thanks to Nick and Callek for their help.
So the pending counts for Linux64 look okay today after this change.  As do the wait times.  This doesn't preclude us from the need to order additional ix machines to handle our increasing capacity but today it seems okay.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
As a clarification the wait times for this pool aren't great (65-70%) but enabling 2.3 tests for Armv6 didn't cause them to change significantly because of the way that the jobs were split between AWS and ix.
Depends on: 1031856
Since

252     if 'mozilla-central' in BRANCHES:
253         BRANCHES['mozilla-central']['gecko_version'] = 33

that had the (I presume, and hope, unintentional) effect of running them on all trunk trees *except* mozilla-central, so that next mergeday they would start running on mozilla-central, and then in two mergedays they would start running on mozilla-aurora.
Attachment 8441810 [details] [diff] enabled android-armv6 on the 5 us-east-1 masters, but not the matching set in us-west-2. Over the weekend we had pending when there were spot instances only in us-west-2.

https://hg.mozilla.org/build/tools/rev/3c6648f8f701 to fix that, then did update and reconfig on bm53/54/68/115/116.
Attachment #8447777 - Flags: checked-in+
I'm not at all sure I like the existence of a gecko_version that's higher than the actual gecko_version, purely to allow this sort of job which absolutely should not be visible the way it's running, but just getting these running on mozilla-central like they should have is good enough for me.
Attachment #8448515 - Flags: review?(kmoir)
Attachment #8448515 - Flags: review?(kmoir) → review+
Comment on attachment 8448515 [details] [diff] [review]
Run them on the current, 33, trunk train, not the 34 train in 9 weeks

https://hg.mozilla.org/build/buildbot-configs/rev/1712f3dd46a8
Attachment #8448515 - Flags: checked-in+
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: