Closed
Bug 1024091
Opened 10 years ago
Closed 10 years ago
address high pending count in in-house Linux64 test pool
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kmoir, Assigned: kmoir)
References
Details
Attachments
(9 files, 3 obsolete files)
6.47 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
1.51 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
6.19 KB,
patch
|
Details | Diff | Splinter Review | |
33.52 KB,
text/plain
|
Details | |
2.15 KB,
patch
|
mozilla
:
review+
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
8.01 KB,
patch
|
mozilla
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
117.03 KB,
patch
|
Details | Diff | Splinter Review | |
2.15 KB,
patch
|
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
1.51 KB,
patch
|
Callek
:
review+
philor
:
checked-in+
|
Details | Diff | Splinter Review |
When bug 1020970 (Schedule all Android 2.3 armv6 tests, except mochitest-gl, on all trunk trees and make them ride the trains) became enabled in production this morning the pending count for linux64 builds became really high (currently at 764 for that test platform). There are three ways to reduce this load that I can think of 1) reallocate some same rev ix machines from build pool to talos-linux64-ix pool 2) reduce the number of branches the tests run on temporarily to reduce wait times 3) buy more ix machines I'll look at option 1) first since it's the lowest cost solution As an aside, gbrown spent months trying to get these tests running on AWS without success so it's not an option to run them there.
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → kmoir
Comment 1•10 years ago
|
||
Are talos-linux32-ix the same spec? Despite that pool seeming like it's already tiny by comparison, my gut feeling is that it's actually too big, and basically never has any pending jobs.
Comment 2•10 years ago
|
||
Assuming (and given how wildly outdated and wrong about other things it is, that's an actual assumption) the wait time emails are correct about the ubuntu32_hw and ubuntu64_hw pools, we were already doing 70-80% no wait for linux64, and even on busy days doing 100% no wait for linux32.
Comment 3•10 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #1) > Are talos-linux32-ix the same spec? Despite that pool seeming like it's > already tiny by comparison, my gut feeling is that it's actually too big, > and basically never has any pending jobs. I feel the same way. I think we could start by moving 10 slaves and see how the numbers look.
Assignee | ||
Comment 4•10 years ago
|
||
So according to inventory talos-linux64-ix-* are iX Systems - iX21X4 2U Neutron and talos-linux32-ix-* are IX Systems - IX22X4 Four Node 2U, which seems to indicate a different hardware rev
Assignee | ||
Comment 5•10 years ago
|
||
gbrown made this suggestion via email "The main problem with running on aws was that reftests (plain and js-reftests, and crashtests) were really slow. Should we re-consider running mochitests/robocop/xpcshell on aws? Recall https://bugzilla.mozilla.org/show_bug.cgi?id=992969#c6." I'm not sure how to do this with our dictionary definition for tests, will have to think about it.
Assignee | ||
Comment 6•10 years ago
|
||
Was talking to rail this morning about this. He said we could try a new instance type that is not currently used. However, in bug 980519 gbrown discusses the types of slave that he ran tests on and none of them had better results than ix. Not sure if this is still the case or if there are newer instance types. We know now that other in-house hardware pools of machines have different hardware revs and cannot be moved over. So the options appear to be 1) run only selected tests on ix, the others on AWS. Not sure how to implement this but will investigate. 2) reduce the number of branches we run tests on (not sure if this is really feasible given that we want to disable tegras) 3) buy more ix machines of the same hardware rev (will ask IT if that hardware rev is still available for purchase)
Assignee | ||
Comment 7•10 years ago
|
||
The t-w864-ix* machines are listed as the same model (iX Systems - iX21X4 2U Neutron) in inventory but the wait counts for this pool don't make me think we should reallocate them.
Assignee | ||
Comment 8•10 years ago
|
||
This patch separates the tests out so the problematic tests identified in comment 5 are run on ix and the others are run on AWS. I've tested it on my dev-master and verified that the correct slave class is connected to the builder. Also, test-masters runs fine the the builder diff is correct. I just wrote the patch to run on ash for a test, if we can verify it there, and then I'll write another patch to enable on trunk if all works out.
Attachment #8440828 -
Flags: review?(aki)
Assignee | ||
Comment 9•10 years ago
|
||
add new slave class definition to puppet to avoid scheduler duplicate
Attachment #8440829 -
Flags: review?(aki)
Comment 10•10 years ago
|
||
Comment on attachment 8440829 [details] [diff] [review] bug1024091puppet.patch I think armv6 is sufficient, but you're still below the ubuntu64_vm-b2g-emulator-jb name length.
Attachment #8440829 -
Flags: review?(aki) → review+
Comment 11•10 years ago
|
||
Comment on attachment 8440828 [details] [diff] [review] bug1024091.patch >+#kim2 You probably don't need this comment anymore :) >+for suite in ANDROID_2_3_MOZHARNESS_DICT: >+ if suite[0].startswith('mochitest-gl'): >+ continue >+ if suite[0].startswith('plain-reftest'): >+ continue >+ if suite[0].startswith('crashtest'): >+ continue >+ if suite[0].startswith('jsreftest'): >+ continue >+ ANDROID_2_3_ARMV6_AWS_DICT['opt_unittest_suites'].append(suite) >+ >+for suite in ANDROID_2_3_MOZHARNESS_DICT: >+ if suite[0].startswith('mochitest-gl'): >+ continue >+ if suite[0].startswith('plain-reftest'): >+ ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite) >+ if suite[0].startswith('crashtest'): >+ ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite) >+ if suite[0].startswith('jsreftest'): >+ ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite) You can probably do this in a single pass: for suite in ANDROID_2_3_MOZHARNESS_DICT: if suite[0].startswith('mochitest-gl'): continue elif suite[0].startswith('plain-reftest'): ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite) elif suite[0].startswith('crashtest'): ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite) elif suite[0].startswith('jsreftest'): ANDROID_2_3_ARMV6_IX_DICT['opt_unittest_suites'].append(suite) else: ANDROID_2_3_ARMV6_AWS_DICT['opt_unittest_suites'].append(suite) Which will save a bit of time on reconfigs.
Attachment #8440828 -
Flags: review?(aki) → review+
Assignee | ||
Comment 12•10 years ago
|
||
updated patch with aki's suggestions
Assignee | ||
Comment 13•10 years ago
|
||
patch to enable tests on trunk
Assignee | ||
Comment 14•10 years ago
|
||
builder diff
Assignee | ||
Comment 15•10 years ago
|
||
So there is a problem with this approach. If you look at ash, the AWS jobs are still pending after three hours. I think the problem is that the builders exist on the master for both AWS and ix tests. However, masters are segregated by function i.e. non AWS hosts don't serve AWS slaves and vice versa. So we have the case now where the AWS jobs are pending on an inhouse master. Since they share the same builder name. http://buildbot-master103.srv.releng.scl3.mozilla.com:8201/builders/Android%202.3%20Emulator%20Armv6%20ash%20opt%20test%20mochitest-1 So I think that adding "android-armv6" to limit_mobile_platforms for the applicable aws masters would fix this. Aki is this the right approach?
Flags: needinfo?(aki)
Comment 16•10 years ago
|
||
(In reply to Kim Moir [:kmoir] from comment #15) > So there is a problem with this approach. If you look at ash, the AWS jobs > are still pending after three hours. > > I think the problem is that the builders exist on the master for both AWS > and ix tests. However, masters are segregated by function i.e. non AWS > hosts don't serve AWS slaves and vice versa. So we have the case now where > the AWS jobs are pending on an inhouse master. Since they share the same > builder name. > > http://buildbot-master103.srv.releng.scl3.mozilla.com:8201/builders/ > Android%202.3%20Emulator%20Armv6%20ash%20opt%20test%20mochitest-1 > > So I think that adding > "android-armv6" > > to limit_mobile_platforms for the applicable aws masters would fix this. > Aki is this the right approach? Hmmm. So this will add tegras to the AWS masters. However, a) if we have some sort of other thing that keeps the tegras from attaching to those masters (slavealloc? foopy hardcodes?) and b) this helps us EOL the tegras so there's only a relatively small window that this is an issue, I think that wfm.
Flags: needinfo?(aki)
Assignee | ||
Comment 17•10 years ago
|
||
patch to enable armv6 on aws linux64 masters
Assignee | ||
Comment 18•10 years ago
|
||
So Callek looking at slavealloc and bug 888835 it's not clear to me that the tegras are locked to masters. Is this the case? See conment #16 for the context of this. I'm trying to get some armv6 with 2.3 emulator tests running on AWS and some on ix machines. The problem is that the AWS machines just queue the jobs because android armv6 is not enabled as a mobile platform on the AWS Linux64 masters. I don't want tegras to be attaching to the AWS masters.
Flags: needinfo?(bugspam.Callek)
Comment 19•10 years ago
|
||
So, the tegras are not "locked" to any masters, however they are part of the "tests-tegra" pool. Which means they only get assigned to masters that are assigned to said pool, which additionally means we won't get assigned to the AWS masters. (only two masters are in the tests-tegra pool, and that is bm88 and bm99) so as long as you don't add new masters to the tests-tegra slavealloc pool we're golden.
Flags: needinfo?(bugspam.Callek)
Assignee | ||
Updated•10 years ago
|
Attachment #8441810 -
Flags: review?(aki)
Assignee | ||
Comment 20•10 years ago
|
||
enable on trunk if patch on AWS masters works. (Will deploy that first) Also changed the name as per RyanVM's suggestion here https://bugzilla.mozilla.org/show_bug.cgi?id=1023948#c5
Attachment #8441500 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #8442084 -
Flags: review?(aki)
Assignee | ||
Comment 21•10 years ago
|
||
:gbrown: So the regular android (non armv6) tests on 2.3 all run on ix machines too. I looked at some bugs when we implemented this and saw that some of the problems with running them on AWS were long running times. Would it be possible to run a subset of these tests on AWS too? This would also reduce the load on our Linux64 ix pool.
Flags: needinfo?(gbrown)
Comment 22•10 years ago
|
||
Yes, that should be possible. mochitest/robocop/xpcshell tests for both Android 2.3 and Android 2.3 armv6 are theoretically okay on aws. (We run into trouble with reftests, which run much longer on aws.)
Flags: needinfo?(gbrown)
Updated•10 years ago
|
Attachment #8441810 -
Flags: review?(aki) → review+
Updated•10 years ago
|
Attachment #8442084 -
Flags: review?(aki) → review+
Assignee | ||
Updated•10 years ago
|
Attachment #8441810 -
Flags: checked-in+
Assignee | ||
Comment 23•10 years ago
|
||
So splitting tests over two types of machines worked on ash. I have a new patch to do this for the Android 2.3 tests that are now riding the trains in conjunction with enabling 2.3 on Armv6 to ride the trains with the same scenario. However, I'n debugging a problem with splitting the pool over Android 2.3 tests before I ask for review. When completed, this patch will enable use to enable Armv6 on 2.3 to ride the trains and also decrease the impact that the Android 2.3 tests have on the ix pool.
Assignee | ||
Comment 24•10 years ago
|
||
puppet patch to run 2.3 tests on aws, vm_android_2_3 wasn't being used
Attachment #8443208 -
Flags: review?(aki)
Assignee | ||
Comment 25•10 years ago
|
||
patch to enabled 2.3 armv6 to ride the trains and to split 2.3 tests to run on AWS and ix slaves. Tested on master, will attach builder diff
Attachment #8442084 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #8443209 -
Flags: review?(aki)
Assignee | ||
Comment 26•10 years ago
|
||
builder diff
Assignee | ||
Comment 27•10 years ago
|
||
Comment on attachment 8443209 [details] [diff] [review] bug1024091june19.patch The weird thing about this patch is that I had to stop my master and then rm -rf master/*_ubuntu64_hw_mobile_test* 1234 rm -rf master/*_ubuntu64_vm_mobile_test* on my master before it would show all the builders associated with the new vm pool for Android 2.3 Not sure if my master was in a weird state or what.
Comment 28•10 years ago
|
||
Comment on attachment 8443208 [details] [diff] [review] bug1024091-2_3.patch Hmm.
Attachment #8443208 -
Flags: review?(aki) → review+
Updated•10 years ago
|
Attachment #8443208 -
Attachment is obsolete: true
Comment 29•10 years ago
|
||
Comment on attachment 8443209 [details] [diff] [review] bug1024091june19.patch Looks like this is the same thing with a 1 char indentation fix?
Attachment #8443209 -
Flags: review?(aki) → review+
Assignee | ||
Comment 30•10 years ago
|
||
re comment 29, yes sorry I didn't realize that this was the very similar to the patch you had already approved. I thought the earlier patch didn't have the part in it to split the Android 2.3 tests to aws and ix too.
Assignee | ||
Updated•10 years ago
|
Attachment #8442084 -
Flags: checked-in+
Assignee | ||
Updated•10 years ago
|
Attachment #8443209 -
Flags: checked-in+
Assignee | ||
Comment 31•10 years ago
|
||
Comment on attachment 8443209 [details] [diff] [review] bug1024091june19.patch Backed out until bug 1028293 is addressed
Attachment #8443209 -
Flags: checked-in+ → checked-in-
Comment 33•10 years ago
|
||
This landed and was backed out again. Was causing reconfig issues (NOT checkconfig) like: [buildbot-master115.srv.releng.usw2.mozilla.com] out: 2014-06-24 13:02:45-0700 [Broker,83757,10.132.157.100] Unhandled Error [buildbot-master115.srv.releng.usw2.mozilla.com] out: Traceback (most recent call last): [buildbot-master115.srv.releng.usw2.mozilla.com] out: File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist ed/spread/pb.py", line 1346, in remote_respond [buildbot-master115.srv.releng.usw2.mozilla.com] out: d = self.portal.login(self, mind, IPerspective) [buildbot-master115.srv.releng.usw2.mozilla.com] out: File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist ed/cred/portal.py", line 116, in login [buildbot-master115.srv.releng.usw2.mozilla.com] out: ).addCallback(self.realm.requestAvatar, mind, *interfaces [buildbot-master115.srv.releng.usw2.mozilla.com] out: File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist ed/internet/defer.py", line 260, in addCallback [buildbot-master115.srv.releng.usw2.mozilla.com] out: callbackKeywords=kw) [buildbot-master115.srv.releng.usw2.mozilla.com] out: File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist ed/internet/defer.py", line 249, in addCallbacks [buildbot-master115.srv.releng.usw2.mozilla.com] out: self._runCallbacks() [buildbot-master115.srv.releng.usw2.mozilla.com] out: --- <exception caught here> --- [buildbot-master115.srv.releng.usw2.mozilla.com] out: File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twist ed/internet/defer.py", line 441, in _runCallbacks [buildbot-master115.srv.releng.usw2.mozilla.com] out: self.result = callback(self.result, *args, **kw) [buildbot-master115.srv.releng.usw2.mozilla.com] out: File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/build bot-0.8.2_hg_11716f9bbdeb_production_0.8-py2.7.egg/buildbot/master.py", line 498, in requestAvatar [buildbot-master115.srv.releng.usw2.mozilla.com] out: p = self.botmaster.getPerspective(mind, avatarID) [buildbot-master115.srv.releng.usw2.mozilla.com] out: File "/builds/buildbot/tests1-linux64/lib/python2.7/site-packages/build bot-0.8.2_hg_11716f9bbdeb_production_0.8-py2.7.egg/buildbot/master.py", line 317, in getPerspective [buildbot-master115.srv.releng.usw2.mozilla.com] out: sl = self.slaves[slavename] [buildbot-master115.srv.releng.usw2.mozilla.com] out: exceptions.KeyError: 'tst-linux64-spot-345' and twistd.log:2014-06-24 12:41:08-0700 [-] configuration update failed twistd.log-2014-06-24 12:41:08-0700 [-] Unhandled Error twistd.log- Traceback (most recent call last): twistd.log- File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_8a9e33843c3f_production_0.8-p y2.7.egg/buildbot/master.py", line 1151, in loadConfig_Builders twistd.log- d = self.botmaster.setBuilders(sortedAllBuilders) twistd.log- File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_8a9e33843c3f_production_0.8-p y2.7.egg/buildbot/master.py", line 294, in setBuilders twistd.log- d.addCallback(_add) twistd.log- File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 260, in addCal lback twistd.log- callbackKeywords=kw) twistd.log- File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 249, in addCal lbacks twistd.log- self._runCallbacks() twistd.log- --- <exception caught here> --- twistd.log- File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCa llbacks twistd.log- self.result = callback(self.result, *args, **kw) twistd.log- File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_8a9e33843c3f_production_0.8-p y2.7.egg/buildbot/master.py", line 289, in _add twistd.log- assert slavename in self.slaves twistd.log- exceptions.AssertionError: twistd.log- Sadly this error wasn't hitting during checkconfig, so I'm not sure what the heck caused it.
Assignee | ||
Comment 34•10 years ago
|
||
So this caused problems in the reconfig where the reconfig failed on test masters serving this platform. I didn't see this in staging. So I think the strategy to test this in staging more thoroughly is * setup master as AWS Linux64 test master with armv6 * revert all patches and rm -rf master/*_ubuntu64_hw_mobile_test*, rm -rf master/*_ubuntu64_vm_mobile_test* so new builder directories are created * export builders * apply patches * checkconfig, reconfig * export new builders and compare repeat steps but as a master serving Linux64 with armv6 see if there are errors
Assignee | ||
Comment 35•10 years ago
|
||
I did a lot of testing this morning and it appears a rolling restart of the relevant masters is required to implement this patch. I don't know why. But if I apply the patch and checkconfig and reconfig it has an errror. If I then make stop and make start it works fine and the new builders appear. Will talk to buildduty to see when we can schedule this.
Assignee | ||
Comment 36•10 years ago
|
||
Talked to Callek, this reconfig will be scheduled for 10am EST tomorrow.
Assignee | ||
Comment 37•10 years ago
|
||
This is in production, will look at pending counts after today's excitement dies down. Basically the reconfig we did on Tuesday that failed added a lot of builder state to the scheduling masters. So when Callek did the rolling reconfig today the pending count went up too high for 2.3 jobs and consumed much the Linux AWS test capacity. The sheriffs closed the trees. Nick analyzed the db and saw that there were many jobs from changesets that were two days old but not scheduled until today. Nick cleaned up the db. Called endured a very long rolling reconfig to implement this. Thanks to Nick and Callek for their help.
Assignee | ||
Comment 38•10 years ago
|
||
So the pending counts for Linux64 look okay today after this change. As do the wait times. This doesn't preclude us from the need to order additional ix machines to handle our increasing capacity but today it seems okay.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 39•10 years ago
|
||
As a clarification the wait times for this pool aren't great (65-70%) but enabling 2.3 tests for Armv6 didn't cause them to change significantly because of the way that the jobs were split between AWS and ix.
Comment 40•10 years ago
|
||
Since 252 if 'mozilla-central' in BRANCHES: 253 BRANCHES['mozilla-central']['gecko_version'] = 33 that had the (I presume, and hope, unintentional) effect of running them on all trunk trees *except* mozilla-central, so that next mergeday they would start running on mozilla-central, and then in two mergedays they would start running on mozilla-aurora.
Comment 41•10 years ago
|
||
Comment on attachment 8443209 [details] [diff] [review] bug1024091june19.patch FTR, the relanding was at http://hg.mozilla.org/build/buildbot-configs/rev/94783aba6009
Attachment #8443209 -
Flags: checked-in- → checked-in+
Comment 42•10 years ago
|
||
Attachment 8441810 [details] [diff] enabled android-armv6 on the 5 us-east-1 masters, but not the matching set in us-west-2. Over the weekend we had pending when there were spot instances only in us-west-2. https://hg.mozilla.org/build/tools/rev/3c6648f8f701 to fix that, then did update and reconfig on bm53/54/68/115/116.
Attachment #8447777 -
Flags: checked-in+
Comment 43•10 years ago
|
||
I'm not at all sure I like the existence of a gecko_version that's higher than the actual gecko_version, purely to allow this sort of job which absolutely should not be visible the way it's running, but just getting these running on mozilla-central like they should have is good enough for me.
Attachment #8448515 -
Flags: review?(kmoir)
Updated•10 years ago
|
Attachment #8448515 -
Flags: review?(kmoir) → review+
Comment 44•10 years ago
|
||
Comment on attachment 8448515 [details] [diff] [review] Run them on the current, 33, trunk train, not the 34 train in 9 weeks https://hg.mozilla.org/build/buildbot-configs/rev/1712f3dd46a8
Attachment #8448515 -
Flags: checked-in+
Updated•6 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•