Closed
Bug 1296329
Opened 8 years ago
Closed 8 years ago
volume of autoland commits is causing high pending counts
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kmoir, Unassigned)
References
Details
Attachments
(3 files, 1 obsolete file)
1.40 KB,
patch
|
jmaher
:
review+
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
1.81 KB,
patch
|
jmaher
:
review+
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
2.81 KB,
patch
|
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
The volume of autoland commits is causing high pending counts possible solutions - more than one may apply 1) give autoland the same priority as try, currently it is at 4, should it have the same priority as try at 5? (this won't reduce overall load but will make jobs on other branches get picked up faster) 2) coalesce builds on autoland so we don't have situations where we have 10 commits in ~10 minutes spawning massive amounts of jobs 3) increase the frequency at which seta skips jobs for that branch
Comment 1•8 years ago
|
||
can we get some basic numbers related to number of commits on autoland vs inbound vs fx-team in the last month or week(s)? I think for SETA, we could change things, I am not sure how buildbot could have different skipcounts per branch, but that is a possibility- if there are seta changes, it would be good to know.
Comment 2•8 years ago
|
||
Is the total number of commits on (inbound + autoland) higher than the historical number of commits on inbound alone? My impression is that autoland is just a more automated version of inbound, so I would expect those to be similar (with, perhaps, more backing-out going on on autoland since it happens automatically).
Comment 3•8 years ago
|
||
autoland lands all patches for a bug as a single commit, on inbound/fx-team it is common to put a few patches from various bugs together into a single commit. I wonder if there is a significant bump in the number of commits due to this change.
Comment 4•8 years ago
|
||
Is this possibly related to autoland batch landing a bunch of pending commits after a tree closure is resolved?
Reporter | ||
Comment 5•8 years ago
|
||
buildapi is broken so getting pushes numbers is difficult https://secure.pub.build.mozilla.org/buildapi/reports/pushes bug 1293659 Anyways, I looked at the hg logs for the two repos m-i has about 1400 commits since the beginning of August autoland has about 2300 commits in the same time period autoland was enabled as a repo at the end of May (bug 1267712) Also, my second suggestion in the first comment is a bit misleading, builds on autoland are coalesced the same way they are on other integration branches, try is the only one where they aren't merged. Not sure if it is an option to coalesce them more aggressively.
Comment 6•8 years ago
|
||
pushes by month autoland 562 2016-07 550 2016-08 inbound 1488 2016-01 1495 2016-02 1805 2016-03 1484 2016-04 1758 2016-05 1323 2016-06 1060 2016-07 633 2016-08 fx-team 489 2016-01 579 2016-02 552 2016-03 558 2016-04 455 2016-05 370 2016-06 345 2016-07 159 2016-08
Comment 7•8 years ago
|
||
Giving it the same priority as try isn't an option: autoland is just another mozilla-inbound, where you trigger your push by pressing a button instead of typing a command. Builds *aren't* coalesced, only tests are. We did that long ago because we used to routinely have hours-long tree closures where we coalesced the builds from pushes A, B, C, and D, only ran the build on D, then got a busted test on D, or perhaps on F when the tests from D, E, and F also coalesced, and then we had to sit closed while builds ran on A, B, and C to get tests on them to see which one was at fault.
Comment 8•8 years ago
|
||
how did :kmoir get such different numbers from :gps? Even if we are off by a day or so that wouldn't account for such a difference.
Reporter | ||
Comment 9•8 years ago
|
||
jmaher i quickly looked at the hg logs like this hg log --limit 2300 --template '{node|short} | {date|isodatesec}\n' looked at the last date associated with the commit obviously it wasn't the right approach
Comment 10•8 years ago
|
||
Dates in commits themselves can't be trusted. I looked at the pushlog data. The dates in the pushlog come from the server and can be trusted. I also only counted pushes, not commit count, as pushes with multi-part series can skew results otherwise.
Comment 11•8 years ago
|
||
ok, this is a bit more understandable. If we look at Greg's data, we see most of the load from inbound/fx-team being migrated to autoland, it isn't as though we have a large net gain in commits. sum of (autoland/inbound/fx-team) April: 2042 May: 2213 June: 1693 July: 1967 (all 3 branches start here) August: 1342 (2/3 of month so far) overall I don't see autoland as causing a 10%+ uptick in commits- we do have extra merges into m-c at the rate of probably 1/day. I know autoland didn't have SETA previously- this was fixed 2 days ago, maybe we are seeing aggregate data prior to SETA working- or possibly SETA isn't working properly.
Reporter | ||
Comment 12•8 years ago
|
||
SETA is working, I just looked at the test scheduler logs. Looking at the logs from a time where seta was enabled on both m-i and autoland, tests are being skipped on both branches, and the number of skipped tests line up with the number of pushes. Also, the seta data for autoland and m-i is identical - so the same tests should be skipped.
Reporter | ||
Comment 13•8 years ago
|
||
Joel pinged me in #releng to look at this this morning give the high number of windows xp pending counts I have confirmed that winxp (xp_ix platform) jobs are being skipped by seta for both talos, opt and debug by looking at the logs. So seta is working. One thing I noticed is that they are hitting the time limit pretty quickly so not as many tests are being skipped. In fact in today's logs they don't hit more than 4 skips before they are all triggered. My expectation for this behaviour is that there are 122 xp_ix machines and 2600+ pending jobs. The number of pending jobs continues to climb because seta is not skipping them fast enough to stop triggering them after we hit the timeout. Given this, my suggestion is to increase the timeout for winxp. Currently it is at 7200sec or 2 hours, however given the backlog this will not get any better. 2016-08-18 21:53:35-0700 [-] tests-autoland-xp_ix-debug-unittest-14-7200: skipping with 4/14 important changes since only 7098/7200s have elapsed 2016-08-18 21:54:34-0700 [-] tests-autoland-xp_ix-debug-unittest-14-7200: skipping with 4/14 important changes since only 7157/7200s have elapsed 2016-08-18 21:55:17-0700 [-] tests-autoland-xp_ix-debug-unittest-14-7200: triggering with 4/14 important changes since 7200s have elapsed
Comment 14•8 years ago
|
||
Does SETA prevent coalescing? I wouldn't expect it to be a problem if it schedules a full set on several pushes, since I'd expect coalescing to mean that we wouldn't actually run them.
Comment 15•8 years ago
|
||
usually we don't see jobs schedule and pending in treeherder if they are coalesced.
Comment 16•8 years ago
|
||
That sounds like something other than coalescing, or at least what I mean by coalescing. When I talk about it, I'm talking about the thing where a slave becomes free to take a job at the priority of autoland, so buildbot looks at what the oldest job pending for that slave is, finds that it's WinXP opt crashtest, and then since there are six pending WinXP opt crashtest jobs it tells the slave to run on the most recent of those, and coalesces away the five previous ones. You will most certainly see them as pending in treeherder, since coalescing doesn't happen until the second that the slave is available and assigned to run something. You're even quite likely to see them as still pending even after they aren't going to run, since the whole operation of getting data into treeherder and updating it is a tangle and quite often they'll stay shown as pending until after you refresh the whole treeherder page.
Reporter | ||
Comment 17•8 years ago
|
||
further discussion in irc this afternoon 3:22 PM <jmaher> kmoir: ok, it looks like when the pending jobs go to get picked up, we do the magical coalescing that SETA recommended 3:22 PM <jmaher> from my previous link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=windows%20xp&tochange=d929d9aa782933499db42c017259be104db0fb2b&group_state=expanded 3:22 PM <jmaher> the oldest push has many jobs missing and very few pending, before it had a full set of jobs 3:23 PM <kmoir> huh 3:23 PM <jmaher> kmoir: that makes me wonder about the https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=windows%20xp&tochange=d929d9aa782933499db42c017259be104db0fb2b&group_state=expanded dashboard, that is a dashboard of pre-coalesced jobs 3:24 PM <jmaher> kmoir: I specifically remember M(+28) pending on all the jobs 3:24 PM <kmoir> yes, I saw that too 3:24 PM stephend|lunch → stephend 3:25 PM <kmoir> just as a sanity check, earlier I checked the test scheduler logs for different platforms and they all seem to be skipping at similar rates 3:25 PM <jmaher> I had thought (maybe what I recall seeing in the past) that we would only have pending what was supposed to run and not have pending coalesced jobs 3:25 PM <jmaher> kmoir: ok- 3:25 PM <kmoir> yes, I recall that too, it is very confusing 3:25 PM <jmaher> possibly with the high backlog, we end up in an odd state of how we display jobs in treeherder 3:26 PM <jmaher> well, the good news is that the SETA integration seems to be working- possibly we just need to be patient or look at other areas to optimize our load
Comment 18•8 years ago
|
||
Kyle has done some work on metrics, see: https://docs.google.com/spreadsheets/d/1WwTUUj9hV1iaQziviCLcpGt6xg70vEGSKtMmOOsJvoA/edit#gid=0 http://activedata.allizom.org/tools/query.html#query_id=DCTbrO+O These seem to indicate that we don't have an increase in overall job counts; if that's correct than increasing SETA coalescing might paper over the problem to some extent without addressing the root cause. Do we have a way of detecting how much time Windows testers are spending between jobs? Could that have increased recently?
Reporter | ||
Comment 19•8 years ago
|
||
Andrei or Alin, could one of you please look at the data that jgriffin has provided as well our sources of information and see if we can find the root cause for the pending counts?
Flags: needinfo?(aselagea)
Flags: needinfo?(aobreja)
Comment 20•8 years ago
|
||
I don't think we can have a precise measure of the time spent by the XP testers between jobs. These machines get stuck pretty often and they don't take any jobs until rebooted (that is usually done by me or Andrei during EU day and :philor during EU night). So the wait times largely depend on human intervention. I've extracted some data from buildrequests table in order to see how many XP jobs are marked as "complete" since June 1 for autoland, mozilla-inbound and try. https://docs.google.com/spreadsheets/d/1-1K-R3owQ5nUJ8sQ15mQ4Dfr73vLygMa44mwrrHQFyI From what I can see, the number of jobs completed on autoland has increased during the past two months, some spikes on m-i can also be noticed.
Flags: needinfo?(aselagea)
Flags: needinfo?(aobreja)
Comment 21•8 years ago
|
||
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #20) > I don't think we can have a precise measure of the time spent by the XP > testers between jobs. These machines get stuck pretty often and they don't > take any jobs until rebooted (that is usually done by me or Andrei during EU For someone who hasn't done this in the past, do we know why that happens? Is there a bug about this issue? I feel it's weird that we need manual intervention here.
Comment 22•8 years ago
|
||
There have been multiple bugs over the life of our Windows buildbot testing era, bug 972867 being both one of the more recent and being characteristic of their lack of findings. The slaverebooter tool existed in large part because of this problem, but it turned out to cause other problems and nobody wanted to fix it, so now we're just waiting for the cloud to save us. But this has always been the case, so it's not the reason we have a problem we didn't have a few months ago. Nor is time between jobs: conveniently, we have some WinXP slaves that have just moved back to WinXP after a few months of being Win7 slaves, so you can look at things like https://secure.pub.build.mozilla.org/buildapi/recent/t-xp32-ix-130 and see that if anything the time between jobs now is lower than it was in March.
Reporter | ||
Comment 23•8 years ago
|
||
re comment 21: I have asked Alin to document this issues he is seeing with winxp machines not completing jobs/hanging in a bug so we can implement automation so these machines are not out of service for multiple hours.
Comment 24•8 years ago
|
||
We already wrote that automation once, it's called slaverebooter, so if you can find its corpse (I thought it lived on github) it'll possibly help skip some of its troubles, though not the one that caused us to just give up and shut it off: if you don't either let it poke masters to see whether a slave is connected or just blindly reboot even if a slave might be doing a job, and instead rely on doing a graceful shutdown and waiting for that to have clearly failed based on how much longer it has been, your automation will be far too slow. Slaverebooter would wind up taking things out for ten hours, while I typically reboot WinXP slaves after 3, 2.5 if I'm in a hurry.
Comment 25•8 years ago
|
||
Sorry for the delay, I've filed bug 1298001 and attached some screenshots to exemplify the issue. Not sure of what improvements we should pursue here though (maybe some runner changes?)
Reporter | ||
Comment 26•8 years ago
|
||
12:56 PM <jmaher> kmoir: coul we experiment with turning off SETA for winxp and see if 'queue collapsing' (slang: coalescing), would do a better job? 12:56 PM <dustin> we have a half-dozen tools that adjust scheduling
Attachment #8785387 -
Flags: review?(jmaher)
Comment 27•8 years ago
|
||
Comment on attachment 8785387 [details] [diff] [review] bug1296329.patch Review of attachment 8785387 [details] [diff] [review]: ----------------------------------------------------------------- thanks, that was easy
Attachment #8785387 -
Flags: review?(jmaher) → review+
Reporter | ||
Updated•8 years ago
|
Attachment #8785387 -
Flags: checked-in+
Comment 28•8 years ago
|
||
Checked the logs from bm81, it looks that the XP jobs are indeed no longer skipped by SETA. I updated the google docs sheet with data up until August 29 for try, autoland and m-i but I'm not seeing any clear change after the last patch yet (maybe more days are needed for that).
Comment 29•8 years ago
|
||
A single damning log line that tells us where all our problems lie: 2016-09-02 17:51:31-0700 [-] mergeRequests: Windows 8 64-bit mozilla-inbound opt test mochitest-media: exceeded limit (3) In https://hg.mozilla.org/build/buildbotcustom/annotate/515987988fd4b9855661c5733708fa8fb24aa0b9/misc.py#l2228 we think we are setting builderMergeLimits for Win8 mochitest-media to 7 based on https://hg.mozilla.org/build/buildbot-configs/annotate/a2b95fc7a8522a99e8d2d1761a8f0261072c48e5/mozilla-tests/config_seta.py#l45 but we aren't, we're getting the default from https://hg.mozilla.org/build/buildbotcustom/annotate/2d29e0ad789e/misc.py#l399 (was 3, now 8 in prod as of a few minutes ago) instead. So what happens is, we have 6 Win8 builds sendchange, and we schedule just a few tests on each of them, then the 7th sendchanges, and we schedule every test on all 7 builds, thinking we'll coalesce all of them, but instead we coalesce the first 3, then the second 3, then do the last one (very, very roughly speaking, since there are generally also other previous pending jobs, and more added afterward, possibly including a 14th build and a whole new full set before we get to that last one). And now that the default is 8, we should instead coalesce all 7, maybe even picking up a stray that was already pending, and get pretty much what we want. Before we turn SETA scheduling back on for WinXP, we should probably look at what sort of backlog we get coalescing 8, since I suspect we never actually saw either coalescing 7 or coalescing 14, and didn't actually need bug 1175568 so much as we just needed to not be coalescing only 3. Probably we almost never actually get 14 and hit 2 hours first, other than maybe in the first rush after a long closure, but actually skipping over 14 pushes is a very very wide gap.
Comment 30•8 years ago
|
||
oh, great find :philor. This goes to the original issue where we saw confusion on large sets of pending revisions. I would be happy to see SETA for winxp again soon- maybe lets see how this works for win8 for a week and discuss winxp next week?
Comment 31•8 years ago
|
||
I think this is probably fallout from bug 1293554, where we disabled SETA for the test masters. When that happened, they stopped using the SETA configs to determine how many jobs to merge together, and fell back to the default of 3.
Reporter | ||
Comment 32•8 years ago
|
||
Does this mean that we should revert the work in bug 1293554 to disable seta on the test masters or is it sufficient to increase the number of jobs to be coalesced in bug 1299378?
Flags: needinfo?(catlee)
Comment 33•8 years ago
|
||
I think we should revert the work in bug 1293554. Either that, or re-enable XP SETA and rely on the increased number of jobs to coalesce that was landed in bug 1299378. philor, what's your preference?
Flags: needinfo?(catlee) → needinfo?(philringnalda)
Comment 34•8 years ago
|
||
1. Reenable SETA for WinXP 2. Switch seta_config.py to skip 8 for both WinXP and non-WinXP, to get rid of both the weirdness of 7 skipped and 8 coalesced and the risk of sometimes (though it would only actually happen during a daytime reopen after an extended tree closure) actually skipping 14 in WinXP instead of skipping 2 hours, which is too much also 3. While we wait for enough load to evaluate the results (won't happen this week, probably won't happen during merge week, the week after is more likely), figure out what we actually want to measure and how to measure it to determine success or failure. The right metric probably resembles "average number of jobs coalesced per push by OS, per tree" but it's not something I'm going to come up with in 2 minute bursts of thought between starrings.
Flags: needinfo?(philringnalda)
Comment 35•8 years ago
|
||
I had built a dashboard to track the effects of SETA: http://alertmanager.allizom.org/dailyjobs.html it isn't 100% perfect and might need some updating- I am actually planning to get rid of it soon, possibly it can help us out!
Reporter | ||
Comment 36•8 years ago
|
||
patch to address the suggestions in comment 34
Attachment #8789381 -
Flags: review?(jmaher)
Comment 37•8 years ago
|
||
Comment on attachment 8789381 [details] [diff] [review] bug1296329-2.patch Review of attachment 8789381 [details] [diff] [review]: ----------------------------------------------------------------- ::: mozilla-tests/config_seta.py @@ +31,5 @@ > "android-4-3-armv7-api15": ("android-api-15", ["ubuntu64_vm_armv7_mobile", "ubuntu64_vm_armv7_large"]), > } > > # platforms and tests to exclude from configs because they are deprecated or lacking data > +platform_exclusions = ["android-2-3-armv7-api9", "Android 4.2 x86 Emulator", "Android 4.3 armv7 API 15+"] in the future we should examine android 4.3 as an option, I do seta calculations for it.
Attachment #8789381 -
Flags: review?(jmaher) → review+
Reporter | ||
Updated•8 years ago
|
Attachment #8789381 -
Flags: checked-in+
Reporter | ||
Comment 38•8 years ago
|
||
patch to fix issues so seta is enabled for Android. The seta data changed a bit in format and some platforms are no longer relevant so I cleaned that up. It worked in staging.
Attachment #8791735 -
Flags: review?(jmaher)
Comment 39•8 years ago
|
||
Comment on attachment 8791735 [details] [diff] [review] bug1296329android.patch Review of attachment 8791735 [details] [diff] [review]: ----------------------------------------------------------------- this gets us working now, feel free to ignore my suggestion or do it later. ::: mozilla-tests/config_seta.py @@ +83,2 @@ > continue > + if platform_filter == "desktop" and "Android" in platform: with the introduction of taskcluster we can have random platform names- could we do: if platform_filter == ... and 'ndroid' in platform: this ignores the A|a question at the beginning.
Attachment #8791735 -
Flags: review?(jmaher) → review+
Reporter | ||
Comment 40•8 years ago
|
||
sure I could do that but this is for scheduling jobs on buildbot only, we don't reduce the taskcluster jobs that are scheduled via this code.
Comment 41•8 years ago
|
||
i agree- just beware we will have a very random set of build and test names showing up in the API- Ido try to sanitize it as much as possible- good point though- you only care about Android- just make sure the logic handles scenarios with odd data.
Reporter | ||
Comment 42•8 years ago
|
||
jmaher: good point, I landed this patch which incorporated your suggestion
Attachment #8791735 -
Attachment is obsolete: true
Attachment #8791818 -
Flags: checked-in+
Reporter | ||
Comment 43•8 years ago
|
||
I verified that the seta test scheduler logs are skipping Android jobs. Seta just has data for Android 4.3.
Comment 44•8 years ago
|
||
The TaskCluster platform names aren't random -- at least, no more random than the existing chaos between the half-dozen ways we define platforms.
Comment 45•8 years ago
|
||
Even easier way to measure whether we're done here: come back two months later and see whether anyone even remembers this bug existed.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•