Closed Bug 1296329 Opened 8 years ago Closed 8 years ago

volume of autoland commits is causing high pending counts

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Unassigned)

References

Details

Attachments

(3 files, 1 obsolete file)

The volume of autoland commits is causing high pending counts

possible solutions - more than one may apply
1) give autoland the same priority as try, currently it is at 4, should it have the same priority as try at 5? (this won't reduce overall load but will make jobs on other branches get picked up faster)
2) coalesce builds on autoland so we don't have situations where we have 10 commits in ~10 minutes spawning massive amounts of jobs
3) increase the frequency at which seta skips jobs for that branch
can we get some basic numbers related to number of commits on autoland vs inbound vs fx-team in the last month or week(s)?

I think for SETA, we could change things, I am not sure how buildbot could have different skipcounts per branch, but that is a possibility- if there are seta changes, it would be good to know.
Is the total number of commits on (inbound + autoland) higher than the historical number of commits on inbound alone?  My impression is that autoland is just a more automated version of inbound, so I would expect those to be similar (with, perhaps, more backing-out going on on autoland since it happens automatically).
autoland lands all patches for a bug as a single commit, on inbound/fx-team it is common to put a few patches from various bugs together into a single commit.  I wonder if there is a significant bump in the number of commits due to this change.
Is this possibly related to autoland batch landing a bunch of pending commits after a tree closure is resolved?
buildapi is broken so getting pushes numbers is difficult

https://secure.pub.build.mozilla.org/buildapi/reports/pushes

bug 1293659

Anyways, I looked at the hg logs for the two repos
m-i has about 1400 commits since the beginning of August
autoland has about 2300 commits in the same time period

autoland was enabled as a repo at the end of May (bug 1267712)

Also, my second suggestion in the first comment is a bit misleading, builds on autoland are coalesced the same way they are on other integration branches, try is the only one where they aren't merged.  Not sure if it is an option to coalesce them more aggressively.
pushes by month

autoland

    562 2016-07
    550 2016-08

inbound

   1488 2016-01
   1495 2016-02
   1805 2016-03
   1484 2016-04
   1758 2016-05
   1323 2016-06
   1060 2016-07
    633 2016-08

fx-team

    489 2016-01
    579 2016-02
    552 2016-03
    558 2016-04
    455 2016-05
    370 2016-06
    345 2016-07
    159 2016-08
Giving it the same priority as try isn't an option: autoland is just another mozilla-inbound, where you trigger your push by pressing a button instead of typing a command.

Builds *aren't* coalesced, only tests are. We did that long ago because we used to routinely have hours-long tree closures where we coalesced the builds from pushes A, B, C, and D, only ran the build on D, then got a busted test on D, or perhaps on F when the tests from D, E, and F also coalesced, and then we had to sit closed while builds ran on A, B, and C to get tests on them to see which one was at fault.
how did :kmoir get such different numbers from :gps?  Even if we are off by a day or so that wouldn't account for such a difference.
jmaher i quickly looked at the hg logs like this
 hg log --limit 2300 --template '{node|short} | {date|isodatesec}\n' 

looked at the last date associated with the commit

obviously it wasn't the right approach
Dates in commits themselves can't be trusted.

I looked at the pushlog data. The dates in the pushlog come from the server and can be trusted. I also only counted pushes, not commit count, as pushes with multi-part series can skew results otherwise.
ok, this is a bit more understandable.  If we look at Greg's data, we see most of the load from inbound/fx-team being migrated to autoland, it isn't as though we have a large net gain in commits.


        sum of (autoland/inbound/fx-team)
April:   2042
May:     2213
June:    1693
July:    1967 (all 3 branches start here)
August:  1342 (2/3 of month so far)

overall I don't see autoland as causing a 10%+ uptick in commits- we do have extra merges into m-c at the rate of probably 1/day.

I know autoland didn't have SETA previously- this was fixed 2 days ago, maybe we are seeing aggregate data prior to SETA working- or possibly SETA isn't working properly.
SETA is working, I just looked at the test scheduler logs.  Looking at the logs from a time where seta was enabled on both m-i and autoland, tests are being skipped on both branches, and the number of skipped tests line up with the number of pushes.  Also, the seta data for autoland and m-i is identical - so the same tests should be skipped.
Joel pinged me in #releng to look at this this morning give the high number of windows xp pending counts

I have confirmed that winxp (xp_ix platform) jobs are being skipped by seta for both talos, opt and debug by looking at the logs. So seta is working.

One thing I noticed is that they are hitting the time limit pretty quickly so not as many tests are being skipped.  In fact in today's logs they don't hit more than 4 skips before they are all triggered.  My expectation for this behaviour is that there are 122 xp_ix machines and 2600+ pending jobs.  The number of pending jobs continues to climb because seta is not skipping them fast enough to stop triggering them after we hit the timeout.  Given this, my suggestion is to increase the timeout for winxp.  Currently it is at 7200sec or 2 hours, however given the backlog this will not get any better.

2016-08-18 21:53:35-0700 [-] tests-autoland-xp_ix-debug-unittest-14-7200: skipping with 4/14 important changes since only 7098/7200s have elapsed
2016-08-18 21:54:34-0700 [-] tests-autoland-xp_ix-debug-unittest-14-7200: skipping with 4/14 important changes since only 7157/7200s have elapsed
2016-08-18 21:55:17-0700 [-] tests-autoland-xp_ix-debug-unittest-14-7200: triggering with 4/14 important changes since 7200s have elapsed
Does SETA prevent coalescing? I wouldn't expect it to be a problem if it schedules a full set on several pushes, since I'd expect coalescing to mean that we wouldn't actually run them.
usually we don't see jobs schedule and pending in treeherder if they are coalesced.
That sounds like something other than coalescing, or at least what I mean by coalescing.

When I talk about it, I'm talking about the thing where a slave becomes free to take a job at the priority of autoland, so buildbot looks at what the oldest job pending for that slave is, finds that it's WinXP opt crashtest, and then since there are six pending WinXP opt crashtest jobs it tells the slave to run on the most recent of those, and coalesces away the five previous ones. You will most certainly see them as pending in treeherder, since coalescing doesn't happen until the second that the slave is available and assigned to run something. You're even quite likely to see them as still pending even after they aren't going to run, since the whole operation of getting data into treeherder and updating it is a tangle and quite often they'll stay shown as pending until after you refresh the whole treeherder page.
further discussion in irc this afternoon

3:22 PM <jmaher> kmoir: ok, it looks like when the pending jobs go to get picked up, we do the magical coalescing that SETA recommended
3:22 PM <jmaher> from my previous link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=windows%20xp&tochange=d929d9aa782933499db42c017259be104db0fb2b&group_state=expanded
3:22 PM <jmaher> the oldest push has many jobs missing and very few pending, before it had a full set of jobs
3:23 PM <kmoir> huh
3:23 PM <jmaher> kmoir: that makes me wonder about the https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=windows%20xp&tochange=d929d9aa782933499db42c017259be104db0fb2b&group_state=expanded dashboard, that is a dashboard of pre-coalesced jobs
3:24 PM <jmaher> kmoir: I specifically remember M(+28) pending on all the jobs
3:24 PM <kmoir> yes, I saw that too
3:24 PM stephend|lunch → stephend
3:25 PM <kmoir> just as a sanity check, earlier I checked the test scheduler logs for different platforms and they all seem to be skipping at similar rates
3:25 PM <jmaher> I had thought (maybe what I recall seeing in the past) that we would only have pending what was supposed to run and not have pending coalesced jobs
3:25 PM <jmaher> kmoir: ok-
3:25 PM <kmoir> yes, I recall that too, it is very confusing
3:25 PM <jmaher> possibly with the high backlog, we end up in an odd state of how we display jobs in treeherder
3:26 PM <jmaher> well, the good news is that the SETA integration seems to be working- possibly we just need to be patient or look at other areas to optimize our load
Kyle has done some work on metrics, see:

https://docs.google.com/spreadsheets/d/1WwTUUj9hV1iaQziviCLcpGt6xg70vEGSKtMmOOsJvoA/edit#gid=0

http://activedata.allizom.org/tools/query.html#query_id=DCTbrO+O

These seem to indicate that we don't have an increase in overall job counts; if that's correct than increasing SETA coalescing might paper over the problem to some extent without addressing the root cause.

Do we have a way of detecting how much time Windows testers are spending between jobs? Could that have increased recently?
Andrei or Alin, could one of you please look at the data that jgriffin has provided as well our sources of information and see if we can find the root cause for the pending counts?
Flags: needinfo?(aselagea)
Flags: needinfo?(aobreja)
I don't think we can have a precise measure of the time spent by the XP testers between jobs. These machines get stuck pretty often and they don't take any jobs until rebooted (that is usually done by me or Andrei during EU day and :philor during EU night). So the wait times largely depend on human intervention.

I've extracted some data from buildrequests table in order to see how many XP jobs are marked as "complete" since June 1 for autoland, mozilla-inbound and try.
https://docs.google.com/spreadsheets/d/1-1K-R3owQ5nUJ8sQ15mQ4Dfr73vLygMa44mwrrHQFyI

From what I can see, the number of jobs completed on autoland has increased during the past two months, some spikes on m-i can also be noticed.
Flags: needinfo?(aselagea)
Flags: needinfo?(aobreja)
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #20)
> I don't think we can have a precise measure of the time spent by the XP
> testers between jobs. These machines get stuck pretty often and they don't
> take any jobs until rebooted (that is usually done by me or Andrei during EU

For someone who hasn't done this in the past, do we know why that happens? Is there a bug about this issue? I feel it's weird that we need manual intervention here.
There have been multiple bugs over the life of our Windows buildbot testing era, bug 972867 being both one of the more recent and being characteristic of their lack of findings. The slaverebooter tool existed in large part because of this problem, but it turned out to cause other problems and nobody wanted to fix it, so now we're just waiting for the cloud to save us.

But this has always been the case, so it's not the reason we have a problem we didn't have a few months ago.

Nor is time between jobs: conveniently, we have some WinXP slaves that have just moved back to WinXP after a few months of being Win7 slaves, so you can look at things like https://secure.pub.build.mozilla.org/buildapi/recent/t-xp32-ix-130 and see that if anything the time between jobs now is lower than it was in March.
re comment 21: 

I have asked Alin to document this issues he is seeing with winxp machines not completing jobs/hanging in a bug so we can implement automation so these machines are not out of service for multiple hours.
We already wrote that automation once, it's called slaverebooter, so if you can find its corpse (I thought it lived on github) it'll possibly help skip some of its troubles, though not the one that caused us to just give up and shut it off: if you don't either let it poke masters to see whether a slave is connected or just blindly reboot even if a slave might be doing a job, and instead rely on doing a graceful shutdown and waiting for that to have clearly failed based on how much longer it has been, your automation will be far too slow. Slaverebooter would wind up taking things out for ten hours, while I typically reboot WinXP slaves after 3, 2.5 if I'm in a hurry.
Depends on: 1298001
Sorry for the delay, I've filed bug 1298001 and attached some screenshots to exemplify the issue. Not sure of what improvements we should pursue here though (maybe some runner changes?)
Attached patch bug1296329.patchSplinter Review
12:56 PM <jmaher> kmoir: coul we experiment with turning off SETA for winxp and see if 'queue collapsing' (slang: coalescing), would do a better job?
12:56 PM <dustin> we have a half-dozen tools that adjust scheduling
Attachment #8785387 - Flags: review?(jmaher)
Comment on attachment 8785387 [details] [diff] [review]
bug1296329.patch

Review of attachment 8785387 [details] [diff] [review]:
-----------------------------------------------------------------

thanks, that was easy
Attachment #8785387 - Flags: review?(jmaher) → review+
Attachment #8785387 - Flags: checked-in+
Checked the logs from bm81, it looks that the XP jobs are indeed no longer skipped by SETA. I updated the google docs sheet with data up until August 29 for try, autoland and m-i but I'm not seeing any clear change after the last patch yet (maybe more days are needed for that).
Depends on: 1299377
Depends on: 1299378
A single damning log line that tells us where all our problems lie:

2016-09-02 17:51:31-0700 [-] mergeRequests: Windows 8 64-bit mozilla-inbound opt test mochitest-media: exceeded limit (3)

In https://hg.mozilla.org/build/buildbotcustom/annotate/515987988fd4b9855661c5733708fa8fb24aa0b9/misc.py#l2228 we think we are setting builderMergeLimits for Win8 mochitest-media to 7 based on https://hg.mozilla.org/build/buildbot-configs/annotate/a2b95fc7a8522a99e8d2d1761a8f0261072c48e5/mozilla-tests/config_seta.py#l45 but we aren't, we're getting the default from https://hg.mozilla.org/build/buildbotcustom/annotate/2d29e0ad789e/misc.py#l399 (was 3, now 8 in prod as of a few minutes ago) instead.

So what happens is, we have 6 Win8 builds sendchange, and we schedule just a few tests on each of them, then the 7th sendchanges, and we schedule every test on all 7 builds, thinking we'll coalesce all of them, but instead we coalesce the first 3, then the second 3, then do the last one (very, very roughly speaking, since there are generally also other previous pending jobs, and more added afterward, possibly including a 14th build and a whole new full set before we get to that last one). And now that the default is 8, we should instead coalesce all 7, maybe even picking up a stray that was already pending, and get pretty much what we want.

Before we turn SETA scheduling back on for WinXP, we should probably look at what sort of backlog we get coalescing 8, since I suspect we never actually saw either coalescing 7 or coalescing 14, and didn't actually need bug 1175568 so much as we just needed to not be coalescing only 3. Probably we almost never actually get 14 and hit 2 hours first, other than maybe in the first rush after a long closure, but actually skipping over 14 pushes is a very very wide gap.
oh, great find :philor.  This goes to the original issue where we saw confusion on large sets of pending revisions.

I would be happy to see SETA for winxp again soon- maybe lets see how this works for win8 for a week and discuss winxp next week?
I think this is probably fallout from bug 1293554, where we disabled SETA for the test masters. When that happened, they stopped using the SETA configs to determine how many jobs to merge together, and fell back to the default of 3.
Does this mean that we should revert the work in bug 1293554 to disable seta on the test masters or is it sufficient to increase the number of jobs to be coalesced in bug 1299378?
Flags: needinfo?(catlee)
I think we should revert the work in bug 1293554. Either that, or re-enable XP SETA and rely on the increased number of jobs to coalesce that was landed in bug 1299378.

philor, what's your preference?
Flags: needinfo?(catlee) → needinfo?(philringnalda)
1. Reenable SETA for WinXP
2. Switch seta_config.py to skip 8 for both WinXP and non-WinXP, to get rid of both the weirdness of 7 skipped and 8 coalesced and the risk of sometimes (though it would only actually happen during a daytime reopen after an extended tree closure) actually skipping 14 in WinXP instead of skipping 2 hours, which is too much also
3. While we wait for enough load to evaluate the results (won't happen this week, probably won't happen during merge week, the week after is more likely), figure out what we actually want to measure and how to measure it to determine success or failure. The right metric probably resembles "average number of jobs coalesced per push by OS, per tree" but it's not something I'm going to come up with in 2 minute bursts of thought between starrings.
Flags: needinfo?(philringnalda)
I had built a dashboard to track the effects of SETA:
http://alertmanager.allizom.org/dailyjobs.html

it isn't 100% perfect and might need some updating- I am actually planning to get rid of it soon, possibly it can help us out!
patch to address the suggestions in comment 34
Attachment #8789381 - Flags: review?(jmaher)
Comment on attachment 8789381 [details] [diff] [review]
bug1296329-2.patch

Review of attachment 8789381 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla-tests/config_seta.py
@@ +31,5 @@
>                    "android-4-3-armv7-api15": ("android-api-15", ["ubuntu64_vm_armv7_mobile", "ubuntu64_vm_armv7_large"]),
>                    }
>  
>  # platforms and tests to exclude from configs because they are deprecated or lacking data
> +platform_exclusions = ["android-2-3-armv7-api9", "Android 4.2 x86 Emulator", "Android 4.3 armv7 API 15+"]

in the future we should examine android 4.3 as an option, I do seta calculations for it.
Attachment #8789381 - Flags: review?(jmaher) → review+
Attachment #8789381 - Flags: checked-in+
Attached patch bug1296329android.patch (obsolete) — Splinter Review
patch to fix issues so seta is enabled for Android.  The seta data changed a bit in format and some platforms are no longer relevant so I cleaned that up.  It worked in staging.
Attachment #8791735 - Flags: review?(jmaher)
Comment on attachment 8791735 [details] [diff] [review]
bug1296329android.patch

Review of attachment 8791735 [details] [diff] [review]:
-----------------------------------------------------------------

this gets us working now, feel free to ignore my suggestion or do it later.

::: mozilla-tests/config_seta.py
@@ +83,2 @@
>              continue
> +        if platform_filter == "desktop" and "Android" in platform:

with the introduction of taskcluster we can have random platform names- could we do:
if platform_filter == ... and 'ndroid' in platform:


this ignores the A|a question at the beginning.
Attachment #8791735 - Flags: review?(jmaher) → review+
sure I could do that but this is for scheduling jobs on buildbot only, we don't reduce the taskcluster jobs that are scheduled via this code.
i agree- just beware we will have a very random set of build and test names showing up in the API-  Ido try to sanitize it as much as possible- good point though- you only care about Android- just make sure the logic handles scenarios with odd data.
jmaher: good point, I landed this patch which incorporated your suggestion
Attachment #8791735 - Attachment is obsolete: true
Attachment #8791818 - Flags: checked-in+
I verified that the seta test scheduler logs are skipping Android jobs.  Seta just has data for Android 4.3.
The TaskCluster platform names aren't random -- at least, no more random than the existing chaos between the half-dozen ways we define platforms.
Even easier way to measure whether we're done here: come back two months later and see whether anyone even remembers this bug existed.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: