1296329 - volume of autoland commits is causing high pending counts

Reporter

Description

•

9 years ago

The volume of autoland commits is causing high pending counts possible solutions - more than one may apply 1) give autoland the same priority as try, currently it is at 4, should it have the same priority as try at 5? (this won't reduce overall load but will make jobs on other branches get picked up faster) 2) coalesce builds on autoland so we don't have situations where we have 10 commits in ~10 minutes spawning massive amounts of jobs 3) increase the frequency at which seta skips jobs for that branch

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

9 years ago

can we get some basic numbers related to number of commits on autoland vs inbound vs fx-team in the last month or week(s)? I think for SETA, we could change things, I am not sure how buildbot could have different skipcounts per branch, but that is a possibility- if there are seta changes, it would be good to know.

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

9 years ago

Is the total number of commits on (inbound + autoland) higher than the historical number of commits on inbound alone? My impression is that autoland is just a more automated version of inbound, so I would expect those to be similar (with, perhaps, more backing-out going on on autoland since it happens automatically).

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

9 years ago

autoland lands all patches for a bug as a single commit, on inbound/fx-team it is common to put a few patches from various bugs together into a single commit. I wonder if there is a significant bump in the number of commits due to this change.

Jonathan Griffin (:jgriffin)

Comment 4

•

9 years ago

Is this possibly related to autoland batch landing a bunch of pending commits after a tree closure is resolved?

Kim Moir [:kmoir] ET

Reporter

Comment 5

•

9 years ago

buildapi is broken so getting pushes numbers is difficult https://secure.pub.build.mozilla.org/buildapi/reports/pushes bug 1293659 Anyways, I looked at the hg logs for the two repos m-i has about 1400 commits since the beginning of August autoland has about 2300 commits in the same time period autoland was enabled as a repo at the end of May (bug 1267712) Also, my second suggestion in the first comment is a bit misleading, builds on autoland are coalesced the same way they are on other integration branches, try is the only one where they aren't merged. Not sure if it is an option to coalesce them more aggressively.

Gregory Szorc [:gps]

Comment 6

•

9 years ago

pushes by month autoland 562 2016-07 550 2016-08 inbound 1488 2016-01 1495 2016-02 1805 2016-03 1484 2016-04 1758 2016-05 1323 2016-06 1060 2016-07 633 2016-08 fx-team 489 2016-01 579 2016-02 552 2016-03 558 2016-04 455 2016-05 370 2016-06 345 2016-07 159 2016-08

Phil Ringnalda (:philor)

Comment 7

•

9 years ago

Giving it the same priority as try isn't an option: autoland is just another mozilla-inbound, where you trigger your push by pressing a button instead of typing a command. Builds *aren't* coalesced, only tests are. We did that long ago because we used to routinely have hours-long tree closures where we coalesced the builds from pushes A, B, C, and D, only ran the build on D, then got a busted test on D, or perhaps on F when the tests from D, E, and F also coalesced, and then we had to sit closed while builds ran on A, B, and C to get tests on them to see which one was at fault.

Joel Maher ( :jmaher ) (UTC -8)

Comment 8

•

9 years ago

how did :kmoir get such different numbers from :gps? Even if we are off by a day or so that wouldn't account for such a difference.

Kim Moir [:kmoir] ET

Reporter

Comment 9

•

9 years ago

jmaher i quickly looked at the hg logs like this hg log --limit 2300 --template '{node|short} | {date|isodatesec}\n' looked at the last date associated with the commit obviously it wasn't the right approach

Gregory Szorc [:gps]

Comment 10

•

9 years ago

Dates in commits themselves can't be trusted. I looked at the pushlog data. The dates in the pushlog come from the server and can be trusted. I also only counted pushes, not commit count, as pushes with multi-part series can skew results otherwise.

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

9 years ago

ok, this is a bit more understandable. If we look at Greg's data, we see most of the load from inbound/fx-team being migrated to autoland, it isn't as though we have a large net gain in commits. sum of (autoland/inbound/fx-team) April: 2042 May: 2213 June: 1693 July: 1967 (all 3 branches start here) August: 1342 (2/3 of month so far) overall I don't see autoland as causing a 10%+ uptick in commits- we do have extra merges into m-c at the rate of probably 1/day. I know autoland didn't have SETA previously- this was fixed 2 days ago, maybe we are seeing aggregate data prior to SETA working- or possibly SETA isn't working properly.

Kim Moir [:kmoir] ET

Reporter

Comment 12

•

9 years ago

SETA is working, I just looked at the test scheduler logs. Looking at the logs from a time where seta was enabled on both m-i and autoland, tests are being skipped on both branches, and the number of skipped tests line up with the number of pushes. Also, the seta data for autoland and m-i is identical - so the same tests should be skipped.

Kim Moir [:kmoir] ET

Reporter

Comment 13

•

9 years ago

Joel pinged me in #releng to look at this this morning give the high number of windows xp pending counts I have confirmed that winxp (xp_ix platform) jobs are being skipped by seta for both talos, opt and debug by looking at the logs. So seta is working. One thing I noticed is that they are hitting the time limit pretty quickly so not as many tests are being skipped. In fact in today's logs they don't hit more than 4 skips before they are all triggered. My expectation for this behaviour is that there are 122 xp_ix machines and 2600+ pending jobs. The number of pending jobs continues to climb because seta is not skipping them fast enough to stop triggering them after we hit the timeout. Given this, my suggestion is to increase the timeout for winxp. Currently it is at 7200sec or 2 hours, however given the backlog this will not get any better. 2016-08-18 21:53:35-0700 [-] tests-autoland-xp_ix-debug-unittest-14-7200: skipping with 4/14 important changes since only 7098/7200s have elapsed 2016-08-18 21:54:34-0700 [-] tests-autoland-xp_ix-debug-unittest-14-7200: skipping with 4/14 important changes since only 7157/7200s have elapsed 2016-08-18 21:55:17-0700 [-] tests-autoland-xp_ix-debug-unittest-14-7200: triggering with 4/14 important changes since 7200s have elapsed

Phil Ringnalda (:philor)

Comment 14

•

9 years ago

Does SETA prevent coalescing? I wouldn't expect it to be a problem if it schedules a full set on several pushes, since I'd expect coalescing to mean that we wouldn't actually run them.

Joel Maher ( :jmaher ) (UTC -8)

Comment 15

•

9 years ago

usually we don't see jobs schedule and pending in treeherder if they are coalesced.

Phil Ringnalda (:philor)

Comment 16

•

9 years ago

That sounds like something other than coalescing, or at least what I mean by coalescing. When I talk about it, I'm talking about the thing where a slave becomes free to take a job at the priority of autoland, so buildbot looks at what the oldest job pending for that slave is, finds that it's WinXP opt crashtest, and then since there are six pending WinXP opt crashtest jobs it tells the slave to run on the most recent of those, and coalesces away the five previous ones. You will most certainly see them as pending in treeherder, since coalescing doesn't happen until the second that the slave is available and assigned to run something. You're even quite likely to see them as still pending even after they aren't going to run, since the whole operation of getting data into treeherder and updating it is a tangle and quite often they'll stay shown as pending until after you refresh the whole treeherder page.

Kim Moir [:kmoir] ET

Reporter

Comment 17

•

9 years ago

further discussion in irc this afternoon 3:22 PM <jmaher> kmoir: ok, it looks like when the pending jobs go to get picked up, we do the magical coalescing that SETA recommended 3:22 PM <jmaher> from my previous link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=windows%20xp&tochange=d929d9aa782933499db42c017259be104db0fb2b&group_state=expanded 3:22 PM <jmaher> the oldest push has many jobs missing and very few pending, before it had a full set of jobs 3:23 PM <kmoir> huh 3:23 PM <jmaher> kmoir: that makes me wonder about the https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=windows%20xp&tochange=d929d9aa782933499db42c017259be104db0fb2b&group_state=expanded dashboard, that is a dashboard of pre-coalesced jobs 3:24 PM <jmaher> kmoir: I specifically remember M(+28) pending on all the jobs 3:24 PM <kmoir> yes, I saw that too 3:24 PM stephend|lunch → stephend 3:25 PM <kmoir> just as a sanity check, earlier I checked the test scheduler logs for different platforms and they all seem to be skipping at similar rates 3:25 PM <jmaher> I had thought (maybe what I recall seeing in the past) that we would only have pending what was supposed to run and not have pending coalesced jobs 3:25 PM <jmaher> kmoir: ok- 3:25 PM <kmoir> yes, I recall that too, it is very confusing 3:25 PM <jmaher> possibly with the high backlog, we end up in an odd state of how we display jobs in treeherder 3:26 PM <jmaher> well, the good news is that the SETA integration seems to be working- possibly we just need to be patient or look at other areas to optimize our load

Jonathan Griffin (:jgriffin)

Comment 18

•

9 years ago

Kyle has done some work on metrics, see: https://docs.google.com/spreadsheets/d/1WwTUUj9hV1iaQziviCLcpGt6xg70vEGSKtMmOOsJvoA/edit#gid=0 http://activedata.allizom.org/tools/query.html#query_id=DCTbrO+O These seem to indicate that we don't have an increase in overall job counts; if that's correct than increasing SETA coalescing might paper over the problem to some extent without addressing the root cause. Do we have a way of detecting how much time Windows testers are spending between jobs? Could that have increased recently?

Kim Moir [:kmoir] ET

Reporter

Comment 19

•

9 years ago

Andrei or Alin, could one of you please look at the data that jgriffin has provided as well our sources of information and see if we can find the root cause for the pending counts?

Flags: needinfo?(aselagea)

Flags: needinfo?(aobreja)

Alin Selagea [:aselagea]

Comment 20

•

9 years ago

I don't think we can have a precise measure of the time spent by the XP testers between jobs. These machines get stuck pretty often and they don't take any jobs until rebooted (that is usually done by me or Andrei during EU day and :philor during EU night). So the wait times largely depend on human intervention. I've extracted some data from buildrequests table in order to see how many XP jobs are marked as "complete" since June 1 for autoland, mozilla-inbound and try. https://docs.google.com/spreadsheets/d/1-1K-R3owQ5nUJ8sQ15mQ4Dfr73vLygMa44mwrrHQFyI From what I can see, the number of jobs completed on autoland has increased during the past two months, some spikes on m-i can also be noticed.

Flags: needinfo?(aselagea)

Flags: needinfo?(aobreja)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 21

•

9 years ago

(In reply to Alin Selagea [:aselagea][:buildduty] from comment #20) > I don't think we can have a precise measure of the time spent by the XP > testers between jobs. These machines get stuck pretty often and they don't > take any jobs until rebooted (that is usually done by me or Andrei during EU For someone who hasn't done this in the past, do we know why that happens? Is there a bug about this issue? I feel it's weird that we need manual intervention here.

Phil Ringnalda (:philor)

Comment 22

•

9 years ago

There have been multiple bugs over the life of our Windows buildbot testing era, bug 972867 being both one of the more recent and being characteristic of their lack of findings. The slaverebooter tool existed in large part because of this problem, but it turned out to cause other problems and nobody wanted to fix it, so now we're just waiting for the cloud to save us. But this has always been the case, so it's not the reason we have a problem we didn't have a few months ago. Nor is time between jobs: conveniently, we have some WinXP slaves that have just moved back to WinXP after a few months of being Win7 slaves, so you can look at things like https://secure.pub.build.mozilla.org/buildapi/recent/t-xp32-ix-130 and see that if anything the time between jobs now is lower than it was in March.

Kim Moir [:kmoir] ET

Reporter

Comment 23

•

9 years ago

re comment 21: I have asked Alin to document this issues he is seeing with winxp machines not completing jobs/hanging in a bug so we can implement automation so these machines are not out of service for multiple hours.

Phil Ringnalda (:philor)

Comment 24

•

9 years ago

We already wrote that automation once, it's called slaverebooter, so if you can find its corpse (I thought it lived on github) it'll possibly help skip some of its troubles, though not the one that caused us to just give up and shut it off: if you don't either let it poke masters to see whether a slave is connected or just blindly reboot even if a slave might be doing a job, and instead rely on doing a graceful shutdown and waiting for that to have clearly failed based on how much longer it has been, your automation will be far too slow. Slaverebooter would wind up taking things out for ten hours, while I typically reboot WinXP slaves after 3, 2.5 if I'm in a hurry.

Alin Selagea [:aselagea]

Updated

•

9 years ago

Depends on: 1298001

Alin Selagea [:aselagea]

Comment 25

•

9 years ago

Sorry for the delay, I've filed bug 1298001 and attached some screenshots to exemplify the issue. Not sure of what improvements we should pursue here though (maybe some runner changes?)

Kim Moir [:kmoir] ET

Reporter

Comment 26

•

9 years ago

Attached patch bug1296329.patch — Details — Splinter Review

12:56 PM <jmaher> kmoir: coul we experiment with turning off SETA for winxp and see if 'queue collapsing' (slang: coalescing), would do a better job? 12:56 PM <dustin> we have a half-dozen tools that adjust scheduling

Attachment #8785387 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 27

•

9 years ago

Comment on attachment 8785387 [details] [diff] [review] bug1296329.patch Review of attachment 8785387 [details] [diff] [review]: ----------------------------------------------------------------- thanks, that was easy

Attachment #8785387 - Flags: review?(jmaher) → review+

Kim Moir [:kmoir] ET

Reporter

Updated

•

9 years ago

Attachment #8785387 - Flags: checked-in+

Alin Selagea [:aselagea]

Comment 28

•

9 years ago

Checked the logs from bm81, it looks that the XP jobs are indeed no longer skipped by SETA. I updated the google docs sheet with data up until August 29 for try, autoland and m-i but I'm not seeing any clear change after the last patch yet (maybe more days are needed for that).

Phil Ringnalda (:philor)

Updated

•

9 years ago

Depends on: 1299377

Phil Ringnalda (:philor)

Updated

•

9 years ago

Depends on: 1299378

Phil Ringnalda (:philor)

Comment 29

•

9 years ago

A single damning log line that tells us where all our problems lie: 2016-09-02 17:51:31-0700 [-] mergeRequests: Windows 8 64-bit mozilla-inbound opt test mochitest-media: exceeded limit (3) In https://hg.mozilla.org/build/buildbotcustom/annotate/515987988fd4b9855661c5733708fa8fb24aa0b9/misc.py#l2228 we think we are setting builderMergeLimits for Win8 mochitest-media to 7 based on https://hg.mozilla.org/build/buildbot-configs/annotate/a2b95fc7a8522a99e8d2d1761a8f0261072c48e5/mozilla-tests/config_seta.py#l45 but we aren't, we're getting the default from https://hg.mozilla.org/build/buildbotcustom/annotate/2d29e0ad789e/misc.py#l399 (was 3, now 8 in prod as of a few minutes ago) instead. So what happens is, we have 6 Win8 builds sendchange, and we schedule just a few tests on each of them, then the 7th sendchanges, and we schedule every test on all 7 builds, thinking we'll coalesce all of them, but instead we coalesce the first 3, then the second 3, then do the last one (very, very roughly speaking, since there are generally also other previous pending jobs, and more added afterward, possibly including a 14th build and a whole new full set before we get to that last one). And now that the default is 8, we should instead coalesce all 7, maybe even picking up a stray that was already pending, and get pretty much what we want. Before we turn SETA scheduling back on for WinXP, we should probably look at what sort of backlog we get coalescing 8, since I suspect we never actually saw either coalescing 7 or coalescing 14, and didn't actually need bug 1175568 so much as we just needed to not be coalescing only 3. Probably we almost never actually get 14 and hit 2 hours first, other than maybe in the first rush after a long closure, but actually skipping over 14 pushes is a very very wide gap.

Joel Maher ( :jmaher ) (UTC -8)

Comment 30

•

9 years ago

oh, great find :philor. This goes to the original issue where we saw confusion on large sets of pending revisions. I would be happy to see SETA for winxp again soon- maybe lets see how this works for win8 for a week and discuss winxp next week?

Chris AtLee [:catlee]

Comment 31

•

9 years ago

I think this is probably fallout from bug 1293554, where we disabled SETA for the test masters. When that happened, they stopped using the SETA configs to determine how many jobs to merge together, and fell back to the default of 3.

Kim Moir [:kmoir] ET

Reporter

Comment 32

•

9 years ago

Does this mean that we should revert the work in bug 1293554 to disable seta on the test masters or is it sufficient to increase the number of jobs to be coalesced in bug 1299378?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 33

•

9 years ago

I think we should revert the work in bug 1293554. Either that, or re-enable XP SETA and rely on the increased number of jobs to coalesce that was landed in bug 1299378. philor, what's your preference?

Flags: needinfo?(catlee) → needinfo?(philringnalda)

Phil Ringnalda (:philor)

Comment 34

•

9 years ago

1. Reenable SETA for WinXP 2. Switch seta_config.py to skip 8 for both WinXP and non-WinXP, to get rid of both the weirdness of 7 skipped and 8 coalesced and the risk of sometimes (though it would only actually happen during a daytime reopen after an extended tree closure) actually skipping 14 in WinXP instead of skipping 2 hours, which is too much also 3. While we wait for enough load to evaluate the results (won't happen this week, probably won't happen during merge week, the week after is more likely), figure out what we actually want to measure and how to measure it to determine success or failure. The right metric probably resembles "average number of jobs coalesced per push by OS, per tree" but it's not something I'm going to come up with in 2 minute bursts of thought between starrings.

Flags: needinfo?(philringnalda)

Joel Maher ( :jmaher ) (UTC -8)

Comment 35

•

9 years ago

I had built a dashboard to track the effects of SETA: http://alertmanager.allizom.org/dailyjobs.html it isn't 100% perfect and might need some updating- I am actually planning to get rid of it soon, possibly it can help us out!

Kim Moir [:kmoir] ET

Reporter

Comment 36

•

9 years ago

Attached patch bug1296329-2.patch — Details — Splinter Review

patch to address the suggestions in comment 34

Attachment #8789381 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 37

•

9 years ago

Comment on attachment 8789381 [details] [diff] [review] bug1296329-2.patch Review of attachment 8789381 [details] [diff] [review]: ----------------------------------------------------------------- ::: mozilla-tests/config_seta.py @@ +31,5 @@ > "android-4-3-armv7-api15": ("android-api-15", ["ubuntu64_vm_armv7_mobile", "ubuntu64_vm_armv7_large"]), > } > > # platforms and tests to exclude from configs because they are deprecated or lacking data > +platform_exclusions = ["android-2-3-armv7-api9", "Android 4.2 x86 Emulator", "Android 4.3 armv7 API 15+"] in the future we should examine android 4.3 as an option, I do seta calculations for it.

Attachment #8789381 - Flags: review?(jmaher) → review+

Kim Moir [:kmoir] ET

Reporter

Updated

•

9 years ago

Attachment #8789381 - Flags: checked-in+

Kim Moir [:kmoir] ET

Reporter

Comment 38

•

9 years ago

Attached patch bug1296329android.patch (obsolete) — Details — Splinter Review

patch to fix issues so seta is enabled for Android. The seta data changed a bit in format and some platforms are no longer relevant so I cleaned that up. It worked in staging.

Attachment #8791735 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 39

•

9 years ago

Comment on attachment 8791735 [details] [diff] [review] bug1296329android.patch Review of attachment 8791735 [details] [diff] [review]: ----------------------------------------------------------------- this gets us working now, feel free to ignore my suggestion or do it later. ::: mozilla-tests/config_seta.py @@ +83,2 @@ > continue > + if platform_filter == "desktop" and "Android" in platform: with the introduction of taskcluster we can have random platform names- could we do: if platform_filter == ... and 'ndroid' in platform: this ignores the A|a question at the beginning.

Attachment #8791735 - Flags: review?(jmaher) → review+

Kim Moir [:kmoir] ET

Reporter

Comment 40

•

9 years ago

sure I could do that but this is for scheduling jobs on buildbot only, we don't reduce the taskcluster jobs that are scheduled via this code.

Joel Maher ( :jmaher ) (UTC -8)

Comment 41

•

9 years ago

i agree- just beware we will have a very random set of build and test names showing up in the API- Ido try to sanitize it as much as possible- good point though- you only care about Android- just make sure the logic handles scenarios with odd data.

Kim Moir [:kmoir] ET

Reporter

Comment 42

•

9 years ago

Attached patch bug1296329android-2.patch — Details — Splinter Review

jmaher: good point, I landed this patch which incorporated your suggestion

Attachment #8791735 - Attachment is obsolete: true

Attachment #8791818 - Flags: checked-in+

Kim Moir [:kmoir] ET

Reporter

Comment 43

•

9 years ago

I verified that the seta test scheduler logs are skipping Android jobs. Seta just has data for Android 4.3.

Dustin J. Mitchell [:dustin] (he/him)

Comment 44

•

9 years ago

The TaskCluster platform names aren't random -- at least, no more random than the existing chaos between the half-dozen ways we define platforms.

Phil Ringnalda (:philor)

Comment 45

•

9 years ago

Even easier way to measure whether we're done here: come back two months later and see whether anyone even remembers this bug existed.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

bug1296329.patch 9 years ago Kim Moir [:kmoir] ET 1.40 KB, patch	jmaher : review+ kmoir : checked-in+	Details \| Diff \| Splinter Review
bug1296329-2.patch 9 years ago Kim Moir [:kmoir] ET 1.81 KB, patch	jmaher : review+ kmoir : checked-in+	Details \| Diff \| Splinter Review
bug1296329android.patch 9 years ago Kim Moir [:kmoir] ET 2.81 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
bug1296329android-2.patch 9 years ago Kim Moir [:kmoir] ET 2.81 KB, patch	kmoir : checked-in+	Details \| Diff \| Splinter Review