Closed Bug 1386264 Opened 4 years ago Closed 4 years ago
very high pending counts for macosx tests
13.55 KB, patch
|Details | Diff | Splinter Review|
3.31 KB, text/plain
15.64 KB, patch
|Details | Diff | Splinter Review|
1.37 KB, patch
|Details | Diff | Splinter Review|
see https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010 11284 jobs It looks like jobs on inbound are completing in a reasonable timeframe but try is very backed up. i.e. This try run from last night is still pending on macosx tests https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c52319b670c4ea0a0115ea486762431af61a371&exclusion_profile=false I can see this applies to other pushes from Monday as well
:aobreja is already looking at this and touched base with folks in #taskcluster. Assigning this to him as he's got the knowledge already.
Assignee: nobody → aobreja
Status: NEW → ASSIGNED
Priority: -- → P1
oh okay, thanks that is fantastic. I looked for an existing bug but couldn't find one.
Adding some more context from IRC &garndt> hi aobreja|buildduty ! 14:03:33 <aobreja|buildduty> garndt:Hi 14:04:47 <&garndt> let me see where these jobs are pending 14:05:42 <&garndt> btw, there are still 15 machines not taking jobs...1 of them I think is wcosta's loaner (0200) 14:05:46 <&garndt> https://www.irccloud.com/pastebin/bL8XE6Tr/ 14:07:02 <aobreja|buildduty> I'll have a look on all 15
Don't know for sure what caused this huge backlog,machines running on taskcluster are taking jobs,there were 15 machines which didn't,some of them were unreachable and I open bugs to DCOps for them,and the rest were restarted or re-imaged and began taking jobs. So at this point we have ~12k pending jobs and ~350 machines running in taskcluster.
I don't see any other job been in other state than "pending" for OS X 10.10 since: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dcd486b3eef98f07527705b06a5826b1004f6a3c&filter-searchStr=os%20x%2010.10 All jobs for today for OS X 10.10 are shown as "pending": https://treeherder.mozilla.org/#/jobs?repo=try&filter-searchStr=os%20x%2010.10&fromchange=9fb288b0726ef68d8b73c4251d16d97c662dc0b8&selectedJob=119780260
kmoir: Callek: not sure if one of you would be good choices to help debug, but mtabara is working on the release right now and we still have massive OS X backlog without knowing the root cause 12:21 PM <•kmoir> Kim Moir arr: I can look 12:21 PM <•arr> kmoir: thanks! 12:21 PM <•Callek> Justin Wood I have some errands to run in about 30-40 minutes (has to be today), so not in the best position right now to look 12:21 PM kmoir: thanks! 12:22 PM kmoir: fwiw, my initial suspicion was relating to https://hg.mozilla.org/mozilla-central/filelog/default/taskcluster/taskgraph/try_option_syntax.py johan's and my own push's, but mihai looked and didn't see anything that suggestion that was at fault... 12:23 PM — mtabara would definitely love a double check on ^ 12:24 PM <mtabara> I looked at last 20-30 try pushes and all macosx tests running were cases where we: a) "u all" b) "macosx" was set as platfor and then namely some tests were enumarted under "-u X, Y, Z" 12:24 PM that's how I checked 12:26 PM <•kmoir> Kim Moir I looked at that earlier and came to a similar conclusion 12:26 PM didn't see any mac tests running that shouldn't be given the try syntax 12:26 PM spacurar|buildduty → spacurar|afk 12:27 PM <•Callek> Justin Wood alternate thought, are we running *more* tests on beta than we were last week? due to the uplift? 12:28 PM (or release, or both) 12:28 PM their scheduling priority could be impacting longer waits on try 12:28 PM <•kmoir> Kim Moir right we are running more tests, that seems a large number for beta, also there are more stylo tests enabled 12:28 PM → leeroybot joined ⇐ leeroybot2 quit 12:29 PM <•kmoir> Kim Moir i wonder if there are just enough macs for the trunk branches + beta and then try has a lower priority and doesn't get machines, will look into this 12:31 PM for instance, until recently we didn't have all these stylo tests running on inbound, they only ran on m-c 12:32 PM which significantly increased the number of tests running on macox 12:39 PM <mtabara> out of curiosity, how can we look into jobs per branch specification in terms of pending? 12:41 PM <•kmoir> Kim Moir actually the stylo tests are here just under the wrong header 12:52 PM › https://irccloud.mozilla.com/file/CLuI874J/Screen%20Shot%202017-08-01%20at%2012.51.58%20PM.pngScreen Shot 2017-08-01 at 12.51.58 PM.png216.26KB • image/png 12:52 PM going to ask if we need that in the quantum meeting I have in few min 12:54 PM <mtabara> what do you mean by wrong header? 1:10 PM <•kmoir> Kim Moir the are at the bottom, instead of groups with the other mac tests 1:10 PM from #taskcluster https://www.irccloud.com/pastebin/FwoJzij4/ 1:11 PM I talked to the stylo team in the meeting and I will turn it off mac tests on trunk - minus autoland, m-c
I found that macosx stylo tests are not in the seta data in taskcluster, either for buildbot or for taskcluster. (Of course, the ones on trunk have been migrated to taskcluster, I was just doing a sanity check) See https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=taskscluster So for macosx64-stylo debug and macosx64-stylo opt builds on autoland and m-i, it looks like all the tests run on every push. I don't if this is a recent change. The macosx stylo tests running off the regular build with a preference enabled in the tests was changed about a week ago. See bug 1374748 for details of this change, Armen, is this expected behaviour? What do we need do to get the macosx64-stylo* jobs enabled in seta for taskcluster?
When adding the macOS Stylo tests, I only updated run-on-projects where there were already exceptions for linux64-stylo. We may want to add an exception for all test types Stylo uses to skip .*-stylo/.* test platforms on mozilla-inbound.
Yes, I'm testing a patch now as discussed in end of comment #6
diff of the jobs with mach tasksgraph target -p parameters.yml from m-i. I also applied the patch to m-c and there were no changes.
Comment on attachment 8892618 [details] [diff] [review] bug1386264.patch disable macosx stylo tests on inbound - I'm not quite clear on the logic behind disabling on inbound but not autoland. I would have guessed we would want to disable on both. However, if we want to keep autoland stylo tests, I think this patch is correct. - I thought about the 'try' in run-on-projects causing the builds to always run on Try, but aiui the try syntax target task method ignores run-on-projects anyway, and filters by comment. I think this looks good.
Attachment #8892618 - Flags: review?(aki) → review+
We might be hitting the 2-week grace period for new tasks on SETA. See also: https://bugzilla.mozilla.org/show_bug.cgi?id=1386405
Comment on attachment 8892618 [details] [diff] [review] bug1386264.patch disable macosx stylo tests on inbound Review of attachment 8892618 [details] [diff] [review]: ----------------------------------------------------------------- ::: taskcluster/ci/test/tests.yml @@ +1321,5 @@ > run-on-projects: > by-test-platform: > linux64-stylo-sequential/.*: ['mozilla-central','try'] > linux64-stylo/.*: ['mozilla-central', 'try'] > + macosx64-stylo/.*: ['autoland', 'mozilla-central', 'try'] All the Talos tests could be restricted further to just m-c and try, like the Linux lines above them. (We aren't actually running Talos for macOS Stylo yet, but hope to soon, so this is nice prep work.)
Comment on attachment 8892618 [details] [diff] [review] bug1386264.patch disable macosx stylo tests on inbound Review of attachment 8892618 [details] [diff] [review]: ----------------------------------------------------------------- ::: taskcluster/ci/test/tests.yml @@ +1816,5 @@ > + run-on-projects: > + by-test-platform: > + macosx64-stylo/.*: ['autoland', 'mozilla-central', 'try'] > + default: built-projects > + mozharness: The line is duplicated, probably needs to be removed.
Good catch :)
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/1466ea7d584f very high pending counts for macosx tests r=aki
Pushed by email@example.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1c5f2189d049 very high pending counts for macosx tests r=aki DONTBUILD
aki: the quantum/stylo team asked me to keep the tests on autoland, this is why they are not included jryans: I'll update the talos tests
Looks like the tests aren't scheduled in m-i now https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1466ea7d584f9e7f8dd40cb99a6b4cb88d561ea6 I was talking to Dustin in #taskcluster and he mentioned that try tasks have a lifetime of one day which should bring our pending count down as they expire as this is the bulk of the pending count. If we can't get seta addressed tomorrow perhaps we should consider disabling the jobs on autoland temporarily.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Dustin, is there a way for me to look at the pending queues of tasks for macosx to look at when they will expire? I still see our pending counts increasing here, now at 16K https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
There is not an easy way to show what's in the queue at the moment, but here our the counts of pending tasks per hour that they were scheduled to give an idea of when some might drop off or get picked up by a worker. I do not think we're going to hit a point where one large chunk drops off because we are completing the oldest try jobs first. We're racing between completing them and deadline being hit. For instance, the oldest pending job is only 22 hours ago. Assuming the deadline is 24 hours after creation and we maintain completing 1k per hour, we'll complete enough per hour to keep ahead of them hitting their deadline. scheduled | count ---------------------+------- 2017-08-01 04:00:00 | 491 2017-08-01 05:00:00 | 199 2017-08-01 06:00:00 | 164 2017-08-01 07:00:00 | 673 2017-08-01 08:00:00 | 704 2017-08-01 09:00:00 | 994 2017-08-01 10:00:00 | 1383 2017-08-01 11:00:00 | 306 2017-08-01 12:00:00 | 327 2017-08-01 13:00:00 | 285 2017-08-01 14:00:00 | 656 2017-08-01 15:00:00 | 773 2017-08-01 16:00:00 | 98 2017-08-01 17:00:00 | 217 2017-08-01 18:00:00 | 1383 2017-08-01 19:00:00 | 612 2017-08-01 20:00:00 | 1518 2017-08-01 21:00:00 | 95 2017-08-01 22:00:00 | 197 2017-08-01 23:00:00 | 947 2017-08-02 00:00:00 | 1107 2017-08-02 01:00:00 | 1083 2017-08-02 02:00:00 | 1386
When I looked, everything I could see running was non-try. I created a table containing all tasks from that workerType created in the last 36 hours - 60937 of them, with 15760 pending. taskcluster-task-analysis::DATABASE=> select project, count(*) from pendingmac group by project; project | count ----------+------- oak | 294 cedar | 468 pine | 2196 date | 87 try | 11926 graphics | 597 autoland | 192 (7 rows) So as expected, just about everything is try at this point, since it's lowest priority. Right now, at about 0300 UTC, taskcluster-task-analysis::DATABASE=> select min(created) from pendingmac where project='try'; min ------------------------- 2017-08-01 03:48:57.504 (1 row) so not quite 24 hours old yet. Looking at tasks a bit over 24 hours, they seem to have all completed. So I think try is running about 23.5 hours behind right now, meaning that deadline expiration isn't helping to reduce the pending counts. Example push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ea4170e0089193869762150165b0c2f9f0123cf ..and I just saw I'm going to midair with greg so I'll leave it there.
Oh, and what we're doing right now?: taskcluster-task-analysis::DATABASE=> select project, count(*) from recentmac where state='running' group by project; project | count -----------------+------- autoland | 311 mozilla-inbound | 24 (2 rows) so every try push is just adding to the backlog. Note that the sum there is 335, which is pretty close to the number of macs we have, so I think just about all of the hardware is busy.
Queue this morning GMT is ~17K I'm wondering if there's no way we can actually clear the existing jobs from try. Or modify that expiration timeframe somehow. Sounds like we're in the case that try jobs that are mostly populating the queue are now inches away from being expired and that's when they actually get their fair share in the machines pool. Do we have a contingency plan for situations like this? To clear/kill existing try-jobs or something alike? : https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
Reassigning this to :kmoir as she's been driving the things forward in the past day. Thanks Kim!
Assignee: aobreja → kmoir
patch to disable stylo tests on autoland temporarily until seta is fixed as well as disable the awsy tests to not run on trunk by default
Comment on attachment 8892924 [details] [diff] [review] bug1386264autoland.patch change awsy stylo tests to not run on trunk on by default + disable stylo mac tests on autoland, will talk to stylo team before enabling if we can't get seta fixed to reduce load soon
Attachment #8892924 - Flags: review?(aki)
kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?
I can change it to opt only if that works better
(In reply to Bob Clary [:bc:] from comment #30) > kmoir: This will run awsy on both opt and debug? erahm: Do we want debug? In your change to add AWSY it was only added to opt test platforms, so I believe :kmoir's change alone won't add debug. : https://hg.mozilla.org/integration/mozilla-inbound/diff/fa83b1463e2b/taskcluster/ci/test/test-platforms.yml
4 years ago
Attachment #8892924 - Flags: review?(aki) → review+
(In reply to Bob Clary [:bc:] from comment #30) > kmoir: This will run awsy on both opt and debug? erahm: Do we want debug? Just opt, but per jryans it sounds like this will work as-is.
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/bcdde389bfac very high pending counts for macosx tests r=aki DONTBUILD
disabled awsy stylo tests from running by default on most branches. Will see how the load goes down now that seta is enabled for macosx stylo tests on autoland. Right now it's at around 16/17K
Attachment #8893059 - Flags: checked-in+
We seem to be chewing through the backlog, at least. Down to 11K tests at the moment.
Pending counts are now down to 1100 which is a more normal range. Going to close this bug but leave the larger tracking bug open as it tracks some longer term work to make a more efficient use of test pools.
Status: REOPENED → RESOLVED
Closed: 4 years ago → 4 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.