Closed
Bug 1386264
Opened 7 years ago
Closed 7 years ago
very high pending counts for macosx tests
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P1)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(firefox56 fixed)
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox56 | --- | fixed |
People
(Reporter: kmoir, Assigned: kmoir)
References
Details
Attachments
(4 files)
13.55 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
3.31 KB,
text/plain
|
Details | |
15.64 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
1.37 KB,
patch
|
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
see https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010 11284 jobs
It looks like jobs on inbound are completing in a reasonable timeframe but try is very backed up.
i.e. This try run from last night is still pending on macosx tests
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c52319b670c4ea0a0115ea486762431af61a371&exclusion_profile=false
I can see this applies to other pushes from Monday as well
Comment 1•7 years ago
|
||
:aobreja is already looking at this and touched base with folks in #taskcluster.
Assigning this to him as he's got the knowledge already.
Assignee: nobody → aobreja
Status: NEW → ASSIGNED
Priority: -- → P1
Assignee | ||
Comment 2•7 years ago
|
||
oh okay, thanks that is fantastic. I looked for an existing bug but couldn't find one.
Comment 3•7 years ago
|
||
Adding some more context from IRC
&garndt> hi aobreja|buildduty !
14:03:33 <aobreja|buildduty> garndt:Hi
14:04:47 <&garndt> let me see where these jobs are pending
14:05:42 <&garndt> btw, there are still 15 machines not taking jobs...1 of them I think is wcosta's loaner (0200)
14:05:46 <&garndt> https://www.irccloud.com/pastebin/bL8XE6Tr/
14:07:02 <aobreja|buildduty> I'll have a look on all 15
Comment 4•7 years ago
|
||
Don't know for sure what caused this huge backlog,machines running on taskcluster are taking jobs,there were 15 machines which didn't,some of them were unreachable and I open bugs to DCOps for them,and the rest were restarted or re-imaged and began taking jobs.
So at this point we have ~12k pending jobs and ~350 machines running in taskcluster.
Comment 5•7 years ago
|
||
I don't see any other job been in other state than "pending" for OS X 10.10 since:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dcd486b3eef98f07527705b06a5826b1004f6a3c&filter-searchStr=os%20x%2010.10
All jobs for today for OS X 10.10 are shown as "pending":
https://treeherder.mozilla.org/#/jobs?repo=try&filter-searchStr=os%20x%2010.10&fromchange=9fb288b0726ef68d8b73c4251d16d97c662dc0b8&selectedJob=119780260
Assignee | ||
Comment 6•7 years ago
|
||
kmoir: Callek: not sure if one of you would be good choices to help debug, but mtabara is working on the release right now and we still have massive OS X backlog without knowing the root cause
12:21 PM
<•kmoir> Kim Moir arr: I can look
12:21 PM
<•arr> kmoir: thanks!
12:21 PM
<•Callek> Justin Wood I have some errands to run in about 30-40 minutes (has to be today), so not in the best position right now to look
12:21 PM kmoir: thanks!
12:22 PM kmoir: fwiw, my initial suspicion was relating to https://hg.mozilla.org/mozilla-central/filelog/default/taskcluster/taskgraph/try_option_syntax.py johan's and my own push's, but mihai looked and didn't see anything that suggestion that was at fault...
12:23 PM
— mtabara would definitely love a double check on ^
12:24 PM
<mtabara> I looked at last 20-30 try pushes and all macosx tests running were cases where we: a) "u all" b) "macosx" was set as platfor and then namely some tests were enumarted under "-u X, Y, Z"
12:24 PM that's how I checked
12:26 PM
<•kmoir> Kim Moir I looked at that earlier and came to a similar conclusion
12:26 PM didn't see any mac tests running that shouldn't be given the try syntax
12:26 PM spacurar|buildduty → spacurar|afk
12:27 PM
<•Callek> Justin Wood alternate thought, are we running *more* tests on beta than we were last week? due to the uplift?
12:28 PM (or release, or both)
12:28 PM their scheduling priority could be impacting longer waits on try
12:28 PM
<•kmoir> Kim Moir right we are running more tests, that seems a large number for beta, also there are more stylo tests enabled
12:28 PM → leeroybot joined ⇐ leeroybot2 quit
12:29 PM
<•kmoir> Kim Moir i wonder if there are just enough macs for the trunk branches + beta and then try has a lower priority and doesn't get machines, will look into this
12:31 PM for instance, until recently we didn't have all these stylo tests running on inbound, they only ran on m-c
12:32 PM which significantly increased the number of tests running on macox
12:39 PM
<mtabara> out of curiosity, how can we look into jobs per branch specification in terms of pending?
12:41 PM
<•kmoir> Kim Moir actually the stylo tests are here just under the wrong header
12:52 PM › https://irccloud.mozilla.com/file/CLuI874J/Screen%20Shot%202017-08-01%20at%2012.51.58%20PM.pngScreen Shot 2017-08-01 at 12.51.58 PM.png216.26KB • image/png
12:52 PM going to ask if we need that in the quantum meeting I have in few min
12:54 PM
<mtabara> what do you mean by wrong header?
1:10 PM
<•kmoir> Kim Moir the are at the bottom, instead of groups with the other mac tests
1:10 PM from #taskcluster https://www.irccloud.com/pastebin/FwoJzij4/
1:11 PM I talked to the stylo team in the meeting and I will turn it off mac tests on trunk - minus autoland, m-c
Assignee | ||
Comment 7•7 years ago
|
||
I found that macosx stylo tests are not in the seta data in taskcluster, either for buildbot or for taskcluster. (Of course, the ones on trunk have been migrated to taskcluster, I was just doing a sanity check) See
https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot
https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=taskscluster
So for macosx64-stylo debug and macosx64-stylo opt builds on autoland and m-i, it looks like all the tests run on every push. I don't if this is a recent change. The macosx stylo tests running off the regular build with a preference enabled in the tests was changed about a week ago. See bug 1374748 for details of this change,
Armen, is this expected behaviour? What do we need do to get the macosx64-stylo* jobs enabled in seta for taskcluster?
Flags: needinfo?(armenzg)
When adding the macOS Stylo tests, I only updated run-on-projects where there were already exceptions for linux64-stylo.
We may want to add an exception for all test types Stylo uses to skip .*-stylo/.* test platforms on mozilla-inbound.
Assignee | ||
Comment 9•7 years ago
|
||
Yes, I'm testing a patch now as discussed in end of comment #6
Assignee | ||
Comment 10•7 years ago
|
||
Assignee | ||
Comment 11•7 years ago
|
||
diff of the jobs with mach tasksgraph target -p parameters.yml from m-i.
I also applied the patch to m-c and there were no changes.
Assignee | ||
Updated•7 years ago
|
Attachment #8892618 -
Attachment description: bug1386264.patch → bug1386264.patch disable macosx stylo tests on inbound
Attachment #8892618 -
Flags: review?(aki)
Comment 12•7 years ago
|
||
Comment on attachment 8892618 [details] [diff] [review]
bug1386264.patch disable macosx stylo tests on inbound
- I'm not quite clear on the logic behind disabling on inbound but not autoland. I would have guessed we would want to disable on both. However, if we want to keep autoland stylo tests, I think this patch is correct.
- I thought about the 'try' in run-on-projects causing the builds to always run on Try, but aiui the try syntax target task method ignores run-on-projects anyway, and filters by comment.
I think this looks good.
Attachment #8892618 -
Flags: review?(aki) → review+
Comment 13•7 years ago
|
||
We might be hitting the 2-week grace period for new tasks on SETA.
See also: https://bugzilla.mozilla.org/show_bug.cgi?id=1386405
Flags: needinfo?(armenzg)
Comment on attachment 8892618 [details] [diff] [review]
bug1386264.patch disable macosx stylo tests on inbound
Review of attachment 8892618 [details] [diff] [review]:
-----------------------------------------------------------------
::: taskcluster/ci/test/tests.yml
@@ +1321,5 @@
> run-on-projects:
> by-test-platform:
> linux64-stylo-sequential/.*: ['mozilla-central','try']
> linux64-stylo/.*: ['mozilla-central', 'try']
> + macosx64-stylo/.*: ['autoland', 'mozilla-central', 'try']
All the Talos tests could be restricted further to just m-c and try, like the Linux lines above them. (We aren't actually running Talos for macOS Stylo yet, but hope to soon, so this is nice prep work.)
Comment on attachment 8892618 [details] [diff] [review]
bug1386264.patch disable macosx stylo tests on inbound
Review of attachment 8892618 [details] [diff] [review]:
-----------------------------------------------------------------
::: taskcluster/ci/test/tests.yml
@@ +1816,5 @@
> + run-on-projects:
> + by-test-platform:
> + macosx64-stylo/.*: ['autoland', 'mozilla-central', 'try']
> + default: built-projects
> + mozharness:
The line is duplicated, probably needs to be removed.
Comment 16•7 years ago
|
||
Good catch :)
Comment 17•7 years ago
|
||
Pushed by kmoir@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1466ea7d584f
very high pending counts for macosx tests r=aki
Comment 18•7 years ago
|
||
Pushed by kmoir@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1c5f2189d049
very high pending counts for macosx tests r=aki DONTBUILD
Assignee | ||
Comment 19•7 years ago
|
||
aki: the quantum/stylo team asked me to keep the tests on autoland, this is why they are not included
jryans: I'll update the talos tests
Assignee | ||
Comment 20•7 years ago
|
||
Looks like the tests aren't scheduled in m-i now
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1466ea7d584f9e7f8dd40cb99a6b4cb88d561ea6
I was talking to Dustin in #taskcluster and he mentioned that try tasks have a lifetime of one day which should bring our pending count down as they expire as this is the bulk of the pending count.
If we can't get seta addressed tomorrow perhaps we should consider disabling the jobs on autoland temporarily.
Comment 21•7 years ago
|
||
bugherder |
Assignee | ||
Updated•7 years ago
|
Assignee | ||
Comment 22•7 years ago
|
||
Dustin, is there a way for me to look at the pending queues of tasks for macosx to look at when they will expire? I still see our pending counts increasing here, now at 16K
https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
Flags: needinfo?(dustin)
Comment 23•7 years ago
|
||
There is not an easy way to show what's in the queue at the moment, but here our the counts of pending tasks per hour that they were scheduled to give an idea of when some might drop off or get picked up by a worker. I do not think we're going to hit a point where one large chunk drops off because we are completing the oldest try jobs first. We're racing between completing them and deadline being hit. For instance, the oldest pending job is only 22 hours ago. Assuming the deadline is 24 hours after creation and we maintain completing 1k per hour, we'll complete enough per hour to keep ahead of them hitting their deadline.
scheduled | count
---------------------+-------
2017-08-01 04:00:00 | 491
2017-08-01 05:00:00 | 199
2017-08-01 06:00:00 | 164
2017-08-01 07:00:00 | 673
2017-08-01 08:00:00 | 704
2017-08-01 09:00:00 | 994
2017-08-01 10:00:00 | 1383
2017-08-01 11:00:00 | 306
2017-08-01 12:00:00 | 327
2017-08-01 13:00:00 | 285
2017-08-01 14:00:00 | 656
2017-08-01 15:00:00 | 773
2017-08-01 16:00:00 | 98
2017-08-01 17:00:00 | 217
2017-08-01 18:00:00 | 1383
2017-08-01 19:00:00 | 612
2017-08-01 20:00:00 | 1518
2017-08-01 21:00:00 | 95
2017-08-01 22:00:00 | 197
2017-08-01 23:00:00 | 947
2017-08-02 00:00:00 | 1107
2017-08-02 01:00:00 | 1083
2017-08-02 02:00:00 | 1386
Comment 24•7 years ago
|
||
When I looked, everything I could see running was non-try.
I created a table containing all tasks from that workerType created in the last 36 hours - 60937 of them, with 15760 pending.
taskcluster-task-analysis::DATABASE=> select project, count(*) from pendingmac group by project;
project | count
----------+-------
oak | 294
cedar | 468
pine | 2196
date | 87
try | 11926
graphics | 597
autoland | 192
(7 rows)
So as expected, just about everything is try at this point, since it's lowest priority.
Right now, at about 0300 UTC,
taskcluster-task-analysis::DATABASE=> select min(created) from pendingmac where project='try';
min
-------------------------
2017-08-01 03:48:57.504
(1 row)
so not quite 24 hours old yet. Looking at tasks a bit over 24 hours, they seem to have all completed. So I think try is running about 23.5 hours behind right now, meaning that deadline expiration isn't helping to reduce the pending counts. Example push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ea4170e0089193869762150165b0c2f9f0123cf
..and I just saw I'm going to midair with greg so I'll leave it there.
Comment 25•7 years ago
|
||
Oh, and what we're doing right now?:
taskcluster-task-analysis::DATABASE=> select project, count(*) from recentmac where state='running' group by project;
project | count
-----------------+-------
autoland | 311
mozilla-inbound | 24
(2 rows)
so every try push is just adding to the backlog. Note that the sum there is 335, which is pretty close to the number of macs we have, so I think just about all of the hardware is busy.
Flags: needinfo?(dustin)
Comment 26•7 years ago
|
||
Queue this morning GMT is ~17K[1]
I'm wondering if there's no way we can actually clear the existing jobs from try. Or modify that expiration timeframe somehow. Sounds like we're in the case that try jobs that are mostly populating the queue are now inches away from being expired and that's when they actually get their fair share in the machines pool.
Do we have a contingency plan for situations like this? To clear/kill existing try-jobs or something alike?
[1]: https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
Comment 27•7 years ago
|
||
Reassigning this to :kmoir as she's been driving the things forward in the past day.
Thanks Kim!
Assignee: aobreja → kmoir
Assignee | ||
Comment 28•7 years ago
|
||
patch to disable stylo tests on autoland temporarily until seta is fixed as well as disable the awsy tests to not run on trunk by default
Assignee | ||
Comment 29•7 years ago
|
||
Comment on attachment 8892924 [details] [diff] [review]
bug1386264autoland.patch
change awsy stylo tests to not run on trunk on by default + disable stylo mac tests on autoland, will talk to stylo team before enabling if we can't get seta fixed to reduce load soon
Attachment #8892924 -
Flags: review?(aki)
Comment 30•7 years ago
|
||
kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?
Flags: needinfo?(kmoir)
Flags: needinfo?(erahm)
Assignee | ||
Comment 31•7 years ago
|
||
I can change it to opt only if that works better
Flags: needinfo?(kmoir)
(In reply to Bob Clary [:bc:] from comment #30)
> kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?
In your change to add AWSY[1] it was only added to opt test platforms, so I believe :kmoir's change alone won't add debug.
[1]: https://hg.mozilla.org/integration/mozilla-inbound/diff/fa83b1463e2b/taskcluster/ci/test/test-platforms.yml
Updated•7 years ago
|
Attachment #8892924 -
Flags: review?(aki) → review+
Comment 33•7 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #30)
> kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?
Just opt, but per jryans it sounds like this will work as-is.
Flags: needinfo?(erahm)
Comment 34•7 years ago
|
||
Pushed by kmoir@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bcdde389bfac
very high pending counts for macosx tests r=aki DONTBUILD
Assignee | ||
Comment 35•7 years ago
|
||
disabled awsy stylo tests from running by default on most branches. Will see how the load goes down now that seta is enabled for macosx stylo tests on autoland. Right now it's at around 16/17K
Attachment #8893059 -
Flags: checked-in+
Comment 36•7 years ago
|
||
bugherder |
Comment 37•7 years ago
|
||
We seem to be chewing through the backlog, at least. Down to 11K tests at the moment.
Assignee | ||
Comment 38•7 years ago
|
||
Pending counts are now down to 1100 which is a more normal range. Going to close this bug but leave the larger tracking bug open as it tracks some longer term work to make a more efficient use of test pools.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•7 years ago
|
Keywords: leave-open
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•