very high pending counts for macosx tests

RESOLVED FIXED

Status

task
P1
normal
RESOLVED FIXED
2 years ago
Last year

People

(Reporter: kmoir, Assigned: kmoir)

Tracking

unspecified
Dependency tree / graph

Firefox Tracking Flags

(firefox56 fixed)

Details

Attachments

(4 attachments)

see https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010 	11284 jobs

It looks like jobs on inbound are completing in a reasonable timeframe but try is very backed up.

i.e. This try run from last night is still pending on macosx tests
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c52319b670c4ea0a0115ea486762431af61a371&exclusion_profile=false


I can see this applies to other pushes from Monday as well
:aobreja is already looking at this and touched base with folks in #taskcluster.
Assigning this to him as he's got the knowledge already.
Assignee: nobody → aobreja
Status: NEW → ASSIGNED
Priority: -- → P1
oh okay, thanks that is fantastic.  I looked for an existing bug but couldn't find one.
Adding some more context from IRC

&garndt> hi aobreja|buildduty !
14:03:33 <aobreja|buildduty> garndt:Hi
14:04:47 <&garndt> let me see where these jobs are pending
14:05:42 <&garndt> btw, there are still 15 machines not taking jobs...1 of them I think is wcosta's loaner (0200)
14:05:46 <&garndt> https://www.irccloud.com/pastebin/bL8XE6Tr/
14:07:02 <aobreja|buildduty> I'll have a look on all 15
Don't know for sure what caused this huge backlog,machines running on taskcluster are taking jobs,there were 15 machines which didn't,some of them were unreachable and I open bugs to DCOps for them,and the rest were restarted or re-imaged and began taking jobs.
So at this point we have ~12k pending jobs and ~350 machines running in taskcluster.
kmoir: Callek: not sure if one of you would be good choices to help debug, but mtabara is working on the release right now and we still have massive OS X backlog without knowing the root cause
12:21 PM 
<•kmoir> Kim Moir arr: I can look
12:21 PM 
<•arr> kmoir: thanks!
12:21 PM 
<•Callek> Justin Wood I have some errands to run in about 30-40 minutes (has to be today), so not in the best position right now to look
12:21 PM kmoir: thanks!
12:22 PM kmoir: fwiw, my initial suspicion was relating to https://hg.mozilla.org/mozilla-central/filelog/default/taskcluster/taskgraph/try_option_syntax.py johan's and my own push's, but mihai looked and didn't see anything that suggestion that was at fault...
12:23 PM 
— mtabara would definitely love a double check on ^
12:24 PM 
<mtabara> I looked at last 20-30 try pushes and all macosx tests running were cases where we: a) "u all" b) "macosx" was set as platfor and then namely some tests were enumarted under "-u X, Y, Z"
12:24 PM that's how I checked
12:26 PM 
<•kmoir> Kim Moir I looked at that earlier and came to a similar conclusion
12:26 PM didn't see any mac tests running that shouldn't be given the try syntax
12:26 PM spacurar|buildduty → spacurar|afk
12:27 PM 
<•Callek> Justin Wood alternate thought, are we running *more* tests on beta than we were last week? due to the uplift?
12:28 PM (or release, or both)
12:28 PM their scheduling priority could be impacting longer waits on try
12:28 PM 
<•kmoir> Kim Moir right we are running more tests, that seems a large number for beta, also there are more stylo tests enabled 
12:28 PM → leeroybot joined  ⇐ leeroybot2 quit  
12:29 PM 
<•kmoir> Kim Moir i wonder if there are just enough macs for the trunk branches + beta and then try has a lower priority and doesn't get machines, will look into this
12:31 PM for instance, until recently we didn't have all these stylo tests running on inbound, they only ran on m-c
12:32 PM which significantly increased the number of tests running on macox
12:39 PM 
<mtabara> out of curiosity, how can we look into jobs per branch specification in terms of pending?
12:41 PM 
<•kmoir> Kim Moir actually the stylo tests are here just under the wrong header 
12:52 PM › https://irccloud.mozilla.com/file/CLuI874J/Screen%20Shot%202017-08-01%20at%2012.51.58%20PM.pngScreen Shot 2017-08-01 at 12.51.58 PM.png216.26KB • image/png
12:52 PM going to ask if we need that in the quantum meeting I have in few min
12:54 PM 
<mtabara> what do you mean by wrong header?
1:10 PM 
<•kmoir> Kim Moir the are at the bottom, instead of groups with the other mac tests
1:10 PM from #taskcluster https://www.irccloud.com/pastebin/FwoJzij4/
1:11 PM I talked to the stylo team in the meeting and I will turn it off mac tests on trunk - minus autoland, m-c
I found that macosx stylo tests are not in the seta data in taskcluster, either for buildbot or for taskcluster. (Of course, the ones on trunk have been migrated to taskcluster, I was just doing a sanity check) See

https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot
https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=taskscluster

So for macosx64-stylo debug and macosx64-stylo opt builds on autoland and m-i, it looks like all the tests run on every push.  I don't if this is a recent change.  The macosx stylo tests running off the regular build with a preference enabled in the tests was changed about a week ago.  See bug 1374748 for details of this change,

Armen, is this expected behaviour?  What do we need do to get the macosx64-stylo* jobs enabled in seta for taskcluster?
Flags: needinfo?(armenzg)
When adding the macOS Stylo tests, I only updated run-on-projects where there were already exceptions for linux64-stylo.  

We may want to add an exception for all test types Stylo uses to skip .*-stylo/.* test platforms on mozilla-inbound.
Yes, I'm testing a patch now as discussed in end of comment #6
Posted file bug1386264tc.diff
diff of the jobs with mach tasksgraph target -p parameters.yml from m-i.  

I also applied the patch to m-c and there were no changes.
Attachment #8892618 - Attachment description: bug1386264.patch → bug1386264.patch disable macosx stylo tests on inbound
Attachment #8892618 - Flags: review?(aki)
Comment on attachment 8892618 [details] [diff] [review]
bug1386264.patch disable macosx stylo tests on inbound

- I'm not quite clear on the logic behind disabling on inbound but not autoland. I would have guessed we would want to disable on both. However, if we want to keep autoland stylo tests, I think this patch is correct.

- I thought about the 'try' in run-on-projects causing the builds to always run on Try, but aiui the try syntax target task method ignores run-on-projects anyway, and filters by comment.

I think this looks good.
Attachment #8892618 - Flags: review?(aki) → review+
See Also: → 1386405
We might be hitting the 2-week grace period for new tasks on SETA.
See also: https://bugzilla.mozilla.org/show_bug.cgi?id=1386405
Flags: needinfo?(armenzg)
Depends on: 1386405
Comment on attachment 8892618 [details] [diff] [review]
bug1386264.patch disable macosx stylo tests on inbound

Review of attachment 8892618 [details] [diff] [review]:
-----------------------------------------------------------------

::: taskcluster/ci/test/tests.yml
@@ +1321,5 @@
>      run-on-projects:
>          by-test-platform:
>              linux64-stylo-sequential/.*: ['mozilla-central','try']
>              linux64-stylo/.*: ['mozilla-central', 'try']
> +            macosx64-stylo/.*: ['autoland', 'mozilla-central', 'try']

All the Talos tests could be restricted further to just m-c and try, like the Linux lines above them.  (We aren't actually running Talos for macOS Stylo yet, but hope to soon, so this is nice prep work.)
Comment on attachment 8892618 [details] [diff] [review]
bug1386264.patch disable macosx stylo tests on inbound

Review of attachment 8892618 [details] [diff] [review]:
-----------------------------------------------------------------

::: taskcluster/ci/test/tests.yml
@@ +1816,5 @@
> +    run-on-projects:
> +        by-test-platform:
> +            macosx64-stylo/.*: ['autoland', 'mozilla-central', 'try']
> +            default: built-projects
> +    mozharness:

The line is duplicated, probably needs to be removed.
Good catch :)
Pushed by kmoir@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1466ea7d584f
very high pending counts for macosx tests r=aki
Pushed by kmoir@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1c5f2189d049
very high pending counts for macosx tests r=aki DONTBUILD
aki: the quantum/stylo team asked me to keep the tests on autoland, this is why they are not included

jryans: I'll update the talos tests
Looks like the tests aren't scheduled in m-i now

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1466ea7d584f9e7f8dd40cb99a6b4cb88d561ea6

I was talking to Dustin in #taskcluster and he mentioned that try tasks have a lifetime of one day which should bring our pending count down as they expire as this is the bulk of the pending count.

If we can't get seta addressed tomorrow perhaps we should consider disabling the jobs on autoland temporarily.
https://hg.mozilla.org/mozilla-central/rev/1466ea7d584f
https://hg.mozilla.org/mozilla-central/rev/1c5f2189d049
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Dustin, is there a way for me to look at the pending queues of tasks for macosx to look at when they will expire?  I still see our pending counts increasing here, now at 16K

https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
Flags: needinfo?(dustin)
There is not an easy way to show what's in the queue at the moment, but here our the counts of pending tasks per hour that they were scheduled to give an idea of when some might drop off or get picked up by a worker.  I do not think we're going to hit a point where one large chunk drops off because we are completing the oldest try jobs first.  We're racing between completing them and deadline being hit.  For instance, the oldest pending job is only 22 hours ago.  Assuming the deadline is 24 hours after creation and we maintain completing 1k per hour, we'll complete enough per hour to keep ahead of them hitting their deadline.

   scheduled         | count
---------------------+-------
 2017-08-01 04:00:00 |   491
 2017-08-01 05:00:00 |   199
 2017-08-01 06:00:00 |   164
 2017-08-01 07:00:00 |   673
 2017-08-01 08:00:00 |   704
 2017-08-01 09:00:00 |   994
 2017-08-01 10:00:00 |  1383
 2017-08-01 11:00:00 |   306
 2017-08-01 12:00:00 |   327
 2017-08-01 13:00:00 |   285
 2017-08-01 14:00:00 |   656
 2017-08-01 15:00:00 |   773
 2017-08-01 16:00:00 |    98
 2017-08-01 17:00:00 |   217
 2017-08-01 18:00:00 |  1383
 2017-08-01 19:00:00 |   612
 2017-08-01 20:00:00 |  1518
 2017-08-01 21:00:00 |    95
 2017-08-01 22:00:00 |   197
 2017-08-01 23:00:00 |   947
 2017-08-02 00:00:00 |  1107
 2017-08-02 01:00:00 |  1083
 2017-08-02 02:00:00 |  1386
When I looked, everything I could see running was non-try.

I created a table containing all tasks from that workerType created in the last 36 hours - 60937 of them, with 15760 pending.

taskcluster-task-analysis::DATABASE=> select project, count(*) from pendingmac group by project;
 project  | count 
----------+-------
 oak      |   294
 cedar    |   468
 pine     |  2196
 date     |    87
 try      | 11926
 graphics |   597
 autoland |   192
(7 rows)

So as expected, just about everything is try at this point, since it's lowest priority.

Right now, at about 0300 UTC,

taskcluster-task-analysis::DATABASE=> select min(created) from pendingmac where project='try';
           min           
-------------------------
 2017-08-01 03:48:57.504
(1 row)

so not quite 24 hours old yet.  Looking at tasks a bit over 24 hours, they seem to have all completed.  So I think try is running about 23.5 hours behind right now, meaning that deadline expiration isn't helping to reduce the pending counts.  Example push:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ea4170e0089193869762150165b0c2f9f0123cf

..and I just saw I'm going to midair with greg so I'll leave it there.
Oh, and what we're doing right now?:

taskcluster-task-analysis::DATABASE=> select project, count(*) from recentmac where state='running' group by project;
     project     | count 
-----------------+-------
 autoland        |   311
 mozilla-inbound |    24
(2 rows)

so every try push is just adding to the backlog.  Note that the sum there is 335, which is pretty close to the number of macs we have, so I think just about all of the hardware is busy.
Flags: needinfo?(dustin)
Queue this morning GMT is ~17K[1]

I'm wondering if there's no way we can actually clear the existing jobs from try. Or modify that expiration timeframe somehow. Sounds like we're in the case that try jobs that are mostly populating the queue are now inches away from being expired and that's when they actually get their fair share in the machines pool.

Do we have a contingency plan for situations like this? To clear/kill existing try-jobs or something alike?

[1]: https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
Blocks: 1386625
No longer depends on: 1386405
Reassigning this to :kmoir as she's been driving the things forward in the past day.
Thanks Kim!
Assignee: aobreja → kmoir
patch to disable stylo tests on autoland temporarily until seta is fixed as well as disable the awsy tests to not run on trunk by default
Comment on attachment 8892924 [details] [diff] [review]
bug1386264autoland.patch

change awsy stylo tests to not run on trunk on by default + disable stylo mac tests on autoland, will talk to stylo team before enabling if we can't get seta fixed to reduce load soon
Attachment #8892924 - Flags: review?(aki)
kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?
Flags: needinfo?(kmoir)
Flags: needinfo?(erahm)
I can change it to opt only if that works better
Flags: needinfo?(kmoir)
(In reply to Bob Clary [:bc:] from comment #30)
> kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?

In your change to add AWSY[1] it was only added to opt test platforms, so I believe :kmoir's change alone won't add debug.

[1]: https://hg.mozilla.org/integration/mozilla-inbound/diff/fa83b1463e2b/taskcluster/ci/test/test-platforms.yml
Attachment #8892924 - Flags: review?(aki) → review+
(In reply to Bob Clary [:bc:] from comment #30)
> kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?

Just opt, but per jryans it sounds like this will work as-is.
Flags: needinfo?(erahm)
Pushed by kmoir@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bcdde389bfac
very high pending counts for macosx tests r=aki DONTBUILD
disabled awsy stylo tests from running by default on most branches.  Will see how the load goes down now that seta is enabled for macosx stylo tests on autoland. Right now it's at around 16/17K
Attachment #8893059 - Flags: checked-in+
We seem to be chewing through the backlog, at least. Down to 11K tests at the moment.
Pending counts are now down to 1100 which is a more normal range.  Going to close this bug but leave the larger tracking bug open as it tracks some longer term work to make a more efficient use of test pools.
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Keywords: leave-open
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.