1386264 - very high pending counts for macosx tests

Assignee

Description

•

7 years ago

see https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010 	11284 jobs

It looks like jobs on inbound are completing in a reasonable timeframe but try is very backed up.

i.e. This try run from last night is still pending on macosx tests
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c52319b670c4ea0a0115ea486762431af61a371&exclusion_profile=false


I can see this applies to other pushes from Monday as well

Mihai Tabara [:mtabara]⌚️GMT

Comment 1

•

7 years ago

:aobreja is already looking at this and touched base with folks in #taskcluster.
Assigning this to him as he's got the knowledge already.

Assignee: nobody → aobreja

Status: NEW → ASSIGNED

Priority: -- → P1

Kim Moir [:kmoir] ET

Assignee

Comment 2

•

7 years ago

oh okay, thanks that is fantastic.  I looked for an existing bug but couldn't find one.

Mihai Tabara [:mtabara]⌚️GMT

Comment 3

•

7 years ago

Adding some more context from IRC

&garndt> hi aobreja|buildduty !
14:03:33 <aobreja|buildduty> garndt:Hi
14:04:47 <&garndt> let me see where these jobs are pending
14:05:42 <&garndt> btw, there are still 15 machines not taking jobs...1 of them I think is wcosta's loaner (0200)
14:05:46 <&garndt> https://www.irccloud.com/pastebin/bL8XE6Tr/
14:07:02 <aobreja|buildduty> I'll have a look on all 15

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Comment 4

•

7 years ago

Don't know for sure what caused this huge backlog,machines running on taskcluster are taking jobs,there were 15 machines which didn't,some of them were unreachable and I open bugs to DCOps for them,and the rest were restarted or re-imaged and began taking jobs.
So at this point we have ~12k pending jobs and ~350 machines running in taskcluster.

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Comment 5

•

7 years ago

I don't see any other job been in other state than "pending" for OS X 10.10 since:
 https://treeherder.mozilla.org/#/jobs?repo=try&revision=dcd486b3eef98f07527705b06a5826b1004f6a3c&filter-searchStr=os%20x%2010.10

All jobs for today for OS X 10.10 are shown as "pending":
https://treeherder.mozilla.org/#/jobs?repo=try&filter-searchStr=os%20x%2010.10&fromchange=9fb288b0726ef68d8b73c4251d16d97c662dc0b8&selectedJob=119780260

Kim Moir [:kmoir] ET

Assignee

Comment 6

•

7 years ago

kmoir: Callek: not sure if one of you would be good choices to help debug, but mtabara is working on the release right now and we still have massive OS X backlog without knowing the root cause
12:21 PM 
<•kmoir> Kim Moir arr: I can look
12:21 PM 
<•arr> kmoir: thanks!
12:21 PM 
<•Callek> Justin Wood I have some errands to run in about 30-40 minutes (has to be today), so not in the best position right now to look
12:21 PM kmoir: thanks!
12:22 PM kmoir: fwiw, my initial suspicion was relating to https://hg.mozilla.org/mozilla-central/filelog/default/taskcluster/taskgraph/try_option_syntax.py johan's and my own push's, but mihai looked and didn't see anything that suggestion that was at fault...
12:23 PM 
— mtabara would definitely love a double check on ^
12:24 PM 
<mtabara> I looked at last 20-30 try pushes and all macosx tests running were cases where we: a) "u all" b) "macosx" was set as platfor and then namely some tests were enumarted under "-u X, Y, Z"
12:24 PM that's how I checked
12:26 PM 
<•kmoir> Kim Moir I looked at that earlier and came to a similar conclusion
12:26 PM didn't see any mac tests running that shouldn't be given the try syntax
12:26 PM spacurar|buildduty → spacurar|afk
12:27 PM 
<•Callek> Justin Wood alternate thought, are we running *more* tests on beta than we were last week? due to the uplift?
12:28 PM (or release, or both)
12:28 PM their scheduling priority could be impacting longer waits on try
12:28 PM 
<•kmoir> Kim Moir right we are running more tests, that seems a large number for beta, also there are more stylo tests enabled 
12:28 PM → leeroybot joined  ⇐ leeroybot2 quit  
12:29 PM 
<•kmoir> Kim Moir i wonder if there are just enough macs for the trunk branches + beta and then try has a lower priority and doesn't get machines, will look into this
12:31 PM for instance, until recently we didn't have all these stylo tests running on inbound, they only ran on m-c
12:32 PM which significantly increased the number of tests running on macox
12:39 PM 
<mtabara> out of curiosity, how can we look into jobs per branch specification in terms of pending?
12:41 PM 
<•kmoir> Kim Moir actually the stylo tests are here just under the wrong header 
12:52 PM › https://irccloud.mozilla.com/file/CLuI874J/Screen%20Shot%202017-08-01%20at%2012.51.58%20PM.pngScreen Shot 2017-08-01 at 12.51.58 PM.png216.26KB • image/png
12:52 PM going to ask if we need that in the quantum meeting I have in few min
12:54 PM 
<mtabara> what do you mean by wrong header?
1:10 PM 
<•kmoir> Kim Moir the are at the bottom, instead of groups with the other mac tests
1:10 PM from #taskcluster https://www.irccloud.com/pastebin/FwoJzij4/
1:11 PM I talked to the stylo team in the meeting and I will turn it off mac tests on trunk - minus autoland, m-c

Kim Moir [:kmoir] ET

Assignee

Comment 7

•

7 years ago

I found that macosx stylo tests are not in the seta data in taskcluster, either for buildbot or for taskcluster. (Of course, the ones on trunk have been migrated to taskcluster, I was just doing a sanity check) See

https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=buildbot
https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=taskscluster

So for macosx64-stylo debug and macosx64-stylo opt builds on autoland and m-i, it looks like all the tests run on every push.  I don't if this is a recent change.  The macosx stylo tests running off the regular build with a preference enabled in the tests was changed about a week ago.  See bug 1374748 for details of this change,

Armen, is this expected behaviour?  What do we need do to get the macosx64-stylo* jobs enabled in seta for taskcluster?

Flags: needinfo?(armenzg)

J. Ryan Stinnett [:jryans] (Use needinfo, replies may be slow)

Comment 8

•

7 years ago

When adding the macOS Stylo tests, I only updated run-on-projects where there were already exceptions for linux64-stylo.  

We may want to add an exception for all test types Stylo uses to skip .*-stylo/.* test platforms on mozilla-inbound.

Kim Moir [:kmoir] ET

Assignee

Comment 9

•

7 years ago

Yes, I'm testing a patch now as discussed in end of comment #6

Kim Moir [:kmoir] ET

Assignee

Comment 10

•

7 years ago

Attached patch bug1386264.patch disable macosx stylo tests on inbound — Details — Splinter Review

Kim Moir [:kmoir] ET

Assignee

Comment 11

•

7 years ago

Attached file bug1386264tc.diff — Details

diff of the jobs with mach tasksgraph target -p parameters.yml from m-i.  

I also applied the patch to m-c and there were no changes.

Kim Moir [:kmoir] ET

Assignee

Updated

•

7 years ago

Attachment #8892618 - Attachment description: bug1386264.patch → bug1386264.patch disable macosx stylo tests on inbound

Attachment #8892618 - Flags: review?(aki)

Aki Sasaki (not active)

Comment 12

•

7 years ago

Comment on attachment 8892618 [details] [diff] [review]
bug1386264.patch disable macosx stylo tests on inbound

- I'm not quite clear on the logic behind disabling on inbound but not autoland. I would have guessed we would want to disable on both. However, if we want to keep autoland stylo tests, I think this patch is correct.

- I thought about the 'try' in run-on-projects causing the builds to always run on Try, but aiui the try syntax target task method ignores run-on-projects anyway, and filters by comment.

I think this looks good.

Attachment #8892618 - Flags: review?(aki) → review+

Armen [:armenzg]

Updated

•

7 years ago

Comment 13

•

7 years ago

We might be hitting the 2-week grace period for new tasks on SETA.
See also: https://bugzilla.mozilla.org/show_bug.cgi?id=1386405

Flags: needinfo?(armenzg)

Kim Moir [:kmoir] ET

Assignee

Updated

•

7 years ago

Depends on: 1386405

J. Ryan Stinnett [:jryans] (Use needinfo, replies may be slow)

Comment 14

•

7 years ago

Comment on attachment 8892618 [details] [diff] [review]
bug1386264.patch disable macosx stylo tests on inbound

Review of attachment 8892618 [details] [diff] [review]:
-----------------------------------------------------------------

::: taskcluster/ci/test/tests.yml
@@ +1321,5 @@
>      run-on-projects:
>          by-test-platform:
>              linux64-stylo-sequential/.*: ['mozilla-central','try']
>              linux64-stylo/.*: ['mozilla-central', 'try']
> +            macosx64-stylo/.*: ['autoland', 'mozilla-central', 'try']

All the Talos tests could be restricted further to just m-c and try, like the Linux lines above them.  (We aren't actually running Talos for macOS Stylo yet, but hope to soon, so this is nice prep work.)

J. Ryan Stinnett [:jryans] (Use needinfo, replies may be slow)

Comment 15

•

7 years ago

Comment on attachment 8892618 [details] [diff] [review]
bug1386264.patch disable macosx stylo tests on inbound

Review of attachment 8892618 [details] [diff] [review]:
-----------------------------------------------------------------

::: taskcluster/ci/test/tests.yml
@@ +1816,5 @@
> +    run-on-projects:
> +        by-test-platform:
> +            macosx64-stylo/.*: ['autoland', 'mozilla-central', 'try']
> +            default: built-projects
> +    mozharness:

The line is duplicated, probably needs to be removed.

Aki Sasaki (not active)

Comment 16

•

7 years ago

Good catch :)

Pulsebot

Comment 17

•

7 years ago

Pushed by kmoir@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1466ea7d584f
very high pending counts for macosx tests r=aki

Pulsebot

Comment 18

•

7 years ago

Pushed by kmoir@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1c5f2189d049
very high pending counts for macosx tests r=aki DONTBUILD

Kim Moir [:kmoir] ET

Assignee

Comment 19

•

7 years ago

aki: the quantum/stylo team asked me to keep the tests on autoland, this is why they are not included

jryans: I'll update the talos tests

Kim Moir [:kmoir] ET

Assignee

Comment 20

•

7 years ago

Looks like the tests aren't scheduled in m-i now

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1466ea7d584f9e7f8dd40cb99a6b4cb88d561ea6

I was talking to Dustin in #taskcluster and he mentioned that try tasks have a lifetime of one day which should bring our pending count down as they expire as this is the bulk of the pending count.

If we can't get seta addressed tomorrow perhaps we should consider disabling the jobs on autoland temporarily.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 21

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1466ea7d584f
https://hg.mozilla.org/mozilla-central/rev/1c5f2189d049

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox56: --- → fixed

Resolution: --- → FIXED

Kim Moir [:kmoir] ET

Assignee

Updated

•

7 years ago

Status: RESOLVED → REOPENED

Keywords: leave-open

Resolution: FIXED → ---

Kim Moir [:kmoir] ET

Assignee

Comment 22

•

7 years ago

Dustin, is there a way for me to look at the pending queues of tasks for macosx to look at when they will expire?  I still see our pending counts increasing here, now at 16K

https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010

Flags: needinfo?(dustin)

Greg Arndt [:garndt]

Comment 23

•

7 years ago

There is not an easy way to show what's in the queue at the moment, but here our the counts of pending tasks per hour that they were scheduled to give an idea of when some might drop off or get picked up by a worker.  I do not think we're going to hit a point where one large chunk drops off because we are completing the oldest try jobs first.  We're racing between completing them and deadline being hit.  For instance, the oldest pending job is only 22 hours ago.  Assuming the deadline is 24 hours after creation and we maintain completing 1k per hour, we'll complete enough per hour to keep ahead of them hitting their deadline.

   scheduled         | count
---------------------+-------
 2017-08-01 04:00:00 |   491
 2017-08-01 05:00:00 |   199
 2017-08-01 06:00:00 |   164
 2017-08-01 07:00:00 |   673
 2017-08-01 08:00:00 |   704
 2017-08-01 09:00:00 |   994
 2017-08-01 10:00:00 |  1383
 2017-08-01 11:00:00 |   306
 2017-08-01 12:00:00 |   327
 2017-08-01 13:00:00 |   285
 2017-08-01 14:00:00 |   656
 2017-08-01 15:00:00 |   773
 2017-08-01 16:00:00 |    98
 2017-08-01 17:00:00 |   217
 2017-08-01 18:00:00 |  1383
 2017-08-01 19:00:00 |   612
 2017-08-01 20:00:00 |  1518
 2017-08-01 21:00:00 |    95
 2017-08-01 22:00:00 |   197
 2017-08-01 23:00:00 |   947
 2017-08-02 00:00:00 |  1107
 2017-08-02 01:00:00 |  1083
 2017-08-02 02:00:00 |  1386

Dustin J. Mitchell [:dustin] (he/him)

Comment 24

•

7 years ago

When I looked, everything I could see running was non-try.

I created a table containing all tasks from that workerType created in the last 36 hours - 60937 of them, with 15760 pending.

taskcluster-task-analysis::DATABASE=> select project, count(*) from pendingmac group by project;
 project  | count 
----------+-------
 oak      |   294
 cedar    |   468
 pine     |  2196
 date     |    87
 try      | 11926
 graphics |   597
 autoland |   192
(7 rows)

So as expected, just about everything is try at this point, since it's lowest priority.

Right now, at about 0300 UTC,

taskcluster-task-analysis::DATABASE=> select min(created) from pendingmac where project='try';
           min           
-------------------------
 2017-08-01 03:48:57.504
(1 row)

so not quite 24 hours old yet.  Looking at tasks a bit over 24 hours, they seem to have all completed.  So I think try is running about 23.5 hours behind right now, meaning that deadline expiration isn't helping to reduce the pending counts.  Example push:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ea4170e0089193869762150165b0c2f9f0123cf

..and I just saw I'm going to midair with greg so I'll leave it there.

Dustin J. Mitchell [:dustin] (he/him)

Comment 25

•

7 years ago

Oh, and what we're doing right now?:

taskcluster-task-analysis::DATABASE=> select project, count(*) from recentmac where state='running' group by project;
     project     | count 
-----------------+-------
 autoland        |   311
 mozilla-inbound |    24
(2 rows)

so every try push is just adding to the backlog.  Note that the sum there is 335, which is pretty close to the number of macs we have, so I think just about all of the hardware is busy.

Flags: needinfo?(dustin)

Mihai Tabara [:mtabara]⌚️GMT

Comment 26

•

7 years ago

Queue this morning GMT is ~17K[1]

I'm wondering if there's no way we can actually clear the existing jobs from try. Or modify that expiration timeframe somehow. Sounds like we're in the case that try jobs that are mostly populating the queue are now inches away from being expired and that's when they actually get their fair share in the machines pool.

Do we have a contingency plan for situations like this? To clear/kill existing try-jobs or something alike?

[1]: https://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010

Kim Moir [:kmoir] ET

Assignee

Updated

•

7 years ago

Blocks: 1386625

Kim Moir [:kmoir] ET

Assignee

Updated

•

7 years ago

No longer depends on: 1386405

Mihai Tabara [:mtabara]⌚️GMT

Comment 27

•

7 years ago

Reassigning this to :kmoir as she's been driving the things forward in the past day.
Thanks Kim!

Assignee: aobreja → kmoir

Kim Moir [:kmoir] ET

Assignee

Comment 28

•

7 years ago

Attached patch bug1386264autoland.patch — Details — Splinter Review

patch to disable stylo tests on autoland temporarily until seta is fixed as well as disable the awsy tests to not run on trunk by default

Kim Moir [:kmoir] ET

Assignee

Comment 29

•

7 years ago

Comment on attachment 8892924 [details] [diff] [review]
bug1386264autoland.patch

change awsy stylo tests to not run on trunk on by default + disable stylo mac tests on autoland, will talk to stylo team before enabling if we can't get seta fixed to reduce load soon

Attachment #8892924 - Flags: review?(aki)

Bob Clary [:bc] (inactive)

Comment 30

•

7 years ago

kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?

Flags: needinfo?(kmoir)

Flags: needinfo?(erahm)

Kim Moir [:kmoir] ET

Assignee

Comment 31

•

7 years ago

I can change it to opt only if that works better

Flags: needinfo?(kmoir)

J. Ryan Stinnett [:jryans] (Use needinfo, replies may be slow)

Comment 32

•

7 years ago

(In reply to Bob Clary [:bc:] from comment #30)
> kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?

In your change to add AWSY[1] it was only added to opt test platforms, so I believe :kmoir's change alone won't add debug.

[1]: https://hg.mozilla.org/integration/mozilla-inbound/diff/fa83b1463e2b/taskcluster/ci/test/test-platforms.yml

Aki Sasaki (not active)

Updated

•

7 years ago

Attachment #8892924 - Flags: review?(aki) → review+

Eric Rahm [:erahm]

Comment 33

•

7 years ago

(In reply to Bob Clary [:bc:] from comment #30)
> kmoir: This will run awsy on both opt and debug? erahm: Do we want debug?

Just opt, but per jryans it sounds like this will work as-is.

Flags: needinfo?(erahm)

Pulsebot

Comment 34

•

7 years ago

Pushed by kmoir@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bcdde389bfac
very high pending counts for macosx tests r=aki DONTBUILD

Kim Moir [:kmoir] ET

Assignee

Comment 35

•

7 years ago

Attached patch bug1386264awsy.patch — Details — Splinter Review

disabled awsy stylo tests from running by default on most branches.  Will see how the load goes down now that seta is enabled for macosx stylo tests on autoland. Right now it's at around 16/17K

Attachment #8893059 - Flags: checked-in+

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 36

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bcdde389bfac

Amy Rich [:arr] [:arich]

Comment 37

•

7 years ago

We seem to be chewing through the backlog, at least. Down to 11K tests at the moment.

Kim Moir [:kmoir] ET

Assignee

Comment 38

•

7 years ago

Pending counts are now down to 1100 which is a more normal range.  Going to close this bug but leave the larger tracking bug open as it tracks some longer term work to make a more efficient use of test pools.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Kim Moir [:kmoir] ET

Assignee

Updated

•

7 years ago

Keywords: leave-open

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

bug1386264.patch disable macosx stylo tests on inbound 7 years ago Kim Moir [:kmoir] ET 13.55 KB, patch	mozilla : review+	Details \| Diff \| Splinter Review
bug1386264tc.diff 7 years ago Kim Moir [:kmoir] ET 3.31 KB, text/plain		Details
bug1386264autoland.patch 7 years ago Kim Moir [:kmoir] ET 15.64 KB, patch	mozilla : review+	Details \| Diff \| Splinter Review
bug1386264awsy.patch 7 years ago Kim Moir [:kmoir] ET 1.37 KB, patch	kmoir : checked-in+	Details \| Diff \| Splinter Review