Closed Bug 1469720 Opened 6 years ago Closed 6 years ago

Investigate tools to find long-running tasks approaching or exceeding their max-run-time


(Testing :: General, enhancement, P3)

Version 3


(firefox63 fixed)

Tracking Status
firefox63 --- fixed


(Reporter: gbrown, Assigned: gbrown)


(Blocks 1 open bug)



(3 files, 1 obsolete file)

Can we detect particular causes of bug 1411358? Can we predict new instances and/or warn when a task's run-time is approaching its max-run-time?
Attached file
This is a simple prototype that uses the treeherder-client api to retrieve the duration of recent, completed jobs and uses the job's details to find the associated task definition on taskcluster. max-run-time is retrieved from the task definition and compared to the job times.
would activedata recipe be a better solution?  Any solution is better than none- ideally getting alerts or reports would be ideal.
Attached file thjobtimes-output1.txt
Sample output from the prototype.

Some issues to note:

 - This took 38 minutes to run and made thousands of requests to treeherder and taskcluster.
 - Some jobs identified as potentially troublesome are probably not actionable:

test-linux64-asan/opt-cppunit :: [210, 538, 3669, 208, 1674, 517, 738, 210] vs 3600 seconds (5 - 101%)

On the other hand, the prototype does find specific issues, like this known issue that got me started on this idea:

test-linux64-ccov/debug-web-platform-tests-e10s-2 :: [7544, 7711, 7204, 7636, 7543, 7439, 7515] vs 7200 seconds (100 - 107%)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #2)
> would activedata recipe be a better solution?  Any solution is better than
> none- ideally getting alerts or reports would be ideal.

There is an existing recipe for Task Durations, but it doesn't seem accurate. I've had trouble in the past getting reliable durations from ActiveData; that's probably fixable with effort. A bigger problem may be relating task durations to max-run-time. Still, ActiveData is definitely an alternative to consider.
(In reply to Geoff Brown [:gbrown] from comment #4)
> A bigger problem may be
> relating task durations to max-run-time.

Or not...

Priority: -- → P3
This is pretty good:

			"name":"average duration"
			"name":"max duration"

""	"average duration"	"max duration"	"maxRunTime"	"count"
"test-android-em-4.3-arm7-api-16/opt-geckoview-junit"	1461.6925822226206	2241.7509999275208	3600	225
"test-android-em-4.3-arm7-api-16/opt-marionette-4"	724.364691273478	1771.4939999580383	3600	149
"test-android-em-4.3-arm7-api-16/debug-mochitest-26"	1833.9672837724556	2930.5209999084473	7200	148
"test-android-em-4.3-arm7-api-16/debug-marionette-6"	1001.4378639419062	2096.792999982834	3600	147
"test-android-em-4.3-arm7-api-16/debug-mochitest-clipboard"	1052.1142465725336	1928.7569999694824	3600	146
"test-android-em-4.3-arm7-api-16/debug-marionette-8"	1000.6796206852486	2330.8700001239777	3600	145

:ekyle -- A few (non-urgent) questions for you:

1. When I change the start_time date to something more recent, say, 2018-07-01, my counts go to 0. Why?

2. Can I relate the duration to the maxRunTime in a query? I'd really like to select these numbers but restrict rows to something like "where average(duration) > 90% maxRunTime". (Obviously I can do that filtering with post-processing, but wonder if the query itself can do it.)

3. I am struggling to define what I really want here. Generally, I am interested in tasks when duration approaches maxRunTime, but I want to ignore infrequent cases where maxRunTime is much higher than normal (sometimes we get a slow machine, or a bad changeset is checked in for a while, then backed out). average vs percentile vs ...?

4. Any other observations or insights that might help here?
Flags: needinfo?(klahnakoski)
Performing "where average(duration) > 90% maxRunTime" is not possible with ActiveData yet. Comparing columns (in this case duration and maxRunTime), over many rows is often difficult because of the distributed nature of the data; all the data must be sent to a single node to perform the comparison. I have plans to allow such queries, and new Elasticsearch might be able to accept them, but that has been a P4 for a couple years now.

An alternative is to perform the comparison at the document level; with hope it exposes information that you can still take action on. You can consider counting the number of records that go above your threshold: As the average duration passes the 90% maxRunTime, this count should spike:

> {"gt":[{"div":["action.duration","task.maxRunTime"]},0.9]},

This can give you a short list of tasks (test-android-em-4.3-arm7-api-16/debug-mochitest-48) that may be getting close to the maxRunTime. Unfortunately, the aggregates are not helpful because they only aggregate over the few records that we are counting.

This next query is much like yours, I added a "tasks_gt_90" column that does the same count as the previous query. These aggregates are over the whole population. Notice that I used "median" on the maxRunTime in an attempt to exclude the anomalous values. I also chose the 70th percentile for duration; high enough to include most jobs, but lower than max (100th percentile) to exclude anomalies; 90th percentile may also work fine. There is "median percent" to report the median runtime in terms of maxRuntime as a percent.

Looks like test-android-em-4.3-arm7-api-16/debug-mochitest-48 is getting close.

> {
>     "from":"task",
>     "groupby":[""],
>     "limit":1000,
>     "select":[
>         {
>             "value":"action.duration",
>             "aggregate":"percentile",
>             "percentile":0.7,
>             "name":"70th"
>         },
>         {
>             "value":"task.maxRunTime",
>             "aggregate":"median",
>             "name":"maxRunTime"
>         },
>         {"aggregate":"count"},
>         {
>             "value":{
>                 "when":{"gt":[{"div":["action.duration","task.maxRunTime"]},0.9]},
>                 "then":1
>             },
>             "aggregate":"sum",
>             "name":"tasks_gt_90"
>         },
>         {
>             "value":{"div":["action.duration","task.maxRunTime"]},
>             "aggregate":"median",
>             "name":"median percent"
>         }
>     ],
>     "where":{"and":[
>         {"gte":{"":{"date":"2018-06-28"}}},
>         {"eq":{"state":"completed"}},
>         {"eq":{"kind":"test"}},
>         {"in":{"build.branch":["autoland","mozilla-inbound","mozilla-central"]}},
>         {"regexp":{"":"test-android-em.*"}}
>     ]}
> }
Flags: needinfo?(klahnakoski)
You may also want to look at using a Jupyter notebook (or Mozilla's own Iodide notebook) for this type of analysis: They use python, can make network calls (to reach ActiveData), and allow a step-by-step post processing.
This seems to do just what I want, but I want to test it out a bit more...

./mach test-info long-running-tasks

Long running test tasks on mozilla-central,mozilla-inbound,autoland between 2018-07-11 and 2018-07-18
test-android-em-4.3-arm7-api-16/debug-crashtest-8      : 2 of 184 runs (1.1%) exceeded 90% of max-run-time (3600 s)
test-linux64-qr/debug-reftest-e10s-4                   : 2 of 184 runs (1.1%) exceeded 90% of max-run-time (3600 s)
test-linux64-qr/debug-crashtest-e10s                   : 2 of 185 runs (1.1%) exceeded 90% of max-run-time (3600 s)
test-android-em-4.3-arm7-api-16/opt-crashtest-3        : 1 of 184 runs (0.5%) exceeded 90% of max-run-time (3600 s)
test-linux32/debug-reftest-e10s-4                      : 1 of 184 runs (0.5%) exceeded 90% of max-run-time (3600 s)
test-linux32/debug-reftest-no-accel-e10s-3             : 1 of 184 runs (0.5%) exceeded 90% of max-run-time (3600 s)
test-linux32/debug-reftest-no-accel-e10s-4             : 1 of 184 runs (0.5%) exceeded 90% of max-run-time (3600 s)
test-windows10-64/debug-web-platform-tests-e10s-11     : 1 of 185 runs (0.5%) exceeded 90% of max-run-time (7200 s)
test-android-em-4.3-arm7-api-16/debug-crashtest-5      : 1 of 186 runs (0.5%) exceeded 90% of max-run-time (3600 s)
test-linux64-qr/debug-reftest-e10s-1                   : 1 of 186 runs (0.5%) exceeded 90% of max-run-time (3600 s)
test-linux32/debug-web-platform-tests-e10s-4           : 1 of 187 runs (0.5%) exceeded 90% of max-run-time (7200 s)
test-android-em-4.3-arm7-api-16/opt-marionette-2       : 1 of 190 runs (0.5%) exceeded 90% of max-run-time (3600 s)
test-android-em-4.3-arm7-api-16/opt-mochitest-23       : 1 of 191 runs (0.5%) exceeded 90% of max-run-time (7200 s)
test-android-em-4.3-arm7-api-16/opt-robocop-1          : 1 of 191 runs (0.5%) exceeded 90% of max-run-time (3600 s)
test-android-em-4.3-arm7-api-16/debug-reftest-41       : 1 of 192 runs (0.5%) exceeded 90% of max-run-time (7200 s)
See Also: → 1478811
Attachment #8993050 - Attachment is obsolete: true
This works great for me and has been helpful in addressing issues contributing to bug 1411358. I want to have a look at adding this to adr in the future, but for now, this is the most convenient implementation for my use.
Attachment #8998197 - Flags: review?(jmaher)
Comment on attachment 8998197 [details] [diff] [review]
implement ekyle's excellent query as 'mach test-info long-tasks'

Review of attachment 8998197 [details] [diff] [review]:

I don't see any problems with the code- I would prefer to see this as an activedata recipe instead of a mach command.  The ideal state of ADR is that we can have it shown on a dashboard which increases the visibility to others- in this state, we need someone to remember to run the mach command which requires someone to have a clone, know the command exists, and understand the output.
Attachment #8998197 - Flags: review?(jmaher) → review+
Pushed by
Add a 'mach test-info' subcommand to find long-running tasks; r=jmaher
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
You need to log in before you can comment on or make changes to this bug.