Closed
Bug 1469720
Opened 6 years ago
Closed 6 years ago
Investigate tools to find long-running tasks approaching or exceeding their max-run-time
Categories
(Testing :: General, enhancement, P3)
Tracking
(firefox63 fixed)
RESOLVED
FIXED
mozilla63
Tracking | Status | |
---|---|---|
firefox63 | --- | fixed |
People
(Reporter: gbrown, Assigned: gbrown)
References
(Blocks 1 open bug)
Details
Attachments
(3 files, 1 obsolete file)
Can we detect particular causes of bug 1411358? Can we predict new instances and/or warn when a task's run-time is approaching its max-run-time?
Assignee | ||
Comment 1•6 years ago
|
||
This is a simple prototype that uses the treeherder-client api to retrieve the duration of recent, completed jobs and uses the job's details to find the associated task definition on taskcluster. max-run-time is retrieved from the task definition and compared to the job times.
Comment 2•6 years ago
|
||
would activedata recipe be a better solution? Any solution is better than none- ideally getting alerts or reports would be ideal.
Assignee | ||
Comment 3•6 years ago
|
||
Sample output from the thjobtimes.py prototype. Some issues to note: - This took 38 minutes to run and made thousands of requests to treeherder and taskcluster. - Some jobs identified as potentially troublesome are probably not actionable: test-linux64-asan/opt-cppunit :: [210, 538, 3669, 208, 1674, 517, 738, 210] vs 3600 seconds (5 - 101%) On the other hand, the prototype does find specific issues, like this known issue that got me started on this idea: test-linux64-ccov/debug-web-platform-tests-e10s-2 :: [7544, 7711, 7204, 7636, 7543, 7439, 7515] vs 7200 seconds (100 - 107%)
Assignee | ||
Comment 4•6 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #2) > would activedata recipe be a better solution? Any solution is better than > none- ideally getting alerts or reports would be ideal. There is an existing recipe for Task Durations, but it doesn't seem accurate. I've had trouble in the past getting reliable durations from ActiveData; that's probably fixable with effort. A bigger problem may be relating task durations to max-run-time. Still, ActiveData is definitely an alternative to consider.
Assignee | ||
Comment 5•6 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #4) > A bigger problem may be > relating task durations to max-run-time. Or not... { "from":"task", "select":["run.name","action.duration","task.maxRunTime"] }
Assignee | ||
Updated•6 years ago
|
Priority: -- → P3
Assignee | ||
Comment 6•6 years ago
|
||
This is pretty good: { "from":"task", "groupby":["run.name"], "limit":1000, "select":[ { "value":"action.duration", "aggregate":"average", "name":"average duration" }, { "value":"action.duration", "aggregate":"max", "name":"max duration" }, { "value":"task.maxRunTime", "aggregate":"min", "name":"maxRunTime" }, {"aggregate":"count"} ], "where":{"and":[ {"gte":{"task.run.start_time":{"date":"2018-06-28"}}}, {"eq":{"state":"completed"}}, {"eq":{"kind":"test"}}, {"in":{"build.branch":["autoland","mozilla-inbound","mozilla-central"]}}, {"regexp":{"run.name":"test-android-em.*"}} ]} } "run.name" "average duration" "max duration" "maxRunTime" "count" "test-android-em-4.3-arm7-api-16/opt-geckoview-junit" 1461.6925822226206 2241.7509999275208 3600 225 "test-android-em-4.3-arm7-api-16/opt-marionette-4" 724.364691273478 1771.4939999580383 3600 149 "test-android-em-4.3-arm7-api-16/debug-mochitest-26" 1833.9672837724556 2930.5209999084473 7200 148 "test-android-em-4.3-arm7-api-16/debug-marionette-6" 1001.4378639419062 2096.792999982834 3600 147 "test-android-em-4.3-arm7-api-16/debug-mochitest-clipboard" 1052.1142465725336 1928.7569999694824 3600 146 "test-android-em-4.3-arm7-api-16/debug-marionette-8" 1000.6796206852486 2330.8700001239777 3600 145 ... :ekyle -- A few (non-urgent) questions for you: 1. When I change the start_time date to something more recent, say, 2018-07-01, my counts go to 0. Why? 2. Can I relate the duration to the maxRunTime in a query? I'd really like to select these numbers but restrict rows to something like "where average(duration) > 90% maxRunTime". (Obviously I can do that filtering with post-processing, but wonder if the query itself can do it.) 3. I am struggling to define what I really want here. Generally, I am interested in tasks when duration approaches maxRunTime, but I want to ignore infrequent cases where maxRunTime is much higher than normal (sometimes we get a slow machine, or a bad changeset is checked in for a while, then backed out). average vs percentile vs ...? 4. Any other observations or insights that might help here?
Flags: needinfo?(klahnakoski)
Comment 7•6 years ago
|
||
Performing "where average(duration) > 90% maxRunTime" is not possible with ActiveData yet. Comparing columns (in this case duration and maxRunTime), over many rows is often difficult because of the distributed nature of the data; all the data must be sent to a single node to perform the comparison. I have plans to allow such queries, and new Elasticsearch might be able to accept them, but that has been a P4 for a couple years now. An alternative is to perform the comparison at the document level; with hope it exposes information that you can still take action on. You can consider counting the number of records that go above your threshold: As the average duration passes the 90% maxRunTime, this count should spike: > {"gt":[{"div":["action.duration","task.maxRunTime"]},0.9]}, https://activedata.allizom.org/tools/query.html#query_id=c5PIolmZ This can give you a short list of tasks (test-android-em-4.3-arm7-api-16/debug-mochitest-48) that may be getting close to the maxRunTime. Unfortunately, the aggregates are not helpful because they only aggregate over the few records that we are counting. This next query is much like yours, I added a "tasks_gt_90" column that does the same count as the previous query. These aggregates are over the whole population. Notice that I used "median" on the maxRunTime in an attempt to exclude the anomalous values. I also chose the 70th percentile for duration; high enough to include most jobs, but lower than max (100th percentile) to exclude anomalies; 90th percentile may also work fine. There is "median percent" to report the median runtime in terms of maxRuntime as a percent. https://activedata.allizom.org/tools/query.html#query_id=2d2yBOR2 Looks like test-android-em-4.3-arm7-api-16/debug-mochitest-48 is getting close. > { > "from":"task", > "groupby":["run.name"], > "limit":1000, > "select":[ > { > "value":"action.duration", > "aggregate":"percentile", > "percentile":0.7, > "name":"70th" > }, > { > "value":"task.maxRunTime", > "aggregate":"median", > "name":"maxRunTime" > }, > {"aggregate":"count"}, > { > "value":{ > "when":{"gt":[{"div":["action.duration","task.maxRunTime"]},0.9]}, > "then":1 > }, > "aggregate":"sum", > "name":"tasks_gt_90" > }, > { > "value":{"div":["action.duration","task.maxRunTime"]}, > "aggregate":"median", > "name":"median percent" > } > ], > "where":{"and":[ > {"gte":{"task.run.start_time":{"date":"2018-06-28"}}}, > {"eq":{"state":"completed"}}, > {"eq":{"kind":"test"}}, > {"in":{"build.branch":["autoland","mozilla-inbound","mozilla-central"]}}, > {"regexp":{"run.name":"test-android-em.*"}} > ]} > }
Flags: needinfo?(klahnakoski)
Comment 8•6 years ago
|
||
You may also want to look at using a Jupyter notebook (or Mozilla's own Iodide notebook) for this type of analysis: They use python, can make network calls (to reach ActiveData), and allow a step-by-step post processing. https://github.com/iodide-project/iodide
Assignee | ||
Comment 9•6 years ago
|
||
This seems to do just what I want, but I want to test it out a bit more... ./mach test-info long-running-tasks Long running test tasks on mozilla-central,mozilla-inbound,autoland between 2018-07-11 and 2018-07-18 test-android-em-4.3-arm7-api-16/debug-crashtest-8 : 2 of 184 runs (1.1%) exceeded 90% of max-run-time (3600 s) test-linux64-qr/debug-reftest-e10s-4 : 2 of 184 runs (1.1%) exceeded 90% of max-run-time (3600 s) test-linux64-qr/debug-crashtest-e10s : 2 of 185 runs (1.1%) exceeded 90% of max-run-time (3600 s) test-android-em-4.3-arm7-api-16/opt-crashtest-3 : 1 of 184 runs (0.5%) exceeded 90% of max-run-time (3600 s) test-linux32/debug-reftest-e10s-4 : 1 of 184 runs (0.5%) exceeded 90% of max-run-time (3600 s) test-linux32/debug-reftest-no-accel-e10s-3 : 1 of 184 runs (0.5%) exceeded 90% of max-run-time (3600 s) test-linux32/debug-reftest-no-accel-e10s-4 : 1 of 184 runs (0.5%) exceeded 90% of max-run-time (3600 s) test-windows10-64/debug-web-platform-tests-e10s-11 : 1 of 185 runs (0.5%) exceeded 90% of max-run-time (7200 s) test-android-em-4.3-arm7-api-16/debug-crashtest-5 : 1 of 186 runs (0.5%) exceeded 90% of max-run-time (3600 s) test-linux64-qr/debug-reftest-e10s-1 : 1 of 186 runs (0.5%) exceeded 90% of max-run-time (3600 s) test-linux32/debug-web-platform-tests-e10s-4 : 1 of 187 runs (0.5%) exceeded 90% of max-run-time (7200 s) test-android-em-4.3-arm7-api-16/opt-marionette-2 : 1 of 190 runs (0.5%) exceeded 90% of max-run-time (3600 s) test-android-em-4.3-arm7-api-16/opt-mochitest-23 : 1 of 191 runs (0.5%) exceeded 90% of max-run-time (7200 s) test-android-em-4.3-arm7-api-16/opt-robocop-1 : 1 of 191 runs (0.5%) exceeded 90% of max-run-time (3600 s) test-android-em-4.3-arm7-api-16/debug-reftest-41 : 1 of 192 runs (0.5%) exceeded 90% of max-run-time (7200 s)
Assignee | ||
Updated•6 years ago
|
Attachment #8993050 -
Attachment is obsolete: true
Assignee | ||
Comment 10•6 years ago
|
||
This works great for me and has been helpful in addressing issues contributing to bug 1411358. I want to have a look at adding this to adr in the future, but for now, this is the most convenient implementation for my use.
Attachment #8998197 -
Flags: review?(jmaher)
Comment 11•6 years ago
|
||
Comment on attachment 8998197 [details] [diff] [review] implement ekyle's excellent query as 'mach test-info long-tasks' Review of attachment 8998197 [details] [diff] [review]: ----------------------------------------------------------------- I don't see any problems with the code- I would prefer to see this as an activedata recipe instead of a mach command. The ideal state of ADR is that we can have it shown on a dashboard which increases the visibility to others- in this state, we need someone to remember to run the mach command which requires someone to have a clone, know the command exists, and understand the output.
Attachment #8998197 -
Flags: review?(jmaher) → review+
Comment 12•6 years ago
|
||
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/43fea7881cff Add a 'mach test-info' subcommand to find long-running tasks; r=jmaher
Comment 13•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/43fea7881cff
Status: NEW → RESOLVED
Closed: 6 years ago
status-firefox63:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
You need to log in
before you can comment on or make changes to this bug.
Description
•