Closed Bug 1433761 Opened 7 years ago Closed 7 years ago

ensure that max-run-time is working from taskcluster -> talos

Categories

(Testing :: Talos, enhancement)

enhancement
Not set
normal

Tracking

(firefox60 fixed)

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: jmaher, Assigned: jmaher)

Details

(Whiteboard: [PI:March])

Attachments

(1 file, 1 obsolete file)

I noticed in: https://treeherder.mozilla.org/#/jobs?repo=try&revision=04c9f43eecf38d7cdbc538fd00b7b3b1ab1cf9b3 that the orange 'o' and 'd' job timed out at an hour despite setting max-run-time to something lower (specifically in the patch). We are running on the new moonshot hardware and using a hardware based taskcluster-worker. I would like to determine if this worker supports max-run-time or not. If it doesn't, I would like to pass the max-run-time to talos via a commandline argument and then die gracefully. Right now we have a few areas for timeout: 1) taskcluster worker (I suspect this isn't supported) 2) mozharness 3) talos harness 4) firefox process stdout monitor looking at those 4 areas, we should focus on the talos harness and taskcluster worker. :dustin, could you help determine or find the right person who could figure out if the hardware taskcluster-worker we are using on the new moonshot hardware for linux support max-run-time?
Flags: needinfo?(dustin)
I think Jonas is the right person..
Flags: needinfo?(dustin) → needinfo?(jopsen)
taskcluster-worker has a "maxruntime" plugin you can activate in it's configuration. > maxruntime: > maxRunTime: '1 hour' > perTaskLimit: 'allow|forbid|require' See: https://docs.taskcluster.net/reference/workers/taskcluster-worker/docs/configuration If you specify config to the worker you can print the payload schema too: > taskcluster-worker schema payload <config.yml> But the docs will also let you infer most of the properties available: https://docs.taskcluster.net/reference/workers/taskcluster-worker/docs/configuration Note, if a property is allowed, then usually it's also used by something :) Example, the following plugin config: https://github.com/taskcluster/taskcluster-worker/blob/843ae341ff9e205df9dda0d50ad91d0a69539de7/examples/packet-config.yml#L56-L58 enforces a max-run-time of 4 hours, but allows tasks to specify their own task.payload.maxRunTime property, like: task.payload.maxRunTime = '3 hours 5 min' However, tasks cannot specify task.payload.maxRunTime = '6 h' because the plugin was configured to a max of 4 hours.
Flags: needinfo?(jopsen)
thanks for the great information Jonas! :dhouse/:markco, can you ensure the taskcluster workers that we are installing on the new moonshots use the perTaskLimit: 'allow' option?
Flags: needinfo?(mcornmesser)
Flags: needinfo?(dhouse)
Joel, here's what I'm seeing for linux: ``` [root@ms1-5 ~]# grep -C3 -i maxruntime /etc/taskcluster-worker.yml reboot: maxLifeCycle: '96 hours' allowTaskReboots: true # tasks can never take more than 96 hours (but typically are limited by their own maxRunTime) maxruntime: maxRunTime: '96 hours' perTaskLimit: allow watchdog: {} logprefix: ``` And that is defined, hardcoded, in releng-puppet: https://hg.mozilla.org/build/puppet/file/tip/modules/taskcluster_worker/templates/taskcluster-worker.yml.erb#l28 https://hg.mozilla.org/build/puppet/file/tip/modules/taskcluster_worker/manifests/init.pp#l13
Flags: needinfo?(dhouse)
here is a case where the maxruntime in taskcluster terminated the job in 1200 seconds (i.e. 20 minutes): https://treeherder.mozilla.org/#/jobs?repo=try&revision=f7141e72bc1f2241a0e68b70af00554433e36d1d the max-run-time is defined in-tree: https://searchfox.org/mozilla-central/source/taskcluster/ci/test/talos.yml#506 unfortunately this didn't timeout for linux (windows is buildbot in that try job). So based on the above config, I would expect that |perTaskLimit: allow| would use the in-tree max-run-time:1200, but it doesn't seem to be. :wcosta- can you get the maxRunTime information from the osx worker so we can compare what that is and why it is working? :jonasfj, can you think of any features in a worker that might ignore this?
Flags: needinfo?(wcosta)
Flags: needinfo?(jopsen)
osx uses generic-worker, and I think gw honors the task configuration. 303 :pmoore to confirm it.
Flags: needinfo?(wcosta) → needinfo?(pmoore)
(In reply to Wander Lairson Costa [:wcosta] from comment #6) > osx uses generic-worker, and I think gw honors the task configuration. 303 > :pmoore to confirm it. Yes, generic-worker uses same format for maximum runtime as docker-worker, which is why it works on macOS (our gecko macOS testers run generic-worker). See https://docs.taskcluster.net/reference/workers/generic-worker/payload for generic-worker payload format.
Flags: needinfo?(pmoore)
if jmaher is talking about the "linux x64 opt | sp" like, taken from treeherder link above: taskId: F2yiXvgZR6OeB5uTOAr1pg https://tools.taskcluster.net/groups/BMpDNACcRQqaZ8u3n2QAHQ/tasks/F2yiXvgZR6OeB5uTOAr1pg/details Then clearly the task.payload doesn't contain maxRunTime, I'm guessing the in-tree transforms are missing some tweaks. The @payload_builder('docker-worker') contains: > if 'max-run-time' in worker: > payload['maxRunTime'] = worker['max-run-time'] https://searchfox.org/mozilla-central/rev/a5abf843f8fac8530aa5de6fb40e16547bb4a47a/taskcluster/taskgraph/transforms/task.py#819-820 And the @payload_builder('native-engine') seems to be missing something like that, see: https://searchfox.org/mozilla-central/rev/a5abf843f8fac8530aa5de6fb40e16547bb4a47a/taskcluster/taskgraph/transforms/task.py#1091-1111 Probably, the in-tree logic will have to be tweaked.
Flags: needinfo?(jopsen)
Attached patch maxruntime.patch (obsolete) — Splinter Review
with the attached patch, I have proven that adding payload['maxRunTime'] to the talos jobs native-engine worker yields a timeout as expected: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cecfc0465098009a5d1ea65a243376639c7cba8c the problem is when I try to proxy the value from the test definition in talos.yml to be set in the worker object, I get decision task failures: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1dfe919b9fbecc6b84696451effe6ea6db49f661 so you can see in the above patch where that is commented out. :jonasfj, could you help find someone more familiar with this workflow who could help me get the value from the test -> payload.
Flags: needinfo?(mcornmesser) → needinfo?(jopsen)
Probably we have to add max-run-time to the schema as well, like: > # the maximum time to run, in seconds > Required('max-run-time'): int, in: https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/transforms/task.py?q=taskcluster%2Ftaskgraph%2Ftransforms%2Ftask.py&redirect_type=direct#456-486 where the schema for @payload_builder('native-engine') is defined... (this isn't super intuitive) Then I think something like: > if 'max-run-time' in worker: > payload['maxRunTime'] = worker['max-run-time'] in: > @payload_builder('native-engine') > def def build_macosx_engine_payload(... will do the trick. I suspect this is what's missing. Otherwise, we might have to ask dustin :)
Flags: needinfo?(jopsen)
I am not figuring this out despite a handful of pushes to try yesterday- here is my latest one: https://treeherder.mozilla.org/#/jobs?repo=try&revision=480e6daba872ae431172318ed3720683ecaed7d6 I believe I put in the maxruntime for native-engine in the right places. :dustin- if you have any quick ideas I am happy to try them out, otherwise maybe we can schedule some time in to discuss this in more detail in the coming weeks.
Flags: needinfo?(dustin)
Looking at that decision task, I see that it failed with the "data should not have additional properties". The schema-validation library doesn't tell you which properties are additional, which is a bit annoying. But at a guess, there's some property of the task that shouldn't be there. The decision task creates 50 tasks in parallel, so it's a bit hard to tell which one caused the error. However, since you're working on talos I just scrolled back a bit until I saw a log line about creating a talos task - test-linux64-qr/opt-talos-chrome-e10s. Looking in task-graph,json: "NuQRdTIQTN6FQxLfZEGNCA": { "attributes": { "always_target": false, "build_platform": "linux64", "build_type": "opt", "e10s": true, "kind": "test", "run_on_projects": [ "mozilla-central", "try" ], "shipping_phase": null, "shipping_product": null, "talos_try_name": "chromez-e10s", "test_chunk": "1", "test_platform": "linux64-qr/opt", "unittest_flavor": "talos", "unittest_suite": "talos" }, "dependencies": { "build": "CmspfrVOTv6NnuzYo48kuA" }, "kind": "test", "label": "test-linux64-qr/opt-talos-chrome-e10s", "optimization": { "skip-unless-schedules": [ "talos", "linux" ] }, "task": { "created": { "relative-datestamp": "0 seconds" }, "deadline": { "relative-datestamp": "1 day" }, "dependencies": [ "CmspfrVOTv6NnuzYo48kuA" ], "expires": { "relative-datestamp": "14 days" }, "extra": { "chunks": { "current": 1, "total": 1 }, "index": { "rank": 0 }, "parent": "HzRti4Q7RIaKOtxU6qqsKw", "suite": { "flavor": "talos", "name": "talos" }, "treeherder": { "collection": { "opt": true }, "groupName": "Talos performance tests with e10s", "groupSymbol": "T-e10s", "jobKind": "test", "machine": { "platform": "linux64-qr" }, "symbol": "c", "tier": 2 } }, "maxRunTime": 180, "metadata": { "description": "Talos chrome ([Treeherder push](https://treeherder.mozilla.org/#/jobs?repo=try&revision=480e6daba872ae431172318ed3720683ecaed7d6))", "name": "test-linux64-qr/opt-talos-chrome-e10s", "owner": "jmaher@mozilla.com", "source": "https://hg.mozilla.org/try/file/480e6daba872ae431172318ed3720683ecaed7d6/taskcluster/ci/test" }, "payload": { "artifacts": [ { "expires": { "relative-datestamp": "14 days" }, "name": "public/logs", "path": "workspace/build/upload/logs", "type": "directory" }, { "expires": { "relative-datestamp": "14 days" }, "name": "public/test", "path": "artifacts", "type": "directory" }, { "expires": { "relative-datestamp": "14 days" }, "name": "public/test_info", "path": "workspace/build/blobber_upload_dir", "type": "directory" } ], "command": [ "./test-linux.sh", "--no-read-buildbot-config", "--installer-url=https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/target.tar.bz2", "--test-packages-url=https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/target.test_packages.json", "--suite=chromez-e10s", "--add-option", "--webServer,localhost", "--add-option", "--webServer,localhost", "--use-talos-json", "--branch-name", "try", "--enable-webrender", "--download-symbols=ondemand" ], "context": "https://hg.mozilla.org/try/raw-file/480e6daba872ae431172318ed3720683ecaed7d6/taskcluster/scripts/tester/test-linux.sh", "env": { "GECKO_HEAD_REPOSITORY": "https://hg.mozilla.org/try", "GECKO_HEAD_REV": "480e6daba872ae431172318ed3720683ecaed7d6", "MOZHARNESS_CONFIG": "talos/linux_config.py", "MOZHARNESS_SCRIPT": "talos_script.py", "MOZHARNESS_URL": "https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/mozharness.zip", "MOZILLA_BUILD_URL": "https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/target.tar.bz2", "MOZ_AUTOMATION": "1", "MOZ_HIDE_RESULTS_TABLE": "1", "MOZ_NODE_PATH": "/usr/local/bin/node", "MOZ_NO_REMOTE": "1", "NEED_XVFB": "false", "NO_EM_RESTART": "1", "NO_FAIL_ON_TEST_ERRORS": "1", "XPCOM_DEBUG_BREAK": "warn" }, "maxRunTime": 180 }, "priority": "very-low", "provisionerId": "releng-hardware", "routes": [ "tc-treeherder.v2.try.480e6daba872ae431172318ed3720683ecaed7d6.255607" ], "scopes": [], "tags": { "createdForUser": "jmaher@mozilla.com", "kind": "test", "label": "test-linux64-qr/opt-talos-chrome-e10s", "os": "linux", "worker-implementation": "native-engine" }, "workerType": "gecko-t-linux-talos" }, "task_id": "NuQRdTIQTN6FQxLfZEGNCA" }, and indeed, I see maxRunTime in there twice -- once in `task.payload` (below XPCOM_DEBUG_BREAK) and once at the `task` level (right above `metadata`). The option is an instruction to the worker, so it is included in the payload, but is not allowed in the enclosing task definition. So I think https://hg.mozilla.org/try/rev/0855a4ff58eec6a4116d4ea7be709948b571f355#l3.51 + if 'max-run-time' in worker: + task_def['maxRunTime'] = worker['max-run-time'] is what is causing the issue.
Flags: needinfo?(dustin)
Assignee: nobody → jmaher
Attachment #8949776 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8955515 - Flags: review?(dustin)
Whiteboard: [PI:March]
Attachment #8955515 - Flags: review?(dustin) → review+
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/00fc5ded3b48 ensure that max-run-time is working from taskcluster -> talos. r=dustin
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: