Closed
Bug 1433761
Opened 7 years ago
Closed 7 years ago
ensure that max-run-time is working from taskcluster -> talos
Categories
(Testing :: Talos, enhancement)
Testing
Talos
Tracking
(firefox60 fixed)
RESOLVED
FIXED
mozilla60
Tracking | Status | |
---|---|---|
firefox60 | --- | fixed |
People
(Reporter: jmaher, Assigned: jmaher)
Details
(Whiteboard: [PI:March])
Attachments
(1 file, 1 obsolete file)
2.80 KB,
patch
|
dustin
:
review+
|
Details | Diff | Splinter Review |
I noticed in:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=04c9f43eecf38d7cdbc538fd00b7b3b1ab1cf9b3
that the orange 'o' and 'd' job timed out at an hour despite setting max-run-time to something lower (specifically in the patch). We are running on the new moonshot hardware and using a hardware based taskcluster-worker. I would like to determine if this worker supports max-run-time or not.
If it doesn't, I would like to pass the max-run-time to talos via a commandline argument and then die gracefully. Right now we have a few areas for timeout:
1) taskcluster worker (I suspect this isn't supported)
2) mozharness
3) talos harness
4) firefox process stdout monitor
looking at those 4 areas, we should focus on the talos harness and taskcluster worker.
:dustin, could you help determine or find the right person who could figure out if the hardware taskcluster-worker we are using on the new moonshot hardware for linux support max-run-time?
Flags: needinfo?(dustin)
Comment 1•7 years ago
|
||
I think Jonas is the right person..
Flags: needinfo?(dustin) → needinfo?(jopsen)
Comment 2•7 years ago
|
||
taskcluster-worker has a "maxruntime" plugin you can activate in it's configuration.
> maxruntime:
> maxRunTime: '1 hour'
> perTaskLimit: 'allow|forbid|require'
See:
https://docs.taskcluster.net/reference/workers/taskcluster-worker/docs/configuration
If you specify config to the worker you can print the payload schema too:
> taskcluster-worker schema payload <config.yml>
But the docs will also let you infer most of the properties available:
https://docs.taskcluster.net/reference/workers/taskcluster-worker/docs/configuration
Note, if a property is allowed, then usually it's also used by something :)
Example, the following plugin config:
https://github.com/taskcluster/taskcluster-worker/blob/843ae341ff9e205df9dda0d50ad91d0a69539de7/examples/packet-config.yml#L56-L58
enforces a max-run-time of 4 hours, but allows tasks to specify their own task.payload.maxRunTime property, like:
task.payload.maxRunTime = '3 hours 5 min'
However, tasks cannot specify task.payload.maxRunTime = '6 h' because the plugin was configured to a max of 4 hours.
Flags: needinfo?(jopsen)
Assignee | ||
Comment 3•7 years ago
|
||
thanks for the great information Jonas!
:dhouse/:markco, can you ensure the taskcluster workers that we are installing on the new moonshots use the perTaskLimit: 'allow' option?
Flags: needinfo?(mcornmesser)
Flags: needinfo?(dhouse)
Joel, here's what I'm seeing for linux:
```
[root@ms1-5 ~]# grep -C3 -i maxruntime /etc/taskcluster-worker.yml
reboot:
maxLifeCycle: '96 hours'
allowTaskReboots: true
# tasks can never take more than 96 hours (but typically are limited by their own maxRunTime)
maxruntime:
maxRunTime: '96 hours'
perTaskLimit: allow
watchdog: {}
logprefix:
```
And that is defined, hardcoded, in releng-puppet:
https://hg.mozilla.org/build/puppet/file/tip/modules/taskcluster_worker/templates/taskcluster-worker.yml.erb#l28
https://hg.mozilla.org/build/puppet/file/tip/modules/taskcluster_worker/manifests/init.pp#l13
Flags: needinfo?(dhouse)
Assignee | ||
Comment 5•7 years ago
|
||
here is a case where the maxruntime in taskcluster terminated the job in 1200 seconds (i.e. 20 minutes):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f7141e72bc1f2241a0e68b70af00554433e36d1d
the max-run-time is defined in-tree:
https://searchfox.org/mozilla-central/source/taskcluster/ci/test/talos.yml#506
unfortunately this didn't timeout for linux (windows is buildbot in that try job). So based on the above config, I would expect that |perTaskLimit: allow| would use the in-tree max-run-time:1200, but it doesn't seem to be.
:wcosta- can you get the maxRunTime information from the osx worker so we can compare what that is and why it is working?
:jonasfj, can you think of any features in a worker that might ignore this?
Flags: needinfo?(wcosta)
Flags: needinfo?(jopsen)
Comment 6•7 years ago
|
||
osx uses generic-worker, and I think gw honors the task configuration. 303 :pmoore to confirm it.
Flags: needinfo?(wcosta) → needinfo?(pmoore)
Comment 7•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #6)
> osx uses generic-worker, and I think gw honors the task configuration. 303
> :pmoore to confirm it.
Yes, generic-worker uses same format for maximum runtime as docker-worker, which is why it works on macOS (our gecko macOS testers run generic-worker).
See https://docs.taskcluster.net/reference/workers/generic-worker/payload for generic-worker payload format.
Flags: needinfo?(pmoore)
Comment 8•7 years ago
|
||
if jmaher is talking about the "linux x64 opt | sp" like, taken from treeherder link above:
taskId: F2yiXvgZR6OeB5uTOAr1pg
https://tools.taskcluster.net/groups/BMpDNACcRQqaZ8u3n2QAHQ/tasks/F2yiXvgZR6OeB5uTOAr1pg/details
Then clearly the task.payload doesn't contain maxRunTime, I'm guessing the in-tree transforms are missing some tweaks.
The @payload_builder('docker-worker') contains:
> if 'max-run-time' in worker:
> payload['maxRunTime'] = worker['max-run-time']
https://searchfox.org/mozilla-central/rev/a5abf843f8fac8530aa5de6fb40e16547bb4a47a/taskcluster/taskgraph/transforms/task.py#819-820
And the @payload_builder('native-engine') seems to be missing something like that, see:
https://searchfox.org/mozilla-central/rev/a5abf843f8fac8530aa5de6fb40e16547bb4a47a/taskcluster/taskgraph/transforms/task.py#1091-1111
Probably, the in-tree logic will have to be tweaked.
Flags: needinfo?(jopsen)
Assignee | ||
Comment 9•7 years ago
|
||
with the attached patch, I have proven that adding payload['maxRunTime'] to the talos jobs native-engine worker yields a timeout as expected:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cecfc0465098009a5d1ea65a243376639c7cba8c
the problem is when I try to proxy the value from the test definition in talos.yml to be set in the worker object, I get decision task failures:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1dfe919b9fbecc6b84696451effe6ea6db49f661
so you can see in the above patch where that is commented out.
:jonasfj, could you help find someone more familiar with this workflow who could help me get the value from the test -> payload.
Flags: needinfo?(mcornmesser) → needinfo?(jopsen)
Comment 10•7 years ago
|
||
Probably we have to add max-run-time to the schema as well, like:
> # the maximum time to run, in seconds
> Required('max-run-time'): int,
in:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/transforms/task.py?q=taskcluster%2Ftaskgraph%2Ftransforms%2Ftask.py&redirect_type=direct#456-486
where the schema for @payload_builder('native-engine') is defined...
(this isn't super intuitive)
Then I think something like:
> if 'max-run-time' in worker:
> payload['maxRunTime'] = worker['max-run-time']
in:
> @payload_builder('native-engine')
> def def build_macosx_engine_payload(...
will do the trick.
I suspect this is what's missing. Otherwise, we might have to ask dustin :)
Flags: needinfo?(jopsen)
Assignee | ||
Comment 11•7 years ago
|
||
I am not figuring this out despite a handful of pushes to try yesterday- here is my latest one:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=480e6daba872ae431172318ed3720683ecaed7d6
I believe I put in the maxruntime for native-engine in the right places.
:dustin- if you have any quick ideas I am happy to try them out, otherwise maybe we can schedule some time in to discuss this in more detail in the coming weeks.
Flags: needinfo?(dustin)
Comment 12•7 years ago
|
||
Looking at that decision task, I see that it failed with the "data should not have additional properties". The schema-validation library doesn't tell you which properties are additional, which is a bit annoying. But at a guess, there's some property of the task that shouldn't be there. The decision task creates 50 tasks in parallel, so it's a bit hard to tell which one caused the error. However, since you're working on talos I just scrolled back a bit until I saw a log line about creating a talos task - test-linux64-qr/opt-talos-chrome-e10s. Looking in task-graph,json:
"NuQRdTIQTN6FQxLfZEGNCA": {
"attributes": {
"always_target": false,
"build_platform": "linux64",
"build_type": "opt",
"e10s": true,
"kind": "test",
"run_on_projects": [
"mozilla-central",
"try"
],
"shipping_phase": null,
"shipping_product": null,
"talos_try_name": "chromez-e10s",
"test_chunk": "1",
"test_platform": "linux64-qr/opt",
"unittest_flavor": "talos",
"unittest_suite": "talos"
},
"dependencies": {
"build": "CmspfrVOTv6NnuzYo48kuA"
},
"kind": "test",
"label": "test-linux64-qr/opt-talos-chrome-e10s",
"optimization": {
"skip-unless-schedules": [
"talos",
"linux"
]
},
"task": {
"created": {
"relative-datestamp": "0 seconds"
},
"deadline": {
"relative-datestamp": "1 day"
},
"dependencies": [
"CmspfrVOTv6NnuzYo48kuA"
],
"expires": {
"relative-datestamp": "14 days"
},
"extra": {
"chunks": {
"current": 1,
"total": 1
},
"index": {
"rank": 0
},
"parent": "HzRti4Q7RIaKOtxU6qqsKw",
"suite": {
"flavor": "talos",
"name": "talos"
},
"treeherder": {
"collection": {
"opt": true
},
"groupName": "Talos performance tests with e10s",
"groupSymbol": "T-e10s",
"jobKind": "test",
"machine": {
"platform": "linux64-qr"
},
"symbol": "c",
"tier": 2
}
},
"maxRunTime": 180,
"metadata": {
"description": "Talos chrome ([Treeherder push](https://treeherder.mozilla.org/#/jobs?repo=try&revision=480e6daba872ae431172318ed3720683ecaed7d6))",
"name": "test-linux64-qr/opt-talos-chrome-e10s",
"owner": "jmaher@mozilla.com",
"source": "https://hg.mozilla.org/try/file/480e6daba872ae431172318ed3720683ecaed7d6/taskcluster/ci/test"
},
"payload": {
"artifacts": [
{
"expires": {
"relative-datestamp": "14 days"
},
"name": "public/logs",
"path": "workspace/build/upload/logs",
"type": "directory"
},
{
"expires": {
"relative-datestamp": "14 days"
},
"name": "public/test",
"path": "artifacts",
"type": "directory"
},
{
"expires": {
"relative-datestamp": "14 days"
},
"name": "public/test_info",
"path": "workspace/build/blobber_upload_dir",
"type": "directory"
}
],
"command": [
"./test-linux.sh",
"--no-read-buildbot-config",
"--installer-url=https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/target.tar.bz2",
"--test-packages-url=https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/target.test_packages.json",
"--suite=chromez-e10s",
"--add-option",
"--webServer,localhost",
"--add-option",
"--webServer,localhost",
"--use-talos-json",
"--branch-name",
"try",
"--enable-webrender",
"--download-symbols=ondemand"
],
"context": "https://hg.mozilla.org/try/raw-file/480e6daba872ae431172318ed3720683ecaed7d6/taskcluster/scripts/tester/test-linux.sh",
"env": {
"GECKO_HEAD_REPOSITORY": "https://hg.mozilla.org/try",
"GECKO_HEAD_REV": "480e6daba872ae431172318ed3720683ecaed7d6",
"MOZHARNESS_CONFIG": "talos/linux_config.py",
"MOZHARNESS_SCRIPT": "talos_script.py",
"MOZHARNESS_URL": "https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/mozharness.zip",
"MOZILLA_BUILD_URL": "https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/target.tar.bz2",
"MOZ_AUTOMATION": "1",
"MOZ_HIDE_RESULTS_TABLE": "1",
"MOZ_NODE_PATH": "/usr/local/bin/node",
"MOZ_NO_REMOTE": "1",
"NEED_XVFB": "false",
"NO_EM_RESTART": "1",
"NO_FAIL_ON_TEST_ERRORS": "1",
"XPCOM_DEBUG_BREAK": "warn"
},
"maxRunTime": 180
},
"priority": "very-low",
"provisionerId": "releng-hardware",
"routes": [
"tc-treeherder.v2.try.480e6daba872ae431172318ed3720683ecaed7d6.255607"
],
"scopes": [],
"tags": {
"createdForUser": "jmaher@mozilla.com",
"kind": "test",
"label": "test-linux64-qr/opt-talos-chrome-e10s",
"os": "linux",
"worker-implementation": "native-engine"
},
"workerType": "gecko-t-linux-talos"
},
"task_id": "NuQRdTIQTN6FQxLfZEGNCA"
},
and indeed, I see maxRunTime in there twice -- once in `task.payload` (below XPCOM_DEBUG_BREAK) and once at the `task` level (right above `metadata`). The option is an instruction to the worker, so it is included in the payload, but is not allowed in the enclosing task definition.
So I think https://hg.mozilla.org/try/rev/0855a4ff58eec6a4116d4ea7be709948b571f355#l3.51
+ if 'max-run-time' in worker:
+ task_def['maxRunTime'] = worker['max-run-time']
is what is causing the issue.
Flags: needinfo?(dustin)
Assignee | ||
Comment 13•7 years ago
|
||
Assignee: nobody → jmaher
Attachment #8949776 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8955515 -
Flags: review?(dustin)
Assignee | ||
Updated•7 years ago
|
Whiteboard: [PI:March]
Updated•7 years ago
|
Attachment #8955515 -
Flags: review?(dustin) → review+
Comment 14•7 years ago
|
||
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/00fc5ded3b48
ensure that max-run-time is working from taskcluster -> talos. r=dustin
Comment 15•7 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox60:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in
before you can comment on or make changes to this bug.
Description
•