Closed Bug 1433761 Opened 2 years ago Closed 2 years ago

ensure that max-run-time is working from taskcluster -> talos


(Testing :: Talos, enhancement)

Not set


(firefox60 fixed)

Tracking Status
firefox60 --- fixed


(Reporter: jmaher, Assigned: jmaher)


(Whiteboard: [PI:March])


(1 file, 1 obsolete file)

I noticed in:

that the orange 'o' and 'd' job timed out at an hour despite setting max-run-time to something lower (specifically in the patch).  We are running on the new moonshot hardware and using a hardware based taskcluster-worker.  I would like to determine if this worker supports max-run-time or not.

If it doesn't, I would like to pass the max-run-time to talos via a commandline argument and then die gracefully.  Right now we have a few areas for timeout:
1) taskcluster worker (I suspect this isn't supported)
2) mozharness
3) talos harness
4) firefox process stdout monitor

looking at those 4 areas, we should focus on the talos harness and taskcluster worker.

:dustin, could you help determine or find the right person who could figure out if the hardware taskcluster-worker we are using on the new moonshot hardware for linux support max-run-time?
Flags: needinfo?(dustin)
I think Jonas is the right person..
Flags: needinfo?(dustin) → needinfo?(jopsen)
taskcluster-worker has a "maxruntime" plugin you can activate in it's configuration.

> maxruntime:
>   maxRunTime: '1 hour'
>   perTaskLimit: 'allow|forbid|require'


If you specify config to the worker you can print the payload schema too:
>  taskcluster-worker schema payload <config.yml>

But the docs will also let you infer most of the properties available:

Note, if a property is allowed, then usually it's also used by something :)

Example, the following plugin config:
enforces a max-run-time of 4 hours, but allows tasks to specify their own task.payload.maxRunTime property, like:
    task.payload.maxRunTime = '3 hours 5 min'
  However, tasks cannot specify task.payload.maxRunTime = '6 h' because the plugin was configured to a max of 4 hours.
Flags: needinfo?(jopsen)
thanks for the great information Jonas!

:dhouse/:markco, can you ensure the taskcluster workers that we are installing on the new moonshots use the perTaskLimit: 'allow' option?
Flags: needinfo?(mcornmesser)
Flags: needinfo?(dhouse)
Joel, here's what I'm seeing for linux:

[root@ms1-5 ~]# grep -C3 -i maxruntime /etc/taskcluster-worker.yml
      maxLifeCycle: '96 hours'
      allowTaskReboots: true
    # tasks can never take more than 96 hours (but typically are limited by their own maxRunTime)
      maxRunTime: '96 hours'
      perTaskLimit: allow
    watchdog: {}

And that is defined, hardcoded, in releng-puppet:
Flags: needinfo?(dhouse)
here is a case where the maxruntime in taskcluster terminated the job in 1200 seconds (i.e. 20 minutes):

the max-run-time is defined in-tree:

unfortunately this didn't timeout for linux (windows is buildbot in that try job).  So based on the above config, I would expect that |perTaskLimit: allow| would use the in-tree max-run-time:1200, but it doesn't seem to be.

:wcosta- can you get the maxRunTime information from the osx worker so we can compare what that is and why it is working?

:jonasfj, can you think of any features in a worker that might ignore this?
Flags: needinfo?(wcosta)
Flags: needinfo?(jopsen)
osx uses generic-worker, and I think gw honors the task configuration. 303 :pmoore to confirm it.
Flags: needinfo?(wcosta) → needinfo?(pmoore)
(In reply to Wander Lairson Costa [:wcosta] from comment #6)
> osx uses generic-worker, and I think gw honors the task configuration. 303
> :pmoore to confirm it.

Yes, generic-worker uses same format for maximum runtime as docker-worker, which is why it works on macOS (our gecko macOS testers run generic-worker).

See for generic-worker payload format.
Flags: needinfo?(pmoore)
if jmaher is talking about the "linux x64 opt | sp" like, taken from treeherder link above:
  taskId: F2yiXvgZR6OeB5uTOAr1pg

Then clearly the task.payload doesn't contain maxRunTime, I'm guessing the in-tree transforms are missing some tweaks.
The @payload_builder('docker-worker') contains:
>    if 'max-run-time' in worker:
>        payload['maxRunTime'] = worker['max-run-time']

And the @payload_builder('native-engine') seems to be missing something like that, see:

Probably, the in-tree logic will have to be tweaked.
Flags: needinfo?(jopsen)
Attached patch maxruntime.patch (obsolete) — Splinter Review
with the attached patch, I have proven that adding payload['maxRunTime'] to the talos jobs native-engine worker yields a timeout as expected:

the problem is when I try to proxy the value from the test definition in talos.yml to be set in the worker object, I get decision task failures:

so you can see in the above patch where that is commented out.

:jonasfj, could you help find someone more familiar with this workflow who could help me get the value from the test -> payload.
Flags: needinfo?(mcornmesser) → needinfo?(jopsen)
Probably we have to add max-run-time to the schema as well, like:
>        # the maximum time to run, in seconds
>        Required('max-run-time'): int,

where the schema for @payload_builder('native-engine') is defined...
(this isn't super intuitive)

Then I think something like:
>     if 'max-run-time' in worker:
>        payload['maxRunTime'] = worker['max-run-time']
>     @payload_builder('native-engine')
>     def def build_macosx_engine_payload(...
will do the trick.

I suspect this is what's missing. Otherwise, we might have to ask dustin :)
Flags: needinfo?(jopsen)
I am not figuring this out despite a handful of pushes to try yesterday- here is my latest one:

I believe I put in the maxruntime for native-engine in the right places.

:dustin- if you have any quick ideas I am happy to try them out, otherwise maybe we can schedule some time in to discuss this in more detail in the coming weeks.
Flags: needinfo?(dustin)
Looking at that decision task, I see that it failed with the "data should not have additional properties".  The schema-validation library doesn't tell you which properties are additional, which is a bit annoying.  But at a guess, there's some property of the task that shouldn't be there.  The decision task creates 50 tasks in parallel, so it's a bit hard to tell which one caused the error.  However, since you're working on talos I just scrolled back a bit until I saw a log line about creating a talos task - test-linux64-qr/opt-talos-chrome-e10s.  Looking in task-graph,json:

    "attributes": {
      "always_target": false,
      "build_platform": "linux64",
      "build_type": "opt",
      "e10s": true,
      "kind": "test",
      "run_on_projects": [
      "shipping_phase": null,
      "shipping_product": null,
      "talos_try_name": "chromez-e10s",
      "test_chunk": "1",
      "test_platform": "linux64-qr/opt",
      "unittest_flavor": "talos",
      "unittest_suite": "talos"
    "dependencies": {
      "build": "CmspfrVOTv6NnuzYo48kuA"
    "kind": "test",
    "label": "test-linux64-qr/opt-talos-chrome-e10s",
    "optimization": {
      "skip-unless-schedules": [
    "task": {
      "created": {
        "relative-datestamp": "0 seconds"
      "deadline": {
        "relative-datestamp": "1 day"
      "dependencies": [
      "expires": {
        "relative-datestamp": "14 days"
      "extra": {
        "chunks": {
          "current": 1,
          "total": 1
        "index": {
          "rank": 0
        "parent": "HzRti4Q7RIaKOtxU6qqsKw",
        "suite": {
          "flavor": "talos",
          "name": "talos"
        "treeherder": {
          "collection": {
            "opt": true
          "groupName": "Talos performance tests with e10s",
          "groupSymbol": "T-e10s",
          "jobKind": "test",
          "machine": {
            "platform": "linux64-qr"
          "symbol": "c",
          "tier": 2
      "maxRunTime": 180,
      "metadata": {
        "description": "Talos chrome ([Treeherder push](",
        "name": "test-linux64-qr/opt-talos-chrome-e10s",
        "owner": "",
        "source": ""
      "payload": {
        "artifacts": [
            "expires": {
              "relative-datestamp": "14 days"
            "name": "public/logs",
            "path": "workspace/build/upload/logs",
            "type": "directory"
            "expires": {
              "relative-datestamp": "14 days"
            "name": "public/test",
            "path": "artifacts",
            "type": "directory"
            "expires": {
              "relative-datestamp": "14 days"
            "name": "public/test_info",
            "path": "workspace/build/blobber_upload_dir",
            "type": "directory"
        "command": [
        "context": "",
        "env": {
          "GECKO_HEAD_REPOSITORY": "",
          "GECKO_HEAD_REV": "480e6daba872ae431172318ed3720683ecaed7d6",
          "MOZHARNESS_CONFIG": "talos/",
          "MOZHARNESS_SCRIPT": "",
          "MOZHARNESS_URL": "",
          "MOZILLA_BUILD_URL": "",
          "MOZ_AUTOMATION": "1",
          "MOZ_HIDE_RESULTS_TABLE": "1",
          "MOZ_NODE_PATH": "/usr/local/bin/node",
          "MOZ_NO_REMOTE": "1",
          "NEED_XVFB": "false",
          "NO_EM_RESTART": "1",
          "NO_FAIL_ON_TEST_ERRORS": "1",
          "XPCOM_DEBUG_BREAK": "warn"
        "maxRunTime": 180
      "priority": "very-low",
      "provisionerId": "releng-hardware",
      "routes": [
      "scopes": [],
      "tags": {
        "createdForUser": "",
        "kind": "test",
        "label": "test-linux64-qr/opt-talos-chrome-e10s",
        "os": "linux",
        "worker-implementation": "native-engine"
      "workerType": "gecko-t-linux-talos"
    "task_id": "NuQRdTIQTN6FQxLfZEGNCA"

and indeed, I see maxRunTime in there twice -- once in `task.payload` (below XPCOM_DEBUG_BREAK) and once at the `task` level (right above `metadata`).  The option is an instruction to the worker, so it is included in the payload, but is not allowed in the enclosing task definition.

So I think
+    if 'max-run-time' in worker:
+        task_def['maxRunTime'] = worker['max-run-time']

is what is causing the issue.
Flags: needinfo?(dustin)
Assignee: nobody → jmaher
Attachment #8949776 - Attachment is obsolete: true
Attachment #8955515 - Flags: review?(dustin)
Whiteboard: [PI:March]
Attachment #8955515 - Flags: review?(dustin) → review+
Pushed by
ensure that max-run-time is working from taskcluster -> talos. r=dustin
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.