Closed Bug 1433761 Opened 2 years ago Closed 2 years ago

ensure that max-run-time is working from taskcluster -> talos

Categories

(Testing :: Talos, enhancement)

enhancement
Not set

Tracking

(firefox60 fixed)

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: jmaher, Assigned: jmaher)

Details

(Whiteboard: [PI:March])

Attachments

(1 file, 1 obsolete file)

I noticed in:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=04c9f43eecf38d7cdbc538fd00b7b3b1ab1cf9b3

that the orange 'o' and 'd' job timed out at an hour despite setting max-run-time to something lower (specifically in the patch).  We are running on the new moonshot hardware and using a hardware based taskcluster-worker.  I would like to determine if this worker supports max-run-time or not.

If it doesn't, I would like to pass the max-run-time to talos via a commandline argument and then die gracefully.  Right now we have a few areas for timeout:
1) taskcluster worker (I suspect this isn't supported)
2) mozharness
3) talos harness
4) firefox process stdout monitor

looking at those 4 areas, we should focus on the talos harness and taskcluster worker.

:dustin, could you help determine or find the right person who could figure out if the hardware taskcluster-worker we are using on the new moonshot hardware for linux support max-run-time?
Flags: needinfo?(dustin)
I think Jonas is the right person..
Flags: needinfo?(dustin) → needinfo?(jopsen)
taskcluster-worker has a "maxruntime" plugin you can activate in it's configuration.

> maxruntime:
>   maxRunTime: '1 hour'
>   perTaskLimit: 'allow|forbid|require'

See:
  https://docs.taskcluster.net/reference/workers/taskcluster-worker/docs/configuration

If you specify config to the worker you can print the payload schema too:
>  taskcluster-worker schema payload <config.yml>

But the docs will also let you infer most of the properties available:
  https://docs.taskcluster.net/reference/workers/taskcluster-worker/docs/configuration

Note, if a property is allowed, then usually it's also used by something :)

Example, the following plugin config:
  https://github.com/taskcluster/taskcluster-worker/blob/843ae341ff9e205df9dda0d50ad91d0a69539de7/examples/packet-config.yml#L56-L58
enforces a max-run-time of 4 hours, but allows tasks to specify their own task.payload.maxRunTime property, like:
    task.payload.maxRunTime = '3 hours 5 min'
  However, tasks cannot specify task.payload.maxRunTime = '6 h' because the plugin was configured to a max of 4 hours.
Flags: needinfo?(jopsen)
thanks for the great information Jonas!

:dhouse/:markco, can you ensure the taskcluster workers that we are installing on the new moonshots use the perTaskLimit: 'allow' option?
Flags: needinfo?(mcornmesser)
Flags: needinfo?(dhouse)
Joel, here's what I'm seeing for linux:

```
[root@ms1-5 ~]# grep -C3 -i maxruntime /etc/taskcluster-worker.yml
    reboot:
      maxLifeCycle: '96 hours'
      allowTaskReboots: true
    # tasks can never take more than 96 hours (but typically are limited by their own maxRunTime)
    maxruntime:
      maxRunTime: '96 hours'
      perTaskLimit: allow
    watchdog: {}
    logprefix:
```

And that is defined, hardcoded, in releng-puppet:
https://hg.mozilla.org/build/puppet/file/tip/modules/taskcluster_worker/templates/taskcluster-worker.yml.erb#l28
https://hg.mozilla.org/build/puppet/file/tip/modules/taskcluster_worker/manifests/init.pp#l13
Flags: needinfo?(dhouse)
here is a case where the maxruntime in taskcluster terminated the job in 1200 seconds (i.e. 20 minutes):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f7141e72bc1f2241a0e68b70af00554433e36d1d

the max-run-time is defined in-tree:
https://searchfox.org/mozilla-central/source/taskcluster/ci/test/talos.yml#506

unfortunately this didn't timeout for linux (windows is buildbot in that try job).  So based on the above config, I would expect that |perTaskLimit: allow| would use the in-tree max-run-time:1200, but it doesn't seem to be.

:wcosta- can you get the maxRunTime information from the osx worker so we can compare what that is and why it is working?

:jonasfj, can you think of any features in a worker that might ignore this?
Flags: needinfo?(wcosta)
Flags: needinfo?(jopsen)
osx uses generic-worker, and I think gw honors the task configuration. 303 :pmoore to confirm it.
Flags: needinfo?(wcosta) → needinfo?(pmoore)
(In reply to Wander Lairson Costa [:wcosta] from comment #6)
> osx uses generic-worker, and I think gw honors the task configuration. 303
> :pmoore to confirm it.

Yes, generic-worker uses same format for maximum runtime as docker-worker, which is why it works on macOS (our gecko macOS testers run generic-worker).

See https://docs.taskcluster.net/reference/workers/generic-worker/payload for generic-worker payload format.
Flags: needinfo?(pmoore)
if jmaher is talking about the "linux x64 opt | sp" like, taken from treeherder link above:
  taskId: F2yiXvgZR6OeB5uTOAr1pg
https://tools.taskcluster.net/groups/BMpDNACcRQqaZ8u3n2QAHQ/tasks/F2yiXvgZR6OeB5uTOAr1pg/details

Then clearly the task.payload doesn't contain maxRunTime, I'm guessing the in-tree transforms are missing some tweaks.
The @payload_builder('docker-worker') contains:
>    if 'max-run-time' in worker:
>        payload['maxRunTime'] = worker['max-run-time']
https://searchfox.org/mozilla-central/rev/a5abf843f8fac8530aa5de6fb40e16547bb4a47a/taskcluster/taskgraph/transforms/task.py#819-820

And the @payload_builder('native-engine') seems to be missing something like that, see:
https://searchfox.org/mozilla-central/rev/a5abf843f8fac8530aa5de6fb40e16547bb4a47a/taskcluster/taskgraph/transforms/task.py#1091-1111

Probably, the in-tree logic will have to be tweaked.
Flags: needinfo?(jopsen)
Attached patch maxruntime.patch (obsolete) — Splinter Review
with the attached patch, I have proven that adding payload['maxRunTime'] to the talos jobs native-engine worker yields a timeout as expected:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cecfc0465098009a5d1ea65a243376639c7cba8c

the problem is when I try to proxy the value from the test definition in talos.yml to be set in the worker object, I get decision task failures:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1dfe919b9fbecc6b84696451effe6ea6db49f661

so you can see in the above patch where that is commented out.

:jonasfj, could you help find someone more familiar with this workflow who could help me get the value from the test -> payload.
Flags: needinfo?(mcornmesser) → needinfo?(jopsen)
Probably we have to add max-run-time to the schema as well, like:
>        # the maximum time to run, in seconds
>        Required('max-run-time'): int,
in:
  https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/transforms/task.py?q=taskcluster%2Ftaskgraph%2Ftransforms%2Ftask.py&redirect_type=direct#456-486

where the schema for @payload_builder('native-engine') is defined...
(this isn't super intuitive)

Then I think something like:
>     if 'max-run-time' in worker:
>        payload['maxRunTime'] = worker['max-run-time']
in:
>     @payload_builder('native-engine')
>     def def build_macosx_engine_payload(...
will do the trick.

I suspect this is what's missing. Otherwise, we might have to ask dustin :)
Flags: needinfo?(jopsen)
I am not figuring this out despite a handful of pushes to try yesterday- here is my latest one:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=480e6daba872ae431172318ed3720683ecaed7d6

I believe I put in the maxruntime for native-engine in the right places.

:dustin- if you have any quick ideas I am happy to try them out, otherwise maybe we can schedule some time in to discuss this in more detail in the coming weeks.
Flags: needinfo?(dustin)
Looking at that decision task, I see that it failed with the "data should not have additional properties".  The schema-validation library doesn't tell you which properties are additional, which is a bit annoying.  But at a guess, there's some property of the task that shouldn't be there.  The decision task creates 50 tasks in parallel, so it's a bit hard to tell which one caused the error.  However, since you're working on talos I just scrolled back a bit until I saw a log line about creating a talos task - test-linux64-qr/opt-talos-chrome-e10s.  Looking in task-graph,json:

  "NuQRdTIQTN6FQxLfZEGNCA": {
    "attributes": {
      "always_target": false,
      "build_platform": "linux64",
      "build_type": "opt",
      "e10s": true,
      "kind": "test",
      "run_on_projects": [
        "mozilla-central",
        "try"
      ],
      "shipping_phase": null,
      "shipping_product": null,
      "talos_try_name": "chromez-e10s",
      "test_chunk": "1",
      "test_platform": "linux64-qr/opt",
      "unittest_flavor": "talos",
      "unittest_suite": "talos"
    },
    "dependencies": {
      "build": "CmspfrVOTv6NnuzYo48kuA"
    },
    "kind": "test",
    "label": "test-linux64-qr/opt-talos-chrome-e10s",
    "optimization": {
      "skip-unless-schedules": [
        "talos",
        "linux"
      ]
    },
    "task": {
      "created": {
        "relative-datestamp": "0 seconds"
      },
      "deadline": {
        "relative-datestamp": "1 day"
      },
      "dependencies": [
        "CmspfrVOTv6NnuzYo48kuA"
      ],
      "expires": {
        "relative-datestamp": "14 days"
      },
      "extra": {
        "chunks": {
          "current": 1,
          "total": 1
        },
        "index": {
          "rank": 0
        },
        "parent": "HzRti4Q7RIaKOtxU6qqsKw",
        "suite": {
          "flavor": "talos",
          "name": "talos"
        },
        "treeherder": {
          "collection": {
            "opt": true
          },
          "groupName": "Talos performance tests with e10s",
          "groupSymbol": "T-e10s",
          "jobKind": "test",
          "machine": {
            "platform": "linux64-qr"
          },
          "symbol": "c",
          "tier": 2
        }
      },
      "maxRunTime": 180,
      "metadata": {
        "description": "Talos chrome ([Treeherder push](https://treeherder.mozilla.org/#/jobs?repo=try&revision=480e6daba872ae431172318ed3720683ecaed7d6))",
        "name": "test-linux64-qr/opt-talos-chrome-e10s",
        "owner": "jmaher@mozilla.com",
        "source": "https://hg.mozilla.org/try/file/480e6daba872ae431172318ed3720683ecaed7d6/taskcluster/ci/test"
      },
      "payload": {
        "artifacts": [
          {
            "expires": {
              "relative-datestamp": "14 days"
            },
            "name": "public/logs",
            "path": "workspace/build/upload/logs",
            "type": "directory"
          },
          {
            "expires": {
              "relative-datestamp": "14 days"
            },
            "name": "public/test",
            "path": "artifacts",
            "type": "directory"
          },
          {
            "expires": {
              "relative-datestamp": "14 days"
            },
            "name": "public/test_info",
            "path": "workspace/build/blobber_upload_dir",
            "type": "directory"
          }
        ],
        "command": [
          "./test-linux.sh",
          "--no-read-buildbot-config",
          "--installer-url=https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/target.tar.bz2",
          "--test-packages-url=https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/target.test_packages.json",
          "--suite=chromez-e10s",
          "--add-option",
          "--webServer,localhost",
          "--add-option",
          "--webServer,localhost",
          "--use-talos-json",
          "--branch-name",
          "try",
          "--enable-webrender",
          "--download-symbols=ondemand"
        ],
        "context": "https://hg.mozilla.org/try/raw-file/480e6daba872ae431172318ed3720683ecaed7d6/taskcluster/scripts/tester/test-linux.sh",
        "env": {
          "GECKO_HEAD_REPOSITORY": "https://hg.mozilla.org/try",
          "GECKO_HEAD_REV": "480e6daba872ae431172318ed3720683ecaed7d6",
          "MOZHARNESS_CONFIG": "talos/linux_config.py",
          "MOZHARNESS_SCRIPT": "talos_script.py",
          "MOZHARNESS_URL": "https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/mozharness.zip",
          "MOZILLA_BUILD_URL": "https://queue.taskcluster.net/v1/task/CmspfrVOTv6NnuzYo48kuA/artifacts/public/build/target.tar.bz2",
          "MOZ_AUTOMATION": "1",
          "MOZ_HIDE_RESULTS_TABLE": "1",
          "MOZ_NODE_PATH": "/usr/local/bin/node",
          "MOZ_NO_REMOTE": "1",
          "NEED_XVFB": "false",
          "NO_EM_RESTART": "1",
          "NO_FAIL_ON_TEST_ERRORS": "1",
          "XPCOM_DEBUG_BREAK": "warn"
        },
        "maxRunTime": 180
      },
      "priority": "very-low",
      "provisionerId": "releng-hardware",
      "routes": [
        "tc-treeherder.v2.try.480e6daba872ae431172318ed3720683ecaed7d6.255607"
      ],
      "scopes": [],
      "tags": {
        "createdForUser": "jmaher@mozilla.com",
        "kind": "test",
        "label": "test-linux64-qr/opt-talos-chrome-e10s",
        "os": "linux",
        "worker-implementation": "native-engine"
      },
      "workerType": "gecko-t-linux-talos"
    },
    "task_id": "NuQRdTIQTN6FQxLfZEGNCA"
  },

and indeed, I see maxRunTime in there twice -- once in `task.payload` (below XPCOM_DEBUG_BREAK) and once at the `task` level (right above `metadata`).  The option is an instruction to the worker, so it is included in the payload, but is not allowed in the enclosing task definition.

So I think https://hg.mozilla.org/try/rev/0855a4ff58eec6a4116d4ea7be709948b571f355#l3.51
+    if 'max-run-time' in worker:
+        task_def['maxRunTime'] = worker['max-run-time']

is what is causing the issue.
Flags: needinfo?(dustin)
Assignee: nobody → jmaher
Attachment #8949776 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8955515 - Flags: review?(dustin)
Whiteboard: [PI:March]
Attachment #8955515 - Flags: review?(dustin) → review+
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/00fc5ded3b48
ensure that max-run-time is working from taskcluster -> talos. r=dustin
https://hg.mozilla.org/mozilla-central/rev/00fc5ded3b48
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.