Closed Bug 1056384 Opened 10 years ago Closed 6 years ago

Enforce max runtime at the task level account for pulls / kills / etc..

Categories

(Taskcluster :: Workers, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jonasfj, Unassigned)

Details

(Whiteboard: [good first bug][lang=js])

Attachments

(1 file)

I'm a bit torn on this one. Mostly I'm worried that docker-worker won't clean-up properly.
But I'm not sure it makes sense for `maxRunTime` to count against the time required to download the docker image. Ideally, `maxRunTime` should only be used to provide hard upper limits for task sanity. If you want you process to stop after x number seconds (murder it yourself).
So this might be desired semantics, if so it should be reflected in the documentation, ie. the description field in the JSON schema.

Note, pulling an image can take up to several minutes, so this is not very fair for tasks... as only some tasks needs to fetch the docker image.

If we wanted to restrict maximum time we're allowed to spend fetching a task. I think this should be a docker-worker configuration constant, and not a per task option.

Example task:
http://docs.taskcluster.net/tools/task-inspector/#HPR6sVVQQSKEmMIt9045uw/0


@lightsofapollo,
IMO, the desired semantics is to make `maxRunTime` cover the time the container is running. Specifically excluding the time it takes to:
 - pull the docker image
 - upload artifacts and logs
 - various clean up we have to do afterwards.
Ie. `maxRunTime` should be strictly limited to container run-time, and nothing else.

Do you agree?
Flags: needinfo?(jlal)
Whiteboard: [good first bug][lang=javascript]
This is easy to change (note I did not say fix) maxRunTime should be a high upper bounds to kill long running tasks which may have done some crazy shit that prevents the task from ever completing... It does not make sense to separate out one of the most non-deterministic parts (download failures) into another timeout (given what I know about our current infrastructure and how travis implements similar features).

Downloads now may take a long time (minutes) this is actually a bug in our previous registry implementation we could download 5gig in 3min~ with s3 downloading the same data in 30s. I don't want to optimize pulls which will be getting faster and faster (in the success case) and slower or broken if something is fucked up.

Another way to look at this would be implementing pull timeouts and failing the task after some period of time (like if you have a long running emulator build its maxruntime is going to be high but pull timeout may be lower) rather then trying to separate out timeout logic.

Also we should investigate what docker does at a low level for socket timeouts the current logic will correctly handle docker pull errors so the work here may be done for us by docker (it really should be in any case).
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jlal)
Resolution: --- → WONTFIX
Haha- so my above comment is useless as this actually works as you wanted above =/ I need to double check that the docker code does handle pull timeouts correctly otherwise we have another kind of bug :)
Resolution: WONTFIX → FIXED
From the example task:
http://docs.taskcluster.net/tools/task-inspector/#HPR6sVVQQSKEmMIt9045uw/0

I seems to me that maxRunTime includes time necessary to pull the docker image.
My main concern was that docker-worker didn't clean-up correctly.

Secondary concern was that the semantics were in conflict with the description of the property from JSON schema.
I don't really care whether or not maxRunTime include time spent downloading docker image. As long as it's documented correctly. The name implies that it it doesn't include it, but if the description in the JSON schema, says otherwise that is also a fair definition.

Either way, reopened until docker-worker and documentation match up.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [good first bug][lang=javascript] → [good first bug][lang=js]
From load testing when there are bugs the current maxRunTime is only good enough to detect simple in task errors... I think maxRunTime should mean what it sounds like (at the task level) and enforce dropping tasks after the specified time to account for: container errors (run away),  infra timeouts, etc...
Summary: docker-worker: Time necessary to pull docker image counts towards `maxRunTime` → docker-worker: Enforce max runtime at the task level account for pulls / kills / etc..
> ... (at the task level) ...
> account for: container errors (run away),  infra timeouts, etc...
How is this different from what deadline does? It's just relative and harder to enforce.
Note, docker-worker is expected to kill tasks past task deadline, the fact that it doesn't is another bug :)

maxRunTime should remain a property in task.payload specific to the docker-worker.
It maybe cover following parts:
 A) Setup:
    - docker pull
    - launch proxy
    - create log
    - etc...
 B) Task evaluation:
    - docker run
 C) Extraction and cleanup
    - upload of artifacts
    - creation of task-graphs
    - reportCompleted
    - docker rm

As I've said before I think it's most predictable if it only covers B.
Also (A) and (C) could be covered by smarter things, like max number of artifacts and max number of artifact bytes... As well as max number of task-graph entries.
I'm talking about large max numbers just to ensure sanity. Again artifacts as a list would have allowed for this limitation :)

I'm also okay, with maxRunTime covering (A), (B) and (C).
As long as these semantics is documented in the payload JSON schema.

Anyways, from a user-perspective it seems to me that covering on (B) is the most useful.
As it basically allows you to kill a task that runs a few minutes longer than expected,
if (A) and (C) are included you'll have much higher overhead. And users should not be
expected to have clue about how long (A) and (C) takes. In particular (A) can take anywhere from 300ms to 3min, which is a foot gun for intermittent errors. As 2min will be sufficient for most decision tasks (or just small tasks), but not if they have to do "docker pull <something-big>".
Anything the user supplies (this includes the image!) should be covered in the timeout... Kills/cleanup should actually be handled outside of the task success true/false workflow... Artifact upload should also be included.

maxRunTime is the equivalent of our 2 hour limit on buildbot which includes everything I just spoke of... For just (b) here its easy to do that from the harness or whatever script your running (we are not trying to catch those cases).
Okay, I'll buy that argument.
Let it cover A,B and C.

Note, from my test it already covers A and B at least.

And the talk about task-level was run-level on the worker, right :)
Right :) You make a good point about deadline... basically we should shorten the maxRunTime to deadline if its near OR reject the task and mark it as failed...
Assignee: nobody → garndt
resetting the assignedto field as this has been idle for a few months, do let me know if I was too fast here and take the bug again if there are plans to work on this!
Assignee: garndt → nobody
Status: REOPENED → NEW
Assignee: nobody → garndt
Attached file Worker pull 78
This should abort if artifact upload is taking too long and max runtime his reached.
Attachment #8604337 - Flags: review?(jlal)
Comment on attachment 8604337 [details]
Worker pull 78

Formal r+ note that we may need to update the task definitions for some tasks now that we changed how max runtime works...
Attachment #8604337 - Flags: review?(jlal) → review+
Component: TaskCluster → Docker-Worker
Product: Testing → Taskcluster
Summary: docker-worker: Enforce max runtime at the task level account for pulls / kills / etc.. → Enforce max runtime at the task level account for pulls / kills / etc..
I'm going to unassign myself from this for now.  Some max runtime stuff has landed within docker-worker and also we might rethink this in a taskcluster-worker era.  It was discussed in PDX that max runtime would only account for the task execution piece of the task rather than the end-to-end stuff that a user could not account for.

I will also re-classify this bug under "worker"
Assignee: garndt → nobody
Mentor: jlal
Component: Docker-Worker → Worker
Status: NEW → RESOLVED
Closed: 10 years ago6 years ago
Resolution: --- → WONTFIX
Component: Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: