Bugzilla

Reporter

Updated

•

10 years ago

Whiteboard: [good first bug][lang=javascript]

Comment 1

•

10 years ago

This is easy to change (note I did not say fix) maxRunTime should be a high upper bounds to kill long running tasks which may have done some crazy shit that prevents the task from ever completing... It does not make sense to separate out one of the most non-deterministic parts (download failures) into another timeout (given what I know about our current infrastructure and how travis implements similar features).

Downloads now may take a long time (minutes) this is actually a bug in our previous registry implementation we could download 5gig in 3min~ with s3 downloading the same data in 30s. I don't want to optimize pulls which will be getting faster and faster (in the success case) and slower or broken if something is fucked up.

Another way to look at this would be implementing pull timeouts and failing the task after some period of time (like if you have a long running emulator build its maxruntime is going to be high but pull timeout may be lower) rather then trying to separate out timeout logic.

Also we should investigate what docker does at a low level for socket timeouts the current logic will correctly handle docker pull errors so the work here may be done for us by docker (it really should be in any case).

Status: NEW → RESOLVED

Closed: 10 years ago

Flags: needinfo?(jlal)

Resolution: --- → WONTFIX

Comment 2

•

10 years ago

Haha- so my above comment is useless as this actually works as you wanted above =/ I need to double check that the docker code does handle pull timeouts correctly otherwise we have another kind of bug :)

Resolution: WONTFIX → FIXED

Reporter

Comment 3

•

10 years ago

From the example task:
http://docs.taskcluster.net/tools/task-inspector/#HPR6sVVQQSKEmMIt9045uw/0

I seems to me that maxRunTime includes time necessary to pull the docker image.
My main concern was that docker-worker didn't clean-up correctly.

Secondary concern was that the semantics were in conflict with the description of the property from JSON schema.
I don't really care whether or not maxRunTime include time spent downloading docker image. As long as it's documented correctly. The name implies that it it doesn't include it, but if the description in the JSON schema, says otherwise that is also a fair definition.

Either way, reopened until docker-worker and documentation match up.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Josh Matthews [:jdm]

Updated

•

10 years ago

Whiteboard: [good first bug][lang=javascript] → [good first bug][lang=js]

Comment 4

•

10 years ago

From load testing when there are bugs the current maxRunTime is only good enough to detect simple in task errors... I think maxRunTime should mean what it sounds like (at the task level) and enforce dropping tasks after the specified time to account for: container errors (run away),  infra timeouts, etc...

Summary: docker-worker: Time necessary to pull docker image counts towards `maxRunTime` → docker-worker: Enforce max runtime at the task level account for pulls / kills / etc..

Reporter

Comment 5

•

10 years ago

> ... (at the task level) ...
> account for: container errors (run away),  infra timeouts, etc...
How is this different from what deadline does? It's just relative and harder to enforce.
Note, docker-worker is expected to kill tasks past task deadline, the fact that it doesn't is another bug :)

maxRunTime should remain a property in task.payload specific to the docker-worker.
It maybe cover following parts:
 A) Setup:
    - docker pull
    - launch proxy
    - create log
    - etc...
 B) Task evaluation:
    - docker run
 C) Extraction and cleanup
    - upload of artifacts
    - creation of task-graphs
    - reportCompleted
    - docker rm

As I've said before I think it's most predictable if it only covers B.
Also (A) and (C) could be covered by smarter things, like max number of artifacts and max number of artifact bytes... As well as max number of task-graph entries.
I'm talking about large max numbers just to ensure sanity. Again artifacts as a list would have allowed for this limitation :)

I'm also okay, with maxRunTime covering (A), (B) and (C).
As long as these semantics is documented in the payload JSON schema.

Anyways, from a user-perspective it seems to me that covering on (B) is the most useful.
As it basically allows you to kill a task that runs a few minutes longer than expected,
if (A) and (C) are included you'll have much higher overhead. And users should not be
expected to have clue about how long (A) and (C) takes. In particular (A) can take anywhere from 300ms to 3min, which is a foot gun for intermittent errors. As 2min will be sufficient for most decision tasks (or just small tasks), but not if they have to do "docker pull <something-big>".

Comment 6

•

10 years ago

Anything the user supplies (this includes the image!) should be covered in the timeout... Kills/cleanup should actually be handled outside of the task success true/false workflow... Artifact upload should also be included.

maxRunTime is the equivalent of our 2 hour limit on buildbot which includes everything I just spoke of... For just (b) here its easy to do that from the harness or whatever script your running (we are not trying to catch those cases).

Reporter

Comment 7

•

10 years ago

Okay, I'll buy that argument.
Let it cover A,B and C.

Note, from my test it already covers A and B at least.

And the talk about task-level was run-level on the worker, right :)

Comment 8

•

10 years ago

Right :) You make a good point about deadline... basically we should shorten the maxRunTime to deadline if its near OR reject the task and mark it as failed...

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

10 years ago

Assignee: nobody → garndt

Comment 9

•

9 years ago

resetting the assignedto field as this has been idle for a few months, do let me know if I was too fast here and take the bug again if there are plans to work on this!

Assignee: garndt → nobody

Status: REOPENED → NEW

Updated

•

9 years ago

Assignee: nobody → garndt

Comment 10

•

9 years ago

Attached file Worker pull 78 — Details

This should abort if artifact upload is taking too long and max runtime his reached.

Attachment #8604337 - Flags: review?(jlal)

Pete Moore [:pmoore][:pete]

Comment 11

•

9 years ago

Comment on attachment 8604337 [details]
Worker pull 78

Formal r+ note that we may need to update the task definitions for some tasks now that we changed how max runtime works...

Attachment #8604337 - Flags: review?(jlal) → review+

Updated

•

9 years ago

Component: TaskCluster → Docker-Worker

Product: Testing → Taskcluster

John Ford [:jhford] CET/CEST Berlin Time

Updated

•

9 years ago

Summary: docker-worker: Enforce max runtime at the task level account for pulls / kills / etc.. → Enforce max runtime at the task level account for pulls / kills / etc..

Comment 12

•

8 years ago

I'm going to unassign myself from this for now.  Some max runtime stuff has landed within docker-worker and also we might rethink this in a taskcluster-worker era.  It was discussed in PDX that max runtime would only account for the task execution piece of the task rather than the end-to-end stuff that a user could not account for.

I will also re-classify this bug under "worker"

Updated

•

8 years ago

Assignee: garndt → nobody

Updated

•

8 years ago

Mentor: jlal