If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Make "Aborting task - max run time exceeded!" a Treeherder-parseable message

NEW
Assigned to

Status

Taskcluster
Generic-Worker
9 months ago
a month ago

People

(Reporter: philor, Assigned: pmoore)

Tracking

(Blocks: 2 bugs)

Details

(Reporter)

Description

9 months ago
When a job exceeds the max run time, like with bug 1312255, the logged output is

[taskcluster 2017-01-25T23:31:13.765Z] Aborting task - max run time exceeded!
[taskcluster 2017-01-25T23:31:13.787Z]   Exit Code: 0
[taskcluster 2017-01-25T23:31:13.787Z]   User Time: 0s
[taskcluster 2017-01-25T23:31:13.787Z] Kernel Time: 0s
[taskcluster 2017-01-25T23:31:13.787Z]   Wall Time: 2h54m16.0809961s
[taskcluster 2017-01-25T23:31:13.787Z] Peak Memory: 1687552
[taskcluster 2017-01-25T23:31:13.787Z]      Result: IDLENESS_LIMIT_EXCEEDED

no part of which is anything involving a message that treeherder knows to highlight as a failure (or, equally possible, some part of it is but the "[taskcluster 2017-01-25T23:31:13.765Z]" isn't starting with something that treeherder knows to strip out), so the result is a failure with no suggested bug to star it as, not even a line to search for bugs mentioning.
(Reporter)

Updated

4 months ago
Blocks: 1374170
(Assignee)

Updated

4 months ago
Blocks: 1372229
Component: General → Generic-Worker
Pete, this appears to originate with "max run time exceeded", but the IDLENESS_LIMIT_EXCEEDED has confused people, thinking the task ran too long without output. I can't quite make out what that means from the source, but perhaps the message could be changed?
Flags: needinfo?(pmoore)
I ran into this failure not being flagged when working on bug 1380081.

The `IDLENESS_LIMIT_EXCEEDED` part of the result was very confusing, and lead a few of us in #taskcluster into a wild goose chase trying to figure out if the process was shut down to a hard time limit or due to some sort of "idleness" limit being exceeded.

The answer was that a hard limit had been hit - it would be really awesome if the error message could be cleaned up to make that clear, and shown in the treeherder UI :-).
(Assignee)

Updated

2 months ago
Assignee: nobody → pmoore
Flags: needinfo?(pmoore)
(Assignee)

Comment 3

a month ago
I started a discussion with Ed Morley about logging format here: http://logs.glob.uno/?c=mozilla%23treeherder#c139994

I'll work through the options with Ed to see if we can come to a logging pattern that makes sense. I definitely like having the timestamp in there, it would feel like a shame to move the timestamp out of the square brackets, as it then isn't clear if the timestamp is added by the worker is part of the standard out of the task. But we'll clean this up.

The IDLENESS_LIMIT_EXCEEDED message is indeed confusing and inappropriate - I will fix this too. It comes originally from the third party library we forked, however this is internal state that we shouldn't display. Apologies for the confusion this message caused.

https://github.com/contester/runlib/blob/90fe2e89f927e36e634e8e61cdc3d45b1fd26877/runexe/runexe_results.go#L38
(Assignee)

Comment 4

a month ago
It occurred to me, we probably don't want to auto-handle these failures at all, since there is not a common root cause.

When a task exceeds its maximum run time, this indicates some real problem occurred, or the task wasn't given enough time to run. What went wrong (in the case it isn't a simple not-given-enough-time situation) will be entirely task-specific.

It is trivial to bump the maximum run time of a task, so if that is the cause of a particular failure, it is very quick and easy to resolve. If that is not the root cause, the failure can potentially be anything.

Therefore I'm not sure that it would make sense to have this automatically starred as a general purpose "task took too long" bug, as that bug would then be associated to a myriad of tasks all performing different operations, and failing for different reasons.

Comment 5

a month ago
I think it would still be useful to make this error appear in the Treeherder error summary. Just because an error appears there doesn't mean it has to match a bug (and in fact there is a blacklist for terms that shouldn't be searched in the bugscache).
You need to log in before you can comment on or make changes to this bug.