Open Bug 1769298 Opened 3 years ago Updated 3 years ago

Long-running job incorrectly described as not starting

Categories

(Taskcluster :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: sfink, Unassigned)

Details

I attempted to speed up the hazard job and pushed the changes to try. My optimization attempt failed, and I'm guessing the jobs there hit a timeout. The interface shows multiple blue jobs, each one labeled "Duration: Not started (queued for 145 minutes)". But those jobs did start and ran for quite a while, then apparently gave up and started a new job. While they're running, I can see their live log updating properly. After they turn blue, clicking on live.log returns an error, which is unfortunate because it would have told me which compile is taking so long. (In this case I know, since it's bug 1767612, but often I don't.)

This might be because they're build jobs, and the thing that's taking a long time is a single compile?

Hm... come to think of it, this is probably a taskcluster bug, given that the retry is triggering.

Component: Treeherder → General
Product: Tree Management → Taskcluster
Version: --- → unspecified

Actually this might be two bugs: (1) they should not be getting retried, and (2) the log file should be accessible. I'm not sure what component the latter issue would be in, but maybe it wouldn't matter if they weren't getting retried?

...now I'm no longer sure what's going on. The 4th attempt succeeded, in only 69 minutes, which indicates that perhaps my earlier changes did work and these were machines getting shut down & reclaimed and so everything's working as expected?

If so, then perhaps it's a treeherder problem after all—it still shouldn't say it wasn't started, since I know it was. (All 3 of the blue ones, that is.)

The worker executing the task stops responding / has been lost (spot instance terminated by cloud service provider, high memory consumption caused the operating system to perform actions which effectively halted the task execution, ...).

Would you regard Duration: unknown (communication with worker lost) as helpful?

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #3)

The worker executing the task stops responding / has been lost (spot instance terminated by cloud service provider, high memory consumption caused the operating system to perform actions which effectively halted the task execution, ...).

Would you regard Duration: unknown (communication with worker lost) as helpful?

Yes, that would be much better.

It's the "Not started" part that I find confusing, especially since most of the time I'm checking a slow job on the following day and so have no way to know that it ever started.

I would also like the logs to be available. I'm no longer sure whether these jobs should be retried or not, since you pointed out that this could be the result of running the machine out of memory or something. I was assuming that they should be timing out, but OOM is definitely possible.

You need to log in before you can comment on or make changes to this bug.