Closed Bug 1922641 Opened 1 month ago Closed 27 days ago

Frequently failing jobs ended up by claim expired / worker shutdown / intermittent task

Categories

(Tree Management :: Treeherder, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tszentpeteri, Assigned: ahal)

Details

(Whiteboard: [stockwell disable-recommended])

Attachments

(1 file)

Update:
@relsre was pinged to check the workers. It sounds like many jobs are failing due to potentially OOM issue, when the workers run the tests. One of the logs is this one: https://firefox-ci-tc.services.mozilla.com/tasks/YeWJdYO0St-EOco2LYldEg/runs/0/logs/public/logs/live.log You should be able to see the worker node info at the beginning of the log, and the error towards the bottom of the log.

I’m seeing many claim expired exceptions for this worker pool, well before the services release
Seeing tons of errors https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Ft-linux-xlarge-noscratch-gcp/errors
The zone 'projects/fxci-production-level1-workers/zones/us-central1-b' does not have enough resources available to fulfill the request. '(resource type:compute)'.

Flags: needinfo?(aerickson)

This is more likely to be spot preemption than OOM; sometimes we detect it and retry the task automatically, but not always.

:yarik and :mboris,

Yes, this does seem to be due to spot instance shutdowns. More details about this in https://mozilla.slack.com/archives/CKBFXRD1T/p1727982186900009.

This doesn't seem to be a new thing, but it seems to be hitting autoland jobs hard and the sheriffs are noticing.

Could worker-manager consider preemption shutdown events (I think right now it just considers if instances couldn't be launched due to lack of capacity) and try different zones or track which zones are having issues and use others?

I guess we could attack at the taskgraph layer by having autoland jobs go to a non-spot worker-pool.

We can add more zones to the configs, but not sure if it will help.

Thoughts?

Thanks!

Flags: needinfo?(ykurmyza)
Flags: needinfo?(mboris)
Flags: needinfo?(aerickson)

Adding zones would be beneficial but so would adding additional instance types that would be sufficient to run these tasks on (in hopes of the new instance types aren't as high demand, so less spot terminations). Other option, as you mentioned Andy, is going with an on-demand instance type.

The worker manager specific question is good for Yarik.

Flags: needinfo?(mboris)

Yeah, I think we need to include more regions/zones to avoid this.
At the moment workers do not communicate back preemption events back to worker manager, so it doesn't know what is happening on that side.
We could monitor task exceptions with worker-shutdown reason, but I'm not sure how to use that number if we would start tracking it.

In the feature that is being developed, worker manager would be able to react to the quota errors by trying to provision less in that region, and prefer other regions for some time.

Flags: needinfo?(ykurmyza)

Tracking work to add more zones/regions for these workers in https://mozilla-hub.atlassian.net/browse/RELOPS-1103.

I don't think we should add more zones/regions here.

What I suspect might be happening is that since the deploy of tc 72.0.1 (just a few hours before this bug was filed), treeherder doesn't classify tasks as retried, and thus they bubble up to the sheriffs. That deploy included the fix for https://github.com/taskcluster/taskcluster/issues/7174 which changed how/when tc sends pulse messages for retried tasks, so that looks like a likely cause for the increase in tasks that show up as "exception" instead of "retry".

Component: Workers → Treeherder
Product: Taskcluster → Tree Management
Version: unspecified → ---

So currently Treeherder looks back at runId - 1 when it gets the event for reruns:
https://github.com/mozilla/treeherder/blob/fb59b00868b6d90083531891beca53a477107403/treeherder/etl/taskcluster_pulse/handler.py#L297-L300

But now, it's going to need to look forward at runId + 1 when it gets task-exception events. So basically, when you get a task-exception event, inspect runId + 1. If it has reasonCreated = "retry" (or task-retry?), then classify the current run as retry. Otherwise classify the current run as exception.

I don't have a treeherder dev environment set up, but I'll take an initial stab at a patch.

Assignee: nobody → ahal
Status: NEW → ASSIGNED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: