Frequently failing jobs ended up by claim expired / worker shutdown / intermittent task
Categories
(Tree Management :: Treeherder, defect)
Tracking
(Not tracked)
People
(Reporter: tszentpeteri, Assigned: ahal)
Details
(Whiteboard: [stockwell disable-recommended])
Attachments
(1 file)
Comment 1•1 month ago
|
||
Update:
@relsre was pinged to check the workers. It sounds like many jobs are failing due to potentially OOM issue, when the workers run the tests. One of the logs is this one: https://firefox-ci-tc.services.mozilla.com/tasks/YeWJdYO0St-EOco2LYldEg/runs/0/logs/public/logs/live.log You should be able to see the worker node info at the beginning of the log, and the error towards the bottom of the log.
I’m seeing many claim expired exceptions for this worker pool, well before the services release
Seeing tons of errors https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Ft-linux-xlarge-noscratch-gcp/errors
The zone 'projects/fxci-production-level1-workers/zones/us-central1-b' does not have enough resources available to fulfill the request. '(resource type:compute)'.
Updated•1 month ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 5•1 month ago
|
||
This is more likely to be spot preemption than OOM; sometimes we detect it and retry the task automatically, but not always.
Comment 6•1 month ago
|
||
:yarik and :mboris,
Yes, this does seem to be due to spot instance shutdowns. More details about this in https://mozilla.slack.com/archives/CKBFXRD1T/p1727982186900009.
This doesn't seem to be a new thing, but it seems to be hitting autoland jobs hard and the sheriffs are noticing.
Could worker-manager consider preemption shutdown events (I think right now it just considers if instances couldn't be launched due to lack of capacity) and try different zones or track which zones are having issues and use others?
I guess we could attack at the taskgraph layer by having autoland jobs go to a non-spot worker-pool.
We can add more zones to the configs, but not sure if it will help.
Thoughts?
Thanks!
Comment 7•1 month ago
|
||
Adding zones would be beneficial but so would adding additional instance types that would be sufficient to run these tasks on (in hopes of the new instance types aren't as high demand, so less spot terminations). Other option, as you mentioned Andy, is going with an on-demand instance type.
The worker manager specific question is good for Yarik.
Comment hidden (Intermittent Failures Robot) |
Comment 9•1 month ago
|
||
Yeah, I think we need to include more regions/zones to avoid this.
At the moment workers do not communicate back preemption events back to worker manager, so it doesn't know what is happening on that side.
We could monitor task exceptions with worker-shutdown reason, but I'm not sure how to use that number if we would start tracking it.
In the feature that is being developed, worker manager would be able to react to the quota errors by trying to provision less in that region, and prefer other regions for some time.
Comment hidden (Intermittent Failures Robot) |
Comment 11•1 month ago
|
||
This is severely impacting the trees so a solution here would be greatly appreciated. https://treeherder.mozilla.org/jobs?repo=mozilla-central&group_state=expanded&resultStatus=exception&fromchange=b48e31d47d1f562424c7693ad93f74ed39251edb&tochange=ba301423863e02a50279faf39bb566fa3945007
Comment 12•1 month ago
|
||
Tracking work to add more zones/regions for these workers in https://mozilla-hub.atlassian.net/browse/RELOPS-1103.
Comment 13•1 month ago
|
||
I don't think we should add more zones/regions here.
What I suspect might be happening is that since the deploy of tc 72.0.1 (just a few hours before this bug was filed), treeherder doesn't classify tasks as retried, and thus they bubble up to the sheriffs. That deploy included the fix for https://github.com/taskcluster/taskcluster/issues/7174 which changed how/when tc sends pulse messages for retried tasks, so that looks like a likely cause for the increase in tasks that show up as "exception" instead of "retry".
Assignee | ||
Updated•1 month ago
|
Assignee | ||
Comment 14•1 month ago
|
||
So currently Treeherder looks back at runId - 1
when it gets the event for reruns:
https://github.com/mozilla/treeherder/blob/fb59b00868b6d90083531891beca53a477107403/treeherder/etl/taskcluster_pulse/handler.py#L297-L300
But now, it's going to need to look forward at runId + 1
when it gets task-exception
events. So basically, when you get a task-exception
event, inspect runId + 1
. If it has reasonCreated = "retry"
(or task-retry
?), then classify the current run as retry
. Otherwise classify the current run as exception
.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 17•1 month ago
|
||
I don't have a treeherder dev environment set up, but I'll take an initial stab at a patch.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 23•28 days ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment 25•27 days ago
|
||
Comment hidden (Intermittent Failures Robot) |
Description
•