Open Bug 1866612 Opened 1 year ago Updated 11 months ago

Intermittent linux asan jsreftest jobs fail as exceptions with "claim_expired" after multiple retries

Categories

(Taskcluster :: Workers, defect)

defect

Tracking

(Not tracked)

People

(Reporter: CosminS, Unassigned)

References

(Regression)

Details

(Keywords: intermittent-failure, regression)

First noticed this here: https://treeherder.mozilla.org/jobs?repo=autoland&revision=7dc6122ebd68b46a9d4320c40c86f20c2b78984a&selectedTaskRun=ZkyCwM6BR-eR9O9vxqOEfg.5&searchStr=Linux%2C18.04%2Cx64%2CWebRender%2Casan%2Copt%2CReftests%2Ctest-linux1804-64-asan-qr%2Fopt-jsreftest%2CJ1 and probably coming from changes in Bug 1847258.
The jobs keep on retrying until they end up as an exception with claim_expired.
There was a similar bug for browser-chrome tests in 1859204 that was fixed by increasing RAM for the machines in https://hg.mozilla.org/integration/autoland/rev/fb7b6fc608af4116fcc99a1003dfafc5bb78818a.

Flags: needinfo?(dpalmeiro)

Set release status flags based on info from the regressing bug 1847258

Bug 1865910 only affects code inside #ifdef JS_ION_PERF which is off in all CI builds.

Flags: needinfo?(mstange.moz)

Added some retriggers in this range in case it could be from changes in Bug 1852098.

On initial inspection, this appears to be either:

  1. a Docker Worker issue, or
  2. related to https://github.com/taskcluster/taskcluster/issues/6682

If it is 1) this should be resolved when Docker Worker workers have been replaced with Generic Worker workers in fxci (Docker Worker is no longer supported)
@aerickson - do you have a bug/issue tracking that work?

If this is 2), the SRE team are investigating excessive HTTP 502 errors which seem to be the root cause of many of the claim expired issues we have been seeing.
@wezhou - do you have a bug/issue tracking that work?

Flags: needinfo?(wezhou)
Flags: needinfo?(aerickson)

(In reply to Pete Moore [:pmoore][:pete] from comment #6)

If it is 1) this should be resolved when Docker Worker workers have been replaced with Generic Worker workers in fxci (Docker Worker is no longer supported)
@aerickson - do you have a bug/issue tracking that work?

We're tracking that in https://mozilla-hub.atlassian.net/browse/RELOPS-528.

Flags: needinfo?(aerickson)

(In reply to Pete Moore [:pmoore][:pete] from comment #6)

If this is 2), the SRE team are investigating excessive HTTP 502 errors which seem to be the root cause of many of the claim expired issues we have been seeing.
@wezhou - do you have a bug/issue tracking that work?

Here is the ticket, https://mozilla-hub.atlassian.net/browse/SVCSE-1609

Flags: needinfo?(wezhou)
Component: JavaScript Engine: JIT → Workers
Product: Core → Taskcluster
Flags: needinfo?(dpalmeiro)
Summary: Frequent linux asan jsreftest jobs fail as exceptions with "claim_expired" after multiple retries → Intermittent linux asan jsreftest jobs fail as exceptions with "claim_expired" after multiple retries
You need to log in before you can comment on or make changes to this bug.