Open Bug 1859204 Opened 4 months ago Updated 2 months ago

Frequent Linux 18.04 x64 WebRender asan opt browser-chrome jobs fail as exceptions with "claim_expired"

Categories

(Taskcluster :: Workers, defect)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: imoraru, Unassigned)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: intermittent-failure, regression)

I think this is somehow regressed by the changes from Bug 1855321.

  • backfill range and retriggers
    From the backfill range and retriggers we can see that this appeared when the bug first landed, the failures then disappeared when it got backed out and they appeared again when the bug re-landed.

Hi Chris! Can you please take a look at this?
Thank you!

The Bugbug bot thinks this bug should belong to the 'Core::Graphics: WebRender' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Graphics: WebRender
Product: WebExtensions → Core

:cfallin, since you are the author of the regressor, bug 1855321, could you take a look?

For more information, please visit BugBot documentation.

Flags: needinfo?(chris)
Component: Graphics: WebRender → Untriaged
Product: Core → WebExtensions

:cfallin, since you are the author of the regressor, bug 1855321, could you take a look?

Unfortunately, no, I can't: the link to the log tells me

NetworkError when attempting to fetch resource.
An error occurred attempting to load the provided log.
Please check the URL and ensure it is reachable.

Without logs, I can't really do much more.

In any case, my change was (i) ifdef'd out by default, (ii) to the JavaScript engine, (iii) having nothing to do with software WebRender. I would be very surprised if it were related. Any more information showing how it is related (and, ideally, a way for me to reproduce) would be helpful!

Flags: needinfo?(chris)

Yes, unfortunately this kind of failures do not have a log. I was surprised to see that the backfills pointed at that bug as the culprit that is why I asked if somehow it could cause this.

Perhaps Aryx will have additional insight into this or Luca.

Flags: needinfo?(lgreco)
Flags: needinfo?(aryx.bugmail)

I think this is a new instance of bug 1759288.

Blocks: 1759288

I'm not sure I can add much, trying to look at the failures linked from comment 5 didn't work (treeherder view doesn't seem to work for them), and the bug Jan linked in comment 6 seems to suggest this may be an infra issue, and so it may not even be a webextensions test failure that is actually being hit.

Clearning my pending needinfo for now, but feel free to add it back if we got a link to some webextensions test failures or some other issue that looks like on the webextensions side of things.

Flags: needinfo?(lgreco)
Component: Untriaged → Workers
Product: WebExtensions → Taskcluster

This is starting to get more and more frequent, hence reaching our disable-recommended queue.
Pete, can you help us redirect this bug?
Thank you.

Flags: needinfo?(pmoore)

The recent failures are for tasks which ran the tests listed in browser/components/sessionstore/test/browser.toml. There have been no modifications to the test folder except eslint changes for the last week while the frequent failures started 3 days ago.

Andreas, you could investigate? One of the tests causes the test machine to become unresponsive (often from an OOM-like situation) and it stops to communicate with the worker manager.

Flags: needinfo?(pmoore) → needinfo?(afarre)

Do you have a link to logs where this happens? I the only lead is that it happens in browser/components/sessionstore/test/browser.toml I have no real place to start. Is there a way I can reproduce this? If so, is there a way to chunk this particular folder to see if that reproduces it?

Flags: needinfo?(afarre) → needinfo?(aryx.bugmail)

(In reply to Andreas Farre [:farre] from comment #18)

Do you have a link to logs where this happens? I the only lead is that it happens in browser/components/sessionstore/test/browser.toml I have no real place to start. Is there a way I can reproduce this? If so, is there a way to chunk this particular folder to see if that reproduces it?

There are no logs available because the worker gets unresponsive and stops communicating with the taskcluster instance and does not upload logs.
A Try push which only requests this folder should reproduce the issue. There are many tests in the manifest, and I hope a domain expert can identify what causes the issue. Interactive workers fail to run any browser-chrome tests on Linux ASan.

Flags: needinfo?(aryx.bugmail)

Similar to bug 1863773, the timeout happens after the last test in the file finished.

See Also: → 1863673

The issue has spread, now 5/16 Linux asan browser-chrome chunks fail or are likely to fail. Interactive debugging is not possible because of bug 1862426.

cc yarik - although I would be surprised if increasing RAM helps.

Flags: needinfo?(ykurmyza)

I think there are multiple causes of claim_expired:

  1. Workers dying for unknown reason while running task
  2. Worker-runner <> worker miscommunication (shutting down itself while asking for more work)
  3. Network responses that never reach workers (which was discovered last week)

I believe that increasing RAM might help with the first case, if OOM errors were happening there (we've seen quite a lot of similar issues on CommunityTC in fuzzing)

Flags: needinfo?(ykurmyza)
See Also: → 1866612

No more failures here since the RAM increase. For the jsreftest exceptions I've filed 1866612.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED

These kind of failures have resurfaced here. Both jobs run on t-linux-large-gcp machines and are mochitest-plain.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [stockwell disable-recommended]
You need to log in before you can comment on or make changes to this bug.