Open Bug 1809667 Opened 2 years ago Updated 21 hours ago

[meta] Intermittent [taskcluster:error] Task aborted - max run time exceeded

Categories

(Firefox Build System :: Task Configuration, defect)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: aryx, Assigned: jmaher)

References

(Depends on 6 open bugs)

Details

(Keywords: intermittent-failure, leave-open, meta, Whiteboard: [stockwell infra][stockwell needswork:owner])

Attachments

(2 files, 2 obsolete files)

This is a meta bug for tasks which fail to complete in the set time limit. Until a specific issue has been identified, bugs will be classified with this bug.

Bug 1589796 was the old one tracking this but we wanted to start fresh without the activity of the all bugs already resolved.

Summary: Intermittent [taskcluster:error] Task aborted - max run time exceeded → [meta] Intermittent [taskcluster:error] Task aborted - max run time exceeded
Assignee: nobody → afinder
Severity: -- → S3
Priority: -- → P2

This is a meta bug, and I assume you most likely wanted to assign on bug 1809652?

Assignee: afinder → nobody
Severity: S3 → --
Priority: P2 → --
Whiteboard: [stockwell disable-recommended]
Depends on: 1813390
Whiteboard: [stockwell disable-recommended]
Whiteboard: [stockwell disable-recommended]
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
See Also: → 1821918
Depends on: 1821945
See Also: → 1824521
See Also: → 1825956

Update:

There have been 60 failures within the last 7 days:

  • 25 failures on Android 11.0 Samsung A51 AArch64 Shippable opt
  • 2 failures on Android 13.0 Pixel5 AArch64 Shippable opt
  • 1 failure on OS X 10.15 WebRender debug
  • 9 failures on Windows 11 x86 22H2 WebRender debug
  • 3 failures on Windows 11 x64 22H2 asan WebRender opt
  • 3 failures on Windows 11 x64 22H2 CCov WebRender opt
  • 15 failures on Windows 11 x64 22H2 WebRender opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=415684493&repo=autoland&lineNumber=2605

Andrew, do you know if someone is actively working on this?
Thank you.

Flags: needinfo?(ahal)
Whiteboard: [stockwell infra] → [stockwell infra][stockwell needswork:owner]

(In reply to Natalia Csoregi [:nataliaCs] from comment #65)

Update:

There have been 60 failures within the last 7 days:

  • 25 failures on Android 11.0 Samsung A51 AArch64 Shippable opt
  • 2 failures on Android 13.0 Pixel5 AArch64 Shippable opt
  • 1 failure on OS X 10.15 WebRender debug
  • 9 failures on Windows 11 x86 22H2 WebRender debug
  • 3 failures on Windows 11 x64 22H2 asan WebRender opt
  • 3 failures on Windows 11 x64 22H2 CCov WebRender opt
  • 15 failures on Windows 11 x64 22H2 WebRender opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=415684493&repo=autoland&lineNumber=2605

Andrew, do you know if someone is actively working on this?
Thank you.

You basically want to check which test suites are failing and as it can be seen it's mostly browsertime. So forwarding the needinfo to Greg.

Flags: needinfo?(ahal) → needinfo?(gmierz2)

I'm seeing a lot of issues related to fetching artifacts, such as this one (I saw another task which was having issues with our internal pypi mirror too):

[task 2023-05-13T20:21:21.840Z] 20:08:02     INFO -  raptor-mitmproxy Info: downloading certutil binary (hostutils)
[task 2023-05-13T20:21:21.840Z] 20:08:02     INFO -  raptor-mitmproxy Info: downloading: https://hg.mozilla.org/integration/autoland/raw-file/c2e4de2178a51678ed49f1151b972c0bc96ac9fd/testing/config/tooltool-manifests/linux64/hostutils.manifest to /builds/task_168400695744236/workspace/testing/mozproxy/hostutils.manifest
[task 2023-05-13T20:21:21.840Z] 20:08:03     INFO -  raptor-mitmproxy Info: b'INFO - File host-utils-108.0a1.en-US.linux-x86_64.tar.gz not present in local cache folder /builds/tooltool_cache'
[task 2023-05-13T20:21:21.840Z] 20:08:03     INFO -  raptor-mitmproxy Info: b"INFO - Attempting to fetch from 'http://localhost:8099/tooltool.mozilla-releng.net/'..."
[task 2023-05-13T20:21:21.840Z] 20:21:18     INFO -  raptor-mitmproxy Info: b'INFO - File host-utils-108.0a1.en-US.linux-x86_64.tar.gz fetched from http://localhost:8099/tooltool.mozilla-releng.net/ as /builds/task_168400695744236/workspace/testing/mozproxy/tmptabn_if0'
[task 2023-05-13T20:21:21.840Z] 20:21:19     INFO -  raptor-mitmproxy Info: b'INFO - File integrity verified, renaming tmptabn_if0 to host-utils-108.0a1.en-US.linux-x86_64.tar.gz'

It looks like it's stopped now though (last batch of failures was on the 13th). I'll redirect to :aerickson if this starts again.

Flags: needinfo?(gmierz2)
Severity: -- → N/A

Update

There was a total of 66 failures in the last 7 days:

  • 1 failure on Linux 18.04 x64 WebRender Shippable opt
  • 2 failures on OS X 10.15 WebRender debug
  • 5 failures on OS X 10.15 WebRender Shippable opt
  • 2 failures on Windows 10 x64 WebRender Shippable opt
  • 9 failures on Windows 11 x86 22H2 WebRender debug
  • 10 failures on Windows 11 x64 22H2 asan WebRender opt
  • 37 failures on Windows 11 x64 22H2 WebRender debug/opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=417959757&repo=mozilla-central&lineNumber=118617

We had 70 occurrences in the past 7 days:

Depends on: 1858236

Looks like something in browser chrome jobs on windows debug has moved them over the edge regarding max run time. The spike is here: https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2023-10-06&endday=2023-10-13&tree=trunk&failurehash=all&bug=1809667 and in https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2023-10-06&endday=2023-10-13&tree=trunk&failurehash=all&bug=1799052.
As a timeline I noticed here could be a starting point, top push has 4 jobs hitting the max run time the bottom one none. Backfills are inconclusive and noticed the jobs were already running around the top of 80 mins so something just made the topple over. Now they're constantly failing on tree.
Filed Bug 1859059 as a band aid fix to increase number of chunks from 7 to 8 for now.
Joel, could you investigate more on this matter? Feel free to revert the chunk number if it's not the way to go. Thank you.

LE: Looks like linux wayland is also affected. eg.

Flags: needinfo?(jmaher)
See Also: → 1859059

2 patterns here:

  1. windows browser-chrome msix - one long running manifest
  2. mochitest-plain for linux opt, increase total chunks.
Flags: needinfo?(jmaher)
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a8f92dd682f7 reduce win/msix and linux/mochitest task timeouts. r=aryx
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 120 Branch

Hi, we still have this: https://treeherder.mozilla.org/logviewer?job_id=433219837&repo=autoland and this: https://treeherder.mozilla.org/logviewer?job_id=433223927&repo=mozilla-central. Are these included in this bug? Or should we create another bug for them?

Flags: needinfo?(jmaher)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 120 Branch → ---

I only fixed a few cases here; I assume what I fixed will reduce this; but we might need another visit next week.

Flags: needinfo?(jmaher)
See Also: → 1863344

Comment on attachment 9359467 [details]
Bug 1809667 - reduce win/msix and linux/mochitest task timeouts. r=aryx!

Beta/Release Uplift Approval Request

  • User impact if declined: n/a
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): not risky, just fixes CI to get tasks green again.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9359467 - Flags: approval-mozilla-beta?
Duplicate of this bug: 1863344

:jmaher, i think this patch is already in beta, is there something else we have to uplift?

Flags: needinfo?(jmaher)

ok, let me move back to bug 1863344, this seems like the same problem, yet different circumstances so a different solution is needed.

Flags: needinfo?(jmaher)

Comment on attachment 9359467 [details]
Bug 1809667 - reduce win/msix and linux/mochitest task timeouts. r=aryx!

already in beta fx120 see comment 167 & comment 168

Attachment #9359467 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
No longer duplicate of this bug: 1863344
See Also: → 1870500

There are a lot of Windows ASAN timeouts for browser chrome tests. Maybe these need a new extra chunk?

Flags: needinfo?(jmaher)

I am submitting a phab request to disable toolkit/components/antitracking/test/browser/browser-blocking.toml on debug due to the 20+ minute long runtimes on all platforms.

I would prefer if the anti-tracking team would make these tests faster. Keeping coverage on opt/shippable should help reduce the risk.

Flags: needinfo?(jmaher)
Keywords: leave-open
Depends on: 1884982

Comment on attachment 9390805 [details]
Bug 1809667 - disable toolkit/components/antitracking/test/browser/browser-blocking on debug due to extremely long runtime. r=aryx

Revision D204382 was moved to bug 1884982. Setting attachment 9390805 [details] to obsolete.

Attachment #9390805 - Attachment is obsolete: true

Joel, there are as well a lot of Windows debug mochitest-browser-chrome-7 tests that are timing out. Maybe you could have a look at those as well?

Flags: needinfo?(jmaher)

Henrik- those are the same issues with anti-tracking, I expect the volume to drop starting today

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #248)

Henrik- those are the same issues with anti-tracking, I expect the volume to drop starting today

I don't think so. I had a look at some of these timing out jobs and the test, that you disabled, doesn't appear in the logs. So I'm fairly sure something else causes this timeout.

thanks for pushing back- this is the other antitracking manifest (25-40 minute runtime). I will disable for now.

If we cannot make the tests faster, ideally we can reduce the number of tests and split into smaller manifest groups.

Depends on: 1900824
Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/ff605131b329 Increase duration limit for Jit tests on Android debug to 1 hour. r=jmaher DONTBUILD

The Mn asan jobs are failing because the Win2k enrollment tests take quite a lot of time due to the high amount of Firefox restarts. Hereby the latter consume most of the time for startup and and shutdown.

Joel, given that those tests run close to 90 minutes, which is quite high, it may be the time to split them up into two chunks?

Flags: needinfo?(jmaher)

:whimboo, yes, this looks like a good idea, I am on PTO this week- happy to dig into this next week, or find time to review this week if you write the patch and test it out.

Flags: needinfo?(jmaher)

Comment on attachment 9422271 [details]
Bug 1809667 - split marionette asan into two chunks. r=whimboo

Revision D220903 was moved to bug 1916456. Setting attachment 9422271 [details] to obsolete.

Attachment #9422271 - Attachment is obsolete: true
Duplicate of this bug: 1926876
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: