Open Bug 1963532 (memory-allocation-issues) Opened 22 days ago Updated 10 days ago

[meta] [taskcluster:error] task aborted due to sustained memory usage above 90%

Categories

(Testing :: General, defect, P5)

defect

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

Details

(Keywords: intermittent-failure, meta, Whiteboard: [stockwell infra])

Attachments

(1 file)

Filed by: amarc [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=506091839&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/R_rz82CORLeXWpkJFiikxw/runs/0/artifacts/public/logs/live_backing.log


[task 2025-04-30T07:32:18.090Z] 07:32:18     INFO -  PID 9620 | Cycle 1(15): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6737.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf)
[task 2025-04-30T07:32:18.638Z] 07:32:18     INFO -  PID 9620 | drawTriangle@resource://pdf.js/build/pdf.mjs:6163:23
[task 2025-04-30T07:32:18.639Z] 07:32:18     INFO -  PID 9620 | drawFigure@resource://pdf.js/build/pdf.mjs:6196:21
[task 2025-04-30T07:32:18.639Z] 07:32:18     INFO -  PID 9620 | _createMeshCanvas@resource://pdf.js/build/pdf.mjs:6249:17
[task 2025-04-30T07:32:18.639Z] 07:32:18     INFO -  PID 9620 | getPattern@resource://pdf.js/build/pdf.mjs:6278:41
[task 2025-04-30T07:32:18.639Z] 07:32:18     INFO -  PID 9620 | shadingFill@resource://pdf.js/build/pdf.mjs:8063:29
[task 2025-04-30T07:32:18.639Z] 07:32:18     INFO -  PID 9620 | executeOperatorList@resource://pdf.js/build/pdf.mjs:7098:20
[task 2025-04-30T07:32:18.639Z] 07:32:18     INFO -  PID 9620 | createPatternCanvas@resource://pdf.js/build/pdf.mjs:6383:14
[task 2025-04-30T07:32:18.639Z] 07:32:18     INFO -  PID 9620 | getPattern@resource://pdf.js/build/pdf.mjs:6477:41
[task 2025-04-30T07:32:18.639Z] 07:32:18     INFO -  PID 9620 | fill@resource://pdf.js/build/pdf.mjs:7557:33
[task 2025-04-30T07:32:18.640Z] 07:32:18     INFO -  PID 9620 | eoFill@resource://pdf.js/build/pdf.mjs:7583:10
[task 2025-04-30T07:32:18.640Z] 07:32:18     INFO -  PID 9620 | constructPath@resource://pdf.js/build/pdf.mjs:7516:13
[task 2025-04-30T07:32:18.640Z] 07:32:18     INFO -  PID 9620 | executeOperatorList@resource://pdf.js/build/pdf.mjs:7098:20
[task 2025-04-30T07:32:18.640Z] 07:32:18     INFO -  PID 9620 | _next@resource://pdf.js/build/pdf.mjs:11713:37
[task 2025-04-30T07:32:18.640Z] 07:32:18     INFO -  PID 9620 | @resource://pdf.js/build/pdf.mjs:11703:14
[task 2025-04-30T07:32:18.640Z] 07:32:18     INFO -  PID 9620 |
[task 2025-04-30T07:32:21.824Z] 07:32:21     INFO -  PID 9620 | Cycle 1(1): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:32:26.224Z] 07:32:26     INFO -  PID 9620 | Cycle 1(2): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:32:30.273Z] 07:32:30     INFO -  PID 9620 | Cycle 1(3): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:32:35.468Z] 07:32:35     INFO -  PID 9620 | Cycle 1(4): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:32:38.906Z] 07:32:38     INFO -  PID 9620 | Cycle 1(5): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:32:43.032Z] 07:32:43     INFO -  PID 9620 | Cycle 1(6): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:32:46.425Z] 07:32:46     INFO -  PID 9620 | Cycle 1(7): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:32:50.874Z] 07:32:50     INFO -  PID 9620 | Cycle 1(8): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:32:55.074Z] 07:32:55     INFO -  PID 9620 | Cycle 1(9): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:32:59.741Z] 07:32:59     INFO -  PID 9620 | Cycle 1(10): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:33:02.992Z] 07:33:02     INFO -  PID 9620 | Cycle 1(11): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:33:07.160Z] 07:33:07     INFO -  PID 9620 | Cycle 1(12): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[task 2025-04-30T07:33:10.521Z] 07:33:10     INFO -  PID 9620 | Cycle 1(13): loaded http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6741.pdf (next: http://127.0.0.1:51210/tests/pdfpaint/pdfs/issue6769.pdf)
[taskcluster:warn 2025-04-30T07:33:15.463Z] Sustained memory usage above 90%!
[taskcluster:warn 2025-04-30T07:33:15.470Z] Aborting task to prevent OOM issues...
[taskcluster:error] Aborting task...
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 10512 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 7180 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 3352 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 10540 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 1832 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 2464 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 7380 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 2876 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 2740 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 2588 (child process of PID 6972) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 6972 (child process of PID 9620) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 9620 (child process of PID 11168) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 11168 (child process of PID 8356) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 4128 (child process of PID 3192) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 8356 (child process of PID 3192) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 3192 (child process of PID 10152) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 8976 (child process of PID 6152) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 10152 (child process of PID 6152) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] SUCCESS: The process with PID 6152 (child process of PID 3644) has been terminated.
[taskcluster 2025-04-30T07:33:16.034Z] 
[taskcluster 2025-04-30T07:33:16.036Z] === Task Finished ===
[taskcluster 2025-04-30T07:33:16.036Z] Task Duration: 13m1.5768913s
[taskcluster 2025-04-30T07:33:16.436Z] Uploading artifact public/test_info/pdfpaint_errorsummary.log from file C:\task_174598618418551\build\blobber_upload_dir\pdfpaint_errorsummary.log with content encoding "gzip", mime type "text/plain" and expiry 2026-04-30T04:38:14.412Z
[taskcluster 2025-04-30T07:33:16.441Z] Uploading artifact public/logs/localconfig.json from file C:\task_174598618418551\logs\localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2026-04-30T04:38:14.412Z
[taskcluster 2025-04-30T07:33:16.761Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2026-04-30T04:38:14.412Z
[taskcluster:error] task aborted due to sustained memory usage above 90%

Summary: Intermittent [taskcluster:error] task aborted due to sustained memory usage above 90% → Frequent [taskcluster:error] task aborted due to sustained memory usage above 90%

:amarc, do you know if it's possible to find a culprit commit for this?

Flags: needinfo?(amarc)

Looks like a talos test in the attached log (pdfpaint).

Component: mozperftest → Talos

This looks to not be tree related, it was filed today and fails frequently but we have pushes from the 25th with this failures on retriggers here.
This points to either something in how the performance tests are set up or either something going on with the workers.

Flags: needinfo?(mcornmesser)
Flags: needinfo?(kshampur)
Whiteboard: [stockwell infra]

We had a change in the worker configuration earlier this week, https://github.com/mozilla-platform-ops/worker-images/pull/273 .

I will roll that pool back to the last config and deploy this afternoon.

The rollback config is going out now. Can you let me know if those tests come back green?

Flags: needinfo?(mcornmesser) → needinfo?(csabou)

Sure, will see how it goes.

Flags: needinfo?(csabou)
Flags: needinfo?(mcornmesser)

It looks like we have this happening on 2 different pools. -hw and -hw-ref. I will roll back the config for the ref pool now.

Flags: needinfo?(mcornmesser) → needinfo?(csabou)

It looks like those tests are passing now. https://firefox-ci-tc.services.mozilla.com/tasks/ZzDm-RLYSQy4arxtCFzXuQ

jmoss: FYI

pmoore: This seems to have happened when we started using GW 83.5.7 on Windows 11 hardware workers. Not sure if it was the cause, but it was a significant difference between the 2 different configs.

jmaher: This may have been a condition that wasn't picked up by the integration testing.

Flags: needinfo?(pmoore)
Flags: needinfo?(jmoss)
Flags: needinfo?(jmaher)

Yes, can confirm lots of green runs on retriggers. Thanks for the fix.

Flags: needinfo?(csabou)
Flags: needinfo?(kshampur)

This is more than likely being caused by an update to generic worker and a feature called DisableOOMProtection (more here) in generic worker version 83.4.0. We are going to disable this workerconfig on the hardware pools as this feature impacts tasks such as this.

Flags: needinfo?(jmoss)
Flags: needinfo?(jmaher)

Thanks for the quick resolution :markco, and :jmoss!

Flags: needinfo?(pmoore)

Hi all,

Please note, this feature was created to detect tasks that used excessive memory resources, and are at risk of failing intermittently due to limited available memory.

Disabling the feature may appear to improve the situation, but may hide deeper problems.

I would recommend at least investigating whether the tests can be adjusted to use less memory, rather than just disabling the worker protections.

Please note the excessive memory usage may originate from a long time ago, and not be related to a recent change.

If the current worker memory protections are too aggressive, there is also the opportunity to tune them. Please advise if this is desired. Thanks!

Hi Cosmin, Alexandru,

Can we make this bug a meta-bug to track the separate memory allocation issues?

It would be good to create dedicated bugs for each test suite/platform, and to NI the owners of these specific suites to these specific bugs (@jmaher can probably help identifying the folks).

Jonathan/Mark, are you happy to email out (to fx-all@m.o) with your plans for when the feature will be rolled out, and advise about this meta bug for reporting issues?

I think that should cover everything, so that moving forward we should hopefully have fewer intermittents due to excessive memory consumption.

Let me know if I can assist in any way.

Thanks!

Flags: needinfo?(mcornmesser)
Flags: needinfo?(jmoss)
Flags: needinfo?(csabou)
Flags: needinfo?(amarc)

Hi Pete, happy to make this a meta bug for memory allocation issues.
Want to double check, are we sure this is an issue with the tests rather than the workers? Latest failures here are all on the same worker R5CY128X71B and most of them on different test suites. Is only that worker that has this protective feature enabled on that platform, gecko-t-lambda-perf-a55? If not, it points to a worker issue rather than the test suite in my opinion.
This started as only tests running onwin11-64-24h2-hw-refand win11-64-24h2-hw were failing with task aborted due to sustained memory usage above 90%.

Alias: memory-allocation-issues
Component: Talos → General
Flags: needinfo?(pmoore)
Flags: needinfo?(csabou)
Flags: needinfo?(amarc)
Summary: Frequent [taskcluster:error] task aborted due to sustained memory usage above 90% → [meta] [please file individual bugs per test suite/platform] [taskcluster:error] task aborted due to sustained memory usage above 90%

Oh, this is a great call out. Indeed, that changes everything. This indeed suggests that particular worker is bad, and the feature does not need to be disabled. That makes our lives a lot simpler! It also provides a new insight into these issues - I don't think we had considered they might be localised to individual workers before, perhaps something else is running on the worker and eating up memory? Any ideas Mark or Jonathan?

Flags: needinfo?(pmoore)
Summary: [meta] [please file individual bugs per test suite/platform] [taskcluster:error] task aborted due to sustained memory usage above 90% → [meta] [taskcluster:error] task aborted due to sustained memory usage above 90%

Only releng-hardware/win11-64-24h2-hw-ref and releng-hardware/win11-64-24h2-hw have this generic worker feature disabled. All of the nodes here related to those 2 pools reflect all of the workers in the pools mentioned in releng-hardware/win11-64-24h2-hw-ref and releng-hardware/win11-64-24h2-hw. I believe the lambda perf pool is mobile phones.

Flags: needinfo?(jmoss)

if lambda stuff is all a specific worker id (i.e. R5CY128X71B), then we can quarantine/investigate/fix that worker. this is why we keep track of bugs in a big bucket and look for trends periodically.

t-nuc12-003 and t-nuc12-014 seem to hit this error more frequently

can we adjust this to be 95% memory instead of 90%?

ni Andrew for the mobile one - R5CY128X71B and for the t-linux-docker workers; 2489782782095831361 and 7978773178479849871.

Flags: needinfo?(aerickson)

Pinging :jcristau for the t-linux-docker workers.

Flags: needinfo?(jcristau)

(In reply to :jmoss from comment #27)

Pinging :jcristau for the t-linux-docker workers.

Seems we should disable the feature there.

Flags: needinfo?(jcristau)

(In reply to Cosmin Sabou [:CosminS] from comment #25)

Looks like it's hitting t-linux-docker workers; 2489782782095831361 and 7978773178479849871.

Looking closer at https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/t-linux-docker/workers/us-west1-a/2489782782095831361?sortBy=started&sortDirection=desc, it first happened for https://firefox-ci-tc.services.mozilla.com/tasks/HQPq-eEVRoKY7fq5xaOTYA/runs/0 (source-test-mozlint-android-focus). That does legitimately use up a lot of memory.
Then it hit following tasks again as they were just starting up. The reason for that is the container from HQPq-eEVRoKY7fq5xaOTYA was never actually killed so it was still running and using up memory in the background. That's similar to https://github.com/taskcluster/taskcluster/issues/7678, except not on timeout.

Flags: needinfo?(mcornmesser)

I don't believe that there's too much information I can provide.

I noticed one device that got hit by this at Lambdatest (the only place I'm running a recently released g-w) (the one mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1963532#c26). Otherwise very minor for my workers (because they don't run recent clients).

Clearing NI, but please feel free to re-NI if I'm wrong.

Flags: needinfo?(aerickson)

It looks like I ran into this issue (or something related) when pushing to Try:
https://treeherder.mozilla.org/jobs?repo=try&revision=75c34adc30f4320ec83b4e5fd2b65c1a2a868988

Note that this was a clean Try push, no actual patch was applied.

Attached is a memory usage graph for https://treeherder.mozilla.org/jobs?repo=try&revision=75c34adc30f4320ec83b4e5fd2b65c1a2a868988&selectedTaskRun=IOtHjd-9RY-4INJMiPepiQ.0
It only has one data point every 60s so not definitive.

That said there's probably room for improvement here to kill tasks less aggressively.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: