Frequent win asan gtest TEST-UNEXPECTED-FAIL | gtest | test failed with return code 1 after AddressSanitizer: out of memory:
Categories
(Core :: Audio/Video, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox-esr115 | --- | unaffected |
| firefox-esr128 | --- | wontfix |
| firefox134 | --- | wontfix |
| firefox135 | --- | wontfix |
| firefox136 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: truber)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell disable-recommended])
Attachments
(1 file)
Filed by: nfay [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=464000603&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/A7D29CRkTJSZMGk-L4QrKQ/runs/0/artifacts/public/logs/live_backing.log
[task 2024-06-25T22:21:34.833Z] 22:21:34 INFO - TEST-START | GeckoProfiler.FeatureCombinations
[task 2024-06-25T22:24:09.908Z] 22:24:09 INFO - =================================================================
[task 2024-06-25T22:24:09.910Z] 22:24:09 ERROR - ==2512==ERROR: AddressSanitizer: out of memory: allocator is trying to allocate 0x200000 bytes
[task 2024-06-25T22:24:09.912Z] 22:24:09 INFO - ==2512==FATAL: AddressSanitizer: internal allocator is out of memory trying to allocate 0x70 bytes
[task 2024-06-25T22:24:11.868Z] 22:24:11 INFO - gtest INFO | gtest | process wait complete, returncode=1
[task 2024-06-25T22:24:11.871Z] 22:24:11 INFO - mozcrash checking D:\task_171935297474541\build\tests\gtest for minidumps...
[task 2024-06-25T22:24:11.880Z] 22:24:11 WARNING - gtest TEST-UNEXPECTED-FAIL | gtest | test failed with return code 1
[task 2024-06-25T22:24:11.881Z] 22:24:11 INFO - gtest INFO | rungtests.py exits with code 1
[task 2024-06-25T22:24:11.961Z] 22:24:11 INFO - Return code: 1
[task 2024-06-25T22:24:11.972Z] 22:24:11 ERROR - No tests run or test summary not found
[task 2024-06-25T22:24:11.973Z] 22:24:11 INFO - TinderboxPrint: gtest-gtest<br/><em class="testfail">T-FAIL</em>
[task 2024-06-25T22:24:11.975Z] 22:24:11 WARNING - setting return code to 2
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Updated•1 year ago
|
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Comment 27•1 year ago
|
||
Is there a larger instance type we can use for these Windows ASAN GTest jobs? Seems like throwing more RAM at this problem may be our best option.
| Comment hidden (Intermittent Failures Robot) |
Comment 29•1 year ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #27)
Is there a larger instance type we can use for these Windows ASAN GTest jobs? Seems like throwing more RAM at this problem may be our best option.
For sure. We would have to create an additional pool for it. Which is simple, but so some in-tree items will need to be added.
Do we have an idea of what kind of resources will be needed?
Comment 30•1 year ago
|
||
No clue, sorry. I don't even know how much RAM the current instances have :. I guess I'd naively say to "just" double whatever the current amount is, but that's not based on any hard data whatsoever.
| Comment hidden (Intermittent Failures Robot) |
Comment 32•1 year ago
|
||
We have provisioned gecko-t/win11-64-24h2-large which is based on Standard_E8ads_v5 which has 64GB of ram.
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Comment 37•1 year ago
|
||
:jmaher will work on this when/as we move to 24H2 (currently blocked on bug 1926680)
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Comment 47•1 year ago
|
||
switching to -large workers with 24H2 in bug 1940980.
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Comment 56•1 year ago
•
|
||
The test that we hit this most commonly on is ImageDecoders.CorruptAVIFSingleChunk, it has a chunk in the file that reports that it is 4 GB long. The test expects the file to be reported as corrupt. We try to allocate 4 GB memory to hold the data from the chunk. The test expects this to fail to allocate and that error to propagate and to report the image as corrupt. What happens instead is that the program is terminated due to oom:
==6084==ERROR: AddressSanitizer: out of memory: allocator is trying to allocate 0xfffb0012 bytes
ASAN gives us a helpful hint though:
==6084==HINT: if you don't care about these errors you may set allocator_may_return_null=1
allocator_may_return_null seems like it should return null for these fallible allocations, which is what we want for this test. I see that allocator_may_return_null is indeed set here:
So that we shouldn't need to use VMs with a large amount of memory for this to work. So I'm not sure what is going wrong. There is another definition of __asan_default_options in the tree
that doesn't set allocator_may_return_null.
decoder, you originally set this, do you know what might be going wrong?
Comment 57•1 year ago
|
||
:truber, could you check if other __asan_default_options is ever used here? I wasn't aware we had another definition that might be potentially interfering.
| Assignee | ||
Comment 58•1 year ago
|
||
I don't understand the build system there, but looking at the fuzzing-asan builds, I don't see sanitizer_options.cc used for either Linux or Windows.
However, AsanOptions.cpp isn't used on Windows either. Overriding weak definitions seem to be supported on Windows now, but only in the main executable, not DLLs [1].
- https://reviews.llvm.org/D28596 (last paragraph of c0)
| Assignee | ||
Comment 59•1 year ago
|
||
Updated•1 year ago
|
| Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
| Comment hidden (Intermittent Failures Robot) |
Comment 62•1 year ago
|
||
The severity field for this bug is set to S4. However, the following bug duplicate has higher severity:
- Bug 1923579: S3
:truber, could you consider increasing the severity of this bug to S3?
For more information, please visit BugBot documentation.
Updated•1 year ago
|
| Assignee | ||
Updated•1 year ago
|
| Comment hidden (Intermittent Failures Robot) |
Comment 64•1 year ago
|
||
Comment 65•1 year ago
|
||
Backed out for causing linux cppunit perma failures
Backout link: https://hg.mozilla.org/integration/autoland/rev/8eb4ee83e729cc252dec03c431592e3b70001680
Failure log -> AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete)
TEST-START | TestUniquePtr
[task 2025-01-24T18:01:17.606Z] 18:01:17 INFO - PID 1274 |
[task 2025-01-24T18:01:17.607Z] 18:01:17 INFO - =================================================================
[task 2025-01-24T18:01:17.607Z] 18:01:17 ERROR - ==1274==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete) on 0x50f000000040
[task 2025-01-24T18:01:17.607Z] 18:01:17 INFO - #0 0x557b3b35566d in operator delete(void*) /builds/worker/fetches/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:143:3
[task 2025-01-24T18:01:17.608Z] 18:01:17 INFO - #1 0x557b3b357663 in DeleteIntFunction /builds/worker/checkouts/gecko/mfbt/tests/TestUniquePtr.cpp:325:3
[task 2025-01-24T18:01:17.608Z] 18:01:17 INFO - #2 0x557b3b357663 in reset /builds/worker/workspace/obj-build/dist/include/mozilla/UniquePtr.h:302:7
[task 2025-01-24T18:01:17.608Z] 18:01:17 INFO - #3 0x557b3b357663 in TestFunctionReferenceDeleter /builds/worker/checkouts/gecko/mfbt/tests/TestUniquePtr.cpp:354:8
[task 2025-01-24T18:01:17.608Z] 18:01:17 INFO - #4 0x557b3b357663 in main /builds/worker/checkouts/gecko/mfbt/tests/TestUniquePtr.cpp:587:8
[task 2025-01-24T18:01:17.608Z] 18:01:17 INFO - #5 0x7fe68b1d7b96 in __libc_start_main /tmp/glibc/csu/../csu/libc-start.c:310
[task 2025-01-24T18:01:17.609Z] 18:01:17 INFO - #6 0x557b3b277598 in _start (/builds/worker/workspace/build/tests/cppunittest/TestUniquePtr+0xc7598) (BuildId: 00c433fc9c7ed39469f39dd5c79ca32d9f0d4d92)
[task 2025-01-24T18:01:17.609Z] 18:01:17 INFO - 0x50f000000040 is located 0 bytes inside of 168-byte region [0x50f000000040,0x50f0000000e8)
[task 2025-01-24T18:01:17.609Z] 18:01:17 INFO - allocated by thread T0 here:
[task 2025-01-24T18:01:17.609Z] 18:01:17 INFO - #0 0x557b3b354f1d in operator new[](unsigned long) /builds/worker/fetches/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:89:3
[task 2025-01-24T18:01:17.609Z] 18:01:17 INFO - #1 0x557b3b357616 in TestFunctionReferenceDeleter /builds/worker/checkouts/gecko/mfbt/tests/TestUniquePtr.cpp:351:39
[task 2025-01-24T18:01:17.609Z] 18:01:17 INFO - #2 0x557b3b357616 in main /builds/worker/checkouts/gecko/mfbt/tests/TestUniquePtr.cpp:587:8
[task 2025-01-24T18:01:17.609Z] 18:01:17 INFO - #3 0x7fe68b1d7b96 in __libc_start_main /tmp/glibc/csu/../csu/libc-start.c:310
[task 2025-01-24T18:01:17.610Z] 18:01:17 INFO - SUMMARY: AddressSanitizer: alloc-dealloc-mismatch /builds/worker/checkouts/gecko/mfbt/tests/TestUniquePtr.cpp:325:3 in DeleteIntFunction
[task 2025-01-24T18:01:17.610Z] 18:01:17 INFO - ==1274==HINT: if you don't care about these errors you may set ASAN_OPTIONS=alloc_dealloc_mismatch=0
[task 2025-01-24T18:01:17.610Z] 18:01:17 INFO - ==1274==ABORTING
[task 2025-01-24T18:01:17.610Z] 18:01:17 INFO - mozcrash checking /tmp/tmpg8dj1sp9 for minidumps...
[task 2025-01-24T18:01:17.610Z] 18:01:17 WARNING - TEST-UNEXPECTED-FAIL | TestUniquePtr | test failed with return code 1
[task 2025-01-24T18:01:17.610Z] 18:01:17 INFO - TEST-INFO took 40ms
[task 2025-01-24T18:01:17.610Z] 18:01:17 INFO - TEST-START | TestUtf8
| Assignee | ||
Comment 66•1 year ago
|
||
Thanks. It looks like there's a bunch of tests that pull in mozglue too.
Comment 67•1 year ago
|
||
| Comment hidden (Intermittent Failures Robot) |
Comment 69•1 year ago
|
||
| bugherder | ||
Updated•1 year ago
|
Comment 70•1 year ago
|
||
I looked at failures classified as this bug recently, most of them do not look like this bug. So I'd encourage a new bug to be filed and classify failures in this job (gtest asan win) to that new bug where investigation can happen because it looks like we did indeed fix the issue that was causing almost all the failures here. The new failures were either masked by this issue, or are new because the asan options we wanted actually got applied and used which surfaced new failures.
Comment 71•1 year ago
|
||
as of Friday Jan 24th later in the day, these tests started running on -large instances and updated version of windows. I suspect the failure rate will drop significantly.
| Comment hidden (Intermittent Failures Robot) |
Comment 73•1 year ago
|
||
I don't think we should need to increase the instance size. The oom was as a result of a configuration error where large allocations were supposed to return null, but instead they crashed the program. But the code change hopefully fixed that.
Comment 74•1 year ago
|
||
I am on PTO all week, so if you want to reduce the size, please fix issues and remove https://searchfox.org/mozilla-central/source/taskcluster/kinds/test/compiled.yml#55
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Updated•1 year ago
|
| Comment hidden (Intermittent Failures Robot) |
Comment 78•1 year ago
|
||
FYI, I did try to decrease the size of the machines, but it does seem that there are some other gtests that do allocate a lot of memory that do not intend to OOM like TestIDNA.BenchUrlPunycodeMixed which allocates 50000 uris
So both the patch to fix the asan behaviour and the worker size increase were needed here.
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Description
•