Open Bug 1797413 Opened 2 years ago Updated 2 years ago

OOM due to unconstrained memory usage

Categories

(Core :: DOM: Workers, defect, P3)

defect

Tracking

()

Tracking Status
firefox108 --- affected

People

(Reporter: tsmith, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: csectype-oom, testcase)

Attachments

(1 file)

Attached file testcase.html

Found while fuzzing m-c 20221020-ca2873779214 (--enable-address-sanitizer --enable-fuzzing)

To reproduce via Grizzly Replay:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -a --fuzzing -n firefox
$ ASAN_OPTIONS=hard_rss_limit_mb=6144 python -m grizzly.replay ./firefox/firefox testcase.html

NOTE: Set a reasonable memory limit via ASAN_OPTIONS=hard_rss_limit_mb=# to avoid system OOMs.

This test case does not trigger an OOM on Chrome.

HEAP PROFILE at RSS 12315Mb
Live Heap Allocations: 10520719985 bytes in 2145730 chunks; quarantined: 100213077 bytes in 32622 chunks; 44843 other chunks; total chunks: 2223195; showing top 90% (at most 20 unique contexts)
8664686592 byte(s) (82%) in 2115402 allocation(s)
    #0 0x56346038cf3e in __interceptor_malloc /builds/worker/fetches/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:69:3
    #1 0x7f3eeb76538c in js_arena_malloc /builds/worker/workspace/obj-build/dist/include/js/Utility.h:366:10
    #2 0x7f3eeb76538c in js_pod_arena_malloc<char> /builds/worker/workspace/obj-build/dist/include/js/Utility.h:576:26
    #3 0x7f3eeb76538c in maybe_pod_arena_malloc<char> /builds/worker/workspace/obj-build/dist/include/js/AllocPolicy.h:33:12
    #4 0x7f3eeb76538c in pod_arena_malloc<char> /builds/worker/workspace/obj-build/dist/include/js/AllocPolicy.h:46:12
    #5 0x7f3eeb76538c in pod_malloc<char> /builds/worker/workspace/obj-build/dist/include/js/AllocPolicy.h:72:12
    #6 0x7f3eeb76538c in mozilla::BufferList<js::SystemAllocPolicy>::AllocateSegment(unsigned long, unsigned long) /builds/worker/workspace/obj-build/dist/include/mozilla/BufferList.h:399:33
    #7 0x7f3eeb7501ed in mozilla::BufferList<js::SystemAllocPolicy>::AllocateBytes(unsigned long, unsigned long*) /builds/worker/workspace/obj-build/dist/include/mozilla/BufferList.h:469:16
    #8 0x7f3ef1a9b107 in mozilla::BufferList<js::SystemAllocPolicy>::WriteBytes(char const*, unsigned long) /builds/worker/workspace/obj-build/dist/include/mozilla/BufferList.h:436:18
    #9 0x7f3ef7f762c9 in AppendBytes /builds/worker/workspace/obj-build/dist/include/js/StructuredClone.h:538:21
    #10 0x7f3ef7f762c9 in writeArray<unsigned char> /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:989:12
    #11 0x7f3ef7f762c9 in writeBytes /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:1004:10
    #12 0x7f3ef7f762c9 in JSStructuredCloneWriter::writeArrayBuffer(JS::Handle<JSObject*>) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:1397:14
    #13 0x7f3ef7f74cad in JSStructuredCloneWriter::startWrite(JS::Handle<JS::Value>) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:2062:16
    #14 0x7f3ef7f73939 in JSStructuredCloneWriter::writeTypedArray(JS::Handle<JSObject*>) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:1352:8
    #15 0x7f3ef7f745d7 in JSStructuredCloneWriter::startWrite(JS::Handle<JS::Value>) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:2107:16
    #16 0x7f3ef7f95e2d in JS_WriteTypedArray(JSStructuredCloneWriter*, JS::Handle<JS::Value>) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:4092:13
    #17 0x7f3eef529c6c in mozilla::dom::ImageData::WriteStructuredClone(JSContext*, JSStructuredCloneWriter*) const /builds/worker/checkouts/gecko/dom/canvas/ImageData.cpp:137:10
    #18 0x7f3eed444de6 in mozilla::dom::StructuredCloneHolder::WriteFullySerializableObjects(JSContext*, JSStructuredCloneWriter*, JS::Handle<JSObject*>) /builds/worker/checkouts/gecko/dom/base/StructuredCloneHolder.cpp:492:12
    #19 0x7f3eed449b49 in mozilla::dom::StructuredCloneHolder::CustomWriteHandler(JSContext*, JSStructuredCloneWriter*, JS::Handle<JSObject*>, bool*) /builds/worker/checkouts/gecko/dom/base/StructuredCloneHolder.cpp:1130:10
    #20 0x7f3ef7f752ab in JSStructuredCloneWriter::startWrite(JS::Handle<JS::Value>) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:2124:10
    #21 0x7f3ef7f6ae22 in JSStructuredCloneWriter::write(JS::Handle<JS::Value>) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:2393:42
    #22 0x7f3ef7f69823 in WriteStructuredClone(JSContext*, JS::Handle<JS::Value>, JSStructuredCloneData*, JS::StructuredCloneScope, JS::CloneDataPolicy const&, JSStructuredCloneCallbacks const*, void*, JS::Value const&) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:751:10
    #23 0x7f3ef7f93104 in JS_WriteStructuredClone(JSContext*, JS::Handle<JS::Value>, JSStructuredCloneData*, JS::StructuredCloneScope, JS::CloneDataPolicy const&, JSStructuredCloneCallbacks const*, void*, JS::Handle<JS::Value>) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:3839:10
    #24 0x7f3ef7f94b12 in JSAutoStructuredCloneBuffer::write(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::CloneDataPolicy const&, JSStructuredCloneCallbacks const*, void*) /builds/worker/checkouts/gecko/js/src/vm/StructuredClone.cpp:3971:13
    #25 0x7f3eed442e5e in mozilla::dom::StructuredCloneHolderBase::Write(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::CloneDataPolicy const&) /builds/worker/checkouts/gecko/dom/base/StructuredCloneHolder.cpp:274:17
    #26 0x7f3eed443c4c in mozilla::dom::StructuredCloneHolder::Write(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::CloneDataPolicy const&, mozilla::ErrorResult&) /builds/worker/checkouts/gecko/dom/base/StructuredCloneHolder.cpp:361:35
    #27 0x7f3ef2130536 in mozilla::dom::Worker::PostMessage(JSContext*, JS::Handle<JS::Value>, mozilla::dom::Sequence<JSObject*> const&, mozilla::ErrorResult&) /builds/worker/checkouts/gecko/dom/workers/Worker.cpp:140:13
    #28 0x7f3eeeac5517 in mozilla::dom::Worker_Binding::postMessage(JSContext*, JS::Handle<JSObject*>, void*, JSJitMethodCallArgs const&) /builds/worker/workspace/obj-build/dom/bindings/WorkerBinding.cpp:629:32
    #29 0x7f3eef28b46f in bool mozilla::dom::binding_detail::GenericMethod<mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail::ThrowExceptions>(JSContext*, unsigned int, JS::Value*) /builds/worker/checkouts/gecko/dom/bindings/BindingUtils.cpp:3287:13
    #30 0x7f3ef9b22de3 in CallJSNative /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:459:13
    #31 0x7f3ef9b22de3 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:547:12
    #32 0x7f3ef9b11769 in InternalCall /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:614:10
    #33 0x7f3ef9b11769 in CallFromStack /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:619:10
    #34 0x7f3ef9b11769 in Interpret(JSContext*, js::RunState&) /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:3375:16
    #35 0x7f3ef9af6d6e in js::RunScript(JSContext*, js::RunState&) /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:431:13
    #36 0x7f3ef9b22f05 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:579:13
    #37 0x7f3ef9b249ae in InternalCall /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:614:10
    #38 0x7f3ef9b249ae in js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:646:8
    #39 0x7f3ef8113275 in JS::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) /builds/worker/checkouts/gecko/js/src/vm/CallAndConstruct.cpp:117:10
    #40 0x7f3eeee857c9 in mozilla::dom::EventListener::HandleEvent(mozilla::dom::BindingCallContext&, JS::Handle<JS::Value>, mozilla::dom::Event&, mozilla::ErrorResult&) /builds/worker/workspace/obj-build/dom/bindings/EventListenerBinding.cpp:62:8
    #41 0x7f3eefd18964 in void mozilla::dom::EventListener::HandleEvent<mozilla::dom::EventTarget*>(mozilla::dom::EventTarget* const&, mozilla::dom::Event&, mozilla::ErrorResult&, char const*, mozilla::dom::CallbackObject::ExceptionHandling, JS::Realm*) /builds/worker/workspace/obj-build/dist/include/mozilla/dom/EventListenerBinding.h:65:12
Flags: in-testsuite?
Severity: -- → S3
Priority: -- → P3

IIUC this is happening deep inside JSStructuredCloneWriter and it seems to me that all the call path has an error propagation for possible OOM and that it depends on what mozilla::BufferList<js::SystemAllocPolicy>; actually means in this system context. Moving tentatively to JS for further triage.

Component: DOM: Workers → JavaScript Engine

(removing severity to make it go through the triage process again)

Severity: S3 → --
Priority: P3 → --

I will look into this. It might be that our clone of a huge ArrayBuffer uses more memory than Chrome, though that would be unexpected. It might be that they schedule a GC faster and so less memory piles up. Or it might be that we don't read out the postMessage messages as promptly, and so more of them pile up.

Flags: needinfo?(sphink)

Can you clarify what's going on in frame 0 of the backtrace?

I understand the situation here to potentially be:

  • SystemAllocPolicy is probably fallible and the JS code seems to know how to handle that. This is the basis for Jens moving the bug to the JS engine. (It would be very nice if AllocPolicy.h could state very clearly where it fits within the infallible (crash on failure to allocate) or fallible (return null on failure to allocate) spectrum within the file, potentially referencing the canonical location of appropriate docs.)
  • We are using ASAN's malloc which is infallibly crashing here rather than returning null so it's impossible for the JS code to do the right thing?

I'm a little confused about what's going on with the ImageData size too. https://bugzilla.mozilla.org/show_bug.cgi?id=1715316#c0 implies that JS_CHECK_LARGE_ALLOC should be failing to allocate the 1,682 MiB ImageData allocation (width * height * 4) by way of js_arena_malloc. Of course if we're using ASAN's allocators instead of the JS allocators, then that mechanism wouldn't be in play, I guess?

To :sfink, I understood :jstutte's movement of the bug to primarily be about the code seeming to want to do fallible allocations, but the behavior not being fallible here. It's definitely the case that this test case is absolutely asking for more memory than the process can provide with 6GiB limit given that we're talking about at least 19.7GiB of ImageData buffers (11 worker copies + 1 source window copy; ~2x as much if the deserialization can't simply take ownership of the typed array buffer). new Worker("") will end up trying to load whatever the base URL is, which will fail because the MIME type won't be JS. Notably, our worker script loading must currently involve the main thread so there is a 0% chance of us being able to discard any of those message payloads because the test case does not yield control flow. I understand Chrome does not have this limitation, so they have a chance to discard the messages very quickly, especially if their allocator does something like notifying memory pressure, then waiting a bit and trying again. They could also be doing clever stuff with copy-on-write or leveraging that the ImageData is initialized by spec to "transparent black" which I believe should be all zeroes which also allows for not needing to dirty the underlying memory pages and other cleverness.

Flags: needinfo?(twsmith)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #4)

Can you clarify what's going on in frame 0 of the backtrace?

This is a heap profile created when the specified memory limit is exceeded. A heap profile is used instead of a crash back trace because it make the issues uniquely identifiable and bucketable. We are working to make the large number of OOMs we encounter enumerable and hopefully addressable. Previously we had one HUGE OOM bucket. So that said if we remove the hard limit or use a soft rss limit (allow fallible allocations) and collect a back trace instead of a profile we are back where we stared with a large opaque bucket of OOMs.

Flags: needinfo?(twsmith)

Thanks for the clarification there and on matrix! From matrix Tyson also indicated there is no "allocation size limit" in effect now with the change to having ASAN stop the process when it crosses the memory limit. So this is not actionable from the JS structured clone logic because JS is not receiving any null pointers that it can use to generate an error. I'm removing the needinfo on :sfink.

I've filed bug 1797639 to track the enhancement so that worker loading does not depend on the main thread. This has been on our mental roadmap but I don't think there was a bug explicitly tracking the enhancement. As posed, this bug is not actionable without first implementing bug 1797639 or defining some kind of new resource-utilization back-pressure or limiting mechanism. Since the test case here is generating large amounts of doomed data, arguably the worker script loading enhancement is the right fix, but we could imagine variants of this test case where the problem would need a back-pressure/resource-limiting fix. For example, if the Worker script was a legal script that would load, and especially if it tried reading the MessageEvent's data. I believe we may be unique among some engines in potentially eagerly performing the structured deserialization, whereas other engines may not experience additional memory use because they don't bother deserializing if the data property is not touched.

Bug Triage Policy Note:

In general most worker-related OOMs of a form like this where a test case creates a worker and sends message to it without yielding control flow should end up being marked as dependent on bug 1797639 for the time being. They can be re-triaged once it's fixed.

Component: JavaScript Engine → DOM: Workers
Depends on: 1797639
Flags: needinfo?(sphink)

Ah, thanks for the detailed commentary. I hadn't really looked at the test case yet.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #4)

Can you clarify what's going on in frame 0 of the backtrace?

I understand the situation here to potentially be:

  • SystemAllocPolicy is probably fallible and the JS code seems to know how to handle that. This is the basis for Jens moving the bug to the JS engine.

Yes, they are all fallible.

(It would be very nice if AllocPolicy.h could state very clearly where it fits within the infallible (crash on failure to allocate) or fallible (return null on failure to allocate) spectrum within the file, potentially referencing the canonical location of appropriate docs.)

Fair point. Nearly all allocation within SpiderMonkey is fallible. We specifically mark regions that are infallible and explicitly crash.

We mostly complain to each other about the other properties of allocation: does it just return nullptr? Or does it set an exception on the context? Or does it report the error without setting an exception? Can fuzzers trigger synthetic OOMs? Should the memory be tracked for triggering GCs? At what granularity? It's kind of a mess.

But you are right, the fact that spidermonkey allocation is (almost) all fallible should be described in AllocPolicy.h and MallocProvider.h.

  • We are using ASAN's malloc which is infallibly crashing here rather than returning null so it's impossible for the JS code to do the right thing?

Ah! Yes, if we're sitting on top of an infallible allocator, then none of the OOM handling can ever run.

I'm a little confused about what's going on with the ImageData size too. https://bugzilla.mozilla.org/show_bug.cgi?id=1715316#c0 implies that JS_CHECK_LARGE_ALLOC should be failing to allocate the 1,682 MiB ImageData allocation (width * height * 4) by way of js_arena_malloc.

Only if the env var MOZ_FUZZ_LARGE_ALLOC_LIMIT is set to something. Does Grizzly or something set this?

Of course if we're using ASAN's allocators instead of the JS allocators, then that mechanism wouldn't be in play, I guess?

The large alloc limit would return nullptr first (or crash, if MOZ_FUZZ_CRASH_ON_LARGE_ALLOC is set).

To :sfink, I understood :jstutte's movement of the bug to primarily be about the code seeming to want to do fallible allocations, but the behavior not being fallible here. It's definitely the case that this test case is absolutely asking for more memory than the process can provide with 6GiB limit given that we're talking about at least 19.7GiB of ImageData buffers (11 worker copies + 1 source window copy; ~2x as much if the deserialization can't simply take ownership of the typed array buffer). new Worker("") will end up trying to load whatever the base URL is, which will fail because the MIME type won't be JS. Notably, our worker script loading must currently involve the main thread so there is a 0% chance of us being able to discard any of those message payloads because the test case does not yield control flow.

Thanks, that makes perfect sense.

I understand Chrome does not have this limitation, so they have a chance to discard the messages very quickly, especially if their allocator does something like notifying memory pressure, then waiting a bit and trying again. They could also be doing clever stuff with copy-on-write or leveraging that the ImageData is initialized by spec to "transparent black" which I believe should be all zeroes which also allows for not needing to dirty the underlying memory pages and other cleverness.

Is that the heart of the bug here? Firefox crashes and Chrome doesn't? Firefox crashes here because (as :asuth says) the user-controllable allocation size exceeds ASAN's limit, and ASAN is configured to crash if that happens. If the ASAN limit is removed, it may not crash, though it may bring down the overall system.

I guess I'm uncertain as to what the objective is here. Is it:

  • to understand why Firefox crashes but Chrome doesn't with the same memory limit? (Are you running Chrome under ASAN with the same limit?) That might be useful to know, though as :asuth describes above, there are many potential optimizations behind that. It may or may not be something we want to replicate.

  • to allow the fuzzer to make it past these sorts of user-controlled huge allocations? It appears that setting MOZ_FUZZ_LARGE_ALLOC_LIMIT to something well below ASAN's limit would accomplish that.

  • to break down the huge OOM bucket by enumerating and categorizing large allocations? It seems like MOZ_FUZZ_LARGE_ALLOC_LIMIT could help with that within SpiderMonkey codebase, perhaps by augmenting it with an annotation mechanism that says something like "user-controlled ArrayBuffer allocation; make it fail but continue running rather than crashing". (Or just a single MOZ_KNOWN_LARGE_ALLOC macro if the bucket id isn't needed.)

Or most likely, I'm confused and missing something obvious here.

I'm colliding with :asuth's previous comment that clarifies what's going on. I'll add this comment anyway in case it adds anything useful.

(In reply to Steve Fink [:sfink] [:s:] from comment #7)

I guess I'm uncertain as to what the objective is here. Is it:

  • to understand why Firefox crashes but Chrome doesn't with the same memory limit? (Are you running Chrome under ASAN with the same limit?) That might be useful to know, though as :asuth describes above, there are many potential optimizations behind that. It may or may not be something we want to replicate.

This is more of a side note than anything. But if it is unaffected it can lead to the question of why?

  • to allow the fuzzer to make it past these sorts of user-controlled huge allocations? It appears that setting MOZ_FUZZ_LARGE_ALLOC_LIMIT to something well below ASAN's limit would accomplish that.

Yes, assuming by "make it past" you mean avoid an OOM rather than recover and continue. MOZ_FUZZ_LARGE_ALLOC_LIMIT does look like it might help avoid this. The test case in this case happens to make a few larger allocations instead of a lot of small ones. The other thing to mention here is previously setting ASAN_OPTIONS=max_allocation_size_mb=512 (now 12GB) had the unexpected side effect of missing other security issues and MOZ_FUZZ_LARGE_ALLOC_LIMIT might have the same side effect, but I will look in to it. Thanks!

  • to break down the huge OOM bucket by enumerating and categorizing large allocations? It seems like MOZ_FUZZ_LARGE_ALLOC_LIMIT could help with that within SpiderMonkey codebase, perhaps by augmenting it with an annotation mechanism that says something like "user-controlled ArrayBuffer allocation; make it fail but continue running rather than crashing". (Or just a single MOZ_KNOWN_LARGE_ALLOC macro if the bucket id isn't needed.)

Large allocation are not as frequent and most of the time are handle by fallible allocations, for the most part these are not usually fuzz blockers.

The goal at the moment is to identify the unique methods the fuzzers find to trigger memory usage issues that negatively impact fuzzing and address them. Using heap profiles is the most effective method of bucketing we have at the moment. It enables our tools to automatically reduce the test case and create actionable(?) bugs. I say that because I understand in some cases there isn't much that can be done. But hopefully the issue is either truly a bug or points to a potential optimization, which seems to be the case here.

:asuth and :sfink thank you for your suggestions and feedback. I hope to make this process as smooth as possible as we continue.

Two questions (without really understanding every detail):

  1. Is this actionable for Workers beyond bug 1797639 (which I understand to be the actionable part for us) ?
  2. So, is this bug still in the right component and needs our triage ?
Flags: needinfo?(twsmith)

(In reply to Jens Stutte [:jstutte] from comment #9)

  1. Is this actionable for Workers beyond bug 1797639 (which I understand to be the actionable part for us) ?

I believe bug 1797639 covers what is needed.

  1. So, is this bug still in the right component and needs our triage ?

It seem to be in the right place, but asuth would know for sure.

Flags: needinfo?(twsmith)
Severity: -- → S3
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: