Closed Bug 1539133 Opened 2 years ago Closed 1 year ago

Crash [@ pages_decommit] (MOZ_CRASH) with OOM due to large ArrayBuffer

Categories

(Core :: Memory Allocator, defect, P2)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla72
Tracking Status
firefox-esr68 --- wontfix
firefox68 --- wontfix
firefox70 --- wontfix
firefox71 --- wontfix
firefox72 --- fixed

People

(Reporter: decoder, Assigned: gcp)

Details

(4 keywords, Whiteboard: [jsbugmon:update])

Crash Data

Attachments

(2 files)

The following testcase crashes on mozilla-central revision 4a692c812a3f (build with --enable-posix-nspr-emulation --enable-valgrind --enable-gczeal --disable-tests --disable-profiling --enable-debug --enable-optimize, run with --fuzzing-safe --cpu-count=2 --ion-offthread-compile=off):

function testValue() {
  var ab = new ArrayBuffer(0x100000);
  var dbg = new testValue;
  var target = {};
}
testValue();

Backtrace:

received signal SIGSEGV, Segmentation fault.
#0  pages_decommit (aSize=1048576, aAddr=0x7feff9900000) at memory/build/mozjemalloc.cpp:1285
#1  0x000055555587827e in pages_decommit (aSize=1048576, aAddr=0x7feff9900000) at memory/build/mozjemalloc.cpp:3580
#2  arena_t::PallocHuge (this=<optimized out>, aSize=<optimized out>, aAlignment=<optimized out>, aZero=<optimized out>) at memory/build/mozjemalloc.cpp:3573
#3  0x000055555587a81b in BaseAllocator::calloc (aSize=1, aNum=1048576, this=<synthetic pointer>) at memory/build/mozjemalloc.cpp:3931
#4  Allocator<MozJemallocBase>::moz_arena_calloc (arg2=1, arg1=1048576, aArenaId=aArenaId@entry=10204206196954226593) at memory/build/malloc_decls.h:39
#5  moz_arena_calloc (arg1=arg1@entry=10204206196954226593, arg2=arg2@entry=1048576, arg3=arg3@entry=1) at memory/build/malloc_decls.h:117
#6  0x00005555559bf114 in js_arena_calloc (size=1, nmemb=1048576, arena=10204206196954226593) at dist/include/js/Utility.h:368
#7  js_pod_arena_calloc<unsigned char> (numElems=1048576, arena=10204206196954226593) at dist/include/js/Utility.h:573
#8  js::MallocProvider<JSContext>::maybe_pod_calloc<unsigned char> (arena=10204206196954226593, numElems=1048576, this=0x7ffff5f17000) at js/src/vm/MallocProvider.h:62
#9  JSContext::pod_callocCanGC<unsigned char> (arena=10204206196954226593, numElems=1048576, this=0x7ffff5f17000) at js/src/vm/JSContext.h:223
#10 AllocateArrayBufferContents (nbytes=1048576, cx=0x7ffff5f17000) at js/src/vm/ArrayBufferObject.cpp:450
#11 js::ArrayBufferObject::createZeroed (cx=<optimized out>, cx@entry=0x7ffff5f17000, nbytes=1048576, proto=proto@entry=...) at js/src/vm/ArrayBufferObject.cpp:1244
#12 0x00005555559bf47c in js::ArrayBufferObject::class_constructor (cx=0x7ffff5f17000, argc=<optimized out>, vp=<optimized out>) at js/src/vm/ArrayBufferObject.cpp:441
#13 0x00001527e8fdbdd6 in ?? ()
#14 0x0000000000000000 in ?? ()
rax	0x555557c20240	93825032913472
rbx	0x100000	1048576
rcx	0x7ffff6c1c2dd	140737333281501
rdx	0x555556aa9fd6	93825014603734
rsi	0x7ffff6eeb770	140737336227696
rdi	0x7ffff6eea540	140737336223040
rbp	0x7fffffeaeff0	140737486974960
rsp	0x7fffffeaeff0	140737486974960
r8	0x7ffff6eeb770	140737336227696
r9	0x7ffff7fe6cc0	140737354034368
r10	0x58	88
r11	0x7ffff6b927a0	140737332717472
r12	0x200000	2097152
r13	0x7ff38ec93680	140684049331840
r14	0x7fffffeaf020	140737486975008
r15	0x7feff9800000	140668659826688
rip	0x55555581e3fc <pages_decommit(size_t, void*)+38>
=> 0x55555581e3fc <pages_decommit(size_t, void*)+38>:	movl   $0x0,0x0
   0x55555581e407 <pages_decommit(size_t, void*)+49>:	ud2
Component: JavaScript Engine → Memory Allocator

memory/build/mozjemalloc.cpp:1285 is a MOZ_CRASH on error returned by mmap... why is mmap failing?

Flags: needinfo?(choller)

(In reply to Mike Hommey [:glandium] from comment #1)

memory/build/mozjemalloc.cpp:1285 is a MOZ_CRASH on error returned by mmap... why is mmap failing?

How would I find that out?

Flags: needinfo?(choller) → needinfo?(mh+mozilla)
JSBugMon: Bisection requested, result:
autoBisect shows this is probably related to the following changeset:

The first bad revision is:
changeset:   https://hg.mozilla.org/mozilla-central/rev/becf3a27c2bb
user:        Gian-Carlo Pascutto
date:        Wed Mar 20 10:45:15 2019 +0000
summary:     Bug 1529922 - Add guard pages following huge allocations. r=glandium

This iteration took 532.923 seconds to run.
Whiteboard: [jsbugmon:update,bisect] → [jsbugmon:update]

Start with strace, and see what the error code is?

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #4)

Start with strace, and see what the error code is?

The only thing I see in strace is

mmap(0x7f649bb00000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
Flags: needinfo?(mh+mozilla)

That doesn't match the size in the pages_decommit call in your crash stack trace. A more complete strace, along with the stack trace you get from that specific run would be useful.

Or a complete STR (comment 0 say "run with --fuzzing-safe..." but... run what?)

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #6)

That doesn't match the size in the pages_decommit call in your crash stack trace. A more complete strace, along with the stack trace you get from that specific run would be useful.

Or a complete STR (comment 0 say "run with --fuzzing-safe..." but... run what?)

This bug was reassigned from the JS engine component. All the bugs we file there are reproduced on the JS shell. I will also attach the full strace log.

Flags: needinfo?(mh+mozilla)
Attached file log.gz

Full strace log

Considering how many calls to mmap and munmap there are in this log, I'm tempted to say you're exhausting /proc/sys/vm/max_map_count. It also seems you're allocating a lot of memory.

Seeing this log, it also looks like the fact the JS engine is allocating 1MB chunks makes bug 1529922 a large waste of address space. In this particular instance, it makes the exhaustion of system resources wrt memory faster to happen. You probably were close to exhaustion before.

Flags: needinfo?(mh+mozilla)

It also seems you're allocating a lot of memory.

Crude calculation says close to 70GB. Which implies you likely were mapping more than 35GB before bug 1529922.

Flags: needinfo?(gpascutto)

So the summary is that:

a) We have a clownshoes issue because the code is allocating a 1M arraybuffer, which happens to 1) go through to our allocator unmodified because ArrayBuffers are special 2) hits the worst case there.

b) The code is allocating 35GB. Due to the clownshoes, this doubles to 70GB, which is over the 65G limit that Linux defaults to.

The clownshoes are unfortunate but I don't see a simple (or complicated) way around it. Unless there's evidence that this worsens performance in the field, I think this is a WONTFIX from my side. Workaround would be to make the buffer 4096 bytes smaller or allocate a bit less in a single run.

If you need to keep the testcase unmodified, I guess you'll want to look if the kernel settings on CI machines can be modified.

Flags: needinfo?(gpascutto)

The problem is not so much that we crash, but rather that there is no visible message/indicator that this is an unhandled out-of-memory crash. In the JS engine, we have the contract that all intended crashes due to out-of-memory emit something on stderr so we can filter it out easily.

Since this is a regular MOZ_CRASH in non-JS code there is two problems:

  1. In opt builds, the messages are omitted (something I would very much like to change, even if it is only for fuzzing builds, because it makes triaging and signatures a lot easier).

  2. The message itself doesn't have a clear indicator/prefix that this is an OOM situation that can't be easily recovered.

So in the scope of the bug, i.e. issue (2), it's just a matter of changing the MOZ_CRASH() into MOZ_CRASH("OOM: No more mappings available").

By the way, comment 1 says that this happens on the decommit call. In theory we could just ignore the failure there. But then an attacker can disable guard pages by nearly exhausting memory. Probably also not what we want.

glandium, it seems that using the DECOMMIT logic for normal allocations would mean we're liable to hit this max_map_count much faster because we'll now be DECOMMIT-ing pages in the middel of a normal chunk. Is this an argument for going back to doing guard regions with mprotect instead?

Flags: needinfo?(mh+mozilla)

Is this an argument for going back to doing guard regions with mprotect instead?

That makes no practical difference:

#include <stddef.h>
#include <stdio.h>
#include <sys/mman.h>

int main(int argc, char *argv[]) {
  char *ptr = mmap(NULL, 2UL * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
  int i;
  for (i = 0; ; i++) {
    int fail;
    if (argc > 1) {
      fail = (mmap(ptr + i * 8192, 4096, PROT_NONE, MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0) == MAP_FAILED);
    } else {
      fail = mprotect(ptr + i * 8192, 4096, PROT_NONE);
    }
    if (fail) {
      printf("%d\n", i);
      break;
    }
  }
  return 0;
}
$ ./test
32754
$ ./test foo
32755
Flags: needinfo?(mh+mozilla)

Hey, is anyone working on this and what is the priority?

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(mh+mozilla) → needinfo?(choller)

gcp, are you going to take care of fixing part 2 as you noted in comment 13?

Flags: needinfo?(choller) → needinfo?(gpascutto)
Priority: -- → P2
Assignee: nobody → gpascutto
Flags: needinfo?(gpascutto)
Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d65ec56af997
Explicitly say not being able to mmap is likely OOM. r=decoder,glandium
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72

While copy-pasting this code to debug another issue, I found out that the log message is missing a trailing newline.

You need to log in before you can comment on or make changes to this bug.