Created attachment 8496528 [details] stack The upcoming testcase crashes js opt shell on m-c changeset 6a63bcb6e0d3 intermittently with --no-sse3 --no-asmjs --ion-gvn=pessimistic --ion-licm=off --ion-check-range-analysis --ion-eager --no-threads at js::jit::ExecutablePool::toggleAllCodeAsAccessible (gdb) bt 5 #0 js::jit::ExecutablePool::toggleAllCodeAsAccessible (this=<optimized out>, accessible=accessible@entry=false) at /home/gkwong/trees/mozilla-central/js/src/jit/ExecutableAllocatorPosix.cpp:90 #1 0x00007f2fc5f4f5d4 in js::jit::ExecutableAllocator::toggleAllCodeAsAccessible (this=<optimized out>, accessible=accessible@entry=false) at /home/gkwong/trees/mozilla-central/js/src/jit/ExecutableAllocator.cpp:73 #2 0x00007f2fc5f5000a in ensureIonCodeProtected (rt=0x7f2fc8eaf860, this=0x7f2fc8eaf860) at /home/gkwong/trees/mozilla-central/js/src/jit/Ion.cpp:377 #3 js::jit::RequestInterruptForIonCode (rt=rt@entry=0x7f2fc8e57fb0, mode=mode@entry=JSRuntime::RequestInterruptAnyThread) at /home/gkwong/trees/mozilla-central/js/src/jit/Ion.cpp:480 #4 0x00007f2fc6216d6f in JSRuntime::requestInterrupt (this=0x7f2fc8e57fb0, mode=JSRuntime::RequestInterruptAnyThread) at /home/gkwong/trees/mozilla-central/js/src/vm/Runtime.cpp:551 (More stack frames follow...) My configure flags are: AR=ar sh /home/gkwong/trees/mozilla-central/js/src/configure --disable-debug --enable-optimize --enable-nspr-build --enable-more-deterministic --with-ccache --enable-gczeal --enable-debug-symbols --disable-tests Jan, is it possible for you to take a look at this hard-to-reproduce bug?
I've emailed Jan more information along with the coredump. This bug is one of the causes of a particular rare hard-to-reproduce Linux crash, so it'll be nice to have this fixed. (CC'ing more JIT gurus)
I can reproduce this on Linux. We hit the MOZ_CRASH in ExecutablePool::toggleAllCodeAsAccessible: if (mprotect(begin, size, flags)) MOZ_CRASH(); errno is ENOMEM, which apparently means: - Addresses in the range [addr, addr+len-1] are invalid for the address space of the process, or specify one or more pages that are not mapped. - Internal kernel structures could not be allocated. Will see if I can find out more.
Looks like we're running out of virtual memory (mappings). When mprotect fails and we crash, the process is using 127.8 TB (!) virtual memory. Virtual memory is used like this (thousands of those): Start Addr End Addr Size Offset objfile 0x2f5002e000 0x2f50030000 0x2000 0x0 0x2f50030000 0x305002f000 0xfffff000 0x0 0x305002f000 0x3050031000 0x2000 0x0 0x3050031000 0x3150030000 0xfffff000 0x0 0x3150030000 0x3150032000 0x2000 0x0 0x3150032000 0x3250031000 0xfffff000 0x0 ... $ cat /proc/sys/vm/max_map_count 65530 $ wc -l /proc/2680/maps 65531 /proc/2680/maps It's probably caused by SharedArrayBuffer - when I run the test on OS X we spend a lot of time in mmap under SharedArrayBuffeObject::New...
There's a known - somewhat controversial - issue with SharedArrayBuffer that's logged as bug 1068684: we reserve 4GB regions to make asm.js happy on 64-bit systems. It turns out that on some platforms, certain kinds of address space turbulence lead to crashes in munprotect [sic], this is described in bug 1008613 comment 8 and is further discussed on bug 1068684. It is possible that the problem here with mprotect is similar. cc'ing Nick, who'll be curious to know this.
I doubt this is security-sensitive, since we hit MOZ_CRASH().
Jan/Gary, is there a test case that you can attach?
(In reply to Lars T Hansen [:lth] from comment #7) > Jan/Gary, is there a test case that you can attach? I've forwarded the files to you. Also, opening up as per comment 6.
The crash reproduces easily on my 14.10 Ubuntu (Linux 3.16) AMD-64 system. If I change the code for SharedArrayBuffer so that it does not allocate a 4GB region, but only what it needs for the array, the test case runs to completion as expected. We can probably spin this a couple of ways: - GC is not happening soon enough to recover scarce system resources. The crash in fact happens as a sweep is ongoing; had it finished, that sweep might have cleared up enough dead arrays to make the crash not happen. (At the time of the crash, no shared arrays had been freed.) However we don't know that, and in any case nothing prevents the program from holding on to more arrays and crashing anyway. - There needs to be a special allocator for the asm.js use case so that programs that want to use shared memory "casually" or from plain JS don't easily run into this problem. (Bug 1068684 argues exactly that.) However that won't prevent the problem from happening with asm.js, possibly. - There needs to be a way for the asm.js use case to explicitly deallocate the memory, to reduce the risk of resource exhaustion. Not sure what the complexity of that is; usability is probably poor. - It is in any case inappropriate to crash. Instead we should recover (insert details here) and abort the script in some reasonable way, a la OOM aborts. Not sure how we'll do that yet, we reach the crasher through handling an interrupt, recovery and abort may or may not be possible. I'll try to dig a little deeper into the exact cause of the crash to see if there are other clues.
http://stackoverflow.com/questions/8799481/single-process-maximum-possible-memory-in-x64-linux: "The Debian port documentation for the AMD64 port specifically mentions that the per-process virtual address space limit is 128TiB (twice the physical memory limit)". If my ad-hoc accounting is right there are roughly (varies by one or two) 32714 SharedArrayRawBuffers live when we crash, each is 4GB+4KB in size. That yields a total of 34303147978 pages just for SharedArrayRawBuffer data, leaving a "mere" 56590390 pages for everything else, including shared libraries, code, and data. While that's about 22GB it's plain that we are bumping up against system limits and the error code (ENOMEM, signifying either a bad address or the exhaustion of internal system resources) is appropriate. After grubbing through the kernel sources it seems clear to me that the process of changing protections needs some temporary page table space, so if we're up against the limit it is probably easy to exhaust it. Additionally there is another limit, the maximum number of mappings, which is 65530 on my system. It's possible we're bumping up against that but it seems less likely, there would have to be an unreasonable number of individual mappings for code and data pages for us to have another ~33000 mappings. Whether it's munmap, mprotect, or some other call that fails is probably somewhat arbitrary and dynamically determined.
(In reply to Lars T Hansen [:lth] from comment #10) > Additionally there is another limit, the maximum number of mappings, which > is 65530 on my system. It's possible we're bumping up against that but it > seems less likely, there would have to be an unreasonable number of > individual mappings for code and data pages for us to have another ~33000 > mappings. See comment 4, I thought we used 2 mappings per SharedArrayBuffer?
(In reply to Jan de Mooij [:jandem] from comment #11) > (In reply to Lars T Hansen [:lth] from comment #10) > > Additionally there is another limit, the maximum number of mappings, which > > is 65530 on my system. It's possible we're bumping up against that but it > > seems less likely, there would have to be an unreasonable number of > > individual mappings for code and data pages for us to have another ~33000 > > mappings. > > See comment 4, I thought we used 2 mappings per SharedArrayBuffer? Ah, interesting - I don't know, I'd have to dig into the kernel. There's a single allocation per SharedArrayRawBuffer, but parts of it are protected one way and parts are protected another way (accessible/inaccessible), so there might be two mappings in the kernel tables. I see your point about the number being at the limit. A pretty curious coincidence to reach both limits at the same time.
I ran another experiment. This allocates the amount needed for each allocation, plus one page; it then sets the protections for the extra page to something else than the rest. We crash in short order with the same symptoms as before, but the virtual size is just over 1GB, and /proc/pid/maps indicates that we ran out of mappings. The number of live objects is 32710.
And yet one more experiment. I added 40MB to each allocation, so as to reach the VM limit before the mappings limit. In this case the GC eventually kicks in and collects all the objects and the test case runs to completion. (I don't know if I would count on this, but it's a good sign - and it ought to work, I guess, if the GC does not run out of memory while in a critical state.)
Created attachment 8498892 [details] [diff] [review] Code for experiments (see the #defines that control behavior)
Note: This should actually be a problem on 32-bit Linux too, where we'll run up against the limit on the number of mappings in short order. It may be necessary to use a different allocation strategy on 32-bit Linux than one mapping per shared memory segment, or to institute the same 1000-per-process limit that we're talking about on bug 1068684. At least that would make programs behave similarly across platforms. (I'll run some tests when I have a moment.)
Grabbing a reasonable-looking number off the 'net it appears that the maximum per-process address space on Linux-32 is 3GB (including space for shared libraries and everything else). The smallest SAB is 8KB. 8KB * 64K mappings = 512MB, which means that exhausting the mappings table before the virtual address space or physical memory is easy on 32-bit Linux. And that's assuming the mappings table on a 32-bit system is as large as it is on a 64-bit system.
The supposed fix for the problem reported in this bug has landed: https://bugzilla.mozilla.org/show_bug.cgi?id=1068684#c27. Gary, can you re-test and close if things look OK? (I have tested locally and found it to work fine.)
(In reply to Lars T Hansen [:lth] from comment #18) > The supposed fix for the problem reported in this bug has landed: > https://bugzilla.mozilla.org/show_bug.cgi?id=1068684#c27. Gary, can you > re-test and close if things look OK? (I have tested locally and found it to > work fine.) I'd say we should just go ahead and resolve this as FIXED by bug 1068684, the fuzzers will find new bugs as they pop up again. Moreover you've retested on your end. :)
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.