Closed Bug 1517412 Opened 5 years ago Closed 5 years ago

wasm jit-tests failing with OOM on some ARM64 hardware

Categories

(Core :: JavaScript: WebAssembly, enhancement, P1)

ARM64
Android
enhancement

Tracking

()

RESOLVED FIXED
mozilla66
Tracking Status
firefox66 --- fixed

People

(Reporter: lth, Assigned: lth)

Details

(Whiteboard: [arm64:m3])

Attachments

(2 files)

See bug 1513231 comment 5, bug 1513231 comment 7.

These tests fail with OOM:

jit-test/tests/wasm/atomic.js arm64
jit-test/tests/wasm/baseline-abs-addr-opt.js arm64
jit-test/tests/wasm/bce.js arm64
jit-test/tests/wasm/memory.js arm64

Bob writes:

"I backed out those skips and did a full try run with rebuild 20 which was definitely overkill but guaranteed to get everything:
<https://treeherder.mozilla.org/#/jobs?repo=try&revision=879e771626c04bace4d2c0a57e957ae6d1efa38b>.

Looks like all "out of memory":
<https://taskcluster-artifacts.net/AmQku0U-QymQOVSq7ETc7A/0/public/logs/live_backing.log>

baseline-abs-addr-opt.js, memory.js expect a RuntimeError but the others have uncaught out of memory exceptions."

We should have no OOM failures on these tests so it's high priority to figure this out.  There are several reasons why OOM may be happening:

- it could be an artifact of our less-than-perfect mmap replacement on ARM64 (bug 1441473, bug 1502733)
- it could be that we run out of virtual address space because of some OS limitation without noticing (bug 1507759)
- we could be seeing similar limitations as we see on Windows 7 re the number of mappings (bug 1230168)
- there could be generic GC scheduling issues that prevent speedy reaping of memories (these tests allocate a lot of memories)
FWIW, as noted in bug 1502733 it appears that AArch64 on Linux is (by default) limited to 39-bit addresses. That means we have room for about 127 4GiB regions. I don't know to what extent wasm relies on these or to what extent these tests are stressing that element though.
Well, in addition to that we allocate 6GiB regions for wasm, further reducing the number of available regions even if they are perfectly packed.  The number is actually so low that we may eventually see real wasm content that bumps up against the limit.

All that said, these tests do not OOM on my ARM64 Linux dev system, so there's something additionally problematic about Android here.

I suppose we could be running out of physical backing store, if the device does not support paging and the GC does not reclaim unused wasm mappings quickly enough.  My dev system has a lot of RAM and additionally supports paging.
(In comment #0 the win7 bug is actually bug 1230162.)
Well, I can (somewhat unsurprisingly) provoke a similar error on x64 Linux by setting the amount of virtual memory to 16GB, so I think the two first orders of business are to determine whether there are vm limits on the device (ie, ulimit -v) and whether the OOM failure occurs in any of the buffer mapping operations.  We already have some logging that can probably be enabled for the test.
Adding [arm64:m3] whiteboard tag because this wasm bug doesn't block publishing ARM64 Fennec Nightly builds, but it might block shipping to Beta and Release.
Whiteboard: [arm64:m3]
Alright, some hard data:

There's no limit (in the sense of ulimit) for virtual memory on the device, I cat'd /proc/{pid}/limits at the time of the OOM and it's unremarkable.

The device is not starved for memory (27K pages free at the time of OOM according to /proc/vmstat at time of OOM).

It's almost certain that we're running out of virtual address space:  If Emanuel is correct in saying that the device can acommodate 127 4GB mappings then it can accommodate at most 2/3*127=84+2/3 6GB mappings.  A log from a failed run shows that it fails on the 85th allocation.  This seems pretty conclusive.  (I didn't log deallocations so it's not completely conclusive.)

The next thing to try would therefore be a last-ditch GC before OOM'ing after a huge allocation, at least as an experiment, to see if this will free up enough memory to let us run.
We already have a mechanism that can sort-of deal with this situation, namely, the liveBufferCount mechanism in ArrayBufferObject.cpp and the MaximumLiveMappedBuffers constant.  However, on Android we'd end up setting that constant to something in the range 50-80, which is very low: it is a per-process limit, and on Android we won't have full process separation probably, so it's not clear how appropriate the mechanism is.

It seems probable that we'll want to implement explicit bounds checking for ARM64-on-Android to deal with the VM limit, so as not to create a situation where content suddenly fails.  Or we'll need to tie into some tab-unloading mechanism.  (I have 10-50 open tabs in Firefox on my Android devices; I don't know if this is typical but I won't be surprised if it is, since the UI hides tabs and tends to make dead tabs common.)
Unsurprisingly, with a last-ditch GC before signaling OOM the test cases all pass: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e62640bffdccb0de7a7006cee7102ad3ad442d6f&selectedJob=219934930.  It's risky to land that as our fix; less so to land it under an ARM64 ifdef probably.  But the reason we have the liveBufferCount mechanism in the first place is so that we don't need to have this kind of last-ditch GC...

I'll run a final experiment with MaximumLiveMappedBuffers = 40.  Normally this should let the tests pass though it would be easy to write a test that makes it fail.  It may be appropriate enough for Android as a short-term fix.
To close that loop: I set MaximumLiveMappedBuffers = 40 and then the GC trigger mechanism parameters to 20,20,20 (it's obvious when you look at the code but I'll attach a patch too).  This was insufficient to fix the problem; presumably GC did not have time to finish by the time we tried to allocate the memory that caused the OOM.  (Of course, now the OOM limit is 40 memories and not 84, so that makes the problem harder.)

Supposedly Windows on ARM64 has acres of memory available, see bug 1502733 comment 27, though practical verification will be nice.  I will attempt a try run to verify.

So in the short term, we can implement a point-fix for either Android on ARM64 (fairly safe) or Linux on ARM64 (slightly edgier) where either:

(a) we set the GC control parameters to realistic, small values /and/ in addition do a last-ditch GC if allocation should fail.
(b) we just do a last-ditch GC if allocation should fail, since the GC control we have right now assumes a fair amount of headroom to give the GC time to do its thing.

To mitigate the risks inherent in the last-ditch GC we might disable such GCs for the runtime in question if a GC fails to reclaim enough memory.  Or disable it for a time.  None of the options are great here.

We may also try to move our memory use down from 6GB to something closer to 4GB; this will generally help.

In the long term we must support non-huge memory for wasm on ARM64, and opt into that either at compile time if we think it likely that the platform will always want it (Android) or at run-time if we think it may vary with the configuration (Linux generally; for example, my ARM64 development system has not had this problem, IIRC).  But that's bug 1507765.
There's actually some interesting behavior here... if I just set MaximumLiveMappedBuffers to 40 then the failing test (wasm/bce.js) runs to completion, albeit slowly; in this case we will trigger a bunch of stop-the-world collections as we keep hitting that limit.  But if I set the GC control parameters to reasonable values to trigger GC so as to avoid an expensive stop-the-world collection, I run out of memory.  (All this on x64 with the ulimit for virtual memory set to 4TB, which is what Android uses.)  Superficially it could look like the stop-the-world request is not honored if another GC is underway.
The GC control parameters actually need to have certain relationships among their values that are not well documented and which I did not get right the first time.  If we set:

  MaximumLiveMappedBuffers = 75
  StartTriggeringAtLiveBufferCount = 15
  StartSyncFullGCAtLiveBufferCount = 60 
  AllocatedBuffersPerTrigger = 15

then the first incremental GC will be triggered with 30 memories live and we'll trigger a nonincremental one every time the allocator finds that more than 60 is live; plus, the nonincremental policy is unlikely to interfere too much with the incremental policy unless the allocation rate is very high.

Now if we also set the vlimit to 4TB (this is on x64 Linux), then (a) it runs wasm/bce.js to completion and (b) 2/3 of all triggered GCs are incremental GCs, "only" 1/3 is full GCs.

We can presumably scale these parameters; 75 does not leave us that much headroom if the hard limit is 84.  Reducing the size of the reservation on ARM64 to closer to 4GB would make 75 a more reasonable value for sure.

The test case appears to have many more concurrently live memories than I would expect from the code, this is probably because they are allocated during an incremental GC and hence allocated black.  Probably not much to be done about that; we'll fall back on a nonincremental GC quickly if things get out of hand.
This is green on try.  It's the minimal, ultra-conservative fix: Android-only.  For the generic linux case I'll want to do some more testing.  I'll also look into reducing the size of the HUGE_MEMORY reservation for arm64.
Attachment #9034442 - Flags: review?(luke)
(In reply to Lars T Hansen [:lth] from comment #13)
> Created attachment 9034442 [details] [diff] [review]
> bug1517412-too-many-memories.patch

Don't forget to remove the comments like  // skip arm64 due to bug 1513231
(In reply to Lars T Hansen [:lth] from comment #9)
> Supposedly Windows on ARM64 has acres of memory available, see bug 1502733
> comment 27, though practical verification will be nice.

If you're talking about comment 28, that was just me trying the function out on my desktop and confirming that I have 47-bit pointers on x86_64 - I don't know what the limits are on Windows ARM64. Since we aren't seeing crashes related to 48-bit pointers there right now as far as I know, Windows ARM64 might also have 39-bit addresses - but it's also possible that VirtualAlloc just isn't giving us addresses with the highest bits set (since it allocates addresses in increasing order).
(In reply to Bob Clary [:bc:] from comment #14)
> (In reply to Lars T Hansen [:lth] from comment #13)
> > Created attachment 9034442 [details] [diff] [review]
> > bug1517412-too-many-memories.patch
> 
> Don't forget to remove the comments like  // skip arm64 due to bug 1513231

Thanks, I had missed one - updated locally.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #15)
> (In reply to Lars T Hansen [:lth] from comment #9)
> > Supposedly Windows on ARM64 has acres of memory available, see bug 1502733
> > comment 27, though practical verification will be nice.
> 
> If you're talking about comment 28, that was just me trying the function out
> on my desktop and confirming that I have 47-bit pointers on x86_64 - I don't
> know what the limits are on Windows ARM64. Since we aren't seeing crashes
> related to 48-bit pointers there right now as far as I know, Windows ARM64
> might also have 39-bit addresses - but it's also possible that VirtualAlloc
> just isn't giving us addresses with the highest bits set (since it allocates
> addresses in increasing order).

Ah, my bad.  Well, with this patch we can opt-in for Windows fairly easily once we figure it out.

It doesn't look like there are any tests being run on win10-aarch64 on integration, though, at least ./mach try fuzzy does not have anything plausible (even trying various patterns), nor do i see anything on taskcluster for normal CI testing.

Bob/Chris, any idea how I might go about doing a jit-test run on win-aarch64 hardware without having a system locally?  I can probably synthesize something to be run in a browser if we can find a volunteer to test...
Flags: needinfo?(cpeterson)
Flags: needinfo?(bob)
No, sorry but maybe jmaher can help.
Flags: needinfo?(bob) → needinfo?(jmaher)
win/arm64- that is the future, hopefully something in Q1; right now we don't even know if the tests and tools they depend on run on win/arm64 and are awaiting laptops to be delivered to validate that.  I expect some progress by the end of January- right now we have android/arm64 as our only option.
Flags: needinfo?(jmaher)
Comment on attachment 9034442 [details] [diff] [review]
bug1517412-too-many-memories.patch

Review of attachment 9034442 [details] [diff] [review]:
-----------------------------------------------------------------

Makes sense; thanks for tracking this down!
Attachment #9034442 - Flags: review?(luke) → review+

(In reply to Lars T Hansen [:lth] from comment #17)

It doesn't look like there are any tests being run on win10-aarch64 on
integration, though, at least ./mach try fuzzy does not have anything
plausible (even trying various patterns), nor do i see anything on
taskcluster for normal CI testing.

Bob/Chris, any idea how I might go about doing a jit-test run on win-aarch64
hardware without having a system locally? I can probably synthesize
something to be run in a browser if we can find a volunteer to test...

I don't think win-aarch64 is built or tested in automation yet. I'm not involved with the win-aarch64 project, so I don't know much more.

I recommend asking :froydnj or someone in the #arm64 Slack channel.

Flags: needinfo?(cpeterson) → needinfo?(nfroyd)
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/81ddd47b2cae
Adapt wasm buffer management parameters for android-on-arm64.  r=luke

Nathan + David, some questions cf comment 17:

  • Are you able to run jit-tests locally on a Win+ARM64 device (js shell test)? If so, can you try to run jit-test/tests/wasm/memory.js to see if it OOMs?
  • If not, can you try to figure out from probing the system or from asking people you know or in other ways how much virtual address space each process actually gets on Win-on-ARM64?
  • Do we know the pagesize for these devices?
  • If all else fails, get back to me and I will attempt to construct a test that we might run in a browser to figure this out
Flags: needinfo?(dmajor)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
  • Are you able to run jit-tests locally on a Win+ARM64 device (js shell
    test)? If so, can you try to run jit-test/tests/wasm/memory.js to see if
    it OOMs?

$ jit_test.py --tbpl .../js.exe wasm\memory.js
[8|0|0|0] 100% ======================================================>| 3.3s
PASSED ALL

  • Do we know the pagesize for these devices?

Here's the struct that js::gc::InitMemorySubsystem() receives from GetSystemInfo(). 4k pages, 64k allocation granularity, 47-bit pointers.

js!_SYSTEM_INFO
+0x000 dwOemId : 0xc
+0x000 wProcessorArchitecture : 0xc
+0x002 wReserved : 0
+0x004 dwPageSize : 0x1000
+0x008 lpMinimumApplicationAddress : 0x0000000000010000 Void +0x010 lpMaximumApplicationAddress : 0x00007ffffffeffff Void
+0x018 dwActiveProcessorMask : 0xff
+0x020 dwNumberOfProcessors : 8
+0x024 dwProcessorType : 0
+0x028 dwAllocationGranularity : 0x10000
+0x02c wProcessorLevel : 0x803
+0x02e wProcessorRevision : 0x70c

Flags: needinfo?(dmajor)

Looks like dmajor took care of this one (thanks!).

Flags: needinfo?(nfroyd)

Thanks indeed! I think we'll assume we're good on Windows until we learn otherwise through other channels.

(In reply to David Major [:dmajor] from comment #25)

Here's the struct that js::gc::InitMemorySubsystem() receives from GetSystemInfo(). 4k pages, 64k allocation granularity, 47-bit pointers.

Interesting, so that's identical to x64. Assuming the struct isn't lying, that must mean they use 4 translation tables and have 48-bit addresses internally but simply don't use the most significant bit. It might be interesting to use the logic at [1] and see if it's actually telling the truth, though there's certainly no harm in taking Windows at its word so long as the reported value isn't larger than the actual value. Something for a rainy day when we have Windows ARM64 testing in automation maybe.

[1] https://hg.mozilla.org/mozilla-central/file/e4ac2508e8ed/js/src/gc/Memory.cpp#l376

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: