Closed Bug 1459761 Opened 5 years ago Closed 5 years ago

WebAssembly.Memory objects are slow to GC

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla62
Tracking Status
firefox62 --- fixed

People

(Reporter: jstpierre, Assigned: sfink)

References

Details

Attachments

(3 files, 2 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0
Build ID: 20180501102731

Steps to reproduce:

WebAssembly.Memory objects appear to be much slower to GC than TypedArray / ArrayBuffer counterparts, and they block the main thread while doing so. This means that running multiple small WASM functions with separate functions is slow.

Performance test case: http://jsbin.com/xusadanehu/edit?js,console
The problem is that in the WebAssembly.Memory case, we're doing a GC for *every* allocation. So the time difference is because we're GC'ing 100 times vs zero times.

And that is happening because

    cx->updateMallocCounter(buffer->mappedSize());

and buffer->mappedSize() is 6442516480 bytes (for the 64KB memory region), which comes from

    mappedSize = wasm::ComputeMappedSize(maxSize.valueOr(numBytes));

I'm guessing this isn't the best value to be using for updating the malloc counter.
Looks like the updateMallocCounter line is from bug 1284156.
Flags: needinfo?(luke)
The problem is that, if we don't use the mappedSize -- if instead we use the memory length -- then one can easily allocate many TB of virtual address space mappings before there is a GC.  Maybe there is a compromise though so we can updateMallocCounter(Max(buffer->byteLength(), X)) where X is chosen to reliably produce a GC every N Memories.

Steve: any thoughts on what a good X would be?
Flags: needinfo?(luke) → needinfo?(sphink)
I forgot to add: to avoid exhausting the 47-bit address space (or meaningfully reducing the randomization of ASLR), we limit the number of simultaneously-alive 6gb virtual reservations to MaximumLiveMappedBuffers which is currently 1,000.  Thus, in a (usually test) workload that churns memories, you need to make sure to GC every, say, 100 memories.  If there isn't a good X for updateMallocCounter() (say, because there is a moving high-water-mark), then I guess we could use a separate wasm-memory-specific counter.
My use case here is allocating a lot of small WASM instances, having them do a tiny function on a private heap, and return quickly. Even after this bug is fixed, it sounds like SpiderMonkey will still expect a small number of large, growable Memory objects, and will probably GC after N memory objects for some sensible number of N. Would it be wise to redesign my code to use this pattern instead, or do you plan on supporting this usage pattern as performantly as the other one?
(In reply to Jasper St. Pierre from comment #5)

> My use case here is allocating a lot of small WASM instances, having them do
> a tiny function on a private heap, and return quickly.

Last time this issue came up the conclusion was "don't do that", for precisely this reason.  Some of our test suites take much longer to run on 64-bit than on 32-bit because they allocate huge numbers of wasm memories.

> Even after this bug
> is fixed, it sounds like SpiderMonkey will still expect a small number of
> large, growable Memory objects, and will probably GC after N memory objects
> for some sensible number of N. Would it be wise to redesign my code to use
> this pattern instead, or do you plan on supporting this usage pattern as
> performantly as the other one?

In principle, if this turns out to be an important pattern for many, we could support it by compiling a module that has a small, known, upper memory limit in a slightly different way, so as to allow the module's instances to use a smaller memory allocations.  There would however be a performance implication - memory accesses would be bounds checked.  So it might be one of those cases where it would be best if we didn't have to guess which way to compile it.  Also, one can always recompile in response to actual use, but that has downsides.  Either way this requires some though + engineering effort.
I'm more than happy to "don't do that" (and in fact, did, in my project just now). This is just my first time using WebAssembly and there isn't much documentation in the way of preferred integration and best practices.

That said, there are advantages to this sort of model. One reason I want to do this is because I want to spin up a bunch of Web Workers that run one-off tasks like this. Due to the way Web Workers are currently spec'd, the easiest way to implement a cancellable job like this is to spin up workers for each "job type" and terminate when the job is cancelled. While this would still not block the main thread anymore, it would still be more expensive than necessary, and I can't imagine cross-compartment GC is simple.

Additionally, the idea of an ever-growing persistent memory pool that's not shrinkable makes me feel bad, since I've been trying to care about the memory footprint of my web page. I'll eventually get used to it I guess.
(In reply to Jasper St. Pierre from comment #7)

> That said, there are advantages to this sort of model [many small instances]. 

No question about that - when we discussed this previously here, I found myself advocating for this model too.

I don't think we know how wasm will be used either.  Of course if we make it very expensive to create an instance then we force a specific pattern, but it's hard to force the web to do things one specific way :)
For the many-small-memories use case, one design extension I'm increasingly thinking makes sense is allowing memories (in their memory type [1], so either via JS API constructor or wasm memory section) to declare byte allocation granularity.  (We've talked about the other way too, declaring a bigger-than-64kb page granularity to take advantage of huge pages, so maybe the general feature is declaring your own non-default page size.)  Sub-4kb page size pretty much prevents mmap tricks, so the natural implementation is "just malloc/realloc as much as I tell you".

In the meantime, I think V8 also has the same impl of Memory, so I would attempt to reuse memories like you were saying.  Workers are also pretty expensive to start up, so having a pool of Workers+Memories makes sense.  I'd also recommend holding onto and reusing Module objects to avoid recompiling each time.

https://webassembly.github.io/spec/core/syntax/types.html#memory-types
Page size granularity would be amazing to have! 4kb pages would be super nice for my use case. As for keeping Workers around -- yeah, I do, but killing jobs through terminate() means I have to start up new ones to replace the old ones.

That said, regardless of what the preferred usage model should be, there's still a Firefox bug in here: the triggering of an emergency GC upon every WebAssembly.Instance or WebAssembly.Memory.
Agreed and thanks for reporting; we should keep the bug open to GC less frequently (I'm thinking every 100).
Thinking about it more, this is simply a different heuristic than malloc bytes; it's more related to the MaximumLiveMappedBuffers=1000 limit than anything, so I think I'll just use a per-process atomic counter to triggers GCs.
Flags: needinfo?(sphink)
Attached patch lazier-gc (obsolete) — Splinter Review
This removes all the GCs in the testcase in comment 0.  Running the test repeatedly, I can observe the occasional TOO_MUCH_WASM_MEMORY-reason GC.
Assignee: nobody → luke
Attachment #8975100 - Flags: review?(lhansen)
Attachment #8975100 - Flags: review?(jcoppeard)
Attached patch lazier-gc (obsolete) — Splinter Review
Oops, I should probably add the new GCReason into one of the reserved slots so I don't mess up the telemetry numbering.
Attachment #8975100 - Attachment is obsolete: true
Attachment #8975100 - Flags: review?(lhansen)
Attachment #8975100 - Flags: review?(jcoppeard)
Attachment #8975101 - Flags: review?(lhansen)
Attachment #8975101 - Flags: review?(jcoppeard)
Shouldn't the heuristic only be changed for HUGE_PAGE memories?
I think there's value in having consistent behavior across platforms, so we (already) enforce the MaximumLiveMappedBuffers limit on all platforms, regardless of WASM_HUGE_MEMORY.  Since we enforce that limit everywhere, the GC-triggering should happen everywhere.
Attached patch lazier-gcSplinter Review
Re-reading the patch just now, I see that I accidentally deleted the malloc accounting in the createFromNewRawBuffer() case; instead I should have switched it from counting virtual/mapped size to the accessible/byteLength size.
Attachment #8975101 - Attachment is obsolete: true
Attachment #8975101 - Flags: review?(lhansen)
Attachment #8975101 - Flags: review?(jcoppeard)
Attachment #8975139 - Flags: review?(lhansen)
Attachment #8975139 - Flags: review?(jcoppeard)
Comment on attachment 8975139 [details] [diff] [review]
lazier-gc

Review of attachment 8975139 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/ArrayBufferObject.cpp
@@ -1198,5 @@
>                  size_t nAllocated = nbytes;
>                  if (contents.kind() == MAPPED)
>                      nAllocated = JS_ROUNDUP(nbytes, js::gc::SystemPageSize());
> -                else if (contents.kind() == WASM)
> -                    nAllocated = contents.wasmBuffer()->allocatedBytes();

Did you mean to remove this malloc accounting too?
Attachment #8975139 - Flags: review?(jcoppeard) → review+
Comment on attachment 8975139 [details] [diff] [review]
lazier-gc

Review of attachment 8975139 [details] [diff] [review]:
-----------------------------------------------------------------

This seems reasonable.  We'll still want something better, long-term, to allow the user program to control for behavior, but it seems like this should reduce the likelihood of stop-the-world collections.
Attachment #8975139 - Flags: review?(lhansen) → review+
(In reply to Jon Coppeard (:jonco) from comment #18)
> > -                else if (contents.kind() == WASM)
> > -                    nAllocated = contents.wasmBuffer()->allocatedBytes();
> 
> Did you mean to remove this malloc accounting too?

In this case, removing the 'else if' just causes nAllocated to fall back to the byteLength, which is what we want and matches the other malloc accounting change.
Pushed by lwagner@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5c0629a808fe
Baldr: relax WebAssembly.Memory GC heuristic (r=lth,jonco)
https://hg.mozilla.org/mozilla-central/rev/5c0629a808fe
Status: UNCONFIRMED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Blocks: 1291954
So the reason TSAN fails (and probably ASAN too) is because the SAN tools use a lot of vmem for internal bookkeeping so it's actually the mmap() that's failing (around 600 live buffers).  So that suggests a different Max #if defined(TSAN) or defined(ASAN).

Now in theory, since we start triggering GCs at 200 live buffers, we shouldn't hit 600 live, but I guess the incremental GCs are just not completing fast enough for high-throughput WebAssembly.Memory allocation.  I'm not sure what the right fix is, but here's a patch that works and is no worse than the original state which did a sync GC on every allocation: it synchronously performs a full GC once we hit (Max - 100) live buffers.

It looks like we have a bunch of distributed machinery for escalating too-much-malloc GCs into full GCs as we get beyond an upper threshold (and with updateMallocMemory(6gb) we previously hit this every time) so it wasn't clear how I could reuse that for escalating gcreason::TOO_MUCH_WASM_MEMORY GCs, but quite happy to hear if there is a better way.
Flags: needinfo?(luke)
Attachment #8976379 - Flags: review?(jcoppeard)
I'm not sure if this is a good idea, nor whether it's the best way to do it.

It seems like normally you'd want to trigger an incremental GC at one threshold, and only go nonincremental when things get bad enough. I assume that in actual practice you wouldn't expect lots of these memory objects to be created rapidly, and that you'd have time to do incremental GCs?

This does fix the test, though.
Attachment #8976380 - Flags: review?(jcoppeard)
Assignee: luke → sphink
Comment on attachment 8976380 [details] [diff] [review]
Make TOO_MUCH_WASM_MEMORY GCs nonincremental

Whoops, looks like luke got something first that's a lot less blunt.
Attachment #8976380 - Flags: review?(jcoppeard)
Comment on attachment 8976379 [details] [diff] [review]
further-tweak-heuristic

Review of attachment 8976379 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/ArrayBufferObject.cpp
@@ +847,5 @@
>      maybeSharedObject.set(object);
>  
> +    // See MaximumLiveMappedBuffers comment above.
> +    if (liveBufferCount > StartSyncFullGCAtLiveBufferCount) {
> +        JS::PrepareForFullGC(cx);

I don't think this does what you want. A full GC just means all-zones; it can still be incremental. You'd be better off with a nonincremental collection of just the current zone, probably.

I don't think we have an API for this, really. There's JS::DisableIncrementalGC but that's permanent.

I think we'd either need to add some sort of API for this, or add some sort of WAY_TOO_MUCH_MASM_MEMORY and make it nonincremental as I did in my patch.
Ah, thanks; I did indeed want a non-incremental GC.

> You'd be better off with a nonincremental collection of just the current zone, probably.
>
> I don't think we have an API for this, really. There's JS::DisableIncrementalGC but that's permanent.

I'd be happy (maybe even happier) with the whole GCRuntime; is there an API for that?

Also, question about the existing `cx->runtime()->gc.triggerGC()` call (added by the first patch): will it trigger an incremental collection in *all* zones?  If not *all*, it seems likely that the zone I care about (cx->zone()) could be missed (or perhaps GC'd the first time, but not the next, which, iirc Steve, is what you saw?).
Comment on attachment 8976379 [details] [diff] [review]
further-tweak-heuristic

Review of attachment 8976379 [details] [diff] [review]:
-----------------------------------------------------------------

This is fine to make the tests pass etc.

I think the best eventual solution here is for the GC to keep track of virtual bytes allocated for wasm memory and trigger full GCs based on that.  This would be like we do for malloc memory except would have knowledge of frees as well.

::: js/src/vm/ArrayBufferObject.cpp
@@ +109,4 @@
>  static const int32_t MaximumLiveMappedBuffers = 1000;
> +#endif
> +static const int32_t StartTriggeringAtLiveBufferCount = 100;
> +static const int32_t StartSyncFullGCAtLiveBufferCount = MaximumLiveMappedBuffers - 100;

nit: we usually call these non-incremental GCs rather than synchronous GCs.

@@ +848,5 @@
>  
> +    // See MaximumLiveMappedBuffers comment above.
> +    if (liveBufferCount > StartSyncFullGCAtLiveBufferCount) {
> +        JS::PrepareForFullGC(cx);
> +        JS::GCForReason(cx, GC_NORMAL, JS::gcreason::TOO_MUCH_WASM_MEMORY);

GCForReason does actually do a non-incremental GC - it calls GCRuntime::gc() which calls collect() passing true for the nonincrementalByAPI parameter.  So this works.
Attachment #8976379 - Flags: review?(jcoppeard) → review+
Pushed by lwagner@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/04d361bb1343
Baldr: trigger non-incremental, full GC close to live buffer limit (r=jonco)
You need to log in before you can comment on or make changes to this bug.