Open
Bug 936168
Opened 11 years ago
Updated 10 months ago
Optimize away construction of temporary ArrayBufferView objects
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
NEW
People
(Reporter: jgilbert, Unassigned, NeedInfo)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
One problem encountered by otherwise garbage-free asm.js code is that in order to pass sub-buffers of data to WebGL, they need to create temporary ArrayBufferView objects.
Example:
gl.bufferData(gl.ARRAY_BUFFER, new ArrayBufferView(HEAP, offset, len), gl.STATIC_DRAW);
If it's possible to identify when such a constructor is immediately being passed into IDL bindings, would it be possible to immediately free it?
Comment 1•11 years ago
|
||
The variation of this that I've been thinking about is having the DOM binding provide additional info saying that it takes a view directly as (HEAP, offset, len) components. (This avoids the overhead of building and extracting this information.) Back in the old TM days, a single native would have a list of "traceable native" specializations, each taking different argument types, so this would be like that. Eric: do you know if we do anything similar atm with DOM bindings?
Also: Alon do you have any indication how much garbage is being created from this source? If it's a few objects per frame, then I don't think it's a huge deal; we will get periodic GCs regardless.
Comment 2•11 years ago
|
||
The amount depends on the codebase. We will need 1 (or perhaps 2 or 3) per GL call (draw and/up data upload), but how many of those calls happen is highly dependent on how optimized the app is. Could be anywhere from dozens per frame to hundreds.
Comment 3•11 years ago
|
||
> Eric: do you know if we do anything similar atm with DOM bindings?
Right now we do not, but we've thought about doing it. I think that would be quite nice in fact; that might also allow us to compile an idempotent version of getElementsByTagName or whatnot if the argument is a known string...
I'm definitely very interested in calling methods that need to do less work based on static TI information as opposed to just redoing all the dynamic checking.
Comment 4•11 years ago
|
||
(In reply to Boris Zbarsky [:bz] from comment #3)
> Right now we do not, but we've thought about doing it. I think that would
> be quite nice in fact; that might also allow us to compile an idempotent
> version of getElementsByTagName or whatnot if the argument is a known
> string...
Wow, yeah, just with simple constant-prop we could turn getElementsByTagName("td") to getElementsByTagName_td. Think this would help any DOM synthetic benchmarks?
Comment 5•11 years ago
|
||
Turning getElementsByTagName("td") to getElementsByTagName_td seems like it would require a lot of C++ getElementsByTagName_* methods, right?
I was thinking something even simpler: getElementsByTagName_argKnownToBeSomethingThatWontHaveSideEffectsWhenYouToStringIt (e.g. the arg is known to be an actual string). Then it could be idempotent, which allows things like loop-hoisting (which would sure help synthetic benchmarks) and more importantly allows CSE of things like the (also idempotent by the time we convert window to WebIDL) "document" getter across getElementsByTagName calls.
On the synthetic benchmark front, I would hope that something like this (from Dromaeo):
test( "getElementById (not in document)", function(){
for ( var i = 0; i < num * 30; i++ ) {
ret = document.getElementById("testA");
ret = document.getElementById("testB");
ret = document.getElementById("testC");
ret = document.getElementById("testD");
ret = document.getElementById("testE");
ret = document.getElementById("testF");
}
});
would all end up loop-hoisted when we're done. Loop-hoisting something like this:
test( "getElementsByTagName(p)", function(){
for ( var i = 0; i < num; i++ ) {
var elems = document.getElementsByTagName("p");
ret = elems[elems.length-1].nodeType;
}
});
would take more work; we'd need to teach the (nonexistent so far, afaict) fast path for that [elems.length-1] bit that it's also idempotent.
I feel like I'm hijacking this bug now.... ;)
Comment 6•11 years ago
|
||
(In reply to Boris Zbarsky [:bz] from comment #5)
> Turning getElementsByTagName("td") to getElementsByTagName_td seems like it
> would require a lot of C++ getElementsByTagName_* methods, right?
I was assuming it'd be Bindings-generated and that there would actually be some advantage to having static knowledge of which tag you are looking up, but I don't really know what I'm talking about :)
Comment 7•11 years ago
|
||
> I was assuming it'd be Bindings-generated
Sure, but it'd still increase binary size and whatnot.
There _are_ some cases where we'd want to specialize like that; the
style.top = someVar + "px";
case being able to call a setter passing it just the someVar number and avoiding the string ops is another example. Or similar with canvas colors and "rgb(" + r + "," + g + "," + b + ")" bits. roc and I talked to jandem and ejpbruel and efaust for a bit about that one at the summit.... So in general we definitely want this sort of thing; just need to figure out what the right way is to give the JIT the data it needs to call the right thing with the right arguments.
Comment 8•10 years ago
|
||
Profiling the Humble Bundle games, there are a considerable amount of these occurring in each of them. Mostly they come from GL usage, but also large memcpys, string marshalling for C++<->JS, Emscripten filesystem operations and SDL pixel canvas operations generate these. The general pattern is something like "foo(HEAPU32.subarray(a, b));"
Comment 9•10 years ago
|
||
Jukka: any rough count on the number of these produced per frame (or second)?
Comment 10•10 years ago
|
||
Here's a very rough count:
FTL: 1000 allocs/second
Superhexagon: 1500 allocs/second
Zenbound 2: 3500 allocs/second
Osmos: 17000 allocs/second
Voxatron demo: 400 allocs/second
Osmos contains a lot of dynamically rendered sprites compared to other games, which shows up in the numbers.
Comment 11•10 years ago
|
||
Those are big numbers (esp Osmos), but not so astoundingly large that it's clearly a problem. I don't really know our thresholds, though. Steve: do you know how many typed array views it takes to trigger a minor collection and thus what sort of impact optimizing away all these view creations would make?
Flags: needinfo?(sphink)
Comment 12•10 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #11)
> Those are big numbers (esp Osmos), but not so astoundingly large that it's
> clearly a problem. I don't really know our thresholds, though. Steve: do
> you know how many typed array views it takes to trigger a minor collection
> and thus what sort of impact optimizing away all these view creations would
> make?
I would like to say that minor GC should be mostly irrelevant here, because at the time of the minor GC all of these temporaries are dead, and we don't spend any time on dead objects during nursery collections. I mean, there are second-order effects from possibly tenuring *other* objects prematurely due to too-frequent minor GCs, but I wouldn't expect those to be too significant.
And note that these temporaries are not free. There is a cost in allocating the object, filling in its fields, guarding on its type, extracting stuff back out, etc. Or stuff like that; I don't really know all that goes on.
However: in the specific case of these views, dead objects *do* cost you during the nursery collection. There is a weak pointer from a buffer to its views if it has more than one (and these temporaries will make the buffer have more than one.) There is a compartment-wide table mapping buffer objects to lists of views that will need to be swept, which takes time linear in the number of views. I don't think it should be a lot of time -- it's a hashtable lookup, a vector traversal, and a bunch of element swapping to sweep the dead temporaries. But it *is* time, and if there are enough of them, it might hurt. Knowing the time spent in minor GCs would help.
I still don't think the number of objects required to trigger a minor collection is relevant, because if these are all temporaries then the cost is linear in the total number of temporary views. But on desktop, it looks like the initial nursery size is 16MB, and at 64 bytes per view object (?), that's 256k temporaries before you'd trigger a nursery collection from the temporaries alone.
needinfo? terrence to check how much I'm lying.
Flags: needinfo?(sphink) → needinfo?(terrence)
Comment 13•10 years ago
|
||
Ah, so IIUC your point correctly: if we *were* to be emitting so many temporary views as to trigger many minor GCs per frame (which, if 256k is our limit, we aren't), they'd be super-fast b/c nothing was living (except the unlucky one live when the minor gc happened).
Comment 14•10 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #13)
> Ah, so IIUC your point correctly: if we *were* to be emitting so many
> temporary views as to trigger many minor GCs per frame (which, if 256k is
> our limit, we aren't), they'd be super-fast b/c nothing was living (except
> the unlucky one live when the minor gc happened).
That's the main point, but the secondary point is: these might *not* be super-fast minor GCs because of the weak ref table (InnerViewTable). I'm not sure how expensive that can be.
Comment 15•10 years ago
|
||
Attached is a minorgc profiling log for a 10 minute session playing Osmos through the first few levels, notably including the first massive gravity level animating thousands of droplets around the screen constantly.
tl;dr: we're getting ~1 minor GC per second, taking roughly 200-300us, promoting from 0.5-1.5% of the minor heap. After warmup the minor heap stabilizes to 1MiB. Summary: GGC is operating nominally.
Gameplay was generally smooth as silk except for the odd GC hiccup every 30 seconds or so. One odd thing I do see in the log is that we're occasionally firing two GC's almost back to back. We should find out what's up with that. I'll grab a full GC log next so we can get a peek at long-term behavior.
Flags: needinfo?(terrence)
Comment 16•10 years ago
|
||
A random GC from halfway through the log.
GC(T+192.413s) =================================================================
Reason: CC_WAITING
Incremental: yes
Zones Collected: 7 of 7
Compartments Collected: 230 of 230
MinorGCs since last GC: 13
MMU 20ms:39.8%; 50ms:43.1%
SCC Sweep Total (MaxPause): 4.858ms (4.728ms)
HeapSize: 44.668 MiB
Chunk Delta (magnitude): +0 (0)
---- Slice 0 ----
Reason: CC_WAITING
Reset: no
Page Faults: 0
Pause: 10.651ms (@ 0.000ms)
Begin Callback: 0.001ms
Wait Background Thread: 0.001ms
Mark Discard Code: 0.439ms
Purge: 0.849ms
Mark: 8.820ms
Other: 4.819ms
Mark Roots: 4.001ms
---- Slice 1 ----
Reason: REFRESH_FRAME
Reset: no
Page Faults: 0
Pause: 10.324ms (@ 34.431ms)
Mark: 10.125ms
---- Slice 2 ----
Reason: REFRESH_FRAME
Reset: no
Page Faults: 0
Pause: 10.231ms (@ 56.670ms)
Mark: 10.039ms
---- Slice 3 ----
Reason: REFRESH_FRAME
Reset: no
Page Faults: 0
Pause: 8.144ms (@ 77.627ms)
Wait Background Thread: 0.002ms
Mark: 7.948ms
Other: 7.935ms
Mark Delayed: 0.013ms
---- Slice 4 ----
Reason: REFRESH_FRAME
Reset: no
Page Faults: 0
Pause: 12.040ms (@ 96.584ms)
Wait Background Thread: 0.001ms
Mark: 0.011ms
Mark Delayed: 0.009ms
Sweep: 11.372ms
Other: 0.272ms
Mark During Sweeping: 1.465ms
Mark Incoming Black Pointers: 0.001ms
Mark Weak: 0.272ms
Mark Incoming Gray Pointers: 0.002ms
Mark Gray: 1.167ms
Mark Gray and Weak: 0.021ms
Finalize Start Callback: 0.179ms
Sweep Atoms: 2.084ms
Sweep Compartments: 4.502ms
Sweep Discard Code: 0.459ms
Sweep Inner Views: 0.015ms
Sweep Cross Compartment Wrappers: 0.355ms
Sweep Base Shapes: 0.460ms
Sweep Initial Shapes: 0.692ms
Sweep Type Objects: 0.361ms
Sweep Breakpoints: 0.002ms
Sweep Regexps: 0.065ms
Sweep Miscellaneous: 0.549ms
Discard Analysis: 3.653ms
Discard TI: 0.912ms
Free TI Arena: 0.001ms
Sweep Types: 2.735ms
Sweep Object: 0.351ms
Sweep String: 0.086ms
Sweep Script: 0.762ms
Sweep Shape: 0.805ms
Sweep JIT code: 0.110ms
Finalize End Callback: 0.437ms
Deallocate: 0.319ms
End Callback: 0.007ms
---- Totals ----
Total Time: 51.390
Max Pause: 12.040
Begin Callback: 0.001ms
Wait Background Thread: 0.004ms
Mark Discard Code: 0.439ms
Purge: 0.849ms
Mark: 36.943ms
Other: 32.920ms
Mark Roots: 4.001ms
Mark Delayed: 0.022ms
Sweep: 11.372ms
Other: 0.272ms
Mark During Sweeping: 1.465ms
Mark Incoming Black Pointers: 0.001ms
Mark Weak: 0.272ms
Mark Incoming Gray Pointers: 0.002ms
Mark Gray: 1.167ms
Mark Gray and Weak: 0.021ms
Finalize Start Callback: 0.179ms
Sweep Atoms: 2.084ms
Sweep Compartments: 4.502ms
Sweep Discard Code: 0.459ms
Sweep Inner Views: 0.015ms
Sweep Cross Compartment Wrappers: 0.355ms
Sweep Base Shapes: 0.460ms
Sweep Initial Shapes: 0.692ms
Sweep Type Objects: 0.361ms
Sweep Breakpoints: 0.002ms
Sweep Regexps: 0.065ms
Sweep Miscellaneous: 0.549ms
Discard Analysis: 3.653ms
Discard TI: 0.912ms
Free TI Arena: 0.001ms
Sweep Types: 2.735ms
Sweep Object: 0.351ms
Sweep String: 0.086ms
Sweep Script: 0.762ms
Sweep Shape: 0.805ms
Sweep JIT code: 0.110ms
Finalize End Callback: 0.437ms
Deallocate: 0.319ms
End Callback: 0.007ms
Comment 17•10 years ago
|
||
A more in-depth analysis of the long-lived heap:
GC invocations by reason:
Reason: CC_WAITING 11
Reason: MAYBEGC 9
Reason: DOM_WORKER 8
Reason: DESTROY_RUNTIME 2
Reason: DESTROY_CONTEXT 2
Reason: ALLOC_TRIGGER 1
Reason: SET_NEW_DOCUMENT 1
Reason: API 1
Utilization:
Very low. After browser startup, we appear to allocate very slowly -- most GC's are looking at much, much less than 1MiB of new data and none actually result in any chunks being freed. Towards the end of the demo, allocation appears to totally level off and we're not adding anything to the GC heap, although we do see a malloc-based trigger in that timeframe. The growth does appear to be slow enough that you could probably finish the full game without actually needing to GC.
Latency:
10ms while gaming. Outside of the non-incremental startup and shutdown GC's, the max GC related pause is 13.689ms. I did not see Andrew's 40ms second slice; perhaps that happens on a different game. On the plus side, we're doing pretty good at hitting 10ms; however, we should be budgeting games for 5ms slices. I guess that heuristic is not being hit for these games for some reason.
Comment 18•10 years ago
|
||
Actually, looks like I'm remembering incorrectly. The default is 10ms slices on desktop and 30ms on b2g. That will double if we are in high-frequency mode, but is generally going to be 10ms. When triggered via gecko's slice timer (100ms intervals) with reason==INTER_SLICE_GC, gecko manually passes a 40ms budget, overriding the 10ms default. Between the traces here and bug 1084672, it looks like NotifyDidPaint is getting called for these traces, but may not be called correctly in other situations.
Updated•2 years ago
|
Severity: normal → S3
Comment 19•10 months ago
|
||
Does this still make any sense ten years on? Or should we just close this one?
Blocks: sm-js-perf
Flags: needinfo?(sphink)
You need to log in
before you can comment on or make changes to this bug.
Description
•