Open Bug 936168 Opened 11 years ago Updated 10 months ago

Optimize away construction of temporary ArrayBufferView objects

Categories

(Core :: JavaScript Engine, defect)

defect

Tracking

()

People

(Reporter: jgilbert, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

One problem encountered by otherwise garbage-free asm.js code is that in order to pass sub-buffers of data to WebGL, they need to create temporary ArrayBufferView objects. Example: gl.bufferData(gl.ARRAY_BUFFER, new ArrayBufferView(HEAP, offset, len), gl.STATIC_DRAW); If it's possible to identify when such a constructor is immediately being passed into IDL bindings, would it be possible to immediately free it?
The variation of this that I've been thinking about is having the DOM binding provide additional info saying that it takes a view directly as (HEAP, offset, len) components. (This avoids the overhead of building and extracting this information.) Back in the old TM days, a single native would have a list of "traceable native" specializations, each taking different argument types, so this would be like that. Eric: do you know if we do anything similar atm with DOM bindings? Also: Alon do you have any indication how much garbage is being created from this source? If it's a few objects per frame, then I don't think it's a huge deal; we will get periodic GCs regardless.
The amount depends on the codebase. We will need 1 (or perhaps 2 or 3) per GL call (draw and/up data upload), but how many of those calls happen is highly dependent on how optimized the app is. Could be anywhere from dozens per frame to hundreds.
> Eric: do you know if we do anything similar atm with DOM bindings? Right now we do not, but we've thought about doing it. I think that would be quite nice in fact; that might also allow us to compile an idempotent version of getElementsByTagName or whatnot if the argument is a known string... I'm definitely very interested in calling methods that need to do less work based on static TI information as opposed to just redoing all the dynamic checking.
(In reply to Boris Zbarsky [:bz] from comment #3) > Right now we do not, but we've thought about doing it. I think that would > be quite nice in fact; that might also allow us to compile an idempotent > version of getElementsByTagName or whatnot if the argument is a known > string... Wow, yeah, just with simple constant-prop we could turn getElementsByTagName("td") to getElementsByTagName_td. Think this would help any DOM synthetic benchmarks?
Turning getElementsByTagName("td") to getElementsByTagName_td seems like it would require a lot of C++ getElementsByTagName_* methods, right? I was thinking something even simpler: getElementsByTagName_argKnownToBeSomethingThatWontHaveSideEffectsWhenYouToStringIt (e.g. the arg is known to be an actual string). Then it could be idempotent, which allows things like loop-hoisting (which would sure help synthetic benchmarks) and more importantly allows CSE of things like the (also idempotent by the time we convert window to WebIDL) "document" getter across getElementsByTagName calls. On the synthetic benchmark front, I would hope that something like this (from Dromaeo): test( "getElementById (not in document)", function(){ for ( var i = 0; i < num * 30; i++ ) { ret = document.getElementById("testA"); ret = document.getElementById("testB"); ret = document.getElementById("testC"); ret = document.getElementById("testD"); ret = document.getElementById("testE"); ret = document.getElementById("testF"); } }); would all end up loop-hoisted when we're done. Loop-hoisting something like this: test( "getElementsByTagName(p)", function(){ for ( var i = 0; i < num; i++ ) { var elems = document.getElementsByTagName("p"); ret = elems[elems.length-1].nodeType; } }); would take more work; we'd need to teach the (nonexistent so far, afaict) fast path for that [elems.length-1] bit that it's also idempotent. I feel like I'm hijacking this bug now.... ;)
(In reply to Boris Zbarsky [:bz] from comment #5) > Turning getElementsByTagName("td") to getElementsByTagName_td seems like it > would require a lot of C++ getElementsByTagName_* methods, right? I was assuming it'd be Bindings-generated and that there would actually be some advantage to having static knowledge of which tag you are looking up, but I don't really know what I'm talking about :)
> I was assuming it'd be Bindings-generated Sure, but it'd still increase binary size and whatnot. There _are_ some cases where we'd want to specialize like that; the style.top = someVar + "px"; case being able to call a setter passing it just the someVar number and avoiding the string ops is another example. Or similar with canvas colors and "rgb(" + r + "," + g + "," + b + ")" bits. roc and I talked to jandem and ejpbruel and efaust for a bit about that one at the summit.... So in general we definitely want this sort of thing; just need to figure out what the right way is to give the JIT the data it needs to call the right thing with the right arguments.
Profiling the Humble Bundle games, there are a considerable amount of these occurring in each of them. Mostly they come from GL usage, but also large memcpys, string marshalling for C++<->JS, Emscripten filesystem operations and SDL pixel canvas operations generate these. The general pattern is something like "foo(HEAPU32.subarray(a, b));"
Jukka: any rough count on the number of these produced per frame (or second)?
Here's a very rough count: FTL: 1000 allocs/second Superhexagon: 1500 allocs/second Zenbound 2: 3500 allocs/second Osmos: 17000 allocs/second Voxatron demo: 400 allocs/second Osmos contains a lot of dynamically rendered sprites compared to other games, which shows up in the numbers.
Those are big numbers (esp Osmos), but not so astoundingly large that it's clearly a problem. I don't really know our thresholds, though. Steve: do you know how many typed array views it takes to trigger a minor collection and thus what sort of impact optimizing away all these view creations would make?
Flags: needinfo?(sphink)
(In reply to Luke Wagner [:luke] from comment #11) > Those are big numbers (esp Osmos), but not so astoundingly large that it's > clearly a problem. I don't really know our thresholds, though. Steve: do > you know how many typed array views it takes to trigger a minor collection > and thus what sort of impact optimizing away all these view creations would > make? I would like to say that minor GC should be mostly irrelevant here, because at the time of the minor GC all of these temporaries are dead, and we don't spend any time on dead objects during nursery collections. I mean, there are second-order effects from possibly tenuring *other* objects prematurely due to too-frequent minor GCs, but I wouldn't expect those to be too significant. And note that these temporaries are not free. There is a cost in allocating the object, filling in its fields, guarding on its type, extracting stuff back out, etc. Or stuff like that; I don't really know all that goes on. However: in the specific case of these views, dead objects *do* cost you during the nursery collection. There is a weak pointer from a buffer to its views if it has more than one (and these temporaries will make the buffer have more than one.) There is a compartment-wide table mapping buffer objects to lists of views that will need to be swept, which takes time linear in the number of views. I don't think it should be a lot of time -- it's a hashtable lookup, a vector traversal, and a bunch of element swapping to sweep the dead temporaries. But it *is* time, and if there are enough of them, it might hurt. Knowing the time spent in minor GCs would help. I still don't think the number of objects required to trigger a minor collection is relevant, because if these are all temporaries then the cost is linear in the total number of temporary views. But on desktop, it looks like the initial nursery size is 16MB, and at 64 bytes per view object (?), that's 256k temporaries before you'd trigger a nursery collection from the temporaries alone. needinfo? terrence to check how much I'm lying.
Flags: needinfo?(sphink) → needinfo?(terrence)
Ah, so IIUC your point correctly: if we *were* to be emitting so many temporary views as to trigger many minor GCs per frame (which, if 256k is our limit, we aren't), they'd be super-fast b/c nothing was living (except the unlucky one live when the minor gc happened).
(In reply to Luke Wagner [:luke] from comment #13) > Ah, so IIUC your point correctly: if we *were* to be emitting so many > temporary views as to trigger many minor GCs per frame (which, if 256k is > our limit, we aren't), they'd be super-fast b/c nothing was living (except > the unlucky one live when the minor gc happened). That's the main point, but the secondary point is: these might *not* be super-fast minor GCs because of the weak ref table (InnerViewTable). I'm not sure how expensive that can be.
Attached file js_minorgc_times.txt
Attached is a minorgc profiling log for a 10 minute session playing Osmos through the first few levels, notably including the first massive gravity level animating thousands of droplets around the screen constantly. tl;dr: we're getting ~1 minor GC per second, taking roughly 200-300us, promoting from 0.5-1.5% of the minor heap. After warmup the minor heap stabilizes to 1MiB. Summary: GGC is operating nominally. Gameplay was generally smooth as silk except for the odd GC hiccup every 30 seconds or so. One odd thing I do see in the log is that we're occasionally firing two GC's almost back to back. We should find out what's up with that. I'll grab a full GC log next so we can get a peek at long-term behavior.
Flags: needinfo?(terrence)
Attached file moz_gctimer.txt
A random GC from halfway through the log. GC(T+192.413s) ================================================================= Reason: CC_WAITING Incremental: yes Zones Collected: 7 of 7 Compartments Collected: 230 of 230 MinorGCs since last GC: 13 MMU 20ms:39.8%; 50ms:43.1% SCC Sweep Total (MaxPause): 4.858ms (4.728ms) HeapSize: 44.668 MiB Chunk Delta (magnitude): +0 (0) ---- Slice 0 ---- Reason: CC_WAITING Reset: no Page Faults: 0 Pause: 10.651ms (@ 0.000ms) Begin Callback: 0.001ms Wait Background Thread: 0.001ms Mark Discard Code: 0.439ms Purge: 0.849ms Mark: 8.820ms Other: 4.819ms Mark Roots: 4.001ms ---- Slice 1 ---- Reason: REFRESH_FRAME Reset: no Page Faults: 0 Pause: 10.324ms (@ 34.431ms) Mark: 10.125ms ---- Slice 2 ---- Reason: REFRESH_FRAME Reset: no Page Faults: 0 Pause: 10.231ms (@ 56.670ms) Mark: 10.039ms ---- Slice 3 ---- Reason: REFRESH_FRAME Reset: no Page Faults: 0 Pause: 8.144ms (@ 77.627ms) Wait Background Thread: 0.002ms Mark: 7.948ms Other: 7.935ms Mark Delayed: 0.013ms ---- Slice 4 ---- Reason: REFRESH_FRAME Reset: no Page Faults: 0 Pause: 12.040ms (@ 96.584ms) Wait Background Thread: 0.001ms Mark: 0.011ms Mark Delayed: 0.009ms Sweep: 11.372ms Other: 0.272ms Mark During Sweeping: 1.465ms Mark Incoming Black Pointers: 0.001ms Mark Weak: 0.272ms Mark Incoming Gray Pointers: 0.002ms Mark Gray: 1.167ms Mark Gray and Weak: 0.021ms Finalize Start Callback: 0.179ms Sweep Atoms: 2.084ms Sweep Compartments: 4.502ms Sweep Discard Code: 0.459ms Sweep Inner Views: 0.015ms Sweep Cross Compartment Wrappers: 0.355ms Sweep Base Shapes: 0.460ms Sweep Initial Shapes: 0.692ms Sweep Type Objects: 0.361ms Sweep Breakpoints: 0.002ms Sweep Regexps: 0.065ms Sweep Miscellaneous: 0.549ms Discard Analysis: 3.653ms Discard TI: 0.912ms Free TI Arena: 0.001ms Sweep Types: 2.735ms Sweep Object: 0.351ms Sweep String: 0.086ms Sweep Script: 0.762ms Sweep Shape: 0.805ms Sweep JIT code: 0.110ms Finalize End Callback: 0.437ms Deallocate: 0.319ms End Callback: 0.007ms ---- Totals ---- Total Time: 51.390 Max Pause: 12.040 Begin Callback: 0.001ms Wait Background Thread: 0.004ms Mark Discard Code: 0.439ms Purge: 0.849ms Mark: 36.943ms Other: 32.920ms Mark Roots: 4.001ms Mark Delayed: 0.022ms Sweep: 11.372ms Other: 0.272ms Mark During Sweeping: 1.465ms Mark Incoming Black Pointers: 0.001ms Mark Weak: 0.272ms Mark Incoming Gray Pointers: 0.002ms Mark Gray: 1.167ms Mark Gray and Weak: 0.021ms Finalize Start Callback: 0.179ms Sweep Atoms: 2.084ms Sweep Compartments: 4.502ms Sweep Discard Code: 0.459ms Sweep Inner Views: 0.015ms Sweep Cross Compartment Wrappers: 0.355ms Sweep Base Shapes: 0.460ms Sweep Initial Shapes: 0.692ms Sweep Type Objects: 0.361ms Sweep Breakpoints: 0.002ms Sweep Regexps: 0.065ms Sweep Miscellaneous: 0.549ms Discard Analysis: 3.653ms Discard TI: 0.912ms Free TI Arena: 0.001ms Sweep Types: 2.735ms Sweep Object: 0.351ms Sweep String: 0.086ms Sweep Script: 0.762ms Sweep Shape: 0.805ms Sweep JIT code: 0.110ms Finalize End Callback: 0.437ms Deallocate: 0.319ms End Callback: 0.007ms
A more in-depth analysis of the long-lived heap: GC invocations by reason: Reason: CC_WAITING 11 Reason: MAYBEGC 9 Reason: DOM_WORKER 8 Reason: DESTROY_RUNTIME 2 Reason: DESTROY_CONTEXT 2 Reason: ALLOC_TRIGGER 1 Reason: SET_NEW_DOCUMENT 1 Reason: API 1 Utilization: Very low. After browser startup, we appear to allocate very slowly -- most GC's are looking at much, much less than 1MiB of new data and none actually result in any chunks being freed. Towards the end of the demo, allocation appears to totally level off and we're not adding anything to the GC heap, although we do see a malloc-based trigger in that timeframe. The growth does appear to be slow enough that you could probably finish the full game without actually needing to GC. Latency: 10ms while gaming. Outside of the non-incremental startup and shutdown GC's, the max GC related pause is 13.689ms. I did not see Andrew's 40ms second slice; perhaps that happens on a different game. On the plus side, we're doing pretty good at hitting 10ms; however, we should be budgeting games for 5ms slices. I guess that heuristic is not being hit for these games for some reason.
Actually, looks like I'm remembering incorrectly. The default is 10ms slices on desktop and 30ms on b2g. That will double if we are in high-frequency mode, but is generally going to be 10ms. When triggered via gecko's slice timer (100ms intervals) with reason==INTER_SLICE_GC, gecko manually passes a 40ms budget, overriding the 10ms default. Between the traces here and bug 1084672, it looks like NotifyDidPaint is getting called for these traces, but may not be called correctly in other situations.
Severity: normal → S3

Does this still make any sense ten years on? Or should we just close this one?

Blocks: sm-js-perf
Flags: needinfo?(sphink)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: