On a sample run of speedometer, we run IonBuilder 32797 times, on 996 unique scripts (same file/line/column). Here is a rough breakdown of what happens with those IonBuilders: 2146 have their compilations cancelled before finishing - 1234 are cancelled by a nursery GC - 754 are cancelled by GC mark/sweep actions - 192 are cancelled by TI constraints on script observed types 1925 are cancelled while linking on the main thread due to TI constraints 15738 of the remaining IonScripts destroyed by GC 12695 IonScripts are invalidated after being created - 9060 are invalidated due to TI constraints -- 7772 are due to new observed types on scripts -- 1308 are due to changes to heap type set properties - 2120 are unhandled accesses on idempotent ICs - 633 are shape guard failures - 365 are replaced by a newly compiled ion script - 281 had too many bailouts - 197 had a bounds check failure On workloads like GMail, IonBuilder is up to 4% of our main thread time in JS (as measured previously using the tracelogger), and it might be even more on speedometer. Doing lots of Ion compilations has fallout in other areas, like more time spent in baseline execution, and more work done on background threads. It seems like there is a lot of fat to trim here. Mainly, it would be great to reduce the number of IonScripts destroyed by GC when running this test suite and similar websites that do lots of JS but are not playing animations and hitting our preserve-jit-code triggers. Lesser targets would include seeing if we can be more precise when cancelling compilations during minor GCs (if any IonBuilder has nursery pointers than all builders in the runtime end up being cancelled on the next minor GC). I'm also curious about what is causing all the invalidations due to unhandled accesses in idempotent ICs. Longer range, the way in which we keep track of type sets is pretty simple and it would be interesting to see if more precisely capturing the assumptions we make about certain observed type sets would reduce the number of TI related invalidations performed. An example is that if we have no information about definite properties or property types on a property access, then we fall back to shapes and observed types and the exact group of the accessed object is no longer pertinent to whether the code will behave correctly.
Hmm, while there are 996 unique script locations Ion compiled on speedometer, there are 21k different scripts, presumably due to lots of iframes etc. loading similar content. Preserving jitcode during GC doesn't make much of a difference in the number of Ion compilations performed, but invalidations and cancellations would still be good to address.
Thanks for looking into this! See also bug 1382650, I think Ion currently inlines too much code too soon.
So, with the experimental work in bug 1384334 we can reduce the number of invalidations caused by type inference from ~11500 to ~7700 (a 1/3 reduction) without affecting the quality of the generated code. We might be able to do a little better on that front, but not much; generated code has a very tight dependence on the types which are used, in particular accessed objects. Another option to consider is to decrease the quality of the generated code, in order to A) reduce the number of invalidations, B) make compilation faster, and C) make it easier to reuse ion compilations between different scripts with the same location (same file/line/column), including caching ion code on disk for this purpose (see bug 1384334). To get some idea of the impact of this idea, I deoptimized ion so that it doesn't get any information about objects from TI: type barriers and inline caches are used everywhere, and no inlining is performed. Running in this way, about the only thing ion depends on is the known MIR type of all argument and observed value type sets, and the bug 1384334 work is able to leverage that to reduce the number of TI invalidations as a result. I did a sample run with these changes, and of 35937 ion compilations performed only 2290 were invalidated due to TI changes, an 80% reduction from the existing state of things. Moreover, 33442 of the compilations have the same MIR types that were used in an earlier compilation of that script location, and another 1270 have MIR types that are a subset of those used in an earlier compilation. These should be able to reuse the code from those earlier compilations (a new IonScript and JitCode would need to be created, but the data from the other compilation could be copied in and fixed up a little bit, which is much easier when you don't bake in data from TI or shape pointers, etc). Assuming that works out, only 1225 compilations would need IonBuilder to run. I think that it's worth pursuing a strategy like this: we can spend much less time ion compiling and with fewer TI invalidations we will spend more time in ion and less in baseline. Because the ion code generated here will be considerably slower than full ion, this would fit in as part of a tiered compilation: leave the interpreter and baseline as they are, when the code gets pretty warm do a basic ion compile (or reuse another basic ion compile), and when it gets pretty hot (hotter than our current ion compile thresholds) do a full ion compile. A nice thing about this is that the optimizations which basic ion compiles perform are flexible. Because the basic compile can be invalidated, it can do things like optimizing property accesses more if it wants, or perform a limited amount of inlining. The trade/offs are different from full ion, since basic ion cares more about reducing compilation time and avoiding invalidation, but there is a lot of room for tuning. It might also be worth considering reducing the use count threshold for basic ion compilation, which will further reduce the time spent in baseline. I experimented with a few different thresholds on speedometer to see how the number of compilations, TI invalidations, and reused compilations were affected. UseCount IonCompilations NonReusedCompilations TypeInferenceInvalidations 1000 35937 1225 2290 500 50347 2149 3832 250 68471 2913 5106 100 97445 3505 8824 At least on this workload it would be conceivable to use a lower threshold, and interesting to experiment with when this stuff is further developed.
Created attachment 8909504 [details] [diff] [review] WIP I suppose I should post an update to this bug. Ion ICs are too slow to really be usable as part of a tier in between baseline and full Ion, and making them faster has its own performance issues (see bug 1394006). I've been working on what is essentially a rewrite of how we keep track of information about type sets and objects during Ion compilations, in a way that will allow any Ion compilation to potentially be reused at another time and place. The main change to the Ion compiler is that instead of pointing to GC things directly (other than atoms and symbols) we need a layer analogous to the temporary type sets which can be used to keep track of both where the GC thing came from and what Ion is assuming about it, so that those dependencies can be checked (and analogous GC things determined) when reusing the IonScript for compiling a different script. This is mostly done; there are lots of superficial changes required to IonBuilder and related code, and there are lots of substantial changes required to TI internals, which are still in progress. Unfortunately, pretty much all of this work is required in order to accurately measure how many Ion compilations we'll ultimately be able to perform by reusing an existing compilation.
Assignee: nobody → bhackett1024
Created attachment 8912861 [details] [diff] [review] WIP This finally has enough pieces filled in to start testing in the browser, to see how often we can reuse compilations and to try tuning IonBuilder to improve that metric.
Attachment #8909504 - Attachment is obsolete: true
Created attachment 8916686 [details] [diff] [review] WIP This patch fixes various issues that showed up when testing in the browser. With this patch, on a sample run of speedometer we do 21797 ion compilations: - 3766 are on scripts whose script data has not been successfully ion compiled before. (This seems high to me, since there are ~1000 different script data which we should see, there might be some lingering bugs here.) - 6499 are on scripts whose script data has been ion compiled before, but none of the previous compilations have matching type / baseline data and the compilation needs to be repeated. - 11532 are on scripts whose script data has been ion compiled before, and enough data matches up between the two compilations that the previous one should be able to be reused. These measurements do not quite allow apples-to-apples comparisons with those in earlier patches. Mainly, some Ion optimizations are turned off which affect type dependencies and reduce the number of invalidations; these are mostly minor ones, but two important ones --- inlining and inline baseline-shape-based property accesses --- need to be turned back on. I'll work on that next.
Attachment #8912861 - Attachment is obsolete: true
status-firefox57: --- → wontfix
Priority: -- → P2
Created attachment 8917838 [details] [diff] [review] WIP This patch allows Ion to inline at monomorphic call sites and to inline monomorphic shape accesses from baseline, and still allow those compilations to be transplanted (reused for another compilation). This reduces the fraction of compilations that can be transplanted from earlier compilations (the third category in the above comment) from 53% to 20%, but it shouldn't be hard to get that number back up. I also measured the number of type inference based invalidations we're doing now, and compared to comment 0 it has gone down by about half, ~11k to 5.5k.
Attachment #8916686 - Attachment is obsolete: true
Created attachment 8918358 [details] [diff] [review] WIP This patch fixes some bugs and makes some refinements to the constraints used for transplanting, bringing the fraction of Ion compilations that can be transplanted from earlier compilations up to 67%. Since there are fewer Ion compilations being performed now (due mainly to the reduced number of TI invalidations), in comparison to the baseline numbers in comment 0 we should be able to get an overall 77% reduction in the number of Ion compilations performed. I'll look to see if there are more easy wins here, but I think these numbers are good enough that it's worth it to start on the next phase, actually performing these compilation transplants. Also, I tested this stuff on gmail and facebook sessions, and it has benefits on those sites as well. On gmail, 48% of 1015 Ion compilations can be transplanted from earlier ones, and on facebook 40% of 2503 Ion compilations can be as well.
Attachment #8917838 - Attachment is obsolete: true
Created attachment 8918390 [details] [diff] [review] Part 1 - Fine grained type constraints and compilation comparison logic This fixes a small bug which was causing more than two thousand compilations on speedometer to be invalidated as soon as they were finished (subset constraints were being added on type sets that were never subsets of each other). With this fixed, 77% of the 21667 Ion compilations being performed on speedometer can be transplanted from an earlier compilation, representing a potential 85% reduction in the number of Ion compilations performed from the initial numbers in comment 0.
Attachment #8918358 - Attachment is obsolete: true
Created attachment 8923611 [details] [diff] [review] Part 2 WIP This patch adds all the infrastructure and logic needed to transplant Ion compilations. Mainly, the MacroAssembler keeps track of which code offsets refer to GC things or immediate pointers that will need to be changed when transplanting, and we keep that information around and use it to fix up the jitcode during the transplant. Things still line up pretty well with the estimates provided by previous patches: on a sample run of speedometer we transplant 71% of the ~22k Ion compilations performed, for an 81% reduction in the number of compilations performed vs. the original baseline. I'll start doing some measurements of how performance is affected by this patch tomorrow.
Can someone please add "PERF" key word? Thank you.
OK, so I measured the amounts of time spent on speedometer in the various parts of the Ion compilation pipeline. Without transplanting (i.e. trunk, but with the same set of optimizations disabled as would be for transplantable compilations): IonBuilder: 21491 runs, 3978.64 ms [Main thread time spent running IonBuilder for an Ion compilation] CompileBackEnd: 18885 runs, 20633.69 ms [Off thread time spent compiling down to assembly] With transplanting: IonBuilder: 6351 runs, 1003.30 ms CompileBackEnd: 4368 runs, 2461.03 ms Transplant: 22338 runs, 1686.50 ms [Main thread time spent attempting to transplant compilations and generate the resulting IonScripts] So, transplanting saves 75% of IonBuilder time and 88% of off thread compilation time, but the time spent transplanting eats up 56% of the gains from the reduced IonBuilder time. The benchmark score with transplanting is about 2% better. Another issue is that the optimizations that have been changed/disabled for transplanting hurt the benchmark score a lot, reducing it by ~20% (I think a lot of this is the new type barrier ICs, which need to be restructured to avoid the huge perf cost of the indirect jumps used by Ion ICs). I think the best way forward here is to restore all the optimizations that have been disabled for transplanting, so that we can transplant any Ion compilation and don't need to deal with new compilation heuristics, tiered Ion compilation, or other confounding factors in this bug. After that, the cost of doing transplants needs to be addressed as well, there are some significant speedups that can be made there I think. I was also curious to see what effect IonBuilder and backend compilation have on the benchmark score in a normal trunk build. It's not perfect, but these can be approximated by modifying the benchmark to measure time with Date.now instead of window.performance.now, and displacing the result of Date.now by the time spent in IonBuilder and backend compilation (which needs to be forced to happen on the main thread). This gives a maximum possible speedup of 8% from eliminating the cost of both these things. I'm hoping with transplanting to get to a 4-5% speedup, maybe more on lower end machines (where the cost of off thread compilation has a larger impact).
(In reply to Worcester12345 from comment #11) > Can someone please add "PERF" key word? Thank you. still
You need to log in before you can comment on or make changes to this bug.