Closed Bug 593931 Opened 14 years ago Closed 14 years ago

inline js_GetPropertyHelper() and friends more aggressively

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 2 obsolete files)

I noticed in string-fasta that js_GetProperty() is called a *lot*.  

Furthemore, js_GetProperty() calls js_GetPropertyHelper(), which calls js_LookupPropertyWithFlags() and js_NativeGet().  None of these calls are inlined, and Cachegrind tells me that the call overhead for each of them is exceeding the amount of work going on within the functions themselves.

Just as an experiment I created copies of js_GetPropertyHelper(), js_LookupPropertyWithFlags() and js_NativeGet(), marked them JS_ALWAYS_INLINE, and ensured that the hot path would call these inlined versions.  (The patch is attached.)  The results were much better than I'd hoped.  Cachegrind says:

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    95.193    94.541 (1.007x) |    25.159    25.123 (1.001x) | 3d-cube
|    57.410    57.410 (------) |    23.099    23.099 (------) | 3d-morph
|   116.739   116.486 (1.002x) |    44.065    44.049 (------) | 3d-raytrace
|    73.043    70.055 (1.043x) |    15.220    15.052 (1.011x) | access-binary-
|   133.642   133.643 (------) |    82.806    82.806 (------) | access-fannkuc
|    37.526    37.529 (------) |    16.700    16.700 (------) | access-nbody
|    61.690    61.692 (------) |    28.901    28.901 (------) | access-nsieve
|    16.349    16.350 (------) |     3.249     3.249 (------) | bitops-3bit-bi
|    45.911    45.912 (------) |    32.701    32.701 (------) | bitops-bits-in
|    24.779    24.781 (------) |    12.019    12.019 (------) | bitops-bitwise
|    63.619    63.621 (------) |    38.301    38.301 (------) | bitops-nsieve-
|    49.008    49.009 (------) |    19.438    19.438 (------) | controlflow-re
|    47.619    47.607 (------) |     5.964     5.964 (------) | crypto-md5
|    30.330    30.331 (------) |     6.809     6.809 (------) | crypto-sha1
|    97.604    97.613 (------) |    17.295    17.295 (------) | date-format-to
|    83.587    81.923 (1.020x) |     9.767     9.672 (1.010x) | date-format-xp
|    54.274    54.275 (------) |    31.292    31.292 (------) | math-cordic
|    29.549    29.795 (0.992x) |     6.232     6.462 (0.964x) | math-partial-s
|    31.019    31.025 (------) |    13.319    13.319 (------) | math-spectral-
|    58.528    58.528 (------) |    34.592    34.592 (------) | regexp-dna
|    39.745    39.747 (------) |     9.408     9.408 (------) | string-base64
|   118.456   106.890 (1.108x) |    25.728    25.027 (1.028x) | string-fasta
|   139.147   139.575 (0.997x) |    17.703    17.733 (0.998x) | string-tagclou
|   180.727   175.309 (1.031x) |    21.897    21.590 (1.014x) | string-unpack-
|    58.414    58.414 (------) |    11.706    11.706 (------) | string-validat
-------
|  1743.920  1722.075 (1.013x) |   553.382   552.318 (1.002x) | all

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|  1966.223  1966.107 (------) |  1735.163  1735.157 (------) | v8-crypto
|  1557.612  1553.712 (1.003x) |   958.631   957.417 (1.001x) | v8-deltablue
|  1812.821  1738.561 (1.043x) |   531.263   527.090 (1.008x) | v8-earley-boye
|  1460.158  1432.332 (1.019x) |   293.350   291.761 (1.005x) | v8-raytrace
|   873.751   873.738 (------) |   374.289   374.288 (------) | v8-regexp
|  1213.670  1210.363 (1.003x) |  1162.236  1159.012 (1.003x) | v8-richards
|  1321.328  1306.473 (1.011x) |   131.794   131.562 (1.002x) | v8-splay
-------
| 10205.566 10081.289 (1.012x) |  5186.730  5176.290 (1.002x) | all

Measurements on my laptop suggest that the timing improvements are even better, giving perhaps a 5-8ms win for SS on the AWFY machines.  The timings for V8 were less conclusive, maybe a 30ms win;  this surprised me because I thought property accesses were more important in V8.

Now obviously these functions are (surprisingly!) large, they each have multiple callers, and just duplicating them is bad.  But there must be some way to rearrange them so the hot cases can be inlined to avoid the calling overhead.  js_NativeGet() looks like the easiest, it almost always returns after the shape->hasDefaultGetter() test.

Furthermore, js_GetProperty() is mostly called from this function:

    JSBool getProperty(JSContext *cx, jsid id, js::Value *vp) {
        js::PropertyIdOp op = getOps()->getProperty;
        return (op ? op : js_GetProperty)(cx, this, id, vp);
    }

and I can't help wondering if this could be done better as well.  Eg. could 'op' be made to point to js_GetProperty() rather than being NULL?  Then the test wouldn't be necessary.  The same could be done with lots of the other object ops.
Attached file Sunspider timings (obsolete) —
Comment on attachment 473875 [details]
Sunspider timings

Sorry, those sunspider timings were meant to go on bug 595048.
Attachment #473875 - Attachment is obsolete: true
(In reply to comment #0)
> Created attachment 472564 [details] [diff] [review]
> patch (against TM 52850:9d7c084a9a56)
> 
> I noticed in string-fasta that js_GetProperty() is called a *lot*. 

But this is addressed by other open bugs:

bug 580138 (bhackett-owned with partial patch)
bug 560989 (unowned, dup bait -- gal!)

Not sayin' we don't need to clean up js_GetPropertyHelper (and much in jsobj.cpp for that matter) but if we want fasta wins, this may be the wrong path. The issue is that infernal for-in loop and the use of the enumerated property name in [] to index the "next" property. Lotta predictability, not quite a PIC use case (mega-morphic in general), still looks winnable.

/be
(In reply to comment #3)
> (In reply to comment #0)
> > Created attachment 472564 [details] [diff] [review] [details]
> > patch (against TM 52850:9d7c084a9a56)
> > 
> > I noticed in string-fasta that js_GetProperty() is called a *lot*. 
> 
> But this is addressed by other open bugs:
> 
> bug 580138 (bhackett-owned with partial patch)
> bug 560989 (unowned, dup bait -- gal!)
> 
> Not sayin' we don't need to clean up js_GetPropertyHelper (and much in
> jsobj.cpp for that matter) but if we want fasta wins, this may be the wrong
> path.

Good to know, but the inlining helps with some other cases too :)
Yeah, my studies of JSC and conversations with Olliej support inlining hot paths. The NATIVE_GET and NATIVE_SET ugly C macros in jsinterp.cpp date from an earlier inlining win.

Don't want to overrotate on fasta's account, 'sall. It should get a more targeted fix (is getting, when bhackett returns from his search for the elusive GPO).

/be
This patch creates always-inline variants of js_GetPropertyHelper(),
js_LookupPropertyWithFlags() and js_NativeGet().

Cachegrind results:

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|   101.241   100.576 (1.007x) |    47.530    47.494 (1.001x) | 3d-cube
|    47.191    47.185 (------) |    25.565    25.565 (------) | 3d-morph
|   108.085   107.784 (1.003x) |    43.385    43.368 (------) | 3d-raytrace
|    44.014    41.022 (1.073x) |    14.868    14.699 (1.011x) | access-binary-
|   112.003   111.996 (------) |    92.713    92.713 (------) | access-fannkuc
|    38.313    38.292 (1.001x) |    17.191    17.190 (------) | access-nbody
|    48.416    48.410 (------) |    28.185    28.185 (------) | access-nsieve
|    16.376    16.370 (------) |     3.250     3.250 (------) | bitops-3bit-bi
|    45.760    45.754 (------) |    32.524    32.524 (------) | bitops-bits-in
|    24.795    24.789 (------) |    12.020    12.020 (------) | bitops-bitwise
|    52.237    52.231 (------) |    37.968    37.968 (------) | bitops-nsieve-
|    30.672    30.666 (------) |    17.541    17.541 (------) | controlflow-re
|    47.544    47.543 (------) |     6.368     6.368 (------) | crypto-md5
|    30.179    30.171 (------) |     7.046     7.046 (------) | crypto-sha1
|    78.958    78.943 (------) |    22.471    22.471 (------) | date-format-to
|    80.704    80.262 (1.006x) |     9.774     9.741 (1.003x) | date-format-xp
|    54.165    54.155 (------) |    31.075    31.075 (------) | math-cordic
|    29.387    29.626 (0.992x) |     6.237     6.468 (0.964x) | math-partial-s
|    31.085    31.075 (------) |    13.416    13.416 (------) | math-spectral-
|    58.289    58.283 (------) |    34.590    34.590 (------) | regexp-dna
|    39.784    39.776 (------) |     9.414     9.414 (------) | string-base64
|    95.634    95.620 (------) |    24.845    24.845 (------) | string-fasta
|   121.490   121.475 (------) |    17.444    17.444 (------) | string-tagclou
|   174.195   168.893 (1.031x) |    21.614    21.316 (1.014x) | string-unpack-
|    52.214    52.206 (------) |     8.604     8.604 (------) | string-validat
-------
|  1562.742  1553.117 (1.006x) |   585.648   585.326 (1.001x) | all

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|  1791.482  1791.358 (------) |  1628.001  1627.995 (------) | v8-crypto
|  1640.548  1637.368 (1.002x) |  1109.452  1109.117 (------) | v8-deltablue
|  1216.334  1140.729 (1.066x) |   528.420   524.162 (1.008x) | v8-earley-boye
|  1246.646  1218.230 (1.023x) |   306.273   304.668 (1.005x) | v8-raytrace
|   913.893   913.752 (------) |   374.975   374.968 (------) | v8-regexp
|  1204.717  1204.406 (------) |  1163.749  1163.531 (------) | v8-richards
|  1208.137  1201.750 (1.005x) |   127.181   126.666 (1.004x) | v8-splay
-------
|  9221.761  9107.596 (1.013x) |  5238.054  5231.110 (1.001x) | all


Timing-wise, it looks to be worth about 3ms on SS and maybe 50ms on V8.

Thanks to cdleary for helping me realize that you could use wrapper
functions like this to guarantee inlining at only some of the call sites of
a function.
Attachment #472564 - Attachment is obsolete: true
Attachment #476727 - Flags: review?(cdleary)
cdleary: ping?
njn: sorry for the delay, reviewing now!
Comment on attachment 476727 [details] [diff] [review]
patch (against TM 53640:dfcf5611ce02)

Gotta love simple inlining wins!
Attachment #476727 - Flags: review?(cdleary) → review+
http://hg.mozilla.org/tracemonkey/rev/46038daefd9f

I re-ran Cachegrind, looks like the access-binary-trees improvement was cannibalized by a different patch (we're down to 24.6M instructions for it now), so the overall instruction count improvement dropped from 1.006x to 1.004x.  Still not bad for an easy patch.  I didn't get AWFY numbers because the regress-x86 page was disabled when I landed the patch :(
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/46038daefd9f
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.