Closed
Bug 564548
Opened 15 years ago
Closed 15 years ago
double the speed of Math.pow when the second argument is an integer
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: sayrer, Assigned: billm)
References
Details
(Whiteboard: fixed-in-tracemonkey)
Attachments
(2 files, 4 obsolete files)
|
390 bytes,
text/html
|
Details | |
|
3.36 KB,
patch
|
Details | Diff | Splinter Review |
I noticed v8 had this optimization, so I grabbed it. My patch is kind of lame, because defining traceable natives with the same number of arguments but different types doesn't seem to work right. I bet we could do better.
This shaves a few ms off math-partial-sums on my box (every other Math.pow call has an integer second argument). On the attached microbenchmark, we go from 12x faster than Chrome 5 to 15x faster.
| Reporter | ||
Comment 1•15 years ago
|
||
| Reporter | ||
Comment 2•15 years ago
|
||
Comment 3•15 years ago
|
||
You can specialize this on trace.
| Reporter | ||
Comment 4•15 years ago
|
||
(In reply to comment #3)
> You can specialize this on trace.
well, I naively tried this:
JS_DEFINE_TRCINFO_2(math_pow,
(2, (static, DOUBLE, math_powi_tn, DOUBLE, INT32, 1, nanojit::ACC_NONE)),
(2, (static, DOUBLE, math_pow_tn, DOUBLE, DOUBLE, 1, nanojit::ACC_NONE)))
but I ended up getting everything sent to the first function listed there. jorendorff says I need to change some stuff in jstracer.cpp to do what I want. I thought I would see if powi() actually won before diving deeper.
Comment 5•15 years ago
|
||
it only gets specialized if the incoming value is known to be an int, i.e. pow(5) would get specialized. pow(x | 0) as well. for i = ... pow(i). As we specialize heap reads better, this will further improve performance down the road.
| Reporter | ||
Comment 6•15 years ago
|
||
(In reply to comment #5)
> it only gets specialized if the incoming value is known to be an int
the problem is that
Math.pow(2314.1234, -0.5)
gets sent to the version that expects the second argument to be an integer.
Comment 7•15 years ago
|
||
that shouldn't be the case. attach the patch I take a look
Comment 8•15 years ago
|
||
Sayre, can you attach the patch that attempts to specialize on-trace?
| Reporter | ||
Updated•15 years ago
|
Blocks: JaegerSpeed
| Assignee | ||
Comment 10•15 years ago
|
||
This patch includes Rob's integer pow specialization as well as the sqrt optimization for pow from bug 584223. With regard to the integer stuff, it calls separate pow functions on trace depending on whether the arg has integer or double type (as Andreas requested).
Unfortunately, getting this to work necessitated some changes to how specialized natives are handled. Now when you use JS_DEFINE_TRCINFO_n and some argument is an INT32, that specialization is *only* called for int32s. Previously, doubles were converted to ints to allow the specialization to be used. Consequently, for each function declared with INT32 before, I created a wrapper when the arg is a double that casts to an int and then calls the int specialization.
Note that the code for the integer case in callSpecializedNative still calls d2i. I need this to get the types in the LIR to work out. Hopefully it ultimately compiles to a NOP, since the arg is known to be an integer. Is this true?
Attachment #444212 -
Attachment is obsolete: true
Attachment #462990 -
Flags: review?
| Assignee | ||
Updated•15 years ago
|
Attachment #462990 -
Flags: review? → review?(gal)
Comment 11•15 years ago
|
||
Comment on attachment 462990 [details] [diff] [review]
Combined pow optimizations (integer pow and sqrt specialization)
>diff -r 6d9d375a035f js/src/jsmath.cpp
>--- a/js/src/jsmath.cpp Wed Aug 04 15:11:30 2010 -0700
>+++ b/js/src/jsmath.cpp Wed Aug 04 16:48:20 2010 -0700
>@@ -393,43 +393,84 @@ js_math_min(JSContext *cx, uintN argc, V
> } else {
> z = (x < z) ? x : z;
> }
> }
> vp->setNumber(z);
> return JS_TRUE;
> }
>
>+static jsdouble powi(jsdouble x, jsint y) {
>+ jsuint n = (y < 0) ? -y : y;
>+ jsdouble m = x;
>+ jsdouble p = 1;
>+ while (true) {
>+ if ((n & 1) != 0) p *= m;
>+ n >>= 1;
>+ if (n == 0) {
>+ if (y < 0) {
>+ // Unfortunately, we have to be careful when p has reached
>+ // infinity in the computation, because sometimes the higher
>+ // internal precision in the pow() implementation would have
>+ // given us a finite p. This happens very rarely.
>+
>+ jsdouble result = 1.0 / p;
>+ return (result == 0 && isinf(p))
Align the ? with the (
>+ ? pow(x, static_cast<jsdouble>(y)) // Avoid pow(double, int).
>+ : result;
>+ } else {
>+ return p;
>+ }
>+ }
>+ m *= m;
>+ }
>+}
>+
> static JSBool
> math_pow(JSContext *cx, uintN argc, Value *vp)
> {
> jsdouble x, y, z;
>
> if (argc <= 1) {
> vp->setDouble(js_NaN);
> return JS_TRUE;
> }
> if (!ValueToNumber(cx, vp[2], &x))
> return JS_FALSE;
> if (!ValueToNumber(cx, vp[3], &y))
> return JS_FALSE;
>+ /* Special case for square roots */
>+ if (JSDOUBLE_IS_FINITE(x)) {
>+ if (y == 0.5) {
>+ vp->setNumber(sqrt(x));
>+ return JS_TRUE;
>+ } else if (y == -0.5) {
>+ vp->setNumber(1.0/sqrt(x));
Is there no rqsrt in libc? lame.
>+ return JS_TRUE;
>+ }
>+ }
> /*
> * Because C99 and ECMA specify different behavior for pow(),
> * we need to wrap the libm call to make it ECMA compliant.
> */
> if (!JSDOUBLE_IS_FINITE(y) && (x == 1.0 || x == -1.0)) {
> vp->setDouble(js_NaN);
> return JS_TRUE;
> }
> /* pow(x, +-0) is always 1, even for x = NaN. */
> if (y == 0) {
> vp->setInt32(1);
> return JS_TRUE;
> }
>- z = pow(x, y);
>+
>+ if (vp[3].isInt32()) {
>+ z = powi(x, vp[3].toInt32());
>+ } else {
>+ z = pow(x, y);
>+ }
> vp->setNumber(z);
> return JS_TRUE;
> }
>
> static const int64 RNG_MULTIPLIER = 0x5DEECE66DLL;
> static const int64 RNG_ADDEND = 0xBLL;
> static const int64 RNG_MASK = (1LL << 48) - 1;
> static const jsdouble RNG_DSCALE = jsdouble(1LL << 53);
>@@ -668,24 +709,44 @@ math_min_tn(jsdouble d, jsdouble p)
> return d;
> }
> return (p < d) ? p : d;
> }
>
> static jsdouble FASTCALL
> math_pow_tn(jsdouble d, jsdouble p)
> {
>+ /* Special case for square roots */
>+ if (JSDOUBLE_IS_FINITE(d)) {
>+ if (p == 0.5) {
>+ return sqrt(d);
>+ } else if (p == -0.5) {
>+ return 1.0/sqrt(d);
>+ }
>+ }
You could special case for this in the trace code generation but maybe not worth it.
> if (!JSDOUBLE_IS_FINITE(p) && (d == 1.0 || d == -1.0))
> return js_NaN;
> if (p == 0)
> return 1.0;
>+ int32 i;
>+ if (JSDOUBLE_IS_INT32(p, &i)) {
>+ return powi(d, i);
>+ }
> return pow(d, p);
> }
>
> static jsdouble FASTCALL
>+math_powi_tn(jsdouble d, int32 p)
>+{
>+ if (p == 0)
>+ return 1.0;
>+ return powi(d, p);
>+}
>+
>+static jsdouble FASTCALL
> math_random_tn(JSContext *cx)
> {
> return random_nextDouble(cx);
> }
>
> static jsdouble FASTCALL
> math_round_tn(jsdouble x)
> {
>@@ -713,17 +774,18 @@ JS_DEFINE_TRCINFO_1(math_atan2,
> JS_DEFINE_TRCINFO_1(js_math_floor,
> (1, (static, DOUBLE, math_floor_tn, DOUBLE, 1, nanojit::ACCSET_NONE)))
> JS_DEFINE_TRCINFO_1(math_log,
> (1, (static, DOUBLE, math_log_tn, DOUBLE, 1, nanojit::ACCSET_NONE)))
> JS_DEFINE_TRCINFO_1(js_math_max,
> (2, (static, DOUBLE, math_max_tn, DOUBLE, DOUBLE, 1, nanojit::ACCSET_NONE)))
> JS_DEFINE_TRCINFO_1(js_math_min,
> (2, (static, DOUBLE, math_min_tn, DOUBLE, DOUBLE, 1, nanojit::ACCSET_NONE)))
>-JS_DEFINE_TRCINFO_1(math_pow,
>+JS_DEFINE_TRCINFO_2(math_pow,
>+ (2, (static, DOUBLE, math_powi_tn, DOUBLE, INT32, 1, nanojit::ACCSET_NONE)),
> (2, (static, DOUBLE, math_pow_tn, DOUBLE, DOUBLE, 1, nanojit::ACCSET_NONE)))
> JS_DEFINE_TRCINFO_1(math_random,
> (1, (static, DOUBLE, math_random_tn, CONTEXT, 0, nanojit::ACCSET_STORE_ANY)))
> JS_DEFINE_TRCINFO_1(js_math_round,
> (1, (static, DOUBLE, math_round_tn, DOUBLE, 1, nanojit::ACCSET_NONE)))
> JS_DEFINE_TRCINFO_1(js_math_ceil,
> (1, (static, DOUBLE, math_ceil_tn, DOUBLE, 1, nanojit::ACCSET_NONE)))
>
>diff -r 6d9d375a035f js/src/jsnum.cpp
>--- a/js/src/jsnum.cpp Wed Aug 04 15:11:30 2010 -0700
>+++ b/js/src/jsnum.cpp Wed Aug 04 16:48:20 2010 -0700
>@@ -617,16 +617,18 @@ IntToCString(jsint i, jsint base, char *
> *--cp = '-';
>
> JS_ASSERT(cp >= buf);
> return cp;
> }
>
> static JSString * JS_FASTCALL
> js_NumberToStringWithBase(JSContext *cx, jsdouble d, jsint base);
>+static JSString * JS_FASTCALL
>+js_NumberToStringWithDoubleBase(JSContext *cx, jsdouble d, jsdouble base);
>
> static JSBool
> num_toString(JSContext *cx, uintN argc, Value *vp)
> {
> const Value *primp;
> if (!js_GetPrimitiveThis(cx, vp, &js_NumberClass, &primp))
> return JS_FALSE;
> double d = primp->toNumber();
>@@ -846,20 +848,22 @@ num_toPrecision(JSContext *cx, uintN arg
> if (argc == 0 || vp[2].isUndefined())
> return num_toString(cx, 0, vp);
> return num_to(cx, DTOSTR_STANDARD, DTOSTR_PRECISION, 1, MAX_PRECISION, 0,
> argc, vp);
> }
>
> #ifdef JS_TRACER
>
>-JS_DEFINE_TRCINFO_2(num_toString,
>+JS_DEFINE_TRCINFO_3(num_toString,
> (2, (extern, STRING_RETRY, js_NumberToString, CONTEXT, THIS_DOUBLE,
> 1, nanojit::ACCSET_NONE)),
> (3, (static, STRING_RETRY, js_NumberToStringWithBase, CONTEXT, THIS_DOUBLE, INT32,
>+ 1, nanojit::ACCSET_NONE)),
>+ (3, (static, STRING_RETRY, js_NumberToStringWithDoubleBase, CONTEXT, THIS_DOUBLE, DOUBLE,
> 1, nanojit::ACCSET_NONE)))
>
> #endif /* JS_TRACER */
>
> static JSFunctionSpec number_methods[] = {
> #if JS_HAS_TOSOURCE
> JS_FN(js_toSource_str, num_toSource, 0,JSFUN_THISP_NUMBER),
> #endif
>@@ -1049,16 +1053,22 @@ js_IntToString(JSContext *cx, jsint i)
> if (jsuint(i) < INT_STRING_LIMIT)
> return JSString::intString(i);
>
> char buf[12];
> return js_NewStringCopyZ(cx, IntToCString(i, 10, buf, sizeof buf));
> }
>
> static JSString * JS_FASTCALL
>+js_NumberToStringWithDoubleBase(JSContext *cx, jsdouble d, jsdouble base)
>+{
>+ return js_NumberToStringWithBase(cx, d, (jsint)base);
>+}
>+
>+static JSString * JS_FASTCALL
> js_NumberToStringWithBase(JSContext *cx, jsdouble d, jsint base)
> {
> /*
> * The longest possible result here that would need to fit in buf is
> * (-0x80000000).toString(2), which has length 33. (This can produce
> * longer results, but in those cases buf is not used; see comment at
> * NumberToCString.)
> */
>diff -r 6d9d375a035f js/src/jsstr.cpp
>--- a/js/src/jsstr.cpp Wed Aug 04 15:11:30 2010 -0700
>+++ b/js/src/jsstr.cpp Wed Aug 04 16:48:20 2010 -0700
>@@ -2845,23 +2845,31 @@ str_sub(JSContext *cx, uintN argc, Value
> #ifdef JS_TRACER
> JSString* FASTCALL
> js_String_getelem(JSContext* cx, JSString* str, int32 i)
> {
> if ((size_t)i >= str->length())
> return NULL;
> return JSString::getUnitString(cx, str, size_t(i));
> }
>+
>+JSString* FASTCALL
>+js_String_getelem_double(JSContext* cx, JSString* str, jsdouble i)
>+{
>+ return js_String_getelem(cx, str, (int32)i);
>+}
> #endif
>
> JS_DEFINE_TRCINFO_1(js_str_toString,
> (2, (extern, STRING_RETRY, String_p_toString, CONTEXT, THIS,
> 1, nanojit::ACCSET_NONE)))
>-JS_DEFINE_TRCINFO_1(str_charAt,
>+JS_DEFINE_TRCINFO_2(str_charAt,
> (3, (extern, STRING_RETRY, js_String_getelem, CONTEXT, THIS_STRING, INT32,
>+ 1, nanojit::ACCSET_NONE)),
>+ (3, (extern, STRING_RETRY, js_String_getelem_double, CONTEXT, THIS_STRING, DOUBLE,
> 1, nanojit::ACCSET_NONE)))
> JS_DEFINE_TRCINFO_2(str_charCodeAt,
> (1, (extern, DOUBLE, js_String_p_charCodeAt0, THIS_STRING,
> 1, nanojit::ACCSET_NONE)),
> (2, (extern, DOUBLE, js_String_p_charCodeAt, THIS_STRING, DOUBLE,
> 1, nanojit::ACCSET_NONE)))
> JS_DEFINE_TRCINFO_1(str_concat,
> (3, (extern, STRING_RETRY, js_ConcatStrings, CONTEXT, THIS_STRING, STRING,
>@@ -3211,20 +3219,27 @@ static JSString* FASTCALL
> String_fromCharCode(JSContext* cx, int32 i)
> {
> JS_ASSERT(JS_ON_TRACE(cx));
> jschar c = (jschar)i;
> if (c < UNIT_STRING_LIMIT)
> return JSString::unitString(c);
> return js_NewStringCopyN(cx, &c, 1);
> }
>+
>+static JSString* FASTCALL
>+String_fromCharCode_double(JSContext* cx, jsdouble i)
>+{
>+ return String_fromCharCode(cx, (int32)i);
>+}
> #endif
>
>-JS_DEFINE_TRCINFO_1(str_fromCharCode,
>- (2, (static, STRING_RETRY, String_fromCharCode, CONTEXT, INT32, 1, nanojit::ACCSET_NONE)))
>+JS_DEFINE_TRCINFO_2(str_fromCharCode,
>+ (2, (static, STRING_RETRY, String_fromCharCode, CONTEXT, INT32, 1, nanojit::ACCSET_NONE)),
>+ (2, (static, STRING_RETRY, String_fromCharCode_double, CONTEXT, DOUBLE, 1, nanojit::ACCSET_NONE)))
>
> static JSFunctionSpec string_static_methods[] = {
> JS_TN("fromCharCode", str_fromCharCode, 1, 0, &str_fromCharCode_trcinfo),
> JS_FS_END
> };
>
> JSObject *
> js_InitStringClass(JSContext *cx, JSObject *obj)
>diff -r 6d9d375a035f js/src/jstracer.cpp
>--- a/js/src/jstracer.cpp Wed Aug 04 15:11:30 2010 -0700
>+++ b/js/src/jstracer.cpp Wed Aug 04 16:48:20 2010 -0700
>@@ -11286,21 +11286,23 @@ TraceRecorder::callSpecializedNative(JSN
> argp--;
> }
>
> for (i = knownargc; i--; ) {
> Value& arg = stackval(0 - (i + 1));
> *argp = get(&arg);
>
> argtype = sn->argtypes[i];
>- if (argtype == 'd' || argtype == 'i') {
>+ if (argtype == 'd') {
> if (!arg.isNumber())
> goto next_specialization;
>- if (argtype == 'i')
>- *argp = d2i(*argp);
>+ } else if (argtype == 'i') {
>+ if (!arg.isInt32())
>+ goto next_specialization;
>+ *argp = d2i(*argp);
> } else if (argtype == 'o') {
> if (arg.isPrimitive())
> goto next_specialization;
> } else if (argtype == 's') {
> if (!arg.isString())
> goto next_specialization;
> } else if (argtype == 'r') {
> if (!VALUE_IS_REGEXP(cx, arg))
Attachment #462990 -
Flags: review?(gal) → review+
| Assignee | ||
Comment 12•15 years ago
|
||
OK, the indentation is fixed. I also couldn't find a better way to do the reciprocal sqrt.
Also, given the cost of pow/sqrt, I doubt that the extra branches to test for 0.5/-0.5 are that significant.
Attachment #462990 -
Attachment is obsolete: true
| Assignee | ||
Comment 13•15 years ago
|
||
I did a little more work on this and for some reason the traceable native stuff was actually slowing down the benchmarks that don't use pow. I've removed that stuff, which was barely helping pow performance anyway. This new, smaller patch seems to be worth about 5ms on SunSpider.
Attachment #463360 -
Attachment is obsolete: true
Updated•15 years ago
|
Attachment #467159 -
Flags: review?(gal)
Comment 14•15 years ago
|
||
Comment on attachment 467159 [details] [diff] [review]
Simplified patch
All hail sunspider.
Attachment #467159 -
Flags: review?(gal) → review+
Comment 15•15 years ago
|
||
Comment on attachment 467159 [details] [diff] [review]
Simplified patch
>diff --git a/js/src/jsmath.cpp b/js/src/jsmath.cpp
>--- a/js/src/jsmath.cpp
>+++ b/js/src/jsmath.cpp
>@@ -393,43 +393,84 @@ js_math_min(JSContext *cx, uintN argc, V
> } else {
> z = (x < z) ? x : z;
> }
> }
> vp->setNumber(z);
> return JS_TRUE;
> }
>
>+static jsdouble powi(jsdouble x, jsint y) {
Prevailing style breaks lines after type and mode (before function declarator name), and after parameter list's closing ) so { is in column 1.
>+ jsuint n = (y < 0) ? -y : y;
>+ jsdouble m = x;
>+ jsdouble p = 1;
>+ while (true) {
>+ if ((n & 1) != 0) p *= m;
>+ n >>= 1;
>+ if (n == 0) {
>+ if (y < 0) {
>+ // Unfortunately, we have to be careful when p has reached
>+ // infinity in the computation, because sometimes the higher
>+ // internal precision in the pow() implementation would have
>+ // given us a finite p. This happens very rarely.
>+
>+ jsdouble result = 1.0 / p;
>+ return (result == 0 && isinf(p))
>+ ? pow(x, static_cast<jsdouble>(y)) // Avoid pow(double, int).
>+ : result;
>+ } else {
>+ return p;
Avoid else after return non-sequiturs.
>+ if (JSDOUBLE_IS_FINITE(x) && x != 0.0) {
>+ if (y == 0.5) {
>+ vp->setNumber(sqrt(x));
>+ return JS_TRUE;
>+ } else if (y == -0.5) {
else-after-return again.
> static jsdouble FASTCALL
> math_pow_tn(jsdouble d, jsdouble p)
> {
>+ /* Special case for square roots */
>+ if (JSDOUBLE_IS_FINITE(d) && d != 0.0) {
>+ if (p == 0.5) {
>+ return sqrt(d);
>+ } else if (p == -0.5) {
>+ return 1.0/sqrt(d);
And again.
>+ if (JSDOUBLE_IS_INT32(p, &i)) {
>+ return powi(d, i);
>+ }
Over-braced, don't do it.
Style nits count, keep after them or we'll have randomly styled code in no time flat.
/be
Comment 16•15 years ago
|
||
To say a bit more (JS C++ style guide on the wiki I believe says all this):
The { in column 1 rule for top-level functions has better navigation usability (vim [[). This rule does not apply to inline class member methods, unless they have too many parameters to fit on one line, or are constructors with base class and member initialisers.
The rule against else-after-return avoids ever-increasing over-indentation or (if one avoids rightward drift) consequent nonsense indentation, e.g.
if (foo)
return bar;
else
don't return;
more don't return here;
or the variation:
if (not foo)
don't return;
else
return bar;
more after the don't return;
where the condition should be inverted to cast out the abnormal case so the spine (hat tip jimb) of the function is least indented.
Common idiom -- instead of
if (foo)
return bar;
else
return baz;
or the else-free fixed version, use
return foo ? bar : baz;
even if multiline (underhang the ? and : below the first char of the condition, typically ( unless the condition is a primary or member expression).
A lot of this is K&R plus some feedback over the years arguing for exceptions to rules, and a few refinements.
There's a substance-not-style issue depending on compiler, optimization level, and processor architecture:
Prefetching icache lines follows unconditional jumps but not conditional ones, as far as I know (please correct me if I'm wrong for specific architectures). So the "then" clause should be faster, especially if the conditioanl jump around it is to instructions not in the pipe. If the branch is short enough to target prefetched code, then only a minor cycle penalty accrues.
Therefore sometimes it pays to manually optimize for "prefetched is predicted" flow, or just simplify that way due to only two levels of logic, thus putting the fast path in the "then" clause. It's ok to do this nowadays.
We used to avoid indenting the common or hot path a bit too scrupulously, but this just made more work for PGO builds and had the perverse effect of causing some bug filing due to lack of developer-generated, -O but without PGO program profiles showing mispredicted long branches costing where they would not if we either PGO'ed or reordered manually.
So reordering manually sometimes is worth it if the indentation doesn't go rightward too badly.
Sorry for writing so much, wanted to avoid leaving a style-nit-police-raid bad taste with my previous comment.
/be
| Assignee | ||
Comment 17•15 years ago
|
||
I've fixed these issues.
Attachment #467159 -
Attachment is obsolete: true
| Assignee | ||
Updated•15 years ago
|
Keywords: checkin-needed
Comment 18•15 years ago
|
||
(In reply to comment #16)
> The { in column 1 rule for top-level functions has better navigation usability
> (vim [[). This rule does not apply to inline class member methods, unless they
> have too many parameters to fit on one line, or are constructors with base
> class and member initialisers.
Virtual column 1 for inline class members, of course, since they are indented. Any { indented from physical column 1 loses the usability win (vim [[) anyway. Plus, C++ has its creator-influenced style, which deviates from K&R. So we have a mix, with K&R-ish top-level (unindented) function and method definition style.
Style, yay.
/be
Comment 19•15 years ago
|
||
Keywords: checkin-needed
Whiteboard: fixed-in-tracemonkey
Comment 20•15 years ago
|
||
(In reply to comment #16)
>
> Prefetching icache lines follows unconditional jumps but not conditional ones,
> as far as I know (please correct me if I'm wrong for specific architectures).
> So the "then" clause should be faster, especially if the conditioanl jump
> around it is to instructions not in the pipe. If the branch is short enough to
> target prefetched code, then only a minor cycle penalty accrues.
>
> Therefore sometimes it pays to manually optimize for "prefetched is predicted"
> flow, or just simplify that way due to only two levels of logic, thus putting
> the fast path in the "then" clause. It's ok to do this nowadays.
I don't have any data, but this smells dubious to me. For one, the compiler could swap the branch ordering itself. If you really think it's important (ie. the branch is super-hot) then I'd use a likely/unlikely pragma and write the code in the order that looks the nicest. My understanding is that these pragmas are aimed at improving branch prediction rather than I-cache prefetching, though; my guess is that branch prediction has a much bigger effect on performance.
Comment 21•15 years ago
|
||
(In reply to comment #20)
> (In reply to comment #16)
> >
> > Prefetching icache lines follows unconditional jumps but not conditional ones,
> > as far as I know (please correct me if I'm wrong for specific architectures).
> > So the "then" clause should be faster, especially if the conditioanl jump
> > around it is to instructions not in the pipe. If the branch is short enough to
> > target prefetched code, then only a minor cycle penalty accrues.
> >
> > Therefore sometimes it pays to manually optimize for "prefetched is predicted"
> > flow, or just simplify that way due to only two levels of logic, thus putting
> > the fast path in the "then" clause. It's ok to do this nowadays.
>
> I don't have any data, but this smells dubious to me. For one, the compiler
> could swap the branch ordering itself.
Could != does. I'm telling you we've had bugs where our "avoid indenting the spine" style, which I still favor by default, has led to un-PGO'ed (gcc and MSVC if memory serves) code costing enough that someone filed a bug. I can dig up refs if needed.
> If you really think it's important (ie. the branch is super-hot) then I'd use a
> likely/unlikely pragma and write the code in the order that looks the nicest.
The agreement I heard sayrer forge is that we don't add JS_*LIKELY macro uses without profile data showing a problem. Of course this would be un-PGO'ed profile data.
> My understanding is that these
> pragmas are aimed at improving branch prediction rather than I-cache
> prefetching, though; my guess is that branch prediction has a much bigger
> effect on performance.
The GCC __builtin_expect intrinsic can be used for branch prediction where the architecture provides branch likely and branch unlikely instructions, but it is also used to determine which path is fall-through and which is targeted. http://kerneltrap.org/node/4705 has some asm examples.
Modern Intel wants to do its own branch prediction dynamically, but you still have to lay your code out to best advantage -- you the compiler, not so much the developers who always guess wrong. In this light I agree that layout things out for readability and maintenance ease comes ahead of any other good.
/be
Comment 22•15 years ago
|
||
Ugh, I meant to write "I agree that *laying* things out for readability and maintenance ease comes ahead of any other good."
/be
| Reporter | ||
Comment 23•15 years ago
|
||
(In reply to comment #21)
>
> The agreement I heard sayrer forge is that we don't add JS_*LIKELY macro uses
> without profile data showing a problem.
Yeah, I think we had general nodding and grunting in agreement.
> Of course this would be un-PGO'ed profile data.
>
That's OK. We still only have PGO working on MSVC. GCC claims to do it, but we don't PGO there, because we have yet to measure an improvement on Linux or OS X.
Comment 24•15 years ago
|
||
(In reply to comment #21)
> Modern Intel wants to do its own branch prediction dynamically, but you still
> have to lay your code out to best advantage -- you the compiler, not so much
> the developers who always guess wrong.
But gcc at -O2 is going to mash around the control flow structure
absolutely mercilessly, reorder blocks, switch branch senses, and move
anything it estimates to be a cold block off the main path (past the
return instruction). Unless you blunt-instrument it by adding
__builtin_expect all over the place, I don't think you have a hope of
reliably improving performance by changing the layout.
IME trying to "help out" the compiler in these kinds of ways is almost
always pointless, since it requires second-guessing what both the
compiler and the microarchitecture will do. FWIW, I (and Nick,
probably) spent way too much time trying to help out gcc generate
really fast code for the helper functions called from JIT generated
code in Valgrind.
Comment 25•15 years ago
|
||
(In reply to comment #24)
> (In reply to comment #21)
> > Modern Intel wants to do its own branch prediction dynamically, but you still
> > have to lay your code out to best advantage -- you the compiler, not so much
> > the developers who always guess wrong.
>
> But gcc at -O2 is going to mash around the control flow structure
> absolutely mercilessly, reorder blocks, switch branch senses, and move
> anything it estimate.
We seem to be in agreement!
The allowance for short if-then flows that put what might obviously be the common case (the non-error case, e.g.) in the "then" that I mentioned evolving as "ok style" did help. Possibly it was before modern gcc -O2 ("tree-ssa") days, but I don't think so (I think it was jsarena-related hacking by Robin Bate-Boerop, if that helps find the bug -- 2007 era?).
If it doesn't matter, then it doesn't matter and there's no substance issue, just style. We can either let the style allowance stand, or go back to "least-indentation for the common-case spine." Opinions?
/be
/be
Comment 26•15 years ago
|
||
FWIW, I found bug 417995, which was where the issue of if-then code "usually" -- even with -O2 often enough -- results in better performance when the then clause is the common case, in the absence of PGO. Bug 417995 comment 27 cites
[http://www.intel.com/design/processor/manuals/248966.pdf page 88]
If anyone has fresher information, please share by email. Sorry to clutter up this bug, I made it a bit of a discussion list :-/.
/be
| Reporter | ||
Comment 27•15 years ago
|
||
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•15 years ago
|
Assignee: general → wmccloskey
You need to log in
before you can comment on or make changes to this bug.
Description
•