Closed Bug 564548 Opened 15 years ago Closed 15 years ago

double the speed of Math.pow when the second argument is an integer

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sayrer, Assigned: billm)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 4 obsolete files)

I noticed v8 had this optimization, so I grabbed it. My patch is kind of lame, because defining traceable natives with the same number of arguments but different types doesn't seem to work right. I bet we could do better. This shaves a few ms off math-partial-sums on my box (every other Math.pow call has an integer second argument). On the attached microbenchmark, we go from 12x faster than Chrome 5 to 15x faster.
Attached file microbenchmark.html
Attached patch pow.patch (obsolete) — Splinter Review
You can specialize this on trace.
(In reply to comment #3) > You can specialize this on trace. well, I naively tried this: JS_DEFINE_TRCINFO_2(math_pow, (2, (static, DOUBLE, math_powi_tn, DOUBLE, INT32, 1, nanojit::ACC_NONE)), (2, (static, DOUBLE, math_pow_tn, DOUBLE, DOUBLE, 1, nanojit::ACC_NONE))) but I ended up getting everything sent to the first function listed there. jorendorff says I need to change some stuff in jstracer.cpp to do what I want. I thought I would see if powi() actually won before diving deeper.
it only gets specialized if the incoming value is known to be an int, i.e. pow(5) would get specialized. pow(x | 0) as well. for i = ... pow(i). As we specialize heap reads better, this will further improve performance down the road.
(In reply to comment #5) > it only gets specialized if the incoming value is known to be an int the problem is that Math.pow(2314.1234, -0.5) gets sent to the version that expects the second argument to be an integer.
that shouldn't be the case. attach the patch I take a look
Sayre, can you attach the patch that attempts to specialize on-trace?
Blocks: JaegerSpeed
This patch includes Rob's integer pow specialization as well as the sqrt optimization for pow from bug 584223. With regard to the integer stuff, it calls separate pow functions on trace depending on whether the arg has integer or double type (as Andreas requested). Unfortunately, getting this to work necessitated some changes to how specialized natives are handled. Now when you use JS_DEFINE_TRCINFO_n and some argument is an INT32, that specialization is *only* called for int32s. Previously, doubles were converted to ints to allow the specialization to be used. Consequently, for each function declared with INT32 before, I created a wrapper when the arg is a double that casts to an int and then calls the int specialization. Note that the code for the integer case in callSpecializedNative still calls d2i. I need this to get the types in the LIR to work out. Hopefully it ultimately compiles to a NOP, since the arg is known to be an integer. Is this true?
Attachment #444212 - Attachment is obsolete: true
Attachment #462990 - Flags: review?
Attachment #462990 - Flags: review? → review?(gal)
Comment on attachment 462990 [details] [diff] [review] Combined pow optimizations (integer pow and sqrt specialization) >diff -r 6d9d375a035f js/src/jsmath.cpp >--- a/js/src/jsmath.cpp Wed Aug 04 15:11:30 2010 -0700 >+++ b/js/src/jsmath.cpp Wed Aug 04 16:48:20 2010 -0700 >@@ -393,43 +393,84 @@ js_math_min(JSContext *cx, uintN argc, V > } else { > z = (x < z) ? x : z; > } > } > vp->setNumber(z); > return JS_TRUE; > } > >+static jsdouble powi(jsdouble x, jsint y) { >+ jsuint n = (y < 0) ? -y : y; >+ jsdouble m = x; >+ jsdouble p = 1; >+ while (true) { >+ if ((n & 1) != 0) p *= m; >+ n >>= 1; >+ if (n == 0) { >+ if (y < 0) { >+ // Unfortunately, we have to be careful when p has reached >+ // infinity in the computation, because sometimes the higher >+ // internal precision in the pow() implementation would have >+ // given us a finite p. This happens very rarely. >+ >+ jsdouble result = 1.0 / p; >+ return (result == 0 && isinf(p)) Align the ? with the ( >+ ? pow(x, static_cast<jsdouble>(y)) // Avoid pow(double, int). >+ : result; >+ } else { >+ return p; >+ } >+ } >+ m *= m; >+ } >+} >+ > static JSBool > math_pow(JSContext *cx, uintN argc, Value *vp) > { > jsdouble x, y, z; > > if (argc <= 1) { > vp->setDouble(js_NaN); > return JS_TRUE; > } > if (!ValueToNumber(cx, vp[2], &x)) > return JS_FALSE; > if (!ValueToNumber(cx, vp[3], &y)) > return JS_FALSE; >+ /* Special case for square roots */ >+ if (JSDOUBLE_IS_FINITE(x)) { >+ if (y == 0.5) { >+ vp->setNumber(sqrt(x)); >+ return JS_TRUE; >+ } else if (y == -0.5) { >+ vp->setNumber(1.0/sqrt(x)); Is there no rqsrt in libc? lame. >+ return JS_TRUE; >+ } >+ } > /* > * Because C99 and ECMA specify different behavior for pow(), > * we need to wrap the libm call to make it ECMA compliant. > */ > if (!JSDOUBLE_IS_FINITE(y) && (x == 1.0 || x == -1.0)) { > vp->setDouble(js_NaN); > return JS_TRUE; > } > /* pow(x, +-0) is always 1, even for x = NaN. */ > if (y == 0) { > vp->setInt32(1); > return JS_TRUE; > } >- z = pow(x, y); >+ >+ if (vp[3].isInt32()) { >+ z = powi(x, vp[3].toInt32()); >+ } else { >+ z = pow(x, y); >+ } > vp->setNumber(z); > return JS_TRUE; > } > > static const int64 RNG_MULTIPLIER = 0x5DEECE66DLL; > static const int64 RNG_ADDEND = 0xBLL; > static const int64 RNG_MASK = (1LL << 48) - 1; > static const jsdouble RNG_DSCALE = jsdouble(1LL << 53); >@@ -668,24 +709,44 @@ math_min_tn(jsdouble d, jsdouble p) > return d; > } > return (p < d) ? p : d; > } > > static jsdouble FASTCALL > math_pow_tn(jsdouble d, jsdouble p) > { >+ /* Special case for square roots */ >+ if (JSDOUBLE_IS_FINITE(d)) { >+ if (p == 0.5) { >+ return sqrt(d); >+ } else if (p == -0.5) { >+ return 1.0/sqrt(d); >+ } >+ } You could special case for this in the trace code generation but maybe not worth it. > if (!JSDOUBLE_IS_FINITE(p) && (d == 1.0 || d == -1.0)) > return js_NaN; > if (p == 0) > return 1.0; >+ int32 i; >+ if (JSDOUBLE_IS_INT32(p, &i)) { >+ return powi(d, i); >+ } > return pow(d, p); > } > > static jsdouble FASTCALL >+math_powi_tn(jsdouble d, int32 p) >+{ >+ if (p == 0) >+ return 1.0; >+ return powi(d, p); >+} >+ >+static jsdouble FASTCALL > math_random_tn(JSContext *cx) > { > return random_nextDouble(cx); > } > > static jsdouble FASTCALL > math_round_tn(jsdouble x) > { >@@ -713,17 +774,18 @@ JS_DEFINE_TRCINFO_1(math_atan2, > JS_DEFINE_TRCINFO_1(js_math_floor, > (1, (static, DOUBLE, math_floor_tn, DOUBLE, 1, nanojit::ACCSET_NONE))) > JS_DEFINE_TRCINFO_1(math_log, > (1, (static, DOUBLE, math_log_tn, DOUBLE, 1, nanojit::ACCSET_NONE))) > JS_DEFINE_TRCINFO_1(js_math_max, > (2, (static, DOUBLE, math_max_tn, DOUBLE, DOUBLE, 1, nanojit::ACCSET_NONE))) > JS_DEFINE_TRCINFO_1(js_math_min, > (2, (static, DOUBLE, math_min_tn, DOUBLE, DOUBLE, 1, nanojit::ACCSET_NONE))) >-JS_DEFINE_TRCINFO_1(math_pow, >+JS_DEFINE_TRCINFO_2(math_pow, >+ (2, (static, DOUBLE, math_powi_tn, DOUBLE, INT32, 1, nanojit::ACCSET_NONE)), > (2, (static, DOUBLE, math_pow_tn, DOUBLE, DOUBLE, 1, nanojit::ACCSET_NONE))) > JS_DEFINE_TRCINFO_1(math_random, > (1, (static, DOUBLE, math_random_tn, CONTEXT, 0, nanojit::ACCSET_STORE_ANY))) > JS_DEFINE_TRCINFO_1(js_math_round, > (1, (static, DOUBLE, math_round_tn, DOUBLE, 1, nanojit::ACCSET_NONE))) > JS_DEFINE_TRCINFO_1(js_math_ceil, > (1, (static, DOUBLE, math_ceil_tn, DOUBLE, 1, nanojit::ACCSET_NONE))) > >diff -r 6d9d375a035f js/src/jsnum.cpp >--- a/js/src/jsnum.cpp Wed Aug 04 15:11:30 2010 -0700 >+++ b/js/src/jsnum.cpp Wed Aug 04 16:48:20 2010 -0700 >@@ -617,16 +617,18 @@ IntToCString(jsint i, jsint base, char * > *--cp = '-'; > > JS_ASSERT(cp >= buf); > return cp; > } > > static JSString * JS_FASTCALL > js_NumberToStringWithBase(JSContext *cx, jsdouble d, jsint base); >+static JSString * JS_FASTCALL >+js_NumberToStringWithDoubleBase(JSContext *cx, jsdouble d, jsdouble base); > > static JSBool > num_toString(JSContext *cx, uintN argc, Value *vp) > { > const Value *primp; > if (!js_GetPrimitiveThis(cx, vp, &js_NumberClass, &primp)) > return JS_FALSE; > double d = primp->toNumber(); >@@ -846,20 +848,22 @@ num_toPrecision(JSContext *cx, uintN arg > if (argc == 0 || vp[2].isUndefined()) > return num_toString(cx, 0, vp); > return num_to(cx, DTOSTR_STANDARD, DTOSTR_PRECISION, 1, MAX_PRECISION, 0, > argc, vp); > } > > #ifdef JS_TRACER > >-JS_DEFINE_TRCINFO_2(num_toString, >+JS_DEFINE_TRCINFO_3(num_toString, > (2, (extern, STRING_RETRY, js_NumberToString, CONTEXT, THIS_DOUBLE, > 1, nanojit::ACCSET_NONE)), > (3, (static, STRING_RETRY, js_NumberToStringWithBase, CONTEXT, THIS_DOUBLE, INT32, >+ 1, nanojit::ACCSET_NONE)), >+ (3, (static, STRING_RETRY, js_NumberToStringWithDoubleBase, CONTEXT, THIS_DOUBLE, DOUBLE, > 1, nanojit::ACCSET_NONE))) > > #endif /* JS_TRACER */ > > static JSFunctionSpec number_methods[] = { > #if JS_HAS_TOSOURCE > JS_FN(js_toSource_str, num_toSource, 0,JSFUN_THISP_NUMBER), > #endif >@@ -1049,16 +1053,22 @@ js_IntToString(JSContext *cx, jsint i) > if (jsuint(i) < INT_STRING_LIMIT) > return JSString::intString(i); > > char buf[12]; > return js_NewStringCopyZ(cx, IntToCString(i, 10, buf, sizeof buf)); > } > > static JSString * JS_FASTCALL >+js_NumberToStringWithDoubleBase(JSContext *cx, jsdouble d, jsdouble base) >+{ >+ return js_NumberToStringWithBase(cx, d, (jsint)base); >+} >+ >+static JSString * JS_FASTCALL > js_NumberToStringWithBase(JSContext *cx, jsdouble d, jsint base) > { > /* > * The longest possible result here that would need to fit in buf is > * (-0x80000000).toString(2), which has length 33. (This can produce > * longer results, but in those cases buf is not used; see comment at > * NumberToCString.) > */ >diff -r 6d9d375a035f js/src/jsstr.cpp >--- a/js/src/jsstr.cpp Wed Aug 04 15:11:30 2010 -0700 >+++ b/js/src/jsstr.cpp Wed Aug 04 16:48:20 2010 -0700 >@@ -2845,23 +2845,31 @@ str_sub(JSContext *cx, uintN argc, Value > #ifdef JS_TRACER > JSString* FASTCALL > js_String_getelem(JSContext* cx, JSString* str, int32 i) > { > if ((size_t)i >= str->length()) > return NULL; > return JSString::getUnitString(cx, str, size_t(i)); > } >+ >+JSString* FASTCALL >+js_String_getelem_double(JSContext* cx, JSString* str, jsdouble i) >+{ >+ return js_String_getelem(cx, str, (int32)i); >+} > #endif > > JS_DEFINE_TRCINFO_1(js_str_toString, > (2, (extern, STRING_RETRY, String_p_toString, CONTEXT, THIS, > 1, nanojit::ACCSET_NONE))) >-JS_DEFINE_TRCINFO_1(str_charAt, >+JS_DEFINE_TRCINFO_2(str_charAt, > (3, (extern, STRING_RETRY, js_String_getelem, CONTEXT, THIS_STRING, INT32, >+ 1, nanojit::ACCSET_NONE)), >+ (3, (extern, STRING_RETRY, js_String_getelem_double, CONTEXT, THIS_STRING, DOUBLE, > 1, nanojit::ACCSET_NONE))) > JS_DEFINE_TRCINFO_2(str_charCodeAt, > (1, (extern, DOUBLE, js_String_p_charCodeAt0, THIS_STRING, > 1, nanojit::ACCSET_NONE)), > (2, (extern, DOUBLE, js_String_p_charCodeAt, THIS_STRING, DOUBLE, > 1, nanojit::ACCSET_NONE))) > JS_DEFINE_TRCINFO_1(str_concat, > (3, (extern, STRING_RETRY, js_ConcatStrings, CONTEXT, THIS_STRING, STRING, >@@ -3211,20 +3219,27 @@ static JSString* FASTCALL > String_fromCharCode(JSContext* cx, int32 i) > { > JS_ASSERT(JS_ON_TRACE(cx)); > jschar c = (jschar)i; > if (c < UNIT_STRING_LIMIT) > return JSString::unitString(c); > return js_NewStringCopyN(cx, &c, 1); > } >+ >+static JSString* FASTCALL >+String_fromCharCode_double(JSContext* cx, jsdouble i) >+{ >+ return String_fromCharCode(cx, (int32)i); >+} > #endif > >-JS_DEFINE_TRCINFO_1(str_fromCharCode, >- (2, (static, STRING_RETRY, String_fromCharCode, CONTEXT, INT32, 1, nanojit::ACCSET_NONE))) >+JS_DEFINE_TRCINFO_2(str_fromCharCode, >+ (2, (static, STRING_RETRY, String_fromCharCode, CONTEXT, INT32, 1, nanojit::ACCSET_NONE)), >+ (2, (static, STRING_RETRY, String_fromCharCode_double, CONTEXT, DOUBLE, 1, nanojit::ACCSET_NONE))) > > static JSFunctionSpec string_static_methods[] = { > JS_TN("fromCharCode", str_fromCharCode, 1, 0, &str_fromCharCode_trcinfo), > JS_FS_END > }; > > JSObject * > js_InitStringClass(JSContext *cx, JSObject *obj) >diff -r 6d9d375a035f js/src/jstracer.cpp >--- a/js/src/jstracer.cpp Wed Aug 04 15:11:30 2010 -0700 >+++ b/js/src/jstracer.cpp Wed Aug 04 16:48:20 2010 -0700 >@@ -11286,21 +11286,23 @@ TraceRecorder::callSpecializedNative(JSN > argp--; > } > > for (i = knownargc; i--; ) { > Value& arg = stackval(0 - (i + 1)); > *argp = get(&arg); > > argtype = sn->argtypes[i]; >- if (argtype == 'd' || argtype == 'i') { >+ if (argtype == 'd') { > if (!arg.isNumber()) > goto next_specialization; >- if (argtype == 'i') >- *argp = d2i(*argp); >+ } else if (argtype == 'i') { >+ if (!arg.isInt32()) >+ goto next_specialization; >+ *argp = d2i(*argp); > } else if (argtype == 'o') { > if (arg.isPrimitive()) > goto next_specialization; > } else if (argtype == 's') { > if (!arg.isString()) > goto next_specialization; > } else if (argtype == 'r') { > if (!VALUE_IS_REGEXP(cx, arg))
Attachment #462990 - Flags: review?(gal) → review+
Attached patch Fixed patch (obsolete) — Splinter Review
OK, the indentation is fixed. I also couldn't find a better way to do the reciprocal sqrt. Also, given the cost of pow/sqrt, I doubt that the extra branches to test for 0.5/-0.5 are that significant.
Attachment #462990 - Attachment is obsolete: true
Attached patch Simplified patch (obsolete) — Splinter Review
I did a little more work on this and for some reason the traceable native stuff was actually slowing down the benchmarks that don't use pow. I've removed that stuff, which was barely helping pow performance anyway. This new, smaller patch seems to be worth about 5ms on SunSpider.
Attachment #463360 - Attachment is obsolete: true
Attachment #467159 - Flags: review?(gal)
Comment on attachment 467159 [details] [diff] [review] Simplified patch All hail sunspider.
Attachment #467159 - Flags: review?(gal) → review+
Comment on attachment 467159 [details] [diff] [review] Simplified patch >diff --git a/js/src/jsmath.cpp b/js/src/jsmath.cpp >--- a/js/src/jsmath.cpp >+++ b/js/src/jsmath.cpp >@@ -393,43 +393,84 @@ js_math_min(JSContext *cx, uintN argc, V > } else { > z = (x < z) ? x : z; > } > } > vp->setNumber(z); > return JS_TRUE; > } > >+static jsdouble powi(jsdouble x, jsint y) { Prevailing style breaks lines after type and mode (before function declarator name), and after parameter list's closing ) so { is in column 1. >+ jsuint n = (y < 0) ? -y : y; >+ jsdouble m = x; >+ jsdouble p = 1; >+ while (true) { >+ if ((n & 1) != 0) p *= m; >+ n >>= 1; >+ if (n == 0) { >+ if (y < 0) { >+ // Unfortunately, we have to be careful when p has reached >+ // infinity in the computation, because sometimes the higher >+ // internal precision in the pow() implementation would have >+ // given us a finite p. This happens very rarely. >+ >+ jsdouble result = 1.0 / p; >+ return (result == 0 && isinf(p)) >+ ? pow(x, static_cast<jsdouble>(y)) // Avoid pow(double, int). >+ : result; >+ } else { >+ return p; Avoid else after return non-sequiturs. >+ if (JSDOUBLE_IS_FINITE(x) && x != 0.0) { >+ if (y == 0.5) { >+ vp->setNumber(sqrt(x)); >+ return JS_TRUE; >+ } else if (y == -0.5) { else-after-return again. > static jsdouble FASTCALL > math_pow_tn(jsdouble d, jsdouble p) > { >+ /* Special case for square roots */ >+ if (JSDOUBLE_IS_FINITE(d) && d != 0.0) { >+ if (p == 0.5) { >+ return sqrt(d); >+ } else if (p == -0.5) { >+ return 1.0/sqrt(d); And again. >+ if (JSDOUBLE_IS_INT32(p, &i)) { >+ return powi(d, i); >+ } Over-braced, don't do it. Style nits count, keep after them or we'll have randomly styled code in no time flat. /be
To say a bit more (JS C++ style guide on the wiki I believe says all this): The { in column 1 rule for top-level functions has better navigation usability (vim [[). This rule does not apply to inline class member methods, unless they have too many parameters to fit on one line, or are constructors with base class and member initialisers. The rule against else-after-return avoids ever-increasing over-indentation or (if one avoids rightward drift) consequent nonsense indentation, e.g. if (foo) return bar; else don't return; more don't return here; or the variation: if (not foo) don't return; else return bar; more after the don't return; where the condition should be inverted to cast out the abnormal case so the spine (hat tip jimb) of the function is least indented. Common idiom -- instead of if (foo) return bar; else return baz; or the else-free fixed version, use return foo ? bar : baz; even if multiline (underhang the ? and : below the first char of the condition, typically ( unless the condition is a primary or member expression). A lot of this is K&R plus some feedback over the years arguing for exceptions to rules, and a few refinements. There's a substance-not-style issue depending on compiler, optimization level, and processor architecture: Prefetching icache lines follows unconditional jumps but not conditional ones, as far as I know (please correct me if I'm wrong for specific architectures). So the "then" clause should be faster, especially if the conditioanl jump around it is to instructions not in the pipe. If the branch is short enough to target prefetched code, then only a minor cycle penalty accrues. Therefore sometimes it pays to manually optimize for "prefetched is predicted" flow, or just simplify that way due to only two levels of logic, thus putting the fast path in the "then" clause. It's ok to do this nowadays. We used to avoid indenting the common or hot path a bit too scrupulously, but this just made more work for PGO builds and had the perverse effect of causing some bug filing due to lack of developer-generated, -O but without PGO program profiles showing mispredicted long branches costing where they would not if we either PGO'ed or reordered manually. So reordering manually sometimes is worth it if the indentation doesn't go rightward too badly. Sorry for writing so much, wanted to avoid leaving a style-nit-police-raid bad taste with my previous comment. /be
I've fixed these issues.
Attachment #467159 - Attachment is obsolete: true
(In reply to comment #16) > The { in column 1 rule for top-level functions has better navigation usability > (vim [[). This rule does not apply to inline class member methods, unless they > have too many parameters to fit on one line, or are constructors with base > class and member initialisers. Virtual column 1 for inline class members, of course, since they are indented. Any { indented from physical column 1 loses the usability win (vim [[) anyway. Plus, C++ has its creator-influenced style, which deviates from K&R. So we have a mix, with K&R-ish top-level (unindented) function and method definition style. Style, yay. /be
(In reply to comment #16) > > Prefetching icache lines follows unconditional jumps but not conditional ones, > as far as I know (please correct me if I'm wrong for specific architectures). > So the "then" clause should be faster, especially if the conditioanl jump > around it is to instructions not in the pipe. If the branch is short enough to > target prefetched code, then only a minor cycle penalty accrues. > > Therefore sometimes it pays to manually optimize for "prefetched is predicted" > flow, or just simplify that way due to only two levels of logic, thus putting > the fast path in the "then" clause. It's ok to do this nowadays. I don't have any data, but this smells dubious to me. For one, the compiler could swap the branch ordering itself. If you really think it's important (ie. the branch is super-hot) then I'd use a likely/unlikely pragma and write the code in the order that looks the nicest. My understanding is that these pragmas are aimed at improving branch prediction rather than I-cache prefetching, though; my guess is that branch prediction has a much bigger effect on performance.
(In reply to comment #20) > (In reply to comment #16) > > > > Prefetching icache lines follows unconditional jumps but not conditional ones, > > as far as I know (please correct me if I'm wrong for specific architectures). > > So the "then" clause should be faster, especially if the conditioanl jump > > around it is to instructions not in the pipe. If the branch is short enough to > > target prefetched code, then only a minor cycle penalty accrues. > > > > Therefore sometimes it pays to manually optimize for "prefetched is predicted" > > flow, or just simplify that way due to only two levels of logic, thus putting > > the fast path in the "then" clause. It's ok to do this nowadays. > > I don't have any data, but this smells dubious to me. For one, the compiler > could swap the branch ordering itself. Could != does. I'm telling you we've had bugs where our "avoid indenting the spine" style, which I still favor by default, has led to un-PGO'ed (gcc and MSVC if memory serves) code costing enough that someone filed a bug. I can dig up refs if needed. > If you really think it's important (ie. the branch is super-hot) then I'd use a > likely/unlikely pragma and write the code in the order that looks the nicest. The agreement I heard sayrer forge is that we don't add JS_*LIKELY macro uses without profile data showing a problem. Of course this would be un-PGO'ed profile data. > My understanding is that these > pragmas are aimed at improving branch prediction rather than I-cache > prefetching, though; my guess is that branch prediction has a much bigger > effect on performance. The GCC __builtin_expect intrinsic can be used for branch prediction where the architecture provides branch likely and branch unlikely instructions, but it is also used to determine which path is fall-through and which is targeted. http://kerneltrap.org/node/4705 has some asm examples. Modern Intel wants to do its own branch prediction dynamically, but you still have to lay your code out to best advantage -- you the compiler, not so much the developers who always guess wrong. In this light I agree that layout things out for readability and maintenance ease comes ahead of any other good. /be
Ugh, I meant to write "I agree that *laying* things out for readability and maintenance ease comes ahead of any other good." /be
(In reply to comment #21) > > The agreement I heard sayrer forge is that we don't add JS_*LIKELY macro uses > without profile data showing a problem. Yeah, I think we had general nodding and grunting in agreement. > Of course this would be un-PGO'ed profile data. > That's OK. We still only have PGO working on MSVC. GCC claims to do it, but we don't PGO there, because we have yet to measure an improvement on Linux or OS X.
(In reply to comment #21) > Modern Intel wants to do its own branch prediction dynamically, but you still > have to lay your code out to best advantage -- you the compiler, not so much > the developers who always guess wrong. But gcc at -O2 is going to mash around the control flow structure absolutely mercilessly, reorder blocks, switch branch senses, and move anything it estimates to be a cold block off the main path (past the return instruction). Unless you blunt-instrument it by adding __builtin_expect all over the place, I don't think you have a hope of reliably improving performance by changing the layout. IME trying to "help out" the compiler in these kinds of ways is almost always pointless, since it requires second-guessing what both the compiler and the microarchitecture will do. FWIW, I (and Nick, probably) spent way too much time trying to help out gcc generate really fast code for the helper functions called from JIT generated code in Valgrind.
(In reply to comment #24) > (In reply to comment #21) > > Modern Intel wants to do its own branch prediction dynamically, but you still > > have to lay your code out to best advantage -- you the compiler, not so much > > the developers who always guess wrong. > > But gcc at -O2 is going to mash around the control flow structure > absolutely mercilessly, reorder blocks, switch branch senses, and move > anything it estimate. We seem to be in agreement! The allowance for short if-then flows that put what might obviously be the common case (the non-error case, e.g.) in the "then" that I mentioned evolving as "ok style" did help. Possibly it was before modern gcc -O2 ("tree-ssa") days, but I don't think so (I think it was jsarena-related hacking by Robin Bate-Boerop, if that helps find the bug -- 2007 era?). If it doesn't matter, then it doesn't matter and there's no substance issue, just style. We can either let the style allowance stand, or go back to "least-indentation for the common-case spine." Opinions? /be /be
FWIW, I found bug 417995, which was where the issue of if-then code "usually" -- even with -O2 often enough -- results in better performance when the then clause is the common case, in the absence of PGO. Bug 417995 comment 27 cites [http://www.intel.com/design/processor/manuals/248966.pdf page 88] If anyone has fresher information, please share by email. Sorry to clutter up this bug, I made it a bit of a discussion list :-/. /be
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Assignee: general → wmccloskey
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: