Closed Bug 453157 Opened 16 years ago Closed 16 years ago

watchdog thread as an alternative to operation count

Categories

(Core :: JavaScript Engine, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla1.9.1

People

(Reporter: igor, Assigned: andrei)

References

Details

(Keywords: fixed1.9.1, relnote, Whiteboard: fixed-in-tracemonkey)

Attachments

(3 files, 33 obsolete files)

2.62 KB, text/plain; charset=UTF-8
Details
1.34 KB, patch
gal
: review+
Details | Diff | Splinter Review
39.63 KB, patch
Details | Diff | Splinter Review
Currently the operation callback is called when the operation counter reaches zero with various places in the code contributing with different weights to the counter. In theory such weights should reflect the complexity of operations but in reality they are arbitrary. Thus the idea is to replace this counting by a watchdog thread that simply sets a flag after a predefined timeout expires. Then the code that currently decreases and checks the counter would simply checks the flag. This will remove the need to maintain the counting weights, improve the accuracy of callback scheduling and minimize the amount of code that currently is involved in the callback calling.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attached patch Work in progress (obsolete) — Splinter Review
Comment on attachment 336496 [details] [diff] [review] Work in progress Here is style nits and suggestions for the patch: >--- a/dom/src/base/nsJSEnvironment.cpp Mon Aug 25 21:56:29 2008 +1200 ... >+#if JS_HAS_OPERATION_COUNT > // The accumulated operation weight before we call MaybeGC > const PRUint32 MAYBE_GC_OPERATION_WEIGHT = 5000 * JS_OPERATION_WEIGHT_BASE; >+#endif As discussed, I see no point in keeping the operation counting functionality in the browser. This just increases efforts to maintain the code. So #if JS_HAS_OPERATION_COUNT can be removed from nsJSEnvironment.cpp and nsDOMThreadService.cpp. >+static JSBool TestCallback(JSContext *cx) >+{ >+ >+ printf("Callback is called for %d\n",cx); >+ return JS_FALSE; Style nit: tabs are banned from the sources! Make sure that the editor does not ever insert them. >+static JSBool >+SetWatchdogInterval(JSContext *cx, JSObject *obj, uintN argc, jsval *argv, jsval *rval) >+{ >+ if (argc != 1) { >+ JS_ReportError(cx, "Wrong number of arguments"); >+ return JS_FALSE; >+ } >+ >+ jsdouble interval; >+ if (!JS_ValueToNumber(cx, argv[0], &interval)) >+ return JS_FALSE; >+ if (!(interval >= 0)) { NSPR does not support very long waiting time. Thus it make sence to limit the interval to 30 minutes (1800 seconds) or so. This way when PR_TicksPerSecond() is 1e6, 1800*1e6 would not overflow 32-bit int. >+ JS_ReportError(cx, "Illegal argument value"); >+ return JS_FALSE; >+ } >+ JS_SetWatchdogCallback(cx, TestCallback, (PRIntervalTime)(interval*PR_TicksPerSecond())); Style nit: a blank is required on both sides of binary operations like *, + etc. >+#if JS_HAS_OPERATION_COUNT > if (!gEnableBranchCallback) { > /* Enable the branch callback, for periodic scope-sharing. */ > gEnableBranchCallback = JS_TRUE; >- JS_SetBranchCallback(cx, my_BranchCallback); >+ //JS_SetBranchCallback(cx, my_BranchCallback); The branch callback has to be set when JS_HAS_OPERATION_COUNT is true for compatibility. >+ JS_FS("setWatchdogInterval", SetWatchdogInterval, 1,0,0), As discussed, the idea here to have a function like watchdogInterval that returns the current watchdog interval when called with 0 arguments and that sets the interval when called with one argument. Another suggestion is to abbreviate the function name to a short name in keeping with the tradition. Name like "watchint" would fit the current customs. > #ifdef DEBUG > JS_FS("dis", Disassemble, 1,0,0), > JS_FS("disfile", DisassFile, 1,0,0), > JS_FS("dissrc", DisassWithSrc, 1,0,0), > JS_FN("dumpHeap", DumpHeap, 0,0), > JS_FS("notes", Notes, 1,0,0), > JS_FS("tracing", Tracing, 0,0,0), > JS_FS("stats", DumpStats, 1,0,0), >@@ -2895,16 +2935,18 @@ static const char *const shell_help_mess > "trap([fun, [pc,]] exp) Trap bytecode execution", > "untrap(fun[, pc]) Remove a trap", > "line2pc([fun,] line) Map line number to PC", > "pc2line(fun[, pc]) Map PC to line number", > "stackQuota([number]) Query/set script stack quota", > "stringsAreUTF8() Check if strings are UTF-8 encoded", > "testUTF8(mode) Perform UTF-8 tests (modes are 1 to 4)", > "throwError() Throw an error from JS_ReportError", >+"setWatchdogInterval(interval) Set watchdog interval to the specified number" >+" of seconds", The lines that involves setWatchdogInterval must be protected with if !JS_HAS_OPERATION_COUNT > static JSBool > ContextCallback(JSContext *cx, uintN contextOp) > { > if (contextOp == JSCONTEXT_NEW) { > JS_SetErrorReporter(cx, my_ErrorReporter); > JS_SetVersion(cx, JSVERSION_LATEST); > SetContextOptions(cx); >+ /* >+ if (JS_GetWatchdogInterval(cx)) { >+ JS_SetOperationCallback(cx, my_test_callback, JS_MAX_OPERATION_LIMIT); >+ }*/ The dead code should not be commented out but rather removed. >diff -r 6958399a2eb1 js/src/jsapi.cpp >@@ -776,16 +776,22 @@ JS_NewRuntime(uint32 maxbytes) ... >+ rt->watchdogLock = JS_NEW_LOCK(); >+ if (!rt->watchdogLock) >+ goto bad; >+ rt->watchdogCv = JS_NEW_CONDVAR(rt->watchdogLock); >+ if (!rt->watchdogCv) >+ goto bad; The style nit: the indent level in 4, not 2, blanks. Another is to use a descriptive suffix for the conditional variable name, not Cv. watchdogWakeup can be a possibility. This will follow the names of other conditional variables. >+ if (rt->watchdogCv) >+ JS_DESTROY_CONDVAR(rt->watchdogCv); >+ if (rt->watchdogLock) >+ JS_DESTROY_LOCK(rt->watchdogLock); Again, the indent is 4, not 2 spaces! >+#ifdef JS_THREADSAFE >+static void WakeupWatchdog(JSRuntime *rt, PRIntervalTime interval) Style nit: attributes like static and the result should be on a separated line so the function name is always starts at the first column on its own line. Example: static void WakeupWatchdog(JSRuntime *rt, PRIntervalTime interval) >+{ >+ if (rt->watchdogInterval > interval) { this requires comments about skipping wakeup. Also, would this unconditionally wake up the watchdod when interval is 0? The style nit: 1. Kill the TABS! 2. To avoid over-identing the code SpiderMonkey's custom is to write: if (!A) return; code; instead of if (A) { code; } ... >+#ifdef JS_THREADSAFE As discussed, this should be #if !JS_HAS_OPERATION_COUNT >+static void WatchdogMain(void *arg) Style nit: function name should be at the first column. >+{ >+ >+ JSRuntime *rt = (JSRuntime*)arg; Style nits: 1. Kill the TABS! 2. A blank should separate * from the type name and the cast itself should have a blank before its operand like in (JSRuntime *) arg. >+ >+ PRStatus status; >+ PRBool isRunning; >+ do { Style nit: declarations should be separated with a blank line from the following code like in: >+ PRStatus status; >+ PRBool isRunning; >+ do { >+ >+ JS_LOCK_GC(rt); >+ JSContext *iter = NULL; >+ JSContext *cc; >+ PRIntervalTime newWatchdogInterval = 0; >+ while((cc = js_ContextIterator(rt, JS_FALSE, &iter))) { >+ if (cc->requestDepth && cc->watchdogLimit) { >+ PRIntervalTime ct = PR_IntervalNow(); Optimization suggestion: Initialize ct with the current time no more than once during the loop. PR_IntervalNow() is heavy. >+ rt->watchdogInterval = newWatchdogInterval?newWatchdogInterval:PR_INTERVAL_NO_TIMEOUT; Nits: blanks around ? and :. >+/* >+ * Change the watchdog limit associated with the watchdog callback. This API >+ * function may be called only when the result of JS_GetOperationCallback(cx) >+ * is not null. >+ */ Nit: The part in comments about "This API..." is no longer true - the function can be called any time. >+extern JS_PUBLIC_API(void) >+JS_SetWatchdogLimit(JSContext *cx, PRIntervalTime watchdogLimit) >+{ >+ if (watchdogLimit == cx->watchdogLimit) >+ return; /* Nothing to change - immediate exit. */ Style nit: in-line comments like here should start at the next tab stop like 4-8-12-16... >+ >+ cx->watchdogLimit = watchdogLimit; >+ >+ /* Start a new watchdog thread if it has not been started. If it has been started wake up the thread and cause Style nit: this line does not fit 78 characters limit. That is, in comments all the text must have no than 78 characters. >+ * the watchdog rescheduling. >+ */ >+ PR_Lock(cx->runtime->watchdogLock); >+ JSBool isRunning = (JSBool)cx->runtime->watchdogThread; Style nit: a) although C++ and C99 allows this, SM style is to stick with C89 tradition of declaring all the variables at the begining of the block. b) blanks after the (JSBool) cast. >+ if (!isRunning) { >+ cx->runtime->watchdogRunning = PR_TRUE; >+ cx->runtime->watchdogThread = PR_CreateThread(PRThreadType(PR_USER_THREAD), >+ WatchdogMain, >+ cx->runtime, >+ PRThreadPriority(PR_PRIORITY_NORMAL), >+ PRThreadScope(PR_LOCAL_THREAD), >+ PRThreadState(PR_JOINABLE_THREAD), >+ 0); >+ } >+ PR_Unlock(cx->runtime->watchdogLock); Here the code does not check for the result of PR_CreateThread. It can fail and the code must return JS_FALSE when so (after unlocking the watchdog lock). >+ if (isRunning) >+ WakeupWatchdog(cx->runtime, cx->watchdogLimit); I think WakeupWatchdog should be called here unconditionally even when isRunning is false. The watchdod thread may well already be sleeping at this point. > JS_PUBLIC_API(JSBool) > JS_IsConstructing(JSContext *cx) > { >- return cx->fp && (cx->fp->flags & JSFRAME_CONSTRUCTING); >+ return cx->fp && (cx->fp-> flags & JSFRAME_CONSTRUCTING); Kill that tab after cx->fp->! >+#ifndef JS_HAS_OPERATION_COUNT >+#if (defined(MOZILLA_VERSION) || defined(JS_THREADSAFE)) >+#define JS_HAS_OPERATION_COUNT 0 >+#else >+#define JS_HAS_OPERATION 1 >+#endif >+#endif Style nit: the preprocessor directives nesed inside #if like here should have blanks after # to indicate the preprocessor level, like in: >+#ifndef JS_HAS_OPERATION_COUNT >+# if (defined(MOZILLA_VERSION) || defined(JS_THREADSAFE)) >+# define JS_HAS_OPERATION_COUNT 0 >+# else >+# define JS_HAS_OPERATION 1 >+# endif >+#endif >diff -r 6958399a2eb1 js/src/jscntxt.cpp .. >@@ -243,17 +246,21 @@ js_NewContext(JSRuntime *rt, size_t stac > JSContextCallback cxCallback; > JS_ClearAllWatchPoints(cx); >+ /* >+ * Code to shutdown watchdog >+ */ >+ ShutdownWatchdog(cx); The call to ShutdownWatchdog should be wrapped into #if !JS_HAS_OPERATION_CALLBACK The style nits: 1. When block comments like here follows the code at the same indentation level, they should be separated with a blank line from the previous code. 2. The comments must be a full sentence ending with a period. 3. If comments fits one, they should be fit on one line unless one wants to show very different nature of the code. Here is a possible fix: > JS_ClearAllWatchPoints(cx); >+ /* Shutdown the watchdog. */ >+ ShutdownWatchdog(cx); But even better would be to remove the comments as they duplicates the function name. > if (cx->operationCallback) { > /* > * Invoke the deprecated branch callback. It may be called only when > * the top-most frame is scripted or JSOPTION_NATIVE_BRANCH_CALLBACK > * is set. > */ >- script = cx->fp ? cx->fp->script : NULL; >+ JSScript *script = cx->fp ? cx->fp->script : NULL; > if (script || JS_HAS_OPTION(cx, JSOPTION_NATIVE_BRANCH_CALLBACK)) Style nits: a bad indent for JSScript and missing blank like after the JSScript declaration. >+static JSBool >+ShutdownWatchdog(JSContext *cx) >+{ >+ Kill the tabs! >+ PR_Lock(cx->runtime->watchdogLock); >+ cx->runtime->watchdogRunning = PR_FALSE; >+ PR_NotifyCondVar(cx->runtime->watchdogCv); >+ PR_Unlock(cx->runtime->watchdogLock); >+ PR_JoinThread(cx->runtime->watchdogThread); >+ cx->runtime->watchdogThread = NULL; >+ return JS_TRUE; Shouldn't this code check that watchdogThread is not null? >--- a/js/src/jscntxt.h Mon Aug 25 21:56:29 2008 +1200 >+#ifdef JS_THREADSAFE The check should be #if !JS_HAS_OPERATION_COUNT >+ >+ /* >+ * Variables to support thread watchdog. >+ */ >+ PRIntervalTime watchdogInterval; This interval in JSRuntime is no longer used and should be removed, right? > >@@ -862,16 +877,17 @@ struct JSContext { "operationCount" is declared at the beginning of JSContext. It should be declared with the attribute volatile like in volatile type operationCount. > JSCList threadLinks; /* JSThread contextList linkage */ >+ PRIntervalTime startTime; /* Time when the context thread was started */ Kill the tabs! :) > > #define CX_FROM_THREAD_LINKS(tl) \ > ((JSContext *)((char *)(tl) - offsetof(JSContext, threadLinks))) > #endif > > /* PDL of stack headers describing stack slots not rooted by argv, etc. */ > JSStackHeader *stackHeaders; > >@@ -881,16 +897,20 @@ struct JSContext { > /* Stack of thread-stack-allocated temporary GC roots. */ > JSTempValueRooter *tempValueRooters; > > /* List of pre-allocated doubles. */ > JSGCDoubleCell *doubleFreeList; > > /* Debug hooks associated with the current context. */ > JSDebugHooks *debugHooks; >+#ifdef JS_THREADSAFE This should be #if !JS_HAS_OPERATION_COUNT >+ /* Watchdog limit associated with the current context. */ >+ PRIntervalTime watchdogLimit; >+#endif > }; >+#if JS_HAS_OPERATION_COUNT >+ > #define JS_CHECK_OPERATION_LIMIT(cx, weight) \ > (JS_CHECK_OPERATION_WEIGHT(weight), \ >- (((cx)->operationCount -= (weight)) > 0 || js_ResetOperationCount(cx))) >+ (((cx)->operationCount -= (weight))> 0 || js_ResetOperationCount(cx))) The spurious blank removal should not be here. > > /* > * A version of JS_CHECK_OPERATION_LIMIT that just updates the operation count > * without calling the operation callback or any other API. This macro resets > * the count to 0 when it becomes negative to prevent a wrap-around when the > * macro is called repeatably. > */ > #define JS_COUNT_OPERATION(cx, weight) \ > ((void)(JS_CHECK_OPERATION_WEIGHT(weight), \ > (cx)->operationCount = ((cx)->operationCount > 0) \ > ? (cx)->operationCount - (weight) \ > : 0)) >- This blanl lank should not be removed either. > /* > * The implementation of the above macros assumes that subtracting weights > * twice from a positive number does not wrap-around INT32_MIN. > */ > #define JS_CHECK_OPERATION_WEIGHT(weight) \ > (JS_ASSERT((uint32) (weight) > 0), \ > JS_ASSERT((uint32) (weight) < JS_BIT(30))) > >@@ -1196,17 +1217,21 @@ extern JSErrorFormatString js_ErrorForma > #define JSOW_ALLOCATION 100 > #define JSOW_LOOKUP_PROPERTY 5 > #define JSOW_GET_PROPERTY 10 > #define JSOW_SET_PROPERTY 20 > #define JSOW_NEW_PROPERTY 200 > #define JSOW_DELETE_PROPERTY 30 > #define JSOW_ENTER_SHARP JS_OPERATION_WEIGHT_BASE > #define JSOW_SCRIPT_JUMP JS_OPERATION_WEIGHT_BASE >- >+#else >+#define JS_CHECK_OPERATION_LIMIT(cx, weight) \ >+ (((cx)->operationCount)> 0 || js_ResetOperationCount(cx)) Style nit: a blank before >. >diff -r 6958399a2eb1 js/src/jsinterp.cpp ... >+#define CHECK_BRANCH() \ >+ JS_BEGIN_MACRO \ >+ if ((cx->operationCount /*-= JSOW_SCRIPT_JUMP*/) <= 0) { \ >+ if (!js_ResetOperationCount(cx)) \ >+ goto error; \ >+ } \ >+ JS_END_MACRO This can be replaced with just #define CHECK_BRANCH() \ JS_BEGIN_MACRO \ if (!JS_CHECK_OPERATION_LIMIT(cx, 1)) \ goto error; \ JS_END_MACRO \ With the watchdog there is no need to optimize the JS_CHECK_OPERATION_LIMIT macro.
Attached patch Work in progress 2 (obsolete) — Splinter Review
Changes are made according to Igor's comments. Close to the final
Attachment #336496 - Attachment is obsolete: true
Comment on attachment 339763 [details] [diff] [review] Work in progress 2 Here is *style-only* comments: >diff -r 6f18a6d94920 dom/src/threads/nsDOMThreadService.cpp > JS_SetErrorReporter(cx, DOMWorkerErrorReporter); > >- JS_SetOperationCallback(cx, DOMWorkerOperationCallback, >- 100 * JS_OPERATION_WEIGHT_BASE); >+ JS_SetWatchdogCallback(cx, DOMWorkerOperationCallback, PR_TicksPerSecond()/1000); missing whitespace around the "/" operator >diff -r 6f18a6d94920 js/src/js.cpp >+#if !JS_HAS_OPERATION_COUNT >+ >+static JSBool >+TestCallback(JSContext *cx) >+{ >+ return JS_TRUE; >+} >+ >+static JSBool >+SetWatchdogInterval(JSContext *cx, JSObject *obj, uintN argc, >+ jsval *argv, jsval *rval) >+{ >+ if (!(argc == 1 || argc == 0)) { >+ JS_ReportError(cx, "Wrong number of arguments"); >+ return JS_FALSE; >+ } >+ >+ if (argc == 0) { >+ >+ return JS_NewDoubleValue(cx, JS_GetWatchdogLimit(cx), rval); >+ >+ } else { 1. No extra blank lines around return. 2. The else part should be omitted as the if-then part ends with the return statement. 3. With else removal the {} around the if part should be removed as the part would contain the single return statement that fits one line. >+ jsdouble interval; >+ if (!JS_ValueToNumber(cx, argv[0], &interval)) >+ return JS_FALSE; >+ if (!(interval >= 0)) { >+ JS_ReportError(cx, "Negative argument value"); >+ return JS_FALSE; >+ } Comment before the "if" that !(x >= 0), not (x < 0), is used to cover the case of nans. >+ JS_SetWatchdogCallback(cx, TestCallback, (PRIntervalTime) (interval >+ * PR_TicksPerSecond())); Indent should be done on the "highest" level, not inside some nested expression. Here the highest level as the argument list so the indent may look like: >+ JS_SetWatchdogCallback(cx, TestCallback, (PRIntervalTime) (interval * PR_TicksPerSecond())); or >+ JS_SetWatchdogCallback( cx, TestCallback, (PRIntervalTime) (interval * PR_TicksPerSecond())); Alternatively, one can use: PRIntervalTime nticks; ... nticks = (PRIntervalTime) (interval * PR_TicksPerSecond()); >+ JS_SetWatchdogCallback(cx, TestCallback, nticks); >diff -r 6f18a6d94920 js/src/jsapi.cpp >@@ -884,16 +899,28 @@ > return rt->data; > } > > JS_PUBLIC_API(void) > JS_SetRuntimePrivate(JSRuntime *rt, void *data) > { > rt->data = data; > } >+#if !JS_HAS_OPERATION_COUNT >+static void >+WakeupWatchdog(JSRuntime *rt, PRIntervalTime interval) >+{ >+ if (rt->watchdogInterval <= interval) { >+ return; >+ } no {} around single-statement that fits one line if-then part of loops. >+#endif >+#if !JS_HAS_OPERATION_COUNT >+static void >+WatchdogMain(void *arg) >+{ >+ >+ JSRuntime *rt = (JSRuntime *) arg; ^^^^^ No tabs here! >+ >+ PRStatus status; >+ PRBool isRunning; no blank line between local variable rt and status declarations >+ >+ do { >+ >+ JS_LOCK_GC(rt); >+ JSContext *iter = NULL; >+ JSContext *cc; >+ PRIntervalTime newWatchdogInterval = 0; >+ while ((cc = js_ContextIterator(rt, JS_FALSE, &iter))) { >+ if (cc->requestDepth && cc->watchdogLimit) { >+ PRIntervalTime ct = PR_IntervalNow(); blank line after a declaration >+ if (ct - cc->startTime > cc->watchdogLimit) >+ cc->operationCount = 0; >+ if (newWatchdogInterval > cc->watchdogLimit >+ || !newWatchdogInterval) >+ newWatchdogInterval = cc->watchdogLimit; 1. When splitting, the operator should be the last thing on the old line, not the first thing on the new line. 2. On the new line the identation should stay on the same aligning thing under "(" that opens the if. 2. When if/for code does not fit one line, {} are required around "then" part, like in: >+ if (newWatchdogInterval > cc->watchdogLimit || >+ !newWatchdogInterval) { >+ newWatchdogInterval = cc->watchdogLimit; } >+ } >+ } >+ rt->watchdogInterval = newWatchdogInterval ? newWatchdogInterval >+ : PR_INTERVAL_NO_TIMEOUT; The rules for the ? operator requires either to fit the whole thing on one line or split it in 3 lines with ? and : parts aligned under the column that starts the condition, like in: >+ rt->watchdogInterval = newWatchdogInterval ? newWatchdogInterval >+ : PR_INTERVAL_NO_TIMEOUT; >+extern JS_PUBLIC_API(JSBool) >+JS_SetWatchdogCallback(JSContext *cx, JSOperationCallback callback, >+ PRIntervalTime watchdogLimit) >+{ >+ JS_ASSERT(callback); Kill the tab! >+extern JS_PUBLIC_API(void) >+JS_ClearWatchdogCallback(JSContext *cx) >+{ >+ cx->operationCount = 1; Kill the tab! >+ cx->operationCallback = NULL; >+} >+ >+extern JS_PUBLIC_API(JSOperationCallback) >+JS_GetWatchdogCallback(JSContext *cx) >+{ >+ return cx->operationCallback; Kill the tab! >+} >+ >+/* >+ * Get the watchdog limit associated with the watchdog callback. >+ */ >+extern JS_PUBLIC_API(PRIntervalTime) >+JS_GetWatchdogLimit(JSContext *cx) >+{ >+ return cx->watchdogLimit; Kill the tab! >+} >+ >+/* >+ * Change the watchdog limit associated with the watchdog callback. This API >+ * function may be called only when the result of JS_GetOperationCallback(cx) >+ * is not null. >+ */ >+extern JS_PUBLIC_API(JSBool) >+JS_SetWatchdogLimit(JSContext *cx, PRIntervalTime watchdogLimit) >+{ >+ JSBool isRunning; >+ if (watchdogLimit == cx->watchdogLimit) >+ return JS_TRUE; /* Nothing to change - immediate exit. */ Kill the tab and re-align the "then" part. >+ >+ cx->watchdogLimit = watchdogLimit; >+ >+ /* Start a new watchdog thread if it has not been started. If it has been >+ * started wake up the thread and cause the watchdog rescheduling. >+ */ The comment style requires to keep the starting /* in multi-line comments alone on the line like in: /* >+ * Start a new watchdog thread if it has not been started. If it has been >+ * started wake up the thread and cause the watchdog rescheduling. >+ */ >+ PR_Lock(cx->runtime->watchdogLock); >+ isRunning = (JSBool) cx->runtime->watchdogThread; >+ if (!isRunning) { >+ cx->runtime->watchdogRunning = PR_TRUE; >+ cx->runtime->watchdogThread = PR_CreateThread(PRThreadType(PR_USER_THREAD), >+ WatchdogMain, >+ cx->runtime, >+ PRThreadPriority(PR_PRIORITY_NORMAL), >+ PRThreadScope(PR_LOCAL_THREAD), >+ PRThreadState(PR_JOINABLE_THREAD), >+ 0); When breaking the line while keeping at least one argument on the old line, the rest of the argument should be aligned on "(" that starts the method call like in: >+ cx->runtime->watchdogThread = PR_CreateThread(PRThreadType(PR_USER_THREAD), >+ WatchdogMain, >+ cx->runtime, >+ PRThreadPriority(PR_PRIORITY_NORMAL), but that would not fit so an extra line break after "=" is necessary: >+ cx->runtime->watchdogThread = PR_CreateThread(PRThreadType(PR_USER_THREAD), >+ WatchdogMain, >+ cx->runtime, >+ PRThreadPriority(PR_PRIORITY_NORMAL), >+ PR_Unlock(cx->runtime->watchdogLock); >+ if (cx->runtime->watchdogThread == NULL) >+ return JS_FALSE; Change pointer == NULL into !pointer. >+#if !JS_HAS_OPERATION_COUNT >+ >+static JSBool >+ShutdownWatchdog(JSContext *cx) >+{ >+ PR_Lock(cx->runtime->watchdogLock); Kill the tab! >diff -r 6f18a6d94920 js/src/jscntxt.h > JSTitle *lockedSealedTitle; /* weak ref, for low-cost sealed > title locking */ > JSCList threadLinks; /* JSThread contextList linkage */ >+ PRIntervalTime startTime; /* Time when the context thread was started */ Align comment to follow the comment alignment from previous lines and start it with a small letter as such in-line comments should not be full sentences. >@@ -1163,51 +1183,57 @@ >-#define JS_CHECK_OPERATION_LIMIT(cx, weight) \ >+#if JS_HAS_OPERATION_COUNT >+ >+# define JS_CHECK_OPERATION_LIMIT(cx, weight) \ > (JS_CHECK_OPERATION_WEIGHT(weight), \ > (((cx)->operationCount -= (weight)) > 0 || js_ResetOperationCount(cx))) As a matter of style, "\" that continues the C preprocessor macros should be put it the column 79. >-#define JS_COUNT_OPERATION(cx, weight) \ >+# define JS_COUNT_OPERATION(cx, weight) \ > ((void)(JS_CHECK_OPERATION_WEIGHT(weight), \ > (cx)->operationCount = ((cx)->operationCount > 0) \ > ? (cx)->operationCount - (weight) \ > : 0)) Here again. > > /* > * The implementation of the above macros assumes that subtracting weights > * twice from a positive number does not wrap-around INT32_MIN. > */ >-#define JS_CHECK_OPERATION_WEIGHT(weight) \ >+# define JS_CHECK_OPERATION_WEIGHT(weight) \ > (JS_ASSERT((uint32) (weight) > 0), \ > JS_ASSERT((uint32) (weight) < JS_BIT(30))) and here. > > /* Relative operations weights. */ >-#define JSOW_JUMP 1 >-#define JSOW_ALLOCATION 100 >-#define JSOW_LOOKUP_PROPERTY 5 >-#define JSOW_GET_PROPERTY 10 >-#define JSOW_SET_PROPERTY 20 >-#define JSOW_NEW_PROPERTY 200 >-#define JSOW_DELETE_PROPERTY 30 >-#define JSOW_ENTER_SHARP JS_OPERATION_WEIGHT_BASE >-#define JSOW_SCRIPT_JUMP JS_OPERATION_WEIGHT_BASE >- >+# define JSOW_JUMP 1 >+# define JSOW_ALLOCATION 100 >+# define JSOW_LOOKUP_PROPERTY 5 >+# define JSOW_GET_PROPERTY 10 >+# define JSOW_SET_PROPERTY 20 >+# define JSOW_NEW_PROPERTY 200 >+# define JSOW_DELETE_PROPERTY 30 >+# define JSOW_ENTER_SHARP JS_OPERATION_WEIGHT_BASE >+# define JSOW_SCRIPT_JUMP JS_OPERATION_WEIGHT_BASE More esoteric stile rules: here the macro definitions should be aligned on 4-character indent-step boundary. That is, the extra space after # should be compensated by removing the space before constants.
Attached patch Work in progress (obsolete) — Splinter Review
Attachment #339763 - Attachment is obsolete: true
The problems with the current approach: 1. It only works with JS_THREADSAFE version. An embedding may well use signals as an alternative to the watchdog signals. 2. It puts way too much policy inside the engine with explicit thread management. 3. It is not very efficient with the code enumerating all contexts. To resolve this the suggestion would be to move all the watchdog management to the browser keeping just API to set watchdog termination flag.
Blocks: 462228
Attached patch Fix (obsolete) — Splinter Review
Attachment #340344 - Attachment is obsolete: true
Attachment #345543 - Flags: review?(igor)
Attached patch Fix ready for testing (obsolete) — Splinter Review
This patch is ready for testing of the watchdog functionality.
Attachment #345543 - Attachment is obsolete: true
Attachment #345820 - Flags: review?(igor)
Attachment #345543 - Flags: review?(igor)
Comment on attachment 345820 [details] [diff] [review] Fix ready for testing >diff -r 5d65342d7287 dom/src/base/nsJSEnvironment.cpp >@@ -885,16 +882,17 @@ GetPromptFromContext(nsJSContext* ctx) > nsIPrompt* prompt; > ireq->GetInterface(NS_GET_IID(nsIPrompt), (void**)&prompt); > return prompt; > } > > JSBool > nsJSContext::DOMOperationCallback(JSContext *cx) > { >+ > nsresult rv; Nit: remove the added blank line here. >diff -r 5d65342d7287 dom/src/threads/nsDOMThreadService.cpp > JSBool > DOMWorkerOperationCallback(JSContext* aCx) > { >+ > nsDOMWorkerThread* worker = (nsDOMWorkerThread*)JS_GetContextPrivate(aCx); Nit: remove the added blank line. >diff -r 5d65342d7287 js/src/js.cpp >+#if !JS_HAS_OPERATION_COUNT >+ >+ /* >+ * Variables to support thread watchdog. >+ */ >+ >+static PRLock *watchdogLock; Nit: Start comments in the first column and remove the blank line after them. >+/* Watchdog limit. */ >+static PRIntervalTime watchdogLimit; >+ >+ >+#endif Nit: remove extra blank line before endif. >+#if !JS_HAS_OPERATION_COUNT >+ >+static JSBool >+ShutdownWatchdog() ... >+static JSBool >+WakeupWatchdog() The functions must be void, not boolean, as they cannot fail and their callers assumes so. >+static JSBool >+TestCallback(JSContext *cx) >+{ >+ printf("Callback is called for %d\n", (int) cx); >+ return JS_TRUE; >+} This function should return false when the watchdog limit is set and it is reached. For reporting it should not not print the context but rather a reasonable text about a script spending too much time. Also, TestCallback is a bad name for such function. >+ >+static void >+WatchdogMain(void *arg) >+{ >+ JSRuntime *rt = (JSRuntime *) arg; >+ PRStatus status; >+ PRBool isRunning; >+ >+ do { >+ >+ PR_Lock(watchdogLock); Nit: remove the blank line after do { . >+ JS_LOCK_GC(rt); >+ JSContext *iter = NULL; >+ JSContext *cc; Nit: the canonical name for an iterated context is acx, not cc. >+ } while (isRunning == PR_TRUE && status == PR_SUCCESS); Nit: use just isRunning and not isRunning == PR_TRUE. >+/* >+ * Change the watchdog limit associated with the watchdog callback. This API >+ * function may be called only when the result of JS_GetOperationCallback(cx) >+ * is not null. >+ */ >+static JSBool >+SetWatchdogLimit(JSContext *cx, PRIntervalTime newWatchdogLimit) >+{ >+ JSBool isRunning; >+ PRIntervalTime oldWatchdogLimit; >+ >+ if (newWatchdogLimit == watchdogLimit) >+ return JS_TRUE; /* Nothing to change - immediate exit. */ Nit: in SM short comments that follow the code on the same line starts form small letter and do end in the period to emphasis that the text is sentence fragment. But here the comments just rephrase the code adding the noise. It is better to drop them all together. >+ >+ oldWatchdogLimit = watchdogLimit; >+ watchdogLimit = newWatchdogLimit; >+ >+ /* >+ * Start a new watchdog thread if it has not been started. If it has been >+ * started wake up the thread and cause the watchdog rescheduling. >+ */ ... >+ PR_Unlock(watchdogLock); >+ if (!watchdogThread) >+ return JS_FALSE; Add JS_ReportError here when thread creation fails. >+#if !JS_HAS_OPERATION_COUNT >+ >+static JSBool >+SetWatchdogInterval(JSContext *cx, JSObject *obj, uintN argc, >+ jsval *argv, jsval *rval) Nits: call the function WatchdogInterval as it both sets and reads the value; align the arguments on the second line to start after (. >+ jsdouble interval; >+ if (!JS_ValueToNumber(cx, argv[0], &interval)) >+ return JS_FALSE; >+ /* Next conditions takes NaNs into account. */ Nit: blank line before the comments. >+ if (!(interval >= 0)) { >+ JS_ReportError(cx, "Negative or NaN argument value"); >+ return JS_FALSE; >+ } Add comments on use of !(interval >= 0) and not (interval < 0) as the way to filter out NaNs. Nit: use 0.0, not 0, to remind the type of interval. >+ if (!(interval <= 1800.0)) { >+ JS_ReportError(cx, "Excessive argument value"); >+ return JS_FALSE; >+ } Here use just interval > 1800 as NaNs are already filtered out. >diff -r 5d65342d7287 js/src/jsapi.cpp ... >+extern JS_PUBLIC_API(JSBool) >+JS_SetWatchdogCallback(JSContext *cx, JSOperationCallback callback) Change the type of the function to void. The callback set should be infallible. Nit: drop extern before the function declaration. >+{ >+ JS_ASSERT(callback); >+ >+ cx->operationCount = 1; Why is this assignment to operationCount is necessary? >+ cx->operationCallback = callback; >+ >+ return JS_TRUE; >+} >+ >+extern JS_PUBLIC_API(void) >+JS_ClearWatchdogCallback(JSContext *cx) Nit: drop extern before the function declaration. >+{ >+ cx->operationCount = 1; Again, why is this assignment to operationCount is necessary? >+ cx->operationCallback = NULL; >+} >+ >+extern JS_PUBLIC_API(JSOperationCallback) >+JS_GetWatchdogCallback(JSContext *cx) Nit: drop extern before the function declaration. >+{ >+ return cx->operationCallback; >+} >+ >+extern JS_PUBLIC_API(JSBool) >+JS_PrepareContextForReset(JSContext *cx) Nit: drop extern before the function declaration. >+{ >+ cx->operationCount = 0; >+ return JS_TRUE; >+} >+ >+#endif > >diff -r 5d65342d7287 js/src/jsapi.h >@@ -43,16 +43,28 @@ > /* > * JavaScript API. > */ > #include <stddef.h> > #include <stdio.h> > #include "js-config.h" > #include "jspubtd.h" > #include "jsutil.h" >+ >+#ifndef JS_HAS_OPERATION_COUNT >+# if (defined(MOZILLA_VERSION) || defined(JS_THREADSAFE)) >+# define JS_HAS_OPERATION_COUNT 0 >+# else >+# define JS_HAS_OPERATION_COUNT 1 >+# endif >+#endif >+ >+#if !JS_HAS_OPERATION_COUNT >+# include "prcvar.h" >+#endif This include, prcvar.h, should be removed. jsapi.c must not include any NSPR stuff. > >+#if JS_HAS_OPERATION_COUNT > /* > * The maximum value of the operation limit to pass to JS_SetOperationCallback > * and JS_SetOperationLimit. > */ > #define JS_MAX_OPERATION_LIMIT ((uint32) 0x7FFFFFFF) > > #define JS_OPERATION_WEIGHT_BASE 4096 > >@@ -2215,16 +2228,33 @@ JS_SetOperationLimit(JSContext *cx, uint > * JS_SetOperationCallback(cx, callback, 4096, NULL); > * > * except that the callback will not be called from a long-running native > * function when JSOPTION_NATIVE_BRANCH_CALLBACK is not set and the top-most > * frame is native. > */ > extern JS_PUBLIC_API(JSBranchCallback) > JS_SetBranchCallback(JSContext *cx, JSBranchCallback cb); >+#else >+/* >+ * Set the watchdog callback that the engine calls while resetting >+ * context. >+ */ >+extern JS_PUBLIC_API(JSBool) >+JS_SetWatchdogCallback(JSContext *cx, JSOperationCallback callback); After I had started to review this patch I realized that the patch really should not use the terming "watchdog" for the API. The current design does not require that the embedding uses a special signaling thread and the new terming just means to maintain more documentation. Thus my strong suggestion is to call this function just as before, JS_SetOperationCallback. >+ >+extern JS_PUBLIC_API(void) >+JS_ClearWatchdogCallback(JSContext *cx); >+ >+extern JS_PUBLIC_API(JSOperationCallback) >+JS_GetWatchdogCallback(JSContext *cx); >+ >+extern JS_PUBLIC_API(JSBool) >+JS_PrepareContextForReset(JSContext *cx); Call this function JS_TriggerOperationCallback(JSContext *cx); >diff -r 5d65342d7287 js/src/jscntxt.cpp > > JSBool > js_ResetOperationCount(JSContext *cx) > { > JSScript *script; >+ JS_ASSERT(cx->operationCount <= 0); > >- JS_ASSERT(cx->operationCount <= 0); Nit: keep the blank line before JS_ASSERT. >+ cx->operationCount = 1; >+ if (cx->operationCallback) >+ return cx->operationCallback(cx); Use a local variable here for cx->operationCallback. An embedding may clear the callback from another thread after the code checks for cx->operationCallback but before calling cx->operationCallback. >diff -r 5d65342d7287 js/src/jspubtd.h > /* Callbacks and their arguments. */ > > typedef enum JSContextOp { > JSCONTEXT_NEW, >- JSCONTEXT_DESTROY >+ JSCONTEXT_DESTROY, >+ JSCONTEXT_SCRIPT_START Call the constant JSCONTEXT_REQUEST_START as JS_BeginRequest may not run any scripts. >diff -r 5d65342d7287 js/src/xpconnect/idl/nsIXPConnect.idl > [ptr] native nsAXPCNativeCallContextPtr(nsAXPCNativeCallContext); >+ >+ native PRIntervalTime(PRIntervalTime); Nit: drop the blank line before PRIntervalTime. > > /***************************************************************************/ > > %{ C++ > /***************************************************************************/ > #define GENERATE_XPC_FAILURE(x) \ > (NS_ERROR_GENERATE_FAILURE(NS_ERROR_MODULE_XPCONNECT,x)) > >@@ -762,9 +764,25 @@ interface nsIXPConnect : nsISupports >+ >+ /** >+ * Returns the value of the watchdog limit for the specified context. >+ * @param cx >+ * A context >+ */ >+ [noscript,notxpcom] PRIntervalTime getWatchdogLimit(in JSContextPtr cx); >+ /** Nit: blank line before the comments. >diff -r 5d65342d7287 js/src/xpconnect/src/xpccomponents.cpp ... >@@ -3359,48 +3359,49 @@ ContextHolder::ContextHolder(JSContext * > if(mJSContext) > { > JS_SetOptions(mJSContext, > JSOPTION_DONT_REPORT_UNCAUGHT | > JSOPTION_PRIVATE_IS_NSISUPPORTS); > JS_SetGlobalObject(mJSContext, aSandbox); > JS_SetContextPrivate(mJSContext, this); > >- if(JS_GetOperationCallback(aOuterCx)) >- { >- JS_SetOperationCallback(mJSContext, ContextHolderOperationCallback, >- JS_GetOperationLimit(aOuterCx)); >- } >+ if(nsXPConnect::GetXPConnect()->GetWatchdogLimit(mJSContext)) >+ JS_SetWatchdogCallback(mJSContext, >+ ContextHolderOperationCallback); This is not correct transformation. The code should set both the callback and watchdog limit for mJSContext to match aOuterCx. >diff -r 5d65342d7287 js/src/xpconnect/src/xpcjsruntime.cpp > // GCCallback calls are chained > static JSBool >@@ -233,16 +233,20 @@ ContextCallback(JSContext *cx, uintN ope > if(operation == JSCONTEXT_NEW) > { > if(!self->OnJSContextNew(cx)) > return JS_FALSE; > } > else if(operation == JSCONTEXT_DESTROY) > { > delete XPCContext::GetXPCContext(cx); >+ } >+ else if(operation == JSCONTEXT_SCRIPT_START) { Nit: xpconnect style dictates to put { on its own line. >+ if(!self->WakeupWatchdog()) >+ return JS_FALSE; > } The code here must be infallible as the JS_BeginRequest cannot fail. This requres in turn that WakeupWatchdog must never fail. >+/* static */ >+void >+XPCJSRuntime::WatchdogMain(void *arg) >+{ >+ >+ XPCJSRuntime *xp = (XPCJSRuntime *) arg; Nit: use xpcrt, not xp, as the name for the runtime local. >+ JSRuntime *rt = xp->GetJSRuntime(); >+ PRStatus status; >+ PRBool isRunning; >+ >+ do { >+ >+ JSContext *iter = NULL; >+ JSContext *cc = NULL; Nit: use acx, not cc. Do not set it to NULL as this is a useless operation. >+ >+ } while (isRunning == PR_TRUE && status == PR_SUCCESS); Nit use isRunning, not isRunning == PR_TRUE. >+PRBool >+XPCJSRuntime::SetWatchdogLimit(JSContext *cx, PRIntervalTime newWatchdogLimit) >+{ >+ >+ PRBool isRunning; >+ PRIntervalTime oldWatchdogLimit; >+ XPCContext *xpc = XPCContext::GetXPCContext(cx); Nit: use ccx, not xpc, for a XPCContext local. >+ >+ if (newWatchdogLimit == xpc->mWatchdogLimit) >+ return PR_TRUE; /* Nothing to change - immediate exit. */ Nit: drop that comment. >+PRBool >+XPCJSRuntime::WakeupWatchdog() The function must be void as the code assumes it never fails. >+ PR_Lock(watchdogLock); >+ if (currentInterval == PR_INTERVAL_NO_TIMEOUT) { >+ PR_NotifyCondVar(watchdogWakeup); >+ } Question: why this wakes the watchdog only when currentInterval == PR_INTERVAL_NO_TIMEOUT ? What if the new watchdog limit is set to a value that is less than the current wait time? > // XPCContext is mostly a dumb class to hold JSContext specific data and > // maps that let us find wrappers created for the given JSContext. ... > private: > XPCJSRuntime* mRuntime; > JSContext* mJSContext; > nsresult mLastResult; > nsresult mPendingResult; > nsIXPCSecurityManager* mSecurityManager; > nsIException* mException; >+ PRIntervalTime mWatchdogLimit; > LangType mCallingLangType; > PRUint16 mSecurityManagerFlags; >+ >+ /* Watchdog Limit */ >+ I guess mWatchdogLimit should be declared after the comment.
Comment on attachment 345820 [details] [diff] [review] Fix ready for testing Igor, what's delaying this review?
(In reply to comment #10) > (From update of attachment 345820 [details] [diff] [review]) > Igor, what's delaying this review? sorry, I see the comment is recent. please update status here.
Attachment #345820 - Flags: review?(igor)
Technically I think this obsoletes patch 345820 above, but I feel rude obsoleting someone else's patch. Doesn't change anything (yet), just updates to tracemonkey tip and removes all the spurious whitespace diffs. I'll go back to this and integrate Igor's comments if the original poster isn't inclined.
This second bit integrates JIT support for the watchdog timer, using the code patching routine in bug 462228, and fixes a crasher bug in that routine. I've tested this with a simple html page that infinite-loops, and it appears to pop up a dialog for the user after long enough and, if selected, stop the page. Though it seems slightly crashy still, every few runs. Some points: - The synchronization mechanism is the crudest thing I could work out that would be correct. There is a boundary mutex on the trace and a counter saying how deep in trace we are (different meaning from the 'onTrace' flag). The watchdog only manipulates the Fragmento when it has acquired the boundary mutex and knows that the thread is in-trace. This is possibly too expensive, further suggestions welcome. - I had some code in here to break out of long-running regex engine traces too, but learned that the exit type of the regex traces is presently not a GuardRecord. So I removed the code to attempt to break out of them: long regexes will simply not be interrupted. We'll have to revisit this. - The watchdog thread is waking its context up 10 times per second. This means that we disconnect all the loop-edges and reconnect them 10 times per second. I think it might be better to run this closer to the frequency of the actual firefox-set script timeout. I don't really know why it was set to this level. Comments appreciated.
Attachment #347015 - Flags: review?(igor)
To Andrei: could you attach ASAP the newer patch even if has some problems in the js.cpp shell's implementation? As I understand, it works with the browser. Some general observation: instead of separated tm->traceBoundaryLock I suggest to use the GC lock. The watchdog thread calls JS_PrepareContextForReset (or JS_TriggerOperationCallback as it is called in the newer patch) with that lock held to ensure that cx is not destroyed or leaves the JS request. I also think that for testing and to ensure API compatibility it would be very helpful if the patch with JS_HAS_OPERATION_COUNT defined would add the operation count update to the trace. This way the trace would follow what the interpreter does without and *with* JS_HAS_OPERATION_COUNT and this option becomes truly the alternative to separated threads or signals when the embedding do not wish to use them.
(In reply to comment #13) > - The watchdog thread is waking its context up 10 times per second. This > means that we disconnect all the loop-edges and reconnect them 10 times per > second. I think it might be better to run this closer to the frequency of the > actual firefox-set script timeout. I don't really know why it was set to this > level. Besides monitoring the long-running scripts the operation callback in the browser serves the following purposes: 1. Monitor the GC heap and trigger the GC when heuristics are met. The bug 453432 would eliminate that. 2. Monitor tight overall memory conditions and stop the script if so. This can be fixed if instead of simply setting a flag the out-of-memory handler would also call JS_PrepareContextForReset/JS_TriggerOperationCallback. 3. Force to quit canceled DOM workers and to force them to leave periodically the JS request so the GC can run on the main thread. With proper setup it should be possible to simply call JS_TriggerOperationCallback for both conditions. Even better would be to use separated JS runtime for workers, but that is not for 1.9.1.
Igor's comments were incorporated. Also JS shell problem suddenly discovered after applying the comments was fixed.
Attachment #345820 - Attachment is obsolete: true
Attachment #347269 - Flags: review?(igor)
Comment on attachment 347271 [details] [diff] [review] The same patch as previous but without removing spaces >diff -r b5f3b30402cb dom/src/base/nsJSEnvironment.cpp >+ } >+ >+ nsCxPusher cxPusher; >+ if (!cxPusher.Push(mContext)) { >+ return NS_ERROR_FAILURE; > } > > nsCxPusher cxPusher; > if (!cxPusher.Push(mContext)) { > return NS_ERROR_FAILURE; > } This is a double-merge, right? >diff -r b5f3b30402cb js/src/js.cpp ... >+#if !JS_HAS_OPERATION_COUNT >+static void >+ShutdownWatchdog() >+{ >+ PR_Lock(watchdogLock); >+ watchdogRunning = PR_FALSE; >+ PR_NotifyCondVar(watchdogWakeup); >+ PR_Unlock(watchdogLock); >+ if (watchdogThread) >+ PR_JoinThread(watchdogThread); >+ watchdogThread = NULL; Although the code logic implies that ShutdownWatchdog is called only once, it is better not to assume that in the function and copy the watchdogThread global into a function-local and clear it under the lock. >+} >+ >+static void >+WakeupWatchdog() >+{ >+ PR_Lock(watchdogLock); >+ if (currentInterval == PR_INTERVAL_NO_TIMEOUT && >+ watchdogLimit) { >+ PR_NotifyCondVar(watchdogWakeup); >+ } To match the browser code, the code should only wake up watchdog if the new sleep interval is less than the old one. >+ PR_Unlock(watchdogLock); >+} >+ >+static JSBool >+JSShellCallback(JSContext *cx) Nit: call this ShellOperationCallback (witjout the JS prefix). >+{ >+ if (watchdogLimit) { >+ JS_ReportWarning(cx, "Script is running too long"); This is not a warning as the script will be forcefully terminated which suggests to use JSReportError. But that would not work as the latter function raises an exception that can be caught and ignored by a script like in: while (true) { try { while (true); } catch (e) { } } So I suggest to use just fprintf(stderr, "Error: ...") to report the error and just return JS_FALSE. >+static void >+WatchdogMain(void *arg) >+{ >+ JSRuntime *rt = (JSRuntime *) arg; >+ PRStatus status; >+ PRBool isRunning; >+ >+ do { >+ JSContext *iter = NULL; >+ JSContext *acx; >+ JSBool isContextRunning = JS_FALSE; >+ PRIntervalTime ct = PR_IntervalNow(); >+ >+ PR_Lock(watchdogLock); >+ JS_LOCK_GC(rt); >+ while ((acx = js_ContextIterator(rt, JS_FALSE, &iter))) { >+ if (acx->requestDepth) { >+ if (ct - acx->startTime > watchdogLimit) { >+ JS_TriggerOperationCallback(acx); >+ } Nit: remove the braces around single-lined then part. >@@ -2119,16 +2267,50 @@ TestUTF8(JSContext *cx, JSObject *obj, u > > static JSBool > ThrowError(JSContext *cx, JSObject *obj, uintN argc, jsval *argv, jsval *rval) > { > JS_ReportError(cx, "This is an error"); > return JS_FALSE; > } > >+#if !JS_HAS_OPERATION_COUNT >+ >+static JSBool >+WatchdogInterval(JSContext *cx, JSObject *obj, uintN argc, >+ jsval *argv, jsval *rval) >+{ >+ if (!(argc == 1 || argc == 0)) { Nit: use just argc > 1 as the condition for clarity. >@@ -3883,16 +4075,23 @@ static JSBool > static JSBool > ContextCallback(JSContext *cx, uintN contextOp) > { > if (contextOp == JSCONTEXT_NEW) { > JS_SetErrorReporter(cx, my_ErrorReporter); > JS_SetVersion(cx, JSVERSION_LATEST); > SetContextOptions(cx); > } >+ else if(contextOp == JSCONTEXT_REQUEST_START) { >+#if !JS_HAS_OPERATION_COUNT >+ if (!JS_GetOperationCallback(cx)) >+ JS_SetOperationCallback(cx, JSShellCallback); >+ //WakeupWatchdog(); >+#endif Nit: put "if !JS_HAS_OPERATION_COUNT" around the whole else-if part to emphasis that the callback checks that event only when the shell implements own operation counting through a watchdog thread. Nit: remove the commented-out WakeupWatchdog. >diff -r b5f3b30402cb js/src/jsapi.cpp >--- a/js/src/jsapi.cpp Mon Nov 10 21:13:59 2008 +1300 >+++ b/js/src/jsapi.cpp Mon Nov 10 13:13:39 2008 +0100 >@@ -898,36 +898,51 @@ JS_SetRuntimePrivate(JSRuntime *rt, void > rt->data = data; > } > > JS_PUBLIC_API(void) > JS_BeginRequest(JSContext *cx) > { > #ifdef JS_THREADSAFE > JSRuntime *rt; >+ JSContextCallback cxCallback; > > JS_ASSERT(cx->thread->id == js_CurrentThreadId()); > if (!cx->requestDepth) { > JS_ASSERT(cx->gcLocalFreeLists == &js_GCEmptyFreeListSet); > > /* Wait until the GC is finished. */ > rt = cx->runtime; >+#if !JS_HAS_OPERATION_COUNT >+ cx->operationCount = 1; The proper place for this initialization is js_NewContext from js/src/. There the code currently calls JS_ClearOperationCallback but the patch changes that function just to clear the callback. So I suggest just to remove the call to JS_ClearOperationCallback from js_NewContext inlining its implementation instead there. >+#if JS_HAS_OPERATION_COUNT > JS_PUBLIC_API(void) > JS_SetOperationCallback(JSContext *cx, JSOperationCallback callback, > uint32 operationLimit) > { > JS_ASSERT(callback); > JS_ASSERT(operationLimit <= JS_MAX_OPERATION_LIMIT); > JS_ASSERT(operationLimit > 0); > >@@ -5314,16 +5332,46 @@ JS_SetBranchCallback(JSContext *cx, JSBr > cx->operationCount = JSOW_SCRIPT_JUMP; > cx->operationLimit = JSOW_SCRIPT_JUMP; > cx->operationCallback = (JSOperationCallback) cb; > } else { > JS_ClearOperationCallback(cx); > } > return oldcb; > } >+#else >+ >+JS_PUBLIC_API(JSBool) >+JS_SetOperationCallback(JSContext *cx, JSOperationCallback callback) >+{ >+ JS_ASSERT(callback); I suggest to drop this assert and allow to call JS_SetOperationCallback with the null argument removing the need for JS_ClearOperationCallback. Which shows that my initial design of operation callback API was not future proof :( >diff -r b5f3b30402cb js/src/jsapi.h >--- a/js/src/jsapi.h Mon Nov 10 21:13:59 2008 +1300 >+++ b/js/src/jsapi.h Mon Nov 10 13:13:39 2008 +0100 >@@ -43,16 +43,24 @@ > /* > * JavaScript API. > */ > #include <stddef.h> > #include <stdio.h> > #include "js-config.h" > #include "jspubtd.h" > #include "jsutil.h" >+ >+#ifndef JS_HAS_OPERATION_COUNT >+# if (defined(MOZILLA_VERSION) || defined(JS_THREADSAFE)) >+# define JS_HAS_OPERATION_COUNT 0 >+# else >+# define JS_HAS_OPERATION_COUNT 1 >+# endif >+#endif We should drop this extra define and just include jsversion.h into jsapi.h. >+#if JS_HAS_OPERATION_COUNT ... >+#else >+extern JS_PUBLIC_API(JSBool) >+JS_TriggerOperationCallback(JSContext *cx); The function JS_TriggerOperationCallback make perfect sense with or without JS_HAS_OPERATION_COUNT. So lets move it outside of the ifs. >diff -r b5f3b30402cb js/src/jscntxt.h ... > /* Relative operations weights. */ >-#define JSOW_JUMP 1 >-#define JSOW_ALLOCATION 100 >-#define JSOW_LOOKUP_PROPERTY 5 >-#define JSOW_GET_PROPERTY 10 >-#define JSOW_SET_PROPERTY 20 >-#define JSOW_NEW_PROPERTY 200 >-#define JSOW_DELETE_PROPERTY 30 >-#define JSOW_ENTER_SHARP JS_OPERATION_WEIGHT_BASE >-#define JSOW_SCRIPT_JUMP JS_OPERATION_WEIGHT_BASE >- >+# define JSOW_JUMP 1 >+# define JSOW_ALLOCATION 100 >+# define JSOW_LOOKUP_PROPERTY 5 >+# define JSOW_GET_PROPERTY 10 >+# define JSOW_SET_PROPERTY 20 >+# define JSOW_NEW_PROPERTY 200 >+# define JSOW_DELETE_PROPERTY 30 >+# define JSOW_ENTER_SHARP JS_OPERATION_WEIGHT_BASE >+# define JSOW_SCRIPT_JUMP JS_OPERATION_WEIGHT_BASE >+#else >+# define JS_CHECK_OPERATION_LIMIT(cx, weight) \ >+ (((cx)->operationCount) > 0 || js_ResetOperationCount(cx)) >+# define JS_COUNT_OPERATION(cx, weight) ((void) 0) >+#endif I should have seen this before: the patch should always define JSOW_ even when !JS_HAS_OPERATION_COUNT. This way the current code that uses the macros and the constants does not need updating. Otherwise the patch would not compile with JS_HAS_OPERATION_COUNT == 1. >diff -r b5f3b30402cb js/src/jsinterp.cpp ... >-#define CHECK_BRANCH() \ >+#if JS_HAS_OPERATION_COUNT >+# define CHECK_BRANCH() \ > JS_BEGIN_MACRO \ > if ((cx->operationCount -= JSOW_SCRIPT_JUMP) <= 0) { \ > if (!js_ResetOperationCount(cx)) \ > goto error; \ > } \ > JS_END_MACRO >- >+#else >+# define CHECK_BRANCH() \ >+ JS_BEGIN_MACRO \ >+ if (!JS_CHECK_OPERATION_LIMIT(cx,1)) \ >+ goto error; \ >+ JS_END_MACRO >+#endif Nit: expand here JS_CHECK_OPERATION_LIMIT to avoid adding that "1" which is unused. The code checks for JS_HAS_OPERATION_COUNT in any case and it is better to be explicit here the same way as the JS_CHECK_OPERATION_LIMIT part is doing. >diff -r b5f3b30402cb js/src/xpconnect/src/xpccomponents.cpp >--- a/js/src/xpconnect/src/xpccomponents.cpp Mon Nov 10 21:13:59 2008 +1300 >+++ b/js/src/xpconnect/src/xpccomponents.cpp Mon Nov 10 13:13:39 2008 +0100 >@@ -3400,43 +3400,44 @@ ContextHolder::ContextHolder(JSContext * > if(mJSContext) > { > JS_SetOptions(mJSContext, > JSOPTION_DONT_REPORT_UNCAUGHT | > JSOPTION_PRIVATE_IS_NSISUPPORTS); > JS_SetGlobalObject(mJSContext, aSandbox); > JS_SetContextPrivate(mJSContext, this); > >- if(JS_GetOperationCallback(aOuterCx)) >- { >- JS_SetOperationCallback(mJSContext, ContextHolderOperationCallback, >- JS_GetOperationLimit(aOuterCx)); >- } >+ if(nsXPConnect::GetXPConnect()->GetWatchdogLimit(mJSContext)) >+ JS_SetOperationCallback(mJSContext, >+ ContextHolderOperationCallback); This is wrong: the should set the limit for mJSContext to match the limit for aOuterCx. Also, for maximum future-profness the code can always set the callback to ContextHolderOperationCallback so that does not need to be updated. > } > } > > JSBool > ContextHolder::ContextHolderOperationCallback(JSContext *cx) > { >+ > ContextHolder* thisObject = > static_cast<ContextHolder*>(JS_GetContextPrivate(cx)); > NS_ASSERTION(thisObject, "How did that happen?"); > > JSContext *origCx = thisObject->mOrigCx; >+ > JSOperationCallback callback = JS_GetOperationCallback(origCx); > JSBool ok = JS_TRUE; > if(callback) > { > ok = callback(origCx); > callback = JS_GetOperationCallback(origCx); > if(callback) > { > // If the callback is still set in the original context, reflect > // a possibly updated operation limit into cx. >- JS_SetOperationLimit(cx, JS_GetOperationLimit(origCx)); >+ nsXPConnect::GetXPConnect()->SetWatchdogLimit(cx, >+ nsXPConnect::GetXPConnect()->GetWatchdogLimit(cx)); > return ok; > } > } Here the code should ensure that the watchdog limits for cx matches the limit set for origCx after calling the original callback if any. >diff -r b5f3b30402cb js/src/xpconnect/src/xpcjsruntime.cpp >+PRBool >+XPCJSRuntime::ShutdownWatchdog() >+{ >+ >+ PR_Lock(watchdogLock); >+ watchdogRunning = PR_FALSE; >+ PR_NotifyCondVar(watchdogWakeup); >+ PR_Unlock(watchdogLock); >+ if (watchdogThread) >+ PR_JoinThread(watchdogThread); >+ watchdogThread = NULL; >+ return PR_TRUE; >+ >+} >+ >+ Nit: extra blank line >diff -r b5f3b30402cb js/src/xpconnect/src/xpcprivate.h > /***************************************************************************/ >@@ -2376,16 +2396,31 @@ public: > if(!offsets) > { > static NS_DEFINE_IID(kThisPtrOffsetsSID, NS_THISPTROFFSETS_SID); > mIdentity->QueryInterface(kThisPtrOffsetsSID, (void**)&offsets); > } > return offsets; > } > >+ QITableEntry* GetOffsets() >+ { >+ if(!HasProto() || !GetProto()->ClassIsDOMObject()) >+ return nsnull; >+ >+ XPCWrappedNativeProto* proto = GetProto(); >+ QITableEntry* offsets = proto->GetOffsets(); >+ if(!offsets) >+ { >+ static NS_DEFINE_IID(kThisPtrOffsetsSID, NS_THISPTROFFSETS_SID); >+ mIdentity->QueryInterface(kThisPtrOffsetsSID, (void**)&offsets); >+ } >+ return offsets; >+ } >+ Hm, is this another that double-merge problem?
Attachment #347269 - Flags: review?(igor)
Attachment #347010 - Attachment is obsolete: true
Attachment #347269 - Attachment is obsolete: true
Attachment #347311 - Flags: review?(igor)
Attachment #347271 - Attachment is obsolete: true
Comment on attachment 347313 [details] [diff] [review] Latest update after another review without removing spaces >diff -r b77d8bd7903b js/src/js.cpp >+ >+#if !JS_HAS_OPERATION_COUNT >+static void >+ShutdownWatchdog() >+{ >+ PRThread *lwt; >+ PR_Lock(watchdogLock); >+ watchdogRunning = PR_FALSE; >+ lwt = watchdogThread; >+ PR_NotifyCondVar(watchdogWakeup); >+ PR_Unlock(watchdogLock); >+ if (lwt) >+ PR_JoinThread(lwt); >+ watchdogThread = NULL; Here watchdogThread should be nulled inside the lock. Otherwise racing threads can execute PR_JoinThread twice. >+static void >+WakeupWatchdog() >+{ >+ >+} Hm, the context callback calls this function, I guess it should not be empty? >@@ -3883,16 +4074,24 @@ static JSBool > static JSBool > ContextCallback(JSContext *cx, uintN contextOp) > { > if (contextOp == JSCONTEXT_NEW) { ... > } >+#if !JS_HAS_OPERATION_COUNT >+ else if(contextOp == JSCONTEXT_REQUEST_START) { >+ >+ if (!JS_GetOperationCallback(cx)) >+ JS_SetOperationCallback(cx, ShellCallback); I suggest to set the operation callback once in JSCONTEXT_NEW to match what the browser is doing. This cannot harm as with the watchdog thread the callback can only be called when the JS_TrigerOperationCallback is called. *Repeated* nit: the callback should be called ShellOperationCallback. Alternatively, the precise name like CancelLongRunningScript can be used. >+ WakeupWatchdog(); Here is that call to empty function. >+ } >+#endif > return JS_TRUE; > } > >+ >+JS_PUBLIC_API(JSBool) >+JS_SetOperationCallback(JSContext *cx, JSOperationCallback callback) The function must be void, not boolean. >+{ >+ cx->operationCallback = callback; >+ return JS_TRUE; >+} >+ >+JS_PUBLIC_API(void) >+JS_ClearOperationCallback(JSContext *cx) >+{ >+ cx->operationCallback = NULL; >+} >+ >+JS_PUBLIC_API(JSOperationCallback) >+JS_GetOperationCallback(JSContext *cx) Move out the common JS_GetOperationCallback outside the #ifdef. >+ >+JS_PUBLIC_API(JSBool) >+JS_TriggerOperationCallback(JSContext *cx) This should also be void. >+#else >+/* >+ * Set the watchdog callback that the engine calls while resetting >+ * context. >+ */ Nit: replace "watchdog" by "operation" in the comments. >diff -r b77d8bd7903b js/src/jscntxt.cpp > JSBool > js_ResetOperationCount(JSContext *cx) > { > JSScript *script; >+ JSOperationCallback operationCallback; the variable is unsed when JS_HAS_OPERATION_COUNT. To avoid warnings in that case move the declartion under #else. >diff -r b77d8bd7903b js/src/xpconnect/src/xpccomponents.cpp > JSBool > ContextHolder::ContextHolderOperationCallback(JSContext *cx) > { >+ Nit: remove this blank line. > ContextHolder* thisObject = > static_cast<ContextHolder*>(JS_GetContextPrivate(cx)); > NS_ASSERTION(thisObject, "How did that happen?"); > > JSContext *origCx = thisObject->mOrigCx; >+ > JSOperationCallback callback = JS_GetOperationCallback(origCx); > JSBool ok = JS_TRUE; > if(callback) > { > ok = callback(origCx); > callback = JS_GetOperationCallback(origCx); > if(callback) > { > // If the callback is still set in the original context, reflect > // a possibly updated operation limit into cx. >- JS_SetOperationLimit(cx, JS_GetOperationLimit(origCx)); >+ nsXPConnect::GetXPConnect()->SetWatchdogLimit(cx, >+ nsXPConnect::GetXPConnect()->GetWatchdogLimit(origCx)); Since SetWatchdogLimit may faul, the code should set ok to JS_FALSE if so. >diff -r b77d8bd7903b js/src/xpconnect/src/xpcprivate.h >@@ -781,16 +788,26 @@ private: >+ >+ /* >+ * Variables to support watchdog thread. >+ */ >+ PRLock *watchdogLock; >+ PRCondVar *watchdogWakeup; >+ PRBool watchdogRunning; >+ PRThread *watchdogThread; >+ PRIntervalTime currentInterval; >+ Nit: rename all class fields so they start with "m" prefix.
Comment on attachment 347313 [details] [diff] [review] Latest update after another review without removing spaces >--- a/js/src/xpconnect/src/xpcjsruntime.cpp Mon Nov 10 10:04:36 2008 -0600 >+++ b/js/src/xpconnect/src/xpcjsruntime.cpp Mon Nov 10 18:40:25 2008 +0100 >+void >+XPCJSRuntime::WatchdogMain(void *arg) >+{ >+ >+ XPCJSRuntime *xpcrt = (XPCJSRuntime *) arg; >+ JSRuntime *rt = xpcrt->GetJSRuntime(); >+ PRStatus status; >+ PRBool isRunning; >+ >+ do { >+ JSContext *iter; >+ JSContext *acx; >+ PRIntervalTime newInterval = (PRIntervalTime) 0; >+ XPCContext *ccx; >+ >+ PRIntervalTime ct = PR_IntervalNow(); >+ PR_Lock(xpcrt->watchdogLock); >+ JS_LOCK_GC(rt); >+ >+ while ((acx = js_ContextIterator(rt, JS_FALSE, &iter))) { >+ if (acx->requestDepth) { FWIW: the first call to js_ContextIterator here returns an invalid pointer which then crashes on firefox startup unless you initialize iter to NULL. At least on my machine; maybe you got lucky and your toolchain initialized it to NULL?
js_ContextIterator's last parameter is in-out -- callers must initialize to NULL if not a valid context pointer. /be
(My comment was backing up Graydon -- iter = NULL; is mandatory...) /be
We definitively don't want to exit trees with a high frequency only poll for certain conditions (i.e. low memory heuristics). Why can't we hook into the points that affect those conditions (i.e. allocation path? or provide a feedback when trees exit with OOM_EXIT?).
With Andrei's patch, plus the NULL initialization I pointed out above, and this patch applied, I do not see a perturbation of sunspider times from baseline, and I remain able to interrupt long-running scripts. Anyone else want to give it a spin?
Attachment #347015 - Attachment is obsolete: true
Attachment #347419 - Flags: review?(igor)
Attachment #347015 - Flags: review?(igor)
(In reply to comment #25) > We definitively don't want to exit trees with a high frequency only poll for > certain conditions (i.e. low memory heuristics). The high frequency for the operation callback calls is necessary only for DOM workers. The main reason for that is to temporary suspend DOM worker's JS request so the GC can proceed on the main thread without waiting too much. For main thread this is not an issue and the callback is called not that often. This will be addressed later in a separated bug - the main idea is to notify the embedding when a thread is about to wait for the GC. Then the browser can check that the thread is the main thread and call JS_TrigerOperationCallback for the workers.
Attached patch Watchdog patch (obsolete) — Splinter Review
Attachment #347311 - Attachment is obsolete: true
Attachment #347639 - Flags: review?(igor)
Attachment #347311 - Flags: review?(igor)
Attachment #347313 - Attachment is obsolete: true
Comment on attachment 347640 [details] [diff] [review] Watchdog patch without removing spaces >diff -r b77d8bd7903b js/src/js.cpp > FILE *gOutFile = NULL; > > static JSBool reportWarnings = JS_TRUE; > static JSBool compileOnly = JS_FALSE; >+ >+#if !JS_HAS_OPERATION_COUNT >+/* >+ * Variables to support thread watchdog. >+ */ >+static PRLock *watchdogLock; >+static PRCondVar *watchdogWakeup; >+static PRBool watchdogRunning; >+static PRThread *watchdogThread; >+static PRIntervalTime currentInterval; Nit which I should spot long time ago: prefix all globals with the letter g even if reportWarnings does not follow that. >+ >+/* Watchdog limit. */ >+static PRIntervalTime watchdogLimit; Nit: drop the comments as it serves no purpose. >+static void >+ShutdownWatchdog() >+{ >+ PRThread *lwt; >+ PR_Lock(watchdogLock); Nit: blank line after declarations. Plus it takes some time to decode what lwt stands for. Use either just t or thread. >+static void >+WakeupWatchdog() >+{ >+ PR_Lock(watchdogLock); >+ if (watchdogThread && watchdogLimit && >+ currentInterval == PR_INTERVAL_NO_TIMEOUT) { This would not wake up the watchdog when the new wakeup interval is less then the current one. > static JSBool > ContextCallback(JSContext *cx, uintN contextOp) > { > if (contextOp == JSCONTEXT_NEW) { > JS_SetErrorReporter(cx, my_ErrorReporter); > JS_SetVersion(cx, JSVERSION_LATEST); > SetContextOptions(cx); >- } >+ JS_SetOperationCallback(cx, ShellOperationCallback); >+ >+ } Nit: remove the extra blank line. > JSBool > js_ResetOperationCount(JSContext *cx) > { >+ JSOperationCallback operationCallback; *Repeated* nit: operationCallback is only used when !JS_HAS_OPERATION_COUNT. Move it to the else part. >diff -r b77d8bd7903b js/src/xpconnect/src/nsXPConnect.cpp > JSBool > ContextHolder::ContextHolderOperationCallback(JSContext *cx) > { > ContextHolder* thisObject = > static_cast<ContextHolder*>(JS_GetContextPrivate(cx)); > NS_ASSERTION(thisObject, "How did that happen?"); > > JSContext *origCx = thisObject->mOrigCx; >+ > JSOperationCallback callback = JS_GetOperationCallback(origCx); Nit: remove the inserted blank line. >+PRBool >+XPCJSRuntime::ShutdownWatchdog() Nit: make the function void - it cannot fail. >+{ >+ >+ PR_Lock(mWatchdogLock); >+ mWatchdogRunning = PR_FALSE; >+ mWatchdogThread = NULL; >+ PR_NotifyCondVar(mWatchdogWakeup); >+ PR_Unlock(mWatchdogLock); >+ if (mWatchdogThread) >+ PR_JoinThread(mWatchdogThread); >+ return PR_TRUE; This will never call JoinThread as mWatchdogThread will be null. I guess the same code from shell's version of ShutdownWatchdog() should be used. >diff -r b77d8bd7903b js/src/xpconnect/src/xpcprivate.h ... > // XPCContext is mostly a dumb class to hold JSContext specific data and > // maps that let us find wrappers created for the given JSContext. > > // no virtuals >@@ -896,16 +913,19 @@ private: > JSContext* aJSContext); > private: > XPCJSRuntime* mRuntime; > JSContext* mJSContext; > nsresult mLastResult; > nsresult mPendingResult; > nsIXPCSecurityManager* mSecurityManager; > nsIException* mException; >+ >+ /* Watchdog Limit */ >+ PRIntervalTime mWatchdogLimit; Nit: drop the useless comments and the extra blank line. After addressing all this issues ask Blake, mrbkap@gmail.com for a review.
Comment on attachment 347419 [details] [diff] [review] update jit support to tip, change to use GC lock To Graydon: >diff --git a/js/src/js.cpp b/js/src/js.cpp >--- a/js/src/js.cpp >+++ b/js/src/js.cpp >@@ -4256,14 +4256,18 @@ > > #if !JS_HAS_OPERATION_COUNT > ShutdownWatchdog(); >+#endif >+ >+ JS_DestroyContext(cx); >+ JS_DestroyRuntime(rt); >+ JS_ShutDown(); >+ >+#if !JS_HAS_OPERATION_COUNT > if (watchdogWakeup) > JS_DESTROY_CONDVAR(watchdogWakeup); > if (watchdogLock) > JS_DESTROY_LOCK(watchdogLock); > #endif > >- JS_DestroyContext(cx); >- JS_DestroyRuntime(rt); >- JS_ShutDown(); > return result; Why is this change necessary? AFAICS the trace locking does not depend on this. >diff --git a/js/src/jsapi.cpp b/js/src/jsapi.cpp ... > JS_TriggerOperationCallback(JSContext *cx) > { > cx->operationCount = 0; >+#ifdef JS_TRACER >+ /* We are called with the GC lock already held. */ >+ if (JS_TRACE_MONITOR(cx).traceDepth) { >+ JS_TRACE_MONITOR(cx).fragmento->disconnectLoops(); >+ } Nit: remove the braces around the "then" part. >diff --git a/js/src/jsbuiltins.cpp b/js/src/jsbuiltins.cpp >+#if !JS_HAS_OPERATION_COUNT >+ JS_LOCK_GC(state->cx->runtime); >+ JS_TRACE_MONITOR(state->cx).traceDepth++; >+ JS_UNLOCK_GC(state->cx->runtime); >+#endif It is hard to relate this to the real code. Could you attach -p version of the diff preferably to bug 450000 as that is the right place for the bug. Also, I am not sure about the locking when JS_TriggerOperationCallback is called when the new trace is not fully constructed. AFAICS such trace would not be patched so it may well be entered before the operationCount is ever checked.
Attachment #347419 - Flags: review?(igor)
Comment on attachment 347640 [details] [diff] [review] Watchdog patch without removing spaces To Andrei: >diff -r b77d8bd7903b js/src/jscntxt.h >@@ -840,18 +840,20 @@ struct JSContext { One more issue: when !JS_HAS_OPERATION_COUNT, operationCount defined at the top of JSContext should be declared as volatile int32 operationCount; as it can and will be set from other threads. Otherwise a compiler may optimize away operationCount checks.
Attached patch Updated watchdog patch (obsolete) — Splinter Review
Attachment #347639 - Attachment is obsolete: true
Attachment #347745 - Flags: review?(mrbkap)
Attachment #347639 - Flags: review?(igor)
Attachment #347640 - Attachment is obsolete: true
(In reply to comment #31) > Why is this change necessary? AFAICS the trace locking does not depend on this. Not in particular reference to the traces, no. What depends on it is successfully shutting down the spidermonkey shell with "quit()". I just included it as a bugfix. Otherwise quit() segfaults. > It is hard to relate this to the real code. Could you attach -p version of the > diff preferably to bug 450000 as that is the right place for the bug. Will do, along with fixed nits. > Also, I am not sure about the locking when JS_TriggerOperationCallback is > called when the new trace is not fully constructed. AFAICS such trace would not > be patched so it may well be entered before the operationCount is ever checked. I don't think this matters. The next time the watchdog wakes up it *will* patch the running code. Entering the code and looping one extra time un-patched (like exiting one-too-many times if over-patched) is no big deal. Or am I misunderstanding your comment? Let's pick it up over in Bug 450000.
(In reply to comment #35) > The next time the watchdog wakes up it *will* patch > the running code. Entering the code and looping one extra time un-patched (like > exiting one-too-many times if over-patched) is no big deal. This is a very good point. But by the same observation we need to patch only the currently run trace, not all traces in the cache.
(In reply to comment #35) > > Why is this change necessary? AFAICS the trace locking does not depend on this. > > Not in particular reference to the traces, no. What depends on it is > successfully shutting down the spidermonkey shell with "quit()". I just > included it as a bugfix. Otherwise quit() segfaults. To Andrei: please fix this in the watchdog patch.
(In reply to comment #36) > (In reply to comment #35) > > The next time the watchdog wakes up it *will* patch > > the running code. Entering the code and looping one extra time un-patched (like > > exiting one-too-many times if over-patched) is no big deal. > > This is a very good point. But by the same observation we need to patch only > the currently run trace, not all traces in the cache. Actually this would not work in general if JS_TriggerOperationCallback is event, not time, driven. For example, the reason why the DOM worker threads need to call the callback with high frequency is that they need to suspend the JS request so the GC can proceed on the main thread. That can be fixed if the engine can notify the browser that some thread is about to wait for other threads to leave their JS requests. With such notification for the main thread the browser can simply call JS_TriggerOperationCallback for each active DOM worker avoiding frequent polling in the worker itself. But this requires that no JS_TriggerOperationCallback call is lost meaning that *all* traces must be patched.
Blocks: 464672
Attached patch Watchdog patch with small fixes. (obsolete) — Splinter Review
Problem with JS shell is fixed. Also compilation problems when JS_THREADSAFE is undefined are also fixed.
Attachment #347745 - Attachment is obsolete: true
Attachment #348285 - Flags: review?(mrbkap)
Attachment #347745 - Flags: review?(mrbkap)
Attachment #347746 - Attachment is obsolete: true
Assignee: igor → andrei
Blocks: 465028
Blocks: 465031
Flags: blocking1.9.1+
Priority: -- → P1
Target Milestone: --- → mozilla1.9.1b2
Whiteboard: [needs review mrbkap]
Comment on attachment 348285 [details] [diff] [review] Watchdog patch with small fixes. >diff -r 432bd8501fa5 js/src/js.cpp >+static void >+WatchdogMain(void *arg) >+ gCurrentInterval = isContextRunning && gWatchdogLimit >+ ? gWatchdogLimit >+ : PR_INTERVAL_NO_TIMEOUT; Nits: parens around the condition and move the ?: lines over one space. >+ status = PR_WaitCondVar(gWatchdogWakeup, gCurrentInterval); >+ isRunning = gWatchdogRunning; >+ PR_Unlock(gWatchdogLock); >+ Nit: Extra newline here. >+ } while (isRunning && status == PR_SUCCESS); >+ Nit: ibid. >+static void >+SetWatchdogLimit(JSContext *cx, PRIntervalTime newWatchdogLimit) Since this function can fail, this should return a JSBool. >+{ ... >+ if (!gWatchdogThread) >+ JS_ReportError(cx, "Failed to create watchdog thread"); And you need to return false here, or this exception will get lost. >+static JSBool >+WatchdogInterval(JSContext *cx, JSObject *obj, uintN argc, >+ jsval *argv, jsval *rval) Nit: move the second line over to line up with the arguments in the first line. >+ /* Next conditions takes negative values and NaNs into account. */ This should be "The next condition ..." right? Once that conditional is run, we don't need to worry about NaN anymore. >+ SetWatchdogLimit(cx, (PRIntervalTime) (interval * PR_TicksPerSecond())); >+ return JS_TRUE; >+ With the above change to make SetWatchdogLimit return a value, this should be just 'return SetWatchdogLimit...'. Also nit: extra blank line here. > static JSBool > ContextCallback(JSContext *cx, uintN contextOp) > { > if (contextOp == JSCONTEXT_NEW) { > JS_SetErrorReporter(cx, my_ErrorReporter); > JS_SetVersion(cx, JSVERSION_LATEST); > SetContextOptions(cx); >- } >+#if !JS_HAS_OPERATION_COUNT >+ JS_SetOperationCallback(cx, ShellOperationCallback); >+#endif >+#if !JS_HAS_OPERATION_COUNT >+ } >+ >+ else if(contextOp == JSCONTEXT_REQUEST_START) { >+ if (cx->runtime->state != JSRTS_LANDING) >+ WakeupWatchdog(); >+ } >+#endif > return JS_TRUE; #if a #endif #if a #endif should be fused into a single #if. It looks like this won't compile if JS_HAS_OPERATION_COUNT is defined since there's no end brace. I suppose you want something like #if !JS_HAS_OPERATION_COUNT JS_SetOperationCallback(cx, ShellOperationCallback); } else if (contextOp == JSCONTEXT_REQUEST_START && cx->runtime->state != JSRTS_LANDING) { WakeupWatchdog(); #endif } Note also the fused if () if ()s. >diff -r 432bd8501fa5 js/src/jspubtd.h >+ * JSCONTEXT_REQUEST_START The JS_BeginRequest method is called and >+ * requestDepth == 0. The callback can be used to >+ * notify other components that the JS script begins. This comment needs some rewording... Something like JS_BeginRequest was called with requestDepth == 0. This callback can be used to notify other components that execution has begun on this context. >diff -r 432bd8501fa5 js/src/xpconnect/idl/nsIXPConnect.idl >--- a/js/src/xpconnect/idl/nsIXPConnect.idl Fri Nov 14 14:13:42 2008 -0800 >+++ b/js/src/xpconnect/idl/nsIXPConnect.idl Sat Nov 15 01:52:45 2008 +0100 >@@ -762,9 +763,26 @@ interface nsIXPConnect : nsISupports > * interfaceCount are like what nsIClassInfo.getInterfaces returns. > */ > [noscript,notxpcom] PRBool defineDOMQuickStubs( > in JSContextPtr cx, > in JSObjectPtr proto, > in PRUint32 flags, > in PRUint32 interfaceCount, > [array, size_is(interfaceCount)] in nsIIDPtr interfaceArray); >+ >+ /** >+ * Returns the value of the watchdog limit for the specified context. >+ * @param cx >+ * A context >+ */ >+ [noscript,notxpcom] PRIntervalTime getWatchdogLimit(in JSContextPtr cx); >+ >+ /** >+ * Sets a new value of the watchdog limit. >+ * @param cx >+ * A context >+ * @param limit >+ * New value of the watchdog limit. >+ */ >+ [noscript,notxpcom] PRBool setWatchdogLimit(in JSContextPtr cx, in PRIntervalTime limit); >+ Non-nit: You need to rev the uuid of this interface. Nit: Get rid of the extra blank line here (and its trailing whitespace). >diff -r 432bd8501fa5 js/src/xpconnect/src/nsXPConnect.cpp I appreciate killing trailing whitespace, but the amount of blame you're taking in this file seems disproportionate to the size of the change you're making. I'd say leave out the whitespace nuking for now and leave it for another patch. >diff -r 432bd8501fa5 js/src/xpconnect/src/xpccomponents.cpp Same comments about trailing whitespace apply here too. >--- a/js/src/xpconnect/src/xpccomponents.cpp Fri Nov 14 14:13:42 2008 -0800 >@@ -3400,21 +3400,26 @@ ContextHolder::ContextHolder(JSContext * > if(mJSContext) > { > JS_SetOptions(mJSContext, > JSOPTION_DONT_REPORT_UNCAUGHT | > JSOPTION_PRIVATE_IS_NSISUPPORTS); > JS_SetGlobalObject(mJSContext, aSandbox); > JS_SetContextPrivate(mJSContext, this); > >- if(JS_GetOperationCallback(aOuterCx)) >- { >- JS_SetOperationCallback(mJSContext, ContextHolderOperationCallback, >- JS_GetOperationLimit(aOuterCx)); >- } >+ PRIntervalTime watchdogLimit = >+ nsXPConnect::GetXPConnect()->GetWatchdogLimit(aOuterCx); >+ >+ if(watchdogLimit) { Nit: In XPConnect, house style is: if(xxx) { } >diff -r 432bd8501fa5 js/src/xpconnect/src/xpcjsruntime.cpp >+/* static */ >+void >+XPCJSRuntime::WatchdogMain(void *arg) Nits here: Nuke the same extra lines pointed out in js.cpp. Also, please match house style for bracing and whitespace (as above). >diff -r 432bd8501fa5 js/src/xpconnect/src/xpcprivate.h >+ PRIntervalTime mCurrentInterval; >+ Nuke this extra line. r=mrbkap with these comments addressed.
Attachment #348285 - Flags: review?(mrbkap) → review+
Whiteboard: [needs review mrbkap] → [needs updated patch based on review]
Any ETA on the patch? This is one of the last blockers for beta.
Attached patch Updated to my comments (obsolete) — Splinter Review
This is the previous patch with my comments. I'm compiling it now. I'm not going to get a chance to check it in for a couple of hours (and I want to ensure that it compiles, first).
Attachment #348285 - Attachment is obsolete: true
Attachment #348286 - Attachment is obsolete: true
Attachment #349109 - Flags: review+
Thanks, Blake. Please try to see that it gets in before the nightlies spin up?
Attached patch Updated even more (obsolete) — Splinter Review
Before checking in, I tried running with this patch and hit a couple of assertions. I'm hoping that bugzilla's interdiff will show the two (small) changes I made to allow the browser to start up and shut down.
Attachment #349147 - Flags: review?(igor)
I'm going to have to defer to igor to get this checked in since it's way too late for me to be able to watch the tree.
Attachment #349147 - Flags: review?(igor) → review+
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Backed out - http://hg.mozilla.org/mozilla-central/rev/19134f7ea893 This seemed to be causing occasional hangs on the test boxes. The following things were seen to be recurring after the 1d817f9d842f changeset: - w32 leak-test had pegged CPU and running high on memory - OSX talos boxes are showing "quit unexpectedly" and "slow script" warnings on the tgfx tests I'll watch the tree to make sure I got the blame right.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
More: after checking in, OSX talos boxes started failing at these points: - NOISE: Cycle 3: loaded http://localhost/page_load_test/sunspider/access-binary-trees.html (next: http://localhost/page_load_test/sunspider/access-fannkuch.html) - NOISE: Cycle 1: loaded http://localhost/page_load_test/gfx/text2.html (next: http://localhost/page_load_test/gfx/tile-jpg.html) - NOISE: Cycle 1: loaded http://localhost/page_load_test/gfx/text2.html (next: http://localhost/page_load_test/gfx/tile-jpg.html) Hopefully you'll be able to tell from access-fannkuch and tile-jpg.html what might be triggering things?
(In reply to comment #49) > More: after checking in, OSX talos boxes started failing at these points: On 64-bit Linux this runs without any problems - cannot reproduce it locally.
Whiteboard: [needs updated patch based on review] → [backed out due to leak-test bustage]
Comment on attachment 349147 [details] [diff] [review] Updated even more Here is a problematic code that can be responsible for the deadlock on the shutdown: >diff --git a/js/src/xpconnect/src/xpcjsruntime.cpp b/js/src/xpconnect/src/xpcjsruntime.cpp ... >+void >+XPCJSRuntime::WatchdogMain(void *arg) >+{ ... >+ do >+ { ... >+ PR_Lock(xpcrt->mWatchdogLock); ... >+ xpcrt->mCurrentInterval = newInterval ? newInterval >+ : PR_INTERVAL_NO_TIMEOUT; >+ status = PR_WaitCondVar(xpcrt->mWatchdogWakeup, xpcrt->mCurrentInterval); >+ isRunning = xpcrt->mWatchdogRunning; >+ PR_Unlock(xpcrt->mWatchdogLock); >+ } while (isRunning && status == PR_SUCCESS); >+} >+PRBool >+XPCJSRuntime::ShutdownWatchdog() >+{ >+ PR_Lock(mWatchdogLock); >+ mWatchdogRunning = PR_FALSE; >+ PRThread *t = mWatchdogThread; >+ mWatchdogThread = NULL; >+ PR_NotifyCondVar(mWatchdogWakeup); >+ PR_Unlock(mWatchdogLock); >+ if(t) >+ PR_JoinThread(t); >+ return PR_TRUE; >+} >+ The bug is that in the watchog main loop calls PR_WaitCondVar uncoditionally before checking for xpcrt->mWatchdogRunning. Thus, if ShutdownWatchdog is called when the watchog is outside its lock, PR_JoinThread will hung forever as the watchdog will re-eneter the lock and call PR_WaitCondVar(, PR_INTERVAL_NO_TIMEOUT).
Andrei and I has to leave for now. So our fix for the above problem has to wait until around 20:00 UTC (12:00 PST).
Attached patch Previous patch with the fix (obsolete) — Splinter Review
The problem Igor pointed to seems to be fixed now. Somebody should check in the patch.
Attachment #349233 - Flags: review?(igor)
Whiteboard: [backed out due to leak-test bustage] → [needs landing]
Attachment #349233 - Flags: review?(igor) → review+
Comment on attachment 349233 [details] [diff] [review] Previous patch with the fix aaaaaaa=beltzner
Attachment #349233 - Flags: approval1.9.1b2+
Attachment #349233 - Flags: approval1.9.1+
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
this blew up the w32 mozilla-central leak-test box again: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1227216533.1227222814.4867.gz
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [needs landing] → [blew up leak test again]
The tinderboxes failed due to slow script warning that was shown on the startup. Since the affected machines are VMs that can run on rather busy boxes, it may well be possible that the slow dialog are legitimate one due to too long execution of the tests. Without the watchdog there were not showing due to bad accounting or something. Sp does anybody know the details of the leak test run on tinderboxes? I.e. in particular, how long it takes for the test to run on average and does it closes the browser window from the script automatically? Also the exact command that is used to run the leak test on tinderboxes would be nice.
(In reply to comment #57) > Since the affected machines are VMs that can run on rather busy boxes, it may > well be possible that the slow dialog are legitimate one due to too long > execution of the tests. I tested a browser build for a leak test with an extra printout added to nsJSContext::DOMOperationCallback from dom/src/base/nsJSEnvironment.cpp . On a fast 64-bit box *Xenon 2.6 GHz) it showed that some script execution time reached 1s before any window was shown. I suspect on VM that may well reach 10s, the time to show the slow script dialog. To workaround that nsJSContext::DOMOperationCallback has to be updated with some code to detect a leak trace run and, if so, increase the max execution time for a script 10 fold or something. I am not sure how to do that.
Igor, who do you think can help you here?
Attachment #347419 - Attachment is obsolete: true
Attachment #349109 - Attachment is obsolete: true
Attachment #349147 - Attachment is obsolete: true
Attachment #349233 - Attachment is obsolete: true
Comment on attachment 349329 [details] [diff] [review] extending the patch to increase 10-fold the max run time for debug builds I need a review for build/* changes in the patch - the firsts 3 files. Otherwise the patch is the same as the one already landed and backed ot.
Attachment #349329 - Flags: review?(ted.mielczarek)
Hrm, can someone explain why we need to raise the limits when we didn't before? "or something" seems pretty vague, was this actually investigated on the failing machines? Also, Igor, if you're around to land this, I can probably review the early parts of your patch.
(In reply to comment #62) > Hrm, can someone explain why we need to raise the limits when we didn't before? > "or something" seems pretty vague, was this actually investigated on the > failing machines? Here is the theory: without the patch the operation callback (that detects the slow scripts and shows the dialog based on time that the script runs) is called periodically when the so-called operation counter reaches certain limit. The updates of the counter and the call to the callback happen mostly in the interpreter when the bytecode does a backward jump or when a function returns. There are also some native functions that potentially run for a long time and that update the counter. This mechanism does not account for the running time of the wast majority of the native code. Thus it is well possible that a script, that spends long time in native code, will not reach the operation counter limit to call the callback. Naturally, for such script no detection that it runs too long would ever be executed. Now, with the patch calls to the operation callback are scheduled from a separated thread based on precise timing. The thread setts a special flag that the interpreter checks precisely at the same places where it previously updated the operation counter. Thus for the above script that spend way too much time in the native code the operation callback will run detecting that a script is slow. A strong confirmation for this theory comes from the fact that for the leak test without the patch the operation callback is called only *once*. Since on the first invocation the callback will just record the current time to compare it with the time on the moment of the second invocation, the slow script warning cannot be shown. Moreover, since the operation count update just depends on the nature of the code executed and does not depend on the CPU speed/system architecture, this only single call to the callback will happen on arbitrary slow machine including busy tinderbox VMs. So the leak test cannot show the dialog. With the patch the situation is changed as now the runtime may well call the callback multiple times on a slow machine so the dialog can be shown. And it is shown on a relatively fast 64-bit Linux during the test when I decrease the dom.max_script_run_time and dom.max_chrome_script_run_time to one second from the default 10 and 20 seconds.
Note: I am not sure that I will be around today, 2008-11-21, to land the updated patch if it gets a review or an approval.
Attachment #349329 - Flags: review?(ted.mielczarek) → review+
Comment on attachment 349329 [details] [diff] [review] extending the patch to increase 10-fold the max run time for debug builds Build parts look fine, not qualified to comment on Vlad's question.
Attachment #349329 - Flags: approval1.9.1b2?
fwiw IE has a somewhat similar system of counting by execution time instead of complexity. That's also why IE throws up the stop script dialog, IMHO, way too often. Has anyone tested this on top 10 web apps. Something like I guess gmail, the new yahoo mail, etc. to test for regressions (this would probably need to be done on slower systems). Maybe there should be an increased threshold once the dialog has been dismissed for a single script for the first time? That's probably another bug though. Just a few thoughts...
There are two questions that I think haven't been clearly addressed: 1. Will taking this patch result in more frequent "slow script" warnings? should we be changing the way we trigger those warnings? 2. Do we think we need this change for a beta release (and if the answer to the first question is "yes", then I think we do) or can we shake out the impact on nightlies?
The answer to #1 is "no", because the slow-script dialog still checks against wall-clock time.
Err. On rereading your question, I realize you weren't concerned about "annoyingly frequent" slow script dialogs... which is what I was answering. More frequent in a good way? Yes, this should help.
I thought that the fact that we had to back it out indicates that the answer to 1 is "yes", because we're seeing a slow-script where we weren't before?
If we have a fx3.1 b3 then one question is: could this bug could wait to be fixed in the first nightly after b2? Lack of b2 slow-script dialog for runaway JITted code would be a task-manager or force-quit scenario -- no fun at all -- but we hope rare for non-developers. At this point I'm more concerned about untriaged and undiagnosed TM crasher bugs in b2. Would welcome others' thoughts. /be
I have been trying to keep up with top crashes and it seems to me that we have fixed most top crashes we caused. That having said yes we do have a whole bunch of fuzzer bugs outstanding.
Comment on attachment 349329 [details] [diff] [review] extending the patch to increase 10-fold the max run time for debug builds a191b2=beltzner Let's get this in and get nightlies spun with it. We're now at a point where we'll have at least two days of nightly baking before we start QA'ing beta builds, so if we see it on top websites, I'd expect to see bugs.
Attachment #349329 - Flags: approval1.9.1b2? → approval1.9.1b2+
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
I backed out the patch again - the build system changes broke the mochitests, http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1227306163.1227308346.12667.gz : File "runtests.py", line 17, in <module> import automation File "/builds/slave/trunk_linux-8/build/objdir/testing/mochitest/automation.py", line 44 IS_DEBUG_BUILD =
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
18:13 <@bhearsum|afk> beltzner: win32 just started --trace-malloc test - looks like the same thing is happening (That's referring to what we were seeing in comment 48)
As it turns out, I didn't wait long enough after the trace malloc test started. The browser window is now up and running the test. Actually, it just finished the test successfully!
Is it possible to get an access to a mchine that is simialr to WINNT 5.2 mozilla-central ?
(In reply to comment #78) > Is it possible to get an access to a mchine that is simialr to WINNT 5.2 > mozilla-central ? Is that neccessary now that we've cycled green there? It seems like if we can fix the issues from comment 75 then we'll be good to go.
The previous patch misses changes to testing/mochitest/Makefile.in that prevented mochitests to run. With the new version I can run both debug and non-debug builds.
Attachment #349329 - Attachment is obsolete: true
Comment on attachment 349511 [details] [diff] [review] previous patch + /testing/mochitest/Makefile.in changes r+ from Waldo for extra chnages.
Attachment #349511 - Flags: review+
Attachment #349511 - Flags: approval1.9.1b2?
Its a blocker so new approval needed for b2 afaik.
Comment on attachment 349511 [details] [diff] [review] previous patch + /testing/mochitest/Makefile.in changes a191b2=beltzner, though not needed for this bug; igor, when you have a new reviewed patch ready, just find a sheriff and go. This is the last thing we're waiting on for b2 :)
Attachment #349511 - Flags: approval1.9.1b2? → approval1.9.1b2+
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Depends on: 466265
Apparently the patch caused random failes on tinderboxes with tspider tests. Although the problem on some tinderboxes has started when a previous version of the patch was backed out, I still going to back it out to see if this fixes the problem.
I wait from the backout as some tinderboxes went green and it is only talos that are problematic and they has shown the same problem without the patch applied.
backed out due to persistent oranges
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The oranges on the tinderboxes continue with the patch backed out, so it does not responsible for them.
This the previous patch with the bogus assert from bug 466265 removed - that assert is in code inside ifdef that is not used for the browser builds.
Attachment #349511 - Attachment is obsolete: true
Attachment #349587 - Attachment is patch: true
Attachment #349587 - Attachment mime type: application/octet-stream → text/plain
Depends on: 466328
Attached file valgrind warning
In a build from changeset 04cecb0ec24c (which I think is the most recent time this patch landed) I see a lot of these valgrind warnings. Note that the line numbers may be a little off because my tree has a lot of patches in it.
Attached patch prev patch + valgrind fix (obsolete) — Splinter Review
The valgrind varnings were due to missing initialization of XPCJSRuntime.mCurrentInterval.
Attachment #349587 - Attachment is obsolete: true
Attachment #349659 - Flags: review+
Attachment #349659 - Flags: approval1.9.1b2?
Attachment #349659 - Flags: approval1.9.1b2? → approval1.9.1b2+
Comment on attachment 349659 [details] [diff] [review] prev patch + valgrind fix a191b2=beltzner Tree's green again, so let's give this a shot.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
The latest patch caused failures with Mac talos boxes and with WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07. So it has to backed out again - but then the Mac builds has to be clobbered (see bug 466399).
According to http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1227497227.1227509931.1835.gz&fulltext=1 , on windows the failure on WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 happens in http://localhost/page_load_test/sunspider/access-binary-trees.html , this is the same as with the comment 49 .
Note that access-binary-trees we don't JIT at all (we don't even record anythink I think, we certainly don't compile or run anything).
We should probably reopen this bug, or file a new one (maybe better, 10 bugs depend on this one and re-opening it sets of an avalanche).
(In reply to comment #98) > We should probably reopen this bug, or file a new one (maybe better, 10 bugs > depend on this one and re-opening it sets of an avalanche). Unless at least Mac startup problem can be resolved quickly, reopening the bug is the only option. Otherwise all talos mac boxes would stay orange.
I am backing this out again to prevent broken mac builds leaking into nighties.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
After several tries it's becoming clear that we're going to need the flexibility of nightlies in order to work this out. Taking this out of the beta 2 blocking path and adding a relnote flag since AIUI, without this patch we have no way to abort slow scripts with TraceMonkey.
Keywords: relnote
Target Milestone: mozilla1.9.1b2 → mozilla1.9.1
To proceed farther with the bug it is necessary to have access to a box like MAC OS tinderbox that showed the problem. How do I do that?
I think I reproduced this locally, access-fannkuch froze inside standalone talos. No CPU usage, might be a deadlock. Will re-run and try to attach gdb.
The browser is ilooping on jss/real-fannkuch-2.html. We're hitting the TIMEOUT_EXIT very frequently but we never leave js_Interpret, so it seems like the heuristics for timing out are not kicking in. Leaving GDB session open if there's anything someone would like me to poke at.
(In reply to comment #104) > The browser is ilooping on jss/real-fannkuch-2.html. We're hitting the > TIMEOUT_EXIT very frequently but we never leave js_Interpret, so it seems like > the heuristics for timing out are not kicking in. Does it mean that the interpreter simply return to the tracer immediately after TIMEOUT_EXIT?
(In reply to comment #104) > The browser is ilooping on jss/real-fannkuch-2.html. We're hitting the > TIMEOUT_EXIT very frequently but we never leave js_Interpret, Note that operation callback should be called 10 times per second to check for the GC condition. Does this matches "very frequently" ? Also, does the interpreter ever calls js_ResetOperationCount in js/src/jscntxt.cpp
Yes to both questions. There's two problems with the TIMEOUT_EXIT guard. The first is that it uses a stale snapshot. deduceTypeStability() changes the recorder's view of the stack, so when the TIMEOUT_EXIT guard fired it was potentially boxing values incorrectly. The second is that the guard isn't sufficient for catching all iloop cases because we can stitch "one-shot" traces together to form a loopy one, without ever using the LIR_loop path. This is easily reproducible in the shell. Moving the guard into the TraceRecorder ctor seems to fix it, so I'm re-running talos now. That doesn't explain why the watchdog thread was never kicking in though.
Falling off trace every 100ms seems way too frequent to me. If the only reason we do that is to keep checking if we're about to run out of memory, then why don't we make the watchdog thread do that and only interrupt the JS if we really are low on memory? Ideally there'd be a cleaner way to deal with the whole low memory issue, but that probably belongs in a separate Fennec bug.
I checked with Stuart and checking for low memory (i.e. calling NS_GetMemoryManager() and calling IsLowMemory() on the memory manager) on the watchdog thread should be fine. We'd still need to do it in the operation callback too, I guess, to be covered in the case where we're *not* running JIT:ed code, unless the watchdog thread runs then too.
(In reply to comment #108) > Falling off trace every 100ms seems way too frequent to me. See comment 15. (In reply to comment #109) > I checked with Stuart and checking for low memory (i.e. calling > NS_GetMemoryManager() and calling IsLowMemory() on the memory manager) on the > watchdog thread should be fine. There is no need to wake up the watchdog thread on low memory. Just forcing the operation callback calls on each running JSContext should be enough to make sure the JS terminates ASAP. > We'd still need to do it in the operation > callback too, I guess, to be covered in the case where we're *not* running > JIT:ed code, unless the watchdog thread runs then too. The watchdog thread does not terminate jited code. It just sets a flag that the jited code (or interpreter) periodically checks to call the callback when it is set.
Comment on attachment 350164 [details] [diff] [review] fix for timeout_exit bug This fixes the timeout exit bug. According to david the timeout handling itself still isn't right and was observed to crash in exciting ways. Need more testing. We will apply this to TM and see what talos thinks with the latest fix.
Attachment #350164 - Flags: review?(gal) → review+
Pushed fix as changeset http://hg.mozilla.org/tracemonkey/rev/057dd58d50b8 talos is still dying with the overall watchdog patch, but access-fannkuch no longer iloops. will get a stack trace when I get back to the office.
(In reply to comment #113 by David) > talos is still dying with the overall watchdog patch, but access-fannkuch no > longer iloops. will get a stack trace when I get back to the office. What happens if you disable the jit? Will the browser run? If so to proceed farther I suggest to disable that watchdog check in the jited code, land this patch and work in a separated bug on the jit issues.
What's the status of this bug? It's a major issue and hasn't seen progress in more than a week.
(In reply to comment #115) > What's the status of this bug? It's a major issue and hasn't seen progress in > more than a week. Any progress depends on the availability of Mac hardware as described in the comments above.
(In reply to comment #116) > (In reply to comment #115) > > What's the status of this bug? It's a major issue and hasn't seen progress in > > more than a week. > > Any progress depends on the availability of Mac hardware as described in the > comments above. Just to make sure we're all on the same page: if I try the patch on linux or windows, it will work as designed on scripts such as "while (true) {}" -- correct?
I have mac hardware. What exactly do you need me to test?
(In reply to comment #118) > I have mac hardware. What exactly do you need me to test? See comment 113.
(In reply to comment #117) > Just to make sure we're all on the same page: if I try the patch on linux or > windows, it will work as designed on scripts such as "while (true) {}" -- > correct? Yes. The issue is to figure out why Mac talos builds die with the patch. It could be related to the fact that with the patch the GC is called at different moments triggering some hidden bugs.
(In reply to comment #119) > (In reply to comment #118) > > I have mac hardware. What exactly do you need me to test? > > See comment 113. and comment 114.
Could you please refresh the patch? It doesn't apply cleanly to TM tip.
Attached patch patch sync with mozilla-central (obsolete) — Splinter Review
Attachment #349659 - Attachment is obsolete: true
Attached patch patch sync with mozilla-central (obsolete) — Splinter Review
Attachment #352295 - Attachment is obsolete: true
Attached patch patch sync with tm (obsolete) — Splinter Review
The same patch as the previous one. It was just synchronized with the latest mozilla-central code. We run mochi tests on our mini mac. No one test failed. Also we built the browser with the configuration that as we believed was used for talos. The browser did not crash. So we did not manage to find any failures. If there is no directions how we can get any problem with the patch then we would like to try landing again.
Attachment #350164 - Flags: approval1.9.1+
Attached patch sync with tm (obsolete) — Splinter Review
Attachment #352296 - Attachment is obsolete: true
Attachment #352297 - Attachment is obsolete: true
Whiteboard: [blew up leak test again] → fixed-in-tracemonkey
I backed out the patch from TM.
Whiteboard: fixed-in-tracemonkey
> On TM tinderbox the patch busted tgfx test case on Windows Andrei, can you reproduce this problem?
Currently we cannot because we do not have build environment for Windows (MVC and so on). Igor thinks that the problem may be related to the tracing code (looks like the patch that has been made for tracing does not solve the problem). So our plan is to switch off the corresponding tracing code (not all, but in places where it interacts with the watchdog) and test it. It may take a couple of days. If this idea does not work the only options is to debug the failed test.
I plan to land this to tm. If the tests passes, then the problem with the previous patch would be related with jit exit/operation callback interaction. If the patch still hang windows tinderbox, then we will try to debug the problem later on some windows installation.
Attachment #353894 - Attachment is obsolete: true
Attachment #354052 - Attachment is obsolete: true
Attachment #354296 - Flags: review+
Attachment #354296 - Attachment is obsolete: true
Comment on attachment 354296 [details] [diff] [review] the prev patch + disabling operation count support in jit code The change to disable operation counting checks in the traced code has not help - WINNT talos box was still busted with the patch.
Attachment #354052 - Attachment is obsolete: false
To Robert Sayre: do you know how to get MSVC for Windows testing?
I think MS Visual Studio Express is available from MS for free.
You need to download Microsoft Visual C++ from this page: http://www.microsoft.com/express/download/default.aspx#webInstall and then also follow instructions on: https://developer.mozilla.org/en/Windows_Build_prerequisites ro find the rest of what you need.
One theory about the hangs with gfx talos tests is that the changes from the patch has triggered the slow script dialog. With an optimized or debug build I can get the dialog once per about 5 runs of the browser with Linux-64 on somewhat busy system. The dialog is legitimate one AFAICS as in those cases the tests run more then 20 seconds, the limit for chrome scripts. This theory also explains why only Talos tests on Windows hangs. The total runtime for recent gfx tests on Windows Talos box is about 3 minuts while for Linux the time is 40 second and 30 seconds for Mac. Given that 20 seconds per test the Windows box may well hit the limit. Andrei is now doing the testing with Windows XP and he already got the slow dialog with the patch with the debug build.
Regarding that slow script theory. Currently the python script that runs the browser, http://mxr.mozilla.org/mozilla/source/testing/performance/talos/run_tests.py (correct me if this not the right location), only sets dom.max_script_run_time to zero according to http://mxr.mozilla.org/mozilla/source/testing/performance/talos/sample.config . The sample setup does *not* set dom.max_chrome_script_run_time to zero. Thus the max execution time for the chrome scripts is still limited to 20 seconds. Thus to proceed farther I suggest to fix the talos performance scripts in a separated bug to make sure that dom.max_chrome_script_run_time is zero and then re-land the patch in this bug. Now, what product/component should use for such bug?
Talos bugs go in mozilla.org : Release Engineering: Talos
As Igor mentioned I did talos gfx tests on Windows XP SP2. With Core 2 Duo system (not the fastest one) it passes without problem requiring less than 1 second for the optimized non-debug build. However if I run the tests against the non-optimized debug build then I get the slow script dialog. If I set dom.max_chrome_script_run_time to sero then the tests pass without showing dialog. So if we want the watchdog patch to pass tests we need to set the dom.max_chrome_script_run_time parameter to zero.
Depends on: 471595
Attached patch sync with tm (obsolete) — Splinter Review
This what I will attempt to land given that the patch for the bug 471595 went in.
Attachment #354052 - Attachment is obsolete: true
Attached patch sync with tm + void fixes (obsolete) — Splinter Review
Here is yet another tm sync with few fixes to declare methods that never fail to return void, not boolean.
Attachment #355052 - Attachment is obsolete: true
Attachment #355442 - Flags: review+
According to brendan the tinderboxes might be not up to date, so maybe its not the patch's fault. Backing out so I can commit an unrelated patch.
(In reply to comment #145) > According to brendan the tinderboxes might be not up to date, so maybe its not > the patch's fault. Backing out so I can commit an unrelated patch. In what sense ?
I think there was some bad interaction between a talos timeout and this bug. There is a bug this bug depends on to fix that. Maybe igor or andrei can comment. I am just muttering hearsay here.
Do you mean bug 471595, which is fixed for talos, but also need it set for unit testers ?
Both logs in comment #144 are timeouts doing make check (TUnit): .... TEST-PASS | ../../../../_tests/xpcshell-simple/test_update/unit/test_0040_general.js | all tests passed TEST-PASS | ../../../../_tests/xpcshell-simple/test_update/unit/test_0050_general.js | all tests passed TEST-PASS | ../../../../_tests/xpcshell-simple/test_update/unit/test_0051_general.js | all tests passed TinderboxPrint: TUnit<br/>450/0 NEXT ERROR buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 21604 TinderboxPrint: check <em class="testfail">timeout</em><br/> ... TEST-PASS | ../../../_tests/xpcshell-simple/test_intl_uconv/unit/test_decode_8859-10.js | all tests passed TEST-PASS | ../../../_tests/xpcshell-simple/test_intl_uconv/unit/test_decode_8859-11.js | all tests passed TEST-PASS | ../../../_tests/xpcshell-simple/test_intl_uconv/unit/test_decode_8859-13.js | all tests passed TinderboxPrint: TUnit<br/>39/0 NEXT ERROR buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output TinderboxPrint: check <em class="testfail">timeout</em><br/> I don't think there's anything configurable there except the time buildbot waits for output. Sounds like something is hanging.
Yeah, I think you are right. This definitively looks broken. Its backed out. The logs should be sufficient to debug this I hope.
To proceed farther I am going to split the patch in parts to land its pices separately. The first part will go to the bug 419086.
Depends on: 419086
Depends on: 465030
(In reply to comment #151) > To proceed farther I am going to split the patch in parts to land its pices > separately. The first part will go to the bug 419086. OK, what's the second part?
Blocks: 472702
No longer blocks: 472702
Depends on: 472702
(In reply to comment #152) > OK, what's the second part? Bug 472702.
Depends on: 473167
(In reply to comment #153) > (In reply to comment #152) > > OK, what's the second part? > > Bug 472702. What is left to do in this bug? Anything? Just trying to clear things up.
Attached patch watchdog for xpcSplinter Review
The new patch is effectively the part of the previous patches that enables the support for the watchdog for xpc.
Attachment #355442 - Attachment is obsolete: true
Whats the status of this bug? I am really eager to get rid of the operation count updating. It seems to contribute to some strange performance problems we are observing https://bugzilla.mozilla.org/show_bug.cgi?id=475998 (stores confusing the address disambiguation logic of the process).
There are some performance problems with the current patch with thread workers. In its current form it uses the watchdog thread also to wake up them. But due to the need to do it each 10ms or so not to block the main thread that waits for the GC, this causes apparent performance degradation. So newer patch will use the watchdog only for the main thread as only for it the jit is enabled.
Depends on: 477179
Igor, please see bug 477179 for how we're getting rid of the need for a 10ms watchdog thread wakeup interval for DOM workers.
Alternative patch landed for 477187 that solves this bug. Marking WONTFIX. http://hg.mozilla.org/tracemonkey/rev/32fd75dcb1f7
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → WONTFIX
(In reply to comment #159) > Marking WONTFIX. This bug is about the problem, not a particular way to solve it. So it fixed-in-tracemonkey ;)
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Whiteboard: fixed-in-tracemonkey
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
477187 is fixed1.9.1, so this is fixed1.9.1
Keywords: fixed1.9.1
No longer blocks: 464672
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: