Closed Bug 626743 Opened 9 years ago Closed 9 years ago

Don't miss compartments in js_SetDebugMode

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: luke, Assigned: sfink)

References

Details

(Whiteboard: [hardblocker][fixed-in-tracemonkey][has patch])

Attachments

(1 file, 6 obsolete files)

js_SetDebugMode only visits compartments that are the compartment of some context.  Particularly once we remove cx->globalObject (which will set cx->compartment == NULL when we leave the last request for a context as opposed to cx->globalObject's compartment), this can miss a lot of compartments.  It seems we need to enumerate all compartments.
Andreas says that, even now, there are many compartments without a context whose globalObject is in that context.  So this seems like a real problem now.
blocking2.0: --- → betaN+
Whiteboard: hardblocker
What context should recompile the script in a compartment that is not associated with any context? Can any context do it, as long as it's in the same compartment?

The current code seems to check script liveness with respect to a single context. Couldn't a script be live in a different context, and shouldn't that block recompiling for debug too?
(In reply to comment #2)
> What context should recompile the script in a compartment that is not
> associated with any context? Can any context do it, as long as it's in the same
> compartment?

I believe the recompiler ignores the cx argument and instead finds the context in which the script is active using the VMFrame list.

> The current code seems to check script liveness with respect to a single
> context. Couldn't a script be live in a different context, and shouldn't that
> block recompiling for debug too?

That's a good question. I think the only way that could happen is if a script calls a native, and that calls the same script again. I think in that case we can still do it, we just need to fix up both return addresses. We might already handle that case.
(In reply to comment #3)
> (In reply to comment #2)
> > What context should recompile the script in a compartment that is not
> > associated with any context? Can any context do it, as long as it's in the same
> > compartment?
> 
> I believe the recompiler ignores the cx argument and instead finds the context
> in which the script is active using the VMFrame list.

Where? I don't see that.

Whichever context it is, the recompiler uses it mostly for memory allocation/deallocation, which doesn't really matter, but also passes it to the Compiler constructor. If it's the "wrong" context, my worry is that an exception could get thrown, or a context-affiliated callback could be invoked.

> > The current code seems to check script liveness with respect to a single
> > context. Couldn't a script be live in a different context, and shouldn't that
> > block recompiling for debug too?
> 
> That's a good question. I think the only way that could happen is if a script
> calls a native, and that calls the same script again. I think in that case we
> can still do it, we just need to fix up both return addresses. We might already
> handle that case.

Huh? I don't follow. In your example, why would script->native->script change contexts? How is that example different from recursion? Oh... do natives lose the context?

I don't really understand the execution model. But... um... couldn't you have one context spin a nested event loop that processes an event triggering another context? Or what's that JS_SuspendRequest API for? Could it do something similar?

Someday I'll figure out how all this stuff works.
(In reply to comment #4)
> (In reply to comment #3)
> > (In reply to comment #2)
> > > What context should recompile the script in a compartment that is not
> > > associated with any context? Can any context do it, as long as it's in the same
> > > compartment?
> > 
> > I believe the recompiler ignores the cx argument and instead finds the context
> > in which the script is active using the VMFrame list.
> 
> Where? I don't see that.
> 
> Whichever context it is, the recompiler uses it mostly for memory
> allocation/deallocation, which doesn't really matter, but also passes it to the
> Compiler constructor. If it's the "wrong" context, my worry is that an
> exception could get thrown, or a context-affiliated callback could be invoked.

You're right, it finds the right fp, but I guess it just uses the same cx. I still assume any cx is supposed to work, but there does seem to be potential for problems.

> > > The current code seems to check script liveness with respect to a single
> > > context. Couldn't a script be live in a different context, and shouldn't that
> > > block recompiling for debug too?
> > 
> > That's a good question. I think the only way that could happen is if a script
> > calls a native, and that calls the same script again. I think in that case we
> > can still do it, we just need to fix up both return addresses. We might already
> > handle that case.
> 
> Huh? I don't follow. In your example, why would script->native->script change
> contexts? How is that example different from recursion? Oh... do natives lose
> the context?

It would be if the native uses JSAPI to call the same script again on a different cx. It seems pretty unlikely, but I don't think we actually defend against it.

> I don't really understand the execution model. But... um... couldn't you have
> one context spin a nested event loop that processes an event triggering another
> context? 

I think that's an instance of what I am talking about here.

> Or what's that JS_SuspendRequest API for? Could it do something
> similar?

I think Luke knows more about that.
Assignee: general → sphink
Here's my current WIP patch that scans through all scripts in all compartments. It's pretty ugly, partly because we're doing some stuff that requires a JSContext when we don't have one. And even if we use any random context, there's no good way to put it into the correct compartment because we only have the context and compartment, and we need a global object in the right compartment.

I'll need help with this, but currently it just immediately triggers compartment mismatches (JS_RestoreExceptionState is finding a mismatch between the context and exception again.) I'm not sure how these changes cause that. So I'll try to work that out first.
Ok, uncommenting my poor-man's compartment switch worked. It turns out that the context used for compilation really matters.

Andreas, can you take a look at this patch? It's nowhere near being ready for review, and I warn you that it's not pretty, but I'm sure there's a better way of dealing with compartments and contexts than what I'm doing here.
Attachment #505299 - Attachment is obsolete: true
Can you show me the stack of the crash?
Comment on attachment 505304 [details] [diff] [review]
Recompile scripts for all compartments

Man the version hack is nasty but the rest looks pretty good. Does this work?
I talked with Luke and David about this a bunch and I think I actually have some level of understanding now. Comments:

- Is it really necessary to recompile scripts back to regular mode? If it's hard, it seems OK to just not do that.

  - Side note: AFAICT, the cx you pass to the compiler doesn't matter too much. There are a couple of corner cases where it matters: (a) an OOM during compilation will generate an OOM exception on that cx, and (b) js_GetClassPrototype could be called on that cx, and it can fail, but probably only when compiling for the first time.

- For scripts that are not active, can't you just wipe out their jit code, then set the debug flag on them, so that they are compiled on demand in the new mode the next time they are called?

- Instead of setting cx->compartment directly, I assume you want to use the autocompartment call objects there?
Yeah, don't fiddle with cx->compartment, it doesn't guarantee a valid stack frame. Use

JSAutoCompartmentCall ac;
if (!ac.enter(cx, script))
   return false;
(In reply to comment #8)
> Can you show me the stack of the crash?

If necessary, I can get it again. But it only happens when I don't set the context's compartment.

(In reply to comment #9)
> Man the version hack is nasty but the rest looks pretty good. Does this work?

Heh. I'll probably just iterate through my liveScripts vector instead, since it's usually empty anyway.

It appears to work at least as well as it did before. I have some problems I *hope* it helps with, but I haven't worked on reproducing them so it's hard to tell if it's better now. See bug 626830 for an example. Partly this was for future-proofing; the code is wrong now, and if you start clearing cx->compartment it will start to matter very quickly.

(In reply to comment #10)
> - Is it really necessary to recompile scripts back to regular mode? If it's
> hard, it seems OK to just not do that.

It is definitely not necessary, and in fact the error case depends on that. But it's easy, and slightly better (it avoids the "turning on firebug slows everything down even after you turn it back off!" problem as much as possible.)

On the other hand, now that we're doing things runtime-wide, single-stepping may start to get slow when it recompiles every script in the world twice between each step.

>   - Side note: AFAICT, the cx you pass to the compiler doesn't matter too much.
> There are a couple of corner cases where it matters: (a) an OOM during
> compilation will generate an OOM exception on that cx, and (b)
> js_GetClassPrototype could be called on that cx, and it can fail, but probably
> only when compiling for the first time.

That's what I thought from inspection, but then I don't understand why having the wrong compartment on that context caused compartment mismatches *later* when the code was run and threw an exception.

But I didn't look too closely at it.

> - For scripts that are not active, can't you just wipe out their jit code, then
> set the debug flag on them, so that they are compiled on demand in the new mode
> the next time they are called?

That seems like a better idea, yes. Should speed up single-stepping nicely, too.

> - Instead of setting cx->compartment directly, I assume you want to use the
> autocompartment call objects there?

(In reply to comment #11)
> Yeah, don't fiddle with cx->compartment, it doesn't guarantee a valid stack
> frame. Use
> 
> JSAutoCompartmentCall ac;
> if (!ac.enter(cx, script))
>    return false;

JSAutoEnterCompartment, I assume.

Yes, I can do it that way. I started to once, but I hadn't figured out a way to get a meaningful context then. Then when I figured it out, it was just easier to hack cx->compartment because I'm looping over compartments and don't have a script in hand. I also didn't want to enter once per script, but that's silly -- I can just check whether I've already entered so I enter at most once per compartment.

Ok, I'll switch back.
Here's something closer to a real patch. I replaced the script->version hackery with a HashMap lookup, added a JS_BeginRequest() to fix a crash on shutdown, and switched to using JSAutoEnterCompartment.

Also, I replaced the original js/JS_SetDebugMode with this new one. There probably ought to be two -- one like the original, for setting debugMode only within a single compartment, and this one that works across compartments. A single-compartment version seems useful to have, but it currently wouldn't have any users other than the shell.

I have not yet implemented clearing out JITScripts rather than recompiling them.
Attachment #505304 - Attachment is obsolete: true
Switched from HashMap -> HashSet, I guess this is ready for review.

Sorry, didn't get around to purging JITScripts rather than recompiling them. Followup bug?
Attachment #505358 - Attachment is obsolete: true
Attachment #505893 - Flags: review?(dmandelin)
On the patch:

- needs to check for NULL return from JS_NewContext.

- why do only scripts on the current thread need to be considered live?

- I think it would be good to implement the purging right away.
http://hg.mozilla.org/tracemonkey/rev/47678330818a

Oops. Got ahead of myself. This one wasn't r+'ed, and I only addressed 2/3 of dmandelin's comments:

 - now checks for NULL return from JS_NewContext
 - implements purging (which makes MUCH more sense; recompiling was silly)

Backout: http://hg.mozilla.org/tracemonkey/rev/19574542e48f

I was restricting to the current thread mostly through cargo-culting what the previous version did. Is there a lock I can use to look at the other threads? I don't want to race.
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/47678330818a
http://hg.mozilla.org/mozilla-central/rev/19574542e48f (backout)
Note: not marking as fixed because last changeset is a backout.
The patch is looking good (with the rest of dmandelin's comments addressed), just a question:

If we fail partway through switching on debug mode, we will be in the state of some compartments having debug mode off.  I don't see the callers of JS_SetDebugMode checking its return value.  Does JSD, by any chance, always check if compartment->debugMode is on before using any of our debug APIs?  If not, then it seems like the return value needs to be carefully bubbled up so that we don't later try to use a debug API on a non-debug compartment.

And a nit: I know the error paths aren't all identical, but could you factor out the common error handling to an error label at the end that we goto if something fails?
(In reply to comment #18)
> If we fail partway through switching on debug mode, we will be in the state of
> some compartments having debug mode off.  I don't see the callers of
> JS_SetDebugMode checking its return value.  Does JSD, by any chance, always
> check if compartment->debugMode is on before using any of our debug APIs?  If
> not, then it seems like the return value needs to be carefully bubbled up so
> that we don't later try to use a debug API on a non-debug compartment.

As I understand it, everything is supposed to check debugMode first. But looking at the code, it really doesn't seem like that's happening. jsdbgapi.cpp checks for things that modify state, but I don't see why the debugHook callbacks wouldn't get invoked. I think you're right.

> And a nit: I know the error paths aren't all identical, but could you factor
> out the common error handling to an error label at the end that we goto if
> something fails?

Sorry, I should have posted the latest version of the patch. After eliminating the recompiling and instead purging the JITScripts, there's much less error handling to worry about.

This patch still doesn't handle multiple threads properly.
Uploading the latest version. Now handles the multithreading issue. In attempting to get the layering right, I had to change the JSD IDL. Alternatively, I could change the XPConnect IDL.

The problem is that XPConnect-land knows whether a compartment is on the main thread, and I need to use that information when setting the debug mode. So I iterate over the main thread compartments in XPConnect, and therefore need to pass in a compartment to jsdbgapi.

Not asking for review yet because I'm detecting memory corruption during GC. Just wanted to post this version so somebody can tell me it's totally crazy.

Remaining issues:
 - JSD APIs could get used after setting debug mode failed
 - memory corruption (possibly not this patch)
Attachment #505893 - Attachment is obsolete: true
Attachment #507643 - Attachment is obsolete: true
Attachment #505893 - Flags: review?(dmandelin)
Blocks: 627758
This patch is not to blame for the memory corruption. That turned out to be caused by a patch for bug 628758.

So the only remaining issue here is that misusing the JSD APIs (i.e., calling them when JSD has not been successfully turned on) can theoretically cause crashes/corruption.
Unfortunately, this patch mutated heavily to handle multiple threads. Since everything in a compartment is either on the main thread or off of it, this uses XPConnect's CompartmentParticipatesInCycleCollection to determine whether something is on the main thread (according to gal, this is the current correct and only working approach, assuming I'm using it right).

Because that information is only available to XPConnect, this restructures things to allow XPConnect to loop over all the eligible compartments in a runtime.
Attachment #507768 - Attachment is obsolete: true
Attachment #508696 - Flags: review?(dmandelin)
Whiteboard: hardblocker → [hardblocker]
Whiteboard: [hardblocker] → [hardblocker][b11]
Whiteboard: [hardblocker][b11] → [hardblocker][b11][has patch]
I just found out the current version of this patch has a memory leak, so don't count on it landing for b11. I will review now, though, to maximize options.
Does the leak happen proportionately to the number of times you transition in and out of debug mode, or is it more pervasive?  If the former, we should probably take it anyway to get Firebug coverage in more of our remaining beta-time.
Comment on attachment 508696 [details] [diff] [review]
Recompile scripts for all compartments

OK, patch looks good to me, but obviously there is something else to do because of the leak, so I'll clear the r flag to indicate that.
Attachment #508696 - Flags: review?(dmandelin)
If I do not turn on debugging at all during a run, I don't see a leak. (Though the try server showed lots of orange for all of the test groups, and I wouldn't think debugging would ever be turned on for more than one of them.)

If I turn on debugging but comment out the OnDebuggerActivated JSD callback, so that Firebug doesn't get notified that debugging is on (and therefore it never makes any JSD calls), then I don't see any leaks, but I get:

###!!! ASSERTION: XPConnect is being called on a scope without a 'Components' property!  (stack and details follow): 'Error', file /home/sfink/src/TM-singlestep/js/src/xpconnect/src/xpcwrappednativescope.cpp, line 781

This is for an object defined by Firebug, when sort of proxy invoke event is processed during shutdown. (It tries to run code when things have been mostly torn down.) Here it is: nsProxyObjectCallInfo::Run.

If I turn on debugging and allow Firebug to do stuff, I get the 'Components' assert plus a bunch of leaked strings. Or at least I did until I recompiled with --trace-malloc; now they're gone. Grr.

But the 'Components' assert seems more important. The leak may just be a side effect of things bailing out prematurely.

As I mentioned, early this morning I kicked off a tryserver run that failed most of the tests because of leaks, although whatever leak analyzer it was running claimed that I was leaking XPCJSStackFrames:

leaked 68513 instances of XPCJSStackFrame with size 32 bytes each (2192416 bytes total)

Which is... interesting.
Based on previous comments, removing this from the b11 radar - we'll get it for the next beta.
Whiteboard: [hardblocker][b11][has patch] → [hardblocker][has patch]
The XPConnect 'Components' assertion seems to be a red herring. It happens without the patch.

The leaks I was chasing are probably not what broke on the try server. I'm not sure if they're even relevant.

I am able to reproduce this with both the bits pulled down from the try server build, and with my own local build. And I think the problem is actually from another patch. I confused matters by assuming the leak report on shutdown (that I only got when applying this patch) was the same issue as what the try server ran into. I had additional patches on the try server. I am waiting on a try server push of just this patch right now, but it's pretty backed up. One opt build has made it through, though, so I may have an answer soon.

STR:

1. Pull down the test zip from a try server. I don't know how to generate one of these locally, and all of my attempts to use the tests from the source tree mysteriously passed.

2. unzip into a directory eg /tmp/tests

3. symlink /tmp/tests/firefox to your $OBJDIR/dist/bin directory

4. From within /tmp/tests:

python -u xpcshell/runxpcshelltests.py --symbols-path=symbols --test-path=xpcshell/tests/browser/components/dirprovider/tests/unit/test_bookmark_pref.js firefox/xpcshell xpcshell/tests/browser/components/dirprovider/tests/unit

From local testing, it appears that this patch is fine. At least for the xpcshell test, I can make the errors come or go by adding in the other patches that interfered.
After discussing with dmandelin, the current intention is to push this a little later this evening if the try server comes out clean and some manual Firebug tests go okay. I'll push it in about an hour unless someone objects.
Comment on attachment 508696 [details] [diff] [review]
Recompile scripts for all compartments

try server still isn't done chewing on it, but I was hoping you could renew the r+.
Attachment #508696 - Flags: review?(dmandelin)
This changes a public interface way past the time to do that (beta3-ish?), but maybe its the only way to go. Lets get the API stability police involved.
http://hg.mozilla.org/tracemonkey/rev/641a33d0932f

Oops, didn't notice comment 31 show up before I pushed to TM. I'll hold off on the planned landing to m-c until I get an official opinion on the API change, but this was requested for the b11 short list so it may have been discussed already.

Specifics: I added one new [noscript] function to the jsdIDebuggerService interface: deactivateDebugger().

I also modified another [noscript] function on the same interface: recompileForDebugMode(JSRuntime,PRBool) -> recompileForDebugMode(JSContext,JSCompartment,PRBool)

They're both [noscript], no in-tree native users of jsdIDebuggerService use these functions other than the one I modified, and I don't know of any out-of-tree users of the interface. So this should be pretty safe. The semantics of the scripted interface do not change.
Oh, right. That was the cost.

The benefit: the previous code iterated over all of the contexts, found their current compartments, and recompiled all scripts found in those compartments. Any compartment not currently associated with a context would be missed. When enabling debugging, that means some scripts would not be debuggable. When disabling, you could theoretically leave some scripts in debug mode and invoking callbacks unexpectedly (which would probably have no effect, but I'm not 100% sure). See also comment 0.

In order to avoid a race condition, I needed to restrict the enable/disable to the main thread. XPConnect knows what thread its compartments are used from; JSD does not. So I modified JSD to allow XPConnect to restrict its operation to eligible compartments. The alternative would be to modify XPConnect so it could be queried about the thread affinity of compartments, but changing nsIXPConnect seemed riskier than changing jsdIDebuggerService.
This broke non-threadsafe shell builds. The fix seemed straightforward, so I pushed it without review. Please take a look.

http://hg.mozilla.org/tracemonkey/rev/17e52535ac5b
(In reply to comment #34)
> This broke non-threadsafe shell builds. The fix seemed straightforward, so I
> pushed it without review. Please take a look.
> 
> http://hg.mozilla.org/tracemonkey/rev/17e52535ac5b

That fix works, and is the minimal change required. But it seems like the proper fix would be elsewhere, so I'll open a new bug 630898 for that since it touches JSAPI.
Attachment #508696 - Flags: review?(dmandelin) → review+
Comment on attachment 508696 [details] [diff] [review]
Recompile scripts for all compartments

Andreas wants to run this past the API stability posse. Explicitly marking for review since this blocks other stuff I want to land (esp bug 626830), so I'd like to resolve it soon.
Attachment #508696 - Flags: review?(brendan)
Comment on attachment 508696 [details] [diff] [review]
Recompile scripts for all compartments

[noscript] only changes for which there is no alternative are ok. uuid change, check. What else should I look at? jsdbgapi.h addition is fine too.

We should test (get honzab to help if you can) that Firebug likes this.

/be
Attachment #508696 - Flags: review?(brendan) → review+
Please cc: timeless on bugs involving jsd, though -- not doing so is bad.

/be
(I did not review all the rest of the patch.)
(In reply to comment #37)
> Comment on attachment 508696 [details] [diff] [review]
> Recompile scripts for all compartments
> 
> [noscript] only changes for which there is no alternative are ok. uuid change,
> check. What else should I look at? jsdbgapi.h addition is fine too.

Thanks, that's fine. The patch already passed review; I was bringing you in only for the IDL change. Blame gal.

> We should test (get honzab to help if you can) that Firebug likes this.

I am testing Firebug, constantly. honzab helped me get going with the test suite, and I've dug into both Firebug's and FBTest's code. Firebug is happy with the patch.
Whiteboard: [hardblocker][has patch] → [hardblocker][fixed-in-tracemonkey]
Whiteboard: [hardblocker][fixed-in-tracemonkey] → [hardblocker][fixed-in-tracemonkey][has patch]
Comment on attachment 508696 [details] [diff] [review]
Recompile scripts for all compartments

+JS_SetDebugModeForCompartment(JSContext *cx, JSCompartment *comp, JSBool debug)
+{
+    JSRuntime *rt = cx->runtime;
+
+    // We can only recompile scripts that are not currently live (executing in
+    // some context). This function is only called from the main thread,

this sounds like an assertion. if so, it should be written using JS_ASSERT().

+    ................................................................... and
+    // will only consider contexts in that same thread and scripts inside
+    // compartments associated with that same thread.

this part is odd, because you use 'same thread' often, but the previous part seemed to mean 'this stuff only applies to the main thread'. if you really mean main thread here, i'd rather you write 'this thread' or 'the main thread'.

+    // Discard all of this thread's inactive JITScripts and set their

here you use 'this thread'..

      * Recompile all active scripts in the runtime for debugMode.
      */
-    [noscript] void recompileForDebugMode(in JSRuntime rt, in PRBool mode);
+    [noscript] void recompileForDebugMode(in JSContext cx, in JSCompartment comp, in PRBool mode);
 
it'd be vaguely nice if this method had an @throws which explains what it means to throw NS_ERROR_FAILURE - i've read the patch and can't quite figure it out. does it mean that no debugging should ever be tried in this app instance? - i.e. can one recover from this? how?

+    if (!NS_SUCCEEDED(rv)) {

this should be NS_FAILED(rv) -- and someone should have caught it. if that means that dmandelin a js reviewer isn't familiar with places outside js/src/<files not subdirectories> then that means someone should probably be designated as the reviewer for those outside areas, either asking brendan to actually review them, or asking mrbkap....

+    nsCOMPtr<jsdIDebuggerService> jsds = do_GetService(jsdServiceCtrID, &rv);

i don't think we actually want to do this, i suspect we want to instead ask the service manager if the service is currently instantiated. the code you have here will instantiate it even if no one is using it. there is probably some cost involved in doing this (albeit hopefully not a high cost).

+    JS_BeginRequest(cx);

i think we have a JSAutoRequest or something fancier in xpconnect. - using it would catch cases where you use goto fail incorrectly (see below).

+    if (!(cx = JS_NewContext(rt, 256))) {
..
+    JS_BeginRequest(cx);
..
+                rv = jsds->RecompileForDebugMode(cx, comp, gDesiredDebugMode);
+                if (!NS_SUCCEEDED(rv)) {
..
+                    goto fail;
..
+fail:
the request and the cx leak here.
 }

These are all things which brendan can review if he's asked to review, so someone should write a patch addressing these comments and seek r?brendan.
Attachment #508696 - Flags: review-
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to comment #37)
> We should test (get honzab to help if you can) that Firebug likes this.
> 
> /be

I guess that should be Jan 'Honza' Odvarko (now CC:), rather than Honza Bambas.
You need to log in before you can comment on or make changes to this bug.