Closed Bug 1260239 Opened 8 years ago Closed 4 years ago

Worker debugger fails to attach, followed by a browser crash.

Categories

(DevTools :: Debugger, defect, P5)

x86
Windows
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: jujjyl, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

STR:

1. Enable "Enable worker debugging (in development)" in developer toolbox settings, and restart the browser.
2. Download https://dl.dropboxusercontent.com/u/40949268/dump/worker_debugging_test.zip
3. Unzip and host the file locally with "python -m SimpleHTTPServer" in the unzipped directory. (this seems to be important, hosting in Dropbox would not reproduce the bug)
4. Navigate to http://localhost:8000/worker_debugging_test.html
5. Open Debugger devtool and click on the "worker_debugging_test.js" worker item under the "Workers" list in the debugger view.

Observe that the worker script does not load up, but a blank window is opened. See screenshot worker_debugger_no_start.png for illustration.

6. Continuing in the current state, click on the page reload navigation button while the blank window is open.

The browser content process will crash. Two different crash traces come up:

More common:
https://crash-stats.mozilla.com/report/index/59529af4-5ee1-402e-9ca9-fbb132160328
https://crash-stats.mozilla.com/report/index/8846a5b2-5aba-4303-9da0-612032160328

In one run, the crash took down the whole browser: https://crash-stats.mozilla.com/report/index/319080cb-a389-4402-bcbc-6e8a32160328
That's not good.  Is this with / without e10s, or doesn't matter?
Component: Developer Tools → Developer Tools: Debugger
Doesn't matter. The above crash logs were with e10s enabled. Here is one without e10s active:

https://crash-stats.mozilla.com/report/index/401e921c-5ebd-41aa-9bd0-13b222160328
The code line reads "MOZ_CRASH("There should be no edges from the debuggee to the debugger.");". The crash message was added at https://bugzilla.mozilla.org/show_bug.cgi?id=1092102. Eddy, how does this look to you?
Flags: needinfo?(ejpbruel)
That assertion means that the debuggee is trying to access an object that lives in a debugger compartment. That should never, ever happen: the debugger should be able to access things in the debuggee, but never the other way around.

Since this is a crasher, I'm assigning P1 to this, and putting it on my todo list for this week.
Assignee: nobody → ejpbruel
Flags: needinfo?(ejpbruel)
Priority: -- → P1
Here's what I figured out so far. We try to run a normal worker runnable. This causes a DOMException to be thrown. The DOMException lives in the debugger's scope, so it must be thrown by debugger code. This implies that the runnable causes debugger code to be executed. Once we finished running the runnable, we check if there are any pending exceptions. Since there are, we try to get it, while we are in the 
debuggee's scope. This triggers the MOZ_CRASH you see above, since we assert that the debugger can never have any references to objects in the debugger's scope.

I tried debugging the crash under rr to figure out where that DOMException is thrown from, but so far the crash does not reproduce for me under Linux. However, the overall problem is that debuggee code can end up calling debugger code. For instance, when a breakpoint is hit, we trigger a debugger callback. That callback can throw an exception, which lives in the debugger's scope, which is then apparently propagated back up to the debuggee's scope. This violates our assertion that the debuggee can never have any edges to the debugger.

Bobby, I need to double check this with Jim, but as far as I know, the fact that exceptions thrown by the debugger can propagate up to the debuggee is not a bug in itself. We should probably always prevent such exceptions from propagating up in the debugger server, but it would be nice if the browser didn't crash whenever we forgot to catch some exception. What would be a nice way to make this work?
Flags: needinfo?(bobbyholley)
Jukka, where you able to reproduce the crash on Linux? I tried to debug this under rr, but I couldn't get the worker window to become blank. On Mac, the crash reproduces consistently.
Flags: needinfo?(jujjyl)
I don't have a linux computer to test. Mostly I work on Windows, where I do get it consistently to produce. I found it curious that it only reproduces when I use python web server, but if I put it to dropbox and run it there (https://dl.dropboxusercontent.com/u/40949268/dump/worker_debugging_test/worker_debugging_test.html), the worker window does correctly show the script and I don't get the blank window. (and no crash either)
Flags: needinfo?(jujjyl)
The Debugger API's principle here is that exceptions generated in debugger code are probably meaningless to the debuggee: after all, if it weren't for the developer attaching the debugger to it, the poor debuggee would never have seen that exception at all: it represents a complaint about a problem in code it doesn't even know exists. So in general, a debugger's callback functions should strive to be "infallible", in the sense that it somehow handles all its own errors itself, and shields the debuggee from them. The Debugger's uncaughtExceptionHook is supposed to be the backstop for this.

So the fact that any debugger exception is propagating to the debuggee at all is a problem already.

Reproducing this is probably key to sorting out the problem.
(In reply to Eddy Bruel [:ejpbruel] from comment #5)
> Bobby, I need to double check this with Jim, but as far as I know, the fact
> that exceptions thrown by the debugger can propagate up to the debuggee is
> not a bug in itself. We should probably always prevent such exceptions from
> propagating up in the debugger server, but it would be nice if the browser
> didn't crash whenever we forgot to catch some exception. What would be a
> nice way to make this work?

I think comment 8 answers this.
Flags: needinfo?(bobbyholley)
Well, it shouldn't *crash*! The crash reports linked in comment 0 are cross-compartment references, and those shouldn't be able to be introduced at all. Certainly Debugger claims it never introduces them. So there's also a bug there.
(In reply to Jim Blandy :jimb from comment #10)
> Well, it shouldn't *crash*! The crash reports linked in comment 0 are
> cross-compartment references, and those shouldn't be able to be introduced
> at all.

What does the debugger do in the absence of an uncaughtExceptionHook? Does it swallow the exception, or propagate it as-is to the debuggee? If the answer is "the latter", then that would explain why we're ending up in the wrap hook.
(In reply to Bobby Holley (busy) from comment #11)
> (In reply to Jim Blandy :jimb from comment #10)
> > Well, it shouldn't *crash*! The crash reports linked in comment 0 are
> > cross-compartment references, and those shouldn't be able to be introduced
> > at all.
> 
> What does the debugger do in the absence of an uncaughtExceptionHook? Does
> it swallow the exception, or propagate it as-is to the debuggee? If the
> answer is "the latter", then that would explain why we're ending up in the
> wrap hook.

I think we're two misunderstandings deep, including my own, so let me back out carefully:

The code that runs when a Debugger callback throws and there is no uncaughtExceptionHook is here:

https://hg.mozilla.org/mozilla-central/file/d5d53a3b4e50/js/src/vm/Debugger.cpp#l1064

This was the backstop for the backstop, so we decided that reporting it and then terminating (JSTRAP_ERROR is what's used for OOM) was the best we could do.

First misunderstanding: the above only applies to throws from Debugger callbacks, but I think I saw someone say that in this case the debugger hadn't really gotten running yet, so Debugger's rules for its callbacks may have nothing to do with this case. The code I linked to above certainly doesn't allow exceptions to propagate.

Second misunderstanding: My comment 10 is completely wrong. I hadn't looked at the stack carefully, and jumped to the conclusion that the assertion had detected an unwrapped cross-compartment reference. That is not what's happening. Rather, the worker cross-compartment wrapper policy refuses to create *any debuggee->debugger references at all*. The assertion that failed claims that the worker thread will never even ask for such a CCW. That aggressive policy surprised me, but it seems reasonable now that I understand it.

I'm sorry for the confusion.

The Debugger implementation code I linked above won't propagate debugger exceptions to the debuggee. And the other Debugger paths that pass data to the debuggee would try to wrap that data appropriately for the debuggee earlier (say, as soon as the Debugger.Object that the APIs require was created), yet we don't see those frames on the stack. So if Debugger is involved here at all, that will be a surprise to me.

As an unrelated maybe-bug: it should be very easy to use the Debugger to hit that assertion:

// Run this code in the debugger global.

var worker = ... // the debuggee global
var dbg = new Debugger;
var workerDO = dbg.addDebuggee(worker);

// This should crash at the assertion here:
// https://hg.mozilla.org/mozilla-central/annotate/63be002b4a80/dom/workers/RuntimeService.cpp#l838
workerDO.makeDebuggeeValue({});

However, if it were Debugger methods trying to wrap a debugger value for presentation to the debuggee, we should see those methods on the stack, and we don't.

I call this a 'maybe-bug', because perhaps the solution is "please never try to do that." But Debugger tries to be a safe API, where you get JS errors, not crashes; clearly we're not following that principle here.
To add what Jim said in comment 12:

I claimed earlier that even when a debugger server has been initialised in a worker, that does necessarily mean that a Debugger object has been created yet. We don't do that until the parent server attaches to the ThreadActor in the worker server.

However, if there is no Debugger object, there can be no debugger callbacks, since those are set on the Debugger object. If there are no debugger callbacks set, we should never be able to enter debugger code from debuggee code, since we can only do this via a debugger callback.

Conversely, if we enter debugger code from debuggee code, this must have been done via a debugger callback. Since a debugger callback exists, a Debugger object must exist. We set the uncaughtExceptionHook right after the Debugger object is created, so if a Debugger object exists, the uncaughtExceptionHook must
have been set.

Assuming the above is correct, we should either never enter debugger code from debuggee code, or exceptions from the debugger should never propagate to the debuggee. Yet that seems to be what we're seeing: a normal worker runnable (which can only run debuggee code) is run, and once it is finished, we
have a pending exception that lives in the debugger's compartment.

I believe the key to figuring out what's going on here is figuring out where that DOMException comes from. Inspecting the DOMException under lldb didn't really tell me much (it's location is set to null). I tried debugging it with rr, but the crash doesn't seem to reproduce under Linux.

I'm not really sure where to go from here, but I'll ask around.
Here is what lldb shows me when I print the contents of that dom::Exception:
https://pastebin.mozilla.org/8865831

As you can see, none of the fields seem to be set to meaningful values, so this doesn't really tell me anything about where that exception comes from, or why it is being thrown.

Boris, any idea why these fields are empty? Are they created lazily, by any chance? If not, do you have any advice on how I could uncover the origin of this exception? I already tried debugging it under rr, but unfortunately the crash does not reproduce on Linux.
Flags: needinfo?(bzbarsky)
First of all, is this a mozilla::dom::Exception, or a mozilla::dom::DOMException?  The latter is a subclass of the former.  I'm guessing it's an Exception, not a DOMException, given the lack of useful anything in it, the mention of DOMException in comment 13 notwithstanding.

mFilename and mLineNumber fields are only used when mLocation is null.  But in this case it is, so if they're blank that means whoever created this exception did not set any useful filename/linenumber.

Third, mMessage/mName are used if set, but if not set and this is an Exception (NOT DOMException), they are derived from mResult as needed. 

Your mResult is 0x80004005 which is NS_ERROR_FAILURE.  That's not terribly helpful...

It may be worth breakpointing on Exception::Initialize, condition on aLocation being null and aResult being NS_ERROR_FAILURE (and maybe conditioned on being on the relevant worker thread if you can manage it; I expect there to be some false positives from mainthread here), and seeing what the C++ callstack is at that point.
Flags: needinfo?(bzbarsky)
(In reply to Jukka Jylänki from comment #7)
> I don't have a linux computer to test. Mostly I work on Windows, where I do
> get it consistently to produce. I found it curious that it only reproduces
> when I use python web server, but if I put it to dropbox and run it there
> (https://dl.dropboxusercontent.com/u/40949268/dump/worker_debugging_test/
> worker_debugging_test.html), the worker window does correctly show the
> script and I don't get the blank window. (and no crash either)

After pulling from mozilla-central today, I can no longer reproduce the issue on Mac either :-( The worker window always shows correctly, and refreshing does not cause the tab to crash. Since I can't reproduce the issue on Linux either, and don't have access to a Windows machine, this makes the bug unactionable for me.

Jukka, can you still reproduce this issue with a recent pull from mozilla-central?
Flags: needinfo?(jujjyl)
I am no longer experiencing the crash at step 6, and on step 5, the content does load up. However, if I repeat step 5 twice (click on the text "worker_debugging_test.js" to open the worker debugger, then close the worker debugger window, then click again on the text "worker_debugging_test.js"), the script content does not load up on the second time, but a blank window is displayed.

If there is nothing that explains why the two bugs (script not loading, and the crash) should have gone away in mozilla-central, then it would still be possible to debug against the earlier revision where it still reproduced for you to understand why it no longer doesn't, rather than dropping it as unactionable(?). You could also mozregression --find-fix to see which code change swept the crash under the rug (or if someone fixed it for you?) for you to see which code change did it.
Flags: needinfo?(jujjyl)
Given the context, I do think it would be awfully nice to have this bug analyzed. The changes in M-C may have fixed it, but my suspicion is that they only masked it.
Took a while, but I managed to bisect this.

The commit that fixed this is:
2abb7f68c033d1216d4068d2db5d47e0d19aed8e

The last broken commit is:
c3701b806a7a9e9044d162fef4413bc3afb3334e
At this point, I'm wondering if the problem didn't reproduce under Linux because I tried it there after trying it on OSX, using a later commit. I'm going to try this again tomorrow, using the definitely broken commit above. If we can debug this with rr, it should be relatively easy to figure out the origin of that DOMException.
I managed to reproduce the bug on Linux under rr, but unlike on OSX, where I could inspect the reserved slot of the JSObject being thrown to obtain the dom::Exception, doing so in rr causes gdb to segfault. I don't understand why.

I also debugged the problem on OSX using the steps suggested by Boris in comment 15, and managed to obtain the following stack trace:
https://pastebin.mozilla.org/8867030

This suggests that the Exception is being thrown because the call to loadSubScript on WorkerDebuggerGlobalScope fails. loadSubScript itself does not throw NS_ERROR_FAILURE. However, the script loader which it uses definitely does, in multiple places.

I am not yet sure why the script loader fails, but regardless, it seems to me that this shouldn't cause the browser to crash.
> The commit that fixed this is:
> 2abb7f68c033d1216d4068d2db5d47e0d19aed8e

I assume that's a gecko-dev commit hash, because there is nothing like that in m-c...

That's an Atomics changeset; having it change behavior here is pretty weird.....
Eddy, are you still working on this?  We're getting duplicates filed...
Flags: needinfo?(ejpbruel)
(In reply to Boris Zbarsky [:bz] (TPAC) from comment #24)
> Eddy, are you still working on this?  We're getting duplicates filed...

I wasn't, but I should be able to spend some cycles on this this week. Thanks for putting it back on my radar!
Flags: needinfo?(ejpbruel)
I spent some more time analysing this bug. Like I mentioned in comment 21, I managed to reproduce the bug on Linux under rr. However, when I then try to set a watchpoint on the exception object and reverse continue to see where it is coming from, I seem to be running into a bug with rr.

I've opened the following issue on github for the problem I'm running into. Details can be found there:
https://github.com/mozilla/rr/issues/1833
Since I can't get rr to work for me, I've tried again to obtain more information about the exception object using just gdb. Given comment 14, I know that I've done this before, but this was sufficiently long ago that I no longer remember the exact steps I used to obtain this information.

At the moment, I'm stuck on the following problem: if I call dump on the exception object (which is a JSObject*), it prints information suggesting it has a single reserved slot. However, when I then try to access that reserved slot, the slots array of the object appears to be empty. The following pastebin shows the exact output I'm seeing: https://pastebin.mozilla.org/8914282

Boris, can you remind me how we got from this JSObject* to a mozilla::dom::Exception ?
Flags: needinfo?(bzbarsky)
slots_ is the heap slots; reserved slots start off in fixed slots unless there are too many of them.

You want js::GetReservedSlot(obj.get(), 0).toPrivate() to get the mozilla::dom::Exception*.
Flags: needinfo?(bzbarsky)
After setting a conditional breakpoint in Exception::Initialize, as suggested by bz in comment 15, I've obtained the following stack trace for when the Exception object that causes the crash is created: https://pastebin.mozilla.org/8914406
Based on the stack trace in comment 29, the Exception object that is causing the crash is created as a result of this call:
https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp?q=WorkerPrivate.cpp%3A585&redirect_type=direct#599

This means that the ErrorResult object rv has a pending exception set on it. Where does this exception come from? If I set a breakpoint on ErrorResult::Throw, the resulting stack trace suggests it is called from WorkerPrivate::AddFeature, which in turn is called from ScriptLoader::LoadAllScripts.

Apparently, what is going on is that we are trying to initialize the worker debugger when the worker is already in the Canceling state. This will cause AddFeature to fail. This in turn causes loading the main worker script to fail.
Attached patch Proposed fix (obsolete) — Splinter Review
The root cause of the crash is that if CompileDebuggerScriptRunnable fails, it sets a pending exception. By the time we report this exception, we are no longer in a compartment, so AutoJSAPI::ReportException assumes it should use the main worker global, while it should be using the worker debugger global instead.

As a result, we end up wrapping an object in the debugger's compartment in the debuggee's compartment. We explicitly disallow any edges from the debuggee to the debugger, and crash deliberately whenever this rule is violated.

I've attached a patch with a proposed fix for the problem. Normally I'd ask khuey for feedback, but since he's no longer the owner of this code, I'm guessing baku would be a good person to ask instead.
Attachment #8796575 - Flags: feedback?(amarchesini)
Do you want to use "ac", or a new JSAutoCompartment?  Note the long comment before the PostRun call which is talking about not changing the compartment before we PostRun....
Oh, and also... This patch is just wrong.  !globalObject can happen in the following cases:

1)  targetIsWorkerThread and DefaultGlobalObject returned null (because the global is not created yet).
2)  !targetIsWorkerThread and we're a worker started from a JSM.

Case #2 is the one all the rigmarole around "ac" is for right now.  We should probably document that more clearly...

Anyway, in case #2, trying to do anything with DefaultGlobalObject() is obviously a bad idea.  We should add a test that exercises an exception on that codepath if we don't have one already...
I guess scoping the compartment-entering down to the targetIsWorkerThread && !globalObject case, and using a separate Maybe<JSAutoCompartment> for it, would do about the right thing.  And lots of comments needed.  :(
Attached patch Proposed fix v2 (obsolete) — Splinter Review
I revised the patch based on bz's feedback. Does this look correct to you?

In any case, we probably need some tests before we can land this. I'm not sure if this patch is correct, because unlike the main worker script, the debugger can have sandboxes as well. What happens if an exception thrown from a debugger sandbox is reported on the main debugger global's compartment?
Attachment #8796575 - Attachment is obsolete: true
Attachment #8796575 - Flags: feedback?(amarchesini)
Attachment #8797695 - Flags: feedback?(amarchesini)
>+      MOZ_ASSERT(targetIsWorkerThread);

Why does this necessarily hold?  What if we had !mWorkerPrivate->GetWrapper() above in the !targetIsWorkerThread case?
(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #36)
> >+      MOZ_ASSERT(targetIsWorkerThread);
> 
> Why does this necessarily hold?  What if we had
> !mWorkerPrivate->GetWrapper() above in the !targetIsWorkerThread case?

Damn. You are right. I didn't consider that possibility :-(
I updated the patch to take into account comment 36 by bz, and added some tests, so this should be ready for review now.

There is one thing I am still unsure about, though: for the main worker script, we only ever have one global, so DefaultGlobalObject() will automatically pick the correct one. Debugger scripts, however, can create additional globals, in the form of sandboxes.

What happens if the script loaded by CompileDebuggerScriptRunnable creates a sandbox, and the script loaded in that sandbox throws an exception? As the patch is currently written, we will report that exception in the main debugger global. I'm not sure if that's something we're supposed to be doing. FWIW, I wrote a test to cover this case and it seems to work.
Attachment #8797695 - Attachment is obsolete: true
Attachment #8797695 - Flags: feedback?(amarchesini)
Attachment #8798161 - Flags: review?(amarchesini)
It still seems safer to me to scope the JSAutoCompartment around the ReportException() call only....
Attachment #8798161 - Flags: review?(amarchesini) → review+
Assignee: ejpbruel → nobody
Not a P1 anymore. Our worker debugging support was always experimental, and now with the new debugger, it even isn't available anymore.
So I'm downgrading this now.

Web worker support in the new debugger is tracked here: https://github.com/devtools-html/debugger.html/issues/3815
Í'll add a link to this bug there, so we keep it in mind.
Severity: critical → normal
Priority: P1 → P2
Product: Firefox → DevTools
Blocks: dbg-worker

Brian what do you think about the importance/relevance now.

Flags: needinfo?(bhackett1024)

(In reply to Jason Laster [:jlast] from comment #42)

Brian what do you think about the importance/relevance now.

It's hard to say, the relevant code hasn't changed much but this is a tricky fix and I don't know enough about this code to really understand what's going on. Since the symptoms are a deliberate, safe crash I'm inclined to not worry about this unless we start seeing crashes again with this signature.

Flags: needinfo?(bhackett1024)
Priority: P2 → P5

This is very old and the repro steps don't work. We support debugging workers in the new debugger now and I'm not aware of any crash reports currently, so I'm going to call this fixed.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: