Closed Bug 1225160 Opened 4 years ago Closed 4 years ago

crash in JSAutoCompartment::JSAutoCompartment | mozJSComponentLoader::ImportInto

Categories

(Core :: XPConnect, defect, P1, critical)

defect

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox45 --- affected
firefox48 --- fixed

People

(Reporter: Gijs, Assigned: ejpbruel)

References

Details

(Keywords: crash, reproducible)

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is 
report bp-36559863-e718-4bc5-b2f9-aa5f52151116.
=============================================================

We're crashing here:

http://hg.mozilla.org/mozilla-central/annotate/48d636f678b0/js/src/jsobj.h#l159

with a null _group

and according to crash-reports end up there through here:

http://hg.mozilla.org/mozilla-central/annotate/48d636f678b0/js/xpconnect/loader/mozJSComponentLoader.cpp#l1217

I don't really understand how that is possible. But it's annoying - I am repeatedly crashing while trying to use the browser debugger in the course of my normal work. The browser debugger stays up but the process dies underneath me. It tends to happen while I'm stopped on a breakpoint in the debugger. I don't have reliable STR, but I've crashed 3 times in the last 20 minutes, and it's starting to get on my nerves.

bhackett, ejpbruel says you might know more about what's going on here?
Flags: needinfo?(bhackett1024)
Um, I have no idea how that got set.
Group: mozilla-employee-confidential
All objects have a non-NULL group_ pointer, so if group_ is NULL then the object is invalid.  I don't know anything about mozJSComponentLoader or why the object might be invalid.
Flags: needinfo?(bhackett1024)
Duplicate of this bug: 1254155
As written in the duped bug, I can reproduce this crash all the time.

When I'm running our firefox-ui-tests with the --jsdebugger argument and are waiting too long at a formerly set breakpoint, Firefox crashes.

Steps:
1. mach firefox-ui-test --jsdebugger
2. Set breakpoint in marionette/driver.js in switchToWindow (line 1470) at this.getWinEnumerator()
3. Click the ok button of the modal dialog to start the tests
4. When the breakpoint is hit just wait about 10s - then Firefox crashes
Keywords: reproducible
OS: Mac OS X → All
Hardware: Unspecified → All
Eddy, any chance you can help here, maybe with the steps from comment #4 or by pointing to someone who can? Thanks!
Flags: needinfo?(ejpbruel)
I was also able to hit this numerous times with the following steps:

1. Open the page http://cpettitt.github.io/project/dagre-d3/latest/demo/user-defined.html
2. Open the page toolbox
3. Open the browser toolbox
4. In the browser toolbox, open inspector.js, and set a breakpoint inside modifyAttributes
5. Back over in the page toolbox, inspect the house like SVG shape, and edit an attribute

~10s later the browser crashes.

Here's my crash report from these steps:

https://crash-stats.mozilla.com/report/index/28539982-f6f5-4b10-96e6-e43862160309
A similar issue was reported in bug 1232709.
See Also: → 1232709
I am not familiar with mozJSComponentLoader, unfortunately, so I have no idea what's causing this, or who's the best person to ask. The best I can do is needinfo Tom, since git blame claims he wrote the line that the failing assertion is guarding.

Tom, can you take a look at this bug? And if you have no idea what's going on either, could you bring this to the attention of the JS team for me? Thanks!
Flags: needinfo?(ejpbruel) → needinfo?(evilpies)
I only changed this code to add proper rooting, so I actually have no idea what is going on here.
Flags: needinfo?(evilpies)
Component: JavaScript Engine → XPConnect
Bobby, can you play final buck-passer where all of Tom, Eddy and bhackett say they don't know who knows this code? Who does?

Considering the frequency with which this crashes the debuggee when using the JS debugger in the browser toolbox, it would be really really really nice if we got this addressed. As it is, the crashes are interfering with frontend's ability to debug our code. :-(
Flags: needinfo?(bobbyholley)
This looks like the same bug as bug 1232709 - |obj| is presumably null.


(In reply to :Gijs Kruitbosch from comment #10)
> Bobby, can you play final buck-passer where all of Tom, Eddy and bhackett
> say they don't know who knows this code? Who does?

The component loader isn't really code that anybody knows by heart. If this is reproducible (which it sounds like it is) _or_ you can get an rr trace, this should be straightforward for any Gecko hacker (including yourself) to debug. So I think you should just find somebody with the cycles to actually look at it.

In terms of XPConnect peers who might be convinced to look into it, Gabor or Blake, maybe?
Flags: needinfo?(bobbyholley)
Gabor, would you have time to look at this? I'm swamped with other things and not really at home in JS or xpconnect land...
Flags: needinfo?(gkrizsanits)
Yes, I can take a look at this.
Assignee: nobody → gkrizsanits
Flags: needinfo?(gkrizsanits)
I could not reproduce the crash with any of the steps above on the current nightly / fresh build.

(In reply to Brian Hackett (:bhackett) from comment #2)
> All objects have a non-NULL group_ pointer, so if group_ is NULL then the
> object is invalid.  I don't know anything about mozJSComponentLoader or why
> the object might be invalid.

This bug has nothing to do with mozJSComponentLoader, we have an object with an invalid state, and import is just happen to be the function we pass this object to. Someone from the JS team should take a look at this, I have no idea how an object can get into that state, nor can I reproduce this to try and trace the moment when we set that |group_| property to null.
Assignee: gkrizsanits → nobody
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #14)
> I could not reproduce the crash with any of the steps above on the current
> nightly / fresh build.
> 
> (In reply to Brian Hackett (:bhackett) from comment #2)
> > All objects have a non-NULL group_ pointer, so if group_ is NULL then the
> > object is invalid.  I don't know anything about mozJSComponentLoader or why
> > the object might be invalid.
> 
> This bug has nothing to do with mozJSComponentLoader, we have an object with
> an invalid state, and import is just happen to be the function we pass this
> object to. Someone from the JS team should take a look at this, I have no
> idea how an object can get into that state, nor can I reproduce this to try
> and trace the moment when we set that |group_| property to null.

My impression is that |obj| is null, and that this is the non-debug version of bug 1232709. Are you saying that isn't the case?
Flags: needinfo?(gkrizsanits)
(In reply to Bobby Holley (busy) from comment #15) 
> My impression is that |obj| is null, and that this is the non-debug version
> of bug 1232709. Are you saying that isn't the case?

Actually, I think you're right sorry. Anyway, I keep trying to reproduce it when I have some time for it but cannot do much until then.
Flags: needinfo?(gkrizsanits)
Sounds like Gijs can reproduce. Maybe he can come up with reliable STR or an rr trace?
(In reply to Bobby Holley (busy) from comment #17)
> Sounds like Gijs can reproduce. Maybe he can come up with reliable STR or an
> rr trace?

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #16)
> (In reply to Bobby Holley (busy) from comment #15) 
> > My impression is that |obj| is null, and that this is the non-debug version
> > of bug 1232709. Are you saying that isn't the case?
> 
> Actually, I think you're right sorry. Anyway, I keep trying to reproduce it
> when I have some time for it but cannot do much until then.

Do the steps in comment #6 not reproduce this for you:

(In reply to J. Ryan Stinnett [:jryans] (use ni?) (at work week until Mar. 18) from comment #6)
> I was also able to hit this numerous times with the following steps:
> 
> 1. Open the page
> http://cpettitt.github.io/project/dagre-d3/latest/demo/user-defined.html
> 2. Open the page toolbox
> 3. Open the browser toolbox
> 4. In the browser toolbox, open inspector.js, and set a breakpoint inside
> modifyAttributes
> 5. Back over in the page toolbox, inspect the house like SVG shape, and edit
> an attribute
> 
> ~10s later the browser crashes.
> 
> Here's my crash report from these steps:
> 
> https://crash-stats.mozilla.com/report/index/28539982-f6f5-4b10-96e6-
> e43862160309

?

I don't know how to make an rr trace and would need to get everything set-up on my linux vm to do that, AIUI (and manipulate some vmware config things, or something?). I haven't figured out how to do this yet, and I"m fairly nose-to-grindstone with e10s test efforts, unfortunately. I am also... hesitant... about me being able to figure out what's wrong with the component loader - my knowledge of the JS engine as well as xpconnect and their concepts is approximately nil. :-)
(In reply to :Gijs Kruitbosch from comment #18)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #16)
> Do the steps in comment #6 not reproduce this for you:
> ?

No they don't. Actually the breakpoint never gets hit no matter what attributes I change.
Did not have a lot more luck with the steps in Comment 4 either, on a fresh build from mc
mach firefox-ui-test --jsdebugger brings up the update popup and then the test fails, the
debugger does never show up.

> 
> I don't know how to make an rr trace and would need to get everything set-up
> on my linux vm to do that, AIUI (and manipulate some vmware config things,
> or something?). I haven't figured out how to do this yet, and I"m fairly
> nose-to-grindstone with e10s test efforts, unfortunately. I am also...
> hesitant... about me being able to figure out what's wrong with the
> component loader - my knowledge of the JS engine as well as xpconnect and
> their concepts is approximately nil. :-)

Once you've got the rr trace at least the figuring out part would not rest on your shoulder ;)
Anyway, a simpler more reliable steps to reproduce would be plenty. Maybe there is just some flags I need to set or something for the debugger I don't know, or maybe you are on another branch. Can you reproduce this on a fresh debug build from the current mc with a fresh profile?
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #19)
> Did not have a lot more luck with the steps in Comment 4 either, on a fresh
> build from mc
> mach firefox-ui-test --jsdebugger brings up the update popup and then the
> test fails, the

The mach command has been changed in the meantime, but looks like we have a problem bringing up the debugger. If you checkout a revision from March 6th the above command should work for you and replicate this problem. I will try to get this fixed via bug 1256996 tomorrow.
(In reply to Henrik Skupin (:whimboo) from comment #20)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #19)
> > Did not have a lot more luck with the steps in Comment 4 either, on a fresh
> > build from mc
> > mach firefox-ui-test --jsdebugger brings up the update popup and then the
> > test fails, the
> 
> The mach command has been changed in the meantime, but looks like we have a
> problem bringing up the debugger. If you checkout a revision from March 6th
> the above command should work for you and replicate this problem. I will try
> to get this fixed via bug 1256996 tomorrow.

It might have to do something with the fact that in the other STR the debugger (from browser toolbox) never breaks at the breakpoint I set, so essentially does not work at all. I'd prefer to wait until the debugger is functional again.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #21)
> (In reply to Henrik Skupin (:whimboo) from comment #20)
> > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #19)
> > > Did not have a lot more luck with the steps in Comment 4 either, on a fresh
> > > build from mc
> > > mach firefox-ui-test --jsdebugger brings up the update popup and then the
> > > test fails, the
> > 
> > The mach command has been changed in the meantime, but looks like we have a
> > problem bringing up the debugger. If you checkout a revision from March 6th
> > the above command should work for you and replicate this problem. I will try
> > to get this fixed via bug 1256996 tomorrow.
> 
> It might have to do something with the fact that in the other STR the
> debugger (from browser toolbox) never breaks at the breakpoint I set, so
> essentially does not work at all. I'd prefer to wait until the debugger is
> functional again.

The debugger is perfectly functional, I'm guessing you have e10s turned on, and so the page toolbox runs in the child process, and because of the same reason that MSVS/GDB/XCode breakpoints don't work when you're attached to the other process, they don't work here. Turn e10s off and the debugger will break as described.

As it is, with Ryan's steps I can reproduce this fairly easily on my Windows machine, but after having just set up everything on my Linux VM so I can run Firefox under rr, of course I can't reproduce there. :-\

I'll try copying over prefs from the Windows profile to see if it's something else that's different about the profile.

Otherwise, are there things I can find out from just setting breakpoints in MSVS or something? Any obvious suspects for what would be clearing this thing that shouldn't be cleared?
Flags: needinfo?(gkrizsanits)
OK, I have an rr trace. Now what? :-)

For context, here's the JS stack that I'm seeing at the point where we're crashing:

(rr) call DumpJSStack()
0 XPCU_moduleLambda() ["resource://gre/modules/XPCOMUtils.jsm":273]
    this = [object BackstagePass]
1 XPCU_defineLazyGetter/<.get() ["resource://gre/modules/XPCOMUtils.jsm":198]
    this = [object BackstagePass]
2 Impl._loadSessionData<() ["resource://gre/modules/TelemetrySession.jsm":1956]
3 next(val = undefined) ["self-hosted":947]
    this = [object Generator]
4 TaskImpl_run(aSendResolved = true) ["resource://gre/modules/Task.jsm":319]
    this = [object Object]
5 TaskImpl(iterator = [object Generator]) ["resource://gre/modules/Task.jsm":280]
    this = [object Object]
6 createAsyncFunction/asyncFunction() ["resource://gre/modules/Task.jsm":254]
    this = [object Object]
7 setupChromeProcess/this._delayedInitTask<() ["resource://gre/modules/TelemetrySession.jsm":1417]
8 next(val = undefined) ["self-hosted":947]
    this = [object Generator]
9 TaskImpl_run(aSendResolved = true) ["resource://gre/modules/Task.jsm":319]
    this = [object Object]
10 TaskImpl(iterator = [object Generator]) ["resource://gre/modules/Task.jsm":280]
    this = [object Object]
11 createAsyncFunction/asyncFunction() ["resource://gre/modules/Task.jsm":254]
12 Task_spawn(aTask = [function]) ["resource://gre/modules/Task.jsm":168]
    this = [object Object]
13 this.DeferredTask.prototype._timerCallback/<() ["resource://gre/modules/DeferredTask.jsm":282]
14 next(val = undefined) ["self-hosted":947]
    this = [object Generator]
15 TaskImpl_run(aSendResolved = true) ["resource://gre/modules/Task.jsm":319]
    this = [object Object]
16 TaskImpl(iterator = [object Generator]) ["resource://gre/modules/Task.jsm":280]
    this = [object Object]
17 createAsyncFunction/asyncFunction() ["resource://gre/modules/Task.jsm":254]
18 Task_spawn(aTask = [function]) ["resource://gre/modules/Task.jsm":168]
    this = [object Object]
19 this.DeferredTask.prototype._timerCallback([xpconnect wrapped nsITimer]) ["resource://gre/modules/DeferredTask.jsm":280]
    this = [object Object]
20 bound ([xpconnect wrapped nsITimer]) ["self-hosted":752]
    this = func$1 = 1
(In reply to :Gijs Kruitbosch from comment #22)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #21)
> > (In reply to Henrik Skupin (:whimboo) from comment #20)
> > > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #19)
> > > > Did not have a lot more luck with the steps in Comment 4 either, on a fresh
> > > > build from mc
> > > > mach firefox-ui-test --jsdebugger brings up the update popup and then the
> > > > test fails, the
> > > 
> > > The mach command has been changed in the meantime, but looks like we have a
> > > problem bringing up the debugger. If you checkout a revision from March 6th
> > > the above command should work for you and replicate this problem. I will try
> > > to get this fixed via bug 1256996 tomorrow.
> > 
> > It might have to do something with the fact that in the other STR the
> > debugger (from browser toolbox) never breaks at the breakpoint I set, so
> > essentially does not work at all. I'd prefer to wait until the debugger is
> > functional again.
> 
> The debugger is perfectly functional, I'm guessing you have e10s turned on,

Thanks for looking into this.

FYI e10s is the default setting on mc/nightly/aurora soon on beta too. If the
debugger is not working with e10s on it is not working. Period. If I have to turn
it off that should be noted in the STR in the future.

> and so the page toolbox runs in the child process, and because of the same
> reason that MSVS/GDB/XCode breakpoints don't work when you're attached to
> the other process, they don't work here. Turn e10s off and the debugger will
> break as described.

If the only way to make the debugger to work is turning e10s off then we have a problem.

(In reply to :Gijs Kruitbosch from comment #23)
> OK, I have an rr trace. Now what? :-)

I'm flagging Bobby for this.

> 
> For context, here's the JS stack that I'm seeing at the point where we're
> crashing:
> 
> (rr) call DumpJSStack()
> 0 XPCU_moduleLambda() ["resource://gre/modules/XPCOMUtils.jsm":273]
>     this = [object BackstagePass]
> 1 XPCU_defineLazyGetter/<.get() ["resource://gre/modules/XPCOMUtils.jsm":198]
>     this = [object BackstagePass]
> 2 Impl._loadSessionData<()
> ["resource://gre/modules/TelemetrySession.jsm":1956]
> 3 next(val = undefined) ["self-hosted":947]

So if I understand it correctly we are stopping the browser with the debugger. How can this timer still fire? And why do we try to init telemetry in the middle of a browser debugger session?
Flags: needinfo?(gkrizsanits)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #24)
> (In reply to :Gijs Kruitbosch from comment #22)
> > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #21)
> > > (In reply to Henrik Skupin (:whimboo) from comment #20)
> > > > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #19)
> > > > > Did not have a lot more luck with the steps in Comment 4 either, on a fresh
> > > > > build from mc
> > > > > mach firefox-ui-test --jsdebugger brings up the update popup and then the
> > > > > test fails, the
> > > > 
> > > > The mach command has been changed in the meantime, but looks like we have a
> > > > problem bringing up the debugger. If you checkout a revision from March 6th
> > > > the above command should work for you and replicate this problem. I will try
> > > > to get this fixed via bug 1256996 tomorrow.
> > > 
> > > It might have to do something with the fact that in the other STR the
> > > debugger (from browser toolbox) never breaks at the breakpoint I set, so
> > > essentially does not work at all. I'd prefer to wait until the debugger is
> > > functional again.
> > 
> > The debugger is perfectly functional, I'm guessing you have e10s turned on,
> 
> Thanks for looking into this.
> 
> FYI e10s is the default setting on mc/nightly/aurora soon on beta too. If the
> debugger is not working with e10s on it is not working. Period. If I have to
> turn

You can use the browser-content toolbox to debug things that run in the content process.

> it off that should be noted in the STR in the future.

Agreed; sorry about that - it likely wasn't something about which we thought much. So much requires you to still run without e10s that I (and I think many other people) still turn it off routinely.

> > and so the page toolbox runs in the child process, and because of the same
> > reason that MSVS/GDB/XCode breakpoints don't work when you're attached to
> > the other process, they don't work here. Turn e10s off and the debugger will
> > break as described.
> 
> If the only way to make the debugger to work is turning e10s off then we
> have a problem.

See above. I'm not really sure why this is so different from the MSVS/GDB/XCode case - the hoops are different, but you still have to jump. It would probably make sense for the browser toolbox to connect to both processes, but I don't know how much work that is to implement.

> (In reply to :Gijs Kruitbosch from comment #23)
> > OK, I have an rr trace. Now what? :-)
> 
> I'm flagging Bobby for this.

I've been poking at the trace after getting a new one with symbols (artifact build = no symbols = good luck debugging that). It's still an opt build, which isn't very helpful - but I'm kind of dreading trying to reproduce this with a debug build because of the raciness and the slowdown of VM + rr + debug build. That and gcc5 crashed a gazillion times even trying to build, so going through that again isn't great.

It looks like the module was cached, but its object is nullptr. I'm trying to figure out how/why, I'll update if I find something. In the meantime, if you or Bobby have advice about where to look, I'm listening... :-)

> > For context, here's the JS stack that I'm seeing at the point where we're
> > crashing:
> > 
> > (rr) call DumpJSStack()
> > 0 XPCU_moduleLambda() ["resource://gre/modules/XPCOMUtils.jsm":273]
> >     this = [object BackstagePass]
> > 1 XPCU_defineLazyGetter/<.get() ["resource://gre/modules/XPCOMUtils.jsm":198]
> >     this = [object BackstagePass]
> > 2 Impl._loadSessionData<()
> > ["resource://gre/modules/TelemetrySession.jsm":1956]
> > 3 next(val = undefined) ["self-hosted":947]
> 
> So if I understand it correctly we are stopping the browser with the
> debugger. How can this timer still fire? And why do we try to init telemetry
> in the middle of a browser debugger session?

Because stopping the right amount of JS is "hard". The debugger server still needs to run, etc. The APIs were pretty bad in Venkman's day, and JSD2 is better but still not great at this. There's ongoing work to make the js debugger better at this.

I'm not sure if this really matters for this issue... I've seen this happen without being stopped at a breakpoint, too. These STR just happen to be the most deterministic. Looks like it depends on telemetry being enabled - the crashing code fires 60000ms (1 minute) after startup.
Flags: needinfo?(bobbyholley)
So... I'm struggling a bit, but AIUI things actually go wrong when this thing is constructed, specifically:
(rr) watch * (void**) 0x110e55fcb5d0
Hardware watchpoint 1: * (void**) 0x110e55fcb5d0
(rr) reverse-continue
Continuing.
Hardware watchpoint 1: * (void**) 0x110e55fcb5d0

Old value = (void *) 0x7fce74dbe0c0
New value = (void *) 0xe5e5e5e5e5e5e5e5
JS::PersistentRooted<JSObject*>::PersistentRooted<JSRuntime*> (cx=<optimised out>, this=0x110e55fcb5d0)
    at /home/gijs/mc-realbuild/dist/include/js/RootingAPI.h:979
979	        registerWithRootLists(rootLists(cx));
(rr) bt
#0  JS::PersistentRooted<JSObject*>::PersistentRooted<JSRuntime*> (cx=<optimised out>, this=0x110e55fcb5d0)
    at /home/gijs/mc-realbuild/dist/include/js/RootingAPI.h:979
#1  mozJSComponentLoader::ModuleEntry::ModuleEntry (this=0x110e55fcb590, aCx=0x7fce74dd3400)
    at /home/gijs/mozilla-central/js/xpconnect/loader/mozJSComponentLoader.h:91
#2  0x000058bb74e2854e in mozJSComponentLoader::ImportInto (this=this@entry=0x7fce74da17c0, aLocation=..., targetObj=..., targetObj@entry=..., 
    callercx=callercx@entry=0x7fce74dd3400, vp=..., vp@entry=...) at /home/gijs/mozilla-central/js/xpconnect/loader/mozJSComponentLoader.cpp:1167
#3  0x000058bb74e28fcf in mozJSComponentLoader::Import (this=0x7fce74da17c0, registryLocation=..., targetValArg=..., cx=0x7fce74dd3400, 
    optionalArgc=<optimised out>, retval=...) at /home/gijs/mozilla-central/js/xpconnect/loader/mozJSComponentLoader.cpp:1051
#4  0x000058bb74e2ef11 in nsXPCComponents_Utils::Import (this=<optimised out>, registryLocation=..., targetObj=..., cx=0x7fce74dd3400, 
    optionalArgc=<optimised out>, retval=...) at /home/gijs/mozilla-central/js/xpconnect/src/XPCComponents.cpp:2502
#5  0x000058bb748f745c in NS_InvokeByIndex (that=<optimised out>, methodIndex=<optimised out>, paramCount=<optimised out>, params=<optimised out>)
    at /home/gijs/mozilla-central/xpcom/reflect/xptcall/md/unix/xptcinvoke_x86_64_unix.cpp:182
#6  0x000058bb74e5f991 in CallMethodHelper::Invoke (this=0x7fff2dba8668) at /home/gijs/mozilla-central/js/xpconnect/src/XPCWrappedNative.cpp:2083
#7  CallMethodHelper::Call (this=0x7fff2dba8668) at /home/gijs/mozilla-central/js/xpconnect/src/XPCWrappedNative.cpp:1400
#8  XPCWrappedNative::CallMethod (ccx=..., mode=mode@entry=XPCWrappedNative::CALL_METHOD)
    at /home/gijs/mozilla-central/js/xpconnect/src/XPCWrappedNative.cpp:1367
#9  0x000058bb74e67b0b in XPC_WN_CallMethod (cx=0x7fce74dd3400, argc=2, vp=0x2626181aa830)
    at /home/gijs/mozilla-central/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1115
#10 0x000058bb76d78b32 in js::CallJSNative (args=..., native=<optimised out>, cx=0x7fce74dd3400)
    at /home/gijs/mozilla-central/js/src/jscntxtinlines.h:235
#11 js::Invoke (cx=0x7fce74dd3400, args=..., construct=<optimised out>) at /home/gijs/mozilla-central/js/src/vm/Interpreter.cpp:478
#12 0x000058bb76d72f8f in Interpret (cx=0x7fce74dd3400, state=...) at /home/gijs/mozilla-central/js/src/vm/Interpreter.cpp:2809
#13 0x000058bb76d7862d in js::RunScript (cx=cx@entry=0x7fce74dd3400, state=...) at /home/gijs/mozilla-central/js/src/vm/Interpreter.cpp:428
#14 0x000058bb76d78939 in js::Invoke (cx=cx@entry=0x7fce74dd3400, args=..., construct=construct@entry=js::NO_CONSTRUCT)
    at /home/gijs/mozilla-central/js/src/vm/Interpreter.cpp:496
#15 0x000058bb76c0c375 in js::fun_call (cx=cx@entry=0x7fce74dd3400, argc=argc@entry=1, vp=vp@entry=0x2626181aa788)
    at /home/gijs/mozilla-central/js/src/jsfun.cpp:1161
#16 0x000058bb76c0faca in js::fun_apply (cx=0x7fce74dd3400, argc=1, vp=0x2626181aa788) at /home/gijs/mozilla-central/js/src/jsfun.cpp:1179
#17 0x000058bb76d78b32 in js::CallJSNative (args=..., native=<optimised out>, cx=0x7fce74dd3400)
    at /home/gijs/mozilla-central/js/src/jscntxtinlines.h:235
#18 js::Invoke (cx=0x7fce74dd3400, args=..., construct=<optimised out>) at /home/gijs/mozilla-central/js/src/vm/Interpreter.cpp:478
#19 0x000058bb76d72f8f in Interpret (cx=0x7fce74dd3400, state=...) at /home/gijs/mozilla-central/js/src/vm/Interpreter.cpp:2809
#20 0x000058bb76d7862d in js::RunScript (cx=cx@entry=0x7fce74dd3400, state=...) at /home/gijs/mozilla-central/js/src/vm/Interpreter.cpp:428
#21 0x000058bb76d78939 in js::Invoke (cx=cx@entry=0x7fce74dd3400, args=..., construct=construct@entry=js::NO_CONSTRUCT)
...


mozJSComponentLoader.h looking like this:

87     class ModuleEntry : public mozilla::Module
88     {
89     public:
90         explicit ModuleEntry(JSContext* aCx)
91           : mozilla::Module(), obj(JS_GetRuntime(aCx)), thisObjectKey(JS_GetRuntime(aCx))
92         {
93             mVersion = mozilla::Module::kVersion;
94             mCIDs = nullptr;
95             mContractIDs = nullptr;
96             mCategoryEntries = nullptr;
97             getFactoryProc = GetFactory;
98             loadProc = nullptr;
99             unloadProc = nullptr;
100 
101             location = nullptr;
102         }

if I step through the constructor, once in the constructor body, obj ends up as:

(rr) p obj
$4 = {<js::PersistentRootedBase<JSObject*>> = {<No data fields>}, <mozilla::LinkedListElement<JS::PersistentRooted<JSObject*> >> = {
    mNext = 0x7fce74dbe0c0, mPrev = 0x6f116b873a10, mIsSentinel = false}, ptr = 0x0}

And, just to make sure the optimized build isn't messing with me, it still looks like that after the constructor returns. It doesn't change until the crash.

At the point where we're crashing we're looking for "resource://gre/modules/TelemetryStorage.jsm". The JS stack is in comment #23 . However, at the point where we're creating this ModuleEntry and set its obj to 0, we are *also* looking for this file, but through a different stack:

(rr) call DumpJSStack()
0 XPCU_moduleLambda() ["resource://gre/modules/XPCOMUtils.jsm":273]
    this = [object BackstagePass]
1 XPCU_defineLazyGetter/<.get() ["resource://gre/modules/XPCOMUtils.jsm":198]
    this = [object BackstagePass]
2 TelemetrySendImpl._checkPendingPings<() ["resource://gre/modules/TelemetrySend.jsm":607]
3 next(val = undefined) ["self-hosted":947]
    this = [object Generator]
4 TaskImpl_run(aSendResolved = true) ["resource://gre/modules/Task.jsm":319]
    this = [object Object]
5 TaskImpl(iterator = [object Generator]) ["resource://gre/modules/Task.jsm":280]
    this = [object Object]
6 createAsyncFunction/asyncFunction() ["resource://gre/modules/Task.jsm":254]
    this = [object Object]
7 TelemetrySendImpl.setup<() ["resource://gre/modules/TelemetrySend.jsm":588]
8 next(val = undefined) ["self-hosted":947]
    this = [object Generator]
9 TaskImpl_run(aSendResolved = true) ["resource://gre/modules/Task.jsm":319]
    this = [object Object]
10 TaskImpl(iterator = [object Generator]) ["resource://gre/modules/Task.jsm":280]
    this = [object Object]
11 createAsyncFunction/asyncFunction(false) ["resource://gre/modules/Task.jsm":254]
    this = [object Object]
12 this.TelemetrySend.setup(testing = false) ["resource://gre/modules/TelemetrySend.jsm":183]
    this = [object Object]
13 setupTelemetry/this._delayedInitTask<() ["resource://gre/modules/TelemetryController.jsm":673]
14 next(val = undefined) ["self-hosted":947]
    this = [object Generator]
15 TaskImpl_run(aSendResolved = true) ["resource://gre/modules/Task.jsm":319]
    this = [object Object]
16 TaskImpl(iterator = [object Generator]) ["resource://gre/modules/Task.jsm":280]
    this = [object Object]
17 createAsyncFunction/asyncFunction() ["resource://gre/modules/Task.jsm":254]
18 Task_spawn(aTask = [function]) ["resource://gre/modules/Task.jsm":168]
    this = [object Object]
19 this.DeferredTask.prototype._timerCallback/<() ["resource://gre/modules/DeferredTask.jsm":282]
20 next(val = undefined) ["self-hosted":947]
    this = [object Generator]


... it's not clear to me why obj gets set to 0 when we're constructing the ModuleEntry for the initial call, or why/how that's even possible.

It also seems like if I use "next" in gdb to step through the rest of ImportInto, when I end up at:

1170	        mInProgressImports.Put(info.Key(), newEntry);
1171	
1172	        rv = info.EnsureURI();
1173	        NS_ENSURE_SUCCESS(rv, rv);
1174	        RootedValue exception(callercx);
1175	        rv = ObjectForLocation(info, sourceLocalFile, &newEntry->obj,
1176	                               &newEntry->thisObjectKey,
1177	                               &newEntry->location, true, &exception);
1178	

and try to continue stepping over the ObjectForLocation call, it goes straight into the segfault. This is surprising because I didn't think, from looking at the stack from comment #0, that there were multiple Cu.import calls on the stack. And in fact, the backtrace from the crash doesn't have two calls in it. So at this point I'm mostly just very confused.

Stepping through this a bit more, it seems that the call to ReadCachedScript from within ObjectForLocation causes the crash - but it's not clear to me how that would work and/or if it's worth figuring that out, rather than why the ModuleEntry's obj thing is null on creation.

Does this help at all? Bobby/Gabor?
I'm not sure what the specific question about rr for me is. rr traces aren't portable, so the debugging itself will need to happen on Gijs' machine, presumably with some help from gabor. Y'all should hop on IRC together. :-)

Any questions about rr itself should probably go to roc, or maybe BenWa.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (busy) from comment #27)
> I'm not sure what the specific question about rr for me is. rr traces aren't
> portable, so the debugging itself will need to happen on Gijs' machine,

Sorry that was me who thought you can just port the recorded trace to another machine... wishful thinking...

(In reply to :Gijs Kruitbosch from comment #26)
> Does this help at all? Bobby/Gabor?

I'm getting too tired, but will look into it tomorrow and ping you over IRC and then we can figure out the rest.
So this is what I think happens:

That particular object can be set to null if PrepareObjectForLocation returns null, but I think we guard against that case. However PrepareObjectForLocation does a tricky things for the debugger it can call back to JS [1] and at this point the module is already set into the cache [2] but it's object is not yet set, so when that JS calls another import with the same module name it finds the cache entry with null obj in it and we crash. Gijs, could you validate my theory?

What we should do is to detect this case by adding an doneInit or something flag to the module and fire some exception that there is a recursion in the import or something similar when this happens. And then be very careful what to import from onNewGlobalObject...

Or if that's feasible we should fire the event once the module is fully initialized. That would solve this issue. Bobby is there a reason why we don't fire that event a little bit later when the obj for the module is set and the module is in a usable state? I can give this approach a try some time, but I'm quite flooded for today...

[1] : http://hg.mozilla.org/mozilla-central/annotate/5e14887312d4/js/xpconnect/loader/mozJSComponentLoader.cpp#l639

[2] : http://hg.mozilla.org/mozilla-central/annotate/3e04659fdf6a/dom/base/ImportManager.cpp#l666
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #29)
> So this is what I think happens:
> 
> That particular object can be set to null if PrepareObjectForLocation
> returns null, but I think we guard against that case. However
> PrepareObjectForLocation does a tricky things for the debugger it can call
> back to JS [1] and at this point the module is already set into the cache
> [2] but it's object is not yet set, so when that JS calls another import
> with the same module name it finds the cache entry with null obj in it and
> we crash. Gijs, could you validate my theory?
> 
> What we should do is to detect this case by adding an doneInit or something
> flag to the module and fire some exception that there is a recursion in the
> import or something similar when this happens. And then be very careful what
> to import from onNewGlobalObject...
> 
> Or if that's feasible we should fire the event once the module is fully
> initialized. That would solve this issue. Bobby is there a reason why we
> don't fire that event a little bit later when the obj for the module is set
> and the module is in a usable state? I can give this approach a try some
> time, but I'm quite flooded for today...
> 
> [1] :
> http://hg.mozilla.org/mozilla-central/annotate/5e14887312d4/js/xpconnect/
> loader/mozJSComponentLoader.cpp#l639
> 
> [2] :
> http://hg.mozilla.org/mozilla-central/annotate/3e04659fdf6a/dom/base/
> ImportManager.cpp#l666

I will try to look at this hypothesis and the proposed solution later today, but it might get pushed back to next week because I'm honestly pretty swamped myself.
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #29)
> So this is what I think happens:
> 
> That particular object can be set to null if PrepareObjectForLocation
> returns null, but I think we guard against that case. However
> PrepareObjectForLocation does a tricky things for the debugger it can call
> back to JS [1] and at this point the module is already set into the cache
> [2] but it's object is not yet set, so when that JS calls another import
> with the same module name it finds the cache entry with null obj in it and
> we crash. Gijs, could you validate my theory?

Nice analysis!

> What we should do is to detect this case by adding an doneInit or something
> flag to the module and fire some exception that there is a recursion in the
> import or something similar when this happens. And then be very careful what
> to import from onNewGlobalObject...

Yes, checking for recursive imports is probably the correct thing to do here. The bug here is in debugger JS, but we probably should throw a helpful error message rather than crashing.

> Or if that's feasible we should fire the event once the module is fully
> initialized. That would solve this issue. Bobby is there a reason why we
> don't fire that event a little bit later when the obj for the module is set
> and the module is in a usable state? I can give this approach a try some
> time, but I'm quite flooded for today...

OnNewGlobalObject needs to be fired before any JS code can run, otherwise the debugger can miss breakpoints.
This VM and the recording live on my approximately dead mac (hardware issues). I'll try to reproduce on my fedora VM on my Windows box in the meantime, but I'm also on PTO + public holiday from Thursday until Tuesday inclusive, so it might be (second half) of next week before I get to this. :-(
(In reply to :Gijs Kruitbosch from comment #30)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #29)
> > So this is what I think happens:
> > 
> > That particular object can be set to null if PrepareObjectForLocation
> > returns null, but I think we guard against that case. However
> > PrepareObjectForLocation does a tricky things for the debugger it can call
> > back to JS [1] and at this point the module is already set into the cache
> > [2] but it's object is not yet set, so when that JS calls another import
> > with the same module name it finds the cache entry with null obj in it and
> > we crash. Gijs, could you validate my theory?

So, I don't *think* this is it... because when I step through PrepareObjectForLocation, and through the JS_FireOnNewGlobalObject(aes.cx(), global);, everything seems to work. We return |obj| , and the call succeeds.

However... when we try to call ReadCachedScript here: http://hg.mozilla.org/mozilla-central/annotate/5e14887312d4/js/xpconnect/loader/mozJSComponentLoader.cpp#l691 everything dies.

Does that help? Am I missing something?
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(gkrizsanits)
(In reply to :Gijs Kruitbosch (away 24-29/3, incl.) from comment #33)
> (In reply to :Gijs Kruitbosch from comment #30)
> > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #29)
> > > So this is what I think happens:
> > > 
> > > That particular object can be set to null if PrepareObjectForLocation
> > > returns null, but I think we guard against that case. However
> > > PrepareObjectForLocation does a tricky things for the debugger it can call
> > > back to JS [1] and at this point the module is already set into the cache
> > > [2] but it's object is not yet set, so when that JS calls another import
> > > with the same module name it finds the cache entry with null obj in it and
> > > we crash. Gijs, could you validate my theory?
> 
> So, I don't *think* this is it... because when I step through
> PrepareObjectForLocation, and through the JS_FireOnNewGlobalObject(aes.cx(),
> global);, everything seems to work. We return |obj| , and the call succeeds.
> 
> However... when we try to call ReadCachedScript here:
> http://hg.mozilla.org/mozilla-central/annotate/5e14887312d4/js/xpconnect/
> loader/mozJSComponentLoader.cpp#l691 everything dies.
> 
> Does that help? Am I missing something?

I think this is a new bug, previously we did not crash in ReadCachedScript based on the stacks above. I don't know what has changed but if you have a totally different crash stack it might make sense to file a new bug for that. Could you elaborate on the "everything dies" part a bit? Also it might make sense to log which file do we start to import and which one do we finish importing, so at the moment of the crash you would have a chain of imports which would be a useful data to look at.
Flags: needinfo?(gkrizsanits)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #34)
> I think this is a new bug, previously we did not crash in ReadCachedScript
> based on the stacks above. I don't know what has changed but if you have a
> totally different crash stack it might make sense to file a new bug for
> that.

Well no, I think this is still the same issue. See comment 26. Stepping over ObjectForLocation never reaches the next line - we segfault (in ImportInto) before then. The ReadCachedScript call is in ObjectForLocation. I guess I could keep using 'step into' rather than 'step over' to narrow that down further... as I noted in that comment, the weird thing is that we're inside an ImportInto call when we get here (ie the stack already has that, from (somewhere in) there we call ObjectForLocation, and then ReadCachedScript, and then ???? and then we crash, again in ImportInto - but ImportInto only appears in the stack once. Even if we do run JS (maybe as part of ReadCachedScript ?) I don't understand why ImportInto doesn't appear in the stack multiple times with JS in between. The stack doesn't look clobbered enough in either that crashreport or in "bt" in gdb to think that it is corrupt, yet this state of things makes no sense to me. Am I missing something obvious/stupid? Does this make sense to you?

> Could you elaborate on the "everything dies" part a bit?

Sorry, I just meant that trying to "step over" ("next" in gdb terms) that line hits the segfault. So the segfault is somewhere in the calltree under the ReadCachedScript call.

> Also it might
> make sense to log which file do we start to import and which one do we
> finish importing, so at the moment of the crash you would have a chain of
> imports which would be a useful data to look at.

Yes, this makes sense to me. I'd like to do that, but at the moment I'm travelling and because my mbp has busted RAM, I'm only carrying my surface pro, which doesn't like recompiling gecko (as in, it takes ~70 minutes, longer on battery) so I use artifact builds on it and can't do this easily right now. I can see if I have time to do this late tonight when I'm at my destination. Otherwise it'll have to be next week when I'm back at my main machine, from the looks of it.

In case this is easier, I *think* these are fairly conclusive STR (works at least 50% of the time for me) that you should be able to use if you want to look at this yourself:

0) clean profile, start Firefox
1) turn off e10s, restart
2) turn on (set to true) toolkit.telemetry.enabled in about:config . This is the default on official builds, but not when you build yourself, AFAICT.
3) open the web developer tools (ctrl-shift-k or cmd-opt-k), then click the "cog" / settings wheel. Then enable remote debugging and chrome + add-on debugging (bottom right of the settings pane).
5) load http://cpettitt.github.io/project/dagre-d3/latest/demo/user-defined.html in a tab.
6) restart. From the time of this start, you have about 50 seconds to complete the following steps because of the telemetry timer that's involved (which might be tricky if you have a debug build...)
7) restore the tabs from about:home to get back the page loaded in step 5
8) open the web developer tools
9) inspect the little house on the page
10) open the browser toolbox (ctrl-alt-shift-i on Windows/Linux - not sure on OS X off-hand)
11) switch to the debugger tab on the browser toolbox if it's not selected already. Wait for scripts to have come in, type 'inspector.js' in the debugger's search field, open that file
12) in the file, search for modifyAttributes, set a breakpoint on the first line
13) switch back to the main Firefox process, and in the inspector, change the class of one of the elements to add " foo" after the existing value of the attribute, and hit enter
14) the debugger stops on the breakpoint. Try expanding some of the variables on the right, and if you got here before the 60 seconds were up, you'll likely crash the main process doing that or if you "step" in the js debugger.
Flags: needinfo?(gijskruitbosch+bugs)
I can at least confirm that this crashes somewhere in trying to notify debuggers about a new script. Specifically, if I keep "step over" / "next" ing in gdb, the crash happens somewhere under:

#1  0x00007f718467fe33 in js::Debugger::slowPathOnNewScript(JSContext*, JS::Handle<JSScript*>) (cx=cx@entry=0x7f71796a3000, script=..., script@entry=...) at /home/gijs/fx-team/js/src/vm/Debugger.cpp:1612
#2  0x00007f718461feb6 in js::XDRScript<(js::XDRMode)1>(js::XDRState<(js::XDRMode)1>*, JS::Handle<JSObject*>, JS::Handle<JSScript*>, JS::Handle<JSFunction*>, JS::MutableHandle<JSScript*>) (script=..., cx=0x7f71796a3000) at /home/gijs/fx-team/js/src/vm/Debugger.h:1150
#3  0x00007f718461feb6 in js::XDRScript<(js::XDRMode)1>(js::XDRState<(js::XDRMode)1>*, JS::Handle<JSObject*>, JS::Handle<JSScript*>, JS::Handle<JSFunction*>, JS::MutableHandle<JSScript*>) (xdr=xdr@entry=0x7ffc629ed0e0, enclosingScopeArg=..., 
    enclosingScopeArg@entry=..., enclosingScript=..., enclosingScript@entry=..., fun=..., fun@entry=..., scriptp=..., 
    scriptp@entry=...) at /home/gijs/fx-team/js/src/jsscript.cpp:1224
#4  0x00007f71847e25bb in js::XDRState<(js::XDRMode)1>::codeScript(JS::MutableHandle<JSScript*>) (this=this@entry=0x7ffc629ed0e0, scriptp=scriptp@entry=...) at /home/gijs/fx-team/js/src/vm/Xdr.cpp:136
#5  0x00007f718459c16b in JS_DecodeScript(JSContext*, void const*, unsigned int) (cx=cx@entry=0x7f71796a3000, data=<optimized out>, length=<optimized out>) at /home/gijs/fx-team/js/src/jsapi.cpp:6231
#6  0x00007f718299a1ec in ReadCachedScript(mozilla::scache::StartupCache*, nsACString_internal&, JSContext*, nsIPrincipal*, JS::MutableHandle<JSScript*>) (cache=cache@entry=0x7f7178036280, uri=..., cx=0x7f71796a3000, systemPrincipal=<optimized out>, scriptp=scriptp@entry=...) at /home/gijs/fx-team/js/xpconnect/loader/mozJSLoaderUtils.cpp:33
#7  0x00007f7182997cac in mozJSComponentLoader::ObjectForLocation(ComponentLoaderInfo&, nsIFile*, JS::MutableHandle<JSObject*>, JS::MutableHandle<JSScript*>, char**, bool, JS::MutableHandle<JS::Value>) (this=this@entry=0x7f7170a49780, aInfo=..., aComponentFile=0x7f714066ef80, aObject=..., aTableScript=..., aLocation=0x7f7156dc1e40, aPropagateExceptions=true, aException=...)
    at /home/gijs/fx-team/js/xpconnect/loader/mozJSComponentLoader.cpp:691
#8  0x00007f7182998ec7 in mozJSComponentLoader::ImportInto(nsACString_internal const&, JS::Handle<JSObject*>, JSContext*, JS::MutableHandle<JSObject*>) (this=this@entry=0x7f7170a49780, aLocation=..., targetObj=..., 
    targetObj@entry=..., callercx=callercx@entry=0x7f71796a3000, vp=..., vp@entry=...)
    at /home/gijs/fx-team/js/xpconnect/loader/mozJSComponentLoader.cpp:1175
#9  0x00007f7182999898 in mozJSComponentLoader::Import(nsACString_internal const&, JS::Handle<JS::Value>, JSContext*, unsigned char, JS::MutableHandle<JS::Value>) (this=0x7f7170a49780, registryLocation=..., targetValArg=..., cx=0x7f71796a3000, optionalArgc=<optimized out>, retval=...) at /home/gijs/fx-team/js/xpconnect/loader/mozJSComponentLoader.cpp:1051
#10 0x00007f718299f375 in nsXPCComponents_Utils::Import(nsACString_internal const&, JS::Handle<JS::Value>, JSContext*, unsigned char, JS::MutableHandle<JS::Value>) (this=<optimized out>, registryLocation=..., targetObj=..., cx=0x7f71796a3000, optionalArgc=<optimized out>, retval=...) at /home/gijs/fx-team/js/xpconnect/src/XPCComponents.cpp:2502

Still don't understand why the stack is different by the time we crash. I'll try the printf() suggestion and see if that sheds light on this, but that'll be tomorrow or next week because it's getting late.
Uhh, so if I add a log here:

https://dxr.mozilla.org/mozilla-central/rev/494289c72ba3997183e7b5beaca3e0447ecaf96d/js/xpconnect/loader/mozJSComponentLoader.cpp#1166-1167

    if (!mImports.Get(info.Key(), &mod) && !mInProgressImports.Get(info.Key(), &mod)) {
>>>  printf("importinto for new module: %s\n", loc.get());
        newEntry = new ModuleEntry(callercx);

(loc's just an nsAutoCString copy of aLocation, an argument to ImportInto)

and another one when ObjectForLocation returns that logs:

printf("Removed %s from in-progress imports\n", loc.get());


and another one in an else to the above-quoted if() that logs:

        printf("ImportInto for cached module: %s\n", loc.get());

then this gets logged on a crashy run:

importinto for new module: resource://gre/modules/TelemetryStorage.jsm
ImportInto for cached module: resource://gre/modules/TelemetryStorage.jsm


Nothing inbetween. At all.

A non-busted run looks like this:

importinto for new module: resource://gre/modules/TelemetryStorage.jsm
ImportInto for cached module: resource://gre/modules/AppConstants.jsm
ImportInto for cached module: resource://gre/modules/Log.jsm
ImportInto for cached module: resource://gre/modules/Services.jsm
ImportInto for cached module: resource://gre/modules/XPCOMUtils.jsm
ImportInto for cached module: resource://gre/modules/osfile.jsm
ImportInto for cached module: resource://gre/modules/Task.jsm
ImportInto for cached module: resource://gre/modules/TelemetryUtils.jsm
ImportInto for cached module: resource://gre/modules/Promise.jsm
ImportInto for cached module: resource://gre/modules/Preferences.jsm
Removed resource://gre/modules/TelemetryStorage.jsm from in-progress imports


I... I'm not sure where to go from here. :-(
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(gkrizsanits)
(In reply to :Gijs Kruitbosch from comment #37)
> then this gets logged on a crashy run:
> 
> importinto for new module: resource://gre/modules/TelemetryStorage.jsm
> ImportInto for cached module: resource://gre/modules/TelemetryStorage.jsm

This is the exact signature I was looking for to confirm my theory in comment 29, and the recursively imported module is TelemetryStorage. If you could figure out where do we get the second import call then we could fix this. For that I think you can just dump the JS call stack, or you could just return some error and then you would get an exception in JS from the import call.

In fact I think this code:
if (!mImports.Get(info.Key(), &mod) && !mInProgressImports.Get(info.Key(), &mod)) {

Is trying to guard against this case just not doing a great job at it. It should be something like:

if (!mImports.Get(info.Key(), &mod)) {
  // do the load
} else if (mInProgressImports.Get(info.Key(), &mod)) {
  // return some error here about the recursion
}
Flags: needinfo?(gkrizsanits)
I attached full JS stacks as produced by DumpJSStack(), but I don't think those are very readable, so here's shorter ones:

0 XPCU_moduleLambda() ["resource://gre/modules/XPCOMUtils.jsm":273]
    this = [object BackstagePass]
1 XPCU_defineLazyGetter/<.get() ["resource://gre/modules/XPCOMUtils.jsm":198]
    this = [object BackstagePass]
2 TelemetrySendImpl._checkPendingPings<() ["resource://gre/modules/TelemetrySend.jsm":607]
7 TelemetrySendImpl.setup<() ["resource://gre/modules/TelemetrySend.jsm":588]
12 this.TelemetrySend.setup(testing = false) ["resource://gre/modules/TelemetrySend.jsm":183]
13 setupTelemetry/this._delayedInitTask<() ["resource://gre/modules/TelemetryController.jsm":673]


The other one looks like this:

0 XPCU_moduleLambda() ["resource://gre/modules/XPCOMUtils.jsm":273]
    this = [object BackstagePass]
1 XPCU_defineLazyGetter/<.get() ["resource://gre/modules/XPCOMUtils.jsm":198]
    this = [object BackstagePass]
2 Impl._loadSessionData<() ["resource://gre/modules/TelemetrySession.jsm":1972]
7 setupChromeProcess/this._delayedInitTask<() ["resource://gre/modules/TelemetrySession.jsm":1433]
13 this.DeferredTask.prototype._timerCallback/<() ["resource://gre/modules/DeferredTask.jsm":282]
19 this.DeferredTask.prototype._timerCallback([xpconnect wrapped nsITimer]) ["resource://gre/modules/DeferredTask.jsm":280]
20 bound ([xpconnect wrapped nsITimer]) ["self-hosted":752]


So number two here is basically going through:

https://dxr.mozilla.org/mozilla-central/rev/bccb11375f2af838cda714d42fd8cef78f5c7bf1/toolkit/components/telemetry/TelemetrySession.jsm#1429-1433

which is a delayed task that is started after a particular delay which ends up calling

https://dxr.mozilla.org/mozilla-central/rev/bccb11375f2af838cda714d42fd8cef78f5c7bf1/toolkit/components/telemetry/TelemetrySession.jsm#1972

which pokes at TelemetryStorage, which is a lazy getter and therefore imports the module.


Number 1 is *also* a delayed task:

https://dxr.mozilla.org/mozilla-central/rev/bccb11375f2af838cda714d42fd8cef78f5c7bf1/toolkit/components/telemetry/TelemetryController.jsm#667-673

which calls into TelemetrySend, which then here:

https://dxr.mozilla.org/mozilla-central/rev/bccb11375f2af838cda714d42fd8cef78f5c7bf1/toolkit/components/telemetry/TelemetrySend.jsm#607

also pokes at a lazy getter for TelemetryStorage and therefore imports the module.


Now... the confusing thing is these are not recursive stacks. They are separate stacks. Both of these timers fired and we're managing to try to run code off the second one while code off the first one has not run to completion, which is breaking us. I don't know enough about the JS engine / the debugger to figure out how that could ever happen, I just know it's messed up.
Flags: needinfo?(gkrizsanits)
So I've not tested guarding the import-while-pending exception thing. It's quite possible that would stop this blowing up as badly as it is here, now, in this case, but it sure looks like we're violating invariants (returning to the event loop to run code off a timer while not having run running JS code to completion) in a way that will just bite us more even if we address this particular nasty.
(In reply to :Gijs Kruitbosch from comment #41)
> So I've not tested guarding the import-while-pending exception thing. It's
> quite possible that would stop this blowing up as badly as it is here, now,
> in this case, but it sure looks like we're violating invariants (returning
> to the event loop to run code off a timer while not having run running JS
> code to completion) in a way that will just bite us more even if we address
> this particular nasty.

Yes this sounds bad. Bobby do you know what might trigger the event loop to spin in this case? At first we try to load a module from a generator and then the slowPathOnNewGlobalObject triggers the event loop somehow. My JS engine knowledge is very limited here and I cannot spot anything obvious... maybe the engine does not expect someone to call invoke from a generator and that confuses things. Who could take look at this?
Flags: needinfo?(gkrizsanits) → needinfo?(bobbyholley)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #42)
> (In reply to :Gijs Kruitbosch from comment #41)
> > So I've not tested guarding the import-while-pending exception thing. It's
> > quite possible that would stop this blowing up as badly as it is here, now,
> > in this case, but it sure looks like we're violating invariants (returning
> > to the event loop to run code off a timer while not having run running JS
> > code to completion) in a way that will just bite us more even if we address
> > this particular nasty.
> 
> Yes this sounds bad. Bobby do you know what might trigger the event loop to
> spin in this case? At first we try to load a module from a generator and
> then the slowPathOnNewGlobalObject triggers the event loop somehow. My JS
> engine knowledge is very limited here and I cannot spot anything obvious...

onNewGlobalObject just calls a JS callback. It will spin the event loop if the callback does something that spins the event loop.

> maybe the engine does not expect someone to call invoke from a generator and
> that confuses things. Who could take look at this?

I don't understand the question. You and Gijs can reproduce this issue, right? What expertise is missing?
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (busy) from comment #43)
> onNewGlobalObject just calls a JS callback. It will spin the event loop if
> the callback does something that spins the event loop.

A JS function can just do something that spins the event loop without returning? I mean it can do sync XHR and then there is a nested event loop that spins but that's not what we're looking for here. I thought you might now if there are some JS API that can spin the event loop, but never mind then.

> I don't understand the question. You and Gijs can reproduce this issue,
> right? What expertise is missing?

I haven't reproduced it yet, but will when I get there, it's just I have some e10s bugs that have higher prio so I have not got the time yet.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #44)
> I thought you might
> now if there are some JS API that can spin the event loop, but never mind
> then.

XPCOM exposes such APIs. We use them in other places such as the tab-modal implementations of window.alert() and friends. They are problematic there, too, but doing it in a callback like this seems... wrong? :-)

> > I don't understand the question. You and Gijs can reproduce this issue,
> > right? What expertise is missing?
> 
> I haven't reproduced it yet, but will when I get there, it's just I have
> some e10s bugs that have higher prio so I have not got the time yet.

If the JS debugger code is spinning the event loop from new global/script notifications, re-paging :ejpbruel / :fitzgen / :jimb about that. Folks, is that fixable? Why are we doing it at all?
Flags: needinfo?(nfitzgerald)
Flags: needinfo?(jimb)
Flags: needinfo?(ejpbruel)
(In reply to :Gijs Kruitbosch from comment #35)
> In case this is easier, I *think* these are fairly conclusive STR (works at
> least 50% of the time for me) that you should be able to use if you want to
> look at this yourself:

I tried these steps and could not reproduce it even after changing the timeout to 5 minutes :(

I wanted to trace where do we spin the event loop. Is it really the Invoke call? Or something else? If it's the invoke call what is on the JS stack? To me the JS callback for the debugger does not seem to be doing much so I would first try to make sure that it's really the invoke call or some other JS/Debugger API the debugger uses from onNewGlobalObject. I think the debugger spins the event loop in a restrictive/selective way, maybe it fires nsITimers accidentally that belongs to the Telemtry initialization, even though it should not. Eddy should know more about this...
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #46)
> (In reply to :Gijs Kruitbosch from comment #35)
> > In case this is easier, I *think* these are fairly conclusive STR (works at
> > least 50% of the time for me) that you should be able to use if you want to
> > look at this yourself:
> 
> I tried these steps and could not reproduce it even after changing the
> timeout to 5 minutes :(

Debug or opt build?

> I wanted to trace where do we spin the event loop. Is it really the Invoke
> call? Or something else? If it's the invoke call what is on the JS stack? To
> me the JS callback for the debugger does not seem to be doing much so I
> would first try to make sure that it's really the invoke call or some other
> JS/Debugger API the debugger uses from onNewGlobalObject. I think the
> debugger spins the event loop in a restrictive/selective way, maybe it fires
> nsITimers accidentally that belongs to the Telemtry initialization, even
> though it should not. Eddy should know more about this...

Mm, I hope they can enlighten us as to what's going on here.
(In reply to :Gijs Kruitbosch from comment #47)
> Debug or opt build?

Debug. Debugging opt builds is quite tricky.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #48)
> (In reply to :Gijs Kruitbosch from comment #47)
> > Debug or opt build?
> 
> Debug. Debugging opt builds is quite tricky.

Whereas I use opt because doing frontend work on debug builds is almost impossible. :-)

Can you check if you can reproduce the problem on an opt build? I wouldn't be terribly surprised if debug builds are more careful in places and somehow end up avoiding this issue. In fact, there's an assert specifically for the case we're looking at in debug builds: https://dxr.mozilla.org/mozilla-central/source/js/xpconnect/loader/mozJSComponentLoader.cpp#1204 ...
(In reply to :Gijs Kruitbosch from comment #49)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #48)
> 
> Can you check if you can reproduce the problem on an opt build? I wouldn't
> be terribly surprised if debug builds are more careful in places and somehow
> end up avoiding this issue. In fact, there's an assert specifically for the
> case we're looking at in debug builds:
> https://dxr.mozilla.org/mozilla-central/source/js/xpconnect/loader/
> mozJSComponentLoader.cpp#1204 ...

I'd be fine if instead of a crash I'd hit an assertion... also for most of the asserts
there is a crash that follows it shortly like in this case the next line after the assertion...

Anyway, I will try it whenever I get the time again but not sure if it's really useful to try and debug an opt build. Maybe a non optimized release build would get us somewhere.
We have to spin the event loop on new scripts because there are certain things asynchronous things we must do before allowing those new scripts to run (the main thing being set pending breakpoints).

See calls to `unsafeSynchronize` in `_addSource`, which is called from the new script handler: https://dxr.mozilla.org/mozilla-central/rev/55d557f4d73ee58664bdf2fa85aaab555224722e/devtools/server/actors/script.js#1948-2007

In general, you have to be prepared for arbitrary JS to run when creating scripts or globals, since debugger hooks get fired on those events, and that arbitrary JS includes things like spinning the event loop.
Flags: needinfo?(nfitzgerald)
(In reply to Nick Fitzgerald [:fitzgen] [⏰PDT; UTC-7] from comment #51)
> We have to spin the event loop on new scripts because there are certain
> things asynchronous things we must do before allowing those new scripts to
> run (the main thing being set pending breakpoints).
> 
> See calls to `unsafeSynchronize` in `_addSource`, which is called from the
> new script handler:
> https://dxr.mozilla.org/mozilla-central/rev/
> 55d557f4d73ee58664bdf2fa85aaab555224722e/devtools/server/actors/script.
> js#1948-2007
> 
> In general, you have to be prepared for arbitrary JS to run when creating
> scripts or globals, since debugger hooks get fired on those events, and that
> arbitrary JS includes things like spinning the event loop.

OK, so how do you propose we solve this case?

We're not done importing this module, and it's being imported again, and that's causing crashes. From what you're saying it sounds like we can't *not* spin the event loop, so we can't avoid the other code running and importing the same module a second time over.

We could throw on the second import, but that will cause behaviour changes for code that runs while the debugger is active... Importing it a second time despite the first import already being in progress sounds... scary.
Flags: needinfo?(nfitzgerald)
Flags: needinfo?(jimb)
Flags: needinfo?(ejpbruel)
(In reply to :Gijs Kruitbosch from comment #45)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #44)
> > I thought you might
> > now if there are some JS API that can spin the event loop, but never mind
> > then.
> 
> XPCOM exposes such APIs. We use them in other places such as the tab-modal
> implementations of window.alert() and friends. They are problematic there,
> too, but doing it in a callback like this seems... wrong? :-)
> 
> > > I don't understand the question. You and Gijs can reproduce this issue,
> > > right? What expertise is missing?
> > 
> > I haven't reproduced it yet, but will when I get there, it's just I have
> > some e10s bugs that have higher prio so I have not got the time yet.
> 
> If the JS debugger code is spinning the event loop from new global/script
> notifications, re-paging :ejpbruel / :fitzgen / :jimb about that. Folks, is
> that fixable? Why are we doing it at all?

Yes, the devtools server does run JS from the onNewGlobalObject callback; that's how the chrome debugger ensures that new globals become debuggees, to make activity within their scope visible to the debugger.

And yes, the devtools server does spin event loops while stopped in the midst of debuggee code. This is how the debugger's UI remains active even as the debuggee is stopped at a breakpoint. The devtools server tries to stop all other event handling in the debuggee while it's doing so, because letting handlers run obviously breaks the run-to-completion rule. But Gecko's architecture makes this hard. We recently landed a feature that will help us detect these problems.

However, we have *no* legitimate reason to combine those cases, and spin an event loop from onNewGlobalObject. The onNewGlobalObject handler should do nothing but call addGlobal on some Debugger. If this is occurring, I'm very interested in seeing a backtrace.

It might be the case that the onNewGlobalObject handler is trying to load some other module. That would also be a bug; it's not reasonable to expect the platform to handle that kind of reentrancy.

We do spin an event loop from onNewScript notifications, if breakpoints are set. This is an ongoing source of problems, and probably should never have been landed in the first place, but it's part of our current strategy for handling source maps. We'll be looking for alternatives to that strategy in O2. To see if this is part of the problem, you might try logging calls to unsafeSynchronize here:

https://hg.mozilla.org/mozilla-central/file/bccb11375f2a/devtools/server/actors/script.js#l1060
(In reply to :Gijs Kruitbosch from comment #45)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #44)
> > I thought you might
> > now if there are some JS API that can spin the event loop, but never mind
> > then.
> 
> XPCOM exposes such APIs. We use them in other places such as the tab-modal
> implementations of window.alert() and friends. They are problematic there,
> too, but doing it in a callback like this seems... wrong? :-)
> 
> > > I don't understand the question. You and Gijs can reproduce this issue,
> > > right? What expertise is missing?
> > 
> > I haven't reproduced it yet, but will when I get there, it's just I have
> > some e10s bugs that have higher prio so I have not got the time yet.
> 
> If the JS debugger code is spinning the event loop from new global/script
> notifications, re-paging :ejpbruel / :fitzgen / :jimb about that. Folks, is
> that fixable? Why are we doing it at all?

This is an extremely complicated and convoluted part of the debugger, but I’ll try to explain it as best as I can.

When a new script notification is fired, it introduces a new Debugger.Script instance to the debugger. Each Debugger.Script instance has a reference to a Debugger.Source instance, which is a reflection object for the source from which that script was compiled. Ignoring source maps for the time being, each of those Debugger.Source instances has a corresponding source actor, which represents the source to the client.

When you reload a page, it causes all sources for that page to be reloaded. This will cause all the scripts on the page to be recompiled. As a result, we get a new set of Debugger.Script instances. This set of Debugger.Script instances is distinct from the set of Debugger.Script instances we had before the reload, even if none of the sources changed after the reload. Consequently, the Debugger.Source instances we obtain from those Debugger.Script instances are also distinct from the ones we got from the previous set.

This poses a problem, because we now have two distinct sets of Debugger.Source instances. From the perspective of the client, however, nothing changed. All of the sources are still the same as they were before the reload. It is therefore reasonable for the client to expect to be able to talk to those sources using the same source actors as before. However, those source actors still refer to the old Debugger.Source instances. This is a problem, because those old Debugger.Source instances still refer to the old Debugger.Script instances. As a result, sending a request to send a breakpoint to a source actor would result in us setting the breakpoint on a script that is no longer being used, because it is from a previous page load.

To work around this, we create a new set of source actors for the new set of Debugger.Source instances. However, if we detect that we already had a source actor for an older Debugger.Source instance for the same source, instead of allocating an actor id for the new source actor, we reuse the actor id of the old source actor. From the perspective of the client, this effectively and transparently replaces the old source actor with the new source actor, since the client only ever refers to those by their actor id.

So, what does all of this have to do with event loops? Because reloading a page creates a new set of scripts, it wipes out all the breakpoints we had on the old set of scripts. We thus need to reset all our breakpoints on the new set of Debugger.Script instances. Breakpoints are stored by location in the debugger. We first obtain the source actor for that location. We then obtain the Debugger.Source instance for that source actor. And finally, we obtain the Debugger.Script instance from the Debugger.Source instance. We then set breakpoints on the appropriate byte code offsets on the Debugger.Script instance. We need to make sure all source actors point to the latest Debugger.Source instance, otherwise we will end up with an old Debugger.Script instance, and breakpoints will not work as expected. For this reason, we need to make sure all source actors are updated *before* we reset the breakpoints.

Unfortunately, when you add source maps to the mix, updating source actors becomes an asynchronous process. We create a source actor for each original source. However, from the perspective of the engine, those original sources don’t actually exist, so the Debugger.Source instance for these actors instead refers to the generated source that these original sources mapped to. When we get a new version of this Debugger.Source instance, we need to update the existing original source actors, but to do that, we need to fetch the source map. Since fetching a source map is an asynchronous process, the entire process of updating the source actors becomes asynchronous as well.

Without unsafeSynchronise, we would end up with only partially updated source actors before the breakpoints are reset, which can cause us to set breakpoints on old Debugger.Script instances. The end result is that your breakpoints sometimes don’t work.

The second reason that we use unsafeSynchronise is that resetting breakpoints on a source mapped source is similarly an asynchronous process. Breakpoints are stored by original location, but in reality are set on a generated location. To obtain that generated location, we need to fetch the source map. In the case of a top-level breakpoint on a source mapped source, not using unsafeSynchronise here could cause the script to start running (and pass through that breakpoint) before we get a chance to set it. The end result is, again, that your breakpoints sometimes don’t work.

Needless to say, this is all a horrible, horrible hack, and I am more than open to discuss ideas to redesign this. Jim Blandy has some ideas to replace our uses of unsafeSynchronise with breakpoints on the entry point of each frame. This would still cause us to enter a nested event loop, but now this would only happen from within a breakpoint, not a new script notification. I don’t know if that would fix your immediate problem, but I thought I’d point it out.
I have nothing to add that hasn't already been added.
Flags: needinfo?(nfitzgerald)
(In reply to Jim Blandy :jimb from comment #53)
> However, we have *no* legitimate reason to combine those cases, and spin an
> event loop from onNewGlobalObject. The onNewGlobalObject handler should do
> nothing but call addGlobal on some Debugger. If this is occurring, I'm very
> interested in seeing a backtrace.

You can see the backtrace in Comment 36. onNewGlobalObject ends up calling
onNewScript. Which if I understand it right spins the even loop. I agree
this is not really acceptable. Who could work on this?

> 
> It might be the case that the onNewGlobalObject handler is trying to load
> some other module. That would also be a bug; it's not reasonable to expect
> the platform to handle that kind of reentrancy.

I think it does not, it just spins the event loop and a timer fires which
tries to load the module.
Flags: needinfo?(jimb)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #56)
> (In reply to Jim Blandy :jimb from comment #53)
> > However, we have *no* legitimate reason to combine those cases, and spin an
> > event loop from onNewGlobalObject. The onNewGlobalObject handler should do
> > nothing but call addGlobal on some Debugger. If this is occurring, I'm very
> > interested in seeing a backtrace.
> 
> You can see the backtrace in Comment 36. onNewGlobalObject ends up calling
> onNewScript. Which if I understand it right spins the even loop. I agree
> this is not really acceptable. Who could work on this?

In comment 36, I don't see onNewGlobalObject -> onNewScript. A JS stack would be very valuable, perhaps disposative. I only see an invocation of slowPathOnNewScript from an import operation, which seems entirely appropriate. We do spin the event loop from onNewScript.

If we can find a way to reproduce this, the people to work on it would be :ejpbruel or :jlongster.
Flags: needinfo?(jimb)
(In reply to Jim Blandy :jimb from comment #57)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #56)
> > (In reply to Jim Blandy :jimb from comment #53)
> > > However, we have *no* legitimate reason to combine those cases, and spin an
> > > event loop from onNewGlobalObject. The onNewGlobalObject handler should do
> > > nothing but call addGlobal on some Debugger. If this is occurring, I'm very
> > > interested in seeing a backtrace.
> > 
> > You can see the backtrace in Comment 36. onNewGlobalObject ends up calling
> > onNewScript. Which if I understand it right spins the even loop. I agree
> > this is not really acceptable. Who could work on this?
> 
> In comment 36, I don't see onNewGlobalObject -> onNewScript. A JS stack
> would be very valuable, perhaps disposative. I only see an invocation of
> slowPathOnNewScript from an import operation, which seems entirely
> appropriate. We do spin the event loop from onNewScript.

Based on comment 53 and comment 54 it sounds like you'd want to not spin the event loop from onNewScript, but that's a significant project. Is that right?

As a faster workaround so people like the devtools and frontend team can actually use the browser toolbox without crashing all the time, could we just avoid doing this for chrome:// or resource:/// URIs which won't have source maps anyway? That would avoid the issue described here.

> If we can find a way to reproduce this, the people to work on it would be
> :ejpbruel or :jlongster.

There are STR in comment #35. I can reproduce reliably, also when running under rr, when using an opt build. It might be more difficult with a debug build based on :gabor's experience, I haven't tried myself. I can try to set up SSH to the VM with an rr backtrace if that makes this any easier to work on / understand.
Flags: needinfo?(ejpbruel)
(In reply to :Gijs Kruitbosch from comment #58)
> (In reply to Jim Blandy :jimb from comment #57)
> > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #56)
> > > (In reply to Jim Blandy :jimb from comment #53)
> > > > However, we have *no* legitimate reason to combine those cases, and spin an
> > > > event loop from onNewGlobalObject. The onNewGlobalObject handler should do
> > > > nothing but call addGlobal on some Debugger. If this is occurring, I'm very
> > > > interested in seeing a backtrace.
> > > 
> > > You can see the backtrace in Comment 36. onNewGlobalObject ends up calling
> > > onNewScript. Which if I understand it right spins the even loop. I agree
> > > this is not really acceptable. Who could work on this?
> > 
> > In comment 36, I don't see onNewGlobalObject -> onNewScript. A JS stack
> > would be very valuable, perhaps disposative. I only see an invocation of
> > slowPathOnNewScript from an import operation, which seems entirely
> > appropriate. We do spin the event loop from onNewScript.
> 
> Based on comment 53 and comment 54 it sounds like you'd want to not spin the
> event loop from onNewScript, but that's a significant project. Is that right?
> 
Yes. I think this would be a significant project.

The general idea, as proposed by Jim, is that instead of spinning a nested event loop from onNewScript, you set a breakpoint on the entry point of each frame. If any debuggee code runs, it will hit one of those breakpoints, and we will spin the nested event loop then. This would be much simpler to reason about.

This is not hard to implement in theory. And in theory, theory and practice are the same. But in practice, they are not :-P I expect us to run into a lot of corner cases when implementing this, due to the fragile nature of this code. I would not be surprised if this took up half a quarter to implement. 

> As a faster workaround so people like the devtools and frontend team can
> actually use the browser toolbox without crashing all the time, could we
> just avoid doing this for chrome:// or resource:/// URIs which won't have
> source maps anyway? That would avoid the issue described here.
> 

Things are not quite that simple.

If you want breakpoints to behave correctly, you cannot do *anything* asynchronous from an onNewScript notification. When we fetch a source map, we return a promise that either resolves to the source map, or null if no source map exists. In both cases, we still end up using promise handlers, which are inherently asynchronous. As a result, if you remove the unsafeSynchronise call breakpoints can end up being set too late, even if you don't use source maps!

> > If we can find a way to reproduce this, the people to work on it would be
> > :ejpbruel or :jlongster.
> 
> There are STR in comment #35. I can reproduce reliably, also when running
> under rr, when using an opt build. It might be more difficult with a debug
> build based on :gabor's experience, I haven't tried myself. I can try to set
> up SSH to the VM with an rr backtrace if that makes this any easier to work
> on / understand.

This is a crasher, so if the problem lies with the debugger, we should assign the bug to that component and mark it as P1.

Before we do that, I'd like to better understand the problem, however. Entering a nested event loop is not a bug in itself, and in fact, is a critical feature that the debugger depends on. Why does entering a nested event loop cause the browser debugger to crash in this case? And is that the debugger's fault, or does the real problem lie elsewhere?
Flags: needinfo?(ejpbruel)
(In reply to Eddy Bruel [:ejpbruel] from comment #59)
> (In reply to :Gijs Kruitbosch from comment #58)
> > (In reply to Jim Blandy :jimb from comment #57)
> > > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #56)
> > > > (In reply to Jim Blandy :jimb from comment #53)
> > > > > However, we have *no* legitimate reason to combine those cases, and spin an
> > > > > event loop from onNewGlobalObject. The onNewGlobalObject handler should do
> > > > > nothing but call addGlobal on some Debugger. If this is occurring, I'm very
> > > > > interested in seeing a backtrace.
> > > > 
> > > > You can see the backtrace in Comment 36. onNewGlobalObject ends up calling
> > > > onNewScript. Which if I understand it right spins the even loop. I agree
> > > > this is not really acceptable. Who could work on this?
> > > 
> > > In comment 36, I don't see onNewGlobalObject -> onNewScript. A JS stack
> > > would be very valuable, perhaps disposative. I only see an invocation of
> > > slowPathOnNewScript from an import operation, which seems entirely
> > > appropriate. We do spin the event loop from onNewScript.
> > 
> > Based on comment 53 and comment 54 it sounds like you'd want to not spin the
> > event loop from onNewScript, but that's a significant project. Is that right?
> > 
> Yes. I think this would be a significant project.
> 
> The general idea, as proposed by Jim, is that instead of spinning a nested
> event loop from onNewScript, you set a breakpoint on the entry point of each
> frame. If any debuggee code runs, it will hit one of those breakpoints, and
> we will spin the nested event loop then. This would be much simpler to
> reason about.
> 
> This is not hard to implement in theory. And in theory, theory and practice
> are the same. But in practice, they are not :-P I expect us to run into a
> lot of corner cases when implementing this, due to the fragile nature of
> this code. I would not be surprised if this took up half a quarter to
> implement. 
> 
> > As a faster workaround so people like the devtools and frontend team can
> > actually use the browser toolbox without crashing all the time, could we
> > just avoid doing this for chrome:// or resource:/// URIs which won't have
> > source maps anyway? That would avoid the issue described here.
> > 
> 
> Things are not quite that simple.
> 
> If you want breakpoints to behave correctly, you cannot do *anything*
> asynchronous from an onNewScript notification. When we fetch a source map,
> we return a promise that either resolves to the source map, or null if no
> source map exists. In both cases, we still end up using promise handlers,
> which are inherently asynchronous. As a result, if you remove the
> unsafeSynchronise call breakpoints can end up being set too late, even if
> you don't use source maps!

But if we synchronously know that there is no source map, we can immediately set breakpoints without spinning the event loop, right?

> > > If we can find a way to reproduce this, the people to work on it would be
> > > :ejpbruel or :jlongster.
> > 
> > There are STR in comment #35. I can reproduce reliably, also when running
> > under rr, when using an opt build. It might be more difficult with a debug
> > build based on :gabor's experience, I haven't tried myself. I can try to set
> > up SSH to the VM with an rr backtrace if that makes this any easier to work
> > on / understand.
> 
> This is a crasher, so if the problem lies with the debugger, we should
> assign the bug to that component and mark it as P1.
> 
> Before we do that, I'd like to better understand the problem, however.
> Entering a nested event loop is not a bug in itself, and in fact, is a
> critical feature that the debugger depends on. Why does entering a nested
> event loop cause the browser debugger to crash in this case? And is that the
> debugger's fault, or does the real problem lie elsewhere?

What happens is:

0) telemetry sets two timers that fire after the browser has been up for 60 seconds
1) when the first timer fires, it triggers the load of TelemetryStorage.jsm
2) that load spins the event loop if the debugger is up
3) we fire the second timer in the nested event loop
4) that triggers another load of TelemetryStorage.jsm, which is half-loaded (and can't be fully loaded because we're waiting on the debugger, apparently...)
5) trying to load the file when it's half loaded blows up. We could make it "just" return an exception, but that would break runtime behaviour of telemetry - Cu.import on a file that exists isn't supposed to fail in any way...

IMO the fundamental problem is that we spin the event loop as part of loading a Cu.import'd module, thus permitting the re-entrancy that we're seeing here, so yes, I would argue it's an issue with the debugger.

TBH, there's another point here, which is that we're spinning the event loop the first time we're loading a script at this URI, so we're guaranteed not to have any breakpoints to set anyway, because the current incarnation of the JS debugger has no concept of "future" breakpoints. Moreover, once a Cu.import'd module has been loaded, it will never be loaded again, so I don't think you'd ever get a new Debugger.Source object for it after the first time, so the whole problem of breakpoints, source mapping and so on just doesn't apply. I wouldn't mind optimizing based on that as well - either way, it seems this issue could be best solved by making the debugger not do the event loop spinning.

Does that help clarify what's going on?
Flags: needinfo?(ejpbruel)
(In reply to :Gijs Kruitbosch from comment #60)
> (In reply to Eddy Bruel [:ejpbruel] from comment #59)
> > (In reply to :Gijs Kruitbosch from comment #58)
> > > (In reply to Jim Blandy :jimb from comment #57)
> > > > (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #56)
> > > > > (In reply to Jim Blandy :jimb from comment #53)
> > > > > > However, we have *no* legitimate reason to combine those cases, and spin an
> > > > > > event loop from onNewGlobalObject. The onNewGlobalObject handler should do
> > > > > > nothing but call addGlobal on some Debugger. If this is occurring, I'm very
> > > > > > interested in seeing a backtrace.
> > > > > 
> > > > > You can see the backtrace in Comment 36. onNewGlobalObject ends up calling
> > > > > onNewScript. Which if I understand it right spins the even loop. I agree
> > > > > this is not really acceptable. Who could work on this?
> > > > 
> > > > In comment 36, I don't see onNewGlobalObject -> onNewScript. A JS stack
> > > > would be very valuable, perhaps disposative. I only see an invocation of
> > > > slowPathOnNewScript from an import operation, which seems entirely
> > > > appropriate. We do spin the event loop from onNewScript.
> > > 
> > > Based on comment 53 and comment 54 it sounds like you'd want to not spin the
> > > event loop from onNewScript, but that's a significant project. Is that right?
> > > 
> > Yes. I think this would be a significant project.
> > 
> > The general idea, as proposed by Jim, is that instead of spinning a nested
> > event loop from onNewScript, you set a breakpoint on the entry point of each
> > frame. If any debuggee code runs, it will hit one of those breakpoints, and
> > we will spin the nested event loop then. This would be much simpler to
> > reason about.
> > 
> > This is not hard to implement in theory. And in theory, theory and practice
> > are the same. But in practice, they are not :-P I expect us to run into a
> > lot of corner cases when implementing this, due to the fragile nature of
> > this code. I would not be surprised if this took up half a quarter to
> > implement. 
> > 
> > > As a faster workaround so people like the devtools and frontend team can
> > > actually use the browser toolbox without crashing all the time, could we
> > > just avoid doing this for chrome:// or resource:/// URIs which won't have
> > > source maps anyway? That would avoid the issue described here.
> > > 
> > 
> > Things are not quite that simple.
> > 
> > If you want breakpoints to behave correctly, you cannot do *anything*
> > asynchronous from an onNewScript notification. When we fetch a source map,
> > we return a promise that either resolves to the source map, or null if no
> > source map exists. In both cases, we still end up using promise handlers,
> > which are inherently asynchronous. As a result, if you remove the
> > unsafeSynchronise call breakpoints can end up being set too late, even if
> > you don't use source maps!
> 
> But if we synchronously know that there is no source map, we can immediately
> set breakpoints without spinning the event loop, right?
> 

We definitely could, but it would require us to write the synchronous case separate from the asynchronous one (which is not what we do now).

> > > > If we can find a way to reproduce this, the people to work on it would be
> > > > :ejpbruel or :jlongster.
> > > 
> > > There are STR in comment #35. I can reproduce reliably, also when running
> > > under rr, when using an opt build. It might be more difficult with a debug
> > > build based on :gabor's experience, I haven't tried myself. I can try to set
> > > up SSH to the VM with an rr backtrace if that makes this any easier to work
> > > on / understand.
> > 
> > This is a crasher, so if the problem lies with the debugger, we should
> > assign the bug to that component and mark it as P1.
> > 
> > Before we do that, I'd like to better understand the problem, however.
> > Entering a nested event loop is not a bug in itself, and in fact, is a
> > critical feature that the debugger depends on. Why does entering a nested
> > event loop cause the browser debugger to crash in this case? And is that the
> > debugger's fault, or does the real problem lie elsewhere?
> 
> What happens is:
> 
> 0) telemetry sets two timers that fire after the browser has been up for 60
> seconds
> 1) when the first timer fires, it triggers the load of TelemetryStorage.jsm
> 2) that load spins the event loop if the debugger is up
> 3) we fire the second timer in the nested event loop
> 4) that triggers another load of TelemetryStorage.jsm, which is half-loaded
> (and can't be fully loaded because we're waiting on the debugger,
> apparently...)
> 5) trying to load the file when it's half loaded blows up. We could make it
> "just" return an exception, but that would break runtime behaviour of
> telemetry - Cu.import on a file that exists isn't supposed to fail in any
> way...
> 
> IMO the fundamental problem is that we spin the event loop as part of
> loading a Cu.import'd module, thus permitting the re-entrancy that we're
> seeing here, so yes, I would argue it's an issue with the debugger.
> 
> TBH, there's another point here, which is that we're spinning the event loop
> the first time we're loading a script at this URI, so we're guaranteed not
> to have any breakpoints to set anyway, because the current incarnation of
> the JS debugger has no concept of "future" breakpoints. Moreover, once a
> Cu.import'd module has been loaded, it will never be loaded again, so I
> don't think you'd ever get a new Debugger.Source object for it after the
> first time, so the whole problem of breakpoints, source mapping and so on
> just doesn't apply. I wouldn't mind optimizing based on that as well -
> either way, it seems this issue could be best solved by making the debugger
> not do the event loop spinning.
> 
> Does that help clarify what's going on?

The reason for spinning up a nested event loop is that we want to create the illusion that the debuggee is paused, but still want to be able to talk to the debugger, which runs in the same thread. So nested event loops should process events that are processed by the debugger, but they should not cause any debuggee code to run. In particular, that second time should not fire until after we leave the nested event loop.

Unfortunately, nested event loops is another thing that is horribly broken, in the sense that there are plenty of ways in which debuggee events can still be processed, even though they shouldn't. Known cases are sync XHR events, promise handlers, and postMessage events. Apparently, timers should be added to that list as well. Sigh.

With that, we now have two ways to fix this: either don't use unsafeSynchronise in onNewScript, or make sure that debuggee timers don't fire while we are inside a nested event loop. Both are significant projects.

The best short term solution I can come up with is the one you suggested: source maps can already be configured on/off for the debugger, so let's rewrite this code so that we don't use any asynchronous calls if source maps are off. Doing so would allow us not to use unsafeSynchronise when source maps are off.
Flags: needinfo?(ejpbruel)
(In reply to Eddy Bruel [:ejpbruel] from comment #61)
> With that, we now have two ways to fix this: either don't use
> unsafeSynchronise in onNewScript, or make sure that debuggee timers don't
> fire while we are inside a nested event loop. Both are significant projects.
> 
> The best short term solution I can come up with is the one you suggested:
> source maps can already be configured on/off for the debugger, so let's
> rewrite this code so that we don't use any asynchronous calls if source maps
> are off. Doing so would allow us not to use unsafeSynchronise when source
> maps are off.

OK. Can either you or :jlongster pick this up? :-)
Flags: needinfo?(ejpbruel)
Duplicate of this bug: 1232709
The chrome module loading case is interesting. Unlike content scripts, module loads have a synchronous API: you call the 
import function, which loads the module and then returns.

Certainly special-casing chrome URLs to avoid spinning the event loop is the best fix for this.

But what if we did want to support source maps for chrome URLs? Is that just impossible? That seems ridiculous.

The synchronous import behavior for chrome can't be reconciled with any asynchronous behavior during either loading or execution. Suppose Telemetry.jsm did have source maps. My suggested hack of setting a breakpoint at the script's entry point doesn't help: if we load Telemetry.jsm, set a breakpoint at its entry point, hit the breakpoint and spin while loading the source map, then the second timer will find itself interrupting the first import, just as before.

But the solution is actually the same: chrome URLs are safe to fetch synchronously. Either at onNewScript time or at entry breakpoint time, we could see that the source map URL was a chrome URL, and synchronously fetch and apply the source map.
Gijs, thanks for the explanation in comment 60. Now I really feel like I know what's going on, and even how to write tests for it that reproduce the problem reliably.
(In reply to :Gijs Kruitbosch from comment #62)
> (In reply to Eddy Bruel [:ejpbruel] from comment #61)
> > With that, we now have two ways to fix this: either don't use
> > unsafeSynchronise in onNewScript, or make sure that debuggee timers don't
> > fire while we are inside a nested event loop. Both are significant projects.
> > 
> > The best short term solution I can come up with is the one you suggested:
> > source maps can already be configured on/off for the debugger, so let's
> > rewrite this code so that we don't use any asynchronous calls if source maps
> > are off. Doing so would allow us not to use unsafeSynchronise when source
> > maps are off.
> 
> OK. Can either you or :jlongster pick this up? :-)

I will put this on my todo for next week.
Assignee: nobody → ejpbruel
Flags: needinfo?(ejpbruel)
Priority: -- → P1
This patch refactors addSource so that it still does the same thing as before if source maps are enabled. However, if source maps are disabled, it now carefully avoids the use of unsafeSynchronise. This keeps the browser from crashing when using the browser debugger, provided you disable source mapping for the latter.

This solution isn't perfect, but it's the best we can do given the constraints we have (we can't stop unsafeSynchronize from running that timer without refactoring Gecko's event model). Not being able to use source maps in the browser debugger doesn't seem like too big of a deal, since we hardly use any source maps for chrome code anyway (if any!).

With the above patch in place, I'd like to suggest that we disable source maps for the browser debugger by default (but keep it enabled for the normal debugger, of course). James, what could would I have to change to accomplish that?
Flags: needinfo?(jlong)
Attachment #8740877 - Flags: review?(jlong)
(In reply to Eddy Bruel [:ejpbruel] from comment #67)
> 
> With the above patch in place, I'd like to suggest that we disable source
> maps for the browser debugger by default (but keep it enabled for the normal
> debugger, of course). James, what could would I have to change to accomplish
> that?

It's just a pref that is pulled here and paused in when attaching the thread: https://github.com/mozilla/gecko-dev/blob/master/devtools/client/framework/attach-thread.js#L50

I suppose that is a good place to detect if it's the browser debugger and force it to false.
Flags: needinfo?(jlong)
Comment on attachment 8740877 [details] [diff] [review]
Carefully avoid unsafeSynchronize when source maps are disabled.

Review of attachment 8740877 [details] [diff] [review]:
-----------------------------------------------------------------

I don't want to r+ this patch, but I don't know what I'd recommend. `_addSource` is getting horrific. I suppose this is fine a short-term solution.

Luckily, moving the sourcemap to the client completely solves all of this. I know we still want to somehow pause the engine and sourcemap breakpoints before we set them, but since we plan to do that as an enhancement on top of the client-sourcemap architecture, it doesn't block that work, and can easily be turned off if it causes horrific problems like this. I'm still a little skeptical that we'll fine a proper hack that doesn't introduce these kinds of problem. Pausing the engine on a breakpoint would not have solved this bug.

Are there bugs filed for moving the sourcemap to the client? Can you link to that bug in `_addSource` with a note that we're going to remove all of these hacks?
Attachment #8740877 - Flags: review?(jlong) → review+
(In reply to James Long (:jlongster) from comment #68)
> (In reply to Eddy Bruel [:ejpbruel] from comment #67)
> > 
> > With the above patch in place, I'd like to suggest that we disable source
> > maps for the browser debugger by default (but keep it enabled for the normal
> > debugger, of course). James, what could would I have to change to accomplish
> > that?
> 
> It's just a pref that is pulled here and paused in when attaching the
> thread:
> https://github.com/mozilla/gecko-dev/blob/master/devtools/client/framework/
> attach-thread.js#L50
> 
> I suppose that is a good place to detect if it's the browser debugger and
> force it to false.

Here's where we set other Browser Toolbox specific prefs:

https://dxr.mozilla.org/mozilla-central/source/devtools/client/framework/toolbox-process-window.js#57-65

It could be easiest to set the pref value you need there.
Try run for this patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa105db6630d

I will write a separate patch for the pref value.
Whiteboard: leave-open
Going to post a patch for the final step to set the pref, hopefully :ejpbruel won't mind...
Comment on attachment 8743018 [details]
MozReview Request: Bug 1225160 - Disable source maps for browser debugging. r=ejpbruel

:Gijs, can you verify that browser debugging no longer crashes with this patch applied?
Attachment #8743018 - Flags: feedback?(gijskruitbosch+bugs)
Comment on attachment 8743018 [details]
MozReview Request: Bug 1225160 - Disable source maps for browser debugging. r=ejpbruel

This fixes things for me.
Attachment #8743018 - Flags: feedback?(gijskruitbosch+bugs) → feedback+
Comment on attachment 8743018 [details]
MozReview Request: Bug 1225160 - Disable source maps for browser debugging. r=ejpbruel

:ejpbruel gave r+ on IRC.

Try looks good: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6e47430c95f1
Attachment #8743018 - Flags: review?(ejpbruel) → review+
https://hg.mozilla.org/mozilla-central/rev/b29a652d68e8
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Thanks so much for fixing this, it was driving me a bit insane.

Good work!!
You need to log in before you can comment on or make changes to this bug.