Closed Bug 958980 Opened 9 years ago Closed 9 years ago

crash in js::SN_IS_TERMINATOR(unsigned char*)

Categories

(Core :: JavaScript Engine, defect)

27 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla29
Tracking Status
firefox26 --- unaffected
firefox27 + verified
firefox28 + verified
firefox29 + verified

People

(Reporter: alice0775, Assigned: shu)

References

Details

(Keywords: crash, regression, reproducible, Whiteboard: [dupe me])

Crash Data

Attachments

(2 files, 1 obsolete file)

Attached file testcase (obsolete) —
This bug was filed from the Socorro interface and is 
report bp-4ca3b08a-0c49-4095-9c4c-7ef652140112.
=============================================================

This crash happens on Nightly29.0a1, holly29.0a1, Aurora28.0a2 as well as Firefox27.b5

Steps To Reproduce:
1. Start Firefox with newly created profile
2. Open testcase
3. Open debugger(Ctrl+Shift+S)

(4. Repeat from Step1 if necessary)

Actual Results:
Crash
Regression window(m-c)
Good:
http://hg.mozilla.org/mozilla-central/rev/679410425704
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0 ID:20131121113308
Bad:
http://hg.mozilla.org/mozilla-central/rev/cdbe5f3adeff
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0 ID:20131121114210
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=679410425704&tochange=cdbe5f3adeff


Regression window(m-i)
Good:
http://hg.mozilla.org/integration/mozilla-inbound/rev/d9abdb232a25
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0 ID:20131120224907
Bad:
http://hg.mozilla.org/integration/mozilla-inbound/rev/22a26c92bf00
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0 ID:20131120225324
Pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=d9abdb232a25&tochange=22a26c92bf00

Suspected: 
 9ee5c2664d78	Shu-yu Guo — Bug 933882 - Invalidate JIT code instead of doing full GC on debug mode toggle. (r=bhackett)
Blocks: 933882
Keywords: regression
Attached file testcase —
Attachment #8358979 - Attachment is obsolete: true
Step3 of STR has to perform it after execution of Step2 as promptly as possible.
The perils of GC-dependent API.
Attachment #8359003 - Flags: review?(jimb)
Assignee: nobody → shu
Status: NEW → ASSIGNED
I applied the patch and tried the repro steps of the duplicate at bug 953261 comment 4. About 75% of the time, it doesn't crash anymore. But sometimes it crashes at this stack. It might be an independent bug, or perhaps this patch just delayed the crash?

xul!mozilla::dom::CallbackObject::Callback
xul!nsEventListenerInfo::GetJSVal
xul!nsEventListenerInfo::GetListenerObject
xul!NS_InvokeByIndex
xul!XPCWrappedNative::CallMethod
xul!XPC_WN_GetterSetter
mozjs!js::Invoke
(In reply to David Major [:dmajor] from comment #6)
> I applied the patch and tried the repro steps of the duplicate at bug 953261
> comment 4. About 75% of the time, it doesn't crash anymore. But sometimes it
> crashes at this stack. It might be an independent bug, or perhaps this patch
> just delayed the crash?
> 
> xul!mozilla::dom::CallbackObject::Callback
> xul!nsEventListenerInfo::GetJSVal
> xul!nsEventListenerInfo::GetListenerObject
> xul!NS_InvokeByIndex
> xul!XPCWrappedNative::CallMethod
> xul!XPC_WN_GetterSetter
> mozjs!js::Invoke

David, that stack is unfamiliar to me. Is there an assert or something? What's going on in xul!mozilla::dom::CallbackObject::Callback?
It's an opt build since we had trouble with debug builds on the other bug. The |this| of the CallbackObject is null. It looks like that |this| corresponds to |Ptr()| at: https://hg.mozilla.org/mozilla-central/file/80a27198344a/content/events/src/nsEventListenerService.cpp#l97

The full stack is here (I didn't want to pollute this bug in case it's a different issue): https://pastebin.mozilla.org/4006317 There is some jsinspector stuff at line 35 if that offers any clues.
(In reply to David Major [:dmajor] from comment #8)
> It's an opt build since we had trouble with debug builds on the other bug.
> The |this| of the CallbackObject is null. It looks like that |this|
> corresponds to |Ptr()| at:
> https://hg.mozilla.org/mozilla-central/file/80a27198344a/content/events/src/
> nsEventListenerService.cpp#l97
> 
> The full stack is here (I didn't want to pollute this bug in case it's a
> different issue): https://pastebin.mozilla.org/4006317 There is some
> jsinspector stuff at line 35 if that offers any clues.

Sorry, no ideas here. This bug is because of the Debugger API letting you access a not-fully-initialized JSScript. I don't think that can get into a CallbackObject though. Jim, correct me if I'm wrong.
Flags: needinfo?(jimb)
Firefox27b7 : bp-7bdbcbf6-6322-453f-8216-7e1d12140118

Please fix. this beta cycle is two weeks only, becomes end.
Comment on attachment 8359003 [details] [diff] [review]
Stop Debugger from exposing partially initialized JSScripts.

Review of attachment 8359003 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/Debugger.cpp
@@ +2609,5 @@
> +        // It is possible that the script was created and thus exposed to GC,
> +        // but *not* fully initialized from fullyInit{FromEmitter,Trivial} due
> +        // to errors.
> +        if (!script->code())
> +            return;

Great. Please add a blank line before the comment.
Attachment #8359003 - Flags: review?(jimb) → review+
(In reply to Shu-yu Guo [:shu] from comment #9)
> (In reply to David Major [:dmajor] from comment #8)
> > The full stack is here (I didn't want to pollute this bug in case it's a
> > different issue): https://pastebin.mozilla.org/4006317 There is some
> > jsinspector stuff at line 35 if that offers any clues.
> 
> Sorry, no ideas here. This bug is because of the Debugger API letting you
> access a not-fully-initialized JSScript. I don't think that can get into a
> CallbackObject though. Jim, correct me if I'm wrong.

Yes, that's right. Debugger.prototype.findScripts scans the debuggee compartments for all JSScripts that match the query it is given. It scans by actually iterating over all the objects allocated in the Zone, not by following references from one object to another --- which means that it may return scripts that are not actually reachable. This would include scripts that were partially constructed, but then dropped due to an error.

Shu, once you found the problem, were you able to identify better steps to reproduce? Can we force an error to occur at the right place, for example?
Flags: needinfo?(jimb) → needinfo?(shu)
(In reply to Jim Blandy :jimb from comment #13)
> (In reply to Shu-yu Guo [:shu] from comment #9)
> > (In reply to David Major [:dmajor] from comment #8)
> > > The full stack is here (I didn't want to pollute this bug in case it's a
> > > different issue): https://pastebin.mozilla.org/4006317 There is some
> > > jsinspector stuff at line 35 if that offers any clues.
> > 
> > Sorry, no ideas here. This bug is because of the Debugger API letting you
> > access a not-fully-initialized JSScript. I don't think that can get into a
> > CallbackObject though. Jim, correct me if I'm wrong.
> 
> Yes, that's right. Debugger.prototype.findScripts scans the debuggee
> compartments for all JSScripts that match the query it is given. It scans by
> actually iterating over all the objects allocated in the Zone, not by
> following references from one object to another --- which means that it may
> return scripts that are not actually reachable. This would include scripts
> that were partially constructed, but then dropped due to an error.
> 
> Shu, once you found the problem, were you able to identify better steps to
> reproduce? Can we force an error to occur at the right place, for example?

STR with timing bugs are always hard. STR that require a particularly unfortunate interleaving is still an open problem for software engineering AFAICT (where's cjones's record & replay system? :)). For an easy test case in the shell you'd need the ability to interleave (not necessarily multithreading, but some ability to call D.p.findScripts *while* parsing some other thread) and fine control over suppressing/triggering GC. Can we do that in the shell? I haven't looked too hard, but my intuition is no.

I'm not sure what you mean by "force an error to occur at the right place".
Flags: needinfo?(shu)
https://hg.mozilla.org/mozilla-central/rev/5bc5af785b3b
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
(In reply to Shu-yu Guo [:shu] from comment #14)
> STR with timing bugs are always hard. STR that require a particularly
> unfortunate interleaving is still an open problem for software engineering
> AFAICT (where's cjones's record & replay system? :)). For an easy test case
> in the shell you'd need the ability to interleave (not necessarily
> multithreading, but some ability to call D.p.findScripts *while* parsing
> some other thread) and fine control over suppressing/triggering GC. Can we
> do that in the shell? I haven't looked too hard, but my intuition is no.
> 
> I'm not sure what you mean by "force an error to occur at the right place".

Wait, is this bug caused by inter-thread timing? Off-thread compilation always does its work in a separate compartment, which the debugger should never be able to observe; if findScripts is somehow encountering scripts that are in mid-construction, then that's a very serious bug.

I thought this bug was caused by fullyInit{FromEmitter,Trivial} returning an error, and leaving a partially-constructed script in the Zone as garbage. If that's the case, then if the shell could force the error to occur in the right place, we could test it reliably.

If the shell can't force such an error to occur now, I don't think it's worth adding such a primitive just to test this.
(In reply to Jim Blandy :jimb from comment #16)
> (In reply to Shu-yu Guo [:shu] from comment #14)
> > STR with timing bugs are always hard. STR that require a particularly
> > unfortunate interleaving is still an open problem for software engineering
> > AFAICT (where's cjones's record & replay system? :)). For an easy test case
> > in the shell you'd need the ability to interleave (not necessarily
> > multithreading, but some ability to call D.p.findScripts *while* parsing
> > some other thread) and fine control over suppressing/triggering GC. Can we
> > do that in the shell? I haven't looked too hard, but my intuition is no.
> > 
> > I'm not sure what you mean by "force an error to occur at the right place".
> 
> Wait, is this bug caused by inter-thread timing? Off-thread compilation
> always does its work in a separate compartment, which the debugger should
> never be able to observe; if findScripts is somehow encountering scripts
> that are in mid-construction, then that's a very serious bug.
> 

I don't know if there are actually threads or just polling, but this is what's going on:

    +------------------------------+                +------------------------------+
    |      Main Thread             |                |    UI Thread                 |
    |------------------------------|                |------------------------------|
    | Start parsing bad.js         |                |                              |
    | Allocate JSScript for bad.js |                |                              |
    |                              |                |                              |
    |                              |                | Open Debugger panel          |
    |                              |                | D.p.findScripts finds bad.js |
    |                              |                |                              |
    | Parse error                  |                |                              |
    | ...                          |                |                              |
    | GC                           |                |                              |
    +------------------------------+                +------------------------------+

I don't know how to interrupt the parsing process to make a Debugger and call findScripts inside the shell. The error happens in the right place (in the JS frontend), it just happens *after* the Debugger has already opened, which is why Alice's STR say that the Debugger must be opened quickly after the navigating to the page.

This patch prevents crashes by ignoring not fully initialized scripts. The frontend should probably block until all scripts have been parsed.
Thank you for the ASCII art! I'm not sure what you mean by "Main Thread" and "UI Thread". The Debugger UI runs on the main thread.

Do you mean that:
- We call JS::CompileOffThreadScript, which then queues a job for a worker thread.
- The worker thread does the compilation (WorkerThread::handleParseWorkload) in a separate, temporary compartment which Debugger should not be able to see at all, and encounters an error.
- The main thread eventually calls JS::FinishOffThreadScript, which waits for the worker thread to finish, and then merges the temporary compartment with the one that originally requested the compilation (WorkerThread::finishParseTask, called from JS::FinishOffThreadScript).
- Then, D.p.findScripts finds the dropped JSScript in the merged compartment.

Note that finishParseTask does the compartment merge even if an error is thrown.

If that's right, then the question is, why can't we get the same error (and thus the same mutant JSScript) without using off-thread compilation?
I think the "quickly" is simply because the GC will eventually clean up the mutant JSScript.
This would reproduce the crash, wouldn't it?  It crashes on my shell, which doesn't contain your fix's changeset:

var text = "function drag(ev)\n\
{\n\
\n\
";

var g = newGlobal();
var dbg = new Debugger(g);
g.text = text;
g.eval('offThreadCompileScript(text, { filename: "moof" });');
g.eval('try { runOffThreadScript(); } catch (ex) { }');
for (s of dbg.findScripts({})) {
  print(s.url + " " + s.startLine + " " + s.lineCount);
}
Some of that's noise; this is a little better:

var g = newGlobal();
var dbg = new Debugger(g);
g.eval('offThreadCompileScript("function drag(ev) {");');
g.eval('try { runOffThreadScript(); } catch (ex) { }');
for (s of dbg.findScripts())
  s.lineCount;
Actually, all the off-thread compilation stuff is a red herring; this crashes:

var g = newGlobal();
var dbg = new Debugger(g);
try { g.eval('function drag(ev) {'); } catch (ex) { }
for (s of dbg.findScripts())
  s.lineCount;
Depends on: 962441
Filed follow-up bug 962441 to add the regression test.
(Can you tell I've got a test tomorrow?)
(In reply to Jim Blandy :jimb from comment #22)
> Actually, all the off-thread compilation stuff is a red herring; this
> crashes:
> 
> var g = newGlobal();
> var dbg = new Debugger(g);
> try { g.eval('function drag(ev) {'); } catch (ex) { }
> for (s of dbg.findScripts())
>   s.lineCount;

Oh nice, how simple! I thought it would be about interleavings, but it really is only about GC timing.

I also didn't really know how the Debugger UI worked. I assumed the UI had a different thread for responsiveness, and that played into it. Too many assumptions.
In retrospect, it doesn't make sense at all why I thought the error from the frontend would immediately collect the errant JSScript, and required off-thread-compilation to trigger.
Reproduced in nightly 2014-01-17.
Verified fixed FF 29.0a1 2014-01-22, win 7 x64.
(In reply to Shu-yu Guo [:shu] from comment #26)
> In retrospect, it doesn't make sense at all why I thought the error from the
> frontend would immediately collect the errant JSScript, and required
> off-thread-compilation to trigger.

It also shows a shortcoming in our testing --- this is a case that could easily have been covered.
Can you please request uplift on this with risk analysis. If low risk this should get into one of last beta's going to build tomorrow. Thanks !
Flags: needinfo?(shu)
Comment on attachment 8359003 [details] [diff] [review]
Stop Debugger from exposing partially initialized JSScripts.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 933882
User impact if declined: GC timing related crashes when using Debugger
Testing completed (on m-c, etc.): on m-c
Risk to taking this patch (and alternatives if risky): none
String or IDL/UUID changes made by this patch: none
Attachment #8359003 - Flags: approval-mozilla-beta?
Attachment #8359003 - Flags: approval-mozilla-aurora?
Flags: needinfo?(shu)
(In reply to Jim Blandy :jimb from comment #28)
> It also shows a shortcoming in our testing --- this is a case that could
> easily have been covered.
So, do you think this is not completely fixed?
I marked this verified since the initial crash is no longer reproducible on ff 29
Flags: needinfo?(jimb)
Attachment #8359003 - Flags: approval-mozilla-beta?
Attachment #8359003 - Flags: approval-mozilla-beta+
Attachment #8359003 - Flags: approval-mozilla-aurora?
Attachment #8359003 - Flags: approval-mozilla-aurora+
Keywords: verifyme
Verified as fixed on FF 27 beta 9
Build Id: 20140123185438

Environments used:
Win XP x86, Win7 X64

I've used the STR provided in comment 0 and the bug is not reproducing anymore.
(In reply to Paul Silaghi, QA [:pauly] from comment #31)
> (In reply to Jim Blandy :jimb from comment #28)
> > It also shows a shortcoming in our testing --- this is a case that could
> > easily have been covered.
> So, do you think this is not completely fixed?
> I marked this verified since the initial crash is no longer reproducible on
> ff 29

No, no, I just meant that more thorough testing when the feature was first implemented could have caught this bug sooner. This bug is fixed.
(In reply to Jim Blandy :jimb from comment #35)

Jim, did you mean to resent the status flags?
based on comments 27, 34
Status: RESOLVED → VERIFIED
Verified as fixed on Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0 (20140126004002).

I reproduced this issue on a 01/17 Aurora build by simply repeating the steps from comment 0 2 or 3 times. On the 01/26 Aurora it didn't reproduce after multiple tries.

Socorro doesn't show any crashes on builds post 01/23 either:
https://crash-stats.mozilla.com/report/list?signature=js%3A%3ASN_IS_TERMINATOR%28unsigned+char*%29&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&hang_type=any&date=2014-01-27+12%3A00%3A00&range_value=4#reports
You need to log in before you can comment on or make changes to this bug.