Closed Bug 1053999 Opened 5 years ago Closed 5 years ago

crash in JS_TransplantObject(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>)

Categories

(Core :: JavaScript Engine, defect, critical)

32 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla36
Tracking Status
firefox31 --- unaffected
firefox32 + wontfix
firefox33 + wontfix
firefox34 + verified
firefox35 + verified
firefox36 --- verified
b2g-v1.4 --- wontfix
b2g-v2.0 --- wontfix
b2g-v2.0M --- wontfix
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: jbecerra, Assigned: sfink)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(7 files)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-1d68b20f-fade-4243-adbf-db67d2140813.
=============================================================

There's been a spike in this signature in recent days. I can't tell if they started spiking with builds from 8/11 or 8/07. I can't currently display a graph for this signature. These are happening mostly on Windows 7 and Windows XP. There are some dupes. The list of URLs also has a lot of ask.com URLs. There are very few comments.

This is currently at the top #4 the crashers list for Fx32 Beta, which is scheduled to ship in about two weeks. 

More reports at: https://crash-stats.mozilla.com/report/list?product=Firefox&signature=JS_TransplantObject%28JSContext%2A%2C+JS%3A%3AHandle%3CJSObject%2A%3E%2C+JS%3A%3AHandle%3CJSObject%2A%3E%29

0 	mozjs.dll 	JS_TransplantObject(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>) 	js/src/jsapi.cpp
1 	xul.dll 	xpc::TransplantObject(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>) 	js/xpconnect/wrappers/WrapperFactory.cpp
2 	xul.dll 	nsGlobalWindow::SetNewDocument(nsIDocument*, nsISupports*, bool) 	dom/base/nsGlobalWindow.cpp
3 	xul.dll 	nsDocumentViewer::InitInternal(nsIWidget*, nsISupports*, nsIntRect const&, bool, bool, bool) 	layout/base/nsDocumentViewer.cpp
4 	xul.dll 	nsDocumentViewer::Init(nsIWidget*, nsIntRect const&) 	layout/base/nsDocumentViewer.cpp
5 	xul.dll 	nsDocShell::SetupNewViewer(nsIContentViewer*) 	docshell/base/nsDocShell.cpp
6 	xul.dll 	nsDocShell::Embed(nsIContentViewer*, char const*, nsISupports*) 	docshell/base/nsDocShell.cpp
7 	xul.dll 	nsDocShell::CreateContentViewer(char const*, nsIRequest*, nsIStreamListener**) 	docshell/base/nsDocShell.cpp
8 	xul.dll 	nsDSURIContentListener::DoContent(char const*, bool, nsIRequest*, nsIStreamListener**, bool*) 	docshell/base/nsDSURIContentListener.cpp
9 	xul.dll 	nsDocumentOpenInfo::TryContentListener(nsIURIContentListener*, nsIChannel*) 	uriloader/base/nsURILoader.cpp
10 	xul.dll 	nsDocumentOpenInfo::DispatchContent(nsIRequest*, nsISupports*) 	uriloader/base/nsURILoader.cpp
11 	xul.dll 	nsDocumentOpenInfo::OnStartRequest(nsIRequest*, nsISupports*) 	uriloader/base/nsURILoader.cpp
12 	xul.dll 	mozilla::net::nsHttpChannel::CallOnStartRequest() 	netwerk/protocol/http/nsHttpChannel.cpp
13 	xul.dll 	mozilla::net::nsHttpChannel::ContinueOnStartRequest2(tag_nsresult) 	netwerk/protocol/http/nsHttpChannel.cpp
14 	xul.dll 	mozilla::net::nsHttpChannel::OnStartRequest(nsIRequest*, nsISupports*) 	netwerk/protocol/http/nsHttpChannel.cpp
15 	xul.dll 	nsInputStreamPump::OnStateStart() 	netwerk/base/src/nsInputStreamPump.cpp
16 	xul.dll 	nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*) 	netwerk/base/src/nsInputStreamPump.cpp
17 	xul.dll 	nsInputStreamReadyEvent::Run() 	xpcom/io/nsStreamUtils.cpp
18 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
19 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp
20 	xul.dll 	nsXULWindow::ShowModal() 	xpfe/appshell/src/nsXULWindow.cpp
21 	xul.dll 	nsWindowWatcher::OpenWindowInternal(nsIDOMWindow*, char const*, char const*, char const*, bool, bool, bool, nsITabParent*, nsIArray*, nsIDOMWindow**) 	embedding/components/windowwatcher/src/nsWindowWatcher.cpp
Naveed - Can you please get someone to take a look at this topcrash.
Flags: needinfo?(nihsanullah)
Bill, this is your MOZ_CRASH from bug 793904, want to take a look?

On the beta channel there is a clear rise in volume starting with the first 32 beta, but it's not so clear where it started on the earlier channels.
Flags: needinfo?(wmccloskey)
Hm. Looks like we're not OOMing, and SetNewDocument does a JS_CHECK_RECURSION, so we shouldn't be hitting the stack limit either.

Barring OOMs and stack limits, JSObject::swap can only fail if the following methods fail:

SetClassAndProto
generateOwnShape
EmptyShape::getInitialShape

I'm pretty sure that generateOwnShape can only fail due to OOM, and I'd hypothesize the same for getInitialShape. SetClassAndProto looks suspicious though - I'm betting that's failing somehow.
This is a major concern for 32. 60% of those crashes are within the first minute of uptime, and it sits at #4 on the topcrash list with 3% of all crashes for 32.0b7 right now.
Steve your changeset looks implicated here. Can you please take a look?
Assignee: nobody → sphink
Flags: needinfo?(nihsanullah)
These stacks are consistently a little weird. Most of the stacks are cut off, but here's a reasonably complete one:
  https://crash-stats.mozilla.com/report/index/e16a4809-834b-4f76-ba53-354b12140812
The basic pattern seems to be: jit code -> nsGlobalWindow::ShowSlowScriptDialog -> nsXULWindow::ShowModal -> nested event loop -> SetNewDocument via HTTP -> JS_TransplantObject

The other stacks I looked at appear similar. Usually you can at least see ShowModal before they're cut off, which suggests a slow script dialog. I don't know why this would cause problems, of course.

It is a little suspicious that I haven't seen a single stack that ended in main(). I wonder if the cause could somehow be stack overflow (although that would be weird)? David, would you mind checking a few dumps to see how much stack we're using? I remember you had some way to do that in WinDbg.

If that's not the cause, then maybe we can check in a test patch that MOZ_CRASHes inside ReserveForTradeGuts. That code isn't ever allowed to fail, so we might as well move them up there to get more information.
Flags: needinfo?(wmccloskey) → needinfo?(dmajor)
Stack usage in five dumps from 32 beta on varying machine types:
  Decimal: 867960
  Decimal: 866168
  Decimal: 881976
  Decimal: 865368
  Decimal: 864632

(The one you linked is from 28, so things may have changed)
Flags: needinfo?(dmajor)
Thanks! It certainly seems possible that we're failing a recursion check somewhere. The various stack sizes are computed here:
http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCJSRuntime.cpp#3215

On 32-bit Windows, the untrusted stack size should be:
  kStackQuota - kSystemCodeBuffer - kTrustedScriptBuffer
= (900*1024) - (10*1024) - (12*4*1024)
= (900-10-48)*1024
= 862208

Most of these are just slightly over that. The weird thing is that I don't really see where in the code we're doing a recursion check within JSObject::swap. Maybe we'll need to move MOZ_CRASH anyway just to figure out what's happening.
Added a bunch more MOZ_CRASH calls to narrow things down a bit.
Attachment #8476192 - Flags: review?(bobbyholley)
Attachment #8476192 - Flags: review?(bobbyholley) → review+
So that everyone is clear on the timeline, we're at the end of the 32 beta cycle. beta9 goes to build tomorrow. If this is going to be fixed, we'll need a fix by tomorrow. Given comment 4, I'm happy to push off the build as long as possible in order to accommodate this bug.
That seems petty unlikely. The patch here is just a diagnostic bug.
Attached patch transplant-fixSplinter Review
This patch might fix the problem.
Attachment #8476278 - Flags: review?(bobbyholley)
Comment on attachment 8476278 [details] [diff] [review]
transplant-fix

Review of attachment 8476278 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsfriendapi.h
@@ +1044,5 @@
>  
> +#define JS_CHECK_RECURSION_CONSERVATIVE(cx, onerror)                            \
> +    JS_BEGIN_MACRO                                                              \
> +        int stackDummy_;                                                        \
> +        if (!JS_CHECK_STACK_SIZE_WITH_TOLERANCE(js::GetNativeStackLimit(cx),    \

Please add a comment above the definition of JS_CHECK_STACK_SIZE_WITH_TOLERANCE that the tolerance value can be negative.
Attachment #8476278 - Flags: review?(bobbyholley) → review+
Comment on attachment 8476278 [details] [diff] [review]
transplant-fix

Approval Request Comment
[Feature/regressing bug #]: unknown
[User impact if declined]: crashes
[Describe test coverage new/current, TBPL]: it compiles :-)
[Risks and why]: very low risk. just tuning an error-checking parameter.
[String/UUID change made/needed]: none
Attachment #8476278 - Flags: approval-mozilla-beta?
Attachment #8476278 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/9c45a39f76a2
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Attachment #8476278 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8476278 [details] [diff] [review]
transplant-fix

Bill and I spoke on irc. This is a somewhat speculative patch as we don't have concrete str, however, this is our current best shot and it is a fairly safe change to make. Bill doesn't think anything too bad can happen from this change. We can monitor the crash rate next week but likely won't have good data before we build our first RC on Monday.

beta+
Attachment #8476278 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: qe-verify+
Looking in Socorro [1] for crashes over the last week (since this landed in 32.0b9), I can still see over 500 crashes with this signature.

Looking at [2], 32b9 ranks 8th, compared to 6th in 32b8 and 32b7, so it doesn't look fixed to me. Can anyone review these results?

[1] - https://crash-stats.mozilla.com/report/list?product=Firefox&signature=JS_TransplantObject%28JSContext%2A%2C+JS%3A%3AHandle%3CJSObject%2A%3E%2C+JS%3A%3AHandle%3CJSObject%2A%3E%29
[2] - https://crash-stats.mozilla.com/topcrasher_ranks_bybug/?bug_number=1053999
One more note: the number of crashes for the past 5 days is much lower in Aurora and Nightly (14 in Aurora, 6 in Nightly).
I am reopening for investigation since the number of crashes on Beta has not come down from earlier betas, and it is showing at #5 in the top crashers list.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Resetting flags because this was reopened and tracking 33 as 32 is just about ready to ship.
Setting 32 back to affected while we investigate today. We will need a fix today or early tomorrow if we are to fix this in 32.
The only (non-test) caller of JS_TransplantObject is JSObject* xpc::TransplantObject(JSContext*, class JS::Handle<JSObject*>, class JS::Handle<JSObject*>). It is called from:

3 callers of #240766 = JSObject* xpc::TransplantObject(JSContext*, class JS::Handle<JSObject*>, class JS::Handle<JSObject*>)
  #410113 = uint32 nsGlobalWindow::SetNewDocument(nsIDocument*, nsISupports*, uint8)
  #437013 = uint32 XPCWrappedNative::ReparentWrapperIfFound(XPCWrappedNativeScope*, XPCWrappedNativeScope*, class JS::Handle<JSObject*>, nsISupports*)
  #240758 = uint32 mozilla::dom::ReparentWrapper(JSContext*, class JS::Handle<JSObject*>)

All of the crash reports I have looked at are from SetNewDocument.

There are lots of ways to get from JSObject::wrap to js_ReportOverRecursed. I am attaching a report of all of the ones the static analysis can figure out. The report is in a funky format. It lists things like

A
B C G
D
E
---
C
F
E
---
G
F

That means that there is a call chain A -> B -> D -> E, but also A -> B -> C -> F -> E and A-> B -> G -> F -> E. The first column of the first section is a random example path from wrap --> js_ReportOverRecursed. If there are divergent paths, additional callees are displayed immediately after their callers, and then an additional section will be generated that will give the path from that secondary caller to either the final target, or to something that has already been displayed in an earlier path.

Some of the paths are pretty deep. Here's one example. I'm sure it's totally unrealistic and wouldn't happen in practice, but it's an idea of how crazy things can get. If you want the full function names, you can dig them out of the attached report.

swap
ReserveForTradeGuts
SetClassAndProto
getNewType
CheckNewScriptProperties
NewBuiltinClassInstance
NewObjectWithClassProto
NewObjectWithClassProtoCommon
FindProto
FindClassPrototype
getProperty
getGeneric
GetProperty
GetPropertyHelperInline
get
get
WrapNewBindingObject
WrapNewBindingObject
WrapObject
JS_SetPrototype
isExtensible
isExtensible
isExtensible
isExtensible
CallIsExtensible
CallIsExtensible
Call
InterruptCall
DispatchMessage
DispatchSyncMessage
OnMessageReceived
RecvBrowserFrameOpenWindow
OpenWindowOOP
DispatchOpenWindowEvent
DispatchCustomDOMEvent
DispatchDOMEvent
Dispatch
HandleEventTargetChain
HandleEvent
HandleEvent
OpenMenu
ShowMenu
FirePopupShowingEvent
GenerateChildFrames
ProcessChildren
GetAnonymousContent
CreateAnonymousContent
SetValueOfAnonTextControl
SetValue
SetValue
SetValueInternal
UpdateAllValidityStates
UpdatePatternMismatchValidityState
HasPatternMismatch
IsPatternMatching
JS_ExecuteRegExpNoStatics
ExecuteRegExpLegacy
ExecuteRegExpImpl
execute
InterpretCode
push
js_ReportOverRecursed
With no other current build2 drivers and no fix ready on this bug, I think we're going to have to ship 32 with this crash. That doesn't mean that the investigation should stop as we want a fix for 33.
The people seeing this seem to have a high chance of having some kind of toolbar ("ffxtbr" probably means "Firefox Toolbar") installed, see correlations:

18% (167/918) vs.   2% (960/45199) 4zffxtbr@VideoDownloadConverter_4z.com
13% (116/918) vs.   1% (518/45199) 65ffxtbr@FromDocToPDF_65.com
12% (110/918) vs.   1% (493/45199) 8hffxtbr@Allin1Convert_8h.com
10% (91/918) vs.   1% (501/45199) gtffxtbr@GamingWonderland.com
12% (110/918) vs.   5% (2154/45199) faststartff@gmail.com
7% (65/918) vs.   0% (215/45199) 64ffxtbr@TelevisionFanatic.com

And the crashes mostly happen on ask.com pages like:
7 	http://home.tb.ask.com/index.jhtml?ptb=646578D2-2E8F-44C4-A7B0-75FC0077CC6E&n=780c756a&st=tab&p2=^ZX^xdm071^S10501^eg
6 	http://home.tb.ask.com/index.jhtml?ptb=33E3C0CF-38AD-45B1-BA59-16BADAC49D4E&n=780c468a&st=tab&p2=^HJ^xdm022^YYA^br&si=pconverter
5 	http://home.tb.ask.com/index.jhtml?ptb=3548A5FE-5F20-49FB-93F5-D93010C5B4C9&n=780c75c8&p2=^HJ^xdm238^YYA^kh
4 	http://home.tb.ask.com/index.jhtml?n=780bfe50&st=tab&p2=^HJ^xpi000^YYA^

There's tons more of such URLs in those reports
Thanks KaiRo. That's awfully suspicious.

bholley - I've installed one of those toolbars on an isolated machine. Any experiments you want me to try?
Flags: needinfo?(bobbyholley)
(In reply to David Major [:dmajor] from comment #30)
> bholley - I've installed one of those toolbars on an isolated machine. Any
> experiments you want me to try?

Navigating I guess? It's pretty hard to say.

Given that this didn't do the trick, it seems like we should land the diagnostic patch that was r+ed here but never landed. Bill, can you do that?
Flags: needinfo?(bobbyholley) → needinfo?(wmccloskey)
Target Milestone: mozilla34 → ---
I feel terrible I've let this sit so long, but the tree has been closed every time I remember to land it. Could you please take care of this Ryan? Thanks.
Attachment #8491554 - Flags: checkin?(ryanvm)
Flags: needinfo?(wmccloskey)
Comment on attachment 8491554 [details] [diff] [review]
transplant-investigation v2

A commit message would be helpful.
https://developer.mozilla.org/en-US/docs/Mercurial_FAQ#How_can_I_generate_a_patch_for_somebody_else_to_check-in_for_me.3F

Also, please just use the checkin-needed keyword. It plays more nicely with our bug marking tools.
Attachment #8491554 - Flags: checkin?(ryanvm)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #33)
> Comment on attachment 8491554 [details] [diff] [review]
> A commit message would be helpful.

And a Try link.
https://hg.mozilla.org/mozilla-central/rev/df6055033377
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Bill, could you fill the uplift request to aurora & beta? Thanks
Flags: needinfo?(wmccloskey)
How far do we have to uplift this to get the data we need? Is it really a beta-only crash?
Flags: needinfo?(wmccloskey) → needinfo?(kairo)
(In reply to Bill McCloskey (:billm) from comment #38)
> How far do we have to uplift this to get the data we need? Is it really a
> beta-only crash?

We're not seeing this in Nightly or Aurora in significant volume, so if you need more than ~20 crashes per week for analysis, you need it in beta.

Also, reopening this because from all I gather, this is an investigation patch, not a fix.
Status: RESOLVED → REOPENED
Flags: needinfo?(kairo)
Resolution: FIXED → ---
Flags: needinfo?(wmccloskey)
I looked at one crash on Nightly and it's pretty weird:
https://crash-stats.mozilla.com/report/index/88d8cdbf-6992-4029-9f76-be9a82140921

We're supposedly crashing because JSObject::swap returned false:
http://hg.mozilla.org/mozilla-central/annotate/27253887d2cc/js/src/jsapi.cpp#l1113

However, swap can only return false if ReserveForTradeGuts returns false. And I can't see how that could possibly happen:
http://hg.mozilla.org/mozilla-central/annotate/27253887d2cc/js/src/jsobj.cpp#l2426

I don't really have time to investigate this now. Steve, would you mind taking this over? I'm not sure what the next step should be. We could try to land the investigative patch on beta to get more data, I guess.
Flags: needinfo?(wmccloskey) → needinfo?(sphink)
I don't know if this is true for that report from OS X, but on Windows builds you can definitely expect nearby MOZ_CRASHes to jmp to the same disassembly. (We could try making them unique with __LINE__ or __COUNTER__, but that may add a fair amount to our binary size)
It's been 5+ weeks since the last comment on this bug.

Kairo - Is this still a top crash?
Naveed - If this is still a top crash, do you have someone who can help out?
Flags: needinfo?(nihsanullah)
Flags: needinfo?(kairo)
I don't see that bug fixed for 33... So, wontfix.
Target Milestone: mozilla35 → ---
(In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #42)
> It's been 5+ weeks since the last comment on this bug.
> 
> Kairo - Is this still a top crash?
> Naveed - If this is still a top crash, do you have someone who can help out?

I looked at this yesterday, and I'm no expert in reading crash-stats, but it looks to me like it's still happening. Whether it's a top crash, I don't know.

I was going to try to land some sort of diagnostics patch soonish.
Flags: needinfo?(nihsanullah)
https://crash-stats.mozilla.com/topcrasher_ranks_bybug/?bug_number=1053999 is often slow to load, but it will show you that it's still consistently in the top 10 on all beta and release versions, while it's much lower down on Aurora and Nightly (probably due to the kind of audiences, not due to code improvements).
Flags: needinfo?(kairo)
Still a topcrash for Firefox 34.0b5.
(In reply to Bill McCloskey (:billm) from comment #40)
> I looked at one crash on Nightly and it's pretty weird:
> https://crash-stats.mozilla.com/report/index/88d8cdbf-6992-4029-9f76-
> be9a82140921
> 
> We're supposedly crashing because JSObject::swap returned false:
> http://hg.mozilla.org/mozilla-central/annotate/27253887d2cc/js/src/jsapi.cpp#l1113

No, that version of the code has changed too much. It's

http://hg.mozilla.org/releases/mozilla-beta/annotate/6cb4888905c9/js/src/jsapi.cpp#l1111

> However, swap can only return false if ReserveForTradeGuts returns false.
> And I can't see how that could possibly happen:
> http://hg.mozilla.org/mozilla-central/annotate/27253887d2cc/js/src/jsobj.cpp#l2426

The line above calls JS_WrapObject:
http://hg.mozilla.org/releases/mozilla-beta/annotate/6cb4888905c9/js/src/jsapi.cpp#l1016

which calls JSCompartment::wrap:
http://hg.mozilla.org/releases/mozilla-beta/annotate/6cb4888905c9/js/src/jscompartment.cpp#l412

which returns false because it does a JS_CHECK_CHROME_RECURSION and we've blown the stack.

Only we shouldn't have because of your https://bug1053999.bugzilla.mozilla.org/attachment.cgi?id=8476278 patch, which gives an extra 4KB of stack space to depend on, or a total of 2 * 4KB counting the _CHROME_ part.

Except! The JS_CHECK_RECURSION_CONSERVATIVE runs when the cx is within a "trusted" compartment, which means it is allowed a whole bunch more stack space, more than enough to erase the 8KB buffer. Just before JS_WrapObject is called, the cx enters a content compartment, and the recursion check fails.

\o/

Ok, so one option would be to add a JS_CHECK_CONTENT_RECURSION_CONSERVATIVE and call that from SetNewDocument. It would be different from all the other JS_CHECK*RECURSION* checks in that it would not call js::GetNativeStackLimit(cx) -- it would have to force the use of the "untrusted" limit.

The one thing that makes me nervous is that I don't understand the discrepancy between JS_CHECK_CHROME_RECURSION and the "trusted" native stack limit. Are they meant to be different concepts? If not, then the fix is to use the trusted quota in place of the hardcoded 1024 * sizeof(size_t) fudge factor. The inner check is already a JS_CHECK_CHROME_RECURSION so the outer conservative check would be checking against the same limits. billm?
It ate my needinfo? billm, but that's ok because he redirected me to luke on IRC.
Flags: needinfo?(sphink) → needinfo?(luke)
Luke blocked the serve and hit the ball over to bholley.
Flags: needinfo?(luke) → needinfo?(bobbyholley)
(In reply to Steve Fink [:sfink, :s:] from comment #47)
> Ok, so one option would be to add a JS_CHECK_CONTENT_RECURSION_CONSERVATIVE
> and call that from SetNewDocument. It would be different from all the other
> JS_CHECK*RECURSION* checks in that it would not call
> js::GetNativeStackLimit(cx) -- it would have to force the use of the
> "untrusted" limit.

I think that we should just make JS_CHECK_RECURSION_CONSERVATIVE assume content (the worst case). I don't think we actually need a separate macro.

> The one thing that makes me nervous is that I don't understand the
> discrepancy between JS_CHECK_CHROME_RECURSION and the "trusted" native stack
> limit.

JS_CHECK_CHROME_RECURSION predated my patch to introduce 3 separate stack limits (one for C++, one for trusted JS, and one for untrusted JS). Conceptually, JS_CHECK_CHROME_RECURSION should be called JS_CHECK_SYSTEM_CODE_RECURSION, and should be written in terms of StackForSystemCode, not the arbitrary constant it uses now.

> Are they meant to be different concepts? If not, then the fix is to
> use the trusted quota in place of the hardcoded 1024 * sizeof(size_t) fudge
> factor.

That is the case for JS_CHECK_CHROME_RECURSION (above). But JS_CHECK_RECURSION_CONSERVATIVE is a different beast - it's effectively an offset in the _other_ direction. In practice I think just hard-coding some number of K from the untrusted stack limit should be fine.
Flags: needinfo?(bobbyholley)
#5 topcrash for Firefox 34.0b7. Lots of ask.com urls, and I'm also seeing addons in the crash reports that follow the pattern, 4zffxtbr@VideoDownloadConverter_4z.com or 9tffxtbr@InternetSpeedTracker_9t.com.
Maybe bug 607047 is related. There's a test case there.
See Also: → 607047
This patch really just switches JS_CHECK_RECURSION_CONSERVATIVE to using js::StackForUntrustedScript. That should fix this case because the problem is that during the transplant operation, we run a stack check with an untrusted principle. 

I don't know if there are cases where we will never be running untrusted script and this will cause failures in code that shouldn't really fail.

I'm also not sure whether there should be a followup patch that changes JS_CHECK_CHROME_RECURSION to force the use of either StackForTrustedScript or StackForSystemScript.

The latter two are good reasons to give this review to bholley.
Attachment #8520079 - Flags: review?(bobbyholley)
(In reply to Bobby Holley (:bholley) from comment #50)

Oops, sorry, I didn't see that you had replied to my needinfo when I posted my patch. I think what I ended up doing is consistent with what you said, though.
Comment on attachment 8520079 [details] [diff] [review]
Conservative stack check should assume untrusted code

Review of attachment 8520079 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Steve Fink [:sfink, :s:] from comment #53)
> I don't know if there are cases where we will never be running untrusted
> script and this will cause failures in code that shouldn't really fail.

We never actually run any untrusted script, right? The only reason we end up in an untrusted compartment is because we're rewrapping stuff during transplant, if I understand comment 47 correctly (one great piece of detective work, btw).
 
> I'm also not sure whether there should be a followup patch that changes
> JS_CHECK_CHROME_RECURSION to force the use of either StackForTrustedScript
> or StackForSystemScript.
 
That would be a reasonable thing to do, yes.

::: dom/base/nsGlobalWindow.cpp
@@ +2409,5 @@
>    bool thisChrome = IsChromeWindow();
>  
>    // Check if we're near the stack limit before we get anywhere near the
> +  // transplanting code. The transplanting executes partly within a potentially
> +  // untrusted script, so use the limit for untrusted scripts with a

Per the above, this doesn't make sense, ditto for the other comment changes. Please revert them.

::: js/src/jsfriendapi.h
@@ +1003,5 @@
>   * These macros report a stack overflow and run |onerror| if we are close to
> + * using up the C stack. The JS_CHECK_CHROME_RECURSION variant gives us a
> + * little extra space so that we can ensure that crucial code is able to run.
> + * JS_CHECK_RECURSION_CONSERVATIVE gives us a little less space and assumes
> + * that untrusted code will want to run (untrusted code has a smaller limit.)

The end of this isn't quite right either for the same reason.
Attachment #8520079 - Flags: review?(bobbyholley) → review+
Since you said this would be ok in your comment, here's a followon patch for the crhome check.
Attachment #8520122 - Flags: review?(bobbyholley)
Comment on attachment 8520122 [details] [diff] [review]
Switch chrome stack check to use StackForSystemCode

Review of attachment 8520122 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks. r=bholley

::: js/src/jsfriendapi.h
@@ +1041,5 @@
>              onerror;                                                            \
>          }                                                                       \
>      JS_END_MACRO
>  
>  #define JS_CHECK_CHROME_RECURSION(cx, onerror)                                  \

Would you mind renaming this to JS_CHECK_SYSTEM_RECURSION or something?
Attachment #8520122 - Flags: review?(bobbyholley) → review+
Duplicate of this bug: 607047
From bug 607047:

Steps to Reproduce:
1. Open Firefox
2. Enter: (in JS Console like Firebug)
function spam()   {window.setInterval("spam2();",1);}
function spam2()  {window.setInterval("spam3();",1);}
function spam3()  {window.open();}
3. Enter : (in JS Console like firebug)
spam();

Actual Results:  
Firefox hangs and shows the count of the Popupwindows. (300.000)
(In reply to Bobby Holley (:bholley) from comment #55)
> Comment on attachment 8520079 [details] [diff] [review]
> Conservative stack check should assume untrusted code
> 
> Review of attachment 8520079 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> (In reply to Steve Fink [:sfink, :s:] from comment #53)
> > I don't know if there are cases where we will never be running untrusted
> > script and this will cause failures in code that shouldn't really fail.
> 
> We never actually run any untrusted script, right? The only reason we end up
> in an untrusted compartment is because we're rewrapping stuff during
> transplant, if I understand comment 47 correctly (one great piece of
> detective work, btw).

I don't know. The recursion check is because we're invoking the preWrap and wrap callbacks. I don't know whether either of those is ever going to run scripts.

If they aren't going to run scripts, then this stack checking may be a little too conservative, and just switching the inner check to be a StackForSystemScript check would be good enough without forcing the outer check to always be a StackForUntrustedScript.

But paranoia makes me lean towards what it would be after these two patches, even if the preWrap+wrap callbacks are safe now. If it causes problems, we could relax it then.

bholley: are you ok with landing these two patches with only updates as per your review, given that they are guarding only against preWrap+wrap?

For other embedders' sake, I'd definitely like to keep the 2nd patch's change (using StackForSystemScript rather than a hardcoded bonus value). That should fix the original problem in this bug by itself. It depends on your knowledge of xpconnect's use of preWrap+wrap as to whether it's also a good idea to force the outer check to be StackForUntrustedScript.
Dang it, I always forget to set the ni? flag when submitting the comment. ni? comment 60.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) from comment #57)
> Comment on attachment 8520122 [details] [diff] [review]
> Switch chrome stack check to use StackForSystemCode
> >  
> >  #define JS_CHECK_CHROME_RECURSION(cx, onerror)                                  \
> 
> Would you mind renaming this to JS_CHECK_SYSTEM_RECURSION or something?

Gladly, especially given that it's only used in jscompartment.cpp!
(In reply to Steve Fink [:sfink, :s:] from comment #60)
> I don't know. The recursion check is because we're invoking the preWrap and
> wrap callbacks. I don't know whether either of those is ever going to run
> scripts.

They shouldn't.
 
> If they aren't going to run scripts, then this stack checking may be a
> little too conservative, and just switching the inner check to be a
> StackForSystemScript check would be good enough without forcing the outer
> check to always be a StackForUntrustedScript.
> 
> But paranoia makes me lean towards what it would be after these two patches,
> even if the preWrap+wrap callbacks are safe now. If it causes problems, we
> could relax it then.

Yes, sounds good.
 
> bholley: are you ok with landing these two patches with only updates as per
> your review, given that they are guarding only against preWrap+wrap?

> For other embedders' sake, I'd definitely like to keep the 2nd patch's
> change (using StackForSystemScript rather than a hardcoded bonus value).
> That should fix the original problem in this bug by itself. It depends on
> your knowledge of xpconnect's use of preWrap+wrap as to whether it's also a
> good idea to force the outer check to be StackForUntrustedScript.

Yes, we should do both.
Flags: needinfo?(bobbyholley)
https://hg.mozilla.org/mozilla-central/rev/df574e3263c7
https://hg.mozilla.org/mozilla-central/rev/cc036cbdc52c
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Assuming this to be wontfix for Fx34 at this point (feel free to set it back to affected if I'm wrong). But we for sure need this on Aurora35, right?
Combined patch for backporting. This applies cleanly to Aurora (no changes were necessary.)
Comment on attachment 8524300 [details] [diff] [review]
Rationalize stack recursion checks,

[Approval Request Comment]

Bug caused by (feature/regressing bug #): this is pretty old. It *might* have been introduced by bug 732665, then partially but not completely fixed by bug 809295. But I'd have to dig deeper to be sure.

User impact if declined: topcrash

Testing completed: On mozilla-central for several days now

Risk to taking this patch (and alternatives if risky): this lowers the space allowed for recursion in some situations. I doubt any desired scenario would legitimately require this much stack, but it is possible that some pages or addons could start failing due to stack exhaustion.

String or UUID changes made by this patch: None

Note that I'm blanket nominating this for a whole bunch of branches that I don't know what are. Nothing horrible will happen if we don't take this patch, but taking it would clean up a fairly high frequency crash. I think this crash is only happening when something has gone wrong and a page manages to spawn infinite recursion involving navigation. So the fix converts crashes to script errors, it does not convert crashes to some fully working behavior. Hopefully that makes sense. (It's only crashing when something bad is happening; it just makes the "bad" into "much worse".)
Attachment #8524300 - Flags: approval-mozilla-beta?
Attachment #8524300 - Flags: approval-mozilla-b2g34?
Attachment #8524300 - Flags: approval-mozilla-b2g32?
Attachment #8524300 - Flags: approval-mozilla-b2g30?
Attachment #8524300 - Flags: approval-mozilla-aurora?
(In reply to Steve Fink [:sfink, :s:] from comment #68)
> working behavior. Hopefully that makes sense. (It's only crashing when
> something bad is happening; it just makes the "bad" into "much worse".)

Dang it, I've got to stop throwing around pronouns.

*The crash* just makes "bad" into "much worse".
FWIW, there have not been any crashes with this signature on Nightly since this landed, but this has been very low volume on Nightly and Aurora previously as well.
I was actually hoping to get this into 34 and assumed it would be, since it's been a top crash all through beta. Currently the #2 topcrash for Beta 9 with 913/18944 crashes. 

Lawrence, what do you think? Too late to try it and back it out if we need to?
Flags: needinfo?(lmandel)
Agreed that this would be very nice to have -- if it's safe.

sfink, how comfortable would you be with putting this on a late beta?
Flags: needinfo?(sphink)
(In reply to David Major [:dmajor] (UTC+13) from comment #72)
> Agreed that this would be very nice to have -- if it's safe.
> 
> sfink, how comfortable would you be with putting this on a late beta?

At this point, fairly comfortable. It's fundamentally a pretty safe change to make -- the only thing that could go wrong is introducing new failures in already ugly situations. My main other concern then is that I accidentally botched something while doing it, and any amount of time on Nightly is strong evidence that it's not broken in that sort of way.

I hate saying things like this because I feel like I'm jinxing it, but I'm knocking on wood and crossing my fingers and hanging garlic around my neck right now. (Just kidding about the garlic.)
Flags: needinfo?(sphink)
Attachment #8524300 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8524300 [details] [diff] [review]
Rationalize stack recursion checks,

lizzard and kairo lobbied for this bug during today's channel meeting. Both think that this has significant upside for the stability of 34. We're going to take this fix in beta11. Beta+

sfink - Can you please handle the beta landing on Wed?
Flags: needinfo?(lmandel) → needinfo?(sphink)
Attachment #8524300 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 8524300 [details] [diff] [review]
Rationalize stack recursion checks,

Its too late to land this on b2g30/b2g32 branches approving for 2.1(b2g34)
Attachment #8524300 - Flags: approval-mozilla-b2g34?
Attachment #8524300 - Flags: approval-mozilla-b2g34+
Attachment #8524300 - Flags: approval-mozilla-b2g32?
Attachment #8524300 - Flags: approval-mozilla-b2g30?
In the last week we have:
- 2 crashes in Firofox 36.0a1;
- 6 crashes in Firefox 35.0a2;
- 2 crashes in 34.0b11.

If the crash rate won't increase in the next couple of days I consider we can mark this verified.
Looking for crashes over the past weeks [1]:
- 0 crashes in Firofox 36.0a1 (in builds from November 15 or later)
- 0 crashes in Firefox 35.0a2 (in builds from November 19 or later)
- 22 crashes in 34.0b11
- 10 crashes in 34.0
- 13 crashes in 34.0.5
- 1 crash in 31.3.0esr

While there don;t seem to be any more crashes in 35 and 36, it may be worth to continue to keep an eye on this for 34 for at least another week.

[1] - https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=JS_TransplantObject%28JSContext%2A%2C+JS%3A%3AHandle%3CJSObject%2A%3E%2C+JS%3A%3AHandle%3CJSObject%2A%3E%29#tab-reports
As of today the stats look great over the last week:
- 1 crash in 36.0a2
- 0 crashes in 35.0a2
- 8 crashes in 35.0b1
- 2 crashes in 35.0b2
- 1 crash in 34.0b11
- 13 crashes in 34.0
- 16 crashes in 34.0.5 (greatly reduced from 1307 crashes in 33.1.1 and 2039 crashes in 33.1 in the same time period).

Marking this verified. I will keep an eye just in case but I consider at this point things look good.
You need to log in before you can comment on or make changes to this bug.