Closed Bug 605033 Opened 14 years ago Closed 11 years ago

Crash in js::ValueToNumberSlow @ js::DefaultValue

Categories

(Core :: JavaScript Engine, defect)

2.0 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox5 + fixed
firefox6 + fixed
blocking2.0 --- -
status2.0 --- wanted

People

(Reporter: scoobidiver, Assigned: dmandelin)

References

Details

(Keywords: crash, regression, Whiteboard: [crashkill])

Crash Data

Attachments

(3 files)

Build: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b8pre) Gecko/20101017 Firefox/4.0b8pre

It is a residual crash signature that exists in trunk builds.
There has been a spike since Sept 14th.
It is #49 top crasher in 4.0b8pre build for the last week.

Signature	js::DefaultValue(JSContext*, JSObject*, JSType, js::Value*)
UUID	3d328b01-a3a5-4f4e-afb8-fbc802101017
Time 	2010-10-17 12:51:06.160533
Uptime	21007
Last Crash	672468 seconds (1.1 weeks) before submission
Install Age	91815 seconds (1.1 days) since version was first installed.
Product	Firefox
Version	4.0b8pre
Build ID	20101016052749
Branch	2.0
OS	Windows NT
OS Version	6.1.7600
CPU	x86
CPU Info	GenuineIntel family 6 model 23 stepping 10
Crash Reason	EXCEPTION_ACCESS_VIOLATION_READ
Crash Address	0x4	
App Notes 	AdapterVendorID: 1002, AdapterDeviceID: 9460

Frame 	Module 	Signature [Expand] 	Source
0 	mozjs.dll 	js::DefaultValue 	js/src/jsobj.cpp:5692
1 	mozjs.dll 	js::ValueToNumberSlow 	js/src/jsnum.cpp:1226
2 	mozjs.dll 	js::mjit::stubs::GreaterThan 	js/src/methodjit/StubCalls.cpp:1033

More reports at:
http://crash-stats.mozilla.com/report/list?range_value=4&range_unit=weeks&signature=js%3A%3ADefaultValue%28JSContext*%2C%20JSObject*%2C%20JSType%2C%20js%3A%3AValue*%29
There seem to be a lot of crashes on the http://www.appbrain.com/ site, but I haven't been able to reproduce it yet with any of the links.
topcrash number 15 from chofmann's list.
blocking2.0: --- → ?
Whiteboard: [crashkill]
looks like the trunk volume regression range is right around oct 13 or 14

date     tl crashes at, count build, count build, ...
         js::DefaultValue.JSContext...JSObject...JSType..js::Value..
20101010 8  6 4.0b7pre2010100604, 
	        1 4.0b62010091408, 1 4.0b42010081813, 
20101011 2 4.0b62010091408 2 , 
20101012 6  4 4.0b62010091408, 
	        1 4.0b8pre2010101104, 1 4.0b42010081813, 
20101013 4 4.0b32010080519 4 , 
20101014 12  8 4.0b8pre2010101404, 
	        4 4.0b8pre2010101322, 
20101015 21  10 4.0b8pre2010101404, 
	        7 4.0b8pre2010101504, 3 4.0b62010091408, 
	        1 4.0b7pre2010100204, 
20101016 9  5 4.0b8pre2010101604, 
	        3 4.0b8pre2010101504, 1 4.0b62010091408, 
20101017 16  8 4.0b8pre2010101604, 
	        3 4.0b8pre2010101704, 1 4.0b8pre2010101605, 
	        1 4.0b8pre2010101504, 1 4.0b8pre2010101404, 
	        1 4.0b62010091408, 1 4.0b32010080519, 
20101018 23  9 4.0b8pre2010101804, 
	        7 4.0b8pre2010101704, 4 4.0b62010091408, 
	        1 4.0b8pre2010101604, 1 4.0b8pre2010101404, 
	        1 4.0b7pre2010100204, 
20101019 23  9 4.0b8pre2010101904, 
	        7 4.0b8pre2010101804, 2 4.0b8pre2010101604, 
	        2 4.0b8pre2010101504, 1 4.0b8pre2010101805, 
	        1 4.0b8pre2010101404, 1 4.0b8pre2010101322,
blocking2.0: ? → betaN+
Just got this crash with the latest Minefield nightly (Built from http://hg.mozilla.org/mozilla-central/rev/8cbe83542596) while visiting www.appbrain.com.

http://crash-stats.mozilla.com/report/index/21e05006-0f91-417c-99dc-165142101107
Assignee: general → dmandelin
I have some info:

1. In most of these, we crash with an address of 0x4 trying to read the clasp of an object, which means the JSObject* is NULL. In the minidump I read, the reason is that we call ValueToNumberSlow on a jsval that is an array hole. Holes should not appear as "normal" values, so that is a bug.

2. This crash was happening earlier and then stopped. The first one in the "new" version came in an Oct 7 build. They were low-volume at first, but spiked when compartments landed--I don't know why.

3. Bug 601733, which relates to JSOP_GETELEM, landed just before the Oct 7 build. Interestingly, that bug asserts that a value is not a hole at a place that used to check, inside ArgGetter. So, that could be the bug. Assuming holes do reach that point, it's not clear yet whether asserting non-hole is a bug, or if some caller is using ArgGetter in a buggy way.

4. If the case in ArgGetter is not the "cause" of this, then most likely we are reading holes out of an array somewhere else.
Attached patch Diagnostic patchSplinter Review
This patch primarily checks for holes in ArgGetter per the last comment. In case that's not it, the patch also checks for holes read out of arrays by the interpreter and mjit stub call. Of course, they could be read out elsewhere, but I'm starting in the easy places.
Attachment #490729 - Flags: review?(lw)
Attachment #490729 - Flags: review?(lw) → review+
Diagnostic landed:

http://hg.mozilla.org/mozilla-central/rev/1b815a3b4250
Status: NEW → ASSIGNED
OK, we've had a big dropoff in the number of these crashes since the diagnostic landed, so either

 (a) it got fixed by someone else, or
 (b) the diagnostic is making it crash at a different place, which I'm not finding
     in Socorro. (I did search for the sigs of both the functions where I added
     crashes.)

So, I think the next step is to back out the diagnostic by parts. If one of the backouts causes the crash to come back, then that's the one that was being hit. If neither does, then it got fixed by someone else.
(In reply to comment #8)

Argh, in comment 8 when I said there was a dropoff I was looking at data for a different crash, bug. It is pretty interesting that that crash seems to have dropped off with the landing of this diagnostic, so the comments in comment 8 do kind of bug 608118.
Diagnostic backed out:

  http://hg.mozilla.org/mozilla-central/rev/693505bdb668

The result was negative. I think I'm going to back up a bit on this. The previous test was based on the guess that we were getting a hole from getelem, but neither of those two was ever directly confirmed. So I'm going to first confirm that we are crashing on holes, and continue from there.
Attachment #491632 - Flags: review? → review+
Diagnostic 2 backed out:

http://hg.mozilla.org/mozilla-central/rev/09e869b8ff89

The crash in diagnostic 2 got hit one time. That's way fewer than the original topcrash was. That crash seems to have last occurred in the Nov 17 build. (Note that the diagnostic was first active only in the Nov 18 build.) So the original crash might be gone. I'll keep watching.
It's still there, but low frequency. Deprioritizing.
blocking2.0: betaN+ → -
status2.0: --- → wanted
looks like its still inside the top 20 at #17 in beta7 and running at 130-190 crashes per day.   seems like we should stil be trying to get any crash inside the top 30 that is a regression from firefox 3.6.x
That's reasonable. It looks good on nightlies, but let's check it again once beta8 goes out and make it block again if it's still popular there.
bug 619064 just found a reproducible way to crash in js::DefaultValue, except that it involves obj->getClass()->convert == NULL which isn't the same symptom as comment 5 reports.
n/m -- its using the shell-only (and I'm pretty sure unsafe) parent() function.
Mozilla/5.0 (Windows NT 5.1; rv:6.0a1) Gecko/20110501 Firefox/6.0a1
Build ID: 20110501030600
Built from http://hg.mozilla.org/mozilla-central/rev/068d876996c6

I reliably (100%) get a crash with the same signature on http://geo.maunsell.com/smcpublic/index.phtml using a clean profile when I move the mouse over the map, either after page load finished or during page load (once map images are displayed).

Sample crash reports:
bp-ea030708-e93c-4847-8b7a-f8b6a2110503
bp-7e16d21d-48a8-4201-8697-c443d2110503

The stack is different though... A quick scan shows some js::DefaultValue(...) crashes in nightly 6.0a1 with similar stack as mine and others with the same two or three frames at the top of stack as in comment 0.

I don't relish the thought of making a testcase from this URL, but would it be useful?
(In reply to comment #19)
> I reliably (100%) get a crash with the same signature on
> http://geo.maunsell.com/smcpublic/index.phtml using a clean profile when I move
> the mouse over the map, either after page load finished or during page load
> (once map images are displayed).
I cannot reproduce in 4.0.1 with these STR, so it is a new bug.
I renamed the summary to reflect the stack traces in comment 0.
Summary: crash [@ js::DefaultValue(JSContext*, JSObject*, JSType, js::Value*) ] → Crash in js::ValueToNumberSlow [@ js::DefaultValue(JSContext*, JSObject*, JSType, js::Value*) ]
It is #4 top crasher in 5.0b2.

The major part of stack traces are similar to the ones in comment 0:
0 	mozjs.dll 	js::DefaultValue 	js/src/jsobj.cpp:5923
1 	mozjs.dll 	js::ValueToNumberSlow 	js/src/jsnum.cpp:1274
2 	mozjs.dll 	js::ValueToECMAInt32Slow 	js/src/jsnum.cpp:1292
3 	mozjs.dll 	js::mjit::stubs::BitXor 	js/src/methodjit/StubCalls.cpp:617
4 		@0x765f1df 	
5 	mozjs.dll 	CheckStackAndEnterMethodJIT 	js/src/methodjit/MethodJIT.cpp:712
6 	mozjs.dll 	js::Interpret 	js/src/jsinterp.cpp:3542
7 	mozjs.dll 	js::Invoke 	js/src/jsinterp.cpp:716
8 	mozjs.dll 	JS_CallFunctionValue 	js/src/jsapi.cpp:5153
9 	xul.dll 	nsJSContext::CallEventHandler 	dom/base/nsJSEnvironment.cpp:1903
10 	xul.dll 	nsJSEventListener::HandleEvent 	dom/src/events/nsJSEventListener.cpp:224
11 	xul.dll 	nsEventListenerManager::HandleEventSubType 	content/events/src/nsEventListenerManager.cpp:1136
12 	xul.dll 	nsEventListenerManager::HandleEventInternal 	content/events/src/nsEventListenerManager.cpp:1233
13 	xul.dll 	nsEventDispatcher::Dispatch 	content/events/src/nsEventDispatcher.cpp:646 

Maybe related to bug 595351.
Luke you had some ideas about how we could approach a fix in comments 17 and 18. Any updates on that?

If we look at the top crash list, it's kind of dominated by JS related crashes. I think we need to get some eyes on this. It was not in the tp 300 for 4 and 4.0.1 and and seems to be in the top 10 for 5. Probably something regressed this.
Blocks: 659785
In comment 17 I thought I had found a connection to another bug but that other bug was using a shell-only function that isn't meant to be safe (hence the bug was resolved invalid).
(In reply to comment #19)
> Mozilla/5.0 (Windows NT 5.1; rv:6.0a1) Gecko/20110501 Firefox/6.0a1
> Build ID: 20110501030600
> Built from http://hg.mozilla.org/mozilla-central/rev/068d876996c6
> 
> I reliably (100%) get a crash with the same signature on
> http://geo.maunsell.com/smcpublic/index.phtml using a clean profile when I
> move the mouse over the map, either after page load finished or during page
> load (once map images are displayed).
> 
> Sample crash reports:
> bp-ea030708-e93c-4847-8b7a-f8b6a2110503
> bp-7e16d21d-48a8-4201-8697-c443d2110503
> 
> The stack is different though... A quick scan shows some
> js::DefaultValue(...) crashes in nightly 6.0a1 with similar stack as mine
> and others with the same two or three frames at the top of stack as in
> comment 0.
> 
> I don't relish the thought of making a testcase from this URL, but would it
> be useful?

I bisected on nightlies and found this bug to be first present in the 2011-03-19 mozilla-central build. Given that range, the regressing changeset is almost certainly:

changeset:   63445:23455773db73
user:        Kyle Huey <khuey@kylehuey.com>
date:        Fri Mar 18 17:37:46 2011 -0400
summary:     Bug 641325: Turn PGO back on for JS. rs=ted a=sayrer

Maybe the first thing to try is to turn PGO back off for JS and see what happens in nightly builds? We're currently averaging 1.8/day in nightlies, so if this a major cause, we should be able to see some effect fairly quickly.
Sure, you can turn off PGO in nightlies if you want to watch the crash numbers trend, but I don't think that turning off PGO is acceptable to get this to RESOLVED FIXED.
Without knowing whether PGO is actually at fault: Is there a really compelling perf win to make it worth wasting a bunch of time tracking down why the compiler is breaking?

Keep in mind that the JS win from PGO is only going to decrease over time as we focus on (1) extending JIT coverage to all areas of the VM and (2) improving the VM itself.
Turn it off and see!

(When we turned it back on in March it was a roughly 10% win on Sunspider, etc).
(In reply to comment #26)
> Without knowing whether PGO is actually at fault: Is there a really
> compelling perf win to make it worth wasting a bunch of time tracking down
> why the compiler is breaking?
> 
> Keep in mind that the JS win from PGO is only going to decrease over time as
> we focus on (1) extending JIT coverage to all areas of the VM and (2)
> improving the VM itself.

this sounds reasonable given that time is short to get 5.0 stable.

it looks like PGO could also be behind the new instability in this bug (#4 topcrash on 5.0beta) and bug 645775 (#5 topcrash) , and maybe others in the top crash list that we haven't had a chance to investigate yet.

let get it turned of for the next round of beta builds, check crash stats, and make the call on if PGO is ready for this area, or if the other things dvander outlined will be in place  soon enough.

we really can't afford to give up any chance of added instability for perf wins based on the crash rates we we on 4.0x. and this early 5.0 crash data.
Version: Trunk → 2.0 Branch
The simple way to turn it off is just to add:
NO_PROFILE_GUIDED_OPTIMIZE = 1

to js/src/Makefile.in. You might want to make that ifeq (WINNT,$(OS_ARCH)) so you don't disable PGO on Linux builds as well. This will probably be a large perf hit, but that's not a regression from FF4.
Patch to disable PGO per Ted's instructions. This is in order to test how much of this crash (and also bug 645775) is from PGO on the nightly builds. If/when we get data on that, we can better decide what to do for the release.
Attachment #535438 - Flags: review?(ted.mielczarek)
Comment on attachment 535438 [details] [diff] [review]
Patch to disable PGO for for JS

Review of attachment 535438 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #535438 - Flags: review?(ted.mielczarek) → review+
OK, all reviewed.

dmandelin, can you check in?
(In reply to comment #32)
> OK, all reviewed.
> 
> dmandelin, can you check in?

http://hg.mozilla.org/mozilla-central/rev/7d2a3d61a377
Sheila, we want to get this on the beta branch as well, right?
yes, and in time for the next round of builds.
Attachment #535438 - Flags: approval-mozilla-beta+
Please post to .tree-management about the expected perf regression here.
Why are we marking this fixed for 5?  Do we have crashstats numbers showing the decline in crashes?
(In reply to comment #38)
> Why are we marking this fixed for 5?  

I'm guessing asa marked it that way since the patch to backout pgo is held here and marked landed from here.

yeah, that makes it a bit messy.

maybe we should have spun off another bug for that patch and make this bug and the other sharply rising js crash bugs in 5.0 dependent on that.

I posted that list to tree-management.

https://bugzilla.mozilla.org/show_bug.cgi?id=659785 #1 5.0 topcrash
https://bugzilla.mozilla.org/show_bug.cgi?id=637267 $4 5.0 topcrash
https://bugzilla.mozilla.org/show_bug.cgi?id=595635 #6 5.0 topcrash
https://bugzilla.mozilla.org/show_bug.cgi?id=602803 #8 5.0 topcrash

and maybe others further down that we haven't looked at yet.

> Do we have crashstats numbers showing
> the decline in crashes?

nope, not yet.  we need a new beta build and to get 250k+ users on it to do the right kind of analysis.  if there is a way to expedite this it would really help.
This may have regressed Ts on windows - 

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7a4900cfb25a&tochange=7d2a3d61a377

http://tinyurl.com/42nhg8g

The regression was from 370 -> 380. There was another regression to 395 shortly after which has since been backed out.
Perf regressions are a known consequence of this patch. They're not regressions compared to Firefox 4, just compared to trunk.
its looking like this landed on mozilla central just *after* the cut over of 6.0 on to the aurora channel, making 6.0a2 crashier.

Can someone recheck to see if patch needs to land there?
we'd like this for aurora (sorry if we are asking for something different now).
Attachment #535438 - Flags: approval-mozilla-aurora+
(In reply to comment #42)
> its looking like this landed on mozilla central just *after* the cut over of
> 6.0 on to the aurora channel, making 6.0a2 crashier.

We don't have any data that confirm it actually made it crashier, the data suggests that crashiness remained the same, it just did shift existing crashes to different signatures due to different inlining.
Crash Signature: [@ js::DefaultValue(JSContext*, JSObject*, JSType, js::Value*) ]
Ted asked me to give a bit of a clearer statement here on my assessment of the situation of crashes with/without JS PGO.

Most importantly:

After turning off PGO on beta, we could not see an measurable improvement of overall crash rates (main comparison here is 5.0b2 vs. 5.0b3).
Some signatures that looked to be new (e.g. js::Interpret, js::DefaultValue, js::Invoke, even js::gc::MarkChildren) disappeared or decreased significantly, but ones we've know well as high-rate crashers (js::mjit::EnterMethodJIT, to some extent js::gc::MarkObject) came back to the top levels.
This is easy to explain by different inlining decision the compiler makes as part of PGO.
My conclusion from looking through the numbers over the last weeks is that JS PGO does not regress crash rates, it just shifts crashes to be reported with different signatures, which is what we saw in 5.0b2 mainly.

Followup thoughts/comments:
1) Right now, PGO is ON on aurora, while it's off on central, beta and release.
2) Aurora currently reports with higher than usual crashes/user, but it looks like that was a different issue, fixed late last week. If crash rates go down to a similar level as all other channels within a few days to a week from now, this could be one more data point showing that PGO does not regress overall crash rate.
3) I'd propose to turn PGO ON on central again as well, and leave it on Aurora. Due to the 5.0b2 experience, we now know what signatures the crashes move to, which makes analysis easier. 5 should of course stay as it is, with well-known signatures, we can let "moar speed" slip to 6.
4) If possible, looking into improving signature generation for those crashes would be helpful, so we still see that e.g. we're actually crashing in MethodJIT (#0 frames are always only an address in that case).
5) We should try to investigate that actual causes of those GC and MethodJIT crashes.
I was unable to reproduce the crash using the steps in comment 19 using a PGO build built with MSVC 2010.  I'll catch up with dmandelin in the office tomorrow to make sure that I'm doing it right.

Given that we're going to move to MSVC 2010 Real Soon Now (TM), I would recommend disabling PGO for the js engine on Aurora and waiting until we switch compilers (probably for Gecko 8) to turn PGO back on.
Landed on mozilla-beta for Firefox 6:

http://hg.mozilla.org/releases/mozilla-beta/rev/3aac9d8c3452

Apparently I missed this during Aurora as I thought I already transplanted it. Thanks khuey for calling attention to it.
It is #4 top crasher in 6.0b1/20110705195857.
IIRC, DefaultValue is not present in FF7 top-crashes.  Perhaps this is just a different inlining decision, but I don't see any callers of DefaultValue to take its place.  Assuming the issue has been fixed, one candidate is http://hg.mozilla.org/releases/mozilla-aurora/rev/d2250fc608cc which first appeared on FF7.  Can you think of anything crashy that you fixed in this patch Waldo?
Nope, can't.  (I skimmed the topcrashers a bit further down to see if the crash maybe bifurcated a bit, but none of the signatures look like anything this change would have spawn.)  No idea what happened here.  Serendipity!  :-)
I'm not sure this is connected but I spotted js::ValueToNumberSlow in

https://crash-analysis.mozilla.com/crash_stacks/Stack-summary-8.0a1.txt

....Signature number: 201-js::ValueToNumberSlowJSContext,js::Value,double
https://bugzilla.mozilla.org/buglist.cgi?bug_id=
______ distribution of 1 different stacks, looking at top 10 frames
      1  stacks like
0|0|mozjs.dll|js::ValueToNumberSlow(JSContext *,js::Value,double *)
0|1|mozjs.dll|js::ValueToECMAUint32Slow(JSContext *,js::Value const &amp;,unsigned int *)
0|2|mozjs.dll|JS_ValueToECMAUint32
0|3|xul.dll|XPCConvert::JSData2Native(XPCCallContext &amp;,void *,unsigned __int64,nsXPTType const &amp;,int,nsID const *,unsigned int *)
0|4|xul.dll|XPC_WN_CallMethod(JSContext *,unsigned int,unsigned __int64 *)
0|5|mozjs.dll|CallCompiler::generateNativeStub()
0|6|mozjs.dll|js::mjit::ic::NativeCall(js::VMFrame &amp;,js::mjit::ic::CallICInfo *)
0|7|mozjs.dll|SearchTable
0|8|mozjs.dll|js::mjit::EnterMethodJIT(JSContext *,js::StackFrame *,void *,js::Value *)
0|9|mozjs.dll|js::mjit::JaegerShot(JSContext *)

an there are some varitions of js::mjit::stubs:: related calls on stacks in
https://crash-analysis.mozilla.com/crash_stacks/Stack-summary-7.0a2.txt

but no js::mjit::stubs::GreaterThan like in comment 0
The js:DefaultValue instances we are seeing on the first 6.0 beta (which inadvertently has JS PGO turned on, comment #47 landed _after_ that build, so only the second beta will have it off) all seems to have their stack running through some js::mjit::* so I suspect that those crashes get converted by different inlining decisions to report as js::mjit::EnterMethodJIT on non-JS-PGO builds, like a number of other signatures we've seen.

Given that, I'm inclined to say we should mark this bug FIXED due to the patch turning JS PGO off that has landed from here, potentially changing the summary to state "turn off JS PGO", and use a new bug for issues like what comment #51 is pointing out. This bug is getting confusing.
Crash Signature: [@ js::DefaultValue(JSContext*, JSObject*, JSType, js::Value*) ] → [@ js::DefaultValue(JSContext*, JSObject*, JSType, js::Value*) ] [@ js::DefaultValue(JSContext*, JSObject*, JSType, JS::Value*)]
Can the crash also be reproduced in present edition ( 11/12/13)?
If not, we may enable js PGO.

Mozilla has already moved to VC2010, so this bug may disappear...
JS PGO is currently broken on trunk. See bug 721284.
It still happens at a very low volume:
* 32 crashes in 22.0
* 4 in 23.0b9
but not with the stack trace of comment 0. See https://crash-stats.mozilla.com/report/list?product=Firefox&signature=js%3A%3ADefaultValue%28JSContext*%2C+JS%3A%3AHandle%3CJSObject*%3E%2C+JSType%2C+JS%3A%3AMutableHandle%3CJS%3A%3AValue%3E%29
Status: ASSIGNED → RESOLVED
Crash Signature: [@ js::DefaultValue(JSContext*, JSObject*, JSType, js::Value*) ] [@ js::DefaultValue(JSContext*, JSObject*, JSType, JS::Value*)] → [@ js::DefaultValue(JSContext*, JSObject*, JSType, js::Value*) ] [@ js::DefaultValue(JSContext*, JSObject*, JSType, JS::Value*)] [@ js::DefaultValue(JSContext*, JS::Handle<JSObject*>, JSType, JS::MutableHandle<JS::Value>) ]
Closed: 11 years ago
Resolution: --- → WORKSFORME
Summary: Crash in js::ValueToNumberSlow [@ js::DefaultValue(JSContext*, JSObject*, JSType, js::Value*) ] → Crash in js::ValueToNumberSlow @ js::DefaultValue
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: