Closed Bug 924525 Opened 11 years ago Closed 11 years ago

Crashes @ js::JSONParser::advance() | js::ParseJSONWithReviver | js::Invoke on secondary thread, mostly on OS X

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox26 - ---
firefox27 - ---

People

(Reporter: smichaud, Unassigned)

References

()

Details

Crash Data

Attachments

(4 files)

These appear to have started quite recently, and are almost entirely on OS X 10.8. E.g.:

bp-12309981-f0df-41be-a106-a68202131008

But I did find one on Windows:

bp-94889ee8-332c-4ae9-8664-0832f2131007

These should not be confused with crashes at js::JSONParser::advance() on the main thread, which go back a long way (in FF versions) on Windows (at low frequency).  E.g.:

bp-5be37622-7d3c-4cc2-9041-18bb02130924
bp-be912e21-82e1-4ef1-a950-811c72131002
As I mentioned, there are currently very few of these on Windows (and none yet on Linux).  But they're the number 4 Mac topcrasher on the 27 branch.
Crash Signature: [@ js::JSONParser::advance() ]
Summary: Crashes @ js::JSONParser::advance() | js::ParseJSONWithReviver | | js::Invoke on secondary thread, mostly on OS X → Crashes @ js::JSONParser::advance() | js::ParseJSONWithReviver | js::Invoke on secondary thread, mostly on OS X
I find one of these on the Aurora branch:  bp-204b08e8-262a-4159-be57-31c832131003

This suggests that the trigger is a patch that landed recently on both the 27 and 26 branches at more or less the same time.
The earliest builds I can find with this crash are both from 2013-10-01:

firefox-2013-10-01-00-40-05-mozilla-aurora (OS X)
bp-204b08e8-262a-4159-be57-31c832131003

firefox-2013-10-01-03-02-04-mozilla-central (Windows)
bp-94889ee8-332c-4ae9-8664-0832f2131007

It's odd that the oldest of all is the Aurora build.  This suggests that the trigger patch landed some days before 2013-09-30.
0 	XUL 	js::JSONParser::advance() 	/builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/x86_64/js/src/../../../../js/src/jsonparser.cpp
1 	XUL 	js::JSONParser::parse(JS::MutableHandle<JS::Value>) 	/builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/x86_64/js/src/../../../../js/src/jsonparser.cpp
2 	XUL 	js::ParseJSONWithReviver(JSContext*, JS::StableCharPtr, unsigned long, JS::Handle<JS::Value>, JS::MutableHandle<JS::Value>) 	/builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/x86_64/js/src/../../../../js/src/json.cpp
3 	XUL 	json_parse 	/builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/x86_64/js/src/../../../../js/src/json.cpp
4 	XUL 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/jscntxtinlines.h
5 	XUL 	Interpret 	js/src/vm/Interpreter.cpp
6 	XUL 	js::RunScript(JSContext*, js::RunState&) 	js/src/vm/Interpreter.cpp
7 	XUL 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
8 	XUL 	js::CallOrConstructBoundFunction(JSContext*, unsigned int, JS::Value*) 	js/src/jsfun.cpp
9 	XUL 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/jscntxtinlines.h
10 	XUL 	Interpret 	js/src/vm/Interpreter.cpp
11 	XUL 	js::RunScript(JSContext*, js::RunState&) 	js/src/vm/Interpreter.cpp
12 	XUL 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
13 	XUL 	js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) 	js/src/vm/Interpreter.cpp
14 	XUL 	JS_CallFunctionValue(JSContext*, JSObject*, JS::Value, unsigned int, JS::Value*, JS::Value*) 	js/src/jsapi.cpp
15 	XUL 	mozilla::dom::workers::EventListenerManager::DispatchEvent(JSContext*, mozilla::dom::workers::EventTarget const&, JSObject*, mozilla::ErrorResult&) const 	/builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/x86_64/dom/workers/../../../../dom/workers/EventListenerManager.cpp
16 	XUL 	mozilla::dom::EventTargetBinding_workers::dispatchEvent 	dom/workers/EventTarget.h
17 	XUL 	mozilla::dom::EventTargetBinding_workers::genericMethod 	obj-firefox/x86_64/dom/bindings/EventTargetBinding.cpp
18 	XUL 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/jscntxtinlines.h
19 	XUL 	js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) 	js/src/vm/Interpreter.cpp
20 	XUL 	JS_CallFunctionName(JSContext*, JSObject*, char const*, unsigned int, JS::Value*, JS::Value*) 	js/src/jsapi.cpp
21 	XUL 	mozilla::dom::workers::events::DispatchEventToTarget(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>, bool*) 	dom/workers/Events.cpp
22 	XUL 	(anonymous namespace)::EventRunnable::WorkerRun(JSContext*, mozilla::dom::workers::WorkerPrivate*) 	dom/workers/XMLHttpRequest.cpp
23 	XUL 	mozilla::dom::workers::WorkerRunnable::Run() 	dom/workers/WorkerPrivate.cpp
24 	XUL 	mozilla::dom::workers::WorkerPrivate::DoRunLoop(JSContext*) 	dom/workers/WorkerPrivate.cpp
25 	XUL 	(anonymous namespace)::WorkerThreadRunnable::Run() 	dom/workers/RuntimeService.cpp
26 	XUL 	nsThread::ProcessNextEvent(bool, bool*) 	/builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/x86_64/xpcom/threads/../../../../xpcom/threads/nsThread.cpp
27 	XUL 	NS_ProcessNextEvent(nsIThread*, bool) 	/builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/x86_64/xpcom/build/../../../../xpcom/glue/nsThreadUtils.cpp
28 	XUL 	nsThread::ThreadFunc(void*) 	/builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/x86_64/xpcom/threads/../../../../xpcom/threads/nsThread.cpp
29 	libnss3.dylib 	_pt_root 	
30 	libsystem_c.dylib 	_pthread_start 	
31 	libsystem_c.dylib 	thread_start 	
32 	libnss3.dylib 	null_signal_handler
Given what I said in comment #3, I suspect the trigger for these crashes is the patch for bug 920782.
Blocks: 920782
I don't think bug 920782 is likely to be involved. It should only affect off-thread parsing (i.e., not DOM workers), and its only effect should be to avoid doing a GC.
> I don't think bug 920782 is likely to be involved.

Then do you have any other suggestions for the trigger?

This bug isn't reproducible (yet).  Maybe it would help to have URLs.
Keywords: needURLs
I don't know. It's hard to infer much from a single crash on aurora.

Many of these crashes seem to be coming from the same person. All the CPU and graphics card data are the same. Some of the crashes include an email address for someone at Mozilla. I'll send an email and CC you Steven.
(In reply to Steven Michaud from comment #8)
> > I don't think bug 920782 is likely to be involved.
> 
> Then do you have any other suggestions for the trigger?

I think the trigger is a patch that landed on Talkilla.

In bug 924449 I wrote:

(Florian Quèze [:florian] [:flo] in bug 924449 comment #0)
> https://crash-stats.mozilla.com/report/index/af5a4303-f0d8-442d-9d8f-
> ac6df2131008
> 
> We started seeing these crashes frequently on Talkilla
> (https://talkilla.mozillalabs.com/) just after we moved some of our code
> from the SocialAPI frameworker to a real web worker spawned from the
> SocialAPI frameworker.
> 
> I don't have steps to reproduce, it seems to happen randomly but frequently
> (several times per hour) when Talkilla is installed on a current nightly.

I suspect the reason why the crash happens mostly on OS X is that Talkilla developers use mostly macbooks.
nothing useful in URL's

Total Count 	URL
8 	about:home
4 	http://localhost:8060/
4 	about:blank
2 	https://login.persona.org/sign_in
2 	about:crashes
1 	https://crash-stats.mozilla.com/report/index/3d4569fa-6cc8-4f74-872b-eb4cf2131008
1 	https://www.facebook.com/
1 	https://www.google.ca/?gws_rd=cr&ei=bOFSUounFoaQqgHU94BY
1 	https://crash-stats.mozilla.com/report/index/d1e5f82a-a7fb-488c-bd56-d1fa72131008
1 	about:sessionrestore

Possibly interesting add-on correlation:

  js::JSONParser::advance()|EXC_BAD_ACCESS / 0x0000000d (26 crashes)
     69% (18/26) vs.  14% (18/130) restartless.restart@erikvold.com
     54% (14/26) vs.  12% (15/130) inspector@mozilla.org (DOM Inspector, https://addons.mozilla.org/addon/6622)
Keywords: needURLs
Note, I suspect the Talkilla urls might not be submitted as they're loaded in the social api and workers.

One of my profiles seems to crash a lot, another one doesn't. For Florian, I know he said all of his profiles crash frequently.
Looking through the current crash reports, I notice that all have at least one extension (besides the default one), but it isn't always restartless.restart@erikvold.com (v9) or inspector@mozilla.org (v2.0.14).

There's also firebug@software.joehewitt.com (v1.12.3) -- usually by itself.  And restartless.restart@erikvold.com and inspector@mozilla.org are often (but not always) installed together.

> firefox-2013-10-01-00-40-05-mozilla-aurora (OS X)
> bp-204b08e8-262a-4159-be57-31c832131003
> 
> firefox-2013-10-01-03-02-04-mozilla-central (Windows)
> bp-94889ee8-332c-4ae9-8664-0832f2131007

These two outliers, which are also the earliest builds, also have unique extensions:

e4a8a97b-f2ed-450b-b12d-ee082ba24781 (Greasemonkey v1.12) and firebug@software.joehewitt.com (v1.12.2) for the first, and ad0d925d-88f8-47f1-85ea-8463569e756e (RESTClient v2.0.3, https://addons.mozilla.org/en-us/firefox/addon/restclient/) for the second.

So, though these crashes appear to only happen with extensions, I still suspect they've been triggered by a recent change, on both the 26 and 27 branches, somewhere in the Mozilla tree.
From the initial comment in bug 924449, it sounds like this was triggered by a change in Talkilla:

"We started seeing these crashes frequently on Talkilla (https://talkilla.mozillalabs.com/) just after we moved some of our code from the SocialAPI frameworker to a real web worker spawned from the SocialAPI frameworker.

"I don't have steps to reproduce, it seems to happen randomly but frequently (several times per hour) when Talkilla is installed on a current nightly."

Firefox code is at fault, but I doubt that there was any recent Firefox change that caused this. I think our best bet is to do some testing of Talkilla using debug builds and see if we catch anything in gdb.
Oops.  I assumed Talkilla was an extension of some kind, or used one.
(In reply to Bill McCloskey (:billm) from comment #15)

> do some testing of
> Talkilla using debug builds and see if we catch anything in gdb.

When I start my local debug build on the same profile, it crashes almost at startup with this assertion:
Assertion failure: mParentSuspended, at dom/workers/WorkerPrivate.cpp:2344

I filed a separate bug 924970 on this, but it may be related.
I can reproduce frequently both on a current nightly and on Aurora (26.0a2 (2013-10-10)).
I can't reproduce on beta (Fx 25).

I've also reproduced it 6 times on a local debug build. I'm attaching the reports from the Apple crash reporter; not sure if they contain any useful information.
(In reply to Bill McCloskey (:billm) from comment #15)
> I think our best bet is to do some testing of
> Talkilla using debug builds and see if we catch anything in gdb.

I have this crash stopped in gdb on my local debug build right now, if you are interested.
(In reply to Bill McCloskey (:billm) from comment #15)

> Firefox code is at fault, but I doubt that there was any recent Firefox
> change that caused this. I think our best bet is to do some testing of
> Talkilla using debug builds and see if we catch anything in gdb.

If not a recent change, and depending on how likely a user is to hit this, we might not track here.  Is this something happening frequently in the wild?
These are currently the #2 Mac topcrasher on the 27 branch (with 45 crashes in the last week) and the #4 topcrasher on the 26 branch (with 17 crashes in the last week).  I don't know if they happen only with Talkilla, or have anything to do with Talkilla code.  But they seem to be much easier to trigger using Talkilla, and (apparently) make using current Talkilla quite painful.

So part of the question is how important is Talkilla to us on the Mac?

Another part is how reproducible is this bug?  We don't yet have STR, but we seem quite close.
Interestingly, I see lots of crashes on both 26 and 27, but so far I haven't seen any on my current 25 (built from <http://hg.mozilla.org/releases/mozilla-beta/rev/8b2c180a7b5d>), which gives some places to look/bisect for the actual introduction on the JS engine side.
Steps to reproduce for me look like this, on tonight's nightly (27.0a1 (2013-10-14)), on OS X 10.8.5:

* create a new profile
* go to talkilla.mozillalabs.com
* click:
** activate talkilla
** enable services
* sign in with a persona login
* iconify the browser
* wait a while

After I did this, the first crash happened in < 15 minutes.  I restarted, and another crash happened in < 25 mins.
Attached file json-crash-bt-full.txt
dmose suggested on IRC that a full gdb backtrace may be useful, so I'm attaching one.
Could you attach an all-thread backtrace?

To get a "full format" one you'd (I think) do "thread apply bt full all".  But, though I won't be working on this bug myself, I personally find the non-full format easier to read.  So I suggest you also attach the output of "thread apply bt all".
To really have good STR for this bug, we need steps that take at most a few minutes to perform (and are at least 50% effective).
Attachment 816638 [details] says:
> #2  0x0000000105c4c6d5 in js::ParseJSONWithReviver (cx=0x125bae4c0, chars=<value temporarily unavailable, due to optimizations>, length=985634854918794669, reviver=

    >>> hex(985634854918794669)
    '0xdadadadadadadad'

Probably GC-related.
Flags: needinfo?(terrence)
For those who can reproduce this bug, it might be worth testing with the oldest build you can find that supports Talkilla, in the same profile(s) where you normally crash.  No, I don't know what that build is ... but I hope someone else might.

If crashes don't happen with that build, the FF changes that cause/trigger this bug are presumably fairly recent.  And it'd presumably be possible to find a regression range among builds that support Talkilla (hopefully among mozilla-central builds that support Talkilla).
(In reply to Steven Michaud from comment #30)

> If crashes don't happen with that build, the FF changes that cause/trigger
> this bug are presumably fairly recent.

As said in comment 18 and in comment 22, Firefox 25 (currently in beta) doesn't crash. Firefox 26 (currently aurora) does crash.
You need to test with nightlies (on whatever branch).
I attempted to find a regression range using mozilla-central nightlies.

Unfortunately, nightlies as old as 2013-08-03 (before mozilla-central became moz26, and the first nightly with the patch from bug 896860) already crash. I'm not sure if the crash is the same, as crash-stats doesn't seem to have symbols for these old nightlies, but the behavior is similar (random crash after 10-20 minutes and the crashing thread is not the UI thread).

Example of crashes I'm seeing with these old nightlies:
bp-38d16d61-6c66-43fb-8a65-7d4902131016
bp-1bd507be-2428-4769-a96c-b941c2131016
bp-200417d7-94ab-4d46-bd33-2ef022131016

So I think this means that either:
- the crash I'm seeing is a different one because something else was broken; and finding a regression range will be impossible (I tried 2013-08-17, 2013-08-10, 2013-08-06, 2013-08-04 and 2013-08-03; they all crashed).
- something happened while Mozilla25 was in aurora or beta that made the crash disappear there.
I used atos (atos -o XUL -arch x86_64 0xNNNNNNNN) on the 2013-08-03-03-02-05-mozilla-central nightly to translate the top few lines of your crash stacks:

0   XUL     js::JSONParser::advance() (in XUL) + 32
1   XUL     js::JSONParser::parse(JS::MutableHandle<JS::Value>) (in XUL) + 624
2   XUL     js::ParseJSONWithReviver(JSContext*, JS::StableCharPtr, unsigned long,
                                     JS::Handle<JS::Value>,
                                     JS::MutableHandle<JS::Value>) (in XUL) + 231
3   XUL     nsRunnable::Release() (in XUL) + 46
4           @0x100000000
5   XUL     mozilla::dom::workers::XMLHttpRequest::
              Send(nsAString_internal const&, mozilla::ErrorResult&) (in XUL) + 101
6   XUL     mozilla::dom::XMLHttpRequestBinding_workers::
              abort(JSContext*, JS::Handle<JSObject*>,
                    mozilla::dom::workers::XMLHttpRequest*,
                    JSJitMethodCallArgs const&) (in XUL) + 192
7   XUL     mozilla::dom::XMLHttpRequestBinding_workers::
              send(JSContext*, JS::Handle<JSObject*>,
                   mozilla::dom::workers::XMLHttpRequest*,
                   JSJitMethodCallArgs const&) (in XUL) + 676

0   XUL     XUL@0x2064130
1   XUL     XUL@0x2064b50
2   XUL     XUL@0x2062807
3   XUL     XUL@0x1644a0e
4           @0x100000000
5   XUL     XUL@0xa9aa95
6   XUL     XUL@0x1559700
7   XUL     XUL@0x15599b4

They do appear to be the same crash as reported here.

Which means the crash reports I mentioned in comment #3 aren't the earliest examples of these crashes, and that we won't be able to use Socorro to help find a regression range for this bug :-(

If you're willing to go to the effort, it may still be worthwhile to look for a regression range "by hand" in mozilla-central nightlies.  If you see stacks without symbols, use atos to check the top few lines.  Or you can run each nightly, use "gdb firefox [pid]" to attach to its process, then look at the crash stack in gdb.  mozilla-central nightlies don't have their symbols stripped, so gdb's crash stacks should be accurate.

But did the patch for bug 896860 turn on support for Talkilla?  If so there's no point.
> They do appear to be the same crash as reported here.

They also, like all this bug's other crashes, happen on secondary threads.
(In reply to Steven Michaud from comment #34)
> I used atos (atos -o XUL -arch x86_64 0xNNNNNNNN) on the
> 2013-08-03-03-02-05-mozilla-central nightly to translate the top few lines
> of your crash stacks:
[...]
> They do appear to be the same crash as reported here.

Thanks, this is very helpful!

It may be interesting to look for the 'regression' range in Mozilla 25 builds on aurora/beta to find which changeset made that crash disappear for moz25.

> mozilla-central nightlies don't have their symbols
> stripped, so gdb's crash stacks should be accurate.

Interesting.

> But did the patch for bug 896860 turn on support for Talkilla?

More or less. Talkilla can't use workers (so can't trigger this crash) without this patch.
If we/I go to the effort of making local builds to further bisect, I could try backporting that patch (it's a 2 lines change).

At this point, all the possible ways to get more information seem like a lot of effort, so maybe we should just see what Terrence has to say about the stacks?
So this is neither a topcrash nor, as far as we can tell, something that is being hit on any major sites.  Not a really new regression either.  Will not track this unless something from above list changes, if that's the case please renominate.
After looking at this with Florian today, it seems that the string has been collected before it gets to json_parse, it repros with ion and baseline off, and the amount of code between when the string ("[]") enters the network adapter and when it enters json_parse is extremely small.

Although we were not able to find exactly when the offending GC is happening, this is looking very much like some code in workers+XHR is not keeping the string alive across a GC. Specifically, this is xhr.responseText that is getting corrupted. I tried to look for the source, but there is at least one value and some strings named mResponse* in that section of code and the management of them is not straightforward. One thing that might be important is that there have all been inline strings.

Bent, is there any good documentation about how the response text gets processed? Do you know of anywhere the code looks at the raw characters from the relevant JSString?
Flags: needinfo?(terrence) → needinfo?(bent.mozilla)
Hm, with the new bindings now Worker XHR just holds mResponseText as an nsString. It then passes that to JS through the (generated) bindings code. On trunk, that currently looks like this:

  static bool
  get_responseText(JSContext* cx, JS::Handle<JSObject*> obj,
                   mozilla::dom::workers::XMLHttpRequest* self,
                   JSJitGetterCallArgs args)
  {
    ErrorResult rv;
    DOMString result;
    self->GetResponseText(result, rv);
    rv.WouldReportJSException();
    if (rv.Failed()) {
      return ThrowMethodFailedWithDetails(cx, rv, "XMLHttpRequest", "responseText");
    }
    if (!xpc::StringToJsval(cx, result, args.rval().address())) {
      return false;
    }
    return true;
  }

That's a pretty common code path...
Flags: needinfo?(bent.mozilla)
I've added this in the XHR onload callback:
  try {
    dump("xhr.responseText = " + xhr.responseText + "\n");
  } catch(e) {
    dump("caught error: " + e+"\n");
  }
and just before the crash it prints this in my terminal:
  caught error: InternalError: allocation size overflow

With gdb I put a breakpoint in js_ReportAllocationOverflow, and the stack I get starts like this:
#0  js_ReportAllocationOverflow (cxArg=0x126203dc0) at mozilla-central/js/src/jscntxt.cpp:421
#1  0x0000000105b1b636 in JSString::validateLength (maybecx=0x126203dc0, length=985634854918794688) at String-inl.h:108
#2  0x00000001059edcd3 in js::ConcatStrings<(js::AllowGC)0> (cx=0x126203dc0, left=0x128c3a640, right=0x128e69020) at mozilla-central/js/src/vm/String.cpp:395
#3  0x00000001058aaca0 in AddOperation (cx=0x126203dc0, lhs={<js::MutableHandleBase<JS::Value>> = {<js::MutableValueOperations<JS::MutableHandle<JS::Value> >> = {<js::UnbarrieredMutableValueOperations<JS::MutableHandle<JS::Value> >> = {<js::ValueOperations<JS::MutableHandle<JS::Value> >> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, ptr = 0x1156460a8}, rhs={<js::MutableHandleBase<JS::Value>> = {<js::MutableValueOperations<JS::MutableHandle<JS::Value> >> = {<js::UnbarrieredMutableValueOperations<JS::MutableHandle<JS::Value> >> = {<js::ValueOperations<JS::MutableHandle<JS::Value> >> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, ptr = 0x1156460b0}, res=0x1156460a8) at mozilla-central/js/src/vm/Interpreter.cpp:1118

Full stack at: http://pastebin.mozilla.org/3305117

I then looked at the frame #2 (ConcatString)

The left string has a length of 19; the right string is "huge"/invalid:
(gdb) p (char *)rstr->getChars(cx)
$22 = 0xdadadadadadadada <Address 0xdadadadadadadada out of bounds>
(In reply to Florian Quèze [:florian] [:flo] from comment #41)
Yes, xhr.responseText has been swept. That's effectively the same crash as before.
Depends on: 905382
Let me know if the talkilla problem is fixed now.  Some code that hopefully fixes the problem just merged to central, but it is different than the patch I had before.
I'm assuming Andrew meant to add a needinfo here.
Flags: needinfo?(florian)
(In reply to Andrew McCreight [:mccr8] from comment #44)
> Let me know if the talkilla problem is fixed now.  Some code that hopefully
> fixes the problem just merged to central, but it is different than the patch
> I had before.

We can't reproduce the crash with Talkilla anymore, so I think you actually fixed it, thanks! :-)
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(florian)
Resolution: --- → WORKSFORME
Issue is resolved - clearing old keywords - qa-wanted clean-up
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: