Closed Bug 960828 Opened 10 years ago Closed 10 years ago

Intermittent PROCESS-CRASH application crashed [@ XPCWrappedNative::GetNewOrUsed(xpcObjectHelper&, XPCWrappedNativeScope*, XPCNativeInterface*, XPCWrappedNative**)][@ js::GetGlobalForObjectCrossCompartment(JSObject*)]

Categories

(Core :: XPConnect, defect)

All
Android
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 986843
mozilla30

People

(Reporter: KWierso, Assigned: bholley)

References

Details

(Keywords: crash, intermittent-failure, Whiteboard: [leave open])

Crash Data

Attachments

(2 files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=33126554&tree=Fx-Team
slave: panda-0328



14:32:43     INFO -  REFTEST TEST-START | http://10.12.130.22:30328/jsreftest/tests/jsreftest.html?test=ecma/Math/15.8.2.2.js
14:32:43     INFO -  REFTEST TEST-LOAD | http://10.12.130.22:30328/jsreftest/tests/jsreftest.html?test=ecma/Math/15.8.2.2.js | 443 / 2106 (21%)
14:32:43     INFO -  INFO | automation.py | Application ran for: 0:05:10.829212
14:32:43     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmpSb2g3fpidlog
14:32:43     INFO -  /data/anr/traces.txt not found
14:32:44     INFO -  mozcrash INFO | Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/fx-team-android/1389906679/fennec-29.0a1.en-US.android-arm.crashreporter-symbols.zip
14:33:31  WARNING -  PROCESS-CRASH | http://10.12.130.22:30328/jsreftest/tests/jsreftest.html?test=ecma/Math/15.8.2.2.js | application crashed [@ XPCWrappedNative::GetNewOrUsed(xpcObjectHelper&, XPCWrappedNativeScope*, XPCNativeInterface*, XPCWrappedNative**)]
14:33:31     INFO -  Crash dump filename: /tmp/tmps2aIT8/6fcbbac9-6bbe-5ca9-2ba6b063-21061672.dmp
14:33:31     INFO -  Operating system: Android
14:33:31     INFO -                    0.0.0 Linux 3.2.0+ #2 SMP PREEMPT Thu Nov 29 08:06:57 EST 2012 armv7l pandaboard/pandaboard/pandaboard:4.0.4/IMM76I/5:eng/test-keys
14:33:31     INFO -  CPU: arm
14:33:31     INFO -       2 CPUs
14:33:31     INFO -  Crash reason:  SIGILL
14:33:31     INFO -  Crash address: 0x63191256
14:33:31     INFO -  Thread 13 (crashed)
14:33:31     INFO -   0  libxul.so!XPCWrappedNative::GetNewOrUsed(xpcObjectHelper&, XPCWrappedNativeScope*, XPCNativeInterface*, XPCWrappedNative**) [nsAutoPtr.h:52058b81ff1f : 906 + 0x6]
14:33:31     INFO -       r4 = 0x6bf8e744    r5 = 0x5d3e9448    r6 = 0x6ae862c0    r7 = 0x5bdd4a10
14:33:31     INFO -       r8 = 0x6bf8e744    r9 = 0x00000001   r10 = 0x5d3e9488    fp = 0x6ae862c0
14:33:31     INFO -       sp = 0x5d3e9378    lr = 0x62c187fb    pc = 0x63191256
14:33:31     INFO -      Found by: given as instruction pointer in context
The fact that Android's crashing here in "any damn test at all" makes me think xpconnect (or, xpconnect should blame Fennec, either one) rather than the owner of the particular test.
Component: JavaScript Engine → XPConnect
Version: unspecified → Trunk
Of the four crashes I looked at, all but one (the one in comment 0 of this bug) are crashing at 0x14, so it looks like there's a null deref at this line:
  http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCWrappedNative.cpp#378

Well, it looks like we're making a recursive call back into GNOU, then hitting a null deref at this line:
  302     Native2WrappedNativeMap* map = Scope->GetWrappedNativeMap();

So I guess GetObjectScope of parent is returning null, then because it is different than scope, we recurse and crash.  I'm not sure how that should be fixed, though.
Summary: Intermittent PROCESS-CRASH | jsreftest.html?test=ecma/Math/15.8.2.2.js | application crashed [@ XPCWrappedNative::GetNewOrUsed(xpcObjectHelper&, XPCWrappedNativeScope*, XPCNativeInterface*, XPCWrappedNative**)] → Intermittent PROCESS-CRASH application crashed [@ XPCWrappedNative::GetNewOrUsed(xpcObjectHelper&, XPCWrappedNativeScope*, XPCNativeInterface*, XPCWrappedNative**)]
(In reply to Andrew McCreight [:mccr8] from comment #6)
> Of the four crashes I looked at, all but one (the one in comment 0 of this
> bug) are crashing at 0x14, so it looks like there's a null deref at this
> line:
>  
> http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/
> XPCWrappedNative.cpp#378
> 
> Well, it looks like we're making a recursive call back into GNOU, then
> hitting a null deref at this line:
>   302     Native2WrappedNativeMap* map = Scope->GetWrappedNativeMap();
> 
> So I guess GetObjectScope of parent is returning null, then because it is
> different than scope, we recurse and crash.  I'm not sure how that should be
> fixed, though.

Any ideas, Bobby? :)
Flags: needinfo?(bobbyholley)
Hm, that's pretty bad. GetObjectScope should more or less never return null modulo stuff with old-school JSD machinery, which I doubt is getting invoked on b2g here. I'm worried that someone is passing in an object from a different thread or something. :-(
Flags: needinfo?(bobbyholley)
Armv6 Fennec, rather than b2g.
(In reply to Bobby Holley (:bholley) from comment #18)
> Hm, that's pretty bad. GetObjectScope should more or less never return null
> modulo stuff with old-school JSD machinery, which I doubt is getting invoked
> on b2g here. I'm worried that someone is passing in an object from a
> different thread or something. :-(

Who was this comment directed at?
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #46)
> (In reply to Bobby Holley (:bholley) from comment #18)
> > Hm, that's pretty bad. GetObjectScope should more or less never return null
> > modulo stuff with old-school JSD machinery, which I doubt is getting invoked
> > on b2g here. I'm worried that someone is passing in an object from a
> > different thread or something. :-(
> 
> Who was this comment directed at?

The void. This isn't something I can figure out by inspection.
This is happening too frequently to leave hanging. Any suggestions for who can help?
I can try landing some diagnostics. Moment.
Attached patch Diagnostics. v1Splinter Review
Attachment #8375703 - Flags: review?(continuation)
Comment on attachment 8375703 [details] [diff] [review]
Diagnostics. v1

Review of attachment 8375703 [details] [diff] [review]:
-----------------------------------------------------------------

I guess spamming the log in opt builds right before we crash is okay.
Attachment #8375703 - Flags: review?(continuation) → review+
https://hg.mozilla.org/mozilla-central/rev/1cad5f5cc647
Assignee: nobody → bobbyholley
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
I doubt the diagnostics fixed this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
And morphing:
https://tbpl.mozilla.org/php/getParsedLog.php?id=34852023&tree=Fx-Team
https://tbpl.mozilla.org/php/getParsedLog.php?id=34856253&tree=Fx-Team
https://tbpl.mozilla.org/php/getParsedLog.php?id=34847298&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=34861692&tree=Mozilla-Inbound
Summary: Intermittent PROCESS-CRASH application crashed [@ XPCWrappedNative::GetNewOrUsed(xpcObjectHelper&, XPCWrappedNativeScope*, XPCNativeInterface*, XPCWrappedNative**)] → Intermittent PROCESS-CRASH application crashed [@ XPCWrappedNative::GetNewOrUsed(xpcObjectHelper&, XPCWrappedNativeScope*, XPCNativeInterface*, XPCWrappedNative**)][@ js::GetGlobalForObjectCrossCompartment(JSObject*)]
https://tbpl.mozilla.org/php/getParsedLog.php?id=34864320&tree=Mozilla-Central

Do the new stacks help, Bobby? :)

Also CCing the other sheriffs to be aware of how the stacks are morphing on this now so we don't end up filing more dupes.
Flags: needinfo?(bobbyholley)
Hm. So the android logs don't appear to contain my printfs. But the fact that we're sometimes (and sometimes not) crashing when trying to invoke:

js::GetObjectClass(js::GetGlobalForObjectCrossCompartment(parent))

Implies that |parent| probably points to garbage.

It would be really helpful to know which PreCreate hook we're dealing with here, but it's tricky without the printfs. Ryan, do you have any idea why the printfs here aren't showing up in the log?

http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCWrappedNative.cpp#380
Flags: needinfo?(bobbyholley) → needinfo?(ryanvm)
Geoff's the guy you want for Android harness questions.
Flags: needinfo?(ryanvm) → needinfo?(gbrown)
You just need to change "printf" to "printf_stderr". 

Output will appear in the logcat output, in the test log.
Flags: needinfo?(gbrown)
Attachment #8378704 - Flags: review?(continuation)
Comment on attachment 8378704 [details] [diff] [review]
More diagnostics. v1

Review of attachment 8378704 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/xpconnect/src/XPCWrappedNative.cpp
@@ +382,5 @@
> +                printf_stderr("IsMainThread: %u\n", (uint32_t) NS_IsMainThread());
> +                char* className = nullptr;
> +                sciWrapper.GetCallback()->GetClassName(&className);
> +                printf_stderr("SH Class Name: %s\n", className);
> +                nsMemory::Free(className);

probably not too important to free the memory right before a crash but okay. ;)

@@ +383,5 @@
> +                char* className = nullptr;
> +                sciWrapper.GetCallback()->GetClassName(&className);
> +                printf_stderr("SH Class Name: %s\n", className);
> +                nsMemory::Free(className);
> +                printf_stderr("IsMainThread: %u\n", (uint32_t) NS_IsMainThread());

You don't need to print IsMainThread twice, do you?
Attachment #8378704 - Flags: review?(continuation) → review+
(In reply to TBPL Robot from comment #80)

New diagnostics are working now:

02-20 11:46:00.558 I/Gecko   ( 1727): Uh oh, hit an object without a scope! Crashing shortly.
02-20 11:46:00.558 I/Gecko   ( 1727): IsMainThread: 1
02-20 11:46:00.558 I/Gecko   ( 1727): SH Class Name: Window
02-20 11:46:00.558 I/Gecko   ( 1727): plannedParent object class: ChromeWindow
02-20 11:46:00.558 I/Gecko   ( 1727): plannedParent Global class: ChromeWindow
02-20 11:46:00.558 I/Gecko   ( 1727): parent Object class: Window
02-20 11:46:01.510 I/DEBUG   (  937): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
02-20 11:46:01.510 I/DEBUG   (  937): Build fingerprint: 'nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys'
02-20 11:46:01.510 I/DEBUG   (  937): pid: 1727, tid: 1743  >>> org.mozilla.fennec <<<
02-20 11:46:01.510 I/DEBUG   (  937): signal 11 (SIGSEGV), fault addr 00000000
I am ruminating on this, FWIW.
So, it seems pretty clear that we're trying to get an XPCWN for an inner nsGlobalWindow whose mJSObject points to memory that has been partially trashed.

We have two sets of crashes. This analysis assumes that they're the same situation, and that the symptoms vary with the contents of the garbage memory.

The logs show us calling into nsWindowSH::PreCreate, which just stashes mJSObject into the outparam.

After this, we call GetObjectScope, which returns obj->compartment()->data->scope. In half the cases, this is null, and we print the logging and explicitly crash. In the other half, it's non-null, but probably garbage, since we appear to crash shortly thereafter trying to dereference it with GetWrappedNativeMap.

In the first case (and probably in the second), the JSObject returned from PreCreate has an intact JSClass whose ->name is "Window", suggesting that this JSObject* once did point to a valid global. But calling js::GetGlobalForObjectCrossCompartment (which effectively does *obj->compartment()->global_) returns a null ref.

So this all suggests that the global and compartment have been recently collected. If that were true, the finalize hook _should_ have caused us to null out mJSObject. This leaves 3 possibilities:
(1) The global was finalized, but the JS engine didn't properly trigger the finalize hook.
(2) The global was finalized, but the XPConnect finalize hook didn't manage to get all the way to nulling out mJSObject on the global.
(3) The global was not finalized (and the analysis above is wrong).

We could answer (2) by adding some release-mode assertions along the finalize path. I could add that kind of patch, if we thought it would be useful.

We could answer (1) vs (3) by logging, but we'd need some way to trigger this on a try push, since we couldn't do that kind of logging in the opt builds we ship. Ryan, do you have any thoughts on how to trigger this reliably on a try build?

Andrew, any thoughts on all of this?
Flags: needinfo?(ryanvm)
Flags: needinfo?(philringnalda)
Er.
Flags: needinfo?(philringnalda) → needinfo?(continuation)
(In reply to Bobby Holley (:bholley) from comment #92)
> Ryan, do you have any thoughts on how to trigger this
> reliably on a try build?

Not sure about the "reliably" part. Usually, we just retrigger until we hit it. This is frequent enough that I would suspect we could retrigger our way to it on Try without too much effort.

Looking at the TBPLbot comments here, I'd start with Armv6 reftest-2.
try: -b o -p android-armv6 -u plain-reftest-2 -t none
Flags: needinfo?(ryanvm)
Hmm.  I don't have any particularly concrete thoughts here.  We have had problems where an object ends up in an arena with an incorrect finalize kind, so we don't run the finalizer.

I wonder if the globals were finalizing could be flagged somewhere when they are unlinked or something, in the XPC hashtables?  I'm just trying to think of some way we could try to guess when something should have been finalized.  I'm not sure how that would work, though.
Flags: needinfo?(continuation)
Any updates here, Bobby? Still hitting us pretty frequently :)
Flags: needinfo?(bobbyholley)
It is on my list. I'm just swamped. Hopefully soon.
Flags: needinfo?(bobbyholley)
I have some cycles to look at this again. Here's a try push with some release-mode assertions and some logging:

https://tbpl.mozilla.org/?tree=Try&rev=4fcb09814e7d
Ryan, do you recommend heavy retriggering of the runs in comment 144? I just want to check before I blow some infra dollars.
Flags: needinfo?(ryanvm)
I think that unless there's a better way to trigger the crashes, it's the only real option we've got. I just got the ball rolling for you :)
Flags: needinfo?(ryanvm)
Ryan, should we retrigger more, or try a different configuration?
Flags: needinfo?(ryanvm)
(In reply to Bobby Holley (:bholley) from comment #162)
> Ryan, should we retrigger more, or try a different configuration?

Comment 168 is from your try push, so apparently retrigger more is the answer.
Flags: needinfo?(ryanvm)
Ok. So it appears that a Window's compartment is being destroyed without WrappedNativeFinalize being called. Andrew, do you have an idea of where to look next?
Flags: needinfo?(continuation)
I don't have any particular ideas, sorry.  The logging patch you posted in IRC looks good.  I think in general you are more in deep GC territory here so you are probably going to get more useful help from jonco or terrence.
Flags: needinfo?(continuation)
All hail the mysterious and fickle Android compiler.

https://tbpl.mozilla.org/?tree=Try&rev=d5788721b33a
Depends on: 986843
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: