mResultVal in IDBRequest has a bad state during CycleCollection

RESOLVED FIXED in B2G C4 (2jan on)

Status

defect
--
critical
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: ggrisco, Assigned: gwagner)

Tracking

(Depends on 1 bug, {crash})

unspecified
B2G C4 (2jan on)
ARM
Gonk (Firefox OS)
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:-, firefox19 unaffected, firefox20 unaffected, firefox21 unaffected, firefox22 unaffected, b2g18+ fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 fixed)

Details

(Whiteboard: [b2g-crash][BTG-1006][BTG-1038], crash signature)

Attachments

(4 attachments, 3 obsolete attachments)

(Reporter)

Description

6 years ago
Description of test:
1) MT Call with a connection Bt Carkit
2) FM on Background.
3) Phone got Crashed and MiniDumps Generated.

Seen on AU 177.


Top Frames
----------
Crash reason:  SIGSEGV
Crash address: 0xc

Thread 3 (crashed)
 0  libxul.so!js_GetGCThingTraceKind [jsgc.h : 120 + 0xc]
     r4 = 0x00000000    r5 = 0x42eccd34    r6 = 0x42eccd34    r7 = 0x00000000
     r8 = 0x0000000c    r9 = 0x000000ad   r10 = 0x40e89589    fp = 0x000000ad
     sp = 0x42ecccb8    lr = 0x40e8dda3    pc = 0x413782cc
    Found by: given as instruction pointer in context
 1  libxul.so!CheckParticipatesInCycleCollection [XPCJSRuntime.cpp : 378 + 0x3]
     r4 = 0x00000000    r5 = 0x42eccd34    r6 = 0x42eccd34    r7 = 0x00000000
     r8 = 0x0000000c    r9 = 0x000000ad   r10 = 0x40e89589    fp = 0x000000ad
     sp = 0x42ecccb8    pc = 0x40e8dda3
    Found by: call frame info
 2  libxul.so!mozilla::dom::indexedDB::IDBRequest::cycleCollection::TraceImpl [IDBRequest.cpp : 311 + 0x3]
     r4 = 0x485affa0    r5 = 0x40e8dd95    r6 = 0x42eccd34    r7 = 0x00000000
     r8 = 0x0000000c    r9 = 0x000000ad   r10 = 0x40e89589    fp = 0x000000ad
     sp = 0x42ecccc8    pc = 0x40d31cdd
    Found by: call frame info
(Reporter)

Updated

6 years ago
Whiteboard: BTG-1006
(Reporter)

Updated

6 years ago
Blocks: 808607
blocking-b2g: --- → tef?
Assignee: nobody → echou
blocking-b2g: tef? → tef+

Updated

6 years ago
Severity: normal → critical
Crash Signature: [@ js_GetGCThingTraceKind]
Keywords: crash
Whiteboard: BTG-1006 → [b2g-crash][BTG-1006]
Flags: needinfo?(ggrisco)
(Reporter)

Comment 3

6 years ago
Similar stacktrace was seen again while doing:

1. Make MO calls.
2. Receive MT calls.
3. Enable GPS.
4. Enable BT.
5. Turn Airplane Mod on/off(5-10 times)

Will attach new minidump
Flags: needinfo?(ggrisco)
(Reporter)

Comment 4

6 years ago
(Reporter)

Updated

6 years ago
Whiteboard: [b2g-crash][BTG-1006] → [b2g-crash][BTG-1006][BTG-1038]
We looked at this before, and I think this will be fixed by backporting some of part 3 of bug 811206, specifically the fixes to IDBWrapperCache.  I think it is okay to ReleaseWrapper() and DROP_JS_OBJECTS things that aren't held.
Andrew is going to do comment 5.
Assignee: khuey → continuation
This is a backport of a small part of bug 811206 relevant to this class we're crashing in.
Attachment #705461 - Attachment is obsolete: true
Comment on attachment 705470 [details] [diff] [review]
Always DROP in the destructor of the wrapper cache

Review of attachment 705470 [details] [diff] [review]:
-----------------------------------------------------------------

I'm having trouble pushing to try at the moment, but I'll do that eventually.  I don't expect any problems to show up there.
Attachment #705470 - Flags: review?(khuey)
Attachment #705470 - Flags: review?(bugs)
Attachment #705470 - Flags: review?(bugs) → review+
Try was being uncooperative, so thanks to jlebar for pushing this to try:
https://tbpl.mozilla.org/?tree=Try&rev=2f65f5542af5
Fyi, this patch was very unhappy on try.
I failed to notice that we need a null check for XPConnect in the drop-it-like-its-hot world.
I'll post a link to a new try run if my try push ever goes through.
Comment on attachment 705738 [details] [diff] [review]
Always DROP in the destructor of IDBWrapperCache

This passes the IDB XPCshell tests locally.
Attachment #705738 - Flags: review?(bugs)
Attachment #705738 - Flags: review?(bugs) → review+
https://hg.mozilla.org/releases/mozilla-b2g18/rev/67b9d435c57b

I'm not sure if this should be closed before it is verified, so I'll just leave it open...
Attachment #705470 - Attachment is obsolete: true
Followup to replace the NS_ENSURE_TRUE with a plain branch so we don't spam the log with WARNING in XPCshell. r=smaug over IRC.

https://hg.mozilla.org/releases/mozilla-b2g18/rev/92259dfdcbd6

Also, my push hit a disturbingly related, probably intermittent crash:

PROCESS-CRASH | /Users/cltbld/talos-slave/test/build/xpcshell/tests/dom/indexedDB/test/unit/test_overlapping_transactions.js | application crashed [@ nsXPCOMCycleCollectionParticipant::CheckForRightISupports]

...

Crash reason:  EXC_BAD_ACCESS / 0x0000000d
Crash address: 0x0

Thread 0 (crashed)
 0  XUL!nsXPCOMCycleCollectionParticipant::CheckForRightISupports [nsCycleCollectionParticipant.cpp : 47 + 0xe]
 1  XUL!mozilla::dom::indexedDB::IDBWrapperCache::cycleCollection::TraceImpl [IDBWrapperCache.cpp : 28 + 0x4]
 2  XUL!TraceJSHolder [XPCJSRuntime.cpp : 321 + 0xc]
 3  XUL!PL_DHashTableEnumerate [pldhash.cpp : 716 + 0xe]
 4  XUL!XPCJSRuntime::TraceXPConnectRoots [nsBaseHashtable.h : 221 + 0x12]
Backed out for the aforementioned crashes and Windows xpcshell timeouts.
https://hg.mozilla.org/releases/mozilla-b2g18/rev/324b933aec07
The crash is intermittent. I've reproduced it locally once or twice.  While this particular test has been flakey in the past, I looked back at a number of prior pushes and couldn't see any other failures like this, so I must have somehow caused it.
It also looks like this may have made the IDB tests in M2 more prone to timing out in dom/indexedDB/test/test_add_put.html.

I was able to reproduce the crash a few times at first, but I haven't had any success since then.
This reproduces 100% of the time on Linux64 debug, just running the IDB XPCShell test_overlapping_transactions.js.

What is happening is that we're doing DROP on things we haven't HELD, so sJSGCThingRootCount becomes too small. (At one point it goes negative) It just so happens on Linux64 that the last layout statics reference is always the one held by Hold/DropJSHolders, so we end up calling release on nsLayoutStatics, which sets sXPConnect to NULL, so we stop actually removing dying JS holders from the XPConnect table.  But of course, XPConnect is still alive and well, so when we run the GC during shutdown CC we touch dead objects and crash.  It seems likely this could show up intermittently on another platform, thus hopefully explaining our other problems.

I think the best solution to this problem is to backport the bits if 811206 that make sJSGCThingRootCount, so we keep it alive for long enough.  Other solutions might be a little simpler, but would require coming up with new logic, which sounds like a terrible idea at this stage in the game.
This patch should now be the entirety of Olli's patch in bug 811206, excluding debug-only things, and some nulling out and reorderings that shouldn't affect anything.  Well, I left out a change to AudioBuffer, but I believe that WebAudio is disabled on b2g18.

Hopefully this more thorough try run will expose problems before I land it:
https://tbpl.mozilla.org/?tree=Try&rev=19e9721f5b4d

Locally, this fixed the Linux64 XPCshell problems.
Attachment #705738 - Attachment is obsolete: true
Attachment #706191 - Flags: review?(bugs)
Comment on attachment 706191 [details] [diff] [review]
Always DROP in ~IDBWrapperCache, allow double dropping

Humbug. M1 and M3 are leaking nsXPConnect, presumably because we're not DROPping enough, somewhere, and thus failing to cause nsLayoutStatics::Release enough.  (The B and M2 failures are because this push is on top of a broken push, 0e76f133a710.)  Olli, can you think of what might be causing the problem?  Or maybe we should just cut to the chase and use all of the seemingly benign changes from part 3 of bug 811206.
Attachment #706191 - Flags: review?(bugs) → feedback?(bugs)
Odd to see nsJSObjWrapper and BrowserNPObject in the leak logs. Have we done some
plugin handling fixes after FF18?
I don't know what kind of object BrowserNPObject is. Never seen it in leak logs before this.
Can't find any relevant plugin handling changes after 18, so perhaps we just happen to leak
BrowserNPObject when something else is leaked.
Haven't yet found any missing drop calls in b2g18 either...
On top of the previous patch.
Just a guess... if we unlink and then somehow get access to DOMRequest again
and preservewrapper, then mResult would be come owning ref again.
DOMRequest's Trace doesn't check mRooted value.

https://tbpl.mozilla.org/?tree=Try&rev=63fef613a943
Comment on attachment 706333 [details] [diff] [review]
Clear mResult in DOMRequest::UnrootResultVal()

Er, wait, in b2g18 we do clear mResult explicitly in UNLINK
Just to make sure Try is playing nicely I pushed a dummy patch.
Comment on attachment 706191 [details] [diff] [review]
Always DROP in ~IDBWrapperCache, allow double dropping

(I don't see anything wrong with this)
Attachment #706191 - Flags: feedback?(bugs) → feedback+
So, pushing to try from b2g18 gives us those M1 and M3 leaks. Fun.
Comment on attachment 706191 [details] [diff] [review]
Always DROP in ~IDBWrapperCache, allow double dropping

Should we push this to b2g18 if we can't rely on try? r=me for the patch.
Attachment #706191 - Flags: review+
Woah, that's odd.  I guess I'll try that in a bit.
Thanks for investigating.  Hopefully this will be okay...

https://hg.mozilla.org/releases/mozilla-b2g18/rev/3e673c17e680
gwagner has been able to reproduce this.  The patch I landed doesn't fix anything.

We're calling TraceJSObject with aScriptThing that is NULL, which means that JSVAL_IS_TRACEABLE is true, but JSVAL_TO_TRACEABLE returns NULL.  I guess I should have realized that earlier.

gwagner said that the value is something that looks like a string, but isn't one, which would explain it.  The question is, how does this bogus value get there.
Assignee: continuation → anygregor
(Assignee)

Updated

6 years ago
Depends on: 834995
Is this still happening with bug 834995 fixed?
(Assignee)

Comment 37

6 years ago
I spent a lot of time trying to reproduce the bug yesterday. First, it seemed that bug 834995 didn't fix it, but after a clean flash I couldn't reproduce it any more.
I am currently in SD and can't start a 2h monkey test on my machine :( Can someone else try to reproduce it?
Can we still reproduce this?
Flags: needinfo?(ggrisco)
Closing this out until we catch it again, given recent comments.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Flags: needinfo?(ggrisco)
Resolution: --- → WORKSFORME
(Assignee)

Comment 40

6 years ago
I still see this on b2g18 :(
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(Assignee)

Comment 41

6 years ago
There is a small difference with the recent crash I saw: We have a jsval that looks like an Object and not like a String:

Trace = 0x4125ac21 <mozilla::dom::indexedDB::IDBRequest::cycleCollection::TraceImpl(void*, void (*)(void*, char const*, void*), void*)>}}, mSource = {<nsCOMPtr_base> = {
      mRawPtr = 0x49d98b50}, <No data fields>}, mTransaction = {mRawPtr = 0x44dfd0e0}, mResultVal = {data = {asBits = 18446743554018508800, s = {payload = {i32 = 0, u32 = 0, boo = 0, 
          str = 0x0, obj = 0x0, ptr = 0x0, why = JS_ARRAY_HOLE, word = 0, uintptr = 0}, tag = JSVAL_TAG_OBJECT}, asDouble = -nan(0xfff8700000000), asPtr = 0x0}}, 
  mError = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mActorParent = 0x0, mErrorCode = NS_OK, mHaveResultOrErrorCode = true, mFilename = {<nsAString_internal> = {
      mData = 0x4d0ed948, mLength = 61, mFlags = 5}, <No data fields>}, mLineNo = 69}
(Assignee)

Comment 42

6 years ago
smaug, could this be fixed with bug 811206
Flags: needinfo?(bugs)
(Assignee)

Comment 43

6 years ago
(In reply to Gregor Wagner [:gwagner] from comment #42)
> smaug, could this be fixed with bug 811206

A current bt: http://www.pastebin.mozilla.org/2114703
(Assignee)

Updated

6 years ago
Summary: Mobile terminated call while connected to BT car kit and FM on in background causes crash → mResultVal in IDBRequest has a bad state during CycleCollection
That is what Andrew was trying to figure out, and also I, and based on the comment 35 the answer is no.
Flags: needinfo?(bugs)
(Assignee)

Comment 45

6 years ago
The bug always showed up after around 2hours during monkey testing b2g18 but I couldn't reproduce on mc during a 5hours test.

Maybe I should give bug 811206 a try.
smaug, mccr8, any other bugs that might be related?
I backported most of 811206 in the patch from comment 34.  The remaining bits are debug-only, or moving a little bit of clearing of stuff around.  Feel free to try the latter.

Did you try adding in checks in various places for the invalid value?

khuey also had the idea of adding in a dummy field at the location of mResultVal in case there is something quite weird happening.
Gregor - been nearly a week since last update. What are the next steps?
(Assignee)

Comment 48

6 years ago
I can't reproduce it any more because I always find other bugs that might be related :( bug 840377
Depends on: 840377
(Assignee)

Comment 49

6 years ago
Something is overwriting our mResultVal.

I added mResultVal2 to IDBRequest right after mResultVal and assign mResultVal to it right when we get it from the helper.
When we crash in the cycle collector, mResultVal2 is still OK:

(gdb) p tmp.mResultVal
$7 = {data = {asBits = 18446743554018508800, s = {payload = {i32 = 0, u32 = 0, boo = 0, str = 0x0, obj = 0x0, ptr = 0x0, why = JS_ARRAY_HOLE, word = 0, uintptr = 0}, tag = JSVAL_TAG_OBJECT}, asDouble = -nan(0xfff8700000000), asPtr = 0x0}}
(gdb) p tmp.mResultVal2
$8 = {data = {asBits = 18446743555304233296, s = {payload = {i32 = 1285724496, u32 = 1285724496, boo = 1285724496, str = 0x4ca29950, obj = 0x4ca29950, ptr = 0x4ca29950, why = 1285724496, word = 1285724496, uintptr = 1285724496}, tag = JSVAL_TAG_OBJECT}, 
    asDouble = -nan(0xfff874ca29950), asPtr = 0x4ca29950}}
(Assignee)

Comment 50

6 years ago
The object we are overwriting holds a string that starts with 'blue', afterwards gdb crashed :(

We should look in the Bluetooth code!

$34 = 0x4a9e1070 "b"
(gdb) p ((char *)((JSString *)$21).d.u1.chars)[0]
$35 = 98 'b'
(gdb) p ((char *)((JSString *)$21).d.u1.chars)[2]
$37 = 108 'l'
(gdb) p ((char *)((JSString *)$21).d.u1.chars)[4]
$39 = 117 'u'
(gdb) p ((char *)((JSString *)$21).d.u1.chars)[6]
$40 = 101 'e'
(Assignee)

Updated

6 years ago
Depends on: 840831
(Assignee)

Comment 51

6 years ago
mrbkap and me spent the whole afternoon yesterday investigating this problem.
Short version: Looks like random memory corruption and we shouldn't block on it and it doesn't seem to be bluetooth related as mentioned earlier.

The crash happens because mResultVal in IDBRequest looks like an object or string pointer (as a jsval) but the actual pointer value is null.
I added a dummy val right before mResultVal and I couldn't reproduce the crash.
Adding a mResultVal2 after mResultVal and assigning the actual content of mResultVal when we get it from the helper shows that we get a correct value from the helper.
Like in comment 49 it's an Object or it can also be a string that comes back from a successful put: http://hg.mozilla.org/mozilla-central/file/081cf5b0121e/dom/settings/SettingsManager.js#l89
mrbkap mentioned that IDBRequests are reused. So the result could also come from an earlier use.

The strange thing is that this only happens during cycle collection. We also asserted that during IDBRequest deconstruction we have a valid value in mResultVal but this assert never triggered.

During debugging I also found bug 840377 and bug 840831. They might be related since they also show some kind of random memory corruption.
The overwritten value in this bug is at offset 40 and in bug 840831 it is also at offset 40.

The conclusion is that something overwrites random values at offset 40 to nullptr and most of them are related to indexedDB and maybe to cycle collection.

We should try to get this in a replay box or maybe one of the cycle collection gurus has a better idea how to debug this.
blocking-b2g: tef+ → tef?
Thanks Gregor - we'll track but not block here.
blocking-b2g: tef? → -
tracking-b2g18: --- → +
CCs happen frequently, so it seems possible that the CC is just the one touching it because it is the first thing to look at this memory after the corruption happens.
Not seen in subsequent stability testing, closing.
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → WORKSFORME
Spoke too soon, this came back overnight in AU_LINUX_GECKO_ICS_STRAWBERRY_V1.01.00.01.19.022.  :(
Identical stack trace as in comment 0
Spoke too soon, this came back overnight in AU_LINUX_GECKO_ICS_STRAWBERRY_V1.01.00.01.19.022.  :(
Identical stack trace as in comment 0
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(except that the crashed thread was 0 instead of 3.  still the main process)
Let's say we fixed this in bug 845148 (as I'm pretty sure we did). If we see this again during testing we can reopen of course.
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.