Something (double-free?) is causing libmozsqlite3.so to be a little too crashy

RESOLVED DUPLICATE of bug 822398

Status

defect
P1
major
RESOLVED DUPLICATE of bug 822398
7 years ago
7 years ago

People

(Reporter: m1, Assigned: justin.lebar+bug)

Tracking

unspecified
B2G C3 (12dec-1jan)
ARM
Gonk (Firefox OS)
Dependency tree / graph

Firefox Tracking Flags

(blocking-basecamp:+)

Details

(crash signature)

Attachments

(5 attachments)

We have seen five unique crashes with libmozsqlite3.so on the stack during basic stability testing so far.  Unfortunately the STR are not easily prescribed. 
   eg. Alternate between receiving MT calls and taking pictures overnight.
 
Dogfooders have also seen some of these:
* https://crash-stats.mozilla.com/report/index/89e64411-0e20-4d86-b984-7277f2121129
* https://crash-stats.mozilla.com/report/index/05bd9934-2b04-4708-9359-db7992121127

Other crash dumps include:
* libmozsqlite3.so!fts3UpdateMethod [sqlite3.c : 116521 + 0x6]
* libmozsqlite3.so!sqlite3Step [sqlite3.c : 64219 + 0x0]
* libmozsqlite3.so!sqlite3Fts3ReadBlock [sqlite3.c : 125744 + 0x0]
* libmozsqlite3.so!sqlite3FkCheck [sqlite3.c : 87447 + 0x0]
(I can provide the minidump for these if needed.)

Seems like something is not quite right with the sqlite within Gecko.
Some things to do here
 - get better backtraces out of socorro.  This should be the same code that runs for fennec-android; do we have bad binaries?
 - start some harder stress testing with scripting
 - sqlite folks, anything else we can do to characterize these crashes?  Do you see a pattern from what's here?  Nothing really jumps out at me.
Keywords: qawanted
(In reply to Michael Vines [:m1] from comment #0)
> Other crash dumps include:
> * libmozsqlite3.so!fts3UpdateMethod [sqlite3.c : 116521 + 0x6]
> * libmozsqlite3.so!sqlite3Step [sqlite3.c : 64219 + 0x0]
> * libmozsqlite3.so!sqlite3Fts3ReadBlock [sqlite3.c : 125744 + 0x0]
> * libmozsqlite3.so!sqlite3FkCheck [sqlite3.c : 87447 + 0x0]
> (I can provide the minidump for these if needed.)

Sure, or if you're able to get backtraces from the dumps and can share them (somehow), that would be a good start.

Comment 3

7 years ago
(In reply to Chris Jones [:cjones] [:warhammer] from comment #1)
>  - get better backtraces out of socorro.  This should be the same code that
> runs for fennec-android; do we have bad binaries?

We hope that bug 811335 is the silver bullet there.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #3)
> We hope that bug 811335 is the silver bullet there.

That's not likely to help much for these crashes. That fix is intended to help us find a second frame when the top frame is garbage. If there's already a second frame it's not going to make a difference.

Comment 5

7 years ago
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #4)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #3)
> > We hope that bug 811335 is the silver bullet there.
> 
> That's not likely to help much for these crashes. That fix is intended to
> help us find a second frame when the top frame is garbage. If there's
> already a second frame it's not going to make a difference.

Oh, then we have a different problem altogether, I think, as we have a lot of reports like that where we have a few frames and before and after or in the middle we have garbage (only an address).
I don't know what exactly is going on here, the stack is a mess. Using my dump-lookup tool on the crashes from comment 0, the only things I can find are:
bp-05bd9934-2b04-4708-9359-db7992121127:
0xbea7ee0c: libmozsqlite3.so!sqlite3Pragma [sqlite3.c : 58299 + 0x2]
0xbea7f11c: libc.so!free [malloc_debug_common.c : 224 + 0x1]
0xbea7f20c: libmozglue.so!arena_avail_tree_insert [jemalloc.c : 3156 + 0x9]
0xbea7f924: plugin-container!main [MozillaRuntimeMain.cpp : 35 + 0x1]
0xbea7f934: libc.so!__libc_init [libc_init_dynamic.c : 114 + 0x9]
0xbea7fe50: libnspr4.so!PR_EnterMonitor [ptsynch.c : 519 + 0x3]

bp-89e64411-0e20-4d86-b984-7277f2121129:
0xbe9b9e0c: libmozsqlite3.so!fts3UpdateMethod [sqlite3.c : 116521 + 0x8]
0xbe9ba11c: libc.so!free [malloc_debug_common.c : 224 + 0x1]
0xbe9ba20c: libstlport.so!std::locale::~locale [locale.cpp : 361 + 0x1]
0xbe9ba23c: libstlport.so!std::_Locale_impl::make_classic_locale [locale_impl.cpp : 671 + 0x1]
0xbe9ba29c: libstlport.so!std::_Refcount_Base::_M_decr [_threads.h : 476 + 0x1]
0xbe9ba934: libc.so!__libc_init [libc_init_dynamic.c : 114 + 0x9]
0xbe9bad10: libnss3.so!pkix_BuildForwardDepthFirstSearch [pkix_build.c : 2085 + 0x0]
0xbe9bad48: libnss3.so!cert_CacheCRLByGeneralName [crl.c : 3055 + 0x1]
0xbe9bad84: libnss3.so!PK11_GetKeyMechanism [pk11mech.c : 149 + 0x11]
0xbe9badc0: libnss3.so!pkix_Build_ValidationCheckers [pkix_build.c : 1085 + 0xd]
0xbe9baeb4: libnss3.so!pkix_pl_String_Equals [pkix_pl_string.c : 190 + 0x9]
0xbe9baf68: libnss3.so!cert_CacheCRLByGeneralName [crl.c : 3055 + 0x1]
0xbe9baf80: libnss3.so!cert_CacheCRLByGeneralName [crl.c : 3052 + 0x2]
0xbe9bafa4: libnss3.so!cert_CacheCRLByGeneralName [crl.c : 3052 + 0x5]

This is just an exhaustive list of addresses on the crashing thread's stack that are inside known functions, so consider it a very loose suggestion of a stack trace.

Notably, these do both have free() on the stack, so it's possible there's some heap corruption happening somewhere.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6)
> Using my dump-lookup tool on the crashes from comment 0

Drool. Can you share this tool?  This is a *much* better backtrace than what I see using minidump_stackwalk.
It's this:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/dump-lookup/

You need to download and build the Breakpad source:
http://code.google.com/p/google-breakpad/source/checkout

then you can build this tool. For best results you also need the matching symbols, which is sort of a pain ATM.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6)
> Notably, these do both have free() on the stack, so it's possible there's
> some heap corruption happening somewhere.

Yeah, I'm betting that it's not sqlite at all.
Ted, can you make dump-lookup unwind inlined functions?  For example, the top of the first stack is sqlite.c:58299, which corresponds to sqlite3VdbeAddOp3, which is presumably inlined into sqlite3Pragma.
dump-lookup can only work with what Breakpad provides it. If we had the actual DWARF for these libraries I could give you library offsets and you could go look those up, but I don't think we have any way to get the unstripped binaries.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #11)
> dump-lookup can only work with what Breakpad provides it. If we had the
> actual DWARF for these libraries I could give you library offsets and you
> could go look those up, but I don't think we have any way to get the
> unstripped binaries.

Surely releng keeps the unstripped binaries around and makes them available in some automatic fashion?  If not, shouldn't we change this?
a) I'm not actually sure if these are RelEng builds or not, these might be jgriffin's builds.
b) No, we've never historically done that. For Firefox builds we strip off the debug info and upload that separately to the symbol server. We were doing this for B2G builds but it broke the Android build system (bug 812397).

Comment 14

7 years ago
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #13)
> a) I'm not actually sure if these are RelEng builds or not, these might be
> jgriffin's builds.

We are looking to switch the "official" builds from jgriffin's to RelEng's pretty soon, but right now the vast majority of users will have the former.

We've been having a lot of those reports in B2G where we have those garbage address-only frames in what is found as the stack, e.g. stacks like bp-dbd5fa1c-b315-49f8-8774-ff4012121203 also show that garbage pattern. If that's all heap corruption, we have a lot of heap corruption on B2G, which would worry me a lot.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #14)
> We've been having a lot of those reports in B2G where we have those garbage
> address-only frames in what is found as the stack, e.g. stacks like
> bp-dbd5fa1c-b315-49f8-8774-ff4012121203 also show that garbage pattern. If
> that's all heap corruption, we have a lot of heap corruption on B2G, which
> would worry me a lot.

bp-dbd5fa1c-b315-49f8-8774-ff4012121203 is one of our most common stability crashes here as well.   

Some other regular crashes we see that seem to be corruption related that I've also located on crash-stats:
https://crash-stats.mozilla.com/report/index/e3d8ece7-a903-43e9-9727-f9b942121128
https://crash-stats.mozilla.com/report/index/07d36c94-e1db-4dd1-b9e7-5bb992121127
https://crash-stats.mozilla.com/report/index/713f5ece-0769-471c-abc7-6e3b02121129

I've got more of these that I haven't tried looking for on crash-stats yet.
Applying the dump-lookup tool:

(In reply to Michael Vines [:m1] from comment #15)
> https://crash-stats.mozilla.com/report/index/e3d8ece7-a903-43e9-9727-
> f9b942121128

0xbedc199c: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbedc1ca4: libnspr4.so!PR_AtomicDecrement [pratom.c : 281 + 0x1]
0xbedc1ce4: libnspr4.so!PR_AtomicDecrement [pratom.c : 281 + 0x1]
0xbedc1d3c: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbedc1d44: libmozglue.so!arena_malloc [jemalloc.c : 4142 + 0x1]
0xbedc1d64: libmozglue.so!realloc [jemalloc.c : 4476 + 0x5]
0xbedc1d74: libmozglue.so!imalloc [jemalloc.c : 4219 + 0xd]
0xbedc1d7c: libmozglue.so!realloc [jemalloc.c : 6539 + 0x7]
0xbedc1ea4: libmozglue.so!free [jemalloc.c : 6577 + 0x5]
0xbedc211c: libc.so!free [malloc_debug_common.c : 224 + 0x1]
0xbedc220c: libmozglue.so!arena_avail_tree_insert [jemalloc.c : 3156 + 0x9]
0xbedc2924: plugin-container!main [MozillaRuntimeMain.cpp : 35 + 0x1]
0xbedc2934: libc.so!__libc_init [libc_init_dynamic.c : 114 + 0x9]
<... and some non-useful stuff ...>

> https://crash-stats.mozilla.com/report/index/07d36c94-e1db-4dd1-b9e7-
> 5bb992121127

0xbed635fc: libmozglue.so!arena_dalloc [jemalloc.c : 4651 + 0x1]
0xbed63674: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbed6367c: libmozglue.so!arena_dalloc [jemalloc.c : 4651 + 0x1]
0xbed63874: libnspr4.so!_PR_CreateThread [ptthread.c : 469 + 0x1]
0xbed638a4: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbed63904: libnspr4.so!PR_AtomicIncrement [pratom.c : 275 + 0x1]
0xbed63a1c: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbed63a54: libmozglue.so!calloc [jemalloc.c : 4227 + 0xd]
0xbed63a64: libnspr4.so!PR_Calloc [prmem.c : 444 + 0x1]
0xbed63a74: libnspr4.so!PR_SetThreadPrivate [prtpd.c : 157 + 0xf]
0xbed63aa4: b2g!main [nsBrowserApp.cpp : 154 + 0x11]
<... and then a bunch of stuff that I don't think is useful ...>

> https://crash-stats.mozilla.com/report/index/713f5ece-0769-471c-abc7-
> 6e3b02121129

0xbeeae45c: libmozglue.so!arena_dalloc [jemalloc.c : 3317 + 0x1]
0xbeeae484: libmozglue.so!free [jemalloc.c : 6577 + 0x5]
0xbeeae69c: b2g!NS_StringSetDataRange [nsXPCOMGlue.cpp : 275 + 0x8]
0xbeeae874: libnspr4.so!_PR_CreateThread [ptthread.c : 469 + 0x1]
0xbeeae8a4: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbeeae904: libnspr4.so!PR_AtomicIncrement [pratom.c : 275 + 0x1]
0xbeeaea1c: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbeeaea54: libmozglue.so!calloc [jemalloc.c : 4229 + 0xd]
0xbeeaea64: libnspr4.so!PR_Calloc [prmem.c : 444 + 0x1]
0xbeeaea74: libnspr4.so!PR_SetThreadPrivate [prtpd.c : 157 + 0xf]
0xbeeaeaa4: b2g!main [nsBrowserApp.cpp : 154 + 0x11]
<... and more junk ...>
This sounds a lot like an allocator mismatch.  If instead heap were corrupted, I'd expect to see us crash occasionally in malloc().  (Although perhaps the final stack in comment 16 is in calloc.)
blocking-basecamp: ? → +
We caught this overnight:

Crash reason:  SIGSEGV
Crash address: 0xc

Thread 36 (crashed)
 0  libmozsqlite3.so!pcache1RemoveFromHash [sqlite3.c : 35749 + 0x0]
 1  libmozsqlite3.so!pcache1EnforceMaxPage [sqlite3.c : 35765 + 0x5]
 2  libmozsqlite3.so!pcache1Destroy [sqlite3.c : 36215 + 0x3]
 3  libmozsqlite3.so!sqlite3PagerClose [sqlite3.c : 35118 + 0x9]
 4  libmozsqlite3.so!sqlite3BtreeClose [sqlite3.c : 49992 + 0x5]
 5  libmozsqlite3.so!sqlite3LeaveMutexAndCloseZombie [sqlite3.c : 112481 + 0x3]
 6  libmozsqlite3.so!sqlite3Close [sqlite3.c : 112430 + 0x3]
 7  libmozsqlite3.so!sqlite3_close [sqlite3.c : 112443 + 0x5]
 8  0x40fadf79
 9  libnspr4.so!PR_Unlock [ptsynch.c : 205 + 0x5]
10  0x40d7f83b
jst asked me to own this, although I'm not going to be able to look at it until next week at the earliest, so please feel free to steal this bug.
Assignee: nobody → justin.lebar+bug
Dr. Hipp, does the stack in comment 18 ring any bells for you? The corresponding source line in our version of sqlite is http://mxr.mozilla.org/mozilla-central/source/db/sqlite3/src/sqlite3.c#35749

Comment 21

7 years ago
Navigating off of the http://mxr.mozilla.org/... link in comment 20 causes my self-compiled nightly to latch up, using 100% CPU.  I let it run for 5 minutes before sending it a SIGINT and restarting.  Maybe I'm not compiling nightly right....

This link to the crash location works much better for me: http://www.sqlite.org/src/artifact/9fd22671c27?ln=471

I seem to recall crashes there before when there is heap corruption; when other components of the system continue writing into memory they have previously freed (and which SQLite subsequently allocated), or when they write past array bounds.  We have not previously encountered (that I recall) any bugs in SQLite itself that would cause a crash on that line.  Nor are there any know crash bugs currently in SQLite.  I queried the other SQLite development team members and they all concurred that this is likely a heap corruption issue.

We'll continue to monitor this ticket in case evidence emerges that our initial assessment is mistaken.
(In reply to D. Richard Hipp from comment #21)

Thanks for taking a look!
Can we get a priority and target milestone for this>
We're trying to nail down easily reproducible examples of this.  Our latest attempt:
  1. Make MT calls continuously, once every 30 seconds or so, to the device.
  2. Do not answer those calls.

After 45 minutes we caught:

Crash reason:  SIGSEGV
Crash address: 0x0

Thread 0 (crashed)
 0  libmozglue.so!jemalloc_crash [jemalloc.c : 1562 + 0x0]
 1  libmozglue.so!arena_dalloc [jemalloc.c : 3316 + 0x3]
 2  libmozglue.so!free [jemalloc.c : 6577 + 0x3]
 3  0x41604849


Step 1 was done using test equipment but maybe QEMU can be instrumented to perform the equivalent of Step 1?
Summary: libmozsqlite3.so is a little too crashy → Something is causing libmozsqlite3.so to be a little too crashy
That definitely looks like either heap corruption or a double-free. jlebar would probably know more. Just for my own info:
a) What does MT stand for?
b) What version are you testing (and what version of the Gecko code does that map to)?
Mobile Terminated.  Tip of beta.
(In reply to Damon Sicore (:damons) from comment #23)
> Can we get a priority and target milestone for this>

Marking as a P1 for C3 (given the necessity of valid crash info and the timeline in comment 19).
Priority: -- → P1
Target Milestone: --- → B2G C3 (12dec-1jan)
>  0  libmozglue.so!jemalloc_crash [jemalloc.c : 1562 + 0x0]

Now we're talking.  This is a fatal jemalloc assertion.

>  1  libmozglue.so!arena_dalloc [jemalloc.c : 3316 + 0x3]

Unfortunately this line number doesn't correspond to such an assertion, so I can't tell which one we're hitting.  Assuming the line number is sort of correct, we're in arena_run_reg_dalloc (which is inlined into arena_dalloc).  There are only three release asserts in arena_run_reg_dalloc:

> 3321 RELEASE_ASSERT(diff == regind * size);
> 3322 RELEASE_ASSERT(regind < bin->nregs);

These are both asserting that basically we computed the heap block size correctly.  If you free an internal pointer, I think we might hit this (because we'd incorrectly calculate |diff|, line 3286).  It's also possible that the heap became corrupted and we overwrote the bin's size variable (bin->reg_size in arena_dalloc_small).

> 3336 RELEASE_ASSERT((run->regs_mask[elm] & (1U << bit)) == 0);

This is checking that the element we're freeing is allocated.  That is, it's checking for double-frees (or frees of never-allocated things which happen to align correctly with something jemalloc might have allocated).

Since the failures we're seeing seem to be in free() and not in malloc() (or anywhere else), heap corruption seems less likely than the other possibilities.

Assuming that the assertion is actually happening in arena_run_reg_dalloc, this doesn't look like an allocator mismatch, because to get to arena_run_reg_dalloc, we passed some assertions which check that jemalloc owns this piece of memory (arena_dalloc_small, line 4526 checking run->magic == ARENA_RUN_MAGIC).

So my money is on a double-free, although that's not the only possibility.
> Per comment 26 this is on beta, so it looks like it's hitting this assertion:

Oh, indeed.  I tried to check out beta, but didn't notice I got an error.

So this sounds a lot like a double-free now.
Summary: Something is causing libmozsqlite3.so to be a little too crashy → Something (double-free?) is causing libmozsqlite3.so to be a little too crashy
Are there some jemalloc-related debug flags that we could enable for an automation run that might help track down the perp(s)?
(In reply to Michael Vines [:m1] from comment #31)
> Are there some jemalloc-related debug flags that we could enable for an
> automation run that might help track down the perp(s)?

If you got a full stack trace, you'd be able to see who was freeing the pointer and causing the crash.  It's not clear to me what additional diagnostics we could enable...
Unfortunately that was the full minidump stack trace.  We don't have a facility to maintain a ring buffer of the previous N free()s or something?  Might get lucky with that kind of construct if is a double free.
If you can save the minidump it's easy enough to run the dump-lookup tool on it to get more info out of the stack.
I have this running in the background, will see I reproduce the same results.

Looking into valgrind ...

Comment 36

7 years ago
(In reply to Justin Lebar [:jlebar] from comment #28)
> >  0  libmozglue.so!jemalloc_crash [jemalloc.c : 1562 + 0x0]
> 
> Now we're talking.  This is a fatal jemalloc assertion.
> 
> >  1  libmozglue.so!arena_dalloc [jemalloc.c : 3316 + 0x3]
> 
> Unfortunately this line number doesn't correspond to such an assertion, so I
> can't tell which one we're hitting.  Assuming the line number is sort of
> correct, we're in arena_run_reg_dalloc (which is inlined into arena_dalloc).


We have a lot of crashes in that style, see e.g. bp-458b9310-7b95-4923-9c30-f98dd2121205
(In reply to Michael Vines [:m1] from comment #24)
> We're trying to nail down easily reproducible examples of this.  Our latest
> attempt:
>   1. Make MT calls continuously, once every 30 seconds or so, to the device.
>   2. Do not answer those calls.

To confirm, did you start these tests with a "clean phone", that is without the dialer app already running?

Comment 38

7 years ago
From the reports I generated today, the steps in bug 816334 might end up reproducing at least that jemalloc_crash coming from arena_dalloc.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #34)
> If you can save the minidump it's easy enough to run the dump-lookup tool on
> it to get more info out of the stack.

Here's what dumplookup provides:

0xbeedd4cc: libmozglue.so!arena_dalloc [jemalloc.c : 3317 + 0x1]
0xbeedd4f4: libmozglue.so!free [jemalloc.c : 6577 + 0x5]
0xbeedd8e4: libnspr4.so!_PR_CreateThread [ptthread.c : 469 + 0x1]
0xbeedd914: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbeedd974: libnspr4.so!PR_AtomicIncrement [pratom.c : 275 + 0x1]
0xbeedda5c: b2g!MOZ_PNG_init_read_transf [pngrtran.c : 1245 + 0x3]
0xbeedda6c: libc.so!dlmalloc [dlmalloc.c : 4367 + 0x1]
0xbeedda74: libmozglue.so!arena_malloc [jemalloc.c : 4142 + 0x1]
0xbeeddab0: b2g!MOZ_PNG_init_read_transf [pngrtran.c : 1776 + 0x1]
0xbeeddac4: libc.so!pthread_create [pthread.c : 385 + 0x0]
0xbeeddb14: b2g!main [nsBrowserApp.cpp : 164 + 0x11]
0xbeede274: libmozglue.so!pages_map [jemalloc.c : 2402 + 0x1]
0xbeede2b4: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbeede2dc: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbeede2e4: libmozglue.so!malloc_init [jemalloc.c : 6209 + 0x1]
0xbeedec04: b2g!MOZ_PNG_destroy_gamma_table [png.c : 2695 + 0x0]
0xbeedec08: b2g!main [nsCOMPtr.h : 399 + 0x0]
0xbeedec0c: libc.so!__udivdi3 [libgcc2.c : 931 + 0x14]
0xbeedec2c: libc.so!__sfvwrite [fvwrite.c : 133 + 0x1]
0xbeedec5c: libc.so!__sprint [vfprintf.c : 72 + 0x1]
0xbeedec64: libc.so!__vfprintf [vfprintf.c : 818 + 0x7]
0xbeedf20c: libmozglue.so!arena_avail_tree_remove [jemalloc.c : 3156 + 0x1b]
0xbeedf23c: libmozglue.so!arena_run_split [jemalloc.c : 3350 + 0x5]
0xbeedf27c: libmozglue.so!arena_run_alloc [jemalloc.c : 3594 + 0x1]
0xbeedf284: libmozglue.so!arena_run_split [jemalloc.c : 3350 + 0x5]
0xbeedf2ac: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbeedf2b4: libmozglue.so!arena_malloc [jemalloc.c : 4142 + 0x1]
0xbeedf2e0: b2g!fill_window [deflate.c : 1520 + 0x2]
0xbeedf350: b2g!MOZ_Z_deflate [deflate.c : 701 + 0x0]
0xbeedf370: b2g!deflate_slow [deflate.c : 1840 + 0x26]
0xbeedf3a4: b2g!MOZ_Z_deflate [deflate.c : 701 + 0x0]
0xbeedf3b4: b2g!MOZ_PNG_write_chunk_start [pngwutil.c : 135 + 0xe]
0xbeedf3cc: libc.so!free [malloc_debug_common.c : 224 + 0x1]
0xbeedf4d4: libmozglue.so!arena_avail_tree_remove [jemalloc.c : 3156 + 0x1b]
0xbeedf504: libmozglue.so!arena_run_split [jemalloc.c : 3350 + 0x5]
0xbeedf544: libmozglue.so!arena_run_alloc [jemalloc.c : 3594 + 0x1]
0xbeedf554: libmozglue.so!arena_run_alloc [jemalloc.c : 3594 + 0x1]
0xbeedf584: libmozglue.so!malloc_mutex_unlock [jemalloc.c : 1642 + 0x1]
0xbeedf58c: libmozglue.so!arena_malloc [jemalloc.c : 4142 + 0x1]
0xbeedf5bc: libbinder.so!android::sp<android::BBinder>::~sp [StrongPointer.h : 147 + 0x1]
0xbeedf5c0: libbinder.so!android::sp<android::BBinder>::~sp [StrongPointer.h : 147 + 0x1]
0xbeedf5d4: libbinder.so!android::Mutex::~Mutex [threads.h : 285 + 0x1]
0xbeedf5e4: libc.so!__aeabi_atexit [eabi.c : 63 + 0x1]
0xbeedf5ec: libbinder.so!_GLOBAL__I_Static.cpp [Static.cpp : 53 + 0x1]
0xbeedf830: b2g!MOZ_Z_inflateBack [infback.c : 278 + 0x0]
0xbeedf86c: libui.so!android::KeyedVector<const native_handle_t*, android::GraphicBufferAllocator::alloc_rec_t>::~KeyedVector [KeyedVector.h : 34 + 0x1]
0xbeedf874: libui.so!android::Mutex::~Mutex [threads.h : 285 + 0x1]
0xbeedf878: libui.so!android::Mutex::~Mutex [threads.h : 285 + 0x1]
0xbeedf898: b2g!MOZ_Z_inflate [inflate.c : 1008 + 0x2]
0xbeedf89c: libc.so!__aeabi_atexit [eabi.c : 63 + 0x1]
0xbeedf8a4: libui.so!_GLOBAL__I_GraphicLog.cpp [GraphicLog.cpp : 92 + 0x1]
0xbeedf8c0: b2g!MOZ_Z_inflateBack [infback.c : 558 + 0x4]
0xbeedf8f4: b2g!MOZ_Z_inflate [inflate.c : 1008 + 0x2]
0xbeedf904: b2g!gz_open [gzlib.c : 127 + 0xe]
0xbeedfb1c: b2g!android::sp<android::FramebufferNativeWindow>::~sp [StrongPointer.h : 147 + 0x1]
0xbeedfb20: b2g!android::sp<android::FramebufferNativeWindow>::~sp [StrongPointer.h : 147 + 0x1]
0xbeedfbd4: libc.so!__libc_init [libc_init_dynamic.c : 114 + 0x9]
0xbeedfbec: libc.so!__cxa_atexit [atexit.c : 101 + 0x1]
That stack appears sorted by address.  Are you able to sort by order the values appear on the stack?  That might help us find the legit parts of the stack.
(In reply to Justin Lebar [:jlebar] from comment #40)
> That stack appears sorted by address.  Are you able to sort by order the
> values appear on the stack?  That might help us find the legit parts of the
> stack.

Sorry, not sure I follow.  This is the same trace as in comment 24 but using dumpstack instead of minidump_stackwalk.  Top couple frames are the same (jemalloc_crash didn't show up with dumpstack though.)
> This is the same trace as in comment 24 but using dumpstack instead of minidump_stackwalk.

But dumpstack appears to be sorting the result by address; the first column in the trace in comment 39 is sorted in increasing numeric order, if I'm reading the hex correctly.
m1 set me straight on IRC: Those are the stack addresses, not the code addresses.  gotit.
Posted file logcat from crash
I just caught "a" crash in the emulator, but debuggerd somehow let the processes exit before I could attach in gdb.  ARGHHHH!

Will directly attach now, but for posterity here's the logcat from the failing run
It's hard to tell which process went down first, but it seems like it could be the main b2g process

I/IdleService(   43): Idle timer callback: tell observer 43b12410 user is idle
I/IdleService(   43): Get idle time: time since reset 2271313 msec
I/IdleService(   43): Get idle time: time since reset 2271315 msec
I/IdleService(   43): Idle timer callback: current idle time 2271315 msec
I/IdleService(   43): next timeout 5649802113852233 usec (4294965023684 msec from now)
I/IdleService(   43): SetTimerExpiryIfBefore: next timeout 5649802113852233 usec
I/IdleService(   43): reset timer expiry from 0 usec to 5649802113862233 usec
D/AudioHardwareInterface(   38): setMode(NORMAL)
W/AudioPolicyManagerBase(   38): setPhoneState() setting same state 0
I/Gecko   (  141): ###################################### forms.js loaded
I/IdleService(   43): Remove idle observer 43b12410 (1 seconds)
E/GeckoConsole(  141): [JavaScript Warning: "HTTP "Content-Type" of "text/html" is not supported. Load of media resource app://communications.gaiamobile.org/dialer/oncall.html failed." {file: "app://communications.gaiamobile.org/dialer/oncall.html#locked" line: 0}]
E/GeckoConsole(   43): [JavaScript Warning: "HTTP "Content-Type" of "text/html" is not supported. Load of media resource app://communications.gaiamobile.org/dialer/oncall.html failed." {file: "app://communications.gaiamobile.org/dialer/oncall.html#locked" line: 0}]
D/AudioHardwareInterface(   38): setMode(NORMAL)
W/AudioPolicyManagerBase(   38): setPhoneState() setting same state 0
I/IdleService(   43): Register idle observer 43b12410 for 1 seconds
I/IdleService(   43): Register: adjusting next switch from -1 to 1 seconds
I/IdleService(   43): next timeout 1354834819852233 usec (-2284383 msec from now)
I/IdleService(   43): SetTimerExpiryIfBefore: next timeout 1354834819852233 usec
I/IdleService(   43): reset timer expiry from 5649802113862233 usec to 1354837104247476 usec
I/IdleService(   43): Get idle time: time since reset 2285385 msec
I/IdleService(   43): Get idle time: time since reset 2286012 msec
I/IdleService(   43): Idle timer callback: current idle time 2286012 msec
I/IdleService(   43): next timeout 5649802113852233 usec (4294965008986 msec from now)
I/IdleService(   43): SetTimerExpiryIfBefore: next timeout 5649802113852233 usec
I/IdleService(   43): reset timer expiry from 0 usec to 5649802113862233 usec
I/IdleService(   43): Idle timer callback: tell observer 43b12410 user is idle
I/IdleService(   43): Get idle time: time since reset 2286014 msec
I/Gecko   (  141): ###################################### forms.js loaded
I/IdleService(   43): Remove idle observer 43b12410 (1 seconds)
W/AudioFlinger(   38): Thread AudioOut_1 cannot connect to the power manager service
F/libc    (   43): Fatal signal 11 (SIGSEGV) at 0x4b300000 (code=1)
F/libc    (  141): Fatal signal 11 (SIGSEGV) at 0x00000030 (code=1)

The idle observer(s) around address 0x4b300000 might implicate those.

The dialer crashing at 0x00000030 is definitely not the jemalloc assertion failure.
> The dialer crashing at 0x00000030 is definitely not the jemalloc assertion  failure

I've built up quite a catalog of crash signatures over the last couple days.  If you share the minidump output for the thread that crashed I can let you know if I've seen it.  I've encountered a bunch of other malloc-y type crashes that don't fail at that nice jemalloc assertion.
I managed to catch this in gdb again.  Mine is a layers crash on the compositor thread in the main process.  gdb is being fantastically unhelpful so I can't get much debugging info, however what I do see doesn't make much sense.  Might be just another instance of a heisen-heap-corruption :/.
I filed bug 819248 for the crash I was able to reproduce (twice), which seems like it might be different than this one.  And unfortunately much harder to reproduce :(.

(In reply to Michael Vines [:m1] from comment #46)
> > The dialer crashing at 0x00000030 is definitely not the jemalloc assertion  failure
> 
> I've built up quite a catalog of crash signatures over the last couple days.
> If you share the minidump output for the thread that crashed I can let you
> know if I've seen it.  I've encountered a bunch of other malloc-y type
> crashes that don't fail at that nice jemalloc assertion.

This is something in our IPC code in child processes getting upset when the parent process goes away unexpectedly.  I confirmed that the crash I was seeing is the b2g process.
So the end result is that I don't seem to be able to reproduce this in the emulator.

Comment 50

7 years ago
We continue to see a lot of corrupt stacks in B2G crash reports from devices, is there anything we can do against that, or find out what is causing this?

Updated

7 years ago
Crash Signature: [@ fts3UpdateMethod ] [@ sqlite3Pragma ] [@ sqlite3Step ]
If we think this is an invalid-free, we could write a replace-malloc module which will bail if it receives an invalid pointer to free.  We should be able to get meaningful stacks in this case.

See bug 804303 for the infrastructure and bug 717853 for an example of the kind of analysis we can do.

That would take maybe a week to get up and running, and there's no guarantee that it would identify the source of this problem.
Depends on: 820578

Comment 52

7 years ago
Isn't the whole effort of pushing bug 717853 so hard right now motivated by using it on B2G in any case?

That said said, any possibility to get a lead on this corruption problem would be welcome to me from a stability tracking POV.
> Isn't the whole effort of pushing bug 717853 so hard right now motivated by using it on 
> B2G in any case?

Yes, but I'm missing a step in your argument, because I don't understand how this is related to comment 51 (or this bug in general).
Are we running any tests under valgrind on devices currently? I don't know how slowly valgrind would run on an ARM device (presumably it needs more memory than an actual phone device) but that might be good automation and help figure out this bug.
> Are we running any tests under valgrind on devices currently? I don't know how slowly 
> valgrind would run on an ARM device (presumably it needs more memory than an actual phone 
> device) 

We're not testing B2G on powerful devices, so I suspect this would be a fair bit of work to get going.
B2G can build and run on a PandaBoard, which is what we intend to use for automation, which has 1GB of RAM.
IIRC, jseward has been working on Valgrind for B2G.
We stumbled across the reasonably reliable way to reproduce what looks to be a symptom of this bug with a current beta/nightly:
STR:
1) Rapidly send two SMS message to the device.   Google Voice is nice for this, alternatively send two SMS then power the device on.


Program received signal SIGSEGV, Segmentation fault.
0x40d534f8 in nsDOMDesktopNotification::PostDesktopNotification
(this=0xbed46aac) at .../gecko/dom/src/notification/nsDesktopNotification.cpp:45
45            nsString manifestUrl = EmptyString();
(gdb) bt
#0  0x40d534f8 in nsDOMDesktopNotification::PostDesktopNotification
(this=0xbed46aac)
     at ../gecko/dom/src/notification/nsDesktopNotification.cpp:45
#1  0x40c903ba in nsHTMLMediaElement::NetworkError (this=0x0)
     at .../gecko/content/html/content/src/nsHTMLMediaElement.cpp:2996
#2  0xbed46f1c in ?? ()
#3  0xbed46f1c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
dougt, I think this might be in your neck of the woods.  Can you see if you can repro?
Flags: needinfo?(doug.turner)
I've been unable to reproduce so far in a debug build, but that might be changing timing.

I do see this though

I/Gecko   (  104): [Parent 104] ###!!! ASSERTION: QueryInterface needed: 'query_result.get() == mRawPtr', file ../../dist/include/nsCOMPtr.h, line 503

Might this be a suspect?
Flags: needinfo?(bent.mozilla)
(In reply to Chris Jones [:cjones] [:warhammer] from comment #60)
> I/Gecko   (  104): [Parent 104] ###!!! ASSERTION: QueryInterface needed:
> 'query_result.get() == mRawPtr', file ../../dist/include/nsCOMPtr.h, line 503

The only way to debug this that I know of is to break in a debugger and see what's getting called. I doubt that could cause memory corruption like this though (the assertion just tells you that your QI function is messed up, but the pointers should still be valid).

In any case we should figure out what is going on here.
Flags: needinfo?(bent.mozilla)
hmm.  gdb says that it happens here...


#0  nsCOMPtr<nsIStackFrame>::Assert_NoQueryNeeded (aMessage=<value optimized out>, aResult=<value optimized out>, aLocation=<value optimized out>, aData=<value optimized out>,
    exceptn=0xbecbebc8) at ../../../dist/include/nsCOMPtr.h:508
#1  ~nsGetterAddRefs (aMessage=<value optimized out>, aResult=<value optimized out>, aLocation=<value optimized out>, aData=<value optimized out>, exceptn=0xbecbebc8)
    at ../../../dist/include/nsCOMPtr.h:1295
#2  nsXPCException::NewException (aMessage=<value optimized out>, aResult=<value optimized out>, aLocation=<value optimized out>, aData=<value optimized out>, exceptn=0xbecbebc8)
    at /builds/mozilla-central/js/xpconnect/src/XPCException.cpp:431
#3  0x4134a21c in XPCThrower::BuildAndThrowException (cx=0x4047a370, rv=NS_ERROR_DOM_DATA_CLONE_ERR, sz=0x41f45df8 "") at /builds/mozilla-central/js/xpconnect/src/XPCThrower.cpp:182
#4  0x4134a6e2 in XPCThrower::Throw (rv=NS_ERROR_DOM_DATA_CLONE_ERR, cx=0x4047a370) at /builds/mozilla-central/js/xpconnect/src/XPCThrower.cpp:24
#5  0x4134a6fe in xpc::Throw (cx=0x445d9220, rv=<value optimized out>) at /builds/mozilla-central/js/xpconnect/src/XPCThrower.cpp:32
#6  0x40fcb8b8 in NS_DOMWriteStructuredClone (cx=0x4047a370, writer=0xbecbedc8, obj=<value optimized out>, closure=<value optimized out>)
    at /builds/mozilla-central/dom/base/nsJSEnvironment.cpp:3900
#7  0x415dd1fa in Write (aCx=0x4047a370, aWriter=0xbecbedc8, aObj=0x4ab38020, aClosure=0xbecbf1a4) at /builds/mozilla-central/dom/ipc/StructuredCloneUtils.cpp:154
#8  0x41aad13e in JSStructuredCloneWriter::startWrite (this=<value optimized out>, v=<value optimized out>) at /builds/mozilla-central/js/src/jsclone.cpp:711
#9  0x41aadbfc in JSStructuredCloneWriter::write (this=0xbecbedc8, v=<value optimized out>) at /builds/mozilla-central/js/src/jsclone.cpp:775
#10 0x41aaef26 in js::WriteStructuredClone (cx=<value optimized out>, v=<value optimized out>, bufp=<value optimized out>, nbytesp=<value optimized out>, cb=0x423a3e34,
    cbClosure=0xbecbf1a4, transferable=...) at /builds/mozilla-central/js/src/jsclone.cpp:130
#11 0x41a7809a in JS_WriteStructuredClone (cx=0x4047a370, valueArg=..., bufp=0xbecbf190, nbytesp=0xbecbf194, optionalCallbacks=0x423a3e34, closure=0xbecbf1a4, transferable=...)
    at /builds/mozilla-central/js/src/jsapi.cpp:6326
#12 0x41a7811e in JSAutoStructuredCloneBuffer::write (this=0xbecbf190, cx=0x4047a370, valueArg=<value optimized out>, transferable=..., optionalCallbacks=0x423a3e34, closure=0xbecbf1a4)
    at /builds/mozilla-central/js/src/jsapi.cpp:6450
#13 0x41a78160 in JSAutoStructuredCloneBuffer::write (this=0x0, cx=0x1, valueArg=..., optionalCallbacks=0x423a3e34, closure=0xbecbf1a4) at /builds/mozilla-central/js/src/jsapi.cpp:6437
#14 0x415dc922 in mozilla::dom::WriteStructuredClone (aCx=0x4047a370, aSource=..., aBuffer=<value optimized out>, aClosure=...)
    at /builds/mozilla-central/dom/ipc/StructuredCloneUtils.cpp:183
#15 0x40e49948 in GetParamsForMessage (aCx=0x0, aObject=..., aBuffer=..., aClosure=...) at /builds/mozilla-central/content/base/src/nsFrameMessageManager.cpp:218
#16 0x40e4b6c6 in nsFrameMessageManager::DispatchAsyncMessage (this=0x4a2f5c00, aMessageName=..., aObject=..., aCx=0x4047a370, aArgc=2 '\002')
    at /builds/mozilla-central/content/base/src/nsFrameMessageManager.cpp:320
#17 0x40e4b716 in nsFrameMessageManager::BroadcastAsyncMessage (this=0x0, aMessageName=..., aObject=..., aCx=0x445d9220, aArgc=2 '\002')
    at /builds/mozilla-central/content/base/src/nsFrameMessageManager.cpp:352
#18 0x417ece0a in NS_InvokeByIndex_P (that=0x4a2f5c00, methodIndex=6, paramCount=<value optimized out>, params=<value optimized out>)
    at /builds/mozilla-central/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:160
#19 0x4135410a in CallMethodHelper::Invoke (this=0xbecbf4f0) at /builds/mozilla-central/js/xpconnect/src/XPCWrappedNative.cpp:3081
#20 CallMethodHelper::Call (this=0xbecbf4f0) at /builds/mozilla-central/js/xpconnect/src/XPCWrappedNative.cpp:2415
Flags: needinfo?(doug.turner)
No bites on the trotline yet :/.  Frustrating.

Comment 64

7 years ago
(In reply to Doug Turner (:dougt) from comment #62)
> hmm.  gdb says that it happens here...

I think it's best to go to other bugs for concrete crashes and leave this one here for the corruption issue.
> I think it's best to go to other bugs for concrete crashes and leave this one here for the corruption issue.

I do not follow.
Severity: normal → major

Comment 66

7 years ago
(In reply to Doug Turner (:dougt) from comment #65)
> I do not follow.

This bug basically turned to a discussion of a general B2G (crash) stack corruption issue. We should have specific crashes in separate bugs about them.

Should we rename this bug to e.g. "Something (double-free?) is causing stack corruptions in B2G crashes"?

Comment 67

7 years ago
Just to be clear, I don't think I have seen any B2G crash report for weeks that does have a proper stack, they are all corrupted in strange ways.
I believe this is the bug tracking that, but if it's not, we need to file another high-priority bug for that, as that's blocking basically all our crash investigations on B2G atm.
File another bug for that.  That's not what this bug is about.
Summary: Something (double-free?) is causing libmozsqlite3.so to be a little too crashy → Something (double-free?) is causing stack corruptions in B2G crashes
Again, file a separate bug for stack corruption issues.  This bug is about tracking down the heap corruption resulting in the crashes from comment 0, 58, etc.
Summary: Something (double-free?) is causing stack corruptions in B2G crashes → Something (double-free?) is causing libmozsqlite3.so to be a little too crashy
Spun off bug 821353 for crash stack badness.
I am just not able to reproduce this in any build configuration I try (many so far).  I think we need to start narrowing down what might be different about the environments we're using to reproduce this.
Blocks: 808607
(In reply to Chris Jones [:cjones] [:warhammer] from comment #63)
> Created attachment 691629 [details]
> Torture script for emulator

Is there some way to achieve the same for a real device?  That is,
send simulated SMSs to it?  I have the stack running fairly stably
on valgrind on a GS2, but I don't have a zero cost way to send a 
whole bunch of SMSs to it.
Bug 819275 is a possible cause of this.  Have the same crashes been reproduced since that fix merged to beta?
Depends on: 819275
I cannot not detect any memory corruption when receiving or sending 
either SMSs or calls, using a build that was last "repo sync"'d on 
Friday 14 Dec.
Not able to get the SMS STR from comment 58 to crash anymore here either.
BTW, we caught bp-6a256401-1aa5-4567-98cb-a0c322121213 again last night in build with the fix for bug 819275.
Can we mark this WORKSFORME and re-open if it re-appears?
No.  Like I said in comment 77 this is still very much an issue.
bp-4fc59373-53e9-4b20-be40-b81622121214 was seen last night as well, by making repeated MO calls overnight.
(In reply to Michael Vines [:m1] from comment #80)
Michael, do you have any feel for the variability in the
time-to-crash?  iow, do identical test runs crash after the same
number of MO calls, or do the number of calls prior to the crash vary
widely?  I am just wondering that if this isn't straightforward memory
corruption then maybe it's a data race leading to memory corruption,
which would give much more erratic results.
bp-5cc81cf2-0d67-4839-b8ea-51c032121213 was also seen (MO calls over the weekend)
bp-2ff0390d-90f7-45e0-b133-f5fe42121217 too. 

We're finding a lot of these by making repeated MO or MT calls overnight + <something else, or maybe nothing else>.  I generally don't see the same number of crashes nor same signature for each automation run, but there's definitely a finite set of 10-20 sigs that keep popping up.   I'm only linking to the signatures I can also find reported on Socorro via Dogfooders.
(In reply to Justin Lebar [:jlebar] from comment #17)
Note that this is with a --disable-jemalloc build.  At least
the OTSStream mismatch also exists for Fx on Android and does
not appear to cause any problems.
Comment on attachment 693133 [details]
Allocation mismatches observed during the run described in comment #75

To be explicit, this wouldn't catch an allocation mismatch between jemalloc and the Android malloc, right?
No, it wouldn't.  I have V patches that can detect such mismatches, but didn't
use them for these runs.  For some reason which I long since forgot, I had
the impression from Mike Hommey that allocator mismatches (at least for
Fx on Android) were now a non-problem, so I was kind of assuming that for
b2g too.
In comment 17, I posited that we may be observing just such an allocator mismatch here.  But the dark side clouds everything.
So bionic's memory allocators seem to be implemented as:

void* malloc(size_t bytes) {
    return __libc_malloc_dispatch->malloc(bytes);
}
void free(void* mem) {
    __libc_malloc_dispatch->free(mem);
}
void* calloc(size_t n_elements, size_t elem_size) {
    return __libc_malloc_dispatch->calloc(n_elements, elem_size);
}
void* realloc(void* oldMem, size_t bytes) {
    return __libc_malloc_dispatch->realloc(oldMem, bytes);
}
void* memalign(size_t alignment, size_t bytes) {
    return __libc_malloc_dispatch->memalign(alignment, bytes);
}

So can't we just plug in our own functions into __libc_malloc_dispatch?

I checked on my otoro and nm -D libc.so | grep libc_malloc shows

000433c0 D __libc_malloc_default_dispatch
00043600 D __libc_malloc_dispatch
We LD_PRELOAD the jemalloc allocators, so if any code is using the bionic ones, something has gone wrong.

Would be interesting to put some fatal asserts in bionic and see if that's going wrong.
> We LD_PRELOAD the jemalloc allocators, so if any code is using the bionic ones, something has gone 
> wrong.

See bug 801571; I don't think LD_PRELOADING is as robust on Android as it is on Linux.  Or something else is going on.
(In reply to Justin Lebar [:jlebar] from comment #90)
> > We LD_PRELOAD the jemalloc allocators, so if any code is using the bionic ones, something has gone 
> > wrong.
> 
> See bug 801571; I don't think LD_PRELOADING is as robust on Android as it is
> on Linux.  Or something else is going on.

Note we provide the malloc hooks on Linux too, so that libc calls into jemalloc internally too.
(In reply to Julian Seward from comment #86)
> For some reason which I long since forgot, I hadthe impression from
> Mike Hommey that allocator mismatches were now a non-problem

STL allocator vs stdlib allocator mismatches are the non-problem ones on Android. (operator new()/free() or malloc()/operator delete() pairs). libc allocator vs jemalloc allocator mismatches still are, but you don't catch them with V.
We don't LD_PRELOAD on desktop, and we interpose all the libc mallo'ing functions that we call (strdup etc.).
(In reply to Chris Jones [:cjones] [:warhammer] from comment #93)
> We don't LD_PRELOAD on desktop

But jemalloc is in the executable, which has the same effect on symbol resolution.

> and we interpose all the libc mallo'ing functions that we call (strdup etc.).

Actually, we don't _have_ to, because of the hooks.
I just confirmed that some files we load into both plugin-container and b2g address spaces use asprintf.  That seems like our most likely suspect atm.
Depends on: 801571
(In reply to Chris Jones [:cjones] [:warhammer] from comment #95)
> I just confirmed that some files we load into both plugin-container and b2g
> address spaces use asprintf.  That seems like our most likely suspect atm.

I have seen crashes in both b2g and plugin-container that look like heap corruption fwiw.
Yup, the crash reports I sampled above show either plugin-container with those libs loaded, or b2g which will always have those libs loaded.
Now that I've paged this bug back in, I think it's unlikely that a jemalloc/bionic allocator mismatch is the whole story here, unless the stacks we have are quite wrong.

In comment 30, we established that one of the stacks shows us hitting a particular jemalloc assertion.

In order to fail at that point, we had to pass an earlier assertion checking the arena's "magic" field.  This is essentially checking that the address we're freeing lies inside a chunk mmap'ed by jemalloc -- that is, it's checking that we don''t have a jemalloc/bionic allocator mismatch.

It might be worth adding a call to isalloc_validate at the start of jemalloc's free(); this would be another check that we're freeing an address owned by jemalloc.  But I think it should be essentially redundant with the existing assertions in the code.  Calling isalloc_validate would in particular let us rule out the remote possibility that the bionic heap happens to have jemalloc's magic value in just the right place so as to pass the assertions.
The assertions on magic look like they're using |assert()|, right?  The failures here are mostly being reproduced by automated stress testing on release builds.  The one reproducible crash outside of that automation is now WFM.

Would one of the later RELEASE_ASSERT()s catch this?

If not, m1, do we have the capability to temporarily add a patch to the chainsaw set to promote those assert()s to RELEASE_ASSERT()?
(In reply to Chris Jones [:cjones] [:warhammer] from comment #99)
> If not, m1, do we have the capability to temporarily add a patch to the
> chainsaw set to promote those assert()s to RELEASE_ASSERT()?

I can certainly ask our test folks to run engineering builds through the chipper, just lmk exactly what you'd like tweaked.
A DEBUG build would enable that code for sure, but if we need to keep running with release builds, jlebar will know best what patch to toss in.
Flags: needinfo?(justin.lebar+bug)
Just FYI, we've tracked down the cause of the mangled crash stacks in bug 821353, and bug 822432 is aiming to reprocess a few weeks' worth of broken crash reports, so hopefully we can get better stacks out of the crashes here.
> Would one of the later RELEASE_ASSERT()s catch this?

Yes, or at least, I think so.  Not all of the assertions on |magic| are RELEASE_ASSERTs only because some of them are redundant.  At least, that's my intention.

In order to get into arena_run_reg_dalloc, where the crash from comment 29 is occurring, we have to run arena_dalloc_small, which does a RELEASE_ASSERT on run->magic.  In order to call arena_dalloc_small, we have to call arena_dalloc, which RELEASE_ASSERTS arena->magic.
Flags: needinfo?(justin.lebar+bug)
I reran tests using a hacked version of V that detects allocator-mismatches
in shared objects (eg, libc.so vs libmozglue.so) but found nothing.

I did however see the attached invalid accesses.  In perhaps 20
valgrinded runs on the phone, both with and without a SIM, I only
ever saw these on one run.  And that was the only run in which
the phone had a SIM in but failed to connect to the network for
some reason.  So maybe that's a clue.
Depends on: 822770
I filed bug 822770 on the write-after-frees from comment 104.
In looking at the attachment from comment 104 there are 2 write failures happening.

event.c:965 in event_queue_insert correponds to TAILQ_INSERT_TAIL(&base->eventqueue, ev, ev_next);

If we assume that valgrind is doing its job, then elm has to be valid (this would correspond to the event object) or we would have had lots of other failures prior to the call to event_add (from WatchFileDescrioptor here: https://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_pump_libevent.cc#193)

So this means that the event object is actually allocated properly, and this also means that the ev_next->tge_next assignment to NULL can't be what's causing the problem (otherwise the call to event_set would have also caused a bunch of invalid writes to offsets greater than 0).

tqe_prev is at an offset of 4, it can't be that line.
tqh_last is at an offset of 4, so it can't be that line.
The only line left is *(head)->tqh_last = (elm); so it must be that (head)->tqh_last is what's pointing to 0x8517068 and it's this write which is causing the invalid write.

Similarly, if we look at the seoncd stacktrace, this time its at event.c:932, which is event_queue_remove, which calls TAILQ_REMOVE. TAILQ_REMOVE has 4 assignments, two to tqe_prev which isn't at offset 0, one to tqh_last, which also isn't at offset 0, and this line:
*(elm)->field.tqe_prev = (elm)->field.tqe_next

So that tells me that (elm)->field.tqe_prev points to 0x8517068 and it's that line which causes the invalid write.

The 3rd stack trace looks like the 2nd one.

So this tell me, in all 3 cases, it looks like the linked list contains a node which has been freed but which hasn't been unlinked from the list.

That leads us to StopWatchingFileDescriptor. It calls event_del and then deletes the node. So if there is a way for event_del to leave the node on the list, then that's the problem.
> In looking at the attachment from comment 104 there are 2 write failures happening.

Let's discuss this in bug 822770, if you don't mind.  I'm afraid we're lumping all of our mysterious crashes into this bug, which makes it very difficult to track.
So uh, hey, remember how we were reprocessing the busted crashes? One of the crashes from comment 0 doesn't have symbols anymore (build is too old), but the other one is:
https://crash-stats.mozilla.com/report/index/05bd9934-2b04-4708-9359-db7992121127
[@ nsEventStateManager::FillInEventFromGestureDown ], which was bug 699725.
Depends on: 823363
I just filed bug 823363 which has DFD, which is a hacked-up version of DMD that can detect double-frees.  In theory it shouldn't find anything that Valgrind doesn't, but it uses a different malloc-interception mechanism so who knows.

For whoever can reproduce this crash:  it might be worth trying DFD just in case it is a double-free that's the problem.  Instructions for use are in bug 823363.
This one looks pretty promising.
Depends on: 816994
This may or may not be bug 822398, but mass-closing so that we can get better resolution on crashes that still reproduce.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 822398
Now that we've fixed crash reporting for B2G (that we broke by enabling elfhack), sqlite isn't appearing in any significant crash signatures AFAICT. This appears to have been a red herring from the busted stacks. The underlying crashes were probably a number of different things, they can be tracked in individual bugs.

Comment 113

7 years ago
remove qawanted
please add if there is any need
Keywords: qawanted
No longer depends on: 801571
No longer depends on: 822770
No longer depends on: 823363
No longer depends on: 820578
You need to log in before you can comment on or make changes to this bug.