741836 - Native crash without backtrace in SQLiteBridge

Reporter

Description

•

13 years ago

Attached file Patch. Testing code to reproduce the bug. — Details

Bug 736237 makes Profile Migration resumable. When testing this, I ran into an issue where the app spontaneously exits without a Java or native crash trace. Attaching jdb changes nothing. Attaching gdb makes the application work correctly. Using checkJNI makes the application work correctly. Tracing with printf debugging shows that the app exits at this statement: https://mxr.mozilla.org/mozilla-central/source/mozglue/android/SQLiteBridge.cpp#342 This reeks of memory corruption. It would be cool if someone could valgrind this scenario and see if we have bugs. STR: 1) Apply the patches from bug 736237. 2) Apply the attached patch. 3) Copy a big places.sqlite into the Fennec profile. (eg. from desktop machine) 4) Start Fennec, let migration finish. 5) Quit Fennec 6) Start Fennec. Fennec will crash halfway.

JP Rosevear [:jpr]

Updated

•

13 years ago

blocking-fennec1.0: --- → ?

Gian-Carlo Pascutto [:gcp]

Reporter

Updated

•

13 years ago

Blocks: 736237

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 1

•

13 years ago

We didn't reproduce the issue under valgrind yet, but there is something interesting in the trace: I//data/local/start_valgrind_fennec( 1608): ==1609== Thread 11: I//data/local/start_valgrind_fennec( 1608): ==1609== Invalid read of size 1 I//data/local/start_valgrind_fennec( 1608): ==1609== at 0x4806620: strlen (mc_replace_strmem.c:391) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x4831AAD: strndup (strndup.c:33) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CB0B: report_mapping (APKOpen.cpp:681) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x22251103: CustomElf::Load(Mappable*, char const*, int) (CustomElf.cpp:211) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224FDC3: ElfLoader::Load(char const*, int, LibHandle*) (ElfLoader.cpp:251) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224FE8B: __wrap_dlopen (ElfLoader.cpp:46) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CDF3: loadSQLiteLibs(char const*) (APKOpen.cpp:810) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CEBD: Java_org_mozilla_gecko_GeckoAppShell_loadSQLiteLibsNative (APKOpen.cpp:945) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x540CBF3: dvmPlatformInvoke (CallEABI.S:258) I//data/local/start_valgrind_fennec( 1608): ==1609== Address 0x4cb7082 is 0 bytes after a block of size 658 alloc'd I//data/local/start_valgrind_fennec( 1608): ==1609== at 0x4805440: malloc (vg_replace_malloc.c:263) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224C46D: extractBuf(char const*, Zip*) (APKOpen.cpp:648) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CDC9: loadSQLiteLibs(char const*) (APKOpen.cpp:804) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CEBD: Java_org_mozilla_gecko_GeckoAppShell_loadSQLiteLibsNative (APKOpen.cpp:945) I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x540CBF3: dvmPlatformInvoke (CallEABI.S:258)

Mike Hommey [:glandium]

Comment 2

•

13 years ago

(In reply to Gian-Carlo Pascutto (:gcp) from comment #1) > We didn't reproduce the issue under valgrind yet, but there is something > interesting in the trace: > > I//data/local/start_valgrind_fennec( 1608): ==1609== Thread 11: > I//data/local/start_valgrind_fennec( 1608): ==1609== Invalid read of size 1 > I//data/local/start_valgrind_fennec( 1608): ==1609== at 0x4806620: strlen > (mc_replace_strmem.c:391) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x4831AAD: > strndup (strndup.c:33) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CB0B: > report_mapping (APKOpen.cpp:681) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x22251103: > CustomElf::Load(Mappable*, char const*, int) (CustomElf.cpp:211) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224FDC3: > ElfLoader::Load(char const*, int, LibHandle*) (ElfLoader.cpp:251) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224FE8B: > __wrap_dlopen (ElfLoader.cpp:46) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CDF3: > loadSQLiteLibs(char const*) (APKOpen.cpp:810) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CEBD: > Java_org_mozilla_gecko_GeckoAppShell_loadSQLiteLibsNative (APKOpen.cpp:945) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x540CBF3: > dvmPlatformInvoke (CallEABI.S:258) > I//data/local/start_valgrind_fennec( 1608): ==1609== Address 0x4cb7082 is 0 > bytes after a block of size 658 alloc'd > I//data/local/start_valgrind_fennec( 1608): ==1609== at 0x4805440: malloc > (vg_replace_malloc.c:263) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224C46D: > extractBuf(char const*, Zip*) (APKOpen.cpp:648) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CDC9: > loadSQLiteLibs(char const*) (APKOpen.cpp:804) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x2224CEBD: > Java_org_mozilla_gecko_GeckoAppShell_loadSQLiteLibsNative (APKOpen.cpp:945) > I//data/local/start_valgrind_fennec( 1608): ==1609== by 0x540CBF3: > dvmPlatformInvoke (CallEABI.S:258) That's bug 740318.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 3

•

13 years ago

Julian valgrinded the exact case that's failing for me but nothing showed up.

Brad Lassey [:blassey] (use needinfo?)

Comment 4

•

13 years ago

GCP, I thought this was blocking bug 736237 from landing in which case it would be a blocker. Bug, bug 736237 landed. Can you explain?

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 5

•

13 years ago

We didn't find anything even after extensive valgrinding of the exact scenario that crashes for me (which didn't fail for sewardj, anyway), and nobody had any further idea how to progress. I figured that landing this would either get us some more reports and help to identify a pattern, and in the alternate case, show that this crash almost never happens. Put differently, the alternative to landing was not doing anything. I got an idea in the meantime to let this run and don't pass the data we get off SQLite into Android, to see if it's not some Android lib going bonkers.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 6

•

13 years ago

I managed to crash it while stepping in the debugger: full log: http://pastebin.mozilla.org/1560695 last part: sqlite3VdbeExec (p=0x5c5579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:65422 65422 if( p->rc==SQLITE_NOMEM ){ (gdb) 65427 assert( p->rc==SQLITE_OK || p->rc==SQLITE_BUSY ); (gdb) 65428 p->rc = SQLITE_OK; (gdb) 65429 assert( p->explain==0 ); (gdb) 65430 p->pResultSet = 0; (gdb) 65431 db->busyHandler.nBusy = 0; (gdb) 65432 CHECK_FOR_INTERRUPT; (gdb) 65435 checkProgress = db->xProgress!=0; (gdb) 65438 sqlite3BeginBenignMalloc(); (gdb) Program received signal SIGSEGV, Segmentation fault. 0xaff15c2c in ?? () from /home/morbo/jimdb/moz-gdb/lib/42800C743000157/lib/libc.so

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 7

•

13 years ago

And another one: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 15437] 0xaff15c2c in ?? () from /home/morbo/jimdb/moz-gdb/lib/42800C743000157/lib/libc.so (gdb) bt #0 0xaff15c2c in ?? () from /home/morbo/jimdb/moz-gdb/lib/42800C743000157/lib/libc.so #1 0xaff1cd9c in __assert2 () from /home/morbo/jimdb/moz-gdb/lib/42800C743000157/lib/libc.so #2 0x5dabc4ee in sqlite3MutexAlloc (id=2) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:17352 #3 0x5db46990 in enterMutex () at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:115604 #4 0x5db46b94 in sqlite3ConnectionUnlocked (db=0x5c422408) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:115725 #5 0x5daf031e in sqlite3VdbeHalt (p=0x5c4579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:61797 #6 0x5daf5afe in sqlite3VdbeExec (p=0x5c4579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:65712 #7 0x5daf2704 in sqlite3Step (p=0x5c4579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:63161 #8 0x5daf2946 in sqlite3_step (pStmt=0x5c4579d8) at /home/morbo/hg/mozilla-central/db/sqlite3/src/sqlite3.c:63234 #9 0x80b2a1ca in Java_org_mozilla_gecko_sqlite_SQLiteBridge_sqliteCall (jenv=0x117c68, jDb=0x408bb1f8, jQuery=0x407b0690, jParams=0x40939318, jQueryRes=0x409393e8) at /home/morbo/hg/mozilla-central/mozglue/android/SQLiteBridge.cpp:362 #10 0xaca11d38 in ?? () #11 0xaca11d38 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) This actually makes a little sense; the crash always happens at the last "step" inside the query, so this would seem consistent with the crash originating from sqlite3VdbeHalt.

Mark Finkle (:mfinkle) (use needinfo?)

Comment 8

•

13 years ago

Is this bug related to bug 743715

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 9

•

13 years ago

I needed bug 743715 to provide further evidence that https://bugzilla.mozilla.org/show_bug.cgi?id=730495#c2 is correct.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 10

•

13 years ago

Disabling jemalloc makes the crasher go away (this isn't 100% proof that jemalloc is the cause, tho).

Mike Hommey [:glandium]

Comment 11

•

13 years ago

That would seem to suggest something is allocating memory with the system allocator, and sqlite is giving that to some jemalloc function.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 12

•

13 years ago

Mike, one suspect I am looking at is: http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageService.cpp#574 Which would trigger because the SQLiteBridge is called before Gecko is loaded, so it will be calling into sqlite3 without these wrappers, providing incorrect alignment. But I'd expect that to trigger some assert in sqlite3 itself, at least that's what the bugs that caused this wrapper to be added suggest. So I'm not 100% sure.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 13

•

13 years ago

Bug 722884 has some similarities to this one. Unsure if related.

Brad Lassey [:blassey] (use needinfo?)

Updated

•

13 years ago

blocking-fennec1.0: ? → -

Scoobidiver (away)

Updated

•

13 years ago

Severity: normal → critical

Keywords: crash

Whiteboard: [native-crash]

Julian Seward [:jseward]

Comment 14

•

13 years ago

(In reply to Gian-Carlo Pascutto (:gcp) from comment #10) > Disabling jemalloc makes the crasher go away That might explain why the various valgrind runs found nothing -- I was using a non-jemalloc build.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 15

•

13 years ago

I put some logging into SQLite itself: http://pastebin.mozilla.org/1565953 The sqlite3MutexEnd call is probably what kills us. This is called from sqlite3_shutdown. Richard informed me yesterday that our calling of this function (which was in his opinion unnecessary) could be a cause. Looks like he's right! Theory: the failed sqlite3_config call causes the storage service to shut down and sqlite3_shutdown to be called.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 16

•

13 years ago

This is probably fixed by bug 721784.

Gian-Carlo Pascutto [:gcp]

Reporter

Updated

•

13 years ago

Status: NEW → RESOLVED

Closed: 13 years ago

Depends on: 721784

Resolution: --- → FIXED

Scoobidiver (away)

Updated

•

13 years ago

Target Milestone: --- → Firefox 14

Nobody; OK to take it and work on it

Assignee

Updated

•

4 years ago

Product: Firefox for Android → Firefox for Android Graveyard