1310152 - [App Verifier] Critical section not initialized in sqlite3

Reporter

Description

•

9 years ago

Attached file xpcshell.exe.2447.dat.xml — Details

App verifier reports that a CRITICAL_SECTION is not initialized in running: ./mach xpcshell-test toolkit/components/places/tests/bookmarks/ Stacktrace: vfbasics!+7ff90c996501 ( @ 0) vfbasics!+7ff90c99768a ( @ 0) vfcuzz!CuzzGetCustomSchedulerInterface+1632 ( @ 0) nss3!sqlite3_mutex_enter+f (e:\hg\mozilla-central\db\sqlite3\src\sqlite3.c @ 22485) xul!mozilla::storage::Connection::initializeClone+436 (e:\hg\mozilla-central\storage\mozstorageconnection.cpp @ 1407) xul!mozilla::storage::`anonymous namespace'::AsyncInitializeClone::Run+1a (e:\hg\mozilla-central\storage\mozstorageconnection.cpp @ 432) xul!nsThread::ProcessNextEvent+2bd (e:\hg\mozilla-central\xpcom\threads\nsthread.cpp @ 1082) xul!NS_ProcessNextEvent+27 (e:\hg\mozilla-central\xpcom\glue\nsthreadutils.cpp @ 290) xul!mozilla::ipc::MessagePumpForNonMainThreads::Run+203 (e:\hg\mozilla-central\ipc\glue\messagepump.cpp @ 338) xul!MessageLoop::RunHandler+4b (e:\hg\mozilla-central\ipc\chromium\src\base\message_loop.cc @ 226) xul!MessageLoop::Run+1e (e:\hg\mozilla-central\ipc\chromium\src\base\message_loop.cc @ 206) xul!nsThread::ThreadFunc+102 (e:\hg\mozilla-central\xpcom\threads\nsthread.cpp @ 467) nss3!_PR_NativeRunThread+ee (e:\hg\mozilla-central\nsprpub\pr\src\threads\combined\pruthr.c @ 419) nss3!pr_root+a (e:\hg\mozilla-central\nsprpub\pr\src\md\windows\w95thred.c @ 96) ucrtbase!o__realloc_base+60 ( @ 0) vfbasics!+7ff90c99f0d4 ( @ 0) KERNEL32!BaseThreadInitThunk+14 ( @ 0) ntdll!RtlUserThreadStart+21 ( @ 0) This error cannot be reproduced with --sequential.

Nicholas Nethercote [inactive]

Comment 1

•

9 years ago

CC'ing drh.

D. Richard Hipp

Comment 2

•

9 years ago

The "e:\hg\mozilla-central\db\sqlite3\src\sqlite3.c @ 22485" source location information from the stack trace maps to https://www.sqlite.org/src/artifact/40dc52f7525fa?ln=98 for SQLite 3.15.0 and to https://www.sqlite.org/src/artifact/40dc52f7525fa?ln=140 for SQLite 3.14.2. From this I infer that the issue occurs using older SQLite 3.14.2, not the newly released 3.15.0 Am I correct?

Marco Bonardo [:mak]

Assignee

Comment 3

•

9 years ago

we didn't upgrade to 3.15.0 yet, so yes.

Ryan VanderMeulen [:RyanVM]

Comment 4

•

9 years ago

We're on 3.14.1 at the moment.

D. Richard Hipp

Comment 5

•

9 years ago

Our current hypothesis is that an sqlite3_mutex object is being used after it has been freed. The experimental sqlite3.c file found at https://sqlite.org/tmp/bug-1310152-test1.zip will assert() (with high probability) if that happens. Can someone please rerun the test using the experimental sqlite3.c file above and let us know what happens?

Marco Bonardo [:mak]

Assignee

Comment 6

•

9 years ago

Cervantes, can you do the run with that custom sqlite code from comment 5? Do you need assistance with that?

Flags: needinfo?(cyu)

Cervantes Yu [:cyu] [:cervantes]

Reporter

Comment 7

•

9 years ago

I got page not found with the above link. Would you attach the file or fix the link? Thanks.

Flags: needinfo?(cyu) → needinfo?(drh)

D. Richard Hipp

Comment 8

•

9 years ago

(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #7) > I got page not found with the above link. Would you attach the file or fix > the link? Thanks. Typo on upload. Fixed now. Please try again.

Cervantes Yu [:cyu] [:cervantes]

Reporter

Comment 9

•

9 years ago

Assertion failure caught inside visual studio debugger. The two threads that use the same critical section and cause the crash The thread that calls DeleteCriticalSection: 00007ff811c02149 Unknown 00007ff8117765f8 Unknown 00007ff811776f4d Unknown > winMutexFree Line 23708 C content of p: - p 0x00007ff6ae6ce9c0 {mutex={DebugInfo=0x00000230cc895fd0 {Type=0x0000 CreatorBackTraceIndex=0x0234 CriticalSection=...} ...} ...} sqlite3_mutex * + mutex {DebugInfo=0x00000230cc895fd0 {Type=0x0000 CreatorBackTraceIndex=0x0234 CriticalSection=0x00007ff6ae6ce9c0 {...} ...} ...} _RTL_CRITICAL_SECTION id 0x00000001 int nRef 0x00000001 volatile int owner 0x000016d0 volatile unsigned long magic 0x1f7a8074 volatile unsigned int sqlite3_mutex_free Line 22482 C sqlite3LeaveMutexAndCloseZombie Line 137740 C sqlite3Close Line 137607 C mozilla::storage::Connection::internalClose Line 963 C++ this == 0x00007ff6aec44340, refcnt == 14 mozilla::storage::`anonymous namespace'::AsyncCloseConnection::Run Line 380 C++ nsThread::ProcessNextEvent Line 1082 C++ NS_ProcessNextEvent Line 290 C++ mozilla::ipc::MessagePumpForNonMainThreads::Run Line 338 C++ MessageLoop::RunHandler Line 226 C++ MessageLoop::Run Line 206 C++ nsThread::ThreadFunc Line 467 C++ _PR_NativeRunThread Line 419 C pr_root Line 96 C [External Code] Thread 0x16D0, which uses the critical section and triggers the assertion: [External Code] > winMutexEnter Line 23739 C assert( p->nRef>0 || p->owner==0 ); sqlite3_mutex_enter Line 22493 C mozilla::storage::Connection::CreateFunction Line 1801 C++ mozilla::storage::Connection::initializeClone Line 1423 C++ this == 0x00007ff6aec44340 refcnt == 14 mozilla::storage::`anonymous namespace'::AsyncInitializeClone::Run Line 432 C++ nsThread::ProcessNextEvent Line 1082 C++ NS_ProcessNextEvent Line 290 C++ mozilla::ipc::MessagePumpForNonMainThreads::Run Line 338 C++ MessageLoop::RunHandler Line 226 C++ MessageLoop::Run Line 206 C++ nsThread::ThreadFunc Line 467 C++ _PR_NativeRunThread Line 419 C pr_root Line 96 C [External Code] It doesn't look like the problem lies in sqlite3. I guess test cases running on the 2 threads race in Sqlite.cloneStorageConnection() and gets the same instance. Marco, is this the expected behavior when running xpcshell?

Flags: needinfo?(drh) → needinfo?(mak77)

Marco Bonardo [:mak]

Assignee

Comment 10

•

9 years ago

Since it doesn't happen in sequential mode, I must think this is 2 different xpcshell executables running in parallel. It is strange since they should be acting on different connections and dbs too and we take the mutex from the connection (each xpcshell uses a separate profile folder). So far I don't have a clear idea on how one may end up reusing a section freed by the other one. I will keep thinking about it. Let me also NI Andrew, who worked more widely on our threading model, and may have better ideas.

Flags: needinfo?(mak77) → needinfo?(bugmail)

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Flags: needinfo?(mak77)

D. Richard Hipp

Comment 11

•

9 years ago

I think what is happening is that the sharedDBMutex is being initialized (in mozStorageConnection.cpp) by this line: // Properly wrap the database handle's mutex. sharedDBMutex.initWithMutex(sqlite3_db_mutex(mDBConn)); The sqlite3_db_mutex() interface returns the mutex that the SQLite database connection mDBConn is using. But then sharedDBMutex continues to be used after mDBConn is closed, and the closing of mDBConn will have deleted the mutex. So the mutex is being used after it is freed.

Marco Bonardo [:mak]

Assignee

Comment 12

•

9 years ago

It's possible, but that wouldn't explain why this only happens when tests are run in parallel... unless that parallelism makes the system busier and changes scheduling. That is a possibility. If that's true, we may be trying to clone a closed connection.

Marco Bonardo [:mak]

Assignee

Comment 13

•

9 years ago

Just random thoughts. If the stacks in comment 9 are trustable, there must be an AsyncInitializeClone runnable enqueued after an AsyncCloseConnection, but that means we'd have an asyncClose() before an asyncClone(), and if that happens, asyncClone should bail our before dispatching its runnable, cause asyncClose() nullifies mDBConn. It doesn't sound possible. Hard to proceed from here without some debugging (my build is still ongoing and have other priorities atm, but plan to do some debugging asap). The best approach could probably involve using http://rr-project.org/, provided this is also reproducible on Linux, if normal debugging fails. That said, off-hand it doesn't look like a problem in Sqlite, and we should not block 3.15.0 upgrade on this.

Andrew Sutherland [:asuth] (he/him)

Comment 14

•

9 years ago

The problem is that Connection::AsyncClone is doing: nsCOMPtr<nsIEventTarget> target = clone->getAsyncExecutionTarget(); and getting its own brand new async thread. This allows the clone to race AsyncClose since they're on 2 different threads instead of being serialized on the source/existing connection's async thread. It should just be doing `getAsyncExecutionTarget();` on the current instance, not the fresh clone. The cloned connection is not made visible to the calling code until the runnable completes, so this should not introduce any new races. Also, the MOZ_ASSERT in the AsyncInitializeClone runnable should be corrected and a comment should be added prior to the nsIEventTarget. With the corrected assert, I don't think we need to get into complicated C++ test territory that would just be enforcing the same thing. Impact-wise, it looks like use of asyncClone is limited to places: * History.cpp's ConcurrentStatementsHolder: http://searchfox.org/mozilla-central/source/toolkit/components/places/History.cpp#1946 * via Sqlite.jsm's cloneStorageConnection: http://searchfox.org/mozilla-central/source/toolkit/modules/Sqlite.jsm#1023 * http://searchfox.org/mozilla-central/source/toolkit/components/places/PlacesUtils.jsm#2106 * http://searchfox.org/mozilla-central/source/toolkit/components/places/UnifiedComplete.js#1871 Since Places is singleton-y rather than something that has its lifecycle controlled by web content, it seems likely this shouldn't be happening in the wild too much, but seems super likely in unit tests, especially under load with the parallel execution.

Flags: needinfo?(bugmail)

Marco Bonardo [:mak]

Assignee

Comment 15

•

9 years ago

Ah, good find, that makes completely sense, I didn't see that call on 2 different thread, that was the only possible explanation. I'll take this.

Assignee: nobody → mak77

Status: NEW → ASSIGNED

Flags: needinfo?(mak77)

Comment hidden (mozreview-request)

Marco Bonardo [:mak]

Assignee

Comment 17

•

9 years ago

I think the original reason for using the clone thread, was to allow calling asyncClone from a non-async connection without forcing it to use asyncClose(). That was likely to simplify the conversion of existing consumers. Indeed one of the tests was exactly doing that. In reality the only thing currently using this API is Sqlite.jsm, that always uses async connections, so it was probably over-engineered.

Andrew Sutherland [:asuth] (he/him)

Comment 18

•

9 years ago

mozreview-review

Comment on attachment 8807166 [details] Bug 1310152 - AsyncClone may try to reuse a freed Critical Section in Sqlite. https://reviewboard.mozilla.org/r/90432/#review90132 Looks good!

Attachment #8807166 - Flags: review?(bugmail) → review+

Pulsebot

Comment 19

•

9 years ago

Pushed by mak77@bonardo.net: https://hg.mozilla.org/integration/autoland/rev/ef6bdef5db2c AsyncClone may try to reuse a freed Critical Section in Sqlite. r=asuth

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Priority: -- → P1

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 20

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ef6bdef5db2c

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox52: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

BMO Automation

Updated

•

1 year ago

Product: Toolkit → Core

xpcshell.exe.2447.dat.xml 9 years ago Cervantes Yu [:cyu] [:cervantes] 2.61 KB, application/xml		Details
Bug 1310152 - AsyncClone may try to reuse a freed Critical Section in Sqlite. 9 years ago Marco Bonardo [:mak] 58 bytes, text/x-review-board-request	asuth : review+	Details