Closed Bug 1310152 Opened 8 years ago Closed 8 years ago

[App Verifier] Critical section not initialized in sqlite3

Categories

(Toolkit :: Storage, defect, P1)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: cyu, Assigned: mak)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

App verifier reports that a CRITICAL_SECTION is not initialized in running:
./mach xpcshell-test toolkit/components/places/tests/bookmarks/

Stacktrace:
vfbasics!+7ff90c996501 ( @ 0)
vfbasics!+7ff90c99768a ( @ 0)
vfcuzz!CuzzGetCustomSchedulerInterface+1632 ( @ 0)
nss3!sqlite3_mutex_enter+f (e:\hg\mozilla-central\db\sqlite3\src\sqlite3.c @ 22485)
xul!mozilla::storage::Connection::initializeClone+436 (e:\hg\mozilla-central\storage\mozstorageconnection.cpp @ 1407)
xul!mozilla::storage::`anonymous namespace'::AsyncInitializeClone::Run+1a (e:\hg\mozilla-central\storage\mozstorageconnection.cpp @ 432)
xul!nsThread::ProcessNextEvent+2bd (e:\hg\mozilla-central\xpcom\threads\nsthread.cpp @ 1082)
xul!NS_ProcessNextEvent+27 (e:\hg\mozilla-central\xpcom\glue\nsthreadutils.cpp @ 290)
xul!mozilla::ipc::MessagePumpForNonMainThreads::Run+203 (e:\hg\mozilla-central\ipc\glue\messagepump.cpp @ 338)
xul!MessageLoop::RunHandler+4b (e:\hg\mozilla-central\ipc\chromium\src\base\message_loop.cc @ 226)
xul!MessageLoop::Run+1e (e:\hg\mozilla-central\ipc\chromium\src\base\message_loop.cc @ 206)
xul!nsThread::ThreadFunc+102 (e:\hg\mozilla-central\xpcom\threads\nsthread.cpp @ 467)
nss3!_PR_NativeRunThread+ee (e:\hg\mozilla-central\nsprpub\pr\src\threads\combined\pruthr.c @ 419)
nss3!pr_root+a (e:\hg\mozilla-central\nsprpub\pr\src\md\windows\w95thred.c @ 96)
ucrtbase!o__realloc_base+60 ( @ 0)
vfbasics!+7ff90c99f0d4 ( @ 0)
KERNEL32!BaseThreadInitThunk+14 ( @ 0)
ntdll!RtlUserThreadStart+21 ( @ 0)

This error cannot be reproduced with --sequential.
CC'ing drh.
The "e:\hg\mozilla-central\db\sqlite3\src\sqlite3.c @ 22485" source location information from the stack trace maps to https://www.sqlite.org/src/artifact/40dc52f7525fa?ln=98 for SQLite 3.15.0 and to https://www.sqlite.org/src/artifact/40dc52f7525fa?ln=140 for SQLite 3.14.2.  From this I infer that the issue occurs using older SQLite 3.14.2, not the newly released 3.15.0  Am I correct?
we didn't upgrade to 3.15.0 yet, so yes.
We're on 3.14.1 at the moment.
Our current hypothesis is that an sqlite3_mutex object is being used after it has been freed.  The experimental sqlite3.c file found at https://sqlite.org/tmp/bug-1310152-test1.zip will assert() (with high probability) if that happens.  Can someone please rerun the test using the experimental sqlite3.c file above and let us know what happens?
Cervantes, can you do the run with that custom sqlite code from comment 5? Do you need assistance with that?
Flags: needinfo?(cyu)
I got page not found with the above link. Would you attach the file or fix the link? Thanks.
Flags: needinfo?(cyu) → needinfo?(drh)
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #7)
> I got page not found with the above link. Would you attach the file or fix
> the link? Thanks.

Typo on upload.  Fixed now.  Please try again.
Assertion failure caught inside visual studio debugger. The two threads that use the same critical section and cause the crash

The thread that calls DeleteCriticalSection:
 	00007ff811c02149	Unknown
 	00007ff8117765f8	Unknown
 	00007ff811776f4d	Unknown
>	winMutexFree Line 23708	C
content of p:
-		p	0x00007ff6ae6ce9c0 {mutex={DebugInfo=0x00000230cc895fd0 {Type=0x0000 CreatorBackTraceIndex=0x0234 CriticalSection=...} ...} ...}	sqlite3_mutex *
+		mutex	{DebugInfo=0x00000230cc895fd0 {Type=0x0000 CreatorBackTraceIndex=0x0234 CriticalSection=0x00007ff6ae6ce9c0 {...} ...} ...}	_RTL_CRITICAL_SECTION
		id	0x00000001	int
		nRef	0x00000001	volatile int
		owner	0x000016d0	volatile unsigned long
		magic	0x1f7a8074	volatile unsigned int
 	sqlite3_mutex_free Line 22482	C
 	sqlite3LeaveMutexAndCloseZombie Line 137740	C
 	sqlite3Close Line 137607	C
 	mozilla::storage::Connection::internalClose Line 963	C++
this == 0x00007ff6aec44340, refcnt == 14
 	mozilla::storage::`anonymous namespace'::AsyncCloseConnection::Run Line 380	C++
 	nsThread::ProcessNextEvent Line 1082	C++
 	NS_ProcessNextEvent Line 290	C++
 	mozilla::ipc::MessagePumpForNonMainThreads::Run Line 338	C++
 	MessageLoop::RunHandler Line 226	C++
 	MessageLoop::Run Line 206	C++
 	nsThread::ThreadFunc Line 467	C++
 	_PR_NativeRunThread Line 419	C
 	pr_root Line 96	C
 	[External Code]	

Thread 0x16D0, which uses the critical section and triggers the assertion:
 	[External Code]	
>	winMutexEnter Line 23739	C
assert( p->nRef>0 || p->owner==0 );
 	sqlite3_mutex_enter Line 22493	C
 	mozilla::storage::Connection::CreateFunction Line 1801	C++
 	mozilla::storage::Connection::initializeClone Line 1423	C++
this == 0x00007ff6aec44340 refcnt == 14
 	mozilla::storage::`anonymous namespace'::AsyncInitializeClone::Run Line 432	C++
 	nsThread::ProcessNextEvent Line 1082	C++
 	NS_ProcessNextEvent Line 290	C++
 	mozilla::ipc::MessagePumpForNonMainThreads::Run Line 338	C++
 	MessageLoop::RunHandler Line 226	C++
 	MessageLoop::Run Line 206	C++
 	nsThread::ThreadFunc Line 467	C++
 	_PR_NativeRunThread Line 419	C
 	pr_root Line 96	C
 	[External Code]	

It doesn't look like the problem lies in sqlite3. I guess test cases running on the 2 threads race in Sqlite.cloneStorageConnection() and gets the same instance. Marco, is this the expected behavior when running xpcshell?
Flags: needinfo?(drh) → needinfo?(mak77)
Since it doesn't happen in sequential mode, I must think this is 2 different xpcshell executables running in parallel. It is strange since they should be acting on different connections and dbs too and we take the mutex from the connection (each xpcshell uses a separate profile folder). So far I don't have a clear idea on how one may end up reusing a section freed by the other one. I will keep thinking about it.
Let me also NI Andrew, who worked more widely on our threading model, and may have better ideas.
Flags: needinfo?(mak77) → needinfo?(bugmail)
Flags: needinfo?(mak77)
I think what is happening is that the sharedDBMutex is being initialized (in mozStorageConnection.cpp) by this line:

      // Properly wrap the database handle's mutex.
      sharedDBMutex.initWithMutex(sqlite3_db_mutex(mDBConn));

The sqlite3_db_mutex() interface returns the mutex that the SQLite database connection mDBConn is using. But then sharedDBMutex continues to be used after mDBConn is closed, and the closing of mDBConn will have deleted the mutex.  So the mutex is being used after it is freed.
It's possible, but that wouldn't explain why this only happens when tests are run in parallel... unless that parallelism makes the system busier and changes scheduling. That is a possibility.
If that's true, we may be trying to clone a closed connection.
Just random thoughts. If the stacks in comment 9 are trustable, there must be an AsyncInitializeClone runnable enqueued after an AsyncCloseConnection, but that means we'd have an asyncClose() before an asyncClone(), and if that happens, asyncClone should bail our before dispatching its runnable, cause asyncClose() nullifies mDBConn. It doesn't sound possible.
Hard to proceed from here without some debugging (my build is still ongoing and have other priorities atm, but plan to do some debugging asap). The best approach could probably involve using http://rr-project.org/, provided this is also reproducible on Linux, if normal debugging fails.

That said, off-hand it doesn't look like a problem in Sqlite, and we should not block 3.15.0 upgrade on this.
The problem is that Connection::AsyncClone is doing:
  nsCOMPtr<nsIEventTarget> target = clone->getAsyncExecutionTarget();
and getting its own brand new async thread.  This allows the clone to race AsyncClose since they're on 2 different threads instead of being serialized on the source/existing connection's async thread.

It should just be doing `getAsyncExecutionTarget();` on the current instance, not the fresh clone.  The cloned connection is not made visible to the calling code until the runnable completes, so this should not introduce any new races.  Also, the MOZ_ASSERT in the AsyncInitializeClone runnable should be corrected and a comment should be added prior to the nsIEventTarget.  With the corrected assert, I don't think we need to get into complicated C++ test territory that would just be enforcing the same thing.


Impact-wise, it looks like use of asyncClone is limited to places:
* History.cpp's ConcurrentStatementsHolder: http://searchfox.org/mozilla-central/source/toolkit/components/places/History.cpp#1946
* via Sqlite.jsm's cloneStorageConnection: http://searchfox.org/mozilla-central/source/toolkit/modules/Sqlite.jsm#1023
  * http://searchfox.org/mozilla-central/source/toolkit/components/places/PlacesUtils.jsm#2106
  * http://searchfox.org/mozilla-central/source/toolkit/components/places/UnifiedComplete.js#1871

Since Places is singleton-y rather than something that has its lifecycle controlled by web content, it seems likely this shouldn't be happening in the wild too much, but seems super likely in unit tests, especially under load with the parallel execution.
Flags: needinfo?(bugmail)
Ah, good find, that makes completely sense, I didn't see that call on 2 different thread, that was the only possible explanation. I'll take this.
Assignee: nobody → mak77
Status: NEW → ASSIGNED
Flags: needinfo?(mak77)
I think the original reason for using the clone thread, was to allow calling asyncClone from a non-async connection without forcing it to use asyncClose(). That was likely to simplify the conversion of existing consumers. Indeed one of the tests was exactly doing that.
In reality the only thing currently using this API is Sqlite.jsm, that always uses async connections, so it was probably over-engineered.
Comment on attachment 8807166 [details]
Bug 1310152 - AsyncClone may try to reuse a freed Critical Section in Sqlite.

https://reviewboard.mozilla.org/r/90432/#review90132

Looks good!
Attachment #8807166 - Flags: review?(bugmail) → review+
Pushed by mak77@bonardo.net:
https://hg.mozilla.org/integration/autoland/rev/ef6bdef5db2c
AsyncClone may try to reuse a freed Critical Section in Sqlite. r=asuth
Priority: -- → P1
https://hg.mozilla.org/mozilla-central/rev/ef6bdef5db2c
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: