Last Comment Bug 524780 - Main thread hangs waiting for SQLite lock while another thread does I/O
: Main thread hangs waiting for SQLite lock while another thread does I/O
Status: RESOLVED DUPLICATE of bug 563538
: perf
Product: Toolkit
Classification: Components
Component: Storage (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal with 2 votes (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on: 556400
Blocks:
  Show dependency treegraph
 
Reported: 2009-10-27 14:01 PDT by Robert O'Callahan (:roc) (Exited; email my personal email if necessary)
Modified: 2010-07-20 12:59 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Output of sqlite3_analyzer (44.95 KB, text/plain)
2010-01-05 14:19 PST, Robert O'Callahan (:roc) (Exited; email my personal email if necessary)
no flags Details
Output of hfsdebug on my places.sqlite file (2.43 KB, text/plain)
2010-01-05 14:30 PST, Robert O'Callahan (:roc) (Exited; email my personal email if necessary)
no flags Details

Description Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-10-27 14:01:25 PDT
My trunk builds hang a lot doing SQLite stuff. They always have, but I'd thought the situation was going to get better, and it never has. In fact it feels like it's getting worse. There are some severe hangs on startup (at least tens of seconds of apparent unresponsiveness), and other hangs when I type in the URL bar, but later things settle down. I think there are yet more hangs when memory load is high, though. This is a debug build on my nearly-three-year-old Macbook Pro.

Here's a stack trace of the main thread.
#0  0x96b4a2ce in semaphore_wait_signal_trap ()
#1  0x96b51da5 in pthread_mutex_lock ()
#2  0x050ca725 in pthreadMutexEnter (p=0xe2eb478) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:14833
#3  0x050ca2fc in sqlite3_mutex_enter (p=0xe2eb478) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:14106
#4  0x05102143 in sqlite3_clear_bindings (pStmt=0xe2f8e28) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:49141
#5  0x050567ab in mozilla::storage::Statement::Reset (this=0xe2f8a50) at /Users/roc/mozilla-checkin/storage/src/mozStorageStatement.cpp:550
#6  0x04e047a3 in mozStorageStatementScoper::~mozStorageStatementScoper (this=0xbfffcb24) at mozStorageHelper.h:179
#7  0x04e047c1 in mozStorageStatementScoper::~mozStorageStatementScoper (this=0xbfffcb24) at mozStorageHelper.h:180
#8  0x04e1e97f in nsNavHistory::IsURIStringVisited (this=0xe2eac70, aURIString=@0xbfffcb80) at /Users/roc/mozilla-checkin/toolkit/components/places/src/nsNavHistory.cpp:2076
#9  0x04e1eb4a in nsNavHistory::IsVisited (this=0xe2eac70, aURI=0x9a1d490, _retval=0xbfffcc8c) at /Users/roc/mozilla-checkin/toolkit/components/places/src/nsNavHistory.cpp:5234
#10 0x020941bb in nsDocShell::AddToGlobalHistory (this=0xffe4a40, aURI=0x9a1d490, aRedirect=0, aChannel=0x9cf7150) at /Users/roc/mozilla-checkin/docshell/base/nsDocShell.cpp:9673
#11 0x0209c08a in nsDocShell::OnNewURI (this=0xffe4a40, aURI=0x9a1d490, aChannel=0x9cf7150, aOwner=0x0, aLoadType=2097153, aFireOnLocationChange=0, aAddToGlobalHistory=1) at /Users/roc/mozilla-checkin/docshell/base/nsDocShell.cpp:8915
#12 0x0209c2b8 in nsDocShell::OnLoadingSite (this=0xffe4a40, aChannel=0x9cf7150, aFireOnLocationChange=0, aAddToGlobalHistory=1) at /Users/roc/mozilla-checkin/docshell/base/nsDocShell.cpp:8953

Here's what the other threads are doing:
(gdb) info threads
  16 process 227 thread 0x8a33  0x96b4a2ce in semaphore_wait_signal_trap ()
  15 process 227 thread 0x848b  0x96b4a2ce in semaphore_wait_signal_trap ()
  14 process 227 thread 0x776b  0x96b4a286 in mach_msg_trap ()
  13 process 227 thread 0x8c5b  0x96b5146e in __semwait_signal ()
  12 process 227 thread 0x6113  0x96b5146e in __semwait_signal ()
  11 process 227 thread 0x6083  0x96b72c7e in write$UNIX2003 ()
  10 process 227 thread 0x7f13  0x96b5146e in __semwait_signal ()
  9 process 227 thread 0x7c07  0x96b5146e in __semwait_signal ()
  8 process 227 thread 0x7a03  0x96b5146e in __semwait_signal ()
  7 process 227 thread 0x7903  0x96b5146e in __semwait_signal ()
  6 process 227 thread 0x7803  0x96b5146e in __semwait_signal ()
  5 process 227 thread 0x3503  0x96b5146e in __semwait_signal ()
  4 process 227 thread 0x3103  0x96b996fa in select$DARWIN_EXTSN ()
  3 process 227 thread 0x3003  0x96b5146e in __semwait_signal ()
  2 process 227 thread 0x1d0f  0x96b5146e in __semwait_signal ()
* 1 process 227 thread 0x717  0x96b4a2ce in semaphore_wait_signal_trap ()

My guess is that thread 11 is the one holding the SQLite lock, since it's the only one not waiting for a lock of some kind. Here's it's stack trace:

#0  0x96b72c7e in write$UNIX2003 ()
#1  0x050d4601 in seekAndWrite (id=0x14e0930, offset=40764, pBuf=0x217c227c, cnt=1024) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:23784
#2  0x050d48f5 in unixWrite (id=0x14e0930, pBuf=0x217c227c, amt=1024, offset=40764) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:23838
#3  0x050c973a in sqlite3OsWrite (id=0x14e0930, pBuf=0x217c227c, amt=1024, offset=40764) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:11856
#4  0x050e1307 in pager_write (pPg=0x217c2208) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:34721
#5  0x050e1a0d in sqlite3PagerWrite (pDbPage=0x217c2208) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:34892
#6  0x050f43e2 in sqlite3BtreeInsert (pCur=0xe3d2e50, pKey=0x0, nKey=316443, pData=0x22229868, nData=137, nZero=0, appendBias=0, seekResult=0) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:43227
#7  0x0510e3bd in sqlite3VdbeExec (p=0x9a8b918) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:54569
#8  0x05102dfd in sqlite3Step (p=0x9a8b918) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:49388
#9  0x05103103 in sqlite3_step (pStmt=0x9a8b918) at /Users/roc/mozilla-checkin/db/sqlite3/src/sqlite3.c:49447
#10 0x05063bc5 in mozilla::storage::AsyncExecuteStatements::executeStatement (this=0x2222c290, aStatement=0x9a8b918) at /Users/roc/mozilla-checkin/storage/src/mozStorageAsyncStatementExecution.cpp:330
#11 0x0506414d in mozilla::storage::AsyncExecuteStatements::executeAndProcessStatement (this=0x2222c290, aStatement=0x9a8b918, aLastStatement=false) at /Users/roc/mozilla-checkin/storage/src/mozStorageAsyncStatementExecution.cpp:280
#12 0x050643e3 in mozilla::storage::AsyncExecuteStatements::bindExecuteAndProcessStatement (this=0x2222c290, aData=@0xf5b0768, aLastStatement=false) at /Users/roc/mozilla-checkin/storage/src/mozStorageAsyncStatementExecution.cpp:262
#13 0x0506456f in mozilla::storage::AsyncExecuteStatements::Run (this=0x2222c290) at /Users/roc/mozilla-checkin/storage/src/mozStorageAsyncStatementExecution.cpp:551
#14 0x005bb3fa in nsThread::ProcessNextEvent (this=0x9a8bee0, mayWait=1, result=0xb05cfecc) at /Users/roc/mozilla-checkin/xpcom/threads/nsThread.cpp:527

The file being written to is handle 38:
firefox-b 227  roc   38u   REG   14,2    41788 40150552 /Users/roc/Library/Application Support/Firefox/Profiles/5vywfpbz.default/places.sqlite-journal

I have quite a lot of history:
-rw-r--r--   1 roc  roc  44499968 28 Oct 09:52 places.sqlite
-rw-r--r--@  1 roc  roc         0 28 Oct 09:58 places.sqlite-journal

My disk has been quite full in the past, although it isn't now, so I don't know how fragmented that file might be:
Filesystem    1024-blocks      Used Available Capacity  Mounted on
/dev/disk0s2    155954992 115478068  40220924    75%    /

I would have expected SQLite to be able to write without holding its lock...
Comment 1 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-12-02 17:56:14 PST
This bug is incredibly painful for me. Am I really the only person who is affected by these hangs?
Comment 2 Shawn Wilsher :sdwilsh 2009-12-02 18:04:34 PST
No, you are the only person I've ever heard about this from.  With that said, I'll see if I can take a look into this more tomorrow.
Comment 3 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-12-02 18:10:32 PST
I can send my places.sqlite file to you if that would help.
Comment 4 Shawn Wilsher :sdwilsh 2009-12-02 19:16:41 PST
(In reply to comment #3)
> I can send my places.sqlite file to you if that would help.
Probably doesn't matter, but feel free to send it.
Comment 5 Peter Weilbacher 2009-12-03 01:49:08 PST
I've seen quite a few complaints from people about application hangs in relation to SQLite files with sizes of some tens of MB. It's just the first time I see actual such detailed evidence. These people use pretty-fast-but-not-quite-up-to-date hardware.
So I don't think roc is the only person seeing this, just maybe one of the few troubled people who can actually debug this.
Comment 6 Martijn Wargers [:mwargers] (not working for Mozilla) 2009-12-03 06:03:50 PST
Disabling browsing history helped for me a lot.
Comment 7 Marco Bonardo [::mak] (Away 6-20 Aug) 2009-12-03 06:16:45 PST
we should also consider this is on a debug build, that is largely slower especially on sqlite, i don't expect us being so performant in a debug build seeing the complexity of our queries (but clearly i'd not expect hangs of this duration).
Would be really interesting to know what are we writing.
I would not consider a 44MB DB so large, we have users with larger dbs (iirc dietrich has a DB larger than 80MB for example, others report >100MB dbs).

Just to be sure, is your db clean? you can try running this in the error console:
Components.utils.import("resource://gre/modules/PlacesDBUtils.jsm");PlacesDBUtils.checkAndFixDatabase();

it will take some time to run, but will check if your db has any integrity problem, cleanup and vacuum it.
But first of all PLEASE BACKUP YOUR DATABASE, in case we need to check it.

Is this only on trunk, is 3.6 behaving better?
Comment 8 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-12-03 13:04:50 PST
I don't use 3.6.

I'll try checkAndFixDatabase.

It's a debug build, but my hardware isn't the slowest (dual-core laptop with 3GB RAM). Also, in the stack above we're not stuck in our code, so I don't see why debug vs opt should matter.

What information can I get from gdb that would help you next time I get a hang like this?
Comment 9 Shawn Wilsher :sdwilsh 2009-12-03 13:55:03 PST
(In reply to comment #8)
> It's a debug build, but my hardware isn't the slowest (dual-core laptop with
> 3GB RAM). Also, in the stack above we're not stuck in our code, so I don't see
> why debug vs opt should matter.
We build SQLite in debug mode in our debug builds.
Comment 10 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-12-03 14:05:24 PST
But we're not stuck in SQLite either. We're waiting on the write system call to complete.
Comment 11 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2010-01-04 11:59:58 PST
This is still happening on trunk. I've verified it happens in an opt build. Relevant stacks:

Main thread:
#0  0x96efa2c2 in semaphore_wait_trap ()
#1  0x96f01db2 in pthread_mutex_lock ()
#2  0x005f3852 in pthreadMutexEnter ()
#3  0x020d2b7c in mozilla::storage::Connection::BeginTransactionAs ()
#4  0x020f5d00 in nsNavHistory::CommitLazyMessages ()
#5  0x0228771d in nsTimerImpl::Fire ()

Another thread:
#0  0x96f1df9a in read$UNIX2003 ()
#1  0x005f837c in unixRead ()
#2  0x005f3330 in sqlite3OsRead ()
#3  0x00606503 in sqlite3PagerAcquire ()
#4  0x006065d4 in btreeGetPage ()
#5  0x0060721a in getAndInitPage ()
#6  0x0060728a in moveToChild ()
#7  0x00607575 in moveToLeftmost ()
#8  0x0064904a in sqlite3Step ()
#9  0x00634dcb in sqlite3_step ()
#10 0x020d945e in mozilla::storage::AsyncExecuteStatements::executeStatement ()
#11 0x020d94d8 in mozilla::storage::AsyncExecuteStatements::executeAndProcessStatement ()
#12 0x020d965a in mozilla::storage::AsyncExecuteStatements::bindExecuteAndProcessStatement ()
#13 0x020d9824 in mozilla::storage::AsyncExecuteStatements::Run ()
Comment 12 D. Richard Hipp 2010-01-04 13:19:26 PST
When you build with -DSQLITE_DEBUG, it enables over 2,700 assert() statements in SQLite, many of which contain expense sanity-check function calls or are in inner loops and other performance sensitive places.  Overall SQLite performance is about 3x slower with -DSQLITE_DEBUG enabled.  So you can see some dramatic performance improvements by using an optimized SQLite build on a CPU-bound application.

But in this case, the delays seems to be associated with I/O, not CPU, so I don't think that SQLITE_DEBUG is the culprit here.
Comment 13 Shawn Wilsher :sdwilsh 2010-01-04 13:27:48 PST
Note that when Firefox is built in debug, we compile SQLite with SQLITE_DEBUG=1
Comment 14 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2010-01-04 13:57:27 PST
I quit Firefox, moved places.sqlite to places.sqlite.bak, copied places.sqlite.bak to places.sqlite (which should defrag it, I presume), restarting, and then trying to reproduce the bug. I couldn't.

So I quit again, moved places.sqlite.bak to places.sqlite (which should restore the old fragged file, I presume), restarted, and tried to reproduce the bug. But then I couldn't reproduce either. So I'm not sure what's going on.
Comment 15 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2010-01-04 14:19:59 PST
OK, I can still reproduce the problem with my old places.sqlite. The query that was being executed on the non-main-thread was
(gdb) p v->zSql
$10 = 0xfa72b48 "SELECT h.url, h.last_visit_date AS visit_date, h.hidden, 1 as whole_entry FROM moz_places h LEFT JOIN moz_historyvisits v ON h.id = v.place_id LEFT JOIN moz_historyvisits_temp v_t ON h.id = v_t.place_id LEFT JOIN moz_bookmarks b ON h.id = b.fk WHERE v.id IS NULL AND v_t.id IS NULL AND b.id IS NULL AND b.id IS NULL AND SUBSTR(h.url, 1, 6) <> 'place:' LIMIT :max_expire"

This was statement #3 of an array of 6 statements.

(Is this asynchronous history expiration?)
Comment 16 Marco Bonardo [::mak] (Away 6-20 Aug) 2010-01-04 14:36:45 PST
(In reply to comment #15)
> (Is this asynchronous history expiration?)

it is the old async expiration done during db flushes, in nsPlacesDBFlush.
Comment 17 Shawn Wilsher :sdwilsh 2010-01-04 14:44:55 PST
Well, soon to be old since the new expiration code has not yet landed.
Comment 18 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2010-01-04 14:49:00 PST
OK, I can reproduce a hang after copying my places.sqlite to a new file (presumably, defragging it). The hang may not be as long, but it's hard to tell. It's still a several-seconds hang.
Comment 19 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2010-01-05 14:19:28 PST
Created attachment 420162 [details]
Output of sqlite3_analyzer
Comment 20 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2010-01-05 14:30:37 PST
Created attachment 420167 [details]
Output of hfsdebug on my places.sqlite file

This hfsdebug tool seems very cool. It shows that my places.sqlite is quite fragmented.

Also, if I make a copy of places.sqlite, the copy is *even more* fragmented. I guess free space on this disk is just super-fragmented in general! Anyway, that would explain why making a copy in comment #18 didn't seem to solve the problem.
Comment 21 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2010-01-05 14:33:34 PST
My hypothesis, based on not very much data, is that the fragmented places.sqlite causes many of the read()s performed by the expiration thread to incur a disk seek and be slow. Then, many of the times the main thread needs to get the SQLite lock, it will have to wait for one of those seeks to complete. This would result in the main thread making extremely slow progress --- even if the data it needs is already in memory.
Comment 22 Matthew N. [:MattN] 2010-06-13 11:12:39 PDT
I experienced a hang like this on today's Minefield build when opening my first new tab.  The hang must have been over 20 seconds as the slow script dialog appeared for tabbrowser.xml. I inspected the process with Activity Monitor and it showed the same places call stack:

...
2310 nsDocShell::OnNewURI(nsIURI*, nsIChannel*, nsISupports*, unsigned int, int, int)
  2310 nsDocShell::AddToGlobalHistory(nsIURI*, int, nsIChannel*)
    2310 nsDocShell::AddToGlobalHistory(nsIURI*, int, nsIURI*)
      2310 nsNavHistory::IsVisited(nsIURI*, int*)
        2310 nsNavHistory::IsURIStringVisited(nsACString_internal const&)
          2310 mozilla::storage::Statement::ExecuteStep(int*)
            2310 mozilla::storage::BindingParams::bind(sqlite3_stmt*)
              2310 int mozilla::storage::variantToSQLiteT<mozilla::storage::(anonymous namespace)::BindingColumnData>(mozilla::storage::(anonymous namespace)::BindingColumnData, nsIVariant*)
                2310 bindText
                  2310 vdbeUnbind
                    2310 pthreadMutexEnter
                      2310 pthread_mutex_lock
                        2310 semaphore_wait_signal_trap

My places.sqlite is 70 MB and running hfsdebug on on it seems to show that it is extremely fragmented:
  17015 allocation blocks in 4688 extents total.
  3.63 allocation blocks per extent on an average.

My drive has never been full and still has 65 GB free. Apparently Mac OS X will defrag "when a file is opened, is under 20MB and contains more than eight fragments" which would obviously never happen for my 70 MB places DB.
My home folder is encrypted with FileVault so I wonder if that has anything to do with the fragmentation.  When I log off, Mac OS claims to be compacting my home folder (IIRC).
Comment 23 Shawn Wilsher :sdwilsh 2010-06-13 11:29:28 PDT
This stack will no longer be a problem with bug 556400
Comment 24 Shawn Wilsher :sdwilsh 2010-06-13 11:31:01 PDT
(In reply to comment #22)
> My places.sqlite is 70 MB and running hfsdebug on on it seems to show that it
> is extremely fragmented:
>   17015 allocation blocks in 4688 extents total.
>   3.63 allocation blocks per extent on an average.
Note that if you have SQLite VACUUM the database, the situation will likely improve for you.
Comment 25 Matthew N. [:MattN] 2010-06-13 11:42:52 PDT
(In reply to comment #24)
> Note that if you have SQLite VACUUM the database, the situation will likely
> improve for you.

Actually, that data was from right after I vacuumed the database.  I did like roc mentioned and copied the database to a new file and back. Now hfsdebug shows that it only has 17 extents which is much better.  Hopefully this helps until bug 556400 gets fixed.
Comment 26 Shawn Wilsher :sdwilsh 2010-07-20 06:32:07 PDT
I think this may largely be a dupe of bug 563538 (only because it is more active and blocking).  Agree?
Comment 27 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2010-07-20 12:59:10 PDT
Probably, yes.

*** This bug has been marked as a duplicate of bug 563538 ***

Note You need to log in before you can comment on or make changes to this bug.