Closed Bug 628921 Opened 9 years ago Closed 9 years ago

Changing from WAL to a rollback journal to change page_size may cause a infinite loop @AsyncExecuteStatements::executeStatement

Categories

(Toolkit :: Storage, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla2.0b12
Tracking Status
blocking2.0 --- final+

People

(Reporter: glandium, Assigned: mak)

References

Details

(Whiteboard: [hardblocker][has patch][can land])

Attachments

(1 file)

So far, I've seen this twice in the past few days. This is with a build using system sqlite version 3.7.4. This doesn't prevent firefox from running fine, but it makes it suck 100% CPU. I'm filing this bug because I'm unsure if it's a sqlite issue or a firefox issue, and I'm looking for guidance as to where to look at.

FWIW, I got a crash while tracking what was happening the second time under gdb, though I think it might have been because I only attached to one thread.

The stack trace when the cpu sucks 100% looks like this:
#0  0x00007f094865d168 in do_fcntl (fd=<value optimized out>, cmd=<value optimized out>, arg=0x7f0918dfe760)
    at ../sysdeps/unix/sysv/linux/fcntl.c:40
#1  0x00007f094865d2c8 in __libc_fcntl (fd=37, cmd=6) at ../sysdeps/unix/sysv/linux/fcntl.c:89
#2  0x00007f09454d2976 in _posixUnlock (id=0x7f092c186150, eFileLock=1) at sqlite3.c:24316
#3  unixUnlock (id=0x7f092c186150, eFileLock=1) at sqlite3.c:24381
#4  0x00007f0945496bf4 in sqlite3OsUnlock (pPager=0x7f092c186008, eLock=1) at sqlite3.c:13784
#5  pagerUnlockDb (pPager=0x7f092c186008, eLock=1) at sqlite3.c:35765
#6  0x00007f0945496f4e in pagerExclusiveLock (pPager=0x7f092c186008) at sqlite3.c:41293
#7  0x00007f0945507230 in sqlite3PagerCloseWal (p=<value optimized out>) at sqlite3.c:41413
#8  sqlite3VdbeExec (p=<value optimized out>) at sqlite3.c:138
#9  0x00007f09454e8820 in sqlite3Step (pStmt=0x7f08e1296708) at sqlite3.c:58706
#10 sqlite3_step (pStmt=0x7f08e1296708) at sqlite3.c:58770
#11 0x00007f09465348a3 in mozilla::storage::stepStmt (aStatement=0x7f08e1296708)
    at ../../../storage/src/mozStoragePrivateHelpers.cpp:283
#12 0x00007f094653369b in mozilla::storage::AsyncExecuteStatements::executeStatement (this=0x7f08df6444a0, 
    aStatement=0x7f08e1296708) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:345
#13 0x00007f094653378a in mozilla::storage::AsyncExecuteStatements::executeAndProcessStatement (this=0x7f08df6444a0, 
    aStatement=0x7f08e1296708, aLastStatement=true) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:292
#14 0x00007f0946533b18 in mozilla::storage::AsyncExecuteStatements::Run (this=0x7f08df6444a0)
    at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:601
#15 0x00007f094672c674 in nsThread::ProcessNextEvent (this=0x7f091fe90b30, mayWait=1, result=0x7f0918dfee5c)
    at ../../../xpcom/threads/nsThread.cpp:633
#16 0x00007f09466f9cb9 in NS_ProcessNextEvent_P (thread=0x25, mayWait=6) at nsThreadUtils.cpp:250
#17 0x00007f094672cd53 in nsThread::ThreadFunc (arg=<value optimized out>) at ../../../xpcom/threads/nsThread.cpp:278
#18 0x00007f094719f7c3 in _pt_root (arg=<value optimized out>) at ptthread.c:228
#19 0x00007f09486558ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#20 0x00007f09483bd02d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#21 0x0000000000000000 in ?? ()

And I can go up to mozilla::storage::AsyncExecuteStatements::executeStatement with gdb's finish command.
Status: NEW → UNCONFIRMED
Ever confirmed: false
Nothing really jumps out at me offhand.  If you hit it again, let me know?
(In reply to comment #1)
> Nothing really jumps out at me offhand.  If you hit it again, let me know?

Happening right now
sqlite3_step is returning SQLITE_BUSY, which explains the infinite loop. It doesn't explain what is keeping the database busy, though.
Sadly, after a while, it manages to break the loop, which doesn't help, since I was hoping to keep the process doing its loop until i can get some debugging instructions :-/
(In reply to comment #4)
> Sadly, after a while, it manages to break the loop, which doesn't help, since I
> was hoping to keep the process doing its loop until i can get some debugging
> instructions :-/
I guess figuring out which database file this is is the way to get more information.  We'd have to poke at the nsIFile on the Connection object to figure that out, it looks like.
Got it again with b11, and it looks like it's an access to places:

0x00007f3246b53168 in do_fcntl (fd=<value optimized out>, cmd=<value optimized out>, arg=0x7f3221bfd750)
    at ../sysdeps/unix/sysv/linux/fcntl.c:40
40	../sysdeps/unix/sysv/linux/fcntl.c: No such file or directory.
	in ../sysdeps/unix/sysv/linux/fcntl.c
(gdb) bt
#0  0x00007f3246b53168 in do_fcntl (fd=<value optimized out>, cmd=<value optimized out>, arg=0x7f3221bfd750)
    at ../sysdeps/unix/sysv/linux/fcntl.c:40
#1  0x00007f3246b532c8 in __libc_fcntl (fd=36, cmd=6) at ../sysdeps/unix/sysv/linux/fcntl.c:89
#2  0x00007f32439c2200 in unixLock (id=0x7f3226979d50, eFileLock=4) at sqlite3.c:24063
#3  0x00007f3243988c54 in sqlite3OsLock (pPager=0x7f3226979c08, eLock=4) at sqlite3.c:13781
#4  pagerLockDb (pPager=0x7f3226979c08, eLock=4) at sqlite3.c:35789
#5  0x00007f3243988f3b in pagerExclusiveLock (pPager=0x24) at sqlite3.c:41289
#6  0x00007f32439f9230 in sqlite3PagerCloseWal (p=<value optimized out>) at sqlite3.c:41413
#7  sqlite3VdbeExec (p=<value optimized out>) at sqlite3.c:138
#8  0x00007f32439da820 in sqlite3Step (pStmt=0x7f31e8f9fa88) at sqlite3.c:58706
#9  sqlite3_step (pStmt=0x7f31e8f9fa88) at sqlite3.c:58770
#10 0x00007f3244a2c267 in mozilla::storage::stepStmt (aStatement=0x7f31e8f9fa88)
    at ../../../storage/src/mozStoragePrivateHelpers.cpp:283
#11 0x00007f3244a2b05f in mozilla::storage::AsyncExecuteStatements::executeStatement (this=0x7f31f4695cc0, 
    aStatement=0x7f31e8f9fa88) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:345
#12 0x00007f3244a2b14e in mozilla::storage::AsyncExecuteStatements::executeAndProcessStatement (this=0x7f31f4695cc0, 
    aStatement=0x7f31e8f9fa88, aLastStatement=true) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:292
#13 0x00007f3244a2b4dc in mozilla::storage::AsyncExecuteStatements::Run (this=0x7f31f4695cc0)
    at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:601
#14 0x00007f3244c2688c in nsThread::ProcessNextEvent (this=0x7f32226370e0, mayWait=1, result=0x7f3221bfde5c)
    at ../../../xpcom/threads/nsThread.cpp:633
#15 0x00007f3244bf3ed1 in NS_ProcessNextEvent_P (thread=0x24, mayWait=6) at nsThreadUtils.cpp:250
#16 0x00007f3244c26f6b in nsThread::ThreadFunc (arg=<value optimized out>) at ../../../xpcom/threads/nsThread.cpp:278
#17 0x00007f324569c7c3 in _pt_root (arg=<value optimized out>) at ptthread.c:228
#18 0x00007f3246b4b8ba in start_thread (arg=<value optimized out>) at pthread_create.c:300
#19 0x00007f32468b302d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#20 0x0000000000000000 in ?? ()
(gdb) frame 11
#11 0x00007f3244a2b05f in mozilla::storage::AsyncExecuteStatements::executeStatement (this=0x7f31f4695cc0, 
    aStatement=0x7f31e8f9fa88) at ../../../storage/src/mozStorageAsyncStatementExecution.cpp:345
345	../../../storage/src/mozStorageAsyncStatementExecution.cpp: No such file or directory.
	in ../../../storage/src/mozStorageAsyncStatementExecution.cpp
(gdb) print mConnection 
$1 = {mRawPtr = 0x7f3226ba9c00}
(gdb) print *((Connection *)0x7f3226ba9c00)
$2 = {<mozIStorageConnection> = {<nsISupports> = {
      _vptr.nsISupports = 0x7f3245578d80}, <No data fields>}, <nsIInterfaceRequestor> = {<nsISupports> = {
      _vptr.nsISupports = 0x7f3245578e90}, <No data fields>}, mRefCnt = {mValue = 111}, 
  sharedAsyncExecutionMutex = {<mozilla::BlockingResourceBase> = {static kResourceTypeName = 0x7f32454247f0}, 
    mLock = 0x7f3226b36df0}, sharedDBMutex = {<mozilla::BlockingResourceBase> = {
      static kResourceTypeName = <same as static member of an already seen type>}, mMutex = 0x7f322a3cf1f8}, 
  threadOpenedOn = {<nsCOMPtr_base> = {mRawPtr = 0x7f3234d46270}, <No data fields>}, mDBConn = 0x7f3226979808, 
  mDatabaseFile = {<nsCOMPtr_base> = {mRawPtr = 0x7f3226c44b00}, <No data fields>}, 
  mAsyncExecutionThread = {<nsCOMPtr_base> = {mRawPtr = 0x7f32226370e0}, <No data fields>}, 
  mAsyncExecutionThreadShuttingDown = false, mTransactionInProgress = 0, 
  mFunctions = {<nsBaseHashtable<nsCStringHashKey, mozilla::storage::Connection::FunctionInfo, mozilla::storage::Connection::FunctionInfo>> = {<nsTHashtable<nsBaseHashtableET<nsCStringHashKey, mozilla::storage::Connection::FunctionInfo> >> = {
        mTable = {ops = 0x7f324562fa10, data = 0x0, hashShift = 28, maxAlphaFrac = 192 '\300', minAlphaFrac = 64 '@', 
          entrySize = 40, entryCount = 4, removedCount = 0, generation = 0, 
          entryStore = 0x7f3226979400 ""}}, <No data fields>}, <No data fields>}, mProgressHandler = {<nsCOMPtr_base> = {
      mRawPtr = 0x0}, <No data fields>}, mFlags = 262150, mStorageService = {mRawPtr = 0x7f322de4a140}}
(gdb) print ((Connection *)0x7f3226ba9c00)->getFilename()
$3 = {<nsACString_internal> = {mData = 0x7f31ebd7a788 "places.sqlite", mLength = 13, mFlags = 5}, <No data fields>}
Mike, I'm going to rope you in on an e-mail thread with the SQLite developers on this to debug this.  While storage could handle the case of getting SQLITE_BUSY all the time, I chatted with drh today and determined that we shouldn't be getting SQLITE_BUSY.
This looks very similar to a loop I'm seeing on solaris (sqlite3_step returning SQLITE_BUSY, because someone else has a PENDING_LOCK on the database).  see bug 633771 which has some information on what I found so far inside sqlite3.c
The "sqlite3PagerCloseWal" frame indicates that SQLite is trying to switch a database from WAL mode to rollback mode. The SQL statement executed was probably "PRAGMA journal_mode = DELETE" (or PERSIST, or anything other than WAL).

Switching from WAL to rollback mode is much harder than from rollback to WAL.
To do so successfully, SQLite requires that all other connections close the database. Not just unlock, but actually close (i.e. call sqlite3_close() or execute a DETACH statement). Otherwise it will return SQLITE_BUSY.
What would I need to poke to know the SQL statement when it happens again?
In the debugger?

In frame sqlite3VdbeExec(), look at the string p->zSql.

Or in sqlite3Step() - ((Vdbe *)pStmt)->zSql.

Above that, call sqlite3_sql() on the statement handle.
(In reply to comment #9)
> The "sqlite3PagerCloseWal" frame indicates that SQLite is trying to switch a
> database from WAL mode to rollback mode. The SQL statement executed was
> probably "PRAGMA journal_mode = DELETE" (or PERSIST, or anything other than
> WAL).
We should only be doing that if we fail to set WAL mode (if this is places.sqlite).
(In reply to comment #12)
> We should only be doing that if we fail to set WAL mode (if this is
> places.sqlite).

We also try to do that in the vacuum manager if page size is not the expected one, WAL does not allow to change page size on a VACUUM.
Checking if Glandium's places.sqlite has a wrong page size is easy, could you just run a PRAGMA page_size; on it please?
This should only happen once every 30 days though, after a 5 minutes idle, unless the two above stack have been taken with 2 different profiles or the process was interrupted (crash, hang, kill).
The "failing to set wal" case would happen really early on startup when no statement exists yet.
And based on comment 9, we are hosed because we have cloned connections and we don't close them. So this makes fixing page size pretty much hard if not impossible.
(In reply to comment #12)
> (In reply to comment #9)
> > The "sqlite3PagerCloseWal" frame indicates that SQLite is trying to switch a
> > database from WAL mode to rollback mode. The SQL statement executed was
> > probably "PRAGMA journal_mode = DELETE" (or PERSIST, or anything other than
> > WAL).
> We should only be doing that if we fail to set WAL mode (if this is
> places.sqlite).

I think if you actually fail to switch into WAL mode, the subsequent "PRAGMA journal_mode = [DELETE|PERSIST|TRUNCATE]" should not need to take the exclusive lock.
(In reply to comment #13)
> We also try to do that in the vacuum manager if page size is not the expected
> one, WAL does not allow to change page size on a VACUUM.
Oof, I forgot about that. :(
(In reply to comment #13)
> (In reply to comment #12)
> > We should only be doing that if we fail to set WAL mode (if this is
> > places.sqlite).
> 
> We also try to do that in the vacuum manager if page size is not the expected
> one, WAL does not allow to change page size on a VACUUM.
> Checking if Glandium's places.sqlite has a wrong page size is easy, could you
> just run a PRAGMA page_size; on it please?

sqlite> PRAGMA page_size;
1024

> This should only happen once every 30 days though, after a 5 minutes idle,
> unless the two above stack have been taken with 2 different profiles or the
> process was interrupted (crash, hang, kill).

This happened at least 5 times so far, within the last 2 weeks, with the same profile.
So, imo this is absolutely our try to fix the page size.
could you please check what's the value of the about:config option storage.vacuum.last.places.sqlite?
(In reply to comment #18)
> So, imo this is absolutely our try to fix the page size.
> could you please check what's the value of the about:config option
> storage.vacuum.last.places.sqlite?

1297709999
(In reply to comment #19)
> (In reply to comment #18)
> > So, imo this is absolutely our try to fix the page size.
> > could you please check what's the value of the about:config option
> > storage.vacuum.last.places.sqlite?
> 
> 1297709999

this is yesterday.
Now I see what happens, we usually vacuum every 30 days, but if we notice the page size is not the expected one we proceed regardless. So for you this could happen every day...
Based on this I'm going to confirm the bug.

Shawn, at this point, unless we find a good alternative for fixing page size, we could want to do a simple one-line patch to disable page size correction...
The only alternative I see is to delay the vacuum to xpcom-shutdown, when the connections should be closed, but without any visible window it could hurt the user (would happen only once though).
Status: UNCONFIRMED → NEW
Ever confirmed: true
My profile is very old, so I guess a lot of users using an old profile like me may have the same kind of issues in the future. Couldn't the vacuum take place during profile upgrade ?
The problem is not the vacuum, the problem is that page size can be changed only by a vacuum, but not if the journal mode is wal.  I'm not sure if SQLite team has any clue if this will be possible in a future version.

It's possible we could change so that in this specific case we run a vacuum on profile upgrade, but doesn't look something feasible in the 4.0 timeframe.  That's why I think the only viable solution today, is to disable page size change on wal databases till next version.
OK, so this turns out to be pretty bad.  For users that have an the default page size of 1024, we have a high chance of hitting this during the 3.6 -> 4.0 upgrade.  4.0 beta users are not seeing this because our vacuuming logic landed before Places started to use WAL.

mak or I can own this bug, and we'll work that out shortly...
blocking2.0: --- → final+
Whiteboard: [hardblocker]
(In reply to comment #23)
> mak or I can own this bug, and we'll work that out shortly...

By all means, but marking it yours for the moment.
Assignee: nobody → sdwilsh
Talked this over with Marco.  He's going to take this, and we are going to go with the safest approach here.  The patch is going to just not try to change the page size when the database is in WAL mode.  In the future, we'll try to figure out a way around this, but for the time being, we'll just have to deal with users having a small page size.
Assignee: sdwilsh → mak77
Status: NEW → ASSIGNED
Summary: Infinite loop @AsyncExecuteStatements::executeStatement → Changing from WAL to a rollback journal to change page_size may cause a infinite loop @AsyncExecuteStatements::executeStatement
Blocks: 634374
Attached patch patch v1.0Splinter Review
I'm sorry we have to rollback on this perf win, but after all since it was not working we are not regressing anything.  We can properly fix this for the next version.
Attachment #512649 - Flags: review?(sdwilsh)
Comment on attachment 512649 [details] [diff] [review]
patch v1.0

r=sdwilsh
Attachment #512649 - Flags: review?(sdwilsh) → review+
Whiteboard: [hardblocker] → [hardblocker][has patch][can land]
Flags: in-testsuite-
Target Milestone: --- → mozilla2.0b12
Depends on: 541373
http://hg.mozilla.org/mozilla-central/rev/49186a8f4fa2
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.