Open Bug 633771 Opened 11 years ago Updated 1 year ago

two threads looping in database code.

Categories

(Toolkit :: Storage, defect, P3)

x86
Solaris
defect

Tracking

()

People

(Reporter: sommerfeld, Unassigned)

Details

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (X11; SunOS i86pc; rv:2.0b11) Gecko/20100101 Firefox/4.0b11
Build Identifier: Mozilla/5.0 (X11; SunOS i86pc; rv:2.0b11) Gecko/20100101 Firefox/4.0b11

With 4.0b10 and 4.0b11, I periodically see firefox start consuming significant cpu (on a dual-core system, it eats slightly more than one cpu).

pstack $(pgrep firefox-bin) | c++filt shows two threads with the following 
trace:

 fe920e67 yield    (d7591ac0, f3bf21a0, f0affde8, 0, e7eea168, fea6d3dc) + 7
 fea4ae3a PR_Sleep (0) + a2
 fcfe4693 unsigned mozilla::storage::AsyncExecuteStatements::Run() (d4794290, 1, f0affed8, 0) + 2e3
 fd5fcf33 unsigned nsThread::ProcessNextEvent(int,int*) (e7efc880, 1, f0afff3c, fd57a580) + 22f
 fd57a5b5 int NS_ProcessNextEvent_P(nsIThread*,int) (e7efc880, 1, f0afff6c, fd5fbdd4) + 45
 fd5fbe48 void nsThread::ThreadFunc(void*) (e7efc880) + 94
 fea49be1 _pt_root (e7eed6a0, fe9ae000, f0afffe8, fe91c03e) + c9
 fe91c093 _thrp_setup (f9a75a40) + 9b
 fe91c320 _lwp_start (f9a75a40, 0, 0, 0, 0, 0)

single-stepping showed that sqlite3_step() is returning SQLITE_BUSY, and the eventual caller is calling PR_Sleep(0) (which turns into a yield()) and is trying again.

digging further showed that control reaches
db/sqlite3/src/sqlite3.c::unixLock():

  /* If some thread using this PID has a lock via a different unixFile*
  ** handle that precludes the requested lock, return BUSY.
  */
  if( (pFile->eFileLock!=pInode->eFileLock && 
          (pInode->eFileLock>=PENDING_LOCK || eFileLock>SHARED_LOCK))
  ){
    rc = SQLITE_BUSY;
    goto end_lock;
  }

hot-patching pInode->eFileLock to zero unwedges the URL bar and makes the symptoms go away.

I have not yet caught the code responsible for abandoning the PENDING_LOCK state.

Others have seen this bug as well.  See:

http://mail.opensolaris.org/pipermail/desktop-discuss/2011-February/016755.html

for one example.  This did not happen with b9.


Reproducible: Always

Steps to Reproduce:
Unclear.  Happens after a few minutes normal use of the browser.  notable sites visited include online.wsj.com, mail.google.com, ...

when it happens, URL completion in the URL bar stops working and cpu usage spikes.  If I bind the firefox process to one cpu I see 400k system calls per second on that cpu.



Expected Results:  
I'd expect the system to eventually detect the failure to acquire the lock and crash.


"The definition of insanity is doing the same thing over and over again and expecting different results".

hot-patching the PR_Sleep(0) to the equivalent of PR_Sleep(1) reduces the impact on the rest of the system; the code should probably adaptively increase the sleep time if it gets repeated SQLITE_BUSY returns, and crash if it's locked out for "too long".  I leave deciding how much is "too long" to you.

I'm suspicious of the sqlite upgrade to 3.7.4 in b10.

I also wonder whether this is related to 593543 on MacOS which mentions "60% under stepStmt, but more interesting 10% in kernel space messing with
semaphores." which sounds at least superficially similar to the path I caught looping on Solaris.
Component: General → Storage
Product: Firefox → Toolkit
QA Contact: general → storage
Version: unspecified → Trunk
Possible duplicate of bug 628921 (similar symptoms; the analysis in that bug doesn't dig inside sqlite3).
In b10/b11 on Solaris, I comment out "journalmode = WAL" because it doesn't work with NFS on Solaris.

Reporter, if your profile is not on NFS,
can you please try builds at
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/contrib/latest-mozilla-central/

Thanks!
Home directory & profile is in local ZFS.  Trying it now.  I'll update the bug if I see the symptom recurring.
The 4.0b12 version is better than 4.0b11 - it only seizes control of one core rather than two on my dual-core dual-proc workstation.

stacktrace to be attached shortly.
I didn't see AsyncExecuteStatements in 4.0b12 stack.

So it might be bug 628921.
With 4.0b11, I leave Firefox running overnight and I see 1 thread keep doing stepStmt(), and it returns BUSY for every time.

The SQL command is
SELECT h.id FROM moz_places h WHERE url = ?1 AND EXISTS(SELECT id FROM moz_historyvisits WHERE place_id = h.id LIMIT 1) 

BTW: I comment out journal_mode = WAL for places in this build.
(In reply to comment #7)
> BTW: I comment out journal_mode = WAL for places in this build.

it's unclear to me if this was the only change, commenting it out won't change the journal mode of an existing db. So if you didn't fix the db manually or create a new one, you can still hit bug 628921.
You should comment out the journal_mode but also create a new database, or set a different journal mode on your test database through a third party app, then re-run the test with your custom build. This is because once we set a db to wal, it will stick to wal till something other changes its mode.
The last vacuum is Jan 20, so it could be vacuum caused a lock on db.

The journal mode of my places.sqlite is not WAL when it happens.
But it is locked.

(In reply to comment #8)

> it's unclear to me if this was the only change, commenting it out won't change
> the journal mode of an existing db. 

For places.sqlite, it will set journal_mode to truncate if it cannot set journal_mode to WAL.
See nsNavHistory.cpp.
I changed
if (NS_SUCCEEDED(SetJournalMode(JOURNAL_WAL))) {
to
if (0 && NS_SUCCEEDED(SetJournalMode(JOURNAL_WAL))) {
ok, that works.
Vacuum is likely to cause a lock, but not along all the night.
I copied places.sqlite to another place, the page size is 32768.
Vacuum will lock the db for about 3 seconds.

I've no idea what is holding the lock.
This problem came back again today.
I think I'm going to reproduce it for everyday.

When it happens, I cannot search in URL bar, and this time I've 2 threads looping.
Unfortunately, I clicked close by accident.
Restarting Firefox, the lock of db was gone.

How can I know which operation is locking the db?
Status: UNCONFIRMED → NEW
Ever confirmed: true
It's unlikely a problem of vacuum.

My storage.vacuum.last.places.sqlite is Feb 09.

I tried to trigger idle daily, but the problem was not reproduced.
Just reproduced again
2 threads
thread 1 is doing
369       "/* do not warn (bug 487787) */ "
370     + "SELECT "
371     +  "(SELECT REPLACE(url, '%s', :query_string) FROM moz_places WHERE id = b.fk) "
372     +  "AS search_url, h.title, "
373     +  "IFNULL(f.url, (SELECT f.url "
374     +                 "FROM moz_places "
375     +                 "JOIN moz_favicons f ON f.id = favicon_id "
376     +                 "WHERE rev_host = (SELECT rev_host FROM moz_places WHERE id = b.fk) "
377     +                 "ORDER BY frecency DESC "
378     +                 "LIMIT 1) "
379     + "), b.parent, b.title, NULL, h.visit_count, h.typed, IFNULL(h.id, b.fk), "
380     +  ":query_type, t.open_count "
381     +  "FROM moz_keywords k "
382     +  "JOIN moz_bookmarks b ON b.keyword_id = k.id "
383     +  "LEFT JOIN moz_places h ON h.url = search_url "
384     +  "LEFT JOIN moz_favicons f ON f.id = h.favicon_id "
385     +  "LEFT JOIN moz_openpages_temp t ON t.url = search_url "
386     +  "WHERE LOWER(k.keyword) = LOWER(:keyword) "
387     +  "ORDER BY h.frecency DESC "

thread 2 is doing
SELECT h.id FROM moz_places h WHERE url = ?1 AND EXISTS(SELECT id FROM
moz_historyvisits WHERE place_id = h.id LIMIT 1)
the first one is autocomplete search by keyword, the second one seems a isvisited query.
ideally these should never lock with WAL, but since you disabled WAL a lock could happen, now if the second query is the async isvisited it should still NOT lock. I guess it must be the synchronous one. the synchronous IsVisited() call seems to be used only by download history and by Sync. Do you have Sync enabled?
And since this is not due to vacuum. I'm moving the bug to Places.
Component: Storage → Places
QA Contact: storage → places
(In reply to comment #15)
> ideally these should never lock with WAL, but since you disabled WAL a lock
> could happen, now if the second query is the async isvisited it should still
> NOT lock. I guess it must be the synchronous one. the synchronous IsVisited()
> call seems to be used only by download history and by Sync. Do you have Sync
> enabled?
It will still lock, just not on the main thread.
I don't think either the 2 queries caused a lock of db.

I reproduced the problem twice yesterday with following steps.

1) Change idle daily time stamp back to 1-2 days earlier.
2) Keep the computer idle for about 30 minutes
3) Try to browse some pages.
4) URL bar will not search history now. Problem reproduced.
I suspect places maintenance is the cause.
I still have this problem with latest mozilla-central.
places maintenance should not be the cause.

I can see 4 fd for places.sqlite, 1 is RDWR, the other 3 are RDONLY.

one of the fd held EXCLUSIVE_LOCK of places.sqlite.
(In reply to comment #20)
> one of the fd held EXCLUSIVE_LOCK of places.sqlite.

I meant PENDING_LOCK.

I think it happened in this way.
Thread 1 is busy doing
SELECT h.id FROM moz_places h WHERE url = ?1 AND EXISTS(SELECT id FROM
moz_historyvisits WHERE place_id = h.id LIMIT 1)
while loading a page.

Thread 2 is doing a transaction for frecency or favicon.

Now, thread 2 reserved a lock and then ask for exclusive lock, because thread 1 is doing a select, it can't get the exclusive lock, thread 2 gets a pending lock and returns BUSY.
We give up "COMMIT TRANSACTION" because of BUSY error, and we leave places.sqlite a pending lock.
thread 2 returns.

Thread 1 can't get SHARED lock now, and it keeps trying.
Other threads may also keep trying to get SHARED lock. URL bar doesn't work now.
And since we sleep(0); it takes a lot of CPU time.

To fix it I think we should do a loop on BUSY for COMMIT.
We do that for ROLLBACK. We should do the same thing for COMMIT.

Also we might want to change PR_Sleep(0); to PR_Sleep(10); to decrease the waste of switching context.
It is quite common that we have 2 threading waiting for SHARED lock while another thread is committing.

This bug happens to some users on Solaris because ZFS might be quite slow on fsync(), especially on heavy IO.
Attached patch patchSplinter Review
Assignee: nobody → ginn.chen
Status: NEW → ASSIGNED
Attachment #515856 - Flags: review?(sdwilsh)
(In reply to comment #21)
> To fix it I think we should do a loop on BUSY for COMMIT.
> We do that for ROLLBACK. We should do the same thing for COMMIT.
But per bug 462173 comment 10, commits should always succeed.  I'm loathe to do this for such a common operation as COMMIT, especially given how often we end up on the calling it on the GUI thread still.
If you are running on Solaris, you should have DTrace and so it should be quite feasible to create probes that show exactly what is happening, rather than just relying on inspection.  You would likely want to use pid probes and perhaps speculation() and friends.
(In reply to comment #23)
> (In reply to comment #21)
> > To fix it I think we should do a loop on BUSY for COMMIT.
> > We do that for ROLLBACK. We should do the same thing for COMMIT.
> But per bug 462173 comment 10, commits should always succeed.  I'm loathe to do
> this for such a common operation as COMMIT, especially given how often we end
> up on the calling it on the GUI thread still.

http://www.sqlite.org/lang_transaction.html

An attempt to execute COMMIT might also result in an SQLITE_BUSY return code if an another thread or process has a shared lock on the database that prevented the database from being updated. When COMMIT fails in this way, the transaction remains active and the COMMIT can be retried later after the reader has had a chance to clear.

If we don't retry for COMMIT, the COMMIT thread will reserve the lock, and no new shared lock is allowed for the db.

(In reply to comment #24)
> If you are running on Solaris, you should have DTrace and so it should be quite
> feasible to create probes that show exactly what is happening, rather than just
> relying on inspection.  You would likely want to use pid probes and perhaps
> speculation() and friends.

The only problem is I didn't find a reliable way to reproduce the problem.
DTrace would not help on that.
(In reply to comment #25)
> > If you are running on Solaris, you should have DTrace and so it should be quite
> > feasible to create probes that show exactly what is happening, rather than just
> > relying on inspection.  You would likely want to use pid probes and perhaps
> > speculation() and friends.
> 
> The only problem is I didn't find a reliable way to reproduce the problem.
> DTrace would not help on that.

Right, the idea is that DTrace with probes operating in speculative mode should have a limited performance overhead impact and only produce output in interesting circumstances so the additional human effort (after authoring the probes, of course :) is limited.

But more specifically, the proposed change to the PR_Sleep invocation seems like the type of proposal that would benefit from the kind of data that DTrace could easily gather (although it would be partially scheduler dependent).
A COMMIT can generate SQLITE_BUSY in one of the rollback journal modes, but not
in PRAGMA journal_mode=WAL.  I thought you were using journal_mode=WAL.  Am I mistaken about that?
(In reply to comment #27)
> A COMMIT can generate SQLITE_BUSY in one of the rollback journal modes, but not
> in PRAGMA journal_mode=WAL.  I thought you were using journal_mode=WAL.  Am I
> mistaken about that?
We are (although I understand it to be broken in OS/2 due to no shared memory being available).  Does it also not work in Solaris Ginn?
We didn't use WAL on Solaris because it doesn't work with NFS on Solaris.

So I comment out setting WAL. See comment #9.
(In reply to comment #29)
> We didn't use WAL on Solaris because it doesn't work with NFS on Solaris.
What doesn't work about it?  The problem here is that a number of the architectural changes we made for Firefox 4 depend on WAL working.
On NFS/Solaris, mmap() for -shm file failed if -shm file is partial locked by fcntl().
See Bug 629296.

What change in Firefox 4 depends on WAL?
What's the result if WAL is disabled, besides this bug?
(In reply to comment #31)
> On NFS/Solaris, mmap() for -shm file failed if -shm file is partial locked by
> fcntl().
> See Bug 629296.
Oh right, I had forgotten about this.

> What change in Firefox 4 depends on WAL?
Namely the fact that we now have three database connections for Places (cookies also uses two database connections during initialization).  OS/2 has been having breakage issues as well (like bug 637115).

> What's the result if WAL is disabled, besides this bug?
Lot's of locking and busy notifications in places that aren't designed to handle it (any synchronous database call, essentially).  The async API is actually decent about handling it because it does wait until busy isn't returned (with the exception of this bug).

I need to spend time thinking about solutions here...
We (the SQLite team) have a note on our to-do list to provide a version of WAL that uses heap memory instead of shared memory.  Such a heap-memory WAL would only allow a single process to connect (obviously - since only a single process would be able to see the heap) but multiple connections within multiple threads would still work from within that process.  Is this an enhancement that we should accelerate?
I think heap-memory WAL will work for NFS/Solaris case.
That will be helpful for the performance.
Comment on attachment 515856 [details] [diff] [review]
patch

If someone is explicitly calling Commit, they should handle this themselves.  However, if they are using autocommit, we should do the busy waiting in the destructor.

Yes, this differs from Rollback, but rollbacks are far less common than commits, so I want to be more conservative here.
Attachment #515856 - Flags: review?(sdwilsh) → review-
Whiteboard: [places-next-wanted]
If the BUSY error is not handled, it will leave the db unusable.
The error must be handled somewhere.
We'd better guarantee that it is handled in mozStorageTransaction.
(In reply to comment #36)
> If the BUSY error is not handled, it will leave the db unusable.
> The error must be handled somewhere.
> We'd better guarantee that it is handled in mozStorageTransaction.
If the consumer is relying on autocommit, I agree.  However, like I said in comment 35, if they call Commit explicitly, we need to not busy wait for them.
(In reply to comment #37)
> If the consumer is relying on autocommit, I agree.  However, like I said in
> comment 35, if they call Commit explicitly, we need to not busy wait for them.

Then probably we should set mCompleted to false for BUSY to make sure it is handled, right?
(In reply to comment #38)
> Then probably we should set mCompleted to false for BUSY to make sure it is
> handled, right?
There is most certainly a bug in how we use mCompleted now.  Please also add a test for this in https://mxr.mozilla.org/mozilla-central/source/storage/test/test_transaction_helper.cpp
Whiteboard: [places-next-wanted]
Assignee: ginn.chen → nobody
Status: ASSIGNED → NEW
Component: Places → Storage
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.