Last Comment Bug 686025 - nsNavHistory::AsyncExecuteLegacyQueries uses synchronous createStatement call instead of async createAsyncStatement call, blocks main thread
: nsNavHistory::AsyncExecuteLegacyQueries uses synchronous createStatement call...
Status: RESOLVED FIXED
[qa-]
:
Product: Toolkit
Classification: Components
Component: Places (show other bugs)
: Trunk
: All All
: -- normal with 4 votes (vote)
: mozilla10
Assigned To: Marco Bonardo [::mak]
:
Mentors:
: 666128 671448 688521 691146 696576 (view as bug list)
Depends on: 683876 hang-detector 689229
Blocks: PlacesJank 679800 691146
  Show dependency treegraph
 
Reported: 2011-09-09 15:08 PDT by Christopher Blizzard (:blizzard)
Modified: 2011-12-08 11:25 PST (History)
51 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
+
fixed


Attachments
stack traces while the browser is pegged (293.78 KB, text/plain)
2011-09-09 15:08 PDT, Christopher Blizzard (:blizzard)
no flags Details
Another stacktrace - on a different system - while the browser hangs (82.22 KB, text/plain)
2011-09-10 02:11 PDT, Arjen
no flags Details
patch v1.0 (18.34 KB, patch)
2011-09-20 08:40 PDT, Marco Bonardo [::mak]
no flags Details | Diff | Review
patch v1.1 (15.21 KB, patch)
2011-09-21 14:51 PDT, Marco Bonardo [::mak]
bugmail: review+
sdwilsh: review+
Details | Diff | Review
Places stats (1.39 KB, text/plain)
2011-09-26 10:03 PDT, Justin Lebar (not reading bugmail)
no flags Details
Snippet from SQLite log during hang (106.61 KB, text/plain)
2011-09-27 11:21 PDT, Arjen
no flags Details
Contents of sqlite_stat1 before and after Analyze (2.64 KB, text/plain)
2011-09-28 10:03 PDT, Arjen
no flags Details
patch v1.2 (4.96 KB, patch)
2011-10-03 14:34 PDT, Marco Bonardo [::mak]
blizzard: approval‑mozilla‑aurora+
blizzard: approval‑mozilla‑beta+
Details | Diff | Review

Description Christopher Blizzard (:blizzard) 2011-09-09 15:08:22 PDT
Created attachment 559589 [details]
stack traces while the browser is pegged

I got a new laptop from our IT group.  It has Windows 7 (64 bit) on it and I'm using it with the default OS that came on it.

I installed Firefox 6 (release) on it, set it up with my sync account and since then I've been getting occasional hangs.  These hangs:

1. Happen pretty regularly every 5-20 minutes.
2. During that time the browser UI is unresponsive for 20-60 seconds.
3. Firefox pegs one CPU while the browser is unresponsive.
4. It does eventually come back.

I've seen it do this reliably without no plugins enabled and with no extensions enabled.

I finally got a working x86 debugger installed.  Stack traces all look like something is looking at my session history.
Comment 1 Christopher Blizzard (:blizzard) 2011-09-09 15:09:54 PDT
Oh, I've seen this with Sync disabled, and I don't think that the sync service does anything if it's turned off.
Comment 2 Philipp von Weitershausen [:philikon] 2011-09-09 15:52:47 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #1)
> Oh, I've seen this with Sync disabled, and I don't think that the sync
> service does anything if it's turned off.

Correct. The Sync code isn't loaded at all if you don't have Sync configured.
Comment 3 Matthias Versen [:Matti] 2011-09-09 17:13:52 PDT
Do you use a 32bit FF or a 64bit one ?
Comment 4 Johnathan Nightingale [:johnath] 2011-09-09 18:31:07 PDT
Main thread stacks point to nsNavHistory->sqlite -- adding the storage assault squadron
Comment 5 Christopher Blizzard (:blizzard) 2011-09-09 20:26:42 PDT
I'm using a 32 bit build on 64 bit windows.  (I actually typed "firefox.com" into IE to download it!)
Comment 6 Boris Zbarsky [:bz] 2011-09-09 21:30:33 PDT
Looking at the stack, we're going through this code:

476   _getHistoryResults:
477   function WTBLJL__getHistoryResults(aSortingMode, aLimit, aCallback, aScope) {

in WindowsJumpLists.jsm.

It looks like each stack is actually in statement preparation....
Comment 7 Johnny Stenback (:jst, jst@mozilla.com) 2011-09-09 23:29:08 PDT
Chris, I might have seen something similar as well, but I've been utterly unable to find the time to look into it. Do you see the same hangs if you attempt to open up the history menu? For me the problem went away when I disabled sync, blew away my places.sqlite file from my profile, and re-synched. The problem did come back once after doing that, and that's as far as I got. I believe I have a copy of my places.sqlite from when I was seeing this, but again, I have yet to find time to dig in here. At first I thought this was due to a local db corruption problem, but if you're seeing the same thing then I think we have a larger problem at hand here.

What little I did get out of looking into this so far was that the main thread was waiting on a sqlite lock while a background thread was doing stuff with the db while holding the same lock, but that was from an optimized build etc so I didn't get a clear picture of what was going on at the time...
Comment 8 Matthias Versen [:Matti] 2011-09-09 23:32:11 PDT
bug 679800 could be related
Comment 9 Boris Zbarsky [:bz] 2011-09-09 23:35:11 PDT
Hmm.  The stacks above show exactly what jst saw: the main thread is waiting for a lock while an async statement is running on a background thread.

I guess the real question is what that statement on the background thread is.
Comment 10 Matt Evans [:mevans] 2011-09-10 00:40:40 PDT
Chris, can you try with a clean profile and see if you get the same results? Otherwise I suggest we try to get your profile data and try that in another clean win7 platform install.
Comment 11 Arjen 2011-09-10 02:07:59 PDT
I have the same issue, I took a stack trace dump as well. So I'll attach it shortly. My system is a windows 7 64-bit one as well, with 32-bit firefox.
Comment 12 Arjen 2011-09-10 02:11:46 PDT
Created attachment 559674 [details]
Another stacktrace - on a different system - while the browser hangs

Stack trace taken when browser hangs, this one was upon opening a url from outside firefox.
Comment 13 mjh563 2011-09-10 03:20:21 PDT
I've been getting permanent hangs and 100% cpu when opening the Page Info window, and occasional hangs for up to 60s when clicking the Back button. Clearing browsing and download history seems to fix it, if that's any help.
Comment 14 Brendan Eich [:brendan] 2011-09-10 08:35:58 PDT
Why are we using a mutex to make the main thread wait for prepared (and possibly re-prepared and retried) statements to execute? This seems inherently janky and (in the case of this bug) hang-prone.

/be
Comment 15 Brendan Eich [:brendan] 2011-09-10 10:47:03 PDT
Thread 20 is holding the mutex, from mozStorage code. What change sets hit ff6 in that area?

/be
Comment 16 Boris Zbarsky [:bz] 2011-09-10 12:24:23 PDT
Brendan, the main thread is trying to prepare a statement every time that jumplist code is called.  That preparation is what tries to grab the mutex.

If the jumplist code used an already-prepared statement and just ran it async on a background thread it might well not need the mutex on the main thread at all.

That said, thread 20 is just doing an async query on the history.  The real question is who asked for that query to happen, and as usual with async stuff that information is not captured in the thread stacks....
Comment 17 Brendan Eich [:brendan] 2011-09-10 12:53:32 PDT
(In reply to Boris Zbarsky (:bz) from comment #16)
> Brendan, the main thread is trying to prepare a statement every time that
> jumplist code is called.  That preparation is what tries to grab the mutex.

Yes, I gathered that much. Did something I wrote say otherwise? The sqlite3_step code active on thread 20 does execute with retry and re-prepare on schema change, though. Do we know that thread is not ilooping?

> If the jumplist code used an already-prepared statement and just ran it
> async on a background thread it might well not need the mutex on the main
> thread at all.

Maybe. The mutex serializes lots of critical sections, though. Finer grained locking or lock-free methods do look possible. Worth a bug?

I don't thnk you can assume thread 20 is making progress. This could be a livelock. But could it be an ilooping?

/be

> 
> That said, thread 20 is just doing an async query on the history.  The real
> question is who asked for that query to happen, and as usual with async
> stuff that information is not captured in the thread stacks....
Comment 18 Brendan Eich [:brendan] 2011-09-10 12:56:56 PDT
(thanks, iPad spelling corrector!)

IOW the stack shows thread 20 owning the mutex. That could be significant, not incidental.

The number of mutex-guarded critical sections, and their apparent lengths and data at stake, in sqlite3 is alarming.

/be
Comment 19 Christopher Blizzard (:blizzard) 2011-09-10 14:22:23 PDT
Johnny, I can open the history menu without problems.

Is there a way for me to see all sqlite operations as they are executed, either dumped to the error console or stderr?
Comment 20 Dietrich Ayala (:dietrich) 2011-09-12 01:15:19 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #19)
> Is there a way for me to see all sqlite operations as they are executed,
> either dumped to the error console or stderr?

export NSPR_LOG_MODULES=mozStorage:5

^ will send queries to stdout
Comment 21 Andrew Sutherland [:asuth] 2011-09-12 17:08:16 PDT
nsNavHistory::AsyncExecuteLegacyQueries is using createStatement when it looks like it wants to be using createAsyncStatement (see http://mxr.mozilla.org/mozilla-central/source/storage/public/mozIStorageConnection.idl#182 for the right signature), unless I am missing something.

Bad usage:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/nsNavHistory.cpp#5188

createAsyncStatement does not acquire a mutex on the main thread because it defers the actual construction of the statement to the asynchronous thread.

Mutex-wise, yes, each SQLite connection is protected by a single mutex.  All of our asynchronous operations are designed to make sure that mutex is only acquired on the asynchronous thread.  Unfortunately, the synchronous statements can be used both ways, so it's not immediately obvious if you choose the wrong one.  Regrettably the IDL does not seem to make the danger clear on this front, although I think the MDN docs may do so?

In the storage component, we have a C++ unit test to make sure code does not accidentally acquire a mutex on the main thread when it should not.  Perhaps places should adopt a similar unit test to make sure that it avoids regressing its mutex behavior.
Comment 22 Christopher Blizzard (:blizzard) 2011-09-14 20:33:46 PDT
Note that the hangs happen consistently every 10 minutes.  I can often time it.

Someone said to me that there's some sql maintenance thread that wakes up every 10 minutes or so.
Comment 23 Benjamin Smedberg [:bsmedberg] 2011-09-16 13:14:00 PDT
jst is experiencing a very similar problem where a database will hang for minutes. Apart from the incorrect createStatement call, we need to understand why a database operation off the main thread is taking so long.
Comment 24 Arjen 2011-09-16 14:02:32 PDT
(In reply to Dietrich Ayala (:dietrich) from comment #20)
> export NSPR_LOG_MODULES=mozStorage:5
> 
> ^ will send queries to stdout

If I do in a cmd-prompt on my windows:
set NSPR_LOG_FILE=%TEMP%\log.txt
set NSPR_LOG_MODULES=mozStorage:5

And then start firefox from that same prompt, a log.txt-file is created, but it doesn't get any content... So, is that actually the way I'm supposed to get a log of all sqlite-statements?
Comment 25 Shawn Wilsher :sdwilsh 2011-09-18 12:33:22 PDT
The problem here is that Places (along with lots of other places in our code) still creates and uses synchronous database statements.  SQLite is only threadsafe because it serializes all access to a database connection (this is unlikely to change any time soon).

To make this problem worse, every time we "fix" an area that does this, we end up putting more work on the background thread, which increases the likelihood that the remaining places that need to acquire the mutex on the main thread will encounter contention for the mutex.

Some of you might recall we hit this problem in the run-up to Firefox 4 as well (November/December of last year), and it caused Marco and I to have to reachitect a bunch of stuff in Places for a few months in order to work around it.  Until Firefox removes all uses of the synchronous Storage API from the main thread, this issue will keep rearing it's head.  (When that happens we can actually use SQLite in a way that stops using mutexs and will likely speed it up too.)
Comment 26 Benjamin Smedberg [:bsmedberg] 2011-09-19 07:25:40 PDT
What do we need to do to diagnose the particular "hang every 5 or 10 minutes" issue that blizzard and jst are seeing? Would breaking into the storage thread to see what statement is taking "forever" help diagnose the problem? In attachment 559589 [details] that would presumably something in the stackframe of mozilla::storage::AsyncExecuteStatements::bindExecuteAndProcessStatement. What member gives us the SQL statement that is actually going to run?
Comment 27 Christopher Blizzard (:blizzard) 2011-09-19 17:42:28 PDT
I'm running with a debug build to try to catch the query.
Comment 28 Marco Bonardo [::mak] 2011-09-20 06:01:55 PDT
May be jumplists, expiration, Sync, there's a lot of work done at intervals, most of it is off main-thread, but that's also a recipe for contention, as explained in comment 25.  We keep working to move all the stuff to the async thread, but it's a long and tedious process that breaks all of our API consumers. But we have some consumer that matters more like mobile and Sync and we worked mostly in their direction when changing APIs.
As a first step, I'll audit all of our usage of CreateStatement to check which of them may be immediately converted to CreateAsyncStatement, and see if there's any other statement that can easily be moved.
Surely it's possible some query is behaving slowly with some specific database contents, so knowing the query may help to further improve it.
Comment 29 Marco Bonardo [::mak] 2011-09-20 08:40:29 PDT
Created attachment 561203 [details] [diff] [review]
patch v1.0

This is a first pass, that I want to triple-check still. There are various places in the codebase using createStatement when they may use the async version and it's worth to fix the easy/safe ones.

- CookieService (done)
- some tests in Storage (done only a couple I wrote since some _wants_ to use createStatement)
- nsNavHistory (done)
- AddonRepository always async executes (done with a bit of renaming for clarity)
- XPIProvider uses most statements both sync and async (won't touch it)
- nsContentPrefService.js uses most statements both sync and async (won't touch it)

Dave, would my suggested changes to AddonRepository be fine?

I'm thinking if maybe we may print out a warning when we destroy a statement that was created with createStatement but never executed synchronously. Looks like this may be trackable with a debug property on the statement?
Comment 30 Johnathan Nightingale [:johnath] 2011-09-20 08:52:51 PDT
mak - should we throw this against try and give blizzard a test build?
Comment 31 Brendan Eich [:brendan] 2011-09-20 17:41:08 PDT
(In reply to Andrew Sutherland (:asuth) from comment #21)
> In the storage component, we have a C++ unit test to make sure code does not
> accidentally acquire a mutex on the main thread when it should not.  Perhaps
> places should adopt a similar unit test to make sure that it avoids
> regressing its mutex behavior.

More radical idea: remove the synchronous API entirely in the version we link with?

/be
Comment 32 Christopher Blizzard (:blizzard) 2011-09-21 12:50:28 PDT
I'm going to try and apply this patch to my Firefox 6 tree by hand and see if it makes a difference.

I suspect it will hide an underlying problem, but that's probably for another bug.
Comment 33 Marco Bonardo [::mak] 2011-09-21 13:11:01 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #32)
> I suspect it will hide an underlying problem, but that's probably for
> another bug.

It's not a suspect. The underlying problem is that Places is using both synchronous and asynchronous Storage APIs (unfortunately this happens for other components too), and that by itself creates contention.
The solution is to re-architecture Places to be completely asynchronous, that is a wip from quite some time (starting from when autocomplete was rewritten, to fsyncs reduction, usage of WAL, favicons and visits and Sync made mostly async, and so on...).
The patch by itself will just reduce some possibilities of contention (among which the above reported stack), but cannot solve an architectural problem.
Comment 34 Benjamin Smedberg [:bsmedberg] 2011-09-21 13:19:48 PDT
I don't think any of us are arguing about the general problem which needs to be fixed. But the specific problem of users hanging for minutes is relatively new. We really need to understand the specific query which is taking so long, in addition to this patch and the longer-term work.
Comment 35 Marco Bonardo [::mak] 2011-09-21 14:28:41 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #32)
> I'm going to try and apply this patch to my Firefox 6 tree by hand and see
> if it makes a difference.

I will give you a try, the above patch is conceptually fine, but after a clobber I noticed it doesn't compile (oops), so I'm finally fixing it and pushing it to try.
Comment 37 Justin Lebar (not reading bugmail) 2011-09-21 18:50:55 PDT
> It's not a suspect. The underlying problem is that Places is using both synchronous and asynchronous 
> Storage APIs (unfortunately this happens for other components too), and that by itself creates 
> contention.

I'm confused.  Supposing the sync API didn't exist, wouldn't this mystery long-running query still block all other async queries?  So instead of locking up the whole UI, we'd see things like the awesomebar and maybe favicons not working for minutes at a time?
Comment 38 Christopher Blizzard (:blizzard) 2011-09-21 22:30:18 PDT
I still get the hang running with a build with this patch anyway.  Just caught it in the debugger again.  So it doesn't do much for the problem I'm having.

Not sure if this is relevant information or not, but the sql trace running had this right before the hang started:

2864[a829098]: sqlite3_trace on a64a2d8 for 'BEGIN IMMEDIATE'
2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed
 = 1316669137394000 WHERE name = 'SSID' AND host = '.mozilla.com' AND path = '/'
'
2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed
 = 1316669137394000 WHERE name = 'SSRT' AND host = '.mozilla.com' AND path = '/'
'
2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed
 = 1316669137394000 WHERE name = 'wtspl' AND host = '.mozilla.com' AND path = '/
''
2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed
 = 1316669137394000 WHERE name = '__qca' AND host = '.mozilla.com' AND path = '/
''
2864[a829098]: sqlite3_trace on a64a2d8 for 'UPDATE moz_cookies SET lastAccessed
 = 1316669137394000 WHERE name = 'WT_FPC' AND host = '.mozilla.com' AND path = '
/''
2864[a829098]: sqlite3_trace on a64a2d8 for 'COMMIT TRANSACTION'
0[20c6a0]: sqlite3_trace on 59f9910 for 'UPDATE moz_hosts SET permission = 1, ex
pireType= 2, expireTime = 1316755537556 WHERE id = 28'
0[20c6a0]: Resetting statement: 'UPDATE moz_hosts SET permission = ?2, expireTyp
e= ?3, expireTime = ?4 WHERE id = ?1'
Comment 39 Marco Bonardo [::mak] 2011-09-22 01:12:11 PDT
(In reply to Justin Lebar [:jlebar] from comment #37)
> I'm confused.  Supposing the sync API didn't exist, wouldn't this mystery
> long-running query still block all other async queries?  So instead of
> locking up the whole UI, we'd see things like the awesomebar and maybe
> favicons not working for minutes at a time?

It depends, the awesomebar uses a separate connection and thanks to the WAL it can read at any time even if any other connection is writing. We have 4 connections exactly to improve responsiveness.
Comment 40 Marco Bonardo [::mak] 2011-09-22 01:19:00 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #38)
> I still get the hang running with a build with this patch anyway.  Just
> caught it in the debugger again.  So it doesn't do much for the problem I'm
> having.

Any possibility I may privately get a copy of the profile you are using for these tests?
Comment 41 Marco Bonardo [::mak] 2011-09-22 01:34:05 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #38)

> 2864[a829098]: sqlite3_trace on a64a2d8 for 'BEGIN IMMEDIATE'

> 0[20c6a0]: Resetting statement: 'UPDATE moz_hosts SET permission = ?2,
> expireTyp
> e= ?3, expireTime = ?4 WHERE id = ?1'

I don't think these are related, the cookies operations are completely async there, the permission manager is completely sync instead, but is doing a simple update and has synchronous=off so should be quite fast since doesn't care about durability. Do you have the queries logged just after these? which was the stack, I don't think it may be the same as before since the patch should at least fix that specific case.
Comment 42 Andrew Sutherland [:asuth] 2011-09-22 17:28:33 PDT
Comment on attachment 561576 [details] [diff] [review]
patch v1.1

r=asuth for teeny storage bit
Comment 43 Christopher Blizzard (:blizzard) 2011-09-23 00:06:42 PDT
I didn't really want to share my profile but if it would help I can try to put it somewhere.

Also I've been running the 7 beta all day and I've seen no hangs so far.
Comment 44 Arjen 2011-09-23 08:07:00 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #43)
> Also I've been running the 7 beta all day and I've seen no hangs so far.

I just installed 7b06, but I did get a hang after a while. Similar in length and behavior, so I doubt its really fixed in 7.
Comment 45 Christopher Blizzard (:blizzard) 2011-09-24 20:34:09 PDT
I've had it happen in 7 now, but it's far less than it was in 6.

Also?  Turns out this happens on my wife's machine.  She has a mac, with her own profile.  But it does beachball at 10 minute intervals.  Not always every 10 minutes, but definitely on the interval.
Comment 46 Justin Lebar (not reading bugmail) 2011-09-24 20:45:40 PDT
(In reply to Marco Bonardo [:mak] from comment #39)
> (In reply to Justin Lebar [:jlebar] from comment #37)
> > I'm confused.  Supposing the sync API didn't exist, wouldn't this mystery
> > long-running query still block all other async queries?  So instead of
> > locking up the whole UI, we'd see things like the awesomebar and maybe
> > favicons not working for minutes at a time?
> 
> It depends, the awesomebar uses a separate connection and thanks to the WAL
> it can read at any time even if any other connection is writing. We have 4
> connections exactly to improve responsiveness.

So if database reads aren't blocked by database writes, why exactly are we blocking here?  Does the API we use impose additional synchronization on top of that provided by the database?  If so, why?

I understand that we should be using async APIs everywhere, but I don't understand why access to the database should be blocked for this long.
Comment 47 Arjen 2011-09-25 01:44:37 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #45)
> I've had it happen in 7 now, but it's far less than it was in 6.

Than you're lucky, I don't see much difference with my 7 vs what I experienced in 6. I.e. 7 hangs as often as 6 did.

(In reply to Justin Lebar [:jlebar] from comment #46)
> So if database reads aren't blocked by database writes, why exactly are we
> blocking here?  Does the API we use impose additional synchronization on top
> of that provided by the database?  If so, why?
> 
> I understand that we should be using async APIs everywhere, but I don't
> understand why access to the database should be blocked for this long.

Given that a complete rewrite ('vacuum' from the sqlite-prompt) of the places.sqlite only takes a second or so, I don't even understand how it would be possible to have database operations that consume this much time. Unless there is some periodical job (like the vacuum) that can only execute if all connections are closed. But those connections are pooled and only time out after several seconds of non-usage...

But as I still have no idea how to make Firefox log the sql-operations on my end, I won't be able to help out with that much.

By the way, while typing this text firefox hung on me again, it took about 3 minutes for it to respond again...
Comment 48 Arjen 2011-09-25 03:30:28 PDT
I've monitored it a bit more, it turns out it takes about 9 minutes (not 10) to re-occur and lasts about 3 minutes than. I.e. it lasted from 11:25-11:28, 11:34-11:37, 11:43-11:46 and so on.
So basically, one third of the time, firefox is unresponsive :X
Comment 49 Andrew Sutherland [:asuth] 2011-09-25 09:08:07 PDT
(In reply to Arjen from comment #47)
> But as I still have no idea how to make Firefox log the sql-operations on my
> end, I won't be able to help out with that much.

You need a debug build or to force PR_LOGGING to be defined during the build of the storage component; then the logging environment variables you used before will have an effect.  Yes, it would be possible to make the logging always built-in with minimal overhead with a patch.  Right now defining PR_LOGGING causes the sqlite3_trace mechanism to be invoked unconditionally.

I think you can find a debug build here (picking a recent dir in the subdir tree); change up the path if it's not the right thing:
https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32-debug/
Comment 50 Arjen 2011-09-25 09:43:18 PDT
(In reply to Andrew Sutherland (:asuth) from comment #49)
> I think you can find a debug build here (picking a recent dir in the subdir
> tree); change up the path if it's not the right thing:
> https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> central-win32-debug/

Thanks for the pointer, but those builds require a debug version of the VC80 redistributable as well. And that seems to require a full Visual Studio install, and I'd rather not install environments like that on my personal desktop computer...

Is there no other way to get the query logging enabled?
Comment 51 Shawn Wilsher :sdwilsh 2011-09-25 11:31:54 PDT
Comment on attachment 561576 [details] [diff] [review]
patch v1.1

Review of attachment 561576 [details] [diff] [review]:
-----------------------------------------------------------------

r=sdwilsh on everything but the add-on manager stuff.
Comment 52 Marco Bonardo [::mak] 2011-09-26 02:58:58 PDT
(In reply to Justin Lebar [:jlebar] from comment #46)
> So if database reads aren't blocked by database writes, why exactly are we
> blocking here?

in the specific reported stack is a mutex protecting thread safety in Storage.

> I understand that we should be using async APIs everywhere, but I don't
> understand why access to the database should be blocked for this long.

It should not, thus the bug. No query should take more than a few milliseconds, they have been built with that in mind. Broken data (like a corrupt index or a missing ANALYZE run) or unexpected data distribution may cause that to fail though.

My problem is that I need a profile where this happens, since I've never seen a hang even in the large hundreds MB databases I have for testing, nor my default profile that is well over 100MB of data. There must be a specific case due to the data distribution or broken data that activates the behavior. So anyone able to reproduce these hangs, please, I really need your profile, or at least your places.sqlite database (provided it ends up being Places) by some private mean (either mail or mail a link to download a compressed and encrypted one).

I also have some additional questions that may help me:
- does Nightly improve or kills these hangs?
- may you give me the stats generated by this add-on (flag only "statistics") https://addons.mozilla.org/en-US/firefox/addon/places-maintenance/
Comment 53 Arjen 2011-09-26 09:18:02 PDT
Here are my stats:

> Statistics
Database size is 20480 KiB
user_version is 11
page_size is 4096
cache_size is 124393
journal_mode is wal
synchronous is 1
History can store a maximum of 254758 unique pages
Table moz_bookmarks has 192 records
Table moz_bookmarks_roots has 5 records
Table moz_keywords has 0 records
Table sqlite_sequence has 0 records
Table moz_favicons has 1125 records
Table moz_annos has 154 records
Table moz_anno_attributes has 8 records
Table moz_items_annos has 64 records
Table moz_places has 15647 records
Table moz_historyvisits has 156543 records
Table moz_inputhistory has 126 records
Table sqlite_stat1 has 9 records
Index sqlite_autoindex_moz_bookmarks_roots_1
Index sqlite_autoindex_moz_keywords_1
Index sqlite_autoindex_moz_favicons_1
Index sqlite_autoindex_moz_anno_attributes_1
Index sqlite_autoindex_moz_inputhistory_1
Index moz_bookmarks_itemindex
Index moz_bookmarks_parentindex
Index moz_bookmarks_itemlastmodifiedindex
Index moz_places_faviconindex
Index moz_places_hostindex
Index moz_places_visitcount
Index moz_places_frecencyindex
Index moz_historyvisits_placedateindex
Index moz_historyvisits_fromindex
Index moz_historyvisits_dateindex
Index moz_places_lastvisitdateindex
Index moz_annos_placeattributeindex
Index moz_items_annos_itemattributeindex
Index moz_places_url_uniqueindex
Index moz_bookmarks_guid_uniqueindex
Index moz_places_guid_uniqueindex
Trigger moz_bookmarks_beforedelete_v1_trigger

I also ran the integrity and coherence checks, and all seems well.

I'll check the Nightly shortly.
Comment 54 Andrew Sutherland [:asuth] 2011-09-26 10:00:12 PDT
(In reply to Arjen from comment #50)
> Is there no other way to get the query logging enabled?

I spun a try build wherein I think I enabled PR_LOGGING in mozStorage:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bugmail@asutherland.org-6edd9b6f035f/try-win32/

It should be very similar to a nightly, too...
Comment 55 Justin Lebar (not reading bugmail) 2011-09-26 10:03:00 PDT
Created attachment 562461 [details]
Places stats

These are the statistics from the addon on my dad's Win XP computer.  He's been seeing this hang for a few weeks now.  It happens even with sync off.

Can we generate a build which doesn't require the MSVC debug libraries which will dump some information when the db is locked for more than X seconds?

> My problem is that I need a profile where this happens

I'll try to clean the private data out of my dad's profile and send that to you.

> in the specific reported stack is a mutex protecting thread safety in Storage.

Is this mutex held while accessing the database?  Shouldn't we drop all locks before querying the DB, since the DB does its own locking?  Otherwise we've defeated all the reads-proceed-concurrently-with-writes business sqlite is trying to give us.
Comment 56 Andrew Sutherland [:asuth] 2011-09-26 10:38:58 PDT
(In reply to Andrew Sutherland (:asuth) from comment #54)
> (In reply to Arjen from comment #50)
> > Is there no other way to get the query logging enabled?
> 
> I spun a try build wherein I think I enabled PR_LOGGING in mozStorage:

It looks like I screwed this up.  I'll check with a local build and then spin another (optimized) try build.

(In reply to Justin Lebar [:jlebar] from comment #55)
> Is this mutex held while accessing the database?  Shouldn't we drop all
> locks before querying the DB, since the DB does its own locking?  Otherwise
> we've defeated all the reads-proceed-concurrently-with-writes business
> sqlite is trying to give us.

Storage actually uses the SQLite database connection mutex:
http://mxr.mozilla.org/mozilla-central/source/storage/src/SQLiteMutex.h

The upside is that in cases where the mutex will be rapidly acquired and released in a cycle, we can avoid a lot of churn/contention by just holding it for a longer aggregate period (and thereby potentially reducing the introduction of new types of exciting races).
Comment 57 Brendan Eich [:brendan] 2011-09-26 11:27:20 PDT
(In reply to Marco Bonardo [:mak] from comment #52)
> (In reply to Justin Lebar [:jlebar] from comment #46)
> > So if database reads aren't blocked by database writes, why exactly are we
> > blocking here?
> 
> in the specific reported stack is a mutex protecting thread safety in
> Storage.

What mutex is that, can you give some mxr links? Why does it need to be held across i/o? That should be avoided, always.

/be
Comment 58 Arjen 2011-09-26 11:35:33 PDT
By the way, last night's Nightly "9.0a1 (2011-09-26)" didn't improve things. If anything, it may even have a slightly faster re-occurence (8-8.5 minutes rather than the 9 minutes I mentioned for 7b06) and a slightly longer period of hangs (a bit more than 3 minutes). But that may also be just my impatience for waiting to finish the timing :-)
Comment 59 Andrew Sutherland [:asuth] 2011-09-26 13:10:13 PDT
People experiencing the hang bug, are you using any type of anti-virus?  If so, which one?  Also, a proper try build for win32 claims to be 86 minutes away from finishing baking.  It will be at https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bugmail@asutherland.org-03ac75ee927f/ when done
Comment 60 Andrew Sutherland [:asuth] 2011-09-26 14:24:24 PDT
(In reply to Andrew Sutherland (:asuth) from comment #59)
> People experiencing the hang bug, are you using any type of anti-virus?  If
> so, which one?  Also, a proper try build for win32 claims to be 86 minutes
> away from finishing baking.  It will be at
> https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/
> bugmail@asutherland.org-03ac75ee927f/ when done

Okay, the optimized try build is now up here:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bugmail@asutherland.org-03ac75ee927f/try-win32/

I have confirmed that doing:
set NSPR_LOG_FILE=%TEMP%\log.txt
set NSPR_LOG_MODULES=mozStorage:5

then running it works.
Comment 61 Andreas Gal :gal 2011-09-26 18:36:41 PDT
Two things terrify me about this bug, and none of them have to do with places:

1. Our test infrastructure doesn't catch this. We are clearly not exercising the code during tests the way we exercise it in production. This is something we have to get on top of right away. I suggest we pair up one of Bob's amazing automation wizards with someone who knows places and we come up with a stress test that exposes this issues during a test run with the code as-is.

2. We have no visibility into hangs out in the fields. Looks like this bug hits users without extensions every 10 minutes (according to Blizzard). And we don't know about it! Without visibility we can't prioritize and triage these bugs (sounds like we knew about this problem, we just didn't know it was this widespread). We need a bug for an automated watchdog, and we need it fixed asap. If I can't find one, I will file a new one.
Comment 62 Christopher Blizzard (:blizzard) 2011-09-26 22:47:44 PDT
Mak, if you find me tomorrow I can get you a .zip of my profile, which hangs reliably on windows with Firefox 6.0.2.
Comment 63 Arjen 2011-09-26 23:14:25 PDT
(In reply to Andrew Sutherland (:asuth) from comment #60)
> Okay, the optimized try build is now up here:
> https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/
> bugmail@asutherland.org-03ac75ee927f/try-win32/
> 
> I have confirmed that doing:
> set NSPR_LOG_FILE=%TEMP%\log.txt
> set NSPR_LOG_MODULES=mozStorage:5
> 
> then running it works.

And indeed it works just fine. Unfortunately, I don't see much activity in the log while its hanging and the query just before look to be doing just fine. But I'm running out of time just now, so I'll have a better look tonight.

About virus scanning, I'm running Avira AntiVir Personal. I can try if disabling it helps.

Btw, about the log, it might be useful to add some sort of timestamp into it. Now it just logs the statements with some connection and statement indications (I think), but I'm not certain if any of the other aspects is actually useful in relating the executed statements to the timestamps the hangs occurred. As said, I'll have a longer look at it tonight.
Comment 64 Marco Bonardo [::mak] 2011-09-27 01:07:19 PDT
These are 2 test builds on current Nightly where I've modified a couple things I have suspects on: the former avoids expiration notifications during handleResults (ideally should not make a difference, but since we are out of bounds here is worth a try), the latter also doesn't run ANALYZE (that causes reparsing of prepared statements). Both include the patch in this bug.
I'd like to know if they make any difference for users able to reproduce the hang.

build 1: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mak77@bonardo.net-506df90a2c98/

build 2: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mak77@bonardo.net-0054738ddc24/
Comment 65 Marco Bonardo [::mak] 2011-09-27 01:21:00 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #62)
> Mak, if you find me tomorrow I can get you a .zip of my profile, which hangs
> reliably on windows with Firefox 6.0.2.

I'll be on irc, as usual.

(In reply to Andreas Gal :gal from comment #61)
> 1. Our test infrastructure doesn't catch this. We are clearly not exercising
> the code during tests the way we exercise it in production.

We don't know the cause yet, and it affects only some users. If it's hard to reproduce, it's also hard to make a test for it. Reproducing the production conditions may be nearly impossible considering all the possible hardware/software/profile configurations of the users and those matter. But for sure the infrastructure is missing a real browsing-without-hangs test.

> 2. We have no visibility into hangs out in the fields. Looks like this bug
> hits users without extensions every 10 minutes (according to Blizzard). And
> we don't know about it! Without visibility we can't prioritize and triage
> these bugs (sounds like we knew about this problem, we just didn't know it
> was this widespread).

We still don't know, we have 3 or 4 reports here, and they may even be due to different causes (for example the first stack is due to a wrong CreateStatement call, but looks like it's not the only one). We should reports hangs with full stacks like we do with crashes.
Comment 66 Andreas Gal :gal 2011-09-27 01:25:27 PDT
> We still don't know, we have 3 or 4 reports here, and they may even be due
> to different causes (for example the first stack is due to a wrong
> CreateStatement call, but looks like it's not the only one). We should
> reports hangs with full stacks like we do with crashes.

I attached a patch to bug 429592 to do that
Comment 67 Arjen 2011-09-27 01:47:43 PDT
(In reply to Marco Bonardo [:mak] from comment #65)
> We still don't know, we have 3 or 4 reports here, and they may even be due
> to different causes (for example the first stack is due to a wrong
> CreateStatement call, but looks like it's not the only one). We should
> reports hangs with full stacks like we do with crashes.

Not every one knows about bugzilla, for instance, this support-question may very well be the same issue:
http://support.mozilla.com/nl/questions/876938

And there are 59 others who claim it affects them as well.

Here's one with 580 users:
http://support.mozilla.com/nl/questions/872283

And one with 162:
http://support.mozilla.com/nl/questions/872360

If you search more in the support-section, you'd probably find more reports that appear to be similar.
Comment 68 Marco Bonardo [::mak] 2011-09-27 01:57:13 PDT
(In reply to Arjen from comment #67)
> Not every one knows about bugzilla, for instance, this support-question may
> very well be the same issue:

No doubt, as it may be a completely different issue, unless we consider all browser hangs and freezes due to this bug. My point was that it's hard to relate hangs to reasons, not that we don't have hangs.

Btw, anyone willing to test comment 64 builds and send me profiles privately is welcome!
Comment 69 Marco Bonardo [::mak] 2011-09-27 03:09:14 PDT
Apart testing the above 2 builds, may you also try setting places.history.expiration.interval_seconds to 3600 in your current version and see if hangs disappear?
Remember to restore the pref once done.
Comment 70 Justin Lebar (not reading bugmail) 2011-09-27 08:42:20 PDT
Still happening to my dad on WinXP with virus scanner disabled.
Comment 71 Arjen 2011-09-27 10:10:20 PDT
(In reply to Marco Bonardo [:mak] from comment #69)
> Apart testing the above 2 builds, may you also try setting
> places.history.expiration.interval_seconds to 3600 in your current version
> and see if hangs disappear?
> Remember to restore the pref once done.

With this adjusted parameter my Firefox 7b06 was able to browse for at least 36 minutes without any hang. I was bored enough to test it any longer, so I didn't wait.
Comment 72 Arjen 2011-09-27 10:44:33 PDT
(In reply to Marco Bonardo [:mak] from comment #64)
> build 1:
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mak77@bonardo.net-
> 506df90a2c98/

This one seems to fix the problem. Or at least it didn't reoccur for at least half an hour.

> build 2:
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mak77@bonardo.net-
> 0054738ddc24/

I haven't tested this one, would you like me to?
Comment 73 Arjen 2011-09-27 11:21:41 PDT
Created attachment 562821 [details]
Snippet from SQLite log during hang

I've added a partial log of the statements that where in the log-file during one of those hangs (with the build from comment 60).

As can be seen at the end of the log, there are several deletes on moz_places, moz_annos and a few more.

During a second hang those queries where written to the log, but right after my browser responded again, the entries where added to it. So it is probably a good start to investigate.
Most of those deletes are fairly complicated and - although I have mainly MySQL and PostgreSQL-experience - look to me as probable slow queries. Given the size of the databases and the fact that I have a fairly new and high-end system I doubt its the I/O. Although some missing index might make things cpu-heavy if all data is already in memory.
Comment 74 Arjen 2011-09-27 12:02:37 PDT
I had a look at the query plans of some of those queries using the sqlite3 shell. The first one hit the jackpot:

This one:

SELECT h.id, h.url, h.guid, h.last_visit_date, :limit_uris FROM moz_places h LEFT JOIN moz_historyvisits v ON h.id = v.place_id LEFT JOIN moz_bookmarks b ON h.id = b.fk WHERE v.id IS NULL AND b.id IS NULL AND h.ROWID <> IFNULL(:null_skips_last, (SELECT MAX(ROWID) FROM moz_places)) LIMIT :limit_uris

(I substituted 100 for ':limit_uris' and 0 for ':null_skips_last') resulted in this plan:
0|0|0|SCAN TABLE moz_places AS h (~81 rows)
0|0|0|EXECUTE SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE moz_places USING INTEGER PRIMARY KEY (~1 rows)
0|1|1|SCAN TABLE moz_historyvisits AS v (~1 rows)
0|2|2|SEARCH TABLE moz_bookmarks AS b USING COVERING INDEX moz_bookmarks_itemlastmodifiedindex (fk=?) (~2 rows)

As you can see it expected moz_historyvisits to only have 1 row, rather than the actual 150k rows... And thus it choose a rather inefficient table scan, instead of a more useful index search.

So I ran analyze from the shell, and after that, this plan emerged:
0|0|0|SCAN TABLE moz_places AS h (~81 rows)
0|0|0|EXECUTE SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE moz_places USING INTEGER PRIMARY KEY (~1 rows)
0|1|1|SEARCH TABLE moz_historyvisits AS v USING COVERING INDEX moz_historyvisits_placedateindex (place_id=?) (~2 rows)
0|2|2|SEARCH TABLE moz_bookmarks AS b USING COVERING INDEX moz_bookmarks_itemlastmodifiedindex (fk=?) (~2 rows)

So now it uses the index, as expected.

Unless Firefox manages to mess up those statistics again, I'm fairly certain this fixes not only this specific query, but some of the others that it uses in the deletes from the logs as well.

For now it seems my original firefox 7b06 doesn't exhibit the behavior anymore now those table-statistics are in order... Perhaps you guys can add periodic analyze's to tables or databases that see heavy updates (or inserts/deletes).

Adding periodic vacuum's probably might be a good idea as well (if firefox doesn't do that already), although I have no idea what kind of performance gain or impact that has.
Comment 75 Arjen 2011-09-27 12:05:42 PDT
Obviously the statistics for moz_places where also wrong (there actually almost 8k records), but the table scan was still the correct plan, so that didn't matter much here.
Comment 76 Marco Bonardo [::mak] 2011-09-27 12:29:44 PDT
(In reply to Arjen from comment #74)
> As you can see it expected moz_historyvisits to only have 1 row, rather than
> the actual 150k rows... And thus it choose a rather inefficient table scan,
> instead of a more useful index search.

Those numbers are not the number of rows in the table, those are estimates about the number of rows that may be left to evaluate.
My question is how much time sqlite shell takes to run the query before and after analyze in your case (to simulate the "before" you can DROP sqlite_stat1 and sqlite_stat2 tables).
Will do the same experiment here.

> So I ran analyze from the shell, and after that, this plan emerged:
> Perhaps you guys can add
> periodic analyze's to tables or databases that see heavy updates

What's interesting is that we already do that. What may differ here is that most likely you ran analyze with a sqlite shell compiled with STAT2, while we just use STAT1 for privacy reasons.
Also, I think we don't run it often enough, this is known bug 683876, that I'll immediately prioritize and take.

> Adding periodic vacuum's probably might be a good idea as well (if firefox
> doesn't do that already)

We do already, on idle once a month.
Comment 77 Marco Bonardo [::mak] 2011-09-27 12:40:17 PDT
A question. Are you usually never leaving the browser open unused? The fact you had broken analyze results make me think you hardly hit idle, we have quite some idle cleanup tasks.
Comment 78 Arjen 2011-09-27 13:04:59 PDT
(In reply to Marco Bonardo [:mak] from comment #76)
> Those numbers are not the number of rows in the table, those are estimates
> about the number of rows that may be left to evaluate.

I understood that, but for a full table scan they're normally (at least in postgres and mysql) somewhat similar.

> My question is how much time sqlite shell takes to run the query before and
> after analyze in your case (to simulate the "before" you can DROP
> sqlite_stat1 and sqlite_stat2 tables).
> Will do the same experiment here.

That didn't work, but I picked yesterday's version from a backup. The specific select took 280 cpu seconds according to the shell's timer. Wall-clock wise that sounds about right as well.

After the analyze, the query dove to just 0.0468 cpu seconds.

> What's interesting is that we already do that. What may differ here is that
> most likely you ran analyze with a sqlite shell compiled with STAT2, while
> we just use STAT1 for privacy reasons.
> Also, I think we don't run it often enough, this is known bug 683876, that
> I'll immediately prioritize and take.

Actually, there was no sqlite3_stat2 or sqlite_stat2, but I can double check if you'd like me to. Apart from that, I've no idea how often the analyze is run, but given that my places-db had a mismatch of about 100x (assuming the initial estimate of 81 and later estimate of 7998 are somewhat close to the actual sizes), I doubt it has run for some time... Still, Firefox 5 had no issues with this behaviour, so either something changed to make those queries worse, or something changed to make it more pronounced.

> We do already, on idle once a month.
Ok, I did manual vacuum's and those didn't help anyway. So obviously, vacuum doesn't (re)do analyze.

(In reply to Marco Bonardo [:mak] from comment #77)
> A question. Are you usually never leaving the browser open unused? The fact
> you had broken analyze results make me think you hardly hit idle, we have
> quite some idle cleanup tasks.

I commonly walk away from my PC to watch TV or do other things that take over half an hour without me touching said PC, and if my browser happens to be open, I leave it open. I normally don't close it much during the course of the PC's on-time. So unless something during that 'idle time' prohibits Firefox from actually considering itself idle... the answer would be 'no'.
Comment 79 Brendan Eich [:brendan] 2011-09-27 13:29:09 PDT
(In reply to Brendan Eich [:brendan] from comment #57)
> (In reply to Marco Bonardo [:mak] from comment #52)
> > (In reply to Justin Lebar [:jlebar] from comment #46)
> > > So if database reads aren't blocked by database writes, why exactly are we
> > > blocking here?
> > 
> > in the specific reported stack is a mutex protecting thread safety in
> > Storage.
> 
> What mutex is that, can you give some mxr links? Why does it need to be held
> across i/o? That should be avoided, always.

Mak, can you provide some links please? It is a bad bug, which should be filed and fixed quickly, to hold a mutex across i/o.

/be
Comment 80 Justin Lebar (not reading bugmail) 2011-09-27 13:44:47 PDT
That mutex is the sqlite per-connection mutex we looked at earlier.

I think the solution here is to not share connections between threads.  Then we'll never block on the mutex.
Comment 81 Marco Bonardo [::mak] 2011-09-27 17:20:41 PDT
(In reply to Arjen from comment #78)
> That didn't work

yeah sorry, I forgot you can't drop sqlite_stat# tables, you can delete everything from them though.

> , but I picked yesterday's version from a backup. The
> specific select took 280 cpu seconds according to the shell's timer.
> After the analyze, the query dove to just 0.0468 cpu seconds.

ok, that's pretty bad. Any way I could get a compressed copy of that backup database by mail? Otherwise I'd just like to get the content of sqlite_stat1 table before and after analyze.

> I doubt it has run for some time... Still, Firefox 5 had no
> issues with this behaviour, so either something changed to make those
> queries worse, or something changed to make it more pronounced.

I took a look at the list of changes in Firefox 6, but nothing jumped to my eyes.

> > We do already, on idle once a month.
> Ok, I did manual vacuum's and those didn't help anyway. So obviously, vacuum
> doesn't (re)do analyze.

Vacuum doesn't analyze, that's correct.

> (In reply to Marco Bonardo [:mak] from comment #77)
> I commonly walk away from my PC to watch TV or do other things that take
> over half an hour without me touching said PC

So, this is pretty strange, our analyze calls should be invoked during those idles, unless something broke idle or idle expiration. Will check that too.

(In reply to Brendan Eich [:brendan] from comment #79)
> Mak, can you provide some links please? It is a bad bug, which should be
> filed and fixed quickly, to hold a mutex across i/o.

As said, Storage just gets the mutex from SQLite (http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageConnection.cpp#537) and uses it internally to avoid having mutexes races. Storage allows 2 concurrent threads on the same connection (sync and async APIs), and uses that single mutex to drive them. The issue may happen when we cross the line between the two APIs. Asuth explained this really well here and in bug 689229.

At least now we have identified a stack that may cause a contention (with patch here) and the query that is wrongly slow, we still have to identify the reason analyze fails for some users (bug 683876).
I agree it's scary that a long query may hang the ui so badly, but it's probably scarier that we don't detect these cases in the wild.
Comment 82 Brendan Eich [:brendan] 2011-09-27 20:58:07 PDT
(In reply to Marco Bonardo [:mak] from comment #81)
> As said, Storage just gets the mutex from SQLite
> (http://mxr.mozilla.org/mozilla-central/source/storage/src/
> mozStorageConnection.cpp#537) and uses it internally to avoid having mutexes
> races.

What races? What invariants does this mutex protect?

> Storage allows 2 concurrent threads on the same connection (sync and
> async APIs), and uses that single mutex to drive them. The issue may happen
> when we cross the line between the two APIs. Asuth explained this really
> well here and in bug 689229.

The issue is a bug that needs to be fixed. Even if the line is not crossed, holding a mutex across i/o is a bug on its face. Is it on file?

/be
Comment 83 Shawn Wilsher :sdwilsh 2011-09-27 21:30:50 PDT
(In reply to Brendan Eich [:brendan] from comment #82)
> What races? What invariants does this mutex protect?
Note that this is the same mutex SQLite uses to serialize access to the connection.
http://mxr.mozilla.org/mozilla-central/ident?i=sharedDBMutex
Used to:
- protect our "in transaction" state from changing without us being aware of it
- protect `Connection::mFunctions` from being accessed (read/write) on different threads
- protect `Connection::mProgressHandler` from being accessed (read/write) on different threads

http://mxr.mozilla.org/mozilla-central/ident?i=mDBMutex
While executing an asynchronous statement, it is used to:
- prevent `::sqlite3_errmsg` from changing while we step on a statement because the main thread tried to use the connection between us calling step and getting the error.  This is only slightly expanding the time in which we hold the lock from SQLite.

> The issue is a bug that needs to be fixed. Even if the line is not crossed,
> holding a mutex across i/o is a bug on its face. Is it on file?
No, and like I said before [run-up to Firefox 3.6, run-up to Firefox 4.0 (bug 563538 + dependencies), now (bug 689229 comment 16] this is only a problem because of SQLite's longstanding serialization of access to the database and how we use it.  We can't just make everything async at once (we've been working on Places since 3.6 and it still isn't done), so we end up in the state we are in.
Comment 84 Jonathan Guerin 2011-09-28 08:23:25 PDT
I'd love to know when this bug gets fixed!
Comment 85 MasterLeep 2011-09-28 09:41:30 PDT
https://bugzilla.mozilla.org/show_bug.cgi?id=679498 is probably the same bug.
Comment 86 Arjen 2011-09-28 10:03:15 PDT
Created attachment 563098 [details]
Contents of sqlite_stat1 before and after Analyze

I've added the contents of my sqlite_stat1, before and after the the analyze.

I do recall I cleared my history a while back (I think somewhere in july, i.e. with 5.0 or so), so that might account for some of the emptiness, but still, it appears as if the tables have not been analyzed since that moment.
Comment 87 Marco Bonardo [::mak] 2011-09-28 15:31:36 PDT
(In reply to Arjen from comment #86)
> Created attachment 563098 [details]
> Contents of sqlite_stat1 before and after Analyze

fwiw, this is REALLY nonsense:
INSERT INTO "sqlite_stat1" VALUES('moz_historyvisits',NULL,'0');
Comment 88 Marco Bonardo [::mak] 2011-09-29 01:50:17 PDT
May you execute a "PRAGMA integrity_check" on your database? I wonder if analyze fails when indexes go corrupt.
Comment 89 Marco Bonardo [::mak] 2011-09-29 07:29:06 PDT
So, after playing a bit with Firefox 6 and 7, I think I have a better knowledge of why analyze didn't run often enough.  Practically on clear history we run analyze, that stores low values to the stats table, since most indexes are small.  At that point expiration is really conservative, for battery saving and avoiding unneeded work.  If it evaluates the database as clean (not enough entries to expire, that is pretty common till user reaches the history limits) it doesn't enable aggressive expiration steps, included the one on idle.
So now the database grows slowly towards history limits, without further analyze runs, at a certain point stats are so broken that queries take the wrong path.
I think the right approach is to avoid ANALYZE completely and rather use pre-calculated statistics, as implemented in bug 683876.

We should backport some fix at least to Firefox 8 and Firefox 9, the patch in this bug may be reduced to fix the known jumplists case for stable branches. Backporting also bug 683876 would be nice, but if too scary I may at last add an effective analyze run on idle.
Comment 90 Marco Bonardo [::mak] 2011-09-29 07:54:59 PDT
(In reply to Marco Bonardo [:mak] from comment #89)
> Backporting also bug 683876 would be nice, but if too scary I may
> at last add an effective analyze run on idle.

Actually taking bug 690354 on other branches may be easier.
So to clarify, on central I'd take this bug, bug 683876 and bug 690354, on aurora and beta I'd take a reduced version of this bug (only createStatement change) and bug 690354.
Comment 91 Arjen 2011-09-29 11:16:59 PDT
I'm guessing I don't need to check anything else anymore?

Its still weird this was so much more visible during Firefox 6 usage compared to 5, although perhaps the worsening of the staleness of the stats-data just happened to colide with starting to use Firefox 6...

I think its a bit scary to force-feed a database some bogus stats, but I can imagine in this case the amount of possible query plans is relatively small and there are normally not many plans that actually are much worse if the stats are overstating a table's amount of records. At least not in the multi-mb sizes firefox is operating with.
Still, a periodic Analyze would probably be best, but if you can't guarantee the periodic aspect... this may be the more defensive solution.
Comment 92 MasterLeep 2011-09-29 11:34:10 PDT
I've experienced this problem since Firefox 5 at least - maybe 4.
Comment 93 Marco Bonardo [::mak] 2011-09-29 12:46:55 PDT
(In reply to Arjen from comment #91)
> I'm guessing I don't need to check anything else anymore?

Right, I figured it out. Thank you for your availability.
 
> Its still weird this was so much more visible during Firefox 6 usage
> compared to 5

Imo, it's just a problem that piles up slowly.
Comment 94 Sheila Mooney 2011-09-29 14:26:45 PDT
We decided we need to track this for FF8 and FF9. It looks like we will get a fix soon.
Comment 95 Christopher Blizzard (:blizzard) 2011-09-29 23:01:01 PDT
Hi, a couple of notes based on re-reading the bug.

Mak - please make sure you add a telemetry probe that tells us how often we actually analyze the database.  Or, if it's very rare, at least how often we go idle.  I remember that a lot of our GC problems were related to the fact that the browser actually does idle only very rarely.  I don't know if the SQLite code has a different way to measure "idle" or not, but we should make sure that we're actually hitting the condition on a regular basis.  If there are a lot of other services that depend on that condition being hit we need to start measuring in the field now now now.

Second, think about what happened to me.  I set up a new computer, created a new profile (which set up the database with the wrong stats data) and then ran sync which populated the database.  This means that I get the hangs because we probably don't analyze the table after we populate it with a huge amount of data!  This also means that anyone else that does this is probably affected, which is very scary.  I don't know how many people are doing what I'm doing, but I can't imagine it's a small number.

If we figure out that Sync is causing terrible performance problems because of this interaction then we really need to make sure we re-analyze after sync finishes the initial data load.
Comment 96 Christopher Blizzard (:blizzard) 2011-09-29 23:21:57 PDT
Mak, if I select All in the Places Maintenance database does it re-analyze the tables in question or not?
Comment 97 Marco Bonardo [::mak] 2011-09-30 02:10:09 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #95)
> Mak - please make sure you add a telemetry probe that tells us how often we
> actually analyze the database.  Or, if it's very rare, at least how often we
> go idle.

I don't want to run ANALYZE at all in future, as implemented in bug 683876, since it's just too scary to forget to do it.
Regarding idle, telemetry is sent on idle so if a user pings he is idle, if a user doesn't ping we can't know if he signed out of telemetry or doesn't go idle. we may measure the number of times a user idles along a day, but for how telemetry is designed (it pings all statistics collected till the first idle) this requires storing data locally that is something telemetry has not yet been built for (the telemetry pricacy document states collected data is volatile).

> I remember that a lot of our GC problems were related to the fact
> that the browser actually does idle only very rarely.

Well, it depends how much rarely, once a day for Places needs may be enough.

I rather think we may want something different, maybe a system service, to run maintenance on the profile even when the browser is closed. May be more reliable than idles.

> This also means that anyone else that does this is probably
> affected, which is very scary.

Agree, this is scary, that's why I don't want to analyze anymore.

(In reply to Christopher Blizzard (:blizzard) from comment #96)
> Mak, if I select All in the Places Maintenance database does it re-analyze
> the tables in question or not?

Selecting "expire" option is enough.
Comment 98 Brendan Eich [:brendan] 2011-09-30 04:24:35 PDT
(In reply to Shawn Wilsher :sdwilsh from comment #83)
> > The issue is a bug that needs to be fixed. Even if the line is not crossed,
> > holding a mutex across i/o is a bug on its face. Is it on file?
> No, and like I said before [run-up to Firefox 3.6, run-up to Firefox 4.0
> (bug 563538 + dependencies), now (bug 689229 comment 16] this is only a
> problem because of SQLite's longstanding serialization of access to the
> database and how we use it.

We do not need to use the mutexes SQLite abuses just because it exposes them in its API, though. Let's go through the invariants, which I do not understand yet -- please educate me and others reading here.

> - protect our "in transaction" state from changing without us being aware of it

This does not need a mutex acquisition on the main thread, though. Is this a short critical section? It should be, unless you mean the main thread has to keep this state fixed against mutation from a non-main thread -- that would be bad on its face, janky if not hang-y.

> - protect `Connection::mFunctions` from being accessed (read/write) on different threads
> - protect `Connection::mProgressHandler` from being accessed (read/write) on different threads

These must be short critical sections too. So the problem is that SQLite abuses its mutexes by holding them for potentially very long critical sections, and our code on the main thread then gets hung on the same mutexes. That suggests us avoiding using these mutexes at all.

> http://mxr.mozilla.org/mozilla-central/ident?i=mDBMutex
> While executing an asynchronous statement, it is used to:
> - prevent `::sqlite3_errmsg` from changing while we step on a statement because the main thread tried to use the connection between us calling step and getting the error.  This is only slightly expanding the time in which we hold the lock from SQLite.

But if SQLite holds the mutex and goes out to lunch, then we can hang, right?

> We can't just make everything async at once
> (we've been working on Places since 3.6 and it still isn't done), so we end
> up in the state we are in.

Why can't we put all the nasty sync'ing on background threads we communicate with using our own postMessage-style queuing? Then we will still have problems when SQLite goes out to lunch and the queue fills or a response is needed, but we won't hang all of the UI.

I don't see why it takes years to stop abusing SQLite mutexes on the main thread. If we married the SQLite API, it could be there's a lot of grunt work remoting with async message passing. But is it really a multi-year mission?

/be
Comment 99 MasterLeep 2011-09-30 06:56:16 PDT
It's really 683876 that should be tracked for Firefox 8, right?
Comment 100 Marco Bonardo [::mak] 2011-09-30 07:03:19 PDT
nope, see comment 90
Comment 101 Jonathan Guerin 2011-09-30 07:14:25 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #95)
> Hi, a couple of notes based on re-reading the bug.

> 
> If we figure out that Sync is causing terrible performance problems because
> of this interaction then we really need to make sure we re-analyze after
> sync finishes the initial data load.

This is the exact way I set up my browser, and I have hang problems all the time. :-(
PS I'm not a Firefox developer, so I don't know if that comment will be much use, but I'd be happy to supply more data if you need it.
Comment 102 Christopher Blizzard (:blizzard) 2011-09-30 08:33:04 PDT
Brendan & Shawn, can you guys please take the architecture to .platform?  I want to keep this bug about how we fix the bad data that we're creating.
Comment 103 Sid 2011-09-30 11:39:00 PDT
*** Bug 666128 has been marked as a duplicate of this bug. ***
Comment 104 Christopher Blizzard (:blizzard) 2011-09-30 20:59:30 PDT
Since I installed the Places Maintenance add-on and ran the expire operation I haven't seen a long hang.  Yay!
Comment 105 Christopher Blizzard (:blizzard) 2011-09-30 21:08:55 PDT
*** Bug 688521 has been marked as a duplicate of this bug. ***
Comment 106 Michail Pappas 2011-10-01 06:35:41 PDT
I am the author of bug 688521. Thought I'd drop by in case feedback is needed and provide some info on my setup. For the record, issue appears on a new installation of FF 6.0.2 (Windows 7, 64-bit), after it gets firefox sync'ed with my sync account.

Please note that I do not experience a cycle of 9' as described here. It's something like 60" freeze followed by 60" of ok operation. See my description in bug 688521 for cpu graphs, stack when this happens etc.

(In reply to Andrew Sutherland (:asuth) from comment #59)
> People experiencing the hang bug, are you using any type of anti-virus?  If
> so, which one?
NOD32 Antivirus, ver. 4.2.something (64-bit)

(In reply to Christopher Blizzard (:blizzard) from comment #95)
> Second, think about what happened to me.  I set up a new computer, created a
> new profile (which set up the database with the wrong stats data) and then
> ran sync which populated the database.  This means that I get the hangs
> because we probably don't analyze the table after we populate it with a huge
> amount of data!  This also means that anyone else that does this is probably
> affected, which is very scary.  I don't know how many people are doing what
> I'm doing, but I can't imagine it's a small number.

Not a coder here, but it feels like this in an on-spot observation. Also matches my very own experience and possibly explains why my other much-less-powered firefox sync'ed computers do not exhibit this behavior, whereas the most powerful, albeit with a new installation, does...

I expect to be back in Wednesday, meaning I would have access to the problematic platform only by then. Since I still have got 6.0.2 installed, would you like me to perform some test before upgrading to 7? I thought I'd install the places maintainance extension and do an expire operation.

Or do you want me to produce a debug log when the issue appears? Please advise.
Comment 107 MasterLeep 2011-10-01 15:18:32 PDT
(In reply to Christopher Blizzard (:blizzard) from comment #104)
> Since I installed the Places Maintenance add-on and ran the expire operation
> I haven't seen a long hang.  Yay!

That is my experience as well.  It feels like a brand new browser.

Since three independent installations out of three that I know got affected by this bug, and we don't use Sync or anything weird, I would urge you to get this fix out there as soon as possible.
Comment 108 carl@brannenworks.com 2011-10-01 21:43:42 PDT
This has been bugging me for months. Around 50% of the time Firefox locks up and uses 50% CPU time. Fixes itself after a few minutes but only lasts a few minutes before doing it again. I'll add explicit instructions for people like me who don't understand the buzz words.

A few hours ago I installed the extension "Places Maintenance":
https://addons.mozilla.org/en-US/firefox/addon/places-maintenance/
Opened up the add-ons manager. Found "Places Maintenance". Clicked "options". Selected "All" from the drop down menu. Clicked "Execute".

Problem solved.
Comment 109 Christopher Blizzard (:blizzard) 2011-10-03 12:16:46 PDT
*** Bug 691146 has been marked as a duplicate of this bug. ***
Comment 110 Jonathan Guerin 2011-10-03 12:23:09 PDT
I have run the 'Expire' option using the Places Maintenance and have also observed no lockups in the last couple of hours.

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Comment 111 Marco Bonardo [::mak] 2011-10-03 14:04:24 PDT
since this bug is about some specific stack in createStatement, I filed meta bug 691507 to track work and bring on the discussion. I'll now split the patch here and land the reviewed part, filing a new bug for the addons manager part.
Comment 112 Marco Bonardo [::mak] 2011-10-03 14:34:37 PDT
Created attachment 564337 [details] [diff] [review]
patch v1.2

I've splitted the addons manager bits to bug 691512, so now I can land this.
Comment 114 Matt Brubeck (:mbrubeck) 2011-10-03 16:44:43 PDT
https://hg.mozilla.org/mozilla-central/rev/347b57a14ce3
Comment 115 Marco Bonardo [::mak] 2011-10-04 01:46:13 PDT
Comment on attachment 564337 [details] [diff] [review]
patch v1.2

This is the first piece I'd like to backport to 8 and 9.
The change is contained and safe enough imo, just moving some possibly-blocking calls to non-blocking calls.
To fix the hangs we'll also need bug 690354 and bug 691509 though.
Comment 116 Christopher Blizzard (:blizzard) 2011-10-04 14:51:32 PDT
Comment on attachment 564337 [details] [diff] [review]
patch v1.2

Approved for Aurora (Firefox 8) and Aurora (Firefox 9).  Please land as soon as possible.
Comment 118 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-10-13 11:16:52 PDT
Is there something specific QA can do to verify this fix, apart from general Places load testing?
Comment 119 Marco Bonardo [::mak] 2011-10-13 15:46:49 PDT
Not this specific bug, I don't think you can test this specific mutex lockup.
What you may check is how Firefox behaves after a clear history followed by a Sync import or setting up a new profile using Sync (or importing large number of bookmarks).
Comment 120 Marco Bonardo [::mak] 2011-10-13 16:22:04 PDT
PS: notice the above can't be tested in beta since hangs are not fixed in beta (due to bug 691509 missing approval to land there)
Comment 121 mlissner@michaeljaylissner.com 2011-10-14 09:54:17 PDT
Affects Fennec? I have terrible performance there, and I also used sync on Fennec to bring in my places DB.
Comment 122 Marco Bonardo [::mak] 2011-10-14 15:56:35 PDT
it may affect fennec, but mostly what affects fennec is an exageration in the number of entries, that has been addressed some time ago by making expiration more clever, but it may take time to expire all redundant entries, I'll file a bug to adapt expiration to be more aggressive when the number of pages goes dangerously over the expected limit. I think jlebar filed a bug some days ago about this.
Comment 123 Boris Zbarsky [:bz] 2011-10-30 20:48:56 PDT
*** Bug 696576 has been marked as a duplicate of this bug. ***
Comment 124 Richard Newman [:rnewman] 2011-10-31 23:24:39 PDT
*** Bug 671448 has been marked as a duplicate of this bug. ***
Comment 125 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-12-08 11:25:47 PST
Based on above comments, I don't think QA will be able to directly verify this fix.

Note You need to log in before you can comment on or make changes to this bug.