Closed Bug 563538 Opened 14 years ago Closed 14 years ago

Browser freezes due a lock contention across Places statements

Categories

(Toolkit :: Places, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: Dolske, Unassigned)

References

Details

(Keywords: dogfood, helpwanted, Whiteboard: [needs dependencies][see comment 60])

Attachments

(4 files)

Attached file Hang sample
+++ This bug was initially created as a clone of Bug #559217 +++

Encountered another ~30sec hang, and grabbed a clean sample. This is with a current trunk nightly (Gecko/20100503). Happened shortly after restart with a few tabs restored, was entering javascript:prompt("foo") into the URL bar.
We can't do anything with this one short of making it async :(
also making it async could not be so easy, this is part of browser.js::getShortcutOrURI that has a pretty sync scope.
What kind of frequency is this happening for people?
I'd think that is really hard to tell since it can happen randomly. I've never seen it personally.
(In reply to comment #3)
> What kind of frequency is this happening for people?
I have only ever heard complaints about it from dolske fwiw.  Other bad hangs we've had I heard about it from *lots* of people, and that was really before I was known as the guy to talk to about this kind of stuff.
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.5pre) Gecko/20100515 Namoroka/3.6.5pre ID:20100515042315

I don't know if this is the bug I'm seeing on the 1.9.2 branch, but I always get two browser lockups after I start firefox where the HD churns away for a minute or two. It might be something to do with my very large history, or not.

I don't see the lockups on the trunk with the same places.sqlite fwiw. But that doesn't narrow it down much.
I've hung in this stack while compiling Firefox in the background. I didn't profile often, though, so I'm not sure if specifically this stack is very common. But multi-second freezes in general happen very often to me during compilation.
Doesn't sound common enough to block on right now
blocking2.0: ? → -
I've noticed that this _seems_ to happen regularly, when I've been running a build for a day or two, restart to update to the latest lightly, and within a minute or so of restarting it will hang.

I caught it in a debug build this time, the main thread hung in nsNavHistory::RegisterOpenPage() while a background thread is doing... something. I didn't have any luck extracting a useful statement in gdb, but by breaking on the CompletionNotifier's Run() + some lucky stepping, I ended up in nsPlacesExpiration.js#629.

So it looks like I git hit with a long-running expiration right after startup, which in turn is blocking the main thread.
(In reply to comment #9)
> I caught it in a debug build this time, the main thread hung in
> nsNavHistory::RegisterOpenPage() while a background thread is doing...
> something. I didn't have any luck extracting a useful statement in gdb, but by
> breaking on the CompletionNotifier's Run() + some lucky stepping, I ended up in
> nsPlacesExpiration.js#629.
Marco!
The first expiration happens 3 minutes after startup, is this compatible with your timings? We could split the expiration statements in 2 parts to break the running thread eventually.
actually, for the getShortcutORURI thing, I don't expect us to have an incredible amount of keywords, we could have what, 20, 30 at a maximum (actually I think I have about 5). At first request we could cache all keyword/uri associations in a string hash, and serve them from it. At least the check "is this a keyword" would be faster after the first request and immediatly fallback to the default.

Regarding expiration I have a patch in Bug 562644 that removes the only sync part of the service, and avoid doing double queries... should improve expiration performance quite a bit.
(In reply to comment #12)
> At first request we could cache all keyword/uri associations in a string hash,
> and serve them from it.

I like this idea!
Depends on: 572030
The keywords hash idea is partially implemented in bug 572030, it still needs some work right now.
Depends on: 562644
Brendan reports seeing a hang-after-startup that may very well be this too (CC'd). I'm also still seeing this regularly, though I've become numb to just starting the browser and doing something else for a few minutes (my hangs seem to be in the 1-3 minute ballpark, now). Renomming.

(In reply to comment #11)
> The first expiration happens 3 minutes after startup, is this compatible with
> your timings?

Err, why do we do this? Seems like that should at least be on an idle timer (though that would just make the hang happen at random less-noticeable times).
blocking2.0: - → ?
This is perhaps what I'm seeing too. Though last time I sampled I saw the background thread in autocomplete stuff and then the main thread getting an annotation (favicons maybe?). It's really annoying.
Why isn't this assigned yet? If it's what is biting me, it happens often but unpredictably, and it's a blocker in my book -- I restart every time there's a new m-c nightly.

/be
blocking2.0: ? → betaN+
Assignee: nobody → mak77
This bug is a good example of what Lilly calls "table stakes". It doesn't matter how many nifty WebGL or IndexDB features we have if users encounter 30 second hangs and tell their friends.

This blocks Firefox 4 and needs to be fixed ASAP. If I have the wrong assignee, go ahead change it to another person.
(In reply to comment #15)
> Err, why do we do this? Seems like that should at least be on an idle timer
> (though that would just make the hang happen at random less-noticeable times).

Expiring on idle would be insufficient for anyone, we do also that when needed but it would also be completely useless for a bunch of users that don't hit idle. Expiration is not more expensive than before, it is less and we don't have evidence that is caused by expiration (that is also running in the async thread and doing really small steps). I just guessed because it runs on a timer, but I also saw some hang and it was not running in that moment.

Favicons are not annotations and are on the async thread as well, even if a small part is still sync and should be fixed (we improved it but it's not yet ready). A bunch of other stuff could ask for annotations, it's too generic.

There is nothing more than a fact that someone is seeing hangs randomly, we have planned a bunch of improvements but there is no evidence what could improve this to prioritize those. I just have the keywords bug that I can finish, but there is no evidence if it will help or not.

There is not even evidence if this bug is exactly what people is seeing.
If you think someone other has time and knowledge to reach a solution faster feel free to change assignee, I won't mind.
(In reply to comment #19)
> 
> There is not even evidence if this bug is exactly what people is seeing.
> If you think someone other has time and knowledge to reach a solution faster
> feel free to change assignee, I won't mind.

Your team's job is to find out what this bug is, and make sure it is fixed.
(In reply to comment #20)
> (In reply to comment #19)
> > 
> > There is not even evidence if this bug is exactly what people is seeing.
> > If you think someone other has time and knowledge to reach a solution faster
> > feel free to change assignee, I won't mind.
> 
> Your team's job is to find out what this bug is, and make sure it is fixed.

He didn't imply that it wasn't. Only that if anyone else had a better solution, they're welcome to hop in :)

As Marco noted, there are a couple of in-progress bugs that might affect this one. We'll try to get those in as soon as possible. Since Dolske can reproduce regularly, we'll work with him to test fixes.
Status: NEW → ASSIGNED
I also have some ideas on how we can improve our concurrency for read-only tasks in places based on stuff we learned from our IndexedDB work which will reduce lock contention as well.
I can most likely reduce even more expiration work, so that we run minimum number of needed statements. I can figure out a less API-changing approach for caching keywords and put down a list of statements we still run sync without user explicit request (so during a common navigation session).
Also killing temp tables should cut by 30% execution times of most of our statements, I should start working on that so it's ready sooner than later, lazy_add killing is already WIP, and new sqlite should be coming.

Please, anyone who can capture samples of the hangs should provide them to help figure out where we still fail in your case.
Attached file Hang sample
Here is a sample from today's build shortly after I started it (a little under 5 minutes). Normally I get hung when typing in the address bar but in this case the hang was while typing in a bugzilla comment.
(In reply to comment #24)
> Here is a sample from today's build shortly after I started it (a little under
> 5 minutes). Normally I get hung when typing in the address bar but in this case
> the hang was while typing in a bugzilla comment.
That is places code again, but not sure what is triggering it because it's in JS :(
I can guess, since he was not activating anything in our UI, livemarks updating could be triggering getItemAnnotation. Btw, there will always be some kind of locking because our threads are part sync, part async, we can just try to make them rare reducing the time spent in statements, unless we want to spend 6 months just rewriting all the module, that we are doing in chunks.
(In reply to comment #26)
> I can guess, since he was not activating anything in our UI, livemarks updating
> could be triggering getItemAnnotation. Btw, there will always be some kind of
> locking because our threads are part sync, part async, we can just try to make
> them rare reducing the time spent in statements, unless we want to spend 6
> months just rewriting all the module, that we are doing in chunks.

I don't believe that I have any livemarks.
Is this a deadlock? I.e., is no progress made and no CPU time accrued once the threads enter this state?

/be
(In reply to comment #28)
> Is this a deadlock? I.e., is no progress made and no CPU time accrued once the
> threads enter this state?
No, it's just one thread holding the lock for longer than expected.  There's more lock contention that we'd like to have here, and we need to get rid of it.
(In reply to comment #29)
> (In reply to comment #28)
> > Is this a deadlock? I.e., is no progress made and no CPU time accrued once the
> > threads enter this state?
> No, it's just one thread holding the lock for longer than expected.  There's
> more lock contention that we'd like to have here, and we need to get rid of it.

Is there a quick way we can add logging to a build to indicate which statements end up waiting on a lock? I guess that doesn't tell us what the evil long running background thread is that is causing the problem but might help us avoid it.
Nothing comes to mind immediately.  I'll think about it some more and maybe I'll come up with something on the plane tomorrow.
A mutex should never be held across any I/O operation (or of course an unbounded or even just overlong CPU operation).

A condvar can be the right tool if the contending thread is able to wait, but of course the main thread has to service its event queue, so effectively try-locking may be the only recourse.

Is this sqlite code or our code, or a mix?

/be
(In reply to comment #32)
> Is this sqlite code or our code, or a mix?
This is SQLite code and it's how they ensure that access to the database is serialized.

This is exactly why I advocate everyone to use the asynchronous storage API because it's lock free on the calling thread (usually the main thread).
(In reply to comment #33)
> 
> This is exactly why I advocate everyone to use the asynchronous storage API

Yes, but comment 26 says

>  unless we want to spend 6 months just rewriting all the module, that we are doing in chunks.

6 months seems like a Really Long Time, but I don't have a lot of data here. Shipping with hangs like this doesn't fly, so where does that leave us?
(In reply to comment #34)
> 6 months seems like a Really Long Time, but I don't have a lot of data here.
> Shipping with hangs like this doesn't fly, so where does that leave us?
I'm not sure six months is an accurate estimate, but I'm not sure we've got one either.  We've been converting places over to the asynchronous API in chunks each release.

We should see if this is still an issue once mak gets his patch up to nuke nsPlacesDBFlush for two reasons:
1) This would get rid of a lot of the I/O (or at least the large amounts of I/O) we do on the background thread (every two minutes + whenever we add a bookmark).
2) All of our queries get a lot simpler so they should run faster.

I don't have a bug number handy, but I bet mak does.
bug 552023 is temp tables removal (bug 499828, that i'm working on, should come first to simplify code removing redundant stuff), bug 556631 is making frecency async, there is other stuff we can/should convert but will need a list of queries that are running sync during user's browsing, Storage PRLog should be able to generate the list easily. I have an expiration patch on disk that should further reduce timed work, will soon file a bug for it.
Another interesting question for folk seeing this:
1) Do you have Weave (Firefox Sync) installed?
2) If yes to (1), do you have more than one computer using it?

(trying to figure out the increase in places activity)
Shawn, I still have concerns about the fact Storage uses BEGIN IMMEDIATE TRANSACTION on async statements execution, is there a reason to differ from DEFERRED that is used in the sync state?
IMMEDIATE means a RESERVED lock is setup immediately on the DB regardless when the database will be accessed or written to. DEFERRED does instead set a SHARED lock till first write happens then lock changes to RESERVED. Using DEFERRED would give us a larger safe time for reads, unless there is a specific reason IMMEDIATE is being used.
As I recall, it was because of this from the SQLite docs:
Because the acquisition of locks is deferred until they are needed, it is possible that another thread or process could create a separate transaction and write to the database after the BEGIN on the current thread has executed. 

But given that we don't have any way to determine if a statement is read-only or write-only, we may end up creating an immediate transaction for read statements, which would be undesirable.  We should probably switch that to a deferred transaction until we can do that.  Can you file a bug for that, and mark it blocking this?
Depends on: 580240
Filed bug 580240. Notice this involves only multiple statements executed at once through dbconn.executeAsync, not single async statements, we can probably evaluate splitting batches if they are locking for too long times.
For example this causes autocomplete to lock the database for read-only statements.
Also expiration and DBflush are using this method, I'll tweak the former as far as I can (will evaluate chunking statements if needed), and the second is going to be killed along with temp tables.
Depends on: 552023, 556631
(In reply to comment #41)
> For example this causes autocomplete to lock the database for read-only
> statements.
We can create another db connection for things like this which are read-only so there is only file lock contention (and we can use multiple readers) as noted in comment 22.

> Also expiration and DBflush are using this method, I'll tweak the former as far
> as I can (will evaluate chunking statements if needed), and the second is going
> to be killed along with temp tables.
Well, at least with DBFlush we did this to reduce the number of fsyncs which also hurt us on the main thread due to I/O we do on it still.  Expiration is likely in the same boat.
(In reply to comment #37)
> Another interesting question for folk seeing this:
> 1) Do you have Weave (Firefox Sync) installed?
> 2) If yes to (1), do you have more than one computer using it?

I see this without Weave installed, though Weave had been installed in the profile previously.
No Weave here.

Do sqlite maintainers know about API-exposed mutexes being held across I/O being implicated here? That is a flawed design, if I'm understanding things here. It is unnecessary and it can't scale, never mind blowing back as in this bug (again if this diagnosis is correct).

Mutexes may be implemented with competitive spinning on SMP systems. They are not ever for waiting indefinitely or even "too long". Any unbounded computation or input/output operation inside a critical section wants a condvar, and if a thread can't wait it will have to poll.

/be
(In reply to comment #37)
> Another interesting question for folk seeing this:
> 1) Do you have Weave (Firefox Sync) installed?

I do now, but it was not installed when I had previously reported and commented on this bug.
(In reply to comment #44)
> Do sqlite maintainers know about API-exposed mutexes being held across I/O
> being implicated here? That is a flawed design, if I'm understanding things
> here. It is unnecessary and it can't scale, never mind blowing back as in this
> bug (again if this diagnosis is correct).

Yes, at least since January when we discussed this exact issue in bug 524780.
Blocks: 560198
Depends on: 580228
Attached file Another hang
Don't know if you want more of these, in this case I wasn't doing anything in Firefox, just switched back to it to find it hung and it remained so for over a minute.
(In reply to comment #48)
> Created attachment 459997 [details]
> Another hang
> 
> Don't know if you want more of these, in this case I wasn't doing anything in
> Firefox, just switched back to it to find it hung and it remained so for over a
> minute.

Actually just noticed that the background thread in this case is the url classifier service, why would the annotation service be blocked on that?
(In reply to comment #49)
> Actually just noticed that the background thread in this case is the url
> classifier service, why would the annotation service be blocked on that?
There's another thread further down that is the async execution thread which I suspect is places.
when you capture the sample, could you check name of the annotation with the debugger from itemHasAnnotation or getItemAnnotation? identifying the caller would be interesting.
I've been using this component to track how long calls to annotation service takes, but they're all fast. And when the browser does hang, it's not annotation related for me. On startup I get..

1) 0ms getItemsWithAnnotation("livemark/feedURI") PlacesUtils.jsm:1092
2) 0ms addObserver([object]) nsMicrosummaryService.js:95
3) 0ms getItemsWithAnnotation("livemark/feedURI") nsLivemarkService.js:131
4) 0ms getItemsWithAnnotation("microsummary/generatorURI") nsMicrosummaryService:536

Any suggestions for other services to maybe track?

If you want to try out the component, drop the attached into your components directory and update browser.manifest with these 2 lines:

component {ee5a097b-6ee6-4c6a-8eca-6bce87d570e9} AnnoWatcher.js
contract @mozilla.org/browser/annotation-service;1 {ee5a097b-6ee6-4c6a-8eca-6bce87d570e9}
Blocks: 581978
Ed, there are many non-scriptable methods in the anno service, so it's possible your annotation watcher (which is cool) is missing some.  The stack in bug 581978 comment 10 enters the anno service through GetPageAnnotationString for instance.
Depends on: 582703
Error: 32248ms
getPageAnnotation([object XPCWrappedNative_NoHelper],"weave/guid")@:0
anno([object XPCWrappedNative_NoHelper],"weave/guid")@resource://services-sync/util.js:217
GUIDForUri([object XPCWrappedNative_NoHelper],true)@resource://services-sync/engines/history.js:62
HT_onVisit([object XPCWrappedNative_NoHelper],649125,1280579641520209,3054,649117,1,[object Object])@resource://services-sync/engines/history.js:277

Source File: file:///Applications/Firefox%20x.x%20-%20Minefield.app/Contents/MacOS/components/AnnoWatcher.js
Line: 32

Triggered by weave when getting a page/history annotation "weave/guid".

Seems like getting item annotations are very fast though -- there were a ton of these just /before/ that 32second delay:

Error: 0ms
getItemAnnotation(1164,"PlacesOrganizer/OrganizerQuery")@:0
anno(1164,"PlacesOrganizer/OrganizerQuery")@resource://services-sync/util.js:217
(1164,4,[object Array])@resource://services-sync/engines/bookmarks.js:1101
("MobileBookmarks")@resource://services-sync/engines/bookmarks.js:1101
_ensureMobileQuery()@resource://services-sync/engines/bookmarks.js:1111
BMT__ignore(1098)@resource://services-sync/engines/bookmarks.js:1055
BMT_onItemChanged(1098,"favicon",false,"https://mail.google.com/mail/images/favicon.ico",0,1)@resource://services-sync/engines/bookmarks.js:1134

Source File: file:///Applications/Firefox%20x.x%20-%20Minefield.app/Contents/MacOS/components/AnnoWatcher.js
Line: 32
(In reply to comment #54)
> Seems like getting item annotations are very fast though -- there were a ton of
> these just /before/ that 32second delay:
The common case may be fast, but the failure case makes us fall on our faces, skid about 100 feet, and then fall off a cliff without a parachute.  We need an async annotation API for sync at the very least stat, I think (maybe just an async guid API for now?).
I just restarted my minefield updating to 20100804 and after a bunch of fast annotation calls:

Error: 28952ms
getItemsWithAnnotation("microsummary/generatorURI")@:0
()@file:///Applications/Firefox%20x.x%20-%20Minefield.app/Contents/MacOS/components/nsMicrosummaryService.js:536
MSS__updateMicrosummaries()@file:///Applications/Firefox%20x.x%20-%20Minefield.app/Contents/MacOS/components/nsMicrosummaryService.js:182
MSS_notify([object XPCWrappedNative_NoHelper])@file:///Applications/Firefox%20x.x%20-%20Minefield.app/Contents/MacOS/components/nsMicrosummaryService.js:151

However, the very first thing in my log is also to a call to getItemsWithAnnotation and that returned instantly:

Error: 0ms
getItemsWithAnnotation("livemark/feedURI")@:0
PU_getMostRecentFolderForFeedURI([object XPCWrappedNative_NoHelper])@resource://gre/modules/PlacesUtils.jsm:1092
PSB_updateState()@chrome://browser/content/browser.js:2308
([object XPCWrappedNative_NoHelper],null,[object XPCWrappedNative_NoHelper])@chrome://browser/content/browser.js:7508
([object Object],0,[object Array])@chrome://browser/content/tabbrowser.xml:279
_callProgressListeners([object XULElement],"onLocationChange",[object Array],true,false)@chrome://browser/content/tabbrowser.xml:276
updateCurrentBrowser()@chrome://browser/content/tabbrowser.xml:759
onselect([object Event])@chrome://browser/content/browser.xul:1
@:0
set_selectedIndex(7)@chrome://global/content/bindings/tabbox.xml:645
set_selectedPanel([object XULElement])@chrome://global/content/bindings/tabbox.xml:664
set_selectedIndex(7)@chrome://global/content/bindings/tabbox.xml:394
set_selectedItem([object XULElement])@chrome://global/content/bindings/tabbox.xml:426
set_selectedTab([object XULElement])@chrome://global/content/bindings/tabbox.xml:106
set_selectedTab([object XULElement])@chrome://browser/content/tabbrowser.xml:1720
sss_restoreHistoryPrecursor([object ChromeWindow],[object Array],[object Array],7,0,0)@file:///Applications/Firefox%20x.x%20-%20Minefield.app/Contents/MacOS/components/nsSessionStore.js:2029
sss_restoreWindow([object ChromeWindow],[object Object],true)@file:///Applications/Firefox%20x.x%20-%20Minefield.app/Contents/MacOS/components/nsSessionStore.js:1923
sss_onLoad([object ChromeWindow])@file:///Applications/Firefox%20x.x%20-%20Minefield.app/Contents/MacOS/components/nsSessionStore.js:598
sss_init([object ChromeWindow])@file:///Applications/Firefox%20x.x%20-%20Minefield.app/Contents/MacOS/components/nsSessionStore.js:305
init([object ChromeWindow])@:0
delayedStartup(false,false,134)@chrome://browser/content/browser.js:4787
@:0
hm, both microsummaries and livemark services are caching items and not hitting the db more than once.
the livemarks stuff probably comes from getMostRecentFolderForFeedURI that is called to update the Star button status, it will query more than once till livemarks service is fully initialized.
getItemsWithAnnotation is accessing only moz_items_annos and moz_annos, so something must be locking moz_items_annos on the other thread, but I can't find any obvious culprit, so there must some global lock on all the database.
Edward how much time was elapsed from browser start? Did this happen almost immediately after startup?
Ideally WAL should help these cases since any reader can run while the db is being written using a snapshot of the data.
(In reply to comment #58)
> Edward how much time was elapsed from browser start? Did this happen almost
> immediately after startup?
After restarting, I was able to type in the location bar and see some results, but typing some more resulted in a beachball. So probably about 10 seconds after the browser loaded?
Depends on: 573492
Depends on: 584927
Depends on: 546253
Whiteboard: [needs dependencies]
Depends on: 591072
No longer depends on: 591072
Hot to get the statement running on the async thread when you see the hang:
1) In gdb, get to the async thread, and get into the stack from for mozilla::storage::AsyncExecuteStatements::executeStatement.
2) call sqlite3_sql(aStatement)

This method returns a const char * of the SQL string, and we can then identify which statement is running in the background.
No longer depends on: 582703
Keywords: helpwanted
Whiteboard: [needs dependencies] → [needs dependencies][see comment 60]
Unassigning since this is mostly meta and investigated by the whole team on the branch, Shawn will also post more details later on the hang.
Assignee: mak77 → nobody
Status: ASSIGNED → NEW
That's actually over in bug 582703 comment 12.
Why are we blocking on this meta bug?
This is not a meta bug. It does have dependent bugs.
(In reply to comment #64)
> This is not a meta bug. It does have dependent bugs.
Although it should be fixed by those dependent bugs.  We'll have usable nightlies on the Places branch that people can evaluate once bug 616001 [needs review] and bug 617761 [needs owner] are finished up.
So this meta bug is fixed?
Ideally it is since all reported causes have been fixed so far. If anyone of those that were able to reproduce the hangs could tell if they can still reproduce anything or things improved, we'd like to know.
I think Dolske reproduced a hang due to a missing call to finalize, but I can't find the filed bug.  Shawn or Dolke could you add it to dependencies please?
(In reply to comment #68)
> I think Dolske reproduced a hang due to a missing call to finalize, but I can't
> find the filed bug.  Shawn or Dolke could you add it to dependencies please?
I asked dolske to file it, not sure if he did though.
Closing this. Re-open if there are new and related hangs reported.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: