Closed Bug 606062 Opened 14 years ago Closed 14 years ago

Synchronous IO from sqlite causing very long UI pauses on fennec

Categories

(Toolkit :: Storage, defect)

ARM
Android
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 606353
Tracking Status
fennec 2.0b3+ ---

People

(Reporter: cjones, Unassigned)

References

Details

(Keywords: relnote)

Attachments

(5 files)

This is a different issue that bug 601994, based on my understanding of that bug.

STR
 (1) Rebuild fennec
 (2) adb install -r $objdir/dist/fennec-*.apk (rebooting phone seems to work too)
 (3) Start fennec
 (4) Wait for weave/sync to kick in or "Sync Now"
 (5) Return to content area, pan around constantly

While syncing, weave/sync locks up the UI for seconds at a time on my galaxy s.  One hypothesis for this was weave/sync chewing a bunch of CPU and starving the main (gecko) thread.  To test this, I named the gecko main thread gecko.Main, the weave/sync thread gecko.Sync, and other threads gecko.Background.  Then I ran |top -m 20 -d 1 -s cpu -t| during a "Sync Now" until I got the pausing, and killed top after the pausing stopped.  The attached file has those results.

The data is very poor; 9 samples over 9 seconds.  But, both main and sync threads are getting fair amounts of CPU until sample 5, when both drop off.  Both Sync and Main come back in sample 7.  This matches what I experienced; about 2-3 seconds of unresponsiveness after "Sync Now".

Because neither the Sync nor Main thread ran during this time (even though I panned around continuously), I'm led to believe there's either synchronous IO on the main thread or a synchronous proxy from main-->sync thread and synchronous IO on the sync thread.
tracking-fennec: --- → ?
Damn, forgot to attach this to comment 0.
(In reply to comment #1)
> Created attachment 484934 [details]
> Samples from top during syncing
> 
> Damn, forgot to attach this to comment 0.

To be clear, here's what I did from just before this trace started up until its end

 0. Went to settings screen, clicked "Sync Now"
 0.1 Pressed back key to return to content screen
 0.2 Started panning around continuously, at first it's reponsive
--- at some point the gecko.Sync thread first appears in top; that's sample (1) in the attachment ---
 1. The UI jammed up
--- after a few seconds ---
 2. The UI started responding again
 3. Interrupted top (Ctrl-C); that's around sample (9) in the attachment
I can also reproduce this by force-killing, restarting, then doing "Sync Now" and panning around.
Very suspicious-looking are


Thread 12 (Thread 4095):
#0  0xafe0cb9c in fsync () from libc.so
#1  0x8202ac76 in full_fsync (id=0x487ea8, flags=2) at /home/cjones/mozilla/mozilla-central/db/sqlite3/src/sqlite3.c:25510
#2  unixSync (id=0x487ea8, flags=2) at /home/cjones/mozilla/mozilla-central/db/sqlite3/src/sqlite3.c:25558
#3  0x82005698 in sqlite3OsSync (id=0x0, flags=2) at /home/cjones/mozilla
[snip]
#12 0x837fdd38 in mozilla::storage::Connection::ExecuteSimpleSQL (this=<value optimized out>, aSQLStatement=<value optimized out>) at /home/cjones/mozilla/mozilla-central/storage/src/mozStorageConnection.cpp:846


Thread 2 (Thread 4085):
#0  0xafe0d9c4 in __futex_wait () from libc.so
#1  0xafe0f204 in pthread_mutex_lock () from libc.so
#2  0x82028a1a in pthreadMutexEnter (p=0xfffffe00) at /home/cjones/mozilla/mozilla-central/db/sqlite3/src/sqlite3.c:16637
#3  0x820057cc in sqlite3_mutex_enter (p=0xfffffe00) at /home/cjones/mozilla/mozilla-central/db/sqlite3/src/sqlite3.c:15877
#4  0x837fda58 in lock (this=0x4f8168e0, aTransactionType=0) at /home/cjones/mozilla/mozilla-central/storage/src/SQLiteMutex.h:92
#5  SQLiteMutexAutoLock (this=0x4f8168e0, aTransactionType=0) at /home/cjones/mozilla/mozilla-central/storage/src/SQLiteMutex.h:170
#6  mozilla::storage::Connection::BeginTransactionAs (this=0x4f8168e0, aTransactionType=0) at /home/cjones/mozilla/mozilla-central/storage/src/mozStorageConnection.cpp:912
#7  0x838165f0 in mozStorageTransaction::mozStorageTransaction (this=0x51ca5a70, aConnection=<value optimized out>, aCommitOnComplete=0, aType=<value optimized out>) at ../../../../dist/include/mozStorageHelper.h:77
#8  0x83817898 in nsNavHistory::BeginUpdateBatch (this=0x4f49b300) at /home/cjones/mozilla/mozilla-central/toolkit/components/places/src/nsNavHistory.cpp:4373
#9  0x83817912 in UpdateBatchScoper (this=0x4f49b300, aCallback=0x51ca5a50, aUserData=0x0) at /home/cjones/mozilla/mozilla-central/toolkit/components/places/src/nsNavHistory.cpp:336
#10 nsNavHistory::RunInBatchMode (this=0x4f49b300, aCallback=0x51ca5a50, aUserData=0x0) at /home/cjones/mozilla/mozilla-central/toolkit/components/places/src/nsNavHistory.cpp:4406
#11 0x83966e34 in NS_InvokeByIndex_P (that=0x4f49b308, methodIndex=<value optimized out>, paramCount=<value optimized out>, params=<value optimized out>) at /home/cjones/mozilla/mozilla-central/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:190
[snip bunch of JS I can't get a JS backtrace for]
#59 0x833e4efe in nsButtonBoxFrame::DoMouseClick (this=0x5064cb48, aEvent=0x4d8ed5a0, aTrustEvent=<value optimized out>) at /home/cjones/mozilla/mozilla-central/layout/xul/base/src/nsButtonBoxFrame.cpp:172
#60 0x833df4ca in nsButtonBoxFrame::MouseClicked (this=0x4f49b300, aPresContext=<value optimized out>, aEvent=<value optimized out>) at /home/cjones/mozilla/mozilla-central/layout/xul/base/src/nsButtonBoxFrame.h:64
[snip]
77 0x83840306 in nsWindow::OnGlobalAndroidEvent (ae=0x518e6f00) at /home/cjones/mozilla/mozilla-central/widget/src/android/nsWindow.cpp:708
#78 0x8383bff4 in nsAppShell::ProcessNextNativeEvent (this=0x4f199a80, mayWait=<value optimized out>) at /home/cjones/mozilla/mozilla-central/widget/src/android/nsAppShell.cpp:290
#79 0x838410ce in nsBaseAppShell::DoProcessNextNativeEvent (this=0x19d, mayWait=528) at /home/cjones/mozilla/mozilla-central/widget/src/xpwidgets/nsBaseAppShell.cpp:161
#80 0x8384129e in nsBaseAppShell::OnProcessNextEvent (this=0x4f199a80, thr=0x4d20c740, mayWait=0, recursionDepth=<value optimized out>) at /home/cjones/mozilla/mozilla-central/widget/src/xpwidgets/nsBaseAppShell.cpp:317
#81 0x838a0062 in mozilla::dom::ContentParent::OnProcessNextEvent (this=<value optimized out>, thread=0x210, mayWait=0, recursionDepth=0) at /home/cjones/mozilla/mozilla-central/dom/ipc/ContentParent.cpp:560
[snip]


(Thread 2 is the main Gecko thread.)   Looks like the main thread might be blocked on a mutex held by this SQL thread while it fsync()s.  That this happens after "Sync Now" might be a total red herring.
Shawn, does mozilla::storage::Connection::ExecuteSimpleSQL run with the sqlite mutex?  (I'm about to poke, hopefully you can beat me.)
Sure does.

Thread 12 (Thread 4095):
#13 0x837fda0e in mozilla::storage::Connection::CommitTransaction (this=0x4f8168e0) at /home/cjones/mozilla/mozilla-central/storage/src/mozStorageConnection.cpp:944

NS_IMETHODIMP
Connection::CommitTransaction()
{
...
  SQLiteMutexAutoLock lockedScope(sharedDBMutex);

I'll see if this is what's consistently happening on the hangs and move it to (places?) if so.
Component: Firefox Sync: Backend → Storage
Product: Mozilla Services → Toolkit
QA Contact: sync-backend → storage
Summary: Sync/Weave causes very long UI pauses on fennec → Synchronous IO from sqlite causing very long UI pauses on fennec
Attached file Backtrace 2
This shows us doing blocking IO on the main thread from underneath JS.
Attached file Backtraces 3
This again shows a SQL thread doing blocking IO with a mutex, while the main thread waits on that mutex.  That (obviously) in effect makes the main thread do blocking IO as well.
Something sync/weave is doing sure seems to be exacerbating the underlying problem (maybe sync/weave is just giving storage more to do?).  At any rate, I can't repro using steps in comment 4 unless I "Sync Now".  Tried a few other things without "success".
This isn't really a storage bug since storage cannot control what consumers do.  This is very likely a places and/or weave bug (likely a bad interaction with both).  We could even argue that this is a duplicate of bug 563538 if it's involving places.

(In reply to comment #6)
> Shawn, does mozilla::storage::Connection::ExecuteSimpleSQL run with the sqlite
> mutex?  (I'm about to poke, hopefully you can beat me.)
Any SQLite operation runs with the SQLite mutex (either by us grabbing it because we need more atomicity, or SQLite grabbing it for any operation it does).  This is why I advocate consumers of storage to *only* use the asynchronous API so blocking will only happen on background threads.

(In reply to comment #7)
> Sure does.
> 
> Thread 12 (Thread 4095):
> #13 0x837fda0e in mozilla::storage::Connection::CommitTransaction
I'll note that CommitTransaction has nothing to do with ExecuteSimpleSQL, but as stated above, both will acquire the mutex.
Sync, especially initial sync, will add a lot of stuff to places, so I can totally see this being a dupe of bug 563538.  Initial sync is up to 5k URLs, so there's a lot of history to come down, plus whatever bookmark count a user has.  If Places is deadlocking here, it's possible we can be even lazier (though we do chunks of 50-150 at once, which has other side effects).
How can we make progress on this?  The UI freezing for several seconds at a time is bad enough on desktop, and worse on mobile because of the physical-feedback UI paradigm.
(In reply to comment #12)
> Sync, especially initial sync, will add a lot of stuff to places, so I can
> totally see this being a dupe of bug 563538.  Initial sync is up to 5k URLs, so
> there's a lot of history to come down, plus whatever bookmark count a user has.
>  If Places is deadlocking here, it's possible we can be even lazier (though we
> do chunks of 50-150 at once, which has other side effects).
Part of the problem is that we have no script-exposed async API to add history or bookmarks.  Bookmarks API is bug 519514.  Nothing for the history side of things on file AFAIK.
So I assume the path forward is:

* expose async-add APIs for adding bookmarks/history (and ideally getting annos: bug 584927, but I don't think that's what we're hitting here) 
* call those from sync

Maybe that's oversimplifying?
(In reply to comment #15)
> So I assume the path forward is:
> 
> * expose async-add APIs for adding bookmarks/history (and ideally getting
> annos: bug 584927, but I don't think that's what we're hitting here) 
> * call those from sync
> 
> Maybe that's oversimplifying?
No, I think that's about right.  Firefox itself is now adding all history asynchronously, and bookmarking doesn't happen often enough that we've cared about it yet (but fennec does, so they will, in theory, help out there).  It's possible you are hitting the annotation stuff here too because weave relies on guids for everything, which is stored as an annotation.
Depends on: 606353
Looks like Sync is working around the lack of an async annotation api in bug 584927.  I've detailed what needs to be done for an async bookmarking API in bug 519514, and filed bug 606966 for an async history API exposed to JS.
We're tracking the "use async APIs in Sync" stuff from bug 606353l, which links those bugs together.
tracking-fennec: ? → 2.0b3+
tracking-fennec: 2.0b3+ → 2.0+
tracking-fennec: 2.0+ → 2.0b3+
Shawn - can you assign this to someone?
(In reply to comment #19)
> Shawn - can you assign this to someone?
This bug is kinda meta so I'm not sure that is worthwhile.
Keywords: relnote
OK, we can remove this bug from blocking, but I want to know the specific bugs for the fixes so we can block on them.
(In reply to comment #21)
> OK, we can remove this bug from blocking, but I want to know the specific bugs
> for the fixes so we can block on them.
I believe the dependency tree is pretty accurate for "what we know" as of now.
Is keeping this bug open useful for anyone?
this is really a combination of 519514 and 606353, right?
Well, this started as a diagnostic bug, then evolved into a tracker-of-a-tracker (bug 606353).  The only reason I can see for keeping it open is for relnotes or other similar stuff.
lets hope 606353 gets fixed then we will not need a tracker.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: