Closed Bug 583821 Opened 9 years ago Closed 9 years ago

Firefox Sync full syncs are really slow after SQLite3.7.0 upgrade

Categories

(Toolkit :: Storage, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- beta4+

People

(Reporter: dholbert, Unassigned)

References

Details

(Keywords: regression, relnote)

STEPS TO REPRODUCE:
 0. Be using a profile that has lots of history (> 5000 entries) and is set up with latest Firefox Sync "experimental version" (I'm using 1.4.3)
 1. Start Firefox
 2. Tools menu | Sync | Preferences
 3. (optional) For simplicity, tell Sync to just synchronize your history (Minefield will: Use my custom settings" in lowest drop-down menu
 4. Click "Manage Account" and then "Reset Sync"
 5. Choose the bottom option, "Replace all other devices with your local data"
 6. Click Next, Next, and then trigger a sync.

ACTUAL RESULTS: The sync hangs for at least a minute on my machine, apparently while querying my local history.

REGRESSION RANGE (using .txt files in nightly download dirs):
20100723031340 58101a16aff7
20100724030057 30239e4cebd8

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=58101a16aff7&tochange=30239e4cebd8

This includes both a places change related to history (Bug 580228) as well as a sqlite upgrade (Bug 581000).  Filing in Places right now, but it may very well be a sqlite/storage issue.
Note that Firefox 4.0 b2 isn't affected by this -- IMHO this should block 4.0b3, or else it'll be a major regression from one beta to the next for Weave users.
blocking2.0: --- → ?
A few clarifications on comment 0:
 - I believe this affects all "full" syncs - I initially discovered it without any "Reset Sync" required. I only include that in my STR because that's a surefire way to get a full sync, and also because it's what I did to find the regression range.
 - The hang here locks up the UI and spikes the CPU. (whereas in nightlies before this regressed, the CPU usage goes up, but the UI remains responsive.  That'd be the 'expected behavior', I guess.)
What operating system are you on?
I've seen this on both Ubuntu 10.04 & 10.10a2.

This also might be a dupe of (or related to) Bug 563538. (that's another hang that's marked as blocking Bug 580228, which is the history/places change in this bug's regression range)

FWIW, here's the chunk of about:sync-log around the hang (notice that 1:16 elapses before we figure out "5000 outgoing items pre-reconciliation" -- that's the hang):
>2010-08-02 11:57:15     Engine.History       DEBUG   Resetting history last sync time
>2010-08-02 11:57:15     Net.Resource         DEBUG   PUT Length: 563
>2010-08-02 11:57:15     Net.Resource         DEBUG   PUT success 200 https://phx-sync064.services.mozilla.com/1.0/dholbert/storage/crypto/history
>2010-08-02 11:57:15     Engine.History       DEBUG   First sync, uploading all items
>2010-08-02 11:58:31     Engine.History       INFO    5000 outgoing items pre-reconciliation
>2010-08-02 11:58:31     Collection           DEBUG   GET success 200 https://phx-sync064.services.mozilla.com/1.0/dholbert/storage/history?full=1
(In reply to comment #4)
> FWIW, here's the chunk of about:sync-log around the hang (notice that 1:16
> elapses before we figure out "5000 outgoing items pre-reconciliation" -- that's
> the hang):
> >2010-08-02 11:57:15     Engine.History       DEBUG   Resetting history last sync time
> >2010-08-02 11:57:15     Net.Resource         DEBUG   PUT Length: 563
> >2010-08-02 11:57:15     Net.Resource         DEBUG   PUT success 200 https://phx-sync064.services.mozilla.com/1.0/dholbert/storage/crypto/history
> >2010-08-02 11:57:15     Engine.History       DEBUG   First sync, uploading all items
> >2010-08-02 11:58:31     Engine.History       INFO    5000 outgoing items pre-reconciliation
> >2010-08-02 11:58:31     Collection           DEBUG   GET success 200 https://phx-sync064.services.mozilla.com/1.0/dholbert/storage/history?full=1

Mardak - does this tell you anything useful?
(In reply to comment #4)
> This also might be a dupe of (or related to) Bug 563538. (that's another hang
> that's marked as blocking Bug 580228, which is the history/places change in
> this bug's regression range)
Note that that bug has only been seen by people on OS X so far.  Knowing that you are on Linux and seeing it would be interesting.
Assuming this is latest-weave, it's doing a query SELECT url FROM moz_places_view:

http://hg.mozilla.org/services/fx-sync/file/eaffcccd1744/services/sync/modules/engines/history.js#l137

Older versions used history query api and that was locking up the browser, so we switched to sql.
So that change was made in 1.4a1pre (if I'm reading things correctly) which means dholbert was running that code.

Is now a bad time to point out to you that queries against the view, especially with a WHERE clause are going to be *very* slow?  As in, create a new temporary table containing the results of the view query, and then querying that (which has no indexes).  If you can avoid it, you really /really/ *really* should.
I should also add that it's possible that the SQLite upgrade regressed querying the view, but we've had a policy in places that we NEVER select from it because it has bad performance.  I'd hate to have to start caring about view performance because weave uses it (not to mention that it's likely going away in bug 552023).
So while this may have been a result of the SQLite upgrade, queries on the view have always been slow; so much so that places specifically forbids using them.  Since we are turning on the sync UI in b3 (plan at least), this should block beta 4.
blocking2.0: ? → beta4+
Component: Places → Sync
Keywords: relnote
Product: Toolkit → Weave
QA Contact: places → sync
Version: Trunk → unspecified
flagging for tests when a fix lands
Flags: in-testsuite?
(Note that this bug is now marked as blocking beta4+ even though it doesn't show up in the UI that way.  It does show up in queries for beta 4 blockers.)
(In reply to comment #10)
> So while this may have been a result of the SQLite upgrade

I'm marking this as a regression from that bug (bug 581000), for bookkeeping's sake (even though it sounds like the underlying issue is that Sync shouldn't have been using views in the first place).

Also: the SQLite upgrade bug was just backed out for investigation of issues unrelated to this bug.  If that backout lasts through the b3 tagging, then this bug won't affect the 4.0b3 release after all.
(In reply to comment #13)
> Also: the SQLite upgrade bug was just backed out for investigation of issues
> unrelated to this bug.  If that backout lasts through the b3 tagging, then this
> bug won't affect the 4.0b3 release after all.
Assuming that SQLite was the cause of this problem.  And it will not be relanding in time for b3.
(In reply to comment #14)
> Assuming that SQLite was the cause of this problem.

FWIW, I just confirmed that it was indeed the cause, by testing an m-c build from just after the sqlite-backout vs. an m-c build with the backout stripped off.  (The backout build was fine, and the sans-backout build was affected.)
Summary: Firefox Sync full syncs are really slow since 201007240 m-c nightly → Firefox Sync full syncs are really slow after SQLite3.7.0 upgrade
(In reply to comment #10)
> So while this may have been a result of the SQLite upgrade, queries on the view
> have always been slow; so much so that places specifically forbids using them. 
> Since we are turning on the sync UI in b3 (plan at least), this should block
> beta 4.

Okay, so what's the recommended alternative to querying the view?  It's not fast, but it's correct, data-wise, and doesn't cause hangs.
(In reply to comment #16)
> Okay, so what's the recommended alternative to querying the view?  It's not
> fast, but it's correct, data-wise, and doesn't cause hangs.

https://wiki.mozilla.org/Places/Places_SQL_queries_best_practices

or wait for me to kill temp tables... I'm still fighting a couple oranges though.
(In reply to comment #17)
> or wait for me to kill temp tables... I'm still fighting a couple oranges
> though.
Except that that won't help them exactly because they still support Firefox 3.5 and Firefox 3.6.
(In reply to comment #17)
> https://wiki.mozilla.org/Places/Places_SQL_queries_best_practices

That says not to query the view tables, except that querying the actual tables caused bug 575423. Of course the fix doesn't have a test so there's no way to tell for sure whether it actually solves the problem at hand. :/

(In reply to comment #18)
> Except that that won't help them exactly because they still support Firefox 3.5
> and Firefox 3.6.

We could have conditional code so that we do the right thing on Firefox 3.5/3.6 and 4.
(In reply to comment #19)
> (In reply to comment #17)
> > https://wiki.mozilla.org/Places/Places_SQL_queries_best_practices
> 
> That says not to query the view tables

and 5 rows below it says "When using SELECT we must always be sure to query both tables"[...]"To query both tables we can use a special SQL construct"[...]Example.
Hey look, the blocking flag cannot be changed.  Is this fixed now that weave is not querying on the view?
Is 3.7.0 landed?
3.7.0.1 landed.
see bug 581000 for 3.7.0.1
Assignee: nobody → mitcho
Oops, wrong tab! Reassigning to mconnor to resolve for b4; I think we backed out SQLite, though.
Assignee: mitcho → mconnor
(In reply to comment #25)
> Oops, wrong tab! Reassigning to mconnor to resolve for b4; I think we backed
> out SQLite, though.
And then landed 3.7.0.1 though.  I believe that Sync not querying on the views (bug 583852) fixed this, but somebody should confirm that (comment 21)
I'll try to confirm. brb.
Nope, still broken on a full initial sync (using STR from comment 0). Hangs on the same step as before:
>2010-08-16 14:49:55    Engine.History       DEBUG    First sync, uploading all items
>2010-08-16 14:51:20    Engine.History       INFO    5000 outgoing items pre-reconciliation

That's almost 90-seconds of browser-hanging pain.

I'm using Sync 1.4.4 with today's nightly:
Mozilla/5.0 (X11; Linux x86_64; rv:2.0b4pre) Gecko/20100816 Minefield/4.0b4pre
(In reply to comment #28)
> I'm using Sync 1.4.4 with today's nightly

FWIW, I tried reproducing with 1.4.4 & an old nightly shortly after this regressed (20100727), but my sync failed in that case (retried a few times), with errors like:
2010-08-16 14:48:37	Engine.Bookmarks     DEBUG	Total (ms): sync 4377, processIncoming 35, uploadOutgoing 1310, syncStartup 3025, syncFinish 1, createRecord 398, resetLastSync 0, resetClient 1
2010-08-16 14:48:37	RecordMgr            DEBUG	Failed to import record: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIHttpChannel.requestMethod] Stack trace: Res__request()@resource://services-sync/resource.js:205 < Res_get()@resource://services-sync/resource.js:293 < RecordMgr_import()@resource://services-sync/base_records/wbo.js:119 < RecordMgr_get()@resource://services-sync/base_records/wbo.js:142 < SyncEngine__syncStartup()@resource://services-sync/engines.js:352 < anonymous()@resource://services-sync/engines.js:196 < SyncEngine__sync()@resource://services-sync/engines.js:744 < anonymous()@resource://services-sync/engines.js:196 < WrappedNotify()@resource://services-sync/util.js:115 < Engine_sync()@resource://services-sync/engines.js:206 < WeaveSvc__syncEngine()@resource://services-sync/service.js:1455 < anonymous()@resource://services-sync/service.js:1404 < WrappedNotify()@resource://services-sync/util.js:115 < WrappedLock()@resource://services-sync/util.js:87 < WrappedCatch()@resource://services-sync/util.js:66 < sync()@resource://services-sync/service.js:1289 < anonymous()@sync.js:326

This is all with my "normal" browsing profile.  I wonder if the error is due to a recent change in places that makes an older nightly barf?

In any case, though, the hanging persists in today's nightly with the most recent "beta channel" version of weave, as noted in comment 28.
I wonder if turning off automatic_index would help...
http://www.sqlite.org/pragma.html#pragma_automatic_index

(shooting in the dark here)
sdwilsh/mconnor asked me to try "stock" mozilla-central, without the Sync addon installed, to see if that helps.  It doesn't. :(
Okay, so it's not views, since everywhere has that fixed.  This looks like a SQLite regression at this point.
I guess our approach here for the beta is to back out the SQLite upgrade.  Can I get some commitment from the Sync team to help me identify the cause of the regression *this week* so I can report it to the SQLite folks so it can hopefully get fixed in time for 3.7.1?
Whatever help you need from us you'll get.
(and if this doesn't get backed out by 8 PM PDT, I guess I can do that)
SQLite ticket http://www.sqlite.org/src/info/13f033c865 describes a performance issue in the query planner that was introduced in version 3.7.0 but which has now been fixed.  Could this be your problem?  A pre-release snapshot of SQLite 3.7.1 is available at http://www.sqlite.org/draft/download.html and the performance regression of ticket [13f033c865] is fixed in that snapshot.  Can yall do a test build using the snapshot and see if that clears your problem?
(d'oh, accidentally posted early)  sdwilsh: I'd be happy to test a tryserver / custom-patched build with the beta sqlite upgrade that Dr H. referenced above, if you can let me know what I need to do to upgrade.
I'm making a local build with the sqlite code from comment 36 and a version-rev in configure.in (don't know if that's required), to match the patches in bug 581000...
Yay - that upgrade fixes the issue, in my local build!

So, IIUC, all SQLite versions 3.7.0* are affected by this, and 3.7.1 isn't release-ready yet -- should we back out the 3.7.0.1 upgrade and wait until 3.7.1 is ready?
FWIW, we are aiming for a 3.7.1 release early next week - Monday or Tuesday - 2010-08-23 or 2010-09-24.
Assignee: mconnor → nobody
Component: Sync → Storage
Product: Weave → Toolkit
QA Contact: sync → storage
Version: unspecified → Trunk
sdwilsh: backout sqlite or go to beta with the slowness; I think we want to back it out, right?
beltzner: sounds like it, based on sdwilsh's comment 33 & 34.

Sounds like sdwilsh will be AFK until 8pm or later based on comment 34.  In the interests of not clashing with other blocker-landings later tonight, I'll go ahead and back out bug 581000 right now.
Backed out the sqlite upgrade. (see bug 581000 comment 31)  We're back to version 3.6.23.1 for now.

Should I resolve this as FIXED by the backout? (with the understanding that the next sqlite version to land will be 3.7.1-based?)
Yes, that seems appropriate.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to comment #42)
> sdwilsh: backout sqlite or go to beta with the slowness; I think we want to
> back it out, right?
Yes, backout is the right thing to do here.  I couldn't get to it until about now, so thanks dholbert for doing this!

dholbert - did your own trunk build without the test SQLite show the same issue?  Just making sure we've covered all our bases here :)
Yes, the trunk build from comment 40 hits this bug if I remove the 3.7.1 SQLite prerelease.
Awesome!  Thanks a ton!
You need to log in before you can comment on or make changes to this bug.