Closed Bug 660036 Opened 10 years ago Closed 10 years ago
.sqlite-wal size grows more in Firefox 5 than in Firefox 4
173.84 KB, patch
|Details | Diff | Splinter Review|
28.16 KB, patch
|Details | Diff | Splinter Review|
33.67 KB, patch
|Details | Diff | Splinter Review|
3.90 KB, patch
|Details | Diff | Splinter Review|
Bug 625981 is a known bug that causes the places.sqlite-wal file in the Firefox profile to grow much larger than intended. The rate or amount of this growth seems to have regressed since Firefox 4. On my T-Mobile G2 (stock Android 2.2), the WAL file is <10MB after initial sync in Firefox 4.0.1, and >300MB after initial sync in both Beta (5.0b2) and Nightly (7.0a1). All tests were done starting with new empty profiles, using the same sync account. Michael Verdi reported the same regression in bug comment 80, and we've gotten similar reports from Firefox Beta users. (The problem is not 100% reproducible, though. On a Motorola Xoom, I don't see large WAL sizes in any of Firefox, Beta, Aurora, or Nightly builds connected to sync.) Nominating for tracking-firefox5 because this is a major regression for mobile users.
I suspect this regression was introduced in Sync when we removed a bunch of Firefox 3.5/3.6 compatibility code (bug 648338) and also took out the runInBatchMode call for history, since on Firefox 4+ we use mozIAsyncHistory. This did have an effect on the number of Places transactions, going up from 1 single transaction for the whole history sync to up to 100 (we download up to 5000 history items and apply them in batches of 50). The symptom probably arises from the fact that Sync is also doing some reads against the Places DB in between batches. Quoting Richard Hipp from bug 625981 comment 35: After a successful checkpoint, the WAL is reset back to a small size. But a checkpoint cannot run to completion if there is an active concurrent reader, ... Matt, I can round up a patch that brings back the runInBatchMode call for history, if you want to try that out to see whether it fixes the problem?
Here's a patch that restores the batch mode for history sync. Try build: http://tbpl.mozilla.org/?tree=Try&rev=144e9490430a
Whiteboard: patch is in sync but the disproportionate negative impact is on mobile.
(In reply to comment #1) > Matt, I can round up a patch that brings back the runInBatchMode call for > history, if you want to try that out to see whether it fixes the problem? Yes, that would be great. If you can upload the patch, I can build and test it.
(In reply to comment #3) > If you can upload the patch, I can build and test it. *turns off time machine* DONE! See comment 2 :)
Another change between Firefox 4 and 5 was bug 630539 (upgrade to SQLite 3.7.5). After downgrading to SQLite 3.7.4 locally (backing out the patches from bug 630539), I no longer see huge WAL file growth during initial sync from a new profile. Details: I produced two builds from mozilla-central tip, one with the backout and one without, and no other differences. I wiped out the profile/data directory, started Fennec, then connected to my sync account. I ran all of these steps twice with each build, and got consistent results.
Component: General → Storage
Product: Fennec → Toolkit
QA Contact: general → storage
This patch fixes the regression for me by backing out bug 630539. I think we want this on beta and aurora unless there are negative consequences I'm not aware of. This downgrades us to SQLite 3.7.4 which is the version that shipped in Firefox 4. We should also test with SQLite 3.7.6 (see bug 649362) or the upcoming 3.7.7 to see if the problem is still present with those versions.
Do you have results on Aurora too or are just guessing it must be the same? A lot changed there and I'd really like to see if it differs from Beta or has the same exact behavior. Backing out SQLite from Aurora means also backing out bug 658135 that may have an opposite good effect.
(In reply to comment #6) > We should also test with SQLite 3.7.6 (see bug 649362) or the upcoming 3.7.7 > to see if the problem is still present with those versions. A mozilla-central build with SQLite 3.7.6 (using the patch from bug 649362) shows the regressed behavior (>300MB WAL file a few seconds after sync starts). (In reply to comment #7) > Do you have results on Aurora too or are just guessing it must be the same? Yes, I have tested current Nightly, Aurora, Beta, and release builds, and I see the same problematic behavior on Nightly, Aurora, and Beta. I have not tried applying the backout to Aurora specifically, but I was testing the backout on a mozilla-central revision from yesterday which differs from Aurora by only one days' worth of changes.
At Marco's suggestion, I tried adding a call to ForceWALCheckpoint(dbConn); before the dispatch of TOPIC_UPDATEPLACES_COMPLETE in History::UpdatePlaces. This did *not* fix the problem - with this change, I still see the same >300MB WAL file growth during initial sync.
If we downgrade SQLite, we need to back out bug 658135 because it uses sqlite3_stmt_readonly which had a known bug in SQLite 3.7.4. Note: I am working with Richard Hipp to get him some log data to help find the cause of this regression.
Attachment #535448 - Attachment is obsolete: true
At drh's request, I have built Fennec with the test_vfstrace shim enabled. I cannot reproduce this problem on desktop using the steps in this bug (though I have seen on desktop before during regular use), so this patch is for Android and uses the Android logging system. The patch also depends on bug SQLite3.7.6. To view the resulting logs with thread IDs included, build Fennec with this patch, install it on your phone, set up USB debugging, and run this command (requires the "adb" tool from the Android SDK): adb logcat -v thread *:S sqlite:I
This log shows sqlite3 activity as I follow the steps to reproduce this bug. The action that triggers the bug happens at 10:55:33.021: http://people.mozilla.com/~mbrubeck/sqlite-log-660036.txt
Thanks for the log. It appears that on line 31932 of the log (timestamp 05-27 10:55:33.301) that somebody is starting a write transaction, then then holding that transaction open for the remainder of the log (spanning 4 minutes or so). As far as I can tell, that transaction never commits, and so the WAL file just keeps growing and growing.... Is there, perhaps, someway to use sqlite3_trace() (http://www.sqlite.org/c3ref/profile.html) to write the SQL that is being evaluated to the same log file as the vfstrace output, so that we can perhaps correlate the start of the transaction with a specific thread inside of FF?
This log includes statement traces. The action to start sync is at 12:34:44.486: http://people.mozilla.com/~mbrubeck/sqlite-log-660036-trace.txt
The new trace shows that the long-running transaction, the one that makes the WAL grow so large, was started with a "BEGIN DEFERRED" and that the first write was by "INSERT INTO moz_bookmarks (id, fk, type, parent, position, title, folder_type, dateAdded, lastModified, guid) VALUES (NULL, 5, 1, 6, 5, 'Firefox Start', NULL, 1306524884899553, 1306524884899553, GENERATE_GUID())". Is that any help in tracing down what is holding the transaction open inside of FF?
(In reply to comment #15) > The new trace shows that the long-running transaction, the one that makes the > WAL grow so large, was started with a "BEGIN DEFERRED" and that the first write > was by "INSERT INTO moz_bookmarks (id, fk, type, parent, position, title, > folder_type, dateAdded, lastModified, guid) VALUES (NULL, 5, 1, 6, 5, 'Firefox > Start', NULL, 1306524884899553, 1306524884899553, GENERATE_GUID())". Is that > any help in tracing down what is holding the transaction open inside of FF? It tells me that someone is opening a transaction and then inserting a bookmark. Matt, do you know what adds the bookmark with the title "Firefox Start"?
That "INSERT INTO moz_bookmarks" happens a fraction of a second after I press the "start sync" button. It looks like it is BookmarkStore.prototype.create's call to nsNavBookmarks::InsertBookmark here: http://mxr.mozilla.org/mozilla-central/source/services/sync/modules/engines/bookmarks.js#655 I believe the transaction is created at the start of bookmark sync by nsNavBookmarks::RunInBatchMode, here: http://mxr.mozilla.org/mozilla-central/source/services/sync/modules/engines/bookmarks.js#251 I'm not sure why this would be different in SQLite 3.7.4 compared to .5. I'll try running with the tracing enabled under .4 to see if there is any difference.
When did you make the transition from using a rollback journaling mode to using WAL? Was that coincident with the move from 3.7.4 to 3.7.5? In WAL mode, if you have a large transaction - larger than will fit into the pager cache, or if you are under a lot of memory pressure such that the pager cache is having to spill content to disk frequently in order to free up memory, then that could be making the WAL file grow very large in ways that would never happen in the older rollback modes. Usually the size of the WAL file is limited to be roughly the same as the amount of database content that is changed by the transaction. But if you are under memory pressure and if pager cache content is having to be spilled to the WAL file early - before transaction commit - then the WAL can get to be much larger than the transaction. The fact that the WAL is getting really big even before we see the end of the transaction tells me that cache spilling is a big part of the problem. How large the the pager cache. (What is "PRAGMA cache_size" set to? What does the magic "about:memory" URL report?) Is it possible to make it bigger, at least for this one "sync" transaction? Is it possible to break up the "sync" into multiple smaller transactions?
(In reply to comment #18) > Is it possible to break up the "sync" into multiple smaller transactions? It certainly is. In fact as I mention in comment 1, Sync in Firefox 5 applies history records in batches of 50 (in Firefox 4 it was one big transaction). Bookmarks are applied all in one big transaction (using nsINavBookmarkService::runInBatchMode), but we could split that up too, if desired.
Here is the same experiment using SQLite 3.7.4. (This does not include the VFS shim, which doesn't seem to work with versions <3.7.6.) It seems to have the exact same open transaction starting at 14:26:59.438, followed by the same FetchItemInfo and InsertBookmarkIntoDB statements: http://people.mozilla.com/~mbrubeck/sqlite-log-3.7.4-trace.txt But when running this in the SQLite 3.7.4 build, the places.sqlite-wal file never grew over 1.03 MB, while with SQLITE 3.7.5 and up it always grows over 300 MB in just a few seconds.
(In reply to comment #18) > When did you make the transition from using a rollback journaling mode to > using WAL? Was that coincident with the move from 3.7.4 to 3.7.5? No, the switch to WAL mode came before the move from 3.7.4 to 3.7.5. It happened in bug 573492, which landed in December 2010 (before Firefox 4) while bug SQLite3.7.5 landed in March 2011 (after Firefox 4). > How large the the pager cache. (What is "PRAGMA cache_size" set to? What > does the magic "about:memory" URL report?) Is it possible to make it > bigger, at least for this one "sync" transaction? PRAGMA cache_size is set to 6% of physical memory. Physical memory is 512MB on my test device, so cache_size should be around 31MB. If cache size is important, this may explain I do not see the same problem on my laptop (3GB physical memory) or Motorola Xoom tablet (1GB physical memory).
(In reply to comment #18) > How large the the pager cache. (What is "PRAGMA cache_size" set to? What > does the magic "about:memory" URL report?) During initial sync when this bug occurs, about:memory reports sqlite 3.36MB: places.sqlite 1.21MB: cache-used 0.94MB. It does not change much from this level as the sync progresses.
We evaluated history at fault initially, but actually how many bookmarks do you have Matt? The fact all bookmarks are added in a single large transaction may matter more than the history chunks. Could you try syncing only history and then only bookmarks and see who is the worse culprit? Splitting bookmarks batch into 30 or such bookmarks batches, even just to try it, should be trivial.
(In reply to comment #23) > We evaluated history at fault initially, but actually how many bookmarks do > you have Matt? I appear to have 169 bookmarks and 4 folders (276 KB of my sync quota). As soon as I have time (probably next week) I can try disabling bookmarks sync, or changing bookmark sync to use smaller batches.
Is there any way for someone to show me the output of "PRAGMA compile_options;" command to SQLite for the version of FF that is exhibiting the problem?
(In reply to comment #25) > Is there any way for someone to show me the output of "PRAGMA compile_options;" > command to SQLite for the version of FF that is exhibiting the problem? We compile SQLite with these defines: https://mxr.mozilla.org/mozilla-central/source/db/sqlite3/src/Makefile.in#99
Based on logging inside of SQLite, it looks like we the page cache size (pCache->nMax) is set to zero. When we set this in nsNavHistory::InitDB, it looks like we are getting an bad value from PR_GetPhysicalMemorySize: __android_log_print(ANDROID_LOG_INFO, "sqlite", "cache size %lld (%d percent of %lld)", cacheSize, cachePercentage, physMem); ==> I( 7573:0x1d9e) cache size -737 B (6 percent of -12288) When I change InitDB to use PRUint64 instead of PRInt64, I get: ==> I( 7645:0x1de6) cache size 184467440737094778 B (6 percent of 18446744073709539328) That's 164 exabytes, which is clearly incorrect...
Thanks for debugging the full thing, looks like we need a fix in PR_GetPhysicalMemorySize and a fallback for nonsense values in initDB. Also test_systemInfo.js may be expanded to check that returned values make sense. Actually, some user may set page cache to 0, should we disallow that if it's going to cause these issues?
(In reply to comment #28) > Actually, some user may set page cache to 0, should we disallow that if it's > going to cause these issues? Yes, but in a different bug.
On both Samsung Verizon Galaxy Tab (Android 2.2) and HTC T-Mobile G2 (Android 2.2), sysconf(_SC_PHYS_PAGES) is returning -3: > 18446744073709539328 bytes (-3 pages * 4096 bytes/page) This happens if the sysconf implementation fails to find a line with the expected "MemTotal" format in /proc/meminfo: http://www.google.com/codesearch/p?hl=en#XAzRy8oK4zA/libc/unistd/sysconf.c&q=__get_phys_pages&exact_package=git://android.git.kernel.org/platform/bionic.git&l=398 I'm not sure why this is failing. This code has not changed in bionic any time recently, at least in the public Android git repositories, and the first line of /proc/meminfo looks like it should match the expected scanf format. > "MemTotal: 451764 kB" This seems to be fixed in Android 3.0. On a Motorola Xoom (Android 3.0), I get the correct value: > 754163712 bytes (184122 pages * 4096 bytes/page)
to post new findings, this is a bug in Android: http://android.git.kernel.org/?p=platform/bionic.git;a=commitdiff;h=038fbae518e904c7aba64779714a22dbeeb90887 Exists in all versions up to 2.2 Matt is working on a fix
This patches PR_GetPhysicalMemorySize to return 0 instead of garbage when sysconf returns an error, and it patches nsNavHistory::InitDB to fall back to a conservative default value of 256MB if the real physical memory size is not available. Marco points out in IRC that this is a bug in Android 2.2 and earlier that is fixed in later versions by this commit: http://android.git.kernel.org/?p=platform/bionic.git;a=commitdiff;h=038fbae518e904c7aba64779714a22dbeeb90887
Attachment #536696 - Flags: review?(mak77)
Comment on attachment 536696 [details] [diff] [review] patch: Handle negative values from sysconf Review of attachment 536696 [details] [diff] [review]: ----------------------------------------------------------------- Could you please also apply the fallback here, so the two values are coherent: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/nsPlacesExpiration.js#722 I'll then file a bug to better unify the 2 paths in Places in future versions. r=me with non-nits addressed. ::: nsprpub/pr/src/misc/prsystem.c @@ +282,5 @@ > > long pageSize = sysconf(_SC_PAGESIZE); > long pageCount = sysconf(_SC_PHYS_PAGES); > + if (pageSize <= 0 || pageCount <= 0) > + return 0; nit: I'd rather if (pageSize > 0 && pageCount > 0) byte = ... since otherwise byte is already inited to 0. ::: toolkit/components/places/nsNavHistory.cpp @@ +728,5 @@ > > + static PRUint64 physMem = PR_GetPhysicalMemorySize(); > + // If actual value is not available, just use a sane fallback. > + if (physMem == 0) > + physMem = 256*1024*1024; Put this magic number into a #define MEMSIZE_FALLBACK_BYTES please per code style brace the if and indent only 2 spaces
Attachment #536696 - Flags: review?(mak77) → review+
(In reply to comment #32) > This patches PR_GetPhysicalMemorySize to return 0 instead of garbage when > sysconf returns an error, and it patches nsNavHistory::InitDB to fall back to a > conservative default value of 256MB if the real physical memory size is not > available. System requirements says 512MB, so that should be safe too, right? https://www.mozilla.com/en-US/firefox/system-requirements.html
(In reply to comment #34) > System requirements says 512MB, so that should be safe too, right? > > https://www.mozilla.com/en-US/firefox/system-requirements.html Er, that's recommended, so nevermind.
We don't want to take an NSPR change on beta. This patch works without the NSPR change. We will file a followup for the "correct" fix to NSPR and Places.
Attachment #536703 - Attachment is patch: true
Address a missed review comment about nsPlacesExpiration.js.
Comment on attachment 536714 [details] [diff] [review] patch for beta without NSPR change Review of attachment 536714 [details] [diff] [review]: ----------------------------------------------------------------- nit: rather than "[Duplicated in XXX]" I'd put a "Must stay in sync with the code in XXX.". But I don't care too much about this nit, since on m-c we should file a bug to better handle and unify this initialization in Places (I plan to reduce a bit the cache size too), so this code will likely change soon. Thanks for taking care of this bug so deeply. I think the nspr and Places bugs should be handled in separate follow-up bugs, rather than adding more work here.
Attachment #536714 - Flags: review?(mak77) → review+
Comment on attachment 536714 [details] [diff] [review] patch for beta without NSPR change Requesting approval for Aurora and Beta. This is a severe regression that causes Firefox 5 on Android 2.0 through 2.2 to immediately consume hundreds or thousands of MB of storage as soon as sync is enabled. This can easily fill up all available storage on a mobile device. This patch is the lowest-risk fix we can think of. It adds a guard in two locations in Places to detect when PR_GetPhysicalMemorySize is returning negative or garbage values, and falls back to a default value rather than using the garbage value. An alternate fix is to roll back to SQLite 3.7.4 (the version used in Firefox 4) which does not interact as badly with this Android bug. This also requires backing out one unrelated change. (See the two backout patches attached to this bug.) This is probably riskier, since any of the changes between SQLite 3.7.5 and 3.7.4 might have unanticipated effects, and we have been building and testing with 3.7.5 since March. (In reply to comment #38) > nit: rather than "[Duplicated in XXX]" I'd put a "Must stay in sync with the > code in XXX.". Fixed locally and pushed to Try: http://tbpl.mozilla.org/?tree=Try&rev=29bb7b909042 > Thanks for taking care of this bug so deeply. I think the nspr and Places > bugs should be handled in separate follow-up bugs, rather than adding more > work here. Filed bug 661351.
First Try run is looking green; started another run with different coverage: http://tbpl.mozilla.org/?tree=Try&rev=c6ebcc1e6a68
Whiteboard: patch is in sync but the disproportionate negative impact is on mobile. → patch is in places but the disproportionate negative impact is on mobile.
Pushed to mozilla-central for Firefox 7: http://hg.mozilla.org/mozilla-central/rev/5a7f6817a104
Comment on attachment 536714 [details] [diff] [review] patch for beta without NSPR change landed on trunk and has stuck.
Pushed to Aurora (Firefox 6) and Beta (Firefox 5): http://hg.mozilla.org/releases/mozilla-aurora/rev/56960f05b1ec http://hg.mozilla.org/releases/mozilla-beta/rev/f55aac1529bf
VERIFIED FIXED on: Build ID: Mozilla /5.0 (Android;Linux armv7l;rv:7.0a1) Gecko/20110602 Firefox/7.0a1 Fennec/7.0a1 Mozilla /5.0 (Android;Linux armv7l;rv:6.0a2) Gecko/20110602 Firefox/6.0a2 Fennec/6.0a2 Device: HTC Desire Z (Android 2.2)
Status: RESOLVED → VERIFIED
Verified fixed - Mozilla/5.0 (Android; Linux armv7l; rv:5.0) Gecko/20110603 Firefox/5.0 Fennec/5.0 ID:20110603103300 G2 - Android 2.2 N1 - Android 2.2
Firefox 36.0.1 places.sqlite-wal takes over 300,000K
(In reply to neopeeves from comment #46) > Firefox 36.0.1 > places.sqlite-wal takes over 300,000K Please report a new bug, as this one is fixed.
6 years ago
See Also: → 1140902
You need to log in before you can comment on or make changes to this bug.