Closed Bug 430530 Opened 16 years ago Closed 16 years ago

excess disk IO when updating the url-classifier

Categories

(Toolkit :: Safe Browsing, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 3

People

(Reporter: dcamp, Assigned: dcamp)

References

Details

(Keywords: perf, Whiteboard: [long pole])

Attachments

(3 files, 1 obsolete file)

There's a thread at http://ubuntuforums.org/showthread.php?t=759673 complaining that some users are seeing painful amounts of disk IO during safebrowsing updates.

I'm not sure how widespread this is, but regardless I think we can help this a lot with some index and cache size tweaking.  I'm working on getting some numbers and a patch together now.
Flags: blocking-firefox3?
Heh, I _just_ mailed you about this, and then I look at my bugmail, and I see this bug was just filed. Hah.
Version: unspecified → Trunk
Dave, can you renom when you've got those numbers together to help us understand the impact/size of the problem?
Flags: blocking-firefox3?
Flags: blocking-firefox3?
After some testing, it looks like this gets pretty bad on linux.

Once we hit sqlite's cache size, sqlite starts hitting the disk pretty hard.  While the database is small this isn't a big deal (we don't hit the cache limit), and once we're up-to-date it isn't as bad (we aren't adding as much data).

But during the middle/end of the push to get the complete database, things take way too long, and we're thrashing the disk the whole time.

We can reduce the amount of IO by increasing the cache size, and that'll help some (it'll help a whole lot if we increase the cache size by enough to hold the db).  We might also need to look at throttling the update process a bit, so that we don't slam the system all at once...

I've sent mail to Dr. Hipp looking for suggestions, and any other suggestions would be appreciated.. :/
Keywords: perf
OK, so I split the current google database into 3 pieces (which are much bigger chunks than they're currently giving us now, and bit bigger than they're planning to start feeding us during an initial update) and fed them to the DB one piece at a time.  This gives us ~800,000 urls to deal with each update (some are adds that will be applied, some are subs that will cause an add to be deleted, some are subs that will be saved to apply to a future add, and some are adds that will cause a previously-applied sub to delete).

On OSX and Vista and vista, it takes ~30-35 seconds to apply one of these chunks the entire process.  On ubuntu, it takes 10 minutes of system-performance-degrading IO to deal with that.  I'm sure it's worse on network filesystems (though we use the Local data dir on osx and windows).

At its largest size, (which is actually near the middle of the update, because the update tends to send us a bunch of subs-that-don't-have-adds-yet that fill up the database, then the adds that remove them come later) the database has roughly 35,000 1k pages.  During one of these big initial updates all the pages will be touched quite a few times.

Once we've caught up to the state of the server, our updates are going to be much smaller - usually in the low hundreds or occasionally a thousand or two. This will touch many fewer pages in the database (I haven't measured this closely).

The only way I see to get this performing decently is to bump up the max page cache size to fit all the pages during updates.  If we bump the max page cache to a pretty big size (say, 35-40 megs), we'll avoid hitting the disk until commit time for even the largest updates.  We can blow away that cache after the update, so it would be a transient spike in memory during updates, relative to the size of the update. 

This sucks, but it's all I can think of.  Any thoughts or suggestions?
If Vista and OSX both can do the updates in 30-35 seconds as you quote - I am wondering if anything is wrong with Linux/kernel?

I originally reported this as a problem against Ubuntu Hardy with high CPU utilization while loading pages but the bug report got hijacked by disk thrashing bug as it apparently also consumes CPU while thrashing the disk - I have never had disk thrashing problem.

Is there a way to reproduce this easily? I have debug build on my machine so I can try to figure out what is going wrong.
The trouble with Linux may be in the way the underlying filesystems implement fsync() which I guess is called a lot during the trashing. See bug 421482 comment 11 or bug 421482 comment 12.
Here's some data: In 11 minutes of light browsing (I visited perhaps 6 pages), firefox called fsync() 17 times on places.sqlite, 47 (!) times on places.sqlite-journal, and 12 times on other files (one is formhistory.sqlite, the other appear to have been closed already, so I don't know).  Total time spent inside fsync system calls: 63 seconds.  Longest nonresponsive period: 12.8 seconds.

I've attached the strace log with timestamps and durations.  The file descriptors mentioned in it are:

32 /home/mg/.mozilla/firefox/default.a83/places.sqlite
40 /home/mg/.mozilla/firefox/default.a83/formhistory.sqlite
49 /home/mg/.mozilla/firefox/default.a83/places.sqlite-journal
53 54 65 70 already gone from /proc/.../fd/

Strangely enough I had't noticed excessive disk I/O or bouts of nonresponsiveness in Firefox 3.0 beta 4.  The data here is collected from beta 5 (the build included in Ubuntu Hardy Heron), where the fsyncs make the browsing excessively painful.
Looks like this is more than just safebrowsing?
Flags: blocking-firefox3? → blocking-firefox3+
So why do we need to call fsync so many times on the safebrowsing and places/formhistory data in first place? Why are data integrity gurantees required for that sort of data? I mean no big deal if we did an update and the browser crashed - ff is not a database after all and this type of data is not like it is going to cause a lot of problems if it is lost.

I suspect that the problem is sqlite trying to be paranoid - there should be a way to tell it to do relaxed syncs - in a separate low CPU and IO priority (where supported) thread or something like that.
There may be a few different problems here.  First problem may be sqlite vacuums.  It looks to me like Firefox occasionally reads and writes the entire contents of large sqlite files (~50MB in my profile).  Second, there seems to be very excessive preallocation or wasting of space.  For example I just did an offline sqlite vacuum which reduced my profile from 50MB to 38MB.  Then I started the browser and visited one site, and the sqlite files in my profile are up to 47MB.  That's 9MB for one site visit?

The reason I suspect a vacuum issue is that I see this very regularly.  Just as a rough guess, I get a 20-30 second hang on every 20 page loads.

I'm testing with Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b5) Gecko/2008041514 Firefox/3.0b5
Throwing in an additional strace.  This is the result of strace -fte open,close,write,fsync.  During this 150-second run, Firefox writes an incredible 34MB of crap (and then, of course, periodically fsyncs).  The write storm starts at timestamp 00:02:36.  File 80 is urlclassifier3.sqlite-journal and 44 is urlclassifier3.sqlite.  Output to these files continues quite heavily until the browser shuts down at 00:03:20.

Contents of write() calls expurgated for privacy's sake.
From today's triage call:
 - might be able to ship the DB w/Ubuntu?
 - might be due to cost of a sync?
 - will probably suck for network mounted profiles
 - karl thinks he's seen it in gentoo

Dave thinks that the easiest way to solve this is let the DB have more cache memory, but that will potentially affect our memory stats in the firstrun case.

Do we know any SQLLite gurus? 
How about making firefox download a database *snapshot* incrementally?  We know how to download files incrementally (see software update).  With this approach, users won't be protected until they get the whole thing, but Firefox might use less memory and CPU as firefox downloads it the first time, and Firefox might even be able to download it more quickly.
(In reply to comment #16)
> How about making firefox download a database *snapshot* incrementally?  We know
> how to download files incrementally (see software update).  With this approach,
> users won't be protected until they get the whole thing, but Firefox might use
> less memory and CPU as firefox downloads it the first time, and Firefox might
> even be able to download it more quickly.

If we can figure this out on the server side, it would be reasonably easy to implement, and would almost certainly be a better user experience (downloading roughly the same amount of data, but no significant CPU/memory usage).  So yeah, if we can figure out where to host these snapshots this would be a good solution.
I'd be a little wary of changing the basic way the DB is initially downloaded (or shipped). Tweaking the existing mechanism would seem lower-risk.
I'm uncertain that this is purely a Linux phenomenon-- comment 7 of bug # 431094 is from a Windows user, and a 'Lars Erik' talked about this within yesterday's "The Official Build of 20080429 is out" Thread over on Mozillazine. (I sent him a PM there, asking him to come over here and comment). Here's his post:

Lars-Erik
Joined: 30 Nov 2007
Nydalen, Oslo, Norway 	PostPosted: Mon 28th Apr 2008 8:20am     

"Anyone else but me that have HUGE CPU usage and I/O.
Loading some pages cause my HDD to go nuts, my A/V to scan constantly and Minefield CPU usage go 100%!!"

He says he's got an A/V scanner going wild, and most Linux Desktop users don't even bother to have one-- so I'm suspicious that he might be on Windows.
My windows builds sometimes takes forever to start up and the CPU useage is probably around 100% because everything is very slow to respond (switching windows to Firefox 2 or Thunderbird). I haven't checked to see if the CPU useage is at 100% though.  How am I supposed to figure out if it is url-classifier downloading/updating during this?
I described my own experience in reply to Lars, here's what I said. BTW, I am on RPM-based Mandriva-2008-fall (2008.0) so it's not explusive to Ubuntu (Probably no one imagines that it was, but I'll document that fact right here.) 
I said, 

- - - quote form Mozillazine - - - -

"This seems to be a 100% Firefox-Created issue, although increased A/V activity may be a side effect of the FF misbehavior on your box.

"Many times when I start running FF3, my hard drive will busy for a long time doing something. Usually it's shortly after starting FF3, but sometimes it goes wild in the middle of a long session. CPU does go 100%-- and my system has no virus scanner or add-on malware monitor at all, modern Linux doesn't need one in my kind of usage. (The entire Firefox installation is within a USER account, and nothing has access anywhere in a system context.) Within Firefox, the panel becomes totally unresponsive. I can start other applications, because Linux is a bit more "fair" about allowing other applications (like my Window Manager) to get a peice of the box when I do something-- but inside Firefox, it's totally locked up and focused on this disk-thrashing task for many, many seconds.

Lars, are you using a really old and much-upgraded Profile? (I am, I've got about a zillion sets of "saved form" data and don't know how to migrate this info to a freshly-built profile.)

I did get rid of urlcalssifier2.sqlite a couple of weeks ago, but the current urlclassifier3.sqlite has already grown to considerable size. I don't know if FF might be churning the whole file, in tiny random-access reads, to "organize it". If so, that needs to be STOPPED. There's also a tiny chance that it's the new memory model doing some garbage collection, but I doubt this, because it happens right at startup a lot, before any memory bits have been allocated. Also, I have lots of RAM, and the Firefox task can get all the "real" memory it wants to have. Here's my current sqlite file sizes:

-rw-r--r-- 1 rick rick 1082368 2006-04-30 00:00 bookmarks_history.sqlite
-rw-r--r-- 1 rick rick 7168 2008-03-24 01:37 content-prefs.sqlite
-rw-r--r-- 1 rick rick 193536 2008-04-28 10:53 cookies.sqlite
-rw-r--r-- 1 rick rick 12288 2008-04-28 10:28 downloads.sqlite
-rw-r--r-- 1 rick rick 240640 2008-04-28 10:39 formhistory.sqlite
-rw-r--r-- 1 rick rick 5120 2007-12-12 12:16 permissions.sqlite
-rw-r--r-- 1 rick rick 4685824 2008-04-28 10:53 places.sqlite
-rw-r--r-- 1 rick rick 2048 2007-10-16 01:11 search.sqlite
-rw-r--r-- 1 rick rick 32229376 2008-04-28 10:40 urlclassifier3.sqlite
-rw-r--r-- 1 rick rick 2048 2008-04-18 09:37 webappsstore.sqlite

I'm currently running a 5-day old version,
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9pre) Gecko/2008042320 Minefield/3.0pre - Build ID: 2008042320

I've no idea what it's doing, and haven't opened a bug because I've no idea how to describe it adequately-- this was pretty much everything I know."

- - - - end mozillazine quote - - - - -

But my "Firefox locked up, disk thrashing" Time periods often exceed 20 seconds, and lots of people have less capable boxes than I do (disk has 8mb cache, lots of people still have just 2mb). I agree with comment 9-- this problem wasn't present until some time around March. Unfortunately, I can't provide the legwork to determine a regression window for it's appearance.
(In reply to comment #5)
> ... Once we've caught up to the state of the server, our updates are going
> to be much smaller - usually in the low hundreds or occasionally a thousand
> or two. This will touch many fewer pages in the database (I haven't measured > this closely)....
>

Dave, I may be mis-understanding this part of your assessment. But, like SuperNova in comment 20, I run Firefox many times per day, hundreds of times per week-- and I am not experiencing small, incremental updates. Instead, it seems that I'm getting "tons-of-megabytes" disk I/O incidents, day after day after day-- even on days when I haven't downloaded a new Build, although it seems to me that the Databases in my profile shouldn't need to be completely overwritten even if I *did* doi a new Nightly every night.

I have to agree with Rick Stockton.  It would be a mistake to characterize this bug as some kind of ramp-up cost for new users only.  I use Firefox continuously on my laptop and I also see this bug pretty much all day long.  You can see from my strace that Firefox had written more than 30MB within a few minutes of starting, and I can reproduce that behavior at will.

From my very little experience with sqlite, my impression of this bug is that either we're doing full database writes when we think we're doing incremental updates, or we're doing full database vacuums when we wanted incremental vacuum.  

Aside from that, what stands out to me in the strace log is that it looks like the sqlite database is initialized with 1KiB pages.  It might be a lot more efficient if it used the same page size as my filesystem (4KiB).  This could be queried before initializing the database, or if that's too complicated then 4KiB is a far better default on modern Linux systems anyway.
as far as I can deduce from these comments, this bug seems to be associated with beta 5 alone, is this correct?, if so, what exactly is different about the underlying implementation between beta 4 and 5?
Whiteboard: [long pole]
The attached patch lets the cache grow as big as it needs to (up to a limit of 100 megs, as a sanity check) on linux.  The cache is freed when the update is done.  It also fixes the page size to be 4096 (this is actually a regression from the schema versioning change awhile back).

To pick up the pagesize change we need to reset the DB, hence the IMPLEMENTATION_VERSION bump.

I'm not sure this is the best or final fix, but I'd like to get feedback on it, so I think we should put it in.
Assignee: nobody → dcamp
Status: NEW → ASSIGNED
Attachment #318863 - Flags: review?(mconnor)
Whiteboard: [long pole] → [long pole][has patch][needs review mconnor]
Comment on attachment 318863 [details] [diff] [review]
increase the page size, let the cache size grow on linux

>+#ifdef XP_WIN
>+pref("urlclassifier.updatecachemax", -1);
>+#elifdef XP_MACOSX
>+pref("urlclassifier.updatecachemax", -1);
>+#else
>+pref("urlclassifier.updatecachemax", 100 * 1024 * 1024);
>+#endif

you just want to do ifdef MOZ_WIDGET_GTK2 here, I don't think we want this on OS/2 or beos, etc.

> // The implementation version is updated during development when we
> // want to change schema, or to recover from updating bugs.  When an
> // implementation version change is detected, the database is scrapped
> // and we start over.
>-#define IMPLEMENTATION_VERSION 3
>+#define IMPLEMENTATION_VERSION 4

as an aside, this is really bloody convenient. :)

>+  if (gUpdateCacheSize > 0) {
>+    PRUint32 cachePages = gUpdateCacheSize / PAGE_SIZE;
>+    nsCAutoString cacheSizePragma("PRAGMA cache_size=");
>+    cacheSizePragma.AppendInt(cachePages);
>+    rv = mConnection->ExecuteSimpleSQL(cacheSizePragma);
>+    if (NS_FAILED(rv)) {
>+      mUpdateStatus = rv;
>+      return rv;
>+    }
>+    mGrewCache = PR_TRUE;
>+  }

>+  nsCAutoString cacheSizePragma("PRAGMA page_size=");
>+  cacheSizePragma.AppendInt(PAGE_SIZE);

ITYM pageSizePragma

looks good, r=me
Attachment #318863 - Flags: review?(mconnor) → review+
Comment on attachment 318863 [details] [diff] [review]
increase the page size, let the cache size grow on linux

a1.9=beltzner
Attachment #318863 - Flags: approval1.9+
Whiteboard: [long pole][has patch][needs review mconnor] → [long pole][has patch][has review][needs approval]
Thanks DC, MC, MB! I'll keep my eyes on tinderbox and start running the Linux Smoketest as soon as I see this Update in there. (Hoping there's a little bit of a testing window before Google's bug # 402469 changes are implemented, in case we need to figure out which change is to "get the blame"... if the combined effect is no-worky.)
Attached patch updated patchSplinter Review
Updated patch fixes review comments and pref file syntax (can't actually do multiplication in the pref file!).
Attachment #318863 - Attachment is obsolete: true
Keywords: checkin-needed
Checking in browser/app/profile/firefox.js;
/cvsroot/mozilla/browser/app/profile/firefox.js,v  <--  firefox.js
new revision: 1.328; previous revision: 1.327
done
Checking in toolkit/components/url-classifier/src/nsUrlClassifierDBService.cpp;
/cvsroot/mozilla/toolkit/components/url-classifier/src/nsUrlClassifierDBService.cpp,v  <--  nsUrlClassifierDBService.cpp
new revision: 1.74; previous revision: 1.73
done
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: [long pole][has patch][has review][needs approval] → [long pole]
Target Milestone: --- → Firefox 3
Comment on attachment 318939 [details] [diff] [review]
updated patch

>+#ifdef MOZ_WIDGET_GTK2

I changed this to |#ifdef UNIX_BUT_NOT_MAC|.
> >+#ifdef MOZ_WIDGET_GTK2
> 
> I changed this to |#ifdef UNIX_BUT_NOT_MAC|.
You didn't:
http://bonsai.mozilla.org/cvsview2.cgi?subdir=mozilla/browser/app/profile&command=DIFF_FRAMESET&file=firefox.js&rev1=1.327&rev2=1.328
Good thing, too, because UNIX_BUT_NOT_MAC would have been the same as the original patch.
EEEK! Gavin, Steffen: I'm no expert at this, but Tinderbox shows Reed's final version as rev 1.329 (applied a minute AFTER the big update rev 1.328), and the bonsai diff shows this change applied exactly has he said:
http://bonsai.mozilla.org/cvsview2.cgi?subdir=mozilla/browser/app/profile&command=DIFF_FRAMESET&file=firefox.js&rev1=1.328&rev2=1.329

And rev 1.329 is the last, so it's in there as Reed says. Maybe it's a good thing, maybe not-- better ask Reed why he didn't agree with comment 26. But it looks like he made this change very, VERY intentionally.

- - - - -
I'd been waiting for this to land and try it out before the Google folk landed their changes for 402469, but I gave up at about 2AM. So I'm pulling the nightly right now, a little "end user's test report" (how long to update the file) coming in my next comment. My current profile is a very, very old one-- migrated all the way from about Mozilla 0.8, but it has been following FF3 for a long time. I'll leave the current urlclassifier3.sqlite intact upon restart.

And yeah, my profile *IS* getting backed up first ;)
shocking, great, absolutely awesome:

my urlclassifier3.sqlite file auto-magically changed in size from 9.8MB to 17.6MB, without ANY apparent hard drive thrashing at all. But I may not have been using Firefox actively at that moment-- I'll be sure to play again, more acrtively, opening tabs and typing into text boxes, when the "stale" timer expires again in a little less than 45 mins.
shocking, great, absolutely awesome:

my urlclassifier3.sqlite file auto-magically changed in size from 9.8MB to 17.6MB, without ANY apparent hard drive thrashing at all. But I may not have been using Firefox actively at that moment-- I'll be sure to play again, more acrtively, opening tabs and typing into text boxes, when the "stale" timer expires again in a little less than 45 mins.
apology: as you can guess, my "active usage" involved reloading tabs-- and I collided with myself, #36 == #35. But I see something unexpected happened: I was hoping to see the update occur at 12:58 + 2700 seconds = 13:43 (that's why I starteed playing at 13:41, but instead it updated (adding another 200KB) earlier: at 13:31, per my local PC time.

So I missed it. grrrr. I'll try playing again, from 14:00 and continuing until I see the file updated.

THEN: I'll download FF3 nightly into a Windoze box which has either FF2 or no current Firefox at all, see what happens with that "migration" or creation of a new FF3 profile.
(In reply to comment #34)
> EEEK! Gavin, Steffen: I'm no expert at this, but Tinderbox shows Reed's final
> version as rev 1.329 (applied a minute AFTER the big update rev 1.328), and the
> bonsai diff shows this change applied exactly has he said:

"Eek" indeed. I undid that:
mozilla/browser/app/profile/firefox.js 	1.330
OK, I caught it. And WOW, that was really hard to even notice-- right in the middle of reloading a bunch of tabs (including Acid3, which loads a bit slowly, and Tinderbox, which loads REALLY slow). I heard a quick disk I/O-- less than 1/2 second (I'll SWAG between 0.2 and 0.3 seconds). And it was done! added another 300K, now 20.1 MB.

slightly off-topic: how big is this file gonna get?

I'll proceed to start up a Windows-XP machine, install this morning's nightly, and report results.
Downloaded 2.0.0.14 and installed it, creating a new default profile. The upgraded to Tinderbox windows build. (Downloaded 14:21, which might have been too early for Gavin's update-- but it should certainly have no effect on Windows, which should have failed either test very decisively.)

Initial profile creation (under 2.x) took a while, but 3.0 Tinderbox showed no issues when used for only a few minutes. HOWEVER, "no issues" may have been only because I didn't keep it running for long enough- I've got a urlclassifierkey3.txt, but no urlclassifier3.sqlite file. Back to Windoze for a while...

PROBLEM ON WINDOWS (I think).

So I went back to Windows and ran for a long time (a full hour), aggressively switching between and reloading tabs during periods when I anticipated that FF3 might start updating the urlclassifier_x.sqlite file. I ran for over an hour, but

urlclassifier3.sqlite was never created.

urlclassifier2.sqlite was never updated (left unchanged from it's State with Firefox2).

I can't read code reliably, but I'm under the impression that phishing protection (using current Trunk code) should ALWAYS create a urlclassifier3.sqlite file, in the root of the profile, on all platforms. And update should have happened in no more than 45 minutes. (Although I gave it an entire hour, looking for something to happen to either of the urlclassifier_x files).

So I suspect we've introduced a bug-- somehow, a Windows machine doesn't set the timer correctly, or doesn't initiate the timer listener properly.

New bug? The Linux fix worked great.
Sorry, should've looked better before posting.
This will not block the final release of Firefox 3. Any patch will need unit tests in order to be approved.
Flags: blocking-firefox3+ → blocking-firefox3-
(sorry for spam - finger slipped and hit my bookmarklet ;)
Flags: blocking-firefox3- → blocking-firefox3+
(In reply to comment #41)
> PROBLEM ON WINDOWS (I think).

NO PROBLEM EXISTS. I don't know why my experience was different yesterday, but I asked for 'nightly' Windows users at the Mozillazine "daily build" forum Thread to tell me what they saw on their boxen... and urlcalssifier3.sqlite *is* being updated properly, for both XP and Vista users (WildCatRay and LittleMutt, they're much more expert than I am at FireFox bug testing.) So we're all done here, and all done with 402469 :)

When RC1 is created, I'll update my wife's *PRODUCTION* FF2 system and double-check to see what occurs there. She's on Windows exclusively.

My file size is now 44.8MB 
Product: Firefox → Toolkit
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: