Closed Bug 650649 Opened 13 years ago Closed 12 years ago

url-classifier bloat (memory use grows slowly even in new profile, idle browser, showing blank tab)

Categories

(Toolkit :: Safe Browsing, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 11

People

(Reporter: zwol, Assigned: gcp)

References

(Blocks 1 open bug)

Details

(Keywords: memory-footprint, Whiteboard: [MemShrink:P1])

Attachments

(9 files)

On two different Windows 7 machines, I observe a slow but continuous memory leak when the browser is completely idle and showing a blank tab. This happens even in a completely fresh profile. Specific steps to reproduce:

* Create a new profile.
* In the new profile, enter add-on manager and disable all plugins and system-provided add-ons (one of these computers has several such, the other only has Flash and Acrobat).
* Optionally, install dbaron's leak monitor add-on [https://addons.mozilla.org/en-US/firefox/addon/leak-monitor/] in the new profile.

* Restart Firefox.
* Open a new (blank) tab; close the about:home tab.
* Open Task Manager and select the process list.  At this point, firefox.exe should be using about 38MB of RAM.

* Leave the computer untouched for at least an hour.  When you return, firefox.exe will be reported as consuming close to 60MB of RAM.  If you continue the experiment, the Firefox process continues to grow at a rate of approximately 20MB per hour.

The person who regularly uses these computers informs me that the leak is much worse if the browser is actually used for anything, with Firefox growing to consume all available RAM within a few hours; and that reloading a single page repeatedly seems to be especially bad. Some of that may be extensions' or pages' fault, of course.

Over on Mac and Linux, I don't observe the same problem (but I haven't yet tried exactly the above experiment) so it could well be a Windows-specific issue.  I'm filing this under Tabbed Browser because dbaron's leak monitor fingered browser.js for at least some of the leaks (details to follow).
On one of the aforementioned computers, I installed dbaron's leak monitor in the new profile, and then after allowing the empty tab to leak for an hour, I opened a new window and closed the old one.  The leak monitor promptly popped up this leak report.
And this is the second leak report it produced, after "some objects were reclaimed" (I assume that means after JS GC and/or XPCOM cycle collection had a chance to run).  It doesn't tell me much, but I see a whole lot of references to chrome://browser/content/browser.js in there.
cc:ing the usual suspects.
Oh, I should have said: all this was with the Windows 4.0 release build.
Component: Tabbed Browser → General
QA Contact: tabbed.browser → general
A similar overall increase in process size occurs with 4.0.1 on my Mac (10.6.7).  However, this time the leak monitor didn't catch anything.

$ while :; do sleep 300; ps xuc | grep firefox-bin; done
USER   PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
zack   360   0.0  2.5  3117464 103708   ??  S    12:22PM   0:09.81 firefox-bin
zack   360   0.0  2.3  2969060  96736   ??  S    12:22PM   0:09.85 firefox-bin
zack   360   0.0  2.3  2969584  96752   ??  S    12:22PM   0:09.90 firefox-bin
zack   360   0.0  2.3  2969060  96776   ??  S    12:22PM   0:10.14 firefox-bin
zack   360   0.0  2.3  2969060  96776   ??  S    12:22PM   0:10.41 firefox-bin
zack   360   0.0  2.3  2969060  96788   ??  S    12:22PM   0:10.64 firefox-bin
zack   360   0.0  2.3  2969060  96788   ??  S    12:22PM   0:10.67 firefox-bin
zack   360   0.0  2.5  2977776 103148   ??  S    12:22PM   0:17.25 firefox-bin
zack   360   0.0  2.5  2977252 103128   ??  S    12:22PM   0:17.34 firefox-bin
zack   360   0.0  2.5  2977252 103128   ??  S    12:22PM   0:17.61 firefox-bin
zack   360   0.0  2.5  2977252 103128   ??  S    12:22PM   0:17.88 firefox-bin
zack   360   0.0  2.5  2977252 103128   ??  S    12:22PM   0:18.15 firefox-bin
zack   360   0.1  2.5  2977252 103128   ??  S    12:22PM   0:18.41 firefox-bin
zack   360   0.4  2.5  2977776 105220   ??  S    12:22PM   0:19.97 firefox-bin
zack   360   1.0  2.6  2985968 109508   ??  S    12:22PM   0:24.47 firefox-bin
zack   360   0.0  2.6  2985444 109484   ??  S    12:22PM   0:24.74 firefox-bin
zack   360   0.0  2.6  2985444 109484   ??  S    12:22PM   0:25.00 firefox-bin
zack   360   0.0  2.6  2985444 109484   ??  S    12:22PM   0:25.28 firefox-bin
zack   360   0.0  2.6  2985444 109484   ??  S    12:22PM   0:25.55 firefox-bin
zack   360   0.0  2.6  2985968 109496   ??  S    12:22PM   0:25.81 firefox-bin
zack   360   0.0  2.7  2985968 112176   ??  S    12:22PM   0:27.73 firefox-bin
zack   360   0.0  2.8  2985968 115568   ??  S    12:22PM   0:32.84 firefox-bin
zack   360   0.0  2.8  2985444 115544   ??  S    12:22PM   0:33.11 firefox-bin

The steady increase in CPU time consumed is also a bit of a concern; it's showing one window, containing one tab, with no contents (empty address bar, not about:blank; I would like to think those are the same thing, but ... :-/)  It shouldn't have anything to burn CPU on.

That appears to be a leak rate of 180KB/minute, by the way.
OS: Windows 7 → All
Hardware: x86 → All
Attached image RSS over time (Mac)
Because I am that kind of nerd, I plotted the RSS measurements from my Mac.  The result is veeery interesting.  It looks like something happens every 30 minutes, more or less, that makes memory usage go up by a big chunk.  In between, nothing.
just a poor man's guess based on searching mxr, does that time change if you set urlclassifier.updateinterval to a really large value, since the default seems to be update every 30 mins.
I set urlclassifier.updatetime to 3600 and let the browser idle for five hours, monitoring RSS every minute; this is the result.  It looks like changing that parameter did have _some_ effect, but it also looks like that's not the _only_ thing leaking memory at 30min intervals.

Possibly this should become a tracking bug?  From this data, there's at least two problems on Mac, and the original syndrome on Windows is different -- from eyeballing the numbers in Task Manager it looks like memory usage goes up continuously there.  (I would be happy to produce a graph if someone tells me how to record a log of RSS every minute in Windows.)
Blocks: 639626
Blocks: mlk-fx5+
No longer blocks: 639626
Summary: Memory leak in new profile, idle browser showing blank tab (possibly W7-specific) → Memory leak in new profile, idle browser showing blank tab
Keywords: mlk
hm, notice that updatetime and updateinterval are different things.
updateinterval is the time between updates, you didn't change it from the original 30 mins. updatetime is the the time between each chunk during an update.
Also, with a current nightly you could check in about:memory the sqlite memory usage (esp. storage/urlclassifier3.sqlite/Cache_Used) for urlclassifier3.sqlite since with continuous updates I'd expect the cache to increase naturally.
I tried this on a Linux64 box under Massif.  Here's the graph for a 1,000
second run:

    MB
27.24^                                                                      # 
     |                                                                      # 
     |                               @   ::::       :::  :   :::::::   @@@@ # 
     |                     :::       @   :          :    :   :    :    @    #:
     |  ::::::@@::::::@@:@::  :@:::::@::::   :@::::::  :::::::    : :::@   :#:
     |  ::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |  ::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |  ::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
   0 +----------------------------------------------------------------------->s
     0                                                                   1050.9

You can see a gradual increase, with some drops along the way (GCs, I
guess?)  I looked at the detailed snapshots;  with the exception of some
very minor variations the only thing that is changing significantly is this
allocation point:

94.22% (26,914,780B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
o1o> 25.06% (7,159,760B): pcache1Alloc (sqlite3.c:33629)
 o2o> 22.08% (6,307,584B): pcache1Fetch (sqlite3.c:33692)
  o3o> 22.08% (6,307,584B): sqlite3PcacheFetch (sqlite3.c:33046)
   o4o> 22.08% (6,307,584B): sqlite3PagerAcquire (sqlite3.c:39873)
    o5o> 22.08% (6,307,584B): btreeGetPage (sqlite3.c:47038)
     o6o> 11.91% (3,401,472B): getAndInitPage (sqlite3.c:47091)
      o7o> 08.21% (2,344,704B): moveToChild (sqlite3.c:49594)
       o8o> 08.09% (2,311,680B): sqlite3BtreeMovetoUnpacked (sqlite3.c:50054)
        o9o> 07.75% (2,212,608B): btreeMoveto (sqlite3.c:46159)
         o10> 07.75% (2,212,608B): sqlite3BtreeInsert (sqlite3.c:52180)
          o11> 07.75% (2,212,608B): sqlite3VdbeExec (sqlite3.c:64977)
           o12> 07.75% (2,212,608B): sqlite3_step (sqlite3.c:59036)
            o13> 07.75% (2,212,608B): mozilla::storage::stepStmt() (mozStoragePrivateHelpers.cpp:288)
             o14> 07.75% (2,212,608B): mozilla::storage::Statement::ExecuteStep() (mozStorageStatement.cpp:617)
              o15> 07.75% (2,212,608B): mozilla::storage::Statement::Execute() (mozStorageStatement.cpp:582)
               o16> 07.75% (2,212,608B): nsUrlClassifierStore::WriteEntry() (nsUrlClassifierDBService.cpp:1967)
                o17> 07.75% (2,212,608B): nsUrlClassifierDBServiceWorker::AddChunk() (nsUrlClassifierDBService.cpp:2563)
                 o18> 07.75% (2,212,608B): nsUrlClassifierDBServiceWorker::ProcessChunk() (nsUrlClassifierDBService.cpp:2693)
                  o19> 07.75% (2,212,608B): nsUrlClassifierDBServiceWorker::UpdateStream() (nsUrlClassifie
rDBService.cpp:3082)
                   o20> 07.75% (2,212,608B): NS_InvokeByIndex_P ()
                    o21> 07.75% (2,212,608B): nsProxyObjectCallInfo::Run() (nsProxyEvent.cpp:182)
                     o22> 07.75% (2,212,608B): nsThread::ProcessNextEvent() (nsThread.cpp:618)
                      o23> 07.75% (2,212,608B): NS_ProcessNextEvent_P() (nsThreadUtils.cpp:250)
                       o24> 07.75% (2,212,608B): nsThread::ThreadFunc() (nsThread.cpp:273)
                        o25> 07.75% (2,212,608B): _pt_root (ptthread.c:187)

For example, about halfway through this accounts for about 5.5MB, by the end
it accounts for 7.2MB.  This seems to confirm Marco's idea that
urlclassifier is causing the gradual memory usage increase.

Can someone explain briefly what the urlclassifier is?  Is there a good
reason why it should be causing memory usage to creep up slowly?
Urlclassifier is the thing that we use for phishing protection.  It basically fetches data from Google's anti-phishing blacklist and stores it in a database.  Then when we need to load a page we check whether the url is in that database, and if it is we show the phishing error page.
bz, is the data fetched in small chunks periodically?  Comment 7 and comment 9 indicate this happens every 30 minutes, but I'm seeing increases on much shorter timescales.  Is that expected?  Is this growth likely to reach some kind of limit?
Hmm.  It's been a while, but I think based on reading code just now that if there are updates to download the url classifier will spend 5 seconds doing that, then 60 seconds sleeping, then repeat until the update is fully downloaded.

I believe 45 minutes is how often we check whether there's an update available...  Again, I could be totally misreading this code.  These various frobs aren't documented very clearly.  :(
pardon for jumping in, but the specification "Version 4.0 Branch" in the header is a little too constraining, since I can reproduce above issues easily in 6.0a1pre nightlies.
Zack:
"A similar overall increase in process size occurs with 4.0.1 on my Mac
(10.6.7).  However, this time the leak monitor didn't catch anything."

I had this too (though on Windows). It was totally weird: it stayed perfectly for almost exactly 3 minutes, then OUT OF THE BLUE mem leaking started all of a sudden as if someone had flipped a lever :)
Maybe sqlite should avoid caching what it writes into urlclassifier, since it isn't especially likely that new entries will be hit? (Or clear the cache after each urlclassifier database update.)
Version: 4.0 Branch → Trunk
(In reply to comment #17)
> Maybe sqlite should avoid caching what it writes into urlclassifier, since it
> isn't especially likely that new entries will be hit? (Or clear the cache after
> each urlclassifier database update.)
What data do you have to indicate that new entries are unlikely to be hit?
I'm assuming we only care about speed in the non-attack case, and that hash collisions between non-attack pages and urlclassifier entries are effectively random. Am I missing something?
(In reply to comment #19)
> I'm assuming we only care about speed in the non-attack case, and that hash
> collisions between non-attack pages and urlclassifier entries are effectively
> random. Am I missing something?
We still have to do a lookup on every load, so clearing the cache is going to be expensive.
No longer blocks: mlk-fx5+
I did an overnight run on Linux64, logging VSIZE and RSS every second (with a few gaps).  Overall they both trended upwards slowly, sometimes going down for a while.  At start the numbers were 535192/80144 KiB, at end they were 569996/96676 KiB.  I also saw spikes in the RSIZE up to 105000 or 110000 KiB on some of the 30 minute marks, so something was happening there.

I've attached about:memory measurements from the start and end.  Here are the most interesting parts:

Start:
37,138,916 B (100.0%) -- explicit
├──20,005,162 B (53.87%) -- heap-unclassified
...
├───2,644,240 B (07.12%) -- storage
│   └──2,644,240 B (07.12%) -- sqlite
│      ├────867,376 B (02.34%) -- places.sqlite
│      │    ├──696,056 B (01.87%) -- cache-used
│      │    ├───96,256 B (00.26%) -- stmt-used
│      │    └───75,064 B (00.20%) -- schema-used

End:
48,258,882 B (100.0%) -- explicit
├──28,258,021 B (58.56%) -- heap-unclassified
...
├───4,295,200 B (08.90%) -- storage
│   └──4,295,200 B (08.90%) -- sqlite
│      ├──2,008,592 B (04.16%) -- places.sqlite
│      │  ├──1,686,536 B (03.49%) -- cache-used
│      │  ├────245,544 B (00.51%) -- stmt-used
│      │  └─────76,512 B (00.16%) -- schema-used


So an 11MB increase in explicit allocations, 8.2MB of that on the heap but unclassified, 1.7MB in sqlite and most of that in sqlite.

We really need to understand what the browser is doing when it's idle.
> We really need to understand what the browser is doing when it's idle.

Some info from
https://wiki.mozilla.org/Phishing_Protection:_Design_Documentation, which is
a page that hasn't been modified since March 2008:

> The first update request happens at a random interval between 0-5 minutes
> after the browser starts. The second update request happens between 15-45
> minutes later. After that, each update is once every 30 minutes.
>
> If the client receives an error during update, it tries again in a minute.
> If it receives three errors in a row, it skips updates until at least 60
> minutes have passed before trying again. If it then receives another (4th)
> error, skips updates for the next 180 minutes and if it receives another
> (5th) error, it skips updates for the next 360 minutes. It will continue
> to check once every 360 minutes until the server responds with a success
> message. The current implementation doesn't change the 30 minute timer
> interval, it just involves skipping updates until the back off time has
> elapsed. 

So it really sounds like the short (2 or 3 second) spikes in memory usage
every 30 minutes or so are the url-classifier getting an update.

But the url-classifier is also doing heaps of local processing in the
meantime, as soon as the browser starts up.  I used a modified version of
Valgrind to obtain a stack trace of every malloc/free call done by the
browser, and looked at them for a period of a minute or so shortly after
start-up (fresh profile, showing only a blank page).  Here are the 10 most
common stack entries seen for a chunk of that time (the numbers at the start
are the popularity, frequency count, percentage of total, and cumulative
percentage):

( 1) 230770 ( 7.7%,  7.7%): ==17034==    at 0x40256B5: VALGRIND_PRINTF_BACKTRACE (valgrind.h:4477)

( 2) 129279 ( 4.3%, 12.0%): ==17034==    by 0x5DCB6F3: nsUrlClassifierDBServiceWorker::SubChunk(unsigned int, unsigned int, nsTArray<nsUrlClassifierEntry, nsTArrayDefaultAllocator>&) (nsUrlClassifierDBService.cpp:2655)

( 3) 115104 ( 3.8%, 15.8%): ==17034==    by 0x4025FF9: free (vg_replace_malloc.c:370)

( 4) 115084 ( 3.8%, 19.7%): ==17034==    by 0x4026903: malloc (vg_replace_malloc.c:240)

( 5) 113580 ( 3.8%, 23.5%): ==17034==    by 0x5DCD59D: nsUrlClassifierDBServiceWorker::ProcessChunk(int*) (nsUrlClassifierDBService.cpp:2696)

( 6)  93749 ( 3.1%, 26.6%): ==17034==    by 0x405A1B4: moz_free (mozalloc.cpp:94)

( 7)  84157 ( 2.8%, 29.4%): ==17034==    by 0x5DC1506: nsUrlClassifierStore::WriteEntry(nsUrlClassifierEntry&) (nsUrlClassifierDBService.cpp:1965)

( 8)  77313 ( 2.6%, 32.0%): ==17034==    by 0x5DC2046: nsUrlClassifierSubStore::BindStatement(nsUrlClassifierEntry const&, mozIStorageStatement*) (nsUrlClassifierDBService.cpp:982)

( 9)  69374 ( 2.3%, 34.3%): ==17034==    by 0x5DCD6F9: nsUrlClassifierDBServiceWorker::UpdateStream(nsACString_internal const&) (nsUrlClassifierDBService.cpp:3083)

(10)  68660 ( 2.3%, 36.6%): ==17034==    by 0x405A197: moz_xmalloc (mozalloc.cpp:100)


1, 3, 4, 6, and 10 are malloc and/or free, or wrappers thereof.  2, 5, 7, 8 and 9 are all url-classifier steps.

Here is a typical stack trace that includes all of 2, 5, 7, 8 and 9, one
which appears extremely frequently:

**17034** malloc(64) = 0x205E7C20
==17034==    at 0x40256B5: VALGRIND_PRINTF_BACKTRACE (valgrind.h:4477)
==17034==    by 0x4026903: malloc (vg_replace_malloc.c:240)
==17034==    by 0x405A197: moz_xmalloc (mozalloc.cpp:100)
==17034==    by 0x5EA01A4: mozilla::storage::Statement::getParams() (mozalloc.h:
229)
==17034==    by 0x5EA038B: mozilla::storage::Statement::BindNullByIndex(unsigned
 int) (mozStorageStatement.cpp:899)
==17034==    by 0x5DC1F05: nsUrlClassifierStore::BindStatement(nsUrlClassifierEn
try const&, mozIStorageStatement*) (nsUrlClassifierDBService.cpp:641)
==17034==    by 0x5DC2046: nsUrlClassifierSubStore::BindStatement(nsUrlClassifie
rEntry const&, mozIStorageStatement*) (nsUrlClassifierDBService.cpp:982)
==17034==    by 0x5DC1506: nsUrlClassifierStore::WriteEntry(nsUrlClassifierEntry
&) (nsUrlClassifierDBService.cpp:1965)
==17034==    by 0x5DCB6F3: nsUrlClassifierDBServiceWorker::SubChunk(unsigned int
, unsigned int, nsTArray<nsUrlClassifierEntry, nsTArrayDefaultAllocator>&) (nsUr
lClassifierDBService.cpp:2655)
==17034==    by 0x5DCD59D: nsUrlClassifierDBServiceWorker::ProcessChunk(int*) (n
sUrlClassifierDBService.cpp:2696)
==17034==    by 0x5DCD6F9: nsUrlClassifierDBServiceWorker::UpdateStream(nsACStri
ng_internal const&) (nsUrlClassifierDBService.cpp:3083)
==17034==    by 0x6107ED1: NS_InvokeByIndex_P (xptcinvoke_x86_64_unix.cpp:195)

Now, it appears that the allocations done for this trace (64 bytes in this
case) are freed very shortly after.  The cases where allocations aren't
freed correspond to stack traces like the one in comment 11.

I know very little about the url-classifier but if I had to guess I'd say
that we're processing a downloaded update, making lots of deletions from the
database (they're deletions because I'm seeing SubChunk all the time but
never AddChunk), and every so often SQLite has to rearrange a B-tree and
this requires allocating some memory.  It's disconcerting that SQLite's
memory use is growing even though we're seemingly deleting database entries.

For a comparison, I tried a second fresh profile where I chose to "work
offline".  Its memory usage did not creep up, and there were no
allocations/deallocations from the url-classifier.  After a minute or so I
turned off "work offline" and there was a brief flurry of allocations, but
not from the url-classifier, and then it held pretty steady again, with
occasional allocations/deallocations from what looks like the main event
loop.

For a second comparison, I tried another fresh profile that had online
access, but "block reported attack sites" and "block reported web forgeries"
were both turned off in the security preferences.  Again, url-classifier
didn't show up in the allocation trace at all, and once start-up was
complete the allocation rate was very slow, and the allocations were from
the main event loop.

So, conclusions?  Well, maybe the url-classifier is working exactly as it's
supposed to.  (sdwilsh, any info you could provide on that front would be
welcome.)  Though I would have though that the memory usage would level out
fairly quickly instead of continuing to creep up for hours and hours (as
described in comment 21).  Unfortunately, it sure complicates things in
terms of identifying and fixing real leaks;  users assume that if the
browser is (apparently) idle that it's not doing anything, and so any rise
in memory usage is assumed to be a leak.  Firefox's reputation for leaks
doesn't help here.
Summary: Memory leak in new profile, idle browser showing blank tab → Memory usage grows in new profile, idle browser showing blank tab, due to url-classifier
(In reply to comment #22)
> So, conclusions?  Well, maybe the url-classifier is working exactly as it's
> supposed to.  (sdwilsh, any info you could provide on that front would be
> welcome.)  Though I would have though that the memory usage would level out
> fairly quickly instead of continuing to creep up for hours and hours (as
> described in comment 21).  Unfortunately, it sure complicates things in
> terms of identifying and fixing real leaks;  users assume that if the
> browser is (apparently) idle that it's not doing anything, and so any rise
> in memory usage is assumed to be a leak.  Firefox's reputation for leaks
> doesn't help here.
Unfortunatley, I don't know much about it.  mmmulani is rewriting it last I knew though.
dcamp might know something.
(In reply to comment #22)
> So, conclusions?  Well, maybe the url-classifier is working exactly as it's
> supposed to.  (sdwilsh, any info you could provide on that front would be
> welcome.)  Though I would have though that the memory usage would level out
> fairly quickly instead of continuing to creep up for hours and hours (as
> described in comment 21).  Unfortunately, it sure complicates things in
> terms of identifying and fixing real leaks;  users assume that if the
> browser is (apparently) idle that it's not doing anything, and so any rise
> in memory usage is assumed to be a leak.  Firefox's reputation for leaks
> doesn't help here.

From what I understood of the events, this seems like expected behaviour.
On each update, url-classifier adds and removes entries from the database.

The code for updating chunks in url-classifier has been stable for quite some time, so this might be a regression caused by something else? (The only major change was to code unrelated to the SQLite usage.) Could it also be related to the current version of SQLite?
To answer more specific questions:
Are deletions expected/what is the make up of an update?
Updates contain roughly the same number of addChunks and subChunks. That is, there are always deletions in each update. I am not sure about the size of deletions per subChunk, though.

When does the browser update the database?
On a fresh profile, it is timed to be random over the first five minutes. On a dirty profile, after 30 seconds.
The relevant code is here: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/content/listmanager.js#254
ISTR from another bug that the memory increase from urlclassifier was just a result of sqlite caches filling up. [I don't know how strong that claim is, though.] Is there an easyish way to disable / limit caching in the DB to see if that's the case?
Whiteboard: [MemShrink:P2]
I'm copying this measurement over from bug 636791.

Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0

Steps to Reproduce:
1. Start Firefox in Safe Mode
2. Load about:memory
3. Do Nothing (for about 70 minutes)

The key measurements:

- win32/privatebytes went from 34M to 87M

- storage/urlclassifier3.sqlite/Cache_Used went from 0M to 40M

- storage/sqlite/pagecache went from 1M to 43M, but ignore that;  it's double-counting the urlclassifier memory (bug 653630, which has been fixed in FF 6)

So, basically the same phenomenon we've seen previously, but a much bigger and faster rate of increase.

We have to do something about this.  The memory increase is (a) excessive, (b) makes it harder to detect other leaks, and (c) damages our reputation ("Firefox starts leaking memory as soon as you start it up!").
Whiteboard: [MemShrink:P2] → [MemShrink:P1]
Did the default sqlite page cache size change?

On windows and mac we use the default sqlite page cache size, which I remember being in the few-megabyte range.  A 40M page cache is significantly bigger than I would have expected.

It should be ok to just choose a sane page cache size by default rather than using sqlite's built-in default - http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#161

Don't know about the privatebytes changes offhand...
(In reply to comment #30)
> Did the default sqlite page cache size change?
It hasn't changed AFAIK.
Here's the output on a fresh profile with NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5 NSPR_LOG_MODULES=UrlClassifierDbService:5.  I let it sit on about:home for 10 minutes, there's 15,000 lines of output, which was generated in bursts.  That seems like a lot of work.  Is that expected?

I can't help but notice that http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#3076 is clearly buggy -- it's consistently printing garbage after the "Got ".  I don't know if that's relevant here.

There are heaps of sequences like this:

6203904[7fd802a35480]: Processing a:140161:4:333
26203904[7fd802a35480]: Handling a chunk sized 333
26203904[7fd802a35480]: Adding 41 entries to chunk 140161 in table 4
26203904[7fd802a35480]: adding chunk 140161 took 2ms

The chunk numbers seem awfully high, but maybe they don't mean anything.
(In reply to comment #32)
> Created attachment 540396 [details]
> Output with NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5
> NSPR_LOG_MODULES=UrlClassifierDbService:5
> 
> Here's the output on a fresh profile with
> NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5
> NSPR_LOG_MODULES=UrlClassifierDbService:5.  I let it sit on about:home for
> 10 minutes, there's 15,000 lines of output, which was generated in bursts. 
> That seems like a lot of work.  Is that expected?

Seems roughly right, the malware database we're fetching is big (was around 300,000 entries last I checked) and the protocol is kinda inefficient.

> I can't help but notice that
> http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-
> classifier/nsUrlClassifierDBService.cpp#3076 is clearly buggy -- it's
> consistently printing garbage after the "Got ".  I don't know if that's
> relevant here.

I think it's useless more than it is buggy.  The chunk is binary data at that point.

> 
> There are heaps of sequences like this:
> 
> 6203904[7fd802a35480]: Processing a:140161:4:333
> 26203904[7fd802a35480]: Handling a chunk sized 333
> 26203904[7fd802a35480]: Adding 41 entries to chunk 140161 in table 4
> 26203904[7fd802a35480]: adding chunk 140161 took 2ms
> 
> The chunk numbers seem awfully high, but maybe they don't mean anything.

Yeah, those are just serial numbers for segments of the malware database that have been increasing since 2007/2008ish.
(In reply to comment #33)
> (In reply to comment #32)
> > Created attachment 540396 [details]
> > Output with NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5
> > NSPR_LOG_MODULES=UrlClassifierDbService:5
> > 
> > Here's the output on a fresh profile with
> > NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5
> > NSPR_LOG_MODULES=UrlClassifierDbService:5.  I let it sit on about:home for
> > 10 minutes, there's 15,000 lines of output, which was generated in bursts. 
> > That seems like a lot of work.  Is that expected?
> 
> Seems roughly right, the malware database we're fetching is big (was around
> 300,000 entries last I checked) and the protocol is kinda inefficient.

(We don't grab all the entries in the first 10 minutes, but a large enough chunk that a 15,000 line file doesn't seem outside the range of possibility).
(In reply to comment #33)
> 
> Seems roughly right, the malware database we're fetching is big (was around
> 300,000 entries last I checked) and the protocol is kinda inefficient.

Are we fetching/storing the database as efficiently as we could?  It seems like we're doing a lot of work and using a lot of memory.
(In reply to comment #35)
> (In reply to comment #33)
> > 
> > Seems roughly right, the malware database we're fetching is big (was around
> > 300,000 entries last I checked) and the protocol is kinda inefficient.
> 
> Are we fetching/storing the database as efficiently as we could?  It seems
> like we're doing a lot of work and using a lot of memory.

No. We are fetching/storing the database in the most inefficient way possible. In addition to eating memory, it also stalls the browser with a disk io storm. Firefox frontend folks keep talking of rewriting the classifier code, but I'm not sure how much progress has been made
(In reply to comment #36)
> 
> No. We are fetching/storing the database in the most inefficient way
> possible.

I talked to dcamp about this.  The short version is that the original code from Google was so-so, and then various large protocol changes were made at short notice, which necessitated some less-than-perfect updates to the url-classifier.  That's why it's not very good code.
Component: General → Phishing Protection
Keywords: mlkfootprint
QA Contact: general → phishing.protection
Summary: Memory usage grows in new profile, idle browser showing blank tab, due to url-classifier → url-classifier bloat (memory use grows slowly even in new profile, idle browser, showing blank tab)
Assignee: nobody → gpascutto
Gian-Carlo, any updates on this bug yet? This is a pretty bad bug from a memory usage point of view, something we're tracking as part of the MemShrink effort.
I'm trying to finish up bug 669410, which once fully working, would allow fixing the SQLite cache to some small value. That might reduce the impact of this bug a lot.

Going further there's bug 669407 which gets rid of SQLite entirely. 

dcamp has some incomplete patches that replace the SQLite DB by an optimized store. It's orthogonal to my work so it could go in once he or I finish it.
Depends on: 669407
I stumbled onto this in browser/app/profile/firefox.js:

#ifdef MOZ_WIDGET_GTK2
#define RESTRICT_CACHEMAX
#endif
#ifdef XP_OS2
#define RESTRICT_CACHEMAX
#endif

// Maximum size of the sqlite3 cache during an update, in bytes
#ifdef RESTRICT_CACHEMAX
pref("urlclassifier.updatecachemax", 104857600);
#else
pref("urlclassifier.updatecachemax", -1);
#endif

This is strange for several reasons. 
1) Why is there a difference in behavior here between say, Linux and Windows?

2) Even though the setting is called RESTRICT_CACHEMAX, if updatecachemax is "-1" the code in UrlClassifier simply won't touch the default cache size at all. See: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#3157
So it should be the SQLite default, which according to the documentation is 2000 pages. Firefix sets it pages to 32K, so this is 64M. This is still less than the 100M defined there.

This first looked like some fallout from bug 416330, but then it's still the wrong way around.

Am I missing something? Anyone have a clue what's going on?
This is the explanation for the previous comment:

https://bugzilla.mozilla.org/show_bug.cgi?id=430530#c26
Depends on: 669410
No longer depends on: 669407
Preliminary testing with bug 669410 patches v3 show that memory usage from url-classifier stays below 1M, except when it is updating, at which it will briefly go up to 40M and drop to 1M again when it's finished. I believe this behavior is much more reasonable and might be sufficient to resolve this issue for desktop.
Great news, Gian-Carlo!  When that bug lands I'll do some more measurements.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #40)
> I stumbled onto this in browser/app/profile/firefox.js:
> 
> #ifdef MOZ_WIDGET_GTK2
> #define RESTRICT_CACHEMAX
> #endif
> #ifdef XP_OS2
> #define RESTRICT_CACHEMAX
> #endif
> 

> This is strange for several reasons. 
> 1) Why is there a difference in behavior here between say, Linux and Windows?

Difference?

Well, maybe I'm blind, but it does the _very same thing_ in Linux and Windows AFAICS :)

You could also code (in C/C++, at least, can't say from the top of my head whether this syntax is valid in JS)

#if defined(MOZ_WIDGET_GTK2) || defined(XP_OS2)
#define RESTRICT_CACHEMAX
#endif

voilà :)
@Andreas: I can't say anything WRT to behavior (did you measure memory use?) but the code is IMO obviously odd.
http://mxr.mozilla.org/mozilla-central/source/browser/app/profile/firefox.js#737
http://mxr.mozilla.org/mozilla-central/source/dom/plugins/base/npapi.h#49
Ahhh! Thanks for the code excerpt, Flávio. 
I stand corrected. 
So XP_OS2 does not mean "either XP or OS2" (as I thought) but XP _running_ on OS/2.

(As no one ought to ask me what has been happening to OS/2 all the years since I would not be able to tell, nor would I have believed that you could make XP run on the old git. But with a product called VPC it is possible. You never stop learning, do you. :))
Pretty sure `XP` is for executable platform and is not a reference to Windows XP at all.
Andreas, XP_OS2 just means "OS/2", just like XP_MACOSX, XP_WIN, XP_BEOS, XP_UNIX, and XP_LINUX mean the obvious corresponding things.

The XP_* defines also predate the existence of Windows XP by at least 3 years, if memory serves me right.
Thanks to the both of you.

The predating bit may be true, though I could not be **** to look up pre-2001 code for verification :-)
Bug 669410 changes the url-classifier behavior to only keep a 1.2M datastructure in memory when not updating. During an update, the SQLite cache is upped to 40M and dropped down to +-1M again when it's done (a few seconds).

I believe that essentially fixes this bug. Would be nice if anyone else can confirm.
I tried a new profile again and measured RSS every second for an hour.  It started at 81816KB and ended up at 93560KB.  Then I quit and started again and the RSS stayed much more stable, around 83512KB, presumably because no update was occurring.  So I'm not quite sure what to make of that.
Bug 673470 revamps the urlclassifier and replaces the random SQLite I/O by a serial read of a much smaller store, removing the need for the I/O cache. I believe this bug is essentially fixed.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 13
I'd like to have measurements like those done earlier in the bug (e.g. RSS measured once per second for several hours) before declaring this fixed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 723153
Assignee: gpascutto → n.nethercote
I have nothing better to do with my Mac right now, I'll run the same test I did then.
I recorded Firefox 11's memory and CPU usage on my laptop for six hours, according to the same procedure as described in comments 1 and 5 (the laptop is a Mac) except that I took a data point every second instead of every five minutes.  Results are in this graph.  VSZ and RSS are in bytes; CPU is in seconds.  The x-axis of all three subplots is time in hours.

It looks like, under these conditions (i.e. completely idle, only one tab showing about:blank, no extensions of any kind) there is no longer a memory leak.  I presume the periodic spikes roughly every half hour are still due to the URL classifier, as discussed above.  I am a little bit concerned that we don't go *completely* idle -- CPU does keep creeping up -- but it's less than one minute of CPU in six hours, so it's probably not a big deal, and anyway we can worry about it in another bug :)
Thanks for doing the measurements!


> It looks like, under these conditions (i.e. completely idle, only one tab
> showing about:blank, no extensions of any kind) there is no longer a memory
> leak.  I presume the periodic spikes roughly every half hour are still due
> to the URL classifier, as discussed above.

I agree, and I am happy to declare victory.


> I am a little bit concerned that
> we don't go *completely* idle -- CPU does keep creeping up -- but it's less
> than one minute of CPU in six hours, so it's probably not a big deal, and
> anyway we can worry about it in another bug :)

I was horrified at first because I thought the y-axis was CPU % :)

It looks to me like it's doing a lot of work the first few times and then just a small amount each time after that.  That sounds reasonable to me.  Investigating it is certainly outside the scope of this bug.
Assignee: n.nethercote → gpascutto
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Filed follow-up bug 735684 for automated testing for this kind of slow leak, and bug 735690 for the CPU usage when idle.
Blocks: 736394
Verified fixed testing was done against Firefox 11 with the original patches from bug 669410, so let's indicate that correctly.
Target Milestone: Firefox 13 → Firefox 11
Product: Firefox → Toolkit
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: