Last Comment Bug 650649 - url-classifier bloat (memory use grows slowly even in new profile, idle browser, showing blank tab)
: url-classifier bloat (memory use grows slowly even in new profile, idle brows...
Status: RESOLVED FIXED
[MemShrink:P1]
: footprint
Product: Toolkit
Classification: Components
Component: Safe Browsing (show other bugs)
: Trunk
: All All
: -- normal with 6 votes (vote)
: Firefox 11
Assigned To: Gian-Carlo Pascutto [:gcp]
:
Mentors:
: 468003 636791 (view as bug list)
Depends on: 669410
Blocks: 723153 mlk-fx5 736394
  Show dependency treegraph
 
Reported: 2011-04-17 10:13 PDT by Zack Weinberg (:zwol)
Modified: 2014-05-27 12:25 PDT (History)
49 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
leak monitor report before "reclaim" (646.45 KB, text/plain)
2011-04-17 10:16 PDT, Zack Weinberg (:zwol)
no flags Details
leak monitor report after "reclaim" (86.57 KB, text/plain)
2011-04-17 10:19 PDT, Zack Weinberg (:zwol)
no flags Details
RSS over time (Mac) (139.16 KB, image/png)
2011-04-17 14:40 PDT, Zack Weinberg (:zwol)
no flags Details
RSS over time (Mac) (urlclassifier.updatetime=3600) (134.91 KB, image/png)
2011-04-18 13:16 PDT, Zack Weinberg (:zwol)
no flags Details
about:memory at start and end of overnight run (8.38 KB, text/plain)
2011-05-26 16:57 PDT, Nicholas Nethercote [:njn]
no flags Details
results from Brent Garber (96.31 KB, image/png)
2011-06-16 16:55 PDT, Nicholas Nethercote [:njn]
no flags Details
Output with NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5 NSPR_LOG_MODULES=UrlClassifierDbService:5 (873.15 KB, text/plain)
2011-06-19 22:23 PDT, Nicholas Nethercote [:njn]
no flags Details
six hours of VSZ, RSS, and CPU measurements, Firefox 11/Mac (60.53 KB, application/x-gzip)
2012-03-13 21:38 PDT, Zack Weinberg (:zwol)
no flags Details
six hours of VSZ, RSS, and CPU measurements, Firefox 11/Mac, plotted (159.92 KB, image/png)
2012-03-13 21:44 PDT, Zack Weinberg (:zwol)
no flags Details

Description Zack Weinberg (:zwol) 2011-04-17 10:13:08 PDT
On two different Windows 7 machines, I observe a slow but continuous memory leak when the browser is completely idle and showing a blank tab. This happens even in a completely fresh profile. Specific steps to reproduce:

* Create a new profile.
* In the new profile, enter add-on manager and disable all plugins and system-provided add-ons (one of these computers has several such, the other only has Flash and Acrobat).
* Optionally, install dbaron's leak monitor add-on [https://addons.mozilla.org/en-US/firefox/addon/leak-monitor/] in the new profile.

* Restart Firefox.
* Open a new (blank) tab; close the about:home tab.
* Open Task Manager and select the process list.  At this point, firefox.exe should be using about 38MB of RAM.

* Leave the computer untouched for at least an hour.  When you return, firefox.exe will be reported as consuming close to 60MB of RAM.  If you continue the experiment, the Firefox process continues to grow at a rate of approximately 20MB per hour.

The person who regularly uses these computers informs me that the leak is much worse if the browser is actually used for anything, with Firefox growing to consume all available RAM within a few hours; and that reloading a single page repeatedly seems to be especially bad. Some of that may be extensions' or pages' fault, of course.

Over on Mac and Linux, I don't observe the same problem (but I haven't yet tried exactly the above experiment) so it could well be a Windows-specific issue.  I'm filing this under Tabbed Browser because dbaron's leak monitor fingered browser.js for at least some of the leaks (details to follow).
Comment 1 Zack Weinberg (:zwol) 2011-04-17 10:16:39 PDT
Created attachment 526600 [details]
leak monitor report before "reclaim"

On one of the aforementioned computers, I installed dbaron's leak monitor in the new profile, and then after allowing the empty tab to leak for an hour, I opened a new window and closed the old one.  The leak monitor promptly popped up this leak report.
Comment 2 Zack Weinberg (:zwol) 2011-04-17 10:19:23 PDT
Created attachment 526602 [details]
leak monitor report after "reclaim"

And this is the second leak report it produced, after "some objects were reclaimed" (I assume that means after JS GC and/or XPCOM cycle collection had a chance to run).  It doesn't tell me much, but I see a whole lot of references to chrome://browser/content/browser.js in there.
Comment 3 Zack Weinberg (:zwol) 2011-04-17 10:20:05 PDT
cc:ing the usual suspects.
Comment 4 Zack Weinberg (:zwol) 2011-04-17 10:21:27 PDT
Oh, I should have said: all this was with the Windows 4.0 release build.
Comment 5 Zack Weinberg (:zwol) 2011-04-17 14:31:37 PDT
A similar overall increase in process size occurs with 4.0.1 on my Mac (10.6.7).  However, this time the leak monitor didn't catch anything.

$ while :; do sleep 300; ps xuc | grep firefox-bin; done
USER   PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
zack   360   0.0  2.5  3117464 103708   ??  S    12:22PM   0:09.81 firefox-bin
zack   360   0.0  2.3  2969060  96736   ??  S    12:22PM   0:09.85 firefox-bin
zack   360   0.0  2.3  2969584  96752   ??  S    12:22PM   0:09.90 firefox-bin
zack   360   0.0  2.3  2969060  96776   ??  S    12:22PM   0:10.14 firefox-bin
zack   360   0.0  2.3  2969060  96776   ??  S    12:22PM   0:10.41 firefox-bin
zack   360   0.0  2.3  2969060  96788   ??  S    12:22PM   0:10.64 firefox-bin
zack   360   0.0  2.3  2969060  96788   ??  S    12:22PM   0:10.67 firefox-bin
zack   360   0.0  2.5  2977776 103148   ??  S    12:22PM   0:17.25 firefox-bin
zack   360   0.0  2.5  2977252 103128   ??  S    12:22PM   0:17.34 firefox-bin
zack   360   0.0  2.5  2977252 103128   ??  S    12:22PM   0:17.61 firefox-bin
zack   360   0.0  2.5  2977252 103128   ??  S    12:22PM   0:17.88 firefox-bin
zack   360   0.0  2.5  2977252 103128   ??  S    12:22PM   0:18.15 firefox-bin
zack   360   0.1  2.5  2977252 103128   ??  S    12:22PM   0:18.41 firefox-bin
zack   360   0.4  2.5  2977776 105220   ??  S    12:22PM   0:19.97 firefox-bin
zack   360   1.0  2.6  2985968 109508   ??  S    12:22PM   0:24.47 firefox-bin
zack   360   0.0  2.6  2985444 109484   ??  S    12:22PM   0:24.74 firefox-bin
zack   360   0.0  2.6  2985444 109484   ??  S    12:22PM   0:25.00 firefox-bin
zack   360   0.0  2.6  2985444 109484   ??  S    12:22PM   0:25.28 firefox-bin
zack   360   0.0  2.6  2985444 109484   ??  S    12:22PM   0:25.55 firefox-bin
zack   360   0.0  2.6  2985968 109496   ??  S    12:22PM   0:25.81 firefox-bin
zack   360   0.0  2.7  2985968 112176   ??  S    12:22PM   0:27.73 firefox-bin
zack   360   0.0  2.8  2985968 115568   ??  S    12:22PM   0:32.84 firefox-bin
zack   360   0.0  2.8  2985444 115544   ??  S    12:22PM   0:33.11 firefox-bin

The steady increase in CPU time consumed is also a bit of a concern; it's showing one window, containing one tab, with no contents (empty address bar, not about:blank; I would like to think those are the same thing, but ... :-/)  It shouldn't have anything to burn CPU on.

That appears to be a leak rate of 180KB/minute, by the way.
Comment 6 Zack Weinberg (:zwol) 2011-04-17 14:40:00 PDT
Created attachment 526630 [details]
RSS over time (Mac)

Because I am that kind of nerd, I plotted the RSS measurements from my Mac.  The result is veeery interesting.  It looks like something happens every 30 minutes, more or less, that makes memory usage go up by a big chunk.  In between, nothing.
Comment 7 Marco Bonardo [::mak] 2011-04-18 07:19:13 PDT
just a poor man's guess based on searching mxr, does that time change if you set urlclassifier.updateinterval to a really large value, since the default seems to be update every 30 mins.
Comment 8 Zack Weinberg (:zwol) 2011-04-18 13:16:56 PDT
Created attachment 526806 [details]
RSS over time (Mac) (urlclassifier.updatetime=3600)

I set urlclassifier.updatetime to 3600 and let the browser idle for five hours, monitoring RSS every minute; this is the result.  It looks like changing that parameter did have _some_ effect, but it also looks like that's not the _only_ thing leaking memory at 30min intervals.

Possibly this should become a tracking bug?  From this data, there's at least two problems on Mac, and the original syndrome on Windows is different -- from eyeballing the numbers in Task Manager it looks like memory usage goes up continuously there.  (I would be happy to produce a graph if someone tells me how to record a log of RSS every minute in Windows.)
Comment 9 Marco Bonardo [::mak] 2011-04-19 05:12:10 PDT
hm, notice that updatetime and updateinterval are different things.
updateinterval is the time between updates, you didn't change it from the original 30 mins. updatetime is the the time between each chunk during an update.
Comment 10 Marco Bonardo [::mak] 2011-04-19 05:15:01 PDT
Also, with a current nightly you could check in about:memory the sqlite memory usage (esp. storage/urlclassifier3.sqlite/Cache_Used) for urlclassifier3.sqlite since with continuous updates I'd expect the cache to increase naturally.
Comment 11 Nicholas Nethercote [:njn] 2011-05-03 23:44:04 PDT
I tried this on a Linux64 box under Massif.  Here's the graph for a 1,000
second run:

    MB
27.24^                                                                      # 
     |                                                                      # 
     |                               @   ::::       :::  :   :::::::   @@@@ # 
     |                     :::       @   :          :    :   :    :    @    #:
     |  ::::::@@::::::@@:@::  :@:::::@::::   :@::::::  :::::::    : :::@   :#:
     |  ::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |  ::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |  ::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     | @::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
     |:@::::::@ :: : :@ :@::  :@: : :@   :   :@:   ::  ::::  :    : : :@   :#:
   0 +----------------------------------------------------------------------->s
     0                                                                   1050.9

You can see a gradual increase, with some drops along the way (GCs, I
guess?)  I looked at the detailed snapshots;  with the exception of some
very minor variations the only thing that is changing significantly is this
allocation point:

94.22% (26,914,780B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
o1o> 25.06% (7,159,760B): pcache1Alloc (sqlite3.c:33629)
 o2o> 22.08% (6,307,584B): pcache1Fetch (sqlite3.c:33692)
  o3o> 22.08% (6,307,584B): sqlite3PcacheFetch (sqlite3.c:33046)
   o4o> 22.08% (6,307,584B): sqlite3PagerAcquire (sqlite3.c:39873)
    o5o> 22.08% (6,307,584B): btreeGetPage (sqlite3.c:47038)
     o6o> 11.91% (3,401,472B): getAndInitPage (sqlite3.c:47091)
      o7o> 08.21% (2,344,704B): moveToChild (sqlite3.c:49594)
       o8o> 08.09% (2,311,680B): sqlite3BtreeMovetoUnpacked (sqlite3.c:50054)
        o9o> 07.75% (2,212,608B): btreeMoveto (sqlite3.c:46159)
         o10> 07.75% (2,212,608B): sqlite3BtreeInsert (sqlite3.c:52180)
          o11> 07.75% (2,212,608B): sqlite3VdbeExec (sqlite3.c:64977)
           o12> 07.75% (2,212,608B): sqlite3_step (sqlite3.c:59036)
            o13> 07.75% (2,212,608B): mozilla::storage::stepStmt() (mozStoragePrivateHelpers.cpp:288)
             o14> 07.75% (2,212,608B): mozilla::storage::Statement::ExecuteStep() (mozStorageStatement.cpp:617)
              o15> 07.75% (2,212,608B): mozilla::storage::Statement::Execute() (mozStorageStatement.cpp:582)
               o16> 07.75% (2,212,608B): nsUrlClassifierStore::WriteEntry() (nsUrlClassifierDBService.cpp:1967)
                o17> 07.75% (2,212,608B): nsUrlClassifierDBServiceWorker::AddChunk() (nsUrlClassifierDBService.cpp:2563)
                 o18> 07.75% (2,212,608B): nsUrlClassifierDBServiceWorker::ProcessChunk() (nsUrlClassifierDBService.cpp:2693)
                  o19> 07.75% (2,212,608B): nsUrlClassifierDBServiceWorker::UpdateStream() (nsUrlClassifie
rDBService.cpp:3082)
                   o20> 07.75% (2,212,608B): NS_InvokeByIndex_P ()
                    o21> 07.75% (2,212,608B): nsProxyObjectCallInfo::Run() (nsProxyEvent.cpp:182)
                     o22> 07.75% (2,212,608B): nsThread::ProcessNextEvent() (nsThread.cpp:618)
                      o23> 07.75% (2,212,608B): NS_ProcessNextEvent_P() (nsThreadUtils.cpp:250)
                       o24> 07.75% (2,212,608B): nsThread::ThreadFunc() (nsThread.cpp:273)
                        o25> 07.75% (2,212,608B): _pt_root (ptthread.c:187)

For example, about halfway through this accounts for about 5.5MB, by the end
it accounts for 7.2MB.  This seems to confirm Marco's idea that
urlclassifier is causing the gradual memory usage increase.

Can someone explain briefly what the urlclassifier is?  Is there a good
reason why it should be causing memory usage to creep up slowly?
Comment 12 Boris Zbarsky [:bz] 2011-05-04 11:20:14 PDT
Urlclassifier is the thing that we use for phishing protection.  It basically fetches data from Google's anti-phishing blacklist and stores it in a database.  Then when we need to load a page we check whether the url is in that database, and if it is we show the phishing error page.
Comment 13 Nicholas Nethercote [:njn] 2011-05-04 14:05:53 PDT
bz, is the data fetched in small chunks periodically?  Comment 7 and comment 9 indicate this happens every 30 minutes, but I'm seeing increases on much shorter timescales.  Is that expected?  Is this growth likely to reach some kind of limit?
Comment 14 Boris Zbarsky [:bz] 2011-05-04 18:15:39 PDT
Hmm.  It's been a while, but I think based on reading code just now that if there are updates to download the url classifier will spend 5 seconds doing that, then 60 seconds sleeping, then repeat until the update is fully downloaded.

I believe 45 minutes is how often we check whether there's an update available...  Again, I could be totally misreading this code.  These various frobs aren't documented very clearly.  :(
Comment 15 Andreas Eibach 2011-05-05 20:14:41 PDT
pardon for jumping in, but the specification "Version 4.0 Branch" in the header is a little too constraining, since I can reproduce above issues easily in 6.0a1pre nightlies.
Comment 16 Andreas Eibach 2011-05-05 20:17:42 PDT
Zack:
"A similar overall increase in process size occurs with 4.0.1 on my Mac
(10.6.7).  However, this time the leak monitor didn't catch anything."

I had this too (though on Windows). It was totally weird: it stayed perfectly for almost exactly 3 minutes, then OUT OF THE BLUE mem leaking started all of a sudden as if someone had flipped a lever :)
Comment 17 Jesse Ruderman 2011-05-05 20:40:50 PDT
Maybe sqlite should avoid caching what it writes into urlclassifier, since it isn't especially likely that new entries will be hit? (Or clear the cache after each urlclassifier database update.)
Comment 18 Shawn Wilsher :sdwilsh 2011-05-06 09:17:19 PDT
(In reply to comment #17)
> Maybe sqlite should avoid caching what it writes into urlclassifier, since it
> isn't especially likely that new entries will be hit? (Or clear the cache after
> each urlclassifier database update.)
What data do you have to indicate that new entries are unlikely to be hit?
Comment 19 Jesse Ruderman 2011-05-06 12:57:31 PDT
I'm assuming we only care about speed in the non-attack case, and that hash collisions between non-attack pages and urlclassifier entries are effectively random. Am I missing something?
Comment 20 Shawn Wilsher :sdwilsh 2011-05-06 18:56:44 PDT
(In reply to comment #19)
> I'm assuming we only care about speed in the non-attack case, and that hash
> collisions between non-attack pages and urlclassifier entries are effectively
> random. Am I missing something?
We still have to do a lookup on every load, so clearing the cache is going to be expensive.
Comment 21 Nicholas Nethercote [:njn] 2011-05-26 16:57:40 PDT
Created attachment 535513 [details]
about:memory at start and end of overnight run

I did an overnight run on Linux64, logging VSIZE and RSS every second (with a few gaps).  Overall they both trended upwards slowly, sometimes going down for a while.  At start the numbers were 535192/80144 KiB, at end they were 569996/96676 KiB.  I also saw spikes in the RSIZE up to 105000 or 110000 KiB on some of the 30 minute marks, so something was happening there.

I've attached about:memory measurements from the start and end.  Here are the most interesting parts:

Start:
37,138,916 B (100.0%) -- explicit
├──20,005,162 B (53.87%) -- heap-unclassified
...
├───2,644,240 B (07.12%) -- storage
│   └──2,644,240 B (07.12%) -- sqlite
│      ├────867,376 B (02.34%) -- places.sqlite
│      │    ├──696,056 B (01.87%) -- cache-used
│      │    ├───96,256 B (00.26%) -- stmt-used
│      │    └───75,064 B (00.20%) -- schema-used

End:
48,258,882 B (100.0%) -- explicit
├──28,258,021 B (58.56%) -- heap-unclassified
...
├───4,295,200 B (08.90%) -- storage
│   └──4,295,200 B (08.90%) -- sqlite
│      ├──2,008,592 B (04.16%) -- places.sqlite
│      │  ├──1,686,536 B (03.49%) -- cache-used
│      │  ├────245,544 B (00.51%) -- stmt-used
│      │  └─────76,512 B (00.16%) -- schema-used


So an 11MB increase in explicit allocations, 8.2MB of that on the heap but unclassified, 1.7MB in sqlite and most of that in sqlite.

We really need to understand what the browser is doing when it's idle.
Comment 22 Nicholas Nethercote [:njn] 2011-05-29 18:15:23 PDT
> We really need to understand what the browser is doing when it's idle.

Some info from
https://wiki.mozilla.org/Phishing_Protection:_Design_Documentation, which is
a page that hasn't been modified since March 2008:

> The first update request happens at a random interval between 0-5 minutes
> after the browser starts. The second update request happens between 15-45
> minutes later. After that, each update is once every 30 minutes.
>
> If the client receives an error during update, it tries again in a minute.
> If it receives three errors in a row, it skips updates until at least 60
> minutes have passed before trying again. If it then receives another (4th)
> error, skips updates for the next 180 minutes and if it receives another
> (5th) error, it skips updates for the next 360 minutes. It will continue
> to check once every 360 minutes until the server responds with a success
> message. The current implementation doesn't change the 30 minute timer
> interval, it just involves skipping updates until the back off time has
> elapsed. 

So it really sounds like the short (2 or 3 second) spikes in memory usage
every 30 minutes or so are the url-classifier getting an update.

But the url-classifier is also doing heaps of local processing in the
meantime, as soon as the browser starts up.  I used a modified version of
Valgrind to obtain a stack trace of every malloc/free call done by the
browser, and looked at them for a period of a minute or so shortly after
start-up (fresh profile, showing only a blank page).  Here are the 10 most
common stack entries seen for a chunk of that time (the numbers at the start
are the popularity, frequency count, percentage of total, and cumulative
percentage):

( 1) 230770 ( 7.7%,  7.7%): ==17034==    at 0x40256B5: VALGRIND_PRINTF_BACKTRACE (valgrind.h:4477)

( 2) 129279 ( 4.3%, 12.0%): ==17034==    by 0x5DCB6F3: nsUrlClassifierDBServiceWorker::SubChunk(unsigned int, unsigned int, nsTArray<nsUrlClassifierEntry, nsTArrayDefaultAllocator>&) (nsUrlClassifierDBService.cpp:2655)

( 3) 115104 ( 3.8%, 15.8%): ==17034==    by 0x4025FF9: free (vg_replace_malloc.c:370)

( 4) 115084 ( 3.8%, 19.7%): ==17034==    by 0x4026903: malloc (vg_replace_malloc.c:240)

( 5) 113580 ( 3.8%, 23.5%): ==17034==    by 0x5DCD59D: nsUrlClassifierDBServiceWorker::ProcessChunk(int*) (nsUrlClassifierDBService.cpp:2696)

( 6)  93749 ( 3.1%, 26.6%): ==17034==    by 0x405A1B4: moz_free (mozalloc.cpp:94)

( 7)  84157 ( 2.8%, 29.4%): ==17034==    by 0x5DC1506: nsUrlClassifierStore::WriteEntry(nsUrlClassifierEntry&) (nsUrlClassifierDBService.cpp:1965)

( 8)  77313 ( 2.6%, 32.0%): ==17034==    by 0x5DC2046: nsUrlClassifierSubStore::BindStatement(nsUrlClassifierEntry const&, mozIStorageStatement*) (nsUrlClassifierDBService.cpp:982)

( 9)  69374 ( 2.3%, 34.3%): ==17034==    by 0x5DCD6F9: nsUrlClassifierDBServiceWorker::UpdateStream(nsACString_internal const&) (nsUrlClassifierDBService.cpp:3083)

(10)  68660 ( 2.3%, 36.6%): ==17034==    by 0x405A197: moz_xmalloc (mozalloc.cpp:100)


1, 3, 4, 6, and 10 are malloc and/or free, or wrappers thereof.  2, 5, 7, 8 and 9 are all url-classifier steps.

Here is a typical stack trace that includes all of 2, 5, 7, 8 and 9, one
which appears extremely frequently:

**17034** malloc(64) = 0x205E7C20
==17034==    at 0x40256B5: VALGRIND_PRINTF_BACKTRACE (valgrind.h:4477)
==17034==    by 0x4026903: malloc (vg_replace_malloc.c:240)
==17034==    by 0x405A197: moz_xmalloc (mozalloc.cpp:100)
==17034==    by 0x5EA01A4: mozilla::storage::Statement::getParams() (mozalloc.h:
229)
==17034==    by 0x5EA038B: mozilla::storage::Statement::BindNullByIndex(unsigned
 int) (mozStorageStatement.cpp:899)
==17034==    by 0x5DC1F05: nsUrlClassifierStore::BindStatement(nsUrlClassifierEn
try const&, mozIStorageStatement*) (nsUrlClassifierDBService.cpp:641)
==17034==    by 0x5DC2046: nsUrlClassifierSubStore::BindStatement(nsUrlClassifie
rEntry const&, mozIStorageStatement*) (nsUrlClassifierDBService.cpp:982)
==17034==    by 0x5DC1506: nsUrlClassifierStore::WriteEntry(nsUrlClassifierEntry
&) (nsUrlClassifierDBService.cpp:1965)
==17034==    by 0x5DCB6F3: nsUrlClassifierDBServiceWorker::SubChunk(unsigned int
, unsigned int, nsTArray<nsUrlClassifierEntry, nsTArrayDefaultAllocator>&) (nsUr
lClassifierDBService.cpp:2655)
==17034==    by 0x5DCD59D: nsUrlClassifierDBServiceWorker::ProcessChunk(int*) (n
sUrlClassifierDBService.cpp:2696)
==17034==    by 0x5DCD6F9: nsUrlClassifierDBServiceWorker::UpdateStream(nsACStri
ng_internal const&) (nsUrlClassifierDBService.cpp:3083)
==17034==    by 0x6107ED1: NS_InvokeByIndex_P (xptcinvoke_x86_64_unix.cpp:195)

Now, it appears that the allocations done for this trace (64 bytes in this
case) are freed very shortly after.  The cases where allocations aren't
freed correspond to stack traces like the one in comment 11.

I know very little about the url-classifier but if I had to guess I'd say
that we're processing a downloaded update, making lots of deletions from the
database (they're deletions because I'm seeing SubChunk all the time but
never AddChunk), and every so often SQLite has to rearrange a B-tree and
this requires allocating some memory.  It's disconcerting that SQLite's
memory use is growing even though we're seemingly deleting database entries.

For a comparison, I tried a second fresh profile where I chose to "work
offline".  Its memory usage did not creep up, and there were no
allocations/deallocations from the url-classifier.  After a minute or so I
turned off "work offline" and there was a brief flurry of allocations, but
not from the url-classifier, and then it held pretty steady again, with
occasional allocations/deallocations from what looks like the main event
loop.

For a second comparison, I tried another fresh profile that had online
access, but "block reported attack sites" and "block reported web forgeries"
were both turned off in the security preferences.  Again, url-classifier
didn't show up in the allocation trace at all, and once start-up was
complete the allocation rate was very slow, and the allocations were from
the main event loop.

So, conclusions?  Well, maybe the url-classifier is working exactly as it's
supposed to.  (sdwilsh, any info you could provide on that front would be
welcome.)  Though I would have though that the memory usage would level out
fairly quickly instead of continuing to creep up for hours and hours (as
described in comment 21).  Unfortunately, it sure complicates things in
terms of identifying and fixing real leaks;  users assume that if the
browser is (apparently) idle that it's not doing anything, and so any rise
in memory usage is assumed to be a leak.  Firefox's reputation for leaks
doesn't help here.
Comment 23 Shawn Wilsher :sdwilsh 2011-05-29 18:31:04 PDT
(In reply to comment #22)
> So, conclusions?  Well, maybe the url-classifier is working exactly as it's
> supposed to.  (sdwilsh, any info you could provide on that front would be
> welcome.)  Though I would have though that the memory usage would level out
> fairly quickly instead of continuing to creep up for hours and hours (as
> described in comment 21).  Unfortunately, it sure complicates things in
> terms of identifying and fixing real leaks;  users assume that if the
> browser is (apparently) idle that it's not doing anything, and so any rise
> in memory usage is assumed to be a leak.  Firefox's reputation for leaks
> doesn't help here.
Unfortunatley, I don't know much about it.  mmmulani is rewriting it last I knew though.
Comment 24 Boris Zbarsky [:bz] 2011-05-29 19:05:33 PDT
dcamp might know something.
Comment 25 Mehdi Mulani [:mmm] (I don't check this) 2011-05-29 19:11:17 PDT
(In reply to comment #22)
> So, conclusions?  Well, maybe the url-classifier is working exactly as it's
> supposed to.  (sdwilsh, any info you could provide on that front would be
> welcome.)  Though I would have though that the memory usage would level out
> fairly quickly instead of continuing to creep up for hours and hours (as
> described in comment 21).  Unfortunately, it sure complicates things in
> terms of identifying and fixing real leaks;  users assume that if the
> browser is (apparently) idle that it's not doing anything, and so any rise
> in memory usage is assumed to be a leak.  Firefox's reputation for leaks
> doesn't help here.

From what I understood of the events, this seems like expected behaviour.
On each update, url-classifier adds and removes entries from the database.

The code for updating chunks in url-classifier has been stable for quite some time, so this might be a regression caused by something else? (The only major change was to code unrelated to the SQLite usage.) Could it also be related to the current version of SQLite?
Comment 26 Mehdi Mulani [:mmm] (I don't check this) 2011-05-29 19:21:00 PDT
To answer more specific questions:
Are deletions expected/what is the make up of an update?
Updates contain roughly the same number of addChunks and subChunks. That is, there are always deletions in each update. I am not sure about the size of deletions per subChunk, though.

When does the browser update the database?
On a fresh profile, it is timed to be random over the first five minutes. On a dirty profile, after 30 seconds.
The relevant code is here: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/content/listmanager.js#254
Comment 27 Justin Dolske [:Dolske] 2011-05-31 17:03:08 PDT
ISTR from another bug that the memory increase from urlclassifier was just a result of sqlite caches filling up. [I don't know how strong that claim is, though.] Is there an easyish way to disable / limit caching in the DB to see if that's the case?
Comment 28 Nicholas Nethercote [:njn] 2011-06-16 16:37:23 PDT
*** Bug 636791 has been marked as a duplicate of this bug. ***
Comment 29 Nicholas Nethercote [:njn] 2011-06-16 16:55:35 PDT
Created attachment 539940 [details]
results from Brent Garber

I'm copying this measurement over from bug 636791.

Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0

Steps to Reproduce:
1. Start Firefox in Safe Mode
2. Load about:memory
3. Do Nothing (for about 70 minutes)

The key measurements:

- win32/privatebytes went from 34M to 87M

- storage/urlclassifier3.sqlite/Cache_Used went from 0M to 40M

- storage/sqlite/pagecache went from 1M to 43M, but ignore that;  it's double-counting the urlclassifier memory (bug 653630, which has been fixed in FF 6)

So, basically the same phenomenon we've seen previously, but a much bigger and faster rate of increase.

We have to do something about this.  The memory increase is (a) excessive, (b) makes it harder to detect other leaks, and (c) damages our reputation ("Firefox starts leaking memory as soon as you start it up!").
Comment 30 Dave Camp (:dcamp) 2011-06-16 17:41:43 PDT
Did the default sqlite page cache size change?

On windows and mac we use the default sqlite page cache size, which I remember being in the few-megabyte range.  A 40M page cache is significantly bigger than I would have expected.

It should be ok to just choose a sane page cache size by default rather than using sqlite's built-in default - http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#161

Don't know about the privatebytes changes offhand...
Comment 31 Shawn Wilsher :sdwilsh 2011-06-16 19:02:37 PDT
(In reply to comment #30)
> Did the default sqlite page cache size change?
It hasn't changed AFAIK.
Comment 32 Nicholas Nethercote [:njn] 2011-06-19 22:23:31 PDT
Created attachment 540396 [details]
Output with NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5 NSPR_LOG_MODULES=UrlClassifierDbService:5

Here's the output on a fresh profile with NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5 NSPR_LOG_MODULES=UrlClassifierDbService:5.  I let it sit on about:home for 10 minutes, there's 15,000 lines of output, which was generated in bursts.  That seems like a lot of work.  Is that expected?

I can't help but notice that http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#3076 is clearly buggy -- it's consistently printing garbage after the "Got ".  I don't know if that's relevant here.

There are heaps of sequences like this:

6203904[7fd802a35480]: Processing a:140161:4:333
26203904[7fd802a35480]: Handling a chunk sized 333
26203904[7fd802a35480]: Adding 41 entries to chunk 140161 in table 4
26203904[7fd802a35480]: adding chunk 140161 took 2ms

The chunk numbers seem awfully high, but maybe they don't mean anything.
Comment 33 Dave Camp (:dcamp) 2011-06-20 08:06:09 PDT
(In reply to comment #32)
> Created attachment 540396 [details]
> Output with NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5
> NSPR_LOG_MODULES=UrlClassifierDbService:5
> 
> Here's the output on a fresh profile with
> NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5
> NSPR_LOG_MODULES=UrlClassifierDbService:5.  I let it sit on about:home for
> 10 minutes, there's 15,000 lines of output, which was generated in bursts. 
> That seems like a lot of work.  Is that expected?

Seems roughly right, the malware database we're fetching is big (was around 300,000 entries last I checked) and the protocol is kinda inefficient.

> I can't help but notice that
> http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-
> classifier/nsUrlClassifierDBService.cpp#3076 is clearly buggy -- it's
> consistently printing garbage after the "Got ".  I don't know if that's
> relevant here.

I think it's useless more than it is buggy.  The chunk is binary data at that point.

> 
> There are heaps of sequences like this:
> 
> 6203904[7fd802a35480]: Processing a:140161:4:333
> 26203904[7fd802a35480]: Handling a chunk sized 333
> 26203904[7fd802a35480]: Adding 41 entries to chunk 140161 in table 4
> 26203904[7fd802a35480]: adding chunk 140161 took 2ms
> 
> The chunk numbers seem awfully high, but maybe they don't mean anything.

Yeah, those are just serial numbers for segments of the malware database that have been increasing since 2007/2008ish.
Comment 34 Dave Camp (:dcamp) 2011-06-20 08:06:54 PDT
(In reply to comment #33)
> (In reply to comment #32)
> > Created attachment 540396 [details]
> > Output with NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5
> > NSPR_LOG_MODULES=UrlClassifierDbService:5
> > 
> > Here's the output on a fresh profile with
> > NSPR_LOG_MODULES=UrlClassifierStreamUpdater:5
> > NSPR_LOG_MODULES=UrlClassifierDbService:5.  I let it sit on about:home for
> > 10 minutes, there's 15,000 lines of output, which was generated in bursts. 
> > That seems like a lot of work.  Is that expected?
> 
> Seems roughly right, the malware database we're fetching is big (was around
> 300,000 entries last I checked) and the protocol is kinda inefficient.

(We don't grab all the entries in the first 10 minutes, but a large enough chunk that a 15,000 line file doesn't seem outside the range of possibility).
Comment 35 Nicholas Nethercote [:njn] 2011-06-20 18:05:35 PDT
(In reply to comment #33)
> 
> Seems roughly right, the malware database we're fetching is big (was around
> 300,000 entries last I checked) and the protocol is kinda inefficient.

Are we fetching/storing the database as efficiently as we could?  It seems like we're doing a lot of work and using a lot of memory.
Comment 36 (dormant account) 2011-06-20 18:08:43 PDT
(In reply to comment #35)
> (In reply to comment #33)
> > 
> > Seems roughly right, the malware database we're fetching is big (was around
> > 300,000 entries last I checked) and the protocol is kinda inefficient.
> 
> Are we fetching/storing the database as efficiently as we could?  It seems
> like we're doing a lot of work and using a lot of memory.

No. We are fetching/storing the database in the most inefficient way possible. In addition to eating memory, it also stalls the browser with a disk io storm. Firefox frontend folks keep talking of rewriting the classifier code, but I'm not sure how much progress has been made
Comment 37 Nicholas Nethercote [:njn] 2011-06-20 20:48:44 PDT
(In reply to comment #36)
> 
> No. We are fetching/storing the database in the most inefficient way
> possible.

I talked to dcamp about this.  The short version is that the original code from Google was so-so, and then various large protocol changes were made at short notice, which necessitated some less-than-perfect updates to the url-classifier.  That's why it's not very good code.
Comment 38 Johnny Stenback (:jst, jst@mozilla.com) 2011-07-11 15:57:16 PDT
Gian-Carlo, any updates on this bug yet? This is a pretty bad bug from a memory usage point of view, something we're tracking as part of the MemShrink effort.
Comment 39 Gian-Carlo Pascutto [:gcp] 2011-07-11 22:22:36 PDT
I'm trying to finish up bug 669410, which once fully working, would allow fixing the SQLite cache to some small value. That might reduce the impact of this bug a lot.

Going further there's bug 669407 which gets rid of SQLite entirely. 

dcamp has some incomplete patches that replace the SQLite DB by an optimized store. It's orthogonal to my work so it could go in once he or I finish it.
Comment 40 Gian-Carlo Pascutto [:gcp] 2011-07-20 03:54:58 PDT
I stumbled onto this in browser/app/profile/firefox.js:

#ifdef MOZ_WIDGET_GTK2
#define RESTRICT_CACHEMAX
#endif
#ifdef XP_OS2
#define RESTRICT_CACHEMAX
#endif

// Maximum size of the sqlite3 cache during an update, in bytes
#ifdef RESTRICT_CACHEMAX
pref("urlclassifier.updatecachemax", 104857600);
#else
pref("urlclassifier.updatecachemax", -1);
#endif

This is strange for several reasons. 
1) Why is there a difference in behavior here between say, Linux and Windows?

2) Even though the setting is called RESTRICT_CACHEMAX, if updatecachemax is "-1" the code in UrlClassifier simply won't touch the default cache size at all. See: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#3157
So it should be the SQLite default, which according to the documentation is 2000 pages. Firefix sets it pages to 32K, so this is 64M. This is still less than the 100M defined there.

This first looked like some fallout from bug 416330, but then it's still the wrong way around.

Am I missing something? Anyone have a clue what's going on?
Comment 41 Gian-Carlo Pascutto [:gcp] 2011-07-29 12:03:45 PDT
This is the explanation for the previous comment:

https://bugzilla.mozilla.org/show_bug.cgi?id=430530#c26
Comment 42 Gian-Carlo Pascutto [:gcp] 2011-08-02 13:14:33 PDT
Preliminary testing with bug 669410 patches v3 show that memory usage from url-classifier stays below 1M, except when it is updating, at which it will briefly go up to 40M and drop to 1M again when it's finished. I believe this behavior is much more reasonable and might be sufficient to resolve this issue for desktop.
Comment 43 Nicholas Nethercote [:njn] 2011-08-02 16:23:18 PDT
Great news, Gian-Carlo!  When that bug lands I'll do some more measurements.
Comment 44 Andreas Eibach 2011-09-05 17:00:32 PDT
(In reply to Gian-Carlo Pascutto (:gcp) from comment #40)
> I stumbled onto this in browser/app/profile/firefox.js:
> 
> #ifdef MOZ_WIDGET_GTK2
> #define RESTRICT_CACHEMAX
> #endif
> #ifdef XP_OS2
> #define RESTRICT_CACHEMAX
> #endif
> 

> This is strange for several reasons. 
> 1) Why is there a difference in behavior here between say, Linux and Windows?

Difference?

Well, maybe I'm blind, but it does the _very same thing_ in Linux and Windows AFAICS :)

You could also code (in C/C++, at least, can't say from the top of my head whether this syntax is valid in JS)

#if defined(MOZ_WIDGET_GTK2) || defined(XP_OS2)
#define RESTRICT_CACHEMAX
#endif

voilà :)
Comment 45 Flávio Etrusco 2011-09-05 20:14:13 PDT
@Andreas: I can't say anything WRT to behavior (did you measure memory use?) but the code is IMO obviously odd.
http://mxr.mozilla.org/mozilla-central/source/browser/app/profile/firefox.js#737
http://mxr.mozilla.org/mozilla-central/source/dom/plugins/base/npapi.h#49
Comment 46 Andreas Eibach 2011-09-05 20:35:00 PDT
Ahhh! Thanks for the code excerpt, Flávio. 
I stand corrected. 
So XP_OS2 does not mean "either XP or OS2" (as I thought) but XP _running_ on OS/2.

(As no one ought to ask me what has been happening to OS/2 all the years since I would not be able to tell, nor would I have believed that you could make XP run on the old git. But with a product called VPC it is possible. You never stop learning, do you. :))
Comment 47 Shawn Wilsher :sdwilsh 2011-09-05 20:38:13 PDT
Pretty sure `XP` is for executable platform and is not a reference to Windows XP at all.
Comment 48 Boris Zbarsky [:bz] 2011-09-05 21:38:50 PDT
Andreas, XP_OS2 just means "OS/2", just like XP_MACOSX, XP_WIN, XP_BEOS, XP_UNIX, and XP_LINUX mean the obvious corresponding things.

The XP_* defines also predate the existence of Windows XP by at least 3 years, if memory serves me right.
Comment 49 Andreas Eibach 2011-09-05 21:50:58 PDT
Thanks to the both of you.

The predating bit may be true, though I could not be **** to look up pre-2001 code for verification :-)
Comment 50 Nicholas Nethercote [:njn] 2011-09-06 13:52:46 PDT
*** Bug 468003 has been marked as a duplicate of this bug. ***
Comment 51 Gian-Carlo Pascutto [:gcp] 2011-09-16 15:00:40 PDT
Bug 669410 changes the url-classifier behavior to only keep a 1.2M datastructure in memory when not updating. During an update, the SQLite cache is upped to 40M and dropped down to +-1M again when it's done (a few seconds).

I believe that essentially fixes this bug. Would be nice if anyone else can confirm.
Comment 52 Nicholas Nethercote [:njn] 2011-10-18 23:06:34 PDT
I tried a new profile again and measured RSS every second for an hour.  It started at 81816KB and ended up at 93560KB.  Then I quit and started again and the RSS stayed much more stable, around 83512KB, presumably because no update was occurring.  So I'm not quite sure what to make of that.
Comment 53 Gian-Carlo Pascutto [:gcp] 2012-02-05 22:16:18 PST
Bug 673470 revamps the urlclassifier and replaces the random SQLite I/O by a serial read of a much smaller store, removing the need for the I/O cache. I believe this bug is essentially fixed.
Comment 54 Nicholas Nethercote [:njn] 2012-02-05 23:12:04 PST
I'd like to have measurements like those done earlier in the bug (e.g. RSS measured once per second for several hours) before declaring this fixed.
Comment 55 Zack Weinberg (:zwol) 2012-03-13 15:00:33 PDT
I have nothing better to do with my Mac right now, I'll run the same test I did then.
Comment 56 Zack Weinberg (:zwol) 2012-03-13 21:38:38 PDT
Created attachment 605652 [details]
six hours of VSZ, RSS, and CPU measurements, Firefox 11/Mac
Comment 57 Zack Weinberg (:zwol) 2012-03-13 21:44:35 PDT
Created attachment 605653 [details]
six hours of VSZ, RSS, and CPU measurements, Firefox 11/Mac, plotted

I recorded Firefox 11's memory and CPU usage on my laptop for six hours, according to the same procedure as described in comments 1 and 5 (the laptop is a Mac) except that I took a data point every second instead of every five minutes.  Results are in this graph.  VSZ and RSS are in bytes; CPU is in seconds.  The x-axis of all three subplots is time in hours.

It looks like, under these conditions (i.e. completely idle, only one tab showing about:blank, no extensions of any kind) there is no longer a memory leak.  I presume the periodic spikes roughly every half hour are still due to the URL classifier, as discussed above.  I am a little bit concerned that we don't go *completely* idle -- CPU does keep creeping up -- but it's less than one minute of CPU in six hours, so it's probably not a big deal, and anyway we can worry about it in another bug :)
Comment 58 Nicholas Nethercote [:njn] 2012-03-13 23:03:38 PDT
Thanks for doing the measurements!


> It looks like, under these conditions (i.e. completely idle, only one tab
> showing about:blank, no extensions of any kind) there is no longer a memory
> leak.  I presume the periodic spikes roughly every half hour are still due
> to the URL classifier, as discussed above.

I agree, and I am happy to declare victory.


> I am a little bit concerned that
> we don't go *completely* idle -- CPU does keep creeping up -- but it's less
> than one minute of CPU in six hours, so it's probably not a big deal, and
> anyway we can worry about it in another bug :)

I was horrified at first because I thought the y-axis was CPU % :)

It looks to me like it's doing a lot of work the first few times and then just a small amount each time after that.  That sounds reasonable to me.  Investigating it is certainly outside the scope of this bug.
Comment 59 Zack Weinberg (:zwol) 2012-03-14 08:36:21 PDT
Filed follow-up bug 735684 for automated testing for this kind of slow leak, and bug 735690 for the CPU usage when idle.
Comment 60 Frank Nestel 2012-03-16 02:14:00 PDT
*** Bug 736394 has been marked as a duplicate of this bug. ***
Comment 61 Gian-Carlo Pascutto [:gcp] 2012-03-16 02:27:41 PDT
Verified fixed testing was done against Firefox 11 with the original patches from bug 669410, so let's indicate that correctly.

Note You need to log in before you can comment on or make changes to this bug.