Open Bug 832157 Opened 11 years ago Updated 5 months ago

Browser considerable slowdown after reaching cache limit.

Categories

(Core :: Networking: Cache, defect, P3)

22 Branch
x86_64
Windows 7
defect

Tracking

()

UNCONFIRMED

People

(Reporter: naveenreloaded, Unassigned)

Details

(Keywords: perf, Whiteboard: [snappy][necko-backlog])

Attachments

(11 files)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0
Build ID: 20130104151925

Steps to reproduce:

Setting cache limit to either auto or manual , say if i set it to 500MB ( the default is 1GB) , in either case , when the cache level is reached , the browser starts to slowdown considerably. 


Actual results:

Browser almost comes to a crawl.


Expected results:

Firefox should handle cache based on popularity of a site and cache specific to it. For eg most visited site caches should be preserved , while less frequent sites should be deleted after specific time frame and cache storage space should be kept at least 25% free at all times , ie 500MB out of 1GB should be used to store most frequented , newly accessed site and remaining 250 MB should be used for not so frequent site cache , flash , other stuff , etc. And always remaining 250 MB should be kept free for browser to work with. Currently Firefox slowdown when it cant find new space for storing fresh cache once the limit is reached.
>the browser starts to slowdown considerably. 
Is the loading of pages slower than before or what do you mean with "slowdown" ?
Component: Untriaged → Networking: Cache
Product: Firefox → Core
Whiteboard: [snappy]
Yes after hit the cache limit , there is considerable page loading delay/lag. Sometimes flash videos or contents doesnt load at all. Its as if the browser is struggling to find new space for new contents . When i find the browser is slow , a simple check to cache ALWAYS shows the cache is full. Once i clear it , the problem resolves. Since i reach 1GB / Auto very soon , and it becomes annoying to clear it even now and then , not to mention it takes 2-4 minutes to clear such a huge cache with all those small files , i now have cache cleared when i quit the browser. Because of this , page loading is slow ,coz each time a fresh copy is downloaded.

Thats why i suggested that idea on cache.
any update ?
Naveen: how much space do you have left on your drive?  Is this a spinning hard drive  or a solid state drive (SSD)?

Does the same slowdown happen for you if you set a much smaller limit (say 50 MB)?  How about if you turn the disk cache off entirely?

The most obvious culprit here is evictions taking a long time, since we start evicting when the cache is full.
I have a spinning HDD. Currently i have around 40GB free out of 250GB. 

If i set it to 50MB , the slow down occurs more often. The limit is the problem. Ifs its 1GB , it occurs say in 2-3 days time , and clearing the cache takes atleast 3-4 minutes , as i stated in the first post.

Currently i have set FF to clear cache when exiting . This as you know fetches the websites freshly each time , consuming more bandwidth and longer time to load pages. I really wish FF implements "FREQUENCY-BASED" caching for sites. ie sites i visit more are reddit , fb ,twitter(slowest of all), thehindu , arstechnica , hackernews , addictivetips etc , so prioritizing cache for these sites which are frequently visited and trying to retaining them and clearing the less frequented will be a really good way for dealing with cache imo.
On FF 20 now. The slow is somewhat more evident now. and now the cache is set at 350 as default by firefox. Now I am forced to clean the cache every 5-6 hours or so. even if i set to max limit (1GB) , i will have to clean every day or two. Why cant we implement a dynamic cache management ?
michal, what do you think is going on here?
sorry for bumping. But here is the same problem - 350Mb cap , which as i said , when reached , makes the browser so sluggish and laggy. While trying to clear the cache, the browser goes in to "not responding" for atleast a 2 minutes solid.

http://i.imgur.com/fXmJHkN.png

here is the screenshot
Bumping again.

Still the problem is persisting.

Here is the resource monitor screenshots after telling firefox to clear the 350MB (default) cache.

http://i.imgur.com/CafDp2V.jpg

Please reply or tell what to do .
let's not distract from the current cache work to look at this, but do you have any guesses at what is going on here?
Flags: needinfo?(michal.novotny)
I posted this screenshot to show that , when i hit clear cache in firefox , the browser hangs for a solid 1-2 minutes , I saw what is happening to the HDD using the resource monitor , and this is what i saw , so , i posted it here , hoping you guys would know what its all about , and why firefox hangs for 1-2 mintues to clear a mere 350MB cache.

I am really not that technical , so i dont know. May be its due to numerous sub cache folder firefox creates to store the cache , and also numerous small files (<1kb) which usually takes longer time to delete on HDD. Next time when FF fills to 350Mb , i will check the number of files in these cache folder (total no of files). 

Apart from this , any idea , why firefox should be limited to 350MB as default ? If user to need to change , the max one can set cache is 1GB. Cant firefox set a over head storage space , ie 20% of allotted cache size for storing /deleting / accessing current tabs ?
Flags: needinfo?(michal.novotny)
I understand why clearing the cache is slow. What I don't understand is why there is a "considerable slowdown after reaching cache limit" for the reporter.
Flags: needinfo?(michal.novotny)
As i said before , I think (my guess) , firefox isnt able to find a new space within a filled cache limit to store new data arriving from new tabs. Is there any tool which I can use to see what happens and how firefox reacts when cache limit is reached ? What happens to new data coming in from websites and how ff stores them ?
Also can we change the "VERSION" branch from 18 to current 22 in the details section above ?
Naveen, I'm sorry about the confusion. My questions were really intended for Michal, who is the most experienced developer on the cache we have. That's why they had the needinfo?michal.novotny flag attached to them..
Version: 18 Branch → 22 Branch
Oh sorry . :)
I have no idea what could be the reason of slowdown here. If we hit the cache limit in nsDiskCacheDevice::OnDataSizeChange() we start evicting old entries until the new entry fits into the cache. This should be quite fast operation. It seems that Naveens computer is very slow in deleting files.
Flags: needinfo?(michal.novotny)
http://imgur.com/CaY81sR
http://imgur.com/rjdkiMq

here are some benchmarks of my HDD.
Naveen, you might test a nightly build to see if cache2 has the same problem.
see http://www.janbambas.cz/mozilla-firefox-new-http-cache-is-live/  - note bug 913808 cache evict recently landed, so blog is now incorrect on that point.

Also, your bug 915110 is likely I think to be just inherent slowness of OS accessing the huge directory and not fixable by firefox.  Does having such a large directory perhaps also impact this bug?
Flags: needinfo?(naveenreloaded)
Thanks , will give it a try.

Where can i learn more about cache2 ? That blog link doesnt provide much info .
Flags: needinfo?(naveenreloaded)
I don't know where the doc is at. But, I'm not sure there is much to know
(In reply to Wayne Mery (:wsmwk) from comment #22)
> http://www.janbambas.cz/new-firefox-http-cache-backend-implementation/

This one is quite old.  Try http://www.janbambas.cz/mozilla-firefox-new-http-cache-is-live/

You can turn it on with browser.cache.use_new_backend pref to set to 1.  But I still don't believe this is caused by cache.
(In reply to Honza Bambas (:mayhemer) from comment #23)
>  But I still don't believe this is caused by cache.

ref bug 915110
Naveen, what are your results with cache2?
Flags: needinfo?(naveenreloaded)
Whiteboard: [snappy] → [closeme 2014-10-10][snappy]
I just updated to 32.0.3 and i still do get a lags and performance difference. Current cache is 350mb and i feel the browser is little lagging here and there , but not like before. I must say there is some improvement. 

But why is cache is default set to 350mb in current era ? and max limit being 1 GB ?
Flags: needinfo?(naveenreloaded)
Whiteboard: [closeme 2014-10-10][snappy] → [snappy]
Whiteboard: [snappy] → [snappy][necko-backlog]
This bug still exists in FF 51.0.1. A soon as the cache is full, the browser slows too a crawl and uses inordinate amounts of CPU. Yes, I have lots of tabs, but FF needs to be more robust. Something is clearly wrong about the cache replacement machinery. Given that this problem has existed at least since year 2013, is it not time to try something different? For example, just drop 10% of the LRU ((least recently used) cached items whenever running into a limit. It is patently obvious that the algorithm of delete-just-enough-to-fit-each-new-request is NOT a good solution. It is kind of obvious that such an algorithm cannot work well, because when the browser is opening a lot of tabs, it will constantly bump against the limit and create massive churn in the cache replacement algorithm!


Also see bug https://bugzilla.mozilla.org/show_bug.cgi?id=1225501 , which describes what happens when you clear the cache and then try to restart a firefox instance.

The absolute limit of 1GB cache is also a problem.
(In reply to reikred from comment #27)
> This bug still exists in FF 51.0.1. A soon as the cache is full, the browser
> slows too a crawl and uses inordinate amounts of CPU. 

Let's take a look.

First I'd like to confirm this is actually related to cache.
- close the browser
- start it again
- confirm the CPU load is still there
- close the browser
- presuming you are on windows, delete or rename c:\Users\<your-username>\appdata\Local\Mozilla\Firefox\Profiles\<some-random-string>.default\cache2\ (this folder contains all the http cached data)
- start again the browser
- check the CPU high load problem is gone

> Yes, I have lots of
> tabs, but FF needs to be more robust. Something is clearly wrong about the
> cache replacement machinery. Given that this problem has existed at least
> since year 2013, is it not time to try something different? 

Please setup logging according https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging with following modules:

MOZ_LOG=rotate:20,timestamp,cache2:5

reproduce the CPU high load and then provide the log here or pack it and send it to my bugzilla email.  Thanks!

> Also see bug https://bugzilla.mozilla.org/show_bug.cgi?id=1225501 , which
> describes what happens when you clear the cache and then try to restart a
> firefox instance.

That bug is not related to http cache.
Flags: needinfo?(reikred)
(In reply to reikred from comment #27)
> It is patently obvious that the algorithm of
> delete-just-enough-to-fit-each-new-request is NOT a good solution.

We don't do this in cache2.
i can still confirm this problem is still there. But funny thing is now i am on SSD compared to spinning HDD when i started this thread. Now my cache doesnt go beyond 200mb but there is a definite lag after some time ,and i had to clear the cache even though its low to get FF back to normal speed.
(In reply to Naveen from comment #30)
> i can still confirm this problem is still there. But funny thing is now i am
> on SSD compared to spinning HDD when i started this thread. Now my cache
> doesnt go beyond 200mb but there is a definite lag after some time ,and i
> had to clear the cache even though its low to get FF back to normal speed.

Can I ask you for the same as I asked reikred in comment 28?  Thanks!
Flags: needinfo?(naveenreloaded)
(In reply to Honza Bambas (:mayhemer) from comment #28)

I'm working on generating an example that uses a large randomized working set of tabs, rather than my specific set of tabs, as I feel there may be privacy issues. I will report back.
This is a session with 490 tabs worth of random web sites for stress testing (in particular)  the cache subsystem. This file is for loading via the Session Manager {1] extension for firefox, but if you want to dump the session into a recovery.js file instead, you just extract the last (super-long) line, as can be done by the unix command

tail -n 1 2017-03-01.13_15.session.tabs=490.random-tabs.session > recovery.js

-----------------------------------------------------------------------------------------

That test case ought to be sufficient, but if anyone want to generate other test cases, or add more tabs, this is how I did it:

As a source of random websites, I used the link http://www.uroulette.com/visit/trpq, which will return a new random website each time.

The random tabs were loaded using a short html script from [2], which I will also attach soon. The script was placed in file:///tmp/open-list-of-URLs.html , please adjust as needed for  your platform/OS. Just paste many lines of 

http://www.uroulette.com/visit/trpq 

into a tab running the open-list-of-URLs.html script, click the button, and sit back while firefox churns.

References:

[1] https://addons.mozilla.org/en-US/firefox/addon/session-manager/
[2] http://superuser.com/questions/385207/how-to-open-a-list-of-urls-in-firefox-or-seamonkey
[3] http://www.uroulette.com/visit/trpq
Comment on attachment 8842599 [details]
2017-03-01.13_15.session.tabs=490.random-tabs.session

Hold on, some of the content was screwed up.
Comment on attachment 8842599 [details]
2017-03-01.13_15.session.tabs=490.random-tabs.session

Not ready yet
Attached file open-list-of-URLs.html
Use this html file to load more random web pages into new tabs.
Paste lines like this one into the form reated by the html file


http://www.uroulette.com/visit/trpq
http://www.uroulette.com/visit/trpq
http://www.uroulette.com/visit/trpq
....
http://www.uroulette.com/visit/trpq
(In reply to Honza Bambas (:mayhemer) from comment #28)

Honza, the test case is now ready. Please use my test case in your own browser instance (make a test profile for this !!) and you will then be able to see with your own eyes what is going on. I think that is a superior method, and I hope you agree. 

Use the attached 2017-03-01.13_15.session.tabs=490.random-tabs.session as described in the attachment details. Set cache size to 99MB (say, so that cache is small compared to working set and thrashing is easily triggered).  I can easily trigger cache thrashing that way, and I think you will be able to as well, using the test steps you prescribed above.

Finally, because 490 is quite a lot of tabs, you may want to use the tree-style-tabs extension, as well as some ad blocker, flash blocker, etc  so that you do not get a cacophony of sound from all the web tabs opened.  My set of enabled extensions for the test is


http://adblockplus.org/en/
https://addons.mozilla.org/addon/classicthemerestorer/
http://cookieSwap.mozdev.org
http://www.schuzak.jp/
https://addons.mozilla.org/en-US/firefox/addon/extension-list-dumper-2/
http://flashblock.mozdev.org/
https://getmultifox.com/
http://sessionmanager.mozdev.org/
http://code.google.com/p/tabhunter/
http://piro.sakura.ne.jp/xul/_treestyletab.html.en

PS: I'm running this in a pretty fast machine, i7-4770 cpu with 32GB of RAM, but it is still a very challenging workload for firefox.
Here is the log files as requested by Honza Bambas. The logfile was created under the following conditions: Cache was full, I closes about 50 tabs and opened about 50 new random tabs. The cpu went over 100% for several minutes, and then when the cpu was trending down towards about 65% I took the snapshot.
(In reply to Honza Bambas (:mayhemer) from comment #28)

As you may have seen, I have posted one snapshot of the requested logfile(s) already. I will soon post another snapshot. The cpu usage for firefox is still hovering around 60-80% although I have not touched the browser for quite a while. Cache is full at limit=199MB. Paste of "top" cpu minitor follows:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND   
11501 xyzuser   20   0 4704548 2.789g 247164 S  68.7  8.9  26:06.12 /usr/local/firefox-51.0.1/firefox --class Firefox9 --sync -P test9             
11501 xyzuser   20   0 4771104 2.804g 252608 S  83.1  9.0  28:29.98 /usr/local/firefox-51.0.1/firefox --class Firefox9 --sync -P test9             
11501 xyzuser   20   0 4759512 2.773g 253912 S  62.2  8.9  30:00.25 /usr/local/firefox-51.0.1/firefox --class Firefox9 --sync -P test9             

Would be nice id you could give a preliminary interpretation of what is in the log files, Honza! Currently 557 tabs.
2nd snapshot of logfiles, without any further tabs added or removed, and no viewing/scrolling or other browser use.
(In reply to Honza Bambas (:mayhemer) from comment #28)

I just realized I had to clear Clear the needinfo request for reikred myself. Hereby done. The requested information has been provided.
Flags: needinfo?(reikred)
It would be nice if I could get some confirmation that my posted log files are useful and are being studied. 

I also have a question about a possible temporary remedy: What would happen if I randomly removed,say, 10-20% of the (oldest?) files in ~/.cache/mozilla/firefox/SALT.profilename/cache2/entries? Would Firefox realize hat there was then more space available? Would it help to mark the cache as dirty afterwards, something like

echo 0 > ~/.cache/mozilla/firefox/SALT.profilename/_CACHE_CLEAN_

Would the same temporary remedy work better if performed with firefox not running (exited), then followed by a restart?

Why do I ask these questions about possible remedies? Well, it because right now the only available remedy is to clear all 1024 MB of my cache. But then, if I quit and restart before re-visiting ALL tabs, FF will not come up properly, as described in https://bugzilla.mozilla.org/show_bug.cgi?id=1225501 . Honza says that is not a cache bug, and that is quite possible, but the bugs are intertwined in a way that makes it hard to recover back to a useable state without much rigamarole, multiple partial restarts, visit 50 still unloaded tabs, quit, restart, repeat 10 times, etc etc etc.
I did some experiments with pruning old cache files, and it works, but you have to restart FF afterwards. I did something like this:

cd ~/.cache/mozilla/firefox/SALT.profilename/
cd cache2/entries
# repeat a few times until enough space has been freed up.
/bin/ls -rt | head -n 1000 | xargs /bin/rm 
du -s .

At his point I could restart FF, FF would calculate the cache usage from the files remaining in the cache, and FF would recover nicely back to its previous state. No incompletely loaded "New Tabs" occurred, but maybe I was just lucky.
I will look at all of this next week, thanks for all the effort you put into this!
Flags: needinfo?(honzab.moz)
(In reply to reikred from comment #36)
> Created attachment 8842608 [details]
> open-list-of-URLs.html
> 
> Use this html file to load more random web pages into new tabs.
> Paste lines like this one into the form reated by the html file
> 
> 
> http://www.uroulette.com/visit/trpq
> http://www.uroulette.com/visit/trpq
> http://www.uroulette.com/visit/trpq
> ....
> http://www.uroulette.com/visit/trpq

The page doesn't show any content, response is just a 404.  It renders the test case unusable.  Please provide some different URL.

In the log I can see some activity, but hard to say if that would exhaust the processor that much.

Can you take a snapshot when the CPU/firefox process(es) is/are at 100%?

Thanks.
Flags: needinfo?(naveenreloaded)
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #45)
> (In reply to reikred from comment #36)

> The page doesn't show any content, response is just a 404.  It renders the
> test case unusable.  Please provide some different URL.

Workaround: Skip attempting to generate your own random workload, since apparently the http://www.uroulette.com/visit/trpq no longer works.
Instead load the 2017-03-01.13_15.session.tabs=490.random-tabs.session which I attached earlier. This file already contains the randomized web site workload (don't forget the "tail" command shown earlier if you want to use this file as a recovery.js file). You know much more about FF innards than I do, so I assume working with recovery.js or SessionManager to load the session is a piece of cake for you. If you need to create a lot of simultaneous loads after the initial load, use TreeStyleTab extension mouse3 tab context menu to reload whole subtrees at once.

> In the log I can see some activity, but hard to say if that would exhaust
> the processor that much.
> 
> Can you take a snapshot when the CPU/firefox process(es) is/are at 100%?
> 

Can you please try with the session/workload from the 2017-03-01.13_15.session.tabs=490.random-tabs.session file first (if my MOZ_LOG files are not good enough)? I'll make more MOZ_LOG files if you can't get the session to work, but it will take a while. Thanks!
Here is a longer set of log files as requested by Honza Bambas.

The set of websites is the same as before, and the http cache was 129MB. I reduced the cache limit down to 129MB, and restarted firefox. FF was called from a script where I used some inotifywait trickery to capture each rolling logfile as it was completed. I collected 8*4=32 such files and zipped them up for posting here.

The logfiles cover about the first 3 minutes of operation, as can be seen from the dates on the individual files.

I hope this dataset will do the trick. Please let me know whether it is sufficient to see what is going on. The total size is about 17.1MB of zip, so I had to zipsplit the file.
part2 of moz_log cache log data set.
At a first look I don't see anything wrong in the logs. Eviction of the items over the limit takes around 100us per entry. This looks good to me.
What I am seeing in the log files is there are large periods of time where the cache appears active every several microseconds.

Let's look at the activity between two timestamps timeA and timeB , where timeA is the very first item in the log, and timeB is the timestamp where the cache is getting a pause after the initial burst of activity:

timeA: 2017-03-12 04:37:46.468076 UTC - [Main Thread]: D/cache2 CacheFileIOManager::Init()
timeB: 2017-03-12 04:39:27.401234 UTC - [Cache2 I/O]: D/cache2 CacheFileHandles::RemoveHandle() hash=2c0d8e00114114f70ddfca530a45376dfcb2d8d9 list is empty, removing entry 7f1938cde950
#gap here, then cache starts laboring again when I manually instruct FF to reload a tree of about 40 tabs
timeC: 2017-03-12 04:44:27.716510 UTC - [Main Thread]: D/cache2 CacheStorageService::AddStorageEntry [entryKey=:http://www.tc.gc.ca/, contextKey=]

In the initial 1min41sec there are 168567 logged cache events, calculated from 4041+$(wc -l moz_log.v=51.0.1.p=9.d=2017-03-11-Sat.20:37.0.[0-3]|tail -n 1|cut -d" " -f 3) = 4041+164526 = 168567. I further counted that there was about a total of 7189 cache files in use at timeB.

% find ./cache2/entries -print | wc
   7189    7189  309086

If one compares the number 168567 to the number 7189, does that indicate that there is a high degree of cache activity relative to the number of files cached? Given that there is no browsing going on at all (yet), only an initial load of a set of 567 tabs, of which only actually 12 windows and 12 selected/active tabs are even being loaded?

At timeC, I request FF to load a tree of about 40 tabs from scratch (treestyletab extension makes that request easy). The cache becomes very active again, and counting across all the 32 saved log files, there is a total of 1451560 (1.45M) logged operations within 8min7sec. Most of the entire activity is related to the reloading of the 40 tabs from scratch.

wc moz_log.v=51.0.1.p=9.d=2017-03-11-Sat.20:37.?.?
  1451560  15610404 188866489 total

Please note again: What I have described constitutes ALL the browsing activity performed. An initial load of 12 tabs. Idle. A full reload of 40 tabs. That's all the activity performed. No other GUI operations. Nothing. Nevertheless there are 1451560 logged cache operations across 8min7sec, however there is only 7189 cache files.

Now, I know little about the internal workings of this specific cache. But looking at the data as I have, does it not seem that the cache is very busy? Anyway, this us just an attempt at seeing the big picture.
(In reply to reikred from comment #50)
> I further counted that there was about a total of 7189 cache files in use at
> timeB.
> 
> % find ./cache2/entries -print | wc
>    7189    7189  309086

Did you check right directory? According to cache index stat dumps, there was never such amount of files:

2017-03-12 04:37:46.484819 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=2477, notInitialized=0, removed=10, dirty=13, fresh=0, empty=0, size=131927]
2017-03-12 04:38:59.339114 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=2620, notInitialized=0, removed=72, dirty=300, fresh=506, empty=10, size=131999]
2017-03-12 04:38:59.341995 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=2548, notInitialized=0, removed=5, dirty=17, fresh=438, empty=10, size=131994]
2017-03-12 04:39:20.105097 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=2580, notInitialized=0, removed=98, dirty=435, fresh=478, empty=0, size=131972]
2017-03-12 04:39:20.106872 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=2482, notInitialized=0, removed=0, dirty=0, fresh=380, empty=0, size=131972]
2017-03-12 04:44:29.585656 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=2760, notInitialized=0, removed=24, dirty=300, fresh=681, empty=235, size=132009]
2017-03-12 04:44:29.587729 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=2736, notInitialized=0, removed=0, dirty=0, fresh=657, empty=235, size=132009]
2017-03-12 04:44:49.632850 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=4231, notInitialized=0, removed=1300, dirty=2998, fresh=3413, empty=63, size=131217]
2017-03-12 04:44:49.639904 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=2931, notInitialized=0, removed=0, dirty=4, fresh=2113, empty=63, size=131217]
2017-03-12 04:45:09.670879 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=3372, notInitialized=0, removed=351, dirty=922, fresh=2847, empty=29, size=127805]
2017-03-12 04:45:09.674188 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=3021, notInitialized=0, removed=0, dirty=1, fresh=2496, empty=28, size=127807]
2017-03-12 04:45:29.698380 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=3496, notInitialized=0, removed=180, dirty=710, fresh=3114, empty=23, size=130992]
2017-03-12 04:45:29.700851 UTC - [Cache2 I/O]: D/cache2 CacheIndexStats::Log() [count=3316, notInitialized=0, removed=0, dirty=0, fresh=2934, empty=23, size=130992]

Number of files in cache2/entries is (count - removed - empty).


> If one compares the number 168567 to the number 7189, does that indicate
> that there is a high degree of cache activity relative to the number of
> files cached?

No. It doesn't make sense to compare number of log lines to entry count. During startup we log every record found in the index, but otherwise cache activity and entry count are totally unrelated.

> Nevertheless there are 1451560 logged cache operations across 8min7sec,
> however there is only 7189 cache files.
> 
> Now, I know little about the internal workings of this specific cache. But
> looking at the data as I have, does it not seem that the cache is very busy?

This just mean that logging in cache is verbose.

At what time did you notice slowdown in this session?
(In reply to Michal Novotny (:michal) from comment #51)

Well it seems that for me to create http cache logs and (not very successfully) trying to interpret and correlate them with the slowdown is not leading to any useful observations.

May I humbly suggest that the best method of analysis would be to load up the test session I have been using (see comment 33 and 37), with a low 99MB cache limit, reload all the tabs, and see what happens. Then do the same with a 500MB cache limit. There is nothing like seeing what is going on with your own eyes.

I hope that one of the two resident experts is willing to do that. The observation remains that when the cache is at the limit, the CPU usage goes crazy and the browser is near catatonic, for hours in more extreme cases. I devised a test case that shows exactly that, and I hope someone will run it.
While waiting for some progress on this problem, can someone please tell me what changes I have to make in the source code to recompile Firefox with a higher http cache size limit? For example 2048MB instead of 1024MB.
I cannot reproduce it. I tried to load 100 tabs when cache size was 80MB. When the limit was set 500MB it took 58 seconds. When the limit was set to 80MB it took 61 seconds. Both tests were started with the same profile and the same cache content.

Honza, please have a look at the log in comments 47-48. Maybe you'll find something. All I can say is that the problem is not in the eviction.
Flags: needinfo?(honzab.moz)
(In reply to Michal Novotny (:michal) from comment #55)

Did you open and view 100 new tabs with 100 new and different URLs AFTER the cache was already full (at the limit)?
100 tabs with 100 different URLs was loaded when the cache was at the limit. In total ~100MB of cacheable data was downloaded so evicting was running all the time. It's not possible to view 100 tabs at the same time, so I'm not sure what you mean by viewing it.
Thanks Michal.  I will not have time to try to repro with the stored session, even more when it's not reproducible with it.  Logs didn't reveal anything critical too.

reikerd, I think profiling with some external profiler (like AMD's CodeAnalyst [1] or its successor CodeXL [2]) is the best way to figure out what's going on.

Note that URL for Firefox debug symbols is http://symbols.mozilla.org/

[1] http://developer.amd.com/tools-and-sdks/archive/compute/amd-codeanalyst-performance-analyzer/codeanalyst-performance-analyzer-for-windows/
[2] https://github.com/GPUOpen-Tools/CodeXL/releases/download/v2.2/CodeXL_Win_2.2.733.exe
Flags: needinfo?(honzab.moz) → needinfo?(reikred)
(In reply to Honza Bambas (:mayhemer) from comment #58)
> Thanks Michal.  I will not have time to try to repro with the stored
> session, even more when it's not reproducible with it. 

It should be noted that NOBODY has run my stored session test case with 490 tabs. It appears that Michal made a smaller test case. I have tested my case with

51.0.1 (release version, AND and compiled with cacheSize modified to 8192 MB)
52.0   (release version, AND and compiled with cacheSize modified to 8192 MB)
53.0b2 (no released vsn, ONLY as compiled with cacheSize modified to 8192 MB)

My runs are all in uniprocess mode (because of addons used cause limitations), not multiprocess/e10s mode. What version of firefox and what process mode are you guys testing with? 

It is very simple to run my test case. Make a profile, install some addons, load the session. But you have to visit enough tabs to fill the cache to the limit, and then see what happens after the cache is full.
(In reply to reikred from comment #59)
> It should be noted that NOBODY has run my stored session test case with 490
> tabs. It appears that Michal made a smaller test case. I have tested my case
> with

It should be noted that you didn't answer my questions in comment #51. If you're sure you're watching the right cache directory, then there is something wrong at your side, because there is a huge difference between existing files and files Firefox is aware of. Even if your are manipulating files manually like you mentioned in comment #43, Firefox should "see" more files than there exists. But your case is the opposite so you would need to add files manually to the cache directory.

> My runs are all in uniprocess mode (because of addons used cause
> limitations), not multiprocess/e10s mode. What version of firefox and what
> process mode are you guys testing with? 

I run latest release with clean profile in e10s mode. I got the same result when i disable e10s. When I install addons that you mentioned (except http://www.schuzak.jp/ which doesn't points to any particular addon) the load time is 2 times slower (the slowdown is caused by ABP) but still consistent, i.e. there is no difference when the cache is on the limit or not.

> It is very simple to run my test case. Make a profile, install some addons,
> load the session. But you have to visit enough tabs to fill the cache to the
> limit, and then see what happens after the cache is full.

The problem with your test is that it cannot be rerun with consistent results, because response times will vary across reloads. Also it's hard to do any measurement because scrolling among loading tabs doesn't provide useful timing information. OTOH in my test I load all the pages from a server on a gigabit LAN and I see a exact load timing from httpd logs. FYI, I used tp5n.zip from https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5.

Anyway, I tried to load your set of URLs and when the cache wasn't limited it took ~14 minutes to load them all. The cache size was ~1.2GB when the load was done. When I rerun the test with 100MB cache limit it took again ~14 minutes and I didn't notice any visual difference. Both test started with a clean cache and I used profile with addons you listed.

It's obvious that the problem is specific to you computer. Maybe there is some application that is watching FS changes and tries to do some action on file removal. Try to move cache directory to another filesystem (different type, location etc.) or even tmpfs since you have a lot of memory available. You could also try to use oprofile to find the bottleneck, but especially in this case you should run the tests against you own server to ensure you can get consistent results across different runs.
(In reply to Michal Novotny (:michal) from comment #60)
> (In reply to reikred from comment #59)

I will try to address your questions and observations. And it was good for me to understand the tp5n.zip test methodology  that you use normally. That certainly is quite different from what I was doing. This post is going to be long, but please read on. Part 3 is the important stuff, but party 1 and part 2 addresses questions and requests.

1. about the cache2/entries content: The count from "find ./cache2/entries -print | wc" (add -type f if you want) was in the correct directory, but may have been after loading an additional ~80 tabs after first loading the "490.session" (a shorthand name I will now use). I was not doing any funny business about removing files manually, and for the record I NEVER did any automatic removal of any cache files at any time at all.. The manual removal experiment was for one of my real sessions, not the 490.session.

2. about profiling: I'm a bit rusty on profiling in general, I'm from the gprof generation and I have never used  oprofile nor the more fancy CodeXL and derivatives. I will try to get this going but it will take a while.

3. your test of 490.session: Agree that the pseudo-random (but now frozen) mix of web sites and response times makes it harder to compare two different runs on two different machines on two different networks at two different points in time. However, I suspect that how firefox reacts to cache state and network response time may be an important part of the overall syndrome.

BUT, even more importantly, I have some new results on the effect of uniproc versus multiproc (e10s). The upshot is that I see HUGE difference in behavior between uniproc (BAD behavior) and multiproc (GOOD behavior) runs. Here are some comparisons.

RESULT:

3a. uniproc run gets into permanently high cpu usage after all the reloads

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
30454 xyzuser 20   0 8579.5m 6.207g 252.8m R 150.7 19.8  89:01.46 /usr/local/firefox-51.0.1/firefox --class Firefox9M --sync -P test9M 
30454 xyzuser 20   0 8637.2m 6.183g 253.3m R 184.1 19.8 117:13.06 /usr/local/firefox-51.0.1/firefox --class Firefox9M --sync -P test9M

3b. uniproc run created more cache logfiles, ended at count 13.0 (53 files) after load, then *seemingly* no more log items while cpu most often = 150-200%

3c. multiproc run (e10s) will get to low cpu after reloading all the tabs

3d. multiproc run (e10s) generated fewer cache LOG files, ended at count 7.0 (29 files) after load

3e. overall there is some indication that uniproc versus multiproc affects the behavior, not ONLY cache full.

3f. only addons running: SessionManager/SM, Treestyletabs/TST and Tabhunter/TH (if all enabled:uniproc, if all disabled:multiproc). Unless TST has some bug, I find it unlikely that the extensions are the problem. Please note I am down to just 3 addons in this test case, to make life simpler for all of us.

I disabled (actually fully removed) all other addons, in particular no uBlock or AdblockPlus (actually I never ran AdblockPlus, only uBlock) and Flashblock. To get less audio noise I used

#in URL about:config, make the following changes
media.autoplay.enabled                   = true  --> false 
media.block-autoplay-until-in-foreground = false --> true
plugins.click_to_play                    = TRUE (default unknown)

3g. all runs were in 51.0.1 stock version, since this test case does not need or use large cache.

3h. I will post the MOZ_LOG files as zipsplits, but it is a lot of data.

3i. My suspicion is that the uniproc case gets into a bad state, and it is related to the cache being full, but I will gladly retract my original speculation that the observed behavior is a pure case of cache thrashing, given that the e10s case manages to complete. It may be a problem (speculating again!) with the thread scheduling/switching between the cache thread and some other threads (rendering thread??), which causes something that looks very much like an endless/infinite loop in my experience.

Does this help? I'm baffled that you cannot see the difference in behavior. I'll go out on a limb and send back to you one of those "are you sure" questions :): Are you sure that you were ever really in uniproc mode when cachelimit=99 or 100MB? I checked in about:support each time I ran . In 51.0.1 stock, addon TH stock version will block e10s.
e10s=on, load session.490 with MOZ_LOG=rotate:20,timestamp,cache2:5; otherwise filename says it all. 

Note I set cachesize=browser.cache.disk.capacity=50MB her to get the cache full quickly here, apologies for changing it from 99MB to 50MB but it serves to illustrate the behavior.

part1
e10s=off, load session.490 with MOZ_LOG=rotate:20,timestamp,cache2:5; otherwise filename says it all. 

Part 1
(In reply to reikred from comment #61)
> 2. about profiling: I'm a bit rusty on profiling in general, I'm from the
> gprof generation and I have never used  oprofile nor the more fancy CodeXL
> and derivatives. I will try to get this going but it will take a while.

Thanks.  That would be IMO the best option, you can use whichever profiler you like.  I'd just like to know where exactly we are looping.  Logs don't provide that answer, unfortunately.
Flags: needinfo?(reikred)
I am seeing similar - took me a while to figure out what is going on, but sure enough, cache fills and firefox becomes unresponsive for minutes at a time, cannot scroll or do anything, it just hangs, usually for a couple of minutes.  I almost stopped using firefox until I figured out the issue.  Sure enough, I clear the cache and firefox works like a champ.
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
Keywords: perf
Severity: normal → S3

Confirmed this issue when trying to troubleshoot very strange performance problems for a customer who uses Firefox internally . They experience significant browser lag to the point of it becoming virtually unusable - clearing cache solved the issue immediately.

It is highly concerning that this has been open for 11 years and has such low priority. The user impact is tangible, and a a full cache is increasingly more common with modern sites using Service Workers, pre-caching and other offline features.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: