Closed Bug 1487101 Opened 7 years ago Closed 5 years ago

CacheStorageService::PurgeOverMemoryLimit() doesn't work correctly under specific circumstances

Categories

(Core :: Networking: Cache, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla79
Tracking Status
firefox79 --- fixed

People

(Reporter: mconley, Assigned: mayhemer)

References

Details

(Whiteboard: [MemShrink:P2][gfx-noted][necko-triaged])

Attachments

(5 files, 1 obsolete file)

Attached file Meddler script
This Meddler script was used to produce the test case, using this tool: https://webdbg.com/meddler/
We're going to need somebody to reproduce this and gather a memory report from about:memory, I think.
Whiteboard: [MemShrink]
(In reply to Eric Lawrence (@ericlaw) from comment #3) > Created attachment 9004893 [details] > Win10 x64 v61.0.2 memory-report.json.gz Thanks, ericlaw. Just so we're clear, was this memory report gathered from a session of Firefox that was at that moment consuming an usual amount of memory due to this bug? I ask, because a quick glance at the memory report doesn't show unusually high consumption from any of the processes.
Flags: needinfo?(ericlaw1979)
Flags: needinfo?(ericlaw1979)
This memory report was from a Firefox instance that was using ~360mb of "Active Private Working Set" in Windows. That number had been steadily creeping up since the process was launched, and now stands at 427mb ten minutes later (memory report attached).
I should clarify further: The 427mb and growing process was the manager/browser Firefox.exe process (no command line args) while the most active Content process (presumably representing the only tab) is growing more slowly and currently stands at 310mb.
Okay, great - thanks for the context, ericlaw. Hey mccr8, when you have a sec, is this enough to go on to know if this is an imagelib thing? Necko cache might also be a factor here. I'm also seeing a bit of heap unclassified...
Flags: needinfo?(continuation)
I note from the memory report that the reported image footprint is minimal. Are we certain it was captured at the right time?
(In reply to Andrew Osmond [:aosmond] from comment #9) > I note from the memory report that the reported image footprint is minimal. > Are we certain it was captured at the right time? I think these recent memory reports are from a different, shorter session, where ericlaw is still seeing a slow creep-up of memory usage, but where it hasn't yet reached the levels as seen in comment 0.
Yeah, the original screenshot was an overnight (~7 hours) run; the original memory capture attached here was after running the repro for 10 minutes or so. After 2 hours, the browser process is at 749mb of active private working set and the tab process is ~150mb (Peaked at 1.1gb)
(In reply to Eric Lawrence (@ericlaw) from comment #11) > Yeah, the original screenshot was an overnight (~7 hours) run; the original > memory capture attached here was after running the repro for 10 minutes or > so. > > After 2 hours, the browser process is at 749mb of active private working set > and the tab process is ~150mb (Peaked at 1.1gb) Given the image footprint is still very low in this report, I imagine images are responsible for the peak, because we have a 1 minute timer that expires unused entries from the cache. In the parent process, the heap unclassified is concerning at 255MB, along with the network at 332MB. Presumably this is caching for all of the images we received from the network that we haven't released. In the content process, it seems much more evenly spread. It is still holding onto the generated '<div data-snippet-id="XXX" ...' strings in the JS memory, presumably those could have been freed earlier (not sure how many images are displayed, maybe this is reasonable).
Priority: -- → P3
Whiteboard: [MemShrink] → [MemShrink][gfx-noted]
Hey valentin, are you the right person to ask about the Necko caching memory usage here? Anything actionable from these memory reports?
Flags: needinfo?(valentin.gosu)
(In reply to Mike Conley (:mconley) (:⚙️) from comment #14) > Hey valentin, are you the right person to ask about the Necko caching memory > usage here? Anything actionable from these memory reports? Unless there is a random memory leak happening, I don't think the necko memory cache could cause this. From what I recall it has a maximum size, which you can inspect at about:cache, computed here [1] Michal may have a better idea if the memory cache could pose a problem. [1] https://searchfox.org/mozilla-central/rev/55da592d85c2baf8d8818010c41d9738c97013d2/netwerk/cache2/CacheObserver.cpp#196
Flags: needinfo?(valentin.gosu) → needinfo?(michal.novotny)
It's definitely not normal that cache index occupies 100MB. Right now I don't have a Windows machine available, so I cannot reproduce it locally. Eric, can you please check how big the index file is on disk? The file is in cache2 directory, you can see full path on about:cache page. If the cache entries are leaked, there also might be a lot of files in doomed directory.
Flags: needinfo?(michal.novotny) → needinfo?(ericlaw1979)
I let the repro run for ~7 hours and working set ballooned to 1.1gb. I then navigated to about:blank and it remained pretty constant for the minute that I watched. Now, 12 hours later, it's down to 546mb. about:cache has the following to say: memory Number of entries: 1 Maximum storage size: 32768 KiB Storage in use: 6 KiB Storage disk location: none, only stored in memory disk Number of entries: 162651 Maximum storage size: 358400 KiB Storage in use: 357233 KiB Storage disk location: C:\Users\ericlaw\AppData\Local\Mozilla\Firefox\Profiles\q58h7lzw.default\cache2 The index file in the cache2 folder is 6.82 MB (7,157,188 bytes. The doomed folder is empty. The entries folder has 162,647 Files, of which 162,613 files appear to be the 2-3kb JPEG images used in the repro.
Flags: needinfo?(ericlaw1979)
It looks like the relevant people have investigated, so I'm going to clear my needinfo.
Flags: needinfo?(continuation)
Whiteboard: [MemShrink][gfx-noted] → [MemShrink:P2][gfx-noted]
:aosmond, do you have any ideas for next steps here?
Flags: needinfo?(aosmond)
This problem is not related to images. It's a generic problem when a lot of small resources are being cached. What happens is that CacheStorageService::MemoryPool::PurgeByFrecency() is called from PurgeOverMemoryLimit(), it sorts the array because frecencyNeedsSort is always true in CacheStorageService::MemoryPool::PurgeOverMemoryLimit(). Sorting takes some time when there is a lot of entries, then CacheIOThread::YieldAndRerun() returns true because events with higher priority were enqueued in the meantime. Nothing is purged and event is rescheduled for execution. When CacheStorageService::PurgeOverMemoryLimit() is executed again it returns early because it's executed too early since last purging. The result is that we try to purge entries only once every 4 seconds and usually nothing is purged, so used memory just grows. Honza, can you take this bug? This is your code. Regarding the sorting, consider doing something similar what CacheIndex::FrecencyArray does. New entries are always appended at the end of the array because we expect that new entries won't be evicted soon. The array is sorted occasionally in CacheIndex::FrecencyArray::SortIfNeeded() depending on how much the array is unsorted.
Component: ImageLib → Networking: Cache
Flags: needinfo?(aosmond) → needinfo?(honzab.moz)
Summary: We might be leaking images that reload many times over a long period → CacheStorageService::PurgeOverMemoryLimit() doesn't work correctly under specific circumstances
Thanks for investigating this, Michal. Yep, this looks like my bug.
Assignee: nobody → honzab.moz
Flags: needinfo?(honzab.moz)
Whiteboard: [MemShrink:P2][gfx-noted] → [MemShrink:P2][gfx-noted][necko-triaged]
Blocks: 1541106
See Also: → 1626612

Stupid question of an outsider, why not combining both ideas and do presumably costly sort only every now and then.

  1. I.e. sort only every other time but make sure, that new entries are added to safer places.
  2. Make sure that a minimum number of purges happen, regardless of high prio interruptions.

This avoids the strange sorting which might cause problems in other situations.

I do not know about memoryLimit thing, might or might not be good.

Go! I am afraid, this particular bug bugs me since 2010.

Attachment #9155964 - Attachment is obsolete: true

Let's see how this works with the second (purge > 20) patch. Occasional sorting turned out to happen still too often. And looking at the array shows it's scattered too much. If we manager to keep the array relatively short, sorting may be kept fast. I also put a 10% reserve on the memory limit, so we purge a little bit more - to specifically save sorting time.

(In reply to Honza Bambas (:mayhemer) from comment #25)

Let's see how this works with the second (purge > 20) patch. Occasional sorting turned out to happen still too often. And looking at the array shows it's scattered too much. If we manager to keep the array relatively short, sorting may be kept fast. I also put a 10% reserve on the memory limit, so we purge a little bit more - to specifically save sorting time.

Why exactly 20? Is it enough? The minimum delay between purging is 4 seconds and I can imagine that a way more entries could be added in 4 seconds.

(In reply to Frank Nestel from comment #24)

Go! I am afraid, this particular bug bugs me since 2010.

That's unlikely because this code landed in 2015 :)

(In reply to Michal Novotny [:michal] from comment #26)

(In reply to Honza Bambas (:mayhemer) from comment #25)

Let's see how this works with the second (purge > 20) patch. Occasional sorting turned out to happen still too often. And looking at the array shows it's scattered too much. If we manager to keep the array relatively short, sorting may be kept fast. I also put a 10% reserve on the memory limit, so we purge a little bit more - to specifically save sorting time.

Why exactly 20? Is it enough? The minimum delay between purging is 4 seconds and I can imagine that a way more entries could be added in 4 seconds.

Good point with the 4 seconds, I forgot about it. Then it should be more, yes. What about to try to reduce the array to some specific length only? Aka, do kind of a limit on the array length?

That's unlikely because this code landed in 2015 :)

OK. How was previous code? Not serious question of course. Thanks for correcting me.
I am a poor bug reporter as I am reporting memory leaks for about 7+ years, and never ever convinced someone to solve at least one, I am not back to Firefox yet, but I would love to :(

  1. You can still go for even rarer sorting (you had that 0.1 tuning factor up there).
  2. That 0.9 thing is of course also a way to reduce frequency.
  3. I feel in a loaded environment (ok, my Browser) there will be many fresh objects and the balance between trashing old and new is critical. It is unfortunate, that there is not at least some LRU policy fallback. But I understand, this might be beyond this fix.

(In reply to Honza Bambas (:mayhemer) from comment #27)

Good point with the 4 seconds, I forgot about it. Then it should be more, yes. What about to try to reduce the array to some specific length only? Aka, do kind of a limit on the array length?

Sounds like a good approach. This would avoid infinite growth of the array.

Attachment #9157621 - Attachment description: Bug 1487101 - Purge at least 20 HTTP cache entries from the memory pool regardless if there is a higher priority HTTP cache event pending, r=michal → Bug 1487101 - Do not interrupt purging of HTTP cache entries from memory by frecency until the pool containts at most 2000 entries, r=michal

Opening a Firefox instance, opening quite a few tabs (70 maybe), seeing 1323 cache entries in about:cache (and 16487 on disk). Somewhat irritating, nearly all the in memory cached things are actually "expired immediately". Can this be used to tune sorting? I.e. is it expired? Looks like this isn't my leak either. Sigh. Anyway, leaks must be fixed.

Question/Comments.

  1. Where do the 2000 come from? I have a one day old Firefox with exact 80 tabs and 1422 cache entries. Cache limit is 32768 LiB, Cache use is 17295 KiB. Limit expected to trigger way beyond 2000 only.
  2. The code does not really warrant the cache is shrunk to 2000, since it could contain many small bits and the loop can exit earlier.
  3. Sorting the zero access to the end protects both, the new entries in cache and old entries that get never used. This will fill the cache with useless entries as used entries will killed before unused ones.
  1. I have missed, this is about disk cache. Here my Firefox looks like
    Number of entries: 15222
    Maximum storage size: 1048576 KiB
    Storage in use: 587576 KiB
    2000 sounds brutally low, cleanup will kill many heavily used entries and keep all the unused ones.
  1. On disk cache there is also an expiry date, use this as (sub) criterion for sorting to clear at least unused expired stuff.
Pushed by honzab.moz@firemni.cz: https://hg.mozilla.org/integration/autoland/rev/e13422469a6c Do not interrupt purging of HTTP cache entries from memory by frecency until the pool containts at most 2000 entries, r=michal,necko-reviewers
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla79

OK, thanks, I'll disable disk cache.

This is nothing personal. But what does this patch do? Why not getting a user benefit of the time and ideas investigated?

The patch is obviously fixing worst case, but has big potential of damaging the whole purpose of a cache.

No vision about user your users, this is why Firefox is on the decline.

(In reply to Frank Nestel from comment #31)

Question/Comments.

  1. Where do the 2000 come from? I have a one day old Firefox with exact 80 tabs and 1422 cache entries. Cache limit is 32768 LiB, Cache use is 17295 KiB. Limit expected to trigger way beyond 2000 only.

We have a so called "pool" of entries we keep in memory (only headers) while they are in direct use or expected to be reused again (on follow on navigation, for instance). The bug was cleaning this pool was not working when it grew beyond some limit.

Hence, I added a limit that keeps cleaning this pool until we have only 2000 entries. It's totally arbitrary number, but seems reasonable according experience and testing.

  1. The code does not really warrant the cache is shrunk to 2000, since it could contain many small bits and the loop can exit earlier.

You misunderstand what the patch does.

  1. Sorting the zero access to the end protects both, the new entries in cache and old entries that get never used. This will fill the cache with useless entries as used entries will killed before unused ones.

Again, you don't understand the code.

(In reply to Frank Nestel from comment #32)

  1. I have missed, this is about disk cache. Here my Firefox looks like
    Number of entries: 15222
    Maximum storage size: 1048576 KiB
    Storage in use: 587576 KiB
    2000 sounds brutally low, cleanup will kill many heavily used entries and keep all the unused ones.

What you are looking at is how many entries is on your disk, not in the memory pool.

(In reply to Frank Nestel from comment #33)

  1. On disk cache there is also an expiry date, use this as (sub) criterion for sorting to clear at least unused expired stuff.

It's not that simple, because there are situations we want to use expired (stall) content and it's totally valid. I will not go to details about it, because it's beyond the scope of this bug.

(In reply to Frank Nestel from comment #36)

OK, thanks, I'll disable disk cache.

I don't follow this offensive reaction at all.

(In reply to Frank Nestel from comment #37)

This is nothing personal. But what does this patch do? Why not getting a user benefit of the time and ideas investigated?

We always do, if it has the technical quality.

The patch is obviously fixing worst case, but has big potential of damaging the whole purpose of a cache.

Again, you misunderstand the patch.

No vision about user your users, this is why Firefox is on the decline.

Again, offensive reaction I don't follow.

Thank you for your answer, though it fails to convince me. Yes, I might not have missed some presumed magic going on, but you didn't attempt to change that.

Simple question: From observed cache of 13000 entries, under 2000 well used and over 11000 with no access recorded according to about:cache, what would your patch remove?

(In reply to Frank Nestel from comment #39)

Thank you for your answer, though it fails to convince me. Yes, I might not have missed some presumed magic going on, but you didn't attempt to change that.

Simple question: From observed cache of 13000 entries, under 2000 well used and over 11000 with no access recorded according to about:cache, what would your patch remove?

Nothing. It doesn't do anything with on disk entries. And I believe it's now clear from my comment trying to explain what the patch does.

Please let's stop polluting this bug now. Let's see how the patch works in the field and if leaks in this code persist, let's look at it again.

Thanks.

QA Whiteboard: [qa-79b-p2]
See Also: → 1959090
No longer blocks: 1541106
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: