Closed Bug 1487101 Opened 7 years ago Closed 5 years ago

CacheStorageService::PurgeOverMemoryLimit() doesn't work correctly under specific circumstances

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla79

Tracking Flags:

Tracking

Status

firefox79

---

fixed

People

(Reporter: mconley, Assigned: mayhemer)

References

Details

(Whiteboard: [MemShrink:P2][gfx-noted][necko-triaged])

Attachments

(5 files, 1 obsolete file)

Meddler script 7 years ago Mike Conley (:mconley) (:⚙️) 2.30 KB, text/plain		Details
Win10 x64 v61.0.2 memory-report.json.gz 7 years ago Eric Lawrence (@ericlaw) 118.73 KB, application/x-gzip		Details
later memory-report.json.gz 7 years ago Eric Lawrence (@ericlaw) 171.08 KB, application/x-gzip		Details
two hours memory-report.json.gz 7 years ago Eric Lawrence (@ericlaw) 127.15 KB, application/x-gzip		Details
Bug 1487101 - Sort frecency array in HTTP cache storage backend only occasionally, r=michal 5 years ago Honza Bambas (:mayhemer) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1487101 - Do not interrupt purging of HTTP cache entries from memory by frecency until the pool containts at most 2000 entries, r=michal 5 years ago Honza Bambas (:mayhemer) 47 bytes, text/x-phabricator-request		Details \| Review

Mike Conley (:mconley) (:⚙️)

Reporter

Description

•

7 years ago

Context: https://screenshotscdn.firefoxusercontent.com/images/c401fc40-8afe-452f-b5ec-5a780b3f0176.png from https://twitter.com/ericlaw/status/1034767736004648960 STR to follow.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 1

•

7 years ago

STR

Attached file Meddler script — Details

This Meddler script was used to produce the test case, using this tool: https://webdbg.com/meddler/

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 2

•

7 years ago

We're going to need somebody to reproduce this and gather a memory report from about:memory, I think.

Nathan Froyd [:froydnj]

Updated

•

7 years ago

Whiteboard: [MemShrink]

Eric Lawrence (@ericlaw)

Comment 3

•

7 years ago

Attached file Win10 x64 v61.0.2 memory-report.json.gz — Details

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 4

•

7 years ago

(In reply to Eric Lawrence (@ericlaw) from comment #3) > Created attachment 9004893 [details] > Win10 x64 v61.0.2 memory-report.json.gz Thanks, ericlaw. Just so we're clear, was this memory report gathered from a session of Firefox that was at that moment consuming an usual amount of memory due to this bug? I ask, because a quick glance at the memory report doesn't show unusually high consumption from any of the processes.

Flags: needinfo?(ericlaw1979)

Eric Lawrence (@ericlaw)

Comment 5

•

7 years ago

Attached file later memory-report.json.gz — Details

Flags: needinfo?(ericlaw1979)

Eric Lawrence (@ericlaw)

Comment 6

•

7 years ago

This memory report was from a Firefox instance that was using ~360mb of "Active Private Working Set" in Windows. That number had been steadily creeping up since the process was launched, and now stands at 427mb ten minutes later (memory report attached).

Eric Lawrence (@ericlaw)

Comment 7

•

7 years ago

I should clarify further: The 427mb and growing process was the manager/browser Firefox.exe process (no command line args) while the most active Content process (presumably representing the only tab) is growing more slowly and currently stands at 310mb.

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 8

•

7 years ago

Okay, great - thanks for the context, ericlaw. Hey mccr8, when you have a sec, is this enough to go on to know if this is an imagelib thing? Necko cache might also be a factor here. I'm also seeing a bit of heap unclassified...

Flags: needinfo?(continuation)

Andrew Osmond [:aosmond] (he/him)

Comment 9

•

7 years ago

I note from the memory report that the reported image footprint is minimal. Are we certain it was captured at the right time?

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 10

•

7 years ago

(In reply to Andrew Osmond [:aosmond] from comment #9) > I note from the memory report that the reported image footprint is minimal. > Are we certain it was captured at the right time? I think these recent memory reports are from a different, shorter session, where ericlaw is still seeing a slow creep-up of memory usage, but where it hasn't yet reached the levels as seen in comment 0.

Eric Lawrence (@ericlaw)

Comment 11

•

7 years ago

Yeah, the original screenshot was an overnight (~7 hours) run; the original memory capture attached here was after running the repro for 10 minutes or so. After 2 hours, the browser process is at 749mb of active private working set and the tab process is ~150mb (Peaked at 1.1gb)

Eric Lawrence (@ericlaw)

Comment 12

•

7 years ago

Attached file two hours memory-report.json.gz — Details

Andrew Osmond [:aosmond] (he/him)

Comment 13

•

7 years ago

(In reply to Eric Lawrence (@ericlaw) from comment #11) > Yeah, the original screenshot was an overnight (~7 hours) run; the original > memory capture attached here was after running the repro for 10 minutes or > so. > > After 2 hours, the browser process is at 749mb of active private working set > and the tab process is ~150mb (Peaked at 1.1gb) Given the image footprint is still very low in this report, I imagine images are responsible for the peak, because we have a 1 minute timer that expires unused entries from the cache. In the parent process, the heap unclassified is concerning at 255MB, along with the network at 332MB. Presumably this is caching for all of the images we received from the network that we haven't released. In the content process, it seems much more evenly spread. It is still holding onto the generated '<div data-snippet-id="XXX" ...' strings in the JS memory, presumably those could have been freed earlier (not sure how many images are displayed, maybe this is reasonable).

Andrew Osmond [:aosmond] (he/him)

Updated

•

7 years ago

Priority: -- → P3

Whiteboard: [MemShrink] → [MemShrink][gfx-noted]

Mike Conley (:mconley) (:⚙️)

Reporter

Comment 14

•

7 years ago

Hey valentin, are you the right person to ask about the Necko caching memory usage here? Anything actionable from these memory reports?

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] {{ PTO 2 - 12 April }}

Comment 15

•

7 years ago

(In reply to Mike Conley (:mconley) (:⚙️) from comment #14) > Hey valentin, are you the right person to ask about the Necko caching memory > usage here? Anything actionable from these memory reports? Unless there is a random memory leak happening, I don't think the necko memory cache could cause this. From what I recall it has a maximum size, which you can inspect at about:cache, computed here [1] Michal may have a better idea if the memory cache could pose a problem. [1] https://searchfox.org/mozilla-central/rev/55da592d85c2baf8d8818010c41d9738c97013d2/netwerk/cache2/CacheObserver.cpp#196

Flags: needinfo?(valentin.gosu) → needinfo?(michal.novotny)

Michal Novotny [:michal]

Comment 16

•

7 years ago

It's definitely not normal that cache index occupies 100MB. Right now I don't have a Windows machine available, so I cannot reproduce it locally. Eric, can you please check how big the index file is on disk? The file is in cache2 directory, you can see full path on about:cache page. If the cache entries are leaked, there also might be a lot of files in doomed directory.

Flags: needinfo?(michal.novotny) → needinfo?(ericlaw1979)

Eric Lawrence (@ericlaw)

Comment 17

•

7 years ago

I let the repro run for ~7 hours and working set ballooned to 1.1gb. I then navigated to about:blank and it remained pretty constant for the minute that I watched. Now, 12 hours later, it's down to 546mb. about:cache has the following to say: memory Number of entries: 1 Maximum storage size: 32768 KiB Storage in use: 6 KiB Storage disk location: none, only stored in memory disk Number of entries: 162651 Maximum storage size: 358400 KiB Storage in use: 357233 KiB Storage disk location: C:\Users\ericlaw\AppData\Local\Mozilla\Firefox\Profiles\q58h7lzw.default\cache2 The index file in the cache2 folder is 6.82 MB (7,157,188 bytes. The doomed folder is empty. The entries folder has 162,647 Files, of which 162,613 files appear to be the 2-3kb JPEG images used in the repro.

Flags: needinfo?(ericlaw1979)

Andrew McCreight [:mccr8]

Comment 18

•

7 years ago

It looks like the relevant people have investigated, so I'm going to clear my needinfo.

Flags: needinfo?(continuation)

Eric Rahm [:erahm]

Updated

•

7 years ago

Whiteboard: [MemShrink][gfx-noted] → [MemShrink:P2][gfx-noted]

Eric Rahm [:erahm]

Comment 19

•

7 years ago

:aosmond, do you have any ideas for next steps here?

Flags: needinfo?(aosmond)

Michal Novotny [:michal]

Comment 20

•

7 years ago

This problem is not related to images. It's a generic problem when a lot of small resources are being cached. What happens is that CacheStorageService::MemoryPool::PurgeByFrecency() is called from PurgeOverMemoryLimit(), it sorts the array because frecencyNeedsSort is always true in CacheStorageService::MemoryPool::PurgeOverMemoryLimit(). Sorting takes some time when there is a lot of entries, then CacheIOThread::YieldAndRerun() returns true because events with higher priority were enqueued in the meantime. Nothing is purged and event is rescheduled for execution. When CacheStorageService::PurgeOverMemoryLimit() is executed again it returns early because it's executed too early since last purging. The result is that we try to purge entries only once every 4 seconds and usually nothing is purged, so used memory just grows. Honza, can you take this bug? This is your code. Regarding the sorting, consider doing something similar what CacheIndex::FrecencyArray does. New entries are always appended at the end of the array because we expect that new entries won't be evicted soon. The array is sorted occasionally in CacheIndex::FrecencyArray::SortIfNeeded() depending on how much the array is unsorted.

Component: ImageLib → Networking: Cache

Flags: needinfo?(aosmond) → needinfo?(honzab.moz)

Summary: We might be leaking images that reload many times over a long period → CacheStorageService::PurgeOverMemoryLimit() doesn't work correctly under specific circumstances

Honza Bambas (:mayhemer)

Assignee

Comment 21

•

7 years ago

Thanks for investigating this, Michal. Yep, this looks like my bug.

Assignee: nobody → honzab.moz

Flags: needinfo?(honzab.moz)

Junior [inactive]

Updated

•

7 years ago

Whiteboard: [MemShrink:P2][gfx-noted] → [MemShrink:P2][gfx-noted][necko-triaged]

Honza Bambas (:mayhemer)

Assignee

Updated

•

7 years ago

Blocks: 1541106

Honza Bambas (:mayhemer)

Assignee

Updated

•

5 years ago

Comment 22

•

5 years ago

Attached file Bug 1487101 - Sort frecency array in HTTP cache storage backend only occasionally, r=michal (obsolete) — Details

Honza Bambas (:mayhemer)

Assignee

Comment 23

•

5 years ago

Attached file Bug 1487101 - Do not interrupt purging of HTTP cache entries from memory by frecency until the pool containts at most 2000 entries, r=michal — Details

Frank Nestel

Comment 24

•

5 years ago

Stupid question of an outsider, why not combining both ideas and do presumably costly sort only every now and then.

I.e. sort only every other time but make sure, that new entries are added to safer places.
Make sure that a minimum number of purges happen, regardless of high prio interruptions.

This avoids the strange sorting which might cause problems in other situations.

I do not know about memoryLimit thing, might or might not be good.

Go! I am afraid, this particular bug bugs me since 2010.

Phabricator Automation

Updated

•

5 years ago

Attachment #9155964 - Attachment is obsolete: true

Honza Bambas (:mayhemer)

Assignee

Comment 25

•

5 years ago

Let's see how this works with the second (purge > 20) patch. Occasional sorting turned out to happen still too often. And looking at the array shows it's scattered too much. If we manager to keep the array relatively short, sorting may be kept fast. I also put a 10% reserve on the memory limit, so we purge a little bit more - to specifically save sorting time.

Michal Novotny [:michal]

Comment 26

•

5 years ago

(In reply to Honza Bambas (:mayhemer) from comment #25)

Let's see how this works with the second (purge > 20) patch. Occasional sorting turned out to happen still too often. And looking at the array shows it's scattered too much. If we manager to keep the array relatively short, sorting may be kept fast. I also put a 10% reserve on the memory limit, so we purge a little bit more - to specifically save sorting time.

Why exactly 20? Is it enough? The minimum delay between purging is 4 seconds and I can imagine that a way more entries could be added in 4 seconds.

(In reply to Frank Nestel from comment #24)

Go! I am afraid, this particular bug bugs me since 2010.

That's unlikely because this code landed in 2015 :)

Honza Bambas (:mayhemer)

Assignee

Comment 27

•

5 years ago

(In reply to Michal Novotny [:michal] from comment #26)

(In reply to Honza Bambas (:mayhemer) from comment #25)

Let's see how this works with the second (purge > 20) patch. Occasional sorting turned out to happen still too often. And looking at the array shows it's scattered too much. If we manager to keep the array relatively short, sorting may be kept fast. I also put a 10% reserve on the memory limit, so we purge a little bit more - to specifically save sorting time.

Why exactly 20? Is it enough? The minimum delay between purging is 4 seconds and I can imagine that a way more entries could be added in 4 seconds.

Good point with the 4 seconds, I forgot about it. Then it should be more, yes. What about to try to reduce the array to some specific length only? Aka, do kind of a limit on the array length?

Frank Nestel

Comment 28

•

5 years ago

That's unlikely because this code landed in 2015 :)

OK. How was previous code? Not serious question of course. Thanks for correcting me.
I am a poor bug reporter as I am reporting memory leaks for about 7+ years, and never ever convinced someone to solve at least one, I am not back to Firefox yet, but I would love to :(

You can still go for even rarer sorting (you had that 0.1 tuning factor up there).
That 0.9 thing is of course also a way to reduce frequency.
I feel in a loaded environment (ok, my Browser) there will be many fresh objects and the balance between trashing old and new is critical. It is unfortunate, that there is not at least some LRU policy fallback. But I understand, this might be beyond this fix.

Michal Novotny [:michal]

Comment 29

•

5 years ago

(In reply to Honza Bambas (:mayhemer) from comment #27)

Good point with the 4 seconds, I forgot about it. Then it should be more, yes. What about to try to reduce the array to some specific length only? Aka, do kind of a limit on the array length?

Sounds like a good approach. This would avoid infinite growth of the array.

Phabricator Automation

Updated

•

5 years ago

Attachment #9157621 - Attachment description: Bug 1487101 - Purge at least 20 HTTP cache entries from the memory pool regardless if there is a higher priority HTTP cache event pending, r=michal → Bug 1487101 - Do not interrupt purging of HTTP cache entries from memory by frecency until the pool containts at most 2000 entries, r=michal

Frank Nestel

Comment 30

•

5 years ago

Opening a Firefox instance, opening quite a few tabs (70 maybe), seeing 1323 cache entries in about:cache (and 16487 on disk). Somewhat irritating, nearly all the in memory cached things are actually "expired immediately". Can this be used to tune sorting? I.e. is it expired? Looks like this isn't my leak either. Sigh. Anyway, leaks must be fixed.

Frank Nestel

Comment 31

•

5 years ago

Question/Comments.

Where do the 2000 come from? I have a one day old Firefox with exact 80 tabs and 1422 cache entries. Cache limit is 32768 LiB, Cache use is 17295 KiB. Limit expected to trigger way beyond 2000 only.
The code does not really warrant the cache is shrunk to 2000, since it could contain many small bits and the loop can exit earlier.
Sorting the zero access to the end protects both, the new entries in cache and old entries that get never used. This will fill the cache with useless entries as used entries will killed before unused ones.

Frank Nestel

Comment 32

•

5 years ago

I have missed, this is about disk cache. Here my Firefox looks like
Number of entries: 15222
Maximum storage size: 1048576 KiB
Storage in use: 587576 KiB
2000 sounds brutally low, cleanup will kill many heavily used entries and keep all the unused ones.

Frank Nestel

Comment 33

•

5 years ago

On disk cache there is also an expiry date, use this as (sub) criterion for sorting to clear at least unused expired stuff.

Pulsebot

Comment 34

•

5 years ago

Pushed by honzab.moz@firemni.cz: https://hg.mozilla.org/integration/autoland/rev/e13422469a6c Do not interrupt purging of HTTP cache entries from memory by frecency until the pool containts at most 2000 entries, r=michal,necko-reviewers

Narcis Beleuzu [:NarcisB]

Comment 35

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e13422469a6c

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox79: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla79

Frank Nestel

Comment 36

•

5 years ago

OK, thanks, I'll disable disk cache.

Frank Nestel

Comment 37

•

5 years ago

This is nothing personal. But what does this patch do? Why not getting a user benefit of the time and ideas investigated?

The patch is obviously fixing worst case, but has big potential of damaging the whole purpose of a cache.

No vision about user your users, this is why Firefox is on the decline.

Honza Bambas (:mayhemer)

Assignee

Comment 38

•

5 years ago

(In reply to Frank Nestel from comment #31)

Question/Comments.

Where do the 2000 come from? I have a one day old Firefox with exact 80 tabs and 1422 cache entries. Cache limit is 32768 LiB, Cache use is 17295 KiB. Limit expected to trigger way beyond 2000 only.

We have a so called "pool" of entries we keep in memory (only headers) while they are in direct use or expected to be reused again (on follow on navigation, for instance). The bug was cleaning this pool was not working when it grew beyond some limit.

Hence, I added a limit that keeps cleaning this pool until we have only 2000 entries. It's totally arbitrary number, but seems reasonable according experience and testing.

The code does not really warrant the cache is shrunk to 2000, since it could contain many small bits and the loop can exit earlier.

You misunderstand what the patch does.

Sorting the zero access to the end protects both, the new entries in cache and old entries that get never used. This will fill the cache with useless entries as used entries will killed before unused ones.

Again, you don't understand the code.

(In reply to Frank Nestel from comment #32)

I have missed, this is about disk cache. Here my Firefox looks like
Number of entries: 15222
Maximum storage size: 1048576 KiB
Storage in use: 587576 KiB
2000 sounds brutally low, cleanup will kill many heavily used entries and keep all the unused ones.

What you are looking at is how many entries is on your disk, not in the memory pool.

(In reply to Frank Nestel from comment #33)

On disk cache there is also an expiry date, use this as (sub) criterion for sorting to clear at least unused expired stuff.

It's not that simple, because there are situations we want to use expired (stall) content and it's totally valid. I will not go to details about it, because it's beyond the scope of this bug.

(In reply to Frank Nestel from comment #36)

OK, thanks, I'll disable disk cache.

I don't follow this offensive reaction at all.

(In reply to Frank Nestel from comment #37)

This is nothing personal. But what does this patch do? Why not getting a user benefit of the time and ideas investigated?

We always do, if it has the technical quality.

The patch is obviously fixing worst case, but has big potential of damaging the whole purpose of a cache.

Again, you misunderstand the patch.

No vision about user your users, this is why Firefox is on the decline.

Again, offensive reaction I don't follow.

Frank Nestel

Comment 39

•

5 years ago

Thank you for your answer, though it fails to convince me. Yes, I might not have missed some presumed magic going on, but you didn't attempt to change that.

Simple question: From observed cache of 13000 entries, under 2000 well used and over 11000 with no access recorded according to about:cache, what would your patch remove?

Honza Bambas (:mayhemer)

Assignee

Comment 40

•

5 years ago

(In reply to Frank Nestel from comment #39)

Thank you for your answer, though it fails to convince me. Yes, I might not have missed some presumed magic going on, but you didn't attempt to change that.

Simple question: From observed cache of 13000 entries, under 2000 well used and over 11000 with no access recorded according to about:cache, what would your patch remove?

Nothing. It doesn't do anything with on disk entries. And I believe it's now clear from my comment trying to explain what the patch does.

Please let's stop polluting this bug now. Let's see how the patch works in the field and if leaks in this code persist, let's look at it again.

Thanks.

Anca Soncutean, Desktop QA

Updated

•

5 years ago

QA Whiteboard: [qa-79b-p2]

Mathew Hodson

Updated

•

8 months ago

Updated

•

8 months ago

No longer blocks: 1541106

You need to log in before you can comment on or make changes to this bug.