Closed Bug 81640 Opened 23 years ago Closed 14 years ago

Lower threshold for large cache entry eviction

Categories

(Core :: Networking: Cache, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.4alpha
Tracking Status
blocking2.0 --- beta7+

People

(Reporter: tenthumbs, Assigned: byronm)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 4 obsolete files)

Mozilla should have a user configurable limit on the maximum size of an
object that can be permanently cached. Retrieving a really large object
forces other objects out of the cache which may not be reasonable
particularly for users on a slow link. For example, I often visit sites
which have the same image in various sizes. There is a small browse
version and a hi-res version. You view the browse version and download
the hi-res version if it's interesting. The hi-res version is often a
4MB jpeg or even a 40MB tiff and it is _not_ meant for viewing in a
browser. It is irritating to see mozilla flush almost everything else in 
its cache just to accommodate these beasts.

For symmetry, there probably should also be a minimum size limit. I can 
imagine a user on a fast network retrieving small objects from a proxy 
server or something without storing them in mozilla's disk cache.

If it were me, I would set a maximum size of 4MB and a minimum size of 0.

Just some ideas.
Target Milestone: --- → mozilla0.9.2
Target Milestone: mozilla0.9.2 → mozilla0.9.3
Priority: -- → P4
Doesn't look like this is getting fixed before the freeze tonight.
Pushing out a milestone.  Please correct if I'm mistaken.
Target Milestone: mozilla0.9.3 → mozilla0.9.4
Target Milestone: mozilla0.9.4 → mozilla1.0
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 
(you can query for this string to delete spam or retrieve the list of bugs I've 
moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
Darin, Brian: Do we need a minimum cache entry size limit?  I don't think so,
but I could be convinced otherwise.

Also, would it be useful to have two separate maximums, one for the disk cache
and one for the memory cache?  It makes the cache more tweakable, but more complex.
Severity: enhancement → normal
Status: NEW → ASSIGNED
Priority: P4 → P1
Target Milestone: mozilla1.0.1 → mozilla1.4alpha
Minimum disk limits are useful if you're using a nearby proxy,
particularly if it's on the same machine. Reading from the proxy's disk 
just to put it in mozilla's disk is probably inefficient.

You probably want separate limits for memory and disk caches. Again,
proxies are the issue. If oyu have a local proxy then it may well be
faster to have a large memory cache with a large limit and small values
for the slower disk cache.

Whatever you do, do it carefully. Right now I can't use a plugin that 
requires a disk copy of large files because the cache tosses it before 
the plugin can use it. It would only get worse.
Tenthumbs, do you have a bug on that plugin problem?  It sounds like there
plugin shouldn't be depending on the cache for that file.
Gordon, I filed bug 195634 against plugins. To be honest, what are the plugin
people supposed to do? Some objects are inherently random access so they must be
on disk. If they use their own copy then you may well have two disk copies which
may not be even fit. (Consider a plugin that does magic with an cdrom iso image.
That could mean using 2*650MB of disk space.)

Right now mozilla has a long-term disk cache and a short-term memory cache. What
it needs is a short-term disk cache as well.
Fixing this would bascally make all the "Download Manager" downloads fill up
cache bugs go away, as well as making it less likely that cache would over-run a
disk parition.
QA Contact: tever → cacheqa
Assignee: gordon → nobody
Status: ASSIGNED → NEW
Large objects blowing away our small 50 MB cache is definitely an issue that should be fixed.   The comments in bug 193911 suggest making the cache large enough might be enough to solve the issue.  If not, setting a max size could be a good idea.
Blocks: http_cache
Note, currently youtube videos, saving large images or an ubuntu iso is no longer blowing away the cache. 
Youtube is streaming its video's, and will not be stored in cache.
For downloads, as soon as the chunk in the cache is larger than half the cache, it is doomed from the cache.
bug 55307, bug 55690 and bug 69938 are about downloads being saved first in the cache, and only flushed from the cache the downloaded chunk is larger than half the cache.  When the cache is enlarged, we could consider to reduce the limit for large file in the cache to less than half the total.
This is the 'maximum size limit' asked for in this bug.
Currently, it is 'hardcoded' in nsDiskCacheDevice.cpp as follows:
770     // If the new size is larger than max. file size or larger than
771     // half the cache capacity (which is in KiB's), doom the entry and abort
772     if ((newSize > kMaxDataFileSize) || (newSizeK > mCacheCapacity/2)) {

The mCacheCapacity/2 should be configurable, so that one can change limit of downloads saved in the cache... Note, reducing it will also reduce the caching of any object larger than that limit.
Renaming and focusing on large files.

I think we need to either set a "large file" threshold size (probably a fixed size, not half the cache) and either 1) not store such files in the cache; or 2) come up with an algorithm that evicts large files sooner than naive FIFO, at least when other large files are loaded (i.e. it'd be nice to cache a video in case the user wants to watch again, but we don't want browsing 5 videos to blow away everything previous in the cache.  They should evict each other instead).

Two solutions that come to mind are 1) keeping a separate "large files" area of the cachel or 2) putting large files at a lower-than-default priority in our scheme from bug 578541, and limiting the % of cache that that class can take up.

Bz/Biesi:  do we have a sense of whether it's ok to simply not cache files above a certain file size?
Summary: Maximum and minimum size limits for cache objects → Keep large files from blowing away 1/2 the HTTP cache
Assignee: nobody → byronm
(In reply to comment #12)
> Renaming and focusing on large files.
> 
> I think we need to either set a "large file" threshold size (probably a fixed
> size, not half the cache) and either 1) not store such files in the cache; or

I am working on a patch that starts to solve this problem. 

As Alfred pointed out in Comment 11, we place a maximum size constraint on entries of the cache. Unfortunately, we do not check the size of an entry up front. Blindly caching large files can lead to evicting many entries from the cache in order to accommodate the large file, only to ultimately evict the large file itself, because it is either greater than either a) the maximum allowed file size (64MB) or b) half the disk cache's capacity (125 MB by default, once bug 193911 lands). 

A case where this issue commonly occurs is on YouTube. YouTube videos tend to be large in size, and the videos end up in the cache. As a concrete example, I witnessed one large YouTube video (a Google Tech Talk) evict approximately 500 entries from the cache, only to have the YouTube video itself be evicted because it was greater than half the disk cache's capacity. I know it was mentioned earlier in this thread that Youtube videos are not stored in the cache, but I have definitely observed them be bound to the disk cache.

Because most servers send Content-Length in the http headers, we should be able to tell up front what the size of the cache entry is going to be. If the Content-Length is greater than diskCacheCapacity/2 or MaxFileSize (the two constraints mentioned above), then we should not even start caching the data, and therefor avoid throwing away other cache entries unnecessarily. 

I am implementing this logic in nsHttpChannel, as of now, and hope to have a patch up shortly. 

> 2) come up with an algorithm that evicts large files sooner than naive FIFO, at
> least when other large files are loaded (i.e. it'd be nice to cache a video in
> case the user wants to watch again, but we don't want browsing 5 videos to blow
> away everything previous in the cache.  They should evict each other instead).
> 
> Two solutions that come to mind are 1) keeping a separate "large files" area of
> the cachel or 2) putting large files at a lower-than-default priority in our
> scheme from bug 578541, and limiting the % of cache that that class can take
> up.
> 

I agree that a more intelligent eviction algorithm is needed. It is true that we may want to cache larger files, like a YouTube video, so that the user can watch it again without having to completely re-buffer. But we do not want to evict everything else in order to make this happen. Michal and I were talking about some ways to accommodate this. One option is to tag all large files with a flag, or a lower priority. Then, if adding a new large file to the cache requires eviction, we will only add it if there are other large files available to evict. Honestly, though, if we are going to do this, I suggest a more general priority scheme, so that it can be tweaked as we experiment with other options. This is more or less the suggestion in bug 578541 anyways.

Regardless, until we completely change the eviction algorithm, I think the max file size is necessary. And as long as we have it, we should *definitely* be making the size check before writing to the cache. So I think working on the patch that I spoke of earlier is worthwhile.

Please feel free to let me know what you think.
Since it's by far the simplest (and thus fastest) fix, so long nothing "bad" happens if we don't cache large files at all (bz? biesi?  I assume it's fine), then let's just not store in cache if file > size_threshold 

What's the right size?  We could start at 25 MB (which is the size we've been using to evict from cache).  But I'd actually be fine with smaller--how about 10 MB?  This is a great opportunity for the peanut gallery to chime in :)
(In reply to comment #14)
> This is a great opportunity for the peanut gallery to chime in :)

The other issue I would like others thoughts on is what to do if the response headers do not include the Content-Length. I was thinking that in that case, we would just give up and cache the entry.
> nothing "bad" happens if we don't cache large files at all (bz? biesi?  I
> assume it's fine),

It's fine unless SetCacheAsFile was called, I'd think.  If that was called, the data needs to be cached.  Note that we may not know that information early enough; worth checking.
Maybe we could even base the decision by simulating what hit rate we'd get over various browsing histories (byron, your browsing history could be like Craig Ventner's DNA!)
Now that we (In reply to comment #11)

> 772     if ((newSize > kMaxDataFileSize) || (newSizeK > mCacheCapacity/2)) {
> 

kMaxDataFileSize is set to 64MB. Our cache has now landed to be a default size of 250MB. Do we want to change one or both of these values, to prevent large files from taking over the cache? 

Jason suggested dooming entries that get to be 5-10MB.
Let's keep this bug focused on changing kMaxDataFileSize to something smaller.  (I agree that checking Content-length and simply skipping caching for large files is a better solution, but biesi suggests it may be tricky. Let's open a new bug for it if we need it).

Re: what size to pick: on IRC biesi suggested 1MB or even 500KB.  

But before we just pick a number out of our collective backside, let's try the following test:   

If we set the eviction threshold to MAX, and the cache is full, and we start, say, browsing videos on youtube, we ought to see 1) the first video evict MAX worth of cache entries, then get evicted itself from cache.   Then 2) future vids ought to reuse the MAX worth of empty space, then get evicted.   In other words, compared with a smarter algorithm that checks Content-length and skips caching entirely if len > MAX, we ought to just "waste" MAX bytes in the cache.  Provided MAX is a much smaller fraction of cache than the current 1/2, that shouldn't be a big problem.  (ex: for a 250 MB cache, a 2.5 MB or 5 MB limit for a 250 MB cache would "waste" 1-2% of cache vs the Content-length checking algorithm.  We could probably live with 10 MB, or even larger).

At this point I'm not confident that we'd actually see this behavior.  I wouldn't be surprised if we continue to blow away cache entries as new large files appear.  So let's run the test and see what happens.  If we see pathological behavior, the triage solution may be to set kMaxDataFileSize on the small side (assuming that helps) and then try to figure out why the cache isn't doing what it's supposed to.
Summary: Keep large files from blowing away 1/2 the HTTP cache → Lower threshold for large cache entry eviction
Out of curiosity, has anyone looked into what the maximum cache entry size on Chrome is?
Digging around in Chromium's source, it looks like they set the maximum cache entry size to 1/8th whatever the cache size is set to. For those interested: 

Search for MaxFileSize(): http://src.chromium.org/viewvc/chrome/trunk/src/net/disk_cache/backend_impl.h?revision=53838&view=markup

Search for MaxFileSize(): http://src.chromium.org/viewvc/chrome/trunk/src/net/disk_cache/backend_impl.cc?revision=55034&view=markup  (Note that max_size_ is the size of the cache)

Search for EntryImpl::WriteDataImpl: http://src.chromium.org/viewvc/chrome/trunk/src/net/disk_cache/entry_impl.cc?view=markup 

Let me know if you think otherwise.
Darin confirmed that it is indeed 1/8th.
(In reply to comment #19)

> let's try the following test:   
> 
> If we set the eviction threshold to MAX, and the cache is full, and we start,
> say, browsing videos on youtube, we ought to see 1) the first video evict MAX
> worth of cache entries, then get evicted itself from cache.   Then 2) future
> vids ought to reuse the MAX worth of empty space, then get evicted.  

I ran this test, and I observed the cache doing the correct thing. I set a max cache entry size of 10MB. Starting with a full cache, a number of entries are first evicted in order to accommodate the >10MB entry. Then, the >10MB entry is itself evicted, once it has grown beyond 10MB.

On future insertions of >10MB entries, they fill up the "hole" left in the cache, and then evict themselves once they go over 10MB limit.

> In other words, compared with a smarter algorithm that checks Content-length > and skips caching entirely if len > MAX, we ought to just "waste" MAX bytes. 

I certainly agree it makes sense to limit the MAX size to something smaller than MIN(64MB || halfCacheSize). What do we need to do to decide what the proper limit is? 

I think that the Content-Length check is still worth doing up front. If the user is not doing something like browsing YouTube for an hour straight, but watching a video here and another there, then it is quite possible they will bring their cache up to roughly filled, and each >MAX_SIZE request will have to evict MAX_SIZE worth of other entries. This seems like a silly hit to take, unless the win of avoiding it is minimal compared to other improvements we can make.

What does everyone think?
I for one think it's very worth writing the code to check the size up front and not evict anything if we know the file won't be cached anyways. Seems really wrong to me not to do that part. As for the max size, I think it's all pretty arbitrary. I doubt anyone has put any serious thought into it, and I don't think we have the time to do so now, so my suggestion is to do something close to what we used to have, and if that happens to line up somewhat with chrome's 1/8 of the cache size, then we could settle on the same size for now. Doing anything much better than that would require lots and lots of data that we don't have, and can't get quickly.
(In reply to comment #24)
> I for one think it's very worth writing the code to check the size up front 
> and
> not evict anything if we know the file won't be cached anyways. Seems really
> wrong to me not to do that part.

In that case, I'll keep working on the patch :)

> Doing
> anything much better than that would require lots and lots of data that we
> don't have, and can't get quickly.

Agreed. One option would be to do some rough estimation. I could see what the average size of a YouTube video is, and we could use that as a ballpark metric.
I did a rudimentary investigation of cache entry sizes, trying to target what might be common use cases. 

YouTube
*A 320p, ~30second video is ~5MB
*A 320p, ~3minute video is ~14MB

Flickr
*A small photo is ~15KB
*A medium sized photo is ~400KB
*A large photo is up to 2MB

New York Times
*Javascript files of about 50KB
*Images ranging up to ~1MB
*Random flash up through 500KB

HTML5 Spec (As a single-page, huge doc)
*40MB

Obviously this is not very scientific, and I don't pretend that any of these numbers are real averages; they are just what I saw in my browsing and taking of samples.

But given this, I think that if we set the max allowed entry size to 5MB or 10MB, we can accommodate pretty much everything except videos. I do think there is value to being able to cache videos, because people do revisit videos to a certain extent, but we can't do this at the expense of the rest of one's cache. 

We also still need to have a max size relative to the cache size, because people might still change their cache to be very small... I personally think the old default of 1/2 is too big. I suggest dropping it to at least 1/4, and maybe 1/8 like chrome.

If anyone has opinions on this, please let me know. I want to pick a number and get this in the tree. Thanks.
I'd vote that for now we use the smaller of 5 MB and 1/8 of cache (so in most cases, 5 MB).   The only case in the examples above that gives me a pang of sadness is the HTML5 spec, but really, 40MB?  Download time.

I like the idea of caching larger files at some point when we give them their own section of the cache, so they can evict each other rather than tons of smaller files.

I also like the idea of doing the content-length check, though I think it should be lower priority than 1) disable caching for downloads and plugins if possible, and 2) sanity-checking the current cache as it scales.  I'm going to open bugs for those now.
Blocks: 588507
Attached patch max_file_size_v1 (obsolete) — Splinter Review
Limits the max file size in the disk cache to MIN(5MB, diskCacheCapacity/8).

There was an old overflow bug, but I do not think it applies anymore, since the max size is no longer 64MB. But I could be wrong, so please let me know.

Thanks.
(In reply to comment #28)
> Created attachment 467308 [details] [diff] [review]
> max_file_size_v1
> 
> Limits the max file size in the disk cache to MIN(5MB, diskCacheCapacity/8).
> 
> There was an old overflow bug, but I do not think it applies anymore, since the
> max size is no longer 64MB. But I could be wrong, so please let me know.
> 
> Thanks.

I should have mentioned that the old overflow bug was Bug 443067.
Attachment #467308 - Flags: review?(michal.novotny)
Comment on attachment 467308 [details] [diff] [review]
max_file_size_v1

> +#define kMaxDataFileSize   5 * 1024    //5120 KiB. XXX: I think the overflow
> +                                      //issue is moot now that max size is 5MB.
> +                                      //Please correct if I am wrong!

This should be 5*1024*1024. And please change the comment. Either remove it completely or keep the note that upper limit is 65535KiB together with the bug number.
Attachment #467308 - Flags: review?(michal.novotny)
Attached file max_file_size_v2 (obsolete) —
Fixes kMaxDataFileSize value. Sorry for the stupid mistake.
Adds a comment, noting the overflow bug from Bug 443067.
Attachment #467308 - Attachment is obsolete: true
Attached patch max_size_v2 (obsolete) — Splinter Review
Attachment #469159 - Attachment is obsolete: true
Attachment #469161 - Flags: review+
This should help with our cache story, let's take this for beta6.
blocking2.0: --- → beta6+
Attached patch max_file_size_v3 (obsolete) — Splinter Review
No code changes, but makes the comments on our maximum allowed file size more clear, thanks to jduell's input.
Attachment #469161 - Attachment is obsolete: true
Comment on attachment 470097 [details] [diff] [review]
max_file_size_v3

Carrying forward Michal's +r.
Attachment #470097 - Flags: review+
Updated with the tip of the tree and ready to land.
Attachment #470097 - Attachment is obsolete: true
Attachment #471286 - Flags: review+
Keywords: checkin-needed
One could also have changed to the define to be in KB's, and change the compare statement to KB's also:
From: if ((newSize > kMaxDataFileSize) || (newSizeK > mCacheCapacity/8))
To:   if ((newSizeK > kMaxDataFileSizeK) || (newSizeK > mCacheCapacity/8))


All size calculations and updates are in Kb's, so it makes sences (see also comments 30 and 31) to standardize to KB's.
> standardize to KB's

A good idea, but there'd be more places than this to change to make KBs standard, and we don't want to make that fix while we're in blocker-only mode.

But a good idea for a future bug--file?
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #28)
> Limits the max file size in the disk cache to MIN(5MB, diskCacheCapacity/8).

Would there be any benefit to using MAX(5MB, diskCacheCapacity/8) ? 

I'm thinking about the scenario in which someone might browse lots of videos (or even the HTML5 Spec) and had wanted to cache those. If I'm understanding correctly, even if someone manually increases his or her cache size, there's no way to opt-in to caching objects > 5 MB, right?
> Would there be any benefit to using MAX(5MB, diskCacheCapacity/8)?

For now video caching at least seems to be handled reasonably well by flash's cache and/or the media cache (for HTML 5 video).

We may support larger files sizes once we've got an algorithm in place that evicts them sooner than smaller files (see bug 578541)
Blocks: 647391
Depends on: 644431
It looks like for some reason comment 16 was completely ignored, causing bug 644431.
Now that the disk cache can be up to 1 GB, this doesn't matter anymore, and the patch should be removed IMHO.
5 MB limit seems to be reasonable at the first sight, however it turned out to be an issue for lots of sites with audios.

Repeating playlist or a single track now leads to (re)downloading files for each playback. Average MP3 is 320 kbps and ≈ 8-9 MB in size, so it doesn’t fit current limit. Since repeating is pretty common usage pattern, it can significantly increase bandwidth usage.

Also, I’ve got lots complaints from users of my add-on ( http://addons.mozilla.org/addon/vkontakteru-downloader/ ), since cache is not used anymore when downloading audio after it’s been played. Some of them even downgrade to Firefox 3.6.

There’s

#define DISK_CACHE_MAX_ENTRY_SIZE_PREF "browser.cache.disk.max_entry_size"

in nsCacheService.cpp, which could be possibly used to manually override this limit, however it seems that this preference is not implemented (it never referenced in code).
Sergey,

> 5 MB limit seems to be reasonable at the first sight, however

Yes.  See bug 647391

> preference is not implemented

bug 650995
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: