Closed Bug 661900 Opened 13 years ago Closed 13 years ago

LRU-SP not fully implemented in memory cache: small entries avoid eviction

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla10

People

(Reporter: gbrown, Assigned: gbrown)

Details

(Whiteboard: Note: TP4M REGRESSION EXPECTED / OK)

Attachments

(1 file, 4 obsolete files)

Patch to correct deviation from LRU-SP 13 years ago Geoff Brown [:gbrown] 3.85 KB, patch		Details \| Diff \| Splinter Review
Patch to correct deviation from LRU-SP 13 years ago Geoff Brown [:gbrown] 3.41 KB, patch		Details \| Diff \| Splinter Review
Patch to correct deviation from LRU-SP 13 years ago Geoff Brown [:gbrown] 3.54 KB, patch		Details \| Diff \| Splinter Review
Patch to correct deviation from LRU-SP 13 years ago Geoff Brown [:gbrown] 3.69 KB, patch	michal : review+	Details \| Diff \| Splinter Review
Patch to correct deviation from LRU-SP 13 years ago Geoff Brown [:gbrown] 3.72 KB, patch	gbrown : review+	Details \| Diff \| Splinter Review

Geoff Brown [:gbrown]

Assignee

Description

•

13 years ago

The memory cache implements most features of the LRU-SP cache algorithm, but diverges in one significant way. The Cheng/Kambayashi paper states that the choice of eviction entry is determined by considering the set of entries at the heads of all eviction lists, and choosing the one with the greatest cost (cost=time-since-last-access x size / nref). The memory cache simply takes entries in the order of the eviction lists (roughly size / nref).

A consequence of the memory cache implementation is that small items tend to avoid eviction, even if they are unused for a long time. This behavior was documented in bug 397275, but the departure from the LRU-SP algorithm was not noted at the time. Indeed, this detail is hardly mentioned in the Cheng paper and is easy to miss -- I believe it was simply overlooked in the original implementation.

Geoff Brown [:gbrown]

Assignee

Comment 1

•

13 years ago

Attached patch Patch to correct deviation from LRU-SP (obsolete) — Details — Splinter Review

I have tested this patch in some ad-hoc scenarios only -- it seems to fix the problem with small items not being evicted.

Geoff Brown [:gbrown]

Assignee

Updated

•

13 years ago

Assignee: nobody → gbrown

Jo Hermans

Comment 2

•

13 years ago

line 394 : it's not necessary to recalculate maxCost, as it's identical to entryCost.

I would just rewrite the inner part of the loop as :

        if ((entry != &mEvictionList[i]) &&
            (!entry->IsInUse())) {
            entryCost = (PRUint64)
                (now - entry->LastFetched()) * entry->Size() / 
                PR_MAX(1, entry->FetchCount());
            if (!maxEntry || (entryCost > maxCost)) {
                maxEntry = entry;
                maxCost = entryCost;
            }

This function will be called a lot (for every item added to the cache), and it calculates now every time, which can be slow. Maybe you need to keep the original check for mTotalSize and mInactiveSize at the start of the function to avoid that ? And then change the while-loop into a do-while.

Geoff Brown [:gbrown]

Assignee

Comment 3

•

13 years ago

Attached patch Patch to correct deviation from LRU-SP (obsolete) — Details — Splinter Review

Optimized as suggested - thanks Jo!

Attachment #537195 - Attachment is obsolete: true

Geoff Brown [:gbrown]

Assignee

Comment 4

•

13 years ago

Attached patch Patch to correct deviation from LRU-SP (obsolete) — Details — Splinter Review

Attachment #540636 - Attachment is obsolete: true

Doug Turner (:dougt)

Comment 5

•

13 years ago

geoff - who is the right person to review this?

Geoff Brown [:gbrown]

Assignee

Comment 6

•

13 years ago

(In reply to Doug Turner (:dougt) from comment #5)
> geoff - who is the right person to review this?

Michal or Bjarne. I have been holding off on a review request while the team sorts out how to test the effectiveness of cache algorithm changes.

Jason Duell

Comment 7

•

13 years ago

Comment on attachment 540639 [details] [diff] [review]
Patch to correct deviation from LRU-SP

Let's get feedback at least, and if it looks good, +r.

Bjarne/Michal: I assigned to Michal, but either of you could take.

Attachment #540639 - Flags: feedback?(michal.novotny)

Jason Duell

Comment 8

•

13 years ago

Comment on attachment 540639 [details] [diff] [review]
Patch to correct deviation from LRU-SP

Heard on dev.platform:

"what I did see though, is that without the patch in bug #661900, the cache ends up just filled with tiny ad-related content and is practically useless."

So let's bump this in priority and see if we can land this soon.

Attachment #540639 - Flags: feedback?(michal.novotny) → review?(michal.novotny)

Michal Novotny [:michal]

Comment 9

•

13 years ago

Comment on attachment 540639 [details] [diff] [review]
Patch to correct deviation from LRU-SP

> +     // LRU-SP eviction selection: Check the head of each segment (each
> +     // eviction list, kept in LRU order) and select the maximal-cost
> +     // entry for eviction. Cost is time-since-accessed * size / nref.

Shouldn't we check the first unused entry? If the first entry is in use we won't evict any other entry from that list.

Attachment #540639 - Flags: review?(michal.novotny)

Geoff Brown [:gbrown]

Assignee

Comment 10

•

13 years ago

Attached patch Patch to correct deviation from LRU-SP (obsolete) — Details — Splinter Review

Yes, I agree: It is better to check the first not-in-use entry in each queue, rather than skip the queue entirely if the head entry is in use. Patch updated.

Attachment #540639 - Attachment is obsolete: true

Attachment #558560 - Flags: review?(michal.novotny)

Michal Novotny [:michal]

Updated

•

13 years ago

Attachment #558560 - Flags: review?(michal.novotny) → review+

Geoff Brown [:gbrown]

Assignee

Comment 11

•

13 years ago

https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=cd08c8a97175

Geoff Brown [:gbrown]

Assignee

Comment 12

•

13 years ago

Any objections to landing this now?

Jason Duell

Comment 13

•

13 years ago

I assume you went through the oranges on the try build and verified they're unrelated?  If so, you're got +r, so should be ok to land.  Maybe wait until Michal is awake again to make sure...

Michal Novotny [:michal]

Comment 14

•

13 years ago

I went through the oranges and I think that they all are unrelated to this change.

Geoff Brown [:gbrown]

Assignee

Comment 15

•

13 years ago

Me too.

Keywords: checkin-needed

Dão Gottwald [:dao]

Updated

•

13 years ago

Keywords: checkin-needed

Whiteboard: [inbound]

(no longer active)

Comment 16

•

13 years ago

http://hg.mozilla.org/mozilla-central/rev/d469840a370f

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Whiteboard: [inbound]

Target Milestone: --- → mozilla9

Jason Duell

Comment 17

•

13 years ago

Geoff (or anyone using fennec regularly): please verify this gets rid of the "lots of small stale ad objects" issue and mark VERIFIED if so.

Geoff Brown [:gbrown]

Assignee

Comment 18

•

13 years ago

It looks good to me: I tested on Android, with light browsing over a period of about 8 hours. At the end of the 8 hours, my 1M memory cache had about 100 entries (average 10K/entry), with the largest entry about 80K, and many of the entries corresponding to recent activity.

Status: RESOLVED → VERIFIED

(no longer active)

Comment 19

•

13 years ago

This _might_ have regressed Mobile Tp4 <http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/c1767648781e4425#>, so I have temporarily backed it out: <https://hg.mozilla.org/integration/mozilla-inbound/rev/7e132eee5650>.

If the backout doesn't change the numbers for this test, I will reland it either today or tomorrow.

Status: VERIFIED → REOPENED

Resolution: FIXED → ---

(no longer active)

Comment 20

•

13 years ago

The backout was merged to mozilla-central: https://hg.mozilla.org/mozilla-central/rev/7e132eee5650

It seems to have resolved the regression.

Target Milestone: mozilla9 → ---

Jo Hermans

Comment 21

•

13 years ago

The possibility exists that the new algorithm tries to delete let's say 100 small files, while the old one took 1 large file to free the same number of bytes, due the different nature of the selection (see the do-while loop). That might explain the regression.

Geoff Brown [:gbrown]

Assignee

Comment 22

•

13 years ago

(In reply to Jo Hermans from comment #21)

Absolutely. I'll try to collect some data like number of evictions and hit rate for the Talos run with/without this patch...probably next week.

Jason Duell

Comment 23

•

13 years ago

Patrick, do you have a bug open for the cache telemetry patch you wrote yesterday?

Geoff:  I recommend that you build on that patch and add eviction and overall cache hit rate to the telemetry stats, rather than doing a one-off instrumentation just for this bug (should be just as easy).  If you run into questions, ask.  We could really use those stats in general, so let's do it now.

Patrick McManus [:mcmanus]

Comment 24

•

13 years ago

(In reply to Jason Duell (:jduell) from comment #23)
> Patrick, do you have a bug open for the cache telemetry patch you wrote
> yesterday?
> 

nick hurley owns that action item.

u408661

Comment 25

•

13 years ago

It's bug 686948. We also have plans to do telemetry for number of evictions (and dooms). I can make that the next one I work on, should be simple enough.

Geoff Brown [:gbrown]

Assignee

Comment 26

•

13 years ago

The regression (comment 19) is in remote-tp4m_nochrome. This test loads 21 pages, 2 times (1, 2, 3, .. 21, 1, 2, 3 .. 21). Each page comes from a test webserver, set up with copies of very-popular sites (google, yandex, baidu, etc). The time to load each page is measured and reported in logs, but the overall metric for the test is the "average median" time: The average, excluding the largest value, of the median page load times, excluding the largest value, for each page. Since there are only 2 measurements per page, and the largest measurement is discarded, the median page load time for a page corresponds to the minimum page load time for that page. In the vast majority of cases, a page's first load time is greater than it's second load time. So the tp4m time is, generally, the average of the page load times for the second cycle (when caching may be applicable).

I examined logs from try server tp4m-nochrome runs to investigate the effect of this patch on memory cache hit rate and evictions. There is some variability between test runs, but results like this are typical:

            Hits       Misses     Evictions  tp4m time
original    324        516        112        547.5
with patch  241        597        222        585.9

==> The patch to correct LRU-SP decreases the number of Necko memory cache hits, increases the number of evictions, and increases the overall test time, for tp4m-remote.

Geoff Brown [:gbrown]

Assignee

Comment 27

•

13 years ago

I think these results are predictable (if not predicted!) and not necessarily anything to be concerned about. The tp4m page set is approx 8 MB, while the mobile memory cache is only 1 MB. If the memory cache were strict LRU, we would expect all tp4m pages to be evicted long before being revisited, resulting in no hits. LRU-SP (either implementation) improves on LRU in this test by evicting larger entries first: Some smaller entries are kept in the cache long enough to result in hits when the second test cycle comes around. The original LRU-SP implementation can be viewed as holding on to those smaller entries more aggressively than my patch, resulting in more hits and better performance -- during this type of test. With my patch, each entry's last access time is taken into account also, so some smaller entries are evicted while larger entries that have been recently loaded are kept in the cache, resulting in fewer hits in this cyclical test. 

In "real life" / "typical" browsing, not every resource that is loaded once will be loaded again: keeping entries in the cache that have not been accessed for an extended period impairs our ability to store recently accessed content, which ought to be more relevant.

On the other hand, I am reluctant to check in this patch when our only test results are negative! Feedback welcome...

Jason Duell

Comment 28

•

13 years ago

Do we have any idea how much cache reuse there is within the 8 MB?  If there's none,
then yes, regular LRU would roll over dead, and the fact that your patch has slightly less good perf than our previous LRU-SP is overcompensated-for by the fact that it avoid what comment 8 describes, i.e. the cache filled with never-evicted small files.

[A cache that gets clogged up with small obsolete items will indeed do better on a very large repeated dataset than any LRU-ish algorithm, but that's like saying my broken watch has the correct time twice a day: it's a spurious performance difference.]

Geoff Brown [:gbrown]

Assignee

Comment 29

•

13 years ago

(In reply to Jason Duell (:jduell) from comment #28)
> Do we have any idea how much cache reuse there is within the 8 MB?  

There is very little cache reuse: In a single cycle of the tp4m test with evictions disabled there are 434 cache requests total, and only 28 of these are duplicates (14 requests are each made 2x).

Jason Duell

Comment 30

•

13 years ago

Ok, so the tp4m test doesn't seem too worrying.  

Do we have any benchmark anywhere that has a cache access pattern that's more realistic (more cache hits, and with hits skewed towards being closer to the original use)?   I don't know of any.  Did your patch do better on any of the benchmarks?  Like you said, it would be nice to have some positive test results, too.   Given how hard it is to write a good caching benchmark, I'd even be happy if you just browsed around on a mobile device for a while with/without the patch and report back on the difference in hit rate.  (Can I get you to write the telemetry patch for RAM cache hit rate? bug 687085 :)

Geoff Brown [:gbrown]

Assignee

Comment 31

•

13 years ago

(In reply to Jason Duell (:jduell) from comment #30)
> Do we have any benchmark anywhere that has a cache access pattern that's
> more realistic (more cache hits, and with hits skewed towards being closer
> to the original use)?   I don't know of any.  

I don't know of any either. But it is fairly easy to manipulate the tp4m test to show the benefit of this patch:
 - change the -tpcycles from 2 to 1, so that every page in the manifest is loaded just once (unless it is entered in the manifest more than once)
 - change the manifest such that some pages are loaded twice or more early in the test, then different pages are loaded twice or more later in the test

Selecting some duplicate pages at random, I got these results:

            Hits       Misses     Evictions  tp4m time
original    97         502        75         1129
with patch  116        474        115        1047

The original cache implementation does not adapt well to the changing access pattern: It makes fewer evictions than the new implementation, but keeps the wrong ones (presumably the small entries that were stored early in the test) resulting in more misses.

Jason Duell

Comment 32

•

13 years ago

Nice work--I'm sold.  The tp regression is bogus.

Let's re-land--put something in your checkin msg like "TP4 REGRESSION OK: see bug" :)

Geoff Brown [:gbrown]

Assignee

Comment 33

•

13 years ago

Attached patch Patch to correct deviation from LRU-SP — Details — Splinter Review

Updated patch comment only. Carrying forward r+ from michal.novotny.

Attachment #558560 - Attachment is obsolete: true

Attachment #563872 - Flags: review+

Geoff Brown [:gbrown]

Assignee

Updated

•

13 years ago

Keywords: checkin-needed

Whiteboard: Note: TP4M REGRESSION EXPECTED / OK

Ed Morley [:emorley]

Comment 34

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/5bfd700622c7

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Keywords: checkin-needed

Resolution: --- → FIXED

Target Milestone: --- → mozilla10

You need to log in before you can comment on or make changes to this bug.