Closed Bug 868441 Opened 11 years ago Closed 11 years ago

Bypass Cache When Lock Held Too Long

Categories

(Core :: Networking: Cache, enhancement)

20 Branch
x86_64
Linux
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24

People

(Reporter: mcmanus, Assigned: mcmanus)

Details

Attachments

(3 files, 1 obsolete file)

This is a proposed bandaid that I think we can try in Firefox 23. It should help us bypass the cache when the cache lock has been held for a long time. (i.e. during periods of slowness - don't use the cache). It doesn't rescue (i.e. timeout) any transactions that choose to use the cache before they were able to detect the slowness. That would probably be better and can still be done - but this is something quick and easy that can be done without distracting from the cache rewrite work.
Attached patch patch 0 (obsolete) — Splinter Review
Attachment #745175 - Flags: review?(michal.novotny)
This patch makes me feel warm and fuzzy inside.  Thanks for doing this!
Hmm.. on my older laptop I can see startup of cache (that happens under the lock, on the cache thread) taking significant amount of time (~100ms).  On slower machines during cold startup this may turn cache off almost instantly just because initially opening the cache map file can take longer time but after that there may not be any delay any more.

I'd at least prolonged the time to say 500ms.  Otherwise I can already see the bugs "firefox loads stuff from server despite expired header in future" and so...

We should also log that cache is off because of lock being held too long.
Or we could use a longer delay until the cache has opened, then use a shorter one.
So this code only bypasses cache if a lock has been held for too long? 

Then why is it bad to bypass cache during slow initialization? Seems like an ideal time to bypass it.
(In reply to Taras Glek (:taras) from comment #5)
> So this code only bypasses cache if a lock has been held for too long? 

And once the lock is released we go back to using the cache?
(In reply to Honza Bambas (:mayhemer) from comment #3)
> Hmm.. on my older laptop .... held too long.

I misunderstood how the patch worked.  I thought it disabled the cache completely on encounter of a long time spent in the lock.  But it just disables opening a cache entry when lock is held a long time ; however, opening a cache entry is async, so there is no need to bypass that op to prevent jank.  And just that you didn't open one cache entry doesn't mean we prevent jank caused by SetExpirationTime and others.  But maybe yes, I just don't see it.


Anyway, after talk to taras on IRC, we could try to disable cache completely when lock is held too long (probably because of I/O) and recover in some 60s or sooner.
Comment on attachment 745175 [details] [diff] [review]
patch 0

> +    /**
> +     * Milliseconds for which the service lock has been held. 0 if unlocked.
> +     */
> +    readonly attribute double lockHeldTime;

AFAICS mozilla::TimeStamp() uses integer values, so why did you choose double?


> +    // If the cache lock has been held for a long time then just
> +    // bypass the cache instead of getting in that queue.
> +    nsCOMPtr<nsICacheService> cacheService =
> +        do_GetService(NS_CACHESERVICE_CONTRACTID, &rv);
> +    nsCOMPtr<nsICacheServiceInternal> internalCacheService =
> +        do_QueryInterface(cacheService);
> +    if (internalCacheService) {
> +        double timeLocked;
> +        if (NS_SUCCEEDED(internalCacheService->GetLockHeldTime(&timeLocked)) &&
> +            timeLocked > gHttpHandler->BypassCacheLockThreshold()) {
> +            LOG(("Cache bypassed because service lock held too long [%fms]\n",
> +                 timeLocked));
> +            return ContinueConnect();
> +        }
> +    }
> +
>      // open a cache entry for this channel...
>      rv = OpenCacheEntry(usingSSL);

So, we skip all attempts to use the cache when the lock is being held for too long. But what if LOAD_ONLY_FROM_CACHE flag was specified? Do we really want to fail the request just because the cache is slow? And what if the request should be server from appcache? We could end up using some files from the appcache and others from the network. Honza, what's your opinion on this?
Attachment #745175 - Flags: review?(michal.novotny)
(In reply to Michal Novotny (:michal) from comment #8)
> So, we skip all attempts to use the cache when the lock is being held for
> too long. But what if LOAD_ONLY_FROM_CACHE flag was specified? Do we really
> want to fail the request just because the cache is slow? And what if the
> request should be server from appcache? We could end up using some files
> from the appcache and others from the network. Honza, what's your opinion on
> this?

The appcache point is good, we must honor cache in that case.  

LOAD_ONLY_FROM_CACHE probably as well, but that is not that critical (we can just fail the load as the entry were not found in the cache).

What to just silently bypass or turn to async some sync operations like SetExpirationTime() when the lock is being held at the time?  

I tried to implement a timeout logic but from some reason it didn't work for me (probably just a mistake in my programming, but I didn't want to spend more time on it).
I'd think we'd want to skip this new logic for OAD_ONLY_FROM_CACHE--just wait in that case.  Ditto for appcache.

> opening a cache entry is async, so there is no need to bypass that op to prevent jank.

Right, but the later ops that we'd do with the cache entry would presumably block (at least there's a high chance they would).  That's why we choose to skip using cache in this case.

> we could try to disable cache completely when lock is held too long 
> (probably because of I/O) and recover in some 60s or sooner.

Don't understand this--"disable cache completely" but what happens to existing cache entries that are servicing requests (even if slowly, they may be midway through delivering OnDataAvailable)?  Assuming we keep letting existing entries complete, I'm not clear how what you're proposing is different than this patch (i.e. simply skip using the cache until it appears to have recovered). Other than that it sounds like you'd wait 60sec, while this patch would wait just until the cache lock is released (which will generally be sooner).
(In reply to Jason Duell (:jduell) from comment #10)
> I'd think we'd want to skip this new logic for LOAD_ONLY_FROM_CACHE--just
> wait in that case.  Ditto for appcache.

+1.

(In reply to Honza Bambas (:mayhemer) from comment #7)
> completely on encounter of a long time spent in the lock.  But it just
> disables opening a cache entry when lock is held a long time ; however,
> opening a cache entry is async, so there is no need to bypass that op to
> prevent jank.  And just that you didn't open one cache entry doesn't mean we
> prevent jank caused by SetExpirationTime and others.  But maybe yes, I just
> don't see it.

Right.

I guess the goal of this patch is to avoid blocking networking on the cache if the cache seems to be being slow currently, not to prevent jank. And/or, perhaps there is an inference to be made that if the lock has been held for a long time at the time of connecting in the channel, then it is likely that there will be jank at SetExpirationTime, etc.

In the long term, I really think this is counter-productive to the goal of minimizing network utilization when we can do so. That doesn't mean we shouldn't do it, but IMO we should be conscious of that when measuring the benefit of the change.
To clearly position this - I'm interested in this patch as a short term band aid that's pretty low effort and risk. I hope it puts us in a better place than we are today, and that's really enough for the short term. I definitely don't want it to interfere with the more complicated cache rewrite work or start putting lots of effort into the current cache. Both of those things would be bad outcomes.

For that reason I'm going to steer clear of (interesting!) algorithms like timing out cache requests, or changing expiration apis, etc.. It's just a simple "when cache I/O is really slow, use the cache less often" approach.

One recurring concern is that, as coded, this will leave page loads with inconsistent results (some from cache, some fresh from net). I don't think that's the end of the world (it can happen anyhow with some page construction patterns), but the suggestion honza and taras make about a dampening period would definitely improve that situation (as opposed to evaluating each transaction independently as patch 0 does). Its also easy to do right on top of the patch I already have - so I like that :)

The points about ONLY_FROM_CACHE and appcache are well taken and appreciated. I'll make both of those updates. (I'm actually not totally convinced about appcache, but it is so lightly used I'm happy to just be conservative and leave it untouched by this change).

I'll make those 3 changes and repost the patch.

(In reply to Taras Glek (:taras) from comment #5)
> 
> Then why is it bad to bypass cache during slow initialization? Seems like an
> ideal time to bypass it.

I agree!

(In reply to Michal Novotny (:michal) from comment #8)

> 
> AFAICS mozilla::TimeStamp() uses integer values, so why did you choose
> double?

timestamp - timestamp = duration... duration.ToMilliseconds() returns a double.


[bsmith]
> I guess the goal of this patch is to avoid blocking networking on the cache
> if the cache seems to be being slow currently, not to prevent jank. And/or,
> perhaps there is an inference to be made that if the lock has been held for
> a long time at the time of connecting in the channel, then it is likely that
> there will be jank at SetExpirationTime, etc.

yes on both points. 

Its effectively a very crude form of racing.. its just that rather than actually doing the disk and network in parallel we preemptively declare a winner based on this lock-held-time heuristic. I did say it was crude :)

> In the long term, I really think this is counter-productive to the goal of
> minimizing network utilization when we can do so.

yep, it's true. But right now that's a lower priority than performance and in this case they might be in conflict.
Attached patch patch 1Splinter Review
Attachment #745175 - Attachment is obsolete: true
Attachment #745904 - Flags: review?(michal.novotny)
Comment on attachment 745904 [details] [diff] [review]
patch 1

Review of attachment 745904 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libpref/src/init/all.js
@@ +1001,5 @@
>  // Disable IPv6 for backup connections to workaround problems about broken
>  // IPv6 connectivity.
>  pref("network.http.fast-fallback-to-IPv4", true);
>  
> +// The maximum amount of time the nsICacheSession lock can be held

nsICacheSession lock => cache service lock

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +6013,5 @@
> +    TimeStamp cacheSkippedUntil = gHttpHandler->GetCacheSkippedUntil();
> +    if (!cacheSkippedUntil.IsNull()) {
> +        TimeStamp now = TimeStamp::Now();
> +        if (now < cacheSkippedUntil) {
> +            LOG(("channel=%p Cache bypassed because of dampener\n"));

'this' parameter is missing

@@ +6016,5 @@
> +        if (now < cacheSkippedUntil) {
> +            LOG(("channel=%p Cache bypassed because of dampener\n"));
> +            return true;
> +        }
> +        LOG(("channel=%p Cache dampener released\n"));

'this' parameter is missing

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +1225,5 @@
>              mConnectTimeout = clamped(val, 1, 0xffff) * PR_MSEC_PER_SEC;
>      }
>  
> +    // The maximum amount of time to wait for socket transport to be
> +    // established

Wrong comment
Attachment #745904 - Flags: review?(michal.novotny) → review+
https://hg.mozilla.org/mozilla-central/rev/e1eb29a56dab
Assignee: nobody → mcmanus
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla24
We should add telemetry to check we are not disabling the cache too often.
I think we should back this patch out :( 

- I didn't find any significant improvements in the LifeHacker tab-loading benchmark 
- I've been getting profiles from a user who regularly experiences 30 seconds of network cache jank with the latest Nightlies
> I've been getting profiles from a user who regularly experiences 30 seconds of 
> network cache jank with the latest Nightlies

But this patch just skips the cache if we hit jank.  I don't see how it could increase jank.
(In reply to Vladan Djeric (:vladan) from comment #18)
> I think we should back this patch out :( 
> 
> - I didn't find any significant improvements in the LifeHacker tab-loading
> benchmark 

I thought you concluded that benchmark wasn't very relevant to this change.

> - I've been getting profiles from a user who regularly experiences 30
> seconds of network cache jank with the latest Nightlies

That doesn't make any sense at all - let's figure it out, maybe we can learn something useful from nightly!

This patch disables the cache (sometimes).. so it could theoretically make life slower overall but it shouldn't introduce cache jank - its using the cache less! Understanding that dynamic might lead to something interesting... (or maybe its just a bug in the patch but that seems less likely).
steve has started investigating specific performance scenarios.. maybe he would be interested in working with you vlad if the user provides regular data.
(In reply to Jason Duell (:jduell) from comment #19)
> > I've been getting profiles from a user who regularly experiences 30 seconds of 
> > network cache jank with the latest Nightlies
> 
> But this patch just skips the cache if we hit jank.  I don't see how it
> could increase jank.

I'm concerned about putting in performance fixes(eg taking on risk) where we do not have any proof(eg unclear benefit) on whether they help.
(In reply to Patrick McManus [:mcmanus] from comment #21)
> steve has started investigating specific performance scenarios.. maybe he
> would be interested in working with you vlad if the user provides regular
> data.

Indeed. Vlad, if you think this is possible, please send me an email.
(In reply to Taras Glek (:taras) from comment #22)
> I'm concerned about putting in performance fixes(eg taking on risk) where we
> do not have any proof(eg unclear benefit) on whether they help.

I have to concur.  I never had much confidence in this patch...
Taras: didn't someone on your team run stuff on a slow SDCard profile?  Would it make sense to run on that and see if this is a win?  

I'm fine with backing this out if we can't show it clearly helps perf under janky conditions.
(In reply to Jason Duell (:jduell) from comment #25)
> Taras: didn't someone on your team run stuff on a slow SDCard profile? 
> Would it make sense to run on that and see if this is a win?  
> 
> I'm fine with backing this out if we can't show it clearly helps perf under
> janky conditions.

that's Vlad above. I think honza's failed approach of lock timeouts might be worth revisiting instead of this.
Attached patch timeout WIP 1Splinter Review
This is fixed (it was just a stupid typo...) patch for the lock timeout (engaged only on the main thread and on demand - each method acquiring the lock may individually choose to timeout or wait indefinitely).

There are few places in nsCacheEntryDescriptor that are using the timeout logic and fail in that case, but chosen more just randomly.

I consider this a WIP since we should check what methods of cache descriptor (and others) should use it and how we should handle the failure.  For SetExpirationTime and maybe some others we may just ignore the failure and go on in nsHttpChannel.

Michal, what do you think?
Flags: needinfo?(michal.novotny)
(In reply to Patrick McManus [:mcmanus] from comment #20)
> > - I've been getting profiles from a user who regularly experiences 30
> > seconds of network cache jank with the latest Nightlies
> 
> That doesn't make any sense at all - let's figure it out, maybe we can learn
> something useful from nightly!

I should have been more precise. This user captured profiles with long periods of post-startup cache jank before AND after the patch in this bug landed. The profiles show jank from a variety of sources: sometimes it's mostly cache, sometimes it's reflows, etc

> This patch disables the cache (sometimes).. so it could theoretically make
> life slower overall but it shouldn't introduce cache jank - its using the
> cache less! Understanding that dynamic might lead to something
> interesting... (or maybe its just a bug in the patch but that seems less
> likely).

I didn't mean to suggest this patch made things worse :( As Taras said above, we're concerned about landing performance fixes which add complexity but don't achieve the wins expected.

(In reply to Jason Duell (:jduell) from comment #25)
> Taras: didn't someone on your team run stuff on a slow SDCard profile? 
> Would it make sense to run on that and see if this is a win?  

That was me. I did another quick SD card experiment where I loaded aljazeera.com and cnn.com on a recent Nightly. There were ~13 seconds of network cache jank while loading the 2 pages.

This is the profile: http://people.mozilla.com/~bgirard/cleopatra/#report=6a95a414da7414cca3a383450a1f6afd8d26a513&search=nsCache

The sampling interval is 50ms, filter on "nsCache" to see impact of network cache operations on the main thread.
Sorry, this is a better link for the aljazeera/cnn profile; only some of the cache operations have the "nsCache" prefix:

http://people.mozilla.com/~bgirard/cleopatra/#report=6a95a414da7414cca3a383450a1f6afd8d26a513
(In reply to Vladan Djeric (:vladan) from comment #28)
> I didn't mean to suggest this patch made things worse :( As Taras said
> above, we're concerned about landing performance fixes which add complexity
> but don't achieve the wins expected.

There is no expectation from anybody that this patch would fix all or even most jank. The patch makes a decision whether or not to use the cache for the HTTP request at hand, at the beginning of the request, based on (roughly) previously/recently occuring jank. Once the request has committed to using the cache (because there hasn't been any cache-related jank detected recently), there is still lots of opportunity for jank to occur for that request.  It isn't surprising that in any particular configuration there would be a no improvement. The question to answer isn't whether jank occurs or not at all, but whether there is *less* jank or not. AFAICT, that requires us to look at telemetry or equivalent across many scenerios.

The patch does not add much complexity. Let's find a metric that we can use to determine its overall impact on reducing jank (or not) across many scenerios, and watch that metric, while also watching our normal HTTP metrics to make sure that we're not regressing average time-to-first-byte and overall page load speed. If we see any problems, then let's rip it out immediately. If we don't see any problems then there's not a big rush to pull it out. If it turns out to be a small or big win, but not a total win then, well, that's the goal of the bug.

The risk with Honza's lock timeout patch is that, when the timeout is reached, the operation that tried to acquire the lock fails. That means that instead of jank, we may end up with user-visible failures of various sorts. It is not clear to me that there is no negative user-visible impact when those operations fail. For example, if SetExpirationTime times out on the lock, then that will cause these methods to fail:

  nsHttpChannel::UpdateExpirationTime
    nsHttpChannel::ProcessPartialContent
    nsHttpChannel::ProcessNotModified
    nsHttpChannel::InitCacheEntry
    FinalizeCacheEntry

It seems like, just from this quick investigation, that such a change would cause a uesr-visible failure to load a page in some extremely common situations (e.g. receiving a 304 Not Modified from the server). I do not think that user-visible failure is acceptable behavior; working with jank is better than not working at all. Also, in turn, those methods failing will cause other methods to fail. IMO, it would be OK if somebody other than Honza and Michal spent the time to analyze all the various cases and fix things so that we don't introduce unnecessary user-impacting failures. But, I think it would be great if Honza and Michal could concentrate as much as possible on getting the new cache done, instead of chasing this.
Brian, I appreciate that the overhead of proceeding with honza's experimental patch makes it not worth it. I'm not suggesting we drop everything and go that way.

I am suggesting that there is no evidence of this particular patch helping. We should not risk shipping perf fixes that have no measurable benefit.
we do know that this patch does something.. and there is reason to believe its a good thing. Certainly not a complete thing, but a good thing.

When this landed orange 858055 showed up on a mochitest pretty much just on the android tegras.

The test went orange because it was explicitly counting cache hits for some particular scenario and the tegra had earlier auto disabled the cache due to this patch and the cache counting test was caught up in the dampening time.

That means there are cases where this patch identifies a poorly behaving cache (where it couldn't get the lock for a quarter of a second!) and it disables that cache.. which is what we're looking for. Additionally the orange wasn't happening on other platforms, so its not just disabling the cache all the time. Also good!

I'm inclined to maybe try some other values in here and see if they help and use cases we know about.. its only on nightly right now so there's plenty of runway for experimentation.

Ted, do you want to see if setting network.http.bypass-cachelock-threshold to something like 50 gives you a different profile?

It's also possible the media cache (for YT) doesn't go through this path at all. Maybe steve can glance at the checked in changeset and let us know what he thinks about that.
Flags: needinfo?(sworkman)
(In reply to Patrick McManus [:mcmanus] from comment #32)
> It's also possible the media cache (for YT) doesn't go through this path at
> all. Maybe steve can glance at the checked in changeset and let us know what
> he thinks about that.

ChannelMediaResource makes requests using loadflags |= LOAD_BYPASS_LOCAL_CACHE_IF_BUSY ... HttpCacheQuery::mNoWait will be false ... nsCacheRequest will be non blocking ... So, it looks to me like this won't go on the list of pending requests (i.e. request->WaitForValidation won't be called in nsCacheService::ProcessRequest). But more importantly, I don't think Lock will be called, so this code shouldn't really come into play.
Flags: needinfo?(sworkman)
(In reply to Patrick McManus [:mcmanus] from comment #34)
> Created attachment 759422 [details] [diff] [review]
> effectively disable bypass-cachelock-threshold on beta and release

I'll offer this up to auto-disable the feature after it gets past aurora.. suitable for experimentation.
Attachment #759422 - Flags: review?(jduell.mcbugs) → review+
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
https://hg.mozilla.org/mozilla-central/rev/b50aafc75977
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Flags: needinfo?(michal.novotny)
(In reply to Patrick McManus [:mcmanus] from comment #35)
> I'll offer this up to auto-disable the feature after it gets past aurora..
> suitable for experimentation.

How's that experiment going? After four years of doing this on the trunk (and aurora, back when aurora used to exist), when are we going to be ready to either ship it, or remove it?
Flags: needinfo?(mcmanus)
well, that's a good find. Honza, I'd like to make this official. ok?
Flags: needinfo?(mcmanus) → needinfo?(honzab.moz)
The code around network.http.bypass-cachelock-threshold pref was removed from http channel a long time ago.
cool. plz patch the pref file when you get a chance.
I don't understand why the patch for this bug just hasn't removed the network.http.bypass-cachelock-threshold pref at all.  There is no use for it.  And there is also no experiment, AFAIK.  We keep the old backend only for appcache.  The prefs to switch between the new and old may probably go away now, but I keep it around since they don't carry any maintenance needs.
Flags: needinfo?(honzab.moz)
Ah!!!  It's been fixed on 2013-06-13 :D  Not recently...  OK.  I had to had the second coffee today...

So, I would not bother anyway.  This all should just go with old cache code when finally removed one blessing day.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: