Closed Bug 850968 Opened 11 years ago Closed 10 years ago

Cache Effectiveness Telemetry

Categories

(Core :: Networking, enhancement)

18 Branch
x86_64
Windows 7
enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX
mozilla22

People

(Reporter: mcmanus, Assigned: mcmanus)

Details

Attachments

(2 files, 1 obsolete file)

Let's find out if the HTTP cache helps us or hurts us.

The strategy here is a little crude, but I can live with that.

The experiment takes selected aurora or nightly sessions and puts them into an "experiment mode" for 15 minutes after waiting 2 minutes after startup for things to settle down. After the 15 minutes expires things return to previous behavior.

To be selected, a session needs to have the allow-experiments pref set to true (its default), and the use-cache pref set to on (the default). Beyond that 1 in 16 sessions is selected at startup time.. so if this really screws you up you can just restart :)

If you're in the experiment you get further divided into 1 of 4 groups. Two of those groups have their cache enabled, two have it disabled. Those categories are further split into "fast connections" and "slow connections".. (fast vs slow is pretty crude, I count the number of tcp connects in those first 2 minutes that happen <=125 ms as fast and > 125ms as slow.. if you're at least 1/3 fast I say its a fast-connected-machine.. if I thought it was worth revising I would.)

Finally, assuming the experiment is running, a datapoint is collected for every http transaction measuring the elapsed time from asyncopen to the time onStopRequest is called... this is reported through telemetry along with which group you are allocated to.

The intent is we should be able to see if having the cache enabled speeds things up, slows things down, or changes the tail.. and if it has a better effect on "slow" networks than fast ones. It intentionally doesn't measure hit rate or anything like that - just the overall performance profile.
given that the different groups aren't explicitly the same set of uris from the same locations this is going to rely on having a lot of data. nightly might not be anywhere near enough.
Attached patch patch 0 (obsolete) — Splinter Review
Attachment #724753 - Flags: feedback?(taras.mozilla)
Attachment #724753 - Flags: feedback?(hurley)
Comment on attachment 724753 [details] [diff] [review]
patch 0

 +            Telemetry::Accumulate(
+                telemID,
+                (TimeStamp::Now() - mCacheEffectExperimentAsyncOpenTime).ToMilliseconds());
+        }
use
void AccumulateTimeDelta(ID id, TimeStamp start, TimeStamp end = TimeStamp::Now());

I assume this is a temporary experiment, we should probably bump bucket size to 50 without worrying about overhead to increase chances of bucket success
Attachment #724753 - Flags: feedback?(taras.mozilla) → feedback+
Comment on attachment 724753 [details] [diff] [review]
patch 0

Review of attachment 724753 [details] [diff] [review]:
-----------------------------------------------------------------

Other than Taras' comment about using AccumulateTimeDelta, looks good to me.
Attachment #724753 - Flags: feedback?(hurley) → feedback+
Attached patch patch v1Splinter Review
Attachment #724753 - Attachment is obsolete: true
Attachment #725102 - Flags: review?
Attachment #725102 - Flags: review? → review?(hurley)
Comment on attachment 725102 [details] [diff] [review]
patch v1

Review of attachment 725102 [details] [diff] [review]:
-----------------------------------------------------------------

ship it!
Attachment #725102 - Flags: review?(hurley) → review+
https://hg.mozilla.org/mozilla-central/rev/513fafb75e5b
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
So there is all of 1 day of data in the telemetry dashboard for this.. and given that there is no control for uri or networks there is reason to believe its going to take a lot of data to make meaningful comparisons.. and one of our 4 categories has only about 65K datapoints in it, which is likely far too few.

Nonetheless - if you'd like to take the earliest hint, the early data definitely shows the cache helping overall response time especially on slower networks.

The fast networks have about 500K samples, the slow ones around 70K

These are just desktop numbers for now.. probably not anywhere near enough data from mobile yet (if any - haven't really checked).

percentile  fast-on  fast-off slow-on slow-off
25           68        135      135      378
50          226        318      378      894
75          533        894     1262     2516
90         1500       2117     3553     5961

You read that by saying "fast networks with the cache on have 25% of their transactions complete in 68ms or less, 50% in 226ms or less, and so on.."
now with 4 days worth of data.. relationships are basically unchanged. At this point,  it seems like the cache is pretty useful even as is.

percentile  fast-on  fast-off slow-on slow-off
25           81        135      160      378
50          226        318      533      894
75          633        894     1500     2117
90         1500       2117     4222     7083
(In reply to Patrick McManus [:mcmanus] from comment #10)
> now with 4 days worth of data.. relationships are basically unchanged. At
> this point,  it seems like the cache is pretty useful even as is.
> 

Also note the wild variation in perf with cache on. 10K bucket is empty with cache off for fast connections. I'll let this sit for a few more days but I'll bet that if we filter fast connections by netbooks(aka shitty harddrives) cache will be a net loss.
Interesting. From eyeballing histogram numbers, the situation is much better on android. Cache appears to be a clear win(if one discounts the fact that we have only 3K datapoints there atm).
we've got a fair # of datapoints but not a lot of submissions.. taras suggests temporarily turning up the data collection rate.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #732943 - Flags: review?(taras.mozilla)
Comment on attachment 732943 [details] [diff] [review]
remove 1 in 16 filter

lets take this out on friday
Attachment #732943 - Flags: review?(taras.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/5b710d7fe073
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
let's back this out for causing bug 858588

we'll need to figure out why disabling the http cache broke an applicatino cache test
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Patrick McManus [:mcmanus] from comment #19)
> backouts
Why is it closed as fixed in that case?
Flags: needinfo?(mcmanus)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(mcmanus)
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: