Cache Effectiveness Telemetry

RESOLVED WONTFIX

Status

()

Core
Networking
--
enhancement
RESOLVED WONTFIX
5 years ago
4 years ago

People

(Reporter: mcmanus, Assigned: mcmanus)

Tracking

18 Branch
mozilla22
x86_64
Windows 7
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 1 obsolete attachment)

(Assignee)

Description

5 years ago
Let's find out if the HTTP cache helps us or hurts us.

The strategy here is a little crude, but I can live with that.

The experiment takes selected aurora or nightly sessions and puts them into an "experiment mode" for 15 minutes after waiting 2 minutes after startup for things to settle down. After the 15 minutes expires things return to previous behavior.

To be selected, a session needs to have the allow-experiments pref set to true (its default), and the use-cache pref set to on (the default). Beyond that 1 in 16 sessions is selected at startup time.. so if this really screws you up you can just restart :)

If you're in the experiment you get further divided into 1 of 4 groups. Two of those groups have their cache enabled, two have it disabled. Those categories are further split into "fast connections" and "slow connections".. (fast vs slow is pretty crude, I count the number of tcp connects in those first 2 minutes that happen <=125 ms as fast and > 125ms as slow.. if you're at least 1/3 fast I say its a fast-connected-machine.. if I thought it was worth revising I would.)

Finally, assuming the experiment is running, a datapoint is collected for every http transaction measuring the elapsed time from asyncopen to the time onStopRequest is called... this is reported through telemetry along with which group you are allocated to.

The intent is we should be able to see if having the cache enabled speeds things up, slows things down, or changes the tail.. and if it has a better effect on "slow" networks than fast ones. It intentionally doesn't measure hit rate or anything like that - just the overall performance profile.
(Assignee)

Comment 1

5 years ago
given that the different groups aren't explicitly the same set of uris from the same locations this is going to rely on having a lot of data. nightly might not be anywhere near enough.
(Assignee)

Comment 2

5 years ago
Created attachment 724753 [details] [diff] [review]
patch 0
Attachment #724753 - Flags: feedback?(taras.mozilla)
(Assignee)

Updated

5 years ago
Attachment #724753 - Flags: feedback?(hurley)

Comment 3

5 years ago
Comment on attachment 724753 [details] [diff] [review]
patch 0

 +            Telemetry::Accumulate(
+                telemID,
+                (TimeStamp::Now() - mCacheEffectExperimentAsyncOpenTime).ToMilliseconds());
+        }
use
void AccumulateTimeDelta(ID id, TimeStamp start, TimeStamp end = TimeStamp::Now());

I assume this is a temporary experiment, we should probably bump bucket size to 50 without worrying about overhead to increase chances of bucket success
Attachment #724753 - Flags: feedback?(taras.mozilla) → feedback+
Comment on attachment 724753 [details] [diff] [review]
patch 0

Review of attachment 724753 [details] [diff] [review]:
-----------------------------------------------------------------

Other than Taras' comment about using AccumulateTimeDelta, looks good to me.
Attachment #724753 - Flags: feedback?(hurley) → feedback+
(Assignee)

Comment 5

5 years ago
Created attachment 725102 [details] [diff] [review]
patch v1
Attachment #724753 - Attachment is obsolete: true
Attachment #725102 - Flags: review?
(Assignee)

Updated

5 years ago
Attachment #725102 - Flags: review? → review?(hurley)
Comment on attachment 725102 [details] [diff] [review]
patch v1

Review of attachment 725102 [details] [diff] [review]:
-----------------------------------------------------------------

ship it!
Attachment #725102 - Flags: review?(hurley) → review+

Comment 8

5 years ago
https://hg.mozilla.org/mozilla-central/rev/513fafb75e5b
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
(Assignee)

Comment 9

5 years ago
So there is all of 1 day of data in the telemetry dashboard for this.. and given that there is no control for uri or networks there is reason to believe its going to take a lot of data to make meaningful comparisons.. and one of our 4 categories has only about 65K datapoints in it, which is likely far too few.

Nonetheless - if you'd like to take the earliest hint, the early data definitely shows the cache helping overall response time especially on slower networks.

The fast networks have about 500K samples, the slow ones around 70K

These are just desktop numbers for now.. probably not anywhere near enough data from mobile yet (if any - haven't really checked).

percentile  fast-on  fast-off slow-on slow-off
25           68        135      135      378
50          226        318      378      894
75          533        894     1262     2516
90         1500       2117     3553     5961

You read that by saying "fast networks with the cache on have 25% of their transactions complete in 68ms or less, 50% in 226ms or less, and so on.."
(Assignee)

Comment 10

5 years ago
now with 4 days worth of data.. relationships are basically unchanged. At this point,  it seems like the cache is pretty useful even as is.

percentile  fast-on  fast-off slow-on slow-off
25           81        135      160      378
50          226        318      533      894
75          633        894     1500     2117
90         1500       2117     4222     7083

Comment 11

5 years ago
(In reply to Patrick McManus [:mcmanus] from comment #10)
> now with 4 days worth of data.. relationships are basically unchanged. At
> this point,  it seems like the cache is pretty useful even as is.
> 

Also note the wild variation in perf with cache on. 10K bucket is empty with cache off for fast connections. I'll let this sit for a few more days but I'll bet that if we filter fast connections by netbooks(aka shitty harddrives) cache will be a net loss.

Comment 12

5 years ago
Interesting. From eyeballing histogram numbers, the situation is much better on android. Cache appears to be a clear win(if one discounts the fact that we have only 3K datapoints there atm).
(Assignee)

Comment 13

5 years ago
we've got a fair # of datapoints but not a lot of submissions.. taras suggests temporarily turning up the data collection rate.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 14

5 years ago
Created attachment 732943 [details] [diff] [review]
remove 1 in 16 filter
Attachment #732943 - Flags: review?(taras.mozilla)

Comment 15

5 years ago
Comment on attachment 732943 [details] [diff] [review]
remove 1 in 16 filter

lets take this out on friday
Attachment #732943 - Flags: review?(taras.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/5b710d7fe073
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
(Assignee)

Comment 18

5 years ago
let's back this out for causing bug 858588

we'll need to figure out why disabling the http cache broke an applicatino cache test
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 20

5 years ago
(In reply to Patrick McManus [:mcmanus] from comment #19)
> backouts
Why is it closed as fixed in that case?
Flags: needinfo?(mcmanus)
(Assignee)

Updated

5 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Updated

5 years ago
Flags: needinfo?(mcmanus)
(Assignee)

Updated

4 years ago
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.