Closed Bug 1226564 Opened 9 years ago Closed 8 years ago

[e10s] CACHE V2 telemetry probe regressions

Categories

(Firefox :: General, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
e10s + ---

People

(Reporter: rvitillo, Assigned: mconley)

References

(Blocks 1 open bug, )

Details

Attachments

(1 file)

Is the difference expected? See the provided URL and grep for the metric's name.
Summary: STARTUP_HTTP_CACHE_DISPOSITION_2_V2 differs between e10s and non-e10s → [e10s] STARTUP_HTTP_CACHE_DISPOSITION_2_V2 differs between e10s and non-e10s
Flags: needinfo?(jmathies)
Hey Roberto, do you know where these probes come from? 

STARTUP_HTTP_CACHE_DISPOSITION_2_V2
STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS
STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS

AFAICT these probes don't exist. I searched telemetry.mozilla.org, and they don't appear in the code base.

I have found these probes without the 'STARTUP_' pretext. Looking at t.m.o, there are no regressions with those.
Flags: needinfo?(jmathies) → needinfo?(rvitillo)
tracking-e10s: --- → ?
(In reply to Jim Mathies [:jimm] from comment #1)
> Hey Roberto, do you know where these probes come from? 
> 
> STARTUP_HTTP_CACHE_DISPOSITION_2_V2
> STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS
> STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS
> 
> AFAICT these probes don't exist. I searched telemetry.mozilla.org, and they
> don't appear in the code base.
> 
> I have found these probes without the 'STARTUP_' pretext. Looking at t.m.o,
> there are no regressions with those.

These probes are equivalent to the ones without the STARTUP_ prefix but collect data only until session restore has completed.
Flags: needinfo?(rvitillo)
Flags: needinfo?(jmathies)
HTTP_CACHE_DISPOSITION_2_V2:

116 enum CacheDisposition {
117     kCacheHit = 1,
118     kCacheHitViaReval = 2,
119     kCacheMissedViaReval = 3,
120     kCacheMissed = 4
121 };

https://telemetry.mozilla.org/new-pipeline/dist.html#!compare=e10sEnabled&cumulative=0&end_date=2015-11-30&keys=__none__!__none__!__none__&max_channel_version=nightly%252F45&measure=HTTP_CACHE_DISPOSITION_2_V2&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-11-19&table=0&trim=1&use_submission_date=0

e10s is better at hitting the cache.

STARTUP_HTTP_CACHE_DISPOSITION_2_V2:

Isn't listed on t.m.o. but 

http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/e10s_experiment.ipynb

shows a slight cache hit regression. Looks like the data set was from 20151117, so we need to refresh this with newer data. There was a image cache fix that landed around the same time.

NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS:

https://telemetry.mozilla.org/new-pipeline/dist.html#!compare=e10sEnabled&cumulative=0&end_date=2015-11-30&keys=__none__!__none__!__none__&max_channel_version=nightly%252F45&measure=NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-11-19&table=0&trim=1&use_submission_date=0

no regression on t.m.o

STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS:

Isn't listed on t.m.o. but 

http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/e10s_experiment.ipynb

shows a regression in the lower (0) bucket.

NETWORK_CACHE_V2_HIT_TIME_MS:

https://telemetry.mozilla.org/new-pipeline/dist.html#!compare=e10sEnabled&cumulative=0&end_date=2015-11-30&keys=__none__!__none__!__none__&max_channel_version=nightly%252F45&measure=NETWORK_CACHE_V2_HIT_TIME_MS&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-11-19&table=0&trim=1&use_submission_date=0

slight regression in e10s.

STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS:

Isn't listed on t.m.o. but 

http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/e10s_experiment.ipynb

shows a regression in the lower (0) bucket.
Flags: needinfo?(jmathies)
Summary: [e10s] STARTUP_HTTP_CACHE_DISPOSITION_2_V2 differs between e10s and non-e10s → [e10s] CACHE V2 telemetry probe regressions
Hey Jason, here are a few netwerk cache related telemetry probes (similar to bug 1225097) that regress modestly under e10s. Can you direct this bug to someone who can dig into this and figure out if this is something we should block on?
Flags: needinfo?(jduell.mcbugs)
Kicking backing to triage. Is this being blocked on?
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(jduell.mcbugs)
Honza, Michal: 

So the first telemetry probe here (STARTUP_HTTP_CACHE_DISPOSITION_2_V2) seems to show that our cache hit rate during startup (I assume this includes the content of tabs that are restored?) is lower for e10s than non-e10s.  (And the overall hit rates are low--it goes from a 10% hit rate in non-e10s to what looks like 5% for e10s.  Aren't those numbers both low).  Can you think of any ideas here?  I'm drawing a blank.  

The second one, STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS, looks like it's possibly just seeing the effect of doing some IPC?  I.e. non-e10s has a lot of 0ms results, but since we're using IPDL that happens a lot less in e10s?  Or does cacheFileMetaData class only live on the parent process, and there's no IPC?  (Same question for the 3rd probe, STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS).
Flags: needinfo?(jduell.mcbugs) → needinfo?(honzab.moz)
Flags: needinfo?(michal.novotny)
(In reply to Jason Duell [:jduell] (needinfo? me) from comment #6)
> The second one, STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS, looks
> like it's possibly just seeing the effect of doing some IPC?  I.e. non-e10s
> has a lot of 0ms results, but since we're using IPDL that happens a lot less
> in e10s?  Or does cacheFileMetaData class only live on the parent process,
> and there's no IPC?

CacheFileMetadata lives only in the parent process. NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS should measure just the time needed to read the first part of the metadata from the disk. Not sure why it's slower in e10s. I guess regression of NETWORK_CACHE_V2_HIT_TIME_MS is caused by the same reason.

I have no theory about STARTUP_HTTP_CACHE_DISPOSITION_2_V2.
Flags: needinfo?(michal.novotny)
https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&e10s=false&end_date=2015-12-14&keys=__none__!__none__!__none__&max_channel_version=aurora%252F44&measure=NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS&min_channel_version=null&processType=false&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-11-02&table=0&trim=1&use_submission_date=0

compared to

https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&e10s=true&end_date=2015-12-14&keys=__none__!__none__!__none__&max_channel_version=aurora%252F44&measure=NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS&min_channel_version=null&processType=false&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-10-29&table=0&trim=1&use_submission_date=0


gives roughly the same numbers, e10s faster!



https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&e10s=false&end_date=2015-12-14&keys=__none__!__none__!__none__&max_channel_version=aurora%252F44&measure=HTTP_CACHE_DISPOSITION_2_V2&min_channel_version=null&processType=false&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-11-02&table=0&trim=1&use_submission_date=0

compared to 

https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&e10s=true&end_date=2015-12-14&keys=__none__!__none__!__none__&max_channel_version=aurora%252F44&measure=HTTP_CACHE_DISPOSITION_2_V2&min_channel_version=null&processType=false&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-10-29&table=0&trim=1&use_submission_date=0

same result, e10s better.


Switch to current Nightly (45) and you get roughly the same.  e10s is even better.  I really don't know how you got the original numbers, the [URL] doesn't work.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(honzab.moz)
Resolution: --- → INVALID
Sadly the populations available on telemetry.mozilla.org are self-selecting. Only the populations in the experiments are statistically representative.

I reran the analysis with the numbers from our Beta 44 experiment. Sadly, they're showing no improvement: https://gist.github.com/chutten/64069ac69e142d1d3d39 

I did add a check of the non-startup Cache v2 number which does, indeed, show a less-severe regression than the STARTUP_ variant.

Can you offer an explanation for the differences?
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Flags: needinfo?(michal.novotny)
Flags: needinfo?(honzab.moz)
First I want to ask how exactly you are getting the numbers.  I've already were looking into this and found quite the opposite of your claims.  So before I spend more time on this, I want to know what exactly to look at and how.  Thanks.
Flags: needinfo?(honzab.moz) → needinfo?(chutten)
Certainly.

In pursuit of knowledge to fuel data-driven decisionmaking, we conduct A/B experiments on prelease channels. We have to find a population of Firefox users that a) are representative of Firefox users as a whole and b) have not yet turned E10s on or off by themselves. 

Otherwise we see something like the numbers on telemetry.mozilla.org where users who have had a bad experience with E10s have turned it off. As a result, the E10s numbers end up looking a lot better than they really are.

Whether the user has any active Experiments is recorded in the telemetry pings that are sent to Mozilla. The analysis as shown in the gist acts on this published dataset to get the experimental data (control and experiment branches both) and then graphs and compares them both ( e10s (experimental branch) is green, no-e10s (control branch) is blue)

In most cases we'll take a random sample of the full dataset because the full dataset is too unwieldy. In this case we're looking at 107k clients.

So, that's how data is submitted, published, and analysed. As for how it is collected, we'll need to turn to the codebase.

The (STARTUP_)HTTP_CACHE_DISPOSITION_2_V2 numbers come from nsHttpChannel::ProcessResponse and nsHttpChannel::ContinueConnect. Both call AccumulateCacheHitTelemetry which calls Telemetry::Accumulate. The ones prefixed with STARTUP_ are just the numbers collected during the period from Firefox start up to the point where session restore completes. (this is collated separately since startup is a period of interest)

The rest of the numbers appear to be macro'd. Quite a few are recorded in nsLoadGroup::TelemetryReportChannel, for instance... DXR has trouble getting around the prefixes, but I managed to find it eventually.

Does that answer your question?
Flags: needinfo?(chutten) → needinfo?(honzab.moz)
Exhausting!  Thanks.  I'll try to look into this, but will take some time.
Assignee: nobody → honzab.moz
FYI, I'm indisposed right now.  Won't get to this sooner than the next week, maybe. 

According the cache probes: Personally I have absolutely no idea what this could be and how to proceed here.  This is more a bug to be analyzed by QA that can find a way to reproduce somewhat locally.  Sorry, I'm an engineer, not QA.

I really don't know what to do from just looking at the graphs.  I'm sure this will not be easy to reproduce locally, but I can spend some reasonable time trying to.

Are there any other probes (whatever measurement) that are regressing in the same set of captured results except those in comment 11?

According the page load times: I strongly believe this is an overall architecture problem, e10s is simply still too untuned, IMO.  The IPC overhead can be involved here as well.  Only theories tho.  Reason why I'm working on Backtrack (bug 1148204), a tool to capture and help analyze overall scheduling and timing issues.


Other question:
This graph [1] tells we have some 26% cache hit.  Other builds even over 30%.  e10s=on is always better.  But your data says it's 10% for non-e10s and 5% for e10s.  Can you explain that difference?  Because such a huge difference might point something is wrong with the data set you are running you analyzes on.  Or is the pure telemetry data just completely useless and worthless to even look at?  I somehow refuse to believe that.


[1] https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2015-12-16&keys=__none__!__none__!__none__&max_channel_version=beta%252F43&measure=HTTP_CACHE_DISPOSITION_2_V2&min_channel_version=null&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2015-11-03&table=0&trim=1&use_submission_date=0
Status: REOPENED → NEW
Flags: needinfo?(honzab.moz) → needinfo?(chutten)
Chris, can we get an updated link to beta experiment / no addons data? We're still looking at aurora reports here.
Feel free to ignore t.m.o. It's the Wikipedia of Telemetry data: useful as a starting point, but by no means comprehensive or the result of statistical rigour. (for instance, it doesn't even have STARTUP_HTTP_CACHE_DISPOSITION_2_V2)

Here's the latest, reviewed, study of Beta45ex2: https://github.com/vitillo/e10s_analyses/blob/master/beta45-withoutaddons/e10s_experiment.ipynb

It shows a regression in the following cache measurements:

STARTUP_HTTP_CACHE_DISPOSITION_2_V2
STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS
STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS
HTTP_PAGE_COMPLETE_LOAD
HTTP_PAGE_COMPLETE_LOAD_CACHED_V2
HTTP_PAGE_OPEN_TO_FIRST_FROM_CACHE_V2
HTTP_SUB_COMPLETE_LOAD
HTTP_SUB_COMPLETE_LOAD_CACHED_V2
HTTP_SUB_OPEN_TO_FIRST_FROM_CACHE_V2

It shows an improvement in the following cache measurement:
HTTP_SUB_REVALIDATION

Trends like these are identified from populations of users providing full distributions of measures over time. There are no steps to reproduce locally, because we are each only a population of one measuring distributions over short periods.

We would like to know the following for the above measurements:

* Are the regressions expected?

* How would these regressions manifest in user experience?

* Are these probes measuring meaningful things, or are they bogus?

From those answers we can direct effort.
Flags: needinfo?(chutten) → needinfo?(honzab.moz)
Depends on: 1253379
(In reply to Chris H-C :chutten from comment #15)
> Feel free to ignore t.m.o. It's the Wikipedia of Telemetry data: useful as a
> starting point, but by no means comprehensive or the result of statistical
> rigour. (for instance, it doesn't even have
> STARTUP_HTTP_CACHE_DISPOSITION_2_V2)
> 
> Here's the latest, reviewed, study of Beta45ex2:
> https://github.com/vitillo/e10s_analyses/blob/master/beta45-withoutaddons/
> e10s_experiment.ipynb
> 
> It shows a regression in the following cache measurements:
> 
> STARTUP_HTTP_CACHE_DISPOSITION_2_V2

My theory is that the breakpoint we stop measuring STARTUP_ and do "normal" capturing is simply in different time on e10s vs non-e10s.  It's happening sooner with e10s or less number of requests is happening before that point in e10s.  At the start we do OCSP requests that are simply not cached.  If there is more rate of OCSPs against normal requests it will show up as a regression.

I would ignore the regression here.

> STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS
> STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS

These are more a mystery for me.  Captured only on the parent process, no IPC overhead can cause this.

It might be that we do more I/O with e10s on startup.  Also can be the same cause as for STARTUP_HTTP_CACHE_DISPOSITION_2_V2.  Simply the time span used for "STARTUP_" capture is different (longer in non-e10s or we do less HTTP requests in e10s).  For longer intervals OS preloading may intervene or theoretically entries may be loaded/purged/reused during a longer startup span and that would greatly improve these probes thanks system caching.


Do we have probes for how many measurements, or HTTP requests directly, we do during the "STARTUP" phase?


> HTTP_PAGE_COMPLETE_LOAD
> HTTP_PAGE_COMPLETE_LOAD_CACHED_V2
> HTTP_SUB_COMPLETE_LOAD
> HTTP_SUB_COMPLETE_LOAD_CACHED_V2
> HTTP_PAGE_OPEN_TO_FIRST_FROM_CACHE_V2
> HTTP_SUB_OPEN_TO_FIRST_FROM_CACHE_V2

I'm afraid we don't correctly carry the "cached" timing values to the child.  What we are getting captured here are just few requests made on the parent like https://self-repair.mozilla.org/en-US/repair/ and similar.

I think this bug makes the most important probes useless.  Filed bug 1253379 on it.

> 
> It shows an improvement in the following cache measurement:
> HTTP_SUB_REVALIDATION

This probe is measured only on the parent and is not bearing any IPC overhead.  It measures time between sending the first byte of the conditional request and receiving the last byte of the 304 response.  It's responseEnd - responseStart captured only where there is also a non-null cacheReadStart and cacheReadEnd timing captured (i.e. we are eventually loading from the cache).  I believe it's improved under e10s because even the validation request must do loops through the main thread that is less exhausted on the parent process when e10s is on.  This probe tho also suffers from the same bug as those above.


Can we check also HTTP_*_COMPLETE_LOAD_NET_V2 probe for regressions?
Flags: needinfo?(honzab.moz)
Thanks for the detailed analysis Honza!

Chris, please add the 'HTTP_*_COMPLETE_LOAD_NET_V2' to the probes we report and clear this ni so the bug falls back into e10s triage.
Blocks: e10s-perf
Flags: needinfo?(michal.novotny) → needinfo?(chutten)
Priority: -- → P1
The analysis is now pending review: https://github.com/vitillo/e10s_analyses/pull/24
Flags: needinfo?(chutten)
Assignee: honzab.moz → mconley
I'm going to try to sum up what I've derived from this bug. I'm going to list each of the probes that this bug covers, and try to establish what state they're in in this analysis. In some cases, I've clustered probes because they're in the same state.



STARTUP_HTTP_CACHE_DISPOSITION_2_V2

According to mayhemer in comment 16, we should ignore the regression here.
----------------------------------------------------------------------------------------


STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS
STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS

According to mayhemer in comment 16, these two are a mystery. He asks "Do we have probes for how many measurements, or HTTP requests directly, we do during the "STARTUP" phase?"
----------------------------------------------------------------------------------------


HTTP_PAGE_OPEN_TO_FIRST_FROM_CACHE_V2
HTTP_PAGE_COMPLETE_LOAD_CACHED_V2
HTTP_SUB_COMPLETE_LOAD_CACHED_V2
HTTP_SUB_OPEN_TO_FIRST_FROM_CACHE_V2

According to mayhemer in comment 16, we weren't calculating these probes correctly in the content process. This was fixed in bug 1253379, which has landed in 46. We will need to see new numbers on these probes from the 46 experiment to make an accurate assessment.
----------------------------------------------------------------------------------------


HTTP_PAGE_COMPLETE_LOAD_NET_V2
HTTP_SUB_COMPLETE_LOAD_NET_V2

Looking at the analysis in https://github.com/chutten/e10s_analyses/blob/0d7a789a673f9c825389263764175d502191d9f8/beta45-withoutaddons/e10s_experiment.ipynb, e10s seems to be performing better in general here. There are some small spikes in the higher buckets for e10s, but in general, e10s has more weight in the lower buckets.


So, TL;DR:

Waiting on HTTP_PAGE_OPEN_TO_FIRST_FROM_CACHE_V2, HTTP_PAGE_COMPLETE_LOAD_CACHED_V2, HTTP_SUB_COMPLETE_LOAD_CACHED_V2, HTTP_SUB_OPEN_TO_FIRST_FROM_CACHE_V2 data from the next Telemetry experiment. 

STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS, STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS need explanation.

All other probes seem to be okay, or can be ignored.
So mayhemer's theory here is that STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS and STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS seem worse due to the fact that the parent is doing fewer network requests up until the sessionstore-windows-restored observer notification is fired (this is what is used to snapshot the histograms and prefix with STARTUP_ at http://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySession.jsm#1846).

The idea here is that the reads off of the disk for the cache become faster as more lookups occur because the OS is optimizing for the disk accesses. This means that the more requests there are, the more accesses will report as cheap.

The idea that there are fewer requests is bolstered by the fact that there are fewer samples for these two probes in the e10s case. It's roughly 25% of the samples gathered for the e10s case.

It seems to be more rare for there to be any histograms captured for either of these probes with e10s enabled than with non-e10s. In fact, in the e10s case on my Windows machine, I can only get the probes to appear if the OS has just started up (so the file access optimizations from the OS haven't been invoked yet).

So that's my explanation for the shape of the curve here for those two probes. I'm not sure there's much else to do here. I'm closing this one out.
Status: NEW → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #21)
> The idea here is that the reads off of the disk for the cache become faster
> as more lookups occur because the OS is optimizing for the disk accesses.
> This means that the more requests there are, the more accesses will report
> as cheap.

Yes, and it's even more supported by my work on backtrack.  I can clearly see that the first request made (e10s) is blocked on i7/ssd for almost 900ms on the very first cache access.  We definitely want to do something about it, but it's completely unrelated to e10s ;)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: