Closed Bug 1506534 Opened 9 months ago Closed 4 months ago

Collect telemetry to measure how much penalty we will experience with first-party cache isolation

Categories

(Core :: Networking: HTTP, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox68 --- fixed

People

(Reporter: dragana, Assigned: michal)

References

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

We want to measure how many times we have a cache hit but the cache entry has not been accessed by the current eTLD+1 previously. This would mean that we will need to request the entry again if wh would have cache partitioning by eTLD+1.
For each cache entry we will need a list of eTLD+1(a hash of eTLD+1) that have access the entry. The list should persist through Firefox restart.

For each cache hit we will check whether the eTLD+1 is in the list and report telemetry on that.

Propose telemetry prob:

1) for each page load report how many resources were a cache hit but it was the first time user access this resource from that eTLD+1 (the resource was in the cache but eTLD+1 was not in the list). I expect in most cases it would be 0.

2) How many resources we would additionally fetch from the network in a session. How many resources were a cache hit but the eTLD+1 was not tin the list.
Ehsan, anythink else we are interested in?
Assignee: nobody → michal.novotny
Priority: -- → P2
Version: 59 Branch → Trunk
Whiteboard: [necko-triaged]
I think that it would be hard to tell from the proposed telemetry probes what effect on cache size the first party cache isolation would have. My plan is to have hash of eTLD+1 of all accessing sites in the cache metadata and also have a count of them in cache index. Then we can very quickly iterate the index and know how much larger the cache would need to be to have the same amount of data cached with first party isolation. To have comparable reports the telemetry would be sent after writing some defined amount of data (e.g. 1GB) and then the stats would be reset and started again. Also it might be interesting to know what content-type is duplicated the most. Right now we don't have content-type in the index but we wanted to add it for other purposes anyway.
The effect on the cache size can be mitigated using a bit different structure of the cache (this will be some cache refactoring, but I think we need to do it anyway). We do not need to have a resource stored 2 times, we can store it once and have our cache index remember which eTLD+1 has fetched this resource (I would still calculate frecency for each eTLD+1 separately).

We already store resource multiple times because of the originAttr and the anonymous and the private flags. Can you also add telemetry to measure how much penalty we already have because of the oa, anonymous and private flags.
(In reply to Dragana Damjanovic [:dragana] from comment #4)
> The effect on the cache size can be mitigated using a bit different
> structure of the cache (this will be some cache refactoring, but I think we
> need to do it anyway). We do not need to have a resource stored 2 times, we
> can store it once and have our cache index remember which eTLD+1 has fetched
> this resource (I would still calculate frecency for each eTLD+1 separately).

I don't understand what you mean. Such information cannot be stored in the index, because index can be deleted at any time and we're able to rebuild it from the entry files. So it has to be stored in the entries.

> We already store resource multiple times because of the originAttr and the
> anonymous and the private flags. Can you also add telemetry to measure how
> much penalty we already have because of the oa, anonymous and private flags.

No, all these flags are part of the hash so we cannot find the potential duplicates easily.
Depends on: 1533369

Whenever a cache entry is accessed during a document load, eTLD+1 of the top level document is added to the entry's metadata. Number of accessing sites is also stored in cache index. So we know how many copies of each entry would we have if we did a first party isolation without data deduplication. The telemetry is sent every time we write 2GB to the cache and then the data is reset. Telemetry report ID is an identifier of the telemetry cycle and it's used to invalidate eTLD+1 hashes in all cache entries.

Is data review done in bugzilla or in phabricator?

  1. What questions will you answer with this data?
  • Tells the impact of first party cache isolation, i.e. whether data deduplication is needed.
  1. Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements?
  • It's needed to decide how to implement first party cache isolation.
  1. What alternative methods did you consider to answer these questions? Why were they not sufficient?
  • There isn't an alternative.
  1. Can current instrumentation answer these questions?
  • No
  1. List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki.
  • Percentage increase of cache size, category 1, bug #1506534
  • Percentage increase of cache entry count, category 1, bug #1506534
  • Number of unique sites accessing each cache entry, category 1, bug #1506534
  1. How long will this data be collected? Choose one of the following:
  • For now it's set to expire in 74, but it might be extended.
  1. What populations will you measure?
  • Prerelease channels.
  1. If this data collection is default on, what is the opt-out mechanism for users?
  • There is no specific opt-out possibility just for these probes, so general out-out mechanism applies.
  1. Please provide a general description of how you will analyze this data.
  • Basic analysis on TMO measurement dashboard.
  1. Where do you intend to share the results of your analysis?
  • The data will be used to decide how to implement first party cache isolation.
Flags: needinfo?(chutten)

Preliminary Notes:

For future Data Collection Reviews please attach them to the bug and use the data-review? flag. The review process is documented on wikimo: https://wiki.mozilla.org/Firefox/Data_Collection

Could you please expand a little on how this data collection works? I was trying to follow the code and I'm not sure I understand. It looks as though, occasionally, a sampling of the cache contents is taken, including the number of unique etld+1 "base domains" that accessed that cache entry. Unlike the size and count of these items, the number of base domains tries to avoid double-counting by including a a generation system (the "telemetry record id"). Is that about right?

Also, could you make a broader explanation about how this data will be used to answer the questions you mention in the data review? What does a bad result look like? What does a good one look like? How will you interpret these results to implement cache isolation?

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Telemetry so is documented in its definitions file Histograms.json and the Probe Dictionary.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

This collection will expire in Firefox 74 and is being monitored by :michal.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical. (Though the cache is filled and accessed as a result of user interaction (specifically their web browsing history) the pattern of cache accesses and sizes are purely technical).

Is the data collection request for default-on or default-off?

Default on for pre-release channels only.

Does the instrumentation include the addition of any new identifiers?

No. The "Telemetry Record ID" is a client-local generation mechanism.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :michal is responsible for renewing or removing the collection before it expires in Firefox 74.


Result: datareview+, pending answers to the questions in the Preliminary Notes.

Flags: needinfo?(chutten)

(In reply to Chris H-C :chutten from comment #8)

Could you please expand a little on how this data collection works? I was trying to follow the code and I'm not sure I understand. It looks as though, occasionally, a sampling of the cache contents is taken, including the number of unique etld+1 "base domains" that accessed that cache entry. Unlike the size and count of these items, the number of base domains tries to avoid double-counting by including a a generation system (the "telemetry record id"). Is that about right?

The telemetry is sent every time 2GB of data is written to the cache. Most common cache size is 1GB and my guess is that after writing 2GB of data the cache has been used long enough to have a representative data. After sending the telemetry data, all access info is discarded so the new report won't include any eTLD+1 access from the previous telemetry session. This is true for all probes, not just NETWORK_CACHE_ISOLATION_UNIQUE_SITE_ACCESS_COUNT. Clearing all access info from cache index is simple, because we keep everything in memory. But we cannot easily erase access info from all cache entries on the disk and that's where "telemetry report ID" helps, because this ID is included in the cache entries and when the global ID is changed, all information stored in the entries on the disk is invalidated.

It's a question whether clearing all access info after sending the telemetry is useful or not. In any case, we'll find this out after receiving some telemetry data. When we receive a lot of reports for NETWORK_CACHE_ISOLATION_UNIQUE_SITE_ACCESS_COUNT with 0 value, then we either want to increase 2GB limit, avoid resetting the data or both.

Also, could you make a broader explanation about how this data will be used to answer the questions you mention in the data review? What does a bad result look like? What does a good one look like? How will you interpret these results to implement cache isolation?

Data deduplication isn't trivial to implement and if the increase in size would be small enough we could implement the isolation without data deduplication. There was no discussion yet about what "small increase" is.

Thank you very much for those explanations. datareview+ stands.

Pushed by mnovotny@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c45916f0bea2
Collect telemetry to measure how much penalty we will experience with first-party cache isolation, r=mayhemer, data-r=chutten
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
You need to log in before you can comment on or make changes to this bug.