Closed Bug 763351 Opened 8 years ago Closed 8 years ago

kCacheMissedViaReval (3) is never reported in cache disposition telemetry (e.g. HTTP_CACHE_DISPOSITION)

Categories

(Core :: Networking: HTTP, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla16
Tracking Status
firefox13 --- wontfix
firefox14 + wontfix
firefox15 + fixed
firefox16 + fixed

People

(Reporter: briansmith, Assigned: briansmith)

References

Details

Attachments

(4 files, 1 obsolete file)

Look at the telemetry histograms for HTTP_CACHE_DISPOSITION for all dates prior to today (i.e. over the entire time we have been collecting telemetry).

The values (at the time I looked) are:

kCacheHit (1): 32.48%
kCacheHitViaReval (2): 16.20%
kCacheMissed (4): 51.32%

Note that these add up to exactly 100%.

Note that kCacheMissedViaReval (3) is COMPLETELY missing--not just low, but NEVER reported, AFAICT.

What this telemetry is saying that is, if we have a cache entry, 100% of the time it will revalidate successfully--i.e. no server has EVER returned anything other than a 304 response to any of our conditional requests. That seems impossible to me. (I could believe the number would be CLOSE to zero; but, it cannot be zero.)
This patch contains an xpcshell test + some logging in nsHttpChannel that shows that it is at least POSSIBLE for us to get kCacheMissedViaReval. If you run the test with NSPR_LOG_MODULES=nsHTTP:5, then the log output should include:

0[31f940]: CACHE DISPOSITION: 4
0[31f940]: CACHE DISPOSITION: 3
0[31f940]: CACHE DISPOSITION: 4
0[31f940]: CACHE DISPOSITION: 3
0[31f940]: CACHE DISPOSITION: 4
0[31f940]: CACHE DISPOSITION: 3

So, I think the problem is not in how we calculate the cache disposition telemetry value. The problem is either in telemetry itself or somewhere else in nsHttpChannel. I had thoughe the most likely situation is that, when we need to do validation, we're switching to heuristic validation, or we're throwing the entry away. But, in that case, I would expect my simple test in this attachment to never output CACHE DISPOSITION: 3. It is possible that my test here is not "real world" enough and that my suspicion is still correct.
Note that the attachment contains a fix for bug 763342, in addition to the work done to investigate this bug.
In the debugger, I found that telemetry is putting values 2 and 3 into bucket 2.

I wrote a simple test server in python that always returns a 200 response with an ETag and Last-Modified. Thus, reloading this page should cause kCacheMissedViaReval all the time. My logging from the first patch shows that indeed we do calculate the disposition to be kCacheMissedViaReval (3).

I made the following change:

-HISTOGRAM(HTTP_CACHE_DISPOSITION, 1, 5, 5, LINEAR, "HTTP Cache Hit, Reval, Failed-Reval, Miss")
HISTOGRAM(HTTP_CACHE_DISPOSITION, 1, 4, 5, LINEAR, "HTTP Cache Hit, Reval, Failed-Reval, Miss")

and with that change, kCacheMissedViaReval (3) is reported.

Now I am starting to suspect that there might be a bug in Telemetry itself.

In the debugger, I noticed some oddity in histogram.cc. the ranges_[] array for this histogram is { 0, 1, 2, 4, 5, 2147483647 }; This does not make sense; there should be a "3" in there between 2 and 4 because the range of the 4th bucket is exactly (3, 3).

Consequently, I believe this is a bug in telemetry and I will file it, blocking this bug.
Attached file Python test server
Here is my python test server that I used to create a reproducable test for this. Run it with "python testserver.py" and then point firefox to http://localhost:8000. Reload the page a bunch of times. The above patch that adds the logging of the cache disposition may also be helpful.
In reply to Nathan Froyd (:froydnj) from bug 732053 comment #2 on 2012-03-08 (three months ago):
> Histograms that could benefit from this macro [HISTOGRAM_ENUMERATED_VALUES]:
> 
> GC_REASON
> HTTP_*_DISPOSITION
> CACHE_*_SEARCH (odd to have a linear histogram for ms histogram, especially
> one so finely bucketed)
> CACHE_SERVICE_LOCK_WAIT* (same)
> PLACES_EXPIRATION_STEPS_TO_CLEAN
> DNS_LOOKUP_METHOD
> UPDATE_STATUS
> 
> but most of those would involve a small tweak to the histogram (either in #
> of buckets or upper bound); doesn't seem worth the effort

So, we need to switch to HISTOGRAM_ENUMERATED_VALUES for this histogram and others, and discard most/all of the data from these probes.
Summary: kCacheMissedViaReval (3) is neer reported in cache disposition telemetry (e.g. HTTP_CACHE_DISPOSITION) → kCacheMissedViaReval (3) is never reported in cache disposition telemetry (e.g. HTTP_CACHE_DISPOSITION)
Ugh--what a mess.  Thanks for noticing this and looking into it.
Note we should rename the histograms when we discard data.
(In reply to Brian Smith (:bsmith) from comment #2)
Nick, I didn't write an automated test for this, but you can verify that it is working by browsing around a little bit and then observing that you have positive counts for values 1, 2, 3, and 4 for HTTP_CACHE_DISPOSITION_2 and the other DISPOSITION_2 probes. But, see also bug 763342, where the device-specific telemetry will only show cache hits.

I left the max at 5 even though there are only 4 distinct valid values now. I am not sure how much of a difference this makes, but AFAICT it shouldn't hurt.
Assignee: nobody → bsmith
Status: NEW → ASSIGNED
Attachment #632537 - Flags: review?(hurley)
Marking as tracking all unreleased versions because we need this telemetry to be corrected ASAP.
Target Milestone: --- → mozilla16
Comment on attachment 632537 [details] [diff] [review]
Fix cache hit rate telemetry

Review of attachment 632537 [details] [diff] [review]:
-----------------------------------------------------------------

Ship it!
Attachment #632537 - Flags: review?(hurley) → review+
Blocks: 763359
No longer depends on: 763359
This appears to have an r+. Anything holding us back from landing on m-c, and then nominating for Aurora/Beta yet?
https://hg.mozilla.org/mozilla-central/rev/5a2f50db17b3
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment on attachment 632537 [details] [diff] [review]
Fix cache hit rate telemetry

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 763359

User impact if declined: Wrong telemetry in Aurora and Beta branches

Testing completed (on m-c, etc.): Landed on m-c last week.

Risk to taking this patch (and alternatives if risky): Low risk. This only modifies telemetry probes.

String or UUID changes made by this patch: NONE.
Attachment #632537 - Flags: approval-mozilla-beta?
Attachment #632537 - Flags: approval-mozilla-aurora?
Comment on attachment 632537 [details] [diff] [review]
Fix cache hit rate telemetry

[Triage Comment]
This missed our last beta. Let's get this into Aurora 15 first.
Attachment #632537 - Flags: approval-mozilla-beta?
Attachment #632537 - Flags: approval-mozilla-beta-
Attachment #632537 - Flags: approval-mozilla-aurora?
Attachment #632537 - Flags: approval-mozilla-aurora+
Attached patch Fix for mozilla-aurora (obsolete) — Splinter Review
A different version of the patch is needed for mozilla-aurora because mozilla-central and mozilla-aurora diverged.
Attachment #640704 - Flags: review?(hurley)
Attachment #640704 - Attachment is obsolete: true
Attachment #640704 - Flags: review?(hurley)
Attachment #640705 - Flags: review?(hurley)
Attachment #640705 - Flags: review?(hurley) → review+
(In reply to Brian Smith (:bsmith) from comment #17)
> Created attachment 640705 [details] [diff] [review]
> Fix for aurora [v2]

Please nominate this patch for mozilla-beta approval since 15 is now on Beta.
Comment on attachment 640705 [details] [diff] [review]
Fix for aurora [v2]

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 763359

User impact if declined: Wrong telemetry in Aurora and Beta branches

Testing completed (on m-c, etc.): Already in m-c and m-a for a while.

Risk to taking this patch (and alternatives if risky): Low risk. This only modifies telemetry probes.

String or UUID changes made by this patch: NONE.
Attachment #640705 - Flags: approval-mozilla-beta?
Comment on attachment 640705 [details] [diff] [review]
Fix for aurora [v2]

[Triage Comment]
More/better data is good - approving for Beta 15.
Attachment #640705 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
It hasn't landed in Beta 15.
Let's get this in this week please, we'll want to know this has landed in Beta before the next one gets built.
Nick - can you land this on beta?
Oddly, it looks as if this has already landed on beta: http://hg.mozilla.org/releases/mozilla-beta/rev/99faf1860ca5 Not sure when it happened, exactly
The beta landing is here, this bug apparently wasn't update properly.

http://hg.mozilla.org/releases/mozilla-beta/rev/99faf1860ca5
You need to log in before you can comment on or make changes to this bug.