Last Comment Bug 763351 - kCacheMissedViaReval (3) is never reported in cache disposition telemetry (e.g. HTTP_CACHE_DISPOSITION)
: kCacheMissedViaReval (3) is never reported in cache disposition telemetry (e....
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Networking: HTTP (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla16
Assigned To: Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
:
: Patrick McManus [:mcmanus]
Mentors:
Depends on: 764190
Blocks: 763359 764260
  Show dependency treegraph
 
Reported: 2012-06-10 15:08 PDT by Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
Modified: 2012-08-14 09:38 PDT (History)
10 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
wontfix
+
wontfix
+
fixed
+
fixed


Attachments
Test validation with non-304 responses with logging (6.13 KB, patch)
2012-06-10 15:16 PDT, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
Python test server (1.39 KB, text/plain)
2012-06-10 17:03 PDT, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details
Fix cache hit rate telemetry (4.74 KB, patch)
2012-06-12 21:13 PDT, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
hurley: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta-
Details | Diff | Splinter Review
Fix for mozilla-aurora (4.94 KB, patch)
2012-07-10 11:56 PDT, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
no flags Details | Diff | Splinter Review
Fix for aurora [v2] (6.48 KB, patch)
2012-07-10 11:58 PDT, Brian Smith (:briansmith, :bsmith, use NEEDINFO?)
hurley: review+
akeybl: approval‑mozilla‑beta+
Details | Diff | Splinter Review

Description Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-06-10 15:08:37 PDT
Look at the telemetry histograms for HTTP_CACHE_DISPOSITION for all dates prior to today (i.e. over the entire time we have been collecting telemetry).

The values (at the time I looked) are:

kCacheHit (1): 32.48%
kCacheHitViaReval (2): 16.20%
kCacheMissed (4): 51.32%

Note that these add up to exactly 100%.

Note that kCacheMissedViaReval (3) is COMPLETELY missing--not just low, but NEVER reported, AFAICT.

What this telemetry is saying that is, if we have a cache entry, 100% of the time it will revalidate successfully--i.e. no server has EVER returned anything other than a 304 response to any of our conditional requests. That seems impossible to me. (I could believe the number would be CLOSE to zero; but, it cannot be zero.)
Comment 1 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-06-10 15:16:02 PDT
Created attachment 631777 [details] [diff] [review]
Test validation with non-304 responses with logging

This patch contains an xpcshell test + some logging in nsHttpChannel that shows that it is at least POSSIBLE for us to get kCacheMissedViaReval. If you run the test with NSPR_LOG_MODULES=nsHTTP:5, then the log output should include:

0[31f940]: CACHE DISPOSITION: 4
0[31f940]: CACHE DISPOSITION: 3
0[31f940]: CACHE DISPOSITION: 4
0[31f940]: CACHE DISPOSITION: 3
0[31f940]: CACHE DISPOSITION: 4
0[31f940]: CACHE DISPOSITION: 3

So, I think the problem is not in how we calculate the cache disposition telemetry value. The problem is either in telemetry itself or somewhere else in nsHttpChannel. I had thoughe the most likely situation is that, when we need to do validation, we're switching to heuristic validation, or we're throwing the entry away. But, in that case, I would expect my simple test in this attachment to never output CACHE DISPOSITION: 3. It is possible that my test here is not "real world" enough and that my suspicion is still correct.
Comment 2 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-06-10 15:17:41 PDT
Note that the attachment contains a fix for bug 763342, in addition to the work done to investigate this bug.
Comment 3 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-06-10 17:01:26 PDT
In the debugger, I found that telemetry is putting values 2 and 3 into bucket 2.

I wrote a simple test server in python that always returns a 200 response with an ETag and Last-Modified. Thus, reloading this page should cause kCacheMissedViaReval all the time. My logging from the first patch shows that indeed we do calculate the disposition to be kCacheMissedViaReval (3).

I made the following change:

-HISTOGRAM(HTTP_CACHE_DISPOSITION, 1, 5, 5, LINEAR, "HTTP Cache Hit, Reval, Failed-Reval, Miss")
HISTOGRAM(HTTP_CACHE_DISPOSITION, 1, 4, 5, LINEAR, "HTTP Cache Hit, Reval, Failed-Reval, Miss")

and with that change, kCacheMissedViaReval (3) is reported.

Now I am starting to suspect that there might be a bug in Telemetry itself.

In the debugger, I noticed some oddity in histogram.cc. the ranges_[] array for this histogram is { 0, 1, 2, 4, 5, 2147483647 }; This does not make sense; there should be a "3" in there between 2 and 4 because the range of the 4th bucket is exactly (3, 3).

Consequently, I believe this is a bug in telemetry and I will file it, blocking this bug.
Comment 4 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-06-10 17:03:56 PDT
Created attachment 631784 [details]
Python test server

Here is my python test server that I used to create a reproducable test for this. Run it with "python testserver.py" and then point firefox to http://localhost:8000. Reload the page a bunch of times. The above patch that adds the logging of the cache disposition may also be helpful.
Comment 5 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-06-10 20:20:43 PDT
In reply to Nathan Froyd (:froydnj) from bug 732053 comment #2 on 2012-03-08 (three months ago):
> Histograms that could benefit from this macro [HISTOGRAM_ENUMERATED_VALUES]:
> 
> GC_REASON
> HTTP_*_DISPOSITION
> CACHE_*_SEARCH (odd to have a linear histogram for ms histogram, especially
> one so finely bucketed)
> CACHE_SERVICE_LOCK_WAIT* (same)
> PLACES_EXPIRATION_STEPS_TO_CLEAN
> DNS_LOOKUP_METHOD
> UPDATE_STATUS
> 
> but most of those would involve a small tweak to the histogram (either in #
> of buckets or upper bound); doesn't seem worth the effort

So, we need to switch to HISTOGRAM_ENUMERATED_VALUES for this histogram and others, and discard most/all of the data from these probes.
Comment 6 Jason Duell [:jduell] (needinfo me) 2012-06-11 10:20:29 PDT
Ugh--what a mess.  Thanks for noticing this and looking into it.
Comment 7 (dormant account) 2012-06-11 11:24:58 PDT
Note we should rename the histograms when we discard data.
Comment 8 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-06-12 21:13:07 PDT
Created attachment 632537 [details] [diff] [review]
Fix cache hit rate telemetry

(In reply to Brian Smith (:bsmith) from comment #2)
Nick, I didn't write an automated test for this, but you can verify that it is working by browsing around a little bit and then observing that you have positive counts for values 1, 2, 3, and 4 for HTTP_CACHE_DISPOSITION_2 and the other DISPOSITION_2 probes. But, see also bug 763342, where the device-specific telemetry will only show cache hits.

I left the max at 5 even though there are only 4 distinct valid values now. I am not sure how much of a difference this makes, but AFAICT it shouldn't hurt.
Comment 9 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-06-12 21:14:26 PDT
Marking as tracking all unreleased versions because we need this telemetry to be corrected ASAP.
Comment 10 Nicholas Hurley [:nwgh][:hurley] 2012-06-13 09:19:03 PDT
Comment on attachment 632537 [details] [diff] [review]
Fix cache hit rate telemetry

Review of attachment 632537 [details] [diff] [review]:
-----------------------------------------------------------------

Ship it!
Comment 11 Alex Keybl [:akeybl] 2012-06-20 16:40:04 PDT
This appears to have an r+. Anything holding us back from landing on m-c, and then nominating for Aurora/Beta yet?
Comment 12 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-06-21 22:20:06 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/5a2f50db17b3
Comment 13 Ed Morley [:emorley] 2012-06-22 09:16:55 PDT
https://hg.mozilla.org/mozilla-central/rev/5a2f50db17b3
Comment 14 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-10 09:07:10 PDT
Comment on attachment 632537 [details] [diff] [review]
Fix cache hit rate telemetry

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 763359

User impact if declined: Wrong telemetry in Aurora and Beta branches

Testing completed (on m-c, etc.): Landed on m-c last week.

Risk to taking this patch (and alternatives if risky): Low risk. This only modifies telemetry probes.

String or UUID changes made by this patch: NONE.
Comment 15 Alex Keybl [:akeybl] 2012-07-10 10:41:31 PDT
Comment on attachment 632537 [details] [diff] [review]
Fix cache hit rate telemetry

[Triage Comment]
This missed our last beta. Let's get this into Aurora 15 first.
Comment 16 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-10 11:56:39 PDT
Created attachment 640704 [details] [diff] [review]
Fix for mozilla-aurora

A different version of the patch is needed for mozilla-aurora because mozilla-central and mozilla-aurora diverged.
Comment 17 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-10 11:58:25 PDT
Created attachment 640705 [details] [diff] [review]
Fix for aurora [v2]
Comment 18 Lukas Blakk [:lsblakk] use ?needinfo 2012-07-18 09:21:55 PDT
(In reply to Brian Smith (:bsmith) from comment #17)
> Created attachment 640705 [details] [diff] [review]
> Fix for aurora [v2]

Please nominate this patch for mozilla-beta approval since 15 is now on Beta.
Comment 19 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2012-07-18 22:11:53 PDT
Comment on attachment 640705 [details] [diff] [review]
Fix for aurora [v2]

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 763359

User impact if declined: Wrong telemetry in Aurora and Beta branches

Testing completed (on m-c, etc.): Already in m-c and m-a for a while.

Risk to taking this patch (and alternatives if risky): Low risk. This only modifies telemetry probes.

String or UUID changes made by this patch: NONE.
Comment 20 Alex Keybl [:akeybl] 2012-07-20 15:10:54 PDT
Comment on attachment 640705 [details] [diff] [review]
Fix for aurora [v2]

[Triage Comment]
More/better data is good - approving for Beta 15.
Comment 21 Scoobidiver (away) 2012-07-27 08:14:04 PDT
It hasn't landed in Beta 15.
Comment 22 Lukas Blakk [:lsblakk] use ?needinfo 2012-08-02 12:02:19 PDT
Let's get this in this week please, we'll want to know this has landed in Beta before the next one gets built.
Comment 23 Josh Aas 2012-08-14 07:48:53 PDT
Nick - can you land this on beta?
Comment 24 Nicholas Hurley [:nwgh][:hurley] 2012-08-14 09:17:41 PDT
Oddly, it looks as if this has already landed on beta: http://hg.mozilla.org/releases/mozilla-beta/rev/99faf1860ca5 Not sure when it happened, exactly
Comment 25 Josh Aas 2012-08-14 09:38:40 PDT
The beta landing is here, this bug apparently wasn't update properly.

http://hg.mozilla.org/releases/mozilla-beta/rev/99faf1860ca5

Note You need to log in before you can comment on or make changes to this bug.