Closed Bug 1356097 Opened 7 years ago Closed 7 years ago

Some of our HTTP page load telemetry metrics look bogus

Categories

(Core :: Networking: HTTP, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: jduell.mcbugs, Assigned: valentin)

Details

(Whiteboard: [necko-active])

Attachments

(2 files)

I was looking over the list of the probes we're planning to use for Race-cache-with-network, and I'm seeing numbers that don't make sense:

Allegedly HTTP_SUB_COMPLETE_LOAD gets fired each time we hit nsLoadGroup::RemoveRequest().  And then HTTP_PAGE_COMPLETE_LOAD gets fired for the last remaining load in the loadGroup.

But the numbers look weird to me if this is true.  For instance, HTTP_PAGE_COMPLETE_LOAD_CACHED_V2 has a lower median time than HTTP_SUB_COMPLETE_LOAD_CACHED_V2, which doesn’t make any sense.

The ‘net’ versions of HTTP are also weird—HTTP_PAGE_COMPLETE_LOAD_NET_V2 median is 323 ms, HTTP_SUB_COMPLETE_LOAD_NET_V2 is 305.  That’s not backwards, but it seems awfully close.  It doesn’t look right to me—fetching index.html then getting even 1 subresource only ought to be a lot slower than just the subresource load time.  Like around twice as long.  And of course much longer for the median page on the Internet these days.

Finally TOTAL_CONTENT_PAGE_LOAD_TIME has a median of 735, but HTTP_PAGE_COMPLETE_LOAD_V2 is only 323.  Valentin notes that TOTAL_CONTENT is fired in nsDocShell::EndPageLoad(), and we might be doing lots of additional stuff there before firing the telemetry. But a 2x difference seems kind of high.

My first impression is that the HTTP_PAGE metrics are bogus somehow.

While we're at this, we should add better descriptions for these probes so it's clear what they measure.  (Does HTTP_PAGE_COMPLETE_LOAD_CACHED--as opposed to _NET--get fired if the entire page is cached?  More than 50 cached? If the document URI was cached? etc).
Whiteboard: [necko-active]
Also--do top-level document loads get counted in HTTP_SUB_COMPLETE_LOAD?  If not, does that mean we only count them as part of the HTTP_PAGE metric?
So, if we are about to run experiments (e.g. like RCWN), how do we find out if changes in the telemetry data are caused by this patch or the experiment?
(In reply to Honza Bambas (:mayhemer) from comment #3)
> So, if we are about to run experiments (e.g. like RCWN), how do we find out
> if changes in the telemetry data are caused by this patch or the experiment?

We should land them at different times. If we delay this one, we can better observe the _changes_ from RCWN. On the other hand the telemetry we will get will not be as accurate in value.

But overall I don't think this patch will make that big of a difference in the telemetry stats, so it shouldn't be a problem if we delay it a bit.
(In reply to Jason Duell [:jduell] (needinfo me) from comment #0)
> While we're at this, we should add better descriptions for these probes so
> it's clear what they measure.  (Does HTTP_PAGE_COMPLETE_LOAD_CACHED--as
> opposed to _NET--get fired if the entire page is cached?  More than 50
> cached? If the document URI was cached? etc).

So, this is indeed weird at first, but I think it's just poor naming and documentation.
The difference between HTTP_PAGE_* and HTTP_SUB_* is just that the HTTP_PAGE_* probes are only for requests that trigger the load. So they report how long it took us to fetch index.html from the net or cache, and has nothing to do with subresources. It gets reported when the loadGroup is emptied, but only includes timing attributes that live on the mDefaultLoadRequest (main channel).
The HTTP_SUB_* probes are reported for every request when it gets removed from the loadGroup.

If we take this into account, the probes start to make sense.

> The ‘net’ versions of HTTP are also weird—HTTP_PAGE_COMPLETE_LOAD_NET_V2
> median is 323 ms, HTTP_SUB_COMPLETE_LOAD_NET_V2 is 305.  That’s not
> backwards, but it seems awfully close.  It doesn’t look right to me—fetching
> index.html then getting even 1 subresource only ought to be a lot slower
> than just the subresource load time.  Like around twice as long.  And of
> course much longer for the median page on the Internet these days.

This is actually pretty good. It means that the median for loads is about the same, but that for HTTP_PAGE_ is a bit bigger, which would make sense considering that is usually the channel that opens the connection.
 
> Finally TOTAL_CONTENT_PAGE_LOAD_TIME has a median of 735, but
> HTTP_PAGE_COMPLETE_LOAD_V2 is only 323.  Valentin notes that TOTAL_CONTENT
> is fired in nsDocShell::EndPageLoad(), and we might be doing lots of
> additional stuff there before firing the telemetry. But a 2x difference
> seems kind of high.

I think this is good. I basically means that the median time to load index.html is 323ms, but that the median time for the entire page load (including sub resources) is 735. 

> But the numbers look weird to me if this is true.  For instance,
> HTTP_PAGE_COMPLETE_LOAD_CACHED_V2 has a lower median time than
> HTTP_SUB_COMPLETE_LOAD_CACHED_V2, which doesn’t make any sense.

I think this number is also OK. If most .html pages in the cache are smaller than let's say image files, we should see a lower time for the HTTP_PAGE_* probes.

> My first impression is that the HTTP_PAGE metrics are bogus somehow.

That was my initial feeling as well, but it just the misunderstanding of what each telemetry probe meant.
I'm considering changing the descriptions from 
> HTTP page: Overall load time - all (ms)
> HTTP subitem: Overall load time - all (ms)
to 
> HTTP page channel: Overall load time - all (ms)
> HTTP subitem channel: Overall load time - all (ms)
Comment on attachment 8869365 [details]
Bug 1356097 - Improve correctness of HTTP_PAGE_* telemetry

https://reviewboard.mozilla.org/r/141010/#review144632
Attachment #8869365 - Flags: review?(mcmanus) → review+
Comment on attachment 8872453 [details]
Bug 1356097 - Update description for HTTP_SUB* and HTTP_PAGE* probes

https://reviewboard.mozilla.org/r/143954/#review148364
Attachment #8872453 - Flags: review?(mcmanus) → review+
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/952cbd41608c
Improve correctness of HTTP_PAGE_* telemetry r=mcmanus
https://hg.mozilla.org/integration/autoland/rev/4c960793ed25
Update description for HTTP_SUB* and HTTP_PAGE* probes r=mcmanus
https://hg.mozilla.org/mozilla-central/rev/952cbd41608c
https://hg.mozilla.org/mozilla-central/rev/4c960793ed25
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: