Closed Bug 658894 Opened 14 years ago Closed 14 years ago

Collect basic telemetry for HTTP requests and page load

Categories

(Core :: Networking, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla7

People

(Reporter: mayhemer, Assigned: mayhemer)

References

Details

(Whiteboard: [inbound])

Attachments

(1 file, 9 obsolete files)

This bug should use infrastructure implemented in bug 585196 and timing API in bug 576006 to measure basic timings for http page loads. We have a list here (collected also under bug 650143): https://wiki.mozilla.org/Necko:Telemetry I decided to collect basic telemetry like this: - the whole page load time - info from channel timing API for the default request - info from channel timing API for subrequets - subrequest count for a page - subrequest first byte latency, etc. I currently have a patch that measures most or all of these. It would be nice to get this to Firefox 6, i.e. get this reviewed and landed till Monday midnight to have numbers we can compare with after we have major improvements like pipelining, preconnections, DNS caching.
> I currently have a patch that measures most or all of these. > It would be nice to get this to Firefox 6, i.e. get this reviewed and landed Sure--where's the patch? :)
Attached patch v1 (obsolete) — Splinter Review
I collect timings only for channels that implements nsICacheInfoChannel, currently only HTTP channels. IMO the simplest way to filter all others we are not interested in. I group requests using load groups. It is not the best but is simplest. We do not add iframes load groups into parent load group, so my grouping is not 100% precise. To explain the numbers: * HTTP::PageLoadTime = EndPageLoad - default request creation time (~= DoURILoad) * HTTP::Subrequest.AsyncOpenLatencySincePageLoadStart = subrequest AsyncOpen - default request creation time * HTTP::Subrequest.FirstByteLatencySincePageLoadStart = subrequest request start - default request creation time * HTTP::RequestsPerPage = overall number of requests ever added to a load group * HTTP::RequestsPerPageFromCache = page requests cache load percentage Separate for the default request (main page) and subrequests), just those that are not clear on the first sight: * DomainLookupStartLatency = Domain Lookup Start - AsyncOpen * RequestStartLatency = Request Start - AsyncOpen * RTT = Response End - Request Start * FirstByteSinceAsyncOpen = Response Start - AsyncOpen * CacheReadStartLatency = Cache Read Start - AsyncOpen * PositiveCacheValidation = Request End - Request Start (actually the RTT) but reported when we loaded from cache, i.e. RTT here represents "GET If-"/"304 Not Modified" time * OverallTime = Cache Read End / Response End - AsyncOpen
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Attachment #534444 - Flags: review?(jduell.mcbugs)
(In reply to comment #2) > Created attachment 534444 [details] [diff] [review] [review] > * RTT = Response End - Request Start one favor - can we call this something other than RTT please? One very well known definition of rtt is the time latency of moving the minimum amount of data from one host to the other and back again.. it is being used here to describe latency + data-xfer time.
Depends on: 576006, 658876
Comment on attachment 534444 [details] [diff] [review] v1 +r. I've got some renaming suggestions (all very bike-shed-y: anyone else feel free to chime in). >diff --git a/docshell/base/nsDocShell.cpp b/docshell/base/nsDocShell.cpp > >@@ -6034,16 +6038,30 @@ nsDocShell::EndPageLoad(nsIWebProgress * >+ UMA_HISTOGRAM_TIMES("HTTP::PageLoadTime (ms)", >+ base::TimeDelta::FromMilliseconds(interval)); let's change "HTTP::PageLoadTime" to "HTTP: Total page load time" >diff --git a/netwerk/base/src/nsLoadGroup.cpp b/netwerk/base/src/nsLoadGroup.cpp >--- a/netwerk/base/src/nsLoadGroup.cpp >+#include "base/histogram.h" >+#include "base/logging.h" > > >+#ifdef LOG >+#undef LOG >+#endif The chromium LOG stuff can be a mess to work around--see bug 545995--but it seems that we don't have FORCE_PR_LOG on for nsLoadGroup, which makes things easier (unless it's a bug that we don't have FORCE_PR_LOG on for load groups). >@@ -335,16 +349,18 @@ nsLoadGroup::Cancel(nsresult status) > > // Remember the first failure and return it... > if (NS_FAILED(rv) && NS_SUCCEEDED(firstError)) > firstError = rv; > > NS_RELEASE(request); > } > >+ TelemetryReportSubrequests(); Do we have useful data to report when a loadGroup is canceled? Would it be better to not report anything? >@@ -654,16 +689,54 @@ nsLoadGroup::RemoveRequest(nsIRequest *r >+ >+ rv = timedChannel->GetAsyncOpen(&timeStamp); >+ if (NS_SUCCEEDED(rv) && !timeStamp.IsNull()) { >+ UMA_HISTOGRAM_MEDIUM_TIMES("HTTP::Subrequest.AsyncOpenLatencySincePageLoadStart (ms)", >+ HISTOGRAM_TIME_DELTA(mDefaultRequestCreationTime, timeStamp)); Change to "HTTP subitem: Page start -> subitem open() (ms)" >+ } >+ >+ rv = timedChannel->GetResponseStart(&timeStamp); >+ if (NS_SUCCEEDED(rv) && !timeStamp.IsNull()) { >+ UMA_HISTOGRAM_MEDIUM_TIMES("HTTP::Subrequest.FirstByteLatencySincePageLoadStart (ms)", >+ HISTOGRAM_TIME_DELTA(mDefaultRequestCreationTime, timeStamp)); >+ } Change to "HTTP subitem: Page start -> first byte received for subitem reply " >@@ -797,16 +870,170 @@ nsLoadGroup::AdjustPriority(PRInt32 aDel >+void >+nsLoadGroup::TelemetryReportSubrequests() This function actually really reports on the Default channel! Rename to TelemetryReportDefaultLoad? >+{ >+ if (mCachedContentLoad) { >+ UMA_HISTOGRAM_COUNTS("HTTP::RequestsPerPage (count)", >+ mTotalCachingRequestsCount); "HTTP: requests per page" >+ if (mTotalCachingRequestsCount) { >+ UMA_HISTOGRAM_ENUMERATION("HTTP::RequestsPerPageFromCache (%)", >+ mTotalFromCacheRequestsCount * 100 / mTotalCachingRequestsCount, >+ 101); "HTTP: Requests from cache (%)" (there's nothing "per page" about this stat, is there? And we're counting the page request itself, too AFAICT) >+ mozilla::TimeStamp asyncOpen; >+ rv = aTimedChannel->GetAsyncOpen(&asyncOpen); >+ >+ mozilla::TimeStamp startTime; >+ if (NS_SUCCEEDED(rv) && !asyncOpen.IsNull()) >+ startTime = asyncOpen; >+ else >+ startTime = channelCreation; Why are you worried about asyncOpen time being null? Won't that always be set? >+#define _UMA_HTTP_REQUEST_HISTOGRAMS_(prefix) \ >+ if (!domainLookupStart.IsNull()) { \ >+ UMA_HISTOGRAM_TIMES( \ I prefer to have long macros put the '\' in the 80th column for each line. Makes it easier to read if they're all in the same column, and if they're at 80, no need to reformat them if one line gets longer than previous max. >+ prefix "DomainLookupStartLatency (ms)", \ >+ HISTOGRAM_TIME_DELTA(startTime, domainLookupStart)); \ "open() -> DNS request issued (ms)" >+ \ >+ UMA_HISTOGRAM_TIMES( \ >+ prefix "DomainLookup (ms)", \ >+ HISTOGRAM_TIME_DELTA(domainLookupStart, domainLookupEnd)); \ "DNS lookup time (ms)" >+ \ >+ if (!connectStart.IsNull()) { \ >+ UMA_HISTOGRAM_TIMES( \ >+ prefix "Connect (ms)", \ >+ HISTOGRAM_TIME_DELTA(connectStart, connectEnd)); \ "TCP connection time (ms)" >+ \ >+ \ >+ if (!requestStart.IsNull()) { \ >+ UMA_HISTOGRAM_TIMES( \ >+ prefix "RequestStartLatency (ms)", \ >+ HISTOGRAM_TIME_DELTA(startTime, requestStart)); \ "Open -> first byte of request sent (ms)" >+ UMA_HISTOGRAM_TIMES( \ >+ prefix "RTT (ms)", \ >+ HISTOGRAM_TIME_DELTA(requestStart, responseEnd)); \ "First byte of request sent -> last byte of response received" >+ if (cacheReadStart.IsNull()) { \ >+ UMA_HISTOGRAM_TIMES( \ >+ prefix "FirstByteSinceAsyncOpen (ms)", \ >+ HISTOGRAM_TIME_DELTA(startTime, responseStart)); \ "Open -> first byte of reply received (ms)" >+ } \ >+ \ >+ if (!cacheReadStart.IsNull()) { \ >+ UMA_HISTOGRAM_TIMES( \ >+ prefix "CacheReadStartLatency (ms)", \ >+ HISTOGRAM_TIME_DELTA(startTime, cacheReadStart)); \ "Open -> cache entry opened (ms)" >+ UMA_HISTOGRAM_TIMES( \ >+ prefix "CacheRead (ms)", \ >+ HISTOGRAM_TIME_DELTA(cacheReadStart, cacheReadEnd)); \ "Cache read time (ms)" >+ \ >+ if (!requestStart.IsNull()) { \ >+ UMA_HISTOGRAM_TIMES( \ >+ prefix "PositiveCacheValidation (ms)", \ >+ HISTOGRAM_TIME_DELTA(requestStart, responseEnd)); \ So requestStart is when we first send data out to the net (the request). How do we know from this that we're doing a cache validation request? Shouldn't that require a check for cacheReadEnd != null? >+ } \ >+ UMA_HISTOGRAM_TIMES( \ >+ prefix "OverallTime (ms)", \ >+ HISTOGRAM_TIME_DELTA(startTime, (cacheReadEnd.IsNull() ? \ >+ responseEnd : cacheReadEnd))); "Overall load time (ms)" >+ if (aDefaultRequest) { >+ _UMA_HTTP_REQUEST_HISTOGRAMS_("HTTP::DefaultRequest.") >+ } else { >+ _UMA_HTTP_REQUEST_HISTOGRAMS_("HTTP::Subrequest.") how about "HTTP page: " and "HTTP item: " for the prefixes? I could also do "subitem". But short is good. >diff --git a/netwerk/base/src/nsLoadGroup.h b/netwerk/base/src/nsLoadGroup.h >+ >+ /* Telemetry */ >+ mozilla::TimeStamp mPageLoadStartTime; >+ mozilla::TimeStamp mDefaultRequestCreationTime; >+ bool mCachedContentLoad; >+ PRUint32 mTotalCachingRequestsCount; >+ PRUint32 mTotalFromCacheRequestsCount; Rename and add comments: // Number of cacheable (HTTP) requests s/mTotalCachingRequestsCount/mCacheableRequests/ // Number of requests actually served from cache s/mTotalFromCacheRequestsCount/mCachedRequests/ s/mCachedContentLoad/mDefaultLoadIsCacheable/ or mIsTimedChannel?
Attachment #534444 - Flags: review?(jduell.mcbugs) → review+
I'm tempted to put numbers at the start of the histogram desciptions, so we can present them in a more sensible order (eg show DNS before connect times, etc.). But that's possible a bad idea, given that we'll wind up adding lots more timings, and keeping a strict numerical ordering would result in changing the names a lot. But let's at least make sure that 'page' stats show up before 'item': rename to 'subitem' if needed...
Attached patch v1 -> v2 idiff (obsolete) — Splinter Review
- renames (but I modified few my self a bit more) - ignore requests that fail to load (from sever reason, not because of 404 or so) - some comments unaddressed ; discussed with Jason on IRC - including only requests that implement nsITimedChannel (as well discussed on IRC)
Attachment #534657 - Flags: review?(jduell.mcbugs)
Attached patch v2 (obsolete) — Splinter Review
Attachment #534444 - Attachment is obsolete: true
Attachment #534657 - Flags: review?(jduell.mcbugs) → review+
Made a few changes: In AddRequest, turn on timing even if !mDefaultLoadIsTimed, or we won't set timing for channels added before SetDefaultLoad is called. In RemoveChannel, 'aStatus' arg is enough to determine if channel failed or not Some line breaks and whitespace removal. Made 3 total time reports: cache hits, from network, and all (combined).
Attachment #534657 - Attachment is obsolete: true
Attached patch v3 (ready for checkin) (obsolete) — Splinter Review
A couple more changes: - "HTTP: Requests per page from cache ratio (%)", + "HTTP: Requests serviced from cache (%)", -nsLoadGroup::TelemetryReportRequestTimings +nsLoadGroup::TelemetryReportChannel Changed macro to put backslashes on column 80. Checked to make sure this doesn't break fennec (where timings not yet available). about:telemetry is not supported yet, so we're ok.
Attachment #534658 - Attachment is obsolete: true
Attachment #534674 - Flags: review+
Ran into merge conflicts with biesi patch for JS web timings, and ran out of time to land for FF 6. Turns out telemetry isn't on yet anyway, so we've got time to polish the bike shed on this (which probably isn't a bad idea anyway).
I will add a few tweaks to the patch soon. I also discovered, we do not collect image requests. I already have a patch for imgRequestProxy implementing nsITimedChannel. I will open a new bug for it.
Target Milestone: --- → mozilla6
It'd be nice to land this, so that we can all pile in and add our favorite necko stats as followup patches. Honza, any ETA on your updated patch? Should we just land this and you can do a followup?
Comment on attachment 534674 [details] [diff] [review] v3 (ready for checkin) Review of attachment 534674 [details] [diff] [review]: ----------------------------------------------------------------- I wanted to add/change one measurement, but later I decided not to. I just have a local patch to opt a bit the condition in RemoveRequest (comments bellow). This doesn't include image requests. I have already a patch for that. We should land it separately, it is a bit more complicated. ::: netwerk/base/src/nsLoadGroup.cpp @@ +693,5 @@ > + nsCOMPtr<nsITimedChannel> timedChannel = > + do_QueryInterface(request); > + if (timedChannel) { > + // Don't include failed requests in the timing statistics > + if (NS_SUCCEEDED(aStatus)) { Probably can be merged to a single (timedChannel && NS_SUCCEEDED(sStatus)) or better move the NS_SUCCEEDED(aStatus) check to the if (mDefaultLoadIsTime) condition to prevent QI when failed. @@ +874,5 @@ > + UMA_HISTOGRAM_COUNTS("HTTP: Requests per page (count)", > + mCacheableRequests); > + if (mCacheableRequests) { > + UMA_HISTOGRAM_ENUMERATION( > + "HTTP: Requests serviced from cache (%)", Jason, why exactly are you against adding "per page" suffix here? :) This measures how many percent of requests has been _on a page_ loaded from cache. Your name makes me more confused about what the values are. ::: netwerk/base/src/nsLoadGroup.h @@ +110,5 @@ > + mozilla::TimeStamp mPageLoadStartTime; > + mozilla::TimeStamp mDefaultRequestCreationTime; > + bool mDefaultLoadIsTimed; > + PRUint32 mCacheableRequests; > + PRUint32 mCachedRequests; Not sure this is any longer about "cacheable" requests. Rather "timed" ?
> This doesn't include image requests. I have already a patch for that. We > should land it separately, it is a bit more complicated. OK. You've got my +r for the changes below, assuming they pass try. > ::: netwerk/base/src/nsLoadGroup.cpp > @@ +693,5 @@ > > + nsCOMPtr<nsITimedChannel> timedChannel = > > + do_QueryInterface(request); > > + if (timedChannel) { > > + // Don't include failed requests in the timing statistics > > + if (NS_SUCCEEDED(aStatus)) { > > Probably can be merged to a single (timedChannel && NS_SUCCEEDED(sStatus)) > or better move the NS_SUCCEEDED(aStatus) check to the if > (mDefaultLoadIsTime) condition to prevent QI when failed. Nice catch. > @@ +874,5 @@ > > + UMA_HISTOGRAM_COUNTS("HTTP: Requests per page (count)", > > + mCacheableRequests); > > + if (mCacheableRequests) { > > + UMA_HISTOGRAM_ENUMERATION( > > + "HTTP: Requests serviced from cache (%)", > > Jason, why exactly are you against adding "per page" suffix here? :) This > measures how many percent of requests has been _on a page_ loaded from cache. > > Your name makes me more confused about what the values are. So, for example, a loadgroup has 12 requests, all HTTP (so mCacheableRequests=12), and say 8 of them are loaded from cache (mCachedRequests = 8). I was just figuring the most important statistic of interest is the % of cache hits (total, throughout the browser). I mean, that's the same as the "% of cache hits per page" (in the aggregate: we'll have more, smaller samples, with larger variance, doing it this way then if we kept some global counts of requests/cache-hits somewhere and periodically reported them). Why do we care about it per-page? > ::: netwerk/base/src/nsLoadGroup.h > @@ +110,5 @@ > > + mozilla::TimeStamp mPageLoadStartTime; > > + mozilla::TimeStamp mDefaultRequestCreationTime; > > + bool mDefaultLoadIsTimed; > > + PRUint32 mCacheableRequests; > > + PRUint32 mCachedRequests; > > Not sure this is any longer about "cacheable" requests. Rather "timed" ? I support renaming mCacheableRequests to something like mTimedRequests. mCachedRequests is still about cached requests.
Blocks: 662436
RE: per-page - yes, you are right :) now I understand, and agree. I'll do the rest of the changes, I think I'll get to this today.
Depends on: 662534
Taras: We're trying to do a histogram for "% of network requests from cache" (i.e. % cache hits). Right now we have this instrumented so that we call UMA_HISTOGRAM_COUNTS once per LoadGroup (~= once per webpage), with the % of loads from that page that came from cache. I'm thinking that's going to result in messy stats, where we get a lot of variance (and a lot of entries with 0%, for pages that had nothing from cache). It might make more sense to keep global counters of requests/cache_hits, and then at some point upload them to telemetry. The logical place for this is shutdown, but I'm wondering how late one can call UMA_HISTOGRAM_COUNTS and still have it work. (Alternatively I guess we could call it every 1000 requests or something like that). Thoughts?
I would expose a global counter to JS. Then add a single field to .simpleMeasurements part of the request. http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#170 Using a histogram doesn't make sense to me for this.
Taras, Where do the simpleMeasurements show up? Right now it grabs ("main", "firstPaint", "sessionRestored"), but I don't see any of those on http://people.mozilla.com/~tglek/telemetry/log.html Honza: I assume we could just keep a pair of global variables (requests, reqsFromCache) in gHttpHandler, and increment one or both in OnStartReq for every (successful?) channel.
I haven't updated the log to include those values, but they will show up same as the other values. Optimistically telemetry will be live in a week, and we'll be able to provide consumers with live stats.
Passes try, includes the changes we agreed in comment 15.
Attachment #534671 - Attachment is obsolete: true
Attachment #534674 - Attachment is obsolete: true
Attachment #538237 - Flags: review+
Attachment #538237 - Attachment description: v4 → v4 [Check in comment 22]
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #19) > Honza: I assume we could just keep a pair of global variables (requests, > reqsFromCache) in gHttpHandler, and increment one or both in OnStartReq for > every (successful?) channel. I believe both are interesting. I left the "per page" measure in. I think it is interesting to see how many requests a page gets satisfied from cache. If we find out it is messy we can remove it. The overall cache/network ratio is also interesting but I think this should be made a different way then this patch does and be part of cache telemetry patch(es).
Even though try was green with this patch, bug 662555 may cause the leak test to assert and crash, [1]. Backed out as http://hg.mozilla.org/mozilla-central/rev/63a285db615c due to: ###!!! ASSERTION: Cannot compute with a null value: '!IsNull()', file ../../dist/include/mozilla/TimeStamp.h, line 209 [1] http://tinderbox.mozilla.org/showlog.cgi?tree=Firefox&errorparser=unittest&logfile=1307621350.1307626248.3705.gz&buildtime=1307621350&buildname=OS%20X%2010.6.2%20mozilla-central%20leak%20test%20build&fulltext=1#err0
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #538237 - Attachment description: v4 [Check in comment 22] → v4 [Check in/back out comment 22/24]
Depends on: 662555
Attached patch v5 (obsolete) — Splinter Review
This is updated to the new telemetry code. It is that large change that I rather want to get this re-reviewed by Taras.
Attachment #538237 - Attachment is obsolete: true
Attachment #542265 - Flags: review?(tglek)
Comment on attachment 542265 [details] [diff] [review] v5 I cringe at the heavy use of preprocessor here :) Also feel free to using namespace mozilla; to shorten some of the code.
Attachment #542265 - Flags: review?(tglek) → review+
Status: REOPENED → ASSIGNED
Whiteboard: [inbound]
Backed out for apparently causing reftest assertion failures on Windows debug: http://hg.mozilla.org/integration/mozilla-inbound/rev/47bafd7a2168
Whiteboard: [inbound]
Blocks: 650129
I can see we sometimes don't get nsISocketTransport::STATUS_CONNECTED_TO notification where connectEnd is stamped. I'll report a new bug on that. To fix this one, I will simply check for non-null on both times I subtract.
Attached patch v5.1 (obsolete) — Splinter Review
Additional check for non-null added. Please just check I didn't mess anything. I didn't manage to push to try :(
Attachment #542265 - Attachment is obsolete: true
Attachment #543113 - Flags: review?(jduell.mcbugs)
Attached patch minor-fixups (obsolete) — Splinter Review
honza--looks good. Just a few minor bitrot and indenting fixups. One question: It doesn't seem like we need both calls to TelemetryReport(). We call it in Cancel, after we've removed all the channels, but in RemoveRequest we also check if we're the last request to be removed and call it there. Seems like that means we'll call it twice when cancelled, for no good reason (it's not a problem--it behaves correctly if called twice). Can we remove the call in Cancel()?
Attachment #543113 - Attachment is obsolete: true
Attachment #543113 - Flags: review?(jduell.mcbugs)
Attachment #543274 - Flags: review+
(In reply to comment #32) > Can we remove the call in Cancel()? Yes we can! Good catch.
Whiteboard: [inbound]
Status: ASSIGNED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Target Milestone: mozilla6 → mozilla7
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: