Closed
Bug 1313095
Opened 8 years ago
Closed 8 years ago
Add telemetry probes to report load time difference between network and disk cache
Categories
(Core :: Networking: Cache, defect)
Core
Networking: Cache
Tracking
()
RESOLVED
FIXED
mozilla52
Tracking | Status | |
---|---|---|
firefox52 | --- | fixed |
People
(Reporter: michal, Assigned: valentin)
References
(Blocks 1 open bug)
Details
(Whiteboard: [necko-active])
Attachments
(6 files, 2 obsolete files)
2.56 KB,
patch
|
michal
:
feedback+
|
Details | Diff | Splinter Review |
1.48 KB,
patch
|
michal
:
review+
|
Details | Diff | Splinter Review |
115.45 KB,
application/x-xz
|
Details | |
35.43 KB,
patch
|
michal
:
review+
benjamin
:
feedback+
|
Details | Diff | Splinter Review |
6.05 KB,
patch
|
mayhemer
:
review+
michal
:
feedback+
|
Details | Diff | Splinter Review |
16.12 KB,
application/x-xz
|
Details |
No description provided.
Reporter | ||
Updated•8 years ago
|
Assignee: michal.novotny → valentin.gosu
Assignee | ||
Comment 1•8 years ago
|
||
MozReview-Commit-ID: C6kMSPw8gxv
Attachment #8805926 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 2•8 years ago
|
||
MozReview-Commit-ID: BIIowV9tHSm
Attachment #8805996 -
Flags: review?(michal.novotny)
Reporter | ||
Comment 3•8 years ago
|
||
Comment on attachment 8805996 [details] [diff] [review] Add telemetry recording the disk size of cache entry Review of attachment 8805996 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/protocol/http/nsHttpChannel.cpp @@ +6593,5 @@ > Telemetry::Accumulate(Telemetry::HTTP_NET_VS_CACHE_ONSTART_TIME_LARGE, onStartDiff); > Telemetry::Accumulate(Telemetry::HTTP_NET_VS_CACHE_ONSTOP_TIME_LARGE, onStopDiff); > } > + > + Telemetry::Accumulate(Telemetry::HTTP_CACHE_ENTRY_DISK_SIZE, diskStorageSize / 1024); The size should be rounded up. It would be best to use FileSizeInK() in CacheFile::GetDiskStorageSize() instead of FileSize().
Attachment #8805996 -
Flags: feedback+
Reporter | ||
Comment 4•8 years ago
|
||
Comment on attachment 8805926 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads Review of attachment 8805926 [details] [diff] [review]: ----------------------------------------------------------------- Nice patch! But size is just one of the parameters we want to use to break the data down. Here is the list we've come up with so far: type: image non-image queue: small mid large request priority: priority normal size: small mid large revalidated: yes no It would be best to have the data separated by all those parameters, e.g to have probes like: onstart_image_small_priority_small_reval onstop_non-image_normal_mid_non-reval But this would give us 144 combinations and this is too much. So the question is which of them we can report separately and which need to be grouped together. E.g. queue length shouldn't be reported without priority because it will be calculated differently for different priority. What about following segmentation? - type - queue + priority - priority + size - revalidation This would be only 32 probes if I calculate correctly.
Attachment #8805926 -
Flags: review?(michal.novotny) → feedback+
Reporter | ||
Updated•8 years ago
|
Attachment #8805996 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 5•8 years ago
|
||
32 new telemetry probes. low-med-high thresholds are more or less arbitrary.
Attachment #8806563 -
Flags: review?(michal.novotny)
Assignee | ||
Updated•8 years ago
|
Attachment #8805926 -
Attachment is obsolete: true
Assignee | ||
Comment 6•8 years ago
|
||
Forgot to qfold the patch with the reval telemetry
Attachment #8806716 -
Flags: review?(michal.novotny)
Reporter | ||
Updated•8 years ago
|
Attachment #8806716 -
Flags: review?(michal.novotny) → review+
Reporter | ||
Comment 7•8 years ago
|
||
Comment on attachment 8806563 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads Review of attachment 8806563 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Valentin Gosu [:valentin] from comment #5) > low-med-high thresholds are more or less arbitrary. I'll do some local testing to find out some sane values. Unfortunately, the maximum queue length logged via existing telemetry probes won't help here. All the reported times will include revalidation time if we do the revalidation. Given that successful revalidation is only around 10% of cache hits and that we guess that we might not do revalidation anymore once we race the disk with the network, I think we should exclude revalidated responses from all probes except reval/notreval probes to have unbiased data. We could create a probe for an average revalidation time. Honza, Patrick, do you have an opinion on this? ::: netwerk/cache2/CacheFile.cpp @@ +1230,5 @@ > return mMetadata->GetFetchCount(_retval); > } > > nsresult > +CacheFile::GetDiskStorageSizeInKB(int64_t *aDiskStorageSize) FileSizeInK returns uint32_t ::: netwerk/cache2/CacheIOThread.cpp @@ +335,5 @@ > } > > +uint32_t CacheIOThread::QueueSize(bool highPriority) > +{ > + if (highPriority) { Access to mEventQueue should be IMO protected with CacheIOThread::mMonitor. ::: netwerk/cache2/CacheObserver.cpp @@ +482,5 @@ > return false; > } > > // static > +uint32_t CacheObserver::CacheQueueSize(bool highPriority) There should be only preference related functions in CacheObserver. It's IMO better to put this into CacheStorage. ::: netwerk/protocol/http/nsHttpChannel.cpp @@ +8236,5 @@ > + if (NS_FAILED(rv)) { > + return; > + } > + > + nsXPIDLCString netRespTime; You can reuse onStartTime variable, right? Maybe rename it to something neutral like tmpStr.
Attachment #8806563 -
Flags: review?(michal.novotny) → feedback+
Reporter | ||
Comment 8•8 years ago
|
||
See comment #7
Flags: needinfo?(mcmanus)
Flags: needinfo?(honzab.moz)
Comment 9•8 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #7) > All the reported times will include revalidation time if we do the > revalidation. Given that successful revalidation is only around 10% of cache > hits and that we guess that we might not do revalidation anymore once we > race the disk with the network, I think we should exclude revalidated > responses from all probes except reval/notreval probes to have unbiased > data. We could create a probe for an average revalidation time. Honza, > Patrick, do you have an opinion on this? Yeah, you should measure what we will finally do, so no reval, yes. I was also looking at the "queue length" thing. I think having 2 and 6 as delimiters are too small numbers. 10 and 20 may be better, or maybe even more. when you see the queue length we already captured so far, READ backlog (and just that, not counting all levels before) can go to 60+ in ~9% cases and over 30 in ~19%. I have no data on how dynamic this can be. Definitely during a page load the queue may be "bursted" with requests that may be quickly served tho. anyway, there is probably a direct relation between the queue length and speed of the storage where cache is, so it's a good sync info we can use to decide.
Flags: needinfo?(honzab.moz)
Comment 10•8 years ago
|
||
Comment on attachment 8806563 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads Review of attachment 8806563 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/cache2/CacheIOThread.cpp @@ +343,5 @@ > + return mEventQueue[OPEN_PRIORITY].Length() + > + mEventQueue[READ_PRIORITY].Length() + > + mEventQueue[MANAGEMENT].Length() + > + mEventQueue[OPEN].Length() + > + mEventQueue[READ].Length(); definitely must hold the monitor over mEventQueue, yes
Reporter | ||
Comment 11•8 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #10) > ::: netwerk/cache2/CacheIOThread.cpp > @@ +343,5 @@ > > + return mEventQueue[OPEN_PRIORITY].Length() + > > + mEventQueue[READ_PRIORITY].Length() + > > + mEventQueue[MANAGEMENT].Length() + > > + mEventQueue[OPEN].Length() + > > + mEventQueue[READ].Length(); This won't work. CacheIOThread::LoopOneLevel() moves all events into a local array, so the length for currently processing level will be incorrect. I did a quick hack to have a correct values for logging, but I doubt it is a good final solution. Have a look at https://treeherder.mozilla.org/#/jobs?repo=try&revision=e12f3ed35b2a.
Reporter | ||
Comment 12•8 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #7) > I'll do some local testing to find out some sane values. Unfortunately, the > maximum queue length logged via existing telemetry probes won't help here. I got some data using a build that logs queue length in CacheFile::Init(). I started FF session with about 350 tabs and visited all tabs as quickly as possible to load them all from the cache. log_nspr_20161104-103126.txt-read1 - loading of the tabs log_nspr_20161104-114340.txt-read2 - another run when I also tried to make hdd busy outside FF read1, read2 - parsed logs, first column is a sum of priority levels, second is a sum of all levels The data is a bit surprising, because most of the time priority queue length is 0 and non-priority is 1. From time to time the queue gets longer for a fraction of a second. The second run is faster although one would expect the opposite. I guess it was because most of the data was already in the system cache and I forgot to drop it. I'll do some more testing using slow notebook. You can try it on your computer too using build (log cache2:3) https://archive.mozilla.org/pub/firefox/try-builds/mnovotny@mozilla.com-684034b69df465d58a902db1a179308839791017/try-linux64/
Assignee | ||
Comment 13•8 years ago
|
||
I set the queue size limits to 10-20 as per Honza's comment. In my own experiments I also rarely got the queue length to be higher than 2
Attachment #8808583 -
Flags: review?(michal.novotny)
Assignee | ||
Updated•8 years ago
|
Attachment #8806563 -
Attachment is obsolete: true
Assignee | ||
Comment 14•8 years ago
|
||
MozReview-Commit-ID: 2PbkAm78xz8
Attachment #8808584 -
Flags: review?(honzab.moz)
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(mcmanus)
Reporter | ||
Comment 15•8 years ago
|
||
This data is from my relatively slow notebook with spinning disk running Windows 7. I would probably choose another limits when looking at the data. High priority queues keep short even under heavy load. Limits 5 and 10 seem reasonable to me. Normal priority queues get longer, let's choose 10 and 50.
Reporter | ||
Comment 16•8 years ago
|
||
Comment on attachment 8808583 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads Review of attachment 8808583 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Valentin Gosu [:valentin] from comment #13) > I set the queue size limits to 10-20 as per Honza's comment. In my own > experiments I also rarely got the queue length to be higher than 2 The patch looks good, thanks! See my comment #15 about the limits.
Attachment #8808583 -
Flags: review?(michal.novotny) → review+
Reporter | ||
Comment 17•8 years ago
|
||
Comment on attachment 8808584 [details] [diff] [review] Preserve cache queue length for logging and telemetry Review of attachment 8808584 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/cache2/CacheIOThread.cpp @@ +258,5 @@ > +CacheIOThread::LogQueuesLength() > +{ > + MonitorAutoLock lock(mMonitor); > + > + LOGINFO(("LogQueuesLength: %d, %d, %d, %d, %d", I don't think we want to land logging of queues length. Keep just the part that maintains mQueueLength array.
Attachment #8808584 -
Flags: feedback+
Comment 18•8 years ago
|
||
Comment on attachment 8808584 [details] [diff] [review] Preserve cache queue length for logging and telemetry Review of attachment 8808584 [details] [diff] [review]: ----------------------------------------------------------------- the counter logic seems good ::: netwerk/cache2/CacheFile.cpp @@ +267,5 @@ > mReady = true; > mDataSize = mMetadata->Offset(); > } else { > + RefPtr<CacheIOThread> ioThread = CacheFileIOManager::IOThread(); > + ioThread->LogQueuesLength(); ok, I don't understand why you are logging exactly here and why on the info level (higher level) why do you need the log anyway? ::: netwerk/cache2/CacheIOThread.h @@ +119,5 @@ > Atomic<nsIThread *> mXPCOMThread; > Atomic<uint32_t, Relaxed> mLowestLevelWaiting; > uint32_t mCurrentlyExecutingLevel; > > + Atomic<int32_t> mQueueLength[LAST_LEVEL]; add a comment what this is
Attachment #8808584 -
Flags: review?(honzab.moz) → review+
Reporter | ||
Comment 19•8 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #18) > ::: netwerk/cache2/CacheFile.cpp > @@ +267,5 @@ > > mReady = true; > > mDataSize = mMetadata->Offset(); > > } else { > > + RefPtr<CacheIOThread> ioThread = CacheFileIOManager::IOThread(); > > + ioThread->LogQueuesLength(); > > ok, I don't understand why you are logging exactly here and why on the info > level (higher level) > > why do you need the log anyway? As I wrote in comment #17, we don't want this for the telemetry. I used it to get some data how long the queues can get and I used info level to simply log only this information by using cache2:3.
Comment 20•8 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #19) > As I wrote in comment #17, we don't want this for the telemetry. I used it > to get some data how long the queues can get and I used info level to simply > log only this information by using cache2:3. OK, good
Assignee | ||
Comment 21•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6418ae89d3d269bc1bc960886daa95b28dce49a Bug 1313095 - Preserve cache queue length for logging and telemetry r=honzab https://hg.mozilla.org/integration/mozilla-inbound/rev/5d98f64b411f8d7708143764bfdb63239d32e921 Bug 1313095 - Add telemetry to measure if cache loads are faster than network loads r=michal
Assignee | ||
Comment 22•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/69481dd6296f472bf6d24cef06eac6d4a8ab37a5 Bug 1313095 - Set different limits for cache queue size telemetry for priority and non-priority requests r=michal
Comment 23•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/d6418ae89d3d https://hg.mozilla.org/mozilla-central/rev/5d98f64b411f https://hg.mozilla.org/mozilla-central/rev/69481dd6296f
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment 24•8 years ago
|
||
Please not that all new data collection requires data review: https://wiki.mozilla.org/Firefox/Data_Collection
Flags: needinfo?(valentin.gosu)
Assignee | ||
Comment 25•8 years ago
|
||
Comment on attachment 8808583 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads (In reply to Georg Fritzsche [:gfritzsche] from comment #24) > Please not that all new data collection requires data review: > https://wiki.mozilla.org/Firefox/Data_Collection Sorry about that. It's been a while since I added telemetry, and forgot about that rule. f? Benjamin for data review.
Flags: needinfo?(valentin.gosu)
Attachment #8808583 -
Flags: feedback?(benjamin)
Comment 26•7 years ago
|
||
A couple of general points here: * I can tell from the description that this is measuring comparison times. But I can't tell in what conditions these will be recorded. Do certain builds randomly load things both off the network and cache to compare this? Or is this comparison something that happens during normal necko operations? * Why is there an offset of 500ms? Are you simply trying to record something with a min of -500 and max 500? * In general we strongly discourage expires: never histograms. This bug doesn't describe what problem is being solved by adding these metrics. Are you attempting to tune cache parameters, or see how disk versus network speeds compare in the population of users, monitor for regressions in necko?
Flags: needinfo?(valentin.gosu)
Assignee | ||
Comment 27•7 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #26) > A couple of general points here: > * I can tell from the description that this is measuring comparison times. > But I can't tell in what conditions these will be recorded. Do certain > builds randomly load things both off the network and cache to compare this? > Or is this comparison something that happens during normal necko operations? All builds that have extended telemetry turned on will record the time difference between the network load time for a resource, and the subsequent loads from the cache of that resource. > * Why is there an offset of 500ms? Are you simply trying to record something > with a min of -500 and max 500? Yes. We are interested in how many of the cache loads are slower than those from the network, and the scale of the differences (ie. if they are only slightly faster, or a lot faster - by more than 500ms) > * In general we strongly discourage expires: never histograms. This bug > doesn't describe what problem is being solved by adding these metrics. Are > you attempting to tune cache parameters, or see how disk versus network > speeds compare in the population of users, monitor for regressions in necko? We are using this telemetry in order to decide if and when it's worth to race cache with network requests, and use the one that is faster. Michal when do you think we should expire these probes? I assume we will add other probes to verify the RCWN implementation.
Flags: needinfo?(valentin.gosu) → needinfo?(michal.novotny)
Comment 28•7 years ago
|
||
Comment on attachment 8808583 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads I suggest expiring in 5 releases, so expires in 58, and revisiting whether these are useful and how you want to permanently monitor them after you have data to explore. data-review=me with that change
Attachment #8808583 -
Flags: feedback?(benjamin) → feedback+
Assignee | ||
Comment 29•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/c17cc72764e636b7354463508db84ef6bf01eade Bug 1313095 - Expire NET_VS_CACHE telemetry probes in Firefox 58 data-review=bsmedberg
Comment 30•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/c17cc72764e6
Reporter | ||
Updated•7 years ago
|
Flags: needinfo?(michal.novotny)
You need to log in
before you can comment on or make changes to this bug.
Description
•