1313095 - Add telemetry probes to report load time difference between network and disk cache

Comment on attachment 8805996 [details] [diff] [review] Add telemetry recording the disk size of cache entry Review of attachment 8805996 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/protocol/http/nsHttpChannel.cpp @@ +6593,5 @@ > Telemetry::Accumulate(Telemetry::HTTP_NET_VS_CACHE_ONSTART_TIME_LARGE, onStartDiff); > Telemetry::Accumulate(Telemetry::HTTP_NET_VS_CACHE_ONSTOP_TIME_LARGE, onStopDiff); > } > + > + Telemetry::Accumulate(Telemetry::HTTP_CACHE_ENTRY_DISK_SIZE, diskStorageSize / 1024); The size should be rounded up. It would be best to use FileSizeInK() in CacheFile::GetDiskStorageSize() instead of FileSize().

Attachment #8805996 - Flags: feedback+

Michal Novotny [:michal]

Reporter

Comment 4

•

9 years ago

Comment on attachment 8805926 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads Review of attachment 8805926 [details] [diff] [review]: ----------------------------------------------------------------- Nice patch! But size is just one of the parameters we want to use to break the data down. Here is the list we've come up with so far: type: image non-image queue: small mid large request priority: priority normal size: small mid large revalidated: yes no It would be best to have the data separated by all those parameters, e.g to have probes like: onstart_image_small_priority_small_reval onstop_non-image_normal_mid_non-reval But this would give us 144 combinations and this is too much. So the question is which of them we can report separately and which need to be grouped together. E.g. queue length shouldn't be reported without priority because it will be calculated differently for different priority. What about following segmentation? - type - queue + priority - priority + size - revalidation This would be only 32 probes if I calculate correctly.

Attachment #8805926 - Flags: review?(michal.novotny) → feedback+

Michal Novotny [:michal]

Reporter

Updated

•

9 years ago

Attachment #8805996 - Flags: review?(michal.novotny)

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 5

•

9 years ago

Attached patch Add telemetry to measure if cache loads are faster than network loads (obsolete) — Details — Splinter Review

32 new telemetry probes. low-med-high thresholds are more or less arbitrary.

Attachment #8806563 - Flags: review?(michal.novotny)

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Updated

•

9 years ago

Attachment #8805926 - Attachment is obsolete: true

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 6

•

9 years ago

Attached patch bug1313095-telemetry-reval.patch — Details — Splinter Review

Forgot to qfold the patch with the reval telemetry

Attachment #8806716 - Flags: review?(michal.novotny)

Michal Novotny [:michal]

Reporter

Updated

•

9 years ago

Attachment #8806716 - Flags: review?(michal.novotny) → review+

Michal Novotny [:michal]

Reporter

Comment 7

•

9 years ago

Comment on attachment 8806563 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads Review of attachment 8806563 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Valentin Gosu [:valentin] from comment #5) > low-med-high thresholds are more or less arbitrary. I'll do some local testing to find out some sane values. Unfortunately, the maximum queue length logged via existing telemetry probes won't help here. All the reported times will include revalidation time if we do the revalidation. Given that successful revalidation is only around 10% of cache hits and that we guess that we might not do revalidation anymore once we race the disk with the network, I think we should exclude revalidated responses from all probes except reval/notreval probes to have unbiased data. We could create a probe for an average revalidation time. Honza, Patrick, do you have an opinion on this? ::: netwerk/cache2/CacheFile.cpp @@ +1230,5 @@ > return mMetadata->GetFetchCount(_retval); > } > > nsresult > +CacheFile::GetDiskStorageSizeInKB(int64_t *aDiskStorageSize) FileSizeInK returns uint32_t ::: netwerk/cache2/CacheIOThread.cpp @@ +335,5 @@ > } > > +uint32_t CacheIOThread::QueueSize(bool highPriority) > +{ > + if (highPriority) { Access to mEventQueue should be IMO protected with CacheIOThread::mMonitor. ::: netwerk/cache2/CacheObserver.cpp @@ +482,5 @@ > return false; > } > > // static > +uint32_t CacheObserver::CacheQueueSize(bool highPriority) There should be only preference related functions in CacheObserver. It's IMO better to put this into CacheStorage. ::: netwerk/protocol/http/nsHttpChannel.cpp @@ +8236,5 @@ > + if (NS_FAILED(rv)) { > + return; > + } > + > + nsXPIDLCString netRespTime; You can reuse onStartTime variable, right? Maybe rename it to something neutral like tmpStr.

Attachment #8806563 - Flags: review?(michal.novotny) → feedback+

Michal Novotny [:michal]

Reporter

Comment 8

•

9 years ago

See comment #7

Flags: needinfo?(mcmanus)

Flags: needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Comment 9

•

9 years ago

(In reply to Michal Novotny (:michal) from comment #7) > All the reported times will include revalidation time if we do the > revalidation. Given that successful revalidation is only around 10% of cache > hits and that we guess that we might not do revalidation anymore once we > race the disk with the network, I think we should exclude revalidated > responses from all probes except reval/notreval probes to have unbiased > data. We could create a probe for an average revalidation time. Honza, > Patrick, do you have an opinion on this? Yeah, you should measure what we will finally do, so no reval, yes. I was also looking at the "queue length" thing. I think having 2 and 6 as delimiters are too small numbers. 10 and 20 may be better, or maybe even more. when you see the queue length we already captured so far, READ backlog (and just that, not counting all levels before) can go to 60+ in ~9% cases and over 30 in ~19%. I have no data on how dynamic this can be. Definitely during a page load the queue may be "bursted" with requests that may be quickly served tho. anyway, there is probably a direct relation between the queue length and speed of the storage where cache is, so it's a good sync info we can use to decide.

Flags: needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Comment 10

•

9 years ago

Comment on attachment 8806563 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads Review of attachment 8806563 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/cache2/CacheIOThread.cpp @@ +343,5 @@ > + return mEventQueue[OPEN_PRIORITY].Length() + > + mEventQueue[READ_PRIORITY].Length() + > + mEventQueue[MANAGEMENT].Length() + > + mEventQueue[OPEN].Length() + > + mEventQueue[READ].Length(); definitely must hold the monitor over mEventQueue, yes

Michal Novotny [:michal]

Reporter

Comment 11

•

9 years ago

(In reply to Honza Bambas (:mayhemer) from comment #10) > ::: netwerk/cache2/CacheIOThread.cpp > @@ +343,5 @@ > > + return mEventQueue[OPEN_PRIORITY].Length() + > > + mEventQueue[READ_PRIORITY].Length() + > > + mEventQueue[MANAGEMENT].Length() + > > + mEventQueue[OPEN].Length() + > > + mEventQueue[READ].Length(); This won't work. CacheIOThread::LoopOneLevel() moves all events into a local array, so the length for currently processing level will be incorrect. I did a quick hack to have a correct values for logging, but I doubt it is a good final solution. Have a look at https://treeherder.mozilla.org/#/jobs?repo=try&revision=e12f3ed35b2a.

Michal Novotny [:michal]

Reporter

Comment 12

•

9 years ago

Attached file queue length — Details

(In reply to Michal Novotny (:michal) from comment #7) > I'll do some local testing to find out some sane values. Unfortunately, the > maximum queue length logged via existing telemetry probes won't help here. I got some data using a build that logs queue length in CacheFile::Init(). I started FF session with about 350 tabs and visited all tabs as quickly as possible to load them all from the cache. log_nspr_20161104-103126.txt-read1 - loading of the tabs log_nspr_20161104-114340.txt-read2 - another run when I also tried to make hdd busy outside FF read1, read2 - parsed logs, first column is a sum of priority levels, second is a sum of all levels The data is a bit surprising, because most of the time priority queue length is 0 and non-priority is 1. From time to time the queue gets longer for a fraction of a second. The second run is faster although one would expect the opposite. I guess it was because most of the data was already in the system cache and I forgot to drop it. I'll do some more testing using slow notebook. You can try it on your computer too using build (log cache2:3) https://archive.mozilla.org/pub/firefox/try-builds/mnovotny@mozilla.com-684034b69df465d58a902db1a179308839791017/try-linux64/

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 13

•

9 years ago

Attached patch Add telemetry to measure if cache loads are faster than network loads — Details — Splinter Review

I set the queue size limits to 10-20 as per Honza's comment. In my own experiments I also rarely got the queue length to be higher than 2

Attachment #8808583 - Flags: review?(michal.novotny)

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Updated

•

9 years ago

Attachment #8806563 - Attachment is obsolete: true

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 14

•

9 years ago

Attached patch Preserve cache queue length for logging and telemetry — Details — Splinter Review

MozReview-Commit-ID: 2PbkAm78xz8

Attachment #8808584 - Flags: review?(honzab.moz)

Michal Novotny [:michal]

Reporter

Updated

•

9 years ago

Flags: needinfo?(mcmanus)

Michal Novotny [:michal]

Reporter

Comment 15

•

9 years ago

Attached file yet another queue length data — Details

This data is from my relatively slow notebook with spinning disk running Windows 7. I would probably choose another limits when looking at the data. High priority queues keep short even under heavy load. Limits 5 and 10 seem reasonable to me. Normal priority queues get longer, let's choose 10 and 50.

Michal Novotny [:michal]

Reporter

Comment 16

•

9 years ago

Comment on attachment 8808583 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads Review of attachment 8808583 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Valentin Gosu [:valentin] from comment #13) > I set the queue size limits to 10-20 as per Honza's comment. In my own > experiments I also rarely got the queue length to be higher than 2 The patch looks good, thanks! See my comment #15 about the limits.

Attachment #8808583 - Flags: review?(michal.novotny) → review+

Michal Novotny [:michal]

Reporter

Comment 17

•

9 years ago

Comment on attachment 8808584 [details] [diff] [review] Preserve cache queue length for logging and telemetry Review of attachment 8808584 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/cache2/CacheIOThread.cpp @@ +258,5 @@ > +CacheIOThread::LogQueuesLength() > +{ > + MonitorAutoLock lock(mMonitor); > + > + LOGINFO(("LogQueuesLength: %d, %d, %d, %d, %d", I don't think we want to land logging of queues length. Keep just the part that maintains mQueueLength array.

Attachment #8808584 - Flags: feedback+

Honza Bambas (:mayhemer)

Comment 18

•

9 years ago

Comment on attachment 8808584 [details] [diff] [review] Preserve cache queue length for logging and telemetry Review of attachment 8808584 [details] [diff] [review]: ----------------------------------------------------------------- the counter logic seems good ::: netwerk/cache2/CacheFile.cpp @@ +267,5 @@ > mReady = true; > mDataSize = mMetadata->Offset(); > } else { > + RefPtr<CacheIOThread> ioThread = CacheFileIOManager::IOThread(); > + ioThread->LogQueuesLength(); ok, I don't understand why you are logging exactly here and why on the info level (higher level) why do you need the log anyway? ::: netwerk/cache2/CacheIOThread.h @@ +119,5 @@ > Atomic<nsIThread *> mXPCOMThread; > Atomic<uint32_t, Relaxed> mLowestLevelWaiting; > uint32_t mCurrentlyExecutingLevel; > > + Atomic<int32_t> mQueueLength[LAST_LEVEL]; add a comment what this is

Attachment #8808584 - Flags: review?(honzab.moz) → review+

Michal Novotny [:michal]

Reporter

Comment 19

•

9 years ago

(In reply to Honza Bambas (:mayhemer) from comment #18) > ::: netwerk/cache2/CacheFile.cpp > @@ +267,5 @@ > > mReady = true; > > mDataSize = mMetadata->Offset(); > > } else { > > + RefPtr<CacheIOThread> ioThread = CacheFileIOManager::IOThread(); > > + ioThread->LogQueuesLength(); > > ok, I don't understand why you are logging exactly here and why on the info > level (higher level) > > why do you need the log anyway? As I wrote in comment #17, we don't want this for the telemetry. I used it to get some data how long the queues can get and I used info level to simply log only this information by using cache2:3.

Honza Bambas (:mayhemer)

Comment 20

•

9 years ago

(In reply to Michal Novotny (:michal) from comment #19) > As I wrote in comment #17, we don't want this for the telemetry. I used it > to get some data how long the queues can get and I used info level to simply > log only this information by using cache2:3. OK, good

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 21

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/d6418ae89d3d269bc1bc960886daa95b28dce49a Bug 1313095 - Preserve cache queue length for logging and telemetry r=honzab https://hg.mozilla.org/integration/mozilla-inbound/rev/5d98f64b411f8d7708143764bfdb63239d32e921 Bug 1313095 - Add telemetry to measure if cache loads are faster than network loads r=michal

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 22

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/69481dd6296f472bf6d24cef06eac6d4a8ab37a5 Bug 1313095 - Set different limits for cache queue size telemetry for priority and non-priority requests r=michal

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 23

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d6418ae89d3d https://hg.mozilla.org/mozilla-central/rev/5d98f64b411f https://hg.mozilla.org/mozilla-central/rev/69481dd6296f

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox52: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Georg Fritzsche [:gfritzsche]

Comment 24

•

9 years ago

Please not that all new data collection requires data review: https://wiki.mozilla.org/Firefox/Data_Collection

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 25

•

9 years ago

Comment on attachment 8808583 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads (In reply to Georg Fritzsche [:gfritzsche] from comment #24) > Please not that all new data collection requires data review: > https://wiki.mozilla.org/Firefox/Data_Collection Sorry about that. It's been a while since I added telemetry, and forgot about that rule. f? Benjamin for data review.

Flags: needinfo?(valentin.gosu)

Attachment #8808583 - Flags: feedback?(benjamin)

Benjamin Smedberg

Comment 26

•

9 years ago

A couple of general points here: * I can tell from the description that this is measuring comparison times. But I can't tell in what conditions these will be recorded. Do certain builds randomly load things both off the network and cache to compare this? Or is this comparison something that happens during normal necko operations? * Why is there an offset of 500ms? Are you simply trying to record something with a min of -500 and max 500? * In general we strongly discourage expires: never histograms. This bug doesn't describe what problem is being solved by adding these metrics. Are you attempting to tune cache parameters, or see how disk versus network speeds compare in the population of users, monitor for regressions in necko?

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 27

•

9 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #26) > A couple of general points here: > * I can tell from the description that this is measuring comparison times. > But I can't tell in what conditions these will be recorded. Do certain > builds randomly load things both off the network and cache to compare this? > Or is this comparison something that happens during normal necko operations? All builds that have extended telemetry turned on will record the time difference between the network load time for a resource, and the subsequent loads from the cache of that resource. > * Why is there an offset of 500ms? Are you simply trying to record something > with a min of -500 and max 500? Yes. We are interested in how many of the cache loads are slower than those from the network, and the scale of the differences (ie. if they are only slightly faster, or a lot faster - by more than 500ms) > * In general we strongly discourage expires: never histograms. This bug > doesn't describe what problem is being solved by adding these metrics. Are > you attempting to tune cache parameters, or see how disk versus network > speeds compare in the population of users, monitor for regressions in necko? We are using this telemetry in order to decide if and when it's worth to race cache with network requests, and use the one that is faster. Michal when do you think we should expire these probes? I assume we will add other probes to verify the RCWN implementation.

Flags: needinfo?(valentin.gosu) → needinfo?(michal.novotny)

Benjamin Smedberg

Comment 28

•

9 years ago

Comment on attachment 8808583 [details] [diff] [review] Add telemetry to measure if cache loads are faster than network loads I suggest expiring in 5 releases, so expires in 58, and revisiting whether these are useful and how you want to permanently monitor them after you have data to explore. data-review=me with that change

Attachment #8808583 - Flags: feedback?(benjamin) → feedback+

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 29

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/c17cc72764e636b7354463508db84ef6bf01eade Bug 1313095 - Expire NET_VS_CACHE telemetry probes in Firefox 58 data-review=bsmedberg

Carsten Book [:Tomcat]

Comment 30

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c17cc72764e6

Michal Novotny [:michal]

Reporter

Updated

•

9 years ago

Flags: needinfo?(michal.novotny)

Add telemetry to measure if cache loads are faster than network loads 9 years ago Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }} 15.10 KB, patch	michal : feedback+	Details \| Diff \| Splinter Review
Add telemetry recording the disk size of cache entry 9 years ago Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }} 2.56 KB, patch	michal : feedback+	Details \| Diff \| Splinter Review
Add telemetry to measure if cache loads are faster than network loads 9 years ago Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }} 33.92 KB, patch	michal : feedback+	Details \| Diff \| Splinter Review
bug1313095-telemetry-reval.patch 9 years ago Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }} 1.48 KB, patch	michal : review+	Details \| Diff \| Splinter Review
queue length 9 years ago Michal Novotny [:michal] 115.45 KB, application/x-xz		Details
Add telemetry to measure if cache loads are faster than network loads 9 years ago Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }} 35.43 KB, patch	michal : review+ benjamin : feedback+	Details \| Diff \| Splinter Review
Preserve cache queue length for logging and telemetry 9 years ago Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }} 6.05 KB, patch	mayhemer : review+ michal : feedback+	Details \| Diff \| Splinter Review
yet another queue length data 9 years ago Michal Novotny [:michal] 16.12 KB, application/x-xz		Details