Closed Bug 1313095 Opened 8 years ago Closed 8 years ago

Add telemetry probes to report load time difference between network and disk cache

Categories

(Core :: Networking: Cache, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: michal, Assigned: valentin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-active])

Attachments

(6 files, 2 obsolete files)

      No description provided.
Whiteboard: [necko-active]
Assignee: michal.novotny → valentin.gosu
MozReview-Commit-ID: C6kMSPw8gxv
Attachment #8805926 - Flags: review?(michal.novotny)
MozReview-Commit-ID: BIIowV9tHSm
Attachment #8805996 - Flags: review?(michal.novotny)
Comment on attachment 8805996 [details] [diff] [review]
Add telemetry recording the disk size of cache entry

Review of attachment 8805996 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +6593,5 @@
>          Telemetry::Accumulate(Telemetry::HTTP_NET_VS_CACHE_ONSTART_TIME_LARGE, onStartDiff);
>          Telemetry::Accumulate(Telemetry::HTTP_NET_VS_CACHE_ONSTOP_TIME_LARGE, onStopDiff);
>      }
> +
> +    Telemetry::Accumulate(Telemetry::HTTP_CACHE_ENTRY_DISK_SIZE, diskStorageSize / 1024);

The size should be rounded up. It would be best to use FileSizeInK() in CacheFile::GetDiskStorageSize() instead of FileSize().
Attachment #8805996 - Flags: feedback+
Comment on attachment 8805926 [details] [diff] [review]
Add telemetry to measure if cache loads are faster than network loads

Review of attachment 8805926 [details] [diff] [review]:
-----------------------------------------------------------------

Nice patch! But size is just one of the parameters we want to use to break the data down. Here is the list we've come up with so far:

type: image non-image
queue: small mid large
request priority: priority normal
size: small mid large
revalidated: yes no

It would be best to have the data separated by all those parameters, e.g to have probes like:

onstart_image_small_priority_small_reval
onstop_non-image_normal_mid_non-reval

But this would give us 144 combinations and this is too much. So the question is which of them we can report separately and which need to be grouped together. E.g. queue length shouldn't be reported without priority because it will be calculated differently for different priority.

What about following segmentation?

- type
- queue + priority
- priority + size
- revalidation

This would be only 32 probes if I calculate correctly.
Attachment #8805926 - Flags: review?(michal.novotny) → feedback+
Attachment #8805996 - Flags: review?(michal.novotny)
32 new telemetry probes. low-med-high thresholds are more or less arbitrary.
Attachment #8806563 - Flags: review?(michal.novotny)
Attachment #8805926 - Attachment is obsolete: true
Forgot to qfold the patch with the reval telemetry
Attachment #8806716 - Flags: review?(michal.novotny)
Attachment #8806716 - Flags: review?(michal.novotny) → review+
Comment on attachment 8806563 [details] [diff] [review]
Add telemetry to measure if cache loads are faster than network loads

Review of attachment 8806563 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Valentin Gosu [:valentin] from comment #5)
> low-med-high thresholds are more or less arbitrary.

I'll do some local testing to find out some sane values. Unfortunately, the maximum queue length logged via existing telemetry probes won't help here.

All the reported times will include revalidation time if we do the revalidation. Given that successful revalidation is only around 10% of cache hits and that we guess that we might not do revalidation anymore once we race the disk with the network, I think we should exclude revalidated responses from all probes except reval/notreval probes to have unbiased data. We could create a probe for an average revalidation time. Honza, Patrick, do you have an opinion on this?

::: netwerk/cache2/CacheFile.cpp
@@ +1230,5 @@
>    return mMetadata->GetFetchCount(_retval);
>  }
>  
>  nsresult
> +CacheFile::GetDiskStorageSizeInKB(int64_t *aDiskStorageSize)

FileSizeInK returns uint32_t

::: netwerk/cache2/CacheIOThread.cpp
@@ +335,5 @@
>  }
>  
> +uint32_t CacheIOThread::QueueSize(bool highPriority)
> +{
> +  if (highPriority) {

Access to mEventQueue should be IMO protected with CacheIOThread::mMonitor.

::: netwerk/cache2/CacheObserver.cpp
@@ +482,5 @@
>    return false;
>  }
>  
>  // static
> +uint32_t CacheObserver::CacheQueueSize(bool highPriority)

There should be only preference related functions in CacheObserver. It's IMO better to put this into CacheStorage.

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +8236,5 @@
> +    if (NS_FAILED(rv)) {
> +        return;
> +    }
> +
> +    nsXPIDLCString netRespTime;

You can reuse onStartTime variable, right? Maybe rename it to something neutral like tmpStr.
Attachment #8806563 - Flags: review?(michal.novotny) → feedback+
See comment #7
Flags: needinfo?(mcmanus)
Flags: needinfo?(honzab.moz)
(In reply to Michal Novotny (:michal) from comment #7)
> All the reported times will include revalidation time if we do the
> revalidation. Given that successful revalidation is only around 10% of cache
> hits and that we guess that we might not do revalidation anymore once we
> race the disk with the network, I think we should exclude revalidated
> responses from all probes except reval/notreval probes to have unbiased
> data. We could create a probe for an average revalidation time. Honza,
> Patrick, do you have an opinion on this?

Yeah, you should measure what we will finally do, so no reval, yes.

I was also looking at the "queue length" thing.  I think having 2 and 6 as delimiters are too small numbers.  10 and 20 may be better, or maybe even more.  when you see the queue length we already captured so far, READ backlog (and just that, not counting all levels before) can go to 60+ in ~9% cases and over 30 in ~19%.  I have no data on how dynamic this can be.  Definitely during a page load the queue may be "bursted" with requests that may be quickly served tho.  anyway, there is probably a direct relation between the queue length and speed of the storage where cache is, so it's a good sync info we can use to decide.
Flags: needinfo?(honzab.moz)
Comment on attachment 8806563 [details] [diff] [review]
Add telemetry to measure if cache loads are faster than network loads

Review of attachment 8806563 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/cache2/CacheIOThread.cpp
@@ +343,5 @@
> +  return mEventQueue[OPEN_PRIORITY].Length() +
> +         mEventQueue[READ_PRIORITY].Length() +
> +         mEventQueue[MANAGEMENT].Length() +
> +         mEventQueue[OPEN].Length() +
> +         mEventQueue[READ].Length();

definitely must hold the monitor over mEventQueue, yes
(In reply to Honza Bambas (:mayhemer) from comment #10)
> ::: netwerk/cache2/CacheIOThread.cpp
> @@ +343,5 @@
> > +  return mEventQueue[OPEN_PRIORITY].Length() +
> > +         mEventQueue[READ_PRIORITY].Length() +
> > +         mEventQueue[MANAGEMENT].Length() +
> > +         mEventQueue[OPEN].Length() +
> > +         mEventQueue[READ].Length();

This won't work. CacheIOThread::LoopOneLevel() moves all events into a local array, so the length for currently processing level will be incorrect. I did a quick hack to have a correct values for logging, but I doubt it is a good final solution. Have a look at https://treeherder.mozilla.org/#/jobs?repo=try&revision=e12f3ed35b2a.
Attached file queue length
(In reply to Michal Novotny (:michal) from comment #7)
> I'll do some local testing to find out some sane values. Unfortunately, the
> maximum queue length logged via existing telemetry probes won't help here.

I got some data using a build that logs queue length in CacheFile::Init(). I started FF session with about 350 tabs and visited all tabs as quickly as possible to load them all from the cache.

log_nspr_20161104-103126.txt-read1 - loading of the tabs
log_nspr_20161104-114340.txt-read2 - another run when I also tried to make hdd busy outside FF
read1, read2 - parsed logs, first column is a sum of priority levels, second is a sum of all levels

The data is a bit surprising, because most of the time priority queue length is 0 and non-priority is 1. From time to time the queue gets longer for a fraction of a second. The second run is faster although one would expect the opposite. I guess it was because most of the data was already in the system cache and I forgot to drop it. I'll do some more testing using slow notebook. You can try it on your computer too using build (log cache2:3) https://archive.mozilla.org/pub/firefox/try-builds/mnovotny@mozilla.com-684034b69df465d58a902db1a179308839791017/try-linux64/
I set the queue size limits to 10-20 as per Honza's comment. In my own experiments I also rarely got the queue length to be higher than 2
Attachment #8808583 - Flags: review?(michal.novotny)
Attachment #8806563 - Attachment is obsolete: true
MozReview-Commit-ID: 2PbkAm78xz8
Attachment #8808584 - Flags: review?(honzab.moz)
Flags: needinfo?(mcmanus)
This data is from my relatively slow notebook with spinning disk running Windows 7. I would probably choose another limits when looking at the data. High priority queues keep short even under heavy load. Limits 5 and 10 seem reasonable to me. Normal priority queues get longer, let's choose 10 and 50.
Comment on attachment 8808583 [details] [diff] [review]
Add telemetry to measure if cache loads are faster than network loads

Review of attachment 8808583 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Valentin Gosu [:valentin] from comment #13)
> I set the queue size limits to 10-20 as per Honza's comment. In my own
> experiments I also rarely got the queue length to be higher than 2

The patch looks good, thanks! See my comment #15 about the limits.
Attachment #8808583 - Flags: review?(michal.novotny) → review+
Comment on attachment 8808584 [details] [diff] [review]
Preserve cache queue length for logging and telemetry

Review of attachment 8808584 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/cache2/CacheIOThread.cpp
@@ +258,5 @@
> +CacheIOThread::LogQueuesLength()
> +{
> +  MonitorAutoLock lock(mMonitor);
> +
> +  LOGINFO(("LogQueuesLength: %d, %d, %d, %d, %d",

I don't think we want to land logging of queues length. Keep just the part that maintains mQueueLength array.
Attachment #8808584 - Flags: feedback+
Comment on attachment 8808584 [details] [diff] [review]
Preserve cache queue length for logging and telemetry

Review of attachment 8808584 [details] [diff] [review]:
-----------------------------------------------------------------

the counter logic seems good

::: netwerk/cache2/CacheFile.cpp
@@ +267,5 @@
>        mReady = true;
>        mDataSize = mMetadata->Offset();
>      } else {
> +      RefPtr<CacheIOThread> ioThread = CacheFileIOManager::IOThread();
> +      ioThread->LogQueuesLength();

ok, I don't understand why you are logging exactly here and why on the info level (higher level)

why do you need the log anyway?

::: netwerk/cache2/CacheIOThread.h
@@ +119,5 @@
>    Atomic<nsIThread *> mXPCOMThread;
>    Atomic<uint32_t, Relaxed> mLowestLevelWaiting;
>    uint32_t mCurrentlyExecutingLevel;
>  
> +  Atomic<int32_t> mQueueLength[LAST_LEVEL];

add a comment what this is
Attachment #8808584 - Flags: review?(honzab.moz) → review+
(In reply to Honza Bambas (:mayhemer) from comment #18)
> ::: netwerk/cache2/CacheFile.cpp
> @@ +267,5 @@
> >        mReady = true;
> >        mDataSize = mMetadata->Offset();
> >      } else {
> > +      RefPtr<CacheIOThread> ioThread = CacheFileIOManager::IOThread();
> > +      ioThread->LogQueuesLength();
> 
> ok, I don't understand why you are logging exactly here and why on the info
> level (higher level)
> 
> why do you need the log anyway?

As I wrote in comment #17, we don't want this for the telemetry. I used it to get some data how long the queues can get and I used info level to simply log only this information by using cache2:3.
(In reply to Michal Novotny (:michal) from comment #19)
> As I wrote in comment #17, we don't want this for the telemetry. I used it
> to get some data how long the queues can get and I used info level to simply
> log only this information by using cache2:3.

OK, good
https://hg.mozilla.org/integration/mozilla-inbound/rev/69481dd6296f472bf6d24cef06eac6d4a8ab37a5
Bug 1313095 - Set different limits for cache queue size telemetry for priority and non-priority requests r=michal
Please not that all new data collection requires data review:
https://wiki.mozilla.org/Firefox/Data_Collection
Flags: needinfo?(valentin.gosu)
Comment on attachment 8808583 [details] [diff] [review]
Add telemetry to measure if cache loads are faster than network loads

(In reply to Georg Fritzsche [:gfritzsche] from comment #24)
> Please not that all new data collection requires data review:
> https://wiki.mozilla.org/Firefox/Data_Collection

Sorry about that. It's been a while since I added telemetry, and forgot about that rule.
f? Benjamin for data review.
Flags: needinfo?(valentin.gosu)
Attachment #8808583 - Flags: feedback?(benjamin)
A couple of general points here:
* I can tell from the description that this is measuring comparison times. But I can't tell in what conditions these will be recorded. Do certain builds randomly load things both off the network and cache to compare this? Or is this comparison something that happens during normal necko operations?
* Why is there an offset of 500ms? Are you simply trying to record something with a min of -500 and max 500?
* In general we strongly discourage expires: never histograms. This bug doesn't describe what problem is being solved by adding these metrics. Are you attempting to tune cache parameters, or see how disk versus network speeds compare in the population of users, monitor for regressions in necko?
Flags: needinfo?(valentin.gosu)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #26)
> A couple of general points here:
> * I can tell from the description that this is measuring comparison times.
> But I can't tell in what conditions these will be recorded. Do certain
> builds randomly load things both off the network and cache to compare this?
> Or is this comparison something that happens during normal necko operations?

All builds that have extended telemetry turned on will record the time difference between the network load time for a resource, and the subsequent loads from the cache of that resource.

> * Why is there an offset of 500ms? Are you simply trying to record something
> with a min of -500 and max 500?

Yes. We are interested in how many of the cache loads are slower than those from the network, and the scale of the differences (ie. if they are only slightly faster, or a lot faster - by more than 500ms)

> * In general we strongly discourage expires: never histograms. This bug
> doesn't describe what problem is being solved by adding these metrics. Are
> you attempting to tune cache parameters, or see how disk versus network
> speeds compare in the population of users, monitor for regressions in necko?

We are using this telemetry in order to decide if and when it's worth to race cache with network requests, and use the one that is faster.

Michal when do you think we should expire these probes? I assume we will add other probes to verify the RCWN implementation.
Flags: needinfo?(valentin.gosu) → needinfo?(michal.novotny)
Comment on attachment 8808583 [details] [diff] [review]
Add telemetry to measure if cache loads are faster than network loads

I suggest expiring in 5 releases, so expires in 58, and revisiting whether these are useful and how you want to permanently monitor them after you have data to explore. data-review=me with that change
Attachment #8808583 - Flags: feedback?(benjamin) → feedback+
Flags: needinfo?(michal.novotny)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: