Closed Bug 767275 Opened 12 years ago Closed 12 years ago

Add more detailed cache lock telemetry

Categories

(Core :: Networking: Cache, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla16
Tracking Status
firefox14 - ---
firefox15 - ---
firefox16 - ---

People

(Reporter: briansmith, Assigned: briansmith)

References

Details

Attachments

(4 files)

This telemetry is very helpful for locating where cache-related jank tends to occur. For example, using this telemetry I was able to determine that the most common cause of cache-related jank is nsHttpChannel::OnStopRequest's call to GetCacheAsFile (see the follow-up bug I will file after this).

Since this is a telemetry-only change, I would like to uplift this to Aurora and Beta right after we land it in Nightly, so we can start getting numbers sooner.
This patch records the telemetry for thee cache lock on a per-lock-taker basis. This isn't *exactly* what we'd like; we'd really like to know the caller of the lock-taker (e.g. the method in nsHttpChannel), not the actual lock taker (which is usually a method in nsCacheEntryDescriptor), but we can't pass down the information we need to do that, because we have to go through an XPCOM interface that we don't want to change. Luckily, with this change, it is pretty easy to deduce what the caller is (e.g. which method in nsHttpChannel), based on my experience.
Attachment #635643 - Flags: review?(nfroyd)
Attachment #635643 - Flags: review?(michal.novotny)
Comment on attachment 635643 [details] [diff] [review]
Add more detailed cache lock telemetry

Review of attachment 635643 [details] [diff] [review]:
-----------------------------------------------------------------

I have a couple of concerns about this patch:

- This patch adds a lot of new histograms; the number of histograms we're tracking goes up by ~60%.  This increase has a runtime memory cost, a bandwidth cost (for sending the histograms, even gzip'd), and a storage cost, which we've learned is non-trivial on the metrics side.  I assume the intent is to pinpoint where problems are so that individual parts of the cache system can be finely targeted?  I appreciate the approach, but are all of these really needed (e.g. NSOUTPUTSTREAMWRAPPER_CLOSE lock telemetry)?

- The histograms are very finely bucketed (10000 buckets with 1 bucket/ms).  I am doubtful that this fine of a resolution is needed, and it imposes a significant memory cost (10000 buckets/histogram * (60 CACHE_LOCK_HISTOGRAMS macros * 3 histograms/macro) * 4 bytes/bucket ~= 7M runtime memory assuming each histogram gets one measurement; bucket memory is allocated lazily per-histogram, but all buckets for a given histogram are allocated at once).  I haven't run the numbers, but my guess is that this is an integer-factor, maybe even order-of-magnitude, increase in the amount of memory telemetry requires.  Could 10ms, 20ms resolution be sufficient?  (On Windows, I'm not even sure we can get 10ms resolution consistently, let alone 1ms resolution, on our telemetry timers, so many of the buckets would be unused anyway.)
Attachment #635643 - Flags: review?(nfroyd) → review-
(In reply to Nathan Froyd (:froydnj) from comment #2)
> - This patch adds a lot of new histograms; the number of histograms we're
> tracking goes up by ~60%.  This increase has a runtime memory cost, a
> bandwidth cost (for sending the histograms, even gzip'd), and a storage
> cost, which we've learned is non-trivial on the metrics side.  I assume the
> intent is to pinpoint where problems are so that individual parts of the
> cache system can be finely targeted?  I appreciate the approach, but are all
> of these really needed (e.g. NSOUTPUTSTREAMWRAPPER_CLOSE lock telemetry)?

MAINTHREAD_NSOUTPUTSTREAMWRAPPER_CLOSE is definitely needed.

I think the CACHETHREAD_* and OTHERTHREAD_* telemetry is not really useful at this point in time, so I suggest we cut those, which reduces the number of probes by 2/3.

> - The histograms are very finely bucketed (10000 buckets with 1 bucket/ms). 
> I am doubtful that this fine of a resolution is needed, and it imposes a
> significant memory cost (10000 buckets/histogram * (60 CACHE_LOCK_HISTOGRAMS
> macros * 3 histograms/macro) * 4 bytes/bucket ~= 7M runtime memory assuming
> each histogram gets one measurement; bucket memory is allocated lazily
> per-histogram, but all buckets for a given histogram are allocated at once).

Thanks Nathan, this is very useful to know and I agree with you. I copied this bucketing from the other cache lock telemetry. I think an exponential histogram with ~25 buckets is a better idea than what I've done here.

If I make these two changes, do you think that would reduce the overhead enough to be acceptable?

Should we file a bug to fix the bucketing of the existing cache lock telemetry, which has the same bucketing problem?
(In reply to Brian Smith (:bsmith) from comment #3)
> I think the CACHETHREAD_* and OTHERTHREAD_* telemetry is not really useful
> at this point in time, so I suggest we cut those, which reduces the number
> of probes by 2/3.
> ...
> Thanks Nathan, this is very useful to know and I agree with you. I copied
> this bucketing from the other cache lock telemetry. I think an exponential
> histogram with ~25 buckets is a better idea than what I've done here.
> 
> If I make these two changes, do you think that would reduce the overhead
> enough to be acceptable?

I think that would be sufficient, yes.  I would suggest double-checking that the bucket ranges for the exponential histogram with your chosen parameters are appropriate for your needs.  Hmmm, maybe we should have an simple command-line tool for that...

> Should we file a bug to fix the bucketing of the existing cache lock
> telemetry, which has the same bucketing problem?

That would be good; please do that.
No need to track for release unless this is a regression - please nominate for aurora/beta approval once ready.
I'd personally add a following probe:

- when the main thread waits for the lock, remember the time it started to wait (TimeStamp)
- when some other thread holding the lock is about to release it, add the function it is keeping the lock to an array (have an enum of all places) + the timestamp
- when the main thread acquired the lock, and the wait time was more then some threshold, check the array and record all places that was marked after the main thread wait-start time

- have e.g. 6 graphs
- two triples for e.g. 3 times of your choice (25ms, 100ms, 400ms ; the current probe results may be an inspiration)
- graph will be an enumeration of the places
- the first triple will record the places we are waiting in on the main thread
- the second triple will record the places held by other threads (the cause of wait)

Then we can see for what time the main thread waits for the lock and what may be the cause.
Attachment #635643 - Flags: review?(michal.novotny) → review+
This is basically the same patch, but the OTHERTHREAD general probe was removed and all the detailed CACHETHREAD and OTHERTHREAD probes were removed. Also, the bucketing was changed to be more reasonable and the histograms were made exponential.
Attachment #637719 - Flags: review?(nfroyd)
I think this is still too many probes added and Nathan won't like (as well as I don't).  I proposed comment 6 to make it in *this* bug, not in a followup...
Comment on attachment 637719 [details] [diff] [review]
Add more detailed cache lock telemetry [v2]

Review of attachment 637719 [details] [diff] [review]:
-----------------------------------------------------------------

Works for me, with or without the changes suggested by the comments below.

::: netwerk/cache/nsCacheService.cpp
@@ +2495,2 @@
>      } else {
> +        lockerID = static_cast<mozilla::Telemetry::ID>(0);

This cast made me do a double-take.  Why not use mozilla::Telemetry::HistogramCount as the sentinel value?

@@ +2505,5 @@
> +
> +    // Telemetry isn't thread safe on its own, but this is OK because we're
> +    // protecting it with the cache lock. 
> +    if (lockerID) {
> +        mozilla::Telemetry::AccumulateTimeDelta(lockerID, start, stop);

If you used HistogramCount for the sentinel value, this call could become unconditional, though that would be somewhat slower in the non-main-thread case than simply routing around it.
Attachment #637719 - Flags: review?(nfroyd) → review+
(In reply to Honza Bambas (:mayhemer) from comment #6)
> - when the main thread waits for the lock, remember the time it started to
> wait (TimeStamp)
> - when some other thread holding the lock is about to release it, add the
> function it is keeping the lock to an array (have an enum of all places) +
> the timestamp
> - when the main thread acquired the lock, and the wait time was more then
> some threshold, check the array and record all places that was marked after
> the main thread wait-start time

We did something similar in bug 718801 (stalled because other things have become more important) for sqlite locks.  You might also check out "Analyzing lock contention in multithreaded applications" by Tallent, Mellor-Crummey, and Porterfield: http://dl.acm.org/citation.cfm?id=1693489

Brian's patch added a lot of probes, but it did so in a way that added minimal run-time overhead to collect the data (since we're already collecting cache lock times for telemetry), and the memory overhead is pretty low for such detailed data.  I'm looking forward to seeing what these probes tell us.
Thanks for the comment Nathan.  If you believe the probes as are now are OK, then let's get it in.

I'm just curios whether they tell us really something important.  I'd rather see my proposal (that could be later altered by patches from bug 718801) that may simply tell us more about the true cause and where to "cut".  It doesn't need huge changes, it's just few new probes easily doable.  This is my opinion.  I'm not against landing the current patches in this bug, though.
(In reply to Honza Bambas (:mayhemer) from comment #11)
> I'm just curios whether they tell us really something important.

It depends on whether we're going to keep playing wack-a-mole or not. This telemetry is useful for determining, for changes like bug 767277, where the lock contention moves to when we remove the contention from some spot X.

> I'd rather
> see my proposal (that could be later altered by patches from bug 718801)
> that may simply tell us more about the true cause and where to "cut".

I agree that that would be extremely useful, especially because I think we need to quit playing the wack-a-mole game and stop holding the cache service lock over I/O operations. I just don't have time to make that change right now.
Target Milestone: mozilla17 → mozilla16
[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 722034 and bug 746018.

User impact if declined: Without this data, it will be difficult to fix the performance regressions caused by Bug 722034 and/or Bug 746018.

Testing completed (on m-c, etc.): Landed on inbound today.

Risk to taking this patch (and alternatives if risky): Almost zero.
String or UUID changes made by this patch: None
Attachment #638555 - Flags: review+
Attachment #638555 - Flags: approval-mozilla-aurora?
[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 722034 and bug 746018.

User impact if declined: Without this data, it will be difficult to fix the performance regressions caused by Bug 722034 and/or Bug 746018 on the Aurora branch. (Those patches didn't land on -beta but it would be extremely useful to compare the differences in the numbers between -aurora and -beta.)

Testing completed (on m-c, etc.): Landed on inbound today.

Risk to taking this patch (and alternatives if risky): Almost zero.
String or UUID changes made by this patch: None
Attachment #638558 - Flags: review+
Attachment #638558 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/ecb0393610d5
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment on attachment 638555 [details] [diff] [review]
Add more detailed cache lock telemetry [v2] for Aurora

[Triage Comment]
Approving for Aurora 15, but I don't think we should take this much code change in our second to last beta (regardless of the low risk I'm seeing in the patch). Apologies on not getting this into the build for comparison, but I'd like to be cautious.
Attachment #638555 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #638558 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
I will back out the changes from bug 722034 and bug 746018 so that we can compare -central to -aurora.
Is there a reason why a few dozen time histograms are preferable to a single ENUM histogram with a bucket per callsite?
This seems like it would be frustrating to track via the web ui.

Also I'm not sure if these is value in having these histograms hit release.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: