Closed Bug 1261063 Opened 8 years ago Closed 8 years ago

Remove direct uses of base::histogram instances

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
3

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox48 --- affected
firefox49 --- fixed

People

(Reporter: gfritzsche, Assigned: gfritzsche)

References

Details

(Whiteboard: [measurement:client])

Attachments

(5 files, 2 obsolete files)

base::histogram and other instances from histogram.h/.cc is only supposed to used internally in Telemetry code.

However, currently there are three files where base::histogram::Add()/Substract() is used directly.
We should:
* make those use the Accumulate() functions like everything else
* remove Telemetry::GetHistogramById() et al from the public interface to avoid those scenarios in the future
Summary: Remove uses base::histogram instances directly → Remove direct uses of base::histogram instances
For converting RasterImage.cpp to use the Telemetry::Accumulate() functions instead of GetHistogramById()->Add()/Substract() we need to understand what this is actually supposed to be doing:
https://dxr.mozilla.org/mozilla-central/search?q=path%3ARasterImage.cpp+GetHistogram&redirect=false&case=true

I see that IMAGE_DECODE_COUNT & IMAGE_MAX_DECODE_COUNT use Add() and Substract() to just keep a count.

IMAGE_MAX_DECODE_COUNT is clear and just ends up being the maximum of the instance decode counts.

IMAGE_DECODE_COUNT seems to be rather randomly set to the count of whatever RasterImage instance submitted this last.
Seth, do you know what question that histogram is actually trying to answer?
Are both of those histograms actually being used and monitored?
Flags: needinfo?(seth)
The other histogram that GetHistogramById()->Add()/Substract() is used instead of Telemetry::Accumulate() is WEBRTC_CALL_COUNT in webrtc:
https://dxr.mozilla.org/mozilla-central/search?q=path%3APeerConnection+GetHistogramById&redirect=false&case=true

This seems to be a really simple count.
Jesup, are you actively using/monitoring WEBRTC_CALL_COUNT?
If yes, can we just migrate this to a plain "count" histogram and simple always add 1 to it in PeerConnectionImpl::startCallTelem()?
Flags: needinfo?(rjesup)
Assignee: nobody → gfritzsche
Status: NEW → ASSIGNED
Priority: P3 → P2
(In reply to Georg Fritzsche [:gfritzsche] from comment #2)
> IMAGE_MAX_DECODE_COUNT is clear and just ends up being the maximum of the
> instance decode counts.

Yep.

> IMAGE_DECODE_COUNT seems to be rather randomly set to the count of whatever
> RasterImage instance submitted this last.

If so, that means it's using the API wrong and we should fix it. I confess that that code predates me and I'm not familiar with the histogram API myself.

> Seth, do you know what question that histogram is actually trying to answer?

It's simple enough, we'd like to have a histogram of how often an image is decoded. Since we can now discard images (which wasn't true a long time ago), which can force us to redecode them if we paint them again, it's nice to know how often this is happening and if we ever make a change that causes a major shift in the histogram.

> Are both of those histograms actually being used and monitored?

I don't think anyone is looking at them frequently; to my knowledge, we mostly look at them when doing changes that have the potential to drastically change these values.

It'd be nice to keep them.
Flags: needinfo?(seth)
(In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> The other histogram that GetHistogramById()->Add()/Substract() is used
> instead of Telemetry::Accumulate() is WEBRTC_CALL_COUNT in webrtc:
> https://dxr.mozilla.org/mozilla-central/
> search?q=path%3APeerConnection+GetHistogramById&redirect=false&case=true
> 
> This seems to be a really simple count.
> Jesup, are you actively using/monitoring WEBRTC_CALL_COUNT?

Somewhat. It's not a critical stat to us.

> If yes, can we just migrate this to a plain "count" histogram and simple
> always add 1 to it in PeerConnectionImpl::startCallTelem()?

We're trying to accumulate the number of calls per browser execution, not just count the number of calls made.
I don't think what your suggesting will get us that.  And we can't do it on shutdown, since we can't count on clean shutdown.
Flags: needinfo?(rjesup)
Ok, i think i misunderstood how both these histograms are used.
They are not updating one count (for calls / decodes) per browser session, but rather keeping 1 count updated for each of N instances per browser session.

A simple fix here would be that we simply expose Telemetry::Subtract(), then we can just keep both use-cases working without changing the semantics.
Seth, for RasterImage.cpp, could you just record IMAGE_DECODE_COUNT once per RasterImage instance when it is destroyed or similar?
Or does that usually happen too late in shutdown?
Flags: needinfo?(seth)
(In reply to Randell Jesup [:jesup] from comment #6)
> > If yes, can we just migrate this to a plain "count" histogram and simple
> > always add 1 to it in PeerConnectionImpl::startCallTelem()?
> 
> We're trying to accumulate the number of calls per browser execution, not
> just count the number of calls made.
> I don't think what your suggesting will get us that.  And we can't do it on
> shutdown, since we can't count on clean shutdown.

Are you worried about shutdown crashes?
Or do you actually drop the PeerConnectionImpl instances so you can never cleanly record this on shutdown?
Flags: needinfo?(rjesup)
> > We're trying to accumulate the number of calls per browser execution, not
> > just count the number of calls made.
> > I don't think what your suggesting will get us that.  And we can't do it on
> > shutdown, since we can't count on clean shutdown.
> 
> Are you worried about shutdown crashes?
> Or do you actually drop the PeerConnectionImpl instances so you can never
> cleanly record this on shutdown?

Anything that doesn't lead to a clean shutdown will be lost, and those that are lost are the people who generally run-until-crash (or os restart for patching, etc).  This means the long tail will get chopped predominately.  For example, my non-development browsers would fail to do clean shutdowns more often than they do clean shutdowns.

However... a) this isn't critical to us, and b) if we cared, we could compare to a separate total-number-of-calls telemetry value to calculate the number of "lost calls".  It doesn't replace the date, but gives us a bound on the impact.

We could probably shift it to a "record once before shutdown" value.
Flags: needinfo?(rjesup)
(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> Seth, for RasterImage.cpp, could you just record IMAGE_DECODE_COUNT once per
> RasterImage instance when it is destroyed or similar?
> Or does that usually happen too late in shutdown?

I think for our use case it'd be fine to record IMAGE_DECODE_COUNT when a RasterImage is destroyed, and just bail if it's being destroyed during shutdown. It shouldn't affect the quality of the results significantly.
Flags: needinfo?(seth)
Attachment #8736782 - Attachment is obsolete: true
(In reply to Randell Jesup [:jesup] from comment #10)
> > > We're trying to accumulate the number of calls per browser execution, not
> > > just count the number of calls made.
> > > I don't think what your suggesting will get us that.  And we can't do it on
> > > shutdown, since we can't count on clean shutdown.
> > 
> > Are you worried about shutdown crashes?
> > Or do you actually drop the PeerConnectionImpl instances so you can never
> > cleanly record this on shutdown?
> 
> Anything that doesn't lead to a clean shutdown will be lost, and those that
> are lost are the people who generally run-until-crash (or os restart for
> patching, etc).  This means the long tail will get chopped predominately. 
> For example, my non-development browsers would fail to do clean shutdowns
> more often than they do clean shutdowns.
> 
> However... a) this isn't critical to us, and b) if we cared, we could
> compare to a separate total-number-of-calls telemetry value to calculate the
> number of "lost calls".  It doesn't replace the date, but gives us a bound
> on the impact.
> 
> We could probably shift it to a "record once before shutdown" value.

Great, can you point to where we could record this before shutdown?
Flags: needinfo?(rjesup)
PeerConnection.js watches for shutdown notifications and deletes/cleans up peerconnections. That might be a reasonable place to hang it off.
Flags: needinfo?(rjesup)
Priority: P2 → P1
Points: --- → 3
Attachment #8737740 - Attachment is obsolete: true
Comment on attachment 8738562 [details] [diff] [review]
Part 1 - Remove public Telemetry functions that return raw histogram instances

This removes the public functions that allow to access the base::histogram.
We consider those internal, so we remove the access here.

Chris, can you take a look here?
Attachment #8738562 - Flags: review?(chutten)
Comment on attachment 8738563 [details] [diff] [review]
Part 2 - Expose C++ function to clear Telemetry histograms

For the semantics of IMAGE_MAX_DECODE_COUNT in RasterImage, we need to overwrite the value.
The simple approach here is to expose clearing the histogram from C++, so we can remove the Substract()/Add() juggle.
We already expose clearing histograms to JavaScript, so that doesn't change things here really.

Note: Longer-term we should address those use-cases better, e.g. by adding a kind of "max" or "single value" histogram type, but that seems out of scope for this bug.

Chris, can you take a look?
Attachment #8738563 - Flags: review?(chutten)
Comment on attachment 8738565 [details] [diff] [review]
Part 3 - Make RasterImage use the public Telemetry API

Seth, this should cover things per the previous discussion here.
In rough testing this looks fine, but i didn't manage to get a decode count > 1 - do you have a tip on triggering this?
Attachment #8738565 - Flags: review?(seth)
Comment on attachment 8738562 [details] [diff] [review]
Part 1 - Remove public Telemetry functions that return raw histogram instances

Review of attachment 8738562 [details] [diff] [review]:
-----------------------------------------------------------------

So... GetKeyedHistogramById(ID, const nsAString&) wasn't actually implemented or called? It'd be nice if that were mentioned in the commit message, but eh.
Attachment #8738562 - Flags: review?(chutten) → review+
Comment on attachment 8738563 [details] [diff] [review]
Part 2 - Expose C++ function to clear Telemetry histograms

Review of attachment 8738563 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8738563 - Flags: review?(chutten) → review+
This version should keep the previous behavior as far as i can tell, as there is only one global instance of mConnectionCounter. It loses the one entry in the "0" bucket, but that doesnt seem like a problem for doing call counts.
Given that this apparently only tracks one global call count, we could just move that to a new count histogram. Is that a feasible option jesup?
Attachment #8739443 - Flags: review?(rjesup)
Review pings.
Flags: needinfo?(seth)
Flags: needinfo?(rjesup)
Comment on attachment 8739443 [details] [diff] [review]
Part 4b (Alternative) - Use a count histogram for WebRTC call counts

Review of attachment 8739443 [details] [diff] [review]:
-----------------------------------------------------------------

I can't evaluate the impact of this without a better understanding of what this change means.  Can you point to more docs on the telemetry types?  The include files don't really define the API well from what I saw (may have missed something)

::: toolkit/components/telemetry/Histograms.json
@@ +6554,5 @@
> +  "WEBRTC_CALL_COUNT_2": {
> +    "alert_emails": ["?"],
> +    "bug_numbers": [1261063],
> +    "expires_in_version": "never",
> +    "kind": "count",

I don't know what a "count" histogram does differently than an "exponential"; I had trouble finding docs on this.
Attachment #8739443 - Flags: review?(rjesup)
Comment on attachment 8739412 [details] [diff] [review]
Part 4a - Make WebRTC PeerConnection use the public Telemetry API

Review of attachment 8739412 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp
@@ -356,5 @@
>    initGMP();
>  
>  #if !defined(MOZILLA_EXTERNAL_LINKAGE)
>    mConnectionCounter = 0;
> -  Telemetry::GetHistogramById(Telemetry::WEBRTC_CALL_COUNT)->Add(0);

removing this means we lose the ability to measure what % of sessions made a call at all.  (Or have to measure it indirectly as total submissions against number of submissions with a count.)  That said, it still would be useful.  Or we could add a second item that lets us know that.  Hopefully this sort of evaluation would be possible without having to jump through hoops, even without the Add(0)
Flags: needinfo?(rjesup)
(In reply to Randell Jesup [:jesup] from comment #26)
> I can't evaluate the impact of this without a better understanding of what
> this change means.  Can you point to more docs on the telemetry types?  The
> include files don't really define the API well from what I saw (may have
> missed something)
> 
> ::: toolkit/components/telemetry/Histograms.json
> @@ +6554,5 @@
> > +  "WEBRTC_CALL_COUNT_2": {
> > +    "alert_emails": ["?"],
> > +    "bug_numbers": [1261063],
> > +    "expires_in_version": "never",
> > +    "kind": "count",
> 
> I don't know what a "count" histogram does differently than an
> "exponential"; I had trouble finding docs on this.

The main documentation is here:
https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type
"count" histograms are different in that they only track one value.

> removing this means we lose the ability to measure what % of sessions made a
> call at all.  (Or have to measure it indirectly as total submissions against
> number of submissions with a count.)  That said, it still would be useful. 
> Or we could add a second item that lets us know that.  Hopefully this sort
> of evaluation would be possible without having to jump through hoops, even
> without the Add(0)

Right - inferring it is a bit inconvenient at the moment.
We could just add a specific probe for that - probably a separate "flag" histogram set from |PeerConnectionCtx::Initialize()| that you can compare with the "count" histogram of actual call counts.

Does that sound ok?
Flags: needinfo?(rjesup)
Seth, Randell, review ping, please.  Refactoring Telemetry is a prereq
for de-racing Telemetry, which is indirectly blocking on this.
Comment on attachment 8738565 [details] [diff] [review]
Part 3 - Make RasterImage use the public Telemetry API

This has been sitting for a while - Timothy, do you have time to review this?
Attachment #8738565 - Flags: review?(tnikkel)
(In reply to Georg Fritzsche [:gfritzsche] from comment #28)
> (In reply to Randell Jesup [:jesup] from comment #26)
> > I can't evaluate the impact of this without a better understanding of what
> > this change means.  Can you point to more docs on the telemetry types?  The
> > include files don't really define the API well from what I saw (may have
> > missed something)
> > 
> > ::: toolkit/components/telemetry/Histograms.json
> > @@ +6554,5 @@
> > > +  "WEBRTC_CALL_COUNT_2": {
> > > +    "alert_emails": ["?"],
> > > +    "bug_numbers": [1261063],
> > > +    "expires_in_version": "never",
> > > +    "kind": "count",
> > 
> > I don't know what a "count" histogram does differently than an
> > "exponential"; I had trouble finding docs on this.
> 
> The main documentation is here:
> https://developer.mozilla.org/en-US/docs/Mozilla/Performance/
> Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type
> "count" histograms are different in that they only track one value.

And they have a default value that's submitted as well.  That will tell us how many sessions made calls - but unless we can view it with 0 removed, it will be very hard to see the distribution of call numbers > 0.

That said, the data would be there, and this isn't a critical item (per above).

Please drop a link into the .h or .json file to the docs, or describe these types in the .h (or json) file.

> > removing this means we lose the ability to measure what % of sessions made a
> > call at all.  (Or have to measure it indirectly as total submissions against
> > number of submissions with a count.)  That said, it still would be useful. 
> > Or we could add a second item that lets us know that.  Hopefully this sort
> > of evaluation would be possible without having to jump through hoops, even
> > without the Add(0)
> 
> Right - inferring it is a bit inconvenient at the moment.
> We could just add a specific probe for that - probably a separate "flag"
> histogram set from |PeerConnectionCtx::Initialize()| that you can compare
> with the "count" histogram of actual call counts.

Now I'm confused; the docs you link to say 'count' histograms have a default (0), so we would get the % that made calls (and in fact it might be annoying to see numbers other than that, but we'll see).
Flags: needinfo?(rjesup)
(In reply to Randell Jesup [:jesup] from comment #31)

> > Right - inferring it is a bit inconvenient at the moment.
> > We could just add a specific probe for that - probably a separate "flag"
> > histogram set from |PeerConnectionCtx::Initialize()| that you can compare
> > with the "count" histogram of actual call counts.
> 
> Now I'm confused; the docs you link to say 'count' histograms have a default
> (0), so we would get the % that made calls (and in fact it might be annoying
> to see numbers other than that, but we'll see).

Aha, that was for the a different alternative (non-count-type)
Attachment #8738565 - Flags: review?(tnikkel)
Attachment #8738565 - Flags: review?(seth)
Attachment #8738565 - Flags: review+
(In reply to Randell Jesup [:jesup] from comment #31)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #28)
> > (In reply to Randell Jesup [:jesup] from comment #26)
> > > I can't evaluate the impact of this without a better understanding of what
> > > this change means.  Can you point to more docs on the telemetry types?  The
> > > include files don't really define the API well from what I saw (may have
> > > missed something)
> > > 
> > > ::: toolkit/components/telemetry/Histograms.json
> > > @@ +6554,5 @@
> > > > +  "WEBRTC_CALL_COUNT_2": {
> > > > +    "alert_emails": ["?"],
> > > > +    "bug_numbers": [1261063],
> > > > +    "expires_in_version": "never",
> > > > +    "kind": "count",
> > > 
> > > I don't know what a "count" histogram does differently than an
> > > "exponential"; I had trouble finding docs on this.
> > 
> > The main documentation is here:
> > https://developer.mozilla.org/en-US/docs/Mozilla/Performance/
> > Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type
> > "count" histograms are different in that they only track one value.
> 
> And they have a default value that's submitted as well.  That will tell us
> how many sessions made calls - but unless we can view it with 0 removed, it
> will be very hard to see the distribution of call numbers > 0.
> 
> That said, the data would be there, and this isn't a critical item (per
> above).
> 
> Please drop a link into the .h or .json file to the docs, or describe these
> types in the .h (or json) file.

There are no comments allowed in JSON.
Do you suggest adding a link to the wiki page in the Telemetry.h header?
That sounds useful to me.
Flags: needinfo?(rjesup)
Sure, Telemetry.h
Flags: needinfo?(rjesup)
(In reply to Georg Fritzsche [:gfritzsche] from comment #2)
> For converting RasterImage.cpp to use the Telemetry::Accumulate() functions
> instead of GetHistogramById()->Add()/Substract() we need to understand what
> this is actually supposed to be doing:
> https://dxr.mozilla.org/mozilla-central/search?q=path%3ARasterImage.
> cpp+GetHistogram&redirect=false&case=true
> 
> I see that IMAGE_DECODE_COUNT & IMAGE_MAX_DECODE_COUNT use Add() and
> Substract() to just keep a count.
> 
> IMAGE_MAX_DECODE_COUNT is clear and just ends up being the maximum of the
> instance decode counts.
> 
> IMAGE_DECODE_COUNT seems to be rather randomly set to the count of whatever
> RasterImage instance submitted this last.
> Seth, do you know what question that histogram is actually trying to answer?
> Are both of those histograms actually being used and monitored?

I filed bug 1269829 to make IMAGE_DECODE_COUNT track something useful.
Flags: needinfo?(seth)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: