Closed Bug 816166 Opened 12 years ago Closed 12 years ago

Add log_sum to exp histogram ping data

Categories

(Toolkit :: Telemetry, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: taras.mozilla, Assigned: froydnj)

References

Details

Attachments

(2 files, 2 obsolete files)

Send a log_sum in telemetry data. Without this telemetry evolution view is easily skewed.
(In reply to Taras Glek (:taras) from comment #0)
> Send a log_sum in telemetry data. Without this telemetry evolution view is
> easily skewed.

Note it may make sense to only do this for exp histograms. For linear histograms a weighted mean can be computed already.
Just to make sure I understand what's going on here, we want log_sum for exponential histograms only:

log_sum = log(x1) + log(x2) + ... + log(xn)

where x1, x2, ..., xn are the values recorded in the histogram (i.e. x1 + ... + xn = the sum field we already send)?
Flags: needinfo?(dzeber)
yes.

What base will be used?

By the way, while we're at it, how about computing 

a) for exponential histograms: log_sum_ss : = sum( log(xi)^2, i=1 ... n) 
b) for linear: sum_ss : = sum( (xi)^2, i=1 ... n)
(These are called sum of squares)
Thus we can compute the variance and a 'measure of spread' (e.g. in this case variance/standard deviation) for the computed mean.

Regards
Saptarshi
Also, how will you handle values 'x1' that are observed as 0 (since taking their logs is undefined)? Does the possibility exist?
(In reply to Saptarshi Guha from comment #4)
> Also, how will you handle values 'x1' that are observed as 0 (since taking
> their logs is undefined)? Does the possibility exist?

Natural log is simplest. What is your preference for counting log(0)? The obvious solution is to only do that for nonzero values.
(In reply to Saptarshi Guha from comment #3)
> 
> a) for exponential histograms: log_sum_ss : = sum( log(xi)^2, i=1 ... n) 
> b) for linear: sum_ss : = sum( (xi)^2, i=1 ... n)
> (These are called sum of squares)
> Thus we can compute the variance and a 'measure of spread' (e.g. in this
> case variance/standard deviation) for the computed mean.

Please file a separate bug for sum of squares, we can discuss that there.
(In reply to Nathan Froyd (:froydnj) from comment #2)
> Just to make sure I understand what's going on here, we want log_sum for
> exponential histograms only:
> 
> log_sum = log(x1) + log(x2) + ... + log(xn)
> 
> where x1, x2, ..., xn are the values recorded in the histogram (i.e. x1 +
> ... + xn = the sum field we already send)?

Yes - cf bug 811552. 

(In reply to Saptarshi Guha from comment #3)
> By the way, while we're at it, how about computing 
> 
> a) for exponential histograms: log_sum_ss : = sum( log(xi)^2, i=1 ... n) 
> b) for linear: sum_ss : = sum( (xi)^2, i=1 ... n)
> (These are called sum of squares)
> Thus we can compute the variance and a 'measure of spread' (e.g. in this
> case variance/standard deviation) for the computed mean.

+1. This would give us more information computed from the raw data rather than the discretized data from the histogram, which is what we really want to use.  


(In reply to Saptarshi Guha from comment #4)
> Also, how will you handle values 'x1' that are observed as 0 (since taking
> their logs is undefined)? Does the possibility exist?

One way to handle this is for work on the log scale to be done using the transformation log(1+x) rather than log(x). That means that we would use this transformation before doing any analysis, and the sum field would record log(1+x_1) + ... + log(1+x_n).
Flags: needinfo?(dzeber)
Summary: improvements for telemetry → Add log_sum to exp histogram ping data
(In reply to Taras Glek (:taras) from comment #6)
> (In reply to Saptarshi Guha from comment #3)
> > 
> > a) for exponential histograms: log_sum_ss : = sum( log(xi)^2, i=1 ... n) 
> > b) for linear: sum_ss : = sum( (xi)^2, i=1 ... n)
> > (These are called sum of squares)
> > Thus we can compute the variance and a 'measure of spread' (e.g. in this
> > case variance/standard deviation) for the computed mean.
> 
> Please file a separate bug for sum of squares, we can discuss that there.

sorry I misunderstood the suggestion. 

Adding sums of squares is reasonable while we are at it.
(In reply to Saptarshi Guha from comment #3)
> By the way, while we're at it, how about computing 
> 
> a) for exponential histograms: log_sum_ss : = sum( log(xi)^2, i=1 ... n) 
> b) for linear: sum_ss : = sum( (xi)^2, i=1 ... n)
> (These are called sum of squares)
> Thus we can compute the variance and a 'measure of spread' (e.g. in this
> case variance/standard deviation) for the computed mean.

Is there value in having just sum_ss for all histograms, and log_sum_ss for exponential only?  I'm just trying to see ways to minimize separate code paths for exponential vs. linear histograms.

I wonder how much of a perf impact computing these extra measures will have when gathering data.
> I wonder how much of a perf impact computing these extra measures will have when 
> gathering data.

I can't imagine it would be significant; if we have a histogram operation on a path that's hot enough to make one call to log() significant, what could we possibly be measuring?
(In reply to Justin Lebar [:jlebar] from comment #10)
> > I wonder how much of a perf impact computing these extra measures will have when 
> > gathering data.
> 
> I can't imagine it would be significant; if we have a histogram operation on
> a path that's hot enough to make one call to log() significant, what could
> we possibly be measuring?

Telemetry::Accumulate has shown up in profiles before; the bug I recall seeing this in was calling it somewhere in font handling.

I don't *know* what the overhead is like, but I would not be totally surprised to find log() making Accumulate 5-10% slower.  And on ARMv6 with soft-float (do we support that?), I could see the slowdown being greater.  But that's pure speculation.
Blocks: 699670
Record the new sum and sum of squares information requested.  We're using
the natural logarithm and actually computing log(1+x) so as to avoid issues
with accumulating 0.

The patch is slightly more complicated than it absolutely needs to be
because I wanted to avoid the overhead of computing logs and whatnot
for non-exponential histograms.  But it wasn't as bad as I thought it
was going to be.
Attachment #687235 - Flags: review?(vdjeric)
Now that we have the extra information, we ought to be sending it.

The only tricky part of this patch is the s/INT_MAX/INT24_MAX/ bit in the
tests.  The problem here is that we don't have properly unsigned integers
in JavaScript, so we want to avoid a situation where the squared sample
has its sign bit set.  I could see this being a problem for very
long-running Firefox sessions and very busy exponential histograms, but
I'm going to punt on actually fixing it.
Attachment #687246 - Flags: review?(vdjeric)
Comment on attachment 687235 [details] [diff] [review]
part 1 - record extra information on raw data collected in histograms

Things gets a little hard to grok because "Accumulate" means different things in different classes. What do you think of making Histogram::Accumulate calculate log_sum_ and log_sum_squares_ depending on the histogram_type? Then "sample_" wouldn't have to be public and accumulation logic would be contained in the Histogram class. I noticed Histogram::DeserializeHistogramInfo does something similar when deciding which FactoryGet method to call.

Btw, it looks you might have forgotten to change the Histogram::Accumulate call in FlagHistogram::Accumulate?
Attachment #687235 - Flags: review?(vdjeric) → review-
Comment on attachment 687246 [details] [diff] [review]
part 2 - send extra information through telemetry pings and add tests

> The only tricky part of this patch is the s/INT_MAX/INT24_MAX/ bit in the
> tests.  The problem here is that we don't have properly unsigned integers
> in JavaScript, so we want to avoid a situation where the squared sample
> has its sign bit set.  I could see this being a problem for very
> long-running Firefox sessions and very busy exponential histograms, but
> I'm going to punt on actually fixing it.

We're not using unsigned integers for any of the sum variables as far as I can tell. So is this issue caused by a difference in max positive values between C++ and JavaScript?

>diff --git a/toolkit/components/telemetry/tests/unit/test_nsITelemetry.js b/toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
>@@ -1,32 +1,48 @@
> /* Any copyright is dedicated to the Public Domain.
>    http://creativecommons.org/publicdomain/zero/1.0/ */
> 
> const Cc = Components.classes;
> const Ci = Components.interfaces;
> const Cu = Components.utils;
>-const INT_MAX = 0x7FFFFFFF;
>+const INT24_MAX = 0x7FFFFF;

Add a comment here. Looks good otherwise
Attachment #687246 - Flags: review?(vdjeric) → review+
(In reply to Vladan Djeric (:vladan) from comment #14)
> Things gets a little hard to grok because "Accumulate" means different
> things in different classes. What do you think of making
> Histogram::Accumulate calculate log_sum_ and log_sum_squares_ depending on
> the histogram_type? Then "sample_" wouldn't have to be public and
> accumulation logic would be contained in the Histogram class. I noticed
> Histogram::DeserializeHistogramInfo does something similar when deciding
> which FactoryGet method to call.

Well, Accumulate is already virtual, so it's already expected that Accumulate may do different things in subclasses of Histogram.  I was trying to add zero overhead in the linear/boolean/flag case, which I think are more common histograms than exponential histograms (haven't checked, though).  I am +0 on using virtual methods for this, but if you feel strongly about just switching on the histogram's type in Accumulate, I am willing to go that route.

Nit: sample_ was made protected, not public.

> Btw, it looks you might have forgotten to change the Histogram::Accumulate
> call in FlagHistogram::Accumulate?

Ooo, good catch.  I thought I changed that, but apparently not, or maybe I forgot to refresh appropriately.
(In reply to Vladan Djeric (:vladan) from comment #15)
> > The only tricky part of this patch is the s/INT_MAX/INT24_MAX/ bit in the
> > tests.  The problem here is that we don't have properly unsigned integers
> > in JavaScript, so we want to avoid a situation where the squared sample
> > has its sign bit set.  I could see this being a problem for very
> > long-running Firefox sessions and very busy exponential histograms, but
> > I'm going to punt on actually fixing it.
> 
> We're not using unsigned integers for any of the sum variables as far as I
> can tell. So is this issue caused by a difference in max positive values
> between C++ and JavaScript?

After looking at in the cold light of day, this is actually because I changed the type of the member variable for accumulating the sum of squares, but forgot to change the return type of the accessor.  Changing the return type of the accessor makes everything work out OK, and none of the gross changes to INT24_MAX are needed.  I'll check that change in without review.

You're correct insofar as there is a mismatch between what we can record in C++ and what we can send in Javascript; it will be informative to see if that turns out to be a problem.  Maybe we should just send the high and low halves of sum_squares separately and let the analysis side of things reconstruct values appropriately?
Flags: needinfo?(dzeber)
With the renaming of SampleSet::Accumulate as discussed.
Attachment #687235 - Attachment is obsolete: true
Attachment #687933 - Flags: review?(vdjeric)
Now without the bogus INT24_MAX changes.
Attachment #687246 - Attachment is obsolete: true
Attachment #687934 - Flags: review+
Attachment #687933 - Flags: review?(vdjeric) → review+
 
> You're correct insofar as there is a mismatch between what we can record in
> C++ and what we can send in Javascript; it will be informative to see if
> that turns out to be a problem.  Maybe we should just send the high and low
> halves of sum_squares separately and let the analysis side of things
> reconstruct values appropriately?

Sorry I didn't get what you meant by the high and low halves.


(In reply to Nathan Froyd (:froydnj) from comment #16)

> Well, Accumulate is already virtual, so it's already expected that
> Accumulate may do different things in subclasses of Histogram.  I was trying
> to add zero overhead in the linear/boolean/flag case, which I think are more
> common histograms than exponential histograms (haven't checked, though).  

Actually I had looked at this before using a dataset from v18 nightly, and I found that exponential histograms are by far the most common. There were 489 different histograms expressed across the 44025 packets I looked at (1 week in September 2012), of which about 3/4 were exponential. The breakdown of the remaining types was about 10% boolean, 7% linear, and 5% categorical (out of 489 histograms).
Flags: needinfo?(dzeber)
(In reply to dzeber from comment #20)
>  
> > You're correct insofar as there is a mismatch between what we can record in
> > C++ and what we can send in Javascript; it will be informative to see if
> > that turns out to be a problem.  Maybe we should just send the high and low
> > halves of sum_squares separately and let the analysis side of things
> > reconstruct values appropriately?
> 
> Sorry I didn't get what you meant by the high and low halves.

No worries, I wasn't being very clear here.  We represent sum_squares in C++ as an unsigned 64-bit number.  But not all such numbers can be exactly represented in JavaScript; the conversion from C++ to JavaScript therefore loses information.  What I'm really wondering is whether we should take steps to avoid this loss of information (e.g. by sending the 64-bit number broken down into quantities that can be exactly represented in JavaScript).  The loss of information would only occur when sum_squares > 2^52, so it might not be worth thinking about too hard...

> Actually I had looked at this before using a dataset from v18 nightly, and I
> found that exponential histograms are by far the most common.

Cool, thanks!
Flags: needinfo?(dzeber)
Also, try runs indicate that Windows doesn't support log1p.  Boo.
(In reply to Nathan Froyd (:froydnj) from comment #21)
> (In reply to dzeber from comment #20)

> No worries, I wasn't being very clear here.  We represent sum_squares in C++
> as an unsigned 64-bit number.  But not all such numbers can be exactly
> represented in JavaScript; the conversion from C++ to JavaScript therefore
> loses information.  What I'm really wondering is whether we should take
> steps to avoid this loss of information (e.g. by sending the 64-bit number
> broken down into quantities that can be exactly represented in JavaScript). 
> The loss of information would only occur when sum_squares > 2^52, so it
> might not be worth thinking about too hard...

Oh ok. Yeah if it's not too involved, I think it would be better to avoid the loss of information. The ss depends on the measurement scale (squared), so it is conceivable that it could run into very large values in some cases. 

I don't know what the concensus was on having different data collection methods (via Accumulate) depending on the histogram type, but if we're taking that route we'd only really need sum and ss for linear, and logsum and logss for exponential. For boolean/categorical/flag, the histogram contains the complete data, so these measurements are not necessary or even meaningful. I'm not sure if this observation simplifies things for you on the backend.
Flags: needinfo?(dzeber)
(In reply to Nathan Froyd (:froydnj) from comment #24)
> Applied with the lo/hi split of sum_squares:
> 
> remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/4e8386f13286
> remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/e42b1de74b4d

Do we need to calculate sum & sum of squares for exponential histograms?
Per comment 23, we don't need to; I suppose we could rearrange SampleSet's methods to not do that unnecessary work and all the Telemetry bits to not reflect those.
(In reply to Nathan Froyd (:froydnj) from comment #26)
> Per comment 23, we don't need to; I suppose we could rearrange SampleSet's
> methods to not do that unnecessary work and all the Telemetry bits to not
> reflect those.

I think it would make sense since exponential histograms are the most common and as you said, Accumulate can show up in profiles
https://hg.mozilla.org/mozilla-central/rev/4e8386f13286
https://hg.mozilla.org/mozilla-central/rev/e42b1de74b4d
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Blocks: 819418
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: