Closed Bug 811552 Opened 12 years ago Closed 6 years ago

improvements for telemetry

Categories

(Toolkit :: Telemetry, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: dzeber, Unassigned)

References

(Blocks 1 open bug)

Details

I have some suggestions for improvements to the way telemetry data is recorded. Anything that makes the data cleaner and more consistent is of great value to the analysts. 

- The padding which adds an extra empty bucket to either end of a histogram (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#241) doesn't really add anything, and can lead to confusion later. I vote to remove it. 

It should be sufficient that we have the counts for non-empty buckets. We can recreate the complete bucketing scheme if necessary from the range/bucket_count/histogram_type variables. As it is now, we can't rely on the fact that buckets observed in our data are non-empty, and it leads to boolean/flag histograms having 3 buckets, which doesn't really make sense. 

- For exponential type histograms, it would make more sense to record the sum of the log values rather than the sum of the raw values. The sum field is used to compute the mean of the raw values. However, exponential histograms are used to handle highly skewed distributions, and we take logs anyway before doing any of the analysis. In the case of skewed data, we would not want to look at the mean of raw values anyway, as it is not a good estimate of center. If we have the sum of the logs, we can compute the mean on the log scale (useful for analysis), and we can always compute the geometric mean (prod(x_1,...,x_n)^(1/n)) to give a measure of distribution center on the original scale if necessary. 

- In the JSON (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Histograms.json), certain histograms are of type "enumerated", meaning that they are categorical variables, whose outcomes represent distinct states, but whose ordering is usually not meaningful. At the moment, these are recorded as LinearHistograms, with buckets arranged approriately. I think it would make more sense to create a new histogram type for these, say CategoricalHistogram. 

They are not "linear", in the sense that the values are precisely integer values, not discretized continuous measurements, so they have a different interpretation. Also, bucket "0" is a proper state and does not correspond to "underflow". Because of this, in our analysis we separate these variables anyway (using the rule histogram_type=="linear" && bucket_count==range[2]+1) and handle them differently. I think it would be more robust to record these inherently as a separate type, following the schema in the JSON.
> - The padding which adds an extra empty bucket to either end of a histogram doesn't really add 
> anything, and can lead to confusion later. I vote to remove it. 

I don't have a strong opinion about the padding, but to state the opposite case: The padding is there to indicate that the data you're seeing wasn't clipped.

That is, suppose we have a histogram whose top bucket value is 50.  If we see 2 observations in the 50 bucket, the histogram will be sent /without/ padding at the top.

The same applies if we have observations in the bottom bucket.

But like I say, I don't care about this one either way.

> - For exponential type histograms, it would make more sense to record the sum of the log values 
> rather than the sum of the raw values.

I don't agree with this one, for a few reasons.

* If it's reasonable to compute the sum over logs, you guys can do that in the back-end; you don't need to modify the telemetry ping in order to accommodate that analysis.

* I don't think it's reasonable to suggest that we should compute mean-over-logs for all exponential histograms, because

  ** mean-over-logs and geometric mean are difficult for laypeople to interpret
  ** mean-over-logs and geometric mean are difficult to compare to arithmetic mean
  ** not all exponential histograms are for highly-skewed distributions

It sounds like you don't need the sum at all; I'd be OK getting rid of it.  But perhaps it serves as a useful checksum (since IIRC we still don't have any real checksums in this ping; instead, we rely on TCP checksums).

> I think it would make more sense to create a new histogram type for these, say 
> CategoricalHistogram.

This sounds nice, since we haven't made any headway getting categorical data labeled in telemetry otherwise.

But it's not clear how to do this while still letting us compare our new data to our old data (labeled with numbers).  And if we solve that problem, we've basically solved the problem of attaching labels to the numbers, so we don't really need CategoricalHistogram.

In general, I think these are good suggestions, but none of these is blocking us from accomplishing real goals, as far as I can tell.  On the other hand, we have critical telemetry work outstanding that's been basically ignored for upwards of a month (bug 790616, and I think there's another one which I can't find at the moment).

I think before we start beautifying the telemetry ping, we should resolve our standing data-quality issues.
Brendan, thanks for the constructive suggestions. 

>- The padding which adds an extra empty bucket to either end of a histogram (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#241) doesn't really add anything, and can lead to confusion later. I vote to remove it. 

I'm not sure why you'd be confused. Booleans have 3 values because of other implementation weirdness from code we imported from chrome, not due to padding alone.
Padding was useful for debugging at some point. At the moment it only weakly serves the point of making sure that bucket enumeration code did not go haywire. Once serverside validation is inplace, I could be convinced to get rid of padding.

> In the JSON (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Histograms.json), certain histograms are of type "enumerated", meaning that they are categorical variables, whose outcomes represent distinct states, but whose ordering is usually not meaningful. At the moment, these are recorded as LinearHistograms, with buckets arranged approriately. I think it would make more sense to create a new histogram type for these, say CategoricalHistogram. 

I think that's reasonable. I'm not sure why it's a big deal for analysts. It seems like you should have an abstraction that hides some of the uglyness of raw data before you do analysis. 

Both of above issues can be solved completely on the serverside. Solving these issues on the clientside wont let you skip fixing this on the serverside, since you have to redo data from existing firefox versions anyway(so  you can compare data from older firefoxes to newer ones).

If you still feel we should make this change, we can consider it. I'm more worried about impact on you having to compare slightly different data than us making a trivial change.

> - For exponential type histograms, it would make more sense to record the sum of the log values 
> rather than the sum of the raw values.

We don't have to abandon .sum. If having a .log_sum provides us with better data, we can add it. Will we get better data from that? It's not clear from your description.
Sorry, I just realized that dzeber is a David, not a Brendan
(In reply to Justin Lebar [:jlebar] from comment #1)
> I don't have a strong opinion about the padding, but to state the opposite
> case: The padding is there to indicate that the data you're seeing wasn't
> clipped.
> 
> That is, suppose we have a histogram whose top bucket value is 50.  If we
> see 2 observations in the 50 bucket, the histogram will be sent /without/
> padding at the top.
> 
> The same applies if we have observations in the bottom bucket.
> 

As it is, the raw data is discretized once it gets stored in a histogram (ie. assigned to a bucket) as we lose the original value, so we don't gain any information from the padding - is this what you meant by clipped?. 

At the moment, the data that is sent back to us consists of those buckets whose count is positive, plus the padding buckets. So some of the middle buckets that have a count of zero are not present at all in the "values" table. It would be better if we could take for granted that a bucket appears in the value table if and only if its count > 0 (ie values[i]>0 all i). This is violated by adding the padding, and to work with this data we have to first do a check to remove buckets with count == 0. What I'm saying is that the padding does not add any extra information, and this extra step is just another place for bugs to creep in. Also, boolean variables take values 0 and 1. If the count of "1" values is positive, the padding adds a bucket labelled "2" with count 0. This behaviour doesn't really make sense.


> * If it's reasonable to compute the sum over logs, you guys can do that in
> the back-end; you don't need to modify the telemetry ping in order to
> accommodate that analysis.

On the analysis side, we treat the sum field as a statistic computed from the raw data (ie. before it is bucketed). This will be different from anything computed from the bucketed values and more accurate. The sum is useful among other things because dividing by the number of observations gives the mean of the raw data. Anything we compute from telemetry histogram values uses the bucketed/discretized data, as the raw data is no longer accessible to us.  

> 
> * I don't think it's reasonable to suggest that we should compute
> mean-over-logs for all exponential histograms, because

As I understand it, the reason for having exponential histograms is to handle skewed data. With a linear histogram you'd have tons of buckets, many empty, for such data. To do any analysis on this data, we always take logs first. Most statistical analysis works best for symmetric data, and the procedure to handle skewed data is to transform it (eg via log) to make it look symmetric. Since everything is done on the log scale, it is natural to work with the mean of the logs (ie geometric mean).

> 
>   ** mean-over-logs and geometric mean are difficult for laypeople to
> interpret

They would be interpreting the geometric mean as a measure of center, ie a "typical value" that you would observe from the distribution, the same way as you would interpret an arithmetic mean. Laypeople would not be thinking about its exact computation.


>   ** mean-over-logs and geometric mean are difficult to compare to
> arithmetic mean

This comparison should not be made for this type of data. The arithmetic mean should not be used at all for highly skewed data, because it does not have the meaning that we normally expect of it. We expect the mean to measure a "typical" central value of a distribution. For skewed data, the arithmetic mean is often far from what we would consider "central" values, since it is so highly affected by outliers. 


>   ** not all exponential histograms are for highly-skewed distributions

I would argue that, in general, exponential histograms should only be used for highly skewed distributions, otherwise a linear histogram is more appropriate (less information is lost). 

> 
> It sounds like you don't need the sum at all; I'd be OK getting rid of it. 
> But perhaps it serves as a useful checksum (since IIRC we still don't have
> any real checksums in this ping; instead, we rely on TCP checksums).

Actually I wasn't aware that it was intended to be a checksum - I was thinking of it more as a statistic of the raw data. How is it used as a checksum?

> 
> This sounds nice, since we haven't made any headway getting categorical data
> labeled in telemetry otherwise.
> 
> But it's not clear how to do this while still letting us compare our new
> data to our old data (labeled with numbers).  And if we solve that problem,
> we've basically solved the problem of attaching labels to the numbers, so we
> don't really need CategoricalHistogram.
> 

I was expecting that the categories would still be labelled with numbers. The dictionary would be in the code somewhere, as it is now. What I meant is that categorical variables have a different interpretation from continuous ones. For one thing, the values are not discretized - the integer values are exactly the original data. Also, it would be inappropriate to, say, take a mean of these values, as they are not really numerical. What I'm saying is that, since Boolean and Flag already have their own types, I think it would make more sense to give Categoricals their own type (and their own histogram_type code) as well, so we can easily filter on types.
(In reply to Taras Glek (:taras) from comment #2)
> 
> I think that's reasonable. I'm not sure why it's a big deal for analysts. It
> seems like you should have an abstraction that hides some of the uglyness of
> raw data before you do analysis. 

I was thinking that for categoricals to have their own type would add that extra layer of abstraction :).

> 
> Both of above issues can be solved completely on the serverside. Solving
> these issues on the clientside wont let you skip fixing this on the
> serverside, since you have to redo data from existing firefox versions
> anyway(so  you can compare data from older firefoxes to newer ones).
> 
> If you still feel we should make this change, we can consider it. I'm more
> worried about impact on you having to compare slightly different data than
> us making a trivial change.

It shouldn't cause a problem with the analysis, since I thought the buckets could still have numerical labels.


> 
> > - For exponential type histograms, it would make more sense to record the sum of the log values 
> > rather than the sum of the raw values.
> 
> We don't have to abandon .sum. If having a .log_sum provides us with better
> data, we can add it. Will we get better data from that? It's not clear from
> your description.

I think this could be a good approach.
> We don't have to abandon .sum. If having a .log_sum provides us with better
> data, we can add it.

I really don't think we should be in the business of adding redundant data to the telemetry ping every time metrics decides that they have a new measure they'd like to investigate.

Instead, I think metrics should calculate log sum from the histogram data.

> I was thinking that for categoricals to have their own type would add that extra layer of 
> abstraction :).

It seems like your goal here is to make the telemetry ping clean.  Sending both numbers and words for categoricals in telemetry pings doesn't seem like the best way to achieve that goal.

With this solution, you also have the problem of figuring out what happens when we change a categorical at some point in the future; how does that affect your old mappings?

If we simply add this data to the histograms.json, the ping remains clean and you don't have this problem.
(In reply to Justin Lebar [:jlebar] from comment #6)
> 
> I really don't think we should be in the business of adding redundant data
> to the telemetry ping every time metrics decides that they have a new
> measure they'd like to investigate.
> 
> Instead, I think metrics should calculate log sum from the histogram data.

As I mentioned, computing statistics from the discretized data stored in the histogram is not the same as computing statistics from the raw data. 

> 
> It seems like your goal here is to make the telemetry ping clean.  Sending
> both numbers and words for categoricals in telemetry pings doesn't seem like
> the best way to achieve that goal.
> 
> With this solution, you also have the problem of figuring out what happens
> when we change a categorical at some point in the future; how does that
> affect your old mappings?
> 
> If we simply add this data to the histograms.json, the ping remains clean
> and you don't have this problem.

There shouldn't be any words to send. The point is to give Categorical histograms their own histogram_type number, eg linear=1, boolean=2, flag=3. I'm proposing categorical=4. They already have their own type "enumerated" in the json, and I think they should have their own type identifier in the actual data. New data under this scheme would be consistent with previous data. Adding the text labels for the categories (if that's what you mean) would not be a bad idea.
> As I mentioned, computing statistics from the discretized data stored in the histogram is not the 
> same as computing statistics from the raw data. 

Ah, I see.  Sorry I missed that.

It looks like the sum that's currently reported is computed on the non-discretized data, so a log sum could make sense here.

> The point is to give Categorical histograms their own histogram_type number, eg linear=1, 
> boolean=2, flag=3. I'm proposing categorical=4.

Oh, if that's all, that's fine.

Sorry I misunderstood!
Depends on: 816166
Depends on: 819418
Depends on: 833858
Blocks: 1201022
No longer depends on: 833858
(In reply to Justin Lebar (not reading bugmail) from comment #1)
> > - The padding which adds an extra empty bucket to either end of a histogram doesn't really add 
> > anything, and can lead to confusion later. I vote to remove it. 
> 
> I don't have a strong opinion about the padding, but to state the opposite
> case: The padding is there to indicate that the data you're seeing wasn't
> clipped.

I think we do rely on the overflow bucket, it's the only clear signal engineers have in e.g. the measurement dashboard to see that their values are outside of the chosen bucket range.

> > - For exponential type histograms, it would make more sense to record the sum of the log values 
> > rather than the sum of the raw values.
> 
> I don't agree with this one, for a few reasons.
>
> [...] 
>
> It sounds like you don't need the sum at all; I'd be OK getting rid of it. 
> But perhaps it serves as a useful checksum (since IIRC we still don't have
> any real checksums in this ping; instead, we rely on TCP checksums).

I suspect we currently use the sum for at least count histograms.
Unless there is a strong reason for looking into this we might want to drop this in favor of a later full review of the histogram serialization format.


Chris, Alessio, does that sound right?
If yes, let's close this bug (as the histogram type change is tracked in bug 833858).
Flags: needinfo?(chutten)
Flags: needinfo?(alessio.placitelli)
To my knowledge sum is unnecessary in count histograms as the scalar value is determined by the number of samples in the first bucket. (which happens to equal the sum, for count histograms). Given that count histograms are deprecated in favour of uint Scalars (which are more efficient, more flexible, and better supported by, eg, main_summary), if they were the only user of sum, we'd probably be wise to speed up scalarization and remove it.

That being said, sum is the real sum of the samples as they come in, instead of the sum of the counts of the buckets, and so can be useful for aggregation of linear and exponential histograms. I wouldn't prioritize its removal over a review of the serialization format.

I favour closing this bug and scheduling a histogram format review.
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #11)
> That being said, sum is the real sum of the samples as they come in, instead
> of the sum of the counts of the buckets, and so can be useful for
> aggregation of linear and exponential histograms. I wouldn't prioritize its
> removal over a review of the serialization format.

In the histogram packing bug for GeckoView, :frank left a very clear comment about the histogram format:

(In reply to Frank Bertsch [:frank] from bug 1467705 comment #4)
> (In reply to Alessio Placitelli [:Dexter] from comment #0)
> > On desktop, we pack histograms [1] before adding them to the main ping. The
> > pipeline seems to expect them to be in this format [2]. Even if we're not
> > using the aggregator for the "mobile-ping", it seems a sane thing to have
> > histograms packed on GeckoView as well, for consistency.
> 
> Sparse Histograms: definitely, because we have the histogram definitions
> available
> min/max/bucket_count/histogram_type: Let's remove this, we don't use them
> anywhere
> sum: Definitely keep
> 
> Thoughts on this?

He mentions that we should definitely keep the sum :) Let's revisit this in a histogram format review!

> I favour closing this bug and scheduling a histogram format review.

Yes, I agree with Chris on this.
Flags: needinfo?(alessio.placitelli)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.