I have some suggestions for improvements to the way telemetry data is recorded. Anything that makes the data cleaner and more consistent is of great value to the analysts.  The padding which adds an extra empty bucket to either end of a histogram (http://mxr.mozilla.org/mozillacentral/source/toolkit/components/telemetry/TelemetryPing.js#241) doesn't really add anything, and can lead to confusion later. I vote to remove it. It should be sufficient that we have the counts for nonempty buckets. We can recreate the complete bucketing scheme if necessary from the range/bucket_count/histogram_type variables. As it is now, we can't rely on the fact that buckets observed in our data are nonempty, and it leads to boolean/flag histograms having 3 buckets, which doesn't really make sense.  For exponential type histograms, it would make more sense to record the sum of the log values rather than the sum of the raw values. The sum field is used to compute the mean of the raw values. However, exponential histograms are used to handle highly skewed distributions, and we take logs anyway before doing any of the analysis. In the case of skewed data, we would not want to look at the mean of raw values anyway, as it is not a good estimate of center. If we have the sum of the logs, we can compute the mean on the log scale (useful for analysis), and we can always compute the geometric mean (prod(x_1,...,x_n)^(1/n)) to give a measure of distribution center on the original scale if necessary.  In the JSON (http://mxr.mozilla.org/mozillacentral/source/toolkit/components/telemetry/Histograms.json), certain histograms are of type "enumerated", meaning that they are categorical variables, whose outcomes represent distinct states, but whose ordering is usually not meaningful. At the moment, these are recorded as LinearHistograms, with buckets arranged approriately. I think it would make more sense to create a new histogram type for these, say CategoricalHistogram. They are not "linear", in the sense that the values are precisely integer values, not discretized continuous measurements, so they have a different interpretation. Also, bucket "0" is a proper state and does not correspond to "underflow". Because of this, in our analysis we separate these variables anyway (using the rule histogram_type=="linear" && bucket_count==range[2]+1) and handle them differently. I think it would be more robust to record these inherently as a separate type, following the schema in the JSON.
Comment 1•5 years ago


>  The padding which adds an extra empty bucket to either end of a histogram doesn't really add > anything, and can lead to confusion later. I vote to remove it. I don't have a strong opinion about the padding, but to state the opposite case: The padding is there to indicate that the data you're seeing wasn't clipped. That is, suppose we have a histogram whose top bucket value is 50. If we see 2 observations in the 50 bucket, the histogram will be sent /without/ padding at the top. The same applies if we have observations in the bottom bucket. But like I say, I don't care about this one either way. >  For exponential type histograms, it would make more sense to record the sum of the log values > rather than the sum of the raw values. I don't agree with this one, for a few reasons. * If it's reasonable to compute the sum over logs, you guys can do that in the backend; you don't need to modify the telemetry ping in order to accommodate that analysis. * I don't think it's reasonable to suggest that we should compute meanoverlogs for all exponential histograms, because ** meanoverlogs and geometric mean are difficult for laypeople to interpret ** meanoverlogs and geometric mean are difficult to compare to arithmetic mean ** not all exponential histograms are for highlyskewed distributions It sounds like you don't need the sum at all; I'd be OK getting rid of it. But perhaps it serves as a useful checksum (since IIRC we still don't have any real checksums in this ping; instead, we rely on TCP checksums). > I think it would make more sense to create a new histogram type for these, say > CategoricalHistogram. This sounds nice, since we haven't made any headway getting categorical data labeled in telemetry otherwise. But it's not clear how to do this while still letting us compare our new data to our old data (labeled with numbers). And if we solve that problem, we've basically solved the problem of attaching labels to the numbers, so we don't really need CategoricalHistogram. In general, I think these are good suggestions, but none of these is blocking us from accomplishing real goals, as far as I can tell. On the other hand, we have critical telemetry work outstanding that's been basically ignored for upwards of a month (bug 790616, and I think there's another one which I can't find at the moment). I think before we start beautifying the telemetry ping, we should resolve our standing dataquality issues.
Comment 2•5 years ago


Brendan, thanks for the constructive suggestions. > The padding which adds an extra empty bucket to either end of a histogram (http://mxr.mozilla.org/mozillacentral/source/toolkit/components/telemetry/TelemetryPing.js#241) doesn't really add anything, and can lead to confusion later. I vote to remove it. I'm not sure why you'd be confused. Booleans have 3 values because of other implementation weirdness from code we imported from chrome, not due to padding alone. Padding was useful for debugging at some point. At the moment it only weakly serves the point of making sure that bucket enumeration code did not go haywire. Once serverside validation is inplace, I could be convinced to get rid of padding. > In the JSON (http://mxr.mozilla.org/mozillacentral/source/toolkit/components/telemetry/Histograms.json), certain histograms are of type "enumerated", meaning that they are categorical variables, whose outcomes represent distinct states, but whose ordering is usually not meaningful. At the moment, these are recorded as LinearHistograms, with buckets arranged approriately. I think it would make more sense to create a new histogram type for these, say CategoricalHistogram. I think that's reasonable. I'm not sure why it's a big deal for analysts. It seems like you should have an abstraction that hides some of the uglyness of raw data before you do analysis. Both of above issues can be solved completely on the serverside. Solving these issues on the clientside wont let you skip fixing this on the serverside, since you have to redo data from existing firefox versions anyway(so you can compare data from older firefoxes to newer ones). If you still feel we should make this change, we can consider it. I'm more worried about impact on you having to compare slightly different data than us making a trivial change. >  For exponential type histograms, it would make more sense to record the sum of the log values > rather than the sum of the raw values. We don't have to abandon .sum. If having a .log_sum provides us with better data, we can add it. Will we get better data from that? It's not clear from your description.
Comment 3•5 years ago


Sorry, I just realized that dzeber is a David, not a Brendan
(Reporter)  
Comment 4•5 years ago


(In reply to Justin Lebar [:jlebar] from comment #1) > I don't have a strong opinion about the padding, but to state the opposite > case: The padding is there to indicate that the data you're seeing wasn't > clipped. > > That is, suppose we have a histogram whose top bucket value is 50. If we > see 2 observations in the 50 bucket, the histogram will be sent /without/ > padding at the top. > > The same applies if we have observations in the bottom bucket. > As it is, the raw data is discretized once it gets stored in a histogram (ie. assigned to a bucket) as we lose the original value, so we don't gain any information from the padding  is this what you meant by clipped?. At the moment, the data that is sent back to us consists of those buckets whose count is positive, plus the padding buckets. So some of the middle buckets that have a count of zero are not present at all in the "values" table. It would be better if we could take for granted that a bucket appears in the value table if and only if its count > 0 (ie values[i]>0 all i). This is violated by adding the padding, and to work with this data we have to first do a check to remove buckets with count == 0. What I'm saying is that the padding does not add any extra information, and this extra step is just another place for bugs to creep in. Also, boolean variables take values 0 and 1. If the count of "1" values is positive, the padding adds a bucket labelled "2" with count 0. This behaviour doesn't really make sense. > * If it's reasonable to compute the sum over logs, you guys can do that in > the backend; you don't need to modify the telemetry ping in order to > accommodate that analysis. On the analysis side, we treat the sum field as a statistic computed from the raw data (ie. before it is bucketed). This will be different from anything computed from the bucketed values and more accurate. The sum is useful among other things because dividing by the number of observations gives the mean of the raw data. Anything we compute from telemetry histogram values uses the bucketed/discretized data, as the raw data is no longer accessible to us. > > * I don't think it's reasonable to suggest that we should compute > meanoverlogs for all exponential histograms, because As I understand it, the reason for having exponential histograms is to handle skewed data. With a linear histogram you'd have tons of buckets, many empty, for such data. To do any analysis on this data, we always take logs first. Most statistical analysis works best for symmetric data, and the procedure to handle skewed data is to transform it (eg via log) to make it look symmetric. Since everything is done on the log scale, it is natural to work with the mean of the logs (ie geometric mean). > > ** meanoverlogs and geometric mean are difficult for laypeople to > interpret They would be interpreting the geometric mean as a measure of center, ie a "typical value" that you would observe from the distribution, the same way as you would interpret an arithmetic mean. Laypeople would not be thinking about its exact computation. > ** meanoverlogs and geometric mean are difficult to compare to > arithmetic mean This comparison should not be made for this type of data. The arithmetic mean should not be used at all for highly skewed data, because it does not have the meaning that we normally expect of it. We expect the mean to measure a "typical" central value of a distribution. For skewed data, the arithmetic mean is often far from what we would consider "central" values, since it is so highly affected by outliers. > ** not all exponential histograms are for highlyskewed distributions I would argue that, in general, exponential histograms should only be used for highly skewed distributions, otherwise a linear histogram is more appropriate (less information is lost). > > It sounds like you don't need the sum at all; I'd be OK getting rid of it. > But perhaps it serves as a useful checksum (since IIRC we still don't have > any real checksums in this ping; instead, we rely on TCP checksums). Actually I wasn't aware that it was intended to be a checksum  I was thinking of it more as a statistic of the raw data. How is it used as a checksum? > > This sounds nice, since we haven't made any headway getting categorical data > labeled in telemetry otherwise. > > But it's not clear how to do this while still letting us compare our new > data to our old data (labeled with numbers). And if we solve that problem, > we've basically solved the problem of attaching labels to the numbers, so we > don't really need CategoricalHistogram. > I was expecting that the categories would still be labelled with numbers. The dictionary would be in the code somewhere, as it is now. What I meant is that categorical variables have a different interpretation from continuous ones. For one thing, the values are not discretized  the integer values are exactly the original data. Also, it would be inappropriate to, say, take a mean of these values, as they are not really numerical. What I'm saying is that, since Boolean and Flag already have their own types, I think it would make more sense to give Categoricals their own type (and their own histogram_type code) as well, so we can easily filter on types.
(Reporter)  
Comment 5•5 years ago


(In reply to Taras Glek (:taras) from comment #2) > > I think that's reasonable. I'm not sure why it's a big deal for analysts. It > seems like you should have an abstraction that hides some of the uglyness of > raw data before you do analysis. I was thinking that for categoricals to have their own type would add that extra layer of abstraction :). > > Both of above issues can be solved completely on the serverside. Solving > these issues on the clientside wont let you skip fixing this on the > serverside, since you have to redo data from existing firefox versions > anyway(so you can compare data from older firefoxes to newer ones). > > If you still feel we should make this change, we can consider it. I'm more > worried about impact on you having to compare slightly different data than > us making a trivial change. It shouldn't cause a problem with the analysis, since I thought the buckets could still have numerical labels. > > >  For exponential type histograms, it would make more sense to record the sum of the log values > > rather than the sum of the raw values. > > We don't have to abandon .sum. If having a .log_sum provides us with better > data, we can add it. Will we get better data from that? It's not clear from > your description. I think this could be a good approach.
Comment 6•5 years ago


> We don't have to abandon .sum. If having a .log_sum provides us with better > data, we can add it. I really don't think we should be in the business of adding redundant data to the telemetry ping every time metrics decides that they have a new measure they'd like to investigate. Instead, I think metrics should calculate log sum from the histogram data. > I was thinking that for categoricals to have their own type would add that extra layer of > abstraction :). It seems like your goal here is to make the telemetry ping clean. Sending both numbers and words for categoricals in telemetry pings doesn't seem like the best way to achieve that goal. With this solution, you also have the problem of figuring out what happens when we change a categorical at some point in the future; how does that affect your old mappings? If we simply add this data to the histograms.json, the ping remains clean and you don't have this problem.
(Reporter)  
Comment 7•5 years ago


(In reply to Justin Lebar [:jlebar] from comment #6) > > I really don't think we should be in the business of adding redundant data > to the telemetry ping every time metrics decides that they have a new > measure they'd like to investigate. > > Instead, I think metrics should calculate log sum from the histogram data. As I mentioned, computing statistics from the discretized data stored in the histogram is not the same as computing statistics from the raw data. > > It seems like your goal here is to make the telemetry ping clean. Sending > both numbers and words for categoricals in telemetry pings doesn't seem like > the best way to achieve that goal. > > With this solution, you also have the problem of figuring out what happens > when we change a categorical at some point in the future; how does that > affect your old mappings? > > If we simply add this data to the histograms.json, the ping remains clean > and you don't have this problem. There shouldn't be any words to send. The point is to give Categorical histograms their own histogram_type number, eg linear=1, boolean=2, flag=3. I'm proposing categorical=4. They already have their own type "enumerated" in the json, and I think they should have their own type identifier in the actual data. New data under this scheme would be consistent with previous data. Adding the text labels for the categories (if that's what you mean) would not be a bad idea.
Comment 8•5 years ago


> As I mentioned, computing statistics from the discretized data stored in the histogram is not the > same as computing statistics from the raw data. Ah, I see. Sorry I missed that. It looks like the sum that's currently reported is computed on the nondiscretized data, so a log sum could make sense here. > The point is to give Categorical histograms their own histogram_type number, eg linear=1, > boolean=2, flag=3. I'm proposing categorical=4. Oh, if that's all, that's fine. Sorry I misunderstood!
Description
•