736189 - Schedule a MapReduce job to validate Telemetry data in HBase daily

Reporter

Description

•

13 years ago

If we just drop bad data points, then we blind ourselves to potential systemic problems. However, we cannot let these invalid data points skew the measures of the valid data. I propose that the MR job scan each new submission and evaluate whether each histogram bucket key is within the defined bounds declared by the histogram. If it is outside those bounds, the entries will be removed and the counts adjusted. A new section of data for that histogram will be created that reports the number of invalid entries, possibly breaking that total down by whether it was outside the MIN or the MAX. Once we have corrected and annotated the data, we can have Nagios alerts that will fire if any of these error counts gets "too high". We can also use those annotations in the dashboards to alert users as to the volume of invalid entries.

Daniel Einspanjer [:dre] [:deinspanjer]

Reporter

Comment 1

•

13 years ago

Anurag, is there any chance you might be up to taking a crack at building this MR job?

Anurag Phadke[:aphadke@mozilla.com]

Assignee

Updated

•

13 years ago

Assignee: xstevens → aphadke

Xavier Stevens [:xstevens]

Comment 2

•

13 years ago

I've found another issue while studying simpleMeasurements.uptime. Some of them are negative and some are very large numbers (because they represent minutes the math works out to ~11 years for example). So we should probably remove these as well.

Anurag Phadke[:aphadke@mozilla.com]

Assignee

Comment 3

•

13 years ago

lmandel: Take a look @ http://aphadke.pastebin.mozilla.org/1529693 the top one is an incorrect json while the bottom one is the corrected one. Is this in line with what you want?

Lawrence Mandel [:lmandel] (use needinfo)

Comment 4

•

13 years ago

Looking at the pastebin I see that you've introduced (or did this previously exist?) a valid range. You then drop values outside of the range. In one case you list the invalid values, but only list the value 4 even though it looks like there is a value 0 that is invalid and dropped as well. When will values be listed as an invalid value? Can you get the total count of invalid values as Daniel suggest in comment 0? Where are the valid ranges specified?

Anurag Phadke[:aphadke@mozilla.com]

Assignee

Comment 5

•

13 years ago

yes, I did introduce the fake value "4" : 10020 wrt value 0, afaik, the value "0" is valid and should not be dropped, is that not true? Do you want the count of invalid values per JSON or end of day report? the valid ranges are specified in the JSON itself: so, for: ZIPARCHIVE_CRC "range": [ 1, 2 ], for: MEMORY_RESIDENT "range": [ 32768, 1048576 ], and so on..

Anurag Phadke[:aphadke@mozilla.com]

Assignee

Comment 6

•

13 years ago

:lmandel is the above in line with what u want?

Lawrence Mandel [:lmandel] (use needinfo)

Comment 7

•

13 years ago

(In reply to aphadke from comment #5) > yes, I did introduce the fake value "4" : 10020 > > wrt value 0, afaik, the value "0" is valid and should not be dropped, is > that not true? Let's look at a specific example for EARLY_GLUESTARTUP_HARD_FAULTS. "EARLY_GLUESTARTUP_HARD_FAULTS": { "range": [ 1, 100 ], "bucket_count": 12, "histogram_type": 1, "values": { "0": 1, "1": 0 }, "sum": 0 }, gets converted to "EARLY_GLUESTARTUP_HARD_FAULTS": { "range": [ 1, 100 ], "bucket_count": 12, "histogram_type": 1, "values": { "1": 0 }, "sum": 0 }, In this case I see that there is a value of "0" that is submitted and filtered from the second "gets converted to" view of the data. However, I don't see as "invalid_values" entry. Is this expected? > Do you want the count of invalid values per JSON or end of day report? As Daniel suggested in comment 0, I think the values are required in the frontend dashboards in order to give developers a view of the size of invalid data for their probes. It sounds to me like this means that the count of invalid values is needed in the JSON data...but feel free to correct me. > the valid ranges are specified in the JSON itself: > > so, for: ZIPARCHIVE_CRC > "range": [ > 1, > 2 > ], > > for: MEMORY_RESIDENT > "range": [ > 32768, > 1048576 > ], > > and so on.. Are the ranges needed in the data? Are the ranges used by a consumer of the JSON such as the frontend dashboard?

Pedro Alves

Comment 8

•

13 years ago

We use it to make sure we dont aggregate data generated with different parameters

Anurag Phadke[:aphadke@mozilla.com]

Assignee

Comment 9

•

13 years ago

> Let's look at a specific example for EARLY_GLUESTARTUP_HARD_FAULTS. > > "EARLY_GLUESTARTUP_HARD_FAULTS": { > "range": [ > 1, > 100 > ], > "bucket_count": 12, > "histogram_type": 1, > "values": { > "0": 1, > "1": 0 > }, > "sum": 0 > }, > > gets converted to > > "EARLY_GLUESTARTUP_HARD_FAULTS": { > "range": [ > 1, > 100 > ], > "bucket_count": 12, > "histogram_type": 1, > "values": { > "1": 0 > }, > "sum": 0 > }, > > In this case I see that there is a value of "0" that is submitted and > filtered from the second "gets converted to" view of the data. However, I > don't see as "invalid_values" entry. Is this expected? The dropped "0" was a bug on my part and its been fixed. The value "0" is not dropped since its valid. The count of invalid values will exist as part of JSON as developers need the count.

Lawrence Mandel [:lmandel] (use needinfo)

Comment 10

•

13 years ago

(In reply to aphadke from comment #9) > The dropped "0" was a bug on my part and its been fixed. The value "0" is > not dropped since its valid. > > The count of invalid values will exist as part of JSON as developers need > the count. If the range is 1-100 how is the value "0" within the valid range?

Anurag Phadke[:aphadke@mozilla.com]

Assignee

Comment 11

•

13 years ago

> If the range is 1-100 how is the value "0" within the valid range? I really don't know, I was told that value "0" is a special value and should not fall under invalid_values.

Daniel Einspanjer [:dre] [:deinspanjer]

Reporter

Comment 12

•

13 years ago

Cause Telemetry is weird. :) From what I remember of what Taras told me, Telemetry always adds bounding edges to the buckets submitted. If the range was 1 - 100 and the submission had only one value for bucket 50, then the submission would include the bucket before 50 with a count of 0, the bucket 50 with the count of 1, and the bucket after 50 with the count of 0. In the case where the min bucket contains a value, Telemetry will include an element for the bucket before the min value with a count of 0. The important thing is that the count is 0 meaning there were no observations with that value. It would only be truly invalid data if the count was greater than 0.

Justin Lebar (not reading bugmail)

Comment 13

•

13 years ago

Nathan, can you comment on whether comment 12 is accurate?

Justin Lebar (not reading bugmail)

Comment 14

•

13 years ago

I had a look through the code. With the caveat that I'm completely new to this code, here's what I understand: The relevant function is TelemetryPing.js::packHistogram [1]. * Range is the histogram's range as defined in TelemetryHistograms.h [2]. This is the intrinsic min/max of the histogram and isn't affected by the data in the histogram. But see below for a quirk about the histogram's min. * "If the range was 1 - 100 and the submission had only one value for bucket 50, then the submission would include the bucket before 50 with a count of 0, the bucket 50 with the count of 1, and the bucket after 50 with the count of 0." -- Correct. * "The important thing is that the count is 0 meaning there were no observations with that value. It would only be truly invalid data if the count was greater than 0." -- This is probably not correct, at least for values *smaller* than min, due to a quirk in the code. The reported range is defined as [hist.range[1], hist.range[hist.range.length - 1]]. Notice that the first element is not hist.range[0]. But afaict hist.range[0] can have a non-zero count. My guess is that range[0] is an extra bucket added below the min listed in TelemetryHistograms.h, while perhaps we use the max value in TelemetryHistograms.h as the histogram's actual max value. There's a certain amount of sensibility in this, if so. So to be clear: It appears that the max range value is the histogram's actual max. Values above the max are truncated to the max. In contrast, the min range value is the second to last bucket in the histogram. Values below the min range are reported as a bucket smaller than the min. At least, that's what it seems to me is happening. I'm sorry it took so long for me to get to this. Let me know if I can help more. [1] http://hg.mozilla.org/mozilla-central/file/244991519f53/toolkit/components/telemetry/TelemetryPing.js#l210 [2] http://hg.mozilla.org/mozilla-central/file/244991519f53/toolkit/components/telemetry/TelemetryHistograms.h

Lawrence Mandel [:lmandel] (use needinfo)

Comment 15

•

13 years ago

From what I can tell, the modifications to the JSON data look like they'll support the use cases outlined in comment 0. It's tough for me to say for certain that this is what we need without seeing the changes that will come to the Telemetry dashboard to incorporate this information. Taras - Have you had a chance to review this bug? Do you have any feedback? Daniel - You opened this bug. Do the changes as outlined meet your expectations?

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

13 years ago

Whiteboard: [Telemetry]

(dormant account)

Comment 16

•

13 years ago

I'm not sure what I need to comment on beyond what Justin said. We agreed that metrics should feed a post-processed Telemetry.cpp into some validation routine on the server, now it's up to metrics to figure out how they want to do that.

(dormant account)

Updated

•

13 years ago

Depends on: 748417

(dormant account)

Comment 17

•

13 years ago

In addition to validating histograms in the schema, the validation process should flag histograms that were not defined in the schema.

Justin Lebar (not reading bugmail)

Comment 18

•

13 years ago

With daily pings from millions of machines, you *are* going to see corrupt data, particularly without an end-to-end checksum. I guess I'm unclear on what we're trying to accomplish here.

Daniel Einspanjer [:dre] [:deinspanjer]

Reporter

Comment 19

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #18) > With daily pings from millions of machines, you *are* going to see corrupt > data, particularly without an end-to-end checksum. > > I guess I'm unclear on what we're trying to accomplish here. We agree that basic client-side validation is not sufficient to prevent corrupted data from being received on the server-side. The goal of this bug is to implement a process that validates the server-side data, comparing it against the reference definitions of the histograms. Any data that is invalid will be marked as such in the stored document. This will allow us to present only clean and vetted data in our dashboards as well as being able to indicate to the user how many invalid data points there were, and how many submissions contained invalid data points.

Annie Elliott

Updated

•

13 years ago

Status: NEW → ASSIGNED

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

13 years ago

Whiteboard: [Telemetry] → [Telemetry:P1]

Daniel Einspanjer [:dre] [:deinspanjer]

Reporter

Updated

•

13 years ago

Version: 0.1 → unspecified

Annie Elliott

Comment 20

•

13 years ago

https://metrics.mozilla.com/projects/browse/METRICS-845

Target Milestone: Unreviewed → Targeted - JIRA

Annie Elliott

Updated

•

13 years ago

Whiteboard: [Telemetry:P1] → [JIRA METRICS-845] [Telemetry:P1]

(dormant account)

Updated

•

13 years ago

Blocks: 759441

Lawrence Mandel [:lmandel] (use needinfo)

Comment 22

•

13 years ago

I think this bug can be resolved as METRICS-845 has been resolved.

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED