Closed Bug 736189 Opened 13 years ago Closed 13 years ago

Schedule a MapReduce job to validate Telemetry data in HBase daily

Categories

(Mozilla Metrics :: Data/Backend Reports, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED
Moved to JIRA

People

(Reporter: dre, Assigned: aphadke)

References

Details

(Whiteboard: [JIRA METRICS-845] [Telemetry:P1])

If we just drop bad data points, then we blind ourselves to potential systemic problems. However, we cannot let these invalid data points skew the measures of the valid data. I propose that the MR job scan each new submission and evaluate whether each histogram bucket key is within the defined bounds declared by the histogram. If it is outside those bounds, the entries will be removed and the counts adjusted. A new section of data for that histogram will be created that reports the number of invalid entries, possibly breaking that total down by whether it was outside the MIN or the MAX. Once we have corrected and annotated the data, we can have Nagios alerts that will fire if any of these error counts gets "too high". We can also use those annotations in the dashboards to alert users as to the volume of invalid entries.
Anurag, is there any chance you might be up to taking a crack at building this MR job?
Assignee: xstevens → aphadke
I've found another issue while studying simpleMeasurements.uptime. Some of them are negative and some are very large numbers (because they represent minutes the math works out to ~11 years for example). So we should probably remove these as well.
lmandel: Take a look @ http://aphadke.pastebin.mozilla.org/1529693 the top one is an incorrect json while the bottom one is the corrected one. Is this in line with what you want?
Looking at the pastebin I see that you've introduced (or did this previously exist?) a valid range. You then drop values outside of the range. In one case you list the invalid values, but only list the value 4 even though it looks like there is a value 0 that is invalid and dropped as well. When will values be listed as an invalid value? Can you get the total count of invalid values as Daniel suggest in comment 0? Where are the valid ranges specified?
yes, I did introduce the fake value "4" : 10020 wrt value 0, afaik, the value "0" is valid and should not be dropped, is that not true? Do you want the count of invalid values per JSON or end of day report? the valid ranges are specified in the JSON itself: so, for: ZIPARCHIVE_CRC "range": [ 1, 2 ], for: MEMORY_RESIDENT "range": [ 32768, 1048576 ], and so on..
:lmandel is the above in line with what u want?
(In reply to aphadke from comment #5) > yes, I did introduce the fake value "4" : 10020 > > wrt value 0, afaik, the value "0" is valid and should not be dropped, is > that not true? Let's look at a specific example for EARLY_GLUESTARTUP_HARD_FAULTS. "EARLY_GLUESTARTUP_HARD_FAULTS": { "range": [ 1, 100 ], "bucket_count": 12, "histogram_type": 1, "values": { "0": 1, "1": 0 }, "sum": 0 }, gets converted to "EARLY_GLUESTARTUP_HARD_FAULTS": { "range": [ 1, 100 ], "bucket_count": 12, "histogram_type": 1, "values": { "1": 0 }, "sum": 0 }, In this case I see that there is a value of "0" that is submitted and filtered from the second "gets converted to" view of the data. However, I don't see as "invalid_values" entry. Is this expected? > Do you want the count of invalid values per JSON or end of day report? As Daniel suggested in comment 0, I think the values are required in the frontend dashboards in order to give developers a view of the size of invalid data for their probes. It sounds to me like this means that the count of invalid values is needed in the JSON data...but feel free to correct me. > the valid ranges are specified in the JSON itself: > > so, for: ZIPARCHIVE_CRC > "range": [ > 1, > 2 > ], > > for: MEMORY_RESIDENT > "range": [ > 32768, > 1048576 > ], > > and so on.. Are the ranges needed in the data? Are the ranges used by a consumer of the JSON such as the frontend dashboard?
We use it to make sure we dont aggregate data generated with different parameters
> Let's look at a specific example for EARLY_GLUESTARTUP_HARD_FAULTS. > > "EARLY_GLUESTARTUP_HARD_FAULTS": { > "range": [ > 1, > 100 > ], > "bucket_count": 12, > "histogram_type": 1, > "values": { > "0": 1, > "1": 0 > }, > "sum": 0 > }, > > gets converted to > > "EARLY_GLUESTARTUP_HARD_FAULTS": { > "range": [ > 1, > 100 > ], > "bucket_count": 12, > "histogram_type": 1, > "values": { > "1": 0 > }, > "sum": 0 > }, > > In this case I see that there is a value of "0" that is submitted and > filtered from the second "gets converted to" view of the data. However, I > don't see as "invalid_values" entry. Is this expected? The dropped "0" was a bug on my part and its been fixed. The value "0" is not dropped since its valid. The count of invalid values will exist as part of JSON as developers need the count.
(In reply to aphadke from comment #9) > The dropped "0" was a bug on my part and its been fixed. The value "0" is > not dropped since its valid. > > The count of invalid values will exist as part of JSON as developers need > the count. If the range is 1-100 how is the value "0" within the valid range?
> If the range is 1-100 how is the value "0" within the valid range? I really don't know, I was told that value "0" is a special value and should not fall under invalid_values.
Cause Telemetry is weird. :) From what I remember of what Taras told me, Telemetry always adds bounding edges to the buckets submitted. If the range was 1 - 100 and the submission had only one value for bucket 50, then the submission would include the bucket before 50 with a count of 0, the bucket 50 with the count of 1, and the bucket after 50 with the count of 0. In the case where the min bucket contains a value, Telemetry will include an element for the bucket before the min value with a count of 0. The important thing is that the count is 0 meaning there were no observations with that value. It would only be truly invalid data if the count was greater than 0.
Nathan, can you comment on whether comment 12 is accurate?
I had a look through the code. With the caveat that I'm completely new to this code, here's what I understand: The relevant function is TelemetryPing.js::packHistogram [1]. * Range is the histogram's range as defined in TelemetryHistograms.h [2]. This is the intrinsic min/max of the histogram and isn't affected by the data in the histogram. But see below for a quirk about the histogram's min. * "If the range was 1 - 100 and the submission had only one value for bucket 50, then the submission would include the bucket before 50 with a count of 0, the bucket 50 with the count of 1, and the bucket after 50 with the count of 0." -- Correct. * "The important thing is that the count is 0 meaning there were no observations with that value. It would only be truly invalid data if the count was greater than 0." -- This is probably not correct, at least for values *smaller* than min, due to a quirk in the code. The reported range is defined as [hist.range[1], hist.range[hist.range.length - 1]]. Notice that the first element is not hist.range[0]. But afaict hist.range[0] can have a non-zero count. My guess is that range[0] is an extra bucket added below the min listed in TelemetryHistograms.h, while perhaps we use the max value in TelemetryHistograms.h as the histogram's actual max value. There's a certain amount of sensibility in this, if so. So to be clear: It appears that the max range value is the histogram's actual max. Values above the max are truncated to the max. In contrast, the min range value is the second to last bucket in the histogram. Values below the min range are reported as a bucket smaller than the min. At least, that's what it seems to me is happening. I'm sorry it took so long for me to get to this. Let me know if I can help more. [1] http://hg.mozilla.org/mozilla-central/file/244991519f53/toolkit/components/telemetry/TelemetryPing.js#l210 [2] http://hg.mozilla.org/mozilla-central/file/244991519f53/toolkit/components/telemetry/TelemetryHistograms.h
From what I can tell, the modifications to the JSON data look like they'll support the use cases outlined in comment 0. It's tough for me to say for certain that this is what we need without seeing the changes that will come to the Telemetry dashboard to incorporate this information. Taras - Have you had a chance to review this bug? Do you have any feedback? Daniel - You opened this bug. Do the changes as outlined meet your expectations?
Whiteboard: [Telemetry]
I'm not sure what I need to comment on beyond what Justin said. We agreed that metrics should feed a post-processed Telemetry.cpp into some validation routine on the server, now it's up to metrics to figure out how they want to do that.
Depends on: 748417
In addition to validating histograms in the schema, the validation process should flag histograms that were not defined in the schema.
With daily pings from millions of machines, you *are* going to see corrupt data, particularly without an end-to-end checksum. I guess I'm unclear on what we're trying to accomplish here.
(In reply to Justin Lebar [:jlebar] from comment #18) > With daily pings from millions of machines, you *are* going to see corrupt > data, particularly without an end-to-end checksum. > > I guess I'm unclear on what we're trying to accomplish here. We agree that basic client-side validation is not sufficient to prevent corrupted data from being received on the server-side. The goal of this bug is to implement a process that validates the server-side data, comparing it against the reference definitions of the histograms. Any data that is invalid will be marked as such in the stored document. This will allow us to present only clean and vetted data in our dashboards as well as being able to indicate to the user how many invalid data points there were, and how many submissions contained invalid data points.
Status: NEW → ASSIGNED
Whiteboard: [Telemetry] → [Telemetry:P1]
Version: 0.1 → unspecified
Target Milestone: Unreviewed → Targeted - JIRA
Whiteboard: [Telemetry:P1] → [JIRA METRICS-845] [Telemetry:P1]
Blocks: 759441
I think this bug can be resolved as METRICS-845 has been resolved.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.