Histogram Regression Detector - fatal error comparing histograms

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: azhang, Unassigned)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(firefox48 affected)

Details

GC_REASON_2 changed from 102 to 101 buckets around February 17, 2016. This is causing fatal errors in the histogram regression detector, and anything else that relies on the number of buckets matching the data size.

See [1] for an example of this. The data for Feb 16 has 102 entries, while the data for Feb 17 has 101.

I'm not sure yet which patch is causing this, or even if the problem is the aggregator or the histogram definition. However, this is one of the reasons Medusa email alerts haven't been sending since mid-February.

[1]: https://aggregates.telemetry.mozilla.org/aggregates_by/submission_date/channels/nightly/?version=43&dates=20160216,20160217&metric=GC_REASON_2
Priority: -- → P1
Whiteboard: [measurement:client]
From a quick look, this seems to be related to bug 920169.
I'm looking into the client side.
Should we handle the client side here or in another bug?

Other notes:
* can we have alerts on that?
* should the aggregator get some hardening to not break all aggregations from one erroneous histogram?
* if we restore the previous bucket count, how do we handle the mismatching GC_REASON_2 submissions?
(In reply to Georg Fritzsche [:gfritzsche] from comment #2)
> I'm looking into the client side.
> Should we handle the client side here or in another bug?

We can handle the client side here. 

> Other notes:
> * can we have alerts on that?
> * should the aggregator get some hardening to not break all aggregations

This bug is not breaking all aggregations in the aggregator. 

> from one erroneous histogram?

We have considered histograms to be immutable until now and I would like this to continue to be the case. Since several systems depend on that assumption (cerberus, medusa, moztelemetry, mozaggregator, telemetry-batch-view and possibly others), could we add some checks on the client to avoid this from happening again? The alternative is to add checks on all the systems downstream that depend on that assumptions.

> * if we restore the previous bucket count, how do we handle the mismatching
> GC_REASON_2 submissions?

We are going to have to handle different bucket lengths in cerberus or add some special casing to ignore histograms with mismatching buckets.
Some context:
GC_REASON_2 n_values was JS::gcreason::TELEMETRY_REASONS, which was apparently always 100:
https://hg.mozilla.org/mozilla-central/rev/e0070650c153#l6.68

Client-side nothing seems to have changed, i see GC_REASON_2.bucket_count as 101 on both my current Nightly and one from 2015-06-01.
Maybe the change here:
https://hg.mozilla.org/mozilla-central/diff/c89eba9442f4/toolkit/components/telemetry/Histograms.json
... triggered the aggregator starting to use the histograms.json values (they went from expressions to numbers) and before used a different hard-coded value?
Priority: P1 → --
Whiteboard: [measurement:client]
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3)
> We have considered histograms to be immutable until now and I would like
> this to continue to be the case. Since several systems depend on that
> assumption (cerberus, medusa, moztelemetry, mozaggregator,
> telemetry-batch-view and possibly others), could we add some checks on the
> client to avoid this from happening again? The alternative is to add checks
> on all the systems downstream that depend on that assumptions.

Good point separate from this bug - i don't see us covering that easily in a client-build, but maybe we could look into a commit hook that enforces this on changes to Histograms.json.
I'll take it to a different bug.
> Client-side nothing seems to have changed, i see GC_REASON_2.bucket_count as
> 101 on both my current Nightly and one from 2015-06-01.
> Maybe the change here:
> https://hg.mozilla.org/mozilla-central/diff/c89eba9442f4/toolkit/components/
> telemetry/Histograms.json
> ... triggered the aggregator starting to use the histograms.json values
> (they went from expressions to numbers) and before used a different
> hard-coded value?

The issue seems to be due to [1]. Anthony, could you harden cerberus against incompatible histogram sizes? You could just ignore histograms with different sizes and print a warning.

[1] https://github.com/vitillo/python_moztelemetry/blob/master/moztelemetry/histogram.py#L43
Flags: needinfo?(azhang)
I hope you all like email :)
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.