Closed Bug 1438335 Opened 2 years ago Closed 2 years ago

Accumulating histogram values >INT_MAX overflows

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox59 --- fixed
firefox60 --- fixed

People

(Reporter: chutten, Assigned: chutten)

References

Details

Attachments

(4 files)

Here's a fun one:

Open a console on about:telemetry and:

let a = Telemetry.getHistogramById("TELEMETRY_TEST_LINEAR")
a.snapshot().ranges
a.snapshot().counts
a.add(2149594584)
a.snapshot().counts

For some reason 2149594584 accumulates to the 0 bucket.

In fact, near as I can tell, values between 2^31 and 2^32 all accumulate to the 0 bucket.

2^31 - 1 accumulates to the high bucket
2^32 + 1 accumulates to the 1 bucket.

There's something odd going on here. It might be JS::ToUint32 misbehaving, or it might be something in our handling. Time to investigate.
I'm 80% sure this is a type coercion going through a signed 32-bit representation and having a Bad Time, but that theory'll have to wait until tomorrow.
I was right.

Accumulate and JSHistogram's add both treat intputs as uint32_t. We don't support negative values, and 32-bit widths seemed like a good idea at the time. So far so good.

The base::Histogram class that actually takes the values and stores the counts, however, takes int[1]. Telemetry's internal_HistogramAdd takes an int32_t[2].

And so we coerce silently from unsigned to signed (because C++, that's why) which falls into the "low" bucket because it is less than 1.

*sigh* I wish this was written in Rust.

[1]: https://searchfox.org/mozilla-central/rev/4234703a532006c5ef9ce09b4c487d88124526a0/ipc/chromium/src/base/histogram.cc#107
[2]: https://searchfox.org/mozilla-central/rev/4234703a532006c5ef9ce09b4c487d88124526a0/toolkit/components/telemetry/TelemetryHistogram.cpp#598
I plan on changing our behaviour so that values > 2^31 received from both the JS and C++ accumulation APIs are clamped to exactly 2^31 before being stored.

(( This is instead of JS taking the value mod 2^32 and then (along with C++ values) allowing it to end up being negative if it lies between 2^31 and 2^32. ))

I also plan to document and test this 2^31 limit.

Sound like a valid plan, :Dexter, :gfritzsche?
Flags: needinfo?(gfritzsche)
Flags: needinfo?(alessio.placitelli)
(In reply to Chris H-C :chutten from comment #2)
> The base::Histogram class that actually takes the values and stores the
> counts, however, takes int[1]. Telemetry's internal_HistogramAdd takes an
> int32_t[2].

I assume this means we were never able to record or transmit a value > 2^31?
With that, this sounds valid, thanks!

I would add that we should visibly log errors.
Should we track this error class in Telemetry?
Flags: needinfo?(gfritzsche)
(In reply to Chris H-C :chutten from comment #3)
> Sound like a valid plan, :Dexter, :gfritzsche?

+1 on this + the extended instrumentation that was discussed on IRC!
Flags: needinfo?(alessio.placitelli)
Depends on: 1438561
(In reply to Chris H-C :chutten from comment #2)
> And so we coerce silently from unsigned to signed (because C++, that's why)
> which falls into the "low" bucket because it is less than 1.
> 
> *sigh* I wish this was written in Rust.

Wow, this happens without even a warning? I can't find any complaints about it in any of our compilers' build logs. But don't we get heaps of signed/unsigned warnings for other things (that we habitually ignore)? Or am I thinking about a different type of use pattern?
Flags: needinfo?(nfroyd)
(In reply to David Major [:dmajor] from comment #6)
> (In reply to Chris H-C :chutten from comment #2)
> > And so we coerce silently from unsigned to signed (because C++, that's why)
> > which falls into the "low" bucket because it is less than 1.
> > 
> > *sigh* I wish this was written in Rust.
> 
> Wow, this happens without even a warning? I can't find any complaints about
> it in any of our compilers' build logs. But don't we get heaps of
> signed/unsigned warnings for other things (that we habitually ignore)? Or am
> I thinking about a different type of use pattern?

I think there's a signed/unsigned implicit coercion flag somewhere, but it's *super* noisy.  (MSVC enables it, I think?  It's very spammy.)  The warning you're probably thinking of is signed/unsigned *comparison*, which is also dangerous!
Flags: needinfo?(nfroyd)
DATA COLLECTION REVIEW

    What questions will you answer with this data?

How much data were we previously recording as 0s, even though they were quite large values?

    Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? 

We have a bug in our data collection. We need to figure out how big of a deal this is, and whether analyses may have been affected.

    What alternative methods did you consider to answer these questions? Why were they not sufficient?

There are no alternatives. The data was lost at the client side.

    Can current instrumentation answer these questions?

Nope.

    List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the found on the Mozilla wiki.

telemetry.accumulate_clamped_values | Category 1 | bug 1438335
A count of the number of accumulations, keyed by the histogram that was trying to accumulate them, that had to be clamped.


    How long will this data be collected? Choose one of the following:
Forever. I (chutten) wish to monitor this data forever to determine when Telemetry users have a problem.

    What populations will you measure?
pre-release ought to be fine.

    Which release channels?
All pre-release

    Which countries?
All

    Which locales?
All

    Any other filters? Please describe in detail below.
None.

    If this data collection is default on, what is the opt-out mechanism for users?
Data Upload preferences in about:preferences.

    Please provide a general description of how you will analyze this data.
I will eyeball the number of accumulations and inform probe owners if they have a problem.

    Where do you intend to share the results of your analysis?
Bugzilla and email. Though simple analysis will be available to all on telemetry.mozilla.org if you look at the aggregated values.
Comment on attachment 8951356 [details]
bug 1438335 - Clamp the samples accumulated to histograms.

https://reviewboard.mozilla.org/r/220604/#review226720

::: toolkit/components/telemetry/TelemetryHistogram.cpp:1191
(Diff revision 2)
>        LogToBrowserConsole(nsIScriptError::errorFlag, NS_LITERAL_STRING("Not a number"));
>        return true;
>      }
>  
> -    if (!JS::ToUint32(cx, args[0], &value)) {
> +    if (args[0].isNumber() && args[0].toNumber() > UINT32_MAX) {
> +      // Clamp large numerical arguments.

Do you think it's worth expanding this to mention why are we clamping to `UINT32_MAX` here?
Attachment #8951356 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8951357 [details]
bug 1438335 - Document Histogram accumulation clamping behaviour

https://reviewboard.mozilla.org/r/220606/#review226722
Attachment #8951357 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8951358 [details]
bug 1438335 - Test edge-of-representation accumulations to Histograms.

https://reviewboard.mozilla.org/r/220608/#review226724

::: toolkit/components/telemetry/tests/gtest/TestHistograms.cpp:348
(Diff revision 2)
>    uint32_t uCountLast = 0;
>    JS::ToUint32(cx.GetJSContext(), countFirst, &uCountFirst);
>    JS::ToUint32(cx.GetJSContext(), countLast, &uCountLast);
>  
>    const uint32_t kExpectedCountFirst = 2;
> -  const uint32_t kExpectedCountLast = 1;
> +  const uint32_t kExpectedCountLast = 3;

nit: would you mind adding a comment mentioning why we expect 3? We know that now, may be worth adding it in case we stumble on this test in a year or so
Attachment #8951358 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8951427 [details]
bug 1438335 - Record when we have to clamp large Telemetry accumulations.

https://reviewboard.mozilla.org/r/220722/#review226726
Attachment #8951427 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8951356 [details]
bug 1438335 - Clamp the samples accumulated to histograms.

https://reviewboard.mozilla.org/r/220604/#review226720

> Do you think it's worth expanding this to mention why are we clamping to `UINT32_MAX` here?

`value` is uint32_t, so I figured it was implied. I could rewrite as "Clamp large numerical arguments to value's acceptable values"?
When we clamp, we should log an error (to the browser console?).
Flags: needinfo?(chutten)
Blocks: 1438885
Comment on attachment 8951356 [details]
bug 1438335 - Clamp the samples accumulated to histograms.

https://reviewboard.mozilla.org/r/220604/#review227460

::: toolkit/components/telemetry/TelemetryHistogram.cpp:1194
(Diff revision 3)
>  
> -    if (!JS::ToUint32(cx, args[0], &value)) {
> +    if (args[0].isNumber() && args[0].toNumber() > UINT32_MAX) {
> +      // Clamp large numerical arguments to value's acceptable values.
> +      // JS::ToUint32 will take arg[0] modulo 2^32 before returning it, which
> +      // may result in a smaller final value.
> +      value = UINT32_MAX;

Are we logging this error anywhere?
We should log it here or in a follow-up bug, to make it possible for devs or QA to spot when this happens.
Sorry, the patches I submitted are one revision behind. They were hanging around since Comment 21 :S

I'll get the one with the logging up for review soonish.
Flags: needinfo?(chutten)
Comment on attachment 8951427 [details]
bug 1438335 - Record when we have to clamp large Telemetry accumulations.

Redirecting to :francois for data review.
Attachment #8951427 - Flags: review?(liuche) → review?(francois)
Comment on attachment 8951427 [details]
bug 1438335 - Record when we have to clamp large Telemetry accumulations.

https://reviewboard.mozilla.org/r/220722/#review227658

Scalars.yaml looks fine (with change below).

::: toolkit/components/telemetry/Scalars.yaml:1017
(Diff revision 4)
> +      saved-session pings.
> +    expires: never
> +    kind: uint
> +    keyed: true
> +    notification_emails:
> +      - telemetry-client-dev@mozilla.com

Can you please add your own email address here (or someone else's)?

We now ask that every probe be tied to a person (in addition to optionally be CCed to a list).
Attachment #8951427 - Flags: review?(francois) → review+
datareview+

(In reply to Chris H-C :chutten from comment #15)
> DATA COLLECTION REVIEW

1) Is there or will there be **documentation** that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes, in Scalars.yaml.

2) Is there a control mechanism that allows the user to turn the data collection on and off? (Note, for data collection not needed for security purposes, Mozilla provides such a control mechanism) Provide details as to the control mechanism available.

Yes, telemetry setting.

3) If the request is for permanent data collection, is there someone who will monitor the data over time?**

Yes, Chris.

4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under?  **

Category 1.

5) Is the data collection request for default-on or default-off?

Default-on in pre-release channels only.

6) Does the instrumentation include the addition of **any *new* identifiers** (whether anonymous or otherwise; e.g., username, random IDs, etc.  See the appendix for more details)?

No.

7) Is the data collection covered by the existing Firefox privacy notice?

Yes.

8) Does there need to be a check-in in the future to determine whether to renew the data?

No, permanent.
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/01f045f12c5c
Clamp the samples accumulated to histograms. r=Dexter
https://hg.mozilla.org/integration/autoland/rev/a06841d0b44a
Document Histogram accumulation clamping behaviour r=Dexter
https://hg.mozilla.org/integration/autoland/rev/afef6b06366a
Test edge-of-representation accumulations to Histograms. r=Dexter
https://hg.mozilla.org/integration/autoland/rev/4aa002ab5d40
Record when we have to clamp large Telemetry accumulations. r=Dexter,francois
Attachment #8951427 - Flags: review?(liuche)
Summary: Why does accumulating a large power of 2 add a count to the low bucket? → Accumulating histogram values >INT_MAX overflows
Should we uplift this?
Flags: needinfo?(chutten)
Comment on attachment 8951356 [details]
bug 1438335 - Clamp the samples accumulated to histograms.

Approval Request Comment
[Feature/Bug causing the regression]: unknown, this appears to always have been the case
[User impact if declined]: Telemetry from a list of histograms will erroneously report 0s when instead they should report very large numbers.
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: Yes. You can see the data here, if you'd like: https://mzl.la/2CmmpmK
[Needs manual test from QE? If yes, steps to reproduce]: No, I don't think so.
[List of other uplifts needed for the feature/fix]: Just the patches in this bug
[Is the change risky?]: I wouldn't say so, no
[Why is the change risky/not risky?]: It only changes the values reported when they are above INT_MAX in value. This rarely happens.
[String changes made/needed]: None.
Flags: needinfo?(chutten)
Attachment #8951356 - Flags: approval-mozilla-beta?
Comment on attachment 8951356 [details]
bug 1438335 - Clamp the samples accumulated to histograms.

Verified in nightly, let's uplift this for beta 14.
Attachment #8951356 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
One quick clarification, the maximum value permitted to be stored in histograms is INT_MAX - 1[1]. So though we  clamp values coming in from JS to UINT32_MAX, and then clamp values heading out to the histogram storage at INT_MAX, the histogram storage itself performs a _third_ clamp down to INT_MAX - 1.

Huzzah, off-by-one.

[1]: https://searchfox.org/mozilla-central/rev/14d933246211b02f5be21d2e730a57cf087c6606/ipc/chromium/src/base/histogram.cc#107
Depends on: 1440832
(In reply to Chris H-C :chutten from comment #40)
> [Is this code covered by automated tests?]: Yes
> [Has the fix been verified in Nightly?]: Yes. You can see the data here, if
> you'd like: https://mzl.la/2CmmpmK
> [Needs manual test from QE? If yes, steps to reproduce]: No, I don't think
> so.

Setting qe-verify- based on Chris's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.