Open Bug 1311100 Opened 3 years ago Updated 8 months ago

[meta] Fix the measurement semantics for the Telemetry JS API

Categories

(Toolkit :: Telemetry, defect, P3)

defect

Tracking

()

People

(Reporter: Gijs, Unassigned)

References

(Depends on 3 open bugs, Blocks 1 open bug)

Details

(Keywords: meta, Whiteboard: [measurement:client] [measurement:client:project])

Right now we have a lot of code that uses try...catch wrappers around their calls to Services.telemetry because... well, it's not clear, actually.

The current state has a few downsides:
- it's verbose, and annoying to remember
- we regularly include other code in the try catch, for which we'll (often silently) eat errors
- we don't get any feedback (e.g. via telemetry...) about exceptions / issues with our calls to telemetry (e.g. broken params, expired histograms, whatever).

It would be great if there was clarity about the exact circumstances under which telemetry calls fail (ie why we have all these try...catch wrappers). I've asked around, and not really found anybody who knew of something material. I thought it was possible for telemetry to be null if Firefox was built with the right options, but on looking at the code that seems not to be the case.

It would be better if:
- Telemetry guaranteed that all calls that fetch or operate on histograms are failsafe, irrespective of whether it failed to initialize, can't submit data, etc. This could potentially include passing back a no-op shim to getHistogramById calls that pass an unknown histogram ID
- if passed anything broken, telemetry code is responsible for:
-- not throwing JS exceptions
-- reporting errors to the browser console
-- potentially logging opt-in telemetry about broken callsites or something like that.

This would allow removing all the consumer try...catch gunk and having a unified way of finding out what histograms (and/or their callsites) are broken, and in what circumstances telemetry is broken.

As a general principle, it seems to me that no code that does any telemetry gathering genuinely appreciates being interrupted by a JS exception if reporting such telemetry fails.

Does this make sense? Am I missing something? How hard would it be to implement this? Is there anything that prevents us removing these try...catch statements from consumers today?
Flags: needinfo?(gfritzsche)
Flags: needinfo?(chutten)
I have a feeling we shouldn't wrap try/catch around telemetry, but you should work with Georg on this. Some additional notes/thoughts:

1) it was (is?) possible to disable telemetry at build time, so any calls would fail. I don't know whether this is still true, but let's remove this build option if so.

2) Code records telemetry to an unknown histogram. I have a feeling this *should* throw an exception. It indicates a programming error that should break tests.

3) Code records telemetry to an expired/disabled histogram. I'm 99.44% sure this doesn't throw, and it shouldn't.

4) Code records an illegal value. I'm torn on this one. Could we make this not-throw, but if this happens in an automated test we cause the test to fail?
Sadly I don't really know if the current exception-throwing situation is by design or by happenstance.

Also, I don't know if I agree. Exceptions should be thrown in exceptional cases. Accumulating incorrectly, to an expired or unknown histogram... these seem exceptional to me, and exactly the fault of the calling code.

Calling telemetry at the "wrong" time? Yeah, that shouldn't throw (if it does). Timing things incorrectly isn't the caller's fault.
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #2)
> Also, I don't know if I agree. Exceptions should be thrown in exceptional
> cases. Accumulating incorrectly, to an expired or unknown histogram... these
> seem exceptional to me, and exactly the fault of the calling code.

I would disagree on the expired histogram. Given histogram foo, expires in 53 - when do I remove the code accumulating to and/or the histogram itself? I can't do it before merge day because now it'll be removed from 52. I can't do it afterwards because the world will explode with exceptions immediately after the version increment.

I therefore suspect that this case doesn't currently throw, and would support it staying that way.

As for unknown histograms, that might make sense, because those are normally hardcoded constants anyway so we should always hit those in tests etc.

Invalid values to a known histogram, less so, because often the value is computed somehow and so if there are edgecases in that computation they are more likely to not be caught by tests or manual code exercising. Breaking the code that is doing the computation when the value doesn't match what the histogram expects seems wrong and would likely break other running code. It would be more appropriate if we had an alternative route of feeding information about this back through telemetry (e.g. a keyed histogram where the keys are... other histogram IDs!) that increments when we get called with invalid values. That histogram should always be empty, and we should file bugs for cases where it isn't, but I'm less convinced that we should break the caller "in the wild" on release builds if this happens.
Agree with Gijs, his rationale is a better explanation of what I was thinking!

In particular, it is not exceptional to record to an expired histogram. We want teams to be able to leave expired histograms in the tree without penalty, if they decide they may want to resurrect them later.
I think those try-catch patterns we see are there mostly for historic reasons (builds without Telemetry) and are not strictly needed anymore.
Note that the points here apply similarly to scalars and (upcoming) events.

(In reply to Benjamin Smedberg [:bsmedberg] from comment #1)
> 1) it was (is?) possible to disable telemetry at build time, so any calls
> would fail. I don't know whether this is still true, but let's remove this
> build option if so.

This is not true anymore - Telemetry is always enabled now, only recording/reporting/... are toggled.
 
> 2) Code records telemetry to an unknown histogram. I have a feeling this
> *should* throw an exception. It indicates a programming error that should
> break tests.

This throws and i also believe it should.

> 3) Code records telemetry to an expired/disabled histogram. I'm 99.44% sure
> this doesn't throw, and it shouldn't.

It doesn't and i also think it shouldn't.

> 4) Code records an illegal value. I'm torn on this one. Could we make this
> not-throw, but if this happens in an automated test we cause the test to
> fail?

This currently throws, although a bit inconsistently AFAICT.
We should probably adopt to what works best for the Firefox code-base here.
We really need to reliably get test failures for it and get visible errors in the error console.

I think we clearly need to update the nsITelemetry.idl documentation for thrown exceptions too.
Flags: needinfo?(gfritzsche)
Priority: -- → P4
Whiteboard: [measurement:client]
What options do we have to cause test failure (across test harnesses) without throwing exceptions?
(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> What options do we have to cause test failure (across test harnesses)
> without throwing exceptions?

The simplest thing that I can think of is for telemetry to produce a notification, for which the test harnessess listen either through the observer service or through a listener interface nsITelemetry provides itself.
(In reply to :Gijs Kruitbosch from comment #7)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> > What options do we have to cause test failure (across test harnesses)
> > without throwing exceptions?
> 
> The simplest thing that I can think of is for telemetry to produce a
> notification, for which the test harnessess listen either through the
> observer service or through a listener interface nsITelemetry provides
> itself.

Do we have any existing notifications like that?
(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> (In reply to :Gijs Kruitbosch from comment #7)
> > (In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> > > What options do we have to cause test failure (across test harnesses)
> > > without throwing exceptions?
> > 
> > The simplest thing that I can think of is for telemetry to produce a
> > notification, for which the test harnessess listen either through the
> > observer service or through a listener interface nsITelemetry provides
> > itself.
> 
> Do we have any existing notifications like that?

Not sure what you mean. If you mean "existing observer notifications that would cause test failures if sent during a test" then I'm not aware of any. Most other test-only things (like leak detection) are implemented directly in scripts that parse test output or that run in browser windows. If you preferred, we could potentially do it that way, too, by making Telemetry Cu.reportError() any errors that happen, and then registering a console listener in the test framework that would fail the test based on such errors. This also has the advantage that you'd get a JS stack (assuming you can do a JS Cu.reportError... not sure how much of telemetry is implemented in C++, though we could of course write a jsm wrapper if we thought that was valuable).
So expired histograms *can* still cause problems, specifically if they are "count" histograms, see bug 1275089 and bug 1278388.
I think that's separate. The typical test failure problem is where a test specifically tests telemetry, that telemetry expires, and then the test doesn't see any measurements. The actual recording of telemetry is rarely the failure mode there. I'm still not sure what/whether to do about this, but I think it's separate from this bug.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #11)
> I think that's separate. The typical test failure problem is where a test
> specifically tests telemetry, that telemetry expires, and then the test
> doesn't see any measurements. The actual recording of telemetry is rarely
> the failure mode there. I'm still not sure what/whether to do about this,
> but I think it's separate from this bug.

Well, in this particular case a telemetry recording that initially *wasn't* in a try catch started throwing when the histogram expired, and that broke a test. The idea of comment #0 / filing this bug was that this (ie telemetry throwing exceptions for pretty much anything except getting a non-existant histogram) shouldn't happen. :-)
Depends on: 1315906
Depends on: 1315909
Priority: P4 → P3
Summary: Telemetry should be failsafe and we shouldn't have to try...catch all the calls → Fix Telemetry measurement API semantics
Whiteboard: [measurement:client] → [measurement:client] [measurement:client:project]
(In reply to :Gijs Kruitbosch from comment #12)
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #11)
> > I think that's separate. The typical test failure problem is where a test
> > specifically tests telemetry, that telemetry expires, and then the test
> > doesn't see any measurements. The actual recording of telemetry is rarely
> > the failure mode there. I'm still not sure what/whether to do about this,
> > but I think it's separate from this bug.
> 
> Well, in this particular case a telemetry recording that initially *wasn't*
> in a try catch started throwing when the histogram expired, and that broke a
> test. The idea of comment #0 / filing this bug was that this (ie telemetry
> throwing exceptions for pretty much anything except getting a non-existant
> histogram) shouldn't happen. :-)

This scenario will be fixed by bug 1315906.
Depends on: 1315912
Keywords: meta
Summary: Fix Telemetry measurement API semantics → Fix the measurement semantics for the Telemetry JS API
Summary: Fix the measurement semantics for the Telemetry JS API → [meta] Fix the measurement semantics for the Telemetry JS API
Depends on: 1317702
Depends on: 1321790
Depends on: 1321349
Depends on: 1315910
You need to log in before you can comment on or make changes to this bug.