Telemetry should provide an error-logging service




3 years ago
a year ago


(Reporter: vladan, Unassigned)



Firefox Tracking Flags

(firefox43 affected)


(Whiteboard: [measurement:client])



3 years ago
I've noticed a pattern of Mozilla devs trying to use Telemetry for error logging. Loop does this by using their own ping type. Other devs create a multitude of keyed histograms for reporting different error conditions (e.g. bug 1124428). Others have asked for a way to report all chrome JS exceptions to Telemetry. I’ve also observed that chrome JS exceptions often go unnoticed bug 1193990

Dev can already submit external pings to Telemetry, but I think that sets the bar too high -- and we really want to know what's going wrong. 

We need an easy way for devs to append important error conditions to an error log and have Telemetry upload that error log and expose the results to the relevant teams.

I'm proposing:

1. Creating a TelemetryErrorLog.jsm API for appending to the application-wide error log
2. A new "error-log" ping type. The ping should contain all logged errors +  TelemetryEnvironment + possibly clientID. We'll also want to link this log with Telemetry session IDs. I’m concerned about privacy implications of any absolute or relative timestamps.
3. Telemetry should write-out this log at shutdown only -- we can make it crash-proof later (and maybe co-ordinate with aborted-session writing somehow)
4. We should use Hekka on the backend to pull these logs apart, and then either feed it to some rudimentary Telemetry dash or DataDog(?). Either way, devs should be able to access the app-wide and module-specific logs, likely sampled at some small %
5. Error-logging should require data-collection peer approval and have expiry dates, because of privacy concerns and because we don't want logs to get very large
6. I'm undecided on whether this reporting should be Telemetry base-set or extended-set
This is something of interest to FxA and Sync. Bug 1124428 is broadly interested in 2 different error cases:

1) Where a server response indicated some failure condition. For these we are really just trying to identity what these responses are and try to identity client bugs that cause us to get into a bad state - eg, do we get authentication errors when, from the client's POV, it's quite sure it has good credentials?

2) Where we have an "unexpected" exception - and getting useful info from these are tricky. For example, we can tell that an unacceptably high number of users see exceptions applying a remote bookmark - but we can't get enough information about what the error is. In an ideal world, we'd be able to record the signature of the exception, including a possibly sanitized stack-trace - much like chrome-hangs. This might point us at the problem being, say, a silly bug in Sync itself, a possible corruption of the places database, or some other error we've never considered.

That google doc and comment 0 implies that this proposal would make (1) better, but it's not clear how that would help (2), which for our use-cases, is probably a bigger issue.
Georg and I are doing some work on stack capturing in Bug 1225851. If it lands, we could make use of Telemetry.captureStack() here. Mark, would your use case require capturing Javascript call-stacks? Or is it code in C++ your are interested in?
Flags: needinfo?(markh)
(In reply to Iaroslav Sheptykin from comment #3)
Mark, would your
> use case require capturing Javascript call-stacks? Or is it code in C++ your
> are interested in?

js. That bug looks very close to what I want! I'm not sure the "key" semantics are useful when the stack isn't predictable, but I'll leave further comment on that until I've actually looked :)
Flags: needinfo?(markh)
Priority: -- → P5
Whiteboard: [measurement:client]
You need to log in before you can comment on or make changes to this bug.