Closed Bug 1318463 Opened 8 years ago Closed 2 years ago

Add telemetry for url-oxidation failures

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED INACTIVE

People

(Reporter: manishearth, Unassigned)

References

Details

(Whiteboard: [necko-backlog])

Currently we just log spec differences between nsStandardURL and RustURL. We should be using proper telemetry here.
Ralph, what would you recommend we do to make this happen? I'm not very familiar with the telemetry system and what telemetry we're allowed to include, and what y'all did for the media parsing stuff. This is nightly-only.
Flags: needinfo?(giles)
Chris mentioned that we can't collect URLs (even in nightly). Right now we don't need success numbers; we need to know what kinds of mismatches are out there. I guess we could add a pref that lets people opt in? But nobody will tick that pref. Hm.
#telemetry gave an interesting idea -- we could just prompt for submitting mismatches when one happens.
The main requirement for telemetry is that the collection be meaningful, time-limited, and not personally identifiable. The last part is tricky because of the ease of de-anonymising data sets. Collecting specific urls which parse differently wouldn't be ok, no matter how useful it would be for debugging. Error counts and so on are generally fine; our pipeline aggregates data to trade report latency for privacy. Telemetry basically records event histograms. You can record how often something happens with a Flag (single value histogram), how often something succeeded with a Boolean (two-value histogram), or map things like error codes or execution time to an integer (enumerated histogram). That last can be a bit tedious when you want an enum, but I found anything else too difficult to implement. Basically you add an entry to toolkit/components/telemetry/Histograms.json describing a data collection key. Then you call `Telemetry::Accumulate()` with that key from C++ code and the system will take care of the rest. Some days after landing, data will start showing up under that key at https://telemetry.mozilla.org/ There's also an API if you want to write your own dashboard. In addition to module peer review, you need r+ from a data peer to add a telemetry key. There are some guidelines at https://wiki.mozilla.org/Firefox/Data_Collection along with a list of peers you can request feedback from. For the mp4 parser, we have MEDIA_RUST_MP4PARSE_SUCCESS which is a boolean just logging whether the rust code returned an Result::Ok. This lets us confirm our code is running in the wild and gives the highest-level overview of how it's doing. The failure percentage may or may not be meaningful depending on how much malformed input we get. This is where the '1 billion rust invocations' number came from in Firefox 48. Next we added MEDIA_RUST_MP4PARSE_ERROR_CODE which maps the various Result::Err variants to integers. We include the map in the histogram description so it shows up at the bottom of dashboard plots. This gives us more specific information about how parsing is failing which helps concentrate our efforts. (We get a lot of malformed input, but there are still mp4 features we don't implement.) Finally, we report MEDIA_RUST_MP4PARSE_TRACK_MATCH_AUDIO and MEDIA_RUST_MP4PARSE_TRACK_MATCH_VIDEO which are another pair of booleans to record whether the parse result returned by the new rust code matches what was returned by the C++ implementation we want to replace. Like you, we run (not performance critical!) data through both implementations. This is our main metric for when we can transition to the new implementation. We also have pref to assert when the tracks don't match, which is useful in tracking down reproduction cases, which we can't report directly over telemetry but can take voluntary bug reports for.
Okay. We eventually need something like that, but rust-url is not yet in a state where we want to start counting successes; we still want to look for bugs first. Having a dialog box might help here. I'm not sure if that would be too annoying (I'll ask around). In the meantime I guess we should focus on making it pass try.
Flags: needinfo?(giles)
(In reply to Manish Goregaokar [:manishearth] from comment #5) > In the meantime I guess we should focus on making it pass try. Agreed. Let's revisit this once we get it closer to shipping.
Whiteboard: [necko-backlog]
Priority: -- → P1
Priority: P1 → P3
Severity: normal → S3
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.