585196 - telemetry infrastructure

Reporter

Description

•

14 years ago

We currently gather crash-stats to help us tell what the crashiness state of our field users is. This is good, because we often don't trip on crashes in our own testing, but our field users find all sorts of unusual combinations.

We do not, as far as I know, gather much else. In particular we don't gather performance counters of any sort. This is bad, because we are stuck with performance counters gathered by the environment of idealized test runs like those we run in the talos cluster. No offense intended -- those are good numbers to have! -- but a curated test set is going to make us believe a lot of things are "performing ok" when we just haven't seen how they are performing in the field; we're missing the "real" signal from the larger network of users.

I'd like to start gathering this signal. The counters do not have to be sophisticated or fine-grained counters. I'd like to gather a handful of coarse-grained numbers broadly from all (or a substantial portion of) our users, and organize them on a server we operate, so that we can get a high-level view of performance problem areas and regressions.

Numbers I'd like to gather (say, min/max/avg for each):

- mapped memory / resident set / private bytes
- possibly the full reporter-set from about:memory
- cycle collection and js gc durations
- number of open file descriptors / OS resource handles
- number of threads, subprocesses, etc.
- startup time
- UI event latency
- Page-load, DNS, or other network latencies

Just some very basic counters so we can tell what users are suffering with. Should probably be binned by product, platform and build-id also.

To get this working there's not *much* to do in the client (you can start with any single counter you like, doesn't matter) but probably a lot to do on a server. Figuring out which counters (if any) we can collect, whether it needs opt-in or opt-out, how much data we can digest, what kind of storage and processing business we can allocate, where to get the beefy bandwidth and cpu to digest millions of such small-pings, how to submit asynchronously and unobtrusively. That kind of thing.

(Some people have suggested test pilot for this, but I think it's not quite the right fit. Those are limited-time trials, and more commonly user-facing. I'm talking about much more pervasive reporting of much simpler and lower-level performance numbers. The "analysis" in each case should be obvious: make the bad numbers go down. We just need to be seeing those bad numbers, and seeing when they change.)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 1

•

14 years ago

There's code in extensions/metrics/ that *might* be useful for this sort of thing (though I'm not sure what it's currently set up to measure).  Google folks developed it, and I believe it shipped or ships with Google Toolbar.  It may be designed for more test-pilot-like things, though.

proof-of-concept Telemetry addon 13 years ago (dormant account) 2.55 KB, application/x-xpinstall		Details
telemetry clientside 13 years ago (dormant account) 11.73 KB, patch	mossop : review-	Details \| Diff \| Splinter Review
screenshot 13 years ago (dormant account) 74.97 KB, image/png		Details
cycle collector probe 13 years ago (dormant account) 2.16 KB, patch		Details \| Diff \| Splinter Review
cycle collector probe 13 years ago (dormant account) 1.36 KB, patch	bent.mozilla : review-	Details \| Diff \| Splinter Review
cycle collector probe v2 13 years ago (dormant account) 1.84 KB, patch	bent.mozilla : review+	Details \| Diff \| Splinter Review
telemetry clientside + testcase 13 years ago (dormant account) 16.53 KB, patch		Details \| Diff \| Splinter Review
telemetry clientside + testcase 13 years ago (dormant account) 16.82 KB, patch		Details \| Diff \| Splinter Review
telemetry clientside + testcase 13 years ago (dormant account) 16.84 KB, patch	mossop : review-	Details \| Diff \| Splinter Review
telemetry clientside + testcase 13 years ago (dormant account) 18.68 KB, patch	mossop : review-	Details \| Diff \| Splinter Review
telemetry clientside + testcase 13 years ago (dormant account) 19.38 KB, patch		Details \| Diff \| Splinter Review
telemetry clientside + testcase 13 years ago (dormant account) 19.00 KB, patch	mossop : review+	Details \| Diff \| Splinter Review