TelemetryIPCAccumulator::IPCTimerFired timer fires too often
Categories
(Toolkit :: Telemetry, enhancement, P3)
Tracking
()
People
(Reporter: smaug, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(5 obsolete files)
Comment 1•6 years ago
|
||
Comment 2•6 years ago
|
||
Reporter | ||
Comment 3•6 years ago
|
||
Comment 4•6 years ago
|
||
Reporter | ||
Comment 5•6 years ago
|
||
Comment 6•6 years ago
|
||
Reporter | ||
Updated•6 years ago
|
Comment 7•6 years ago
|
||
Reporter | ||
Comment 8•6 years ago
|
||
Comment 9•6 years ago
|
||
Comment 10•6 years ago
|
||
Reporter | ||
Comment 11•6 years ago
|
||
Comment 12•6 years ago
|
||
Updated•6 years ago
|
Updated•5 years ago
|
Comment 13•5 years ago
|
||
A couple of notes now that I have a partial solution figured out.
- Dispatching to the main thread using nsThreadUtils is tricky because, if the main thread is unavailable, they leak their
already_AddRefed
runnable. Which isn't great. - Given that less than half a percent of batches report any discarded histogram accumulations (situations where somehow we've accumulated more than 5x the high water mark of accumulations before the timer is serviced), I propose we should drop the concept of discarded data at the IPC layer and send everything.
- For some reason setting the batch timeout pref in
head.js
didn't take effect in every xpcshell test under that folder. Not sure what that's all about. But it did confirm that, in tests where accumulations aren't happening in a steady stream, batches just aren't being sent. Which is pretty great, since that was the point.
Still more to do on this WIP:
- There's no Just-in-case timer (maybe we don't need one? The worst case is we lose 2s of data... but only if it's followed by like 58s of silence and then an untidy tear down)
- Clarify shutdown stuff. I pulled the
content-child-shutdown
observer which flushed accumulations on orderly shutdown because it needed a public API that I didn't want to support. Plus, it was registered inMemoryTelemetry
of all places. Weird. I'll probably add the observer toTelemetryIPCAccumulator
directly. - Tests
Comment 14•5 years ago
|
||
This was only used to ensure we flushed to IPC on content-child-shutdown.
We can do that internally.
Comment 15•5 years ago
|
||
Depends on D42918
Comment 16•5 years ago
|
||
For some reason I have to specify this in individual tests as well as in
head.js. Woo.
Depends on D42919
Comment 17•5 years ago
|
||
Less than half a percent of batches in Nightly 70 (less than an eighth of a
percent in Beta 69) discard histogram accumulations because of overshooting the
watermark 5x while the main thread is busy.
(There are essentially 0 discarded anything elses. Histogram Accumulations are
the only thing with a high enough volume to trigger the discarding code.)
So let's send all of the data that comes in, then. This saves us some storage,
plus a bit of data collection.
Depends on D42920
Comment 18•5 years ago
|
||
Bopping this down to P2 as I don't want it to land until early 71 anyway, and there are a few other things I should look into before then.
Current state:
- There is no just-in-case timer. In my opinion we don't need one. Even for infrequently-collecting processes such a timer wouldn't improve data reliability, just data freshness. And then only if the process collects less frequently than the just-in-case timer fires.
- The patches as presented are mostly green (try run), though it's possible there are some tests that check non-parent collections I haven't run into yet.
Comment 19•5 years ago
|
||
Was chatting with Travis and had a thought:
- The final SendBatch in DeInit is happening on the main thread, so it should fire the IPC immediately without dispatch to idle. Otherwise the event queue might not accept dispatches that late.
Comment 20•5 years ago
|
||
Another update: Was chatting with IPC folks on IRC today and it turns out there's no guarantee of tidy teardown in subprocesses. There might be a tidy teardown in some build configurations for some platforms today, but that isn't guaranteed to happen every time, and it certainly isn't guaranteed to not change over time.
Which is a bit of a pickle. I was kinda pinning my hopes on a "we're about to close the pipe" notice in non-crashing cases to kick us into sending a final batch. Instead, the only promises we can make about subprocess data collections are freshness promises.
Which means I'm gonna want to decrease the batch timeout a bit so that we're losing less data. Which is a bummer, but hopefully with lazy dispatch it should still result in a perf win overall.
Which reminds me, I need to run these patches on talos.
Comment 21•5 years ago
|
||
Depends on D42921
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•2 years ago
|
Description
•