Open Bug 1468096 Opened 6 years ago Updated 2 years ago

TelemetryIPCAccumulator::IPCTimerFired timer fires too often

Categories

(Toolkit :: Telemetry, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: smaug, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(5 obsolete files)

I was looking at which timers run often in a rather de-active background tab/process, and TelemetryIPCAccumulator::IPCTimerFired was one of the common ones. Do we really need to fire it so often?
For the record we batch all accumulations and send them at most every 2s. The relevant bug comments are bug 1218576 comment 33 and 34. In short, the timer should only fire if the browser is actively accumulating Telemetry. If it is firing -every- (or most every) 2s, then that means the de-active background process isn't terribly de-active (or maybe it is accumulating something about wakeups, which causes the timer to wakeup, which causes us to register the wakeup and loop...) Do you happen to know what Telemetry is being accumulated in that rather de-active background process?
Flags: needinfo?(bugs)
Ack, I wasn't clear in the past comment. That is how it is -supposed- to work. It is possible it has gone awry. While I'm here, and in a marginally-related note: We record how many accumulations we had to discard because the 2s timer is being starved with a scalar: https://mzl.la/2t2KJ6h (the Beta 61 population has discarded 118B accumulations in child processes, so far in its release). Increasing the timer duration will require tuning the memory watermarks (or looking into coalescing).
2s is quite often no matter what. One reason it is firing often might be CC/GC running (I'm planning to make those run less often in background).
Flags: needinfo?(bugs)
(In reply to Olli Pettay [:smaug] from comment #3) > 2s is quite often no matter what. What is a reasonable interval? The trade-off here with increasing this interval is that lose some precision on child process Telemetry. If we'd set this to, say 60s and we're sending out a ping, the child process data in the ping might be up to 60 seconds old (+ cross-process send time). That, we worried, might throw off analysis.
Flags: needinfo?(bugs)
Not sure what would be good interval. But, remember that we'll have lots of more child processes in the near future, so we should try to reduce wake ups.
Flags: needinfo?(bugs)
If the process is waking up to record Telemetry, maybe it would be best to send it without any (or with a very small) delay so we can do so within the same wakeup interval? When we tested this for 1-4 content processes, though, it had some perf implications. Maybe send the first batch within a wakeup interval's width and increase batch intervals geometrically? Something like: current batch interval = min(MAX_INTERVAL, 100ms * number of sequential previous intervals with data in them) Do we know how long a wakeup keeps us active?
Flags: needinfo?(bugs)
Flags: needinfo?(bugs)
Vicky, do IPC timer wakes fall into your are of interest?
Flags: needinfo?(vchin)
Could we have some quite long max timeout, and then when more data is reported in child process, we also check whether we haven't reported recently, and if so, send data to parent process. So, if data hasn't hasn't been reported to the parent side say in the last 5 seconds and new telemetry data is collected, send all the data. And whenever there is new data, ensure timer is running. If not, create a low priority timer and fire it after, hmm, 60s? Need to also send the data when process is about to be shut down. This would miss more data if child process is crashing, sure. Is this totally unreasonable?
Let's see if I'm understanding you properly: 1. Store a timestamp of the last time we talked to the parent process. 2. Whenever a child telemetry accumulation comes in, check the timestamp: a. If it's greater than 5s, send immediately b. If it isn't > 5s, start a 60s low-priority timer to batch up accumulations 3. On low-priority timer fire, send batched accumulations So the first piece of telemetry on wakeup will be sent immediately, but everything else has to wait a minute. I'd be tempted to have a short (100ms? Something short enough to fit within the wakeup interval) batch timer on 2.a. instead of sending immediately as I don't think we'll ever be dealing with exactly 1 accumulation. Losing scalar/histogram/event data from a crashing child process isn't too much of a concern. AFAIK, crash investigations are based on environment and annotations which are managed by the parent process. Did I get it right?
Flags: needinfo?(bugs)
(In reply to Chris H-C :chutten from comment #7) > Vicky, do IPC timer wakes fall into your are of interest? They do. Olli's on it. :)
Flags: needinfo?(vchin)
(In reply to Chris H-C :chutten from comment #9) > Let's see if I'm understanding you properly: > > 1. Store a timestamp of the last time we talked to the parent process. > 2. Whenever a child telemetry accumulation comes in, check the timestamp: > a. If it's greater than 5s, send immediately > b. If it isn't > 5s, start a 60s low-priority timer to batch up > accumulations > 3. On low-priority timer fire, send batched accumulations > > So the first piece of telemetry on wakeup will be sent immediately, but > everything else has to wait a minute. I'd be tempted to have a short (100ms? > Something short enough to fit within the wakeup interval) batch timer on The idea would be just to use the same task for telemetry data collecting, and if enough time is spent, reporting. So if telemetry data is being collected basically all the time, roughly every 5s data would be sent to the parent side. The long timer would be just a fallback to ensure all the data gets reported at some point (unless the process crashes). So, no, I don't mean "everything else" or "first piece". We'd clear the reporting timestamp whenever we send the data to parent process. And if we then trigger a probe > 5s < 60s, we send the data to parent again (and reset the timestamp). Is that clearer?
Flags: needinfo?(bugs)
A chat on IRC later and this is much clearer. Thank you for your patience in explaining this to me. How Does It Work ---------------- The central theme is "timestamps, not timers" {Accumulate|Record}Child{Histogram|ScalarAction|Event} would have the form gChild{whatever}->AppendElement(...); if (gBatchTimestamp.IsNull()) { gBatchTimestamp = TimeStamp::Now(); ArmJustInCaseTimer(); } else if (TimeStamp::Now() - gBatchTimestamp > INTERVAL || <we've hit the memory watermark>) { gBatchTimestamp = TimeStamp(); CancelJustInCaseTimer(); SendIPC(); } `ArmJustInCaseTimer()` would arm some longish (60s) timer that would ensure if all the accumulations in a batch came within INTERVAL we'd still have some upper bound on the age of the child data. (it needs to remember to set gBatchTimestamp back to TimeStamp()) `SendIPC()` could make use of IdleDispatch to the main thread to avoid causing perf issues, and can make use of [1] to have the same upper bound on the age of the child data awaiting send. (Maybe we even dispense with idle timeouts if we're sending ipc as a result of the JustInCaseTimer). By using dispatch we still have some guarantee that normal xpcom shutdown will service all of the runnables. By using timeouts we limit our exposure to aborted sessions. Tuning the values of INTERVAL and idle dispatch timers will be the biggest source of bikeshedding and should 100% be controlled by pref. (INTERVAL already is since bug 1500066) How Does This Help ------------------ Right now with linear 2s timers in TelemetryIPCAccumulator there are several patterns of Telemetry use that would make us the source of wakeups. By removing the timer and requiring that our callers be the trigger for sending the IPC we guarantee we're not the ones waking things up, it's our callers who are doing that. This also affords us the opportunity to take advantage of IdleDispatch, a feature added one month after I landed Telemetry IPC support back in bug 1218576. This takes us nicely out of perf-sensitive paths in a way that low-priority timers wish they could. Downsides --------- We lose our 2s guarantee of maximum child data staleness and replace it with a statistical staleness model based on how frequently and over what timescales the accumulations come in. (with an ultimate upper bound provided by JustInCaseTimer) --- Leaving this unprioritized and unassigned so we can evaluate this at triage. Given the timing in the year this is pretty much guaranteed not to be addressed until 2019, but the sooner we get this in the sooner our users can reap the benefits to battery and perf. [1]: https://searchfox.org/mozilla-central/source/xpcom/threads/nsThreadUtils.cpp#438
Priority: -- → P3
Blocks: 1547950
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: P3 → P1

A couple of notes now that I have a partial solution figured out.

  1. Dispatching to the main thread using nsThreadUtils is tricky because, if the main thread is unavailable, they leak their already_AddRefed runnable. Which isn't great.
  2. Given that less than half a percent of batches report any discarded histogram accumulations (situations where somehow we've accumulated more than 5x the high water mark of accumulations before the timer is serviced), I propose we should drop the concept of discarded data at the IPC layer and send everything.
  3. For some reason setting the batch timeout pref in head.js didn't take effect in every xpcshell test under that folder. Not sure what that's all about. But it did confirm that, in tests where accumulations aren't happening in a steady stream, batches just aren't being sent. Which is pretty great, since that was the point.

Still more to do on this WIP:

  • There's no Just-in-case timer (maybe we don't need one? The worst case is we lose 2s of data... but only if it's followed by like 58s of silence and then an untidy tear down)
  • Clarify shutdown stuff. I pulled the content-child-shutdown observer which flushed accumulations on orderly shutdown because it needed a public API that I didn't want to support. Plus, it was registered in MemoryTelemetry of all places. Weird. I'll probably add the observer to TelemetryIPCAccumulator directly.
  • Tests

This was only used to ensure we flushed to IPC on content-child-shutdown.
We can do that internally.

For some reason I have to specify this in individual tests as well as in
head.js. Woo.

Depends on D42919

Less than half a percent of batches in Nightly 70 (less than an eighth of a
percent in Beta 69) discard histogram accumulations because of overshooting the
watermark 5x while the main thread is busy.

(There are essentially 0 discarded anything elses. Histogram Accumulations are
the only thing with a high enough volume to trigger the discarding code.)

So let's send all of the data that comes in, then. This saves us some storage,
plus a bit of data collection.

Depends on D42920

Bopping this down to P2 as I don't want it to land until early 71 anyway, and there are a few other things I should look into before then.

Current state:

  • There is no just-in-case timer. In my opinion we don't need one. Even for infrequently-collecting processes such a timer wouldn't improve data reliability, just data freshness. And then only if the process collects less frequently than the just-in-case timer fires.
  • The patches as presented are mostly green (try run), though it's possible there are some tests that check non-parent collections I haven't run into yet.
Priority: P1 → P2

Was chatting with Travis and had a thought:

  • The final SendBatch in DeInit is happening on the main thread, so it should fire the IPC immediately without dispatch to idle. Otherwise the event queue might not accept dispatches that late.

Another update: Was chatting with IPC folks on IRC today and it turns out there's no guarantee of tidy teardown in subprocesses. There might be a tidy teardown in some build configurations for some platforms today, but that isn't guaranteed to happen every time, and it certainly isn't guaranteed to not change over time.

Which is a bit of a pickle. I was kinda pinning my hopes on a "we're about to close the pipe" notice in non-crashing cases to kick us into sending a final batch. Instead, the only promises we can make about subprocess data collections are freshness promises.

Which means I'm gonna want to decrease the batch timeout a bit so that we're losing less data. Which is a bummer, but hopefully with lazy dispatch it should still result in a perf win overall.

Which reminds me, I need to run these patches on talos.

Depends on D42921

Assignee: chutten → nobody
Status: ASSIGNED → NEW
Priority: P2 → P3
Attachment #9087139 - Attachment is obsolete: true
Attachment #9094185 - Attachment is obsolete: true
Attachment #9087138 - Attachment is obsolete: true
Attachment #9087137 - Attachment is obsolete: true
Attachment #9087136 - Attachment is obsolete: true
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: