Open Bug 1468096 Opened 7 years ago Updated 3 years ago

TelemetryIPCAccumulator::IPCTimerFired timer fires too often

Tracking

()

Status:

NEW

People

(Reporter: smaug, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(5 obsolete files)

Bug 1468096 - Nix FlushBatchedChildTelemetry 6 years ago Chris H-C :chutten 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1468096 - Use a TimeStamp instead of a Timer for Telemetry IPC 6 years ago Chris H-C :chutten 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1468096 - Telemetry tests should use 0ms batch timeouts 6 years ago Chris H-C :chutten 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1468096 - Don't discard data in Telemetry IPC any more 6 years ago Chris H-C :chutten 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1468096 - Reduce the Telemetry batch timer to 500ms 6 years ago Chris H-C :chutten 47 bytes, text/x-phabricator-request		Details \| Review

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Description

•

7 years ago

I was looking at which timers run often in a rather de-active background tab/process, and TelemetryIPCAccumulator::IPCTimerFired was one of the common ones. Do we really need to fire it so often?

Chris H-C :chutten

Comment 1

•

7 years ago

For the record we batch all accumulations and send them at most every 2s. The relevant bug comments are bug 1218576 comment 33 and 34. In short, the timer should only fire if the browser is actively accumulating Telemetry. If it is firing -every- (or most every) 2s, then that means the de-active background process isn't terribly de-active (or maybe it is accumulating something about wakeups, which causes the timer to wakeup, which causes us to register the wakeup and loop...) Do you happen to know what Telemetry is being accumulated in that rather de-active background process?

Flags: needinfo?(bugs)

Chris H-C :chutten

Comment 2

•

7 years ago

Ack, I wasn't clear in the past comment. That is how it is -supposed- to work. It is possible it has gone awry. While I'm here, and in a marginally-related note: We record how many accumulations we had to discard because the 2s timer is being starved with a scalar: https://mzl.la/2t2KJ6h (the Beta 61 population has discarded 118B accumulations in child processes, so far in its release). Increasing the timer duration will require tuning the memory watermarks (or looking into coalescing).

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 3

•

7 years ago

2s is quite often no matter what. One reason it is firing often might be CC/GC running (I'm planning to make those run less often in background).

Flags: needinfo?(bugs)

Georg Fritzsche [:gfritzsche]

Comment 4

•

7 years ago

(In reply to Olli Pettay [:smaug] from comment #3) > 2s is quite often no matter what. What is a reasonable interval? The trade-off here with increasing this interval is that lose some precision on child process Telemetry. If we'd set this to, say 60s and we're sending out a ping, the child process data in the ping might be up to 60 seconds old (+ cross-process send time). That, we worried, might throw off analysis.

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 5

•

7 years ago

Not sure what would be good interval. But, remember that we'll have lots of more child processes in the near future, so we should try to reduce wake ups.

Flags: needinfo?(bugs)

Chris H-C :chutten

Comment 6

•

7 years ago

If the process is waking up to record Telemetry, maybe it would be best to send it without any (or with a very small) delay so we can do so within the same wakeup interval? When we tested this for 1-4 content processes, though, it had some perf implications. Maybe send the first batch within a wakeup interval's width and increase batch intervals geometrically? Something like: current batch interval = min(MAX_INTERVAL, 100ms * number of sequential previous intervals with data in them) Do we know how long a wakeup keeps us active?

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Updated

•

7 years ago

Flags: needinfo?(bugs)

Chris H-C :chutten

Comment 7

•

7 years ago

Vicky, do IPC timer wakes fall into your are of interest?

Flags: needinfo?(vchin)

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 8

•

7 years ago

Could we have some quite long max timeout, and then when more data is reported in child process, we also check whether we haven't reported recently, and if so, send data to parent process. So, if data hasn't hasn't been reported to the parent side say in the last 5 seconds and new telemetry data is collected, send all the data. And whenever there is new data, ensure timer is running. If not, create a low priority timer and fire it after, hmm, 60s? Need to also send the data when process is about to be shut down. This would miss more data if child process is crashing, sure. Is this totally unreasonable?

Chris H-C :chutten

Comment 9

•

7 years ago

Let's see if I'm understanding you properly: 1. Store a timestamp of the last time we talked to the parent process. 2. Whenever a child telemetry accumulation comes in, check the timestamp: a. If it's greater than 5s, send immediately b. If it isn't > 5s, start a 60s low-priority timer to batch up accumulations 3. On low-priority timer fire, send batched accumulations So the first piece of telemetry on wakeup will be sent immediately, but everything else has to wait a minute. I'd be tempted to have a short (100ms? Something short enough to fit within the wakeup interval) batch timer on 2.a. instead of sending immediately as I don't think we'll ever be dealing with exactly 1 accumulation. Losing scalar/histogram/event data from a crashing child process isn't too much of a concern. AFAIK, crash investigations are based on environment and annotations which are managed by the parent process. Did I get it right?

Flags: needinfo?(bugs)

Vicky Chin [:vchin]

Comment 10

•

7 years ago

(In reply to Chris H-C :chutten from comment #7) > Vicky, do IPC timer wakes fall into your are of interest? They do. Olli's on it. :)

Flags: needinfo?(vchin)

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Comment 11

•

7 years ago

(In reply to Chris H-C :chutten from comment #9) > Let's see if I'm understanding you properly: > > 1. Store a timestamp of the last time we talked to the parent process. > 2. Whenever a child telemetry accumulation comes in, check the timestamp: > a. If it's greater than 5s, send immediately > b. If it isn't > 5s, start a 60s low-priority timer to batch up > accumulations > 3. On low-priority timer fire, send batched accumulations > > So the first piece of telemetry on wakeup will be sent immediately, but > everything else has to wait a minute. I'd be tempted to have a short (100ms? > Something short enough to fit within the wakeup interval) batch timer on The idea would be just to use the same task for telemetry data collecting, and if enough time is spent, reporting. So if telemetry data is being collected basically all the time, roughly every 5s data would be sent to the parent side. The long timer would be just a fallback to ensure all the data gets reported at some point (unless the process crashes). So, no, I don't mean "everything else" or "first piece". We'd clear the reporting timestamp whenever we send the data to parent process. And if we then trigger a probe > 5s < 60s, we send the data to parent again (and reset the timestamp). Is that clearer?

Flags: needinfo?(bugs)

Chris H-C :chutten

Comment 12

•

7 years ago

A chat on IRC later and this is much clearer. Thank you for your patience in explaining this to me. How Does It Work ---------------- The central theme is "timestamps, not timers" {Accumulate|Record}Child{Histogram|ScalarAction|Event} would have the form gChild{whatever}->AppendElement(...); if (gBatchTimestamp.IsNull()) { gBatchTimestamp = TimeStamp::Now(); ArmJustInCaseTimer(); } else if (TimeStamp::Now() - gBatchTimestamp > INTERVAL || <we've hit the memory watermark>) { gBatchTimestamp = TimeStamp(); CancelJustInCaseTimer(); SendIPC(); } `ArmJustInCaseTimer()` would arm some longish (60s) timer that would ensure if all the accumulations in a batch came within INTERVAL we'd still have some upper bound on the age of the child data. (it needs to remember to set gBatchTimestamp back to TimeStamp()) `SendIPC()` could make use of IdleDispatch to the main thread to avoid causing perf issues, and can make use of [1] to have the same upper bound on the age of the child data awaiting send. (Maybe we even dispense with idle timeouts if we're sending ipc as a result of the JustInCaseTimer). By using dispatch we still have some guarantee that normal xpcom shutdown will service all of the runnables. By using timeouts we limit our exposure to aborted sessions. Tuning the values of INTERVAL and idle dispatch timers will be the biggest source of bikeshedding and should 100% be controlled by pref. (INTERVAL already is since bug 1500066) How Does This Help ------------------ Right now with linear 2s timers in TelemetryIPCAccumulator there are several patterns of Telemetry use that would make us the source of wakeups. By removing the timer and requiring that our callers be the trigger for sending the IPC we guarantee we're not the ones waking things up, it's our callers who are doing that. This also affords us the opportunity to take advantage of IdleDispatch, a feature added one month after I landed Telemetry IPC support back in bug 1218576. This takes us nicely out of perf-sensitive paths in a way that low-priority timers wish they could. Downsides --------- We lose our 2s guarantee of maximum child data staleness and replace it with a statistical staleness model based on how frequently and over what timescales the accumulations come in. (with an ultimate upper bound provided by JustInCaseTimer) --- Leaving this unprioritized and unassigned so we can evaluate this at triage. Given the timing in the year this is pretty much guaranteed not to be addressed until 2019, but the sooner we get this in the sooner our users can reap the benefits to battery and perf. [1]: https://searchfox.org/mozilla-central/source/xpcom/threads/nsThreadUtils.cpp#438

Chris H-C :chutten

Updated

•

7 years ago

Priority: -- → P3

Olli Pettay [:smaug][bugs@pettay.fi]

Reporter

Updated

•

6 years ago

Blocks: 1547950

Chris H-C :chutten

Updated

•

6 years ago

Assignee: nobody → chutten

Status: NEW → ASSIGNED

Priority: P3 → P1

Chris H-C :chutten

Comment 13

•

6 years ago

A couple of notes now that I have a partial solution figured out.

Dispatching to the main thread using nsThreadUtils is tricky because, if the main thread is unavailable, they leak their already_AddRefed runnable. Which isn't great.
Given that less than half a percent of batches report any discarded histogram accumulations (situations where somehow we've accumulated more than 5x the high water mark of accumulations before the timer is serviced), I propose we should drop the concept of discarded data at the IPC layer and send everything.
For some reason setting the batch timeout pref in head.js didn't take effect in every xpcshell test under that folder. Not sure what that's all about. But it did confirm that, in tests where accumulations aren't happening in a steady stream, batches just aren't being sent. Which is pretty great, since that was the point.

Still more to do on this WIP:

There's no Just-in-case timer (maybe we don't need one? The worst case is we lose 2s of data... but only if it's followed by like 58s of silence and then an untidy tear down)
Clarify shutdown stuff. I pulled the content-child-shutdown observer which flushed accumulations on orderly shutdown because it needed a public API that I didn't want to support. Plus, it was registered in MemoryTelemetry of all places. Weird. I'll probably add the observer to TelemetryIPCAccumulator directly.
Tests

Chris H-C :chutten

Comment 14

•

6 years ago

Attached file Bug 1468096 - Nix FlushBatchedChildTelemetry (obsolete) — Details

This was only used to ensure we flushed to IPC on content-child-shutdown.
We can do that internally.

Chris H-C :chutten

Comment 15

•

6 years ago

Attached file Bug 1468096 - Use a TimeStamp instead of a Timer for Telemetry IPC (obsolete) — Details

Depends on D42918

Chris H-C :chutten

Comment 16

•

6 years ago

Attached file Bug 1468096 - Telemetry tests should use 0ms batch timeouts (obsolete) — Details

For some reason I have to specify this in individual tests as well as in
head.js. Woo.

Depends on D42919

Chris H-C :chutten

Comment 17

•

6 years ago

Attached file Bug 1468096 - Don't discard data in Telemetry IPC any more (obsolete) — Details

Less than half a percent of batches in Nightly 70 (less than an eighth of a
percent in Beta 69) discard histogram accumulations because of overshooting the
watermark 5x while the main thread is busy.

(There are essentially 0 discarded anything elses. Histogram Accumulations are
the only thing with a high enough volume to trigger the discarding code.)

So let's send all of the data that comes in, then. This saves us some storage,
plus a bit of data collection.

Depends on D42920

Chris H-C :chutten

Comment 18

•

6 years ago

Bopping this down to P2 as I don't want it to land until early 71 anyway, and there are a few other things I should look into before then.

Current state:

There is no just-in-case timer. In my opinion we don't need one. Even for infrequently-collecting processes such a timer wouldn't improve data reliability, just data freshness. And then only if the process collects less frequently than the just-in-case timer fires.
The patches as presented are mostly green (try run), though it's possible there are some tests that check non-parent collections I haven't run into yet.

Priority: P1 → P2

Chris H-C :chutten

Comment 19

•

6 years ago

Was chatting with Travis and had a thought:

The final SendBatch in DeInit is happening on the main thread, so it should fire the IPC immediately without dispatch to idle. Otherwise the event queue might not accept dispatches that late.

Chris H-C :chutten

Comment 20

•

6 years ago

Another update: Was chatting with IPC folks on IRC today and it turns out there's no guarantee of tidy teardown in subprocesses. There might be a tidy teardown in some build configurations for some platforms today, but that isn't guaranteed to happen every time, and it certainly isn't guaranteed to not change over time.

Which is a bit of a pickle. I was kinda pinning my hopes on a "we're about to close the pipe" notice in non-crashing cases to kick us into sending a final batch. Instead, the only promises we can make about subprocess data collections are freshness promises.

Which means I'm gonna want to decrease the batch timeout a bit so that we're losing less data. Which is a bummer, but hopefully with lazy dispatch it should still result in a perf win overall.

Which reminds me, I need to run these patches on talos.

Chris H-C :chutten

Comment 21

•

6 years ago

Attached file Bug 1468096 - Reduce the Telemetry batch timer to 500ms (obsolete) — Details

Depends on D42921

Chris H-C :chutten

Updated

•

4 years ago

Assignee: chutten → nobody

Status: ASSIGNED → NEW

Priority: P2 → P3

Phabricator Automation

Updated

•

4 years ago

Attachment #9087139 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9094185 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9087138 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9087137 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9087136 - Attachment is obsolete: true

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.