Replace the FOG IPC Backend with one that doesn't use PContent, the main thread
Categories
(Toolkit :: Telemetry, task, P2)
Tracking
()
People
(Reporter: chutten|PTO, Assigned: charlie)
References
(Blocks 1 open bug)
Details
(Whiteboard: [telemetry:fog:m8])
In bug 1635255 we've been given the okay to build FOG's IPC atop PContent on the main thread while there are few users with little data. We don't need to be on the main thread. We don't want to be tied to only content processes. And the main thread doesn't want us either: it has more important things to do.
And more users with more data are coming.
This bug is for replacing the IPC backend with something better. First and foremost get in contact with IPC Peers and see what the current state of the art is. Then propose a design for the replacement backend. Then implement it.
| Reporter | ||
Comment 1•4 years ago
|
||
FOG has now figured out what IPC is gonna look like. Time to see if IPC has any neat ideas about how we could get FOG's IPC off of the main thread and into arbitrary process types. :jld, :nika, do you happen to remember when I asked last year about nifty ways to send IPC opportunistically from and to background threads?
A recap:
- FOG (Firefox on Glean) is the layer that sits atop Glean in Firefox Desktop to replace Firefox Desktop Telemetry as the data collection mechanism
- It provides (amongst other things) IPC support so, e.g., the JS GC can record samples of how long each phase take to a
timing_distribution - Both sides of the communication are Rust.
- As such, the at-the-time-recommended approach was to use
serdeto bincode-serialize a payload to bytes, send it across FFI as aByteBuf, send it across IPC in C++, then send it back down FFI to Rust, then bincode-deserialize it on the parent process. All this (and other misc signalling) can be found inFOGIPC.cppas well as in ContentParent/Child and friends.
| Reporter | ||
Comment 2•3 years ago
|
||
This rebuild should also take into account that current at-clean-process-shutdown IPC flushes (like this one for GMP) do not work as, by the time the child side is being destroyed, the channel is already gone.
We'll want to tackle that anyway because we can't guarantee clean process shutdowns (especially on mobile), but especially because the existing solutions we thought we had aren't working out.
| Reporter | ||
Comment 3•3 years ago
|
||
We may wish to change paradigms from "batch-and-send" to "stream" via e.g. DataPipe. Note that using this from Rust ergonomically will involve getting good interfaces for nsI{Input|Output}Stream, a prototype of which Nika attached to bug 1782237.
This will have ramifications on the cpu, thread, power etc instrumentation currently relying on occasionally being triggered by FOGIPC batches. I warned the instrumentation owner (:florian) when it went in that this was coming at some point in the future so this shouldn't be a surprise. But we'll want to give a lot of notice.
Comment 4•1 year ago
|
||
I don't think there's more context to add beyond what :chutten and I had discussed a few years ago async and was summarized in the above comments. If there are new questions for me, feel free to add a new ni?
| Reporter | ||
Comment 5•1 year ago
|
||
I've spent some time with DataPipe and I'm not entirely sure that it'll suit our purposes. Synchronizing production with consumption, operation by operation, probably raises the complexity budget too high. (By which I mean: the idea of using the DataPipe to send (metric_id, sample) pairs which are picked up by some thread on the parent side which then hands it to Glean has some problems with: ensuring that production and consumption aren't too far from another in rate, bearing the runtime CPU costs of thread dispatch and contention, and trying to figure out the correct ring buffer size for different metrics with different needs (samples on every vsync vs events on every user interaction are orders of magnitude different in frequency and size of samples))
I think the "ideal" form might be a piece of shmem on each parent metric instance that acts as process-aware representation of the metric's storage that can be sync'd down into Glean on the glean.dispatcher thread as normal. Certainly that most closely mimics what perf was talking about wanting for low-CPU telemetry accumulation (a little math and some sums (plus gaining an unlikely-to-be-locked write lock)). I don't know if that fits any existing IPC mechanisms or patterns, though.
Nika, should I schedule a chat for us to noodle about this? Are there docs of IPC patterns I should read? Am I off-base and should take a second look at DataPipe since, complexities I identified included, it'll still be the least painful?
Comment 6•1 year ago
|
||
We can definitely set up some time to chat about telemetry related things for accumulation etc. I'll need a bit of a better idea of what the shape of data needs to look like for this shared memory region etc.
I believe while there are some glean telemetry types which are quite simple (i.e. increment-a-global-counter style), there are also some telemetry types which are quite complex (like events). You might want distinct systems for complex objects like events vs. counters.
| Reporter | ||
Comment 7•2 months ago
|
||
Okay, with advice and coaching from Nika, it's time to scope this down.
It should be possible with the current slate of IPC primitives and abstractions to get the existing FOG IPC messages off of the main thread by introducing a protocol with two actors like:
async protocol PFOGTransport {
parent:
async FOGData(ByteBuf buf);
child:
async FlushFOGData() returns (ByteBuf buf);
}
Then child processes would, at process startup, send mozilla::ipc::Endpoint<PFOGTransportParent> in something likely called RecvCreateFOGTransport to be bound on a background task queue in the parent process. (( I believe the pattern we're following is not dissimilar from PClipboardContentAnalysis and PBackgroundStarter. ))
Most of the rest of the code should just remain as it is, just running on the new task queue's bg thread. We'll probably want to pay attention to (ie, instrument) any extra delays we incur by doing it this way (though I'm not sure how to timestamp these things). If we come up with a way to instrument or measure any improvements to main thread responsiveness that'll tell us whether we have budget to be chattier now that we're not holding up important operations on the main process. And if possible we'll want to drain that task queue at or near shutdown.
And I think we'll definitely need to update the docs about how to add FOG support to a new process type.
If it seems possible (and from here it looks like it might) we may wish to implement this alongside the current main-thread implementation with a nimbus feature to switch back and forth, then design and run an experiment before we go live. (Or go live with the confidence that if our metrics tank we have a remote-control off switch for a couple versions.)
All of the rest of the ideas of reworking IPC into streaming data pipes, or shmem buffers, or what-have-you are now officially out of scope not leastwise because we'd need some bespoke work from IPC folks to implement primitives just for us in order for any of those ideas to even be explored.
Description
•