Closed Bug 915733 Opened 11 years ago Closed 6 years ago

Use Linux AIO instead of IO threads for IPC

Categories

(Firefox OS Graveyard :: General, defect, P3)

All
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:-)

RESOLVED WONTFIX
blocking-b2g -

People

(Reporter: sinker, Assigned: cyu)

References

Details

(Keywords: perf, Whiteboard: [c=progress p= s= u=])

Attachments

(1 file)

AIO is more efficient by without latency of context switching introduced by IO Threads.  AIO is also better by calling less number of syscalls.  Without AIO, it requires 3 context switching to dispatch an IPC message while it is only one for with AIO.
I think this would make things a whole lot faster. Great idea.
We're talking about Linux Async I/O (io_setup/destroy/submit/getevents) not POSIX Async I/O (aio_read/write/etc) right?
Yes, we are talking Linux Async I/O, not POSIX Async I/O.
Summary: Use AIO instead of IO threads for IPC → Use Linux AIO instead of IO threads for IPC
This sounds great! Maybe it can help performance functional team. Thanks, Thinker. So, who will work on this?
blocking-b2g: --- → 1.3?
Keywords: perf
Whiteboard: [c=progress p= s= u=]
Assignee: nobody → cyu
Do we have evidence that context switches show up as a reason for latency or raw throughput slowness? cjones did a bunch of testing of this way back when we started using the IPC layer, and you can send thousands of messages per second even on single-core phone hardware.

Don't want to rain on this parade if it will actually help, but I suspect that it might be better to focus on the application layer or work on coalescing message or message handling to reduce the total load. And I'm worried about putting the Linux code on a completely separate codepath without an I/O thread, because we'll undoubtedly end up with Linux-specific bugs and our testing population on Linux desktop and phones is relatively so small that it's much less likely that we'll catch bugs early.
Thousands of messages per second sounds reasonable even for low-end hardware that we currently ship. But that's thouroughput. Do we have numbers of latency in the previous tests? Since many IPC messages are used to deledate things like inputs, events, rendering, etc., we can have many round trips between the processes even for a simple click on the virtual keyboard.

We can first measure the latency and make a small proof-of-concept patch to test if it really helps. Then we can consider if we want to trade the maintenance cost of linux-specific code path with the improved latency.
So far all the slow cases I've seen involve the main thread event loops of either/both the parent and child processes. I would love to be proven wrong but even if we manage to increase the IPC speeds somehow I think we'll never notice the difference due to the current event loop problems.
Status: NEW → ASSIGNED
Target Milestone: --- → 1.2 C3(Oct25)
I made some experiments and measured the time spent in sending an IPC message. It takes hundreds of microseconds from scheduling the IO request to starting running the IO request. And it normally takes < 100 microseconds to run the IO request (i.e. sendmsg()). That is, we normally spend < 1 ms for sending an IPC request. On the receving side, if the main thread is not loaded I guess the firgures are comparable.

We could improve responsiveness by categorizing the IPC requests and don't let all IPC requests queue up equally. Input events are a good example. We want the device to respond to our inputs even it's under heavy load. If we can make TabChild receive the inputs faster even under load, we might be able the improve responsiveness and hence perceived performance.
Do you mean with async I/O?

With async I/O, two problems are supposed to be resolved.  1. avoid overhead of additional context switching, 2. reduce average and variance of latency for dispatching IPC messages.

For input events, maybe we should put a throttle on input events.  While we are in heavy loading, we may in scrolling or some animations.  It means to handle several input events but generate only one picture.  It means input events already overloading the main thread.  We should aggregate input events to avoid input events overloading the main thread, not trying to deliver it faster.
Delivering IPC out of order really requires a separate root IPDL actor: one of the important guarantees that makes IPDL "safe" is that the messages are checked and delivered in order; if there is a state machine, you can statically prove that you won't be sending messages to dead actors, etc.

It's certainly possible to have a separate root actor for important versus unimportant messages, but we'd have to do a bunch of additional thinking about how that might affect security checking.
It is still delivered by the order.
Cervantes, any update on this? Also, what tools did you use to gather the measurements you quoted in comment 8?
blocking-b2g: 1.3? → -
Priority: -- → P4
Target Milestone: 1.2 C3(Oct25) → ---
Mike, my plates are full of the Nuwa process bugs to make it into 1.3. I am going to come back for this after that. The measurements are from adaptation of the WIP in bug 908005.
Priority: P4 → P1
See Also: → 991420
I still don't think we've seen any evidence that this is worth engineering effort.
Attached file ipc_latency.py
This is a simple simulator of the overhead of IOThread in Python with simpy.  By changing parameters at line 160~164, obviously, latency would be increased by iothread a lot.

For example, with default setting with iothread, the mean value of latency is ~38ms, but only ~20ms without iothread.
Priority: P1 → P3
Firefox OS is not being worked on
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: