Closed Bug 991420 Opened 11 years ago Closed 10 years ago

(silk) IPC Shows 1-2ms latency sending message from b2g to content process and b2g main thread to composite thread

Categories

(Core :: IPC, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED WORKSFORME
2.0 S4 (20june)

People

(Reporter: mchang, Assigned: mchang)

References

Details

(Keywords: perf, Whiteboard: [c=uniformity p=5 s=2014.06.20 u=])

Attachments

(16 files, 1 obsolete file)

442.40 KB, image/png
Details
271.25 KB, image/png
Details
8.97 KB, patch
Details | Diff | Splinter Review
2.69 KB, text/plain
Details
47.75 KB, patch
Details | Diff | Splinter Review
1.71 KB, text/plain
Details
2.91 MB, text/html
Details
18.79 KB, application/pdf
Details
18.62 KB, application/pdf
Details
9.33 KB, patch
Details | Diff | Splinter Review
25.06 KB, application/pdf
Details
2.66 KB, patch
Details | Diff | Splinter Review
1.84 KB, text/plain
Details
21.67 KB, application/pdf
Details
32.08 KB, application/pdf
Details
34.19 KB, application/pdf
Details
For Project Butter, Vincent found out that we're taking 1-2ms to send an IPC message from the b2g process to the content process. Find ways to optimize this. The best way would be to optimize the IPC message delivery system. If that can't happen, look at using sockets to send touch events.
See Also: → 915733
Attached image IPC.png
We need to figure out and fix the IPC latency issue here. sending touch events : 2.5ms notify Refreshing : 2ms notify Composite : 2.2ms That consumes some frame budget can't be ignored.
Summary: (butter) IPC Shows 1-2ms Delay sending touch events from b2g to content process → (butter) IPC Shows 1-2ms latency sending message from b2g to content process and b2g main thread to composite thread
Cool graphs! I'd love to know how you generated them. So if I'm reading this correctly it says that the SendTouchEvent is taking 2.5ms. I see it being processed by the child process main thread as soon as its event queue is done with some previous event. You can't fix that latency with any IPC hacking I don't think. Then for the Refresh message you're posting to the exact same event loop in the child, so it necessarily has to wait for the event queue plus the time it took to process the touch event above. It looks like the child processes that message just as soon as the touch message finishes. Then for the composite message it appears that the compositor was simply busy and didn't process the message until the previous composite finished. So in summary it doesn't look like there's anything to optimize here in the IPC layer. We just need to figure out why the child process' main thread event loop is so busy (which might also have something to do with why the compositor is busy). The only way to see "IPC latency" I think is to gather measurements about the time it takes between sending a message on the main thread, the time it gets copied into the IPC pipe by the IPC thread, and then the time it gets copied out of the IPC pipe on the IPC thread in the receiving process. Everything else is just event loop congestion.
Is the graph from the systrace tool that Jerry is working on? If so please note that systrace records information in the kernel, that is, each "labelling" call is to the kernel, and this might contribute to the latency measured. I am interested in the red block before RecvRealTouchEvent() in the homescreen. Is the main thread idle waiting for incoming events, or is it still processing some other task? If it's processing some other task then we should find another sample that the main thread is idle and the IPC message is processed right away when the IPC thread posts it to the main thread in the home screen.
Depends on: 994968
Depends on: 994970
Vincent, can you share the whole systrace file that you have in comment 1? From Cervantes measurements in https://bugzilla.mozilla.org/show_bug.cgi?id=915733#c8, IPC messages should take < 1ms. Then we could see what's blocking the main thread. Thanks!
Flags: needinfo?(vlin)
Attached image longComposite.png
Thanks for the systrace Vincent. A few things stand out: 1) To answer Cervantes in comment 3, most of the red block in the homescreen app is "ProcessNextNativeEvent". If I'm understanding the code correctly in nsAppShell, this should just be from waiting for new events. 2) In many cases, when we're doing well and the CPUs have the homescreen app + compositor + b2g running, like at 3168.5ms, we have a 0.547 ms latency from sendTouchEvent-> OnMessageReceived in the Homescreen App. From Notify to CompositeToTarget, it has a 0.211 ms latency. So sometimes, IPC can send the message just fine. It's not a consistent 1-2 ms delay. 3) However, there is sporadic delay, which looks serious. For example, at the 3185 ms mark, from Notify -> CompositeToTarget, we have a 3.7 ms delay and 2 CPUs are idle. My understanding from Project Butter is that after a vsync, we should also start compositing. Vincent - From the systrace taken, is this the intended design or it has yet to be implemented? If it hasn't been implemented, it might just be that the compositor hasn't scheduled it's next composite yet. 4) We have a 1ms+ delay between SendNotifyVsyncChange and OnMessageReceived in the Homescreen app. This is probably because the first message from SendTouchEvent takes ~1.3ms and the event loop is blocked? This is probably the JS running in the Homescreen App that sets the transform on the window. @Vincent - Does #3 make sense to you? If the Compositor should have run right after the notify and we had the 3.7 ms delay, then it might be a problem with IPC. Thanks!
Flags: needinfo?(vlin)
(In reply to Mason Chang [:mchang] from comment #6) > Created attachment 8405675 [details] > longComposite.png > > Thanks for the systrace Vincent. A few things stand out: > > 1) To answer Cervantes in comment 3, most of the red block in the homescreen > app is "ProcessNextNativeEvent". If I'm understanding the code correctly in > nsAppShell, this should just be from waiting for new events. > 2) In many cases, when we're doing well and the CPUs have the homescreen app > + compositor + b2g running, like at 3168.5ms, we have a 0.547 ms latency > from sendTouchEvent-> OnMessageReceived in the Homescreen App. From Notify > to CompositeToTarget, it has a 0.211 ms latency. So sometimes, IPC can send > the message just fine. It's not a consistent 1-2 ms delay. > 3) However, there is sporadic delay, which looks serious. For example, at > the 3185 ms mark, from Notify -> CompositeToTarget, we have a 3.7 ms delay > and 2 CPUs are idle. My understanding from Project Butter is that after a > vsync, we should also start compositing. Vincent - From the systrace taken, > is this the intended design or it has yet to be implemented? If it hasn't > been implemented, it might just be that the compositor hasn't scheduled it's > next composite yet. > 4) We have a 1ms+ delay between SendNotifyVsyncChange and OnMessageReceived > in the Homescreen app. This is probably because the first message from > SendTouchEvent takes ~1.3ms and the event loop is blocked? This is probably > the JS running in the Homescreen App that sets the transform on the window. > > @Vincent - Does #3 make sense to you? If the Compositor should have run > right after the notify and we had the 3.7 ms delay, then it might be a > problem with IPC. Thanks! Nice analysis. IMO, it's a problem. There's a Vsync's Notify() in b2g main thread at 3185 ms, but CompositeToTarget took place in Compositor thread at 3189 ms. As Vsync Notify(), it just to schedule a task(CompositeToTarget) to Compositor thread right away. The latency is expected to be less than 1ms. But is it more like scheduling issue instead of IPC issue ? For IPC latency, we should look at the time stamp of sendTouchEvent/recvTouchEvent in b2g/Homescreen process.
Flags: needinfo?(vlin)
After analyzing a bit more, Blake and I found a few other interesting things. First, the thin line above each thread is grey, blue, or green. Grey means that the thread is sleeping. Green means that the thread is executing. Blue means that the thread is ready to run, but hasn't been scheduled on a CPU (http://bigflake.com/systrace/). I'm assuming that means when a thread is sleeping, but becomes runnable, it received a message that says wake up and has already processed the IPC message on the IPC thread, and now the main thread is waiting to process the message. If this is the case, we have a few things to worry about that might not be the IPC subsystem itself. 1) From the original screenshot you have, we see we have a long delay where the Compositor thread is ready to execute (thin blue line), but is not scheduled by the OS. We see that CPU 2 is free doing nothing. I looked through the systrace you sent me and the compositor thread does not have any CPU affinity, so it should be able to execute on CPU 2. If this is the case, bug 980027 might fix the latency between vsync + compositor thread. 2) If we send a SendTouchEvent + Notify Refresh to the same thread on the homescreen app, and dispatching an event takes 1.3 ms, we'll always have 1.3ms+ latency sending the Notify Refresh. We might have to open an app specific bug on OnMessageReceived in the Homescreen App to remove this latency or have the Notify Refresh on another thread to not be blocked by the SendTouchEvent. Bug 976605 might also help here. 3) In LongComposite.png attachment, we see that the Compositor is ready to run for 3.815 ms but isn't scheduled onto any CPU for a while and is blocked by the Homescreen App on CPU3. In addition, the delay between the end of sendTouchEvent on b2g to the thread turning Runnable on the Homescreen App is 0.093 ms, which should be fast enough. It also indicates that IPC is probably fast enough. So what we may be seeing isn't an IPC issue but an OS scheduling issue like Vincent said in comment 7. A few things to check: 1) Vincent, how are you running systrace? What command are you using and do you have a branch with vsync somewhere? If not, which patches have you applied? I'd like to recreate the systrace locally. 2) I'd like to see if we have the same issues with raising the thread priority of the Compositor and test with bug 980027 to see if it is a scheduling issue. 3) I'll check the IPC latency as Vincent described in comment 7 to time stamp sendTouchEvent on the b2g process until the Homescreen app gets it. @Vincent - Can you answer questions 1 please? Thanks!
Flags: needinfo?(vlin)
Attached patch trace-log.patchSplinter Review
Enable systrace, remove some profiling points.
(In reply to Mason Chang [:mchang] from comment #8) > 1) Vincent, how are you running systrace? What command are you using and do > you have a branch with vsync somewhere? If not, which patches have you > applied? I'd like to recreate the systrace locally. > 2) I'd like to see if we have the same issues with raising the thread > priority of the Compositor and test with bug 980027 to see if it is a > scheduling issue. > 3) I'll check the IPC latency as Vincent described in comment 7 to time > stamp sendTouchEvent on the b2g process until the Homescreen app gets it. > > @Vincent - Can you answer questions 1 please? Thanks! For the current staus, you can patch WIP divided by the following bugs. Profiling of Vsync is included in Bug 987527. Bug 980241 - Vsync-triggered RefreshDriver Bug 987523 - Vsync-triggered CompositorParent Bug 970751 - Deliver input events more consistently per composite to make scrolling smoother Bug 987527 - Register Vsync monitor to HWComposer and notify to gecko/hal Bug 987529 - Implement an observer manager for vsync in gecko/hal You may also need to check attachment 8406669 [details] [diff] [review] in Comment 9. systrace comment. (Please update the latest systrace.py) ./systrace.py --time=5 -o mynewtrace_setting.html sched gfx
Flags: needinfo?(vlin)
Attached file Build Error
Hey Vincent, I can't get this to build at all, I always have problems with ATRACE not being defined. I applied the patches in this order: https://bug980241.bugzilla.mozilla.org/attachment.cgi?id=8405953 https://bug987523.bugzilla.mozilla.org/attachment.cgi?id=8405955 https://bug970751.bugzilla.mozilla.org/attachment.cgi?id=8405956 https://bug987527.bugzilla.mozilla.org/attachment.cgi?id=8402555 https://bug987529.bugzilla.mozilla.org/attachment.cgi?id=8405288 attachment 8406669 [details] [diff] [review] I always get the attached build error. Are you doing something special with your .userconfig or have applied updates not posted in those bugs? From what I've seen, atrace_begin should be defined in <cutils/trace.h>, which is already included in trace.h, so I'm probably missing a #export or something. Are you building for a nexus 4 or nexus-4 kit kat? Thanks!
Flags: needinfo?(vlin)
Depends on: 987162
Attached patch butter.patchSplinter Review
All the patches from comment 11 rolled into one in case anyone else needs it. I can't get systrace to work as expected and don't see markers nor other processes in my trace files. I'm going to try to manually instrument timestamps to see if I can reproduce. Vincent, if you have any suggestions on how to get systrace working, please let me know. Thanks!
Attachment #8408591 - Attachment is obsolete: true
After digging into this, I think this is more of a scheduling issue than an IPC issue. If you look at your original systrace, immediately after Notify is called on the compositor thread, a thin line on top of the Compositor thread turns blue, informing that thread it has some work to do. Next, we don't see the Compositor thread actually scheduled for some time later. There is a slight delay between when the compositor is scheduled on CPU 2 until CompositeToTarget executes, which is probably processing the IPC message. This delay is usually ~0.2ms. To further enforce this, I instrumented CompositorParent::Notify and CompositorParent::CompositeToTarget to see the delay. This is on the homescreen app with orangutan to insert kernel touch events. We generally see a < .25 ms delay between Notify -> Composite. However, we sometimes see very large delays like up to 16 ms. I tried applying bug 980027, which sets the Compositor to a RT priority. The results are much better, with the longest delay being 1ms, with most < .2ms. Once in a while, we can see a 24 ms long delay with the Compositor thread having RT priority, but this is because CompositeToTarget is blocked (see 1652ms in the systrace). I can't get systrace to work with the Homescreen App, so I'll continue looking in there to see if there is a Cross-Process delay, rather than just Cross-Thread delay. So some things we can do so far: 1) If the compositor misses a vsync because a composite takes too long, we should just wait until the next vsync to composite. Probably have to update that in bug 987523. I think this is in the design docs, just not updated in the current patch. 2) We should investigate why we have pretty long WaitForevers or use triple buffering (I think triple buffering would fixes this?) 3) We should finish up bug 980027 which sets the Compositor to a RT priority. That would fix up the delay between the Vsync notify / CompositeToTarget. As for the delay between the sendTouchEvent + vsync change, since we only send touch events on vsyncs with project butter, can we combine that message into a general VSync message with an optional payload of a touch event? This would alleviate the second SendNotifyVsyncChange -> OnMessageReceived and minimize IPC latency. That leaves us with one sendTouchEvent -> OnMessageReceived from the b2g process -> Homescreen process that has a delay. I'm still thinking about what we can do there, but I'm wondering if that's just related to the Homescreen App in that it does some JS on every touch event. Most of the other apps should just scroll, so just sending the input to APZ controller should be good enough.
Flags: needinfo?(vlin)
Attached patch compositor.patchSplinter Review
Hi Vincent, can you try this patch and see if you're seeing reduced latency from Notify -> CompositeToTarget on your test environment? Thanks!
Flags: needinfo?(vlin)
Oh and change the pref in mozilla-central/b2g/app/b2g.js, pref hal.gonk.compositor.rt_priority to 1. Thanks!
Depends on: 998379
I think the way forward on this bug involves a couple of things. It's not clear that this is an IPC issue versus the IPC delay is just a symptom of a combination of other issues. I've been reading a lot on other bugs which make me think there are other issues that are causing the latency. From the original bug: 1) Latency between vsync Notify -> CompositeToTarget - Bug 980027 - Set a higher priority on the Compositor thread - This helps with the latency a lot from ~1ms to ~0.2ms, which is what Chrome for Android seems to be experiencing. However, if we composite over 16 ms and break our frame budget, we waste a lot of CPU cycles and become blocked on waiting for fences, which can take up to 24 ms. See bug 977975 , which has a lot of background work on the fencing issue. If we stop compositing if we break our frame budget and drop a frame, then start again at the next vsync, this issue might go away. I'm still a bit blurry on the whole fencing issue. The other issue with this is that we have to make sure we don't schedule the compositor over Audio, which will probably require some tweaking. 2) Latency between Vsync Notify and ContentProcess - Bug 998379 - Combine the vsync notify + sendTouchEvent into one IPC message, which should eliminate this latency. 3) Latency between sendTouchEvent to Content Process - I still have to double check and see if this latency occurs because we use requestAnimationFrame on the Homescreen app which causes the Homescreen app to be scheduled very often. Will check with a standard scroll app like settings. Vincent, what do you think about 1 and 2? Thanks!
As we had discussed off-line, this is really a schedule problem. With high-loading, the latency would be increased dramatically up to several hundred ms. There are already some discussion of changing priority at bug 870294. Combining vsync and touch event change no much since context switching is still there for receiving touch events at parent. Please check bug 998379.
Bug 870294 looks rather old. I think that work is subsumed by bug 980027.
Depends on: 980027
This is using orangutan + the butter patches. I instrumented and got timestamps using PR_Now() at sendTouchEvent + TabChild::RecvRealTouchEvent. We see some pretty long delays between the sendTouchEvent / RecvTouchEvent in the child process. This is still probably event loop congestion / scheduling issues.
Please patch the WIP in order. trace-log (attachment 8406669 [details] [diff] [review]) bug-987529-fix bug-987527-fix bug-988160-fix bug-970751-fix bug-980241-fix bug-987523-fix Do you have N4 device ? Just suggest you to work on device based on Android 4.3. I also encounter same systrace issue on N5 based on 4.4. Haven't figured it out yet. (In reply to Mason Chang [:mchang] from comment #20) > I think the way forward on this bug involves a couple of things. It's not > clear that this is an IPC issue versus the IPC delay is just a symptom of a > combination of other issues. I've been reading a lot on other bugs which > make me think there are other issues that are causing the latency. > > From the original bug: > > 1) Latency between vsync Notify -> CompositeToTarget - Bug 980027 - Set a > higher priority on the Compositor thread - This helps with the latency a lot > from ~1ms to ~0.2ms, which is what Chrome for Android seems to be > experiencing. However, if we composite over 16 ms and break our frame > budget, we waste a lot of CPU cycles and become blocked on waiting for > fences, which can take up to 24 ms. See bug 977975 , which has a lot of > background work on the fencing issue. If we stop compositing if we break our > frame budget and drop a frame, then start again at the next vsync, this > issue might go away. I'm still a bit blurry on the whole fencing issue. The > other issue with this is that we have to make sure we don't schedule the > compositor over Audio, which will probably require some tweaking. make triple-buffering FrameBufferSurface(bug 988160) can fix it(waitForever, waiting fences). I try it on N4 only, but it's supposed to be a general solution. The principle is that if composite ends after this Vsync, it just misses the chance to swap buffer, so driver will keep locking the same buffer on the next Vsync. Triple-buffering makes an intermediate buffer so that producer can always get a valuable buffer generally. bug 980241 will make it start again in the next Vsync. You can check the WIP. Vsync Notify() - unplug to ignore Vsync. RecvUpdate() - plug in to listen to Vsync. > 2) Latency between Vsync Notify and ContentProcess - Bug 998379 - Combine > the vsync notify + sendTouchEvent into one IPC message, which should > eliminate this latency. I don't understand it yet. Let me reply in bug 998379 after reading it completely. > 3) Latency between sendTouchEvent to Content Process - I still have to > double check and see if this latency occurs because we use > requestAnimationFrame on the Homescreen app which causes the Homescreen app > to be scheduled very often. Will check with a standard scroll app like > settings. > > Vincent, what do you think about 1 and 2? Thanks!
Flags: needinfo?(vlin)
1. IPC. I agree with some idea of Comment 2. We should dig more into it before claim it a "IPC latency". What we've seen is just the latency between Send/Recv which includes IPC, but not sure it's the bottleneck. We can just focus on Send/Recv TouchEvent. The latency of Send/Recv Vsync obviously will be affected by the processing time of Recv TouchEvent. It's kind of event loop congestion issue. Anyway, dealing with touch events prior to Recv Vsync in Content is what we expected. In parent side, we also have to guarantee the order of notification is input module first and the rest of all are later(see NotifyVsyncChange() in Hal.cpp of bug 987529). 2. scheduling. I'm curious on the notification to Composite thread. Obviously, it turns into "Runnable" state fast as receiving notification, but why it takes so long to be "running" state ? Do we conclude that just let bug 980027 fix it ?
(In reply to Vincent Lin[:vilin] from comment #24) > > Do you have N4 device ? Just suggest you to work on device based on Android > 4.3. > I also encounter same systrace issue on N5 based on 4.4. Haven't figured it > out yet. I've been doing all my testing on an N4 device with Android 4.3. (In reply to Vincent Lin[:vilin] from comment #25) > > 2. scheduling. > > I'm curious on the notification to Composite thread. Obviously, it turns > into "Runnable" state fast as receiving notification, but why it takes so > long to be "running" state ? Do we conclude that just let bug 980027 fix it ? I think it takes a long time to become in the running state because the OS doesn't schedule the Compositor. The OS scheduler just says the Compositor thread is ready to run. We see from https://bugzilla.mozilla.org/attachment.cgi?id=8408599, that our notification times dramatically decrease. I think bug 980027 will fix this latency issue.
(In reply to Mason Chang [:mchang] from comment #26) > (In reply to Vincent Lin[:vilin] from comment #24) > > > > Do you have N4 device ? Just suggest you to work on device based on Android > > 4.3. > > I also encounter same systrace issue on N5 based on 4.4. Haven't figured it > > out yet. > > I've been doing all my testing on an N4 device with Android 4.3. Sometimes I encountered this issue each time I re-flash gecko. Reboot your device and try again. BTW. Can you see the process name ? Try to remove the hidden property in script.js http://stackoverflow.com/questions/16644116/unable-to-view-html-trace-report-generated-by-systrace-tool-in-android-sdk > > (In reply to Vincent Lin[:vilin] from comment #25) > > > > 2. scheduling. > > > > I'm curious on the notification to Composite thread. Obviously, it turns > > into "Runnable" state fast as receiving notification, but why it takes so > > long to be "running" state ? Do we conclude that just let bug 980027 fix it ? > > I think it takes a long time to become in the running state because the OS > doesn't schedule the Compositor. The OS scheduler just says the Compositor > thread is ready to run. We see from > https://bugzilla.mozilla.org/attachment.cgi?id=8408599, that our > notification times dramatically decrease. I think bug 980027 will fix this > latency issue. Then I found the scheduling issue also happens to Send/Recv TouchEvent. That being the case, I don't get your idea to propose bug 998379. Thinker's comments convinced me more so far. The latency was already there before we had the Vsync notification IPC.
> Sometimes I encountered this issue each time I re-flash gecko. > Reboot your device and try again. > > BTW. Can you see the process name ? Try to remove the hidden property in > script.js > http://stackoverflow.com/questions/16644116/unable-to-view-html-trace-report- > generated-by-systrace-tool-in-android-sdk > I always see the process name. I will try removing that hidden property. > > Then I found the scheduling issue also happens to Send/Recv TouchEvent. That > being the case, I don't get your idea to propose bug 998379. Thinker's > comments convinced me more so far. The latency was already there before we > had the Vsync notification IPC. Yes it is a scheduling issue. The only benefit is that if we combine the touch event + vsync event, we remove 1 opportunity to have a bad scheduling decision. As shown in bug 998379, we may schedule the first touch message just fine, then have a bad scheduling conflict for the vsync message. So the root is a scheduling issue, i agree with you and Thinker. I guess we can bypass 998379 for now. Is our only option to improve the latency for the latency between the b2g process -> content process bug 915733, or any thoughts on how to improve the scheduler latency in this case?
(In reply to Mason Chang [:mchang] from comment #28) > > Sometimes I encountered this issue each time I re-flash gecko. > > Reboot your device and try again. > > > > BTW. Can you see the process name ? Try to remove the hidden property in > > script.js > > http://stackoverflow.com/questions/16644116/unable-to-view-html-trace-report- > > generated-by-systrace-tool-in-android-sdk > > > > I always see the process name. I will try removing that hidden property. > > > > Then I found the scheduling issue also happens to Send/Recv TouchEvent. That > > being the case, I don't get your idea to propose bug 998379. Thinker's > > comments convinced me more so far. The latency was already there before we > > had the Vsync notification IPC. > > Yes it is a scheduling issue. The only benefit is that if we combine the > touch event + vsync event, we remove 1 opportunity to have a bad scheduling > decision. As shown in bug 998379, we may schedule the first touch message > just fine, then have a bad scheduling conflict for the vsync message. So the > root is a scheduling issue, i agree with you and Thinker. I guess we can > bypass 998379 for now. > > Is our only option to improve the latency for the latency between the b2g > process -> content process bug 915733, or any thoughts on how to improve the > scheduler latency in this case? So far, my understanding is bug 915733 is the only effective option. That's to reduce context switch to reduce IPC latency. But I still have two questions. 1. why does it still have ~2ms latency on notification to Composite thread ? It's scheduled in the same process so the context switch is supposed to be lightweight. 2. why boost priority of Composite thread can reduce latency ? Intuitively, it's so undoubted. But in IPC introduction presented by Cervantes, I had asked if IO thread or worker thread can be prioritized ? From the answer, my understanding is it's just like a FIFO. Perhaps the fact is due to preemption, but not scheduling ? https://dl.dropboxusercontent.com/u/2608927/IPC_Overview.html
Flags: needinfo?(cyu)
(In reply to Vincent Lin[:vilin] from comment #29) > > But I still have two questions. > 1. why does it still have ~2ms latency on notification to Composite thread ? > It's scheduled in the same process so the context switch is supposed to be > lightweight. I am not sure how you got this figure. Could you explain more on this? Are you sure the Compositor thread is idle before it receives the notification? It could be that there is still a task running on the compositor thread so the notification is queued. The compositor thread is not a good target for experimenting IPC delay because it regularly has tasks to run. > 2. why boost priority of Composite thread can reduce latency ? Intuitively, Maybe increasing the priority makes the compositor thread have more chance to be scheduled. How much reduction do we have? And how much priority do you increase?
Flags: needinfo?(cyu)
> > 2. why boost priority of Composite thread can reduce latency ? Intuitively, > Maybe increasing the priority makes the compositor thread have more chance > to be scheduled. How much reduction do we have? And how much priority do you > increase? We boost the priority of the compositor thread to real time priority 1 instead of a nice value. This means that when the compositor is ready to run, it gets scheduled before anything else in the system. That's why we see a reduction. If you look at the attachments without real time priority (https://bugzilla.mozilla.org/attachment.cgi?id=8408596) versus with real time priority, we see a 1.0ms latency reduced to 0.26ms latency. This is because the compositor preempts everything else and gets scheduled. When we have large latencies with RT priority, it's usually because the compositor is waiting for a buffer (blocked in waitForever), which triple buffering should fix.
(In reply to Cervantes Yu from comment #30) > (In reply to Vincent Lin[:vilin] from comment #29) > > > > But I still have two questions. > > 1. why does it still have ~2ms latency on notification to Composite thread ? > > It's scheduled in the same process so the context switch is supposed to be > > lightweight. > I am not sure how you got this figure. Could you explain more on this? Are > you sure the Compositor thread is idle before it receives the notification? > It could be that there is still a task running on the compositor thread so > the notification is queued. The compositor thread is not a good target for > experimenting IPC delay because it regularly has tasks to run. (attachment 8401064 [details])It's logged by Systrace. The profiling tool developed by Google and Android developers usually use it to profile performance issue. I'm sure the Compositor thread is idle before it receives the notification. You can see the skinny bar above each thread which represents the state of thread. White represents "sleeping" Blue represents "runnable" (ready) Green represents "running" B2G main thread notified Composite thread at 618ms and Composite thread switched from "sleeping" to "runnable" very fast. But Compositor thread turned to "running" in about 2ms later. > > > 2. why boost priority of Composite thread can reduce latency ? Intuitively, > Maybe increasing the priority makes the compositor thread have more chance > to be scheduled. How much reduction do we have? And how much priority do you > increase?
(In reply to Vincent Lin[:vilin] from comment #32) > I'm sure the Compositor thread is idle before it receives the notification. > You can see the skinny bar above each thread which represents the state of > thread. > White represents "sleeping" > Blue represents "runnable" (ready) > Green represents "running" > > B2G main thread notified Composite thread at 618ms and Composite thread > switched from "sleeping" to "runnable" very fast. But Compositor thread > turned to "running" in about 2ms later. > You mean the 3rd annotation in the diagram that the b2g's main thread post to the compositor thread? That doesn't go over the IPC and I think it's pure kernel scheduling delay. Blue takes the largest portion in the 2.2 ms in the tiny bar so it's just that the kernel hasn't scheduled the compositor thread.
For the compositor thread, we're pretty sure we want to eventually make it real time because it has strict guidelines of every 16.6 ms, especially with hardware vsync. How about the delay between b2g and the content process? I was looking at the nice values for the ProcessPriorityManager, and the foreground process has a nice value of 0. Have we looked at other nice values, or why were those initial values chosen? Maybe we can boost the foreground priority to be a bit higher as well. Thinker, do you have any insight on kernel process priorities? Thanks!
Flags: needinfo?(tlee)
I had mentioned at https://bugzilla.mozilla.org/show_bug.cgi?id=998379#c12. It is about 0.016ms for unagi. There is a test program, we could run it yourself on the target devices.
Flags: needinfo?(tlee)
In theory, once a RT thread/process is on running state. It would be scheduled immediately if not any other RT thread. But, for now, both IO threads and compositor thread are required being scheduled to trigger a composition. So, RT priority is not really work if just apply RT on the compositor thread.
This might shed some light on reducing the context switches in the IPC layer. Let me jump to the conclusion first: we might not need the IPC thread as much as we thought and could possibly get rid of it for IO system calls, leaving only polling on it. Generally, we move things off the main thread for 1. reducing the load of the main thread, and/or 2. not blocking the main thread. For offloading the main thread, I think it's only a valid reason if we queued several IPC messages in the IPC thread and then send them out in one batch. Otherwise the saving of not running the (nonblocking) IO on the main thread could get offset by the cost of context switches. For not blocking the main thread, this is only a problem if we often get blocked with the IPC channel, i.e. we get EAGAIN in sendmsg() and then need to epoll() to get notified when we have more send buffer in the kernel. If we seldom get blocked in sending the IPC messages, we might be better off doing IO directly on the main thread and handle the seldom blocks (maybe redo later after processing some other tasks). This is for the sending side. Receiving IPC messages always involve some form of blocking or waiting before for message to come in, that's why we might still need the IPC thread, but only for polling. To know how much we need the IPC thread, I made some measurement of the IPC channel (please refer to the attachment for the instrumentation for collecting the statistics). My test include swiping the homescreen, opening the gallery and editing images, opening the settings and adjusting the screen brightness, opening the comms app and making a call, and things like that. To my surprise, I *never* get blocked in sendmsg() calls in the chrome process, that is MessageLoopForIO::current()->WatchFileDescriptor() for sendmsg() never happened. Also, when we call IPC::Channel::ChannelImpl::ProcessOutgoingMessages(), the average output queue size is *1.0*. That is, there is no batch sending in my test. We schedule the IPC thread just in order to send each single message. Here are the numbers: blocked: 0/ 11719 (in sending 11719 messages, we get EAGAIN 0 times). Avg queue length: 11719/11719 = 1.000000 (each ProcessOutgoingMessages() sends exactly 1 message). I have to say that this is not a thorough test and it only covers the sending part. But the figures gives us a picture of how much we need the IPC thread for IO. We've seen some async events getting sensible delays, especially when the phone has moderate to high load. Onloading the IPC calls on the main thread (or other worker loops) might be helpful in these cases.
I spent some time measuring the amount of time the Compositor thread sleeps versus when we start to Composite. Since I'm new the code base, I didn't realize that we just post tasks on the compositor thread rather than send an IPC message. Essentially, I measured the time spent from when we PostTask onto the message_loop of the compositor thread, measured the time from when we send the signal for the Compositor Thread to wake up to when it actually wakes up, and finally from when the thread wakes up until we start to composite. In the cases where the Compositor::Composite time is greater than 1 ms, almost all of it is dominated by waiting for the thread to wake up and start processing messages from the message_loop. Here's a sample log: I/Gecko ( 177): [Latency PostTask 177] - 0.122082 ms // Time of Compositor::DispatchTask I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 2.563711 // message_loop::run() I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 2.807874 // CompositorParent::Composite I/Gecko ( 177): [Latency PostTask 177] - 0.000000 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 2.472150 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 2.594232 I/Gecko ( 177): [Latency PostTask 177] - 0.000000 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 2.441630 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 2.563711 I/Gecko ( 177): [Latency PostTask 177] - 0.030520 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 2.319548 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 2.472150 I/Gecko ( 177): [Latency PostTask 177] - 0.000000 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 2.594231 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 2.716313 I/Gecko ( 177): [Latency PostTask 177] - 0.030521 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 2.227987 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 2.380589 I/Gecko ( 177): [Latency PostTask 177] - 0.030520 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 3.052038 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 3.204639 I/Gecko ( 177): [Latency PostTask 177] - 0.000000 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 2.563712 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 2.716314 // This one is also bad, 0.7ms delay I/Gecko ( 177): [Latency PostTask 177] - 0.030521 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 0.915612 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 1.617580 // This one is odd I/Gecko ( 177): [Latency PostTask 177] - 0.030521 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 2.624752 I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 6.348237 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 7.080727 I/Gecko ( 177): [Latency PostTask 177] - 0.030520 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 8.484664 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 8.820387 I/Gecko ( 177): [Latency PostTask 177] - 0.000000 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 7.233328 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 7.538532 I/Gecko ( 177): [Latency PostTask 177] - 0.000000 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 0.518846 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 0.579887 I/Gecko ( 177): [Latency PostTask 177] - 0.000000 ms I/Gecko ( 177): [Latency-581] MessagePumpDefault::Wake up delay: 0.457807 I/Gecko ( 177): [ipc-latency-581] Compositor::MeasureIPC is: 0.488327 Another interesting note is that we take about ~0.2 - 0.3ms from the message_loop::run() until we finally start executing in the compositor. Sometimes, just going from the message_loop to the compositor takes 0.7ms, but the Compositor delay is definitely a scheduling issue. I will do the same measurements as Ben suggested in Comment 2 for the cross process message. From the sounds of it in https://bugzilla.mozilla.org/show_bug.cgi?id=994970#c4, we may need to rework how touch events are sent to the content process as in bug 930939.
(In reply to Cervantes Yu from comment #37) > Generally, we move things off the main thread for 1. reducing the load of > the main thread, and/or 2. not blocking the main thread. The other big reason is that we have multiple threads in the parent and multiple threads in the child that all communicate over the same pipe, so messages must be written sequentially. We cannot have multiple threads writing into the same pipe unless we can somehow guarantee that all writes are atomic. Since our messages are of arbitrary size I don't think we can guarantee that.
Oh, and another! We can't guarantee that a child stays responsive. If it blocks for an extended amount of time our pipe could fill up and every write call will fail. We can't have the parent main thread keep looping in that situation.
Just some data, here is the delay from Hal::SendNotifyVsyncChange, which I changed to be async. We see some delays up to 10ms. Will dive further.
Attachment #8423627 - Attachment description: HalParent::Send async time in Microseconds → HalParent::Send async time in Milliseconds
After chatting with :bent, we had a few leads. I checked out the latest IPC code from Chromium and we didn't modify the parts of the IPC code path that we're going through. Ben thought there might be a bug in IPC that chromium might have fixed. I also tried to instrument MessageChannel::Send() and ProcessLink::SendMessage, which holds two locks, which we think is the cause of the delay. However, adding printfs changes the output enough that I can't reproduce the delays from comment 42. I'm going to try a couple of things to see if this is just an OS scheduling issue. 1) Raise the IPC thread's priority. This will help in the case where we send the message and the main thread is scheduled out by the OS. 2) Send a signal from HalParent to HalChild instead of using IPC. I'm on regression hunting duty this sprint so this data will take a while to get to.
I measured the round trip time from the HalParent::SendNotifyVsync to the HalChild::RecvNotifyVsync, and we have very large delays. All times are in milliseconds. We see that we sometimes have delays of up 50 milliseconds! This is too large. I tried setting the IPC thread on the parent process to nice 20, and nice -20 and we see that when we raise the IPC thread priority ONLY on the main process (we don't have support of changing thread priorities on child process), the average and std deviation goes down a bit but still not enough. 10ms is still too much that we blow our frame budget. I'm going to investigate the second option, of sending a signal.
(In reply to Mason Chang [:mchang] from comment #44) > I measured the round trip time from the HalParent::SendNotifyVsync to the > HalChild::RecvNotifyVsync, and we have very large delays. All times are in > milliseconds. We see that we sometimes have delays of up 50 milliseconds! > This is too large. Where is the time being spent? Is the IPC thread just sitting idle? Or is it working through sending/receiving other messages?
Summary: (butter) IPC Shows 1-2ms latency sending message from b2g to content process and b2g main thread to composite thread → (silk) IPC Shows 1-2ms latency sending message from b2g to content process and b2g main thread to composite thread
I haven't had any chance to look at this. Will get to it after all the regressions :(
Attached file ipcData Sheet1.pdf
I tried measuring various parts of the IPC pipeline when doing a single scroll on the vertical home screen. Unfortunatly, adding printfs and calls to TimeStamp::Now mess up the data a bit, but it gives us some idea where some time is spent across a HalParent::SendVsync / HalChild::RecvVsync. Attached are the raw data. I tried to instrument 4 stages: 1) HalParent::SendVsync 2) Time from when the HalParent::SendVsync message is put on the IPC queue until the IPC thread on the parent process deques. Instrumented at the ioThread->PostTask -> Channel::Send. This is Parent IPC Thread Scheduling. Unfortunately, this is very noisy as we send the data to multiple children and I couldn't filter out only to the foreground process. Ben, if you have any ideas with this, that would be helpful. 3) Channel Send: This is the length of time it takes to send the message on the IPC thread. 4) Time from Parent::Channel::Send -> Content IPC Thread::Dequeue - I couldn't find a good way measure this. 5) Time from the ContentIPC::ProcessMessageReceived -> Main Thread::DispatchMessage. What we see is multiple > 1ms delays in various stages of the pipeline. For example, we sometimes get multiple 1.8ms delays just Sending the vsync message (1) in the content process, sometimes up to 4ms. This is probably waiting on locks. We can also get a 2ms delay from stage (2), waiting for the IPC thread to wake up. Another large delay comes from the content ipc thread -> parent process main thread (5). While sometimes very large delays ~100ms, I think this is noise from something else. However, we still see some 0.5, 1ms delays. Another thing is that the raw data got a lot noisier due to the instrumentation. A better sample would be from attachment 8423627 [details], which has the data roundtrip from HalPArent -> HalChild w/o IPC instrumentation. Most of the time, the delay is acceptable and under 1ms. So now we can decide if we want to: 1) Bypass IPC and use something like a signal to only the foreground process. I want to experiment with this. 2) Try to improve the IPC subsystem situation 3) Live with the latency. This might be ok because we want to rework APZ to send touch events directly to the compositor on the parent process and with APZ, we overpaint. As long as we paint in time (which is a big if), the latency shouldn't matter. This is bug 930939. Just to clarify, we only send IPC messages across process to tick the refresh driver.
Err sorry, correction for comment 47. Better sample for HalParent -> HalChild would be with attachment 8434224 [details].
Attached file Notify VSync w/ Signal
I did a prototype implementation of using a kernel signal to notify vsync instead of IPC. I couldn't do the exact same test as I did with the previous attachment due to the limitation within the signal handler, but I redid the tests instead. The test was on the vertical home screen, just try to scroll back and forth as fast as I could. The test data of using IPC vs a signal is attached. The signal latency is very good. Most of the time, it's < .1 milliseconds, which is fantastic latency. Sometimes, we get 1 ms, and in the worst case, we got 4ms delay, which is significantly better than the IPC delay of 50ms. Interestingly, before I was using orangutan to do 1 swipe. Since I couldn't do that exactly with the signal due to the way we measured data, I redid the tests with IPC. It looks like we're actually flooding the IPC system as we send vsync messages, can't process, and the latency gets monotonically worse until are finally able to catch up and the latency falls back down. I think from this data, using a signal handler to tick the refresh driver on the content process is the way to go.
After lots of data analysis and understanding of the architecture, I think we can close this bug out unless someone objects. I'll let this sit for a few days before I close it out. Going forward: 1) Sending a message from a hardware composer vsync -> Compositor. This didn't actually go through IPC, but the latency was there. We solve this by raising the compositor thread priority to real time 1. This can be tracked in bug 982972. 2) Sending a vsync message from hardware composer -> Foreground Child Content Process::nsRefreshDriver::Tick. We can lower the latency using a signal. This can be tracked in bug 994970. All other background apps refresh drivers don't need to be ticked on vsync. 3) vsync message -> system process refresh driver. Since this is only the system app, the latency isn't that critical. We can directly dispatch the nsRefreshDriver::Tick on the hwc composer thread on the parent process. All other vsync messages are not that latency critical. One note is that in bug 930939, we will process touch events off the main thread in the parent process. We may be able to do that on the compositor thread so no latency issues there.
Resolving from comment 50.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Whiteboard: [c=uniformity p=5 s= u=] → [c=uniformity p=5 s=2014.06.20 u=]
Target Milestone: --- → 2.0 S4 (20june)
See Also: → 1035076
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: