Closed Bug 977975 Opened 7 years ago Closed 7 years ago

composites running slower than 60fps on b2g 1.4

Categories

(Core :: Graphics, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
1.4 S3 (14mar)
blocking-b2g 1.4+
Tracking Status
b2g-v1.4 --- fixed

People

(Reporter: bkelly, Unassigned)

References

Details

(Keywords: perf, Whiteboard: [c= p=5 s= u=])

Attachments

(6 files, 3 obsolete files)

Lately on b2g v1.4 I have seen a number of profiles like this one from bug 975831 comment 54:

  http://people.mozilla.org/~bgirard/cleopatra/#report=91799a05276e72b06beb7d1f038dd55221c95c18

Look at time range 6316 to 6726.

Here you can see a 70+ ms rasterize.  If you look at what the child process is doing you can see that IPC MessageChannel::SendAndWait shows up a few times waiting for a response.

While its unclear if this is due to multiple messages or a single call from the profile, it shows up enough that I started to wonder if IPC is getting delayed or stalling delivering messages to the parent process.

To investigate further I hacked in some instrumentation in SendAndWait() to print to logcat when a synchronous IPC call takes greater than 10ms.  I also then enabled IPC logging (without DEBUG).

The IPC logging shows that most messages are delivered in a timely fashion:

I/GeckoIPC(11713): [time:1393560810093386][11713->11550][PLayerTransactionChild] Sending Msg_PGrallocBufferConstructor([T
ODO])
I/GeckoIPC(11550): [time:1393560810093983][11550<-11713][PLayerTransactionParent] Received Msg_PGrallocBufferConstructor(
[TODO])
I/GeckoIPC(11550): [time:1393560810098459][11550->11713][PLayerTransactionParent] Sending reply Reply_PGrallocBufferConst
ructor([TODO])

In this case the PGrallocBufferConstructor messages was delivered in less than 1ms.

When my instrumentation fires, though, I see that significant delay is coming in the IPC delivery:

I/GeckoIPC(11713): [time:1393560810100034][11713->11550][PLayerTransactionChild] Sending Msg_PTextureConstructor([TODO])
I/GeckoIPC(11713): [time:1393560810100983][11713->11550][PLayerTransactionChild] Sending Msg_PGrallocBufferConstructor([T
ODO])
I/GeckoIPC(11550): [time:1393560810113251][11550<-11713][PLayerTransactionParent] Received Msg_PTextureConstructor([TODO]
)
I/GeckoIPC(11550): [time:1393560810113503][11550<-11713][PLayerTransactionParent] Received Msg_PGrallocBufferConstructor(
[TODO])
E/memalloc(11550): /dev/pmem: No more pmem available
W/memalloc(11550): Falling back to ashmem
I/GeckoIPC(11550): [time:1393560810132156][11550->11713][PLayerTransactionParent] Sending reply Reply_PGrallocBufferConst
ructor([TODO])
I/Gecko   (11713): ### ### SendAndWait() PLayerTransaction::Msg_PGrallocBufferConstructor took 31 ms

Here we see a total delay of 31ms.  Of that delay ~12.5ms is in IPC message delivery and ~18.5ms in parent process work.  In this case it appears the parent process work was quite expensive due to running out of pmem.

I also delays in PLayerTransaction::Msg_Update:

I/GeckoIPC(11713): [time:1393560812133136][11713->11550][PLayerTransactionChild] Sending Msg_Update([TODO])
E/msm7627a.hwcomposer(11550): hwc_set: Unable to render by hwc due to non-pmem memory
I/GeckoIPC(11550): [time:1393560812146536][11550<-11713][PLayerTransactionParent] Received Msg_Update([TODO])
I/GeckoIPC(11550): [time:1393560812147369][11550->11713][PLayerTransactionParent] Sending reply Reply_Update([TODO])
I/Gecko   (11713): ### ### SendAndWait() PLayerTransaction::Msg_Update took 14 ms

Here the entire Msg_Update() took 14ms and ~13.5ms of that was in IPC delivery.

Note all my delayed Msg_Update() instances seem to have the hwc message about non-pmem memory interleaved between the send and receive.

Its unclear to me quite yet what is going on, but it seems we're going to have a bad time if we cannot deliver the gfx IPC messages quickly and reliably.  Its hard to prioritize work in the parent process if we are delayed finding out that we need to do it.

Now that APZ has made composition asynchronous we may be running hwc more frequently than in the past.  This is now keeping the compositor thread busy preventing it from picking up child messages in a timely fashion.

Are the hwc operations synchronous?  Can they be moved to a separate thread?  Or are there other ways we could coordinate hwc better with the client layerizing?  Are there other ways we could prioritize the IPC and gfx messages so that we have some limit on how long they will be delayed?

Any other theories?  Or is this behavior expected?

I'll take this for now to investigate.
On ICS, hwc operations is sync. hwc operations becomes async from gonk JB by fence support. It is bug 957323.
gralloc buffer allocation is synchronous. It could be mitigated by Bug 959089.

Bug 950112 is more aggressive approach. But it can not be used on ICS, because of gralloc buffer's drivers problem. On KK, it seems possible.
gralloc buffer allocation latency is intrinsic problem. we expect that tiling could mitigate some like Bug 963073.
Yea, if I understand correctly we can reduce the gralloc IPC calls once we have bug 963073 which should help.  It concerns me, though, that I see non-gralloc IPC calls being delayed too.

Is it expected that my device would be hitting ashmem so easily?
Which device do you use? Buri/hamachi only has 8MB pmem, it is very easy to hit ashmen.
Yes, I'm on a buri.
Layer transaction could carries multiple messages if there is only one message that request sync transaction. That transaction becomes sync transaction. By removing the sync message, we could reduce the number of sync transaction. For example, ThebesLayer's rendering update is always sync, tiling could reduce the sync transaction.
(In reply to Sotaro Ikeda [:sotaro] from comment #5)
> Which device do you use? Buri/hamachi only has 8MB pmem, it is very easy to
> hit ashmen.

I was able to avoid large the 15+ ms ashmem allocations by prioritizing pmem-vs-ashmem usage.  See bug 978176.  Not sure if this is a reasonable approach or not.
Going back to the delayed IPC messages, I instrumented HwcComposer2D::Render to print any time SwapBuffers() takes more than 5ms:

I/Gecko   ( 6339): ### ### HwcComposer2D::Render() SwapBuffers() took 7 ms
I/Gecko   ( 6339): ### ### HwcComposer2D::Render() SwapBuffers() took 6 ms
I/Gecko   ( 6339): ### ### HwcComposer2D::Render() SwapBuffers() took 18 ms
I/Gecko   ( 6339): ### ### HwcComposer2D::Render() SwapBuffers() took 29 ms
I/Gecko   ( 6339): ### ### SendAndWait() PLayerTransaction::Msg_PGrallocBufferConstructor took 38 ms
I/Gecko   ( 6339): ### ### HwcComposer2D::Render() SwapBuffers() took 13 ms
I/Gecko   ( 6451): ### ### SendAndWait() PLayerTransaction::Msg_PGrallocBufferConstructor took 17 ms
I/Gecko   ( 6339): ### ### SendAndWait() PLayerTransaction::Msg_PGrallocBufferConstructor took 16 ms

From the profile it seems likely most of this is due to fb_lockBuffer().  I'll try to verify that is where the time is going.
Further debugging suggests the delay is coming from fb_lockBuffer() waiting for disp_loop() thread to perform its unlock.

02-28 17:45:52.676 E/msm7627a.gralloc(  891): ### ### fb_lockBuffer() index 2 took 13 ms for pid 891 thread 916 at priority 0
02-28 17:45:52.676 I/Gecko   (  891): ### ### HwcComposer2D::Render() SwapBuffers() took 19 ms
02-28 17:45:52.706 E/msm7627a.gralloc(  891): ### ### fb_lockBuffer() index 0 took 20 ms for pid 891 thread 916 at priority 0
02-28 17:45:52.706 I/Gecko   (  891): ### ### HwcComposer2D::Render() SwapBuffers() took 25 ms

I verified that we are throttling composition correctly.  At first I thought maybe we were just racing ahead.  It seems like the delays are being computed and used to schedule properly.

I also tried various changes to scheduling priority, forcing disp_loop() to unlock the buffer early and yield, etc.  None of these really helped.

Its unclear to me if we are getting out of sync with disp_loop() somehow or if this is just normal behavior.

Since we appear to be blocking on a lock for non-trivial times, it really seems like maybe we should move this work to another thread.
It seems we are hitting periodic delays in the FBIOPUT_VSCREENINFO ioctl sync'ing the framebuffer to the display.  While this is done on a separate thread, its impacting our compositor thread through the framebuffer lock.

For example, this is the ideal sequence of events:

### ### disp_loop begin ioctl 1
### ### fb_lockBuffer begin locking 2
### ### fb_lockBuffer finish locking 2 [FAST]
### ### disp_loop finish ioctl 1 [16ms]
### ### disp_loop unlock 0
### ### disp_loop begin ioctl 2
### ### fb_lockBuffer begin locking 0
### ### fb_lockBuffer finish locking 0 [FAST]

The disp_loop thread is busy pushing out FB 1 here, while the gecko compositor is asking for a lock on FB 2.  Since FB 2 is not in use at all, the lock is granted immediately.  Here the ioctl completes within our frame budget of ~16ms and so everything stays on track.  Gecko's next lock is also immediately granted, etc.

Sometimes, though, we will get an ioctl that is way too slow.  For example:

### ### disp_loop begin ioctl 0
### ### fb_lockBuffer finish locking 1 [FAST]
### ### fb_lockBuffer begin locking 2
### ### disp_loop finish ioctl 0 [30ms]
### ### disp_loop unlock 2
### ### disp_loop begin ioctl 1
### ### fb_lockBuffer finish locking 2 [13ms]
### ### fb_lockBuffer begin locking 0
### ### disp_loop finish ioctl 1 [15ms]
### ### disp_loop unlock 0
### ### disp_loop begin ioctl 2
### ### fb_lockBuffer finish locking 0 [8ms]
### ### fb_lockBuffer begin locking 1
### ### disp_loop finish ioctl 2 [16ms]
### ### disp_loop unlock 1
### ### disp_loop begin ioctl 0
### ### fb_lockBuffer finish locking 1 [6ms]
### ### fb_lockBuffer begin locking 2
### ### disp_loop finish ioctl 0 [15ms]

Here we see the ioctl for FB 0 taking 30ms which is also twice our frame budget.  This causes gecko to block on the FB lock for 13ms.  (Due to our current architecture this also blocks gfx related IPC calls from the child processes to do things like allocate gralloc buffers, etc.)  Since this was just a single slow ioctl, though, the system recovers over the next few cycles and the lock times drift back down towards the fast case.

Unfortunately, though, we also see long series of slow FBIOPUT_VSCREENINFO ioctl calls:

### ### disp_loop begin ioctl 2
### ### fb_lockBuffer begin locking 0
### ### fb_lockBuffer finish locking 0 [FAST]
### ### fb_lockBuffer begin locking 1
### ### disp_loop finish ioctl 2 [29ms]
### ### disp_loop unlock 1
### ### disp_loop begin ioctl 0
### ### fb_lockBuffer finish locking 1 [8ms]
### ### fb_lockBuffer begin locking 2
### ### disp_loop finish ioctl 0 [31ms]
### ### disp_loop unlock 2
### ### disp_loop begin ioctl 1
### ### fb_lockBuffer finish locking 2 [17ms]
### ### fb_lockBuffer begin locking 0
### ### disp_loop finish ioctl 1 [30ms]
### ### disp_loop unlock 0
### ### disp_loop begin ioctl 2
### ### fb_lockBuffer finish locking 0 [18ms]
### ### fb_lockBuffer begin locking 1
### ### disp_loop finish ioctl 2 [31ms]

Here we are consistently blocking the gecko compositor on the FB lock for longer than our 16ms frame budget.

All of this was measured on an ICS buri last flashed with v1.2-device.cfg OEM firmware.

Sotaro, Michael, do either of you know what could be causing such a slow down in the FBIOPUT_VSCREENINFO ioctl?  Is this a DMA or does it use the CPU to perform the copy?  Have we seen this behavior before?
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(mvines)
Here is the instrumentation I used to generate the logs in comment 11.
I was using gecko rev 53dbf19f2f8d to generate the data in comment 11.
I'd be more interested to see the results from a 8x10 device or a QRD8x28 instead of a 7x27a device such as Buri.
Flags: needinfo?(mvines)
I don't think we have devices available yet. Could you guys get some numbers while we are trying to get hardware to Ben?
Vikram, can you please take a look?
Flags: needinfo?(mvikram)
> 
> Sotaro, Michael, do either of you know what could be causing such a slow
> down in the FBIOPUT_VSCREENINFO ioctl?  Is this a DMA or does it use the CPU
> to perform the copy?  Have we seen this behavior before?

On, ICS vsync is alwasys handled implicitly within driver. I suspect it affected to the problem. But how to handle vsync is totally different since JB. It might be better to investigate about the new device as :m1 said.
Flags: needinfo?(sotaro.ikeda.g)
Vsync is a good idea, but I don't think it holds with the results I'm seeing.  If I understand correctly a delay from vsync would occur if we start the ioctl late and have to wait for the next frame.  After that we should be re-synced and only experience another delay if we miss the start again.

I added some more instrumentation to measure time between the end of the last ioctl and the beginning of the next ioctl:

03-01 21:55:26.789 E/msm7627a.gralloc( 3314): ### ### disp_loop finish ioctl 2 [15ms]
03-01 21:55:26.789 E/msm7627a.gralloc( 3314): ### ### disp_loop unlock 1
03-01 21:55:26.799 E/msm7627a.gralloc( 3314): ### ### disp_loop begin ioctl 0, 2 ms after last ioctl completed
03-01 21:55:26.809 E/msm7627a.gralloc( 3314): ### ### fb_lockBuffer begin locking 1
03-01 21:55:26.809 E/msm7627a.gralloc( 3314): ### ### fb_lockBuffer finish locking 1 [FAST]
03-01 21:55:26.819 E/msm7627a.gralloc( 3314): ### ### fb_lockBuffer begin locking 2
03-01 21:55:26.829 E/msm7627a.gralloc( 3314): ### ### disp_loop finish ioctl 0 [29ms]
03-01 21:55:26.829 E/msm7627a.gralloc( 3314): ### ### disp_loop unlock 2
03-01 21:55:26.829 E/msm7627a.gralloc( 3314): ### ### disp_loop begin ioctl 1, 0 ms after last ioctl completed
03-01 21:55:26.829 E/msm7627a.gralloc( 3314): ### ### fb_lockBuffer finish locking 2 [5ms]
03-01 21:55:26.839 E/msm7627a.gralloc( 3314): ### ### fb_lockBuffer begin locking 0
03-01 21:55:26.859 E/msm7627a.gralloc( 3314): ### ### disp_loop finish ioctl 1 [30ms]
03-01 21:55:26.859 E/msm7627a.gralloc( 3314): ### ### disp_loop unlock 0
03-01 21:55:26.859 E/msm7627a.gralloc( 3314): ### ### disp_loop begin ioctl 2, 1 ms after last ioctl completed
03-01 21:55:26.859 E/msm7627a.gralloc( 3314): ### ### fb_lockBuffer finish locking 0 [16ms]
03-01 21:55:26.869 E/msm7627a.gralloc( 3314): ### ### fb_lockBuffer begin locking 1
03-01 21:55:26.889 E/msm7627a.gralloc( 3314): ### ### disp_loop finish ioctl 2 [30ms]
03-01 21:55:26.889 E/msm7627a.gralloc( 3314): ### ### disp_loop unlock 1
03-01 21:55:26.889 E/msm7627a.gralloc( 3314): ### ### disp_loop begin ioctl 0, 0 ms after last ioctl completed
03-01 21:55:26.889 E/msm7627a.gralloc( 3314): ### ### fb_lockBuffer finish locking 1 [17ms]
03-01 21:55:26.899 E/msm7627a.gralloc( 3314): ### ### fb_lockBuffer begin locking 2
03-01 21:55:26.919 E/msm7627a.gralloc( 3314): ### ### disp_loop finish ioctl 0 [30ms]

I can see maybe we missing the vsync on that first one as the ioctl took us 15ms and then we didn't start the next one for 2ms more.  After that, though, I would expect us to resync as our delayed between ioctls are 0ms to 1ms.

Or maybe a vsync conflict throws us into a bad state and another bug prevents us from recovering as expected?
After talking with Andreas on IRC he found that this might be a known issue:

  https://gitorious.org/shr/linux/commit/1ba7f00e223e2548db126e2c810d373551bb7732

My hamachi ICS checkout does not seem to include it.  Is that something that could be backported?

I will focus on re-testing with my nexus-4 now.  That is the only JB device I have at the moment.  I would be happy to test with other hardware if I can get my hands on it.
If this is an issue for 7x25/7x27 only, I wouldn't bother with backporting. We are no longer shipping those devices and the OEMs are unlikely to take kernel patches. We will have to live with it. If more modern QC silicon doesn't show this behavior, I am ok with WONTFIX-ing this bug. Testing on the JB device is a good start. We should focus performance tuning on new QC silicon.

Non-QC silicon doesn't have this driver so there is a good chance this isn't happening there. I will see if I can get you a device to verify.
Ok, I'll focus on the nexus-4 and also test my ICS tarako which is a different chipset.
I'm not sure if its the same root cause yet, but we are definitely composing too slowly on nexus-4.  We are also see IPC stalls as well.

I/Gecko   ( 1467): Compositor: Composite took 40 ms.
I/Gecko   ( 1467): Compositor: Composite took 35 ms.
I/Gecko   ( 1467): Compositor: Composite took 32 ms.
I/Gecko   ( 1772): ### ### SendAndWait() PLayerTransaction::Msg_Update took 22 ms
I/Gecko   ( 1467): Compositor: Composite took 31 ms.
I/Gecko   ( 1772): ### ### SendAndWait() PLayerTransaction::Msg_Update took 21 ms
I/Gecko   ( 1467): Compositor: Composite took 31 ms.
I/Gecko   ( 1467): Compositor: Composite took 31 ms.

Profile:

  http://people.mozilla.org/~bgirard/cleopatra/#report=223d8d70950e35b4ab3db801363c1887b19f8f51

Here we are now seeing the waits on Android::Fence::waitForever() instead of the previous fb_lockBuffer.  I think this represents the differences between ICS and JB.

I need to dig deeper to see if the underlying screen composites are still running too slow or if there is a different issue at work on the nexus-4.
I'm nom'ing this for 1.4 since slow composite times seem like a release blocker issue.  Updating the description to reflect this more concerning condition.
blocking-b2g: --- → 1.4?
Summary: gfx IPC message delivery delayed on b2g 1.4 → composites running slower than 60fps on b2g 1.4
Whiteboard: [c= p= s= u=] → [c= p=5 s= u=]
This may unblock if the problem reduces to just showing up on ICS.
blocking-b2g: 1.4? → 1.4+
(In reply to Ben Kelly [:bkelly] from comment #22)
> 
> Here we are now seeing the waits on Android::Fence::waitForever() instead of
> the previous fb_lockBuffer.  I think this represents the differences between
> ICS and JB.
> 
> I need to dig deeper to see if the underlying screen composites are still
> running too slow or if there is a different issue at work on the nexus-4.

There seems the following possibilities
- underlying layer is too slow
- compositor is doing over composition than the underlying layer expected.
- timing of composition is not correct(does not care about vsync)

On andorid JB, SurfaceFlinger do composition depend on vsync callback. But current b2g compositor did composition without caring about vsync.
That last profile had hwc disabled in developer settings.  Here's another one with hwc enabled:

  http://people.mozilla.org/~bgirard/cleopatra/#report=73a20665150eed780a871dae5ec835621f145298

Still seem to be spending most of our time in that same fence lock.

One thing to note, though, is the fence is for a TextureBind() call, not the render swap buffers.
Sotaro, can you explain comment 25 a bit more?
int SurfaceTextureClient::hook_dequeueBuffer_DEPRECATED(ANativeWindow* window,
        ANativeWindowBuffer** buffer) {
    SurfaceTextureClient* c = getSelf(window);
    ANativeWindowBuffer* buf;
    int fenceFd = -1;
    int result = c->dequeueBuffer(&buf, &fenceFd);
    sp<Fence> fence(new Fence(fenceFd));
    int waitResult = fence->waitForever(1000, "dequeueBuffer_DEPRECATED");
    if (waitResult != OK) {
        ALOGE("dequeueBuffer_DEPRECATED: Fence::wait returned an error: %d",
                waitResult);
        c->cancelBuffer(buf, -1);
        return waitResult;
    }
    *buffer = buf;
    return result;
}

Are we notifying the fence object properly after rendering to the buffer?
(In reply to Andreas Gal :gal from comment #27)
> Sotaro, can you explain comment 25 a bit more?

Since JB, SurfaceFlinger did composition based on vsync event. And OpenGL composition result is also composed by hw composer as Frame Buffer layer. Fence::waitForever() wait until hw composer release. The following diagram shows about the fence and the hwcomposer.

https://github.com/sotaroikeda/android-diagrams/blob/master/graphics/HWComposer_4.3.pdf?raw=true

Hwcomposer's buffer release(release fence) seems to be affected by vsync(screen refresh rate). Therefore b2g's compositor also need to do composition based on vsync event. I saw similar comment on Bug 970751 Comment 13.
On android JB, SurfaceFlinger's rendering is done like the following sequence. It is triggered by vsync event.

//vsync event on hwcomposer
 ->HWComposer::hook_vsync()
  ->HWComposer::vsync()
   ->SurfaceFlinger::onVSyncReceived()
    ->EventThread::onVSyncReceived()
     ->mCondition.broadcast();
      ->//emit vsync event
// receive vsync event
 ->MessageQueue::cb_eventReceiver()
  ->MessageQueue::eventReceiver()
    ->MessageQueue::Handler::dispatchInvalidate()
// receive INVALIDATE message
MessageQueue::Handler::handleMessage()
 ->SurfaceFlinger::onMessageReceived() 
   ->MessageQueue::handleMessageTransaction();
   ->MessageQueue::handleMessageInvalidate();
   ->MessageQueue::signalRefresh();
    ->MessageQueue::refresh()
     ->MessageQueue::Handler::dispatchRefresh()
// receive REFRESH message
MessageQueue::Handler::handleMessage()
 ->SurfaceFlinger::onMessageReceived() 
  ->SurfaceFlinger::preComposition();
  ->SurfaceFlinger::rebuildLayerStacks();
  ->SurfaceFlinger::setUpHWComposer();
  ->SurfaceFlinger::doDebugFlashRegions();
  ->SurfaceFlinger::doComposition();
  ->SurfaceFlinger::postComposition();
Sotaro,

Regarding vsync comment, I do not think vsync is affecting the time taken here. Since composition is asynchronous and sync fences take care of signalling, hence framework needs to wait on release fence 
set by the driver. If you still think vsync will make a difference here, I can provide patches in framework and HAL that will enable HwcComposer2D to wait for the H/W vsync signal (from driver) before starting composition of a frame. I mean wait for vsync before calling hwc prepare from HwcComposer2D.

Ben,

1. Please make sure that your build has this patch: https://bugzilla.mozilla.org/show_bug.cgi?id=957323 . Otherwise we will be waiting in HwcComposer2D for previous retire fence to signal.

2. In Comment 22, is this the time taken by TryRender() call ?
I/Gecko   ( 1467): Compositor: Composite took 31 ms.

3. Can you please disable HWC (to force GPU Composition) and check if it still takes similar time. This will narrow down the issue further.

4. Is this observed on ICS or JB or both ?

5. Nexus-4 has different drivers. Will it be possible for you to test on JB gonk reference device with newer driver version. Or if you can tell me use case, I can test myself.
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(bkelly)
To understand the problem, it might be better to test by changing the frame buffer layer's buffers from 2 to 3.

> #ifndef NUM_FRAMEBUFFER_SURFACE_BUFFERS
> #define NUM_FRAMEBUFFER_SURFACE_BUFFERS (2)
> #endif

http://mxr.mozilla.org/mozilla-central/source/widget/gonk/libdisplay/FramebufferSurface.cpp#41
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Sushil from comment #31)
> 1. Please make sure that your build has this patch:
> https://bugzilla.mozilla.org/show_bug.cgi?id=957323 . Otherwise we will be
> waiting in HwcComposer2D for previous retire fence to signal.

I have been testing with this patch.  My gecko hg rev is at 53dbf19f2f8d dated Feb 28.  I've been waiting to update further since an unrelated crasher seems to have been introduced over the weekend.

> 2. In Comment 22, is this the time taken by TryRender() call ?
> I/Gecko   ( 1467): Compositor: Composite took 31 ms.

This debug is produced by enabled COMPOSITOR_PERFORMANCE_WARNING in our CompositorParent.h.  I believe it measures the time of Render(), not TryRender().

> 3. Can you please disable HWC (to force GPU Composition) and check if it
> still takes similar time. This will narrow down the issue further.

My results in comment 22 were with HWC disabled.  However, Sotaro has told me on IRC we still mostly due GPU composition on nexus-4 since the hwc does not support all the features we need.

> 4. Is this observed on ICS or JB or both ?

I have observed something similar on ICS buri, although its not clear they have the same root cause.  There we are blocked waiting in the FBIOPUT_VSCREENINFO ioctl.

For JB I have been using nexus-4 as its the only JB device I have.

> 5. Nexus-4 has different drivers. Will it be possible for you to test on JB
> gonk reference device with newer driver version. Or if you can tell me use
> case, I can test myself.

Basically uncomment this line, rebuild, and do most anything on the device:

  http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.h#16

Right now I am trying to trace where the fence we are blocking on is coming from.  As far as I can tell its the fence being provided to BufferQueue::queueBuffer().
To clarify comment 33, "do most anything on the device" for me means swipe between homescreens pages and launch the virtual-list-app from bug 975831.  I expect you will hit the slow composites doing anything with the device, though.  It seems quite easy to trigger for me.
Flags: needinfo?(bkelly)
Ok, so this blocking on the fence is not really a problem because its essentially just gating us at vsync rate. The problem is that while we block we can't respond to ipc. So this argues for leaving the general architecture as is, but moving it to a different thread. A less ideal solution would be to delay triggering composition until vsync (that will still block ipc, but less badly).
(In reply to Ben Kelly [:bkelly] from comment #33)
> 
> Right now I am trying to trace where the fence we are blocking on is coming
> from.  As far as I can tell its the fence being provided to
> BufferQueue::queueBuffer().


Release fence comes from hw composer. It seem to be set to FramebufferSurface by the either code.

http://mxr.mozilla.org/mozilla-central/source/widget/gonk/HwcComposer2D.cpp#580

http://mxr.mozilla.org/mozilla-central/source/widget/gonk/HwcComposer2D.cpp#619
(In reply to Sotaro Ikeda [:sotaro] from comment #36)
> Release fence comes from hw composer. It seem to be set to
> FramebufferSurface by the either code.
> 
> http://mxr.mozilla.org/mozilla-central/source/widget/gonk/HwcComposer2D.
> cpp#580
> 
> http://mxr.mozilla.org/mozilla-central/source/widget/gonk/HwcComposer2D.
> cpp#619

Hmm, that seems to be a different fence:

03-03 14:05:33.077 E/Surface ( 5189): ### ### hook queue buffer 0xaf849288 with fenceFd:123
03-03 14:05:33.077 E/Surface ( 5189): ### ### queue buffer 0xaf849288 with fenceFd:123
03-03 14:05:33.077 I/Gecko   ( 5189): ### ### GonkDisplayJB::SetFBReleaseFd fd:142
(In reply to Ben Kelly [:bkelly] from comment #37)
> 
> 03-03 14:05:33.077 E/Surface ( 5189): ### ### hook queue buffer 0xaf849288
> with fenceFd:123
> 03-03 14:05:33.077 E/Surface ( 5189): ### ### queue buffer 0xaf849288 with
> fenceFd:123
> 03-03 14:05:33.077 I/Gecko   ( 5189): ### ### GonkDisplayJB::SetFBReleaseFd
> fd:142

I think it is same fence, but different handle by dup() like the following.

http://androidxref.com/4.3_r2.1/xref/frameworks/native/libs/gui/Surface.cpp#208
Another possiblity about low fps is an acquire fence of FramebufferSurface takes longer time until completion.

http://mxr.mozilla.org/mozilla-central/source/widget/gonk/libdisplay/FramebufferSurface.cpp#132
(In reply to Sotaro Ikeda [:sotaro] from comment #39)
> Another possiblity about low fps is an acquire fence of FramebufferSurface
> takes longer time until completion.
> 
> http://mxr.mozilla.org/mozilla-central/source/widget/gonk/libdisplay/
> FramebufferSurface.cpp#132

In this assumption, acquire fence delays release fence.
Sotaro, can you please explain comment 40 in more detail? Please use longer detailed explanations. I have a hard time following your half sentences :)
Looking at the profile I see us blocking on a lock in DrawQuadInternal. Why would the HWC keep a lock on the gralloc buffer that we decided to draw with the compositor? I guess I still don't understand our theory here.
(In reply to Andreas Gal :gal from comment #41)
> Sotaro, can you please explain comment 40 in more detail? Please use longer
> detailed explanations. I have a hard time following your half sentences :)

Sorry. The above is just explaining another possibility. HWC  uses two fence for the composition. "Aquire Fence" and "Release Fence". The explanation is in the bottom. HWC can not compose the buffer until aquire fence complete. Therefore, if acquire fence delayed by OpenGL driver's reason, HWC' rendering complete could be delayed. But it is just one possibility.

- Aquire Fence: Sync fence object that will be signaled when the buffer's contents are available. Set by Layer. May be -1 if the contents are already available.

- Release Fence: During set() the HWC must set this field to a file descriptor for a sync fence object that will signal after the HWC has finished reading from the buffer.

https://github.com/sotaroikeda/android-diagrams/blob/master/graphics/HWComposer_4.3.pdf?raw=true
From jrmuizel's comment, the names of "Aquire Fence" and "Release Fence" seems to come from  C++11 Standards.
http://preshing.com/20130922/acquire-and-release-fences/
Sotaro, I tried instrumenting those parts of the code, but I still don't see them running while we are blocked waiting for the buffer.

The following debug shows us getting delayed on two buffers.  There is a lot of extra text here because I wanted to show enough history to see when they were last queueud.

I will refer to 0xaf65c088 as buffer 1 and 0xad8fd588 as buffer 2.

Here is where buffer 1 was last queued with fence FD 145.

03-03 15:17:42.446 E/Surface ( 1702): ### ### hook queue buffer 0xaf65c088 with fenceFd:145
03-03 15:17:42.446 E/Surface ( 1702): ### ### queue buffer 0xaf65c088 with fenceFd:145

Some other things happen.  We record the fence used above in lastFencFD, but I don't see us signaling it.

03-03 15:17:42.446 I/Gecko   ( 1702): ### ### FramebufferSurface::nextBuffer() start
03-03 15:17:42.446 I/Gecko   ( 1702): ### ### FramebufferSurface::nextBuffer() finish
03-03 15:17:42.446 E/Fence   ( 1702): ### ### Fence::dup() dup'd 145 to 147
03-03 15:17:42.446 I/Gecko   ( 1702): ### ### FramebufferSurface::onFrameAvailable() lastFenceFD:147
03-03 15:17:42.446 I/Gecko   ( 1702): ### ### GonkDisplayJB::SetFBReleaseFd fd:174

Begin dequeuing buffer 2.

03-03 15:17:42.456 E/Surface ( 1702): ### ### start dequeued buffer [0xb05f52e4]
03-03 15:17:42.456 E/BufferQueue( 1702): [FramebufferSurface] dequeueBuffer: w=768 h=1280 fmt=0x1 usage=0x203
03-03 15:17:42.456 E/BufferQueue( 1702): [FramebufferSurface] dequeueBuffer: returning slot=0 buf=0xadfc97e0 flags=0
03-03 15:17:42.456 E/Fence   ( 1702): ### ### Fence::dup() dup'd 155 to 145
03-03 15:17:42.456 E/Surface ( 1702): ### ### finish dequeued buffer 0xad8fd588 [0xacf32f04][0ms]

Dequeue of buffer 2 completed without any delays this time.

Queue buffer 2 back up using the same fence FD of 145 as buffer 1 above.

03-03 15:17:42.466 E/Surface ( 1702): ### ### hook queue buffer 0xad8fd588 with fenceFd:145
03-03 15:17:42.466 E/Surface ( 1702): ### ### queue buffer 0xad8fd588 with fenceFd:145

Again do some other things including saving the fence ID in lastFenceFD.  Again, I don't see us signaling it in any way.

03-03 15:17:42.466 I/Gecko   ( 1702): ### ### FramebufferSurface::nextBuffer() start
03-03 15:17:42.466 I/Gecko   ( 1702): ### ### FramebufferSurface::nextBuffer() finish
03-03 15:17:42.466 E/Fence   ( 1702): ### ### Fence::dup() dup'd 145 to 155
03-03 15:17:42.466 I/Gecko   ( 1702): ### ### FramebufferSurface::onFrameAvailable() lastFenceFD:155
03-03 15:17:42.466 I/Gecko   ( 1702): ### ### GonkDisplayJB::SetFBReleaseFd fd:174

Begin dequeuing buffer 2.  This is going to take a while.

03-03 15:17:42.476 E/Surface ( 1702): ### ### start dequeued buffer [0xb05f52e4]
03-03 15:17:42.476 E/BufferQueue( 1702): [FramebufferSurface] dequeueBuffer: w=768 h=1280 fmt=0x1 usage=0x203
03-03 15:17:42.476 E/BufferQueue( 1702): [FramebufferSurface] dequeueBuffer: returning slot=1 buf=0xaf0401f0 flags=0
03-03 15:17:42.476 E/Fence   ( 1702): ### ### Fence::dup() dup'd 147 to 145
03-03 15:17:42.496 E/Surface ( 1702): ### ### finish dequeued buffer 0xaf65c088 [0xacf32f04][21ms]

Dequeue has completed after 21ms.  Its unclear what is happening in that blocked time period.  None of my gecko instrumentation fires.

Queue buffer 2 back up and record its fence in lastFenceFD.

03-03 15:17:42.496 E/Surface ( 1702): ### ### hook queue buffer 0xaf65c088 with fenceFd:145
03-03 15:17:42.496 E/Surface ( 1702): ### ### queue buffer 0xaf65c088 with fenceFd:145
03-03 15:17:42.496 I/Gecko   ( 1702): ### ### FramebufferSurface::nextBuffer() start
03-03 15:17:42.496 I/Gecko   ( 1702): ### ### FramebufferSurface::nextBuffer() finish
03-03 15:17:42.496 E/Fence   ( 1702): ### ### Fence::dup() dup'd 145 to 147
03-03 15:17:42.496 I/Gecko   ( 1702): ### ### FramebufferSurface::onFrameAvailable() lastFenceFD:147
03-03 15:17:42.496 I/Gecko   ( 1702): ### ### GonkDisplayJB::SetFBReleaseFd fd:174

Begin dequeuing buffer 1.  This will also be slow.

03-03 15:17:42.496 E/Surface ( 1702): ### ### start dequeued buffer [0xb05f52e4]
03-03 15:17:42.496 E/BufferQueue( 1702): [FramebufferSurface] dequeueBuffer: w=768 h=1280 fmt=0x1 usage=0x203
03-03 15:17:42.496 E/BufferQueue( 1702): [FramebufferSurface] dequeueBuffer: returning slot=0 buf=0xadfc97e0 flags=0
03-03 15:17:42.496 E/Fence   ( 1702): ### ### Fence::dup() dup'd 155 to 145
03-03 15:17:42.506 E/Surface ( 1702): ### ### finish dequeued buffer 0xad8fd588 [0xacf32f04][10ms]

Dequeue completes after 10ms.  Again, its unclear what is happening during the delay.

The only place I can find code using FrameBufferSurface::lastFenceFD is in our GonkDisplayJB class where we pass it to Post().  I have instrumentation there, but don't see it firing.

The attached patch shows where I am instrumenting now.

It seems like the fence is probably getting signaled from the adreno driver, but its hard for me to prove that without its source.

Sotaro, can you think of anywhere else I should instrument?
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Andreas Gal :gal from comment #42)
> Looking at the profile I see us blocking on a lock in DrawQuadInternal. Why
> would the HWC keep a lock on the gralloc buffer that we decided to draw with
> the compositor? I guess I still don't understand our theory here.

OpenGL renders to ANativeWindow(Surface) interface. It's rendering buffers are composed by HWC via FramebufferSurface as "frame buffer layer". Therefore, after hwc composition, the buffer receives fence object. Therefore, Surface need to wait fence completion.

But it seems strange to wait sync under the following function.
>  gl()->fEGLImageTargetTexture2D(textureTarget, mEGLImage);
By the way, I tried using the software EGL driver by modifying egl.cfg, but it crashed the b2g process on boot.
I checked nexus-4 and nexus-5. The result was the following.
 - nexus-4's EGL uses Surface::hook_dequeueBuffer_DEPRECATED()
 - nexus-5's EGL uses Surface::hook_dequeueBuffer()

It might be better to confirm that our recent b2g phone uses which function. If Surface::hook_dequeueBuffer() is used, Fence wait might not be a problem.

It might be better to evaluate by using recent b2g phone. Sadly I do not have it.
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Sotaro Ikeda [:sotaro] from comment #48)
> I checked nexus-4 and nexus-5. The result was the following.
>  - nexus-4's EGL uses Surface::hook_dequeueBuffer_DEPRECATED()
>  - nexus-5's EGL uses Surface::hook_dequeueBuffer()

Sushil, do you know which function is used in latest b2g phone?
Flags: needinfo?(sushilchauhan)
On v1.4 reference device, it uses GonkNativeWindowClient::hook_dequeueBuffer_DEPRECATED. See logs.
I have forced GPU composition. These are during video playback. No high times are observed.

04-29 23:47:39.329   201   968 E HWComposer: Compositor: H/W Composition failed, Time taken in TryRender() = 1 ms
04-29 23:47:39.329   201   968 E HWComposer: Compositor: Time taken by eglSwapBuffers = 2 ms
04-29 23:47:39.329   201   968 E HWComposer: Compositor: Time taken by mHwc->set = 0 ms
04-29 23:47:39.329   201   968 E HWComposer: Compositor: Time taken in Render() = 4 ms
04-29 23:47:39.329  1255  1573 E GonkNativeWindowClient: Compositor: hook_dequeueBuffer_DEPRECATED

04-29 23:47:39.369   201   968 E HWComposer: Compositor: H/W Composition failed, Time taken in TryRender() = 0 ms
04-29 23:47:39.369   201   968 E HWComposer: Compositor: Time taken by eglSwapBuffers = 2 ms
04-29 23:47:39.369   201   968 E HWComposer: Compositor: Time taken by mHwc->set = 0 ms
04-29 23:47:39.369   201   968 E HWComposer: Compositor: Time taken in Render() = 4 ms
04-29 23:47:39.379  1255  1573 E GonkNativeWindowClient: Compositor: hook_dequeueBuffer_DEPRECATED

But during Home screen panning, I see higher times but I do not see GonkNativeWindowClient log. See:
04-30 00:05:49.849   202   977 E HWComposer: Compositor: Render aborted. Time taken in TryRender() = 0 ms
04-30 00:05:49.879   202   977 E HWComposer: Compositor: HwcComposer2D::Render ...
04-30 00:05:49.889   202   977 E HWComposer: Compositor: Time taken by eglSwapBuffers = 6 ms
04-30 00:05:49.889   202   977 E HWComposer: Compositor: Time taken by mHwc->set = 0 ms
04-30 00:05:49.889   202   977 E HWComposer: Compositor: Time taken in Render() = 8 ms
04-30 00:05:49.889   202   977 I Gecko   : Compositor: Composite took 33 ms.

HwcComposer2D::Render() only takes 8 ms here. But most of the time is taken by Rendering all the layers on FBSurface by GPU. This is the time between TryRender() failure and corresponding next call to HwcComposer2D::Render() = 33 ms - 8 ms = 25 ms (which is more than 1 vsync). And profiler is accounting for this time as well. See difference in time stamps: 849 --> 879.
Flags: needinfo?(sushilchauhan)
(In reply to Sushil from comment #50)
> On v1.4 reference device, it uses
> GonkNativeWindowClient::hook_dequeueBuffer_DEPRECATED. See logs.
> I have forced GPU composition. These are during video playback. No high
> times are observed.

Sushil, thanks for taking the log, but GonkNativeWindowClient::hook_dequeueBuffer_DEPRECATED is not related to this usecase. GonkNativeWindow is used only for video codec and camra preview.

For FrameBufferSurface, android's default Surface and BufferQueue is used. Can you confirm Surface::hook_dequeueBuffer() and Surface::hook_dequeueBuffer()?
Flags: needinfo?(sushilchauhan)
On b2g v1.4 reference device, it uses: Surface::hook_dequeueBuffer()
Flags: needinfo?(sushilchauhan)
(In reply to Sushil from comment #52)
> On b2g v1.4 reference device, it uses: Surface::hook_dequeueBuffer()

Thanks for the confirmation!
I checked by adding H/W vsync support in HwcComposer2D so that it waits for h/w vsync signal from driver. It does not improve the numbers in the given use case. So vsync is not the root cause here.
I think there are some bugs in this debug code:

  04-30 00:05:49.889   202   977 I Gecko   : Compositor: Composite took 33 ms.

I'm putting together a patch and hope to have it posted shortly.

I think there are still some underlying problems going on, but it would be good to use an accurate measurement.  Sorry for not catching this earlier.
This patch tries to improve the composition warnings debug by breaking out scheduling time from execution time.  For example:

03-03 22:47:33.916 I/Gecko   (16585): Compositor: Composite execution took 17.7323 ms
03-03 22:47:33.936 I/Gecko   (16585): Compositor: Composite execution took 17.8544 ms
03-03 22:47:33.976 I/Gecko   (16585): Compositor: Composite execution took 37.7232 ms
03-03 22:47:33.976 I/Gecko   (16585): Compositor: Compose starting off schedule by 21.9111 ms
03-03 22:47:34.006 I/Gecko   (16585): Compositor: Composite execution took 31.0697 ms
03-03 22:47:34.006 I/Gecko   (16585): Compositor: Compose starting off schedule by 15.105 ms
03-03 22:47:34.036 I/Gecko   (16585): Compositor: Composite execution took 20.937 ms
03-03 22:47:34.036 I/Gecko   (16585): Compositor: Compose starting off schedule by 8.26848 ms

Here we see some composes that are just barely missing our frame budget due to "execution".  This means all of this time was taken up in the CompositeToTarget() method.

We then see that the execution time is slow that it impacts the scheduling of the next composition.

In other cases its possible to see a scheduling impact simply due to device load or IPC message_loop contention.
Attachment #8385096 - Flags: review?(bgirard)
Minor update to consistently use the calculated target frame rate for warning limits.

Another warning I check for here is if we are triggering composites too fast and racing ahead.  This could create problems by forcing us to block on fences and locks unnecessarily.
Attachment #8385096 - Attachment is obsolete: true
Attachment #8385096 - Flags: review?(bgirard)
Attachment #8385098 - Flags: review?(bgirard)
Sotaro, Sushil, I recommend retesting with that improved debug patch.  I see long executions occurring, but fewer of them are on the order of the 30+ ms reported by the previous debug code.
So I happened to notice in the profile that the child process periodically shows gralloc_unlock() calling IonAlloc::clean_buffer().  At least in some cases this seems around the times when we are blocking in dequeueBuffer() in the parent thread.

For example, see this attachment.

Could this be what the parent is waiting for to get unlocked?  Its just something I noticed, so not sure if it even makes sense or not.  Just a wild theory.

If its plausible I can investigate tomorrow.

Sotaro, what do you think?
Flags: needinfo?(sotaro.ikeda.g)
Comment on attachment 8385098 [details] [diff] [review]
Improve composition thread warning debug. (v2)

Review of attachment 8385098 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +592,5 @@
>      TimeDuration delay = minFrameDelta - delta;
>  #ifdef COMPOSITOR_PERFORMANCE_WARNING
>      mExpectedComposeTime = TimeStamp::Now() + delay;
>  #endif
>      ScheduleTask(mCurrentCompositeTask, delay.ToMilliseconds());

I don't understand why you don't want to capture the expected composite time? The intent was to know if we miss the frame before this task doesn't fire on time.

@@ +623,5 @@
> +    printf_stderr("Compositor: Compose starting off schedule by %g ms\n",
> +                  scheduleDelta.ToMilliseconds());
> +  }
> +  int32_t frameRate = CalculateCompositionFrameRate();
> +  TimeDuration frameBudget = TimeDuration::FromSeconds(1.0 / frameRate);

in asap mode (frameRate == 0) you're going to divide by zero here.

@@ +629,5 @@
> +    TimeDuration lastComposeDelta = TimeStamp::Now() - mLastCompose;
> +    TimeDuration minTimeBetweenComposes =
> +      frameBudget - TimeDuration::FromMilliseconds(2);
> +    if (lastComposeDelta < minTimeBetweenComposes) {
> +      printf_stderr("Compositor: Compose starting %g ms since last; too quick!\n",

I see the logic being this but I think it's really hard to say with certitude that the starting time for a composite is wrong. If this frame is going to take a lot of time to composite then you can't start it too early.

Imagine that the layer tree is expensive on the GPU to compose, if each frame takes 30ms then you want to start right away so this delta would be zero and that would be perfectly valid.

@@ +687,5 @@
>  
>  #ifdef COMPOSITOR_PERFORMANCE_WARNING
> +  TimeDuration executionTime = TimeStamp::Now() - mLastCompose;
> +  if (executionTime > frameBudget) {
> +    printf_stderr("Compositor: Composite execution took %g ms\n",

Curious: Typically I see %f/%d used. In fact I've never seen %g used in practice. Any benefit to using %g instead?
PRIV_FLAGS_NEEDS_FLUSH is set depending on the usage mask, and causes clean_buffer

static int gralloc_map_and_invalidate (gralloc_module_t const* module,
                                       buffer_handle_t handle, int usage,
                                       int l, int t, int w, int h)
{
    if (private_handle_t::validate(handle) < 0)
        return -EINVAL;

    int err = 0;
    private_handle_t* hnd = (private_handle_t*)handle;
    if (usage & (GRALLOC_USAGE_SW_READ_MASK | GRALLOC_USAGE_SW_WRITE_MASK)) {
        if (hnd->base == 0) {
            // we need to map for real                                                                                                                                            
            pthread_mutex_t* const lock = &sMapLock;
            pthread_mutex_lock(lock);
            err = gralloc_map(module, handle);
            pthread_mutex_unlock(lock);
        }
        //Invalidate if reading in software. No need to do this for the metadata                                                                                                  
        //buffer as it is only read/written in software.                                                                                                                          
        IMemAlloc* memalloc = getAllocator(hnd->flags) ;
        err = memalloc->clean_buffer((void*)hnd->base,
                                     hnd->size, hnd->offset, hnd->fd,
                                     CACHE_INVALIDATE);
        if ((usage & GRALLOC_USAGE_SW_WRITE_MASK) &&
            !(hnd->flags & private_handle_t::PRIV_FLAGS_FRAMEBUFFER)) {
            // Mark the buffer to be flushed after cpu read/write                                                                                                                 
            hnd->flags |= private_handle_t::PRIV_FLAGS_NEEDS_FLUSH;
        }
    } else {
        hnd->flags |= private_handle_t::PRIV_FLAGS_DO_NOT_FLUSH;
    }
    return err;
}
The invalidation happens basically with an ioctl:

    struct ion_custom_data d;
    switch(op) {
    case CACHE_CLEAN:
        d.cmd = ION_IOC_CLEAN_CACHES;
        break;
    case CACHE_INVALIDATE:
            d.cmd = ION_IOC_INV_CACHES;
        break;
    case CACHE_CLEAN_AND_INVALIDATE:
    default:
        d.cmd = ION_IOC_CLEAN_INV_CACHES;
    }

    d.arg = (unsigned long int)&flush_data;

    if(ioctl(mIonFd, ION_IOC_CUSTOM, &d)) {
        err = -errno;
	ALOGE("%s: ION_IOC_CLEAN_INV_CACHES failed with error - %s",

              __FUNCTION__, strerror(errno));
        ioctl(mIonFd, ION_IOC_FREE, &handle_data);
        return err;
    }
    ioctl(mIonFd, ION_IOC_FREE, &handle_data);
    return 0;
(In reply to Ben Kelly [:bkelly] from comment #58)
> Sotaro, Sushil, I recommend retesting with that improved debug patch.  I see
> long executions occurring, but fewer of them are on the order of the 30+ ms
> reported by the previous debug code.

Which device are you testing now? From Comment 52, recent b2g device uses Surface::hook_dequeueBuffer(). It is unblocking function. Therefore, fence wait might not be called on compositor thread. If you still use nexus-4 for evalutation, it might be better to change to another device that using Surface::hook_dequeueBuffer().
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Benoit Girard (:BenWa) from comment #60)
> I don't understand why you don't want to capture the expected composite
> time? The intent was to know if we miss the frame before this task doesn't
> fire on time.

I'm confused by this question.  We either schedule with a delay or immediately.  The expected composite time was being set correctly when scheduled with a delay.  When we scheduled immediately, though, we were incorrectly expecting the compose to happen 16ms later.

So my intention was to bring the mExpectedComposeTime value into line without the delay we are passing to ScheduleTask().

> in asap mode (frameRate == 0) you're going to divide by zero here.

Thanks.  I'll fix that by using a constant when frameRate == 0.

> I see the logic being this but I think it's really hard to say with
> certitude that the starting time for a composite is wrong. If this frame is
> going to take a lot of time to composite then you can't start it too early.
> 
> Imagine that the layer tree is expensive on the GPU to compose, if each
> frame takes 30ms then you want to start right away so this delta would be
> zero and that would be perfectly valid.

How do you know before your composite that its going to take 30ms?  It would seem you must know this prior to scheduling in order to make this optimization.

If we start every composite early just in case they are too slow, though, then we end up blocking on vsync and other resources.

Also note, this check is comparing to the actual lastComposeTime, not our scheduled time.  So we won't trigger this warning if the composes are taking 30ms.  You can still schedule immediately and not trigger this warning.

Although it occurs to me that if we are blocking on resources, we also won't trigger this warning.  So its pretty useless.

I'll remove it.  :-)

> Curious: Typically I see %f/%d used. In fact I've never seen %g used in
> practice. Any benefit to using %g instead?

I'll switch to %f.  I just got in the habit of using %g in the past since it handles automatically switching to exponent notation if appropriate.  I don't think that matter here, though.
(In reply to Sotaro Ikeda [:sotaro] from comment #63)
> Which device are you testing now? From Comment 52, recent b2g device uses
> Surface::hook_dequeueBuffer(). It is unblocking function. Therefore, fence
> wait might not be called on compositor thread. If you still use nexus-4 for
> evalutation, it might be better to change to another device that using
> Surface::hook_dequeueBuffer().

I'm using nexus-4.  As far as I know the only option I have for another device right now is to buy a nexus-5.  It appears the soonest I could get one from the Play Store is at least a week.  Local stores only sell them locked to a carrier plan.

If there is a way for me to get one of the production devices, please let me know.
Just grab a device from a local store. Google around first to make sure you can flash it (not firmware locked). SIM locked is fine. You won't need the radio. Please expense the cost. In parallel talk to Mike to get you a production device, but that will take a couple days at least.
Updated patch.  I switched to %f and also aligned the numbers to make the output easier to read.  I dropped the "too quick" warning.
Attachment #8385098 - Attachment is obsolete: true
Attachment #8385098 - Flags: review?(bgirard)
Attachment #8385416 - Flags: review?(bgirard)
I have a nexus-5 now and am working to get my test setup on it.

One thing I noticed, however, is that disabling APZ pretty much removes any long composite execution times while scrolling the virtual-list-demo.  This change from disabling APZ occurs on both my ICS buri and JB nexus-4.

It seems perhaps with APZ gecko is eagerly grabbing and exhausting some resource which is essentially throttling back our compositions.  If that is the case, then no amount of async APIs will prevent us from eventually starving and blocking the compositor.

I think we need to investigate how we are triggering the starvation on the different platforms to understand what resources gecko is over consuming.
(In reply to Ben Kelly [:bkelly] from comment #68)
> I have a nexus-5 now and am working to get my test setup on it.
> 
> One thing I noticed, however, is that disabling APZ pretty much removes any
> long composite execution times while scrolling the virtual-list-demo.  This
> change from disabling APZ occurs on both my ICS buri and JB nexus-4.

The tiling might help this situation, if it is caused by too large gralloc texture rendering.
Here is a nexus-5 profile:

  http://people.mozilla.org/~bgirard/cleopatra/#report=41e68fb8926be16e52c0eee5f25fd04285c90652

While I don't see the extreme 20ms to 30ms compositions, we are instead consistently over our budget at 17ms to 19ms.  This is bad enough that I can visually tell that the frame rate is reduced.

Looking at the profile it seems we are still mostly blocked in eglSwapBuffers, this time on a condition variable instead of a fence.

Here is nexus-5 with APZ disabled showing that many/most of the composites are under our time budget:

  http://people.mozilla.org/~bgirard/cleopatra/#report=24fa126e5290d501deb44f8950799b18b9468478

Also, since nexus-5 is a resource rich phone, synchronous scrolling is buttery smooth compared to APZ at the moment.

I'm going to focus on gecko level resource usage now.  I'll also try tiling to see if that helps APZ.
For a real quick comparison of those two profiles, search for eglSwapBuffers in the composite thread in each.  The difference is pretty huge.
I think to make further progress here you need to wait until tiling is landed, for either euro or can tiling -- that changes things drastically. (Or build one of those branches.)  We're aiming to land tomorrow/thursday.
Thanks Vlad.  I was going to try this branch Chris pointed me at:

  http://hg.mozilla.org/users/bschouten_mozilla.com/tiling
It's a long shot but bug 971914 might be interesting to check.
(In reply to Ben Kelly [:bkelly] from comment #64)
> How do you know before your composite that its going to take 30ms?  It would
> seem you must know this prior to scheduling in order to make this
> optimization.

We don't have a good way of knowing but it means that warning has the potential for a false positive which is not desirable. But I could be convinced.

> I'll switch to %f.  I just got in the habit of using %g in the past since it
> handles automatically switching to exponent notation if appropriate.  I
> don't think that matter here, though.

ohh ok, I was curious. Didn't want to nitpick.
(In reply to Benoit Girard (:BenWa) from comment #75)
> (In reply to Ben Kelly [:bkelly] from comment #64)
> > How do you know before your composite that its going to take 30ms?  It would
> > seem you must know this prior to scheduling in order to make this
> > optimization.
> 
> We don't have a good way of knowing but it means that warning has the
> potential for a false positive which is not desirable. But I could be
> convinced.

Ok, I dropped that particular warning.  I agree it is not useful.
Comment on attachment 8385416 [details] [diff] [review]
Improve composition thread warning debug. (v3)

Review of attachment 8385416 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +595,5 @@
>  #endif
>      ScheduleTask(mCurrentCompositeTask, delay.ToMilliseconds());
>    } else {
> +#ifdef COMPOSITOR_PERFORMANCE_WARNING
> +    mExpectedComposeTime = TimeStamp::Now();

Before I meant mExpectedComposeTime to mean the expected time for the composite to *complete*. Your patch is redefining it. I'm not against but that's a sign that the variable should be named better. My suggestion: mExpectedComposteStartTime.
Attachment #8385416 - Flags: review?(bgirard) → review+
Thanks for the review!  Here is another updated patch with the variable renamed as requested.  Also updated commit message for r=benwa.
Attachment #8385416 - Attachment is obsolete: true
Attachment #8385490 - Flags: review+
Profile of tiling branch with layers.force-per-tile-drawing enabled on nexus5:

  http://people.mozilla.org/~bgirard/cleopatra/#report=64e3327ca8f48e10abe84d4e91572b5170bb44ad

The good news is that it feels smooth again and we see short composites in the profile.

The bad news is there are insanely large rasterizes of 500+ ms mainly blocked on gralloc construction.  Maybe bug 959089 could help there, although reading recent comments there it seems unclear what the way forward is for that bug.
Compositor warning patch pushed to b2g-inbound:

  https://hg.mozilla.org/integration/b2g-inbound/rev/44d519bd551d
Keywords: leave-open
(In reply to Ben Kelly [:bkelly] from comment #79)
> The bad news is there are insanely large rasterizes of 500+ ms mainly
> blocked on gralloc construction.  Maybe bug 959089 could help there,
> although reading recent comments there it seems unclear what the way forward
> is for that bug.

Hmm, I think this might be getting triggered because the tiling branch doesn't have the velocity cap in place.  Rebuilding with bug 976035 to test that theory.
The velocity limit did not really help.  I was still able to get long rasterizes and lots of checkerboarding.  This is even with the b2g velocity limit being fairly low for such a high res device.
I need to investigate further, but the nexus-4 seems to benefit from setting a high RT priority on the compositor thread.  Here is a profile:

  http://people.mozilla.org/~bgirard/cleopatra/#report=a6ad7d33c304ebf393a5b9a64ca462ac16591912

Comparing to the profile in comment 22, the composites are much more consistent and evenly spaced.  That being said, its not a silver bullet and there are still some slow composites.

This was on m-c hg rev e5b09585215f with APZ, no hwc, and no tiling.

I'm going to test the other configurations with the high priority thread tomorrow.  I'll try nexus-5 on m-c and then both devices with both tiling branches.
I've been looking more at the nexus-5 today and it feels like there is something really wrong there.  The compositor thread spends its time in such a way that it always overruns its budget on n5.

Here are some typical times for where compositor spends most of its time for the different devices.

On nexus-4 JB (android version 18):
  * RootLayer()->RenderLayer():  11.4ms
  * CompositorOGL::EndFrame():    1.3ms
  
  * Total composite time:        13.0ms

On nexus-5 KK (android version 19):
  * RootLayer()->RenderLayer():   0.9ms
  * CompositorOGL::EndFrame():   15.7ms
  
  * Total composite time:        16.9ms

When I drill down into the CompositorOGL::EndFrame() on nexus-5 I see that the time is going here:

  * eglSwapBuffers called from UpdateFBSurface():            9.1ms
  * hwc_prepare_primary called from HwcComposer2D::Prepare:  6.2ms

  * Total CompositorOGL::EndFrame():                        15.7ms

The eglSwapBuffers is spending most of its time waiting on a condition variable instead the adreno library.  The hwc_prepare_primary is spending most of its time in an MSMFB_OVERLAY_SET ioctl.

Its difficult to dig deeper since our build uses a prebuilt kernel and I don't have the source for adreno.

I'm not convinced my previous statement that nexus-5 is better with tiling in comment 79 is accurate.  That tiling branch was based on an old m-c at the time I don't think its HwcComposer2D was following this same path.  Once the tiling branch is merged and stable I'll check there again.

Sotaro, do my numbers above make sense at all?  Any ideas why nexus-5 is getting such a dramatically different time profile?
Flags: needinfo?(sotaro.ikeda.g)
We need vendor help here. We are poking around in the dark.
I should note that my numbers in comment 85 included my real-time priority patch for the compositor thread.  I think it helps enough on nexus-4 I will go ahead and clean it up for review.
Priority: -- → P1
Depends on: 980027
Forked off bug 980027 to pursue the compositor thread priority change.
(In reply to Ben Kelly [:bkelly] from comment #85)
> 
> When I drill down into the CompositorOGL::EndFrame() on nexus-5 I see that
> the time is going here:
> 
>   * eglSwapBuffers called from UpdateFBSurface():            9.1ms
>   * hwc_prepare_primary called from HwcComposer2D::Prepare:  6.2ms
> 
>   * Total CompositorOGL::EndFrame():                        15.7ms
> 
> The eglSwapBuffers is spending most of its time waiting on a condition
> variable instead the adreno library.  The hwc_prepare_primary is spending
> most of its time in an MSMFB_OVERLAY_SET ioctl.
> 

Sotaro, can you check eglSwapBuffers is waiting on which condition variable here ? I can look into the hwc_prepare_primary delay.
Sushil, I'm sorry, but I mistyped my comment there.  The condition variable is *inside* adreno; not "instead".  Sorry for the confusion.  So I think we need your help there as well.
Micheal, 

As mentioned in Comment 85, for eglSwapBuffers() delay in Adreno, someone from Graphics needs to take a look. It is in GPU Composition path. I can look into the delay in hwc_prepare_primary call.
Flags: needinfo?(mvines)
(In reply to Ben Kelly [:bkelly] from comment #85)
> 
> I'm not convinced my previous statement that nexus-5 is better with tiling
> in comment 79 is accurate.  That tiling branch was based on an old m-c at
> the time I don't think its HwcComposer2D was following this same path.  Once
> the tiling branch is merged and stable I'll check there again.

If such long rendering is caused by large gralloc buffer usage, tiling could help.
 
> Sotaro, do my numbers above make sense at all?  Any ideas why nexus-5 is
> getting such a dramatically different time profile?

On nexus-4, fence wait happens within rendering quad. On nexus-5, it does not happen within rendering quad. Otherwise, time consuming task seems to defer as much as possible.
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Sotaro Ikeda [:sotaro] from comment #92)
> > I'm not convinced my previous statement that nexus-5 is better with tiling
> > in comment 79 is accurate.  That tiling branch was based on an old m-c at
> > the time I don't think its HwcComposer2D was following this same path.  Once
> > the tiling branch is merged and stable I'll check there again.
> 
> If such long rendering is caused by large gralloc buffer usage, tiling could
> help.

If large buffer is used, whole large buffer data need to uploaded to the OpenGL texture. but tiling is used, only updated tilies' data are uploaded to OpenGL textures.
(In reply to Sushil from comment #89)
> 
> Sotaro, can you check eglSwapBuffers is waiting on which condition variable
> here ? I can look into the hwc_prepare_primary delay.

In side of eglSwapBuffers() are proprietary area, I do not have a source code of it. Surface(ANativeWindow) is the only interface to check how eglSwapBuffers() works.

I checked about it by adding some logs around the above area. From the log, ANativeWindow seems not related to the delay. And ioctl() calls inside of eglSwapBuffers() does not wait long time(less than 1ms).

But from the log, I found one pattern. When eglSwapBuffers() takes longer time(more than 5ms), almost cases, GL updater thread's verly long ioctl() complete(about 15ms) happens just before eglSwapBuffers()complete. GL updater thread's ioctl() is normally take very long time until complete(5ms-25ms). It might mean gl driver is running hard in kernel.
Captured logcat during homescreen panning. A lot of manually logs are added around eglSwapBuffers(), ANativeWindow and ioctl().

The followings are typical log when eglSwapBuffers() takes long time.
------------------------------------

>03-05 14:58:06.489   185   835 I GonkWindow: GonkDisplayJB::SwapBuffers()_E
>03-05 14:58:06.489   185   835 I GonkWindow: GonkDisplayJB::SwapBuffers() 1 ent eglSwapBuffers()
>03-05 14:58:06.489   185   835 I GonkGfx : Surface::hook_dequeueBuffer()
>03-05 14:58:06.489   185   835 V Surface : Surface::dequeueBuffer
>03-05 14:58:06.489   185   835 I GonkGfx : Surface::dequeueBuffer()_E
>03-05 14:58:06.489   185   835 I GonkGfx : Surface::dequeueBuffer() *fenceFd 100
>03-05 14:58:06.489   185   835 I GonkGfx : Surface::dequeueBuffer()_X
>03-05 14:58:06.489   185   835 E libc    : +++**** ioctl fd 88 request 0xc02c093d
>03-05 14:58:06.489   185   835 E libc    : +++**** ioctl fd 88 request 0xc02c093d
>03-05 14:58:06.489   185   835 E libc    : +++**** ioctl fd 88 request 0xc02c093d
>03-05 14:58:06.489   185   835 V Surface : Surface::setBuffersTransform
>03-05 14:58:06.489   185   835 E libc    : +++**** ioctl fd 88 request 0xc0140933
>03-05 14:58:06.489   185   835 I GonkGfx : Surface::hook_queueBuffer()
>03-05 14:58:06.489   185   835 V Surface : Surface::queueBuffer
>03-05 14:58:06.489   185   835 I GonkGfx : Surface::queueBuffer()_E
>03-05 14:58:06.489   185   835 V Surface : Surface::queueBuffer making up timestamp: 237061.31 ms
>03-05 14:58:06.489   185   835 I GonkGfx : Surface::queueBuffer()_X
>03-05 14:58:06.489   185   835 V Surface : Surface::query
>03-05 14:58:06.489   185   835 V Surface : Surface::query
>03-05 14:58:06.489   185   835 V Surface : Surface::query
>03-05 14:58:06.489   185   835 V Surface : Surface::query
>03-05 14:58:06.489   185   835 V Surface : Surface::setBuffersDimensions
>03-05 14:58:06.489   185   936 E libc    : +++**** ioctl: fd 88 duration 17ms +++++++++++++++++++
>03-05 14:58:06.489   185   936 E libc    : +++**** ioctl fd 88 request 0x400c0907
>03-05 14:58:06.489   185   835 I GonkWindow: GonkDisplayJB::SwapBuffers() 2 duration 6ms *****************************************
>03-05 14:58:06.489   185   835 I GonkWindow: GonkDisplayJB::SwapBuffers() 3 ext eglSwapBuffers()
>03-05 14:58:06.489   185   835 I GonkWindow: GonkDisplayJB::SwapBuffers()_X
(In reply to Sotaro Ikeda [:sotaro] from comment #93)
> 
> If large buffer is used, whole large buffer data need to uploaded to the
> OpenGL texture. but tiling is used, only updated tilies' data are uploaded
> to OpenGL textures.

Gralloc data upload to OpenGL Texture could affect to the rendering performance significantly. On last September's b2g, gralloc data are uploaded to OpenGL texture every GL binding. It caused OpenGL driver too busy and Rendering fps was degraded. It is Bug 912134.
(In reply to Sotaro Ikeda [:sotaro] from comment #93)
> (In reply to Sotaro Ikeda [:sotaro] from comment #92)
> > > I'm not convinced my previous statement that nexus-5 is better with tiling
> > > in comment 79 is accurate.  That tiling branch was based on an old m-c at
> > > the time I don't think its HwcComposer2D was following this same path.  Once
> > > the tiling branch is merged and stable I'll check there again.
> > 
> > If such long rendering is caused by large gralloc buffer usage, tiling could
> > help.
> 
> If large buffer is used, whole large buffer data need to uploaded to the
> OpenGL texture. but tiling is used, only updated tilies' data are uploaded
> to OpenGL textures.

I don't dispute that tiling would help.

What I don't understand, though, is why we would be able to mostly keep up on nexus-4 without tiling, but *need* tiling to keep up on nexus-5 with the same gecko and workload.  I don't see many #ifdef's around ANDROID_VERSION > 18.
(In reply to Sotaro Ikeda [:sotaro] from comment #94)
> But from the log, I found one pattern. When eglSwapBuffers() takes longer
> time(more than 5ms), almost cases, GL updater thread's verly long ioctl()
> complete(about 15ms) happens just before eglSwapBuffers()complete. GL
> updater thread's ioctl() is normally take very long time until
> complete(5ms-25ms). It might mean gl driver is running hard in kernel.

What is the "GL updater thread"?  Is that a thread spawned from within adreno?  How are you monitoring that?
(In reply to Ben Kelly [:bkelly] from comment #98)
> (In reply to Sotaro Ikeda [:sotaro] from comment #94)
> 
> What is the "GL updater thread"?  Is that a thread spawned from within
> adreno?  How are you monitoring that?

I just added a log around ioctl(). And checked the log by using the following command.

> adb shell logcat -v threadtime > log.txt

Thread name can be gotten by the follwoing command.

> adb shell b2g-ps -t
(In reply to Ben Kelly [:bkelly] from comment #98)
> 
> What is the "GL updater thread"?  Is that a thread spawned from within
> adreno? 

I do not about it actually. Gecko does not create such thread. From it, I suspect it is created by adreno's GL module.
For high time taken by MSMFB_OVERLAY_SET ioctl from hwc_prepare_primary, we expect a fix by 03/14.
The followings are display size info. Nexus-5's dispaly is about 2.1 times larger than nexus-4's one.
 - nexus-4: 1280*768
 - nexus-5: 1920*1080
(In reply to Sotaro Ikeda [:sotaro] from comment #93)
> ...
> If large buffer is used, whole large buffer data need to uploaded to the
> OpenGL texture. but tiling is used, only updated tilies' data are uploaded
> to OpenGL textures.

I'm really glad we have a fix (comment 102).  We have to be careful.  We cannot afford to deal with performance issues in other places with tiling - we need tiling just to fix the Gecko graphics issues, rather than make up for deficiencies elsewhere - we have enough deficiencies in Gecko graphics.
Milan, should we break the nexus-5 MSMFB_OVERLAY_SET issue out into a separate bug?  This one is getting quite confusing with all the devices in play.

Here is my understanding of the current situation:

1) nexus-4 seems mostly ok with bug 980027, although there is some question if that solution will impact other time sensitive processing like audio.

2) nexus-5 needs the fix from comment 102 for MSMFB_OVERLAY_SET ioctl.  We can then re-assess its performance.

3) buri is suffering from the issue in comment 11.  I'm in the process of verifying this also effects v1.3 and, if that is the case, will write a separate bug to track it.
Yes, let's have a separate bug for high time taken in MSMFB_OVERLAY_SET ioctl in driver (Comment 102).
Depends on: 980166
Done.  See bug 980166 for tracking the MSMFB_OVERLAY_SET ioctl issue.

Sushil, I took the liberty of marking you the assignee.  :-)
Can you comment on the cause of the MSMFB_OVERLAY_SET issue?
I wrote bug 980171 to track the v1.3 buri issue.  I wrote bug 980321 to track decoupling composition from the IPC thread.
Blocks: 970751
No longer blocks: 970751
According to Sotaro's analysis, this issue does not only cause performance issues, but may cause artifacts on the screen. See bug 977880 for a video.
Blocks: 977880
Which issue? MSMFB_OVERLAY_SET?
No, delay in MSMFB_OVERLAY_SET ioctl will not cause artifacts reported in Bug 977880.
Blocks: 970751
What is our working theory for the cause of the performance issue and artifacts?
(In reply to Andreas Gal :gal from comment #113)
> What is our working theory for the cause of the performance issue and
> artifacts?

1. I checked the artifacts reported in Bug 977880, it is tearing. It seems read and write is happening at same time on same fd. Since it is GPU composition, I suspect release fence of FB layer is not being honoured by framework. I already reported similar issue earlier. Please see Bug 962152. The root cause of both bugs seem to be similar. I checked by making BUFFER_SYNC ioctl as synchronous so that driver waits on acquire fence of FB until it gets signalled, I do not see tearing reported in Bug 977880.

2. I am able to see tearing even when driver takes 0 ms in MSMFB_OVERLAY_SET ioctl. So artifacts are not related to this delay. Driver poc is looking into MSMFB_OVERLAY_SET ioctl delay (Bug 980166).
More specifically, these artifacts are related to Bug 974152.
An observation:  It seems composites tend to run over budget more frequently on nexus-4 when we are doing alpha blending things.  In particular, the fade on the scrollbar in settings triggers it.

Is this expected?

Also, it seems we might miss an planeAlpha value in KK, but correcting that did not really change anything for my nexus-5.  See bug 980568.
alpha requires a more expensive shader, this might be a driver problem
Depends on: 980647
Opened bug 980647 to track alpha transparency issue I found with the scrollbar.  Disabling the opacity transition on the scrollbar removes all/most of the slow composite executions while scrolling settings on my nexus-4.
No longer blocks: 970751
It feels like this has become a meta-bug.  All of the issues found have been spun out to other bugs.  Should we close it or leave it here for tracking for now?

I am not currently investigating this any further at the moment.

(If we do close we should probably mark FIXED as it did land a patch for improving compositor warning debug.)
Assignee: bkelly → nobody
Status: ASSIGNED → NEW
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Flags: needinfo?(mvines)
Flags: needinfo?(mvikram)
Keywords: leave-open
Target Milestone: --- → 1.4 S3 (14mar)
You need to log in before you can comment on or make changes to this bug.