Closed Bug 1047149 Opened 6 years ago Closed 6 years ago

Invalid acquire fence for FB layer.

Categories

(Core :: Graphics: Layers, defect)

2.0 Branch
ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla34
blocking-b2g 2.0+
Tracking Status
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: sushilchauhan, Assigned: sushilchauhan)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 310][caf priority: p1][CR 699312])

Attachments

(2 files, 2 obsolete files)

In an automation run, it has been observed that HwcComposer2D received invalid acquire fence for FB layer. So when the fd is passed to HAL and display driver tries to get file or sync_fence data from that fd, it fails and reports invalid fd.
I will upload a patch to check for acquire fence validity for FB layer in FramebufferSurface, before passing it to HWC HAL.

Sotaro, do you know, in which condition, the FB layer's acquire fence can be invalid?
Flags: needinfo?(sotaro.ikeda.g)
Add check to validate acquire fence of FrameBuffer layer, before passing acquire fence fd to HAL.
Attachment #8466557 - Flags: review?(sotaro.ikeda.g)
Assignee: nobody → sushilchauhan
Comment on attachment 8466557 [details] [diff] [review]
Add check to validate acquire fence of FrameBuffer layer.

Review of attachment 8466557 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!
Attachment #8466557 - Flags: review?(sotaro.ikeda.g) → review+
(In reply to Sushil from comment #1)
> I will upload a patch to check for acquire fence validity for FB layer in
> FramebufferSurface, before passing it to HWC HAL.
> 
> Sotaro, do you know, in which condition, the FB layer's acquire fence can be
> invalid?

If Surface::queueBuffer() or Surface::cancelBuffer() is called by EGLSurface without fence file descriptor, FramebufferSurface's fence becomes invalid.
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Sotaro Ikeda [:sotaro] from comment #4)
> 
> If Surface::queueBuffer() or Surface::cancelBuffer() is called by EGLSurface
> without fence file descriptor, FramebufferSurface's fence becomes invalid.

Sorry, in Surface::cancelBuffer()'s case, buffer is not delivered to FramebufferSurface.
Sotaro,

"isValid" check is not sufficient because it only checks for -1 (Reference: include/ui/Fence.h):
- bool isValid() const { return mFenceFd != -1; }

The problem is that framework is sending bogus acquire fence fd (fd > 0) for FB layer to display HAL, that's why display driver returns this error:
- null fence! i=0 fd=150

So driver checks if file* is NULL or file pointer does not support sync fence operations i.e. it is not a sync fence fd. So it is hitting one of these 2 checks below.

struct file *file = fget(fd);

- if (file == NULL)
    return NULL;

- if (file->f_op != &sync_fence_fops)
    return NULL;

The patch in this bug is not sufficient. Please check, in which scenario, we can get a fd, which is not a sync fence fd (bogus fd). It is happening on acquire fence of FB layer (in automation runs).
Assignee: sushilchauhan → nobody
Flags: needinfo?
Flags: needinfo? → needinfo?(sotaro.ikeda.g)
(In reply to Sushil from comment #6)
> 
> The patch in this bug is not sufficient. Please check, in which scenario, we
> can get a fd, which is not a sync fence fd (bogus fd). It is happening on
> acquire fence of FB layer (in automation runs).

all fences except -1 are provided by EGL. Therefore, I assume that the bogus fds are provided by EGL. But EGL implementation is not open sourced. Therefore mozilla side can not analyze the scenarios...
Flags: needinfo?(sotaro.ikeda.g)
By EGL, do you mean GPU (KGSL) is providing bogus fd for FB layer in this case ?
Flags: needinfo?(sotaro.ikeda.g)
android framework expects fence fds are valid except -1 and they are delivered exactly same way as android until FramebufferSurface::onFrameAvailable().
(In reply to Sushil from comment #8)
> By EGL, do you mean GPU (KGSL) is providing bogus fd for FB layer in this
> case ?

From the symptom, it seems most possible. EGL write fence fd to FramebufferSurface via ANativeWindow.
Flags: needinfo?(sotaro.ikeda.g)
Within android framework and gecko, fence fd is handled as android::Fence object. Therefore, there seems no mistake of passing fence fd from ANativeWindow to FramebufferSurface. If there is no possibility of that fence fd changes from valid to bogus, original fd that provided by EGL seems bogus.
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.0?
Sotaro,

There can be 2 scenarios:
1. The original fd provided by EGL seems to be bogus (as you had mentioned).
2. Someone close fd and do not set the "variable" to -1. Meanwhile, this fd is occupied by some non sync fence operation, now same variable is passed to HwcComposer2D without initializing it to -1. Can you also double, there is no such code in framework (before coming to HwcComposer2D) ?
Flags: needinfo?(sotaro.ikeda.g)
Sotaro, I have updated the patch to take care of Scenario # 2 in HwcComposer2D, which I mentioned in previous comment.
Attachment #8466557 - Attachment is obsolete: true
Attachment #8468953 - Flags: review?(sotaro.ikeda.g)
(In reply to Sushil from comment #6)
> 
> The problem is that framework is sending bogus acquire fence fd (fd > 0) for
> FB layer to display HAL, that's why display driver returns this error:
> - null fence! i=0 fd=150

Sushil, I have some questions.

- Which code return this error?
- Is it clear that the acquire fence is provided by FramebufferSurface?
- Does the problem happen when hw composer composition is disabled?

Because, WebGL and hw accelerated canvas 2d could also provide acquire fence. And they are consumed by HwComposer2d during HwComposer composition.
Flags: needinfo?(sotaro.ikeda.g) → needinfo?(sushilchauhan)
Assignee: nobody → sushilchauhan
blocking-b2g: 2.0? → 2.0+
(In reply to Sotaro Ikeda [:sotaro] from comment #15)
> 
> Sushil, I have some questions.
> 
> - Which code return this error?
Display driver code returns this error in BUFFER_SYNC_IOCTL which gets called from display HAL in hwc set. Do you have access to driver code ? Driver checks 2 things: "NULL and non sync fence operation" on file* from this fd, which I have mentioned in Comment 6. Bogus fd = 150 is failing in 1 of them.

> - Is it clear that the acquire fence is provided by FramebufferSurface?
Yes, I have checked logs of all instances of this error, bogus fd is the acquire fence fd of FB layer.

> - Does the problem happen when hw composer composition is disabled?
Yes, this problem will happen even when HWC composition is disabled because the problem is in acquire fence of FB layer.

Can you review my attached patch? Can you also double check all such code in Gecko framework before it comes to HwcComposer2D (all acquire fences coming to HwcComposer2D in GPU and HWC paths).
Flags: needinfo?(sushilchauhan)
Attachment #8468953 - Flags: review?(sotaro.ikeda.g) → review+
Sotaro,

I further looked into ADB logs for the bogus acquire fence fd issue. I found sync_merge() failures. I have listed some of them below. Does it give you some hint? What FB acquire fence does framework send to HwcComposer2D when sync_merge() fails in case of FramebufferSurface, like below instances. The failure log is in Fence::merge() in frameworks/native/libs/ui/Fence.cpp. Please check and let me know.

08-05 20:19:32.870   239   882 E Fence   : merge: sync_merge("FramebufferSurface:1", 100, 167) returned an error: Bad file number (-9)
....
08-05 20:19:40.110   239   882 E Fence   : merge: sync_merge("TextureHostOGL", 209, 110) returned an error: No such file or directory (-2)
08-05 20:19:40.110   239   882 E Fence   : merge: sync_merge("TextureHostOGL", 206, 110) returned an error: No such file or directory (-2)
08-05 20:19:40.110   239   882 E Fence   : merge: sync_merge("TextureHostOGL", 188, 110) returned an error: No such file or directory (-2)
08-05 20:19:40.110   239   882 E Fence   : merge: sync_merge("TextureHostOGL", 184, 110) returned an error: No such file or directory (-2)
08-05 20:19:40.110   239   882 E Fence   : merge: sync_merge("TextureHostOGL", 175, 110) returned an error: No such file or directory (-2)
....
08-05 22:31:38.580   239   882 E Fence   : merge: sync_merge("FramebufferSurface:0", 198, 247) returned an error: Bad file number (-9)
....
08-05 22:33:34.080   239   882 E Fence   : merge: sync_merge("TextureHostOGL", 184, 93) returned an error: Not a typewriter (-25)
....
08-05 22:33:34.140   239   882 E Fence   : merge: sync_merge("TextureHostOGL", 184, 138) returned an error: Bad file number (-9)
....
08-05 22:42:08.180   239   882 E Fence   : merge: sync_merge("TextureHostOGL", 138, 100) returned an error: Bad file number (-9)

Meanwhile, I am planning to add logs in driver code to check which specific case we are hitting, which I mentioned in Comment 6. Do you have access to driver code?
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Sushil from comment #16)
> 
> Can you also double check all such code in
> Gecko framework before it comes to HwcComposer2D (all acquire fences coming
> to HwcComposer2D in GPU and HWC paths).

"fence fd of FB layer" does not pass gecko framework. As in Comment 9, From EGL to FramebufferSurface::onFrameAvailable() is exactly same as android. From FramebufferSurface::onFrameAvailable(), they are different. But they just store fd for next composition.

In android case, raw fd is stored in the function like the following.

FramebufferSurface::onFrameAvailable() 
->HWComposer::fbPost()
 ->HWComposer::setFramebufferTarget()
  ->nextBuffer(buf, acquireFence);
  ->acquireFenceFd = acquireFence->dup();
  ->disp.framebufferTarget->acquireFenceFd = acquireFenceFd;


In gecko case, Fence object is stored in the function like the following.

FramebufferSurface::onFrameAvailable()
  ->nextBuffer(buf, acquireFence);
  ->mPrevFBAcquireFence = new Fence(acquireFence->dup());

https://github.com/sotaroikeda/firefox-diagrams/blob/master/widget/widget_HwcComposer2D__FirefoxOS_1_4.pdf?raw=true
Can you check the errors in Comment 17 ?
(In reply to Sushil from comment #17)
> Sotaro,
> 
> I further looked into ADB logs for the bogus acquire fence fd issue. I found
> sync_merge() failures. I have listed some of them below. Does it give you
> some hint? What FB acquire fence does framework send to HwcComposer2D when
> sync_merge() fails in case of FramebufferSurface, like below instances. The
> failure log is in Fence::merge() in frameworks/native/libs/ui/Fence.cpp.
> Please check and let me know.

It says that FB layer's acquire fence is bogus. gecko re-uses FB layer's acquire fence as gfx layers' release fences. 

By the way, Bug merge is already remove in master gecko by bug 1028532. because merge is not necessary in gecko's use case and it consumes time.
(In reply to Sushil from comment #19)
> Can you check the errors in Comment 17 ?

Fence::merge() returns Fence::NO_FENCE, if merge fails. Therefore error can be checked by checking if merged fence is valid.
Flags: needinfo?(sotaro.ikeda.g)
Should we pull patch from Bug 1028532, to see if it helps on v2.0?
(In reply to Sushil from comment #17)
> 
> Meanwhile, I am planning to add logs in driver code to check which specific
> case we are hitting, which I mentioned in Comment 6. Do you have access to
> driver code?

Recently, Flame builds also build kernel. So, I think I have an access to the driver code. But current mozilla's flame is JB. It seems better to use KK. Since today, flame KK build setting seems to be added to build script.
(In reply to Sushil from comment #22)
> Should we pull patch from Bug 1028532, to see if it helps on v2.0?

Bug 1028532 is only for performance. I do not think it helps to fix this bug.
(In reply to Sotaro Ikeda [:sotaro] from comment #24)
> (In reply to Sushil from comment #22)
> > Should we pull patch from Bug 1028532, to see if it helps on v2.0?
> 
> I do not think it helps to fix this bug.

The above is just my assumption, because I do not know about fence implementation in kernel side and EGL implementation.

If we could uplift just for performance. It seems OK to uplift. A patch iof Bug 1028532 is very simple and seem low risk of regression.
Sotaro, can you ask to uplift the patch in Bug 1028532 to v2.0 because we are not sure of consequence of sync_merge() failure later in display pipeline. I assume you already done sanity test with that patch.
Flags: needinfo?(sotaro.ikeda.g)
Depends on: 1028532
Flags: needinfo?(sotaro.ikeda.g)
Nominated Bug 1028532 to b2g-v2.0.
(In reply to Sotaro Ikeda [:sotaro] from comment #23)
> (In reply to Sushil from comment #17)
> 
> Recently, Flame builds also build kernel. So, I think I have an access to
> the driver code. But current mozilla's flame is JB. It seems better to use
> KK. Since today, flame KK build setting seems to be added to build script.

mozilla's kk flame build is very unstable now. :-(
I am uploading the HG friendly version of reviewed patch.

Patch details:
1. Add check to validate the acquire fence of FrameBuffer layer
2. Reset release fence fd of FB layer after passing it to FB Surface.
3. Reset acquire fence fd of HWC layers before getting it from Gecko.
Attachment #8468953 - Attachment is obsolete: true
Attachment #8469620 - Flags: review+
Sotaro, can you please push the patch in Comment 29 to Try Server? Sorry, I have forgotten the steps :(
Flags: needinfo?(sotaro.ikeda.g)
Keywords: checkin-needed
Sotaro,

I need to create a separate HG patch for v2.0 branch because the code is different at: http://mxr.mozilla.org/mozilla-central/source/widget/gonk/libdisplay/FramebufferSurface.cpp#171 in master and v2.0.

To create an HG patch on b2g-inbound (master), I use the command:
hg clone http://hg.mozilla.org/integration/b2g-inbound

So I need to create an HG patch for v2.0 tip as well. What is the corresponding clone path for v2.0 tip?
Flags: needinfo?(sotaro.ikeda.g)
you can create a commit patch also from git repository. By the following command.
> git diff -p -U8
Flags: needinfo?(sotaro.ikeda.g)
current b2g-v2.0 gecko's repository seems like the following.

https://hg.mozilla.org/releases/mozilla-b2g32_v2_0/
Uploading HG friendly version of reviewed patch for b2g v2.0 branch.
Attachment #8470177 - Flags: review+
Whiteboard: [CR 699312]
https://hg.mozilla.org/mozilla-central/rev/9d23da4eda66
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Whiteboard: [CR 699312] → [caf priority: p1][CR 699312]
Whiteboard: [caf priority: p1][CR 699312] → [b2g-crash][caf-crash 310][caf priority: p1][CR 699312]
Keywords: crash
Observed on: 

Device: msm8610
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.054
Moz BuildID: 20140804000204
B2G Version: 2.0
Gecko Version: 32.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=9e5907995c9327f14cb5d182cee5ff16b1743ed4
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=ed87f0a54baf646eb0b20b4debc090c8016d2104
The patch landed in AU 57.
This bug could be caused by Bug 1053204.
Flags: in-moztrap?(bzumwalt)
No STR is present to create test case to address bug.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(bzumwalt)
Flags: in-moztrap-
You need to log in before you can comment on or make changes to this bug.