Closed Bug 1057220 Opened 10 years ago Closed 10 years ago

Failed to map ion memory in the client on gonk

Categories

(Core :: Graphics: Layers, defect, P1)

32 Branch
ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla34
blocking-b2g 2.0+
Tracking Status
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- unaffected

People

(Reporter: tkundu, Assigned: sotaro)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 217][caf priority: p1][CR 686674])

Attachments

(6 files, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #1038461 +++

+++ This bug was initially created as a clone of Bug #1034294 +++

This bug is created from Bug 1034294 comment 32. 

Logs : 
---------------------------------------------------------
07-14 16:25:00.210  8506  8536 E qdmemalloc: ion: Failed to map memory in the client: Invalid argument
07-14 16:25:00.210  8506  8536 E qdgralloc: Could not mmap handle 0xb0c55c40, fd=45 (Invalid argument)
07-14 16:25:00.210  8506  8536 E qdgralloc: gralloc_register_buffer: gralloc_map failed
07-14 16:25:00.210  8506  8536 W GraphicBufferMapper: registerBuffer(0xb0c55c40) failed -22 (Invalid argument)
07-14 16:25:00.210  8506  8536 E GraphicBuffer: unflatten: registerBuffer failed: Invalid argument (-22)
07-14 16:25:00.210  8506  8536 I Gecko   : ParamTraits<MagicGrallocBufferHandle>::Read() failed to get gralloc buffer
07-14 16:25:00.220  8506  8536 I Gecko   : IPDL protocol error: Error deserializing 'MaybeMagicGrallocBufferHandle'
07-14 16:25:00.220  8506  8536 I Gecko   : [Child 8506] ###!!! ABORT: IPDL error [PSharedBufferManagerChild]: "Error deserializing 'MaybeMagicGrallocBufferHandle'". abort()ing as a result.: file ../../../../../../../../gecko/ipc/glue/ProtocolUtils.cpp, line 198
07-14 16:25:00.230   232   910 W Adreno-GSL: <gsl_ldd_control:412>: ioctl fd 141 code 0xc02c093d (IOCTL_KGSL_SUBMIT_COMMANDS) failed: errno 22 Invalid argument
07-14 16:25:00.230  8506  8536 E Gecko   : mozalloc_abort: [Child 8506] ###!!! ABORT: IPDL error [PSharedBufferManagerChild]: "Error deserializing 'MaybeMagicGrallocBufferHandle'". abort()ing as a result.: file ../../../../../../../../gecko/ipc/glue/ProtocolUtils.cpp, line 198


Crash logs: Please see attachment in bug 1034294 comment 32
Issue is reproduced again with fix from bug 1053204 on msm8926 1GB device

We attached logs in Comment 0 when this happened again.

We asked our internal test team to reproduce it again with additional logging patch from 
bug 1038461 Comment 80 ? 

Please let us know if you want us to reproduce it with any other logs. I have a big doubt that gecko is closing some FD by mistake and that is causing this issue.
Flags: needinfo?(sotaro.ikeda.g)
Observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.066
Moz BuildID: 20140810160202
B2G Version: ---
Gecko Version: 32.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=de28796a8956a48bb98ca67df6a33e0622d642d1
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=2b27becae85092d46bfadcd4fb5605e82e1e1093
Assignee: nobody → sotaro.ikeda.g
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Tapas Kumar Kundu from comment #1)
> 
> We asked our internal test team to reproduce it again with additional
> logging patch from 
> bug 1038461 Comment 80 ? 

Thanks for re-testing:-) Right now, the log paches are enough. If we got more information, we could add more logs.
From attachment 8477215 [details], the crash happened at Setting app.
(In reply to Sotaro Ikeda [:sotaro] from comment #4)
> From attachment 8477215 [details], the crash happened at Setting app.

But from Bug 1038461, the crash is caused by b2g process's close(). It is not clear if the Setting app is related to the crash.
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.0?
(In reply to Sotaro Ikeda [:sotaro] from comment #4)
> From attachment 8477215 [details], the crash happened at Setting app.

Agreed. I also notice same :) .
blocking-b2g: 2.0? → 2.0+
No longer blocks: CAF-v2.1-FC-metabug
(In reply to Sotaro Ikeda [:sotaro] from comment #3)
> (In reply to Tapas Kumar Kundu from comment #1)
> > 
> > We asked our internal test team to reproduce it again with additional
> > logging patch from 
> > bug 1038461 Comment 80 ? 
> 
> Thanks for re-testing:-) Right now, the log paches are enough. If we got
> more information, we could add more logs.

bionic patch in bug 1038461 Comment 80 is not working and causing some boot hang issue on device. Could you please take a look ? Thanks
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Tapas Kumar Kundu from comment #8)
> (In reply to Sotaro Ikeda [:sotaro] from comment #3)
> > (In reply to Tapas Kumar Kundu from comment #1)
> > 
> > Thanks for re-testing:-) Right now, the log paches are enough. If we got
> > more information, we could add more logs.
> 
> bionic patch in bug 1038461 Comment 80 is not working and causing some boot
> hang issue on device. Could you please take a look ? Thanks

Wasn't the patch worked in the past on your side, was it? On my side, the patch works without problem on v2.0 flame. I confirmed it just now.
Flags: needinfo?(sotaro.ikeda.g)
There seems 2 possibility why the patch makes problem.
- [1] patch to close() makes bad thing.
- [2] logout in eartly stage of booting makes bad thing.
Tapas, can you try an updated bionic patch? I updated the bionic patch. It is attachment 8477769 [details] [diff] [review] in bug 1038461.
Flags: needinfo?(tkundu)
(In reply to Sotaro Ikeda [:sotaro] from comment #10)
> - [2] logout in eartly stage of booting makes bad thing.

I faced this problem during creating the kernel log patch. I am not sure this could happen from user side logout.
Tapas, if attachment 8477769 [details] [diff] [review] does not work, can you test attachment 8477772 [details] [diff] [review] in bug 1038461? It does no logout, it just implement close() with c instead of assembler. I create it to make clear an cause of the boot problem between [1] and [2] in Comment 10.
(In reply to Sotaro Ikeda [:sotaro] from comment #11)
> Tapas, can you try an updated bionic patch? I updated the bionic patch. It
> is attachment 8477769 [details] [diff] [review] in bug 1038461.

New bionic patch works. But new bionic patch won't confirm us if FD is closed accidentally .
Flags: needinfo?(tkundu)
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Tapas  (always NI me) from comment #14)
> (In reply to Sotaro Ikeda [:sotaro] from comment #11)
> > Tapas, can you try an updated bionic patch? I updated the bionic patch. It
> > is attachment 8477769 [details] [diff] [review] in bug 1038461.
> 
> New bionic patch works. But new bionic patch won't confirm us if FD is
> closed accidentally .

The bionic patch do not have a way to recognize b2g. There seems no other good way now. 
Before the crash, close() failure should happen randomly. If we are lucky enough, an incorrect close could be in a log. Can you run test this weekend with the current patches? 

Bug 1053277 seems to make close() to recognize b2g.
Flags: needinfo?(sotaro.ikeda.g) → needinfo?(tkundu)
wrap a close() in gecko and add logout.
Flags: needinfo?(tkundu)
Tapas, can you try attachment 8477789 [details] [diff] [review]? It could add close() log that is called from gecko.
Flags: needinfo?(tkundu)
(In reply to Sotaro Ikeda [:sotaro] from comment #17)
> Tapas, can you try attachment 8477789 [details] [diff] [review]? It could
> add close() log that is called from gecko.

Hi Sotaro,

Did you test this patch in your device ? I can seeing close syscalls logs are coming from gecko in logcat but homescreen is not showing any icons in device after enabling these logs. 

I think that logging all close syscall() in gecko has some overhead and this is causing this issue. 

Could you please change your patch so that it prints a table to close syscall() and device timestamps every 10secs . Please test this patch on your device 1st and confirm that it does not cause additional issue for us. 

Thanks a lot for your help
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(tkundu)
Hi Sotaro, I also tried to log all close syscall using strace but b2g process gives segmentation fault. 

Please give us a minimum logging patch for close syscall and test it in your build fully.
Sotaro, IMO, we should try to store timestamps of close syscall with FD in memory and print it in regular interval when CPU usage is low. WDYT ? Please give us a patch and test it in your build 1st.
(In reply to Tapas  (always NI me) from comment #18)
> (In reply to Sotaro Ikeda [:sotaro] from comment #17)
> > Tapas, can you try attachment 8477789 [details] [diff] [review]? It could
> > add close() log that is called from gecko.
> 
> Hi Sotaro,
> 
> Did you test this patch in your device ? I can seeing close syscalls logs
> are coming from gecko in logcat but homescreen is not showing any icons in
> device after enabling these logs. 

Yes, I tested on v2.0 flame. And it works as expected.
Flags: needinfo?(sotaro.ikeda.g)
When I put patches to bugzilla, basically all patches are tested locally if possible.
(In reply to Tapas  (always NI me) from comment #20)
> Sotaro, IMO, we should try to store timestamps of close syscall with FD in
> memory and print it in regular interval when CPU usage is low. WDYT ? Please
> give us a patch and test it in your build 1st.

I think that it seems more efficient if we just find the code that is doing invalid fd close(). From the experience of Bug 1057220, before fd is closed incorrectly, a lot of invalid fd closes happened. If b2g process is intentionally crashed when invalid fd close happen, we could find out the place of the bug.
The patch cause the crash if gecko close invalid fd.
Reduce logout.
Attachment #8477860 - Attachment is obsolete: true
I put attachment 8477860 [details] [diff] [review] to try server. Two tests were failed because of invalid fd close, but tryserver logs did not have correct decoded stack dump :-(
  https://tbpl.mozilla.org/?tree=Try&rev=a0f0d7c3afc8
Tapas, how do you think about Comment 23?
Flags: needinfo?(tkundu)
(In reply to Tapas  (always NI me) from comment #8)
> > Thanks for re-testing:-) Right now, the log paches are enough. If we got
> > more information, we could add more logs.
> 
> bionic patch in bug 1038461 Comment 80 is not working and causing some boot
> hang issue on device. Could you please take a look ? Thanks

I confirmed it when the patch is applied to kk-flame. But the problem did not happen on JB flame. I suspect it is a KK gonk or kernal's problem.

A week ago, mozilla flameKK build ROM's full flash did not boot at all. Today I tried built flame KK ROM's full flash. It booted correctly, but some capabilities(ex bluetooth) still did not work correctly.
(In reply to Sotaro Ikeda [:sotaro] from comment #28)
> (In reply to Tapas  (always NI me) from comment #8)
> > > Thanks for re-testing:-) Right now, the log paches are enough. If we got
> > > more information, we could add more logs.
> > 
> > bionic patch in bug 1038461 Comment 80 is not working and causing some boot
> > hang issue on device. Could you please take a look ? Thanks
> 
> I confirmed it when the patch is applied to kk-flame. But the problem did
> not happen on JB flame. I suspect it is a KK gonk or kernal's problem.

But mozilla's flame KK build did not cause the problem of Comment 18.
(In reply to Sotaro Ikeda [:sotaro] from comment #27)
> Tapas, how do you think about Comment 23?

Nice idea. I assumed that patch in Comment 25 is for this only .
Flags: needinfo?(tkundu) → needinfo?(sotaro.ikeda.g)
(In reply to Tapas  (always NI me) from comment #30)
> (In reply to Sotaro Ikeda [:sotaro] from comment #27)
> > Tapas, how do you think about Comment 23?
> 
> Nice idea. I assumed that patch in Comment 25 is for this only .

Yes.

But this patch track only gecko's fd close. If the problem happens on other gonk modules in b2g process, the problem can not be tracked. Therefore, bionic's invalid fd close check only in b2g process becomes necessary for further checks. I am going to create a patch of doing it.
Flags: needinfo?(sotaro.ikeda.g)
I confirmed that why the bionic patch does not work on KK. Since KK, __libc_write_log() calls openg() and close() in every function call. Therefor, since applying the patch, close() recursively call close() forever. On JB, fd is hold as static variable and close() is not called.
From comment 32, KK's __libc_write_log() calls open() and close() for the every function call. KK's __libc_write_log() takes longer time than JB's one.
The patch provides the following capability.
- logout close() log only on b2g and b2g content process.
- rollback __libc_write_log() implementation to JB's one.
- abort if close() close invalid fd
By applying both the following patches, I confirmed flame-KK works without problem with the logout.
- part 1: attachment 8477910 [details] [diff] [review]
- part 2: attachment 8477910 [details] [diff] [review]
Tapas, can you confirm that the patches in Comment 36 works. They can check more wide area than attachment 8477861 [details] [diff] [review].
Flags: needinfo?(tkundu)
(In reply to Sotaro Ikeda [:sotaro] from comment #36)
> By applying both the following patches, I confirmed flame-KK works without
> problem with the logout.
> - part 1: attachment 8477910 [details] [diff] [review]
> - part 2: attachment 8477910 [details] [diff] [review]

sorry, Correction:
- part 1: attachment 8477910 [details] [diff] [review]
- part 2: attachment 8477911 [details] [diff] [review]
(In reply to Sotaro Ikeda [:sotaro] from comment #38)
> (In reply to Sotaro Ikeda [:sotaro] from comment #36)
> > By applying both the following patches, I confirmed flame-KK works without
> > problem with the logout.
> > - part 1: attachment 8477910 [details] [diff] [review]
> > - part 2: attachment 8477910 [details] [diff] [review]
> 
> sorry, Correction:
> - part 1: attachment 8477910 [details] [diff] [review]
> - part 2: attachment 8477911 [details] [diff] [review]

This works fine .I asked our test team to reproduce again with these patches.
Attached file logs
This is the log collected from device which was running some stability test only to reproduce this bug .

Stack trace is different but I am seeing all stacktraces is coming from 

libc.so!close [close.c : 50 + 0x3]

i.e. 

following code with your bionic patches :

int  close(int fd)
{
    if (sIsB2GProcess) {
        __libc_format_log(ANDROID_LOG_WARN, "libc", "close() fd %d", fd);
    }

    int  ret;
    ret = __close(fd);
    if (sIsB2GProcess && (ret < 0) && (fd != -1)) { 
        __libc_format_log(ANDROID_LOG_WARN, "libc", "close() ret %d errno %d fd %d", ret, -errno, fd);
        abort(); =================> This is line number 50 in stack trace.
    }
    return ret;
}

This confirm us that we are closing invalid FD in 3 more places (3 stacktraces) ?

Could you please fix it ?
Flags: needinfo?(tkundu) → needinfo?(sotaro.ikeda.g)
Two of the stacks are the same and in bluetooth.  I filed bug 1058366 for that.  The other one is different, and I filed bug 1058368 for that.
(In reply to Tapas  (always NI me) from comment #40)
> 
> This confirm us that we are closing invalid FD in 3 more places (3
> stacktraces) ?
> 
> Could you please fix it ?

The related bugs are already created by khuey, thanks!
Flags: needinfo?(sotaro.ikeda.g)
log patches part 1 and part 2 are applied and built into AU 73.
Blocks: 1058780
No longer blocks: 1058780
Tapas, can you do re-testing by adding attachment 8479910 [details] [diff] [review] in Bug 1054929?
Flags: needinfo?(tkundu)
(In reply to Sotaro Ikeda [:sotaro] from comment #45)
> Tapas, can you do re-testing by adding attachment 8479910 [details] [diff] [review]
> [review] in Bug 1054929?

we are landing this patch internally and we will update here soon :)
Depends on: 1061993
Depends on: 1054929
We suspect that there are still issues with duplicate fd close and so would like to keep this bug open to track as we plan to debug further with bionic patch.
Greg/Tapas -- discussed with Bhavana and lets close this bug and reopen if we find any future issues. Currently there is no pending bug here.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Flags: needinfo?(tkundu)
You need to log in before you can comment on or make changes to this bug.