Closed Bug 1001897 Opened 10 years ago Closed 10 years ago

crash in wifi_connect_on_socket_path

Categories

(Firefox OS Graveyard :: Wifi, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:1.4+)

RESOLVED WONTFIX
2.0 S6 (18july)
blocking-b2g 1.4+

People

(Reporter: tkundu, Assigned: hchang)

References

Details

(Keywords: crash, Whiteboard: [caf priority: p1][CR 655397][b2g-crash][p=3])

Crash Data

Attachments

(9 files)

Attached file stack trace
Test steps:

1.Made an MO Call and got connected.
2.Sent Multiple MO SMS.
3.BT ON and paired with other device.
4.Transfered vedio files via BT.
5.Downloaded some games via WIFI.
6.While doing WIFI ON/OFF device got crashed.
blocking-b2g: --- → 1.4?
Severity: normal → critical
blocking-b2g: 1.4? → 1.4+
Keywords: crash
Whiteboard: [CR 655397] → [CR 655397][b2g-crash]
Looks like caused by unmatched c library function call:

The stack trace:

0  libhardware_legacy.so!wifi_connect_on_socket_path [wifi.c : 653 + 0x2] ==> JB/KK wifi.c (because wifi_connect_on_socket_path only appears in JB/KK)
1  libxul.so!ICSWpaSupplicantImpl::do_wifi_connect_to_supplicant(char const*) [WifiUtils.cpp : 182 + 0x1] ==> ICS

For ICS, we call wifi.c::wifi_connect_to_supplicant with no arg 
( http://dxr.mozilla.org/mozilla-central/source/dom/wifi/WifiUtils.cpp#182 )

but the wifi.c::wifi_connect_to_supplicant actually called seems to be the JB/KK version, which requires an argument "interface".
Looks like Henry already gets the root cause. :-)
Assignee: nobody → hchang
Unfortunately I don't think it's quite that.  In KK, the parameter to wifi_connect_on_socket_path() was dropped again.  

* https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/tree/wifi/wifi.c?h=kk_3.5#n656
* https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/commit/wifi?id=0f330488afcfff031bc0ba88de826f998f7cfaf9

I don't see much to go on right now, I'm hoping that we'll catch the crash again soon and maybe get lucky with the logs.
(In reply to Michael Vines [:m1] [:evilmachines] from comment #4)
> Unfortunately I don't think it's quite that.  In KK, the parameter to
> wifi_connect_on_socket_path() was dropped again.  
> 
> *
> https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/
> tree/wifi/wifi.c?h=kk_3.5#n656
> *
> https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/
> commit/wifi?id=0f330488afcfff031bc0ba88de826f998f7cfaf9
> 
> I don't see much to go on right now, I'm hoping that we'll catch the crash
> again soon and maybe get lucky with the logs.

Yes you're right. If this crash occurred on KK, then the missing parameter is not the root cause.
Are we able to get the wifi.c being used for this crash? Or it's just the unchanged AOSP KK wifi.c?
(In reply to Henry Chang [:henry] from comment #5)
> > * https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/tree/wifi/wifi.c?h=kk_3.5
>
> Are we able to get the wifi.c being used for this crash? 

Yep, I linked to it in comment 4.  The reference to line 653 in the backtrace doesn't help me at least :-/
(In reply to Michael Vines [:m1] [:evilmachines] from comment #6)
> (In reply to Henry Chang [:henry] from comment #5)
> > > * https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/tree/wifi/wifi.c?h=kk_3.5
> >
> > Are we able to get the wifi.c being used for this crash? 
> 
> Yep, I linked to it in comment 4.  The reference to line 653 in the
> backtrace doesn't help me at least :-/

Sadly I cannot reproduce this issue on nexus 5 with the codebase I checked out last night.
Also cannot reproduce on nexus 4 kk with manifest:

https://github.com/mozilla-b2g/b2g-manifest/blob/master/nexus-4-kk.xml

except for 

1) gecko 1.4: http://hg.mozilla.org/releases/mozilla-b2g30_v1_4/
2) hardware/libhardware_legacy/wifi/wifi.c @ rev kk_3.5
Hi Michael, can you help to check this in your side first per comment 7/8, and it is crashed in wifi.c ? Not sure about the reproduce rate about this bug ?
Flags: needinfo?(mvines)
We have seen this crash twice, and is not something that you're going to be able to easily reproduce with a couple minutes of testing.  I'm hoping that the next time this reproduces we'll have more to go on.
Flags: needinfo?(mvines)
Hi Jason, this issue is still not reproducible. Should we remove the blocking then investigate it when there's more info from CAF? Thanks.
Flags: needinfo?(jsmith)
We've instrumented our build and hopefully the next time this reproduces there'll be more to go on.  In the meantime we are in a holding pattern so if misusing this bug is the preferred Moz workflow then go for it!
according to comment 12, mark it as 1.4?, thanks.
blocking-b2g: 1.4+ → 1.4?
Flags: needinfo?(jsmith)
blocking-b2g: 1.4? → backlog
blocking-b2g: backlog → 1.4?
I meant to add these additional minidump/extra file to bug 1007766 which is related.  I will update that bug now.
Henry - Is the new information Greg has provided enough to go off of to fix this bug?
Flags: needinfo?(hchang)
(In reply to Jason Smith [:jsmith] from comment #17)
> Henry - Is the new information Greg has provided enough to go off of to fix
> this bug?

There are two different kinds of coredump until now:

1) stack check failure after wpa_ctrl_attach_helper returns. It's most likely due to
   the stack frame of function wpa_ctrl_attach_helper() corruption. However, I checked
   wpa_supplicant source code (I checked kk_3.5, not sure if it's being used in the test)
   but found no clue.

2) Invalid memory access in wpa_ctrl_recv in 1007766. We are suspecting it's due to 
   some racing issue around the terminating event. For this dump, we are waiting for
   the test with patch applied as Bug 1007766 comment 15 mentioned.

I am not sure if both of them are from the same root cause. (I hope they are ...)

Greg, is the revision of libhardware_legacy as well as wpa_supplicant kk_3.5?
Have you tried the same test run with any device that we can get in Taipei office?
We tried nexus 4 kk and nexus 5 but the crash still not seen....

Thanks!
Flags: needinfo?(hchang) → needinfo?(ggrisco)
Whiteboard: [CR 655397][b2g-crash] → [CR 655397][b2g-crash][p=3]
Target Milestone: --- → 2.0 S2 (23may)
Triage : We are waiting for Greg to come back here before making a (non)blocking call.
(In reply to Henry Chang [:henry] from comment #18)
> (In reply to Jason Smith [:jsmith] from comment #17)
> > Henry - Is the new information Greg has provided enough to go off of to fix
> > this bug?
> 
> There are two different kinds of coredump until now:
> 
> 1) stack check failure after wpa_ctrl_attach_helper returns. It's most
> likely due to
>    the stack frame of function wpa_ctrl_attach_helper() corruption. However,
> I checked
>    wpa_supplicant source code (I checked kk_3.5, not sure if it's being used
> in the test)
>    but found no clue.
> 
> 2) Invalid memory access in wpa_ctrl_recv in 1007766. We are suspecting it's
> due to 
>    some racing issue around the terminating event. For this dump, we are
> waiting for
>    the test with patch applied as Bug 1007766 comment 15 mentioned.
> 
> I am not sure if both of them are from the same root cause. (I hope they are
> ...)
> 
> Greg, is the revision of libhardware_legacy as well as wpa_supplicant kk_3.5?
> Have you tried the same test run with any device that we can get in Taipei
> office?
> We tried nexus 4 kk and nexus 5 but the crash still not seen....
> 
> Thanks!

Henry,

Can we please try the same on a QRD?

Is it reproducible on that device?
(In reply to Preeti Raghunath(:Preeti) from comment #20)
> 
> Henry,
> 
> Can we please try the same on a QRD?
> 
> Is it reproducible on that device?
I wonder if we need to do this again because this bug comes from partner and partner only uses QRD to do the test.
(In reply to Henry Chang [:henry] from comment #18)
> 
> Greg, is the revision of libhardware_legacy as well as wpa_supplicant kk_3.5?
> Have you tried the same test run with any device that we can get in Taipei
> office?

Yes, these are both kk_3.5.  I'm waiting for test results from applying the patch in bug 1007766 and will report back once I have them.  We haven't tried on other devices, I think test team only has QRD.
Crash observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.097
Moz BuildID: 20140511000204
B2G Version: 1.4
Gecko Version: 30.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=17fb44880e95bc7ae363a609d811bf5a9a067b5b
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=2f11e3aba98eb785ec24504fe9988ab61a03b64d
(In reply to cafbot (PoC: ggrisco) from comment #23)
> Crash observed on: 
> 
> Device: msm8226
> Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.097
> Moz BuildID: 20140511000204
> B2G Version: 1.4
> Gecko Version: 30.0
> Gaia: 
> http://git.mozilla.org/?p=releases/gaia.git;a=commit;
> h=17fb44880e95bc7ae363a609d811bf5a9a067b5b
> Gecko:
> http://git.mozilla.org/?p=releases/gecko.git;a=commit;
> h=2f11e3aba98eb785ec24504fe9988ab61a03b64d

This is the crash which occurred on gecko w/o the patch I mentioned applied.
We also expect this bug could be resolved by that patch.
Whiteboard: [CR 655397][b2g-crash][p=3] → [CR 655397][b2g-crash][p=3][ETA: 5/27]
Target Milestone: 2.0 S2 (23may) → 2.0 S6 (18july)
Flags: needinfo?(ggrisco)
Marking this dependent on bug 1005775, since that bug should fix this issue.
blocking-b2g: 1.4? → 1.4+
Depends on: 1005775
(In reply to Jason Smith [:jsmith] from comment #29)
> Marking this dependent on bug 1005775, since that bug should fix this issue.

Actually, this is a different crash than 1005775 and the patch does not fix it.  I had the patch applied when running the tests that produced minidump #3 and #4 (attached).  Sorry, I should have made that more clear.  The crash with signature containing "wifi_connect_on_socket_path" are not fixed by the patch from bug 1005775.
Flags: needinfo?(jsmith)
ok
No longer depends on: 1005775
Flags: needinfo?(jsmith)
details for minidump #4 which had patch for bug 1005775 applied:

Device: msm8226
Moz BuildID: 20140518000201
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.105
B2G Version: 1.4
Gecko Version: 30.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=7019efdbcfa58d3ff4702b018420db3d8753bb93
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=46b48309fc782275104e128f58a601123c21922e
Any chance at pulling in the ETA for this one?
Flags: needinfo?(hchang)
Flags: needinfo?(ikumar)
changing ni to Preeti for ETA.
Flags: needinfo?(ikumar) → needinfo?(praghunath)
ni Eric for ETA. I'm now keen to understand next steps.
Flags: needinfo?(praghunath)
We found FD_SET [1] with socket fd larger than 1024 would cause the stack overflow which we observe in the crash log. So we are suspecting if there is any file descriptor leak in wifi or the entire system. We are also still trying to reproduce this crash. In the mean time, are you able to attach gdb to capture this kind of crash to see if too many files are opened at the time crashing. (For example, b2g process opens tons of socket connecting to wpa_supplicant). Thanks!

[1] http://androidxref.com/4.4.2_r2/xref/external/wpa_supplicant_8/src/common/wpa_ctrl.c#454
Flags: needinfo?(hchang)
Adding ni to address questions raised by Henry.

PS: Please ni for faster response.
Flags: needinfo?(tkundu)
Flags: needinfo?(ggrisco)
(In reply to Henry Chang [:henry] from comment #36)
> We found FD_SET [1] with socket fd larger than 1024 would cause the stack
> overflow which we observe in the crash log. So we are suspecting if there is
> any file descriptor leak in wifi or the entire system. We are also still
> trying to reproduce this crash. In the mean time, are you able to attach gdb
> to capture this kind of crash to see if too many files are opened at the
> time crashing. (For example, b2g process opens tons of socket connecting to
> wpa_supplicant). Thanks!
> 
> [1]
> http://androidxref.com/4.4.2_r2/xref/external/wpa_supplicant_8/src/common/
> wpa_ctrl.c#454

Thanks for your help. We are trying to find the fd leak by looking into |lsof| output . We will update here soon.
We reproduced the crash after 10 hours wifi toggle test on nexus 4 with m-c revision [1] and found there was a large number of ashmem allocated (using "adb shell lsof| grep b2g | grep -c ashmem"). Bug 1004191 has addressed this issue and supposedly resolved this leak but we still see this issue on mozilla-central (patch for Bug 1004191 is in the tree we use.)

[1] http://hg.mozilla.org/mozilla-central/log?rev=41a54c8add09
Henry

So we will continue investigation right? Did we want partner to test with patch on bug 1004191?
Flags: needinfo?(hchang)
Crash observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.097
Moz BuildID: 20140511000204
B2G Version: 1.4
Gecko Version: 30.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=17fb44880e95bc7ae363a609d811bf5a9a067b5b
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=2f11e3aba98eb785ec24504fe9988ab61a03b64d
Crash observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.097
Moz BuildID: 20140511000204
B2G Version: 1.4
Gecko Version: 30.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=17fb44880e95bc7ae363a609d811bf5a9a067b5b
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=2f11e3aba98eb785ec24504fe9988ab61a03b64d
Crash observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.102
Moz BuildID: 20140515000202
B2G Version: 1.4
Gecko Version: 30.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=2e97bee6bb79d3577dba1bf2a1bbfcba64ee99ab
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=35f27a8e9b3f651748aa22095553024556272de8
sorry for the spam, please ignore comments 41-43.
Flags: needinfo?(ggrisco)
(In reply to Preeti Raghunath(:Preeti) from comment #40)
> Henry
> 
> So we will continue investigation right? Did we want partner to test with
> patch on bug 1004191?

Patch on bug 1004191 is for mozilla-central. I found another bug 998504 about fd leak which was caught and has been resolved on 1.4. 

The most prioritized thing I want to confirm is if these crashes observed by partner were really caused by fd leak. I am pretty sure fd leak could lead the same crash/stack overflow but there is no evidence to show the crashes caught by partner were caused by fd leak until we could capture the crash and run "adb shell lsof" at the time crash occurs.

Thanks!
Flags: needinfo?(hchang)
Crash observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.102
Moz BuildID: 20140515000202
B2G Version: 1.4
Gecko Version: 30.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=2e97bee6bb79d3577dba1bf2a1bbfcba64ee99ab
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=35f27a8e9b3f651748aa22095553024556272de8
Hi cafbot, can you also dump the file descriptor list using "adb shell lsof" command when crash happened ? It's really an important clue to identify if the crash is caused by file descriptor leak.
(In reply to Vincent Chang[:vchang] from comment #47)
> Hi cafbot, can you also dump the file descriptor list using "adb shell lsof"
> command when crash happened ? It's really an important clue to identify if
> the crash is caused by file descriptor leak.

We're working on this feature actually, not currently available though.
Does it mean that you are working on fixing the file descriptor leak or something else ?
(In reply to Vincent Chang[:vchang] from comment #49)
> Does it mean that you are working on fixing the file descriptor leak or
> something else ?

We're working on adding 'can you also dump the file descriptor list using "adb shell lsof" command when crash happened ?' to the crash logs.  We are not working on a Gecko file descriptor leak at this time.
(In reply to Michael Vines [:m1] [:evilmachines] from comment #50)
> (In reply to Vincent Chang[:vchang] from comment #49)
> > Does it mean that you are working on fixing the file descriptor leak or
> > something else ?
> 
> We're working on adding 'can you also dump the file descriptor list using
> "adb shell lsof" command when crash happened ?' to the crash logs.  We are
> not working on a Gecko file descriptor leak at this time.

Thanks for your prompt response. :-) 
May be you can use the "watch -n 1 "adb shell lsof| grep b2g | grep -c ashmem"" command to observe if the count continues to increase. It seems to increase by 1 for every 30 seconds in our testing. Roughly estimated, the fd may exceed to 1024 less than 10 hours and cause the crash.
Attached patch wifi.c.patchSplinter Review
Easier way to prove the fd leak: Apply this patch and run the test until crash. There's supposed a line of log:

"It's going to have a stack overflow since monitor_conn->s >= 1024"

if the crash is caused by fd leak.
:tk, can you please check out Henry's patch.
Crash observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.105
Moz BuildID: 20140518000201
B2G Version: 1.4
Gecko Version: 30.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=7019efdbcfa58d3ff4702b018420db3d8753bb93
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=46b48309fc782275104e128f58a601123c21922e
Crash observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.105
Moz BuildID: 20140518000201
B2G Version: 1.4
Gecko Version: 30.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=7019efdbcfa58d3ff4702b018420db3d8753bb93
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=46b48309fc782275104e128f58a601123c21922e
(In reply to Henry Chang [:henry] from comment #52)
> Created attachment 8428544 [details] [diff] [review]
> wifi.c.patch
> 
> Easier way to prove the fd leak: Apply this patch and run the test until
> crash. There's supposed a line of log:
> 
> "It's going to have a stack overflow since monitor_conn->s >= 1024"
> 
> if the crash is caused by fd leak.

Our stability team is testing new build with your patch and I will update here when they reproduces it next time. 

I also uploaded logs from last stability test (it also has lsof logs every 30 seconds) in [1]

Crash timestamp is : 2014-05-26 03:38:38 in logs

I can see total number of FD opened by b2g process is 420 at the time of crash.

Could you please take a look and suggest ?

[1] https://drive.google.com/file/d/0B1cSMS8_GuAEYW1MMXZMZnRuWXc/edit?usp=sharing
Flags: needinfo?(tkundu) → needinfo?(hchang)
(In reply to Tapas Kumar Kundu from comment #56)
> (In reply to Henry Chang [:henry] from comment #52)
> > Created attachment 8428544 [details] [diff] [review]
> > wifi.c.patch
> > 
> > Easier way to prove the fd leak: Apply this patch and run the test until
> > crash. There's supposed a line of log:
> > 
> > "It's going to have a stack overflow since monitor_conn->s >= 1024"
> > 
> > if the crash is caused by fd leak.
> 
> Our stability team is testing new build with your patch and I will update
> here when they reproduces it next time. 
> 
> I also uploaded logs from last stability test (it also has lsof logs every
> 30 seconds) in [1]
> 
> Crash timestamp is : 2014-05-26 03:38:38 in logs
> 
> I can see total number of FD opened by b2g process is 420 at the time of
> crash.
> 
> Could you please take a look and suggest ?
> 
> [1]
> https://drive.google.com/file/d/0B1cSMS8_GuAEYW1MMXZMZnRuWXc/edit?usp=sharing

I checked the log and it's not the same crash point as before [1].
The output of lsof looks fine at the time of crash...

[1]
Thread 12 (crashed)
 0  libc.so!strstr [strstr.c : 49 + 0x0]
     r0 = 0xfffffff1    r1 = 0xb6d25714    r2 = 0x00000012    r3 = 0x0000005a
     r4 = 0x9f900000    r5 = 0x9f900000    r6 = 0xb6d25714    r7 = 0x00000063
     r8 = 0x00000012    r9 = 0x9f8fe050   r10 = 0xb6eae394   r12 = 0x01000000
     fp = 0xb6eac2ec    sp = 0xb08fff20    lr = 0xb6e8b6fd    pc = 0xb6e8b6e8
    Found by: given as instruction pointer in context
 1  libhardware_legacy.so!update_ctrl_interface [wifi.c : 448 + 0x9]
     r4 = 0x9f8fe050    r5 = 0x000000d0    r6 = 0xb6d25d80    r7 = 0x00000072
     r8 = 0xb6eae394    r9 = 0x9f8fe050   r10 = 0xb6eae394    fp = 0xb6eac2ec
     sp = 0xb08fff38    pc = 0xb6d2391f
    Found by: call frame info
 2  libhardware_legacy.so!ensure_config_file_exists [wifi.c : 495 + 0x5]
     r4 = 0xb6d25d80    r5 = 0xb6eae394    r6 = 0xb6d2554d    r7 = 0xb6eae394
     r8 = 0xb6eae394    r9 = 0xb63c67f0   r10 = 0x00000000    fp = 0xb6eac2ec
     sp = 0xb0900030    pc = 0xb6d23ae5
    Found by: call frame info
Flags: needinfo?(hchang)
(In reply to Henry Chang [:henry] from comment #57)
> (In reply to Tapas Kumar Kundu from comment #56)
> > I also uploaded logs from last stability test (it also has lsof logs every
> > 30 seconds) in [1]
> > 
> > Crash timestamp is : 2014-05-26 03:38:38 in logs
> > 
Sorry, crash timestamp is 2014-05-25 03:38:38 in logs . Sorry for confusion. Please let me know if you checked lsof logs for some other timestamp. 

> I checked the log and it's not the same crash point as before [1].

We guessed that crash [1] may be same as this crash. But our guess may be wrong. 

> [1]
> https://drive.google.com/file/d/0B1cSMS8_GuAEYW1MMXZMZnRuWXc/edit?usp=sharing
Flags: needinfo?(hchang)
(In reply to Tapas Kumar Kundu from comment #58)
> (In reply to Henry Chang [:henry] from comment #57)
> > (In reply to Tapas Kumar Kundu from comment #56)
> > > I also uploaded logs from last stability test (it also has lsof logs every
> > > 30 seconds) in [1]
> > > 
> > > Crash timestamp is : 2014-05-26 03:38:38 in logs
> > > 
> Sorry, crash timestamp is 2014-05-25 03:38:38 in logs . Sorry for confusion.
> Please let me know if you checked lsof logs for some other timestamp. 
> 

So, the lsof dump at 2014-05-25 03:38:38 does show fd (most likely "anon_inode:dmabuf") 
was leaking and it is definitely a bug and will definitely result in the stack overflow.

> > I checked the log and it's not the same crash point as before [1].
> 
> We guessed that crash [1] may be same as this crash. But our guess may be
> wrong. 
> 
> > [1]
> > https://drive.google.com/file/d/0B1cSMS8_GuAEYW1MMXZMZnRuWXc/edit?usp=sharing

I am not saying they are due to different root cause since I cannot explain
what really caused strstr crashed in the log you attached. But the points that
the crash occurred are different. I am now trying to connect the strstr crash to fd leak.

thanks!
Flags: needinfo?(hchang)
Filed Bug 1017589 for the fd leaking issue observed in the log on comment 56. As long as we prove the wifi_connect_on_socket_path crash is caused by fd leak, we can make this bug a dup of Bug 1017589.
(In reply to Tapas Kumar Kundu from comment #56)
> (In reply to Henry Chang [:henry] from comment #52)
> > Created attachment 8428544 [details] [diff] [review]
> > wifi.c.patch
> > 
> > Easier way to prove the fd leak: Apply this patch and run the test until
> > crash. There's supposed a line of log:
> > 
> > "It's going to have a stack overflow since monitor_conn->s >= 1024"

Tapas, why it has a limitation of 1024? On b2g, b2g process' file descriptor number could rise up to more than 4000 in some use cases. See Bug 877495. Therefore, current b2g process's file descriptor's limitation is set to 8192 by the following.

 https://github.com/mozilla-b2g/gonk-misc/blob/master/b2g.sh#L6
Flags: needinfo?(tkundu)
> I also uploaded logs from last stability test (it also has lsof logs every
> 30 seconds) in [1]
> 
> Crash timestamp is : 2014-05-26 03:38:38 in logs
> 
> I can see total number of FD opened by b2g process is 420 at the time of
> crash.
> 
> Could you please take a look and suggest ?

As in Comment 61, we can not say 420 is huge number in b2g. It could not be a evidence of leaking.
(In reply to Sotaro Ikeda [:sotaro] from comment #61)
> (In reply to Tapas Kumar Kundu from comment #56)
> > (In reply to Henry Chang [:henry] from comment #52)
> > > Created attachment 8428544 [details] [diff] [review]
> > > wifi.c.patch
> > > 
> > > Easier way to prove the fd leak: Apply this patch and run the test until
> > > crash. There's supposed a line of log:
> > > 
> > > "It's going to have a stack overflow since monitor_conn->s >= 1024"
> 
> Tapas, why it has a limitation of 1024? On b2g, b2g process' file descriptor
> number could rise up to more than 4000 in some use cases. See Bug 877495.
> Therefore, current b2g process's file descriptor's limitation is set to 8192
> by the following.
> 
>  https://github.com/mozilla-b2g/gonk-misc/blob/master/b2g.sh#L6

I am seeing fd limit as 8192 for b2g process in v1.4 : https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla-b2g/gonk-misc/tree/b2g.sh?h=mozilla/v1.4#n6 

I am curious to know why do you think that we have 1024 as fd limit for b2g process ?

BTW, it seems to me that patch from #comment 52 can be a just hypothesis for this bug which is not proved yet.
Flags: needinfo?(sotaro.ikeda.g)
I already answered to the question of Comment 63 at Bug 1017589 Comment 10.
Flags: needinfo?(tkundu)
Flags: needinfo?(sotaro.ikeda.g)
Crash observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.114
Moz BuildID: 20140528000201
B2G Version: 1.4
Gecko Version: 30.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=cd595be0a8e975559e8938830df5face89bec3e8
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=d591b0c691da6847dcb9a4f626211b597e8807fe
According to Bug 1017589 Comment 15, wpa_supplicant is going to be fixed to avoid fd_set stack overflow issue since fd greater than 1024 is not a bad value to B2G.
Whiteboard: [CR 655397][b2g-crash][p=3][ETA: 5/27] → [caf priority: p1][CR 655397][b2g-crash][p=3][ETA: 5/27]
Hi TK, according to Bug 1017589 Comment 14, have you fixed the wap_supplicant? have it any improvement for this bug?
Flags: needinfo?(tkundu)
(In reply to Ken Chang[:ken] from comment #67)
> Hi TK, according to Bug 1017589 Comment 14, have you fixed the
> wap_supplicant? have it any improvement for this bug?

yes. We have landed some fix for wpa_supplicant and we are tying to find out more usecase where  b2g process is calling SELECT.  I will update here soon
Waiting for Partner's feedback.
Whiteboard: [caf priority: p1][CR 655397][b2g-crash][p=3][ETA: 5/27] → [caf priority: p1][CR 655397][b2g-crash][p=3]
We are not seeing this issue anymore in our testing. But we seeing bug 1025414
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(tkundu)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: