Closed Bug 1061993 Opened 5 years ago Closed 5 years ago

Closing an invalid fd in WiFi

Categories

(Firefox OS Graveyard :: Wifi, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

(blocking-b2g:2.0+, firefox33 wontfix, firefox34 fixed, firefox35 fixed, b2g-v2.0 fixed, b2g-v2.0M fixed, b2g-v2.1 fixed, b2g-v2.2 fixed)

RESOLVED FIXED
2.1 S5 (26sep)
blocking-b2g 2.0+
Tracking Status
firefox33 --- wontfix
firefox34 --- fixed
firefox35 --- fixed
b2g-v2.0 --- fixed
b2g-v2.0M --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: bhargavg1, Assigned: hchang)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 335][caf priority: p1][CR 716601])

Attachments

(5 files, 3 obsolete files)

When using the patch from Bug 1057220 to catch all invalid fds being closed, we see a crash when operating the WiFi module

You can see the point of crash in the logs as, 

libc    : close() ret -1 errno -9 fd 213

In most of the cases when the crash happens see,

D DHCP    : failed to set prefixLength 0: Cannot assign requested address


this crash has been seen 10 times in automation
We need a stack here.
Flags: needinfo?(bhargavg1)
parsed dump file
extra file at time of crash

Please look for "close()" string in the file for the exact point
Flags: needinfo?(bhargavg1)
Finally the logcat
Attachment #8483130 - Attachment is obsolete: true
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.0?
Suspect to be a race condition while calling ifc_* functions because they all share same variable |ifc_ctl_sock| while doing ioctl() instead of request socket on their own.
There is a checkin-needed patch (see Bug 1038531) which may fix
the race condition by serializing all libnetutil command to one 
thread (NetworkWorker). We can rerun the automation test while
the patch lands.
Flags: needinfo?(dlee)
(In reply to Henry Chang [:henry] from comment #8)
> There is a checkin-needed patch (see Bug 1038531) which may fix
> the race condition by serializing all libnetutil command to one 
> thread (NetworkWorker). We can rerun the automation test while
> the patch lands.

Do you have a patch for v2.0? we can verify the same if available
There are two threads running ifc_* functions at same time after wifi disabled:
1. Network Worker thread [1]: ifc_act_on_ipv4_route()
2. Wifi Control thread [2]: ifc_configure() -> ifc_reset_connection()

The log shows crash happen in thread(30) running |close()| in |ifc_act_on_ipv4_route()|, while another thread(15) running |ioctl()| in |ifc_reset_connection()|.

Because ifc_* shares ifc_ctl_sock[3] between functions, the crash might be caused by ifc_act_on_ipv4_route() and ifc_configure() calls close() almost at same time, leaves no time for the set and check of ifc_ctl_sock in ifc_close()[4] to run.

Since ifc_* functions are not thread-safe, executing all ifc_* functions in single thread, as the patch mentioned in comment 8, might resolve this issue.

[1] http://mxr.mozilla.org/mozilla-central/source/dom/system/gonk/NetworkWorker.cpp#226
[2] http://mxr.mozilla.org/mozilla-central/source/dom/wifi/WifiProxyService.cpp#269
[3] https://android.googlesource.com/platform/system/core/+/master/libnetutils/ifc_utils.c#58
[4] https://android.googlesource.com/platform/system/core/+/master/libnetutils/ifc_utils.c#148
Whiteboard: [CR 716601] → [caf priority: p1][CR 716601]
Whiteboard: [caf priority: p1][CR 716601] → [b2g-crash][caf-crash 335][caf priority: p1][CR 716601]
Keywords: crash
Observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.076
Moz BuildID: 20140829000203
B2G Version: 2.0
Gecko Version: 32.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=5bf3b8cdea15e62ce7bf77a15085a18e24e33c44
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=f2ab70e9ab18439dae25e6eb21f186708b2888de
(In reply to bhargavg1 from comment #9)
> (In reply to Henry Chang [:henry] from comment #8)
> > There is a checkin-needed patch (see Bug 1038531) which may fix
> > the race condition by serializing all libnetutil command to one 
> > thread (NetworkWorker). We can rerun the automation test while
> > the patch lands.
> 
> Do you have a patch for v2.0? we can verify the same if available

I believe the latest patch in Bug 1038531 cannot apply to 2.0. 
Maybe one of the obsoleted patch could (such as attachment 8462323 [details] [diff] [review])
Assignee: nobody → hchang
Attached patch Patch_for_V2_0.patch (obsolete) — Splinter Review
(In reply to Henry Chang [:henry] from comment #13)
> Created attachment 8483383 [details] [diff] [review]
> Patch_for_V2_0.patch

This is a large change and i think too risky for 2.0 at this point. Can we have an alternate solution for 2.0 which is less risky?
Flags: needinfo?(hchang)
(In reply to Inder from comment #14)
> (In reply to Henry Chang [:henry] from comment #13)
> > Created attachment 8483383 [details] [diff] [review]
> > Patch_for_V2_0.patch
> 
> This is a large change and i think too risky for 2.0 at this point. Can we
> have an alternate solution for 2.0 which is less risky?

Probably moving only the function which causes the race condition
is enough. Will provide a lighter patch as soon as possible.
Flags: needinfo?(hchang)
A more lightweight solution: adding exclusive lock to ifc_* functions [1]

[1] http://hg.mozilla.org/mozilla-central/file/776fa9cf70cd/dom/network/NetUtils.cpp#l88
Attached patch Less risky patch (obsolete) — Splinter Review
Chuck,

Could you please take a look at this patch? I add a lock to ifc functions
to avoid race condition and it has lower rick to the release branch. Thansk!
Attachment #8483928 - Flags: review?(chulee)
Comment on attachment 8483928 [details] [diff] [review]
Less risky patch

Review of attachment 8483928 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks!

::: dom/network/src/NetUtils.cpp
@@ +60,5 @@
>  DEFINE_DLFUNC(ifc_remove_host_routes, int32_t, const char*)
>  DEFINE_DLFUNC(ifc_remove_default_route, int32_t, const char*)
>  DEFINE_DLFUNC(dhcp_stop, int32_t, const char*)
>  
> +NetUtils::NetUtils() 

Nit: Extra space
Attachment #8483928 - Flags: review?(chulee) → review+
blocking-b2g: 2.0? → 2.0+
CAF is picking this up today to test and ensure the issue is fixed. NI inder so it does not fall off the plate.
Flags: needinfo?(ikumar)
It was built into our AU: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.081. We will update with our findings shortly.
Flags: needinfo?(ikumar)
We want to see  if crash reproduces with "Less risky patch" and bionic close() debug patch
Removed the trailing white space and renamed the patch since it's not risky at all :p
Attachment #8483383 - Attachment is obsolete: true
Attachment #8483928 - Attachment is obsolete: true
No news is good news!
Let's wait for final confirmation from our side. Adding ni on Greg to confim once we get results on this.
Flags: needinfo?(ggrisco)
Hi Greg and Inder, any update on this one? Thank you.
Flags: needinfo?(ikumar)
Flags: needinfo?(ggrisco)
Hi Howie,

We are not seeing this issue anymore in our stability test. Please go ahead and land this patch. We will create a new bug if it comes again :) .
Flags: needinfo?(ikumar)
Flags: needinfo?(ggrisco)
Hi Henry, please help to land this on master first. After it's fixed, request approval flag: approval‑mozilla‑b2g32 (v2.0). If the patch also apply to v2.1, please request the following approval too: approval‑mozilla‑aurora.
Flags: needinfo?(hchang)
(In reply to howie [:howie] from comment #27)
> Hi Henry, please help to land this on master first. After it's fixed,
> request approval flag: approval‑mozilla‑b2g32 (v2.0). If the patch also
> apply to v2.1, please request the following approval too:
> approval‑mozilla‑aurora.

This patch will only be applied to branched releases since this issue
can be fixed whenever Bug 1038531 lands.
Flags: needinfo?(hchang)
Comment on attachment 8487039 [details] [diff] [review]
No-risk patch (r+'d) (for 2.0 only)

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Wifi
User impact if declined: Switching wifi off may cause invalid fd to be closed.
Testing completed: Yes.
Risk to taking this patch (and alternatives if risky): No 
String or UUID changes made by this patch: No
Attachment #8487039 - Attachment description: No-risk patch (r+'d) → No-risk patch (r+'d) (for 2.0/2.1 only)
Attachment #8487039 - Flags: approval-mozilla-b2g30?
Attachment #8487039 - Flags: approval-mozilla-aurora?
By the way, since the patch is not harmful to master, I also attached patch for m-c. 
Thanks.
Comment on attachment 8487039 [details] [diff] [review]
No-risk patch (r+'d) (for 2.0 only)

Withdraw mozilla-aurora approval request since this patch needs to rebase to apply to mozilla-aurora
Attachment #8487039 - Flags: approval-mozilla-aurora?
Attachment #8487039 - Attachment description: No-risk patch (r+'d) (for 2.0/2.1 only) → No-risk patch (r+'d) (for 2.0 only)
Attachment #8489894 - Attachment description: Bug1061993_for_mc.patch → Bug1061993 (applicable for 2.1 and m-c)
https://hg.mozilla.org/mozilla-central/rev/2ab68be61e8c
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.1 S5 (26sep)
Comment on attachment 8487039 [details] [diff] [review]
No-risk patch (r+'d) (for 2.0 only)

Correct the nom in Comment 29 to approval‑mozilla‑b2g32 (v2.0)
Attachment #8487039 - Flags: approval-mozilla-b2g30? → approval-mozilla-b2g32?
There needs to be an Aurora approval request for the v2.1 uplift as well.
Flags: needinfo?(hchang)
Attachment #8487039 - Flags: approval-mozilla-b2g32? → approval-mozilla-b2g32+
Bhavana -- We need this fixed for 2.1 as well.
Flags: needinfo?(bbajaj)
https://hg.mozilla.org/releases/mozilla-b2g32_v2_0/rev/5cf783171d5c

(In reply to Inder from comment #37)
> Bhavana -- We need this fixed for 2.1 as well.

Needs an approval request first. See comment 35.
Flags: needinfo?(bbajaj)
Henry, can you please help with the needed aurora patch here?
(In reply to bhavana bajaj [:bajaj] from comment #40)
> Henry, can you please help with the needed aurora patch here?

The patch applies cleanly to Aurora. Just needs an approval request...
Comment on attachment 8489894 [details] [diff] [review]
Bug1061993 (applicable for 2.1 and m-c)

Approval Request Comment
[Feature/regressing bug #]: Wifi
[User impact if declined]: Turning off wifi may cause to close invalid fd.
[Describe test coverage new/current, TBPL]: Bug 1057220 and wifi test cases
[Risks and why]: No
[String/UUID change made/needed]: No
Attachment #8489894 - Flags: approval-mozilla-aurora?
Flags: needinfo?(hchang)
Sorry for late response. Already requested approval.
Thanks!
Attachment #8489894 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Unable to verify as it is a back-end issue.
QA Whiteboard: [QAnalyst-Triage?][QAnalyst-verify-]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?][QAnalyst-verify-] → [QAnalyst-Triage+][QAnalyst-verify-]
Flags: needinfo?(ktucker)
You need to log in before you can comment on or make changes to this bug.