Closed Bug 1175387 Opened 6 years ago Closed 6 years ago

libmdns block on recvfrom()

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:2.5+, firefox42 fixed, b2g-master fixed)

RESOLVED FIXED
FxOS-S4 (07Aug)
blocking-b2g 2.5+
Tracking Status
firefox42 --- fixed
b2g-master --- fixed

People

(Reporter: schien, Assigned: xeonchen)

References

Details

Attachments

(1 file, 2 obsolete files)

This is a follow-up for bug 1171827.
I encountered this bug again. After debugging mdnsd, I found that it error handling logic and mdnsd enters an infinite loop:

* In external/mdnsresponder/mDNSPosix/mDNSPosix.c::mDNSPosixRunEventLoopOnce(), it calls select(). select() returns -1 and errno = EBADF. One of the file descriptor being watched has an error. It notifies its caller that there are no data dispatched.
* In external/mdnsresponder/mDNSPosix/PosixDaemon.c::MainLoop(), continues to the next iteration, with gotData == false and enter mDNSPosixRunEventLoopOnce(). Because EBADF was not handled, select() will still return -1 and EBADF. This becomes a CPU spin without any progress.

We need to upgrade mdnsresponder to see if this can be fixed in a newer version.
Currently have this happening on the Sony foxfooding device. I had a 20 minute period where:
- first was slow to wake
- took 30 secs to respond to each tap
- after device went back to sleep, could not wake it with the power button, device was hot, adb reported 80% CPU consumption of root b2g process. 

While reading bugs about this for 5-10 mins, CPU consumption suddenly went back to zero, and device back to normal.
Gary, Cervantes, are there any updates on this? Based on comment 2 and other reports, foxfooders are still facing this issue. This is a very bad bug.
blocking-b2g: --- → spark+
Flags: needinfo?(xeonchen)
Flags: needinfo?(cyu)
We may need to patch this our own and upstream the fix. We are already using master branch of mdnsresponder and it hasn't been updated for ~3yrs.
Flags: needinfo?(cyu)
(In reply to Cervantes Yu from comment #4)
> We may need to patch this our own and upstream the fix. We are already using
> master branch of mdnsresponder and it hasn't been updated for ~3yrs.

Yes, let's please do that :)
blocking-b2g: spark+ → 2.5+
We don't know what triggers this issue, which makes it difficult to investigate and verify.
The library is also used on Android and Apple's devices, but they don't have this issue?

Garvan, do you mean your device went back to normal without any action?
Did you change the network status before the situation happened? E.g. enter basement/elevator
which cause cellular/WiFi disconnect?
Flags: needinfo?(xeonchen) → needinfo?(gkeeley)
> Garvan, do you mean your device went back to normal without any action?

Yes, I was running adb top in a terminal, and then noticed the CPU usage had dropped, I picked up the device and it was responding again. The entire time, the device was on my desk (from when I noticed the problem to when it was resolved).

> Did you change the network status before the situation happened? E.g. enter
> basement/elevator
> which cause cellular/WiFi disconnect?

No.

I am sure other foxfooders will have this happen, we should let the list know we are looking for more feedback on the events surrounding this issue.
Flags: needinfo?(gkeeley)
The file descriptor owned by |DNSServiceRef| is closed before being detached from STS. It may cause |EBADF| while polling on a closed descriptor.

Besides, the descriptor is closed twice by |DNSServiceRefDeallocate| and |PR_Close|, so I use |PR_CreateSocketPollFd| instead.

Patrick, would you give some comment on these modifications?
Attachment #8628111 - Flags: feedback?(mcmanus)
(In reply to Gary Chen [:xeonchen] from comment #8)

> Besides, the descriptor is closed twice by |DNSServiceRefDeallocate| and
> |PR_Close|, so I use |PR_CreateSocketPollFd| instead.

we definitely need to backport any fix becuase of that. It could accidentally close another (unrelated socket - opened after the first close).
Comment on attachment 8628111 [details] [diff] [review]
0001-Bug-1175387-close-file-descriptor-after-detached-fro.patch

Review of attachment 8628111 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/dns/mdns/libmdns/MDNSResponderOperator.cpp
@@ +122,5 @@
>      if (osfd == -1) {
>        return NS_ERROR_FAILURE;
>      }
>  
> +    mFD = PR_CreateSocketPollFd(osfd);

to be honest, I've never used that bit of nspr.. but the documentation says the result is only for use with pr_poll() and you're doing way more than that - so I don't think it is a good idea.

I think you can continue with the old importFile approach, but just before the pr_close() do PR_ChangeFileDescNativeHandle() to change the osfd to -1.
Attachment #8628111 - Flags: feedback?(mcmanus) → feedback-
Depends on: 1180596
Close fd after detached from STS, and make PR_Close does nothing.
Attachment #8628111 - Attachment is obsolete: true
Attachment #8638249 - Flags: review?(mcmanus)
Attachment #8638249 - Flags: review?(mcmanus) → review+
Keywords: checkin-needed
Keywords: checkin-needed
Needs rebasing.
Keywords: checkin-needed
rebase, carry r+
Attachment #8638249 - Attachment is obsolete: true
Attachment #8643449 - Flags: review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/1f6f47627dbd
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → FxOS-S4 (07Aug)
See Also: → 1194532
You need to log in before you can comment on or make changes to this bug.