libmdns block on recvfrom()

RESOLVED FIXED in Firefox 42

Status

RESOLVED FIXED
4 years ago
2 years ago

People

(Reporter: schien, Assigned: xeonchen)

Tracking

unspecified
FxOS-S4 (07Aug)
ARM
Gonk (Firefox OS)

Firefox Tracking Flags

(blocking-b2g:2.5+, firefox42 fixed, b2g-master fixed)

Details

Attachments

(1 attachment, 2 obsolete attachments)

This is a follow-up for bug 1171827.
I encountered this bug again. After debugging mdnsd, I found that it error handling logic and mdnsd enters an infinite loop:

* In external/mdnsresponder/mDNSPosix/mDNSPosix.c::mDNSPosixRunEventLoopOnce(), it calls select(). select() returns -1 and errno = EBADF. One of the file descriptor being watched has an error. It notifies its caller that there are no data dispatched.
* In external/mdnsresponder/mDNSPosix/PosixDaemon.c::MainLoop(), continues to the next iteration, with gotData == false and enter mDNSPosixRunEventLoopOnce(). Because EBADF was not handled, select() will still return -1 and EBADF. This becomes a CPU spin without any progress.

We need to upgrade mdnsresponder to see if this can be fixed in a newer version.

Comment 2

4 years ago
Currently have this happening on the Sony foxfooding device. I had a 20 minute period where:
- first was slow to wake
- took 30 secs to respond to each tap
- after device went back to sleep, could not wake it with the power button, device was hot, adb reported 80% CPU consumption of root b2g process. 

While reading bugs about this for 5-10 mins, CPU consumption suddenly went back to zero, and device back to normal.
Gary, Cervantes, are there any updates on this? Based on comment 2 and other reports, foxfooders are still facing this issue. This is a very bad bug.
blocking-b2g: --- → spark+
Flags: needinfo?(xeonchen)
Flags: needinfo?(cyu)
We may need to patch this our own and upstream the fix. We are already using master branch of mdnsresponder and it hasn't been updated for ~3yrs.
Flags: needinfo?(cyu)
(In reply to Cervantes Yu from comment #4)
> We may need to patch this our own and upstream the fix. We are already using
> master branch of mdnsresponder and it hasn't been updated for ~3yrs.

Yes, let's please do that :)
blocking-b2g: spark+ → 2.5+
(Assignee)

Comment 6

4 years ago
We don't know what triggers this issue, which makes it difficult to investigate and verify.
The library is also used on Android and Apple's devices, but they don't have this issue?

Garvan, do you mean your device went back to normal without any action?
Did you change the network status before the situation happened? E.g. enter basement/elevator
which cause cellular/WiFi disconnect?
Flags: needinfo?(xeonchen) → needinfo?(gkeeley)

Comment 7

4 years ago
> Garvan, do you mean your device went back to normal without any action?

Yes, I was running adb top in a terminal, and then noticed the CPU usage had dropped, I picked up the device and it was responding again. The entire time, the device was on my desk (from when I noticed the problem to when it was resolved).

> Did you change the network status before the situation happened? E.g. enter
> basement/elevator
> which cause cellular/WiFi disconnect?

No.

I am sure other foxfooders will have this happen, we should let the list know we are looking for more feedback on the events surrounding this issue.
Flags: needinfo?(gkeeley)
(Assignee)

Comment 8

4 years ago
Created attachment 8628111 [details] [diff] [review]
0001-Bug-1175387-close-file-descriptor-after-detached-fro.patch

The file descriptor owned by |DNSServiceRef| is closed before being detached from STS. It may cause |EBADF| while polling on a closed descriptor.

Besides, the descriptor is closed twice by |DNSServiceRefDeallocate| and |PR_Close|, so I use |PR_CreateSocketPollFd| instead.

Patrick, would you give some comment on these modifications?
Attachment #8628111 - Flags: feedback?(mcmanus)
(In reply to Gary Chen [:xeonchen] from comment #8)

> Besides, the descriptor is closed twice by |DNSServiceRefDeallocate| and
> |PR_Close|, so I use |PR_CreateSocketPollFd| instead.

we definitely need to backport any fix becuase of that. It could accidentally close another (unrelated socket - opened after the first close).
Comment on attachment 8628111 [details] [diff] [review]
0001-Bug-1175387-close-file-descriptor-after-detached-fro.patch

Review of attachment 8628111 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/dns/mdns/libmdns/MDNSResponderOperator.cpp
@@ +122,5 @@
>      if (osfd == -1) {
>        return NS_ERROR_FAILURE;
>      }
>  
> +    mFD = PR_CreateSocketPollFd(osfd);

to be honest, I've never used that bit of nspr.. but the documentation says the result is only for use with pr_poll() and you're doing way more than that - so I don't think it is a good idea.

I think you can continue with the old importFile approach, but just before the pr_close() do PR_ChangeFileDescNativeHandle() to change the osfd to -1.
Attachment #8628111 - Flags: feedback?(mcmanus) → feedback-
(Assignee)

Updated

3 years ago
Depends on: 1180596
(Assignee)

Comment 11

3 years ago
Created attachment 8638249 [details] [diff] [review]
0001-Bug-1175387-close-file-descriptor-after-detached-fro.patch

Close fd after detached from STS, and make PR_Close does nothing.
Attachment #8628111 - Attachment is obsolete: true
Attachment #8638249 - Flags: review?(mcmanus)
Attachment #8638249 - Flags: review?(mcmanus) → review+
(Assignee)

Updated

3 years ago
Keywords: checkin-needed
(Assignee)

Updated

3 years ago
Keywords: checkin-needed
Needs rebasing.
Keywords: checkin-needed
(Assignee)

Comment 14

3 years ago
Created attachment 8643449 [details] [diff] [review]
0001-Bug-1175387-close-file-descriptor-after-detached-fro.patch

rebase, carry r+
Attachment #8638249 - Attachment is obsolete: true
Attachment #8643449 - Flags: review+
(Assignee)

Updated

3 years ago
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/1f6f47627dbd
Status: NEW → RESOLVED
Last Resolved: 3 years ago
status-firefox42: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → FxOS-S4 (07Aug)
status-b2g-master: --- → fixed
You need to log in before you can comment on or make changes to this bug.