Open Bug 1749022 Opened 3 years ago Updated 2 years ago

mdns_service thread remains active in the parent process

Categories

(Core :: WebRTC: Networking, defect, P2)

defect

Tracking

()

Performance Impact medium

People

(Reporter: florian, Assigned: padenot)

References

(Blocks 1 open bug)

Details

(Keywords: perf:resource-use, power, reproducible)

+++ This bug was initially created as a clone of Bug #1690626 +++

After using WebRTC (eg. the data channel demo at https://tomashubelbauer.github.io/webrtc-data-channel-demo/src/index.html) an mdns_service thread remains active in the parent process, taking 2% of a CPU core, and waking up the CPU very often (about every ms). Here's a profile of it: https://share.firefox.dev/3HZt5oT

The mdns_service thread disappears after the content process that used webrtc dies.

Some relevant comments from the previous bug:

(Dan Minor [:dminor] in bug 1690626 comment #2)

This is all my/our code, so I guess I'll apologize in advance for any problems you find. I think hooking into the profiler API directly would probably be the way to go.

My intention was to use timeouts on the socket reads and writes to idle the thread, so this might be fixable with some tuning of those values.

The other problem here is that once started, this runs until shutdown, even if it is no longer necessary. Depending on the network, there might be background mDNS traffic that we waste time parsing. I filed Bug 1569955 to look into shutting things down if there are no hostnames registered with the service.

(Dan Minor [:dminor] in bug 1690626 comment #3)

Also, a big part of the reason why we're running this service is lack of mDNS support on older versions of Windows, in particular Windows 7. So we might be able to toss a lot of this code and rely upon the OS more once we stop supporting Windows 7. Or, start switching over now on platforms where it is well supported, like OS X and Linux. At the time, I was hesitant to have different code paths for different OSs since this introduced a lot of new code all at once, but that isn't the same concern now.

Severity: -- → S4

I don't think a bug causing a CPU wakeup every millisecond can be of severity S4.

Byron, what's the way forward here? Is it to stop the thread when we no longer need it, or to replace it with OS APIs wherever possible?

Severity: S4 → --
Flags: needinfo?(docfaraday)

So we do not seem to have any plan to stop supporting Windows 7, and my understanding is that the usage numbers there are still significant, so switching over to OS support isn't going to carry the day by itself (although it is probably worth doing). Shutting the service down when unneeded is probably the first thing to do here.

Severity: -- → S2
Flags: needinfo?(docfaraday)
Priority: -- → P2
Performance Impact: --- → P1
Blocks: media-triage
Blocks: webrtc-triage
No longer blocks: media-triage

I'm not sure how serious this is. First, only a small percentage of our users leverage webrtc (somewhere around 1%), and with Fission, closing out a meeting on say, zoom, would eventually shut down the content process associated with the zoom domain. Fixing it also isn't going to impact webrtc performance since this service needs to run while webrtc is in use. This said it would be nice to fix and I can prioritize it within the team, but we have other high priorities right now. I'll add this to our triage list and get the teams thoughts on level of effort.

(In reply to Jim Mathies [:jimm] from comment #3)

I'm not sure how serious this is. First, only a small percentage of our users leverage webrtc (somewhere around 1%)

This bug is usually triggered without the user intentionally using webrtc. Eg. I see it when using my bank's website, and I encountered it today again on the website of a major French tv channel where I had a replay video paused in a background tab. I think both of these sites use a peer connection for fingerprinting purpose. about:webrtc showed only closed connections, but the mdns thread was still power hungry until I found and closed the relevant background tab.

No longer blocks: webrtc-triage
Severity: S2 → S3

(In reply to Florian Quèze [:florian] from comment #0)

+++ This bug was initially created as a clone of Bug #1690626 +++

After using WebRTC (eg. the data channel demo at https://tomashubelbauer.github.io/webrtc-data-channel-demo/src/index.html) an mdns_service thread remains active in the parent process, taking 2% of a CPU core, and waking up the CPU very often (about every ms). Here's a profile of it: https://share.firefox.dev/3HZt5oT

Looking at this, where is the 2% cpu utilization itemized? (I see the network request here.)

(In reply to Jim Mathies [:jimm] from comment #5)

(In reply to Florian Quèze [:florian] from comment #0)

+++ This bug was initially created as a clone of Bug #1690626 +++

After using WebRTC (eg. the data channel demo at https://tomashubelbauer.github.io/webrtc-data-channel-demo/src/index.html) an mdns_service thread remains active in the parent process, taking 2% of a CPU core, and waking up the CPU very often (about every ms). Here's a profile of it: https://share.firefox.dev/3HZt5oT

Looking at this, where is the 2% cpu utilization itemized? (I see the network request here.)

The profile shows the wake-ups but 2% cpu use is hard to see in a profile (it's too low for the timeline tooltips to show up). The 2% number likely came from looking at about:processes on my Intel 2018 Macbook Pro. Currently (on a faster high end M1 Max macbook pro) on about:processes I see the mdns_service thread using continuously between 0.5 and 0.8% of a CPU core while I'm watching a stream on air mozilla, and the OS X activity monitor shows more than 1000 idle wake-ups per second for the parent process.

This code originally landed with 10ms timeout on the sockets [1] and was decreased to 1ms later on, but I'm not sure that really improved the situation with the shutdown hangs we were seeing at the time. I'd suggest that we could safely change this back to a 10ms timeout without requiring a lot of additional testing, and potentially increase the timeout after that. This would improve the idle wake-ups situation.

[1] https://searchfox.org/mozilla-central/rev/fc5c4461124b8572442c71bc34947bee68b75551/dom/media/webrtc/transport/mdns_service/src/lib.rs#427

Assignee: nobody → padenot

Dan, I'm looking into this for power usage improvements. Is there any particular reasons this is using blocking threaded IO and not evented IO? Maybe the Rust ecosystem wasn't ready at the time? I'm not sure about the timeline here.

We have tokio-udp in tree, so maybe we can switch to that to solve the wakeup issues.

Flags: needinfo?(dminor)

Yes, I think switching to tokio makes sense, I can't think of any reason not to use it. I don't recall why I didn't use it the first time around, but yes, maybe things weren't mature enough at that time.

Flags: needinfo?(dminor)

The Performance Priority Calculator has determined this bug's performance priority to be P2.

Platforms: [x] Windows [x] macOS [x] Linux [x] Android
[x] Causes severe resource usage
[x] Able to reproduce locally

Performance Impact: P1 → P2
Keywords: reproducible
You need to log in before you can comment on or make changes to this bug.