Open Bug 1728091 Opened 3 years ago Updated 2 years ago

Socket process using 100% of a core (mostly in the IPC I/O Child thread)

Categories

(Core :: WebRTC: Networking, defect, P3)

defect

Tracking

()

People

(Reporter: florian, Unassigned)

References

Details

My Nightly (93.0a1 (2021-08-25) (64 bits), on OS X 10.15.7) currently has its Socket process taking 100% of a core. It's about 98% of a core spent in the "IPC I/O Child" thread, about 2% in the "Socket Thread", and less than 1% in the "SampleThread" (from the profiler) and "IPDL Background" thread.

Here is the stack captured by /usr/bin/sample:

    334 Thread_9683838: IPC I/O Child
    + 334 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff720eab8b]
    +   334 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff720ef109]
    +     334 ThreadFunc(void*)  (in XUL) + 10  [0x11593063a]
    +       334 base::Thread::ThreadMain()  (in XUL) + 341  [0x1159320c5]
    +         334 MessageLoop::Run()  (in XUL) + 80  [0x115927910]
    +           334 base::MessagePumpLibevent::Run(base::MessagePump::Delegate*)  (in XUL) + 866  [0x115929b22]
    +             334 event_base_loop  (in XUL) + 2044  [0x115946f6c]
    +               334 base::MessagePumpLibevent::OnLibeventNotification(int, short, void*)  (in XUL) + 229  [0x1159295c5]
    +                 334 IPC::Channel::ChannelImpl::ProcessIncomingMessages()  (in XUL) + 881  [0x115934d61]
    +                   334 mozilla::ipc::NodeChannel::OnMessageReceived(IPC::Message&&)  (in XUL) + 173  [0x1159674cd]
    +                     334 mozilla::ipc::NodeController::OnEventMessage(mojo::core::ports::NodeName const&, mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >)  (in XUL) + 485  [0x11596a6f5]
    +                       334 mojo::core::ports::Node::AcceptEvent(mozilla::UniquePtr<mojo::core::ports::Event, mozilla::DefaultDelete<mojo::core::ports::Event> >)  (in XUL) + 5217  [0x11593c971]
    +                         334 mozilla::ipc::NodeController::PortStatusChanged(mojo::core::ports::PortRef const&)  (in XUL) + 437  [0x115969a85]
    +                           334 mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message&&)  (in XUL) + 773  [0x115962465]
    +                             325 nsThread::Dispatch(already_AddRefed<nsIRunnable>, unsigned int)  (in XUL) + 427  [0x11572285b]
    +                             ! 325 mozilla::detail::EventQueueInternal<16ul>::PutEvent(already_AddRefed<nsIRunnable>&&, mozilla::EventQueuePriority, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator>*)  (in XUL) + 658,712,...  [0x11570e1d2,0x11570e208,...]
    +                             9 nsThread::Dispatch(already_AddRefed<nsIRunnable>, unsigned int)  (in XUL) + 388  [0x115722834]
    +                               9 mozilla::detail::MutexImpl::lock()  (in libmozglue.dylib) + 11  [0x10bae83bb]
    +                                 9 _pthread_mutex_firstfit_lock_slow  (in libsystem_pthread.dylib) + 222  [0x7fff720ea937]
    +                                   9 _pthread_mutex_firstfit_lock_wait  (in libsystem_pthread.dylib) + 83  [0x7fff720ec917]
    +                                     9 __psynch_mutexwait  (in libsystem_kernel.dylib) + 10  [0x7fff7202e062]

What happened before I reached this state:

  1. Bug 1724417 (ie. I forgot an air mozilla tab for a few days, and it created hundreads of peer connection objects)
  2. I captured a profile of it (https://share.firefox.dev/3kyndIZ). Interesting observation in this profile: the Poll markers in the "Socket Thread" of the Socket process show we are polling 1000 sockets (is that the maximum?).
  3. I started the profiler again right before closing the tab causing the excessive peer connection use, and then attempted to capture a profile after closing the tab.

Result: capturing the profile took a while, I finally got a profile without the Socket process (https://share.firefox.dev/3sVNOU2), and the Socket process continuously takes 100% of a core, and has its SamplerThread still active.

I don't know if the bad state of the Socket process is a side effect of having closed many peer connections at once when closing the tab (could there be an overflow somewhere if we only support 1000 sockets at most, and we attempted to close more?), or a side effect of the failed profile capture.

The Socket Thread samples might also be relevant:

    334 Thread_9683848: Socket Thread
    + 334 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff720eab8b]
    +   334 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff720ef109]
    +     334 _pt_root  (in libnss3.dylib) + 310  [0x10b8ff9e6]
    +       334 nsThread::ThreadFunc(void*)  (in XUL) + 252  [0x11572181c]
    +         334 MessageLoop::Run()  (in XUL) + 80  [0x115927910]
    +           334 mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)  (in XUL) + 205  [0x115966b3d]
    +             334 nsThread::ProcessNextEvent(bool, bool*)  (in XUL) + 4340  [0x115724f64]
    +               334 non-virtual thunk to mozilla::net::nsSocketTransportService::Run()  (in XUL) + 13  [0x1157dc48d]
    +                 308 mozilla::net::nsSocketTransportService::Run()  (in XUL) + 1294  [0x1157da64e]
    +                 ! 307 PR_Poll  (in libnss3.dylib) + 696  [0x10b8fbdd8]
    +                 ! : 307 poll  (in libnss3.dylib) + 474  [0x10b8f524a]
    +                 ! :   307 __select  (in libsystem_kernel.dylib) + 10  [0x7fff720340fe]
    +                 ! 1 PR_Poll  (in libnss3.dylib) + 485  [0x10b8fbd05]
    +                 26 mozilla::net::nsSocketTransportService::Run()  (in XUL) + 856  [0x1157da498]
    +                   26 nsThread::ProcessNextEvent(bool, bool*)  (in XUL) + 4340  [0x115724f64]
    +                     26 mozilla::MozPromise<bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, false>::ThenValueBase::ResolveOrRejectRunnable::Run()  (in XUL) + 61  [0x11905370d]
    +                       26 mozilla::MozPromise<bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, false>::ThenValue<mozilla::MediaTransportHandlerSTS::GetIceStats(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, double)::$_20>::DoResolveOrRejectInternal(mozilla::MozPromise<bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, false>::ResolveOrRejectValue&)  (in XUL) + 90  [0x119059aea]
    +                         26 mozilla::MozPromise<mozilla::UniquePtr<mozilla::dom::RTCStatsCollection, mozilla::DefaultDelete<mozilla::dom::RTCStatsCollection> >, nsresult, true>::ChainTo(already_AddRefed<mozilla::MozPromise<mozilla::UniquePtr<mozilla::dom::RTCStatsCollection, mozilla::DefaultDelete<mozilla::dom::RTCStatsCollection> >, nsresult, true>::Private>, char const*)  (in XUL) + 238  [0x119059e9e]
    +                           26 void mozilla::MozPromise<mozilla::UniquePtr<mozilla::dom::RTCStatsCollection, mozilla::DefaultDelete<mozilla::dom::RTCStatsCollection> >, nsresult, true>::Private::Resolve<mozilla::UniquePtr<mozilla::dom::RTCStatsCollection, mozilla::DefaultDelete<mozilla::dom::RTCStatsCollection> > >(mozilla::UniquePtr<mozilla::dom::RTCStatsCollection, mozilla::DefaultDelete<mozilla::dom::RTCStatsCollection> >&&, char const*)  (in XUL) + 186  [0x11905a1da]
    +                             26 mozilla::MozPromise<mozilla::UniquePtr<mozilla::dom::RTCStatsCollection, mozilla::DefaultDelete<mozilla::dom::RTCStatsCollection> >, nsresult, true>::DispatchAll()  (in XUL) + 94  [0x11905a4be]
    +                               17 nsThread::Dispatch(already_AddRefed<nsIRunnable>, unsigned int)  (in XUL) + 388  [0x115722834]
    +                               : 17 mozilla::detail::MutexImpl::lock()  (in libmozglue.dylib) + 11  [0x10bae83bb]
    +                               :   17 _pthread_mutex_firstfit_lock_slow  (in libsystem_pthread.dylib) + 222  [0x7fff720ea937]
    +                               :     17 _pthread_mutex_firstfit_lock_wait  (in libsystem_pthread.dylib) + 83  [0x7fff720ec917]
    +                               :       17 __psynch_mutexwait  (in libsystem_kernel.dylib) + 10  [0x7fff7202e062]
    +                               9 nsThread::Dispatch(already_AddRefed<nsIRunnable>, unsigned int)  (in XUL) + 427  [0x11572285b]
    +                                 9 mozilla::detail::EventQueueInternal<16ul>::PutEvent(already_AddRefed<nsIRunnable>&&, mozilla::EventQueuePriority, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator>*)  (in XUL) + 658,712  [0x11570e1d2,0x11570e208]

i not sure it same problem or not but i think it like my problem that i report in Bug 1727975.
if possible next time you should try to reduce profiler Interval to 200μs. (or look at my profiler in that bug)

(In reply to Florian Quèze [:florian] from comment #0)

the Poll markers in the "Socket Thread" of the Socket process show we are polling 1000 sockets (is that the maximum?).

Yes; see SOCKET_LIMIT_TARGET and related code.

Something I notice in the profile is that there are a lot of GetIceStats messages being sent from the content process to the socket process, and then the replies coming back. Some of the messages are answered before the burst of requests ends, but it looks like most of them aren't. (I'd like to say approximately how many messages there are per burst, but while the profiler UI can show me a table of those markers, it won't tell me how many rows are in it. It seems to be a large number.)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #5)

(I'd like to say approximately how many messages there are per burst, but while the profiler UI can show me a table of those markers, it won't tell me how many rows are in it. It seems to be a large number.)

You can get this information from the profiler console API. Open the web developer console on the profiler tab, and then run filteredMarkers.length.

If I understood what you wanted to measure, it's 7154 requests and 7154 replies per burst.

GetIceStats and the analysis above seems to above this is likely a WebRTC issue, and IPC is just doing what it's told to do?

Component: IPC → WebRTC: Networking

Maybe we want to switch this to using a one-shot timer that is re-scheduled once all the GetStats calls finish, so we don't end up with a backlog of GetStats calls?

https://searchfox.org/mozilla-central/rev/a730b83206183bf097820c5780eef0d5e4103b9d/dom/media/webrtc/jsapi/PeerConnectionCtx.cpp#520-522

The severity field is not set for this bug.
:bwc, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(docfaraday)
Severity: -- → S3
Flags: needinfo?(docfaraday)
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.