Closed Bug 1729466 Opened 3 years ago Closed 3 years ago

Async logger messages are enqueued even when the async logger isn't running, resulting in a massive memory leak

Categories

(Core :: Audio/Video: cubeb, defect, P2)

defect

Tracking

()

RESOLVED FIXED
94 Branch
Tracking Status
firefox94 --- fixed

People

(Reporter: jib, Assigned: padenot)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Attached image ACrawl.png

STR:

  1. Open https://jsfiddle.net/jib1/6tbvwgu3/20/ (creates 2000 RTCPeerConnection objects + SRD)

Expected results:

  • Log spew

Actual results:

  • Spew plus Firefox slowly grinds to a halt.
  • Doesn't recover even after closing the tab, likely due to about:webrtc
  • Opening about:webrtc after this seems to stall forever.

Profile doesn't reveal much, but Firefox is near beach ball: https://share.firefox.dev/3zTr6yr

I think we know what's going on here.

  • Allowing 2000+ peer connection objects to be created is costly
  • about:webrtc collects snapshots from all open AND CLOSED peer connections in Firefox, until the user opens about:webrtc waits for it to load and clicks the "Clear History" button.

(In reply to Jan-Ivar Bruaroey [:jib] (needinfo? me) from comment #0)

Profile doesn't reveal much, but Firefox is near beach ball: https://share.firefox.dev/3zTr6yr

The relevant content process is not included in the profile (it likely was too busy to respond to the profiler's IPC), so indeed the profile isn't useful.

about:processes can show a bit of what's going on. We create a large number of threads (not sure if it's infinite or if my content process became unresponsive before it was done creating threads), more than 2000. In about:processes, I saw that the threads we repeatedly create are: PacerThread, ModuleProcessThread, VoiceProcessThread. All of them are active, using 0.3% to 0.5% of a CPU core each.

Changing severity to S3 until we have info that this is super common. It is something we definitely need to look at, but probably after the libwebrtc uplift is complete.

Severity: -- → S3
Priority: -- → P2

(In reply to Michael Froman [:mjf] from comment #2)

Changing severity to S3 until we have info that this is super common. It is something we definitely need to look at, but probably after the libwebrtc uplift is complete.

(In reply to Florian Quèze [:florian] from comment #1)

(In reply to Jan-Ivar Bruaroey [:jib] (needinfo? me) from comment #0)

Profile doesn't reveal much, but Firefox is near beach ball: https://share.firefox.dev/3zTr6yr

The relevant content process is not included in the profile (it likely was too busy to respond to the profiler's IPC), so indeed the profile isn't useful.

about:processes can show a bit of what's going on. We create a large number of threads (not sure if it's infinite or if my content process became unresponsive before it was done creating threads), more than 2000. In about:processes, I saw that the threads we repeatedly create are: PacerThread, ModuleProcessThread, VoiceProcessThread. All of them are active, using 0.3% to 0.5% of a CPU core each.

Given that part of the libwebrtc update (bug 1654112) includes using a shared thread for a lot of this libwebrtc stuff, it may improve things considerably.

Depends on: 1654112

If you want to test on a WIP build with the new threading model you can use this.

(In reply to Andreas Pehrson [:pehrsons] from comment #4)

If you want to test on a WIP build with the new threading model you can use this.

I tested and it's a dramatic improvement:

  • no more thread explosion (the relevant content process had a maximum of 45 threads, only 10 of them active)
  • total CPU use was between 1 and 2 cores.
  • I managed to capture a profile: https://share.firefox.dev/2VwvBQO
  • GraphRunner is using about 50% of a core (about 2/3 of it being in mozilla::AsyncLogger::Log, did you enable some debug logging in your try run?)
  • the MainThread, Socket Thread and IPC I/O Child thread together use about 50% of a core, doing activity triggered by the EverySecondTelemetryCallback_m runnable (ie. bug 1724417).
  • 3 WebRTCPD threads use about 2% of a core each. Most of this time seems to be memory allocation, firing, and creating timers (NS_NewTimerWithCallback).

Also, there is a massive memory leak when using this build. Just during the time of my profile from comment 5, the memory track shows the memory use increased by 1.5GB. The system monitor shows the process as using 29.6GB of memory. About:memory shows:
29,787.47 MB (100.0%) -- explicit
├──28,952.53 MB (97.20%) ── heap-unclassified

After several attempts, I managed to capture a profile with the nativeallocations feature enabled: https://share.firefox.dev/2X1UXpZ
It shows that 99% of the retained memory is from the mozilla::AsyncLogger::Log calls (based on the stacks, my guess is they are from the GraphRunner thread).

Just looking at code I guess this happens when you're profiling and have audio capturing/playing (audio callbacks) -- but you're not profiling audio callbacks. So we're generating log entries but not consuming them (we didn't start the audio callback logging).

This should happen on nightly too.

Paul, could you take a look?

Flags: needinfo?(padenot)
Assignee: nobody → padenot
Flags: needinfo?(padenot)

Reading comment 5, I'm turning this bug into the issue described in comment 7, because the uplift fixes the original issue.

Component: WebRTC: Signaling → Audio/Video: cubeb
Summary: Opening 1000s of RTCPeerConnections brings Firefox to a crawl → Async logger messages are enqueued even when the async logger isn't running, resulting in a massive memory leak
Pushed by padenot@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/58b23100d739
Don't enqueue async logging messages when the async logger is not running. r=pehrsons
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: