Closed Bug 1868070 Opened 5 months ago Closed 5 months ago

Intermittently the socket thread uses 100% of a single logical core and the browser drops all connections

Categories

(Core :: Networking: HTTP, defect, P1)

Firefox 122
defect

Tracking

()

VERIFIED FIXED
122 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox120 --- unaffected
firefox121 --- unaffected
firefox122 + fixed

People

(Reporter: tgnff242, Assigned: kershaw)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: nightly-community, regression, Whiteboard: [necko-triaged][necko-priority-queue])

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:122.0) Gecko/20100101 Firefox/122.0

Steps to reproduce:

This is an intermittent issue, I don't have STR. This might be triggered by youtube. (I'm never logged in and I use uBO).

Actual results:

Suddenly, the browser starts using a lot of CPU. about:processes show most of the CPU to be used by the socket thread. When that happens all connections are dropped. Exiting Firefox doesn't terminate its processes.

Expected results:

The first time I experienced this issue was in BuildID 20231201095335.

Attempting to capture an HTTP log results in an empty log.

Attempting to capture a performance profile with the Firefox profiler is impossible without an internet connection, since saving the profile locally requires to open it in the profile site first... Should I report this somewhere?

This is a profile I captured with perf: https://share.firefox.dev/46Gd42T

This is the resulting crash report when killing the main process which hangs on exit: https://crash-stats.mozilla.org/report/index/752f9a78-caed-46e0-b719-7cd890231202 https://crash-stats.mozilla.org/report/index/0525b25a-2682-41f5-a3e9-c4e460231204

Has STR: --- → no
Component: Networking → Networking: HTTP
See Also: → 1867566
Blocks: necko-perf
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged]

It happened again. This time when doing a Google search through the urlbar (suggestions disabled).

Just to be clear, this is not a performance issue, Firefox is unusable whenever that happens and requires a restart, otherwise it cannot load any site.

https://crash-stats.mozilla.org/report/index/fc029fce-439f-41ad-ac4f-dfda00231206

Crash Signature: shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
Crash Signature: shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown → [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown ]

Hi Reporter,

Could you set this pref network.http.http3.max_accumlated_time_ms to 0 and see if this happens again?

Thanks.

Flags: needinfo?(tgnff242)

The bug has a crash signature, thus the bug will be considered confirmed.

Status: UNCONFIRMED → NEW
Ever confirmed: true

The crash signature is not related to this bug.

Assignee: nobody → kershaw
Severity: S3 → S2
Crash Signature: [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown ]
Priority: P2 → P1
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-queue]

I think the problem here is that neqo returns a zero duration from this line. When this happens, the accumulated_time is never accumulated, so we stuck in this loop forever.

Tracking this for Fx122 as it was triaged as an S2.
:kershaw is this a regression introduced by Bug 1852924?

Flags: needinfo?(kershaw)

(In reply to Donal Meehan [:dmeehan] from comment #7)

Tracking this for Fx122 as it was triaged as an S2.
:kershaw is this a regression introduced by Bug 1852924?

Yes, this is introduced by Bug 1852924.

Flags: needinfo?(kershaw)
Keywords: regression
Regressed by: 1852924

(In reply to Kershaw Chang [:kershaw] from comment #2)

Could you set this pref network.http.http3.max_accumlated_time_ms to 0 and see if this happens again?

I've set the pref yesterday and so far I haven't encountered it yet. However, sometimes this occurred three or more times in a day and other days it didn't occur at all, so I'm not sure it did anything yet. I'll report in a week unless it happens again sooner than that.

Flags: needinfo?(tgnff242)
Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7fc200daf13a
Make sure we exit the loop when neqo returns a zero duration, r=necko-reviewers,valentin
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → 122 Branch

I haven't encountered the bug since. I reset the pref as soon as the patch landed. It seems it's fixed.

Status: RESOLVED → VERIFIED

(In reply to tgn-ff from comment #12)

I haven't encountered the bug since. I reset the pref as soon as the patch landed. It seems it's fixed.

Thanks for verifying.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: