Closed Bug 1766448 Opened 3 years ago Closed 9 months ago

Hit MOZ_CRASH(NSS_Shutdown failed) at /xpcom/build/XPCOMInit.cpp:769 with WebRTC

Categories

(Core :: XPCOM, defect)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jkratzer, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: testcase, Whiteboard: [bugmon:bisected,confirmed])

Attachments

(1 file)

Testcase found while fuzzing mozilla-central rev 31346aa577d3 (built with: --enable-debug --enable-fuzzing).

Testcase can be reproduced using the following commands:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch --build 31346aa577d3 --debug --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.html
Hit MOZ_CRASH(NSS_Shutdown failed) at /xpcom/build/XPCOMInit.cpp:769

    ==290346==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f2f38569b64 bp 0x7ffd133b5f20 sp 0x7ffd133b5e90 T290346)
    ==290346==The signal is caused by a WRITE memory access.
    ==290346==Hint: address points to the zero page.
        #0 0x7f2f38569b64 in mozilla::ShutdownXPCOM(nsIServiceManager*) /xpcom/build/XPCOMInit.cpp:769:9
        #1 0x7f2f3f29d1dc in XRE_TermEmbedding() /toolkit/xre/nsEmbedFunctions.cpp:226:3
        #2 0x7f2f390f2a3e in mozilla::ipc::ScopedXREEmbed::Stop() /ipc/glue/ScopedXREEmbed.cpp:90:5
        #3 0x7f2f3f29d875 in XRE_InitChildProcess(int, char**, XREChildData const*) /toolkit/xre/nsEmbedFunctions.cpp:733:16
        #4 0x560d9667be30 in content_process_main /browser/app/../../ipc/contentproc/plugin-container.cpp:57:28
        #5 0x560d9667be30 in main /browser/app/nsBrowserApp.cpp:327:18
        #6 0x7f2f4f93d0b2 in __libc_start_main /build/glibc-sMfBJT/glibc-2.31/csu/../csu/libc-start.c:308:16
        #7 0x560d96651bdc in _start (/home/jkratzer/builds/mc-debug/firefox-bin+0x15bdc) (BuildId: b794bf6e78cdcf18a15287c381b22f521b1dfd30)
    
    UndefinedBehaviorSanitizer can not provide additional info.
    SUMMARY: UndefinedBehaviorSanitizer: SEGV /xpcom/build/XPCOMInit.cpp:769:9 in mozilla::ShutdownXPCOM(nsIServiceManager*)
    ==290346==ABORTING
Attached file Testcase

Looks like WebRTC is conspiring to keep NSS alive until late in shutdown. I'm not sure which of XPCOM, WebRTC and NSS should be responsible for it going away. It doesn't seem like a severe issue in any event.

Summary: Hit MOZ_CRASH(NSS_Shutdown failed) at /xpcom/build/XPCOMInit.cpp:769 → Hit MOZ_CRASH(NSS_Shutdown failed) at /xpcom/build/XPCOMInit.cpp:769 with WebRTC

Bugmon Analysis
Verified bug as reproducible on mozilla-central 20220426094609-31346aa577d3.
Unable to bisect testcase (Testcase reproduces on start build!):

Start: 1c01cb995fc94cb8d7971f9cd31ff6c9d5a7d8c9 (20210427095509)
End: 31346aa577d363be6c1139d1a134507b93f3784f (20220426094609)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False)

Whiteboard: [bugmon:confirm] → [bugmon:bisected,confirmed]

(In reply to Andrew McCreight [:mccr8] from comment #2)

Looks like WebRTC is conspiring to keep NSS alive until late in shutdown. I'm not sure which of XPCOM, WebRTC and NSS should be responsible for it going away. It doesn't seem like a severe issue in any event.

Whoever is supposed to look at it - a pernosco session would ease the task significantly. Jason, if you can find that time...?

Flags: needinfo?(jkratzer)

Based on the number of different checks in https://searchfox.org/mozilla-central/rev/0ffae75b690219858e5a45a39f8759a8aee7b9a2/xpcom/build/XPCOMInit.cpp#760-779 for downgrading it from a crash to a warning, I think it's probably somewhat expected that we'll accidentally leak sometimes, as it appears we opt out of it in some test suites. Perhaps we should also suppress it when fuzzing?

A pernosco session for this bug can be found here.

Flags: needinfo?(jkratzer)

Looking at that pernosco session, the output includes this:

WARNING: YOU ARE LEAKING THE WORLD (at least one JSRuntime and everything alive inside it, that is) AT JS_ShutDown TIME.  FIX THIS!

If anything in the JSRuntime that has leaked has acquired an NSS resource, NSS won't shut down cleanly. I think NSS complaining is more of a symptom of whatever is leaking the JSRuntime here.
Maybe it would be a good idea to downgrade NSS' failure to a warning when a JSRuntime gets leaked?

(In reply to Nika Layzell [:nika] (ni? for response) from comment #5)

Based on the number of different checks in https://searchfox.org/mozilla-central/rev/0ffae75b690219858e5a45a39f8759a8aee7b9a2/xpcom/build/XPCOMInit.cpp#760-779 for downgrading it from a crash to a warning, I think it's probably somewhat expected that we'll accidentally leak sometimes, as it appears we opt out of it in some test suites. Perhaps we should also suppress it when fuzzing?

I assume we could get a more actionable leak report then? But do we report fuzzer-found leaks at all?

Flags: needinfo?(jkratzer)

It is possible that this doesn't leak in a debug build. We do shutdown GC/CCs in builds where we care about leak checking, and I don't know if these fuzzing builds have them or not.

(In reply to Jens Stutte [:jstutte] from comment #8)

(In reply to Nika Layzell [:nika] (ni? for response) from comment #5)

Based on the number of different checks in https://searchfox.org/mozilla-central/rev/0ffae75b690219858e5a45a39f8759a8aee7b9a2/xpcom/build/XPCOMInit.cpp#760-779 for downgrading it from a crash to a warning, I think it's probably somewhat expected that we'll accidentally leak sometimes, as it appears we opt out of it in some test suites. Perhaps we should also suppress it when fuzzing?

I assume we could get a more actionable leak report then? But do we report fuzzer-found leaks at all?

Jens, we do not currently record testcases that trigger leaks. If it's something you think we should be recording, we have the capability to do so.

Flags: needinfo?(jkratzer)

(In reply to Jason Kratzer [:jkratzer] from comment #10)

Jens, we do not currently record testcases that trigger leaks. If it's something you think we should be recording, we have the capability to do so.

I'd move this question to more authoritative experts. There could be value, but there could be also a huge amount of positives to work on...

Flags: needinfo?(nika)
Flags: needinfo?(continuation)

I think Tyson has looked a bit into fuzzing for leaks. It is probably a big project and I don't know if we have the bandwidth to fix issues it might find right now.

Flags: needinfo?(continuation)

Given that we probably don't have the bandwidth to fix leak fuzzing issues right now, a reasonable approach might be to guard this assertion behind an #ifndef FUZZING so that it doesn't trigger during future fuzzing test runs due to memory leaks.

Flags: needinfo?(nika)
Severity: -- → S3

Bugmon was unable reproduce this issue.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Keywords: bugmon
Keywords: bugmon

A change to the Taskcluster build definitions over the weekend caused Bugmon to fail when reproducing issues. This issue has been corrected. Re-enabling bugmon.

Testcase crashes using the initial build (mozilla-central 20230325090348-88bef400b57b) but not with tip (mozilla-central 20240322093041-5d6efea5e0bb.)

The bug appears to have been fixed in the following build range:

Start: 5684a44a2c2ce7ccb727c718d397c02e0d2d141e (20240319064603)
End: 4872ae54a708638bc5394c4dad81b5ee46da6bfd (20240319164128)
Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=5684a44a2c2ce7ccb727c718d397c02e0d2d141e&tochange=4872ae54a708638bc5394c4dad81b5ee46da6bfd

jkratzer, can you confirm that the above bisection range is responsible for fixing this issue?
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Flags: needinfo?(jkratzer)
Keywords: bugmon

I'm not quite sure what fixed this but bug 1857841 or bug 1885859 seem like likely culprits to me. Any thoughts Paul?

Flags: needinfo?(jkratzer) → needinfo?(pbone)

maybe bug 1885859 by letting something get freed at the right time, possibly through some references? But it's very much a guess.

Flags: needinfo?(pbone)

I'm not sure if we can say with certainty what this was fixed by but we're no longer seeing it reported by the fuzzers.

Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: