Closed Bug 1573731 Opened 3 months ago Closed 3 months ago

Crash in [@ mozilla::StaticPrefs::InitStaticPrefsFromShared]

Categories

(Core :: Preferences: Backend, defect, P1, critical)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox69 --- unaffected
firefox70 + fixed

People

(Reporter: lizzard, Assigned: njn)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-7155c1a1-e483-404c-84ee-facc60190814.

A few crashes suddenly in the 20190812215403 build.

Top 10 frames of crashing thread:

0 xul.dll mozilla::StaticPrefs::InitStaticPrefsFromShared obj-firefox/dist/include/mozilla/StaticPrefList_dom.h:1286
1 xul.dll mozilla::ipc::SharedPreferenceDeserializer::DeserializeFromSharedMemory ipc/glue/ProcessUtils_common.cpp:176
2 xul.dll mozilla::dom::ContentProcess::Init dom/ipc/ContentProcess.cpp:174
3 xul.dll XRE_InitChildProcess toolkit/xre/nsEmbedFunctions.cpp:739
4 firefox.exe static int content_process_main ipc/contentproc/plugin-container.cpp:56
5 firefox.exe static int NS_internal_main browser/app/nsBrowserApp.cpp:267
6 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:131
7 firefox.exe static int __scrt_common_main_seh f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:288
8 kernel32.dll BaseThreadInitThunk 
9 ntdll.dll RtlUserThreadStart 

Any ideas? I don't know if it's even possible to tell from these crash reports what went wrong and if there's any reason to blame IPC.

Flags: needinfo?(kmaglione+bmo)

NI? Nick because this is likely related to the latest changes to the StaticPrefs. There's one thing that's worth mentioning: these crashes happened early during content process startup so they weren't collected properly until I landed the fix for bug 1282776. I had already seen these as unreported crashes on my machine days before (see bug 1448219 comment 6) which is one of the reasons why I speed-landed that fix.

Flags: needinfo?(n.nethercote)

It looks as if these stopped in the 8-17 nightly. I wonder if something got backed out?

#2 crash on the 8-21 Linux Nightly, with 16 crashes.

The stack trace is hard to read because there is generated C++ code and macros involved. The actual crash is a failure of one of the diagnostic asserts here, which means that one of the Internals::GetSharedPrefValue() calls is failing. I can see two possibilities there:

  • pref_SharedLookup() succeeds and then pref->GetValue() returns an error result, which would be caused by WantValueKind() failing.
  • pref_SharedLookup() fails, which would be caused by gSharedMap->Get() failing.

Unfortunately I don't have a deep understanding of the pref IPC stuff. jya, do you have any ideas?

Flags: needinfo?(n.nethercote) → needinfo?(jyavenard)

The assertion is that outside a parent process any static prefs (be it always or once) must exist in the shared pref map as it must have been read (and set) at least once prior the shared preference map was created in the parent process.
For this assertion to be triggered, it indicates that the static prefs weren't initialised in the main process properly, before the shared pref map global object got created.
As here Preferences::Internals::GetSharedPrefValue returned false (can't find the pref)

Now, I'd be more keen to know on what happens on August 14th so suddenly you have those crashes.

AFAIK, :njn you're the only person who has touched that code around that time. Has anything be changed related to how StaticPrefs are initialised in the main process and when they are initialised?

I've tried to grab some memory dump found in crash-stats, but neither Visual Studio 2019 nor WinDbg gets something of use. Do we know in which process this crash is occurring?

Flags: needinfo?(jyavenard) → needinfo?(n.nethercote)

(In reply to Jean-Yves Avenard [:jya] from comment #6)

Now, I'd be more keen to know on what happens on August 14th so suddenly you have those crashes.

The crashes were already happening but were not being reported until I landed the fix for bug 1282776. That's why they seem to start on that day.

I've tried to grab some memory dump found in crash-stats, but neither Visual Studio 2019 nor WinDbg gets something of use. Do we know in which process this crash is occurring?

They're all happening in content processes.

This signature spiked again on 8-22. So far we have 1206 crashes, but only 144 installations.

(100.0% in signature vs 01.03% overall) moz_crash_reason = MOZ_DIAGNOSTIC_ASSERT(false) (NS_SUCCEEDED(rv))

When I checked my minidump files the issue seemed to affect the dom.webdriver.enabled entry. From what I can tell this pref seems to be unused apart from an entry in Navigator.webidl. See this search: https://searchfox.org/mozilla-central/search?q=dom.webdriver.enabled&path=

Could this be the cause or I'm I missing something?

I can't see anything special about dom.webdriver.enabled. It's possible there's a problem that affects a lot of prefs and this just happens to be the unlucky first one.

Flags: needinfo?(n.nethercote)

This is still crashing extensively in the Windows builds for 8-22.

Random comment .. I wonder if this is somehow related to bug 1576454. That seems to be related to early-stage allocator crashing, and per comment 2 above, these crashes are also early-in-process and have no obvious explanation.

Random comment .. I wonder if this is somehow related to bug 1576454. That seems to be related to early-stage allocator crashing, and per comment 2 above, these crashes are also early-in-process and have no obvious explanation.

Bug 1576454 now has a clear explanation -- stack overflow due to too much recursion. It appears to be unrelated to this bug.

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression

I would love to see this fixed for 70 but won't consider it a blocker despite the high volume since (from comment 7) they were already happening but weren't being reported. It is the top crash other than Shutdown and OOM so I think it should be a high priority.

Priority: -- → P1
Keywords: topcrash

I just hit a crash with this signature a whole bunch of times in a row, without even noticing that a crash was happening under the hood:
bp-7a051745-85cb-4313-9ff7-575820190829 8/29/19, 11:08 AM
bp-b76447a3-7154-4c95-aefe-351640190829 8/29/19, 11:08 AM
bp-88958db6-1793-4091-a8ed-60c3e0190829 8/29/19, 11:08 AM
bp-39fb396e-8dde-4d96-b00b-154f10190829 8/29/19, 11:06 AM
bp-c37f0dbd-30eb-4a23-b8ce-fe1b40190829 8/29/19, 11:06 AM
bp-260bdbbc-5da3-4d2f-871d-0062e0190829 8/29/19, 11:06 AM
bp-fcf08662-d714-4c51-86b6-347c30190829 8/29/19, 11:06 AM
bp-1c36c8d2-bcbf-4680-8eaf-05a8b0190829 8/29/19, 11:06 AM
bp-40a639e7-7052-4907-832c-5fe2e0190829 8/29/19, 11:06 AM

My STR (not sure if they're reliable) were:
(1) have a pending Nightly update, ready to install (green update arrow visible on hamburger menu)
(2) Start a separate session of Nightly, e.g. mkdir /tmp/foo; firefox -no-remote -profile /tmp/foo
(This triggers the update to happen underneath your existing session)
(3) Back in your main session of Firefox, Ctrl+N to open a new window.

For me, the New Window would open with the "Sorry, just one more thing we need to do" error page (indicating that it was needing to start a new content process + getting blocked from doing so due to the update that'd happened). And each time I did this (opened a new window & hit that page), I would end up with a new entry in about:crashes with this crash signature.

#1 crash for Linux for the August 28 Nightlies, with about 27% of all crashes.

Also the #4 crash on Window and #2 crash on OSX for those Nightlies.

dholbert: thank you for the info, that's very helpful. I suspect the problem is that the main process and content processes have slightly different ideas of what prefs are defined (due to them coming from different binaries with different prefs define) and this triggers the diagnostic assertion failure. I will investigate some more.

Because it's violated when updates occur, and when the violation occurs it's
safe to continue, for reasons explained in the patch. This should fix a top
crash.

Assuming I've correctly understood what's happening...

  • I've written a patch that should fix the problem.
  • The good news is that all it does is disable the diagnostic assertion. Which means that Beta and Release are unaffected by this.
  • Diagnostic assertions do affect Dev Edition, which is built from mozilla-beta, so this will need backporting to mozilla-beta.
Pushed by nnethercote@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a4c348d43116
Remove a diagnostic assertion in InitStaticPrefsFromShared(). r=jya
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Assignee: nobody → n.nethercote

Here are the current stats for this crash over the last seven nightly builds:

  • 20190829094151: 14
  • 20190829214656: 10
  • 20190830093857: 4
  • 20190830215433: 0 [the fix landed in this build]
  • 20190831095143: 22 [???]
  • 20190831221004: 0
  • 20190901094958: 0

It looks as expected except for the 22 crashes marked with "[???]". I don't know what to make of that. Nonetheless, my inclination is to wait a few days and see if the crash rate remains at 0 for subsequent builds.

Flags: needinfo?(kmaglione+bmo)

Things have improved, but crashes with this signature are still happening.

I see now there have been two types of crash happening with this signature. In the past week there have been 916 crashes.

  • 313 of them involve the diagnostic assert, e.g. Windows, Linux. These ones stopped happening after this bug's patch landed, unsurprisingly.
  • 603 of them do not involve the diagnostic assert, e.g. Windows, Linux, Mac. These have continued.

The non-diagnostic-assert ones have a crash address and crash reason field that is consistent with a diagnostic assert (e.g. 0x7ffd697135c7 and EXCEPTION_BREAKPOINT on Windows; 0 and SIGSEGV on Linux), but lack a "MOZ_CRASH Reason (Raw)" field. I can't see what code remaining within InitStaticPrefsFromShared() could cause crashes like these.

Thanks Nick. I spun off Bug 1578430 to track the continued crashes that are happening on both 70 and now 71 nightly.

Component: IPC → Preferences: Backend

The crashes that dholbert experienced in comment 16 are interesting. I have annotated each one.

> bp-7a051745-85cb-4313-9ff7-575820190829 	8/29/19, 11:08 AM 	DIAGNOSTIC_ASSERT
> bp-b76447a3-7154-4c95-aefe-351640190829 	8/29/19, 11:08 AM 	DIAGNOSTIC_ASSERT
> bp-88958db6-1793-4091-a8ed-60c3e0190829 	8/29/19, 11:08 AM 	DIAGNOSTIC_ASSERT
> bp-39fb396e-8dde-4d96-b00b-154f10190829 	8/29/19, 11:06 AM 	Minimal info
> bp-c37f0dbd-30eb-4a23-b8ce-fe1b40190829 	8/29/19, 11:06 AM 	Minimal info
> bp-260bdbbc-5da3-4d2f-871d-0062e0190829 	8/29/19, 11:06 AM 	Minimal info
> bp-fcf08662-d714-4c51-86b6-347c30190829 	8/29/19, 11:06 AM 	gfx crash, unrelated
> bp-1c36c8d2-bcbf-4680-8eaf-05a8b0190829 	8/29/19, 11:06 AM 	DIAGNOSTIC_ASSERT
> bp-40a639e7-7052-4907-832c-5fe2e0190829 	8/29/19, 11:06 AM       DIAGNOSTIC_ASSERT

Five of them crashed at the diagnostic assert. But three of them have the "minimal info" form:

  • The crash reason and address are the same as for the diagnostic assertion crash reports.
  • They are missing some fields (Install age, Process type).
  • They have some empty fields (Install time, Adapter Vendor ID, Adapter Device ID).

gsvelto, erahm suggested that these "minimal info" crash reports might be due to them lacking an "extra" file due to the crash happening very early during content process startup. Can you explain this a little, e.g. when can that happen, how crash reports are generated in that case, and if any of the fields might be considered unreliable? Thanks.

Flags: needinfo?(gsvelto)

The "minimal info" crash reports are found when we periodically scan the Crash Reports/pending folder and we find minidumps w/o an .extra file attached to them. All the metadata fields should be considered unreliable, I added code to synthesize the extra file from the current running version of Firefox but that might not be the one in which the crash happened (this is especially true for nightly).

Note that we noticed those orphaned minidumps precisely because of this bug. My guess is that they might have been leftovers from before I fixed the issue with .extra file generation in bug 1282776. The change that synthesizes the .extra file for those crashes happened later in bug 1566855. If current version of nightlies start generating more of those orphaned minidumps then we have a major problem with crash generation.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #29)

the current running version of Firefox but that might not be the one in which the crash happened (this is especially true for nightly).

That might explain what happened when I tried investigating one of those crashes by disassembling the “matching” build: the reported offset from libxul wasn't in the right function (or even at an instruction boundary), but the offset from the function was at the right place inside a MOZ_CRASH (writing the value of __LINE__ to address 0… but maybe not the right line).

See Also: → 1578430
You need to log in before you can comment on or make changes to this bug.