Closed Bug 1909898 Opened 4 months ago Closed 3 months ago

Crash in [@ mozilla::GetBootstrap]

Categories

(Core :: XPCOM, defect)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
131 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- disabled
firefox128 --- disabled
firefox129 --- disabled
firefox130 --- disabled
firefox131 --- fixed

People

(Reporter: aryx, Assigned: rkraesig)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file, 1 obsolete file)

Observed for Firefox early betas (b1 - b6) for version 128 and 129 with a noticeable spike to the end of the early beta cycle (b5 and b6) - are there clients which don't update to late beta builds?

See bug 1835048 for a fixed bug with the same signature.

Crash report: https://crash-stats.mozilla.org/report/index/bca9d736-3fe4-44b5-9f7a-9ff570240725

Reason: EXCEPTION_BREAKPOINT

Top 8 frames:

0  firefox.exe  mozilla::GetBootstrap(char const*, mozilla::LibLoadingStrategy)  xpcom/glue/standalone/nsXPCOMGlue.cpp:403
0  firefox.exe  InitXPCOMGlue(mozilla::LibLoadingStrategy)  browser/app/nsBrowserApp.cpp:244
1  firefox.exe  NS_internal_main(int, char**, char**)  browser/app/nsBrowserApp.cpp:437
1  firefox.exe  wmain(int, wchar_t**)  toolkit/xre/nsWindowsWMain.cpp:151
2  firefox.exe  invoke_main()  /builds/worker/workspace/obj-build/browser/app/D:/a/_work/1/s/src/vctools/crt/vcstartup/src/startup/exe_common.inl:90
2  firefox.exe  __scrt_common_main_seh()  /builds/worker/workspace/obj-build/browser/app/D:/a/_work/1/s/src/vctools/crt/vcstartup/src/startup/exe_common.inl:288
3  kernel32.dll  BaseThreadInitThunk
4  ntdll.dll  RtlUserThreadStart

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on beta
  • Top 5 desktop browser crashes on Windows on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

bug 1897479 added the crash being hit here, so leaving a ni? for :yannis

Flags: needinfo?(yjuglaret)
Keywords: regression
Regressed by: 1897479

Set release status flags based on info from the regressing bug 1897479

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #0)

Observed for Firefox early betas (b1 - b6) for version 128 and 129 with a noticeable spike to the end of the early beta cycle (b5 and b6) - are there clients which don't update to late beta builds?

This is just because the patch :nika's identified as the cause has its relevant code scoped to EARLY_BETA_OR_EARLIER (via MOZ_DIAGNOSTIC_ASSERT_ENABLED).

Unfortunately it looks like these aren't just relabeled bug 1816848 crashes. We're trying to load the address of XRE_CheckBlockScopeStaticVarInit, but it's not found, and we die upon asserting that it exists. I suspect — knowing absolutely nothing about the relevant build steps, and just grepping the repo — that outside of Nightly we strip all exports from xul.dll except for those listed in libxul.symbols.

As an immediate minimal stupid fix to get the crashes not to happen, we could just convert that assertion to an if and skip the check if the function isn't exported. Alternatively, we could be slightly more clever, and do something like the following in toolkit/library/moz.build:

if CONFIG["EARLY_BETA_OR_EARLIER"] and CONFIG["OS_ARCH"] == "WINNT":
    SYMBOLS_FILE = "../libxul-early-beta.symbols"
else:
    SYMBOLS_FILE = "../libxul.symbols"

Pinging :glandium for confirmation about SYMBOLS_FILE, as well as an opinion on whether the "slightly more clever" solution is the problematic kind of clever.

Flags: needinfo?(yjuglaret) → needinfo?(mh+mozilla)

In beta and later we prevent xul.dll from exporting any symbols other
than those explicitly enumerated in libxul.symbols. In this case the
symbol load will fail.

Rather than asserting that it succeeds, check whether failure occurred.

Assignee: nobody → rkraesig
Status: NEW → ASSIGNED

Add XRE_CheckBlockScopeStaticVarInit to the set of exported symbols
when building for x64 Windows in early beta or earlier.

The two patches I've just attached are for the two approaches described in comment 4; at most one of them should be landed.

(Caveat lector: neither one has been tested yet, as I don't recall offhand how to build as early-beta.)

The --stage-changes approach should let you build central as early beta locally (docs, including how to do it on Try).

The affected code only runs in EARLY_BETA_OR_EARLIER

See https://firefox-source-docs.mozilla.org/setup/configuring_build_options.html#building-as-beta-or-release

that outside of Nightly we strip all exports from xul.dll except for those listed in libxul.symbols.

We're doing that on nightly too.

The code is behind MOZ_DIAGNOSTIC_ASSERT_ENABLED, why aren't debug builds affected? (MOZ_DIAGNOSTIC_ASSERT_ENABLED is early-beta or central-debug)

Flags: needinfo?(mh+mozilla)

That said, if you moved the function to the Bootstrap class instead, like the other XRE_* functions defined in Bootstrap.cpp/h, then you wouldn't have to care about exporting it.

(In reply to Mike Hommey [:glandium] from comment #12)

That said, if you moved the function to the Bootstrap class instead, like the other XRE_* functions defined in Bootstrap.cpp/h, then you wouldn't have to care about exporting it.

My understanding is that this diagnostic code was specifically intended to be run before XRE_GetBootstrap, due to the implicit TLS use in the static bool sBootstrapInitialized = false; line.

Also, now that I've done a bit of digging, I don't think this would fix it. See below.


(In reply to Mike Hommey [:glandium] from comment #10)

We're doing that on nightly too.

I no longer believe we're doing this at all. I've checked a copy of xul.dll from a known-affected build (129.0b6), and dumpbin.exe clearly shows XRE_CheckBlockScopeStaticVarInit as exported:

File Type: DLL

  Section contains the following exports for xul.dll

          [...]
          23 number of functions
          23 number of names

    ordinal hint RVA      name

          1    0 085FA000 ?WerNotifyProc@CrashReporter@@YAKPEAX@Z
          2    1 02EF1140 DumpCompleteHeap
          [...]
         11    A 00030F60 XRE_CheckBlockScopeStaticVarInit
         12    B 00030FD0 XRE_GetBootstrap
          [...]

... which sinks my theory from comment 4.

(In reply to Mike Hommey [:glandium] from comment #11)

The code is behind MOZ_DIAGNOSTIC_ASSERT_ENABLED, why aren't debug builds affected? (MOZ_DIAGNOSTIC_ASSERT_ENABLED is early-beta or central-debug)

Yeah. And the relatively low volume of this crash (even though it is a topcrash!) seems inconsistent with it affecting all early-beta builds.

What I now think is going on in this crash is that firefox.exe v129.0b6 is trying to load xul.dll v129.0b7, and is failing because the load procedure for the two is no longer identical: the former is looking for a function the latter doesn't have. This is consistent with concerns raised in bug 1816848 about incomplete partial installation. Unlike my previous hypothesis, this would actually explain the crash volume profile described in comment 0 (emphasis added):

Observed for Firefox early betas (b1 - b6) for version 128 and 129 with a noticeable spike to the end of the early beta cycle (b5 and b6) - are there clients which don't update to late beta builds?

The only clients that would be crashing here would be clients upgrading from early-beta to late-beta, which tracks with the above.

I don't know if we have or want any checks comparing the version number of xul.dll to firefox.exe, but I suspect the answers are "no" and "yes" respectively. More immediately, though, I'm withdrawing any consideration of mucking about with moz.build and SYMBOLS_FILE, because it won't have any effect.

Attachment #9418242 - Attachment is obsolete: true

Oh, right, symbols that are exported via code stay exported regardless of SYMBOLS_FILE.

I don't know if we have or want any checks comparing the version number of xul.dll to firefox.exe

That's something we have, actually, but obviously, that comes after GetBootstrap
https://searchfox.org/mozilla-central/rev/0c55d51c0d2a9b672e42ad40ea54f90267f92a8e/toolkit/xre/nsAppRunner.cpp#4100-4108

But all the versions involved are derived from https://hg.mozilla.org/releases/mozilla-beta/file/tip/config/milestone.txt which doesn't contain the beta number.

(In reply to Mike Hommey [:glandium] from comment #14)

That's something we have, actually, but obviously, that comes after GetBootstrap

That's not obvious at all, actually. (Well, other than because, if we did the following, this particular bug probably wouldn't exist...) After loading the DLL, but before calling XRE_Bootstrap, we could check the version number in the VS_FIXEDFILEINFO embedded in the DLL's resources and compare it to the one embedded in the executable. That number is different between beta versions (as you're well aware :-)).

We should probably also improve the error message for the existing version check — see bug 1911174 — but that's getting a little out of scope for this issue.

I support Ray's analysis, in particular these two points:

(In reply to Ray Kraesig [:rkraesig] from comment #13)

My understanding is that this diagnostic code was specifically intended to be run before XRE_GetBootstrap, due to the implicit TLS use in the static bool sBootstrapInitialized = false; line.

What I now think is going on in this crash is that firefox.exe v129.0b6 is trying to load xul.dll v129.0b7, and is failing because the load procedure for the two is no longer identical: the former is looking for a function the latter doesn't have.

We can see that mismatch in the example crash from comment 0 where indeed firefox.exe has version 129.0.0.1033 (129.0b6) and xul.dll has version 129.0.0.1321 (129.9b7). I didn't anticipate that such mismatch can occur when writing this code. Ray's proposal patch looks good to me to avoid crashing in this scenario.

:bytesized , is it expected that this mismatch can occur at some point during the update process? If not, do we already have a bug to track this mismatch issue? Thanks.

Flags: needinfo?(bytesized)

(In reply to Yannis Juglaret [:yannis] from comment #16)

:bytesized , is it expected that this mismatch can occur at some point during the update process? If not, do we already have a bug to track this mismatch issue? Thanks.

Sometimes update can fail in unfortunate ways and leave things in a bad state on disk, with a mixture of files of different versions and corrupt files. We attempt to rollback when we encounter failures, but sometimes the rollback fails too. The error codes involved tend to be "Access Denied" errors. I suspect that antivirus is involved, but it's hard to be sure since I have yet to hear from someone that can actually reproduce the problem.

I think there is probably a bug open for this somewhere, but I can't currently find it.

Flags: needinfo?(bytesized)
Attachment #9418241 - Attachment description: Bug 1909898 [1/1] - Skip checking initialization if check-function is not exported r?yjuglaret,nika → Bug 1909898 [1/1] - Skip checking initialization if check-function is not found r?yjuglaret,glandium

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9d83d63633d4 [1/1] - Skip checking initialization if check-function is not found r=yjuglaret
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 131 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: