Crash in [@ mozilla::GetBootstrap]
Categories
(Core :: XPCOM, defect)
Tracking
()
People
(Reporter: aryx, Assigned: rkraesig)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file, 1 obsolete file)
Bug 1909898 [1/1] - Skip checking initialization if check-function is not found r?yjuglaret,glandium
48 bytes,
text/x-phabricator-request
|
Details | Review |
Observed for Firefox early betas (b1 - b6) for version 128 and 129 with a noticeable spike to the end of the early beta cycle (b5 and b6) - are there clients which don't update to late beta builds?
See bug 1835048 for a fixed bug with the same signature.
Crash report: https://crash-stats.mozilla.org/report/index/bca9d736-3fe4-44b5-9f7a-9ff570240725
Reason: EXCEPTION_BREAKPOINT
Top 8 frames:
0 firefox.exe mozilla::GetBootstrap(char const*, mozilla::LibLoadingStrategy) xpcom/glue/standalone/nsXPCOMGlue.cpp:403
0 firefox.exe InitXPCOMGlue(mozilla::LibLoadingStrategy) browser/app/nsBrowserApp.cpp:244
1 firefox.exe NS_internal_main(int, char**, char**) browser/app/nsBrowserApp.cpp:437
1 firefox.exe wmain(int, wchar_t**) toolkit/xre/nsWindowsWMain.cpp:151
2 firefox.exe invoke_main() /builds/worker/workspace/obj-build/browser/app/D:/a/_work/1/s/src/vctools/crt/vcstartup/src/startup/exe_common.inl:90
2 firefox.exe __scrt_common_main_seh() /builds/worker/workspace/obj-build/browser/app/D:/a/_work/1/s/src/vctools/crt/vcstartup/src/startup/exe_common.inl:288
3 kernel32.dll BaseThreadInitThunk
4 ntdll.dll RtlUserThreadStart
Comment 1•4 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on beta
- Top 5 desktop browser crashes on Windows on beta
For more information, please visit BugBot documentation.
Comment 2•3 months ago
|
||
bug 1897479 added the crash being hit here, so leaving a ni? for :yannis
Comment 3•3 months ago
|
||
Set release status flags based on info from the regressing bug 1897479
Assignee | ||
Comment 4•3 months ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #0)
Observed for Firefox early betas (b1 - b6) for version 128 and 129 with a noticeable spike to the end of the early beta cycle (b5 and b6) - are there clients which don't update to late beta builds?
This is just because the patch :nika's identified as the cause has its relevant code scoped to EARLY_BETA_OR_EARLIER
(via MOZ_DIAGNOSTIC_ASSERT_ENABLED
).
Unfortunately it looks like these aren't just relabeled bug 1816848 crashes. We're trying to load the address of XRE_CheckBlockScopeStaticVarInit
, but it's not found, and we die upon asserting that it exists. I suspect — knowing absolutely nothing about the relevant build steps, and just grepping the repo — that outside of Nightly we strip all exports from xul.dll
except for those listed in libxul.symbols
.
As an immediate minimal stupid fix to get the crashes not to happen, we could just convert that assertion to an if
and skip the check if the function isn't exported. Alternatively, we could be slightly more clever, and do something like the following in toolkit/library/moz.build
:
if CONFIG["EARLY_BETA_OR_EARLIER"] and CONFIG["OS_ARCH"] == "WINNT":
SYMBOLS_FILE = "../libxul-early-beta.symbols"
else:
SYMBOLS_FILE = "../libxul.symbols"
Pinging :glandium for confirmation about SYMBOLS_FILE
, as well as an opinion on whether the "slightly more clever" solution is the problematic kind of clever.
Assignee | ||
Comment 5•3 months ago
|
||
In beta and later we prevent xul.dll
from exporting any symbols other
than those explicitly enumerated in libxul.symbols
. In this case the
symbol load will fail.
Rather than asserting that it succeeds, check whether failure occurred.
Updated•3 months ago
|
Assignee | ||
Comment 6•3 months ago
|
||
Add XRE_CheckBlockScopeStaticVarInit
to the set of exported symbols
when building for x64 Windows in early beta or earlier.
Assignee | ||
Comment 7•3 months ago
|
||
The two patches I've just attached are for the two approaches described in comment 4; at most one of them should be landed.
(Caveat lector: neither one has been tested yet, as I don't recall offhand how to build as early-beta.)
Reporter | ||
Comment 8•3 months ago
|
||
The --stage-changes
approach should let you build central as early beta locally (docs, including how to do it on Try).
Comment 9•3 months ago
|
||
The affected code only runs in EARLY_BETA_OR_EARLIER
Comment 10•3 months ago
|
||
that outside of Nightly we strip all exports from xul.dll except for those listed in libxul.symbols.
We're doing that on nightly too.
Comment 11•3 months ago
|
||
The code is behind MOZ_DIAGNOSTIC_ASSERT_ENABLED, why aren't debug builds affected? (MOZ_DIAGNOSTIC_ASSERT_ENABLED is early-beta or central-debug)
Comment 12•3 months ago
•
|
||
That said, if you moved the function to the Bootstrap class instead, like the other XRE_* functions defined in Bootstrap.cpp/h, then you wouldn't have to care about exporting it.
Assignee | ||
Comment 13•3 months ago
•
|
||
(In reply to Mike Hommey [:glandium] from comment #12)
That said, if you moved the function to the Bootstrap class instead, like the other XRE_* functions defined in Bootstrap.cpp/h, then you wouldn't have to care about exporting it.
My understanding is that this diagnostic code was specifically intended to be run before XRE_GetBootstrap
, due to the implicit TLS use in the static bool sBootstrapInitialized = false;
line.
Also, now that I've done a bit of digging, I don't think this would fix it. See below.
(In reply to Mike Hommey [:glandium] from comment #10)
We're doing that on nightly too.
I no longer believe we're doing this at all. I've checked a copy of xul.dll
from a known-affected build (129.0b6), and dumpbin.exe
clearly shows XRE_CheckBlockScopeStaticVarInit
as exported:
File Type: DLL
Section contains the following exports for xul.dll
[...]
23 number of functions
23 number of names
ordinal hint RVA name
1 0 085FA000 ?WerNotifyProc@CrashReporter@@YAKPEAX@Z
2 1 02EF1140 DumpCompleteHeap
[...]
11 A 00030F60 XRE_CheckBlockScopeStaticVarInit
12 B 00030FD0 XRE_GetBootstrap
[...]
... which sinks my theory from comment 4.
(In reply to Mike Hommey [:glandium] from comment #11)
The code is behind MOZ_DIAGNOSTIC_ASSERT_ENABLED, why aren't debug builds affected? (MOZ_DIAGNOSTIC_ASSERT_ENABLED is early-beta or central-debug)
Yeah. And the relatively low volume of this crash (even though it is a topcrash!) seems inconsistent with it affecting all early-beta builds.
What I now think is going on in this crash is that firefox.exe
v129.0b6 is trying to load xul.dll
v129.0b7, and is failing because the load procedure for the two is no longer identical: the former is looking for a function the latter doesn't have. This is consistent with concerns raised in bug 1816848 about incomplete partial installation. Unlike my previous hypothesis, this would actually explain the crash volume profile described in comment 0 (emphasis added):
Observed for Firefox early betas (b1 - b6) for version 128 and 129 with a noticeable spike to the end of the early beta cycle (b5 and b6) - are there clients which don't update to late beta builds?
The only clients that would be crashing here would be clients upgrading from early-beta to late-beta, which tracks with the above.
I don't know if we have or want any checks comparing the version number of xul.dll
to firefox.exe
, but I suspect the answers are "no" and "yes" respectively. More immediately, though, I'm withdrawing any consideration of mucking about with moz.build
and SYMBOLS_FILE
, because it won't have any effect.
Updated•3 months ago
|
Comment 14•3 months ago
|
||
Oh, right, symbols that are exported via code stay exported regardless of SYMBOLS_FILE.
I don't know if we have or want any checks comparing the version number of xul.dll to firefox.exe
That's something we have, actually, but obviously, that comes after GetBootstrap
https://searchfox.org/mozilla-central/rev/0c55d51c0d2a9b672e42ad40ea54f90267f92a8e/toolkit/xre/nsAppRunner.cpp#4100-4108
But all the versions involved are derived from https://hg.mozilla.org/releases/mozilla-beta/file/tip/config/milestone.txt which doesn't contain the beta number.
Assignee | ||
Comment 15•3 months ago
|
||
(In reply to Mike Hommey [:glandium] from comment #14)
That's something we have, actually, but obviously, that comes after GetBootstrap
That's not obvious at all, actually. (Well, other than because, if we did the following, this particular bug probably wouldn't exist...) After loading the DLL, but before calling XRE_Bootstrap
, we could check the version number in the VS_FIXEDFILEINFO
embedded in the DLL's resources and compare it to the one embedded in the executable. That number is different between beta versions (as you're well aware :-)).
We should probably also improve the error message for the existing version check — see bug 1911174 — but that's getting a little out of scope for this issue.
Comment 16•3 months ago
•
|
||
I support Ray's analysis, in particular these two points:
(In reply to Ray Kraesig [:rkraesig] from comment #13)
My understanding is that this diagnostic code was specifically intended to be run before
XRE_GetBootstrap
, due to the implicit TLS use in thestatic bool sBootstrapInitialized = false;
line.What I now think is going on in this crash is that
firefox.exe
v129.0b6 is trying to loadxul.dll
v129.0b7, and is failing because the load procedure for the two is no longer identical: the former is looking for a function the latter doesn't have.
We can see that mismatch in the example crash from comment 0 where indeed firefox.exe
has version 129.0.0.1033
(129.0b6) and xul.dll
has version 129.0.0.1321
(129.9b7). I didn't anticipate that such mismatch can occur when writing this code. Ray's proposal patch looks good to me to avoid crashing in this scenario.
:bytesized , is it expected that this mismatch can occur at some point during the update process? If not, do we already have a bug to track this mismatch issue? Thanks.
Comment 17•3 months ago
|
||
(In reply to Yannis Juglaret [:yannis] from comment #16)
:bytesized , is it expected that this mismatch can occur at some point during the update process? If not, do we already have a bug to track this mismatch issue? Thanks.
Sometimes update can fail in unfortunate ways and leave things in a bad state on disk, with a mixture of files of different versions and corrupt files. We attempt to rollback when we encounter failures, but sometimes the rollback fails too. The error codes involved tend to be "Access Denied" errors. I suspect that antivirus is involved, but it's hard to be sure since I have yet to hear from someone that can actually reproduce the problem.
I think there is probably a bug open for this somewhere, but I can't currently find it.
Updated•3 months ago
|
Comment 18•3 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 19•3 months ago
|
||
Updated•3 months ago
|
Comment 20•3 months ago
|
||
bugherder |
Description
•