Crash in [@ firefox_on_glean::factory::create_and_register_metric]
Categories
(Firefox :: New Tab Page, defect)
Tracking
()
People
(Reporter: mccr8, Assigned: mconley)
References
(Regression)
Details
(Keywords: crash, regression, topcrash)
Crash Data
Attachments
(8 files, 1 obsolete file)
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-beta+
|
Details | Review |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-beta+
|
Details | Review |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-release+
|
Details | Review |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-release+
|
Details | Review |
|
1.39 MB,
application/x-xpinstall
|
Details | |
|
1.39 MB,
application/x-xpinstall
|
Details |
Crash report: https://crash-stats.mozilla.org/report/index/d039cffb-4b1c-489b-ad90-0a9c10260122
I saw this in the crash spike reporter report.
I guess time_unit is None?
"timing_distribution" => {
let meta = CommonMetricData {
name,
category,
send_in_pings,
lifetime,
disabled,
..Default::default()
};
let metric = TimingDistributionMetric::new(BaseMetricId(metric_id), meta, time_unit.unwrap());
MOZ_CRASH Reason:
called `Option::unwrap()` on a `None` value
Top 10 frames:
0 xul.dll MOZ_Crash(char const*, int, char const*) mfbt/Assertions.h:375
0 xul.dll RustMozCrash(char const*, int, char const*) mozglue/static/rust/wrappers.cpp:18
1 xul.dll mozglue_static::panic_hook(std::panic::PanicHookInfo*) mozglue/static/rust/lib.rs:99
2 xul.dll core::ops::function::Fn::call<void (*)(ref$<std::panic::PanicHookInfo>), tupl... /rustc/1159e78c4747b02ef996e55082b704c09b970588/library/core/src/ops/function.rs:253
3 xul.dll std::panic::PanicHookInfo::new() library/alloc/src/boxed.rs:1985
3 xul.dll std::panicking::rust_panic_with_hook() library/std/src/panicking.rs:841
4 xul.dll std::panicking::begin_panic_handler::closure$0() library/std/src/panicking.rs:699
5 xul.dll std::sys::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_h... library/std/src/sys/backtrace.rs:174
6 xul.dll std::panicking::begin_panic_handler() library/std/src/panicking.rs:697
7 xul.dll core::panicking::panic_fmt() library/core/src/panicking.rs:75
[Tracking Requested - why for this release]: big crash spike on 147.0.1
| Reporter | ||
Comment 1•2 months ago
|
||
Currently this is is around 9% of all crashes on 147.0.1.
| Reporter | ||
Comment 2•2 months ago
|
||
RegisterRuntimeMetric is being called from JS. In 147, it looks like there are two call sites in JS.
| Reporter | ||
Comment 3•2 months ago
|
||
mconley did something or other to hopefully stop this and said they'd look at it tomorrow so I'll assign them
Comment 4•2 months ago
|
||
Looks like this was regressed by Bug 2008850 when it shipped to release in the New Tab 149.1.20260121.51415 train hop.
The New Tab 149.1.20260121.51415 train hop has been rolled back for now to allow for investigation.
Setting 148 and 149 to affected since it will need a fix before the next train hop
Updated•2 months ago
|
Comment 5•2 months ago
|
||
Set release status flags based on info from the regressing bug 2008850
| Assignee | ||
Comment 6•2 months ago
|
||
STR:
- Download the 149.1.20260121.51415 newtab train-hop XPI locally somewhere on the file system.
- In a debug build of Firefox 147.0.1 (either a local build or a Mozilla build, it doesn’t matter), install the Nimbus DevTools add-on.
- Click on the add-on button, and choose “nimbus-devtools” to open the add-on UI tab.
- Choose “Experiment Browser” on the lefthand side.
- Find “New Tab 148.0.20251211.63751 to Release” in the list, and in the dropdown to the right of it, choose “rollout”, and then under the “Actions” dropdown, choose “Force Enroll”
- Restart the browser.
- Visit about:support and ensure that New Tab add-on version 148.0.20251211.63751 is installed.
- If it’s not that version, it probably didn’t download in time before you restarted. Wait a few more seconds, and restart again. Repeat until you see it listed in about:support.
- Restart the browser a few times to get the 148 NewTabGleanUtils written to the startup cache
- Do this maybe 2 or 3 times.
- Open about:config
- Find the
browser.newtabpage.trainhopAddon.versionpref and set it to the stringany - Visit about:addons, and click on the gear icon
- Choose “Install Add-on From File…”
- Select the 149.1.20260121.51415 newtab XPI downloaded in the first step.
- Restart the browser
ER:
No crash.
AR:
The browser crashes.
Notably, the subsequent start of the browser does not appear to crash, which makes me suspect that in the wild, this manifests as a single crash that doesn't repeat.
| Assignee | ||
Comment 7•2 months ago
|
||
These STR support my hypothesis for what's happening here:
This crash makes sense if our module loading is getting out of order, and we read in NewTabGleanUtils before the XPI has registered the XPI version of that module. Then we’d be running the Firefox 147 version of NewTabGleanUtils, which does not include this patch for adding timing_distribution to the extraArgs allow list.
So I think it is likely that this is another manifestation of this bug, where the ScriptPreloader was loading an old version of a module because the URIs matched.
Here’s how this plays out:
- 147 client runs with the prior train-hop XPI, and
NewTabGleanUtilsexists in the ScriptPreloader bytecode cache because it's been read a few times during startup. - The 149 XPI is downloaded and installed. Train-hop XPI initialization runs. When
AboutNewTabResourceMappingreaches the point of registering the metrics, it loadsNewTabGleanUtilsviaresource://newtab/lib/NewTabGleanUtils.sys.mjs, but accesses it from the ScriptPreloader cache instead of the XPI. - The bytecode for the underlying
NewTabGleanUtilsexecutes, and because it is missing thetiming_distributionentry in the extraArgs allow list, theregisterMetricIfNeededmethod in the bytecode version ofNewTabGleanUtilsdoes not passextraArgstoServices.fog.registerRuntimeMetric. - At this point, the crash is predictable -
timing_distributionmetrics must include the extra args parameter.
| Assignee | ||
Comment 8•2 months ago
|
||
I think this scenario also explains why it wasn't caught by our CI - our CI jobs take the XPI from main and install it on a bare copy of Release. They don't first install a pre-existing copy of any active train-hops.
And it also might explain why our QA team didn't hit this. Presumably, the QA team never got into a scenario where the pre-existing install of Firefox Release they were testing had the 148.0.20251211.63751 train-hop's NewTabGleanUtils in its script cache.
| Assignee | ||
Comment 9•2 months ago
|
||
System add-ons can be installed in the profile directory as .xpi JAR files that
are updated independently of the application. Prior to this patch, when these JARs
are updated independently from the rest of the application, the ScriptPreloader
and StartupCache can serve stale bytecode compiled from the previous version of
the JAR, causing the old code to execute even though the JAR has been replaced.
This fix here adds some defense-in-depth by checking if a script's JAR file is
omni.ja before using cached bytecode. The check is applied at multiple
levels: in mozJSModuleLoader and mozJSSubScriptLoader (which prevents both
ScriptPreloader and StartupCache from being consulted), in ScriptPreloader's
GetCachedStencil (which protects other callers like nsFrameMessageManager),
and by passing URIs to GetCachedStencil where they were previously omitted.
Scripts from omni.ja continue to benefit from bytecode caching as before.
| Assignee | ||
Comment 10•2 months ago
|
||
| Assignee | ||
Comment 11•2 months ago
|
||
System add-ons can be installed in the profile directory as .xpi JAR files that
are updated independently of the application. Prior to this patch, when these JARs
are updated independently from the rest of the application, the ScriptPreloader
and StartupCache can serve stale bytecode compiled from the previous version of
the JAR, causing the old code to execute even though the JAR has been replaced.
This fix here adds a helper to checking if a script's JAR file is in one of
the omni.ja files before using cached bytecode.
Scripts from the omni.ja files continue to benefit from bytecode caching as
before.
Original Revision: https://phabricator.services.mozilla.com/D280333
Updated•2 months ago
|
Comment 12•2 months ago
|
||
firefox-beta Uplift Approval Request
- User impact if declined: Clients cannot receive newtab train-hop XPI's without potentially running out-of-date versions of the parent process scripts.
- Code covered by automated testing: yes
- Fix verified in Nightly: no
- Needs manual QE test: no
- Steps to reproduce for manual QE testing:
- Risk associated with taking this patch: low
- Explanation of risk level: Ted and I have conferred, and we both think this is low risk and well understood. If there's risk here, it's that it might potentially regression some startup performance benchmarks if we happen to have gotten some of our rules wrong in terms of which URIs can be cached and which can't (though we're fairly certain we haven't).
- String changes made/needed: None.
- Is Android affected?: yes
| Assignee | ||
Comment 13•2 months ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D280334
Comment 14•2 months ago
|
||
Comment 15•2 months ago
|
||
Updated•2 months ago
|
| Assignee | ||
Comment 17•2 months ago
|
||
Try push with fix for Windows paths:
https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=175402
Comment 18•2 months ago
|
||
| Assignee | ||
Comment 19•2 months ago
|
||
Here's the green try push that I was waiting for before pulling the lando lever: https://treeherder.mozilla.org/jobs?repo=try&revision=15709e3b9d819770ee70ce90d80bc897ef54dda3
Comment 20•2 months ago
|
||
firefox-beta Uplift Approval Request
- User impact if declined: Clients cannot receive newtab train-hop XPI's without potentially running out-of-date versions of the parent process scripts.
- Code covered by automated testing: yes
- Fix verified in Nightly: no
- Needs manual QE test: no
- Steps to reproduce for manual QE testing: N/A
- Risk associated with taking this patch: low
- Explanation of risk level: Ted and I have conferred, and we both think this is low risk and well understood. If there's risk here, it's that it might potentially regression some startup performance benchmarks if we happen to have gotten some of our rules wrong in terms of which URIs can be cached and which can't (though we're fairly certain we haven't).
- String changes made/needed: None.
- Is Android affected?: no
| Assignee | ||
Comment 21•2 months ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D280334
| Assignee | ||
Comment 22•2 months ago
|
||
Is Android affected?: yes
I want to qualify this in that the bug exists in platform code, which means that Android may be affected, but we have no evidence to suggest that anybody has hit anything like this on Android.
Comment 23•2 months ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/58e81c362b8f
https://hg.mozilla.org/mozilla-central/rev/94794eabf67c
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Comment 24•2 months ago
|
||
| uplift | ||
| Assignee | ||
Comment 25•2 months ago
|
||
System add-ons can be installed in the profile directory as .xpi JAR files that
are updated independently of the application. Prior to this patch, when these JARs
are updated independently from the rest of the application, the ScriptPreloader
and StartupCache can serve stale bytecode compiled from the previous version of
the JAR, causing the old code to execute even though the JAR has been replaced.
This fix here adds a helper to checking if a script's JAR file is in one of
the omni.ja files before using cached bytecode.
Scripts from the omni.ja files continue to benefit from bytecode caching as
before.
Original Revision: https://phabricator.services.mozilla.com/D280333
Updated•2 months ago
|
Comment 26•2 months ago
|
||
firefox-release Uplift Approval Request
- User impact if declined: Clients cannot receive newtab train-hop XPI's without potentially running out-of-date versions of the parent process scripts.
- Code covered by automated testing: yes
- Fix verified in Nightly: no
- Needs manual QE test: yes
- Steps to reproduce for manual QE testing: See the steps at https://gist.github.com/mikeconley/653f52f653a686c9bd8b9c83af8ec64d, steps 1-15
- Risk associated with taking this patch: low
- Explanation of risk level: Ted and I have conferred, and we both think this is low risk and well understood. If there's risk here, it's that it might potentially regression some startup performance benchmarks if we happen to have gotten some of our rules wrong in terms of which URIs can be cached and which can't (though we're fairly certain we haven't).
- String changes made/needed: None.
- Is Android affected?: no
Comment 27•2 months ago
|
||
firefox-release Uplift Approval Request
- User impact if declined: Clients cannot receive newtab train-hop XPI's without potentially running out-of-date versions of the parent process scripts.
- Code covered by automated testing: yes
- Fix verified in Nightly: no
- Needs manual QE test: yes
- Steps to reproduce for manual QE testing: See the steps at https://gist.github.com/mikeconley/653f52f653a686c9bd8b9c83af8ec64d, steps 1-15
- Risk associated with taking this patch: low
- Explanation of risk level: Ted and I have conferred, and we both think this is low risk and well understood. If there's risk here, it's that it might potentially regression some startup performance benchmarks if we happen to have gotten some of our rules wrong in terms of which URIs can be cached and which can't (though we're fairly certain we haven't).
- String changes made/needed: None.
- Is Android affected?: no
| Assignee | ||
Comment 28•2 months ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D280334
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Comment 29•2 months ago
|
||
| uplift | ||
Updated•2 months ago
|
| Assignee | ||
Comment 30•2 months ago
|
||
QA testing procedure to reproduce the crash on builds without the fix:
Reproducing the crash on a Nightly or Beta (unbranded or dev edition!) that doesn't have the crash fix:
- Launch the browser
- Disable automatic updates
- Visit about:config
- Set
xpinstall.signatures.requiredtofalse - Add a new string pref
browser.newtabpage.trainhopAddon.versionset to the stringany - Install newtab-crash-test-part-1.xpi
- Restart the browser
- Visit about:support and ensure that New Tab version 999.0.0 is installed
- Restart the browser several times, waiting a few seconds in between.
- Install newtab-crash-test-part-2.xpi
- Restart the browser
This should be enough to crash on a build without the crash fix
| Assignee | ||
Comment 31•2 months ago
|
||
Comment 32•2 months ago
|
||
We managed to reproduce the issue on all desktop operating systems (Windows 10 x64, Windows 11, macOS 15.6.1 and Linux Mint 22.2) using older builds of Firefox Beta 148 and Firefox Nightly 149 (without the crash fix), following the steps described in Comment 30, with only one exception: the browser needs to be fully closed and reopened to trigger the crash.
I can confirm that the issue is no longer reproducible (the crash doesn’t occur) on the builds with the crash fix.
- Tested versions:
- Firefox Release 147.0.3 (Build ID: 20260203002554)
- Firefox Devedition 148.0b10 (Build ID: 20260202090533)
- Firefox Nightly 149.0a1 (BuildID: 20260203210813)
- Tested platforms:
- Windows 10 x64
- Windows 11
- macOS 15.6.1
- Linux Mint 22.2
- Tested addon versions:
- Firefox Beta 148.0b10 and Firefox Nightly 149.0a1: used the .xpi files provided in Comment 30 (Addon Version: 999.0.0 and 999.1.0)
- Firefox Release 147.0.3: Addon version 148.0.20251211.63751 (currently Live on the Prod server) and 149.1.20260121.51415 (Currently Live on the Stage server)
Based on this, I am marking this issue as Verified - Fixed.
Updated•2 months ago
|
Description
•