Closed Bug 2012009 Opened 2 months ago Closed 2 months ago

Crash in [@ firefox_on_glean::factory::create_and_register_metric]

Categories

(Firefox :: New Tab Page, defect)

Unspecified
Windows 11
defect

Tracking

()

VERIFIED FIXED
149 Branch
Tracking Status
firefox-esr140 --- unaffected
firefox147 + verified
firefox148 + verified
firefox149 + verified

People

(Reporter: mccr8, Assigned: mconley)

References

(Regression)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(8 files, 1 obsolete file)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
1.39 MB, application/x-xpinstall
Details
1.39 MB, application/x-xpinstall
Details

Crash report: https://crash-stats.mozilla.org/report/index/d039cffb-4b1c-489b-ad90-0a9c10260122

I saw this in the crash spike reporter report.

I guess time_unit is None?

        "timing_distribution" => {
            let meta = CommonMetricData {
                name,
                category,
                send_in_pings,
                lifetime,
                disabled,
                ..Default::default()
            };
            let metric = TimingDistributionMetric::new(BaseMetricId(metric_id), meta, time_unit.unwrap());

MOZ_CRASH Reason:

called `Option::unwrap()` on a `None` value

Top 10 frames:

0  xul.dll  MOZ_Crash(char const*, int, char const*)  mfbt/Assertions.h:375
0  xul.dll  RustMozCrash(char const*, int, char const*)  mozglue/static/rust/wrappers.cpp:18
1  xul.dll  mozglue_static::panic_hook(std::panic::PanicHookInfo*)  mozglue/static/rust/lib.rs:99
2  xul.dll  core::ops::function::Fn::call<void (*)(ref$<std::panic::PanicHookInfo>), tupl...  /rustc/1159e78c4747b02ef996e55082b704c09b970588/library/core/src/ops/function.rs:253
3  xul.dll  std::panic::PanicHookInfo::new()  library/alloc/src/boxed.rs:1985
3  xul.dll  std::panicking::rust_panic_with_hook()  library/std/src/panicking.rs:841
4  xul.dll  std::panicking::begin_panic_handler::closure$0()  library/std/src/panicking.rs:699
5  xul.dll  std::sys::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_h...  library/std/src/sys/backtrace.rs:174
6  xul.dll  std::panicking::begin_panic_handler()  library/std/src/panicking.rs:697
7  xul.dll  core::panicking::panic_fmt()  library/core/src/panicking.rs:75

[Tracking Requested - why for this release]: big crash spike on 147.0.1

Currently this is is around 9% of all crashes on 147.0.1.

Keywords: topcrash

RegisterRuntimeMetric is being called from JS. In 147, it looks like there are two call sites in JS.

See Also: → 2008850

mconley did something or other to hopefully stop this and said they'd look at it tomorrow so I'll assign them

Assignee: nobody → mconley
Component: Telemetry → New Tab Page
Product: Toolkit → Firefox

Looks like this was regressed by Bug 2008850 when it shipped to release in the New Tab 149.1.20260121.51415 train hop.
The New Tab 149.1.20260121.51415 train hop has been rolled back for now to allow for investigation.
Setting 148 and 149 to affected since it will need a fix before the next train hop

Keywords: regression
Regressed by: 2008850
See Also: 2008850

Set release status flags based on info from the regressing bug 2008850

STR:

  1. Download the 149.1.20260121.51415 newtab train-hop XPI locally somewhere on the file system.
  2. In a debug build of Firefox 147.0.1 (either a local build or a Mozilla build, it doesn’t matter), install the Nimbus DevTools add-on.
  3. Click on the add-on button, and choose “nimbus-devtools” to open the add-on UI tab.
  4. Choose “Experiment Browser” on the lefthand side.
  5. Find “New Tab 148.0.20251211.63751 to Release” in the list, and in the dropdown to the right of it, choose “rollout”, and then under the “Actions” dropdown, choose “Force Enroll”
  6. Restart the browser.
  7. Visit about:support and ensure that New Tab add-on version 148.0.20251211.63751 is installed.
    1. If it’s not that version, it probably didn’t download in time before you restarted. Wait a few more seconds, and restart again. Repeat until you see it listed in about:support.
  8. Restart the browser a few times to get the 148 NewTabGleanUtils written to the startup cache
    1. Do this maybe 2 or 3 times.
  9. Open about:config
  10. Find the browser.newtabpage.trainhopAddon.version pref and set it to the string any
  11. Visit about:addons, and click on the gear icon
  12. Choose “Install Add-on From File…”
  13. Select the 149.1.20260121.51415 newtab XPI downloaded in the first step.
  14. Restart the browser

ER:

No crash.

AR:

The browser crashes.

Notably, the subsequent start of the browser does not appear to crash, which makes me suspect that in the wild, this manifests as a single crash that doesn't repeat.

These STR support my hypothesis for what's happening here:

This crash makes sense if our module loading is getting out of order, and we read in NewTabGleanUtils before the XPI has registered the XPI version of that module. Then we’d be running the Firefox 147 version of NewTabGleanUtils, which does not include this patch for adding timing_distribution to the extraArgs allow list.

So I think it is likely that this is another manifestation of this bug, where the ScriptPreloader was loading an old version of a module because the URIs matched.

Here’s how this plays out:

  1. 147 client runs with the prior train-hop XPI, and NewTabGleanUtils exists in the ScriptPreloader bytecode cache because it's been read a few times during startup.
  2. The 149 XPI is downloaded and installed. Train-hop XPI initialization runs. When AboutNewTabResourceMapping reaches the point of registering the metrics, it loads NewTabGleanUtils via resource://newtab/lib/NewTabGleanUtils.sys.mjs, but accesses it from the ScriptPreloader cache instead of the XPI.
  3. The bytecode for the underlying NewTabGleanUtils executes, and because it is missing the timing_distribution entry in the extraArgs allow list, the registerMetricIfNeeded method in the bytecode version of NewTabGleanUtils does not pass extraArgs to Services.fog.registerRuntimeMetric.
  4. At this point, the crash is predictable - timing_distribution metrics must include the extra args parameter.

I think this scenario also explains why it wasn't caught by our CI - our CI jobs take the XPI from main and install it on a bare copy of Release. They don't first install a pre-existing copy of any active train-hops.

And it also might explain why our QA team didn't hit this. Presumably, the QA team never got into a scenario where the pre-existing install of Firefox Release they were testing had the 148.0.20251211.63751 train-hop's NewTabGleanUtils in its script cache.

System add-ons can be installed in the profile directory as .xpi JAR files that
are updated independently of the application. Prior to this patch, when these JARs
are updated independently from the rest of the application, the ScriptPreloader
and StartupCache can serve stale bytecode compiled from the previous version of
the JAR, causing the old code to execute even though the JAR has been replaced.

This fix here adds some defense-in-depth by checking if a script's JAR file is
omni.ja before using cached bytecode. The check is applied at multiple
levels: in mozJSModuleLoader and mozJSSubScriptLoader (which prevents both
ScriptPreloader and StartupCache from being consulted), in ScriptPreloader's
GetCachedStencil (which protects other callers like nsFrameMessageManager),
and by passing URIs to GetCachedStencil where they were previously omitted.
Scripts from omni.ja continue to benefit from bytecode caching as before.

System add-ons can be installed in the profile directory as .xpi JAR files that
are updated independently of the application. Prior to this patch, when these JARs
are updated independently from the rest of the application, the ScriptPreloader
and StartupCache can serve stale bytecode compiled from the previous version of
the JAR, causing the old code to execute even though the JAR has been replaced.

This fix here adds a helper to checking if a script's JAR file is in one of
the omni.ja files before using cached bytecode.

Scripts from the omni.ja files continue to benefit from bytecode caching as
before.

Original Revision: https://phabricator.services.mozilla.com/D280333

Attachment #9540469 - Flags: approval-mozilla-beta?

firefox-beta Uplift Approval Request

  • User impact if declined: Clients cannot receive newtab train-hop XPI's without potentially running out-of-date versions of the parent process scripts.
  • Code covered by automated testing: yes
  • Fix verified in Nightly: no
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing:
  • Risk associated with taking this patch: low
  • Explanation of risk level: Ted and I have conferred, and we both think this is low risk and well understood. If there's risk here, it's that it might potentially regression some startup performance benchmarks if we happen to have gotten some of our rules wrong in terms of which URIs can be cached and which can't (though we're fairly certain we haven't).
  • String changes made/needed: None.
  • Is Android affected?: yes
Attachment #9540470 - Flags: approval-mozilla-beta?
Pushed by mconley@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/25dbc84378e4 https://hg.mozilla.org/integration/autoland/rev/6b9bd409d517 Part 1: Don't retrieve scripts from non omni.ja JAR files from the ScriptPreloader or StartupCache. r=tcampbell https://github.com/mozilla-firefox/firefox/commit/74f72acbbb92 https://hg.mozilla.org/integration/autoland/rev/9e06a08664d7 Part 2: Add a regression test for ScriptPreloader. r=tcampbell
Pushed by amarc@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/903cacf5967d https://hg.mozilla.org/integration/autoland/rev/a6a37223ca24 Revert "Bug 2012009 - Part 2: Add a regression test for ScriptPreloader. r=tcampbell" for causing marionette failures @ test_preloader_cache_bypass

Backed out for causing marionette failures

Flags: needinfo?(mconley)
Attachment #9540470 - Attachment is obsolete: true
Attachment #9540470 - Flags: approval-mozilla-beta?
Pushed by mconley@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/0e7692507cc5 https://hg.mozilla.org/integration/autoland/rev/58e81c362b8f Part 1: Don't retrieve scripts from non omni.ja JAR files from the ScriptPreloader or StartupCache. r=tcampbell https://github.com/mozilla-firefox/firefox/commit/4fee41448ba8 https://hg.mozilla.org/integration/autoland/rev/94794eabf67c Part 2: Add a regression test for ScriptPreloader. r=tcampbell

Here's the green try push that I was waiting for before pulling the lando lever: https://treeherder.mozilla.org/jobs?repo=try&revision=15709e3b9d819770ee70ce90d80bc897ef54dda3

Flags: needinfo?(mconley)

firefox-beta Uplift Approval Request

  • User impact if declined: Clients cannot receive newtab train-hop XPI's without potentially running out-of-date versions of the parent process scripts.
  • Code covered by automated testing: yes
  • Fix verified in Nightly: no
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing: N/A
  • Risk associated with taking this patch: low
  • Explanation of risk level: Ted and I have conferred, and we both think this is low risk and well understood. If there's risk here, it's that it might potentially regression some startup performance benchmarks if we happen to have gotten some of our rules wrong in terms of which URIs can be cached and which can't (though we're fairly certain we haven't).
  • String changes made/needed: None.
  • Is Android affected?: no
Attachment #9540634 - Flags: approval-mozilla-beta?

Is Android affected?: yes

I want to qualify this in that the bug exists in platform code, which means that Android may be affected, but we have no evidence to suggest that anybody has hit anything like this on Android.

Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 149 Branch
Attachment #9540634 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9540469 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

System add-ons can be installed in the profile directory as .xpi JAR files that
are updated independently of the application. Prior to this patch, when these JARs
are updated independently from the rest of the application, the ScriptPreloader
and StartupCache can serve stale bytecode compiled from the previous version of
the JAR, causing the old code to execute even though the JAR has been replaced.

This fix here adds a helper to checking if a script's JAR file is in one of
the omni.ja files before using cached bytecode.

Scripts from the omni.ja files continue to benefit from bytecode caching as
before.

Original Revision: https://phabricator.services.mozilla.com/D280333

Attachment #9541260 - Flags: approval-mozilla-release?

firefox-release Uplift Approval Request

  • User impact if declined: Clients cannot receive newtab train-hop XPI's without potentially running out-of-date versions of the parent process scripts.
  • Code covered by automated testing: yes
  • Fix verified in Nightly: no
  • Needs manual QE test: yes
  • Steps to reproduce for manual QE testing: See the steps at https://gist.github.com/mikeconley/653f52f653a686c9bd8b9c83af8ec64d, steps 1-15
  • Risk associated with taking this patch: low
  • Explanation of risk level: Ted and I have conferred, and we both think this is low risk and well understood. If there's risk here, it's that it might potentially regression some startup performance benchmarks if we happen to have gotten some of our rules wrong in terms of which URIs can be cached and which can't (though we're fairly certain we haven't).
  • String changes made/needed: None.
  • Is Android affected?: no
Flags: qe-verify+

firefox-release Uplift Approval Request

  • User impact if declined: Clients cannot receive newtab train-hop XPI's without potentially running out-of-date versions of the parent process scripts.
  • Code covered by automated testing: yes
  • Fix verified in Nightly: no
  • Needs manual QE test: yes
  • Steps to reproduce for manual QE testing: See the steps at https://gist.github.com/mikeconley/653f52f653a686c9bd8b9c83af8ec64d, steps 1-15
  • Risk associated with taking this patch: low
  • Explanation of risk level: Ted and I have conferred, and we both think this is low risk and well understood. If there's risk here, it's that it might potentially regression some startup performance benchmarks if we happen to have gotten some of our rules wrong in terms of which URIs can be cached and which can't (though we're fairly certain we haven't).
  • String changes made/needed: None.
  • Is Android affected?: no
Attachment #9541268 - Flags: approval-mozilla-release?
Flags: in-testsuite+
Attachment #9541260 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #9541268 - Flags: approval-mozilla-release? → approval-mozilla-release+
QA Whiteboard: [qa-triage-done-c149/b148]

QA testing procedure to reproduce the crash on builds without the fix:

Reproducing the crash on a Nightly or Beta (unbranded or dev edition!) that doesn't have the crash fix:

  1. Launch the browser
  2. Disable automatic updates
  3. Visit about:config
  4. Set xpinstall.signatures.required to false
  5. Add a new string pref browser.newtabpage.trainhopAddon.version set to the string any
  6. Install newtab-crash-test-part-1.xpi
  7. Restart the browser
  8. Visit about:support and ensure that New Tab version 999.0.0 is installed
  9. Restart the browser several times, waiting a few seconds in between.
  10. Install newtab-crash-test-part-2.xpi
  11. Restart the browser

This should be enough to crash on a build without the crash fix

We managed to reproduce the issue on all desktop operating systems (Windows 10 x64, Windows 11, macOS 15.6.1 and Linux Mint 22.2) using older builds of Firefox Beta 148 and Firefox Nightly 149 (without the crash fix), following the steps described in Comment 30, with only one exception: the browser needs to be fully closed and reopened to trigger the crash.
I can confirm that the issue is no longer reproducible (the crash doesn’t occur) on the builds with the crash fix.

  • Tested versions:
    • Firefox Release 147.0.3 (Build ID: 20260203002554)
    • Firefox Devedition 148.0b10 (Build ID: 20260202090533)
    • Firefox Nightly 149.0a1 (BuildID: 20260203210813)
  • Tested platforms:
    • Windows 10 x64
    • Windows 11
    • macOS 15.6.1
    • Linux Mint 22.2
  • Tested addon versions:

Based on this, I am marking this issue as Verified - Fixed.

Flags: qe-verify+
Regressions: 2015206
Regressions: 2016120
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: