Closed Bug 1757571 Opened 2 years ago Closed 2 years ago

Firefox 98 fails to draw a window on Linux aarch64

Categories

(Firefox :: General, defect)

Firefox 98
Desktop
Linux
defect

Tracking

()

RESOLVED FIXED
100 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox97 --- unaffected
firefox98 + wontfix
firefox99 + fixed
firefox100 + fixed

People

(Reporter: olivier, Assigned: glandium)

References

(Regression)

Details

(Keywords: regression, regressionwindow-wanted)

Attachments

(1 file)

I'm testing aarch64 builds of Firefox 98.0-1 (current release candidate) on a raspberry pi 4 running an Ubuntu 21.10 desktop (Wayland session). Specifically RC builds for the Ubuntu packages that will go into impish-updates (from the ubuntu-mozilla-security PPA), and the snap package in the candidate channel.
In both cases, the application consistently fails to draw a window on screen. It sometimes draws a rect for the window, but nothing else is painted.

The builds for 97.0.1-1 (deb and snap) work fine in the same environment.

This issue appears to affect only aarch64 builds, other architectures are good.

I went back and tested various beta builds, and the very first beta for 98.0 was already affected by the regression.

I am going to dig deeper to try and understand what is causing this regression.

The latest nightly build in https://launchpad.net/~ubuntu-mozilla-daily/+archive/ubuntu/ppa/+packages appears to be similarly affected.
Unfortunately Launchpad doesn't keep build artifacts around for too long, so I won't be able to use that PPA to bisect when the problem first happened.

QA Whiteboard: [qa-regression-triage]

Did you see any errors in the logs or crash reports? Do you mind providing a screenshot?

Flags: needinfo?(olivier)

It looks like we experience this bug in Playwright as well. Firefox M98 headless doesn't work when compiled for Ubuntu 20.04 aarch64.

Did you see any errors in the logs or crash reports?

No errors, it just hangs for us when opening a new page.

No errors in the logs, and no crash report. There is nothing to screenshot, really, as the window doesn't even paint.

X11 sessions are similarly affected, so it's not a Wayland-specific problem.

Flags: needinfo?(olivier)

I'm bisecting by way of building individual arm64 snaps of 98.0a1 for each revision of interest. This is a slow and tedious process.

Revision 9fd81f089c484ee7d36366d6363613fd2c0d57d9 built on 2022-01-20 15:26:35 exhibits the problem.

Revision a952c92c4dae45b199ab0370ad118cb83c360353 built on 2022-01-10 21:31:34 doesn't exhibit the problem.

Revision 60998033086a179f73edd702599f93ab75ff443e built on 2022-01-15 09:45:36 fails to build from source.

Revision a428d96af61ddf8d6c8b1bdcd37fb38ececd9db3 built on 2022-01-18 09:50:36 doesn't exhibit the problem.

Revision cc33400f0ff80f0eada6c3aa637f37d247a3ff46 built on 2022-01-19 21:47:18 exhibits the problem.

Revision 89aa2c8696b7b10a4e71f95d4a468171b92bb828 built on 2022-01-18 21:55:06 exhibits the problem.

Revision 7a69711e136447cbf2bcc2afeceda6cb7bc9155c built on 2022-01-18 12:48:04 doesn't exhibit the problem.

So the regression happened at some point between revision 7a69711e136447cbf2bcc2afeceda6cb7bc9155c and revision 89aa2c8696b7b10a4e71f95d4a468171b92bb828.

I narrowed further the regression window. Last known good revision: e2dfad4efbedffbe2bd759a3ad977f287be15c96. First bad revision: 5541e31f93a192ade150c89e3380b94be180d164. So the regression was introduced by bug 1750646.

Regressed by: 1750646

Set release status flags based on info from the regressing bug 1750646

:glandium, since you are the author of the regressor, bug 1750646, could you take a look?
For more information, please visit auto_nag documentation.

Flags: needinfo?(mh+mozilla)

Can you narrow it further down to an individual commit, or even better, an individual crate?

Flags: needinfo?(mh+mozilla) → needinfo?(olivier)

I'll attempt to do that, yes.

Has Regression Range: --- → yes

I've finally narrowed it down to the upgrade of the crossbeam-* crates:

crossbeam-channel 0.5.1 -> 0.5.2
crossbeam-epoch 0.9.5 -> 0.9.6
crossbeam-utils 0.8.5 -> 0.8.6

I confirmed by rebuilding revision 5541e31f93a192ade150c89e3380b94be180d164 with those three crates reverted, and the problem is gone.

Flags: needinfo?(olivier)

(In reply to Olivier Tilloy from comment #18)

I've finally narrowed it down to the upgrade of the crossbeam-* crates:

crossbeam-channel 0.5.1 -> 0.5.2
crossbeam-epoch 0.9.5 -> 0.9.6
crossbeam-utils 0.8.5 -> 0.8.6

I confirmed by rebuilding revision 5541e31f93a192ade150c89e3380b94be180d164 with those three crates reverted, and the problem is gone.

Are all three culpable or is there any reason to believe it's only one or two of the three components?

(In reply to Arthur K. [He/Him] from comment #19)

Are all three culpable or is there any reason to believe it's only one or two of the three components?

I have no idea. Given the upstream release commit, I was under the impression that these three components are inter-dependent. But I have no understanding of how the crossbeam crates work, or how they're used in Firefox.

I can certainly attempt to bisect even further, but it would be helpful if someone who actually understands how this is used would take a look, because I'm shooting in the dark here.

For users of the firefox snap on arm64, note that the build for 98.0.1-2 (in the stable channel) carry a patch to revert the crossbeam updates, which restores functionality. This is of course a temporary measure, until we can identify a proper fix.

(In reply to Olivier Tilloy from comment #21)

For users of the firefox snap on arm64, note that the build for 98.0.1-2 (in the stable channel) carry a patch to revert the crossbeam updates, which restores functionality. This is of course a temporary measure, until we can identify a proper fix.

Thx you for the update Olivier, I confirm the snap update solve it by my side
(Ubuntu 21.10 aarch64 on RPI4)

(In reply to Olivier Tilloy from comment #21)

For users of the firefox snap on arm64, note that the build for 98.0.1-2 (in the stable channel) carry a patch to revert the crossbeam updates, which restores functionality. This is of course a temporary measure, until we can identify a proper fix.

I reached out to a couple Rust folks associated with these specific components and pointed them here. We'll see if they reply.

I reported one of the duplicates of this bug. Can also confirm the new snap update solves the problem for me, in fact submitting this comment from a Firefox window on my RPI4 with Ubuntu 21.20

(In reply to raul.saavedra from comment #24)

from a Firefox window on my RPI4 with Ubuntu 21.20

Apologies for the typo, 21.10 I meant

Sounds like we've got an upstream workaround for now. With any luck we'll be able to get a proper crossbeam fix landed in time for 99.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #26)

Sounds like we've got an upstream workaround for now. With any luck we'll be able to get a proper crossbeam fix landed in time for 99.

Is there any place where the upstream issue/workaround could be tracked? I would be interested in specifics of the workaround..

Olivier, is the crossbeam patch available somewhere? I'd need to apply to Fedora (https://bugzilla.redhat.com/show_bug.cgi?id=2063961).
Thanks.

Flags: needinfo?(olivier)

I've isolated it to crossbeam-channel. I'm bisecting further down to individual changes.

(In reply to Mike Hommey [:glandium] from comment #29)

I've isolated it to crossbeam-channel. I'm bisecting further down to individual changes.

Mike,

From Taiki Endo, one of the crossbeam maintainers:

Hi, thanks for the report.

I believe the regression is from channel, as there are few changes to the existing implementation in epoch and utils during the relevant period.

And the changes to the channel implementation that occurred during the relevant period are as follows.

  • 1a1c9749
  • 7f3f6cf
  • a649cfc
  • 2c00d5b
  • 10009bd

7f3f6cf, a649cfc, and 2c00d5b seem very small and unrelated, so I think it is related to 1a1c974 or 10009bd.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #28)

Olivier, is the crossbeam patch available somewhere? I'd need to apply to Fedora (https://bugzilla.redhat.com/show_bug.cgi?id=2063961).
Thanks.

Here: https://git.launchpad.net/~mozilla-snaps/+git/firefox-snap/tree/patches/revert-crossbeam-crates-upgrade.patch?h=stable

Flags: needinfo?(olivier)

(In reply to Arthur K. [He/Him] from comment #30)

I confirmed it's this one.

Where did that communication with upstream happen? I can't seem to find it in the upstream repo issues.

I'm working on a patch that is not about completely reverting the original change, FWIW.

Flags: needinfo?(thee.chicago.wolf)

This is enough to fix it, apparently.

diff --git a/crossbeam-channel/src/waker.rs b/crossbeam-channel/src/waker.rs
index dec73a9..e6d0b49 100644
--- a/crossbeam-channel/src/waker.rs
+++ b/crossbeam-channel/src/waker.rs
@@ -77,11 +77,13 @@ impl Waker {
     /// Attempts to find another thread's entry, select the operation, and wake it up.
     #[inline]
     pub(crate) fn try_select(&mut self) -> Option<Entry> {
+        let thread_id = current_thread_id();
+
         self.selectors
             .iter()
             .position(|selector| {
                 // Does the entry belong to a different thread?
-                selector.cx.thread_id() != current_thread_id()
+                selector.cx.thread_id() != thread_id
                     && selector // Try selecting this operation.
                         .cx
                         .try_select(Selected::Operation(selector.oper))

This suggests some race condition somewhere else, though.

(In reply to Mike Hommey [:glandium] from comment #34)

This suggests some race condition somewhere else, though.

BTW, while not fully confirmed yet, disabling rust LTO might "fix" it too.

(In reply to Mike Hommey [:glandium] from comment #33)

Where did that communication with upstream happen? I can't seem to find it in the upstream repo issues.

I'm working on a patch that is not about completely reverting the original change, FWIW.

If you're asking how I reached out to the crossbeam maintainer, it was via email. Do you need his contact info?

Flags: needinfo?(thee.chicago.wolf)

The severity field is not set for this bug.
:mossop, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dtownsend)
Assignee: nobody → mh+mozilla
Status: NEW → ASSIGNED
Severity: -- → S2
Flags: needinfo?(dtownsend)

:glandium tomorrow we have the final beta build for 99.
Depending on the risk and when this lands, unsure if we can address this for 99?

Flags: needinfo?(mh+mozilla)
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/5075188978be
Upgrade crossbeam-channel to 0.5.4. r=emilio

Comment on attachment 9268598 [details]
Bug 1757571 - Upgrade crossbeam-channel to 0.5.4.

Beta/Release Uplift Approval Request

  • User impact if declined: Startup dead-lock on arm64 linux
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The changes are mostly no-op. There's an improvement on handling way too large timeouts, and potentially a slight performance improvement from not reading a thread local variable repeatedly when there are multiple selectors. This only papers over whatever is happening on arm64 linux (see https://phabricator.services.mozilla.com/D141563#inline-780442)
  • String changes made/needed:
Flags: needinfo?(mh+mozilla)
Attachment #9268598 - Flags: approval-mozilla-beta?
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 100 Branch

Comment on attachment 9268598 [details]
Bug 1757571 - Upgrade crossbeam-channel to 0.5.4.

Approved for beta uplift, available on the beta channel with 99RC1. Thanks.

Attachment #9268598 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: