Closed Bug 1823915 Opened 2 years ago Closed 2 years ago

[Wayland] Startup hang when a primary password is set

Categories

(Thunderbird :: General, defect)

Thunderbird 113
Unspecified
Linux
defect

Tracking

(thunderbird_esr102 unaffected)

RESOLVED FIXED
113 Branch
Tracking Status
thunderbird_esr102 --- unaffected

People

(Reporter: rjl, Assigned: emilio)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: hang, regression, regressionwindow-wanted)

Attachments

(3 files)

I've personally seen this on Thunderbird 112.0b2 and 113.0a1 (2023-03-22) on Linux x64.

STR:

  • Create a new profile; set up a mail account, save the password. Do not set a primary password yet.

  • Close and restart Thunderbird a few times to verify normal startup.

  • Go to Settings->Privacy, enable and set a Primary Password. There will be a confirmation dialog.

  • Close Thunderbird.

  • Start Thunderbird, enter primary password when prompted. This launch works as expected (I think?)

  • Close and restart Thunderbird. Enter password. The startup will hang this time.

  • Kill the Thunderbird process with kill <pid>

  • Start Thunderbird, enter password. This startup should work normally.

Expected:

  • No startup freezes

Actual:

  • Every other startup freezes and needs to be killed. This has been consistent for me at least since 112.0b1, possibly a little longer.

Other notes:

  • I always start Thunderbird with the "Select Profile" dialog. (Uncheck the button that says "always use this profile"). I have not tested without that dialog being displayed.

Christian, does this reproduce for you?

Flags: needinfo?(chriechers)

The hang happens even after a reboot. Interesting...

(In reply to Wayne Mery (:wsmwk) from comment #1)

Christian, does this reproduce for you?

Yes, this happens frequently here too with an existing profile, and a master password set. As a workaround I start the beta from a shell with the -P switch, so I just need to press Ctrl-C to kill it.
The next attempt is successful eventually, but not always.

Flags: needinfo?(chriechers)

I saw this once or twice like a week ago.
But it stopped happening briefly after, so I had assumed something else might have fixed it.

I cannot reproduce (linux) using current comm-beta nor current comm-central, neither using existing profile nor fresh profile using the STR given here.

I disabled Wayland last week. Could that be related?

It looks like Wayland is a likely factor. I've re-enabled Wayland, rebooted, and can immediately reproduce again, hang on every other startup.

Emilio, Martin, do you think any recent Wayland-related change could have caused this bug?

It might be useful to know that Thunderbird has special code (in xpfe) that forces bringing up the Primary Password prompt early, earlier than bringing up the main application window. We only see the hang if that early prompt is enabled.

I've looked at some of the recent commits on comm-central that mentioned Wayland and attempted to back them out locally, but so far I haven't found the cause of the regression.

Flags: needinfo?(stransky)

FYI, I've already tried backing out bug 1821618, bug 1798360, bug 1820204, bug 1818530, bug 1813407, but none of that helped.

Can you get a backtrace using gdb or something, or a crash report from kill -11 <thunderbird-pid>? Without it it's hard to what might be going on.

Flags: needinfo?(kaie)

You can run on terminal with WAYLAND_DEBUG=1 env variable and see what's going on.
And/or gdb to frozen TB and attach backtrace here.

Flags: needinfo?(stransky)

I just submitted crash report 79c8110a-399a-8d14-9d96-df39e169af6e

I attached with gdb to a stuck locally built debug process, but I cannot see anything interesting. All threads appear to be idle and waiting for work.

I've tried WAYLAND_DEBUG=1 and will attach two logs, good and bad.

I couldn't yet spot significant differences.
There's one block of events, involving "zwp_linux_dmabuf_v1@39.create_params" which executes slightly earlier in the good scenario.

And after execution is stuck in the bad scenario, the good scenario continues with a second block running xgm_wm_base@27.get_xdg_surface

The bad scenario ends with wl_compositor@4.create_surface as the last event.

Flags: needinfo?(kaie)
Attached file log-bad2
Attached file log-good2

I am running Thunderbird with Wayland, so that definitely checks out from here. If I run with MOZ_ENABLE_WAYLAND=0 set in my environment, Thunderbird does not hang.

OS: Unspecified → Linux
Summary: Startup hang when a primary password is set → [Wayland] Startup hang when a primary password is set

Can you try to use mozregression to find broken commit? Or is that present since beginning?

You may run it with MOZ_LOG="WidgetWayland:5" to see where it's blocked in our Wayland backend.

I reported Bug 1824670 that looks like this one, I'm also on Wayland, but I can start Thunderbird with thunderbird --new-instance.

Duplicate of this bug: 1824670

(In reply to Martin Stránský [:stransky] (ni? me) from comment #17)

You may run it with MOZ_LOG="WidgetWayland:5" to see where it's blocked in our Wayland backend.

That doesn't give much information.

In both scenarios (main window comes up or doesn't come up after password entered),
the log output starts with:

[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_surface_create_locked [7ff2fe854800]
[Parent 7759: Main Thread]: D/WidgetWayland     gtk wl_surface 7ff2e32d0d80 ID 40
[Parent 7759: Main Thread]: D/WidgetWayland     created frame callback ID 59
[Parent 7759: Main Thread]: D/WidgetWayland     created surface 7ff2deb159c0 ID 49
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_invalidate [7ff2fe854800]
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_invalidate [7ff2fe854800]
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_frame_callback_handler [7ff2fe854800] frame_callback_handler 7ff2deb15b00 ready_to_draw 0 (set to true) initial_draw callback 2

If it fails to come up, no other log follows.

If it works, the log continues with:

[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_surface_create_locked [7ff2da010000]
[Parent 7759: Main Thread]: D/WidgetWayland     gtk wl_surface 7ff2d742bba0 ID 58
[Parent 7759: Main Thread]: D/WidgetWayland     guessing subsurface position 26 23
[Parent 7759: Main Thread]: D/WidgetWayland     created frame callback ID 66
[Parent 7759: Main Thread]: D/WidgetWayland     created surface 7ff2d4a864c0 ID 47
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_invalidate [7ff2da010000]
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_frame_callback_handler [7ff2da010000] frame_callback_handler 7ff2d4a86560 ready_to_draw 0 (set to true) initial_draw callback 1

(In reply to Martin Stránský [:stransky] (ni? me) from comment #16)

Can you try to use mozregression to find broken commit? Or is that present since beginning?

I haven't ever used mozregression.
Maybe someone else could help with that?

(In reply to Kai Engert (:KaiE:) from comment #14)

I found my crash report (kill -11):
https://crash-stats.mozilla.org/report/index/317ae00b-2a3c-4b7e-b9da-33ab00230323

  • primary Intel, secondary Nvidia
  • "status": "blocklisted:FEATURE_FAILURE_SOFTWARE_GL"

Do you have the proprietary Nvidia driver installed?
Can the problem be fixed by uninstalling it or by starting thunderbird with
$ __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json path/to/thunderbird?

(In reply to Kai Engert (:KaiE:) from comment #21)

I haven't ever used mozregression.
Maybe someone else could help with that?

For example:
$ pip3 install mozregression
$ MOZ_ENABLE_WAYLAND=1 ~/.local/bin/mozregression --app thunderbird --launch 2023-03-22
$ MOZ_ENABLE_WAYLAND=1 ~/.local/bin/mozregression --app thunderbird --good 2022-03-01 --bad 2023-03-22 --profile-persistence reuse
$ MOZ_ENABLE_WAYLAND=1 ~/.local/bin/mozregression --app thunderbird --good 2022-03-01 --bad 2023-03-22 --profile-persistence reuse -p ~/specific_thunderbirdtestprofile

(In reply to Darkspirit from comment #22)

Do you have the proprietary Nvidia driver installed?

I only have a nvidia-gpu-firmware package installed.

Can the problem be fixed by uninstalling it or by starting thunderbird with
$ __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json path/to/thunderbird?

Using that command line doesn't help.

It never fails to start up when running under mozregression.

If I manually extract the exact same archive from the persistence directory, and use --profile ~/specific_thunderbirdtestprofile
it fails every second time.

Even if I don't keep extracted directory, and extract the archive every time before running, it still fails only every second time, so it cannot be something stored inside the binary directory.

I think that means mozregression performs some additional global cleanup that resets the environment to the working state, so that we always get the working startup from attempt 1.

What kind of cleanup could that be?

I used manual running (without mozgression).

Latest working build:
https://archive.mozilla.org/pub/thunderbird/nightly/2023/03/2023-03-02-10-18-45-comm-central/thunderbird-112.0a1.en-US.linux-x86_64.tar.bz2
INFO: application_buildid: 20230302101845
INFO: application_changeset: 0a9a9457e26bd3fd01635d701a168bb82c32f4f2

First broken build:
https://archive.mozilla.org/pub/thunderbird/nightly/2023/03/2023-03-04-10-28-19-comm-central/thunderbird-112.0a1.en-US.linux-x86_64.tar.bz2
INFO: application_buildid: 20230304102819
INFO: application_changeset: ea0cf52a17885a137b106fd038114f23c648b394

Build 2023-03-03 wasn't found by mozgression (which I used for downloading).

Oldest comm-central treeherder job that still worked:

https://treeherder.mozilla.org/jobs?repo=comm-central&revision=0a9a9457e26bd3fd01635d701a168bb82c32f4f2

IIUC the regression was introduced by one of the following comm-central treeherder jobs:

https://treeherder.mozilla.org/jobs?repo=comm-central&revision=9c3f6d8cb3413f4b797ee3c2074a7316f1fa48ba
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=82720e5726e5c460f094390e66a9fe886cbb4c99
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=b4f412c985679b206e3d4aa9e923b53677ac544e
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=002ee849c3d0403ad7b11940e6379501c6c47ddf
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=587e783e17a003d5f1d9764ace3265ef64e41161
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=b4118276e3515491a8a652359e999efd59429adf
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=ea0cf52a17885a137b106fd038114f23c648b394

... or by one of the following mozillacentral jobs:

https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=9d6a3eb520acb919d4b9d249698e92a40b6fb592
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=ba36dea109e7851e4c392dc57e6c7aada3790fbf
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=5eb81f0156a80fd113309e1a4671c134d0bc4615
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=fa3134749ef096303e069eb5e38469258e521c0e
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=a4e2776e87186593dfcd3984fd61df789f5e51be
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=7642d50032fb242a3377f4a0381e9da02a8bac64

(In reply to Kai Engert (:KaiE:) from comment #25)

Latest working build:

INFO: application_changeset: 0a9a9457e26bd3fd01635d701a168bb82c32f4f2

First broken build:

INFO: application_changeset: ea0cf52a17885a137b106fd038114f23c648b394

= https://hg.mozilla.org/comm-central/pushloghtml?fromchange=0a9a9457e26bd3fd01635d701a168bb82c32f4f2&tochange=ea0cf52a17885a137b106fd038114f23c648b394


bug 1821256 could also be related.

There is bug 1826129 which has been closed as duplicate of bug 1787182.

It's NOT caused by bug 1747722 (that was my wild guess, but backing out locally doesn't help).

I found the cause: bug 1819664

Building the latest c-c ea0cf52a17885a137b106fd038114f23c648b394 from 2023-03-04 with
m-c a4e2776e87186593dfcd3984fd61df789f5e51be from 2023-03-03 still works.

Then I applied the following three patches from 1819664:
https://hg.mozilla.org/mozilla-central/rev/d24bb2cea16d4a9499b491ee5dbbd5e98a499a2d
https://hg.mozilla.org/mozilla-central/rev/d709b8c58a4e35a71f99401f56e8e80db08f7a34
https://hg.mozilla.org/mozilla-central/rev/5a8d0e47ebd50928473d59353589358d55dcc731

That's broken.

Regressed by: 1819664

Latest mozilla-central and comm-central,
backing out those 3 patches fixes this bug.

(In reply to Kai Engert (:KaiE:) from comment #32)

The culprit is
https://hg.mozilla.org/mozilla-central/rev/d24bb2cea16d4a9499b491ee5dbbd5e98a499a2d

Emilio, do you understand why this code causes the hang?
Do you have an idea how this could be fixed?
Thanks in advance

Flags: needinfo?(emilio)

So https://hg.mozilla.org/mozilla-central/rev/d24bb2cea16d4a9499b491ee5dbbd5e98a499a2d#l2.53 is supposed to prevent this. The only way I see this could happen is if Thunderbird continuously adds l10n mutations so that we never unblock the load event, or a fluent promise never resolving... I can debug it more if needed, but if you could check for those things it'd be great.

Flags: needinfo?(emilio) → needinfo?(kaie)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #34)

The only way I see this could happen is if Thunderbird continuously adds l10n mutations so that we never unblock the load event, or a fluent promise never resolving...

I don't know how this might happen.

FYI, when bringing up the initial password prompt, we're here:
https://searchfox.org/mozilla-central/source/toolkit/xre/nsXREDirProvider.cpp#765-788

This is very early during startup, prior to opening up any other app windows.

Could this disturb the usual l10n flow that you're expecting to happen?

Flags: needinfo?(kaie)

It's also very strange this happens on exactly every second application start.

Okay let me take a look.

Flags: needinfo?(emilio)

I can't repro super reliably but the patch above should help if my understanding of the situation is correct. I could repro sometimes before the flush but never afterwards, but it'd be good if you could verify.

Flags: needinfo?(emilio) → needinfo?(kaie)

Consider the following sequence of events:

  • We do an l10n mutation -> mBlockingLoad=true, mPendingMutations.Size() = 1
  • That flush triggers a promise -> mBlockingLoad=true, mPendingMutations.Size() = 0, mPendingPromises=1
  • Another mutation happens -> mBlockingLoad=true, mPendingMutations.Size() = 1, mPendingPromises = 1
  • The promise resolves -> mBlockingLoad=true, mPendingMutations.Size() = 1, mPendingPromises=0

Key part there is that the second mutation didn't post a task to flush,
so we rely on the refresh driver (which is not reliable on Wayland as
described in the comment).

Instead, we need to track the pending flush separately from the blocking
load.

Assignee: nobody → emilio
Status: NEW → ASSIGNED

Emilio, great work, no hangs for me with this patch.

Flags: needinfo?(kaie)
Pushed by ealvarez@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/952b96ee3909 Ensure that we have a pending flush when blocking load. r=smaug
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 113 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: