[Wayland] Startup hang when a primary password is set
Categories
(Thunderbird :: General, defect)
Tracking
(thunderbird_esr102 unaffected)
Tracking | Status | |
---|---|---|
thunderbird_esr102 | --- | unaffected |
People
(Reporter: rjl, Assigned: emilio)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: hang, regression, regressionwindow-wanted)
Attachments
(3 files)
I've personally seen this on Thunderbird 112.0b2 and 113.0a1 (2023-03-22) on Linux x64.
STR:
-
Create a new profile; set up a mail account, save the password. Do not set a primary password yet.
-
Close and restart Thunderbird a few times to verify normal startup.
-
Go to Settings->Privacy, enable and set a Primary Password. There will be a confirmation dialog.
-
Close Thunderbird.
-
Start Thunderbird, enter primary password when prompted. This launch works as expected (I think?)
-
Close and restart Thunderbird. Enter password. The startup will hang this time.
-
Kill the Thunderbird process with
kill <pid>
-
Start Thunderbird, enter password. This startup should work normally.
Expected:
- No startup freezes
Actual:
- Every other startup freezes and needs to be killed. This has been consistent for me at least since 112.0b1, possibly a little longer.
Other notes:
- I always start Thunderbird with the "Select Profile" dialog. (Uncheck the button that says "always use this profile"). I have not tested without that dialog being displayed.
Reporter | ||
Comment 2•2 years ago
|
||
The hang happens even after a reboot. Interesting...
Comment 3•2 years ago
•
|
||
(In reply to Wayne Mery (:wsmwk) from comment #1)
Christian, does this reproduce for you?
Yes, this happens frequently here too with an existing profile, and a master password set. As a workaround I start the beta from a shell with the -P switch, so I just need to press Ctrl-C to kill it.
The next attempt is successful eventually, but not always.
Comment 4•2 years ago
|
||
I saw this once or twice like a week ago.
But it stopped happening briefly after, so I had assumed something else might have fixed it.
I cannot reproduce (linux) using current comm-beta nor current comm-central, neither using existing profile nor fresh profile using the STR given here.
Comment 5•2 years ago
|
||
I disabled Wayland last week. Could that be related?
Comment 6•2 years ago
•
|
||
It looks like Wayland is a likely factor. I've re-enabled Wayland, rebooted, and can immediately reproduce again, hang on every other startup.
Comment 7•2 years ago
|
||
Emilio, Martin, do you think any recent Wayland-related change could have caused this bug?
It might be useful to know that Thunderbird has special code (in xpfe) that forces bringing up the Primary Password prompt early, earlier than bringing up the main application window. We only see the hang if that early prompt is enabled.
I've looked at some of the recent commits on comm-central that mentioned Wayland and attempted to back them out locally, but so far I haven't found the cause of the regression.
Comment 8•2 years ago
|
||
FYI, I've already tried backing out bug 1821618, bug 1798360, bug 1820204, bug 1818530, bug 1813407, but none of that helped.
Assignee | ||
Comment 9•2 years ago
|
||
Can you get a backtrace using gdb or something, or a crash report from kill -11 <thunderbird-pid>
? Without it it's hard to what might be going on.
Comment 10•2 years ago
|
||
You can run on terminal with WAYLAND_DEBUG=1 env variable and see what's going on.
And/or gdb to frozen TB and attach backtrace here.
Comment 11•2 years ago
|
||
I just submitted crash report 79c8110a-399a-8d14-9d96-df39e169af6e
I attached with gdb to a stuck locally built debug process, but I cannot see anything interesting. All threads appear to be idle and waiting for work.
I've tried WAYLAND_DEBUG=1 and will attach two logs, good and bad.
I couldn't yet spot significant differences.
There's one block of events, involving "zwp_linux_dmabuf_v1@39.create_params" which executes slightly earlier in the good scenario.
And after execution is stuck in the bad scenario, the good scenario continues with a second block running xgm_wm_base@27.get_xdg_surface
The bad scenario ends with wl_compositor@4.create_surface as the last event.
Comment 12•2 years ago
|
||
Comment 13•2 years ago
|
||
Comment 14•2 years ago
|
||
I found my crash report (kill -11):
https://crash-stats.mozilla.org/report/index/317ae00b-2a3c-4b7e-b9da-33ab00230323
Reporter | ||
Comment 15•2 years ago
|
||
I am running Thunderbird with Wayland, so that definitely checks out from here. If I run with MOZ_ENABLE_WAYLAND=0 set in my environment, Thunderbird does not hang.
Updated•2 years ago
|
Comment 16•2 years ago
|
||
Can you try to use mozregression to find broken commit? Or is that present since beginning?
Comment 17•2 years ago
|
||
You may run it with MOZ_LOG="WidgetWayland:5" to see where it's blocked in our Wayland backend.
Comment 18•2 years ago
|
||
I reported Bug 1824670 that looks like this one, I'm also on Wayland, but I can start Thunderbird with thunderbird --new-instance
.
Comment 20•2 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #17)
You may run it with MOZ_LOG="WidgetWayland:5" to see where it's blocked in our Wayland backend.
That doesn't give much information.
In both scenarios (main window comes up or doesn't come up after password entered),
the log output starts with:
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_surface_create_locked [7ff2fe854800]
[Parent 7759: Main Thread]: D/WidgetWayland gtk wl_surface 7ff2e32d0d80 ID 40
[Parent 7759: Main Thread]: D/WidgetWayland created frame callback ID 59
[Parent 7759: Main Thread]: D/WidgetWayland created surface 7ff2deb159c0 ID 49
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_invalidate [7ff2fe854800]
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_invalidate [7ff2fe854800]
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_frame_callback_handler [7ff2fe854800] frame_callback_handler 7ff2deb15b00 ready_to_draw 0 (set to true) initial_draw callback 2
If it fails to come up, no other log follows.
If it works, the log continues with:
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_surface_create_locked [7ff2da010000]
[Parent 7759: Main Thread]: D/WidgetWayland gtk wl_surface 7ff2d742bba0 ID 58
[Parent 7759: Main Thread]: D/WidgetWayland guessing subsurface position 26 23
[Parent 7759: Main Thread]: D/WidgetWayland created frame callback ID 66
[Parent 7759: Main Thread]: D/WidgetWayland created surface 7ff2d4a864c0 ID 47
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_invalidate [7ff2da010000]
[Parent 7759: Main Thread]: D/WidgetWayland moz_container_wayland_frame_callback_handler [7ff2da010000] frame_callback_handler 7ff2d4a86560 ready_to_draw 0 (set to true) initial_draw callback 1
Comment 21•2 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #16)
Can you try to use mozregression to find broken commit? Or is that present since beginning?
I haven't ever used mozregression.
Maybe someone else could help with that?
Comment 22•2 years ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #14)
I found my crash report (kill -11):
https://crash-stats.mozilla.org/report/index/317ae00b-2a3c-4b7e-b9da-33ab00230323
- primary Intel, secondary Nvidia
- "status": "blocklisted:FEATURE_FAILURE_SOFTWARE_GL"
Do you have the proprietary Nvidia driver installed?
Can the problem be fixed by uninstalling it or by starting thunderbird with
$ __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json path/to/thunderbird
?
(In reply to Kai Engert (:KaiE:) from comment #21)
I haven't ever used mozregression.
Maybe someone else could help with that?
For example:
$ pip3 install mozregression
$ MOZ_ENABLE_WAYLAND=1 ~/.local/bin/mozregression --app thunderbird --launch 2023-03-22
$ MOZ_ENABLE_WAYLAND=1 ~/.local/bin/mozregression --app thunderbird --good 2022-03-01 --bad 2023-03-22 --profile-persistence reuse
$ MOZ_ENABLE_WAYLAND=1 ~/.local/bin/mozregression --app thunderbird --good 2022-03-01 --bad 2023-03-22 --profile-persistence reuse -p ~/specific_thunderbirdtestprofile
Comment 23•2 years ago
|
||
(In reply to Darkspirit from comment #22)
Do you have the proprietary Nvidia driver installed?
I only have a nvidia-gpu-firmware package installed.
Can the problem be fixed by uninstalling it or by starting thunderbird with
$ __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json path/to/thunderbird
?
Using that command line doesn't help.
Comment 24•2 years ago
|
||
It never fails to start up when running under mozregression.
If I manually extract the exact same archive from the persistence directory, and use --profile ~/specific_thunderbirdtestprofile
it fails every second time.
Even if I don't keep extracted directory, and extract the archive every time before running, it still fails only every second time, so it cannot be something stored inside the binary directory.
I think that means mozregression performs some additional global cleanup that resets the environment to the working state, so that we always get the working startup from attempt 1.
What kind of cleanup could that be?
Comment 25•2 years ago
|
||
I used manual running (without mozgression).
Latest working build:
https://archive.mozilla.org/pub/thunderbird/nightly/2023/03/2023-03-02-10-18-45-comm-central/thunderbird-112.0a1.en-US.linux-x86_64.tar.bz2
INFO: application_buildid: 20230302101845
INFO: application_changeset: 0a9a9457e26bd3fd01635d701a168bb82c32f4f2
First broken build:
https://archive.mozilla.org/pub/thunderbird/nightly/2023/03/2023-03-04-10-28-19-comm-central/thunderbird-112.0a1.en-US.linux-x86_64.tar.bz2
INFO: application_buildid: 20230304102819
INFO: application_changeset: ea0cf52a17885a137b106fd038114f23c648b394
Build 2023-03-03 wasn't found by mozgression (which I used for downloading).
Comment 26•2 years ago
|
||
Oldest comm-central treeherder job that still worked:
IIUC the regression was introduced by one of the following comm-central treeherder jobs:
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=9c3f6d8cb3413f4b797ee3c2074a7316f1fa48ba
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=82720e5726e5c460f094390e66a9fe886cbb4c99
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=b4f412c985679b206e3d4aa9e923b53677ac544e
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=002ee849c3d0403ad7b11940e6379501c6c47ddf
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=587e783e17a003d5f1d9764ace3265ef64e41161
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=b4118276e3515491a8a652359e999efd59429adf
https://treeherder.mozilla.org/jobs?repo=comm-central&revision=ea0cf52a17885a137b106fd038114f23c648b394
... or by one of the following mozillacentral jobs:
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=9d6a3eb520acb919d4b9d249698e92a40b6fb592
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=ba36dea109e7851e4c392dc57e6c7aada3790fbf
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=5eb81f0156a80fd113309e1a4671c134d0bc4615
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=fa3134749ef096303e069eb5e38469258e521c0e
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=a4e2776e87186593dfcd3984fd61df789f5e51be
https://treeherder.mozilla.org/jobs?repo=mozilla-central&revision=7642d50032fb242a3377f4a0381e9da02a8bac64
Comment 27•2 years ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #25)
Latest working build:
INFO: application_changeset: 0a9a9457e26bd3fd01635d701a168bb82c32f4f2
First broken build:
INFO: application_changeset: ea0cf52a17885a137b106fd038114f23c648b394
bug 1821256 could also be related.
Comment 28•2 years ago
|
||
There is bug 1826129 which has been closed as duplicate of bug 1787182.
Comment 29•2 years ago
|
||
It's NOT caused by bug 1747722 (that was my wild guess, but backing out locally doesn't help).
Comment 30•2 years ago
|
||
I found the cause: bug 1819664
Building the latest c-c ea0cf52a17885a137b106fd038114f23c648b394 from 2023-03-04 with
m-c a4e2776e87186593dfcd3984fd61df789f5e51be from 2023-03-03 still works.
Then I applied the following three patches from 1819664:
https://hg.mozilla.org/mozilla-central/rev/d24bb2cea16d4a9499b491ee5dbbd5e98a499a2d
https://hg.mozilla.org/mozilla-central/rev/d709b8c58a4e35a71f99401f56e8e80db08f7a34
https://hg.mozilla.org/mozilla-central/rev/5a8d0e47ebd50928473d59353589358d55dcc731
That's broken.
Comment 31•2 years ago
|
||
Latest mozilla-central and comm-central,
backing out those 3 patches fixes this bug.
Comment 32•2 years ago
|
||
With https://hg.mozilla.org/mozilla-central/rev/5a8d0e47ebd50928473d59353589358d55dcc731 applied:
Still works.
With https://hg.mozilla.org/mozilla-central/rev/d709b8c58a4e35a71f99401f56e8e80db08f7a34 applied:
Still works.
The culprit is
https://hg.mozilla.org/mozilla-central/rev/d24bb2cea16d4a9499b491ee5dbbd5e98a499a2d
Comment 33•2 years ago
|
||
(In reply to Kai Engert (:KaiE:) from comment #32)
The culprit is
https://hg.mozilla.org/mozilla-central/rev/d24bb2cea16d4a9499b491ee5dbbd5e98a499a2d
Emilio, do you understand why this code causes the hang?
Do you have an idea how this could be fixed?
Thanks in advance
Assignee | ||
Comment 34•2 years ago
|
||
So https://hg.mozilla.org/mozilla-central/rev/d24bb2cea16d4a9499b491ee5dbbd5e98a499a2d#l2.53 is supposed to prevent this. The only way I see this could happen is if Thunderbird continuously adds l10n mutations so that we never unblock the load event, or a fluent promise never resolving... I can debug it more if needed, but if you could check for those things it'd be great.
Comment 35•2 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #34)
The only way I see this could happen is if Thunderbird continuously adds l10n mutations so that we never unblock the load event, or a fluent promise never resolving...
I don't know how this might happen.
FYI, when bringing up the initial password prompt, we're here:
https://searchfox.org/mozilla-central/source/toolkit/xre/nsXREDirProvider.cpp#765-788
This is very early during startup, prior to opening up any other app windows.
Could this disturb the usual l10n flow that you're expecting to happen?
Comment 36•2 years ago
|
||
It's also very strange this happens on exactly every second application start.
Assignee | ||
Comment 38•2 years ago
|
||
I can't repro super reliably but the patch above should help if my understanding of the situation is correct. I could repro sometimes before the flush but never afterwards, but it'd be good if you could verify.
Assignee | ||
Comment 39•2 years ago
|
||
Consider the following sequence of events:
- We do an l10n mutation -> mBlockingLoad=true, mPendingMutations.Size() = 1
- That flush triggers a promise -> mBlockingLoad=true, mPendingMutations.Size() = 0, mPendingPromises=1
- Another mutation happens -> mBlockingLoad=true, mPendingMutations.Size() = 1, mPendingPromises = 1
- The promise resolves -> mBlockingLoad=true, mPendingMutations.Size() = 1, mPendingPromises=0
Key part there is that the second mutation didn't post a task to flush,
so we rely on the refresh driver (which is not reliable on Wayland as
described in the comment).
Instead, we need to track the pending flush separately from the blocking
load.
Updated•2 years ago
|
Comment 40•2 years ago
|
||
Emilio, great work, no hangs for me with this patch.
Comment 41•2 years ago
|
||
Comment 42•2 years ago
|
||
bugherder |
Description
•