Closed Bug 1629669 Opened 4 years ago Closed 2 years ago

Thunderbird Crash in [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] (password?)

Categories

(Thunderbird :: General, defect)

x86_64
Windows
defect

Tracking

(thunderbird_esr102+ fixed, thunderbird104 fixed)

RESOLVED FIXED
105 Branch
Tracking Status
thunderbird_esr102 + fixed
thunderbird104 --- fixed

People

(Reporter: worcester12345, Assigned: KaiE)

References

(Depends on 2 open bugs)

Details

(Keywords: crash, dupeme, topcrash-thunderbird, Whiteboard: [TM:102.2.+])

Crash Data

Attachments

(1 file, 2 obsolete files)

This bug is for crash report bp-088ba0db-c9c5-456a-91f6-41ebf0200413.

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent xpcom/threads/ThreadEventQueue.cpp:149
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1108
6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
7 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:881
8 xul.dll nsThreadManager::Shutdown xpcom/threads/nsThreadManager.cpp:284
9 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:649

Report ID
1f603768-0e2a-4839-ae27-cef399cbe0c2

68.7.0 (32-bit)

What were you doing at the time of your crashes?

Flags: needinfo?(worcester12345)

I think I was in Firefox, and Thunderbird was just sitting there in the background. Maybe in Excel. I just noticed the "Mozilla Crash Reporter" in the taskbar and went over to it. Glad I didn't lose my browser, as I have lost data there before.

Flags: needinfo?(worcester12345)

This could be related to bug 1524247.

During startup or while thunderbird was running, had you dismissed a password prompt without entering a password? Or entered an incorrect password?

I don't think so. I had already been in email, and was doing something else in Firefox. I noticed the Mozilla Crash Reporter had come up, and Firefox was still running. I think it (Thunderbird) was just sitting there idle in the background.

Thanks. But I'm not asking whether you had answered a password prompt in the seconds before crash. I mean literally, "During startup or while thunderbird was running" ... and then crashed hours or days later.

Oh, dismissed a password dialog ever when that happened? I don't recall. Usually, I log in and then do other stuff with the window minimized. For this account, I usually just put it in at first launch, then leave it alone.

Keywords: crash

Are you still crashing?

90% of crashes with this signature are Thunderbird - the minority are Firefox

Depends on: 1505660
Flags: needinfo?(worcester12345)
See Also: → 1524247

Not as often, and if I did, I'd log it in here, and it would either remain as new or get "duplicated".

Flags: needinfo?(worcester12345)
See Also: → 1691429

(In reply to Worcester12345 from comment #8)

Oh, dismissed a password dialog ever when that happened? I don't recall. Usually, I log in and then do other stuff with the window minimized. For this account, I usually just put it in at first launch, then leave it alone.

I could attempt to do this, if it helps you. Let me know.

(In reply to Worcester12345 from comment #13)

(In reply to Worcester12345 from comment #8)

Oh, dismissed a password dialog ever when that happened? I don't recall. Usually, I log in and then do other stuff with the window minimized. For this account, I usually just put it in at first launch, then leave it alone.

I could attempt to do this, if it helps you. Let me know.

sure

Flags: needinfo?(worcester12345)

Perhaps a duplicate of bug 1524247 or one of the other shutdown crashes

Flags: needinfo?(vseerror)
Keywords: dupeme

Looking at one of the crashes, I arrived at nsThread::ShutdownInternal where I read:

  // XXXroc What if posting the event fails due to OOM?
  mEvents->PutEvent(event.forget(), EventQueuePriority::Normal);

Inside PutEvent, there seem to be more than just OOM conditions that might have us return false here. Shouldn't we check for false and in case return a nullptr as shutdown context (which would us not make enter the SpinEventLoopUntil that apparently waits forever for a flag on the shutdown context to be set) ?

It also seems that we sequentially shut down the threads, waiting for each thread to join before we shut down the next. We might want to decouple the sending of the shutdown event to all threads from the waiting for the threads to join.

Randell, WDYT ?

Flags: needinfo?(rjesup)

By "something", I mean try to click on a folder, open an email, pretty much anything.

I went to "More troubleshooting info" and clicked submit to the right of cb7914eb-4a79-436d-87e0-f5ff0eec3f94, and that is where it turned to "failed". There is nothing further to do there, and no instructions on what to do to fix it. In the past, these have started working again sometimes on their own, with no rhyme or reason as to why or how.

(In reply to Worcester12345 from comment #23)

It finally let me submit a report:

https://crash-stats.mozilla.org/report/index/6e57f4c2-c7ff-48cc-8b01-4809b0210928

shutdownhang | nsAppShell::ProcessNextNativeEvent is primarily version 91. I believe this is a continuation of the original crash signature of this bug, which is not so prevalent in version 91. And so perhaps not related to bug 1671405.

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] → [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ @ shutdownhang | nsAppShell::ProcessNextNativeEvent ]
OS: Windows 10 → Windows

(In reply to Jens Stutte [:jstutte] from comment #16)

Looking at one of the crashes, I arrived at nsThread::ShutdownInternal where I read:

  // XXXroc What if posting the event fails due to OOM?
  mEvents->PutEvent(event.forget(), EventQueuePriority::Normal);

Inside PutEvent, there seem to be more than just OOM conditions that might have us return false here. Shouldn't we check for false and in case return a nullptr as shutdown context (which would us not make enter the SpinEventLoopUntil that apparently waits forever for a flag on the shutdown context to be set) ?

The false returns from PutEventInternal() are pretty limited - mEventsAreDoomed or no mQueue in the target sink. mEventsAreDoomed can happen if we try to send an event to a thread after it gets past nsThread.cpp:414, where we process events at shutdown time until the event queue is empty. At that point we set mEventsAreDoomed and start cleanup, but an attempt to send us a message could come in there I think.

However for this to happen, we'd have to have already started shutdown for the thread, and ShutdownInternal() doesn't let itself be called twice. If a thread exited on it's own, and Shutdown() was called on it while it was in the process of shutting down, perhaps this sequence could happen. Seems unlikely, but in theory possible for threads that commit suicide.
We could add a MOZ_RELEASE_ASSERT to flag this case, and see if it's being hit.

It also seems that we sequentially shut down the threads, waiting for each thread to join before we shut down the next. We might want to decouple the sending of the shutdown event to all threads from the waiting for the threads to join.

Not shutting them down sequentially should in theory let us shut down faster -- however, it's probably much more likely to expose latent race conditions. So a somewhat risky change, in the short term. Of course, there's no real order imposed currently, so ordering-races could in theory happen already (though in practice they may not), but simultaneous shutdown will definitely make more races possible.

Nika: thoughts?

Flags: needinfo?(rjesup) → needinfo?(nika)
Depends on: 1735129

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #25)

It also seems that we sequentially shut down the threads, waiting for each thread to join before we shut down the next. We might want to decouple the sending of the shutdown event to all threads from the waiting for the threads to join.

Not shutting them down sequentially should in theory let us shut down faster -- however, it's probably much more likely to expose latent race conditions. So a somewhat risky change, in the short term. Of course, there's no real order imposed currently, so ordering-races could in theory happen already (though in practice they may not), but simultaneous shutdown will definitely make more races possible.

IIUC in the long run we want to skip this stage anyway (if toolkit.shutdown.fastShutdownStage >= 3). So it is unclear to me if optimizing performance here is still worth the effort. But we did a similar parallelization for quota clients shutdown and it moved the needle quite significantly. And having some extra opportunity to find (and eliminate) races could be interesting, too. Maybe we could hide it behind a pref and enable it in the beginning only on TSAN for a while?

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #25)

However for this to happen, we'd have to have already started shutdown for the thread, and ShutdownInternal() doesn't let itself be called twice. If a thread exited on it's own, and Shutdown() was called on it while it was in the process of shutting down, perhaps this sequence could happen. Seems unlikely, but in theory possible for threads that commit suicide.
We could add a MOZ_RELEASE_ASSERT to flag this case, and see if it's being hit.

I made a try push with a direct return (and warning) for that case to see what happens. There are also other crash reports that indicate this might happen also during profile-before-change for individual thread shutdowns.

Depends on: 1735284

(In reply to Jens Stutte [:jstutte] from comment #16)

Looking at one of the crashes, I arrived at nsThread::ShutdownInternal where I read:

  // XXXroc What if posting the event fails due to OOM?
  mEvents->PutEvent(event.forget(), EventQueuePriority::Normal);
```...
It also seems that we [sequentially shut down the threads](https://searchfox.org/mozilla-central/rev/50c3cf7a3c931409b54efa009795b69c19383541/xpcom/threads/nsThreadManager.cpp#411-413), waiting for each thread to join before we shut down the next. We might want to decouple the sending of the shutdown event to all threads from the waiting for the threads to join.

Randell, WDYT ?

Magnus, any concerns about doing this?

(In reply to Jens Stutte [:jstutte] from comment #27)

...
I made a try push with a direct return (and warning) for that case to see what happens. There are also other crash reports that indicate this might happen also during profile-before-change for individual thread shutdowns.

I suspect no one in the bug here did any testing (I know I didn't - sorry). Is it worth sending this out to a few community members or development team, and if so what do we want to ask of them?

Flags: needinfo?(mkmelin+mozilla)

Jens

The crash signatures for this bug [1] are good for Thunderbird 78 (which is the #1 shutdown crash sig), but neither exists for Thunderbird 91.

The #1 shutdown crash sig for 91, and a good statistical match, is shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown which corresponds to Bug 1505660. Is that signature a match for the scenario you have detailed here?

[1] [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ] [@ @ shutdownhang | nsAppShell::ProcessNextNativeEvent ]

Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(jstutte)
Flags: needinfo?(mkmelin+mozilla)

(In reply to Wayne Mery (:wsmwk) from comment #29)

Jens

The crash signatures for this bug [1] are good for Thunderbird 78 (which is the #1 shutdown crash sig), but neither exists for Thunderbird 91.

The #1 shutdown crash sig for 91, and a good statistical match, is shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown which corresponds to Bug 1505660. Is that signature a match for the scenario you have detailed here?

No, the scenario from bug 1735284 is supposed to never happen, and by now the MOZ_CRASH we added did not occur.

Flags: needinfo?(jstutte)

Worcester12345 are you still seeing this crash?

Flags: needinfo?(nika) → needinfo?(worcester12345)

Nika, see comment 25

Flags: needinfo?(nika)

(In reply to Wayne Mery (:wsmwk) from comment #31)

Worcester12345 are you still seeing this crash?

If I did, I think bugzilla would hopefully steer me here. If not, I imagine you would. The only question mark on that is that sometimes I can't get Thunderbird to open links to bugzilla correctly. I think eventually they will make their way, or I have to brute-force it to work, so there's that. I'd say you could just mark it wont fix for now if that is what you want. Can't really say it is fixed, since nothing was done to fix it. Also can't say works for me, since there is no way to tell when it might happen again.

Flags: needinfo?(worcester12345)

This type of crash occurs when a thread isn't responsive during shutdown, e.g. due to some background blocking work. For exmaple in a crash like https://crash-stats.mozilla.org/report/index/b24de20f-8c19-46ae-a1f8-af2a50220425#allthreads, the blocked thread is probably Thread 26, which is currently running what looks like blocking win32 API calls to verify the signature of loaded DLLs (https://searchfox.org/mozilla-central/rev/13d69189a8abfc5064fe44944550b9b6eb4544f5/toolkit/xre/dllservices/mozglue/Authenticode.cpp#161-189). There's not much we can do about situations like that unfortunately.

In the thunderbird crashes I looked at (admittedly, I only poked at a couple: https://crash-stats.mozilla.org/report/index/b1183302-cc57-47ef-92b0-1b6390220424#allthreads and https://crash-stats.mozilla.org/report/index/ccea46a5-21db-48f3-b9f6-0968d0220423#allthreads), the blocked thread appears to be the "IMAP" thread, which is blocked in nsImapProtocol::GetPassword waiting on a monitor: https://searchfox.org/comm-central/rev/16430df4c00f639809ba0aea5cf6c31575d55419/mailnews/imap/src/nsImapProtocol.cpp#8473. It looks like whatever system is supposed to tell the thread to die during shutdown in that situation isn't working properly (though it appears there's supposed to be one based on the conditions there).

Flags: needinfo?(nika)

IIUC either this loop needs an additional check if we are in shutdown or something else should ensure that nsImapProtocol::OnPromptCanceled() is called during shutdown if we are waiting for a password?

Flags: needinfo?(vseerror)

(In reply to Jens Stutte [:jstutte] from comment #35)

IIUC either this loop needs an additional check if we are in shutdown or something else should ensure that nsImapProtocol::OnPromptCanceled() is called during shutdown if we are waiting for a password?

Flags: needinfo?(vseerror) → needinfo?(mkmelin+mozilla)

This particular signature no longer exists in version 91 and newer.
Nor does the signature for bug 1524247.

But the problem surely is not gone.

Severity: critical → S2
Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ @ shutdownhang | nsAppShell::ProcessNextNativeEvent ] → [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown]
See Also: → 1638638
Summary: Crash in [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] → Thunderbird Crash in [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] (password?)
Whiteboard: [actionable]
See Also: → 1630597
Assignee: nobody → mkmelin+mozilla
Status: NEW → ASSIGNED
Flags: needinfo?(mkmelin+mozilla)
Target Milestone: --- → 105 Branch
Blocks: 1768344

If at all related to bug 1768344, please see STR from bug 1768344 and its related crash using 104.0b2 for any clues / similarities: https://crash-stats.mozilla.org/report/index/e0dca0f9-14f7-492e-b507-4e4e40220802

See Also: → 1783573
Depends on: 1783573

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/173576a762db
Fix Thunderbird Crash in [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] - might be stuck waiting for password on shutdown. r=benc

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Regressions: 1783659

This change introduced startup lockup bug 1783710

Regressions: 1783710

I think it's best to reopen.

The call to GetServerShuttingDown() isn't simple - it involves a runnable that's dispatched - probably to the main thread?

Apparently that dispatching, while having the monitor locked, causes a deadlock.

I think we must rework that loop to check GetServerShuttingDown() while the monitor is released.

And while I was looking at this code, I'm worried about the related variables m_passwordObtained and m_passwordStatus. IMHO those variables must only be accessed while holding the monitor, otherwise there are races.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

The attached patch fixes the deadlock for me.
I'm not yet asking for review. First, let's check if tests pass:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=cb72d4b0a166337cf1d7c4f936d402450c70a8b1

Attachment #9288919 - Attachment description: WIP: Bug 1629669 - Rework nsImapProtocol::GetPassword to avoid deadlock, and better protect m_passwordObtained/m_passwordStatus. → Bug 1629669 - Rework nsImapProtocol::GetPassword to avoid deadlock, and better protect m_passwordObtained/m_passwordStatus. r=benc

(In reply to Kai Engert (:KaiE:) from comment #44)

The attached patch fixes the deadlock for me.

If this is at all related to what I am able to repro in bug 1768344 and will get pulled into 104.0b3/b4, I can test to see if it at least resolves things there.

(In reply to Kai Engert (:KaiE:) from comment #44)

The attached patch fixes the deadlock for me.
I'm not yet asking for review. First, let's check if tests pass:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=cb72d4b0a166337cf1d7c4f936d402450c70a8b1

Try run looks good I think...
You left me standing - I'm still peppering printf()s everywhere to try and understand the flow of things!
I'll just wrap up what I've done so far then give your patch a proper look.

(In reply to Kai Engert (:KaiE:) from comment #42)

The call to GetServerShuttingDown() isn't simple - it involves a runnable that's dispatched - probably to the main thread?
Apparently that dispatching, while having the monitor locked, causes a deadlock.
I think we must rework that loop to check GetServerShuttingDown() while the monitor is released.
And while I was looking at this code, I'm worried about the related variables m_passwordObtained and m_passwordStatus. IMHO those variables must only be accessed while holding the monitor, otherwise there are races.

Yep, I think you're right about GetServerShuttingDown() causing it.
I think the deadlock was probably always there, but the loop was preventing event dispatch so it didn't actually happen.
Adding the GetServerShuttingDown() gives the event dispatching a kick and allows the queued-up prompt code to run.... and to deadlock!
Something along those lines anyway. It's not super clear which things are running on which threads (sigh).

Backout: https://hg.mozilla.org/comm-central/rev/e8d3e664b980fd02d1ca1c3500391ef4423838cb
Let's land a proper fix instead.

Kai, thanks for jumping in!

Assignee: mkmelin+mozilla → kaie
Attachment #9288919 - Attachment description: Bug 1629669 - Rework nsImapProtocol::GetPassword to avoid deadlock, and better protect m_passwordObtained/m_passwordStatus. r=benc → Bug 1629669 - Rework nsImapProtocol::GetPassword to avoid deadlock, and better protect m_passwordObtained/m_passwordStatus. r=benc,mkmelin
Attachment #9287870 - Attachment is obsolete: true
Attachment #9288939 - Attachment is obsolete: true

Comment on attachment 9288919 [details]
Bug 1629669 - Rework nsImapProtocol::GetPassword to avoid deadlock, and better protect m_passwordObtained/m_passwordStatus. r=benc,mkmelin

[Approval Request Comment]
Regression caused by (bug #):
User impact if declined: crash
Testing completed (on c-c, etc.):
Risk to taking this patch (and alternatives if risky): minor risk for IMAP connectivity

Attachment #9288919 - Flags: approval-comm-beta?

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/31d6fb3903f2
Rework nsImapProtocol::GetPassword to avoid deadlock, and better protect m_passwordObtained/m_passwordStatus. r=benc,mkmelin

Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED

Just out of interest, what does password = nsString(); do?
https://hg.mozilla.org/comm-central/rev/31d6fb3903f23ad0b6e4e97324636d0f7283a007#l1.12

Why not passwort.Truncate(); or `password = "";`` (if the latter compiles).

(In reply to newsfan from comment #52)

Just out of interest, what does password = nsString(); do?
https://hg.mozilla.org/comm-central/rev/31d6fb3903f23ad0b6e4e97324636d0f7283a007#l1.12

Why not passwort.Truncate(); or `password = "";`` (if the latter compiles).

It erases the string. Assigning to "" creates local data, so might be less efficient. Calling Truncate would have been perfect, but I couldn't find that function, after trying Delete, Erase, and others, I gave up ;-)

That's currently true, but the API is designed to potentially fail (and the code checks for a returned failure). Because an API might not fill out parameters on failure, it's safer to init the out parameter.

Comment on attachment 9288919 [details]
Bug 1629669 - Rework nsImapProtocol::GetPassword to avoid deadlock, and better protect m_passwordObtained/m_passwordStatus. r=benc,mkmelin

[Triage Comment]
Approved for beta

Attachment #9288919 - Flags: approval-comm-beta? → approval-comm-beta+

FYI I'm excited to have this, but suspect we should let this sit on beta a while, and not take it too quickly to 102, to give us time to watch what happens on beta for 2-3 weeks.

Whiteboard: [actionable] → [TM:102.2.+]

Time for 102 uplift?

Comment on attachment 9288919 [details]
Bug 1629669 - Rework nsImapProtocol::GetPassword to avoid deadlock, and better protect m_passwordObtained/m_passwordStatus. r=benc,mkmelin

see comment 50 for approval comment

Attachment #9288919 - Flags: approval-comm-esr102?

Comment on attachment 9288919 [details]
Bug 1629669 - Rework nsImapProtocol::GetPassword to avoid deadlock, and better protect m_passwordObtained/m_passwordStatus. r=benc,mkmelin

[Triage Comment]
Approved for esr102

Attachment #9288919 - Flags: approval-comm-esr102? → approval-comm-esr102+

(Collecting information here, which eventually will go into other bug reports)

I believe the signature has shifted in the last 10 months because shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown has no v102 Thunderbird crashes in the past 6 months, which should be impossible - I would expect to see some pre-102.3.3 examples on crash-stats. Perhaps the crash signature morphed somewhere along the way and I failed to update the bug(s).

Anyway, post-102.3.3, we still have plenty of examples of users crash comments mentioning password. The majority of crash signatures are

  • mozilla::dom::workerinternals::RuntimeService::CrashIfHanging - bug 1435343 (where I have also commented)
  • shutdownhang | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads - Bug 1505660
  • shutdownhang | mozilla::SpinEventLoopUntil | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads - Bug 1505660

Most mention SMTP. For example bp-fe8b8258-7db3-4c48-9244-d9e3c0230323

CrashIfHanging bug 1435343 examples:

SpinEventLoopUntil Bug 1505660 examples:

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: