Crash in [@ PLDHashTable::Search | mozilla::dom::DocumentOrShadowRoot::GetElementById]
Categories
(Thunderbird :: General, defect)
Tracking
(thunderbird_esr68 unaffected, thunderbird73 wontfix, thunderbird74 verified)
Tracking | Status | |
---|---|---|
thunderbird_esr68 | --- | unaffected |
thunderbird73 | --- | wontfix |
thunderbird74 | --- | verified |
People
(Reporter: wsmwk, Assigned: benc)
References
(Regression)
Details
(Keywords: crash, regression, topcrash-thunderbird, Whiteboard: [regression:tb73])
Crash Data
Attachments
(2 files, 2 obsolete files)
2.21 KB,
patch
|
mkmelin
:
review+
wsmwk
:
approval-comm-beta+
|
Details | Diff | Splinter Review |
3.54 KB,
patch
|
benc
:
review+
wsmwk
:
approval-comm-beta+
|
Details | Diff | Splinter Review |
#2 crash for 73.0b1
Regression in nightly. bp-2736a504-a4fc-4a77-876b-461d80200106 is the earliest buildid 20200105105422 73.0a1
Top 10 frames of crashing thread:
0 xul.dll PLDHashTable::Search xpcom/ds/PLDHashTable.cpp:501
1 xul.dll mozilla::dom::DocumentOrShadowRoot::GetElementById dom/base/DocumentOrShadowRoot.cpp:98
2 xul.dll nsMsgWindow::GetMessageWindowDocShell comm/mailnews/base/src/nsMsgWindow.cpp:76
3 xul.dll nsImapProtocol::SetupWithUrl comm/mailnews/imap/src/nsImapProtocol.cpp:825
4 xul.dll nsImapProtocol::LoadImapUrl comm/mailnews/imap/src/nsImapProtocol.cpp:2099
5 xul.dll nsImapIncomingServer::LoadNextQueuedUrl comm/mailnews/imap/src/nsImapIncomingServer.cpp:521
6 xul.dll nsresult `anonymous namespace'::SyncRunnable2<nsIImapMailFolderSink, const char*, nsIImapUrl*>::Run comm/mailnews/imap/src/nsSyncRunnableHelpers.cpp:121
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1241
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
9 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87
Comment 1•5 years ago
|
||
It would have been "caused" by bug 1603649 where the caller was introduced: https://searchfox.org/comm-central/rev/009c65e87d240a61404a57362bdf7aa514894fd5/mailnews/base/src/nsMsgWindow.cpp#76-78
Reporter | ||
Comment 2•5 years ago
|
||
Pretty bad for beta, so we should respin beta with a fix for this. Of course, first need a fix.
Comment 3•5 years ago
|
||
Unfortunately probably a bug over in m-c so re-spinning is harder, crashing at https://searchfox.org/mozilla-central/rev/f98dad153b59a985efd4505912588d4651033395/xpcom/ds/PLDHashTable.cpp#501 via https://searchfox.org/mozilla-central/rev/f98dad153b59a985efd4505912588d4651033395/dom/base/DocumentOrShadowRoot.cpp#132
Comment 4•5 years ago
•
|
||
Or well, if wanted we could back out the last changeset from bug 1603649 (use the earlier approach) on beta.
Reporter | ||
Comment 5•5 years ago
|
||
FWIW, there are so far zero Firefox crashes, and zero Mac crashes.
Comment 6•5 years ago
|
||
Add a null check to the M-C code and get the reviewer interested that way?
Reporter | ||
Comment 7•5 years ago
|
||
As long as we get the solution quicker rather than later :) ... Who is driving?
he Mac crash is PLDHashTable::Search | nsDocShell::QueryInterface ? bp-e86ab8a5-e63d-4c26-ab45-0b3910200131
Comment 8•5 years ago
|
||
Maybe we want to add a null check @ DocumentOrShadowRoot.cpp#132?
But, I wonder how that ever happens. If we don't get an element with the expected id, things will not work properly at all.
Reporter | ||
Comment 9•5 years ago
|
||
Now #1 crash for beta
Assignee | ||
Comment 10•5 years ago
|
||
How odd. The QueryInterface crash is exclusive to Mac, and the GetElementById is windows. The code path on the m-c side looks like it should be a pretty solid and well-tested one (I mean, GetElementById() is pretty fundamental, right?). I'm currently leaning toward it being a threading issue - maybe TB being shut down while IMAP operations are still pending (that dom being whisked away at just the wrong time, say)... But unless I can replicate it, it's going to be a matter of trying to work out and understand the whole code path, and where races/clashes could be occurring.
Reporter | ||
Comment 11•5 years ago
|
||
Too bad we don't have a graph that shows recent topcrashes against each other over time - we could clearly see when a signature trails off and a new one begins. Or that it's not happening to more nightly users - we'd have a more clear picture of the regression range :) But given the crash rate I'd say the checkin range should be 2020-01-01 to 2020-01-05, or very late December.
So you're of the opinion it's not likely to be an m-c checkin? Nor bug 1603649 whose checkin was 2019-12-20?
Assignee | ||
Comment 12•5 years ago
|
||
I did think we could mitigate it with an m-c patch to check for null... but it's worse than that - I think it's accessing freed memory which may have been overwritten by something else. So a null check doesn't help.
The crash is happening in the main thread, in response to an event that was dispached to the main threads message queue by the IMAP thread.
My best theory so far is that the IMAP thread posts a LoadNextQueuedUrl() request to the main thread. But before it's serviced, shutdown is initiated. So when the request is eventually processed by the main thread, the messagewindow and its docshell have been (at least partly) torn down. Kaboom.
Tricky to confirm without having steps to replicate the crash, of course.
But if so, the cheesy mitigation is to modify all the IMAPs dispatch-to-main-thread handlers (there's a proxy class to dispatch-to-main a whole bunch of IMAP functions), to detect that shutdown has started and if so, just exit/fail without doing anything. There are already some "we're shutting down" flags in the IMAP code, so hopefully I can use one of them, once I figure out how they're used currently.
So that's my current plan and approach. Stay tuned.
The proper solution is to check over all the IMAP threading and shutdown handling, making sure it's all nice and clean and safe, but that's a bigger task ;-)
I suspect some of these other PLDHashTable::Search() crashes might have a similar cause:
https://bugzilla.mozilla.org/buglist.cgi?quicksearch=PLDHashTable%3A%3ASearch
Assignee | ||
Comment 13•5 years ago
|
||
nsMsgWindow::GetMessageWindowDocShell()
seems to have a superfluous 'if' statement in it (leftover from some recent changes I think). This patch clears it up.
Assignee | ||
Comment 14•5 years ago
|
||
After lots of false starts, I discovered that there's a isBeingDestroyed()
method on nsIDocShell
. This patch adds a check in nsMsgWindow::GetMessageWindowDocShell()
so it'll hopefully abort before crashing. There are also a couple of tweaks in nsImapProtocol::SetupWithUrl()
to check for such an error before plowing on blindly...
I can't say for sure that this'll reduce the crashes, but I guess there's only one way to tell...
Assignee | ||
Comment 15•5 years ago
|
||
And some more notes, just while they're fresh in my head:
- It seems bonkers that any
nsMsgWindow
-related code is present down innsImapProtocol
in the first place. nsImapProtocol::SetupWithUrl()
is doing thisnsMsgWindow
hackery to set up notification callbacks on the mock channel.nsImapProtocol::SetupWithUrl()
has special handling to deal mockchannels vs real channels. This kind of skirts the whole point of using mock channels - the mockchannel should probably be a less leaky abstraction.
So I think the cleanup steps would be:
- ruthlessly track down and refactor real/mockchannel checks and better encapsulate the mock behaviour.
- go over the mockchannel notification callback nsMsgWindow hackery and see what it's really trying to do, and see if it can be done /without/ tying the
nsImapProtocol
system to the GUI side.
Updated•5 years ago
|
Comment 16•5 years ago
|
||
Assignee | ||
Comment 17•5 years ago
|
||
Thanks Magnus - seems like there's a perfect error code for all kinds of surprising occasions!
There seems to be a little bustage currently, so I'll hold off the checkin-needed for now.
Reporter | ||
Comment 18•5 years ago
|
||
So this linux signature as well?
PLDHashTable::Search | <name omitted> | nsMsgWindow::GetMessageWindowDocShell bp-391bc935-1256-4557-9937-52a1a0200220
Assignee | ||
Comment 19•5 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #18)
So this linux signature as well?
PLDHashTable::Search | <name omitted> | nsMsgWindow::GetMessageWindowDocShell bp-391bc935-1256-4557-9937-52a1a0200220
Looks like it, yes. In that case it's coming in from javascript, but the actual crash looks exactly the same.
Assignee | ||
Comment 20•5 years ago
|
||
Damn it, the patch breaks all the imap unit tests!
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=079b46908cf84690b92396efb55b5c0ca105952b
Hopefully I've just done something stupid and screwed up the patch.
If we're unlucky, we'll find out that all the tests were relying on the docshell being in the shutting-down state the whole time or something stupid like that... :-(
Gah...
More investigation.
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Comment 21•5 years ago
|
||
Revised patch. Removed some returncode checks I had added. The imap code was ignoring the returncode and assuming the returned pointer was null on failure. Should probably be tightened up, but for now I just went through and made sure the pointer would in fact be null.
Also added a comment on what the offending code is actually doing (it wasn't at all obvious to me, and it's still not clear it should be doing it anyway!).
Try build here - will check back in on it later today and add checkin flags if it looks clean.
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=bf32cf94e26d4f0eea6ab887da62d5220a153238
Assignee | ||
Comment 22•5 years ago
|
||
Last try build failed, but this one looks clean:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=095f3079e980e7bd6b366282b56c341a6301c98c
(there's one mochitest fail, but that seems unrelated to this this patch).
I'll flag the checkin-needed now because I think we need to get this into the wild and see if it helps.
But I'm definitely open to following it up - Magnus, want to give this patch another quick once-over and see if there's anything extra you'd like me to chase up?
Assignee | ||
Updated•5 years ago
|
Reporter | ||
Comment 23•5 years ago
|
||
If one of you gets me a Mac try build I will test it.
I can very reliably reproduce PLDHashTable::Search | nsDocShell::QueryInterface on Mac beta
- mailnews.show_send_progress=true mailnews.sendInBackground=false
- send a large message
- don't use "Cancel" but click to close the compose window "X".
bp-c0cd6cee-42b7-4460-8583-390ae0200223
Magnus see comment 22.
Reporter | ||
Comment 24•5 years ago
|
||
I am unable to reproduce this on windows TB74
Btw, the steps above are from bug 1608733 which the user cites for Mac using version 68.
Comment 25•5 years ago
|
||
Let's get it checked in and you can check it on nightly.
Assignee | ||
Comment 26•5 years ago
•
|
||
Can't seem to get the repro steps in comment 23 to trigger the bug on Linux (with the latest code, but without my patch), which is a shame.
In any case, I just kicked off a new try build including windows and mac:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=3d0ac805d17841266b11ef18638217df8b510431
bug 1608733 Is the same crash, but coming in from a different code path. My patch should make the offending function bail out before it crashes, but it's possible that the calling code doesn't gracefully handle the error code So even if the patch works, the repro steps might give you a different crash :-) But it'd be progress.
Assignee | ||
Comment 27•5 years ago
|
||
Scratch that - I did get the comment 23 repro steps working!
I'm still a little fuzzy on the exact steps, frantically closing down windows and confirmation dialogs before the sending finished (and I'm running I3, which doesn't even have "X" close buttons on the windows), but it did trigger.
After applying the patch and trying to trigger the bug again, I haven't yet managed to get it to crash down in PLDHashTable, which is very promising.
(I'm running a debug build and I do get a assert triggering down in cairo, via MOZ_gdk_display_close(). But I think that's out of scope for this bug)
Reporter | ||
Comment 28•5 years ago
|
||
Comment 30•5 years ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/947de04af434
Remove superfluous 'if' in nsMsgWindow::GetMessageWindowDocShell(). r=mkmelin
https://hg.mozilla.org/comm-central/rev/ebcd40d4aaab
Make nsMsgWindow::GetMessageWindowDocShell() fail if docshell is being destroyed. r=mkmelin
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Comment 31•5 years ago
|
||
Reporter | ||
Updated•5 years ago
|
Comment 32•5 years ago
|
||
bugherder uplift |
Thunderbird 74.0b2:
https://hg.mozilla.org/releases/comm-beta/rev/75481e886915
Thunderbird 74.0b2:
https://hg.mozilla.org/releases/comm-beta/rev/3cfca362ac8c
Updated•5 years ago
|
Reporter | ||
Comment 33•5 years ago
|
||
There are no crashes for buildid 20200227011511 aka 74.0b2
v.fixed
Description
•