Closed Bug 1610406 Opened 5 years ago Closed 5 years ago

Crash in [@ PLDHashTable::Search | mozilla::dom::DocumentOrShadowRoot::GetElementById]

Categories

(Thunderbird :: General, defect)

Unspecified
All
defect
Not set
critical

Tracking

(thunderbird_esr68 unaffected, thunderbird73 wontfix, thunderbird74 verified)

VERIFIED FIXED
Thunderbird 75.0
Tracking Status
thunderbird_esr68 --- unaffected
thunderbird73 --- wontfix
thunderbird74 --- verified

People

(Reporter: wsmwk, Assigned: benc)

References

(Regression)

Details

(Keywords: crash, regression, topcrash-thunderbird, Whiteboard: [regression:tb73])

Crash Data

Attachments

(2 files, 2 obsolete files)

#2 crash for 73.0b1

Regression in nightly. bp-2736a504-a4fc-4a77-876b-461d80200106 is the earliest buildid 20200105105422 73.0a1

Top 10 frames of crashing thread:

0 xul.dll PLDHashTable::Search xpcom/ds/PLDHashTable.cpp:501
1 xul.dll mozilla::dom::DocumentOrShadowRoot::GetElementById dom/base/DocumentOrShadowRoot.cpp:98
2 xul.dll nsMsgWindow::GetMessageWindowDocShell comm/mailnews/base/src/nsMsgWindow.cpp:76
3 xul.dll nsImapProtocol::SetupWithUrl comm/mailnews/imap/src/nsImapProtocol.cpp:825
4 xul.dll nsImapProtocol::LoadImapUrl comm/mailnews/imap/src/nsImapProtocol.cpp:2099
5 xul.dll nsImapIncomingServer::LoadNextQueuedUrl comm/mailnews/imap/src/nsImapIncomingServer.cpp:521
6 xul.dll nsresult `anonymous namespace'::SyncRunnable2<nsIImapMailFolderSink, const char*, nsIImapUrl*>::Run comm/mailnews/imap/src/nsSyncRunnableHelpers.cpp:121
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1241
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
9 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87

Flags: needinfo?(benc)

Pretty bad for beta, so we should respin beta with a fix for this. Of course, first need a fix.

Flags: needinfo?(mkmelin+mozilla)
Keywords: regression
Regressed by: 1603649

Or well, if wanted we could back out the last changeset from bug 1603649 (use the earlier approach) on beta.

FWIW, there are so far zero Firefox crashes, and zero Mac crashes.

Add a null check to the M-C code and get the reviewer interested that way?

As long as we get the solution quicker rather than later :) ... Who is driving?

he Mac crash is PLDHashTable::Search | nsDocShell::QueryInterface ? bp-e86ab8a5-e63d-4c26-ab45-0b3910200131

Severity: normal → critical
Crash Signature: [@ PLDHashTable::Search | mozilla::dom::DocumentOrShadowRoot::GetElementById] → [@ PLDHashTable::Search | mozilla::dom::DocumentOrShadowRoot::GetElementById] [@ PLDHashTable::Search | nsDocShell::QueryInterface ]
Flags: needinfo?(benc) → needinfo?(mkmelin+mozilla)
OS: Windows 10 → All

Maybe we want to add a null check @ DocumentOrShadowRoot.cpp#132?
But, I wonder how that ever happens. If we don't get an element with the expected id, things will not work properly at all.

Assignee: nobody → benc
Flags: needinfo?(mkmelin+mozilla)

Now #1 crash for beta

How odd. The QueryInterface crash is exclusive to Mac, and the GetElementById is windows. The code path on the m-c side looks like it should be a pretty solid and well-tested one (I mean, GetElementById() is pretty fundamental, right?). I'm currently leaning toward it being a threading issue - maybe TB being shut down while IMAP operations are still pending (that dom being whisked away at just the wrong time, say)... But unless I can replicate it, it's going to be a matter of trying to work out and understand the whole code path, and where races/clashes could be occurring.

Too bad we don't have a graph that shows recent topcrashes against each other over time - we could clearly see when a signature trails off and a new one begins. Or that it's not happening to more nightly users - we'd have a more clear picture of the regression range :) But given the crash rate I'd say the checkin range should be 2020-01-01 to 2020-01-05, or very late December.

So you're of the opinion it's not likely to be an m-c checkin? Nor bug 1603649 whose checkin was 2019-12-20?

Flags: needinfo?(benc)

I did think we could mitigate it with an m-c patch to check for null... but it's worse than that - I think it's accessing freed memory which may have been overwritten by something else. So a null check doesn't help.

The crash is happening in the main thread, in response to an event that was dispached to the main threads message queue by the IMAP thread.
My best theory so far is that the IMAP thread posts a LoadNextQueuedUrl() request to the main thread. But before it's serviced, shutdown is initiated. So when the request is eventually processed by the main thread, the messagewindow and its docshell have been (at least partly) torn down. Kaboom.
Tricky to confirm without having steps to replicate the crash, of course.
But if so, the cheesy mitigation is to modify all the IMAPs dispatch-to-main-thread handlers (there's a proxy class to dispatch-to-main a whole bunch of IMAP functions), to detect that shutdown has started and if so, just exit/fail without doing anything. There are already some "we're shutting down" flags in the IMAP code, so hopefully I can use one of them, once I figure out how they're used currently.
So that's my current plan and approach. Stay tuned.

The proper solution is to check over all the IMAP threading and shutdown handling, making sure it's all nice and clean and safe, but that's a bigger task ;-)

I suspect some of these other PLDHashTable::Search() crashes might have a similar cause:
https://bugzilla.mozilla.org/buglist.cgi?quicksearch=PLDHashTable%3A%3ASearch

Flags: needinfo?(benc)

nsMsgWindow::GetMessageWindowDocShell() seems to have a superfluous 'if' statement in it (leftover from some recent changes I think). This patch clears it up.

Attachment #9127763 - Flags: review?(mkmelin+mozilla)

After lots of false starts, I discovered that there's a isBeingDestroyed() method on nsIDocShell. This patch adds a check in nsMsgWindow::GetMessageWindowDocShell() so it'll hopefully abort before crashing. There are also a couple of tweaks in nsImapProtocol::SetupWithUrl() to check for such an error before plowing on blindly...

I can't say for sure that this'll reduce the crashes, but I guess there's only one way to tell...

Attachment #9127764 - Flags: review?(mkmelin+mozilla)

And some more notes, just while they're fresh in my head:

  • It seems bonkers that any nsMsgWindow-related code is present down in nsImapProtocol in the first place.
  • nsImapProtocol::SetupWithUrl() is doing this nsMsgWindow hackery to set up notification callbacks on the mock channel.
  • nsImapProtocol::SetupWithUrl() has special handling to deal mockchannels vs real channels. This kind of skirts the whole point of using mock channels - the mockchannel should probably be a less leaky abstraction.

So I think the cleanup steps would be:

  1. ruthlessly track down and refactor real/mockchannel checks and better encapsulate the mock behaviour.
  2. go over the mockchannel notification callback nsMsgWindow hackery and see what it's really trying to do, and see if it can be done /without/ tying the nsImapProtocol system to the GUI side.
Attachment #9127763 - Flags: review?(mkmelin+mozilla) → review+
Comment on attachment 9127764 [details] [diff] [review] 1610406-add-docshell-destruction-check.patch Review of attachment 9127764 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/base/src/nsMsgWindow.cpp @@ +77,5 @@ > + // but really, we shouldn't even get here in such cases. > + bool doomed; > + rootShell->IsBeingDestroyed(&doomed); > + if (doomed) { > + return NS_ERROR_FAILURE; Maybe we should remove NS_ERROR_ILLEGAL_DURING_SHUTDOWN instead to perhaps make errors easier to understand if someone runs into them?
Attachment #9127764 - Flags: review?(mkmelin+mozilla) → review+

Thanks Magnus - seems like there's a perfect error code for all kinds of surprising occasions!
There seems to be a little bustage currently, so I'll hold off the checkin-needed for now.

Attachment #9127764 - Attachment is obsolete: true
Attachment #9128025 - Flags: review+

So this linux signature as well?
PLDHashTable::Search | <name omitted> | nsMsgWindow::GetMessageWindowDocShell bp-391bc935-1256-4557-9937-52a1a0200220

Blocks: 1344036

(In reply to Wayne Mery (:wsmwk) from comment #18)

So this linux signature as well?
PLDHashTable::Search | <name omitted> | nsMsgWindow::GetMessageWindowDocShell bp-391bc935-1256-4557-9937-52a1a0200220

Looks like it, yes. In that case it's coming in from javascript, but the actual crash looks exactly the same.

Damn it, the patch breaks all the imap unit tests!
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=079b46908cf84690b92396efb55b5c0ca105952b

Hopefully I've just done something stupid and screwed up the patch.
If we're unlucky, we'll find out that all the tests were relying on the docshell being in the shutting-down state the whole time or something stupid like that... :-(
Gah...
More investigation.

Crash Signature: [@ PLDHashTable::Search | mozilla::dom::DocumentOrShadowRoot::GetElementById] [@ PLDHashTable::Search | nsDocShell::QueryInterface ] → [@ PLDHashTable::Search | mozilla::dom::DocumentOrShadowRoot::GetElementById] [@ PLDHashTable::Search | nsDocShell::QueryInterface ] [@ PLDHashTable::Search | <name omitted> | nsMsgWindow::GetMessageWindowDocShell ]

Revised patch. Removed some returncode checks I had added. The imap code was ignoring the returncode and assuming the returned pointer was null on failure. Should probably be tightened up, but for now I just went through and made sure the pointer would in fact be null.
Also added a comment on what the offending code is actually doing (it wasn't at all obvious to me, and it's still not clear it should be doing it anyway!).

Try build here - will check back in on it later today and add checkin flags if it looks clean.
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=bf32cf94e26d4f0eea6ab887da62d5220a153238

Attachment #9128025 - Attachment is obsolete: true
Attachment #9128338 - Flags: review+

Last try build failed, but this one looks clean:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=095f3079e980e7bd6b366282b56c341a6301c98c
(there's one mochitest fail, but that seems unrelated to this this patch).
I'll flag the checkin-needed now because I think we need to get this into the wild and see if it helps.
But I'm definitely open to following it up - Magnus, want to give this patch another quick once-over and see if there's anything extra you'd like me to chase up?

If one of you gets me a Mac try build I will test it.

I can very reliably reproduce PLDHashTable::Search | nsDocShell::QueryInterface on Mac beta

  1. mailnews.show_send_progress=true mailnews.sendInBackground=false
  2. send a large message
  3. don't use "Cancel" but click to close the compose window "X".
    bp-c0cd6cee-42b7-4460-8583-390ae0200223

Magnus see comment 22.

Flags: needinfo?(mkmelin+mozilla)
Flags: needinfo?(benc)

I am unable to reproduce this on windows TB74

Btw, the steps above are from bug 1608733 which the user cites for Mac using version 68.

Let's get it checked in and you can check it on nightly.

Status: NEW → ASSIGNED
Flags: needinfo?(mkmelin+mozilla)
Flags: needinfo?(benc)
Target Milestone: --- → Thunderbird 75.0

Can't seem to get the repro steps in comment 23 to trigger the bug on Linux (with the latest code, but without my patch), which is a shame.

In any case, I just kicked off a new try build including windows and mac:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=3d0ac805d17841266b11ef18638217df8b510431

bug 1608733 Is the same crash, but coming in from a different code path. My patch should make the offending function bail out before it crashes, but it's possible that the calling code doesn't gracefully handle the error code So even if the patch works, the repro steps might give you a different crash :-) But it'd be progress.

Scratch that - I did get the comment 23 repro steps working!
I'm still a little fuzzy on the exact steps, frantically closing down windows and confirmation dialogs before the sending finished (and I'm running I3, which doesn't even have "X" close buttons on the windows), but it did trigger.

After applying the patch and trying to trigger the bug again, I haven't yet managed to get it to crash down in PLDHashTable, which is very promising.
(I'm running a debug build and I do get a assert triggering down in cairo, via MOZ_gdk_display_close(). But I think that's out of scope for this bug)

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/947de04af434
Remove superfluous 'if' in nsMsgWindow::GetMessageWindowDocShell(). r=mkmelin
https://hg.mozilla.org/comm-central/rev/ebcd40d4aaab
Make nsMsgWindow::GetMessageWindowDocShell() fail if docshell is being destroyed. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Attachment #9127763 - Flags: approval-comm-esr68?
Attachment #9127763 - Flags: approval-comm-esr68? → approval-comm-beta?
Attachment #9128338 - Flags: approval-comm-beta?
Comment on attachment 9128338 [details] [diff] [review] 1610406-add-docshell-destruction-check-3.patch Yes, let's see how this addresses the regression on beta
Attachment #9128338 - Flags: approval-comm-beta? → approval-comm-beta+
Attachment #9127763 - Flags: approval-comm-beta? → approval-comm-beta+

There are no crashes for buildid 20200227011511 aka 74.0b2
v.fixed

Status: RESOLVED → VERIFIED
Whiteboard: [regression:tb73]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: