Closed Bug 1353704 Opened 7 years ago Closed 4 years ago

Crash in mozilla::mailnews::MsgDBReporter::GetPath and mozilla::mailnews::MsgDBReporter::CollectReports - lifetime of reporter object shoud be bound to (and owned by) the nsMsgDatabase

Categories

(Thunderbird :: General, defect)

52 Branch
x86
All
defect
Not set
critical

Tracking

(thunderbird_esr52 wontfix, thunderbird_esr60 wontfix, thunderbird_esr68 affected, thunderbird51 wontfix, thunderbird52 wontfix, thunderbird53 wontfix, thunderbird54 wontfix, thunderbird67 wontfix)

RESOLVED FIXED
Thunderbird 78.0
Tracking Status
thunderbird_esr52 --- wontfix
thunderbird_esr60 --- wontfix
thunderbird_esr68 --- affected
thunderbird51 --- wontfix
thunderbird52 --- wontfix
thunderbird53 --- wontfix
thunderbird54 --- wontfix
thunderbird67 --- wontfix

People

(Reporter: wsmwk, Assigned: jorgk-bmo)

References

()

Details

(Keywords: crash, regression, Whiteboard: [regression:TB51?])

Crash Data

Attachments

(1 file, 3 obsolete files)

#15 crash for TB52.0.  Like bug 1353702, users with this crash signature also tend to report OOM signatures, but in this case not OOM | Small. And again, perhaps addon or some other program involved. Three pairings listed below


bp-864a16f5-dd57-43c4-abcb-463d72170404
bp-567ae4a1-3956-4028-a358-9dcb32170404   OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::moveToTenured 


bp-58dd7fb9-2c28-435b-9b8d-8880b2170404
bp-bfa45f39-3ce7-4904-a25b-c0c6e2170403   OOM | large | NS_ABORT_OOM | PLDHashTable::Add | nsImapFlagAndUidState::AddUidCustomFlagPair 


I suspect these two crashes are the same user:
bp-56435972-9fea-495d-886c-7d9e72170322
bp-4f4d5acf-12fe-463f-ade1-918532170329
 0 	xul.dll	mozilla::mailnews::MsgDBReporter::GetPath(nsACString_internal&, bool)	c:/builds/moz2_slave/tb-rel-c-esr45-w32_bld-0000000/build/mailnews/db/msgdb/src/nsMsgDatabase.cpp:1054
1 	xul.dll	mozilla::mailnews::MsgDBReporter::CollectReports(nsIMemoryReporterCallback*, nsISupports*, bool)	c:/builds/moz2_slave/tb-rel-c-esr45-w32_bld-0000000/build/mailnews/db/msgdb/src/nsMsgDatabase.cpp:1041
2 	xul.dll	nsRunnableFunction<<lambda_4186443ceb77967b62731b2342d35d74> >::Run()	xpcom/glue/nsThreadUtils.h:261
3 	xul.dll	nsThread::ProcessNextEvent(bool, bool*)	xpcom/threads/nsThread.cpp:972
4 	xul.dll	NS_ProcessNextEvent(nsIThread*, bool)	xpcom/glue/nsThreadUtils.cpp:297
5 	xul.dll	nsThebesFontEnumerator::EnumerateFonts(char const*, char const*, unsigned int*, wchar_t***)	gfx/src/nsThebesFontEnumerator.cpp:44
now at #13 crash for TB52.0.1. I suspect connection to bug 1158578

Dave's symptoms in Bug 1360046, which I have just duplicated to this bug, are very slow, unstable and constant crashes. 

Some users who have this crash signature also see
- Bug 1354810 - Crash in mozilla::dom::Promise::PerformWorkerMicroTaskCheckpoint 
- Bug 1353702 - Crash in PtrToNodeMatchEntry (CompareCacheMatchEntry) during CC
- @0x0 | mozilla::mailnews::MsgDBReporter::CollectReports  bp-33943c5e-47d7-49cb-936b-ece2a0170426
- CDesktopBrowser::_UpdateLayoutStoreState  bp-0bc3b92d-4edb-4a6a-9112-b41120170425
Depends on: 1158578
Dave, 
You can test a draft patch from bug 1158578 using the win32.exe version in the directory at https://mzl.la/2q8E3Ec 
Backup your Thunderbird profile first as a precaution https://support.mozilla.org/en-US/kb/profiles-where-thunderbird-stores-user-data#w_backing-up-a-profile

Does it help?
Flags: needinfo?(dave)
Wayne,

Yes, thanks!  This is the elixir I was looking for.  Back to normal operation and no crashes since the install 12 hours ago.
Flags: needinfo?(dave)
52.1.1 is out a bit now and even with bug 1158578 fixed, it's #15 crash.  
Example https://support.mozilla.org/en-US/questions/1160061#answer-970951
The user's crashes from https://support.mozilla.org/en-US/questions/1160061
- CCGraphBuilder::BuildGraph   bp-3a72ac8e-c0cd-4f82-bd22-bc6d70170522 bug 1272230 (and perhaps Bug 1353702)
- mozilla::mailnews::MsgDBReporter::GetPath   bp-4f81be9e-54f1-4aec-8304-953680170521
- mozilla::mailnews::MsgDBReporter::GetPath   bp-cff9cef3-444b-4839-a631-2d1bc0170521
- xul.dll@0x11844d4   bp-ca143aae-e664-4003-a6d0-2cf920170520
- OOM | small   bp-a56edee6-1ffb-42ab-a889-b835e0170516
Depends on: 1272230, 1353702
https://support.mozilla.org/en-US/questions/116187  @ mozilla::mailnews::MsgDBReporter::CollectReports 
bp-b0e13cdb-1d5d-4440-ab9f-a3f1a0170528
Crash Signature: [@ mozilla::mailnews::MsgDBReporter::GetPath] → [@ mozilla::mailnews::MsgDBReporter::GetPath] [@ mozilla::mailnews::MsgDBReporter::CollectReports ]
#89 and #142 for 52.2.1, so no longer a topcrash.
mozilla::mailnews::MsgDBReporter::GetPath dropped substantially ~June 14
many users of current crashes have filtaquilla addon installed
i just have numerous rss feeds in my profile, no email account (POP3, IMAP or otherwise)
i leave thunderbird in background to collect feeds updates

i got few counts for thunderbird 59.0b2 just now
https://crash-stats.mozilla.com/report/index/d59f7e04-4088-4c8a-9d7d-9c1490180306
https://crash-stats.mozilla.com/report/index/8c29d391-d123-461c-af66-f278a0180306

i don't have filtaquilla installed and the crash happens in safe mode as well
https://crash-stats.mozilla.com/report/index/ec0d1dd8-b499-4d2e-8989-7b69b0180306

it consumes 25% CPU (one of the 4 cores) and around 3.5G of physical memory at the time of each crash.
it crashed as "OOM | small" in this particular instance
https://crash-stats.mozilla.com/report/index/7ff4402a-6653-443a-993d-aeb730180306

given high CPU consumption, it is probably not just a simply memory leak
it doesn't even run for 15 mins, falling back to 58.0b3 for now
This user also reports crashes with RSS.

bp-e506e7ef-1f5e-4185-9315-fb5b10180226 mozilla::dom::workers::RuntimeService::CrashIfHanging - bug 1439893
bp-dcf4005d-6b60-4696-8130-2b1bb0180227    OOM | small  
bp-7fb3f27c-0e58-406e-b335-39f2d0180219    @0x0 | mozilla::mailnews::MsgDBReporter::CollectReports
See Also: → 1439893
looks working fine with Thunderbird 60.0b2, up and running for 2 hours with low CPU and steady memory usage :)
http://ftp.mozilla.org/pub/thunderbird/releases/60.0b2/
Lot of this kind of crashes per day with TB 52.8.0, after remove RSS account, it's fine.
See Also: → 1541990

Looking at graph for past 6 months, crash rate of version 60 appears to be about the same as version 52.

just had https://crash-stats.mozilla.com/report/index/e5b51280-f57a-45e6-851f-1a16b0190715 At the time I was looking for why Thunderbird was sitting there consuming over 1200mb. Opened activity manager and sat for minutes observing it telling me it was indexing message 3 or 4 in an RSS feed folder. Then I asked for an about:memory and crash happened.

Crashed again today. This time I got a memory report before it crashed. 622mb of 1137mb in maildb.... 266mb on a feed that has a 220mb mbox and a 94mb msf database. If the memory Gz is of interest to anyone. Just ask.

(In reply to Matt from comment #18)

just had https://crash-stats.mozilla.com/report/index/e5b51280-f57a-45e6-851f-1a16b0190715 60.7.2 Crash Report [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ]

Matt, see bug 1526127.

Does this also affect 64 bit Thunderbird?

If not, another good reason to start switching over.

I am using TB 64bit and have been for more than 12 months. Despite Wayne's pointing me to bug 1526127 I really do not see that as the correct bug for my crash as whilst the updater was supposedly running, it completed and post crash I was on the new version.

What I do have are a number of accounts that are "non functional" so I cancel password prompts, but most of my high memory issues appear to be related to a freenode certificate request that displays an incorrect certificate and just stops network functions until it is acknowledged.

I only have one account, but it keeps asking for password retry/renew. It is a Yahoo account. Can't find specifics on how this should be set up for server names, but I did it the way I found it once in the past. It gets mail, but re-requests frequently, and a popup in lower right hand says yahoo mail not available. Not sure if this is the same problem as you. Let me know if there is a way I can check or help with this. I'm on 69.0b2 (64-bit).

I clicked reply on a message in my gmail account. The compose window didn't come up. (Thunderbird had been running at 800MB and according to task manager there was not a big increase at the time of the crash)
bp-3f039407-0341-49ed-9b4b-458a50190814

Flags: needinfo?(jorgk)
Flags: needinfo?(benc)

Looks like something bad happened, and to add insult to injury, it then crashed in error reporting :-(

  void GetPath(nsACString &memoryPath, bool aAnonymize) {
    memoryPath.AssignLiteral("explicit/maildb/database(");
    nsCOMPtr<nsIMsgFolder> folder;
935  mDatabase->GetFolder(getter_AddRefs(folder));

Looks like the database was null here. We can protect against that.

Flags: needinfo?(jorgk)
Attached patch 1353704-fix-crash.patch (obsolete) — Splinter Review

This will fix the crash in reporting, but it won't fix the horrible thing that caused the report to be requested.

Assignee: nobody → jorgk
Status: NEW → ASSIGNED
Flags: needinfo?(benc)
Attachment #9085413 - Flags: review?(benc)

(In reply to Jorg K (GMT+2) from comment #26)

Created attachment 9085413 [details] [diff] [review]
1353704-fix-crash.patch

This will fix the crash in reporting, but it won't fix the horrible thing that caused the report to be requested.

Indeed. So with the crash "gone", what kind of behavior might users see, or what types of issues might we have now papered over?

I can say a little more about my environment at the time although not sure it will help 1. many tabs open and several compose windows, 2. TB was open for several days (but less than a week), 3. using ~800MB last I looked, which is more than normal for me but not high enough to be alarming (probably higher than normal because I had chat open and RSS feeds running which I don't normally have), 4. I was starting to reply to a message in my Inbox from cinymini of bug 1422251 and just before the crash I searched for her full address which includes @gmx.de but search would return any results, but searching on only cinymini was successful (so very strange)

(In reply to Wayne Mery (:wsmwk) from comment #27)

So with the crash "gone", what kind of behavior might users see, or what types of issues might we have now papered over?

We'll find out ;-)

Going beyond my incident, as suggested in earlier comments, I've been in touch with ~6 reporters since 2017. Some anecdotal information from them:

  • "It looks like if the folders are kept with a lower number of records it helps with alleviating the issue but not with fixing it." and (same user) " if I restart TB often it will not crash". (no RSS feeds) Other crashes from this user bp-e5a9a202-8e82-4c52-8d3b-f36f20190520

  • For some users the crash has simply gone away (again, like bug 1353702) after some period of time, and in some cases after switching machines. In the later case, bad memory certainly is a possibility. I'm really looking forward to the breakpad folks or someone hooking up a memory tester to the crash reporter - bug 995652

Summary: Crash in mozilla::mailnews::MsgDBReporter::GetPath → Crash in mozilla::mailnews::MsgDBReporter::GetPath and mozilla::mailnews::MsgDBReporter::CollectReports

I can find a reviewer soon and we can ship it in TB 69 beta 3 if you agree to do a new build.

Comment on attachment 9085413 [details] [diff] [review]
1353704-fix-crash.patch

First reviewer wins ;-)
Attachment #9085413 - Flags: review?(mkmelin+mozilla)
Attachment #9085413 - Flags: review?(acelists)
Comment on attachment 9085413 [details] [diff] [review]
1353704-fix-crash.patch

Review of attachment 9085413 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
@@ +924,5 @@
>      GetPath(path, aAnonymize);
>      return aCb->Callback(
>          EmptyCString(), path, nsIMemoryReporter::KIND_HEAP,
>          nsIMemoryReporter::UNITS_BYTES,
> +        mDatabase ? mDatabase->SizeOfIncludingThis(GetMallocSize) : 0,

mDatabase is passed as 'db' in the MsgDBReporter constructor from mMemReporter = new mozilla::mailnews::MsgDBReporter(this); in nsMsgDatabase constructor. So how can 'this' (the nsMsgDatabase object) be ever null? Is this some memory corruption?

Hmm, good point. The problem is that there is no ref-counting here:

class MsgDBReporter final : public nsIMemoryReporter {
  nsMsgDatabase *mDatabase;

So most likely the database is already destroyed by the time we come to report. Let me try this again ;-)

This should be better.

Attachment #9085413 - Attachment is obsolete: true
Attachment #9085413 - Flags: review?(mkmelin+mozilla)
Attachment #9085413 - Flags: review?(benc)
Attachment #9085413 - Flags: review?(acelists)
Attachment #9085430 - Flags: review?(mkmelin+mozilla)
Attachment #9085430 - Flags: review?(benc)
Attachment #9085430 - Flags: review?(acelists)

Oops, commit message should be:
Use a ref-counted reference to the database for the DB reporter

The question is whether we want to keep a reference to the DB even somehow after its lifetime. This is something for jcranmer ;)

Sorry about the noise. We need to null-out mMemReporter so it gets destroyed or the reference it keeps will stop the database from being destroyed. Here's an example of a one-line change that can upset the entire system.

The funny thing is, the DTOR actually de-registers using UnregisterWeakMemoryReporter(), so after the database was destroyed, there shouldn't have been any more reporting. As many crashes, the whole thing is tricky. I won't take this to ESR in a hurry.

BTW, Joshua won't help us here, if you're lucky, he'll answer general questions on IRC.

Attachment #9085430 - Attachment is obsolete: true
Attachment #9085430 - Flags: review?(mkmelin+mozilla)
Attachment #9085430 - Flags: review?(benc)
Attachment #9085430 - Flags: review?(acelists)
Attachment #9085434 - Flags: review?(mkmelin+mozilla)
Attachment #9085434 - Flags: review?(benc)
Attachment #9085434 - Flags: review?(acelists)
Comment on attachment 9085434 [details] [diff] [review]
1353704-fix-crash-using-ref.patch (v2)

Review of attachment 9085434 [details] [diff] [review]:
-----------------------------------------------------------------

I think this comment is key:

> the DTOR actually de-registers using UnregisterWeakMemoryReporter(), so after the database was destroyed, there shouldn't have been any more reporting

The fact the reporter is falling down because the db is gone kind of implies something more serious going on...

Regarding the patch, I think it'll cause both the `nsMsgDatabase` and `MsgDBReporter` to lock each other into existence - we'll just gradually leak objects over time.
It seems like the lifetime of the reporter is tightly scoped to the `nsMsgDatabase` object, so the reporter shouldn't need a `RefPtr`.

Are we sure the `nsMsgDatabase` is being destroyed before the reporter callback is called? Or is it being just being corrupted in some way?

::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
@@ +1002,4 @@
>        m_msgReferences(nullptr),
>        m_cacheSize(kMaxHdrsInCache) {
>    mMemReporter = new mozilla::mailnews::MsgDBReporter(this);
>    mozilla::RegisterWeakMemoryReporter(mMemReporter);

So at this point, the `MsgDBReporter` will be holding a RefPtr to  `nsMsgDatabase` which will keep it from being destroyed.
But `nsMsgDatabase::mMemReporter` is also a RefPtr which prevents the `MsgDBReporter` from being deleted.

So aren't the database and reporter now mutually locking each other in existence?

(I can't tell if `mozilla::RegisterWeakMemoryReporter` also holds the reporter object in existance or not, but I guess it's a moot point).

@@ +1006,5 @@
>  }
>  
>  nsMsgDatabase::~nsMsgDatabase() {
>    mozilla::UnregisterWeakMemoryReporter(mMemReporter);
> +  mMemReporter = nullptr;

Will the `~nsMsgDatabase()` dtor ever be called?
Attachment #9085434 - Flags: review?(benc) → review-
Comment on attachment 9085434 [details] [diff] [review]
1353704-fix-crash-using-ref.patch (v2)

Right. You have a better idea?
Attachment #9085434 - Attachment is obsolete: true
Attachment #9085434 - Flags: review?(mkmelin+mozilla)
Attachment #9085434 - Flags: review?(acelists)
Assignee: jorgk → nobody
Status: ASSIGNED → NEW

(In reply to Ben Campbell from comment #38)

Will the ~nsMsgDatabase() dtor ever be called?

Good question, see bug 1342858. Kent's personal prediction was that we'll never solve it unless we get rid of Mork. That does its own ref-counting and other hair-raising stuff :-(

(In reply to Jorg K (GMT+2) from comment #25)

Looks like the database was null here. We can protect against that.

Crash address was 0xe5e5e631. Windows' free replaces memory with 0xe5e5e5e5, so this looks like nsMsgDatabase * was used after free, and the vtable pointer was offset by 19 entries before attempting to call it.

Have you tried making the reporter hold an nsWeakPtr<nsMsgDatabase>? That should be legal, and it will be cleared to null after nsMsgDatabase::~nsMsgDatabase is called, allowing you to legitimately check for null (null checks would also have to be added, of course, but weak pointers do remind you to try that).

(In reply to Jorg K (GMT+2) from comment #40)

Good question, see 1342858. Kent's personal prediction was that we'll never solve it unless we get rid of Mork. That does its own ref-counting and other hair-raising stuff :-(

The Mork side of things isn't too bad, if my stale brain cells are any good. The strong references of mdbI* will hold whatever needs to be held, and they won't hold any cyclic references back into nsMsgDatabase and friends. The problem is nsMsgHdr's... funky... approach to ref-counting, IIRC.

(In reply to Joshua Cranmer [:jcranmer] from comment #41)

Crash address was 0xe5e5e631. Windows' free replaces memory with 0xe5e5e5e5, so this looks like nsMsgDatabase * was used after free, and the vtable pointer was offset by 19 entries before attempting to call it.

Ahh, now that sounds like the core problem to fix. Either somewhere is holding a raw pointer to nsMsgDatabase and accessing it after free, or perhaps somewhere has screwed-up refcounting (eg forgetting to AddRef() a return value or something?).

Have you tried making the reporter hold an nsWeakPtr<nsMsgDatabase>? That should be legal, and it will be cleared to null after nsMsgDatabase::~nsMsgDatabase is called, allowing you to legitimately check for null (null checks would also have to be added, of course, but weak pointers do remind you to try that).

The more I think about it, the more the holding-the-DB-open-in-the-reporter seems to be the wrong approach. The lifetime of the reporter object is meant to be bound to (and owned by) the nsMsgDatabase. The reporter is registered/unregistered in the nsMsgDatabase ctor/dtor.

Grepping through, it looks like raw nsMsgDatabase pointers are relatively rare:
https://searchfox.org/comm-central/search?q=nsMsgDatabase%5B+%5D%3F%5B*%5D&case=false&regexp=true&path=

There are a couple there that smell a little odd:

nsDBFolderInfo holds a raw nsMsgDatabase pointer (no refcounting that I can see):
https://searchfox.org/comm-central/source/mailnews/db/msgdb/public/nsDBFolderInfo.h#125

There's also one in nsMsgOfflineImapOperation, although it looks like its ctor/dtor does increment/decrement the refcount manually...
https://searchfox.org/comm-central/source/mailnews/db/msgdb/src/nsMsgOfflineImapOperation.h#48

Raw nsIMsgDatabase pointers are a little more prevalent:
https://searchfox.org/comm-central/search?q=nsIMsgDatabase%5B+%5D%3F%5B*%5D&regexp=true&path=

(I haven't picked through these ones at all)

Joshua's suggestion for reference.

Assignee: nobody → jorgk

Noting for the record, I crashed today bp-0b9d399f-ec1c-4995-9264-59e8c0190819. I had just fired up my laptop from sleep and was deleting messages from vseerror Inbox while network was coming up - unclear whether it had yet started to filter the 56 new messages. So completely different circumstances from user POV from comment 24

Assignee: jorgk → nobody

Roughly same rate of version 68 crashes as version 60 for the last 6 months. But at roughly 30 per day (if you include v60 crashes whose users will be upgraded) this is no longer a topcrash.

bp-c20aa3bb-808a-4757-a238-043f60200118 is a 68.4.1 example of mozilla::mailnews::MsgDBReporter::GetPath

What's worth a quick fix? And what's worth filing follow up bugs?
What do you think?

Flags: needinfo?(benc)

(In reply to Wayne Mery (:wsmwk) from comment #29)

In the later case, bad memory certainly is a possibility. I'm really looking forward to the breakpad folks or someone hooking up a memory tester to the crash reporter - bug 995652

I might be willing to try this if not too complicated and you give instructions.

Flags: needinfo?(benc)
Summary: Crash in mozilla::mailnews::MsgDBReporter::GetPath and mozilla::mailnews::MsgDBReporter::CollectReports → Crash in mozilla::mailnews::MsgDBReporter::GetPath and mozilla::mailnews::MsgDBReporter::CollectReports - lifetime of reporter object shoud be bound to (and owned by) the nsMsgDatabase
Comment on attachment 9085680 [details] [diff] [review]
1353704-fix-crash-using-ref.patch

Thanks Jorg for the updated patch.
Changing NI to review request
Attachment #9085680 - Flags: review?(benc)
Comment on attachment 9085680 [details] [diff] [review]
1353704-fix-crash-using-ref.patch

Review of attachment 9085680 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. The ptr-weakening 'feels' like it's happening in the right place :-)
Attachment #9085680 - Flags: review?(benc) → review+
Assignee: nobody → jorgk-bmo
Status: NEW → ASSIGNED
Target Milestone: --- → Thunderbird 78.0

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/e47ba37adeb2
Use a ref-counted reference to the database for the DB reporter. r=BenC

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: