Open Bug 1677202 Opened 5 years ago Updated 27 days ago

heap-use-after-free with C-C TB during shutdown, nsMsgDBService::~nsMsgDBService(), debug build, only

Categories

(Thunderbird :: General, defect)

x86_64
Linux
defect

Tracking

(Not tracked)

People

(Reporter: ishikawa, Assigned: ishikawa)

References

(Blocks 1 open bug)

Details

(Keywords: csectype-uaf, sec-other)

Attachments

(6 files)

I noticed a heap-use-after-free error during local mochitest execution with locally created ASAN version of C-C TB.

Here is an excerpt from such error stacktrace.
Full logs of three cases are attached. They all point to similar cause, I think.

229:46.98 TEST_START: comm/mail/test/browser/keyboard/browser_spacehit.js
229:46.98 GECKO(106748) Chrome file doesn't exist: /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/_tests/testing/mochitest/browser/comm/mail/test/browser/keyboard/head.js
	  ...

230:18.34 INFO TEST-START | Shutdown
230:18.34 INFO Browser Chrome Test Summary
230:18.34 INFO Passed:	4
	  ...
230:26.09 GECKO(106748) =================================================================
230:26.09 GECKO(106748) ==106748==ERROR: AddressSanitizer: heap-use-after-free on address 0x614000557ac8 at pc 0x7f8a7d5f62f5 bp 0x7ffe0f775230 sp 0x7ffe0f775228
230:26.09 GECKO(106748) READ of size 8 at 0x614000557ac8 thread T0
230:26.82 GECKO(106748) error: address range table at offset 0xd50 has an invalid tuple (length = 0) at offset 0xd60
230:27.16 GECKO(106748)	    #0 0x7f8a7d5f62f4 in operator-> /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/nsCOMPtr.h:875:5
230:27.16 GECKO(106748)	    #1 0x7f8a7d5f62f4 in nsMsgDBService::~nsMsgDBService() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:88:14
230:27.16 GECKO(106748)	    #2 0x7f8a7d5f5c88 in nsMsgDBService::Release() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:72:1
230:27.18 GECKO(106748)	    #3 0x7f8a7e0d61a9 in ~nsCOMPtr_base /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCOMPtr.h:329:7
230:27.18 GECKO(106748)	    #4 0x7f8a7e0d61a9 in mozilla::SegmentedVector<nsCOMPtr<nsISupports>, 4096ul, mozilla::MallocAllocPolicy>::SegmentImpl<509ul>::~SegmentImpl() /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/SegmentedVector.h:78:21
230:27.18 GECKO(106748)	    #5 0x7f8a7e0d5b4d in mozilla::SegmentedVector<nsCOMPtr<nsISupports>, 4096ul, mozilla::MallocAllocPolicy>::PopLastN(unsigned int) /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/SegmentedVector.h:246:14
230:27.18 GECKO(106748)	    #6 0x7f8a7e0b3528 in mozilla::dom::DeferredFinalizerImpl<nsISupports>::DeferredFinalize(unsigned int, void*) /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/dom/BindingUtils.h:2721:15
230:27.18 GECKO(106748)	    #7 0x7f8a7e08f0f4 in mozilla::IncrementalFinalizeRunnable::ReleaseNow(bool) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/CycleCollectedJSRuntime.cpp:1549:17
230:27.18 GECKO(106748)	    #8 0x7f8a7e09044d in mozilla::CycleCollectedJSRuntime::FinalizeDeferredThings(mozilla::CycleCollectedJSContext::DeferredFinalizeType) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/CycleCollectedJSRuntime.cpp:1622:24
230:27.18 GECKO(106748)	    #9 0x7f8a7e08b013 in mozilla::CycleCollectedJSRuntime::OnGC(JSContext*, JSGCStatus, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/CycleCollectedJSRuntime.cpp:1692:7
230:27.24 GECKO(106748)	    #10 0x7f8a9175d3e2 in js::gc::GCRuntime::maybeCallGCCallback(JSGCStatus, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7168:3
230:27.24 GECKO(106748)	    #11 0x7f8a9175f991 in ~AutoCallGCCallbacks /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7147:32
230:27.24 GECKO(106748)	    #12 0x7f8a9175f991 in js::gc::GCRuntime::gcCycle(bool, js::SliceBudget, mozilla::Maybe<JSGCInvocationKind> const&, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7261:1
230:27.24 GECKO(106748)	    #13 0x7f8a917649ba in js::gc::GCRuntime::collect(bool, js::SliceBudget, mozilla::Maybe<JSGCInvocationKind> const&, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7484:9
230:27.24 GECKO(106748)	    #14 0x7f8a917751e5 in gc /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7560:3
230:27.24 GECKO(106748)	    #15 0x7f8a917751e5 in JS::NonIncrementalGC(JSContext*, JSGCInvocationKind, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:8397:21
230:27.25 GECKO(106748)	    #16 0x7f8a7e0f8a59 in nsCycleCollector::FixGrayBits(bool, TimeLog&) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3274:19
230:27.25 GECKO(106748)	    #17 0x7f8a7e0fa369 in nsCycleCollector::BeginCollection(ccType, nsICycleCollectorListener*) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3575:3
230:27.25 GECKO(106748)	    #18 0x7f8a7e0f994f in nsCycleCollector::Collect(ccType, js::SliceBudget&, nsICycleCollectorListener*, bool) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3409:9
230:27.25 GECKO(106748)	    #19 0x7f8a7e0f93e8 in nsCycleCollector::ShutdownCollect() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3352:20
230:27.25 GECKO(106748)	    #20 0x7f8a7e0fbfc1 in nsCycleCollector::Shutdown(bool) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3641:5
230:27.25 GECKO(106748)	    #21 0x7f8a7e0ffc2c in nsCycleCollector_shutdown(bool) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3956:18
230:27.26 GECKO(106748)	    #22 0x7f8a7e47e43a in mozilla::ShutdownXPCOM(nsIServiceManager*) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/build/XPCOMInit.cpp:716:3
230:27.26 GECKO(106748)	    #23 0x7f8a8f896cb3 in ScopedXPCOMStartup::~ScopedXPCOMStartup() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsAppRunner.cpp:1648:5
230:27.26 GECKO(106748)	    #24 0x7f8a8f8c2198 in operator() /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/UniquePtr.h:460:5
230:27.27 GECKO(106748)	    #25 0x7f8a8f8c2198 in mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/UniquePtr.h:302:7
230:27.27 GECKO(106748)	    #26 0x7f8a8f8b3f55 in operator= /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/UniquePtr.h:275:5
230:27.27 GECKO(106748)	    #27 0x7f8a8f8b3f55 in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsAppRunner.cpp:5300:16
230:27.27 GECKO(106748)	    #28 0x7f8a8f8b5c01 in XRE_main(int, char**, mozilla::BootstrapConfig const&) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsAppRunner.cpp:5339:21
230:27.27 GECKO(106748)	    #29 0x5598f371ee95 in do_main /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mail/app/nsMailApp.cpp:217:22
230:27.27 GECKO(106748)	    #30 0x5598f371ee95 in main /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mail/app/nsMailApp.cpp:324:16
230:27.32 GECKO(106748)	    #31 0x7f8a9ea96cc9 in __libc_start_main csu/../csu/libc-start.c:308:16
230:27.32 GECKO(106748)	    #32 0x5598f3674029 in _start (/NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/bin/thunderbird+0xe9029)

This may be the reason why we see crash during shutdown.

Three full logs are attached in the subsequent comments.

Although I think this is very diffucult to exploit, who knows.
I am setting the security flag. Please clear it as people see fit.

https://searchfox.org/comm-central/rev/43a28c7a2eee7e5cd4ffecd6628450db2a4d9abe/mailnews/db/msgdb/src/nsMsgDatabase.cpp#88 well it's inside ifdef debug... so probably very hard to exploit ;)

I suppose the if clause should be:

if (pMessageDB && pMessageDB->m_dbFile)

(In reply to Magnus Melin [:mkmelin] from comment #4)

https://searchfox.org/comm-central/rev/43a28c7a2eee7e5cd4ffecd6628450db2a4d9abe/mailnews/db/msgdb/src/nsMsgDatabase.cpp#88 well it's inside ifdef debug... so probably very hard to exploit ;)

Right. I have noticed that by now. :-)

How does one clear the security flag?

I suppose the if clause should be:

if (pMessageDB && pMessageDB->m_dbFile)

I found out that things are not that simple.

I traced the values in #ifdef'ed code one by one.
It turns out pMessageDB points to an already freed heap area sometimes.
So pMessage->m_dbFile may not be null, but contains garbage.
This is timing-dependent. Sometimes the ASAN error is not triggered (!)
And that is why we can get crash SOMETIMES and not always, although in the mochitest flow, it seems always to happen at least on my PC.
However, if I run mochitest comm/mail/test/browser/keyboard/browser_spacehit.js by itself, the error does NOT happen(!).

So the issue is that there is a pointer to already freed structure in m_dbCache sometimes. Why?

I checked the code in nsMsgDatabase::~nsMsgDatabase()
and found that there is a following call in order to remove such entry in m_dbCache.
So this is probably not triggered all the time.

static_cast<nsMsgDBService*>(serv.get())->RemoveFromCache(this);

Problem is that it is protected by if (serv), of course.
I have traced and indeed there are cases when the call to RemoveFromCache() does not happen due to
|serv| is null.
I think this is because during shutdown phase, the XPCOM service has been shutdown and unavailable when we need it.

This is what I have figured out so far.

Maybe I can create a locally available code to clear such an entry in |m_dbCache|.
After all, we have routines to dump the entries in |m_dbCache| without going through XPCOM hoops. So maybe I can do it.

Typical local log now.

Note the xpcom-will-shutdown notification at the top.

 1:43.17 GECKO(195723) 1605327166749	Marionette	TRACE	Received observer notification xpcom-will-shutdown
 1:43.17 GECKO(195723) 1605327166750	Marionette	INFO	Stopped listening on port 2828
 1:43.17 GECKO(195723) 1605327166751	Marionette	DEBUG	Marionette stopped listening
 1:43.17 GECKO(195723) [195723, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
 1:43.17 GECKO(195723) [195723, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
 1:43.17 GECKO(195723) ~nsMsgDatabase(): closing database    /COMM-CENTRAL/TMP-DIR/tmp219v2et6.mozrunner/Mail/Local Folders/Sample.msf
 1:43.17 GECKO(195723) Not calling RemoveFromCache(this), this = 0x6140002a0c40
 1:43.17 GECKO(195723) [195723, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length()', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:80
 1:43.17 GECKO(195723) db left open pMessageDB=0x6140002a0c40, =================================================================
 1:43.17 GECKO(195723) ==195723==ERROR: AddressSanitizer: heap-use-after-free on address 0x6140002a0cc8 at pc 0x7fa55eb6c3c3 bp 0x7ffe11136550 sp 0x7ffe11136548
 1:43.17 GECKO(195723) READ of size 8 at 0x6140002a0cc8 thread T0
 1:43.17 GECKO(195723) error: address range table at offset 0xd50 has an invalid tuple (length = 0) at offset 0xd60
 1:43.17 GECKO(195723) ==195723==ERROR: AddressSanitizer: heap-use-after-free on address 0x6140002a0cc8 at pc 0x7fa55eb6c3c3 bp 0x7ffe11136550 sp 0x7ffe11136548
 1:43.17 GECKO(195723)     #0 0x7fa55eb6c3c2 in get /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/nsCOMPtr.h:856:48
 1:43.17 GECKO(195723)     #1 0x7fa55eb6c3c2 in operator nsIFile * /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/nsCOMPtr.h:864:33
 1:43.17 GECKO(195723)     #2 0x7fa55eb6c3c2 in nsMsgDBService::~nsMsgDBService() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:94:43
 1:43.17 GECKO(195723)     #3 0x7fa55eb6bc88 in nsMsgDBService::Release() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:72:1

1:43.17 GECKO(195723) Not calling RemoveFromCache(this), this = 0x6140002a0c40

This means that pointer to the to-be-released structure remains in m_dbCache.
Sure enough this pointer is referenced in nsMsgDBService::~nsMsgDBService().

So we may want to remove this pointer from the cache somehow.
I wonder how.

That will remove this pesky warning that has bothered me for a long time.

 1:43.17 GECKO(195723) [195723, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length()', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:80
Blocks: 1342858
See Also: → 1468691, 1377692

So my analysis based on local log is this.

When ~nsMsgDatabase() is called,
this also calls RemoveFromCache() to remove the being-deleted nsMsgDatabase from m_dbCache.

Unfortunately, during shutdown time, XPCOM shutdown proceeds asynchronously and this RemoveFromCache() cannot be called for the databases deleted near the end sometimes.

So that leaves pointers (to nsMsgDatabase that has already been released !) in m_dbCache.
This causes the reference to released heap in ~nsMsgDBService().

(I think deletion of nsMsgDatabase should always come before nsMsgDBService. I am not sure if this is strictly followed.)

I tried to see if I can somehow make the old pointer to nsMsgDBService() to call RemoveFromCache() even if XPCOM service itself is shutdown. The timing-dependence kicked in. Sometimes it works. But sometimes we get reference problem to already released heap again inside RemoveFromCache() even.

So my tentative final solution is as follows.

Count the number of times that RemoveFromCache() cannot be called due to early XPCOM shutdown.

Then in ~nsMsgDBService(), I check not only m_dbCache.Length(), but whether it is smaller than or equal to the # of failures to call RemoveFromCache() because the service was not available any more.
We allow such left over pointers to m_dbCache referenced by ~nsMsgDatabase() as inevitable.

We don't print the db left open message if m_dbCache.Length() <= the # of failures to invoke RemoveFromCache(...) by exiting early in ~nsMsgDBService().

So far so good locally. : No more db left open message for the three tests that produced sanitizer: heap-use-after-free.
No more heap-use-after-free (although there ARE leaks.) We won't crash, that is more important here.

I am going to run the whole mochitest locally and then, if it succeeds, submit the tryserver run.

Even though, I have somehow avoid the crash by the bandaid kludge,
the racy issue is obviously there.
To wit, an excerpt in the local log of mochitest with ASAN version of C-C TB.
There is only one place where "db left open..." message appears.
But you can see that ~nsMsgDBService() is called and executed BEFORE
~nsMsgDatabase() as far as the output messages are concerned and indeed, the counts printed suggest that
~nsMsgDBService() is executed BEFORE the RemoveFromCache() in ~nsMsgDatabase() is executed.

226:00.40 GECKO(307753) [307753, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
226:00.86 GECKO(307753) [307753, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length() || m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:88
226:00.86 GECKO(307753) m_dbCache.Length() = 1, unremoved_db_from_cache_due_to_unavailable_XPCOM=0
226:00.86 GECKO(307753) db left open pMessageDB=0x614000138440, p->m_dbFile=0x610000558240, /COMM-CENTRAL/TMP-DIR/tmp1rlqki5k.mozrunner/Mail/Local Folders/JunkCommandsA.msf
226:00.86 GECKO(307753) ~nsMsgDatabase(): closing database    /COMM-CENTRAL/TMP-DIR/tmp1rlqki5k.mozrunner/Mail/Local Folders/JunkCommandsA.msf
226:00.86 GECKO(307753) Not calling RemoveFromCache(this), this = 0x614000138440
226:01.91 GECKO(307753) [307753, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3359

We certainly need better ordering of destruction of objects if possible.

Group: mail-core-security

This patch avoids the accessing of left over stale database pointers in the cache.
The pointers are to be removed by calling RemoveFromCache() explicitly, but during shutdown,
the ordering of subsystem shutdowns are not deterministic and
RemoveFromCache() may not be callable, and thus there are some stale pointers in the cache.

This is a bandage to not to access such cached pointers if we know that such pointers are likely to be the pointers that TB tried to remove from cache but due to the early subsystem shutdown, it could not.

The true solution would be to introduce proper ordering of
shutting down XPCOM subsystems.

Assignee: nobody → ishikawa
Attachment #9188190 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9188190 [details] [diff] [review] Try counting the database pointer left in the cache due to early XPCOM shutdown and avoid accessing them later. Review of attachment 9188190 [details] [diff] [review]: ----------------------------------------------------------------- Sorry but I don't think this is something we should do. It also adds far too much debug type spew here. It could be a useful local-debugging patch perhaps.
Attachment #9188190 - Flags: review?(mkmelin+mozilla) → review-

(In reply to Magnus Melin [:mkmelin] from comment #10)

Comment on attachment 9188190 [details] [diff] [review]
Try counting the database pointer left in the cache due to early XPCOM
shutdown and avoid accessing them later.

Review of attachment 9188190 [details] [diff] [review]:

Sorry but I don't think this is something we should do. It also adds far too
much debug type spew here. It could be a useful local-debugging patch
perhaps.

I can take out the printf() by surrounding it in #ifdef. I think the patch in one form or the other would be necessary.

REASON #1. This is necessary to run DEBUG version of ASAN build C-C TB successfully to completion during local mochitest.
REASON #2. This solves or exposes why the dreaded "some db msgs left open" message at TB shutdown that puzzled developers for some time (at least four years).
Please see Bug 1342858
WARNING: some msg dbs left open: '!m_dbCache.Length()' - for last folder that was open in the window and a message was viewed in it?
REASON #3: It is not THAT verbose in contrast to the current situation and actually gives enough information to let developers off the hook by giving information in relation to REASON #2.
Log BEFORE version

...
    44:34.94 GECKO(809035) [809035, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
    44:35.02 GECKO(809035) [809035, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
*** 44:35.38 GECKO(809035) [809035, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length()', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:80
    44:35.73 GECKO(809035) [809035, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3359
    44:35.84 GECKO(809035) [809035, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/modules/libpref/Preferences.cpp:4287

With the log line marked with "***", some developers got really confused why the "dbs left open" message and it does not print even the database name if it is the official build even if DEBUG is set.
The reason of not printing the name was trying to print the name crashed the build.
But the patch here solves the crash issue and we can safely print the name now.
We also know the reason why it was not removed from the cache.
The issue of not closing the msg dbs will persist until the order of XPCOM subsystem shutdowns will be taken care of.)

Log AFTER version

    247:05.42 GECKO(622036) [622036, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length() || m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:88
+++ 247:05.42 GECKO(622036) m_dbCache.Length() = 1, unremoved_db_from_cache_due_to_unavailable_XPCOM=0
+A+ 247:05.42 GECKO(622036) db left open pMessageDB=0x6140000e7a40, p->m_dbFile=0x610000571e40, /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf
+B+ 247:05.42 GECKO(622036) ~nsMsgDatabase(): closing database    /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf
+C+ 247:05.42 GECKO(622036) Not calling RemoveFromCache(this), this = 0x6140000e7a40
    247:06.31 GECKO(622036) [622036, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3359

The messages are printed only at TB's shutdown time. They are not numerous.
Also, the current patch is for exposing the issue clearly including why and the underlying cause of XPCOM shutdown issue.
I can simplify the line marked with A by eliminating pointer values as follows.

247:05.42 GECKO(622036) db left open /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf

I can modify C by eliminating the pointer value as follows.

247:05.42 GECKO(622036) Not calling RemoveFromCache(this) due unavailable XPCOM service.

That would make the log look like:

    247:05.42 GECKO(622036) [622036, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length() || m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:88
    247:05.42 GECKO(622036) m_dbCache.Length() = 1, unremoved_db_from_cache_due_to_unavailable_XPCOM=0
    247:05.42 GECKO(622036) db left open /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf
+B+ 247:05.42 GECKO(622036) ~nsMsgDatabase(): closing database    /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf
    247:05.42 GECKO(622036) Not calling RemoveFromCache(this) due unavailable XPCOM service.
    247:06.31 GECKO(622036) [622036, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3359

I will court the opinion of couple of developers in bug 1342858.

Flags: needinfo?(alta88)
Flags: needinfo?(acelists)

I hope we can find out how to resolve the problem properly instead.

That requires the proper ordering of subsystem shutdown, which has plagued TB and FF for many years.

Also, at least in one other case, I remember that TB code is written in such a manner that destruction of of an object initiated by
GC during shutdown (that is the destruction timing is NOT programmer control at all), and in one such case, by the time shutdown triggers the GC and eventually this destruction of certain object is done, the function called from the destructor of the object is no longer callable because the XPCOM has shutdown already.
If I recall correctly, it is about the deletion of a temporary file which may contain a piece of mail message (private information), but TB code simply discards the object that refers to the file, and let its automatic destruction by GC calls the destructor automatically, and the destructor calls the file system removal API after obtaining the file path name from the stored file object (can't recall which class).
Unfortunately, mapping of the object to the file system pathname fails because the service to provide the mapping has already shutdown.
Thus the temporary file is left around. Tough.

So unless we have a sure way to invoke the destruction of all the message databases BEFORE the db cache service is shutdown, we will be in trouble.
Besides, I don't know if this applies here, but as the destruction by automatic GC indicates, by the time shutdown time comes, TB may not know all the db msgs that are open. (Hmm, that sounds unlikely).
In any case, we may have to look out for deadlock if we introduce a forced order of shutdown.

It is not easy. The kludge here has exposed the underlying issue.

(In reply to ISHIKAWA, Chiaki from comment #13)

That requires the proper ordering of subsystem shutdown, which has plagued TB and FF for many years.

Also, at least in one other case, I remember that TB code is written in such a manner that destruction of initiated by
GC during shutdown (that is the destruction timing is NOT programmer control at all), and in one such case, by the time, initiating shutdown triggers the GC, and eventually this destruction of certain object is done, but the function called from the destructor of the object is no longer callable because the XPCOM has shutdown.

If I recall correctly, it is about the deletion of a temporary file which may contain a piece of mail message (private information), but TB code simply discards the object that refers to the file, and let its automatic destruction by GC calls the destructor automatically, and the destructor calls the file system removal API after obtaining the file path name from the stored file object (can't recall which class).

Unfortunately, mapping of the object to the file system pathname fails because the service to provide the mapping has already be shutdown.
Thus the temporary file is left around. Tough.

I think the bug I was referring to is
Bug 1595343 C-C TB: nsMsgAttachment::DeleteAttachment fails to delete attachment file occasionally during shutdown, possibily leaving orphaned files in temp/tmp directory

Flags: needinfo?(alta88)
Comment on attachment 9188190 [details] [diff] [review] Try counting the database pointer left in the cache due to early XPCOM shutdown and avoid accessing them later. Review of attachment 9188190 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for analyzing this tough problem. ::: mailnews/db/msgdb/src/nsMsgDatabase.cpp @@ +83,5 @@ > + // some entries cannot be removed even if we tried. > + // So we we count such unremoved entries and ignore them. Bug 1677202 > + NS_WARNING_ASSERTION( > + !m_dbCache.Length() || > + m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM, Does this actually hide the strange "some msg dbs left open" message in some cases? Do we just hide it, but the underlying problem of XPCOM shutdown still exists? @@ +119,5 @@ > + (void *) pMessageDB); > + fflush(stdout); > + printf("p->m_dbFile=%p, ", (void *) pMessageDB->m_dbFile); > + fflush(stdout); > + printf("%s\n", pMessageDB->m_dbFile->HumanReadablePath().get()); So we print out the folder path, but is it useful for anything? Will it be one of the last accessed folders? But we can't do anything about it. @@ +887,5 @@ > +// > +// This does not work. Depending on timing, m_dbCache.Length() accesses already > +// released heap. > +// > +// void nsMsgDBService::RemoveFromCacheDuringShutdown(nsMsgDatabase* pMessageDB){ Adding commented out code for documentation purposes? @@ +1137,5 @@ > + // comm/mail/test/browser/keyboard/browser_spacehit.js is gone. > + // But the problem with the following test persists. > + // comm/mail/test/browser/multiple-identities/browser_displayNames.js > + // > + // m_mdbEnv->Release(); //??? is this right? So when do we release this now if not here?
Flags: needinfo?(acelists)

Hi,

Thank you for your comment.

(In reply to :aceman from comment #15)

Comment on attachment 9188190 [details] [diff] [review]
Try counting the database pointer left in the cache due to early XPCOM
shutdown and avoid accessing them later.

Review of attachment 9188190 [details] [diff] [review]:

Thanks for analyzing this tough problem.

::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
@@ +83,5 @@

  • // some entries cannot be removed even if we tried.
  • // So we we count such unremoved entries and ignore them. Bug 1677202
  • NS_WARNING_ASSERTION(
  • !m_dbCache.Length() ||
  • m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM,

Does this actually hide the strange "some msg dbs left open" message in some
cases? Do we just hide it, but the underlying problem of XPCOM shutdown
still exists?

It avoids the "some msg dbs left open" messages ALL THE TIME in my local testing and past try server runs.
More importantly, now it avoids touching invalid pointers and thus avoids the assertion abort,
and we can now print the offending DB's pathname. (See my comment on pathname in the next comment section.)

Now, why this matters is this.
In DEBUG build that prints memory leaks at the end had aborted prematurely if this assertion is hit.
We could not obtain memory leak information at all due the abort.
However, with this patch, the DEBUG version could terminate successfully and all the nice dumps produced at exit time is produced in full. (My local log immediately became larger thanks to this.)

@@ +119,5 @@

  •         (void *) pMessageDB);
    
  •  fflush(stdout);
    
  •  printf("p->m_dbFile=%p, ", (void *) pMessageDB->m_dbFile);
    
  •  fflush(stdout);
    
  •  printf("%s\n", pMessageDB->m_dbFile->HumanReadablePath().get());
    

So we print out the folder path, but is it useful for anything? Will it be
one of the last accessed folders? But we can't do anything about it.

Yes, right now, we can't do anything. I am printing this in order for anyone to
to figure out the underlying problem of unclosed(?) dbfile at program shutdown.
In each test, where such an unclosed db is left before shutdown, the path name is printed.
A serious debugger should be able to figure out what folder is left unclosed (and hopefully where it is not closed, etc.).

@@ +887,5 @@

+//
+// This does not work. Depending on timing, m_dbCache.Length() accesses already
+// released heap.
+//
+// void nsMsgDBService::RemoveFromCacheDuringShutdown(nsMsgDatabase* pMessageDB){

Adding commented out code for documentation purposes?

Yes. I tried a few tricks which did not work. Accessing m_dbCache.Length() did not work during SHUTDOWN phase. :-(

@@ +1137,5 @@

  • // comm/mail/test/browser/keyboard/browser_spacehit.js is gone.
  • // But the problem with the following test persists.
  • // comm/mail/test/browser/multiple-identities/browser_displayNames.js
  • //
  • // m_mdbEnv->Release(); //??? is this right?

So when do we release this now if not here?

Hm... I wonder if this ~nsMsgDatabase() is called only once in TB's execution or is called many times.
If it is called many times, I need to see why m_mdbEnv->Release() crashed. But I am afraid that the crash happens only during SHUTDOWN and we may need to treat this Release() specially. Either we may not call it for now to avoid the mysterious crash() , especially if _nsMsgDatabase() only once (?)

[But I have a suspicion that maybe with the counting hack, this may not abort any more.
Unfortunately, I cannot test this easily now. For the past few weeks, I cannot test local repo due to a corruption caused by an HG problem right now. The fix that uses |hg pull| mentioned in Bug 1720302 does not work. It does a lot of I/O and the repo seems to become pristine. However, as soon as I try to do something using HG MQUEUE the phantom modification appears and I could not apply changes, etc. :-(
Trying to submit a job is also screwed up due to the same reason. I want to remove unwanted local patches, but any operation that touches the local repo refuses to operate due to the same issue.
Maybe I should try the script mentioned in gist in the same bug, but I am a bit afraid that maybe MQUEUE is not really compatible with the fixes proposed now. Ugh...

Lately, sanitized version of C-C TB does not run much because of UBSAN runtime errors here and there.
I have fixed all of them except for this issue locally and submitted patches.

Anyway, to refresh collective memory, here is the log from recent run of sanitized version of C-C TB (with my posted patches) locally under linux.

I have modifeid|~nsMsgDBService()| as follows.
Printing of |pMessageDB->m_dbFile| is the cause.
That is pMessageDB in the code below contains a non-null pointer to an already freed area during shutdown.

Due to async nature, the issue happens during shutdown FOLLOWING a test, and the test that triggers this issue may be different each time.
The log attached was found during local mochitest, and the test that caused it was
comm/mail/test/browser/folder-display/browser_virtualFolderCommands.js

’’’
nsMsgDBService::~nsMsgDBService() {
#ifdef DEBUG
// If you hit this warning, it means that some code is holding onto
// a db at shutdown.
// NS_WARNING_ASSERTION(!m_dbCache.Length(), "some msg dbs left open");
// Problem is as follows.
// RemoveFromCache() fails due to early XPCOM shutdow and
// some entries cannot be removed even if we tried.
// So we we count such unremoved entries and ignore them appropriately.
// See Bug 1677202
NS_WARNING_ASSERTION(!m_dbCache.Length() ||
m_dbCache.Length() <=
unremoved_db_from_cache_due_to_unavailable_XPCOM,
"some msg dbs left open");

printf(
"m_dbCache.Length() = %lu, "
"unremoved_db_from_cache_due_to_unavailable_XPCOM=%u\n",
(unsigned long)m_dbCache.Length(),
unremoved_db_from_cache_due_to_unavailable_XPCOM);
#endif

#if 1
// Only print this on local builds since it causes crashes,
// see bug 1468691, bug 1377692 and bug 1342858.

// The above comment was there because of the following issue.
// Timing-dependent issue: Sometimes, pMessageDB points to
// already freed area.
// So if the unremoved entries are due to early
// XPCOM shutdown at exit time, we don't bother to look into them.
// We figure it out by comparing the cache size and the count of
// known unremoved entries.
// See bug 1677202.
if (m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM)
return;

printf(
"m_dbCache.Length() = %lu, "
"unremoved_db_from_cache_due_to_unavailable_XPCOM=%u\n",
(unsigned long)m_dbCache.Length(),
unremoved_db_from_cache_due_to_unavailable_XPCOM);

for (uint32_t i = 0; i < m_dbCache.Length(); i++) {
nsMsgDatabase* pMessageDB = m_dbCache.ElementAt(i);
// There are times when the area pointed by pMessageDB has
// already been released by the time control reaches here.
// So what should we do?
// But the above early return avoids such cases hopefully.
// The separate outputs for variables make the analysis easy for ASAN dump.
if (pMessageDB) {
printf("db left open pMessageDB=%p, ", (void*)pMessageDB);
fflush(stdout);
printf("p->m_dbFile=%p, ", (void*)pMessageDB->m_dbFile);
fflush(stdout);
printf("\n");
fflush(stdout);

if 0

  // depending on the timing, etc. m_dbFile is an already released pointer.
  // The following was observed on tryserver in May 2022
  // some msg dbs left open: '!m_dbCache.Length() || m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM', file /builds/worker/checkouts/gecko/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:141
  // m_dbCache.Length() = 2, unremoved_db_from_cache_due_to_unavailable_XPCOM=1
  // db left open pMessageDB=0x7eff74e2c760, p->m_dbFile=0xe5e5e5e5e5e5e5e5
  // referencing m_dbFile (= 0xe5e5e5...) caused core dump.
  printf("%s\n", pMessageDB->m_dbFile->HumanReadablePath().get());
  fflush(stdout);

endif

}

}
#endif
}
’’’

I thjnk an early XPCOM shutdown and Garbage Collection are to blame, but we may force an ordering of subsystem shutdowns and that may improve the situation.

Just a hunch. Are we forgetting to make a pointer to be refcounted when we put it into m_dbCache?
This may explain why a pointer in m_dbCache becomes stale during GC phase.

I have built sanitized version of C-C TB using patches in

to avoid premature termination due to other ASAN or UBSAN issues.

During the mochitest today, I saw three ASAN issues, heap-use-after-free from the code I added
to dump the m_dbCache structure.
The line numbers may be slightly off from the pristine C-C TB source files due to my local patches to dump values, etc.

These all occur during the shutdown processing of Garbage Collector:
~ScopedXPCOMStartup -> Shutdown XCPCOM -> nsCycleCollector_shutdown -> nsCycleCollector::Shutdown -> ...

Like I mentioned, I THINK m_dbCache.ElementAt(i) is not properly protected from GC.
That is, if m_dbCache.ElementAt(i) contains a non-null pointer, the object pointed by this pointer should NOT be reclaimed by GC (!).

  for (uint32_t i = 0; i < m_dbCache.Length(); i++) {
    nsMsgDatabase* pMessageDB = m_dbCache.ElementAt(i);
    // There are times when the area pointed by pMessageDB has
    // already been released by the time control reaches here.
    // So what should we do?
    // But the above early return avoids such cases hopefully.
    // The separate outputs for variables make the analysis easy for ASAN dump.
    if (pMessageDB) {
      printf("db left open pMessageDB=%p, ", (void*)pMessageDB);
      fflush(stdout);
      printf("p->m_dbFile=%p, ", (void*)pMessageDB->m_dbFile);
      fflush(stdout);
      printf("\n");
      fflush(stdout);

Cache certainly contains a copy of something.

How do we make sure an object of which pointer is in m_dbCache[...] (m_dbCache.ElementAt(i)) is not reclaimed by GC during shutdown?

Is there a mechanism to explicitly reference such an object from a system-wide array of pointers that contain a reference to objects so that GC won't reclaim the object since there IS valid reference to the object?

I have seen such mechanisms in Lisp Interpreters to make sure premature GC does not occur when a system function/primitive is operating.

For example, such a mechanism is in GNU Emacs, for example. When a C source code is written to reference an object used inside Emacs-Lisp interpreter, such C source code uses a macro to store the pointer to an object into a static array inside Emacs-Lisp so that a GC triggered by a function call chain from the C source code will not, by chance, eliminate the object. : the reference to the object is in the static array in Emacs-Lisp, and so the object is not eliminated by GC.
After the processing in C source code, the pointer should removed from the static array to avoid the memory leak. There is a macro to do this. The object will be properly eliminated during the subsequent GC if there is no other reference to the object.

My two cent worth.

I assume a smart pointer would handle that? RefPtr<nsMsgDatabase> instaed of nsMsgDatabase*

(In reply to Magnus Melin [:mkmelin] from comment #20)

I assume a smart pointer would handle that? RefPtr<nsMsgDatabase> instaed of nsMsgDatabase*

Thank you for the tips.
Pointers are stored into m_dbCache in the following code.
https://searchfox.org/comm-central/source/mailnews/db/msgdb/public/nsMsgDatabase.h#52

void EnsureCached(nsMsgDatabase* pMessageDB) {
    if (!m_dbCache.Contains(pMessageDB)) m_dbCache.AppendElement(pMessageDB);
  }

So will it be something as follows?

void EnsureCached(nsMsgDatabase* pMessageDB) {
    RefPtr<nsMsgDatabase> lp = pMessageDB;
    if (!m_dbCache.Contains(lp) m_dbCache.AppendElement(lp);
  }

I am not sure if it compiles or not.
Or do we have to trace where the pointer passed EnsureCached() is created and change that to use smart pointer?

I was thinking maybe just in the loop? I haven't really tried any of this. The smart pointer would prevent GC until the variable it refers to goes out of scope.

(In reply to Magnus Melin [:mkmelin] from comment #22)

I was thinking maybe just in the loop? I haven't really tried any of this. The smart pointer would prevent GC until the variable it refers to goes out of scope.

The above code compiled.
However, I still see the heap-use-after-free.

The smart pointer would prevent GC until the variable it refers to goes out of scope.

void EnsureCached(nsMsgDatabase* pMessageDB) {
    RefPtr<nsMsgDatabase> lp = pMessageDB;   <--- |lp| is created.
    if (!m_dbCache.Contains(lp) m_dbCache.AppendElement(lp);
  }

With the code above, GC won't try to GC an object pointed by lp until this EnsureCached() is exited.

We need to make sure that the runtime understands the value stored in m_dbCache.AppendElment(lp) is referencing an object that should NOT be garbage collected.
I wonder if the above code let this happen.

Not tried yet: maybe I need to do something like this using NS_IF_ADDREF. I searched, but the documents available to addref, refcnt pointer, etc. seem to be very outdated.
CODE-1:

  void EnsureCached(nsMsgDatabase* pMessageDB) {
     if (!m_dbCache.Contains(pMessageDB)) {
       m_dbCache.AppendElement(pMessageDB);
       NS_IF_ADDREF(pMessageDB);
     }

During the shutdown, various subsystems to make Refptr works as expected may not
be available because they had been shutdown before(?). Well, quite unlikely if we think that GC is called after the decision to shutdown is called anyway. If GC cannot work well, we are really hosed.

I was thinking maybe just in the loop?

The free/releasing occurred well before the loop to print the value was entered.
This is because in the attached log the pointer is freed in |~nsMsgHdr()|.
The line number may be off from the pristince C-C TB code due to local patches.

125:52.30 GECKO(1763962) freed by thread T0 here:
125:52.30 GECKO(1763962)     #0 0x559650efab06 in free /builds/worker/fetches/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:52:3
125:52.30 GECKO(1763962)     #1 0x7fd66af736b5 in nsMsgDatabase::Release() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:1142:1
125:52.30 GECKO(1763962)     #2 0x7fd66af9db53 in nsMsgHdr::~nsMsgHdr() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgHdr.cpp:97:3
125:52.30 GECKO(1763962)     #3 0x7fd66af9dd1d in nsMsgHdr::~nsMsgHdr() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgHdr.cpp:90:23
125:52.30 GECKO(1763962)     #4 0x7fd66af9cd75 in nsMsgHdr::Release() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgHdr.cpp:20:1

nsMsgDatabase.cpp:114 is as follows in my tree.

       NS_IMPL_ISUPPORTS(nsMsgDatabase, nsIMsgDatabase, nsIMsgOfflineOpsDatabase,
                  nsIDBChangeAnnouncer)
   So the release of the nsMsgDatabase object.

The destruction of nsMsgHdr object leads to the premature destruction of the object of which pointer is still stored in m_dbCache.
The code is interesting.

nsMsgHdr.cpp:20 is as follows.

NS_IMPL_ISUPPORTS(nsMsgHdr, nsIMsgDBHdr)

So this is the destruction of nsMsgHdr object.
What is very revealing is as follows. nsMsgHdr.cpp:90 and 97 is as follows.

nsMsgHdr::~nsMsgHdr() {     <--- Line 90
  if (m_mdbRow) {
    if (m_mdb) {
      NS_RELEASE(m_mdbRow);  <--- NS_RELEASE ???
      m_mdb->RemoveHdrFromUseCache((nsIMsgDBHdr*)this, m_messageKey);
    }
  }
  NS_IF_RELEASE(m_mdb);    <--- Line 97
}

The action on Line 97 should be NO-OP if there is another reference to the object pointed by m_mdb exists (in m_dbCache !)
I don't even understand the difference between NS_RELEASE() and NS_IF_RELEASE(), but the use of this NS_IF_RELEASE macro suggests that I do need NS_IF_ADDREF to the pointer that I store in m_dbCache.
I THINK I am onto something.
(To be honest, in a pure object-oriented code, object poinred by m_mdb will be reclaimed by GC eventually, I think.
But there must be a reason to invoke |NS_IF_RELEASE(m_mdb)| . Maybe m_mdb is created quite often to reference a message in a folder, and thus we want to reclaim it explicitly?)

The CODE-1 above compiled.
On my machine mochitest of sanitized version of C-C TB takes 9072 seconds to run as of today.
The first heap-use-after-free occurred about 90 minutes into the test. (When I ran the tests that caused heap-after-free individually, the heap-use-after-free did NOT occur. Or at least NOT in a few tries. So I have to run the whole test and hope the heap-use-after-free either occurs or not.)
Let me report back the test result after a few hours.

There may be a confused refcount issue and the drastic handling of GC at shutdown is a problem. (nsMsgService is shutdown before
nsMsgDatabase and thus proper RemoveFromCache() cannot be done within nsMsgDatabase(), etc.

Anyway, I could avoid heap-use-after-free
with a patch in https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=756131c42270eb8398a046f139b1794f3b9436f4
So we can avoid the segfaulting and other nasty things that may happen due to the issue.
But the build may cause leakage which may not be nice as in https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=756131c42270eb8398a046f139b1794f3b9436f4

The threshold of leak flagged as error is now meager 50KB.
( See https://thunderbird.topicbox.com/groups/developers/Tb6ce4cf69ec84306 )

I always prefer clean shutdown without segfault over a bit of known leak at shutdown time.

However, I am doing a bit additional debugging and trying to see if the leak can be reduced.
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=82d6e9a840f1aaf6bbca4e80c1f20c6a20f6d415
EDIT: A few typos fixed.

(In reply to ISHIKAWA, Chiaki from comment #24)
...

However, I am doing a bit additional debugging and trying to see if the leak can be reduced.
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=82d6e9a840f1aaf6bbca4e80c1f20c6a20f6d415

That patch for additional debugging was wrong.

Better prospect is in
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=d2dba5779dbedaa0c018f3d8261f1e4d30f6fc0a

Under normal circumstances, TB tries to keep the order of release so that the
data A that relies on an XPCOM B is released before B, etc.

That is, "->" is "relies on" operator.
A -> B -> C
In this case, TB tries to release, A, B, and C in that order.

Unfortunately, the GC invoked during shutdown (in DTOR?) seems to invoke the release of objects
that are still around in a seemingly random order.
That is the crux of the issue.

m_dbCache ought to be empty at the end, but due to the nsMsgDBService which is released earlier,
the removal from m_dbCache does not work reliably near shutdown, and thus
we are left with stale pointer that points at already released heap area that contained the object, nsMsgDatabase.

My patch in the treeserver job tries to defer the removal of nsMsgDatabase as late as possible so that
the reference in DEBUG build should not touch already released objects.

That, in turn, MAY increase the amount of objects and pointers in m_dbCache which should have been removed earlier, but
I am afraid that we may have to live with it.
But, anyway, I am looking at this nsMsgDatabase be removed by implicit release in DTOR thingy.
This should be avoided in a complex software as in TB.
There are ordering relations among the objects regarding their release, and we need to invoke the explict release
at the time of shutdown (as forced by the clicking of X button on the window, etc. We do not have a chance of closing the visited folder, etc. and that is what the shutdown process is supposed to do. GC is a way to invoke release of objects, but there lies this problem of unobserved release order of objects.)

Yeah, people may think, "Why bother? Why don't we stop touching the m_dbCache during shutdown in DEBUG code?":
Well, I think this issue is only the tip of an iceberg [ of shutdown ordering ] and that is why I think solving this is important.
I am afraid there MAY be other similar release order issues which simply does not appear often enough. And it may depend on the binary toolchain one uses, etc.

I cringe at the number of times when TB crashed during shutdown.
That really left a bitter taste in my mouth. : "Can I trust a mailer that crashes at shutdown? Will my messages I just downloaded be intact?"

The crash during shutdown occurred for many people very often and I want to solve at least one of the causes of them.

BTW, because of the crash during shtudown, I rarely shut down TB any more.
I keep it running as long as possible on my PC.

Story so far:

I have hit upon random heap-use-after-free failures now that I have a working SANITIZED version of C-C TB under linux locally.
(Working means it does not crash early due to the known isssues fixed in the following patches.)
The patches to create working SANIIZED version of C-C TB are as follows.
Bug 1852662,
Bug 1909768 --- I have to get rid of this to proceed further and notice the problem during shutdown.

I may have forgotten other necessary patches. Maybe we need a meta bug for SANITIZER build of C-C TB.

I even replaced the code of libevent with the latest github version (I tested it under linux on my local PC and it worked. I did not test Windows, Mac, etc.) because I hit upon a strange heap-use-after-release case in the event handling code.
See Bug 1909876
This is again necessary to proceed further to hunt down the real C-C TB memory bug.

Random nature of the issue:
The random failures happen during mochitest. Random in the sense that the heap-use-after-free occurs after different test finishes in different runs. A test that triggers heap-use-after-free does not necessarily trigger it in the next mochitest run, etc.

Root cause:
It is still a mystery to a degree, but I found the root cause. I think I have hit upon a mess of refcount/release, etc. which
was mostly removed by Bug 1340972. I think the work in Bug 1340972 was initiated after observing the memory leak study.
So basically, it is a problem of memory handling at the level of object classes.
I learned of this bugizlla entry from the line:
https://searchfox.org/comm-central/source/mailnews/db/msgdb/src/nsMsgHdr.cpp#27
Note the comment about DTOR.
I am now sure that leaving the class object destruction to DTOR is the core of the evil issue here.

There are class objects that can be safely released by GC in random order.
But there are class objects that need to be released in certain order, and GC during shutdown violates this ordering.
(I am thinking of nsMsgHdr, nsMsgDatababase, nsMsgService and t|m_dbCache| in particular).
This is BAD.
For example, msMsgDatabase::~nsMsgDatabase needs to reference nsMsgService object in order to release itself from the m_dbCache to avoid leaving dangling pointer there , but during shutdown, there are cases when nsMsgService object has been destroyed and thus it cannot delete itself from the m_dbCache.
I notice there are OTHER unexplainable strange heap-use-after-free after I tried to remove this particular issue.
And tracing the source code, I now learned of the large cleanup effort, which was Bug 1340972. I think something later came up and broke the subtle memory reference relation.

Sure, now I can see why we have random crashes at shutdown.

What am I going to do?
Now, after learning the subtle and wide scope of the issue, I would not spend time in a near time to fix this completely. I would rather spend time on having the buffered write incorporated into the next release.
This is Bug 1242030.

Food for thought:
I have a series of questions after investigating this.

  1. Why FF does not suffer from crashes during shutdown? (Or does it?)
    If not, then FF developers must have stuck to a coding rule to avoid this issue caused by GC during shutdown tries to release objects in a random order.
    Or, possibly by chance (?), they have not created a convoluted relationship of object classes that will be adversely affected by the GC during shutdown. As I mentioned nsMsgHdr, nsMsgDatababase, nsMsgService and the m_dbCache come to mind immediately.

These classes are the ones involved in the heap-use-after-free. Even after I tried to fix the issue by changing the code in a few places, I get the errors in different places.
I could avoid the easy to cause issues, but then I get a random failure of nsMsgHdr::~nsMsgHdr trying to release an object and hit
heap-use-after-free. (double free basically.), which is almost impossible to figure out where and why it happens.

  1. Why didn't developers had noticed this issue much earlier?
    SANITIZED version did not work due to some bugs for quite some time. That is the reason why people have noticed this issue, I think.

I used to run valgrind for this type of memory error, but ever since the change form mozmill to mochitest, I have not been able to run C-C TB under valgrind during the local mochitest, and xpcshell test more or less due to some issues adequately.
So I have not run full C-C TB mochitest and xpcshell test suite under valgrind for quite a long time now.

One problem with running C-C TB is the ridiculous number of threads created during mochitst under valgrind. I think it is like
1000-1500. We don't need that many threads for normal operation of C-C TB, correct? Something in C-C TB tries to create threads like crazy under valgrind.: maybe a routine re-tries to create a thread after a timeout occurs when it tries to create the thread first time. Valgrind slows down such creation process very much, thus hitting a timeout.
I think just traversing the thread queue when a thread switch occurs under valgrind slows down the operation significantly when we have 1K+ threads.

BTW, I don't remember catching this heap-use-after-free often when I tested C-C TB under valgrind (and that is before the change to
mochitest from mozmill), so the problem might have crept into C-C TB in the last 5-6 years? But the slowdown by valgrind might have masked the bug. So it is hard to tell from my experience only.

  1. Let us use SANITIZER version more often.
    Although SANITIZER may not be able to catch uninitialized variable use like valgrind does, it certain allows us to catch this
    heap-use-after-free and exposes this GC-during-shutdown issues clearly, so I will use it constantly but it still imposes about x 1.5
  • 2.0 slowdown.
    Which means a lot for a hobbyist patch submitter.

But I will urge people who test C-C TB locally (under linux only?) to run sanitized version since it also bring us other niceties.

For some strange reasons, I now get a proper timeout notice when there is a very long pause during mochitest execution under linux between tests.
Before using sanitized version for test, I only look at the blank window (Xephyr window) on my local PC wondering why it takes so long, but now somehow the proper timeout handler kicks in.It seems sanitizer runtime checks for non-activity.
Sanitizer run-time seems to checks for for runnable threads, and if nothing happens for sometime, it kicks in a SEGV handler to report it.

[end of memo]

If I understand the comments correctly, this happens with a debug build, only.

In a debug build, details of not-yet-released/cleared elements in a list will be printed.
In an optimized build, those items are simply ignored?

Summary: heap-use-after-free with C-C TB during shutdown, nsMsgDBService::~nsMsgDBService() → heap-use-after-free with C-C TB during shutdown, nsMsgDBService::~nsMsgDBService(), debug build, only

(In reply to Kai Engert (:KaiE:) from comment #28)

If I understand the comments correctly, this happens with a debug build, only.

In a debug build, details of not-yet-released/cleared elements in a list will be printed.
In an optimized build, those items are simply ignored?

Something like this.
But in THIS PARTICULAR CASE, it is an explicit dump to figure out why some databases are not released.
(And it turns out there is a one too many reference or dangling pointer in the database cache left over BECAUSE
nsMsgService which the nsMsgDatabase needs during releasing itself from the database cache is no longer available (it seems timing-dependent from one run to the other) has already been released and is no longer available.
So THIS particular path is NOT hit during optimization.

HOWEVER, now that I have working sanitized version, I get a strange double free, etc. during GC after I tried to remove this particular bug, and that is when there is a DEEP and subtle refcount issues surrounding this issue. I see many comments regarding whether a particular pointer should be refcounted, or should not be, etc.
And I am afraid there is a bit of loose handling SOMEWHERE, and the random release order of GC is not helpful.
Now, I am sure that this spartan release of objects during shutdown ignoring the implicit ordering done by running code at normal times is ONE of the potential causes of crashes we have observed for a long time.
Sure there probably are other causes, too.

Any new details or related bug reports?

Flags: needinfo?(ishikawa)

(In reply to Wayne Mery (:wsmwk) from comment #30)

Any new details or related bug reports?

Thank you for the reminder, Wayne.

I think I have a better understanding of the object life time and how to handle pointers to them at shutdown time, which is exactly the issue here, thanks to the work I have put to the bugzilla 1956408 recently.
That bugzilla is to modify the IMAP module to experience less data race/deadlock/unnecessary sleep-wake up cycle (typically 60 seconds sleep and wakeup cycle).

So I will come back to this once I fixed that bug.

But I found a genuine GC bug to boot during testing IMAP module under TSAN. :-(
Initially, I thought it is an issue caused by DEBUG-only assertions, but I could not erase the issue after tinkering with the source code.
I analyzed the TSAN stack trace of a new data race I caught very recently. And I suspected a bug in GC handling of so-called WeakMap, and instrumented what (which thread) is doing what GC action carefully, and found an issue about data races.
Basically, the lock was held to protect a node. So far, so good.
The node is part of a linked list, and GC was trying to remove the node from the list.
The node itself was under a lock. However, in a linked list, the predecessor node's ->_next and the successor node's ->_prev fields need to be rewritten and thus needs to be protected in the same lock. They were not.
Thus data race.
TSAN testing revealed this. The slow down x5~x20 makes the window large enough for me to hit the race.

I am going to file the bugzilla for that now.
Untill this GC bug is fixed in the GC code, I have to whitelist the GC race which may not be quite correct approach, but I have to live with it to make sure the new IMAP code does not crash under stress testing, which I am doing right now. GC's TSAN race interfered with the test and I needed to investigate and found the issue.

All in all, my knowledge of how to cope with refcounted pointers created on main thread (that does I/O)
that had been passed to objects created by non-main thread has progressed siginificantly.

So, I hope I can fix this in a satisfactory manner once I fix the IMAP module in a satisfactory manner.

Flags: needinfo?(ishikawa)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: