Open Bug 1677202 Opened 4 years ago Updated 3 years ago

heap-use-after-free with C-C TB during shutdown, nsMsgDBService::~nsMsgDBService()

Categories

(Thunderbird :: General, defect)

x86_64
Linux
defect

Tracking

(Not tracked)

People

(Reporter: ishikawa, Assigned: ishikawa)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

I noticed a heap-use-after-free error during local mochitest execution with locally created ASAN version of C-C TB.

Here is an excerpt from such error stacktrace.
Full logs of three cases are attached. They all point to similar cause, I think.

229:46.98 TEST_START: comm/mail/test/browser/keyboard/browser_spacehit.js
229:46.98 GECKO(106748) Chrome file doesn't exist: /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/_tests/testing/mochitest/browser/comm/mail/test/browser/keyboard/head.js
	  ...

230:18.34 INFO TEST-START | Shutdown
230:18.34 INFO Browser Chrome Test Summary
230:18.34 INFO Passed:	4
	  ...
230:26.09 GECKO(106748) =================================================================
230:26.09 GECKO(106748) ==106748==ERROR: AddressSanitizer: heap-use-after-free on address 0x614000557ac8 at pc 0x7f8a7d5f62f5 bp 0x7ffe0f775230 sp 0x7ffe0f775228
230:26.09 GECKO(106748) READ of size 8 at 0x614000557ac8 thread T0
230:26.82 GECKO(106748) error: address range table at offset 0xd50 has an invalid tuple (length = 0) at offset 0xd60
230:27.16 GECKO(106748)	    #0 0x7f8a7d5f62f4 in operator-> /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/nsCOMPtr.h:875:5
230:27.16 GECKO(106748)	    #1 0x7f8a7d5f62f4 in nsMsgDBService::~nsMsgDBService() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:88:14
230:27.16 GECKO(106748)	    #2 0x7f8a7d5f5c88 in nsMsgDBService::Release() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:72:1
230:27.18 GECKO(106748)	    #3 0x7f8a7e0d61a9 in ~nsCOMPtr_base /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCOMPtr.h:329:7
230:27.18 GECKO(106748)	    #4 0x7f8a7e0d61a9 in mozilla::SegmentedVector<nsCOMPtr<nsISupports>, 4096ul, mozilla::MallocAllocPolicy>::SegmentImpl<509ul>::~SegmentImpl() /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/SegmentedVector.h:78:21
230:27.18 GECKO(106748)	    #5 0x7f8a7e0d5b4d in mozilla::SegmentedVector<nsCOMPtr<nsISupports>, 4096ul, mozilla::MallocAllocPolicy>::PopLastN(unsigned int) /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/SegmentedVector.h:246:14
230:27.18 GECKO(106748)	    #6 0x7f8a7e0b3528 in mozilla::dom::DeferredFinalizerImpl<nsISupports>::DeferredFinalize(unsigned int, void*) /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/dom/BindingUtils.h:2721:15
230:27.18 GECKO(106748)	    #7 0x7f8a7e08f0f4 in mozilla::IncrementalFinalizeRunnable::ReleaseNow(bool) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/CycleCollectedJSRuntime.cpp:1549:17
230:27.18 GECKO(106748)	    #8 0x7f8a7e09044d in mozilla::CycleCollectedJSRuntime::FinalizeDeferredThings(mozilla::CycleCollectedJSContext::DeferredFinalizeType) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/CycleCollectedJSRuntime.cpp:1622:24
230:27.18 GECKO(106748)	    #9 0x7f8a7e08b013 in mozilla::CycleCollectedJSRuntime::OnGC(JSContext*, JSGCStatus, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/CycleCollectedJSRuntime.cpp:1692:7
230:27.24 GECKO(106748)	    #10 0x7f8a9175d3e2 in js::gc::GCRuntime::maybeCallGCCallback(JSGCStatus, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7168:3
230:27.24 GECKO(106748)	    #11 0x7f8a9175f991 in ~AutoCallGCCallbacks /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7147:32
230:27.24 GECKO(106748)	    #12 0x7f8a9175f991 in js::gc::GCRuntime::gcCycle(bool, js::SliceBudget, mozilla::Maybe<JSGCInvocationKind> const&, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7261:1
230:27.24 GECKO(106748)	    #13 0x7f8a917649ba in js::gc::GCRuntime::collect(bool, js::SliceBudget, mozilla::Maybe<JSGCInvocationKind> const&, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7484:9
230:27.24 GECKO(106748)	    #14 0x7f8a917751e5 in gc /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:7560:3
230:27.24 GECKO(106748)	    #15 0x7f8a917751e5 in JS::NonIncrementalGC(JSContext*, JSGCInvocationKind, JS::GCReason) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/gc/GC.cpp:8397:21
230:27.25 GECKO(106748)	    #16 0x7f8a7e0f8a59 in nsCycleCollector::FixGrayBits(bool, TimeLog&) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3274:19
230:27.25 GECKO(106748)	    #17 0x7f8a7e0fa369 in nsCycleCollector::BeginCollection(ccType, nsICycleCollectorListener*) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3575:3
230:27.25 GECKO(106748)	    #18 0x7f8a7e0f994f in nsCycleCollector::Collect(ccType, js::SliceBudget&, nsICycleCollectorListener*, bool) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3409:9
230:27.25 GECKO(106748)	    #19 0x7f8a7e0f93e8 in nsCycleCollector::ShutdownCollect() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3352:20
230:27.25 GECKO(106748)	    #20 0x7f8a7e0fbfc1 in nsCycleCollector::Shutdown(bool) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3641:5
230:27.25 GECKO(106748)	    #21 0x7f8a7e0ffc2c in nsCycleCollector_shutdown(bool) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3956:18
230:27.26 GECKO(106748)	    #22 0x7f8a7e47e43a in mozilla::ShutdownXPCOM(nsIServiceManager*) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/build/XPCOMInit.cpp:716:3
230:27.26 GECKO(106748)	    #23 0x7f8a8f896cb3 in ScopedXPCOMStartup::~ScopedXPCOMStartup() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsAppRunner.cpp:1648:5
230:27.26 GECKO(106748)	    #24 0x7f8a8f8c2198 in operator() /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/UniquePtr.h:460:5
230:27.27 GECKO(106748)	    #25 0x7f8a8f8c2198 in mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/UniquePtr.h:302:7
230:27.27 GECKO(106748)	    #26 0x7f8a8f8b3f55 in operator= /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/mozilla/UniquePtr.h:275:5
230:27.27 GECKO(106748)	    #27 0x7f8a8f8b3f55 in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsAppRunner.cpp:5300:16
230:27.27 GECKO(106748)	    #28 0x7f8a8f8b5c01 in XRE_main(int, char**, mozilla::BootstrapConfig const&) /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsAppRunner.cpp:5339:21
230:27.27 GECKO(106748)	    #29 0x5598f371ee95 in do_main /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mail/app/nsMailApp.cpp:217:22
230:27.27 GECKO(106748)	    #30 0x5598f371ee95 in main /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mail/app/nsMailApp.cpp:324:16
230:27.32 GECKO(106748)	    #31 0x7f8a9ea96cc9 in __libc_start_main csu/../csu/libc-start.c:308:16
230:27.32 GECKO(106748)	    #32 0x5598f3674029 in _start (/NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/bin/thunderbird+0xe9029)

This may be the reason why we see crash during shutdown.

Three full logs are attached in the subsequent comments.

Although I think this is very diffucult to exploit, who knows.
I am setting the security flag. Please clear it as people see fit.

https://searchfox.org/comm-central/rev/43a28c7a2eee7e5cd4ffecd6628450db2a4d9abe/mailnews/db/msgdb/src/nsMsgDatabase.cpp#88 well it's inside ifdef debug... so probably very hard to exploit ;)

I suppose the if clause should be:

if (pMessageDB && pMessageDB->m_dbFile)

(In reply to Magnus Melin [:mkmelin] from comment #4)

https://searchfox.org/comm-central/rev/43a28c7a2eee7e5cd4ffecd6628450db2a4d9abe/mailnews/db/msgdb/src/nsMsgDatabase.cpp#88 well it's inside ifdef debug... so probably very hard to exploit ;)

Right. I have noticed that by now. :-)

How does one clear the security flag?

I suppose the if clause should be:

if (pMessageDB && pMessageDB->m_dbFile)

I found out that things are not that simple.

I traced the values in #ifdef'ed code one by one.
It turns out pMessageDB points to an already freed heap area sometimes.
So pMessage->m_dbFile may not be null, but contains garbage.
This is timing-dependent. Sometimes the ASAN error is not triggered (!)
And that is why we can get crash SOMETIMES and not always, although in the mochitest flow, it seems always to happen at least on my PC.
However, if I run mochitest comm/mail/test/browser/keyboard/browser_spacehit.js by itself, the error does NOT happen(!).

So the issue is that there is a pointer to already freed structure in m_dbCache sometimes. Why?

I checked the code in nsMsgDatabase::~nsMsgDatabase()
and found that there is a following call in order to remove such entry in m_dbCache.
So this is probably not triggered all the time.

static_cast<nsMsgDBService*>(serv.get())->RemoveFromCache(this);

Problem is that it is protected by if (serv), of course.
I have traced and indeed there are cases when the call to RemoveFromCache() does not happen due to
|serv| is null.
I think this is because during shutdown phase, the XPCOM service has been shutdown and unavailable when we need it.

This is what I have figured out so far.

Maybe I can create a locally available code to clear such an entry in |m_dbCache|.
After all, we have routines to dump the entries in |m_dbCache| without going through XPCOM hoops. So maybe I can do it.

Typical local log now.

Note the xpcom-will-shutdown notification at the top.

 1:43.17 GECKO(195723) 1605327166749	Marionette	TRACE	Received observer notification xpcom-will-shutdown
 1:43.17 GECKO(195723) 1605327166750	Marionette	INFO	Stopped listening on port 2828
 1:43.17 GECKO(195723) 1605327166751	Marionette	DEBUG	Marionette stopped listening
 1:43.17 GECKO(195723) [195723, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
 1:43.17 GECKO(195723) [195723, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
 1:43.17 GECKO(195723) ~nsMsgDatabase(): closing database    /COMM-CENTRAL/TMP-DIR/tmp219v2et6.mozrunner/Mail/Local Folders/Sample.msf
 1:43.17 GECKO(195723) Not calling RemoveFromCache(this), this = 0x6140002a0c40
 1:43.17 GECKO(195723) [195723, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length()', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:80
 1:43.17 GECKO(195723) db left open pMessageDB=0x6140002a0c40, =================================================================
 1:43.17 GECKO(195723) ==195723==ERROR: AddressSanitizer: heap-use-after-free on address 0x6140002a0cc8 at pc 0x7fa55eb6c3c3 bp 0x7ffe11136550 sp 0x7ffe11136548
 1:43.17 GECKO(195723) READ of size 8 at 0x6140002a0cc8 thread T0
 1:43.17 GECKO(195723) error: address range table at offset 0xd50 has an invalid tuple (length = 0) at offset 0xd60
 1:43.17 GECKO(195723) ==195723==ERROR: AddressSanitizer: heap-use-after-free on address 0x6140002a0cc8 at pc 0x7fa55eb6c3c3 bp 0x7ffe11136550 sp 0x7ffe11136548
 1:43.17 GECKO(195723)     #0 0x7fa55eb6c3c2 in get /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/nsCOMPtr.h:856:48
 1:43.17 GECKO(195723)     #1 0x7fa55eb6c3c2 in operator nsIFile * /NEW-SSD/ASAN-OBJ-DIR/objdir-tb3/dist/include/nsCOMPtr.h:864:33
 1:43.17 GECKO(195723)     #2 0x7fa55eb6c3c2 in nsMsgDBService::~nsMsgDBService() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:94:43
 1:43.17 GECKO(195723)     #3 0x7fa55eb6bc88 in nsMsgDBService::Release() /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:72:1

1:43.17 GECKO(195723) Not calling RemoveFromCache(this), this = 0x6140002a0c40

This means that pointer to the to-be-released structure remains in m_dbCache.
Sure enough this pointer is referenced in nsMsgDBService::~nsMsgDBService().

So we may want to remove this pointer from the cache somehow.
I wonder how.

That will remove this pesky warning that has bothered me for a long time.

 1:43.17 GECKO(195723) [195723, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length()', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:80
Blocks: 1342858
See Also: → 1468691, 1377692

So my analysis based on local log is this.

When ~nsMsgDatabase() is called,
this also calls RemoveFromCache() to remove the being-deleted nsMsgDatabase from m_dbCache.

Unfortunately, during shutdown time, XPCOM shutdown proceeds asynchronously and this RemoveFromCache() cannot be called for the databases deleted near the end sometimes.

So that leaves pointers (to nsMsgDatabase that has already been released !) in m_dbCache.
This causes the reference to released heap in ~nsMsgDBService().

(I think deletion of nsMsgDatabase should always come before nsMsgDBService. I am not sure if this is strictly followed.)

I tried to see if I can somehow make the old pointer to nsMsgDBService() to call RemoveFromCache() even if XPCOM service itself is shutdown. The timing-dependence kicked in. Sometimes it works. But sometimes we get reference problem to already released heap again inside RemoveFromCache() even.

So my tentative final solution is as follows.

Count the number of times that RemoveFromCache() cannot be called due to early XPCOM shutdown.

Then in ~nsMsgDBService(), I check not only m_dbCache.Length(), but whether it is smaller than or equal to the # of failures to call RemoveFromCache() because the service was not available any more.
We allow such left over pointers to m_dbCache referenced by ~nsMsgDatabase() as inevitable.

We don't print the db left open message if m_dbCache.Length() <= the # of failures to invoke RemoveFromCache(...) by exiting early in ~nsMsgDBService().

So far so good locally. : No more db left open message for the three tests that produced sanitizer: heap-use-after-free.
No more heap-use-after-free (although there ARE leaks.) We won't crash, that is more important here.

I am going to run the whole mochitest locally and then, if it succeeds, submit the tryserver run.

Even though, I have somehow avoid the crash by the bandaid kludge,
the racy issue is obviously there.
To wit, an excerpt in the local log of mochitest with ASAN version of C-C TB.
There is only one place where "db left open..." message appears.
But you can see that ~nsMsgDBService() is called and executed BEFORE
~nsMsgDatabase() as far as the output messages are concerned and indeed, the counts printed suggest that
~nsMsgDBService() is executed BEFORE the RemoveFromCache() in ~nsMsgDatabase() is executed.

226:00.40 GECKO(307753) [307753, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
226:00.86 GECKO(307753) [307753, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length() || m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:88
226:00.86 GECKO(307753) m_dbCache.Length() = 1, unremoved_db_from_cache_due_to_unavailable_XPCOM=0
226:00.86 GECKO(307753) db left open pMessageDB=0x614000138440, p->m_dbFile=0x610000558240, /COMM-CENTRAL/TMP-DIR/tmp1rlqki5k.mozrunner/Mail/Local Folders/JunkCommandsA.msf
226:00.86 GECKO(307753) ~nsMsgDatabase(): closing database    /COMM-CENTRAL/TMP-DIR/tmp1rlqki5k.mozrunner/Mail/Local Folders/JunkCommandsA.msf
226:00.86 GECKO(307753) Not calling RemoveFromCache(this), this = 0x614000138440
226:01.91 GECKO(307753) [307753, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3359

We certainly need better ordering of destruction of objects if possible.

Group: mail-core-security

This patch avoids the accessing of left over stale database pointers in the cache.
The pointers are to be removed by calling RemoveFromCache() explicitly, but during shutdown,
the ordering of subsystem shutdowns are not deterministic and
RemoveFromCache() may not be callable, and thus there are some stale pointers in the cache.

This is a bandage to not to access such cached pointers if we know that such pointers are likely to be the pointers that TB tried to remove from cache but due to the early subsystem shutdown, it could not.

The true solution would be to introduce proper ordering of
shutting down XPCOM subsystems.

Assignee: nobody → ishikawa
Attachment #9188190 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9188190 [details] [diff] [review]
Try counting the database pointer left in the cache due to early XPCOM shutdown and avoid accessing them later.

Review of attachment 9188190 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry but I don't think this is something we should do. It also adds far too much debug type spew here. It could be a useful local-debugging patch perhaps.
Attachment #9188190 - Flags: review?(mkmelin+mozilla) → review-

(In reply to Magnus Melin [:mkmelin] from comment #10)

Comment on attachment 9188190 [details] [diff] [review]
Try counting the database pointer left in the cache due to early XPCOM
shutdown and avoid accessing them later.

Review of attachment 9188190 [details] [diff] [review]:

Sorry but I don't think this is something we should do. It also adds far too
much debug type spew here. It could be a useful local-debugging patch
perhaps.

I can take out the printf() by surrounding it in #ifdef. I think the patch in one form or the other would be necessary.

REASON #1. This is necessary to run DEBUG version of ASAN build C-C TB successfully to completion during local mochitest.
REASON #2. This solves or exposes why the dreaded "some db msgs left open" message at TB shutdown that puzzled developers for some time (at least four years).
Please see Bug 1342858
WARNING: some msg dbs left open: '!m_dbCache.Length()' - for last folder that was open in the window and a message was viewed in it?
REASON #3: It is not THAT verbose in contrast to the current situation and actually gives enough information to let developers off the hook by giving information in relation to REASON #2.
Log BEFORE version

...
    44:34.94 GECKO(809035) [809035, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
    44:35.02 GECKO(809035) [809035, Main Thread] WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/events/DOMEventTargetHelper.cpp:291
*** 44:35.38 GECKO(809035) [809035, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length()', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:80
    44:35.73 GECKO(809035) [809035, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3359
    44:35.84 GECKO(809035) [809035, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/modules/libpref/Preferences.cpp:4287

With the log line marked with "***", some developers got really confused why the "dbs left open" message and it does not print even the database name if it is the official build even if DEBUG is set.
The reason of not printing the name was trying to print the name crashed the build.
But the patch here solves the crash issue and we can safely print the name now.
We also know the reason why it was not removed from the cache.
The issue of not closing the msg dbs will persist until the order of XPCOM subsystem shutdowns will be taken care of.)

Log AFTER version

    247:05.42 GECKO(622036) [622036, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length() || m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:88
+++ 247:05.42 GECKO(622036) m_dbCache.Length() = 1, unremoved_db_from_cache_due_to_unavailable_XPCOM=0
+A+ 247:05.42 GECKO(622036) db left open pMessageDB=0x6140000e7a40, p->m_dbFile=0x610000571e40, /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf
+B+ 247:05.42 GECKO(622036) ~nsMsgDatabase(): closing database    /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf
+C+ 247:05.42 GECKO(622036) Not calling RemoveFromCache(this), this = 0x6140000e7a40
    247:06.31 GECKO(622036) [622036, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3359

The messages are printed only at TB's shutdown time. They are not numerous.
Also, the current patch is for exposing the issue clearly including why and the underlying cause of XPCOM shutdown issue.
I can simplify the line marked with A by eliminating pointer values as follows.

247:05.42 GECKO(622036) db left open /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf

I can modify C by eliminating the pointer value as follows.

247:05.42 GECKO(622036) Not calling RemoveFromCache(this) due unavailable XPCOM service.

That would make the log look like:

    247:05.42 GECKO(622036) [622036, Main Thread] WARNING: some msg dbs left open: '!m_dbCache.Length() || m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mailnews/db/msgdb/src/nsMsgDatabase.cpp:88
    247:05.42 GECKO(622036) m_dbCache.Length() = 1, unremoved_db_from_cache_due_to_unavailable_XPCOM=0
    247:05.42 GECKO(622036) db left open /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf
+B+ 247:05.42 GECKO(622036) ~nsMsgDatabase(): closing database    /COMM-CENTRAL/TMP-DIR/tmpw1j8jkkn.mozrunner/Mail/Local Folders/Sample.msf
    247:05.42 GECKO(622036) Not calling RemoveFromCache(this) due unavailable XPCOM service.
    247:06.31 GECKO(622036) [622036, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsCycleCollector.cpp:3359

I will court the opinion of couple of developers in bug 1342858.

Flags: needinfo?(alta88)
Flags: needinfo?(acelists)

I hope we can find out how to resolve the problem properly instead.

That requires the proper ordering of subsystem shutdown, which has plagued TB and FF for many years.

Also, at least in one other case, I remember that TB code is written in such a manner that destruction of of an object initiated by
GC during shutdown (that is the destruction timing is NOT programmer control at all), and in one such case, by the time shutdown triggers the GC and eventually this destruction of certain object is done, the function called from the destructor of the object is no longer callable because the XPCOM has shutdown already.
If I recall correctly, it is about the deletion of a temporary file which may contain a piece of mail message (private information), but TB code simply discards the object that refers to the file, and let its automatic destruction by GC calls the destructor automatically, and the destructor calls the file system removal API after obtaining the file path name from the stored file object (can't recall which class).
Unfortunately, mapping of the object to the file system pathname fails because the service to provide the mapping has already shutdown.
Thus the temporary file is left around. Tough.

So unless we have a sure way to invoke the destruction of all the message databases BEFORE the db cache service is shutdown, we will be in trouble.
Besides, I don't know if this applies here, but as the destruction by automatic GC indicates, by the time shutdown time comes, TB may not know all the db msgs that are open. (Hmm, that sounds unlikely).
In any case, we may have to look out for deadlock if we introduce a forced order of shutdown.

It is not easy. The kludge here has exposed the underlying issue.

(In reply to ISHIKAWA, Chiaki from comment #13)

That requires the proper ordering of subsystem shutdown, which has plagued TB and FF for many years.

Also, at least in one other case, I remember that TB code is written in such a manner that destruction of initiated by
GC during shutdown (that is the destruction timing is NOT programmer control at all), and in one such case, by the time, initiating shutdown triggers the GC, and eventually this destruction of certain object is done, but the function called from the destructor of the object is no longer callable because the XPCOM has shutdown.

If I recall correctly, it is about the deletion of a temporary file which may contain a piece of mail message (private information), but TB code simply discards the object that refers to the file, and let its automatic destruction by GC calls the destructor automatically, and the destructor calls the file system removal API after obtaining the file path name from the stored file object (can't recall which class).

Unfortunately, mapping of the object to the file system pathname fails because the service to provide the mapping has already be shutdown.
Thus the temporary file is left around. Tough.

I think the bug I was referring to is
Bug 1595343 C-C TB: nsMsgAttachment::DeleteAttachment fails to delete attachment file occasionally during shutdown, possibily leaving orphaned files in temp/tmp directory

Flags: needinfo?(alta88)
Comment on attachment 9188190 [details] [diff] [review]
Try counting the database pointer left in the cache due to early XPCOM shutdown and avoid accessing them later.

Review of attachment 9188190 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for analyzing this tough problem.

::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
@@ +83,5 @@
> +  // some entries cannot be removed even if we tried.
> +  // So we we count such unremoved entries and ignore them.  Bug 1677202 
> +  NS_WARNING_ASSERTION(
> +    !m_dbCache.Length() ||
> +    m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM,

Does this actually hide the strange "some msg dbs left open" message in some cases? Do we just hide it, but the underlying problem of XPCOM shutdown still exists?

@@ +119,5 @@
> +             (void *) pMessageDB);
> +      fflush(stdout);
> +      printf("p->m_dbFile=%p, ", (void *) pMessageDB->m_dbFile);
> +      fflush(stdout);
> +      printf("%s\n", pMessageDB->m_dbFile->HumanReadablePath().get());

So we print out the folder path, but is it useful for anything? Will it be one of the last accessed folders? But we can't do anything about it.

@@ +887,5 @@
> +//
> +// This does not work. Depending on timing, m_dbCache.Length() accesses already
> +// released heap.
> +//
> +// void nsMsgDBService::RemoveFromCacheDuringShutdown(nsMsgDatabase* pMessageDB){

Adding commented out code for documentation purposes?

@@ +1137,5 @@
> +    // comm/mail/test/browser/keyboard/browser_spacehit.js is gone.
> +    // But the problem with the following test persists.
> +    // comm/mail/test/browser/multiple-identities/browser_displayNames.js
> +    //
> +    // m_mdbEnv->Release();  //??? is this right?

So when do we release this now if not here?
Flags: needinfo?(acelists)

Hi,

Thank you for your comment.

(In reply to :aceman from comment #15)

Comment on attachment 9188190 [details] [diff] [review]
Try counting the database pointer left in the cache due to early XPCOM
shutdown and avoid accessing them later.

Review of attachment 9188190 [details] [diff] [review]:

Thanks for analyzing this tough problem.

::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
@@ +83,5 @@

  • // some entries cannot be removed even if we tried.
  • // So we we count such unremoved entries and ignore them. Bug 1677202
  • NS_WARNING_ASSERTION(
  • !m_dbCache.Length() ||
  • m_dbCache.Length() <= unremoved_db_from_cache_due_to_unavailable_XPCOM,

Does this actually hide the strange "some msg dbs left open" message in some
cases? Do we just hide it, but the underlying problem of XPCOM shutdown
still exists?

It avoids the "some msg dbs left open" messages ALL THE TIME in my local testing and past try server runs.
More importantly, now it avoids touching invalid pointers and thus avoids the assertion abort,
and we can now print the offending DB's pathname. (See my comment on pathname in the next comment section.)

Now, why this matters is this.
In DEBUG build that prints memory leaks at the end had aborted prematurely if this assertion is hit.
We could not obtain memory leak information at all due the abort.
However, with this patch, the DEBUG version could terminate successfully and all the nice dumps produced at exit time is produced in full. (My local log immediately became larger thanks to this.)

@@ +119,5 @@

  •         (void *) pMessageDB);
    
  •  fflush(stdout);
    
  •  printf("p->m_dbFile=%p, ", (void *) pMessageDB->m_dbFile);
    
  •  fflush(stdout);
    
  •  printf("%s\n", pMessageDB->m_dbFile->HumanReadablePath().get());
    

So we print out the folder path, but is it useful for anything? Will it be
one of the last accessed folders? But we can't do anything about it.

Yes, right now, we can't do anything. I am printing this in order for anyone to
to figure out the underlying problem of unclosed(?) dbfile at program shutdown.
In each test, where such an unclosed db is left before shutdown, the path name is printed.
A serious debugger should be able to figure out what folder is left unclosed (and hopefully where it is not closed, etc.).

@@ +887,5 @@

+//
+// This does not work. Depending on timing, m_dbCache.Length() accesses already
+// released heap.
+//
+// void nsMsgDBService::RemoveFromCacheDuringShutdown(nsMsgDatabase* pMessageDB){

Adding commented out code for documentation purposes?

Yes. I tried a few tricks which did not work. Accessing m_dbCache.Length() did not work during SHUTDOWN phase. :-(

@@ +1137,5 @@

  • // comm/mail/test/browser/keyboard/browser_spacehit.js is gone.
  • // But the problem with the following test persists.
  • // comm/mail/test/browser/multiple-identities/browser_displayNames.js
  • //
  • // m_mdbEnv->Release(); //??? is this right?

So when do we release this now if not here?

Hm... I wonder if this ~nsMsgDatabase() is called only once in TB's execution or is called many times.
If it is called many times, I need to see why m_mdbEnv->Release() crashed. But I am afraid that the crash happens only during SHUTDOWN and we may need to treat this Release() specially. Either we may not call it for now to avoid the mysterious crash() , especially if _nsMsgDatabase() only once (?)

[But I have a suspicion that maybe with the counting hack, this may not abort any more.
Unfortunately, I cannot test this easily now. For the past few weeks, I cannot test local repo due to a corruption caused by an HG problem right now. The fix that uses |hg pull| mentioned in Bug 1720302 does not work. It does a lot of I/O and the repo seems to become pristine. However, as soon as I try to do something using HG MQUEUE the phantom modification appears and I could not apply changes, etc. :-(
Trying to submit a job is also screwed up due to the same reason. I want to remove unwanted local patches, but any operation that touches the local repo refuses to operate due to the same issue.
Maybe I should try the script mentioned in gist in the same bug, but I am a bit afraid that maybe MQUEUE is not really compatible with the fixes proposed now. Ugh...

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: