getMsgHdrForMsgId() may cause massive loss of MSF files since it potentially opens all folders in the profile hitting the limit of open files
Categories
(MailNews Core :: Database, defect, P1)
Tracking
(thunderbird_esr102+ fixed, thunderbird105 wontfix, thunderbird108 fixed)
People
(Reporter: b1, Assigned: mkmelin)
References
Details
(Keywords: perf, Whiteboard: [TM:102.6.0])
Attachments
(1 file)
48 bytes,
text/x-phabricator-request
|
wsmwk
:
approval-comm-beta+
wsmwk
:
approval-comm-esr102+
|
Details | Review |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36 Edg/104.0.1293.70
Steps to reproduce:
getMsgHdrForMsgId()
may cause massive loss of MSF files since it potentially opens all folders in the profile hitting the limit of open files.
As we've seen in bug 1773822, opening too many folder databases leads to the loss of the MSF files which can't be opened any more and are then considered invalid and are deleted.
https://github.com/Betterbird/thunderbird-patches/blob/main/102/bugs/1787609-fix-findMsgIdInFolder.patch
Turned out that CloseDBIfFolderNotOpen()
didn't try hard enough, hence we added the ForceClosed()
.
The change in nsMsgDatabase::CheckForErrors()
avoids blowing away databases needlessly due to "too many open files".
BTW, with ~1800 folders, some of them rather big, getMsgHdrForMsgId()
will query all the databases for the given message ID if started with an non-existent message ID. This makes the application unresponsive for seconds/minutes. This should be up for a rewrite. With the patch at least it doesn't obliterate the MSF files any more. On Windows there is a ~500 open file limit, so for us ~1300 MSFs were deleted.
Comment 3•1 year ago
|
||
Several users are reporting serious version 102 performance issues, so we will want to sort this out.
Why is the new argument needed? if folderOpen is evaluated as false, then SetMsgDatabase(nullptr) is not enough to close the DB? If not and mDatabase->ForceClosed() is needed too, why don't we want to call it also at all other callsites (unconditionally)?
-nsresult nsMsgDBFolder::CloseDBIfFolderNotOpen() {
+NS_IMETHODIMP nsMsgDBFolder::CloseDBIfFolderNotOpen(bool aForceClosed) {
nsresult rv;
nsCOMPtr<nsIMsgMailSession> session =
do_GetService(NS_MSGMAILSESSION_CONTRACTID, &rv);
NS_ENSURE_SUCCESS(rv, rv);
bool folderOpen;
session->IsFolderOpenInWindow(this, &folderOpen);
if (!folderOpen &&
-
!(mFlags & (nsMsgFolderFlags::Trash | nsMsgFolderFlags::Inbox)))
-
!(mFlags & (nsMsgFolderFlags::Trash | nsMsgFolderFlags::Inbox))) {
- if (aForceClosed && mDatabase) mDatabase->ForceClosed();
SetMsgDatabase(nullptr); - }
return NS_OK;
}
(In reply to :aceman from comment #4)
Why is the new argument needed? if folderOpen is evaluated as false, then SetMsgDatabase(nullptr) is not enough to close the DB? If not and mDatabase->ForceClosed() is needed too, why don't we want to call it also at all other callsites (unconditionally)?
Good questions we cannot answer. Observed behaviour is a) calling "force closed" unconditionally breaks mail/components/extensions/test/xpcshell/test_ext_messages.js and b) not calling "force closed" and just using the hunk in MailUtils.jsm doesn't achieve the required result, the MSF files still get deleted. You can easily test that when trying to open an article via a message ID when the article is not on your system and therefore all folders are opened (see bug 37653 for enabling patch).
Correct, the test fails like this (with closing the DB unconditionally):
5:03.19 INFO xpcshell-local.ini:comm/mail/components/extensions/test/xpcshell/test_ext_messages.js | Starting test_move_copy_delete
5:03.19 INFO (xpcshell/head.js) | test test_move_copy_delete pending (2)
5:03.19 INFO (xpcshell/head.js) | test run_next_test 4 finished (2)
5:03.19 INFO "Extension attached"
10:00.48 pid:21844 JavaScript error: chrome://messenger/content/parent/ext-mail.js, line 1839: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMsgDBHdr.getProperty]
5:03.59 INFO "CONSOLE_MESSAGE: (error) [JavaScript Error: "NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMsgDBHdr.getProperty]" {file: "chrome://messenger/content/parent/ext-mail.js" line: 1839}]"
The line 1839 is the msgHdr.getProperty() call in the block:
function convertMessage(msgHdr, extension) {
if (!msgHdr) {
return null;
}
let composeFields = Cc[
"@mozilla.org/messengercompose/composefields;1"
].createInstance(Ci.nsIMsgCompFields);
let junkScore = parseInt(msgHdr.getProperty("junkscore"), 10) || 0;
The test does not call convertMessage() directly. It could use some investigation why it is called and whether there is not some wrong assumption of the folder having the DB still open.
Assignee | ||
Comment 7•1 year ago
|
||
Ben, please have a look.
Comment 8•1 year ago
|
||
As an aside, I think we're getting lost in the weeds here, and trying to paper over symptoms rather than looking at underlying causes.
The more general issue is that we don't have any good way to limit the number of MSF files open at any one time. And getMsgHdrForMsgId()
can be a particularly pathological case, in that it might rummage through all the folders in an account, leaving the files open as it goes.
I don't think getMsgHdrForMsgId()
should have to shoulder the responsibility for managing opening/closing MSF files. Same goes for the zillions of other similar functions which hit the message databases.
My understanding of mork:
- When you open the DB, all the data is basically slurped into RAM.
- writes are buffered up in RAM until one of the commits is performed.
- maybe there's some autocommit threshold? If enough commits build up or a time limit elapses it commits automatically? (pure speculation on my part- I've not looked for this in the code yet).
- commits might just append extra changes on to the end of the file, or they might compress and rewrite the whole file.
Does this sound about right?
Next question: Does anyone actually know how the life cycle of nsMsgDatabase
et al corresponds to when the MSF file is opened and closed?
I'm guessing nsMsgDatabase has a 1:1 lifetime correspondence with a mork DB, so I'd guess it's the mork layer which determines when files are open or closed...
(I honestly don't know - I've tried looking a few times, but gave up in the increasing layers of voodoo down there ;- )
Upshot: is the MSF file opened in the nsMsgDatabase
ctor, and close it in the dtor? Or is it more complicated than that?
What I'm getting at:
- the vast majority of folders we hit are for read-only purposes. There's no point actually keeping the MSF file open once we've read it's contents.
- some folders are likely to see lot of writes, and in those cases it might be worth keeping the file open to perform regular commits (but even then, it might be that the filesystem caching is good enough that keeping the file open doesn't give us any performance improvement).
- some folders will only see infrequent writes. For these, it'd be perfectly fine to have to open the MSF file again to perform an occasional commit.
So... we just need to be waaaaay more aggressive about closing MSF files.
To do this will likely take a lot of wading about in the mork layers. But I think It'd solve so many problems.
Followup question:
Is it reasonable to have all the databases resident in RAM all the time? (we're kind of heading toward this anyway, as we look toward replacing per-folder mork DBs with a per-account database - even if it might end up in something like SQLlite, where the resident blocks are automatically loaded/unloaded rather than being permanently in RAM).
This would simplify a whole lot of things - folder.msgDatabase
would always be valid.
At the moment, there's a lot of voodoo everywhere involving setting .msgDatabase
to null and calling ForceDBClosed()
and I really hate it.
Comment 9•1 year ago
|
||
Extra followup question:
Why do .MSF file get deleted when you hit the maximum-open-file limit?
I'm guessing it's because of some code panicing when it can't an MSF file and just deleting it on the assumption it's corrupt... but I really don't know.
In any case, it seems like something worth writing up in it's own bug?
![]() |
||
Comment 10•1 year ago
|
||
There is the MsgDBCacheManager at https://searchfox.org/comm-central/source/mailnews/base/src/MsgDBCacheManager.jsm that keeps a few recently used databases open. But other than those all others should get automatically closed. I also think some are not closed as they should and we have some reports about it. Your analysis is theoretically correct, but we couldn't find the problem in the internals yet.
Comment 11•1 year ago
|
||
(In reply to Ben Campbell from comment #9)
Why do .MSF file get deleted when you hit the maximum-open-file limit?
This is fixed (or at least mitigated) in our patch:
https://github.com/Betterbird/thunderbird-patches/blob/13c1387b48b27c9666978396554c9c6b9c4c23fc/102/bugs/1787609-fix-findMsgIdInFolder.patch#L166
Comment 12•1 year ago
|
||
(In reply to b4 from comment #11)
This is fixed (or at least mitigated) in our patch:
https://github.com/Betterbird/thunderbird-patches/blob/13c1387b48b27c9666978396554c9c6b9c4c23fc/102/bugs/1787609-fix-findMsgIdInFolder.patch#L166
Doh. I did read the patch too! That's obviously where my vague feeling that something was panicing and deleting the MSF came from!
Comment 13•1 year ago
|
||
Hmm. I don't see test_ext_messages.js
failing here.
This is on a linux debug build... something platform-dependent maybe?
![]() |
||
Comment 14•1 year ago
|
||
I did see it on Linux debug build.
Comment 15•1 year ago
|
||
Just pulled the latest patches (over the last few hours) and rebuilt... but still no fail here.
I sent it to try-c-c:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=1211c005e8bd8b2f60838137876d9d4282ed9d66
Comment 16•1 year ago
|
||
(In reply to Ben Campbell from comment #13)
Hmm. I don't see
test_ext_messages.js
failing here.
Not with the patch as it is. If fails when "force closed" is unconditional. We explained that in comment #5.
Updated•1 year ago
|
![]() |
||
Comment 17•1 year ago
|
||
Yes, you need to modify the patch in nsMsgDBFolder::CloseDBIfFolderNotOpen to call mDatabase->ForceClosed(); ignoring aForceClosed.
Comment 18•1 year ago
|
||
OK, I've spent the morning printf debugging, and here's what I've got:
Overall:
A database is being ForceClosed()
during the test, and there's still a nsMsgHdr
held somewhere which references it.
I guess that's not too surprising - lots of places hold nsIMsgDBHdr
pointers. There's no way for the database to know about them all and to ask them to relinquish their DB refcounts... so even ForceClosed()
can't really guarantee that a database will be freed.
If file handles are bound to nsMsgDatabase lifetimes, then ForceClosed()
is just inherently doomed to be a bit crap. There's a bit of stuff in the DB to track outstanding message iterators, but there's no way to be sure there aren't other things holding a refcount on the DB.
So, if the patch fixes the immediate issue, I say we should just land it as is, and I'll open a new bug to try and sort out DB lifetime management properly.
The Details:
Making ForceClosed() unconditional, the failing line in the test run is indeed:
let junkScore = parseInt(msgHdr.getProperty("junkscore"), 10) || 0;
JavaScript error: chrome://messenger/content/parent/ext-mail.js, line 1839: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMsgDBHdr.getProperty]
The failure is occurring in nsMsgHdr::GetProperty()
. It bails out because m_mdbRow
is null.
Working back from there, I can see that particular nsMsgHdr previously being created fine, with a database (m_mdb) and a non-null row (m_mdbRow). Here's a heavily-edited exerpt of some of my printfs while the test runs:
0:04.39 pid:408695 nsMsgHdr::nsMsgHdr(db=0x7f2399efb120, row=0x7f237714ef28) this=0x7f237714efa0
[ So our nsMsgHdr has now been created, with a non-null db and row. ]
0:04.39 pid:408695 nsMsgHdr::SetMessageKey() this=0x7f237714efa0 mdb=0x7f2399efb120 row=0x7f237714ef28
0:04.39 pid:408695 nsMsgHdr::SetStringProperty('storeToken') this=0x7f237714efa0 db=0x7f2399efb120 row=0x7f237714ef28
[... snip lots of other nsMsgHdr calls, all with the same db and row as in the ctor... ]
0:04.40 pid:408695 nsMsgDatabase::ForceClosed() this=0x7f2399efb120
[and there goes our database - byebye! ]
0:04.40 pid:408695 nsMsgHdr::GetProperty('junkscore') this=0x7f237714efa0 db=0x7f2399efb120 row=(nil) canary='XXXXXXXX'
[uhoh... row is now null!]
0:04.40 pid:408695 ===> NS_ERROR_NULL_POINTER!!!
0:04.40 pid:408695 JavaScript error: chrome://messenger/content/parent/ext-mail.js, line 1842: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMsgDBHdr.getProperty]
m_mdbRow
is private and the only place I can see which clears it is nsMsgHdr::ReleaseMDBRow(). But that's not triggering at all, according to the printf I added to it.
I did think maybe something else might have been blatting over the memory... but I replaced the m_mdbRow pointer with a char[8] filled with 'X's, and moved m_mdbRow elsewhere in the class. And my 'X's remained intact (you can see the canary='XXXXXXXX'
in the output above). So I can't be 100% sure, but I don't think it's a memory overwrite.
So something is clearing out m_mdbRow
, but I have no idea what or how :-(
And it's not the real issue anyway, so I'm going to stop digging.
One other mystery:
The failing GetProperty()
dies because m_mdbRow
is null, and returns NS_ERROR_NULL_POINTER
.
But, by the time is shows up as a javascript error in the test framework, it's NS_ERROR_ILLEGAL_VALUE
.
Why?
Oh, the failure happens for local folders - you can bypass the other runs of the same test with --tag local
:
./mach xpcshell-test --tag local comm/mail/components/extensions/test/xpcshell/test_ext_messages.js
Comment 19•11 months ago
|
||
Wrote up my "bigger-picture" approach in Bug 1795178.
Updated•11 months ago
|
Comment 20•11 months ago
|
||
Ben, how about taking this bug here while bug 1795178 is still open? This bug here can also be triggered when clicking on a mid: URL since the call stack involves getMsgHdrForMsgId():
getMsgHdrForMsgId resource:///modules/MailUtils.jsm:654
openMessageByMessageId resource:///modules/MailUtils.jsm:352
_handleMidLink resource:///actors/MailLinkParent.jsm:74
(We heard somewhere that mid: URLs are broken in trunk, but the work in 102.)
We've just written up a summary in bug 1554188 comment #31. Looks like both bug 1554188 and likely also bug 1551173 comment #25 are caused by too many open files during search.
Comment 21•11 months ago
|
||
(In reply to b5 from comment #20)
Ben, how about taking this bug here while bug 1795178 is still open?
I suspect bug 1795178 might be easier to fix. Besides, I'm sick of trying to use sticking plasters to fix transverse fractures.
Assignee | ||
Comment 22•10 months ago
|
||
Updated•10 months ago
|
Comment 23•10 months ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/7c043976746e
Fix findMsgIdInFolder(): Tolerate missing MSF and close database after use. Related backend changes. r=benc
Assignee | ||
Updated•10 months ago
|
Comment 24•10 months ago
|
||
Timing and risk wise, do we want this to ride nightly until the next merge in two weeks? (and two weeks before holiday)
Or should we prefer to uplift earlier, to 109.0b3 in one week?
Assignee | ||
Comment 25•10 months ago
|
||
Comment on attachment 9304373 [details]
Bug 1787609 - Fix findMsgIdInFolder(): Tolerate missing MSF and close database after use. Related backend changes. r=benc
[Approval Request Comment]
Regression caused by (bug #): unknown
User impact if declined: potential msf loss
Testing completed (on c-c, etc.): c-c
Risk to taking this patch (and alternatives if risky): moderate, this is touchy code. Probably best to put on beta now to let it bake there.
Updated•10 months ago
|
Comment 26•10 months ago
|
||
Comment on attachment 9304373 [details]
Bug 1787609 - Fix findMsgIdInFolder(): Tolerate missing MSF and close database after use. Related backend changes. r=benc
[Triage Comment]
Approved for beta
Comment 27•10 months ago
|
||
bugherder uplift |
Thunderbird 108.0b2:
https://hg.mozilla.org/releases/comm-beta/rev/f035e8a5eb5b
Comment 28•10 months ago
|
||
Comment on attachment 9304373 [details]
Bug 1787609 - Fix findMsgIdInFolder(): Tolerate missing MSF and close database after use. Related backend changes. r=benc
[Triage Comment]
Approved for esr102
Comment 29•10 months ago
|
||
bugherder uplift |
Thunderbird 102.6.0:
https://hg.mozilla.org/releases/comm-esr102/rev/5438263ee834
Description
•