Closed Bug 1787609 Opened 2 years ago Closed 1 year ago

getMsgHdrForMsgId() may cause massive loss of MSF files since it potentially opens all folders in the profile hitting the limit of open files

Categories

(MailNews Core :: Database, defect, P1)

Thunderbird 102

Tracking

(thunderbird_esr102+ fixed, thunderbird105 wontfix, thunderbird108 fixed)

RESOLVED FIXED
109 Branch
Tracking Status
thunderbird_esr102 + fixed
thunderbird105 --- wontfix
thunderbird108 --- fixed

People

(Reporter: b1, Assigned: mkmelin)

References

Details

(Keywords: perf, Whiteboard: [TM:102.6.0])

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36 Edg/104.0.1293.70

Steps to reproduce:

getMsgHdrForMsgId() may cause massive loss of MSF files since it potentially opens all folders in the profile hitting the limit of open files.

As we've seen in bug 1773822, opening too many folder databases leads to the loss of the MSF files which can't be opened any more and are then considered invalid and are deleted.

https://github.com/Betterbird/thunderbird-patches/blob/main/102/bugs/1787609-fix-findMsgIdInFolder.patch
Turned out that CloseDBIfFolderNotOpen() didn't try hard enough, hence we added the ForceClosed().
The change in nsMsgDatabase::CheckForErrors() avoids blowing away databases needlessly due to "too many open files".

BTW, with ~1800 folders, some of them rather big, getMsgHdrForMsgId() will query all the databases for the given message ID if started with an non-existent message ID. This makes the application unresponsive for seconds/minutes. This should be up for a rewrite. With the patch at least it doesn't obliterate the MSF files any more. On Windows there is a ~500 open file limit, so for us ~1300 MSFs were deleted.

Several users are reporting serious version 102 performance issues, so we will want to sort this out.

Severity: -- → S2
Keywords: perf
Priority: -- → P1
See Also: → 1788901

Why is the new argument needed? if folderOpen is evaluated as false, then SetMsgDatabase(nullptr) is not enough to close the DB? If not and mDatabase->ForceClosed() is needed too, why don't we want to call it also at all other callsites (unconditionally)?

-nsresult nsMsgDBFolder::CloseDBIfFolderNotOpen() {
+NS_IMETHODIMP nsMsgDBFolder::CloseDBIfFolderNotOpen(bool aForceClosed) {
nsresult rv;
nsCOMPtr<nsIMsgMailSession> session =
do_GetService(NS_MSGMAILSESSION_CONTRACTID, &rv);
NS_ENSURE_SUCCESS(rv, rv);
bool folderOpen;
session->IsFolderOpenInWindow(this, &folderOpen);
if (!folderOpen &&

  •  !(mFlags & (nsMsgFolderFlags::Trash | nsMsgFolderFlags::Inbox)))
    
  •  !(mFlags & (nsMsgFolderFlags::Trash | nsMsgFolderFlags::Inbox))) {
    
  • if (aForceClosed && mDatabase) mDatabase->ForceClosed();
    SetMsgDatabase(nullptr);
  • }
    return NS_OK;
    }
Flags: needinfo?(b2)

(In reply to :aceman from comment #4)

Why is the new argument needed? if folderOpen is evaluated as false, then SetMsgDatabase(nullptr) is not enough to close the DB? If not and mDatabase->ForceClosed() is needed too, why don't we want to call it also at all other callsites (unconditionally)?

Good questions we cannot answer. Observed behaviour is a) calling "force closed" unconditionally breaks mail/components/extensions/test/xpcshell/test_ext_messages.js and b) not calling "force closed" and just using the hunk in MailUtils.jsm doesn't achieve the required result, the MSF files still get deleted. You can easily test that when trying to open an article via a message ID when the article is not on your system and therefore all folders are opened (see bug 37653 for enabling patch).

Flags: needinfo?(b2)

Correct, the test fails like this (with closing the DB unconditionally):

5:03.19 INFO xpcshell-local.ini:comm/mail/components/extensions/test/xpcshell/test_ext_messages.js | Starting test_move_copy_delete
5:03.19 INFO (xpcshell/head.js) | test test_move_copy_delete pending (2)
5:03.19 INFO (xpcshell/head.js) | test run_next_test 4 finished (2)
5:03.19 INFO "Extension attached"
10:00.48 pid:21844 JavaScript error: chrome://messenger/content/parent/ext-mail.js, line 1839: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMsgDBHdr.getProperty]
5:03.59 INFO "CONSOLE_MESSAGE: (error) [JavaScript Error: "NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMsgDBHdr.getProperty]" {file: "chrome://messenger/content/parent/ext-mail.js" line: 1839}]"

The line 1839 is the msgHdr.getProperty() call in the block:
function convertMessage(msgHdr, extension) {
if (!msgHdr) {
return null;
}

let composeFields = Cc[
"@mozilla.org/messengercompose/composefields;1"
].createInstance(Ci.nsIMsgCompFields);

let junkScore = parseInt(msgHdr.getProperty("junkscore"), 10) || 0;

The test does not call convertMessage() directly. It could use some investigation why it is called and whether there is not some wrong assumption of the folder having the DB still open.

Flags: needinfo?(mkmelin+mozilla)

Ben, please have a look.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(mkmelin+mozilla) → needinfo?(benc)

As an aside, I think we're getting lost in the weeds here, and trying to paper over symptoms rather than looking at underlying causes.

The more general issue is that we don't have any good way to limit the number of MSF files open at any one time. And getMsgHdrForMsgId() can be a particularly pathological case, in that it might rummage through all the folders in an account, leaving the files open as it goes.

I don't think getMsgHdrForMsgId() should have to shoulder the responsibility for managing opening/closing MSF files. Same goes for the zillions of other similar functions which hit the message databases.

My understanding of mork:

  • When you open the DB, all the data is basically slurped into RAM.
  • writes are buffered up in RAM until one of the commits is performed.
    • maybe there's some autocommit threshold? If enough commits build up or a time limit elapses it commits automatically? (pure speculation on my part- I've not looked for this in the code yet).
  • commits might just append extra changes on to the end of the file, or they might compress and rewrite the whole file.

Does this sound about right?

Next question: Does anyone actually know how the life cycle of nsMsgDatabase et al corresponds to when the MSF file is opened and closed?
I'm guessing nsMsgDatabase has a 1:1 lifetime correspondence with a mork DB, so I'd guess it's the mork layer which determines when files are open or closed...
(I honestly don't know - I've tried looking a few times, but gave up in the increasing layers of voodoo down there ;- )
Upshot: is the MSF file opened in the nsMsgDatabase ctor, and close it in the dtor? Or is it more complicated than that?

What I'm getting at:

  • the vast majority of folders we hit are for read-only purposes. There's no point actually keeping the MSF file open once we've read it's contents.
  • some folders are likely to see lot of writes, and in those cases it might be worth keeping the file open to perform regular commits (but even then, it might be that the filesystem caching is good enough that keeping the file open doesn't give us any performance improvement).
  • some folders will only see infrequent writes. For these, it'd be perfectly fine to have to open the MSF file again to perform an occasional commit.

So... we just need to be waaaaay more aggressive about closing MSF files.
To do this will likely take a lot of wading about in the mork layers. But I think It'd solve so many problems.

Followup question:
Is it reasonable to have all the databases resident in RAM all the time? (we're kind of heading toward this anyway, as we look toward replacing per-folder mork DBs with a per-account database - even if it might end up in something like SQLlite, where the resident blocks are automatically loaded/unloaded rather than being permanently in RAM).
This would simplify a whole lot of things - folder.msgDatabase would always be valid.
At the moment, there's a lot of voodoo everywhere involving setting .msgDatabase to null and calling ForceDBClosed() and I really hate it.

Extra followup question:

Why do .MSF file get deleted when you hit the maximum-open-file limit?

I'm guessing it's because of some code panicing when it can't an MSF file and just deleting it on the assumption it's corrupt... but I really don't know.
In any case, it seems like something worth writing up in it's own bug?

There is the MsgDBCacheManager at https://searchfox.org/comm-central/source/mailnews/base/src/MsgDBCacheManager.jsm that keeps a few recently used databases open. But other than those all others should get automatically closed. I also think some are not closed as they should and we have some reports about it. Your analysis is theoretically correct, but we couldn't find the problem in the internals yet.

(In reply to Ben Campbell from comment #9)

Why do .MSF file get deleted when you hit the maximum-open-file limit?

This is fixed (or at least mitigated) in our patch:
https://github.com/Betterbird/thunderbird-patches/blob/13c1387b48b27c9666978396554c9c6b9c4c23fc/102/bugs/1787609-fix-findMsgIdInFolder.patch#L166

(In reply to b4 from comment #11)

This is fixed (or at least mitigated) in our patch:
https://github.com/Betterbird/thunderbird-patches/blob/13c1387b48b27c9666978396554c9c6b9c4c23fc/102/bugs/1787609-fix-findMsgIdInFolder.patch#L166

Doh. I did read the patch too! That's obviously where my vague feeling that something was panicing and deleting the MSF came from!

Hmm. I don't see test_ext_messages.js failing here.
This is on a linux debug build... something platform-dependent maybe?

Flags: needinfo?(benc)

I did see it on Linux debug build.

Just pulled the latest patches (over the last few hours) and rebuilt... but still no fail here.

I sent it to try-c-c:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=1211c005e8bd8b2f60838137876d9d4282ed9d66

(In reply to Ben Campbell from comment #13)

Hmm. I don't see test_ext_messages.js failing here.

Not with the patch as it is. If fails when "force closed" is unconditional. We explained that in comment #5.

Flags: needinfo?(benc)

Yes, you need to modify the patch in nsMsgDBFolder::CloseDBIfFolderNotOpen to call mDatabase->ForceClosed(); ignoring aForceClosed.

OK, I've spent the morning printf debugging, and here's what I've got:

Overall:

A database is being ForceClosed() during the test, and there's still a nsMsgHdr held somewhere which references it.
I guess that's not too surprising - lots of places hold nsIMsgDBHdr pointers. There's no way for the database to know about them all and to ask them to relinquish their DB refcounts... so even ForceClosed() can't really guarantee that a database will be freed.

If file handles are bound to nsMsgDatabase lifetimes, then ForceClosed() is just inherently doomed to be a bit crap. There's a bit of stuff in the DB to track outstanding message iterators, but there's no way to be sure there aren't other things holding a refcount on the DB.

So, if the patch fixes the immediate issue, I say we should just land it as is, and I'll open a new bug to try and sort out DB lifetime management properly.

The Details:

Making ForceClosed() unconditional, the failing line in the test run is indeed:

let junkScore = parseInt(msgHdr.getProperty("junkscore"), 10) || 0;

JavaScript error: chrome://messenger/content/parent/ext-mail.js, line 1839: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMsgDBHdr.getProperty]

The failure is occurring in nsMsgHdr::GetProperty(). It bails out because m_mdbRow is null.

Working back from there, I can see that particular nsMsgHdr previously being created fine, with a database (m_mdb) and a non-null row (m_mdbRow). Here's a heavily-edited exerpt of some of my printfs while the test runs:

 0:04.39 pid:408695 nsMsgHdr::nsMsgHdr(db=0x7f2399efb120, row=0x7f237714ef28) this=0x7f237714efa0

   [ So our nsMsgHdr has now been created, with a non-null db and row. ]

 0:04.39 pid:408695 nsMsgHdr::SetMessageKey() this=0x7f237714efa0 mdb=0x7f2399efb120 row=0x7f237714ef28
 0:04.39 pid:408695 nsMsgHdr::SetStringProperty('storeToken') this=0x7f237714efa0 db=0x7f2399efb120 row=0x7f237714ef28

   [... snip lots of other nsMsgHdr calls, all with the same db and row as in the ctor... ]

 0:04.40 pid:408695 nsMsgDatabase::ForceClosed() this=0x7f2399efb120

   [and there goes our database - byebye! ]

 0:04.40 pid:408695 nsMsgHdr::GetProperty('junkscore') this=0x7f237714efa0 db=0x7f2399efb120 row=(nil) canary='XXXXXXXX'

   [uhoh... row is now null!]

 0:04.40 pid:408695     ===> NS_ERROR_NULL_POINTER!!!
 0:04.40 pid:408695 JavaScript error: chrome://messenger/content/parent/ext-mail.js, line 1842: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIMsgDBHdr.getProperty]

m_mdbRow is private and the only place I can see which clears it is nsMsgHdr::ReleaseMDBRow(). But that's not triggering at all, according to the printf I added to it.

I did think maybe something else might have been blatting over the memory... but I replaced the m_mdbRow pointer with a char[8] filled with 'X's, and moved m_mdbRow elsewhere in the class. And my 'X's remained intact (you can see the canary='XXXXXXXX' in the output above). So I can't be 100% sure, but I don't think it's a memory overwrite.

So something is clearing out m_mdbRow, but I have no idea what or how :-(
And it's not the real issue anyway, so I'm going to stop digging.

One other mystery:
The failing GetProperty() dies because m_mdbRow is null, and returns NS_ERROR_NULL_POINTER.
But, by the time is shows up as a javascript error in the test framework, it's NS_ERROR_ILLEGAL_VALUE.
Why?

Oh, the failure happens for local folders - you can bypass the other runs of the same test with --tag local:

./mach xpcshell-test --tag local comm/mail/components/extensions/test/xpcshell/test_ext_messages.js
Flags: needinfo?(benc)

Wrote up my "bigger-picture" approach in Bug 1795178.

See Also: → 1624699

Ben, how about taking this bug here while bug 1795178 is still open? This bug here can also be triggered when clicking on a mid: URL since the call stack involves getMsgHdrForMsgId():

     getMsgHdrForMsgId resource:///modules/MailUtils.jsm:654
     openMessageByMessageId resource:///modules/MailUtils.jsm:352
     _handleMidLink resource:///actors/MailLinkParent.jsm:74

(We heard somewhere that mid: URLs are broken in trunk, but the work in 102.)

We've just written up a summary in bug 1554188 comment #31. Looks like both bug 1554188 and likely also bug 1551173 comment #25 are caused by too many open files during search.

Flags: needinfo?(benc)

(In reply to b5 from comment #20)

Ben, how about taking this bug here while bug 1795178 is still open?

I suspect bug 1795178 might be easier to fix. Besides, I'm sick of trying to use sticking plasters to fix transverse fractures.

Flags: needinfo?(benc)
Assignee: nobody → mkmelin+mozilla
Status: NEW → ASSIGNED

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/7c043976746e
Fix findMsgIdInFolder(): Tolerate missing MSF and close database after use. Related backend changes. r=benc

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch

Timing and risk wise, do we want this to ride nightly until the next merge in two weeks? (and two weeks before holiday)
Or should we prefer to uplift earlier, to 109.0b3 in one week?

Component: General → Database

Comment on attachment 9304373 [details]
Bug 1787609 - Fix findMsgIdInFolder(): Tolerate missing MSF and close database after use. Related backend changes. r=benc

[Approval Request Comment]
Regression caused by (bug #): unknown
User impact if declined: potential msf loss
Testing completed (on c-c, etc.): c-c
Risk to taking this patch (and alternatives if risky): moderate, this is touchy code. Probably best to put on beta now to let it bake there.

Attachment #9304373 - Flags: approval-comm-esr102?
Attachment #9304373 - Flags: approval-comm-beta?
Whiteboard: [TM:102.6.0]

Comment on attachment 9304373 [details]
Bug 1787609 - Fix findMsgIdInFolder(): Tolerate missing MSF and close database after use. Related backend changes. r=benc

[Triage Comment]
Approved for beta

Attachment #9304373 - Flags: approval-comm-beta? → approval-comm-beta+
See Also: → 1778380

Comment on attachment 9304373 [details]
Bug 1787609 - Fix findMsgIdInFolder(): Tolerate missing MSF and close database after use. Related backend changes. r=benc

[Triage Comment]
Approved for esr102

Attachment #9304373 - Flags: approval-comm-esr102? → approval-comm-esr102+
Regressions: 1806078
See Also: → 1888791
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: