Closed Bug 1773822 Opened 2 years ago Closed 2 years ago

Massive loss of MSF files when upgrading from 91 to 102

Categories

(MailNews Core :: Database, defect)

Thunderbird 102
defect

Tracking

(thunderbird_esr102 fixed, thunderbird102+ affected)

RESOLVED FIXED
103 Branch
Tracking Status
thunderbird_esr102 --- fixed
thunderbird102 + affected

People

(Reporter: rachel, Assigned: mkmelin)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(2 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

We upgraded from TB 91 ESR to TB 102 beta and were greeted by a few messages that moving messages to certain folders failed, the usual message that is displayed if the folder is deemed "corrupt", for example because the MSF file is missing.

Comparing the profile to a backup it turned out that about 1100 MSF files have disappeared. This is likely due to the migration of panacea.dat, it is known that removal of panacea.dat causes immediate loss of MSF files, see bug 1093217 comment #5, bug 1551173 comment #22 (about creating zero-byte MSFs temporarily).

It's possible that this is also the real cause of bug 1773511. If the Sent folder's MSF got lost, the sent message likely cant be move there. It's unclear how the fcc_folder pref got changed though.

Ben, have you seen lost MSF files in the context of the panacea.dat migration?

Flags: needinfo?(benc)
Blocks: tb102found

The next backup run showed that 853 MSF files had disappeared again after using version 102 for about a week.

We looked at this a bit more. Bug 1774072 could well be the culprit here.

The consequence of a folder not having a valid MSF leads to the deletion of the MSF:
https://searchfox.org/comm-central/rev/449dac8d0f363633df9e1d4f10fa5515773a2d97/mailnews/db/msgdb/src/nsMsgDatabase.cpp#1185

Validity is checked here
https://searchfox.org/comm-central/rev/449dac8d0f363633df9e1d4f10fa5515773a2d97/mailnews/local/src/nsMsgBrkMBoxStore.cpp#184
and the check relies on the comparison of unread messages, folder size, etc. which is compared to the data from the folder cache:
https://searchfox.org/comm-central/rev/449dac8d0f363633df9e1d4f10fa5515773a2d97/mailnews/local/src/nsMsgBrkMBoxStore.cpp#209-236

If this data is incorrect because the folder cache wasn't flushed and contains stale data, then hell breaks loose.

Maybe the initial loss of MSF files was due to a failed conversion of panacea.dat triggering valid MSF files to be declared invalid and deleted as detailed in the previous comment.

Well, even with the fix from bug 1774072, MSF files keep disappearing, this time 845. Restoring them from backup and opening those corresponding folders doesn't show any visual indication that the MSF would be rebuilt and therefore was invalid and worth deleting.

Component: Untriaged → Database
Product: Thunderbird → MailNews Core

This should catch the issue, the massive loss appears to occur at least every 24 hours. Our hunch is that something will be wrong with the folder cache and therefore valid MSFs well be deleted.

This already caught some nonsense processing:

=== nsMsgDatabase::CheckForErrors at Sun Jun 19 23:30:09 2022
=== nsMsgDatabase::CheckForErrors file C:\Users\xxx\AppData\Local\Temp\MozillaMailnews\Ab.msf
=== nsMsgDatabase::CheckForErrors: (1, expected) err: 80550006, exists: 0, size: 2676119275016, newFile: 0
=== nsMsgDatabase::CheckForErrors: (7) err: 80550006
=== nsMsgDatabase::CheckForErrors: (8) err: 80550005, deleteInvalidDB: 1
=== nsMsgDatabase::CheckForErrors: !!!!!!!!!!!!!!!!!!!!!
=== nsMsgDatabase::CheckForErrors: (9) REMOVING, err: 80550005
=== nsMsgDatabase::CheckForErrors: !!!!!!!!!!!!!!!!!!!!!
=== nsMsgDatabase::CheckForErrors: (10, expected) err: 80550005, newFile: 0, m_leaveInvalidDB: 1

This is for a temp copy created by nsMsgDBFolder::CreateBackupDirectory():
So the function gets called with 80550006 (NS_MSG_ERROR_FOLDER_SUMMARY_MISSING)
https://searchfox.org/comm-central/rev/a19f8cff3316602d1770b261eaf4f29e6c80d6cb/mailnews/base/public/msgCore.h#108
That is switched to 80550005 (NS_MSG_ERROR_FOLDER_SUMMARY_OUT_OF_DATE) and the code proceeds to delete the non-existent file. fileSize wasn't initialised and carries garbage. Not good.

Looking at the debug output, it becomes clear that TB doesn't check so many folders during "normal" operation. However, using the "Quick Folder Move" addon, it seems that all folders are accessed and checked for validity:
https://github.com/kewisch/quickmove-extension/blob/main/src/content/quickmove.js#L314
That would explain a mass exodus of 800+ MSFs in one hit. BTW, QFM is much slower in showing the pop-up menus with all the folders in the system compared to TB 91, so some backend changes must have caused this slowness.

OK, we found what the problem is. QFM triggers the opening of all folders in the profile which will eventually exceed the 256 open files permissible on Windows. Our logging crashed, and protecting against failure of the fopen call, we see:

!!! nsMsgDatabase::CheckForErrors: Couldn't open log file
[Parent 12304, Main Thread] ###!!! ASSERTION: morkBool_kFalse: 'Error', file C:/mozilla-source/mozilla-central/comm/mailnews/db/mork/morkConfig.cpp:20

Not to our surprise, about 800 MFS got blown away in a single hit, likely all the ones that couldn't be opened. It's likely that the panacea.dat to folderCache.json conversion during the upgrade from 91 to 102 suffered the same issue.

Attachment #9281948 - Attachment is obsolete: true

Just for the record, this a typical stack:

xul.dll!nsMsgBrkMBoxStore::IsSummaryFileValid(nsIMsgFolder * aFolder, nsIMsgDatabase * aDB, bool * aResult) Line 187	C++
xul.dll!nsMailDatabase::GetSummaryValid(bool * aResult) Line 115	C++
xul.dll!nsMsgDatabase::CheckForErrors(nsresult err, bool sync, nsMsgDBService * aDBService, nsIFile * summaryFile) Line 1194	C++
xul.dll!nsMsgDatabase::OpenInternal(nsMsgDBService * aDBService, nsIFile * summaryFile, bool aCreate, bool aLeaveInvalidDB, bool sync) Line 1159	C++
xul.dll!nsMsgDatabase::Open(nsMsgDBService * aDBService, nsIFile * aFolderName, bool aCreate, bool aLeaveInvalidDB) Line 1130	C++
xul.dll!nsMailDatabase::Open(nsMsgDBService * aDBService, nsIFile * aSummaryFile, bool aCreate, bool aUpgrading) Line 48	C++
xul.dll!nsMsgDBService::OpenFolderDB(nsIMsgFolder * aFolder, bool aLeaveInvalidDB, nsIMsgDatabase * * _retval) Line 180	C++
xul.dll!nsMsgLocalMailFolder::OpenDatabase() Line 836	C++
xul.dll!nsMsgLocalMailFolder::GetDBFolderInfoAndDB(nsIDBFolderInfo * * folderInfo, nsIMsgDatabase * * db) Line 873	C++
xul.dll!nsMsgDBFolder::GetStringProperty(const char * propertyName, nsTSubstring<char> & propertyValue) Line 2018	C++

GetStringProperty is called from the add-on on all folders :-(
https://github.com/kewisch/quickmove-extension/blob/1774546923c4667fc93096a6ff4e3a04a6269bec/src/content/quickmove.js#L288

This is a little surprising since using QFM in TB 91 didn't usually cause an MSF exodus. As stated in comment #8, it's also much slower these days, perhaps the slowness comes from deleting 800 files from the system. So some backend changes have led to databases not being closed any more.

All this looks like an issue introduce with the panacea->JSON conversion. The add-on retrieves MRUTime which is mostly not present in the cache, hence all the databases are opened and left open. Not surprisingly bug 1726319 (https://hg.mozilla.org/comm-central/rev/aba71e533211) already somewhat papered over the issue, for a different attribute. Our solution:
https://github.com/Betterbird/thunderbird-patches/blob/main/102/bugs/1773822-no-mass-msf-exodus.patch

We suggest to address this quickly together with bug 1774072 since blowing away users' MSF files on Windows won't be popular.

Assignee: nobody → mkmelin+mozilla
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

You should check whether the panacea.dat->json conversion works as desired (bug 418551) on Windows where you can only open 256 files at the same time. The bug was originally reported as "Massive loss of MSF files when upgrading from 91 to 102", see comment #0. At that point, no QFM add-on had been exercised which later turned out to mass kill MSFs. Briefly looking at https://hg.mozilla.org/comm-central/rev/5891b788e8bd#l12.661 is doesn't appear to open databases, but it's still possible that the conversion isn't right, so later the databases need to be opened overrunning the 256-open-files-limit because the properties weren't migrated properly. This suspicion exists since what appears to be a "cover up" in bug 1726319 comment #207.

This seems to be the cause of bug 1773401. It's likely that the recent menu is populated the same way the QFM add-on does.

Blocks: 1773401

(In reply to Rachel Martin from comment #13)

You should check whether the panacea.dat->json conversion works as desired (bug 418551) on Windows where you can only open 256 files at the same time.

The actual conversion only opens the single panacea.dat mork DB, so that should be fine.

but it's still possible that the conversion isn't right, so later the databases need to be opened overrunning the 256-open-files-limit because the properties weren't migrated properly. This suspicion exists since what appears to be a "cover up" in bug 1726319 comment #207.

This seems possible. There's no master list of properties that go into the folder cache (https://hg.mozilla.org/comm-central/rev/5891b788e8bd#l12.575).
By default the migration just copies the values verbatim, but since integer values are stored in mork as hex, they need special handling to convert.
I'd thought I'd managed to grep the codebase for all the integer properties, but maybe there's some missing?
A smoking gun indicating a bad conversion would be foldercache.json containing a hex string value in what should be a numeric property. If you've got an affected folderCache.json can you spot any such cases there?

Flags: needinfo?(benc)

The QFM add-on or places where TB displays a list of recent folders itself (bug 1773401) rely on the MRUTime attribute (https://searchfox.org/comm-central/rev/1960b28f43dd0fa96394129ea9c4a552d873e127/mail/base/content/folderPane.js#2256). Looking at our folderCache.json file, there are 1876 folders, 1306 have "MRUTime" : null and 540 have a value, looks like 30 don't have it at all. We find that surprising since after QFM has retrieved all the properties they should be in the cache for next time, right? So why are there still 1306 entries with null?

As mentioned, QFM building the list of recent folders hitting all folders in the system is woefully slow in 102. Maybe that's because the new cache is generally slower or maybe it's because the values aren't cached. The patch only addresses running over the Windows 256-open-files limit which leads to errors opening the 257th MSF which is then deleted as invalid, also a somewhat suboptimal algorithm.

You should experiment with the add-on on a setup with 2000 folders to see what happens.

Target Milestone: --- → 103 Branch

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/7c180ced774e
Prevent mass exodus of MSF files by closing DB after getting attribute. r=benc

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

(In reply to Rachel Martin from comment #16)

We find that surprising since after QFM has retrieved all the properties they should be in the cache for next time, right? So why are there still 1306 entries with null?

To be continued in bug 1776280.

(In reply to Rachel Martin from comment #16)

You should experiment with the add-on on a setup with 2000 folders to see what happens.

We tested the conversion from panacea to JSON on a small profile where QFM had been run, so all the folders should have had their MRUTime property cached in panacea.dat. A pity that this file is so hard to read. After opening the profile in 102, all folders had MRUTime set in the new folder cache. Maybe the bad results on our 2000-folder production setup was a mix of bug 1774072, bug 1773822 (this one here) and bug 1776280.

Comment on attachment 9282094 [details]
Bug 1773822 - Prevent mass exodus of MSF files by closing DB after getting attribute. r=benc

[Approval Request Comment]
Regression caused by (bug #):
User impact if declined: User perceives data loss
Testing completed (on c-c, etc.): On 103.0b1
Risk to taking this patch (and alternatives if risky):

Attachment #9282094 - Flags: approval-comm-esr102?

Comment on attachment 9282094 [details]
Bug 1773822 - Prevent mass exodus of MSF files by closing DB after getting attribute. r=benc

[Triage Comment]
Approved for esr102

It should be a solid patch, but a little hesitant to take both this and bug 1774072 at the same time, with beta being live less than 24 hours. However, by the time 102.0.1 ships the patch will have been on beta for 5-6 days. So we can decide in a few days against taking it if need be.

Attachment #9282094 - Flags: approval-comm-esr102? → approval-comm-esr102+
Keywords: regression
Regressed by: 418551
See Also: → 1777606
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: