Massive loss of MSF files when upgrading from 91 to 102
Categories
(MailNews Core :: Database, defect)
Tracking
(thunderbird_esr102 fixed, thunderbird102+ affected)
People
(Reporter: rachel, Assigned: mkmelin)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: regression)
Attachments
(2 files, 1 obsolete file)
9.04 KB,
patch
|
Details | Diff | Splinter Review | |
48 bytes,
text/x-phabricator-request
|
wsmwk
:
approval-comm-esr102+
|
Details | Review |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0
Steps to reproduce:
We upgraded from TB 91 ESR to TB 102 beta and were greeted by a few messages that moving messages to certain folders failed, the usual message that is displayed if the folder is deemed "corrupt", for example because the MSF file is missing.
Comparing the profile to a backup it turned out that about 1100 MSF files have disappeared. This is likely due to the migration of panacea.dat, it is known that removal of panacea.dat causes immediate loss of MSF files, see bug 1093217 comment #5, bug 1551173 comment #22 (about creating zero-byte MSFs temporarily).
It's possible that this is also the real cause of bug 1773511. If the Sent folder's MSF got lost, the sent message likely cant be move there. It's unclear how the fcc_folder pref got changed though.
Reporter | ||
Comment 1•2 years ago
|
||
Ben, have you seen lost MSF files in the context of the panacea.dat migration?
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 2•2 years ago
|
||
The next backup run showed that 853 MSF files had disappeared again after using version 102 for about a week.
Reporter | ||
Comment 3•2 years ago
|
||
We looked at this a bit more. Bug 1774072 could well be the culprit here.
The consequence of a folder not having a valid MSF leads to the deletion of the MSF:
https://searchfox.org/comm-central/rev/449dac8d0f363633df9e1d4f10fa5515773a2d97/mailnews/db/msgdb/src/nsMsgDatabase.cpp#1185
Validity is checked here
https://searchfox.org/comm-central/rev/449dac8d0f363633df9e1d4f10fa5515773a2d97/mailnews/local/src/nsMsgBrkMBoxStore.cpp#184
and the check relies on the comparison of unread messages, folder size, etc. which is compared to the data from the folder cache:
https://searchfox.org/comm-central/rev/449dac8d0f363633df9e1d4f10fa5515773a2d97/mailnews/local/src/nsMsgBrkMBoxStore.cpp#209-236
If this data is incorrect because the folder cache wasn't flushed and contains stale data, then hell breaks loose.
Reporter | ||
Comment 4•2 years ago
|
||
Maybe the initial loss of MSF files was due to a failed conversion of panacea.dat triggering valid MSF files to be declared invalid and deleted as detailed in the previous comment.
Reporter | ||
Comment 5•2 years ago
|
||
Well, even with the fix from bug 1774072, MSF files keep disappearing, this time 845. Restoring them from backup and opening those corresponding folders doesn't show any visual indication that the MSF would be rebuilt and therefore was invalid and worth deleting.
Updated•2 years ago
|
Reporter | ||
Comment 6•2 years ago
|
||
This should catch the issue, the massive loss appears to occur at least every 24 hours. Our hunch is that something will be wrong with the folder cache and therefore valid MSFs well be deleted.
Reporter | ||
Comment 7•2 years ago
|
||
This already caught some nonsense processing:
=== nsMsgDatabase::CheckForErrors at Sun Jun 19 23:30:09 2022
=== nsMsgDatabase::CheckForErrors file C:\Users\xxx\AppData\Local\Temp\MozillaMailnews\Ab.msf
=== nsMsgDatabase::CheckForErrors: (1, expected) err: 80550006, exists: 0, size: 2676119275016, newFile: 0
=== nsMsgDatabase::CheckForErrors: (7) err: 80550006
=== nsMsgDatabase::CheckForErrors: (8) err: 80550005, deleteInvalidDB: 1
=== nsMsgDatabase::CheckForErrors: !!!!!!!!!!!!!!!!!!!!!
=== nsMsgDatabase::CheckForErrors: (9) REMOVING, err: 80550005
=== nsMsgDatabase::CheckForErrors: !!!!!!!!!!!!!!!!!!!!!
=== nsMsgDatabase::CheckForErrors: (10, expected) err: 80550005, newFile: 0, m_leaveInvalidDB: 1
This is for a temp copy created by nsMsgDBFolder::CreateBackupDirectory()
:
So the function gets called with 80550006 (NS_MSG_ERROR_FOLDER_SUMMARY_MISSING)
https://searchfox.org/comm-central/rev/a19f8cff3316602d1770b261eaf4f29e6c80d6cb/mailnews/base/public/msgCore.h#108
That is switched to 80550005 (NS_MSG_ERROR_FOLDER_SUMMARY_OUT_OF_DATE) and the code proceeds to delete the non-existent file. fileSize
wasn't initialised and carries garbage. Not good.
Reporter | ||
Comment 8•2 years ago
|
||
Looking at the debug output, it becomes clear that TB doesn't check so many folders during "normal" operation. However, using the "Quick Folder Move" addon, it seems that all folders are accessed and checked for validity:
https://github.com/kewisch/quickmove-extension/blob/main/src/content/quickmove.js#L314
That would explain a mass exodus of 800+ MSFs in one hit. BTW, QFM is much slower in showing the pop-up menus with all the folders in the system compared to TB 91, so some backend changes must have caused this slowness.
Reporter | ||
Comment 9•2 years ago
|
||
OK, we found what the problem is. QFM triggers the opening of all folders in the profile which will eventually exceed the 256 open files permissible on Windows. Our logging crashed, and protecting against failure of the fopen
call, we see:
!!! nsMsgDatabase::CheckForErrors: Couldn't open log file
[Parent 12304, Main Thread] ###!!! ASSERTION: morkBool_kFalse: 'Error', file C:/mozilla-source/mozilla-central/comm/mailnews/db/mork/morkConfig.cpp:20
Not to our surprise, about 800 MFS got blown away in a single hit, likely all the ones that couldn't be opened. It's likely that the panacea.dat to folderCache.json conversion during the upgrade from 91 to 102 suffered the same issue.
Reporter | ||
Comment 10•2 years ago
|
||
Just for the record, this a typical stack:
xul.dll!nsMsgBrkMBoxStore::IsSummaryFileValid(nsIMsgFolder * aFolder, nsIMsgDatabase * aDB, bool * aResult) Line 187 C++
xul.dll!nsMailDatabase::GetSummaryValid(bool * aResult) Line 115 C++
xul.dll!nsMsgDatabase::CheckForErrors(nsresult err, bool sync, nsMsgDBService * aDBService, nsIFile * summaryFile) Line 1194 C++
xul.dll!nsMsgDatabase::OpenInternal(nsMsgDBService * aDBService, nsIFile * summaryFile, bool aCreate, bool aLeaveInvalidDB, bool sync) Line 1159 C++
xul.dll!nsMsgDatabase::Open(nsMsgDBService * aDBService, nsIFile * aFolderName, bool aCreate, bool aLeaveInvalidDB) Line 1130 C++
xul.dll!nsMailDatabase::Open(nsMsgDBService * aDBService, nsIFile * aSummaryFile, bool aCreate, bool aUpgrading) Line 48 C++
xul.dll!nsMsgDBService::OpenFolderDB(nsIMsgFolder * aFolder, bool aLeaveInvalidDB, nsIMsgDatabase * * _retval) Line 180 C++
xul.dll!nsMsgLocalMailFolder::OpenDatabase() Line 836 C++
xul.dll!nsMsgLocalMailFolder::GetDBFolderInfoAndDB(nsIDBFolderInfo * * folderInfo, nsIMsgDatabase * * db) Line 873 C++
xul.dll!nsMsgDBFolder::GetStringProperty(const char * propertyName, nsTSubstring<char> & propertyValue) Line 2018 C++
GetStringProperty is called from the add-on on all folders :-(
https://github.com/kewisch/quickmove-extension/blob/1774546923c4667fc93096a6ff4e3a04a6269bec/src/content/quickmove.js#L288
This is a little surprising since using QFM in TB 91 didn't usually cause an MSF exodus. As stated in comment #8, it's also much slower these days, perhaps the slowness comes from deleting 800 files from the system. So some backend changes have led to databases not being closed any more.
Reporter | ||
Comment 11•2 years ago
|
||
All this looks like an issue introduce with the panacea->JSON conversion. The add-on retrieves MRUTime which is mostly not present in the cache, hence all the databases are opened and left open. Not surprisingly bug 1726319 (https://hg.mozilla.org/comm-central/rev/aba71e533211) already somewhat papered over the issue, for a different attribute. Our solution:
https://github.com/Betterbird/thunderbird-patches/blob/main/102/bugs/1773822-no-mass-msf-exodus.patch
We suggest to address this quickly together with bug 1774072 since blowing away users' MSF files on Windows won't be popular.
Assignee | ||
Comment 12•2 years ago
|
||
Updated•2 years ago
|
Reporter | ||
Comment 13•2 years ago
|
||
You should check whether the panacea.dat->json conversion works as desired (bug 418551) on Windows where you can only open 256 files at the same time. The bug was originally reported as "Massive loss of MSF files when upgrading from 91 to 102", see comment #0. At that point, no QFM add-on had been exercised which later turned out to mass kill MSFs. Briefly looking at https://hg.mozilla.org/comm-central/rev/5891b788e8bd#l12.661 is doesn't appear to open databases, but it's still possible that the conversion isn't right, so later the databases need to be opened overrunning the 256-open-files-limit because the properties weren't migrated properly. This suspicion exists since what appears to be a "cover up" in bug 1726319 comment #207.
Reporter | ||
Comment 14•2 years ago
|
||
This seems to be the cause of bug 1773401. It's likely that the recent menu is populated the same way the QFM add-on does.
Comment 15•2 years ago
|
||
(In reply to Rachel Martin from comment #13)
You should check whether the panacea.dat->json conversion works as desired (bug 418551) on Windows where you can only open 256 files at the same time.
The actual conversion only opens the single panacea.dat mork DB, so that should be fine.
but it's still possible that the conversion isn't right, so later the databases need to be opened overrunning the 256-open-files-limit because the properties weren't migrated properly. This suspicion exists since what appears to be a "cover up" in bug 1726319 comment #207.
This seems possible. There's no master list of properties that go into the folder cache (https://hg.mozilla.org/comm-central/rev/5891b788e8bd#l12.575).
By default the migration just copies the values verbatim, but since integer values are stored in mork as hex, they need special handling to convert.
I'd thought I'd managed to grep the codebase for all the integer properties, but maybe there's some missing?
A smoking gun indicating a bad conversion would be foldercache.json containing a hex string value in what should be a numeric property. If you've got an affected folderCache.json can you spot any such cases there?
Reporter | ||
Comment 16•2 years ago
|
||
The QFM add-on or places where TB displays a list of recent folders itself (bug 1773401) rely on the MRUTime attribute (https://searchfox.org/comm-central/rev/1960b28f43dd0fa96394129ea9c4a552d873e127/mail/base/content/folderPane.js#2256). Looking at our folderCache.json file, there are 1876 folders, 1306 have "MRUTime" : null
and 540 have a value, looks like 30 don't have it at all. We find that surprising since after QFM has retrieved all the properties they should be in the cache for next time, right? So why are there still 1306 entries with null?
As mentioned, QFM building the list of recent folders hitting all folders in the system is woefully slow in 102. Maybe that's because the new cache is generally slower or maybe it's because the values aren't cached. The patch only addresses running over the Windows 256-open-files limit which leads to errors opening the 257th MSF which is then deleted as invalid, also a somewhat suboptimal algorithm.
You should experiment with the add-on on a setup with 2000 folders to see what happens.
Assignee | ||
Updated•2 years ago
|
Comment 17•2 years ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/7c180ced774e
Prevent mass exodus of MSF files by closing DB after getting attribute. r=benc
Reporter | ||
Comment 18•2 years ago
|
||
(In reply to Rachel Martin from comment #16)
We find that surprising since after QFM has retrieved all the properties they should be in the cache for next time, right? So why are there still 1306 entries with null?
To be continued in bug 1776280.
Reporter | ||
Comment 19•2 years ago
|
||
(In reply to Rachel Martin from comment #16)
You should experiment with the add-on on a setup with 2000 folders to see what happens.
We tested the conversion from panacea to JSON on a small profile where QFM had been run, so all the folders should have had their MRUTime property cached in panacea.dat. A pity that this file is so hard to read. After opening the profile in 102, all folders had MRUTime set in the new folder cache. Maybe the bad results on our 2000-folder production setup was a mix of bug 1774072, bug 1773822 (this one here) and bug 1776280.
Comment 20•2 years ago
|
||
Comment on attachment 9282094 [details]
Bug 1773822 - Prevent mass exodus of MSF files by closing DB after getting attribute. r=benc
[Approval Request Comment]
Regression caused by (bug #):
User impact if declined: User perceives data loss
Testing completed (on c-c, etc.): On 103.0b1
Risk to taking this patch (and alternatives if risky):
Comment 21•2 years ago
|
||
Comment on attachment 9282094 [details]
Bug 1773822 - Prevent mass exodus of MSF files by closing DB after getting attribute. r=benc
[Triage Comment]
Approved for esr102
It should be a solid patch, but a little hesitant to take both this and bug 1774072 at the same time, with beta being live less than 24 hours. However, by the time 102.0.1 ships the patch will have been on beta for 5-6 days. So we can decide in a few days against taking it if need be.
Comment 22•2 years ago
|
||
bugherder uplift |
Thunderbird 102.0.1:
https://hg.mozilla.org/releases/comm-esr102/rev/d327573d7672
Updated•2 years ago
|
Description
•