Open Bug 845952 (maildirblockers) Opened 8 years ago Updated 1 month ago

finish "maildir" message storage [meta]

Categories

(MailNews Core :: Database, defect)

defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: wsmwk, Assigned: benc)

References

(Depends on 21 open bugs, Blocks 2 open bugs)

Details

(Keywords: feature, meta, user-doc-needed, Whiteboard: [maildir])

This meta bug is to organize the work to finish "maildir" message storage. Blocking are bugs from [1] and from bug 58308 (which is qmail and won't block this) I include fixed bug 402392 and bug 738651. The end goal is to enable maildir as default, but the meta bug could certainly live on to finish straggling issues after maildir enabled by default.

To do list:
- Someone to file a bug for "enable maildir as default message store"
- check to see if I have missed anything in blockers
- remove anything in blockers that doesn't exactly belong

[1] https://bugzilla.mozilla.org/buglist.cgi?list_id=5785777;field0-0-0=short_desc;type0-0-1=substring;field0-0-1=status_whiteboard;resolution=---;query_format=advanced;value0-0-1=maildir;type0-0-0=anywordssubstr;value0-0-0=maildir%20mail-dir;product=MailNews%20Core;product=Thunderbird
Add to the to-do another task:

- Tool to optimize mail storage (convert mbox to maildir in existing installations)
No longer depends on: 537626
No longer depends on: 584695
Depends on: 58308, 852080, 855950, 855954
Whiteboard: [maildir]
No longer depends on: 789679
Depends on: 856087
"maildir" in bug summary of Bug 534135 is pure "Maildir" used by IMAP server instead of Tb's maildir-like, Maildir Lite, .... Removing from depndency.
No longer depends on: 534135
One day (after the >4GB mbox work) I'd like to try out the maildir-lite support and see if I can polish some of the bugs reported here.
FYI.

Quick summary of observed phenomena in some basic/simple test with recent trunk nightly(8Tb 22.0a0).

Current maildir store looks;
- RETR to LocalMbox/tmp/nnnn, move to LocalMbox/cur/nnnn, works.
- fetch body[] to IMAPMbox/tmp/nnnn, move to IMAPMbox/cur/nnnn, works.
- Copy/Move mail(s) is always same as
  Copy/Move from LocalSource/cur/nnnn to LocalTargetMbox/cur/nnnn.
- No care for Copy/Move Source/Target folder == berkleystore
    No Mbox/cur/nnnn, so misbehave.  
- No care for Copy/Move Source/Target folder == maildirstore/IMAP
  - Offline-Use=Off :
    No offline-store file(Mbox/cur/nnnn) => misbehave  
  - Offline-Use=On :
    Copy/Move from/to IMAP_maildir is;
       same Copy/Move as Copy/Move from/to Local_maildir
     + re-synchronizasion with server because of IMAP, if Move.
    So, if move from IMAP folder, IMAP_maildir/cur/nnnn is
    simply moved to maildir_Target_folder/cur/nnnn
    as done in move from maildir_local_mbox to maildir_local_mbox.
    then, when IMAP Mbox open after move, mail is fetched again.
- It looks that return code, status etc. is not checked many places.
  So, empty directory of Mbox/cur/nnnn, Mbox/cur/nnnn of file size=0
  is easily created in many many cases.
- Unique filename generation has holes, so same cur/nnnn file can be
  used by multiple mails.
- If directory for Mbox is suffixed,
    Mbox name is case sensitive in IMAP server.
    File name in client file syatem is case insensitive.
  bc/cur for abc, Abc-1/cur for Abc, ABC-2/cur for ABC, are used.
  In this case, association between Mbox name and Directory name
  is easily broken by rename of folder, by unsubscribe/subscribe(due to
  known bug).
Funny phenomena was also observed.
- If auto-sync of IMAP account is disabled,
  <server_name>.sbd is created, and <server_name>.sbd\INBOX, INBOX.msf
  is created and used, and <server_name>.sbd\INBOX/cur is created
  and used. Directory/file for sub folders under INBOX is created
  in <server_name>.sbd\INBOX.sbd.
  Directory/file for all other folder is normally created under
  <server_name> directory(and <server_name>.msf is created, as usual).
  This may be caused by /INBOX/Inbox folder what is intensionally
  created for bug testing.
 .
Component: General → Database
Product: Thunderbird → MailNews Core
Hardware: x86 → All
Version: unspecified → Trunk
I've opened meta bug 859011 for many currently known problems around "Copy/Move mails with MaildirStore".
So moving such bugs from this bug's dependency tree to that bug, to keep this bug as root meta bug for "followups after MaildirStore implementation".
Depends on: 797710
Depends on: 890742
Depends on: 906469
Blocks: 476239
Depends on: 816304
Depends on: 1011399
Alias: maildirblockers
Keywords: feature
Depends on: 1078367
No longer depends on: 58308
No longer depends on: 753147
No longer blocks: 1135309
Depends on: 1135309
No longer depends on: 1124948, 1135309
Is it possible to have the maildir files have the same filename [or at least the date/time part], sans the flags as they do on the server?
Deleting a folder with Thunderbird will delete the .msf file, but not the actual folder that messages are stored in.
My inbox folder has about 156k emails in it, but the corresponding file system folder has 200k files.  How can this be reconciled without downloading all the emails again?  It's taken close to 40 hours to download my whole mail store so far and it isn't quite finished yet.  The sluggishness and lock ups during this time are painful too.

I've had crashes and had to restart Thunderbird, this may be part of the problem.  I haven't deleted 40k emails either (from the inbox).

The inbox is only one of my folders, some other folders have huge numbers too.  All up I should have about 2M emails (not quite), which includes mailing list data amongst other emails going back over 10 years.

It would be much better to be able to rsync the Maildir folder and have a process adjust file names and remove tag information from the file name to the lines at the top of each email.  Then build the .msf files from the actual content ... so, basically an offline build.
Did you guys even test this?

I've initiated a repair on my Inbox and now I have 16K extra files in the OS with what looks like another 120K to go.

Total emails in the Inbox (reported by TB) is identical to the server's OS directory, that is around 156K; right now the client's OS directory has 217K of email files.

Of course the problems are amplified with large mail storage, but even small test folders exhibit problems.
Andrew, AFAIK no one using maildir has experienced this problem. Please file a separate bug report describing your issue https://bugzilla.mozilla.org/enter_bug.cgi?product=Thunderbird and answering the questions posed here and make it block this bug - because this bug is a meta for overall tracking, not for fixing specific bugs. Thanks

(In reply to Andrew McGlashan from comment #8)
> My inbox folder has about 156k emails in it, but the corresponding file
> system folder has 200k files.  How can this be reconciled without
> downloading all the emails again?  It's taken close to 40 hours to download
> my whole mail store so far and it isn't quite finished yet.  The
> sluggishness and lock ups during this time are painful too.


> I've had crashes and had to restart Thunderbird, this may be part of the
> problem.  I haven't deleted 40k emails either (from the inbox).

Please post your crash IDs in the new bug report
https://support.mozilla.org/en-US/kb/mozilla-crash-reporter#w_viewing-crash-reports

> The inbox is only one of my folders, some other folders have huge numbers
> too.  All up I should have about 2M emails (not quite), which includes
> mailing list data amongst other emails going back over 10 years.
> 
> It would be much better to be able to rsync the Maildir folder and have a
> process adjust file names and remove tag information from the file name to
> the lines at the top of each email.  Then build the .msf files from the
> actual content ... so, basically an offline build.

I believe imap would not like that.
Please direct all future comments to the new bug(s) you file.
Depends on: 1176675
Depends on: 1182686
Depends on: 1044456
Duplicate of this bug: 749983
Depends on: 1259035
Depends on: 1259040
Depends on: 1261633
Blocks: 1306254
Depends on: 1275948, 1264673
Depends on: 1307017
Depends on: 1317066
Depends on: 1293770
Depends on: 1203570
Depends on: 1317117
Is there a progress on this issue?

Thomas
Depends on: 1457409
Depends on: 1214407
Depends on: 1333342
Depends on: 1472524
Depends on: 1215807
Depends on: 1486491
Duplicate of this bug: 101273
Depends on: 1491228
Depends on: 1498532
Depends on: 1504465
Depends on: 1519364
Depends on: 1529929
Depends on: 1515254
Depends on: 1526289
Depends on: 1519045
Depends on: 1586653
Depends on: 856396
Depends on: 1593455
Depends on: 1607021
No longer depends on: 1607021
Assignee: nobody → benc
Depends on: 1533624
Depends on: 1611897
Depends on: 1643901
See Also: → 533792
Depends on: 1617518

Guys, are you aware that in TB68 maildir implementation caused imap folders to have messages with unreadable attachments? For example many users reported that they can't open pdf files from attachments. EML files on disk were fine. Repairing folder helps same as deleting msf and allowing program to recreate it. So the problem was in improper msf files.

Another problem concerning maildir local folders was connected to msf files, too. When moving files from folder to folder (in local folders) (mostly more than 50 at a time), they were moved on disk, but not in TB (stayed in msf). TB showed them in src folder (but clicking them said msg unavailable). It doesn't happen "sometimes". It happened all the time. Generally after moving files (when manually sorting them year by year) we had to delete msf in order to get proper message list and view what was moved and what not.

My question is - are you aware of these problems and did you fix them? Is msf thoroughly tested? 78+ roadmap says that maildir is decent in 78, but i didn't find anything mentioned in changelog about bugs i stumbled upon. To be clear it wasn't on just my machine. With my coworker we deployed many migrations from mbox to maildir and had these problems on almost every machine.

Currently we use mbox for imap and maildir for local folders, but prefer to move emails around in mbox because it's more stable. We convert to maildir in the end after all work is done.

Last but not least, I would name maildir emails by date then msg id because it allows sorting them yearly and totally simplifies archiving. Trying to work on one big maildir local folder consisting of mails from many years is in tb68 almost impossible without converting to mbox first.

To explain last sentence "I would name maildir emails by date then msg id" - i mean files on the disk. That would allow to move them to folders by hand and not in TB. Moving many emails in maildir local folders is very unstable just as I said before. I did write a script that extracts email date from email message and renames files, but it would be super cool to not have to do that in the first place. This functionality is dicated by the fact that we archive mails mostly by year. People often say - delete/archive emails older than x years.

Another bug.
Reproduce: create account, set custom folder to C:\foo, close TB, delete manually C:\foo. Open TB. This time not only C:\foo is created also c:\foo.sbd is created and TB starts syncing emails and write files into c:\foo.sbd instead of c:\foo.
To get around you have to close TB, delete c:\foo.sbd and run TB again.
This time TB sees existing c:\foo and starts to write files to it.

First off, thanks for taking the time to write all that up - very useful and much appreciated!

(In reply to Zbigniew Gralewski from comment #16)

My question is - are you aware of these problems and did you fix them? Is msf thoroughly tested? 78+ roadmap says that maildir is decent in 78, but i didn't find anything mentioned in changelog about bugs i stumbled upon. To be clear it wasn't on just my machine. With my coworker we deployed many migrations from mbox to maildir and had these problems on almost every machine.

This bug is the overview one to track all the maildir related issues - see the "Depends on" list at the top to see all the unresolved maildir bugs.
There are definitely still enough rough edges that I'd be wary of using maildir in production.
A big maildir push is high on my TODO list.
If there are maildir issues not linked to this meta bug, then I recommend creating a new bug and adding it to the "depends on" list.

I don't see any existing bugs that obviously cover the imap-folders-have-messages-with-unreadable-attachments issue you mention - want to write it up? No problem if not - I'll go through your comments in more detail and write whatever we don't already have.

Last but not least, I would name maildir emails by date then msg id because it allows sorting them yearly and totally simplifies archiving. Trying to work on one big maildir local folder consisting of mails from many years is in tb68 almost impossible without converting to mbox first.

I think that's an interesting point - definitely something to look into.
There's a bigger question here: Is there any benefit to adhering to the maildir spec (all emails as files in a single flat directory), rather than, say, automatically stashing emails into subfolders. For example, "<YYYY>-<MM>/<msgid>.eml" would probably be manageable and rather useful to the user. You could get a more even distribution by, say, using subdirs based on hashing the messageid. But at the expense of making it an arse for the user to find emails in the filesystem (seems like a bad tradeoff).
Probably a discussion to break out into another bug or on the mailing list.

In any case, plain maildir is a good first step. There's still a bunch of places in the code that kind-of-sort-of assume mbox. So getting vanilla maildir solid and reliable makes it waaaaay simpler to add other potential storage schemes (either minor variants on maildir or stuff that's completely different in approach).

To explain last sentence "I would name maildir emails by date then msg id" - i mean files on the disk. That would allow to move them to folders by >hand and not in TB. Moving many emails in maildir local folders is very unstable just as I said before. I did write a script that extracts email date >from email message and renames files, but it would be super cool to not have to do that in the first place

I fully agree with this suggestion from Zbigniew Gralewski

@Zbigniew Gralewski :

You wrote about a script. I am interested.
Could you pass it to me?
thoste at email dot com

Thank you

(In reply to Ben Campbell from comment #18)

...

Last but not least, I would name maildir emails by date then msg id because it allows sorting them yearly and totally simplifies archiving. Trying to work on one big maildir local folder consisting of mails from many years is in tb68 almost impossible without converting to mbox first.

I think that's an interesting point - definitely something to look into.
There's a bigger question here: Is there any benefit to adhering to the maildir spec (all emails as files in a single flat directory), rather than, say, automatically stashing emails into subfolders. For example, "<YYYY>-<MM>/<msgid>.eml" would probably be manageable and rather useful to the user. You could get a more even distribution by, say, using subdirs based on hashing the messageid. But at the expense of making it an arse for the user to find emails in the filesystem (seems like a bad tradeoff).

There is indeed a "breaking point" on folder size where it takes forever to enumerate folder contents in the MS Windows environment.

(In reply to Wayne Mery (:wsmwk) from comment #20)

I think that's an interesting point - definitely something to look into.
There's a bigger question here: Is there any benefit to adhering to the maildir spec (all emails as files in a single flat directory), rather than, say, automatically stashing emails into subfolders. For example, "<YYYY>-<MM>/<msgid>.eml" would probably be manageable and rather useful to the user. You could get a more even distribution by, say, using subdirs based on hashing the messageid. But at the expense of making it an arse for the user to find emails in the filesystem (seems like a bad tradeoff).

There is indeed a "breaking point" on folder size where it takes forever to enumerate folder contents in the MS Windows environment.

Wayne, internally I would leave them as they are and in maildir spec as it is. Cur and tmp folders are fine. On dir in TB, two dirs on the disk (cur and tmp). In other words it is useful to have location of files on disk consistent with structure of folders in Thunderbird. Look at it this way, we have GDPR, we teach people how to archive and delete emails and TB can be configured to move them into yearly subfolders when archiving. Problem is when you have a user that does nothing just holds thousands of emails in big inbox. Admin has to be able to quickly move them into local folders, sort by year into subfolders and finally make user mailbox smaller so the user is forced to sort and archive in realtime or once a week. Admin can put maildir local foldes into sync by google drive, synology drive, dropbox etc (excluding MSF files) and you have realtime protection of local foldes then. I use that with success. So I would only use the date extracted from email as filename because it helps a lot with manual admin work. Renamed EML files reindex in TB just fine. Maybe use messageid for emails that don't have proper "Date:" field in headers or use "date_messageid". Look at my ahk script attached in recent post. We rename all files using it and sort into subfolders by date manually, then delete msf files, run TB and the job of sorting tousands of files is done. Admin work is quick and business rules apply. Maybe a bit offtopic but I wish TB to be admin and business rules implementation friendly.

(In reply to Ben Campbell from comment #18)

There's a bigger question here: Is there any benefit to adhering to the maildir spec (all emails as files in a single flat directory), rather than, say, automatically stashing emails into subfolders. For example, "<YYYY>-<MM>/<msgid>.eml" would probably be manageable and rather useful to the user. You could get a more even distribution by, say, using subdirs based on hashing the messageid. But at the expense of making it an arse for the user to find emails in the filesystem (seems like a bad tradeoff).
Probably a discussion to break out into another bug or on the mailing list.

One issue with the YYYY/MM folder how would you determine according to which timezone should the month change to the next one - local or UTC?

You need to log in before you can comment on or make changes to this bug.