Open Bug 121947 Opened 23 years ago Updated 4 months ago

Need to unescape "From " lines in the message body before displaying them to the UI.

Categories

(MailNews Core :: Backend, defect)

defect

Tracking

(Not tracked)

People

(Reporter: cavin, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

This bug is extended from bug 119441.  See the related bug 121946.

We need to make sure that we unescape "From " lines in the msg body before 
displaying them to the UI (compose/read windows).
 
  If you have this line      Unescape it to     Comment
  ---------------------      --------------     -----------------
  >From now on               From now on
  >>From now on              >From now on
  >>>From now on             >>From now on
  >/>From now on             >/>From now on     No need to unescape
   From now on                From now on       No need to unescape

Note that the above example lines start at the beginning of a line (ie, column 
#1).
*** Bug 121946 has been marked as a duplicate of this bug. ***
*** Bug 120443 has been marked as a duplicate of this bug. ***
*** Bug 169209 has been marked as a duplicate of this bug. ***
*** Bug 191386 has been marked as a duplicate of this bug. ***
*** Bug 200648 has been marked as a duplicate of this bug. ***
*** Bug 219939 has been marked as a duplicate of this bug. ***
This is not limited to windows.

pi
OS: Windows NT → All
*** Bug 240609 has been marked as a duplicate of this bug. ***
*** Bug 233935 has been marked as a duplicate of this bug. ***
Depends on: 248726
This bug is also related to bug 194382. You may also wish to look Comment #8,
regarding the different mbox formats available.
Blocks: 248726
No longer depends on: 248726
Product: MailNews → Core
Component: MailNews: Backend → ChatZilla
Component: ChatZilla → MailNews: Backend
Shouldn't this only be done when the message is coming from an mbox file (e.g.
not IMAP)? - i.e. the true inverse of bug 121946.
*** Bug 288545 has been marked as a duplicate of this bug. ***
*** Bug 306030 has been marked as a duplicate of this bug. ***
Assignee: mscott → nobody
QA Contact: esther → backend
Product: Core → MailNews Core
This indeed appears to be closely related to Bug 194382, however, it is manifested in the message body, not in an attachment.
still seeing this in 3.0b4, or a subspecies of this (not altogether surprising, since it says assigned to nobody up there).  With a message body like

From nospacebefore
 From spacebefore
>From gtbefore
> From gtspacebefore
 >From spacegtbefore
From nospacebefore
Fromnospaceafter

the spacegtbefore line remains untransformed.  However the gtbefore line gets a space added in front of it.  When you read this message the spacegtbefore line has its leading space stripped.  So it becomes the same as the gtbefore line.
Blocks: 528815
Bug 110389
Bug 119441
Bug 121946
Bug 121947
Bug 194382
Bug 271225
...

What are they all about?
Are they a result of incompetence/ignorance? Or neglect? Or sign of hopeless attempts to revive a moulding corpse? Or maybe the great result of "Bazaar"?

When I started to use Thunderbird, the first thing I came upon was its inability to properly import the mail from Outlook. Well, after some time I started to fix it myself. And I found out that the one big problem (poor quality of the import module) was artificially divided to multiple small "bugs". Some of them were so close to each other than their fix would affect the same lines of code. But when you fix only one such bug, without regard to the others, you will always get the suboptimal results. Sometimes, when you can see the whole picture at once, you start to realize where the single root of those multiple problems lies. And sometimes you need to restructure the whole thing to fix them all at once, and prevent many other problems in advance. You know, a properly planned and implemented code solves more problems than years of patching.

Now I came across a problem with my imported mails. They sometimes contain some ">"s before "From "s that were not in the original messages. Or in the case of text messages, some portions of text are incorrectly marked as quotation. Well, the roots of this are easy to trace - they are in the EscapeFromSpaceLine() function that makes it possible to store the mails in the MBOX format. But why cannot I have the mails in their original form? Is this problem unresolvable? I started to search the Bugzilla for the "From " issues, and found out that they are so numerous and long-standing that it is astonishing!

What do I see? Again, the lack of proper structure of the code. And as the result - many problems, that sometimes seem to be different. And if one tries to fix one such bug, he gives rise to others.
The problem with the "From " handling is the problem of storage. It originates from the specific file format (MBOX), and it has simple and straightforvard solution (that have been implemented many times in many mail systems - MTAs, MUAs, etc). It has its level (I mean storage level). It may easily be isolated and solved - once and forever. But the reality is that it persists for years!

Well, we use the MBOX file format. It uses "From " lines to separate different messages. So the lines inside messages that start with "From " must be treated specifically. OK, this problem is common in the world of file formats, and escaping is generally accepted solution. But what is the problem of restoring the original data from such escaped form?

Let me summarise the problems around this. I must say that these problems were reported by many other people in the abovementioned bugs, I only summarize them!

1. The code that TB uses to escape the "From " lines is scattered inside the TB code, it duplicates itself (I see at least one place where this happens: EscapeFromSpaceLine() in nsMsgUtils.cpp that is used only in import module, I would not be surprised if there are other redundant functions). This leads to possibility to make different bugs that would be difficult to trace. Also, if we would need to reimplement something, it will require tracking every occurence of code that handles this...

2. Well, the escaping does happen more or less properly. But the unescaping is totally absent! Despite the long-standing demand, both the latest stable release (3.1.6), as well as development (3.3a1pre), show those nasty artifacts. And it isn't just a cosmetic issue; it's the real problem at least in two cases:
2.1. If the message is signed.
2.2. If the contents of the message must not be tampered with (e.g. come kind of source code). In this case, situations may exist where the inserted extra character will remain unnoticed and will not make the code syntactically erroneous, but will make it give erroneous results that will be very difficult to trace!

3. I stated in the previous paragraph, that the escaping is done properly. But this is not the case! TB escapes the simple text data inconsistently with regards to other cases - it inserts space before "From "s, not ">"! Somewhere in the source tree there is some "bug fix" that does this for years, and now go try to figure where it is! This is the bright example of the problem that is discussed in the p.1.

4. For some weird reason, TB includes the separating "From " line into the source code of the message. Hey, this line is a storage artifact! It has nothing to do with the original message!

The resoultion is, as I stated above, simple. We need to stick to the mboxrd format (and do this consistently, everywhere) when using MBOX, and we need to decode from this format every time we use its data.

As I can understand from the discussion, TB is already able to use multilpe storage formats besides MBOX. So as a programmer, I can assume that there exist some layering of code (at least when it comes to storage). If it is not so, then it's a shame! Let's hope that layering exists.
The MBOX is a format. Let's handle it as a format. Well, a program may use whatever format it chooses, to store its data. One may decide to plant a flower field that will depict the rasterised source text of the message! The high-level code doesn't need to care about it.
So what is the problem to write a library that will handle all the communications with the MBOX files? And is it difficult to include the encoding/decoding of data (including the escaping/unescaping of "From " lines) into it, thus creating the single place to support this format? And is it impossible to write the guidelines for programmers that will encourage not reinventing the wheel, but reusing existing library? Listen, you are programmers, don't the words "Incapsulation", "Abstraction", "Modularization" and such sound familiar? Does "good programming style" mean something to you?

Ah, I see. Bazaar. "Mozilla's culture". It means that there is no person who is responsible for anything. No one knows how Mozilla works! If one once existed, he is already retired, and the "Mozilla's culture" ensured that no one else has a plot. It's not surprising to see sentences like this: "No one here knows how the mapilib works, so let it work its way"! And this is what supporters say. So we create patches to fix patches applied above patches, and the structure (if it ever existed) is buried below them.

The absurd of the situation leads to the distinction between the insertion of ">"s into the message "body" and message "attachments"! Between this problem when a message is received and when it's saved as draft! A person that is responsible for keeping the Bugzilla clean should have marked all these as duplicates, as the problem of the storage level must not be handled on the level of MIME. We try to store a RFC822 message that only has its headers and a body! All other details are irrelevant - no matter what is the structure of the body - whether it's MIME, or something else... We must get from the storage what we put into storage! No matter when we put it there - when we receive or when we compose - the results must be the same!

And we divide the solution into steps, thus making "partial solution"  for a problem that needs a solid solution. A solution that would greatly benefit from being developed as single whole.

Well, I think that the responce to all this will be something like "do it yourself". OK, I would do it. But the purpose of all this is to show one case of a systematic pattern that today determines the Mozilla development. And if it will not change, Mozilla is going to die, giving way to more solid projects that have a core developers team (like, say, WebKit or BSDs).

Thank you for your attention.
Blocks: 194382
As horrible as the MBOX format is, it wasn't invented by Mozilla.  The hack of adding ">" in front of lines beginning with "from " is very widespread, and there are mail systems that operate such message mangling anyway, independently of their storage format.  That is to say, even if TB gets ready of it by choosing a superior storage format, it is likely that several users will get their messages disrupted by intermediate hosts.

Putting a space in front of every unquoted line of test, a.k.a. space stuffing, is the general solution standardized by RFC 3676.  The resulting plain text is identified by the parameter format=flowed.  As Mike notes, it is already being done in TB, although the current implementation leaves something to be desired (See Bug 456053 and Bug 609908.)

The leading space only has to be unescaped on displaying the message.  If the text is signed, the leading space may be necessary to properly verify the signature.  If the displaying software ignores RFC 3676, the leading space may appear as a harmless extra margin.

Of course, imported messages and received messages that are not space-stuffed have to be stored as they are.  Modern storage formats allow just that.
Alessandro,

what you say is generally true, however, you may note that this bug has emerged from bug 119441 (about import issue) that was split to this bug and its counterpart - bug 121946. These are about storage, i.e. how to properly prepare the message to store in the mbox format, and how to get the message from the store to properly display it (well, the issue title may be slightly misleading as it doesn't mention the storage).

Then, the things that you correctly state about how the user agent may help user to keep his writings intact on send, are irrelevant to this problem (that is storage-level, not MIME-level). And trying to adopt the solution from a wrong level may lead to an adverse effects. The specifics of the RFC 3676 is that it helps MUAs to prepare whatever user writes to look very close to the original, no matter if the receiving party uses RFC 3676 or not, if that party wants to stick to the plain text representation. This standard doesn't necessarily allows to recreate the precise original form of the text (it may strip spaces before CRLFs at the paragraph ends). The storage must not allow any change of the data: whatever is put to storage, MUST be kept and output exactly intact.

The inaccuracies in implementation of the RFC 3676, imo, should be discussed in another dedicated bug.
I just spent a very frustrating hour and a half trying to figure out what was wrong with my PHP script that was sending an email which triggered this behaviour, thinking there was some invisible highbyte char stuck in there. Eventually I find out it is a nearly decade old bug in TB. Can't help but think this would be a trivial bug to fix. Likely take less time than I spent just now, or than it took for all the people above to write their comments.
Thanks TB developers, for wasting the time of all the people above, while adding another hundred niche features that hardly anyone will ever see, instead of fixing a core bug that has caused so many people problems.

As to everyone who is following this bug... it only has 4 votes. No wonder it isn't getting any action. If you haven't, please login and vote for it, before we all run out of hair to pull out.
Jonathan,

you are right, that it is very frustrating,
but the fix is not that trivial, and work is being done that will fix this - see Bug 402392 that is being worked on very actively. When it will be fixed, this one will be fixed, too. And that bug is highly voted for.
I'm glad to hear that Mike, as I was ready to start taping a piece of paper with ">From" written on it to the office entry every day. I, like many people above, get quite ruffled by this common prioritizing practice among software developers (FOSS and paid proprietary alike). Frankly this is pretty well displayed here, in that an "enhancement" introduced in '07 gets the jump over a core bug from '02 (arguably a double bug in that it SHOULD have been escaped with a space at least).

Since I don't know the mozilla source and am not even a proficient C programmer, I have to preface this with a grain of salt... but I truly believe this ought to be trivial. It is an unescape filter. Hard to think of anything much simpler. 5 minutes with the 7.0.1 candidate source suggests it could be done on line 699 of nsParseMailbox.cpp (tho since I don't know at what level of abstraction this sits, I may very well be wrong).

Times like these I REALLY wish I was a full time C hacker so I could just fix it myself and submit a patch.

Here's hoping the fix arrives SOON. Though I have to admit I'm not holding my breath on 402392 either.
Per discussion in bug 808450, switching the current mboxo implementation to the mboxrd variant would resolve the ambiguity in quoting of "From" where desired vs. where generated to avoid conflict with the mbox definition, thus would also resolve the issue of displaying it correctly (this bug).
Also, as for the discussion in bug 808450 I'd suggest to switch the severity of this bug to the highest possible level, well at least as long as these two bugs are considered duplicates.

Further people working on this bug should probably read bug 808450, as it contains IMHO much more technical background on the issues and recommendation how one should deal with it in terms of informing users what has happened to their mails over the years.
I may further suggest any future reader of this issue to have a lookt at bug 808450 comment 21 and bug 808450 comment 25, where I give the results of several test cases and whether and how TB corrupts mails,... as well as two further issues (missing empty line separator, "non-standards-compliant" From_-line format)
There exist several variants of the mbox format, which are known as mboxo, mboxrd, mboxcl and mboxcl2. They are all explained in

  http://www.qmail.org/qmail-manual-html/man5/mbox.html

which is widely considered today to be the definitive text on the subject of mbox files.

The best and commonly recommended variant to implement today is mboxrd, which provides for watertight round-trip compatibility. Most of the other variants are broken and cause spurious ">" signs to appear or disappear when you import/export to/from such mbox files.
Markus, I guess you should not expect any soon solution to this issue.

As you can see it is known for many years, but users are still exposed to silent data corruption, without even openly warning about this.
Severity: normal → S3
Flags: needinfo?(benc)
Summary: Need to unescape "From " lines in the msg body before displaying them to the UI. → Need to unescape "From " lines in the message body before displaying them to the UI.
You need to log in before you can comment on or make changes to this bug.