Closed Bug 493544 Opened 15 years ago Closed 12 years ago

invalid decoding of multiline utf-8 subject header (better to be tolerant with split of byte code at wrong position in RFC 2047 encoding)

Categories

(MailNews Core :: MIME, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 20.0

People

(Reporter: sebcsaba, Assigned: ZaneUJi)

References

(Blocks 2 open bugs)

Details

(Keywords: testcase)

Attachments

(9 files, 3 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; hu; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8
Build Identifier: verzió 2.0.0.21 (20090302)

If a multiline base64 encoded subject header is wrapped in the middle of a multibyte character, then it's not displayed properly.


Reproducible: Always

Steps to Reproduce:
Create/send a mail with the next header, and open it in Thunderbird:

Subject: =?utf-8?B?S0FSIC0gw4lydGVzw610w6lzIHbDoXLDs2xpc3TDoXJhIGhlbHllesOpc3LF?=
 =?utf-8?B?kWw=?=

Actual Results:  
Shows this subject: "KAR - Értesítés várólistára helyezésr�"

Expected Results:  
Shows this subject: "KAR - Értesítés várólistára helyezésről"

The letter 'ő' (u0151) is a special hungarian character, it may be displayed as a question mark on your system. But anyway, the last character is 'l' (u006c) must be displayed properly.
c3LF in base64 == 0x7372C5 (last part of first encoded word) 
kWw= in Base64 == 0x916C   (second encoded word)
'ő'(U+0151) == 0xC5 0x91 in UTF-8.

An encoded word should complete within the encoded word. Mail sender have to split to 0x...7372 and 0xC5916C(split based on unicode character) upon base64 encoding.
If Tb generated the header, Tb's bug. If Tb didin't generate the header, INVALID.
Can you reference an rfc about this interpretation? I cannot find any.

Other email-clients displays this header correctly (Gmail, RoundCube Webmail)
RFC 2047.

My description of "should complete within the encoded word" was inaccurate. It was based on "3. Character sets" in RFC 2047, but description in "3. Character sets" was for "escape sequence" of some character sets such as ISO-2022-JP. Sorry for my confusion.

I think your case corresponds to next section.

> http://www.faqs.org/rfcs/rfc2047.html
> 5. Use of encoded-words in message headers
>(snip)
> (1) An 'encoded-word' may replace a 'text' token (as defined by RFC 822)
>    in any Subject or Comments header field, (snip)
>(snip)
>   Ordinary ASCII text and 'encoded-word's may appear together in the
>   same header field.  However, an 'encoded-word' that appears in a
>   header field defined as '*text' MUST be separated from any adjacent
>   'encoded-word' or 'text' by 'linear-white-space'.

AFAIR, same phenomenon was reported to some other bugs and closed as INVALID. I don't know whether quirks for or torelance with such RFC violation is requested ot not.
In evaluating whether this deserves some sort of quirk, it would be very helpful to know exactly which clients generate messages with this sort of header...
phpmailer did it.

I think the tolerant interpretation of the RFC should be preferred in this case.
This case is relieved by code like next;
  If (consecutive encoded words have same charset)
    { join decoded "consecutive encoded words"; }  
  Interprets the "joined decoded consecutive encoded words" by specified charset;
But this kind of coding breaks next case.
   - First encoded word contains single garbage byte.
   - Second encoded word contains correct byte codes.
   In this case, only "single garbage byte" is displayed in U+FFFD currently.
   But if abve change is made, all characters in second encoded word
   can be displayed in U+FFFD, if the "single garbage byte" changes interpreted
   unicode characters for byte codes in second encoded word.

AFAIK, most often case is split of 3 byte code(this bug's case), and garbage is in last part of last encoded word even if garbage is involved.
So I think code like above is practical. (Ignore a part of "5. Use of encoded-words in message headers" of RFC upon decoding/interpreting.)
3 byte code => 2/3 byte code
(Modification of comment #6)
>   If (consecutive encoded words have same charset)
>     { join decoded "consecutive encoded words"; }  

Report of incorrect splitting is mainly for next case.
  (1) UTF-8 (AFAIK, no bug is reported for 2 bytes code such as Shift_JIS)
            (nor EUC_JP which has 3 bytes code in addition to 2 bytes code)
  (2) split of next (AFAIK, no bug is reported for 2 bytes code only case)
  (2-a) split of 3 bytes code
  (2-b) split of 2 bytes code, if single or 3 byte code is mixed
And RFC 2047 "3. Character sets" shouldn't be ignored when charset which has escape sequence. So, even if quirks will be implemented, it should be limited to UTF-8 only. 
(Modified version) 
>   If ( utf-8
>        && (consecutive encoded words have same charset)
>        && (consecutive encoded words uses same encoding of Q or B)
>      )
>      { join decoded binary of consecutive encoded words; }
Severity: normal → enhancement
OS: Windows XP → All
Summary: invalid decoding of multiline utf-8 subject header → invalid decoding of multiline utf-8 subject header (better to be tolerant with split of byte code at wrong position in RFC 2047 encoding)
Version: unspecified → Trunk
No longer blocks: 476741
attached testcase from bug #476741
Confirmed here on 

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2pre) Gecko/20090725 Lightning/1.0pre Shredder/3.0b4pre ID:20090725112141

Otherwise, confirmed per duplicate in comment #09 and comment #10
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: testcase
(In reply to comment #12)
> confirmed per duplicate in comment #09 and comment #10

Since this bug is Severity=Enhancement, changing from UNCONFIRMED to NEW usually means "it's proved that request is reasonable and very useful for many users and required for many users" or "developers or tiage team accepted the enhancement request".
Aureliano Buendía, you changed this enhancement request bug to NEW. You've completed feasibility study of enhancement request of this bug? Or you are ready to implement enhancement of this bug?
For me this is a bug not enhancement.

I forgot of read "Importance:" tag in this bug: in bug #476741 it is "normal".

If you think that this is "enhancement" set as "UNCONFIRMED".
Status: NEW → UNCONFIRMED
Ever confirmed: false
(Off-Topic)

(In reply to comment #14)
> For me this is a bug not enhancement.
> If you think that this is "enhancement" set as "UNCONFIRMED".

Right box of Importance field is Severity: of "bug" at B.M.O, which is clearly defined. It's irrelevant to your thoughts about "bug or enhancement". It's irrelevant to my thoughts about "bug or enhancement" too.
Aureliano Buendía, please read at least next page, which is linked by "Importance" in this bug.
> https://bugzilla.mozilla.org/page.cgi?id=fields.html#importance
  Other than enhancement(trivial to blocker) :
    Flaw in code of Tb really exists(Tb doesn't work as designed)
  enhancement :
    Tb works as designed, but improvement/change is required.
Severity: enhancement → blocker
Severity: blocker → enhancement
Severity: enhancement → normal
Status: UNCONFIRMED → NEW
Ever confirmed: true
Component: Message Reader UI → MIME
Product: Thunderbird → MailNews Core
QA Contact: message-reader → mime
Hardware: x86 → All
Changing back to Severity=Enhancement, Status=UNCONFIRMED, to avoid mis-leading, confusion.
Severity: normal → enhancement
Status: NEW → UNCONFIRMED
Ever confirmed: false
Erm, if I set a status, I mean it...
While I don't think that confirmed ENH bugs should be in UNCO state, it'd clearly be an useful enhancement.
Status: UNCONFIRMED → NEW
Ever confirmed: true
is bug 506927 a dupe of this one?
what about a subject like this one: (all should go into 1 line)

> Subject: =?UTF-8?B?W2lwYXguYXQgIzE2ODg2XSBJUEFYIC0gV2VicmFkaW8gQmVzdGVsbGJlc3TD?= =?UTF-8?B?pHRpZ3VuZyAtIEJFN0I1N0Ux?=

this should read decoded:
> Subject: [ipax.at #16886] IPAX - Webradio Bestellbestätigung - BE7B57E1

but instead reads:
> Subject: [ipax.at #16886] IPAX - Webradio Bestellbest�
Attached file my testcase
Blocks: 586981
Well, it is to convenient to say "this bug caused by RFC rule violation " 
so thats not TB's problem.
But yes it is TB's problem, so what if the program that generated this mail
faulty encoded the subject violating this RFC rule, other clients get past
this violation and display this subject correctly, the same is expected from TB.
I love TB, it's great E-mail client, but you can't just say that "we sticking directly to the rules with no exception" because that will leave people no choice
rather then moving to other client that does the job correctly.
(In reply to comment #23)
> other clients get past this violation and display this subject correctly

There is no "correct" subject there, there's just a broken encoding.
Actually, I'd say the best section of RfC 2047 to look at is 6.3(2):

| 6.3. Mail reader handling of incorrectly formed 'encoded-word's
|   It is possible that an 'encoded-word' that is legal according to the
|   syntax defined in section 2, is incorrectly formed according to the
|   rules for the encoding being used.
...
|   (2) Any 'encoded-word' which encodes a non-integral number of
                                           ^^^^^^^^^^^^^^^^^^^^^^
|       characters or octets is incorrectly formed.
        ^^^^^^^^^^
|   A mail reader need not attempt to display the text associated with an
|   'encoded-word' that is incorrectly formed.

You can only make *assumptions* about what the subject was meant to be.
This bug is about implementing such heuristics.

> moving to other client that does the job correctly.

Since there does not exist a "correct" here, there is no such client, hence we're lucky. ;-)
(In reply to comment #24)
> (In reply to comment #23)
> > other clients get past this violation and display this subject correctly
> 
> There is no "correct" subject there, there's just a broken encoding.
> Actually, I'd say the best section of RfC 2047 to look at is 6.3(2):
> 
> | 6.3. Mail reader handling of incorrectly formed 'encoded-word's
> |   It is possible that an 'encoded-word' that is legal according to the
> |   syntax defined in section 2, is incorrectly formed according to the
> |   rules for the encoding being used.
> ...
> |   (2) Any 'encoded-word' which encodes a non-integral number of
>                                            ^^^^^^^^^^^^^^^^^^^^^^
> |       characters or octets is incorrectly formed.
>         ^^^^^^^^^^
> |   A mail reader need not attempt to display the text associated with an
> |   'encoded-word' that is incorrectly formed.
> 
> You can only make *assumptions* about what the subject was meant to be.
> This bug is about implementing such heuristics.
> 
> > moving to other client that does the job correctly.
> 
> Since there does not exist a "correct" here, there is no such client, hence
> we're lucky. ;-)
I really not interested in getting into argue with you nor any one else, but the fact is that Outlook Express managed to display the subject correctly god knows
how, it is frustrating that TB's can't do so.
I'm not a big fan of Microsoft's software but somehow they do the trick right,
and that is what I expect from TB, even if the massage encoded not by the
standard RFC rules.
It mostly can be equivalent saying that some website is not written in the
right web standard thats why FireFox can't diplay it right but I.E. can,
the fact the there is a program that can display it correctly and it's
Microsoft's is frustrating...and FireFox not going to do anything about it,
just kick back and relax until the whole world will code their website in a
proper way (thats a huge bummer).
Add a screenshot of the current problem.
Attached file letter (eml)
letter (eml), referring to the above attached screenshot
the same letter, but in the web interface below the source code in the web
interface:
Received: from mxfront71.mail.yandex.net ([127.0.0.1])
    by mxfront71.mail.yandex.net with LMTP id cKYGAw2V
    for <mc.sim@k-max.name>; Thu, 7 Oct 2010 22:38:20 +0400
Received: from srv51-h-st.jino.ru (srv51-h-st.jino.ru [81.177.139.53])
    by mxfront71.mail.yandex.net (nwsmtp/Yandex) with ESMTP id cJwO9dPD;
    Thu,  7 Oct 2010 22:38:19 +0400
X-Yandex-Front: mxfront71.mail.yandex.net
X-Yandex-TimeMark: 1286476699
X-Yandex-Spam: 1
Received: by srv51-h-st.jino.ru (Postfix, from userid 2154)
    id 5E365602274; Thu,  7 Oct 2010 22:38:19 +0400 (MSD)
To: mc.sim@k-max.name
Subject: =?utf-8?B?UmU60LfQsNGP0LLQutCwX9C90LBf0J7QntCeX9CQLdCb0LDQsQ==?=
Date: Thu, 7 Oct 2010 18:38:19 +0000

From: "mc.sim@k-max.name" <mc.sim@k-max.name>

Reply-to: "mc-sim85@a-lab.name, client@a-lab.name, d_freylik@a-lab.name,
d_bondar@a-lab.name," <mc-sim85@a-lab.name>

Message-ID: <2229bc106c3737c52a0b3607872b977d@www.a-lab.name>

X-Priority: 1

X-Mailer: cformsII (deliciousdays.com) [version 11.6.1]

MIME-Version: 1.0

Content-Transfer-Encoding: 8bit

Content-Type: text/plain; charset="utf-8"
X-Yandex-Forward: 9ad8c647fe6d8aa2cecd22a8baeb6ab9

Здравствуйте, Иван!

Спасибо за Ваше письмо!

Мы ответим Вам в ближайшее время.
The subject is not full. It ends with symbol �
Blocks: RFC2047
I think I got the point.

It seems to be the fault of the software/system the mail sender use by mal-splitting the chars embedded.

I think it would be nice if there's an extension or a tool to fix the encoding.

Here's a simple tool I made to fix the mal-splitted e-mails. Feel free to use it: http://db.tt/gxzlS41
(In reply to Alex from comment #25)

> I really not interested in getting into argue with you nor any one else, but
> the fact is that Outlook Express managed to display the subject correctly
> god knows
> how, it is frustrating that TB's can't do so.
> I'm not a big fan of Microsoft's software but somehow they do the trick
> right,
> and that is what I expect from TB, even if the massage encoded not by the
> standard RFC rules.
> It mostly can be equivalent saying that some website is not written in the
> right web standard thats why FireFox can't diplay it right but I.E. can,
> the fact the there is a program that can display it correctly and it's
> Microsoft's is frustrating...and FireFox not going to do anything about it,
> just kick back and relax until the whole world will code their website in a
> proper way (thats a huge bummer).

I totally agree! I love TB by this BUG (yes, it's a bug) is driving me insane (and not only me) and should be fixed. Why you FORCE your users to have this problem?
Bug was described in 2009. 3 years passed and it is seems that this problem will never be fixed because it is "RFC violation". But there are plenty e-mail clients that think about user experience and not about strict RFC compliance. It is very frustrating that TB developers think about RFC and not about people who use this software :(
Attached patch Unit test and patch (obsolete) — Splinter Review
Attachment #682871 - Flags: review?(jduell.mcbugs)
Comment on attachment 682871 [details] [diff] [review]
Unit test and patch

bz: if you can't take this yourself, can you suggest a reviewer with Unicode chops that we can delegate it to?
Attachment #682871 - Flags: review?(jduell.mcbugs) → review?(bzbarsky)
Comment on attachment 682871 [details] [diff] [review]
Unit test and patch

Julian, are you willing to review this?  If not, please let me know and I'll find someone else...
Attachment #682871 - Flags: review?(bzbarsky) → review?(julian.reschke)
(In reply to Boris Zbarsky (:bz) from comment #38)
> Comment on attachment 682871 [details] [diff] [review]
> Unit test and patch
> 
> Julian, are you willing to review this?  If not, please let me know and I'll
> find someone else...

I can't do a proper *code* review, but I have provide some feedback.

1) Changing the code to be more forgiving should only be done when there's evidence that it is indeed needed. How frequent are these malformed header fields? What MUA is producing them? Just because other MUAs accept them isn't a sufficient reason (to me).

2) More importantly, we are currently trying to *remove* RFC2047 support from Firefox (see bug 601933). Also, there's work on replacing the header field parsing in the mail client by a new JS library (see bug 746052). If the latter is going forward, the code change proposed here will become irrelevant anyway.
Attachment #682871 - Flags: review?(julian.reschke) → review?(bzbarsky)
(In reply to Julian Reschke from comment #39)
> 2) More importantly, we are currently trying to *remove* RFC2047 support
> from Firefox (see bug 601933). Also, there's work on replacing the header
> field parsing in the mail client by a new JS library (see bug 746052). If
> the latter is going forward, the code change proposed here will become
> irrelevant anyway.

That code is moving very slowly at the present time, so we'll still need to use nsMIMEHeaderParam for several release cycles.
Attachment #682871 - Flags: review?(bzbarsky) → review?(smontagu)
Comment on attachment 682871 [details] [diff] [review]
Unit test and patch

Review of attachment 682871 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla/netwerk/mime/nsMIMEHeaderParamImpl.cpp
@@ +1075,5 @@
> +  if(aQOrBase64 == 'Q' || aQOrBase64 == 'q')
> +    decodedText = DecodeQ(aEncoded, aLen);
> +  else if (aQOrBase64 == 'B' || aQOrBase64 == 'b') {
> +    decodedText = PL_Base64Decode(aEncoded, aLen, nullptr);
> +  } else {

The input to aQOrBase64 has gone through ToUpper, so no need to test against lowercase 'q' and 'b'. 

Perhaps this whole test can become an assertion.

@@ +1084,5 @@
> +    return NS_ERROR_INVALID_ARG;
> +  }
> +
> +  nsresult rv;
> +  nsCOMPtr<nsIUTF8ConverterService> 

Nit: whitespace at end of line.

@@ +1173,1 @@
>        goto badsyntax;

You should be testing curEncoding here, no?

@@ +1263,5 @@
> +  if (!encodedText.IsEmpty()) {
> +    rv = DecodeQOrBase64Str(encodedText.get(), encodedText.Length(),
> +                            prevEncoding, prevCharset.get(), aResult);
> +    if (NS_FAILED(rv)) {
> +        aResult.Append(encodedText);

Nit: overindenting
(In reply to Simon Montagu from comment #41)
> @@ +1173,1 @@
> >        goto badsyntax;
> 
> You should be testing curEncoding here, no?
>
Good catch.
 
The document at:
https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style
says that str.Truncate() is preferred to str.SetLength(0). I was wondering if I should change those str.SetLength(0) too.
Yes, please do.
Attachment #682871 - Attachment is obsolete: true
Attachment #682871 - Flags: review?(smontagu)
Attachment #686540 - Flags: review?(smontagu)
Attachment #686540 - Flags: review?(smontagu) → review+
Any reason not to land this?
https://hg.mozilla.org/integration/mozilla-inbound/rev/45fd2425adc4
The testcase (mailnews part) will be landed after m-c merge because c-c depends on m-c.
Assignee: nobody → ZaneUJi
Status: NEW → ASSIGNED
Flags: in-testsuite-
For the next time,
- Please separate the core patch and the mailnews patch because they are in the different repository.
- Please set your user name, mail address, and checkin comment in the patch header. (see http://blog.bonardo.net/2010/06/22/so-youre-about-to-use-checkin-needed for details.)
- Please push patches to tryserver (or ask for it if you have no Level 1 commit access).
- Your patch didn't build with compilers other than msvc because a goto statement can not cross initialization of a variable (msvc allows this violation).
- Finally, please set "checkin-needed" keyword if you have no Level 3 commit access.

This time, I have done it instead.
Whiteboard: [leave open]
(In reply to Masatoshi Kimura [:emk] from comment #50)
> For the next time,
> - Please separate the core patch and the mailnews patch because they are in
> the different repository.
> - Please set your user name, mail address, and checkin comment in the patch
> header. (see
> http://blog.bonardo.net/2010/06/22/so-youre-about-to-use-checkin-needed for
> details.)
> - Please push patches to tryserver (or ask for it if you have no Level 1
> commit access).
> - Your patch didn't build with compilers other than msvc because a goto
> statement can not cross initialization of a variable (msvc allows this
> violation).
> - Finally, please set "checkin-needed" keyword if you have no Level 3 commit
> access.
> 
> This time, I have done it instead.

Thank you.
https://hg.mozilla.org/comm-central/rev/c438208883ab
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite- → in-testsuite+
Resolution: --- → FIXED
Whiteboard: [leave open]
Target Milestone: --- → Thunderbird 20.0
Attached patch Test case for bug 833028 (obsolete) — Splinter Review
Attachment #704826 - Flags: review?(jduell.mcbugs)
Attached patch Patch for bug 833028 (obsolete) — Splinter Review
Attachment #704827 - Flags: review?(jduell.mcbugs)
(In reply to Zane U. Ji from comment #55)
> Created attachment 704827 [details] [diff] [review]
> Patch for bug 833028

The Patch works fine.
I'm confused--why are we getting new patch for a bug that's already landed?  Is there something that the patch landed in comment 53 didn't fix?  If so please open a new bug--we try to keep bugs to one patch per bug to make backing out and release management easier.
The bug is already open - bug 833028. I don't know why Zane U. Ji attached patches not there.
Nothing special. It's just that I wasn't so familiar with the system.
Comment on attachment 704826 [details] [diff] [review]
Test case for bug 833028

Clearing obsolete requests and patches, as these have moved to bug 833028.
Attachment #704826 - Attachment is obsolete: true
Attachment #704826 - Flags: review?(jduell.mcbugs)
Attachment #704827 - Attachment is obsolete: true
Attachment #704827 - Flags: review?(jduell.mcbugs)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: