Closed Bug 1054252 Opened 10 years ago Closed 10 years ago

wrong values in From and Recipient columns in message list due to improper RFC 2047 decoding / parsing order

Categories

(MailNews Core :: MIME, defect)

defect
Not set
normal

Tracking

(thunderbird36 fixed)

RESOLVED FIXED
Thunderbird 36.0
Tracking Status
thunderbird36 --- fixed

People

(Reporter: mitchnull, Assigned: mitchnull)

References

Details

(Keywords: regression)

Attachments

(2 files, 6 obsolete files)

User Agent: Opera/9.80 (X11; Linux x86_64) Presto/2.12.388 Version/12.16

Steps to reproduce:

Receive an email with RFC 2047 encoded From or To header where the RFC 2047 encoded part contains comma(s) (,):
From: =?iso-8859-1?Q?Radics=2C_P=E9ter=2C_Example_Inc?= <peter.radics@example.com>
To: =?iso-8859-1?Q?Radics=2C_P=E9ter=2C_Example_Inc?= <peter.radics@example.com>

[decoded as: Radics, Péter, Example Inc <peter.radics@example.com>]




Actual results:

In the message list, the From column displays only "Radics" instead of "Radics, Péter, Example Inc"

[The Recipient column displays  correctly, but that's just a coincidence, see below]


Expected results:

Both the From and Recipient columns should display the correct display name (Radics, Péter, Example Inc).

The issue is caused by incorrect order of RFC2047 decoding and splitting of the addresses in nsMsgDBView::FetchRecipients() and nsMsgDBView::FetchAuthor():  The header is first decoded (resulting in "Radics, Péter, Example Inc <peter.radics@example.com>") the split at the commas, resulting in three "addresses"

The splitting / extracting of addresses should be done on the undecoded header.

The bug was exposed by commit 6ff312f950a6 (Bug 842632, part 3e: Use the C++ API in base, r=Neil.), but the issue was present earlier for the Recipient column, it was just hidden from view because the three parsed addresses (in the example) were concatenated again with comma as separator, thus resulting in "good looking" recipients column.

The From header got broken by the commit because previously no splitting was done on the From header when extracting the name / email pair and the patch used the (new?) ExtractFirstAddress() call that first splits the header and parses the first address for the name / email pair.

See Bug 254519 for an earlier discussion about proper handling of RFC 2047 headers.

The attached patch fixes the issue for me. It was created and tested with the esr-31.0 release, but applies cleanly to current comm-central tip, too (not tested there).

I'm not familiar with the various ns*String classes in the codebase, but tried to use the correct form based on other usage in the code.
Component: Folder and Message Lists → MIME
Product: Thunderbird → MailNews Core
Thx for the patch!
Please get it reviewed so it can land - see https://developer.mozilla.org/en/docs/Getting_your_patch_in_the_tree#Getting_the_patch_reviewed

If you have any questions, feel free to ask.
Assignee: nobody → mitchnull
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: x86_64 → All
Status: NEW → ASSIGNED
Attachment #8473660 - Attachment is patch: true
Attachment #8473660 - Attachment mime type: text/x-patch → text/plain
Comment on attachment 8473660 [details] [diff] [review]
patch to fix RFC2047 decoding and address extraction order

Review of attachment 8473660 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/src/nsMsgDBView.cpp
@@ +399,4 @@
>  
>    nsCString emailAddress;
>    nsString name;
> +  ExtractFirstAddress(EncodedHeader(author), name, emailAddress);

There's an undesirable change in this patch: the charset for 8-bit headers is not going to be set correctly. The GetMime2Decoded* methods have some involved computations to grab a charset, which needs to be passed into the detection mechanism here. Given that these are some of the most important headers likely to see non-UTF-8 8-bit headers, it's very important to get that charset correct.
Attachment #8473660 - Flags: review-
I've only found rfc6532 that allows non-encoded 8bit header values, and that only allows UTF-8.  Where / how could a non-utf-8 8-bit encoded From or To header be encountered?

If it's really necessary / valid to handle non-utf8-8bit header values here, then the defaultEncoding could be extracted from the header and passed to EncodedHeader().  The other part of the "elaborate encoding handling" in GetMime2Decoded* is the possibility to override the encoding in RFC2047 encoded-words.  This information would be hard to transfer to EncodedHeader(), but imo this "feature" is misguided anyway, I don't think it's a valid concern that a client will send a properly formatted RFC2047 encoded header but lie about the encoding in the encoded-word.

I'll attach a new patch that extracts the default encoding from the header and passes that to EncodedHeader, but I'll probably have to touch the nsMsgDatabase header, too, to add a getEffectiveHeaderCharset() call.  Is that acceptable?
(In reply to Radics Péter from comment #3)
> I've only found rfc6532 that allows non-encoded 8bit header values, and that
> only allows UTF-8.  Where / how could a non-utf-8 8-bit encoded From or To
> header be encountered?

According to some statistics I did on NNTP a few months ago, 97% of NNTP headers are completely ASCII. Of the remaining 3%, 3.6% are UTF-8, and 96.4% are not UTF-8. I can't run statistics on email in general, but I would not be surprised if not-UTF-8 were still more common than UTF-8.

In general, when it comes to email, the first thing you need to realize is that standards do not reflect reality.

> If it's really necessary / valid to handle non-utf8-8bit header values here,
> then the defaultEncoding could be extracted from the header and passed to
> EncodedHeader().  The other part of the "elaborate encoding handling" in
> GetMime2Decoded* is the possibility to override the encoding in RFC2047
> encoded-words.  This information would be hard to transfer to
> EncodedHeader(), but imo this "feature" is misguided anyway, I don't think
> it's a valid concern that a client will send a properly formatted RFC2047
> encoded header but lie about the encoding in the encoded-word.

The overriding of the encoding in RFC 2047 encoded words doesn't work anyways, not since I landed jsmime. Effectively, the charset parameter is only used for convert8BitHeader.

> I'll attach a new patch that extracts the default encoding from the header
> and passes that to EncodedHeader, but I'll probably have to touch the
> nsMsgDatabase header, too, to add a getEffectiveHeaderCharset() call.  Is
> that acceptable?

Adding structured header support would have been preferable (which is why I didn't add it earlier), but let's not let the perfect be the enemy of the good here.
I've added a new EffectiveCharset property to nsIMsgHdr and used this to pass the defaultCharset to EncodedHeader().

I've also changed the uuid for nsIMsgHdr since as far as I understand adding new properties breaks binary compatibility.

nsMsgHdr just forwards GetEffectiveCharset() to nsMsgDatabase, where I implemented the call based on the logic found in RowCellColumnToMime2DecodedString().

tested on esr31, applies cleanly to trunk.
Attachment #8473660 - Attachment is obsolete: true
To add to statistics, the percentage of feed headers that are UTF-8 vs 2047 encoded is 100.0%.

The enemy of the good here is, rather, bug 842632 and its regresssions[1].  There is an uncomfortable reluctance/failure to acknowledge, at the very least, significant user facing changes that result from these regressions, and to provide a non vague path for addressing them.  To imply that jsmime (with almost no history behind it) or structured headers (not even landed) are 'perfect' is... optimistic.  And that was not the first word choice.

[1] Bug 1023285, Bug 858337#c45, Bug 1039443, Bug 1055077 (particularly problematic).
Attachment #8474580 - Flags: review?(Pidgeot18)
Comment on attachment 8474580 [details] [diff] [review]
revised patch to fix RFC2047 decoding and address extraction order

Review of attachment 8474580 [details] [diff] [review]:
-----------------------------------------------------------------

Also, if it's not too much trouble, I'd like to see a test.

::: mailnews/base/public/nsIMsgHdr.idl
@@ +86,4 @@
>                                    [array, size_is(aCount)] out octet aKey);
>  
>      attribute string Charset;
> +    readonly attribute ACString EffectiveCharset;

effectiveCharset, please.

This also needs a documentation comment in Doxygen/Javadocs style.

::: mailnews/base/src/nsMsgDBView.cpp
@@ +396,5 @@
>  
> +  nsCString author;
> +  nsresult rv = aHdr->GetAuthor(getter_Copies(author));
> +
> +  nsCString defaultCharset;

Nit: headerCharset, not defaultCharset

::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
@@ +3588,5 @@
>  }
>  
> +nsresult nsMsgDatabase::GetEffectiveCharset(nsIMdbRow *row, nsACString &resultCharset)
> +{
> +  nsresult err = NS_OK;

Nit: name this rv, not err.

@@ +3593,5 @@
> +  resultCharset.Truncate();
> +  err = RowCellColumnToCharPtr(row, m_messageCharSetColumnToken, getter_Copies(resultCharset));
> +  if (NS_FAILED(err) || resultCharset.IsEmpty() || resultCharset.Equals("us-ascii"))
> +  {
> +    m_dbFolderInfo->GetEffectiveCharacterSet(resultCharset);

This logic doesn't match the logic in RowCellColumnToMime2DecodedString (you dropped the query of m_dbFolderInfo::GetCharacterSetOverride).

Also, make RowCellColumnToMime2DecodedString use this logic instead of duplicating it.
Attachment #8474580 - Flags: review?(Pidgeot18) → review-
(In reply to Joshua Cranmer [:jcranmer] from comment #7)
> Comment on attachment 8474580 [details] [diff] [review]
> revised patch to fix RFC2047 decoding and address extraction order
> 
> Review of attachment 8474580 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Also, if it's not too much trouble, I'd like to see a test.

I have about 2 days exposure to the codebase, so I'd prefer if someone more in-touch would help me out with this, either by providing the test-case or giving me some pointers on where to start (similar test-case I could use as a starting point, how to run tests, etc)

> 
> ::: mailnews/base/public/nsIMsgHdr.idl
> @@ +86,4 @@
> >                                    [array, size_is(aCount)] out octet aKey);
> >  
> >      attribute string Charset;
> > +    readonly attribute ACString EffectiveCharset;
> 
> effectiveCharset, please.

sure, will change, although I followed the style of the "Charset" property, should I change that one, too, for consistency?

> 
> This also needs a documentation comment in Doxygen/Javadocs style.

will do

> 
> ::: mailnews/base/src/nsMsgDBView.cpp
> @@ +396,5 @@
> >  
> > +  nsCString author;
> > +  nsresult rv = aHdr->GetAuthor(getter_Copies(author));
> > +
> > +  nsCString defaultCharset;
> 
> Nit: headerCharset, not defaultCharset

ok

> 
> ::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
> @@ +3588,5 @@
> >  }
> >  
> > +nsresult nsMsgDatabase::GetEffectiveCharset(nsIMdbRow *row, nsACString &resultCharset)
> > +{
> > +  nsresult err = NS_OK;
> 
> Nit: name this rv, not err.

I lifted this verbatim from the original implementation, but sure, will change.

> 
> @@ +3593,5 @@
> > +  resultCharset.Truncate();
> > +  err = RowCellColumnToCharPtr(row, m_messageCharSetColumnToken, getter_Copies(resultCharset));
> > +  if (NS_FAILED(err) || resultCharset.IsEmpty() || resultCharset.Equals("us-ascii"))
> > +  {
> > +    m_dbFolderInfo->GetEffectiveCharacterSet(resultCharset);
> 
> This logic doesn't match the logic in RowCellColumnToMime2DecodedString (you
> dropped the query of m_dbFolderInfo::GetCharacterSetOverride).

last time I checked it seemed that the GetCharacterSetOverride was used only while decoding the RFC2047 encoded words, which I think is not the best thing to do (see my previous comment about this). If you still insist on keeping this "feature", I'd have to pass this on to EncodedHeader(), which'll result in a much larger patch... I'll double-check that this is indeed only used to override the charset provided in the RFC2047 string though.

> 
> Also, make RowCellColumnToMime2DecodedString use this logic instead of
> duplicating it.

ok (although I'd drop the GetCharacterSetOverride() stuff from there, too...)

I have some urgent issues at work so it might take a couple of days for the new patch
- I fixed all the "Nits" raised in the previous review
- Now I use m_dbFolderInfo::GetCharacterSetOverride for the charset selection, but not for overriding the charset in RFC2047-encoded-words (see below)
- I don't pass charsetOverride to m_mimeConverter->DecodeMimeHeader(), see previous comments about this (also, as far as I could see this override is not used anywhere else in the codebase)
- changed EffectiveCharset to effectiveCharset and Charset to charset in the IDL
- added doxygen-style comment to effectiveCharset
- changed logic in nsMsgDatabase::RowCellColumnToMime2DecodedString() and nsMsgDatabase::RowCellColumnToCollationKey() to use the new GetEffectiveCharset(), note: the charsetOverride flag is not queried and not passed to DecodeMimeHeader() here, see above

patch tested on comm-esr31, applies cleanly to trunk.

will take a look at the testing stuff next week if nobody beats me to it (please do ;). As far as I can see I should probably use the xpcshell-style tests, but any further pointers regarding testing are welcome...
Attachment #8474580 - Attachment is obsolete: true
Attachment #8481822 - Flags: review?(Pidgeot18)
Comment on attachment 8481822 [details] [diff] [review]
patch to fix RFC2047 decoding and address extraction order, revision 3

Review of attachment 8481822 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/public/nsIMsgHdr.idl
@@ +85,4 @@
>      void getRecipientsCollationKey(out unsigned long aCount,
>                                    [array, size_is(aCount)] out octet aKey);
>  
> +    attribute string charset;

Changing the case of this attribute breaks anyone who potentially uses this in JS code (including, notably, add-ons). Don't.

::: mailnews/db/msgdb/src/nsMsgHdr.cpp
@@ +627,5 @@
>  }
>  
> +NS_IMETHODIMP nsMsgHdr::GetEffectiveCharset(nsACString &resultCharset)
> +{
> +  return m_mdb->GetEffectiveCharset(GetMDBRow(), resultCharset);

m_mdbRow instead of GetMDBRow, please.
(In reply to Joshua Cranmer [:jcranmer] from comment #10)
> Comment on attachment 8481822 [details] [diff] [review]
> patch to fix RFC2047 decoding and address extraction order, revision 3
> 
> Review of attachment 8481822 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: mailnews/base/public/nsIMsgHdr.idl
> @@ +85,4 @@
> >      void getRecipientsCollationKey(out unsigned long aCount,
> >                                    [array, size_is(aCount)] out octet aKey);
> >  
> > +    attribute string charset;
> 
> Changing the case of this attribute breaks anyone who potentially uses this
> in JS code (including, notably, add-ons). Don't.

oops, I thought the case-changing/normalization applies to all users, but apparently it's only done for c++.  I reverted this change.

> 
> ::: mailnews/db/msgdb/src/nsMsgHdr.cpp
> @@ +627,5 @@
> >  }
> >  
> > +NS_IMETHODIMP nsMsgHdr::GetEffectiveCharset(nsACString &resultCharset)
> > +{
> > +  return m_mdb->GetEffectiveCharset(GetMDBRow(), resultCharset);
> 
> m_mdbRow instead of GetMDBRow, please.

ok. (I got the GetMDBRow() from the GetCharset() implementation, apparently GetMDBRow() and m_mdbRow are both used in the file with about the same frequency...)

I'll attach the rev4 patch tomorrow after a bit of testing.
- use m_mdbRow instead of GetMDBRow() in nsMsgHdr::GetEffectiveCharset()
- revert capitalization-change in Charset attribute in nsIMsgHdr.idl

tested on comm-esr31, applies cleanly on trunk
Attachment #8481822 - Attachment is obsolete: true
Attachment #8481822 - Flags: review?(Pidgeot18)
Attachment #8484040 - Flags: review?(Pidgeot18)
Comment on attachment 8484040 [details] [diff] [review]
patch-thunderbird-addr-parse-rev4.diff

Review of attachment 8484040 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
@@ +3591,5 @@
> +{
> +  resultCharset.Truncate();
> +  bool characterSetOverride;
> +  m_dbFolderInfo->GetCharacterSetOverride(&characterSetOverride);
> +  nsresult rs = RowCellColumnToCharPtr(row, m_messageCharSetColumnToken, getter_Copies(resultCharset));

Nit: we usually call it nsresult rv.

@@ +3597,5 @@
> +      resultCharset.Equals("us-ascii") || characterSetOverride)
> +  {
> +    m_dbFolderInfo->GetEffectiveCharacterSet(resultCharset);
> +  }
> +  return NS_OK;

If you're always only going to make this return NS_OK, then just make this function a void function, since it's not being exposed to XPIDL.
Attachment #8484040 - Flags: review?(Pidgeot18) → feedback+
- fixed the nsresult variable name typo (rs -> rv)
- changed the return in nsMsgDatabase::GetEffectiveCharset() to return the result of the previous getters (RowCellColumnToCharPtr() or m_dbFolderInfo->GetEffectiveCharacterSet()). This is still effectively "always return NS_OK", but I think the code is cleaner this way. Changing nsMsgDatabase::GetEffectiveCharset() to void would make it a bit out-of-place in my opinion (see for example the usage in nsMsgHdr::GetEffectiveCharset())

I'll attach an alternate patch where I did change the return type of GetEffectiveCharset to void, but I'd strongly prefer this version (5a) over the void version (5b)
Attachment #8484040 - Attachment is obsolete: true
Attachment #8488561 - Flags: review?(Pidgeot18)
- fixed the nsresult variable name typo (rs -> rv)
- changed the return type of nsMsgDatabase::GetEffectiveCharset() to void as requested in the review, but please consider the other patch (rev5a) instead, as I feel that one is better / more consistent.
Attachment #8488562 - Flags: review?(Pidgeot18)
I added some brief tests to make sure that this patch works: they should pass with this patch and fail without it.
(In reply to Joshua Cranmer [:jcranmer] from comment #16)
> Created attachment 8492840 [details] [diff] [review]
> An xpcshell test for this patch
> 
> I added some brief tests to make sure that this patch works: they should
> pass with this patch and fail without it.

Thanks for the test! 2 minor issues:
1) AFAIK the comma in the rfc2047-encoded address should be escaped (=2C), see rfc2047 5.(3):

 As a replacement for a 'word' entity within a 'phrase', for example,
    one that precedes an address in a From, To, or Cc header.  The ABNF
    definition for 'phrase' from RFC 822 thus becomes:

    phrase = 1*( encoded-word / word )

    In this case the set of characters that may be used in a "Q"-encoded
    'encoded-word' is restricted to: <upper and lower case ASCII
    letters, decimal digits, "!", "*", "+", "-", "/", "=", and "_"
    (underscore, ASCII 95.)>.

so in the last test case, the input should be: [{from: "=?UTF-8?Q?H=C3=A5s=C3=A4ther=2C_David?= <db@null.invalid>"},

2) After fixing 1) the test no longer fails on the unpatched thunderbird (I tested with comm-esr31), so apparently it tests some other code-path then the one I patched :(
(In reply to Radics Péter from comment #17)
> 2) After fixing 1) the test no longer fails on the unpatched thunderbird (I
> tested with comm-esr31), so apparently it tests some other code-path then
> the one I patched :(

Sorry, this was user-error on my part, when I ran the xpcshell-tests it picked up my live (patched) install of thunderbird instead of the one I just built.  The test indeed fails without the patch and passes with it (after fixing issue 1 above).
Blocks: 842632
Keywords: regression
Attachment #8488561 - Flags: review?(Pidgeot18) → review?(kent)
Attachment #8488562 - Flags: review?(Pidgeot18)
Comment on attachment 8488561 [details] [diff] [review]
patch to fix RFC2047 decoding and address extraction order, revision 5

I've looked this over some to come up to speed on the issues, but as I understand from IRC there will be a patch provided with the xpcshell test combined and fixed per comment 17 before I do the review. Removing the review request until then, fell free to add it back when you are really ready for me to review.
Attachment #8488561 - Flags: review?(kent)
Comment on attachment 8507386 [details] [diff] [review]
patch to fix RFC2047 decoding and address extraction order, revision 6

- merged tests to the rev5a patch
- added 3 more tests for charset specified in header
Attachment #8507386 - Attachment description: patch-thunderbird-addr-parse-rev6.diff → patch to fix RFC2047 decoding and address extraction order, revision 6
Comment on attachment 8507386 [details] [diff] [review]
patch to fix RFC2047 decoding and address extraction order, revision 6

Review of attachment 8507386 [details] [diff] [review]:
-----------------------------------------------------------------

This all looks good to me. Thanks for the patch!
Attachment #8507386 - Flags: review?(kent) → review+
Attachment #8488561 - Attachment is obsolete: true
Attachment #8488562 - Attachment is obsolete: true
Keywords: checkin-needed
Thanks for this patch, it has now been checked into comm-central and should be available in the next Thunderbird nightly build. (To be released as Thunderbird 36.)

https://hg.mozilla.org/comm-central/rev/0b9f01eb00e7
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 36.0
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: