Last Comment Bug 387860 - Subject lines in mails may contain mangled multi-byte characters
: Subject lines in mails may contain mangled multi-byte characters
Status: RESOLVED FIXED
:
Product: Bugzilla
Classification: Server Software
Component: Email Notifications (show other bugs)
: 3.0
: All All
: -- normal with 3 votes (vote)
: Bugzilla 3.0
Assigned To: Ilya Slobodin
: default-qa
:
Mentors:
: 394165 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-07-12 01:17 PDT by Marc Schumann [:Wurblzap]
Modified: 2007-08-29 07:21 PDT (History)
10 users (show)
mkanat: approval+
LpSolit: approval3.0+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Patch for 3.0 (700 bytes, patch)
2007-08-01 04:21 PDT, Ilya Slobodin
wurblzap: review+
wicked: review+
Details | Diff | Splinter Review

Description Marc Schumann [:Wurblzap] 2007-07-12 01:17:53 PDT
If a multi-byte character is in an unlucky spot, it may get split in two, effectively breaking it. http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=5475 is an example for this: its summary “Test der Umlaute öäü” gets damaged in mails. You can reproduce this by having an e-mail sent to you for it.
Comment 1 Marc Schumann [:Wurblzap] 2007-07-12 01:19:46 PDT
Oh yeah, here's how the Subject header looks like:

Subject: [Bug 62] New:=?UTF-8?Q?=20Test=20der=20Umlaute=20=C3=B6=C3?=
 =?UTF-8?Q?=A4=C3=BC?=

The second umlaut gets split across the line break.
Comment 2 Ilya Slobodin 2007-08-01 04:21:56 PDT
Created attachment 274755 [details] [diff] [review]
Patch for 3.0

This patch is based on Dan Kogai the Encode maintainer's explanation:

http://www.nntp.perl.org/group/perl.unicode/2004/02/msg2404.html

Dan Kogai: It LOOKS LIKE a bug but it is not.

[...]

Why that happens is that you use Encode::encode("UTF-8").  That means  
the utf8 flag of the resulting $string is OFF.   Without UTF-8 flag  
perl takes it as "\xA4\xC3", not "\N{LATIN CAPITAL LETTER A WITH  
TILDE}".  Now try the code below.

use Encode qw(encode);
my $string = Encode::decode("UTF-8",  
"ääääääääääääääääääääääääääääääääääää");
#                    ^^^^^^
print Encode::encode("MIME-Q", $string), "\n"

[...]
Comment 3 Marc Schumann [:Wurblzap] 2007-08-01 04:26:30 PDT
So... Does this mean this bug here will be fixed by bug 363153?
Comment 4 Max Kanat-Alexander 2007-08-01 04:28:49 PDT
(In reply to comment #3)
> So... Does this mean this bug here will be fixed by bug 363153?

  It means that this bug is technically part of that bug, but due to the nature of the technical difficulties we're having over there, it might be a good idea to test and consider this patch. (With the addition of spaces around the "=" of course.)
Comment 5 Marc Schumann [:Wurblzap] 2007-08-01 12:51:37 PDT
Comment on attachment 274755 [details] [diff] [review]
Patch for 3.0

With the patch, the subject line no longer reads

    Subject: [Bug 5475]=?UTF-8?Q?=20Test=20der=20Umlaute=20=C3=B6=C3?=
     =?UTF-8?Q?=A4=C3=BC?=

but

    Subject: [Bug 5475]=?UTF-8?Q?=20Test=20der=20Umlaute=20=C3=B6?=
     =?UTF-8?Q?=C3=A4=C3=BC?=

which works -- r=Wurblzap.

I can do the check-in after a+, touched up as mentioned in comment 4, if you don't have CVS access, Ilya.
Comment 6 Max Kanat-Alexander 2007-08-01 15:52:46 PDT
I can approve this for tip now, but since we're frozen for QA, LpSolit has to approve it for 3.0.

Make sure that you add spaces around the "=" on checkin.
Comment 7 Frédéric Buclin 2007-08-01 22:47:19 PDT
Go for it for 3.0.1. Be sure to test this patch with both utf8 turned on and off, in case one of the two cases is broken. Also, is data seen differently when being newly added in 3.0 or upgraded from 2.20 where UTF-8 wasn't in use? Just to be sure. I don't want an emergency release next week.

Please land it today to catch it in our QA tests.
Comment 8 Marc Schumann [:Wurblzap] 2007-08-01 23:48:19 PDT
(In reply to comment #7)
> Go for it for 3.0.1. Be sure to test this patch with both utf8 turned on and
> off, in case one of the two cases is broken. Also, is data seen differently

It works with utf8 turned on and turned off.

> when being newly added in 3.0 or upgraded from 2.20 where UTF-8 wasn't in use?

I can't (or won't; your pick) test the 2.20-upgrade part of this. Please reconfirm a3.0+.
Comment 9 Marc Schumann [:Wurblzap] 2007-08-02 00:24:12 PDT
Thanks for the patch, Ilya.

Tip:
Checking in Bugzilla/Mailer.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Mailer.pm,v  <--  Mailer.pm
new revision: 1.11; previous revision: 1.10
done

3.0:
Postponed until a3.0+.
Comment 10 Frédéric Buclin 2007-08-02 02:13:25 PDT
Bah... I wanted to test with a 2.20.5 installation where some comments were written with the UTF8 encoding, and some others with the ISO-8859-1 encoding (selected from my web browser), and then upgrading to 3.0.1 to see what happen. After the upgrade, the utf8 param was off, as expected when upgrading from 2.20. Then I 1) ran recode.pl 2) turned on the utf8 param and 3) ran checksetup.pl, and guess what? "Test der Umlaute öäü" which was written as UTF8 is still there, but the same string which was written with the ISO-8859-1 encoding has disappeared and has been replaced by "Test der Umlaute". All the text which was following this string has been clipped too. I ran ./contrib/recode.pl --charset=iso-8859-1 --show-failures, and it reported no error at all, but I now lost some data.

I'm not blaming this patch as it has nothing to do with the problem I just had. I'm only saying that this sucks as I cannot test this patch with a real upgraded DB.
Comment 11 Marc Schumann [:Wurblzap] 2007-08-02 02:39:38 PDT
Yeah, this matches what people experience and report to support-bugzilla@.
Comment 12 Max Kanat-Alexander 2007-08-02 03:50:31 PDT
There is no difference between an upgraded DB and an originally-3.0 DB.

If you're experiencing dataloss even though you ran recode, file another bug. We've had a few reports of that already.
Comment 13 Teemu Mannermaa (:wicked) 2007-08-09 09:20:34 PDT
Comment on attachment 274755 [details] [diff] [review]
Patch for 3.0

This seems to fix the issue atleast for the testcase in this bug on a 3.0 branch installation with utf8 enabled. I did also test this with an update from 2.22 to 3.0 that had a latin1 subject on a bug. (Note that recode.pl is irrevelant at this case because it doesn't touch latin1 strings.)

However, when I tested this with utf8 flag disabled (on DB that's still latin1) I found that it didn't fix the error. It still results malformed UTF8 strings on subjects with UTF8 letters. It seems Encode::MIME::Header (the Perl module that handless MIME-Q encoding type for Encode module) version that comes with Perl 5.8.5 always encodes things in UTF8 and uses Perl utf8 mode.

Simplest fix for this would to be to remove if clause and always run string through UTF8 decoding. This does seem to work for UTF8 subjects. For subjects with latin1 encoding, this will just strip out the non-UTF8 characters. However, without this subjects are always said to be in UTF8 but instead contain 8-bit latin1 characters which makes TB to ignore the whole encoded word and thus strips out almost all of the subject. This means non-utf8 converted systems are already badly broken as things are now so this could be considered to be a separate bug.

Giving this r- for now. I can reverse that if we don't care about non-utf8 mode as everybody should just convert their databases because otherwise there are lot's of things broken anyway.
Comment 14 Marc Schumann [:Wurblzap] 2007-08-09 10:48:10 PDT
Even if we do care about non-utf8 mode, we can surely handle this in a separate bug, I think. With this patch here, nothing changes for non-utf8 mode. Would you consider this and maybe r+?
Comment 15 Max Kanat-Alexander 2007-08-09 13:52:23 PDT
(In reply to comment #13)
> However, when I tested this with utf8 flag disabled (on DB that's still latin1)
> I found that it didn't fix the error. 

  That's OK, Bugzilla is not expected to handle utf8 with the utf8 flag off.

> Giving this r- for now. I can reverse that if we don't care about non-utf8 mode
> as everybody should just convert their databases because otherwise there are
> lot's of things broken anyway.

  Yes, just reverse it. We don't care about people who have utf8 in their database but haven't turned on the utf8 parameter.
Comment 16 Frédéric Buclin 2007-08-09 13:55:10 PDT
(In reply to comment #15)
>   Yes, just reverse it. We don't care about people who have utf8 in their
> database but haven't turned on the utf8 parameter.

No, my own tests also show that non-UTF8 data in an installation with utf8 turned off is also broken.
Comment 17 Marc Schumann [:Wurblzap] 2007-08-09 14:02:42 PDT
(In reply to comment #16)
> No, my own tests also show that non-UTF8 data in an installation with utf8
> turned off is also broken.

Well it's broken without this patch as well, right? So nothing new there. I don't think we should attempt to fix this very very old bugginess and the UTF-8 bugginess at the same time.
Comment 18 Max Kanat-Alexander 2007-08-09 14:19:21 PDT
(In reply to comment #16)
> No, my own tests also show that non-UTF8 data in an installation with utf8
> turned off is also broken.

  Yes, that's to be expected and is essentially impossible to fix. The data in a non-utf8 installation could be in ANY encoding! Email is latin1 by default, so we do our best to encode everything as latin1, and if users don't want to use utf8, they have to customize Bugzilla just like they always did. However, they should just be using utf8 if they have non-latin encodings.
Comment 19 Teemu Mannermaa (:wicked) 2007-08-10 04:12:17 PDT
Comment on attachment 274755 [details] [diff] [review]
Patch for 3.0

> > No, my own tests also show that non-UTF8 data in an installation with utf8
>   Yes, that's to be expected and is essentially impossible to fix. The data in

No, problem is that we SAY it's UTF8 even for latin1 encoded strings. This obviously breaks decoding process and standard compliant systems ignore whole encoded word. We simply can't use MIME-Q encoding in Encode.pm when in non-UTF8 mode as that will make this decision for us, implicitly. This regressed whenever we started to encode headers with Encode.pm.

But I agree that we should handle that regression in some other bug and just note for this bug that non-UTF8 mode still mangless multi-byte characters.
Comment 20 Frédéric Buclin 2007-08-10 07:41:54 PDT
a=me for 3.0.1 based on my discussion with wicked on IRC. Please land this patch asap.
Comment 21 Marc Schumann [:Wurblzap] 2007-08-10 09:02:54 PDT
Branch:
Checking in Bugzilla/Mailer.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Mailer.pm,v  <--  Mailer.pm
new revision: 1.7.2.4; previous revision: 1.7.2.3
done
Comment 22 Marc Schumann [:Wurblzap] 2007-08-29 07:21:15 PDT
*** Bug 394165 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.