Closed Bug 387860 Opened 17 years ago Closed 17 years ago

Subject lines in mails may contain mangled multi-byte characters

Categories

(Bugzilla :: Email Notifications, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Bugzilla 3.0

People

(Reporter: Wurblzap, Assigned: islobodin)

References

Details

Attachments

(1 file)

If a multi-byte character is in an unlucky spot, it may get split in two, effectively breaking it. http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=5475 is an example for this: its summary “Test der Umlaute öäü” gets damaged in mails. You can reproduce this by having an e-mail sent to you for it.
Oh yeah, here's how the Subject header looks like:

Subject: [Bug 62] New:=?UTF-8?Q?=20Test=20der=20Umlaute=20=C3=B6=C3?=
 =?UTF-8?Q?=A4=C3=BC?=

The second umlaut gets split across the line break.
Attached patch Patch for 3.0Splinter Review
This patch is based on Dan Kogai the Encode maintainer's explanation:

http://www.nntp.perl.org/group/perl.unicode/2004/02/msg2404.html

Dan Kogai: It LOOKS LIKE a bug but it is not.

[...]

Why that happens is that you use Encode::encode("UTF-8").  That means  
the utf8 flag of the resulting $string is OFF.   Without UTF-8 flag  
perl takes it as "\xA4\xC3", not "\N{LATIN CAPITAL LETTER A WITH  
TILDE}".  Now try the code below.

use Encode qw(encode);
my $string = Encode::decode("UTF-8",  
"ääääääääääääääääääääääääääääääääääää");
#                    ^^^^^^
print Encode::encode("MIME-Q", $string), "\n"

[...]
So... Does this mean this bug here will be fixed by bug 363153?
(In reply to comment #3)
> So... Does this mean this bug here will be fixed by bug 363153?

  It means that this bug is technically part of that bug, but due to the nature of the technical difficulties we're having over there, it might be a good idea to test and consider this patch. (With the addition of spaces around the "=" of course.)
Assignee: email-notifications → islobodin
Attachment #274755 - Flags: review?(wurblzap)
Comment on attachment 274755 [details] [diff] [review]
Patch for 3.0

With the patch, the subject line no longer reads

    Subject: [Bug 5475]=?UTF-8?Q?=20Test=20der=20Umlaute=20=C3=B6=C3?=
     =?UTF-8?Q?=A4=C3=BC?=

but

    Subject: [Bug 5475]=?UTF-8?Q?=20Test=20der=20Umlaute=20=C3=B6?=
     =?UTF-8?Q?=C3=A4=C3=BC?=

which works -- r=Wurblzap.

I can do the check-in after a+, touched up as mentioned in comment 4, if you don't have CVS access, Ilya.
Attachment #274755 - Flags: review?(wurblzap) → review+
Flags: approval?
Flags: approval3.0?
I can approve this for tip now, but since we're frozen for QA, LpSolit has to approve it for 3.0.

Make sure that you add spaces around the "=" on checkin.
Status: NEW → ASSIGNED
Flags: approval? → approval+
Go for it for 3.0.1. Be sure to test this patch with both utf8 turned on and off, in case one of the two cases is broken. Also, is data seen differently when being newly added in 3.0 or upgraded from 2.20 where UTF-8 wasn't in use? Just to be sure. I don't want an emergency release next week.

Please land it today to catch it in our QA tests.
Flags: approval3.0? → approval3.0+
Target Milestone: Bugzilla 3.2 → Bugzilla 3.0
Version: 3.1 → 3.0
(In reply to comment #7)
> Go for it for 3.0.1. Be sure to test this patch with both utf8 turned on and
> off, in case one of the two cases is broken. Also, is data seen differently

It works with utf8 turned on and turned off.

> when being newly added in 3.0 or upgraded from 2.20 where UTF-8 wasn't in use?

I can't (or won't; your pick) test the 2.20-upgrade part of this. Please reconfirm a3.0+.
Flags: approval3.0+ → approval3.0?
Thanks for the patch, Ilya.

Tip:
Checking in Bugzilla/Mailer.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Mailer.pm,v  <--  Mailer.pm
new revision: 1.11; previous revision: 1.10
done

3.0:
Postponed until a3.0+.
Bah... I wanted to test with a 2.20.5 installation where some comments were written with the UTF8 encoding, and some others with the ISO-8859-1 encoding (selected from my web browser), and then upgrading to 3.0.1 to see what happen. After the upgrade, the utf8 param was off, as expected when upgrading from 2.20. Then I 1) ran recode.pl 2) turned on the utf8 param and 3) ran checksetup.pl, and guess what? "Test der Umlaute öäü" which was written as UTF8 is still there, but the same string which was written with the ISO-8859-1 encoding has disappeared and has been replaced by "Test der Umlaute". All the text which was following this string has been clipped too. I ran ./contrib/recode.pl --charset=iso-8859-1 --show-failures, and it reported no error at all, but I now lost some data.

I'm not blaming this patch as it has nothing to do with the problem I just had. I'm only saying that this sucks as I cannot test this patch with a real upgraded DB.
Yeah, this matches what people experience and report to support-bugzilla@.
There is no difference between an upgraded DB and an originally-3.0 DB.

If you're experiencing dataloss even though you ran recode, file another bug. We've had a few reports of that already.
Comment on attachment 274755 [details] [diff] [review]
Patch for 3.0

This seems to fix the issue atleast for the testcase in this bug on a 3.0 branch installation with utf8 enabled. I did also test this with an update from 2.22 to 3.0 that had a latin1 subject on a bug. (Note that recode.pl is irrevelant at this case because it doesn't touch latin1 strings.)

However, when I tested this with utf8 flag disabled (on DB that's still latin1) I found that it didn't fix the error. It still results malformed UTF8 strings on subjects with UTF8 letters. It seems Encode::MIME::Header (the Perl module that handless MIME-Q encoding type for Encode module) version that comes with Perl 5.8.5 always encodes things in UTF8 and uses Perl utf8 mode.

Simplest fix for this would to be to remove if clause and always run string through UTF8 decoding. This does seem to work for UTF8 subjects. For subjects with latin1 encoding, this will just strip out the non-UTF8 characters. However, without this subjects are always said to be in UTF8 but instead contain 8-bit latin1 characters which makes TB to ignore the whole encoded word and thus strips out almost all of the subject. This means non-utf8 converted systems are already badly broken as things are now so this could be considered to be a separate bug.

Giving this r- for now. I can reverse that if we don't care about non-utf8 mode as everybody should just convert their databases because otherwise there are lot's of things broken anyway.
Attachment #274755 - Flags: review-
Even if we do care about non-utf8 mode, we can surely handle this in a separate bug, I think. With this patch here, nothing changes for non-utf8 mode. Would you consider this and maybe r+?
(In reply to comment #13)
> However, when I tested this with utf8 flag disabled (on DB that's still latin1)
> I found that it didn't fix the error. 

  That's OK, Bugzilla is not expected to handle utf8 with the utf8 flag off.

> Giving this r- for now. I can reverse that if we don't care about non-utf8 mode
> as everybody should just convert their databases because otherwise there are
> lot's of things broken anyway.

  Yes, just reverse it. We don't care about people who have utf8 in their database but haven't turned on the utf8 parameter.
(In reply to comment #15)
>   Yes, just reverse it. We don't care about people who have utf8 in their
> database but haven't turned on the utf8 parameter.

No, my own tests also show that non-UTF8 data in an installation with utf8 turned off is also broken.
(In reply to comment #16)
> No, my own tests also show that non-UTF8 data in an installation with utf8
> turned off is also broken.

Well it's broken without this patch as well, right? So nothing new there. I don't think we should attempt to fix this very very old bugginess and the UTF-8 bugginess at the same time.
(In reply to comment #16)
> No, my own tests also show that non-UTF8 data in an installation with utf8
> turned off is also broken.

  Yes, that's to be expected and is essentially impossible to fix. The data in a non-utf8 installation could be in ANY encoding! Email is latin1 by default, so we do our best to encode everything as latin1, and if users don't want to use utf8, they have to customize Bugzilla just like they always did. However, they should just be using utf8 if they have non-latin encodings.
Comment on attachment 274755 [details] [diff] [review]
Patch for 3.0

> > No, my own tests also show that non-UTF8 data in an installation with utf8
>   Yes, that's to be expected and is essentially impossible to fix. The data in

No, problem is that we SAY it's UTF8 even for latin1 encoded strings. This obviously breaks decoding process and standard compliant systems ignore whole encoded word. We simply can't use MIME-Q encoding in Encode.pm when in non-UTF8 mode as that will make this decision for us, implicitly. This regressed whenever we started to encode headers with Encode.pm.

But I agree that we should handle that regression in some other bug and just note for this bug that non-UTF8 mode still mangless multi-byte characters.
Attachment #274755 - Flags: review- → review+
a=me for 3.0.1 based on my discussion with wicked on IRC. Please land this patch asap.
Flags: approval3.0? → approval3.0+
Branch:
Checking in Bugzilla/Mailer.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Mailer.pm,v  <--  Mailer.pm
new revision: 1.7.2.4; previous revision: 1.7.2.3
done
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.