Closed Bug 369835 Opened 18 years ago Closed 17 years ago

email_in.pl should convert character encoding to utf8

Categories

(Bugzilla :: Incoming Email, defect)

2.23.4
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Bugzilla 3.0

People

(Reporter: firefox, Assigned: fe)

Details

Attachments

(4 files, 2 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1 Build Identifier: Bugzilla 2.23.4+ Bugzilla doesn't check the character encoding of incoming email, which causes problems if the mail contains bytes which aren't valid UTF-8 characters. Reproducible: Always Steps to Reproduce: 1. Set up Bugzilla to use UTF-8 encoding 2. Send mail in Latin1 encoding, containing non-ASCII characters Actual Results: Bug is truncated at first non-ASCII character. Expected Results: Bugzilla should check the headers of the incoming message, and convert the text to UTF-8.
Attached patch Quick patchSplinter Review
I'm assuming latin1 here, if there's no charset parameter specified in the content type header. I'm not sure if this is reasonable? This also requires Text::Iconv;
Assignee: general → create-and-change
Component: Bugzilla-General → Creating/Changing Bugs
OS: Other → All
Hardware: PC → All
Summary: email_in.pl does should convert character encoding to utf8 → email_in.pl should convert character encoding to utf8
Version: unspecified → 2.23.4
Yes, RFC specifies that the charset is ISO-8859-1 if nothing is specified. We don't want to use Text::Iconv--let's just use Encode, which comes with perl.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to comment #1) > Quick patch this patch doesn't recognize charsets from Content-Type strings like this: "Content-Type: text/plain; charset=KOI8-R" I'm change your single charset find regexp with this huge construction (i'm not a perl coder, sorry) $ct =~ /charset=(.+);/; my $charset= $1; if (!$charset) { $ct =~ /charset=(.+)/; $charset= $1 || 'iso-8859-1'; } this works.
I was trying to 'just use perl Encode', but it wasn't successfull :( In email_in.pl we have 'texts', but Encode needs 'bytes'. May be it's possible to have 'bytes' from <STDIN> at very beginning, but at this point encoding table is not knowing yet. I'm confused. So, my edited patch also needs Text::Iconv
Flags: blocking3.1.1?
Flags: blocking3.0.1?
This is not a blocker. I'm in the process of releasing these releases right now. :-)
Flags: blocking3.1.1?
Flags: blocking3.1.1-
Flags: blocking3.0.1?
Flags: blocking3.0.1-
:) may I ask you for a blocker status for 3.?.2?
Flags: blocking3.1.2?
Flags: blocking3.0.2?
Yeah, sure. Technically this would be a sort of minor dataloss bug, so it can be a blocker. You don't need to use Text::Iconv, though, just Encode: http://perldoc.perl.org/Encode.html#PERL-ENCODING-API
Assignee: create-and-change → fe
Flags: blocking3.1.2?
Flags: blocking3.1.2+
Flags: blocking3.0.2?
Flags: blocking3.0.2+
I spent two days in reading this doc and attempts of using Encode. I'm not a perlcoder, and this attempts was unsuccessfull. I'll better wait for release 3.0.2 and read this piece of code to learn perl a little better. :)
Okay. You can see how contrib/recode.pl uses it.
Attached patch patch that use Encode (obsolete) — Splinter Review
Attachment #277888 - Attachment is obsolete: true
Attachment #280368 - Flags: review?(mkanat)
Status: NEW → ASSIGNED
Comment on attachment 280368 [details] [diff] [review] patch that use Encode >--- bugzilla-3.0.1/email_in.pl 2007-08-13 16:38:09.000000000 +0400 >+++ bugzilla/email_in.pl 2007-09-10 22:04:43.000000000 +0400 >+ my $charset = 'iso-8859-1'; >+ if ($ct =~ /charset=(.+);/) { >+ $charset= $1; >+ } elsif ($ct =~ /charset=(.+)/){ >+ $charset= $1; >+ } That's the same condition twice. The second one is identical to the first one. >+ if (Bugzilla->params->{'utf8'}) { >+ $body = encode('UTF-8', decode($charset, $body)); >+ } What Perl version did you test that on? Because "UTF-8" means something different than "utf8" in later 5.8.x versions.
Attachment #280368 - Flags: review?(mkanat) → review-
(In reply to comment #11) > (From update of attachment 280368 [details] [diff] [review]) > >--- bugzilla-3.0.1/email_in.pl 2007-08-13 16:38:09.000000000 +0400 > >+++ bugzilla/email_in.pl 2007-09-10 22:04:43.000000000 +0400 > >+ my $charset = 'iso-8859-1'; > >+ if ($ct =~ /charset=(.+);/) { > >+ $charset= $1; > >+ } elsif ($ct =~ /charset=(.+)/){ > >+ $charset= $1; > >+ } > > That's the same condition twice. The second one is identical to the first > one. There is ';' in first condition, and isn't in second. > >+ if (Bugzilla->params->{'utf8'}) { > >+ $body = encode('UTF-8', decode($charset, $body)); > >+ } > > What Perl version did you test that on? Because "UTF-8" means something > different than "utf8" in later 5.8.x versions. 5.8.7
(In reply to comment #12) > There is ';' in first condition, and isn't in second. Yes, I know, you don't need that. Do ([^;]+) instead. > 5.8.7 Okay, that's fine.
(In reply to comment #13) >Do ([^;]+) instead. I do. Thank you again.
Attachment #280368 - Attachment is obsolete: true
Attachment #280401 - Flags: review?(mkanat)
Comment on attachment 280401 [details] [diff] [review] patch that use Encode v2 >--- bugzilla-3.0.1/email_in.pl 2007-08-13 16:38:09.000000000 +0400 >@@ -295,9 +296,17 @@ > my $body; > foreach my $part (@parts) { > my $ct = $part->content_type || 'text/plain'; >+ my $charset = 'iso-8859-1'; >+ if ($ct =~ /charset=([^;]+)/) { >+ $charset= $1; >+ } Nit: There's a lot of extra whitespace there. This looks good, but before I approve it, I just want to know how you tested it.
Attachment #280401 - Flags: review?(mkanat) → review+
Flags: approval?
Flags: approval3.0?
(In reply to comment #15) > This looks good, but before I approve it, I just want to know how you tested > it. cat ~/mail.txt | perl ./email_in.pl -vv cat ~/mail2.txt | perl ./email_in.pl -vv and look at output by eyes. mail.txt and mail2.txt attached.
Flags: blocking3.1.3+
Flags: blocking3.1.2+
Flags: blocking3.0.3+
Flags: blocking3.0.2+
Flags: approval?
Flags: approval3.0?
Flags: approval3.0+
Flags: approval+
Max, can you check it in?
Target Milestone: --- → Bugzilla 3.0
tip: Checking in email_in.pl; /cvsroot/mozilla/webtools/bugzilla/email_in.pl,v <-- email_in.pl new revision: 1.8; previous revision: 1.7 done 3.0 branch: Checking in email_in.pl; /cvsroot/mozilla/webtools/bugzilla/email_in.pl,v <-- email_in.pl new revision: 1.5.2.3; previous revision: 1.5.2.2 done
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Component: Creating/Changing Bugs → Incoming Email
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: