Closed
Bug 369835
Opened 18 years ago
Closed 17 years ago
email_in.pl should convert character encoding to utf8
Categories
(Bugzilla :: Incoming Email, defect)
Tracking
()
RESOLVED
FIXED
Bugzilla 3.0
People
(Reporter: firefox, Assigned: fe)
Details
Attachments
(4 files, 2 obsolete files)
998 bytes,
patch
|
Details | Diff | Splinter Review | |
977 bytes,
patch
|
mkanat
:
review+
|
Details | Diff | Splinter Review |
604 bytes,
text/plain
|
Details | |
624 bytes,
text/plain
|
Details |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1
Build Identifier: Bugzilla 2.23.4+
Bugzilla doesn't check the character encoding of incoming email, which causes problems if the mail contains bytes which aren't valid UTF-8 characters.
Reproducible: Always
Steps to Reproduce:
1. Set up Bugzilla to use UTF-8 encoding
2. Send mail in Latin1 encoding, containing non-ASCII characters
Actual Results:
Bug is truncated at first non-ASCII character.
Expected Results:
Bugzilla should check the headers of the incoming message, and convert the text to UTF-8.
Reporter | ||
Comment 1•18 years ago
|
||
I'm assuming latin1 here, if there's no charset parameter specified in the content type header. I'm not sure if this is reasonable?
This also requires Text::Iconv;
Updated•18 years ago
|
Assignee: general → create-and-change
Component: Bugzilla-General → Creating/Changing Bugs
OS: Other → All
Hardware: PC → All
Summary: email_in.pl does should convert character encoding to utf8 → email_in.pl should convert character encoding to utf8
Version: unspecified → 2.23.4
Comment 2•18 years ago
|
||
Yes, RFC specifies that the charset is ISO-8859-1 if nothing is specified. We don't want to use Text::Iconv--let's just use Encode, which comes with perl.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee | ||
Comment 3•17 years ago
|
||
(In reply to comment #1)
> Quick patch
this patch doesn't recognize charsets from Content-Type strings like this:
"Content-Type: text/plain; charset=KOI8-R"
I'm change your single charset find regexp with this huge construction (i'm not a perl coder, sorry)
$ct =~ /charset=(.+);/;
my $charset= $1;
if (!$charset) {
$ct =~ /charset=(.+)/;
$charset= $1 || 'iso-8859-1';
}
this works.
Assignee | ||
Comment 4•17 years ago
|
||
I was trying to 'just use perl Encode', but it wasn't successfull :(
In email_in.pl we have 'texts', but Encode needs 'bytes'.
May be it's possible to have 'bytes' from <STDIN> at very beginning, but at this point encoding table is not knowing yet.
I'm confused.
So, my edited patch also needs Text::Iconv
Assignee | ||
Updated•17 years ago
|
Flags: blocking3.1.1?
Flags: blocking3.0.1?
Comment 5•17 years ago
|
||
This is not a blocker. I'm in the process of releasing these releases right now. :-)
Flags: blocking3.1.1?
Flags: blocking3.1.1-
Flags: blocking3.0.1?
Flags: blocking3.0.1-
Assignee | ||
Comment 6•17 years ago
|
||
:)
may I ask you for a blocker status for 3.?.2?
Flags: blocking3.1.2?
Flags: blocking3.0.2?
Comment 7•17 years ago
|
||
Yeah, sure. Technically this would be a sort of minor dataloss bug, so it can be a blocker.
You don't need to use Text::Iconv, though, just Encode:
http://perldoc.perl.org/Encode.html#PERL-ENCODING-API
Assignee: create-and-change → fe
Flags: blocking3.1.2?
Flags: blocking3.1.2+
Flags: blocking3.0.2?
Flags: blocking3.0.2+
Assignee | ||
Comment 8•17 years ago
|
||
I spent two days in reading this doc and attempts of using Encode. I'm not a perlcoder, and this attempts was unsuccessfull.
I'll better wait for release 3.0.2 and read this piece of code to learn perl a little better. :)
Comment 9•17 years ago
|
||
Okay. You can see how contrib/recode.pl uses it.
Assignee | ||
Comment 10•17 years ago
|
||
Attachment #277888 -
Attachment is obsolete: true
Attachment #280368 -
Flags: review?(mkanat)
Assignee | ||
Updated•17 years ago
|
Status: NEW → ASSIGNED
Comment 11•17 years ago
|
||
Comment on attachment 280368 [details] [diff] [review]
patch that use Encode
>--- bugzilla-3.0.1/email_in.pl 2007-08-13 16:38:09.000000000 +0400
>+++ bugzilla/email_in.pl 2007-09-10 22:04:43.000000000 +0400
>+ my $charset = 'iso-8859-1';
>+ if ($ct =~ /charset=(.+);/) {
>+ $charset= $1;
>+ } elsif ($ct =~ /charset=(.+)/){
>+ $charset= $1;
>+ }
That's the same condition twice. The second one is identical to the first one.
>+ if (Bugzilla->params->{'utf8'}) {
>+ $body = encode('UTF-8', decode($charset, $body));
>+ }
What Perl version did you test that on? Because "UTF-8" means something different than "utf8" in later 5.8.x versions.
Attachment #280368 -
Flags: review?(mkanat) → review-
Assignee | ||
Comment 12•17 years ago
|
||
(In reply to comment #11)
> (From update of attachment 280368 [details] [diff] [review])
> >--- bugzilla-3.0.1/email_in.pl 2007-08-13 16:38:09.000000000 +0400
> >+++ bugzilla/email_in.pl 2007-09-10 22:04:43.000000000 +0400
> >+ my $charset = 'iso-8859-1';
> >+ if ($ct =~ /charset=(.+);/) {
> >+ $charset= $1;
> >+ } elsif ($ct =~ /charset=(.+)/){
> >+ $charset= $1;
> >+ }
>
> That's the same condition twice. The second one is identical to the first
> one.
There is ';' in first condition, and isn't in second.
> >+ if (Bugzilla->params->{'utf8'}) {
> >+ $body = encode('UTF-8', decode($charset, $body));
> >+ }
>
> What Perl version did you test that on? Because "UTF-8" means something
> different than "utf8" in later 5.8.x versions.
5.8.7
Comment 13•17 years ago
|
||
(In reply to comment #12)
> There is ';' in first condition, and isn't in second.
Yes, I know, you don't need that. Do ([^;]+) instead.
> 5.8.7
Okay, that's fine.
Assignee | ||
Comment 14•17 years ago
|
||
(In reply to comment #13)
>Do ([^;]+) instead.
I do. Thank you again.
Attachment #280368 -
Attachment is obsolete: true
Attachment #280401 -
Flags: review?(mkanat)
Comment 15•17 years ago
|
||
Comment on attachment 280401 [details] [diff] [review]
patch that use Encode v2
>--- bugzilla-3.0.1/email_in.pl 2007-08-13 16:38:09.000000000 +0400
>@@ -295,9 +296,17 @@
> my $body;
> foreach my $part (@parts) {
> my $ct = $part->content_type || 'text/plain';
>+ my $charset = 'iso-8859-1';
>+ if ($ct =~ /charset=([^;]+)/) {
>+ $charset= $1;
>+ }
Nit: There's a lot of extra whitespace there.
This looks good, but before I approve it, I just want to know how you tested it.
Attachment #280401 -
Flags: review?(mkanat) → review+
Updated•17 years ago
|
Flags: approval?
Flags: approval3.0?
Assignee | ||
Comment 16•17 years ago
|
||
Assignee | ||
Comment 17•17 years ago
|
||
Assignee | ||
Comment 18•17 years ago
|
||
(In reply to comment #15)
> This looks good, but before I approve it, I just want to know how you tested
> it.
cat ~/mail.txt | perl ./email_in.pl -vv
cat ~/mail2.txt | perl ./email_in.pl -vv
and look at output by eyes.
mail.txt and mail2.txt attached.
Updated•17 years ago
|
Flags: blocking3.1.3+
Flags: blocking3.1.2+
Flags: blocking3.0.3+
Flags: blocking3.0.2+
Updated•17 years ago
|
Flags: approval?
Flags: approval3.0?
Flags: approval3.0+
Flags: approval+
Comment 20•17 years ago
|
||
tip:
Checking in email_in.pl;
/cvsroot/mozilla/webtools/bugzilla/email_in.pl,v <-- email_in.pl
new revision: 1.8; previous revision: 1.7
done
3.0 branch:
Checking in email_in.pl;
/cvsroot/mozilla/webtools/bugzilla/email_in.pl,v <-- email_in.pl
new revision: 1.5.2.3; previous revision: 1.5.2.2
done
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Updated•17 years ago
|
Component: Creating/Changing Bugs → Incoming Email
You need to log in
before you can comment on or make changes to this bug.
Description
•