Closed Bug 126266 (bz-charset) Opened 22 years ago Closed 19 years ago

Use UTF-8 (Unicode) charset encoding for pages and email for NEW installations

Categories

(Bugzilla :: Bugzilla-General, defect, P1)

2.15

Tracking

()

RESOLVED FIXED
Bugzilla 2.22

People

(Reporter: burnus, Assigned: glob)

References

Details

(Whiteboard: i18n)

Attachments

(2 files, 30 obsolete files)

12.63 KB, patch
Wurblzap
: review+
Details | Diff | Splinter Review
2.40 KB, patch
cso
: review+
Details | Diff | Splinter Review
Presently the bugzilla webpages don't contain an encoding header. Neither do the
emails.

Expected:
- The HTML pages come with an encoding header such as:
  <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
- The emails come with an encoding header such as:
  MIME-version: 1.0
  Content-type: text/plain; format=flowed; charset=ISO-8859-1
  Content-transfer-encoding: 8BIT

Reasoning:
The encoding information makes sure that the 8bit characters are shown
correctly. I have chosen ISO-8859-1 (Latin1) since it is most spread (though not
as "good" as UTF8) and is the default encoding of MySQL.
The patch changes defparams.pl so only new installations of Bugzilla are
effected. Additionally this value can easily be changed on the "parameters"
page.
Some small case changes should be made, so that the out looks this this:
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

instead of:
MIME-version: 1.0
Content-type: text/plain; format=flowed; charset=ISO-8859-1
Content-transfer-encoding: 8BIT
Changed "ISO" to "iso" and "8BIT" to "8bit".
Could someone mark attachement 70138, I'm not allowed to do so
"MIME-version: 1.0" should be "MIME-Version: 1.0"
"Content-type" should be "Content-Type"
"Content-transfer-encoding" should be "Content-Transfer-Encoding"
> "Content-type" should be "Content-Type"
> "Content-transfer-encoding" should be "Content-Transfer-Encoding"
Fixed. I should really go to bed ...
("Obsolet" marking is bug 97729 by the way)
Keywords: patch, review
Comment on attachment 70148 [details] [diff] [review]
defparams.pl patch (v3): Send per default the encoding for HTML (header) and for emails

Can't use a META tag for content-type.	It's broken and makes Bad Things happen
in Netscape 4.x.

Need to actually send a charset parameter on the Content-Type header being spit
out in the HTTP.

Please see the discussion on bug 38856 for why this was refused entry the last
time it was presented, and take anything into consideration from that bug that
we need to do to keep everyone happy.
Attachment #70148 - Flags: review-
Attachment #70138 - Attachment is obsolete: true
Attachment #70144 - Attachment is obsolete: true
Attachment #70144 - Attachment is patch: true
Keywords: patch, review
Attached patch Bigger patch for text/html (v4) (obsolete) — Splinter Review
This patch addresses the problems by replacing the print "Content-Type:
text/html\n\n" by a function.

This patch was _not_ thoroughly tested, the %...% part in the email settings is
untested (to come...).
In additionally the documentation (3.5.5) needs to be updated if this is
checked in.
Attached patch Bigger patch for text/html (v5) (obsolete) — Splinter Review
Now tested. Changes to previous version
HTMLencoding -> encoding (since used by mail and needed for %encoding%
substitution
%encoding% substition works now
Attached patch Bigger patch for text/html (v5) (obsolete) — Splinter Review
Now tested. Changes to previous version
- HTMLencoding -> encoding (since used by mail and needed for %encoding%
substitution)
- %encoding% substition works now (in the email params)
Keywords: patch, review
--- reports.cgi	2002/01/31 23:51:38	1.51
+++ reports.cgi	2002/02/19 11:17:36
+    PutHTMLContentType("Content-disposition: inline;
filename=bugzilla_report.html";

I missed a ")" before the semicolon.
Question: what does it do if you leave it blank?  (I haven't looked at the patch
yet, but easier to ask you than dig through the patch :)

Does it do the Content-Type: text/plain (or text/html) without the ;
charset=xxxx on the end if you leave it blank?

(it'll need to work this way for the japanese folks IIRC since they have to be
able to change the charset on the fly in the middle of the page)
> Question: what does it do if you leave it blank?  (I haven't looked at the 
> patch yet, but easier to ask you than dig through the patch :)
> Does it do the Content-Type: text/plain (or text/html) without the ;
> charset=xxxx on the end if you leave it blank?
It sends then only the text/html part (text/plain is not supported (yet?)).

+  if( Param('encoding') ne '') {
+    print 'Content-Type: text/html; charset='.Param('encoding')."\n$header\n";
+  } else {
+    print "Content-Type: text/html\n$header\n";

> (it'll need to work this way for the japanese folks IIRC since they have to be
> able to change the charset on the fly in the middle of the page)
Hm. This doesn't sound that healthy actually, but if the browser still likes it ...
Attached patch Bigger patch for text/html (v6) (obsolete) — Splinter Review
Fixes missing ')' and re-diff after long_list.cgi and describekeywords.cgi have
been templateized.
> It sends then only the text/html part (text/plain is not supported (yet?)).

text/plain is how the email sends, correct? :-)
> > It sends then only the text/html part (text/plain is not supported (yet?)).
> text/plain is how the email sends, correct? :-)
True. I missed this since it is only used in defparams.pl and thus easily
customizeable. In this sense it is not true that only "Content-Type: text/plain"
appears if "encoding" is empty. I'm rather comfortable having a
   Content-Type: text/plain; format=flowed; charset=%encoding%
in the params' *mail settings, but I can also move this part
(mime,8bit,content-type) to another perl function, if it is desired.
Attached patch Bigger patch for text/html (v7) (obsolete) — Splinter Review
minor changes to PutHTMLContentType (I confess I missed to initialise a
variable using ''; plus: Call Param('encoding') only once).
Attached patch Bigger patch for text/html (v8) (obsolete) — Splinter Review
Rediff after relogin.cgi and defparams.pl had been changed.
I think this would greatly accompany bug 126456 (2.16 blocker/"Fix our error
handling").
Regarding this: Would it make sense to set $vars->{'header_done'} in the
PutHTMLContentType once bug 126456 is checked in or is this the wrong place to
do so?
Severity: normal → major
No, vars->{'header_done'} should be set only after the global/header template
has been printed. 

I'm still not convinced about the way you are doing things in this bug, but I
need more time to look at it to work out why. ;-)

Gerv
--- post_bug.cgi        2002/02/05 00:20:08     1.39
|+++ post_bug.cgi        2002/02/24 15:43:09
|-print "Content-type: text/html\n\n";
|+pPutHTMLContentType();
s/pP/P/

> I'm still not convinced about the way you are doing things in this bug, but I
> need more time to look at it to work out why. ;-)
Hmm. I thought it wasn't that bad ;-) As long as you can come up with something
else with sends the email and the HTML pages with right charset, I'm fine with
that.
Comment on attachment 71202 [details] [diff] [review]
Bigger patch for text/html (v8)

This is the way I think it should work. The function should be called
SendHTTPHeader(), and take an array of strings, including a Content-Type.

It should print them all, in the order given, with \n separating, but if it
spots an HTML Content-Type, it should slyly insert the charset into it. It
prints \n\n at the end.

This seems to me to be a much cleaner interface, and it works for different
content-types too, and is extensible.

Gerv
Attachment #71202 - Flags: review-
Attached patch Bigger patch for text/html (v9) (obsolete) — Splinter Review
Fix pPut... error and rediff after xml.cgi and userprefs.cgi got checked in.
*** Bug 128609 has been marked as a duplicate of this bug. ***
altering the summary of this bug to more closely match what the patch is
actually accomplishing.
Summary: Bugzilla should send encoding ISO-8859-1 per default → Allow administrator to set charset encoding for pages and email
-needs work-

Goal:
- Provide new function "PutHTMLContentHeader" which sends the
content-transfer-encoding for HTML
- Provide option sending emails with content-transfer-encoding 8bit or
quoted-printable
(This can be set via the editparams.cgi)

Done:
- Options are in defparams.pl
- PutHTMLContentType is used.
- Default email setting uses MIME with %encoding% and %transportencoding%
- The email body is either send as 8bit or quoted-printable

Todo:
- Honour RFC 2047 for the encoding of the header
- Use MIME encoding and other features for the other emails which presently
  are not affected by editparams.cgi
- Check whether we need to change something for 16bit characters.
I fear that MIME:QuotedPrint doesn't do the right thing in this case.
- Do some clean up
- Testing: I haven't yet tested the changes between v9 and v10.

I'd be glad if someone could assure me that I'm on the right road.
burnus: if you disagree with my assessment of how this would most cleanly be
implemented (as written in comment #22), could you at least say why? :-)

Gerv
Attached patch v11/v1h: Encoding patch for HTML (obsolete) — Splinter Review
I split the two areas mail and html output. This only contains the changes
needed for HTML and tries to addess all issues given in comment #22.
The only difference is that a "SendHTTPHeader()" is equivalent to a
SendHTTPHeader("Content-Type: text/html").

I think this patch is rather clean and independed of addressing the email
encoding. Checking this in first, reduces the size of the more complicated
email patch. (Does someone know a lightwight perl implementation for the
encoding of the header of emails? I have an slight idea how to write it, but it
is going to be ugly and lengthy :-(

> burnus: if you disagree with my assessment of how this would most cleanly be
> implemented (as written in comment #22), could you at least say why? :-)
Well the reason is simple: I overlooked this comment :-(
Comment on attachment 72387 [details] [diff] [review]
v11/v1h: Encoding patch for HTML

>+# This sends a HTTP header
>+# It takes an list as argument and prints them \n separated

"a list as an argument" :-)

>+# If it finds "Content-Type: text/html" and the param "encoding" is set
>+#   it adds the charsetencoding
>+# If called without an argument it assumes that "Content-Type: text/html" is
>+#   ment.

"meant".

>+sub SendHTTPHeader(@){
>+  my $header = join("\n",@_);
>+  my $encoding = Param('encoding');
>+  if($header eq "") {
>+    $header = "Content-Type: text/html";
>+  }

$header ||= "Content-Type: text/html" is neater :-)


>+DefParam("encoding",
>+         "Character encoding used for the HTML documents. (This should match the encoding used by the database.)",
>+         "t",
>+         'iso-8859-1');
>+

Please default this to nothing. See long arguments in other bugs for the
reason.

Other than those nits, r=gerv :-)

Gerv
Attachment #72387 - Flags: review+
Fixed the issues which have been risen in comment 29.
Comment on attachment 72860 [details] [diff] [review]
72387: v12/v2h: Encoding patch for HTML

r=gerv.

Gerv
Attachment #72860 - Flags: review+
*** Bug 129646 has been marked as a duplicate of this bug. ***
129643 contains another patch to fix the content type issue, it removes the
content type print's rather than calling a function and moves it into PutHeader.
I am sorry i didn't saw this before duplicating work. Just wanted to give a
heads up reagarding this patch on this bug.


The 72860 patch has gone stale, and no longer applies cleanly to HEAD.	This is
mostly just a refresh.	There are one or two new places, though, where people
are putting additional fields in the header; those bits should be scrutinized
for correctness by the responsible parties.
Also, shouldn't the target milestone for this bug be set to 2.16?
It's going to go stale again as soon as bug 84876 lands.  Everything email is
changing.  If everyone insists this is a showstopper I suppose we can put it in
2.16, but I certainly won't make it a blocker.  If it gets done before we
release, good, otherwise we'll have no qualms about releasing without it. 
Putting in 2.18 for now...  if it gets reviewed and checked in before then we'll
bump it up.
Target Milestone: --- → Bugzilla 2.18
It would also be good to set the charset on the output of xml.cgi; the current
patch only sets encoding on the html output from xml.cgi, which you get when no
bug numbers have been specified.

The output should probably specify the encoding in the <?xml?> PI at the start
of the output:
  <?xml version="1.0" encoding="iso8859-1" standalone="yes"?>

I originally suggested this in the comments on bug 105960, but it was suggested
that it was better to include it with this one.  As it is, most XML parsers
won't handle output like:
  http://bugzilla.mozilla.org/xml.cgi?id=384

As it includes 8 bit characters, but is not UTF-8 (XML files default to UTF-8 if
no encoding is given).
*** Bug 152190 has been marked as a duplicate of this bug. ***
I've refreshed this again for the benefit of people who will be running 2.16,
as well as for the benefit of anyone who wants to review it for inclusion in
2.17 when it opens (hint, hint).
Attachment #77394 - Attachment is obsolete: true
Blocks: 160096
*** Bug 160097 has been marked as a duplicate of this bug. ***
*** Bug 173227 has been marked as a duplicate of this bug. ***
Heads up, the default charset (for Mozilla?) in Redhat 8 is UTF-8 (unicode), so
ISO-8859-1 data entered into b.m.o has been showing up incorrectly for users of
that operating system.  Depending on the browser support, we may want to default
to UTF-8 instead of ISO-8859-1 for new installations.

We should also have a recommendation for existing installations about how to
migrate data from multiple other charsets into the one they want to use, if this
is even possible.
Note that CGI.pm enforces sending a charset on text/* responses.
When I mentiont that on IRC, timeless suggested that thatwas a bad idea, becuase
it disables autodetection. This is particularly important for attachments, whih
may be testing something wrt autodetection in mozilla

That patch also makes supporting this a one-liner in Bugzilla::CGI - just add a
$self->charset(Param('charset')) call into the B::C constructor.

We can't convert existing stuff unless we know what charset it currently is, and
we don't.
> Note that CGI.pm enforces sending a charset on text/* responses.

Can we work around that by sending a bogus charset? Mozilla might still
autodetect if it doesn't recognise it.

Gerv
I'd really prefer to fix CGI.pm...
Once I was visiting one site in UTF-8.  Then I opened bugzilla and entered a
comment containing some accented characters.  It's only when I received the mail
back from bugzilla that I realised my Mozilla was in UTF-8 while I was writing
the comment, and therefore those accented characters turned out to be non-sense.
 I thus had to re-entered the comment again.

If Bugzilla had sent content-type charset in HTML header or HTTP header, this
wouldn't have happened.

Or even better, Bugzilla is using UTF-8 be default.
Right, but I don't know how well browsers like ns4 deal with utf-8.
With the lastest N4 I had used (4.76, I think), it wouldn't switch to UTF-8.

But are there a lot of people using N4 to report Mozilla report?  I don't know
if Bugzilla keeps a record of connection statistics.  If it does, it would be
possible to know the percentage of charset-unaware browsers / total browsers
(counting only the different kind of browsers used per logged users, but not the
number of times accessing Bugzilla).  If this is less than 10%, I think it's
safe to use UTF-8 because we can't wait forever for 0% to happen.  Don't you agree?
Oh, and juding by teh comments at the bottom of
http://www.mysql.com/doc/en/Upgrading-from-3.23.html mysql doesn't support utf8
encoding.

OTOH, I don't know if we need database support for this. It would be nice, and
may make searching a bit easier, but its not essential, I think.
I have no clue what the rate is, but we do need to support ns4. If the only side
effect is that ns4 doesn't show non-ascii characters, then I thik we can deal
with that - theres really no other option.

A quick web search shows that some browsers do have issues with utf8-encoding,
although it appeasr that they're ok with the ascii (or maybe latin-1) character
sets. Hmm

Has anyone got any suggestions for:

a) how to store this in the db (remembering that with mysql will then not work
correctly on string-based operations with non ascii chars)
b) How to convert existing data (pgsql has code within the db to do conversions.
We can use Eode, but only on perl 5.8)
c) Whether suppoting this fully should require perl 5.8 (I really really really
hope not)
d) What to do if we want character encoding X but the broweser sends stuff which
isn't vald for that (and how we detect that case)
e) Whether sending an (admin defined) charset on all text/* documents will cause
problems compared to the current no-charset setting?
f) Whether we should allow an admin defined charset, or just handle everyhting
as utf8. This will probbaly make it much easier to deal with postgres, although
I don't know if DBD::Pg handles all that correctly - dkl?
g) Anything else?
It all depends on what you mean by support.

I use MySQL 3.23.x to store UTF-8 encoded data: Chinese, Japanese, Korean, Arabic, 
and various Latin script languages. There are a few things to keep in mind:

1. Use  Latin-1 as the database encoding.

2. Your char columns should be BINARY since you don't want the server doing case-
insensitive string comparisons.

3. When doing wild-card searching and string manipulation you need to account for the 
fact that a single character may take four bytes (i.e., four Latin 1 characters).

4. Collation will give you Unicode order.
By 'support', I mean not having to do any of those things :)

(2) is the main one here, which we need to avoid.
*** Bug 179076 has been marked as a duplicate of this bug. ***
The other thing to note is that we don't really use many db text functions, so
the db support doesn't matter that much.

If the db doesn't support it, then doing stuff like substring searches based on
non-utf8 strings may return funny results. I'm personally OK with that...

some of our quoting functions probably need to be updated to properly accept
non-utf8 input, mind you.

I've also changed my mind on making these columns 'BINARY', mainly because even
though that is really likly to break lots of stuff, its the way other dbs do
stuff (becase the sql spec says so), so we'll have to deal with i eventually anyway.
Blocks: bz-russian
Blocks: 182975
Re comment 7: Dave, if the META Tag is a problem, shouldn't it be removed from
Bugzilla's *.html files ? (bug_status.html, bugwritinghelp.html,
confirmhelp.html, quicksearch*.html, votehelp.html)

When using our local Bugzilla with Mozilla 1.2* and 1.3a, we got problems with
wrong encodings and accidental changend summary lines. Mozilla "randomly" flips
default charset between ISO-8859-1 and UTF-8 (Bug 148369, Bug 158285,
Bug 159295).

A <meta http-equiv= ... "...; iso-8850-1"> tag in
template/en/custom/global/header.html.tmpl worksforme as a quick fix.
Re comment 55:

This can't be done safely.  At least these files _are_ in ISO-8859-1.  What 
about localized Bugzillas?

What you suggest is a hack to work around Mozilla troubles, not a Bugzilla fix.
Right, which is why we should just say that everything is utf-8 and be done with?

Its a simple standard natively supported by perl.
Re comment 56:
No, the .html files are all plain ascii, and should display correctly with any
"ascii derived" charset (iso-8859-whatever, utf-8, ...). So if the mysterious
NS4.x problem from comment 7 is an issue, the META tags should be removed (and
possibly "AddCharset iso-8851-1 .html" added to .htaccess for localized .html
files).

Yes, the META tag with charset in "global/header.html.tmpl" is a necessary fix
that makes localized bugzillas usable with current mozilla releases. And so it
also an intermediate fix for this bugzilla bug 126266 ;-)


Re comment 57: I agree. But then there should be some support for conversion of
existing localized bugzilla installations to utf-8.
As I understand it, we can't switch to UTF-8 until we drop support for NS 4.x.
Isn't that right?

Our current behaviour of not setting a charset is very useful because it allows
people to just start using Bugzilla in their language and browser auto-detect
algorithms generally Do The Right Thing. I think Bugzilla should definitely
continue ship with no default charset.

However, making it easier for admins to add one is perfectly reasonable, and
this patch looks like the right idea (although it would need to add charset to
more Content-Types than just text/html.)

Gerv
gerv: That works for html right up until two people with different charsets
comment on a single bug.

It also doesn't work for xml, which _must_ be given a charset.

If we set utf-8, then any browser will Do The Right Thing. (At least any
non-ancient one - I don't know how ns2 or ie2 will act, and I don't particularly
care....)

What problem does netscape 4 have? The only comment in this bug is justdave's
mention that we can't use <meta>, but have to use content-type. Thats OK with me.

I mentioned that I don't know if ns4 will work, and thats true. Local testing
shows that it works for ascii, though and I'm not set up to try non-ascii stuff.

Using a single character set has the advantage that we can use the 'standard'
perl features on it. The problem with allowing an admin-settable charset is that
we have no way of testing if inputted data is correct. With utf-8, we can use
perl5.8's native stuff, or simulate it under 5.6.
As a test, consider http://www.unicode.org/iuc/iuc10/x-utf8.html. I get missing
fonts (which display with ? in ns4, and a glyph for the 4 digit codepoint under
mozilla), but the stuff which I do have fonts for do display correctly.
I vote for universal UTF-8 coding.
This shall be simple, safe and everlasting solution.
UTF-8 everywhere would be nice.  The main thing I would like to see is xml.cgi
producing formal XML output.  Currently it doesn't set a charset in the <?xml?>
line, so standard XML parsing tools treat it as UTF-8.

This causes big problems for some bugs where non UTF-8 8-bit characters have
been used, making the xml parser return an error.  Being able to use standard
XML tools with bugzilla would be a very useful feature ...
Right, but we have to set some charset and the urrent problem is that we don't
have one to use...
I was just pointing out xml.cgi as a reason why it is worth worrying about the
charset issue.

This is a case where we can't just leave it to the web browser's charset
detection heuristics (an XML parser is required to treat the current output of
xml.cgi as UTF-8, and then fail when it finds invalid UTF-8 data ...).

The two solutions are to allow the bugzilla administrator to set the charset (in
which case this setting should be used in xml.cgi's output as well), or decide
on a fixed encoding such as UTF-8.  Since a lot of people are moving toward
UTF-8, the second option is the one I would prefer (even though it is probably
more work in the short term).
If can imagine two possible scenarios

a) Allow administrator to set character coding he likes.
This would mean to care about coding everywhere in the code, and I would expect 
some work to be done now and many  bugs to appear in future (because every 
programmer would have to think about charset)

b) Set one fixed coding to UTF-8 everywhere.
This might be in beginning about the same amount of work as in previsou case, 
but later on I would expect minimum bugs resulting from forgetting about coding 
page.

That is why I preffer to use fixed UTF-8
You don't have to convince *me* :)

That said, I have other Bugzilla stuff I'd prefer to do, so if someone wants to
take this, feel free. Any patch needs to come with a script whcih can convert
existing content, from at least ascii, utf-8 (ie no change),
iso-8859-{1,1+euro}, ISO-JP-{Mumble}. Anyone who has contents in another
encoding can patch the script - I think thats likely to cover most of the
detectable content.

It needs to work on at least perl 5.6.1. Making it work on perl 5.6.0 would
probably avoid lots of other arguments, but I'm not sure if that had the
required support we need.
It may be useful to know that I believe Simon Cozens has written a number of
charset-conversion Perl modules, which you should investigate when writing any
conversion scripts.

Gerv
Yep, and Encode is standard with 5.8. Problem is that it requires 5.7.1, so....

Maybe we could have the script require 5.8 - it could easily be run offline, and
generate sql UPDATE statements rather than modify stuff directly.
Has anyone addressed the issue of how we deal with existing data in
many different encodings in the curret Bugzilla? (Maybe this is not
a bug for dealing with that?)

I think I said this before but if we are going to move to UTF-8 in
Bugzilla, then it needs to be done only to new database. The old 
database should be sent without any charset info. 

Strategically, it would be something like this:

1. Mark a transition date. 
2. After that date, all data would be handled as UTF-8 with
input and output processs accurately reflecting UTF-8. 
3. Any data prior to that date should not bear any charset info.
==================================================================

By the way, I don't understand all the comments about Comm 4.x.
It certainly supports UTF-8. It does not have automatic font/glyph
finding mechanism as in Netscape 6/7 and IE. 
What you need to do with Comm 4.x is for users to set a font that 
supports characters you want to display under Unicode. 

Edit | Prefs | Appearance | Fonts | For the encoding | Unicode

pick some fonts that have a lot of different lang characters like
Arial Unicode (for Win). CJK users can simply choose their native
fonts and that should work for most situations. 
momoi: we can't do that, because coments for teh same bug can happen both 
before and after. What we need to do is have a script to go through every 
coment, and try to autodeteect the charset.
See bug 182975 comment 3 for a previous mention of problems with NS4's form
submission when using UTF-8.  Not sure if that's covered by the workaround Kat
mentioned.

As for transition, I agree with Brad, we should convert all of the existing data
to UTF-8 as best as possible as part of the upgrade once this is included.

FYI, MySQL does not appear to support Unicode in any shape or form yet. 
According to their website, Unicode support is planned for version 4.1 (which
isn't out yet).  Sybase and Postgres, which are the other two databases we're on
the verge of supporting, both do, however.
Alias: bz-charset
Keywords: patch, review
OS: Linux → All
Priority: -- → P1
Hardware: PC → All
What about using the OS/user's default charset ?
For example (assuming Solaris):
- if the default $LANG is C/POSIX or iso8859-1 we default to iso8859-1.
- if the default $LANG is ja_JP.UTF-8 we default to UTF-8
etc.
Regarding comment 73, selecting the database character set based on 
the locale of the user running the bugzilla instance or the MySQL instance 
is not going to work: it is probably likely that a server running Bugzilla 
lacks the locale setting of interest. Further, this is a rather advanced 
setting: for most people outside of Eastern Europe and Asia will be happy 
with iso8859-1. People in Eastern Europe and Asia are already aware of 
the character encoding used on their particular installation, out of 
necessity.

The real problem is when you have a database that contains rows with 
multiple character sets, where the character sets used for each row are 
not tagged. Generally this shouldn't happen with Bugzilla and each 
database should be internally consistent. Of course I can imagine 
scenarios where it could actually occur.

In any event, the approach to take is to export the entire database to text, 
transcode the rows as appropriate, and import the database back. This is 
the approach recommended by Oracle through 9i, and was the method 
used by Amazon.com (for example) when they moved their databases to 
UTF-8.
I think I will agree on forcing to use UTF-8. Will appreciate on having a
detailed guidance on how could I implement the said solution on my environment.
Below is the description of my environment:

1) Debian GNU/Linux 2.2.20-idepci  ---> from #uname
2) debian package of bugzilla (ver 2.14.2), sendmail and others
3) we changed into Content-type: test/html; charset=euc-jp under
 /usr/lib/cgi-bin/bugzilla

If there is anything you would like to clarify please let me know. Please
consider me as a newbie on trying to weave on this kind of huge system.

Domo Arigato Gozaimasu (Thank you very much in Japanese)
*** Bug 188745 has been marked as a duplicate of this bug. ***
Comment on attachment 72387 [details] [diff] [review]
v11/v1h: Encoding patch for HTML

one year worth of bitrot...
Attachment #72387 - Flags: review-
Comment on attachment 72860 [details] [diff] [review]
72387: v12/v2h: Encoding patch for HTML

one year worth of bitrot...
Attachment #72860 - Flags: review-
Anyone have an up-to-date patch for this?  This would be really cool to get in
fairly soon, even if it's optional and only available if you're using Sybase or
Postgres (since MySQL doesn't inately support utf-8)
Nothing stops you from storing UTF-8 in a MySQL database: I do it regularly and
it works fine. Supposedly MySQL 4.1 will include unicode support, but in the
meanwhile it would be nice to have this fixed so those of us using Unicode in
MySQL can get our users off our backs complaining about busted subject lines in
email. ;-)
Mozilla should render bug pages in standards mode, but it can't until this is
fixed. Bug 38856 supposedly fixed this.
The charset encoding has nothing to do with standards mode.
According to my understanding of the pre- and post-filing discussion of bug
196292, the absence of a charset declaration puts Mozilla in quirks mode.
'char' doesn't appear on that page. The offical description is
http://www.mozilla.org/docs/web-developer/quirks/doctypes.html - it goes on
doctype, not charset
And from the page Bradley mentioned, it says that a page will render in quirks
mode if it uses "The public identifier "-//W3C//DTD HTML 4.01 Transitional//EN",
without a system identifier.", which is why bug pages render in quirks mode.  It
is not related to the charset.
Hmm. I thought we had a system identifier. Oh well. Did mozilla's behaviour
change at some point, btw? The <img> in the <table> in the header got displayed
differently after we added in the doctype way back whenever it was we templatised.
Just confirming that comment 83 is incorrect.
*** Bug 202114 has been marked as a duplicate of this bug. ***
*** Bug 174340 has been marked as a duplicate of this bug. ***
For the record: Adding an encoding to the Content-Type header is now easier
since bug 201816 has landed. (A short usage is found in the documentation after
bug 201955 is in.)

This doesn't solve the general problem though (ISO-8891-1 vs. UTF-8 vs. other
encodings), and using correctly encoded mail headers ('to' and 'subject') and
mail bodies is still to be done.
It also doesn't actually make any of the data utf8 (you need tags on the <form>
for that), nor does it handle data currently in the system which is not utf8.
Theres also validation (whcih will probably mean patches to CGI.pm), and some
other stuff too.
*** Bug 207960 has been marked as a duplicate of this bug. ***
Blocks: 135762
*** Bug 213864 has been marked as a duplicate of this bug. ***
*** Bug 219257 has been marked as a duplicate of this bug. ***
*** Bug 220066 has been marked as a duplicate of this bug. ***
*** Bug 221838 has been marked as a duplicate of this bug. ***
The problem still exists in 2.16.3 where the HTML pages use UTF-8 encoding.
EMail goes out without the proper MIME header. The problem however is not a
simple as to add some header: Several MUAs (Mail User Agents) cannot deal with
UTF encodings yet, so the preferrable solution would be _reencoding_ the binary
message before sending it as EMail. For EMail the ISO-Latin charsets are
preferrable for Europe at least, and the encoding should be quoted-printable. Is
there a fix for the stable version already?
There are also several MUA:s that can properly handle UTF-8 mail. The best
solution is probably to fix those that don't, instead of introducing a
reencoding workaround that reintroduces the original problem again.
mutt allows the user to specify a list of preferred character encodings for
sending email.  If the message can be encoded in the first encoding in the list,
it is used, then the second, etc.  (IIRC, the default value for that preference
is us-ascii, iso-8859-1, UTF-8.)  A similar solution would likely work well in
Bugzilla.
With regards to comment 100 the user needs to be aware that some encodings are 
indistinguishable from each other without out-of-band information: all of the 
ISO 8859-x encodings share the same encoding space, as does EUC-KR and EUC-CN, 
even though they have very different character sets associated with them. 
 
What does MUTT do when the text it wants to send cannot be transcoded to a 
character set/encoding in the user's list? Latin-1 cannot be transcoded to 
Shift JIS, for example without loosing the accented characters. And is the 
system aware of differences within character sets with a single name? On 
Windows GB2312 implies CP936 which has some 18K more characters than pure 
GB2312 as would be found on Unix systems (in EUC-CN, for example.) 
 
And I assume that, at least internally, MUTT is pivoting through Unicode for 
this? Otherwise you end up with n**2 tables for n different encodings... 
 
I don't see the point of your out-of-band information comment.  Both email and
web pages can and should contain encoding information.

I don't know what mutt does if the message can't be encoded in any of the
encodings.  It would only be an issue if utf-8 were removed from the list.

Internally, mutt probably uses the relevant library functions such as iconv,
which I'm sure go through Unicode internally in some sense or another.
My comment about out-of-band data is that without knowing the language of the 
comment, you cannot necessarily know which of the 8859-x encodings is correct. 
If the comment is in Russian then you can trivially convert it to 8859-1 or 
8859-2 or 8859-6 or whatever, and not get mapping errors. 
 
Similarly, a comment in Chinese in EUC-CN can be transcoded to EUC-KR without 
an error, but resulting in absolute garbage. 
 
Finally, if all of the comments are in Unicode you need to know the language 
of the comment so you can pick the most appropriate legacy encoding to 
transcode to without trying them all... unless you only limit yourself to 
those the user indicates they can process. 
 
FWIW, I think giving the user a choice of encoding to transcode mail to prior 
to sending is a fine idea... it just needs to be handled the right way because 
you can get burned very quickly. 
 
I agree to comment #99. Bugzilla can't take care of everyone who still lives in
the stone age and uses broken email clients like Eudora. Nonetheless, a
reasonable fallback (as suggested) can be supported.  

As for mutt, it does use iconv(3) and Unicode is the internal representation. 

re : comment #101
Once all textual data in bugzilla is converted to Unicode (at least for bugs
filed after D-day or bugs whose comments don't include non-ASCII characters up
to D-day), we exclusively deal with Unicode data so that there's no issue with
indistinguishability of legacy encodings.    

re : comment #100. It'd be great if the prioritized list of encodings could be
configurable per user. Or, at least, there's an option to get emails exclusively
in UTF-8. I don't want to receive ISObug mails in ISO-8859-1 (although I don't
have any problem dealing with them with a patched version of Pine that take
advantage of iconv(3)). 
Re comment 103:  as comment 104 says, there should never be any guessing
involved, since we should know the encoding of all data.
My personal preference is to use ASCII by default (for email) and fall back on
UTF-8 if there are any non-ASCII characters present.  Don't even mess with
ISO-8859-1.  All web interaction will always be UTF-8.

Eudora 6 (for the Mac anyway) deals with Unicode just fine.  6.0 was the first
version that did, however.  5.x had problems with it in subject lines (it'd just
display the raw =?UTF-8?B?foobarbaz?= on the subject lines) but dealt with it
fine in message bodies.  Anyone still using 4.x or less needs to upgrade. :)

Seriously, anyone using an email client old enough to not support that kind of
stuff is pretty unlikely to be dealing with bugs that require it.  And everyone
supports ASCII :)  If someone is regularly dealing with internationalized data,
then they need to upgrade their software to handle it.
> My personal preference is to use ASCII by default (for email) and fall back on
> UTF-8 if there are any non-ASCII characters present.

How could you tell the difference between doing that, and just using UTF-8
exclusively? (Other than by looking at the headers, but that would only matter
if the UA did so itself, and if it does, then it's unlikely not to support UTF-8.)
My comments about needing to know language so you can reliably transcode to 
legacy encodings is only applicable if the suggestion in comment 100 that 
Bugzilla allow the user to specify a list of encodings they are willing to 
accept. 
 
Encoding cannot be trusted to tell you what language a piece of text is in. I 
can write perfectly normal German in 7-bit US-ASCII. Unicode doesn't help at 
all, since you loose all language related information. You cannot reliably 
determine language based on what block a character or set of characters comes 
from, except in relatively rare circumstances. Knowledge of language is 
required to reliably transcode to a legacy character set. 
 
If all data is in UTF-8, and no transcoding is required, then nothing needs to  
be done: US-ASCII just works. As soon as you start transcoding between 
encodings you need to know the language to do it right. 
 
For the record, Zippy's Bugzilla has been in production for almost 4 months now
with utf-8 specified as the charset in the headers using Perl 5.6.1 and we've
had no issues.  Note that it was a new installation that started from scratch
with utf-8, and not applying it to any legacy data.

I think it would be a piece of cake to make new Bugzilla installations use
utf-8.  Upgrading existing ones is going to be a can of worms though.
How about a feature to use UTF-8 for all bugs numbered <configurable number
here> and up?  That would at least allow a transition going forward.
> Knowledge of language is required to reliably transcode to a legacy character 
> set.

Why?
re comment #110: That's what I was saying in comment #104 and what Markus Kuhn
suggested more than a year ago in another bug. In addition to new bugs (with bug
# > N), old bugs with pure ASCII data at a certain D-day can be 'converted' to
UTF-8. To do so, we need to add a boolean field to each bug to indicate whether
it's carrying UTF-8 data (because the test on bug # doesn't work for them).

re comment #106 and Eudora: I meant to write about Eudora on Windows. For an
unknown reason, Eudora-Mac's I18N support has always been  ahead of Eudora-Win's
I18N support. I believe Eudora-Win still doesn't support UTF-8. Neither does it
allow users to choose character encoding for outgoing emails. No support for RFC
2047 header decoding (I'm not sure of the latest version).
  
BTW, there's one more to take care of in bugzilla's migration to UTF-8 only
world. I noticed that some Western Europeans had entered their names in bugzilla
account in ISO-8859-1 [1] (I have never seen non-Western Europeans enter their
names in the corresponding legacy encoding). Perhaps, bugzilla-admin scans the
account name field to see if there's any character outside US-ASCII. If there's,
send an email to account owners to reenter their names with View|Character
Coding set to UTF-8. 

[1] That's also the case of some xml/xul files (that are supposed to be in
UTF-8) in the mozilla source tree.  
We can convert existing data to UTF-8 with a reasonable degree of accuracy. We
can't get it perfectly correct, of course, but we should be able to come up
woith something which is close, and which an admin could script given the
requirements of their userbase.
No, I wouldn't dare. In some bugs, a few different encodings (that are all but
impossible to distinguish from each other without human intervention especially
given that they're usually pretty short, which keep us from using any charset
detection based on statistics) are used. 
This would be done on a per-comment basis, not a per bug basis. If you mix
encodings within a single comment, theres not much anyone can do.
How would you reliably determine the encoding per comment basis without human
intervention? That's the whole point of comments by Tom. Mozilla's charset
detector or Basistech's similar product sometimes fail even for much longer
chunks of text than the usual length of bugzilla comments. Even 95% or 99% is
not good enough (not that I think you can reach even 90%) for our purpose. Just
leaving them alone is  better if you can't get 100%. 
If a single bug uses multiple character sets, then it doesn't matter if we screw 
them up, since no UA is going to ever show all the comments right at the same 
time anyway.
> it doesn't matter if we screw them up,

 It does matter. 

If we just leave *alone* old bugs with non-ASCII comments, we can change the
encoding manually to view them correctly (although not all of comments correctly
at the same time if multiple encodings are used in a single bug). However, if we
screw them up by the incorrect detection, it's a lot harder to view them. We
have to figure out not only the original encoding but also the wrong encoding
that the charset detector believed comments to be in.

Therefore, I'd stand by my comment #112 (dbaron's comment #110). If there is
someone who's got the infinite amount of freetime, (s)he may go comment by
comment and convert them to UTF-8 manually in the backend DB. 
Commercial encoding/language detectors can often do quite well with as little 
as 96 bytes: our encoding/langauge detector can achieve 98--99% accuracy in 
its first guess on a buffer of this length, and is almost 100% accurate within 
the top two or three guesses with 64 bytes or more. In these cases you can 
take the top three candidates and convert from the hypothesized encoding to 
Unicode and see how many invalid characters you get in the conversion. The one 
with the fewest invalid characters wins. Or if there is a tie, you take the 
first guess. This can work pretty well, and has worked well for companies 
migrating gigabyte size (and larger) databases. However, these kinds of tools 
are expensive and not readily available: open sourced converters do not 
approach this level of accuracy. Detecting within a BZ database though is 
complicated by the fact that comments may include source code or similar 
'noise' that needs to be accounted for. 
 
In any event, as Jungshik says, multiple encodings can be displayed by the 
user manually switching character encodings on the page. I expect this is an 
issue for those using cyrillic (Russian, Ukrainian) where there are multiple 
competing encodings in regular use. 
 
In any event, this bug isn't the place to discuss UTF-8 migration issues. 
*** Bug 225291 has been marked as a duplicate of this bug. ***
If by comment 119 we're not going to discuss migration here (hooray), what stops
us from simply adding a charset Param() and getting on with the migration
problems when we try to apply it to existing Bugzillas?
*** Bug 226941 has been marked as a duplicate of this bug. ***
re: comment 121:

I agree.  Let's get on with it.  For this to go in right now, we need the
following behaviour:

- This is implemented as a Param.
- For new installations, the Param defaults to "utf-8".
- For upgrading installations just picking up this param, it defaults to being
disabled.

That will get around the problems caused by existing installations having to
migrate data, since it won't force them to migrate when this goes in.

We can then open another bug for migration paths, and encourage people to submit
theirs if they come up with one.
Regarding the patch in attachment 88599 [details] [diff] [review]: Is there a significant reason to write
$header =~ s#^Content-Type: text/html$#Content-Type: text/html; charset=$encoding#i;
instead of
$header .= "; charset=$encoding"; #?
Blocks: 229010
attachment 88599 [details] [diff] [review] doesn't seem to take care of 'bug mail'. If 'UTF-8' (or any
other charset) is set in a new installation, bug mails should have [1]

Content-Type: text/plain; charset=CHARSET  
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0 

Currently, bug mails don't have any of the above. When C-T and C-T-E headers are
missing, that is regarded as the RFC 822/RFC 2822 default, which is equivalent to 

Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0

In addition, header fields (e.g.  Subject, From, etc) should be encoded per RFC
2047. [2] That is, you can't just send out raw 8bit characters in the message
header. Instead, they have to be encoded as following:


Subject: =?UTF-8?B?.........?= blahblah =?UTF-8?B?.....?=

Subject: =?ISO-8859-1?Q?His=20name=20is=G=F6the?=

[1]
http://www.faqs.org/rfcs/rfc2822.html
http://www.faqs.org/rfcs/rfc2045.html
[2]
http://www.faqs.org/rfcs/rfc2047.html

Just in case it's not known, MIME-Tools at
http://www.zeegee.com/code/perl/MIME-tools/
can make it easy to deal with RFC 2047 header encoding as well as other
MIME-related issues. 

Encode module would be handy when we finally decide to migrate.
http://www.cpan.org/modules/by-category/13_Internationalization_Locale/Encode/
Well, these days, there are lots of encoding converters and Mozilla has one if
you build  intl/uconv/tests, but given that a large part of bugzilla is written
in Perl, Encode may have some advantages.
Sorry for spamming. I hadn't read comment #27. It has the following:

> Default email setting uses MIME with %encoding% and %transportencoding%
> The email body is either send as 8bit or quoted-printable

There should be an option to use 'C-T-E' of base64 for text/*. Another option
might be necessary to pick whichever is shorter after calculating the length of
the encoded result. 
 There's a myth that Q-P is for text/* and Base64 is for binary (image/*,
audio/*, etc).  That's wrong. For CJK, Russian, Greek, Thai and other
non-Western European text, Base64 is more space-efficient and is not much worse
than Q-P in terms of 'human readability'. For CJK, Russian, Greek, Thai, and so
forth, '=A1=B0=C0!=20"=B1=AC' would be as cryptic as 'xerTgylkRt' if a
non-MIME-aware client is used. The  same is true of Q encoding and B encoding in
RFC 2047-style header encdoing.

>  Honour RFC 2047 for the encoding of the header

 Yes, this is important and easy to do with MIME-Tools. 


>Check whether we need to change something for 16bit characters.
>I fear that MIME:QuotedPrint doesn't do the right thing in this case. 

  If '16bit characters' means supporting UTF-16(LE|BE), I guess you don't have
to worry. RFC (2)822 email messages are byte-oriented so that I don't think it's
possible to send non-byte-oriented data in text/* messages no matter how we
encode it. If your concern is that  MIME:QuotedPrint splits multiple octets
representing a single character (in multibyte encodings such as UTF-8, GB2312,
Big5, EUC-KR, ISO-2022-JP) into two neighboring encoded words, that's indeed a
problem.  My memory is not clear, but the last time I checked, it did the right
thing 
jshin: I just read your latest bugmail using |less|
as I have read my last several thousand bugmails.

base64 means I can't use less.

That would be a showstopper for me.

If we're going to do base64, I'd have to request that users be able to pref
against it.

I don't care if I can't read CJK bugmails content, I need to be able to quickly
scan for bug status changes and attachment creation and flag changes.

Note that if i were to actually visit bugzilla.mozilla.co.jp or whatever it is
and it were a current bugzilla (say 2.19.1), I'd request that my interface
be English instead of the Japanese default. Which means I'd expect to see
"Created an attachment" and:

jshin@mailaps.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jshin@mailaps.org

and ...

Even though the actual bugzilla might store Japanese internally.
Actually, it probably shouldn't really store strings for this stuff, since I
should be able to get those strings in Spanish when I load the bug later.
If the bugmail required base64, there's a 99% chance you wouldn't be able to
read it anyway, even if it was ascii+garbage, because it would all be garbage,
so you're not losing anything.  The trick is you don't convert it to base64
unless it contains 8-bit data.  It should be fairly easy to get a percentage of
8-bit data in the mail to be sent...  0% = use us-ascii.  < 30% = use Quoted
Printable.  >30% = use base64.

The way we have it set up on Zippy's bugzilla right now it just ships raw 8-bit
data in the body of the email (C-T-E: 8bit), but we mime-encode the headers
(Subject in particular) using MIME::Words if and only if they actually contain
8-bit data.  We probably should use QP or base64 on the content, but in our case
it didn't matter because the mail systems between the Bugzilla and the
recipients (who were all internal folks) were all 8-bit capable.
The argument for always using quoted-printable is the following:
  suppose someone who can't deal with base64 gets bugmail containing a long
comment in a language that person can't read.  He doesn't really care about the
comment (or is forced not to, since he can't read it), but he does want to know
whether the bug was reassigned or its target milestone was changed.  If the
message is encoded using quoted-printable, he'll see reassignment or target
milestone changes.  If it's base64, he won't.
Well, I'm not a big fan of  Base64/QP. They're just necessary evils. Most SMPT
'transports' are 8bit clean, these days, but we can't be sure, which is why we
need ESMTP negotiation mechanism (unfortunately, not many MTAs/MUAs do that).
Even if bugzilla sends out bug mails in 8bit C-T-E, an SMTP server somewhere in
the way can turn 8BITMIME to Base64 or QP if it determines that its  peer
doesn't support 8BITMIME ESMTP extension (see RFC 1652. sendmail 8.x does that
by default). All I want is that the door should not be shut for base64 if
somebody wants it.

As for using 'less' and non-MIME-aware email clients, QP is certainly better
than Base64, but still not so convenienent as 8bit.  If you really care about
it, you'd better set up a local procmail filter that converts incoming emails to
8BITMIME automatically. You can also filter your existing mailboxes through the
procmail filter. I've done that for years because I also do want to run 'grep',
'less' and friends on my mail boxes. Actually, I stopped doing it (for
single-part text/* messages) when sendmail 8.x began to support automatic
decoding of base64/qp to 8BITMIME before delivering incoming emails to mboxes.
If you don't control your MTA and can't run procmail on incoming emails, you can
still filter your local mailboxes (on disk) through it. Please, don't say
everybody doesn't know how to do that. If she uses grep,less etc to look for
something in her mailbox, she certainly can (and there are other tools that let
you do the equivalent)  

Blocks: 110692
Simple question why isn't the trivial fix (of adding ";
charset=UNICODE-1-1-UTF-8" to "Content-Type: text/html") applied for 2.16.5?
Ulrich: the Bugzilla Guide outlines why Bugzilla doesn't ship with a default
charset.

On a separate note, is it a Mozilla bug that this page now has fonts double the
size, and complains about Chinese text display? Or is it one of the comments?

Gerv
(In reply to comment #132)
> Simple question why isn't the trivial fix (of adding ";

Please, go through comments in this bug.

> charset=UNICODE-1-1-UTF-8" to "Content-Type: text/html") applied for 2.16.5?

The preferred MIME name is  not 'UNICODE-1-1-UTF-8' but 'UTF-8'
It is also necessary to add an attribute like accept-encoding="UTF-8" to all the
forms if you want web browsers to send data back in unicode.  Without that, they
may use the locale charset instead (even if the page is encoded in UTF-8).
IMO, I think it should be "accept-charset" instead of "accept-encoding".

http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset
:: sigh ::  :)

This is very high on my list, btw, we'll have this in 2.20 or bust. :)
Whiteboard: i18n
Target Milestone: Bugzilla 2.18 → Bugzilla 2.20
BTW not properly encoded headers are one criterium filters like amavisd-new use
to detect spam.

This means if you use strict filters they'll also eat some of your bugzilla
trafic (this is bad)

I vote for UTF-8 everywhere with database conversion on upgrade. Other encodings
might work better with old tools but variable encoding cause so many problems
everywhere it's not even fun to write about here.

One encoding to rule them all, period. Screw ancient tools - if they can't grok
UTF-8 that means they also need massive amounts of handholding everywhere to
work anyway.
(In reply to comment #138)
> I vote for UTF-8 everywhere with database conversion on upgrade.

And therein lies the problem.  As stated in earlier comments here, conversion of
an existing database is *almost* impossible, because we have no reliable and
accurate way to tell what the existing character sets (note plural) are.

Making this all work on a clean install is a cakewalk.  Upgrading existing
systems is a nightmare.
Summary: Allow administrator to set charset encoding for pages and email → Use UTF-8 (Unicode) charset encoding for pages and email
As an administrator for a db that would need conversion, I can tell you
incomplete conversion is the lesser evil there.

Undefined encoding is far worse even short-term (since more and more clients do
use UTF-8, so you end up with mixed encodings anyway)

Just treat all data as iso-8859-15 and convert to UTF-8 so there are no invalid
unicode combo left, that's all I ask. Some bugs will be mangled wrongly but they
are alreadly mangled in some circumstances *NOW*, when consulted by people with
a different client than the original submitter
If we could vote, I would also vote for using UTF-8.

Come on, we're in the 21st century, 2004 to be exact, and the encoding problem
is still unsolved!

When the database is upgraded, is it possible to keep the old data?  Normally,
yes, right?  In this case, there's one solution but that might be quite long to
implement:
When someone notices that a comment can't display correctly, he could click a
link to mark it as badly converted.  You know, like the [reply] link that was
absent before some version (I don't know which exactly).  In order to avoid
sabotage, we could only allow the owner of the comment to do this.

One step further (even longer to implement):
he could be directed to a new page where the comment is fetched from the *old*
database and is displayed with a default charset, Latin-1, of course.  Then,
there's a selector with different encodings.  The user chooses (or tries) an
encoding and the selector trigger a submit to the server.  This time, the server
sends back the comment with the demanded encoding (actually, the server's job is
just writing a different metatag in HTML/HTTP header).  When the user is sure of
the encoding, he pushes the submit button and the server, with the chosen
encoding, convert the old comment to UTF-8 and put it in the database, replacing
the wrong one.
There is no reason to "vote" to use UTF-8.  It is pretty clear that that is the
preferred option of the bugzilla developers (just look at the title of the bug).

The problem is how to get from here (encodings not specified, database may
contain comments in a variety of different encodings) to there (all data in UTF-8).

Given that all comments for a bug are displayed on a single page, and people are
going to want to comment on bugs filed before any switch to a fixed encoding, it
will be necessary to convert the old data (having a new bugs vs. old bugs
distinction won't work and neither will separate databases/installations).

The conversion process will probably need to be customisable for a particular
installation though.  While treating existing data as ISO-8859-1 or ISO-8859-15
might work for most english or european installations, it would be incorrect for
a Japanese bugzilla installation.

Probably the right process is to check if each string is valid UTF-8.  If it is,
leave it.  If it isn't, convert it from the encoding the administrator thinks
the most of their content is in.

If there are other encodings where you can validate a string like you can with
UTF-8, the conversion process could be more sophisticated.  Unfortunately you
can't easily check whether a string is actually ISO-8859-1 or not because pretty
much any 8-bit string could be valid ISO-8859-1 (strictly speaking, strings
containing characters in the range 0x80 - 0x9F aren't valid ISO-8859-1, but many
Windows boxes will send back strings in Windows-1252 encoding when asked for
ISO-8859-1).
Here's my suggestion (which probably doesn't say anything new). 

All Bugzillas have a "charset" param, used in all the appropriate places. All
new Bugzillas have this set to "UTF-8".

Whenever someone upgrades a Bugzilla from pre-this-change to post-this-change,
checksetup asks them for a charset. If they specify one, Bugzilla converts all
comments from that charset to UTF-8, and uses UTF-8 in the future. The list of
available charsets may be limited, if each one requires explicit support.

They also have the option of specifying no charset. If they do that, the param
is set to "", and Bugzilla continues to send no charset (i.e. it works exactly
as now.) UTF-8 and "" are the only two valid values for the charset after the
upgrade.

I'm sure that misses something. What? :-)

Gerv
For the record, there appear to be at least 15 unique character sets in use in
bug data on bugzilla.mozilla.org.
Picking "one" to convert from won't even close to work.  The best option I can
think of is to tell the browser that everything is UTF-8 and provide a
"re-encode me" link next to items which we can detect aren't UTF-8 for people
who have specific privileges.  Clicking that link would then load the item all
by itself on a blank page with no charset set, and let their browser auto-detect
the charset, and let the user confirm before submitting to re-encode it as
UTF-8.  Then let people fix the stuff they think is important.  I have a
prototype of this form submission on landfill somewhere at the moment, I was
playing with that idea at one point.
Ick.

Can we tell (using JS) what charset the browser has auto-detected? Or would
people need to look in Page Info and then choose from a list?

How do we detect those items which aren't UTF-8? That is to say, how does the
auto-conversion script detect which items to convert and which to flag for
"manual" conversion? 

Is there any way to do this without (re-)writing Mozilla's code to do charset
detection for small volumes of text?

> Picking "one" to convert from won't even close to work.

Depends what you mean by "close". Picking ISO-8859-15 would almost certainly do
99% of comments...

Gerv
(In reply to comment #145)
> Can we tell (using JS) what charset the browser has auto-detected? Or would
> people need to look in Page Info and then choose from a list?

document.characterSet

> How do we detect those items which aren't UTF-8? That is to say, how does the
> auto-conversion script detect which items to convert and which to flag for
> "manual" conversion? 

(I guess it checks if the item is a valid UTF-8 sequence?

> Depends what you mean by "close". Picking ISO-8859-15 would almost certainly do
> 99% of comments...

well, for 99% of the comments, US-ASCII will suffice...
I'm also confused as to why you think ISO-8859-15 is more common than -8859-1.

And 99% of comments still leaves many thousands of comments that will become 
largely unreadable, which is a regression from the current state (which involves 
the UA guessing at the encoding instead of being forced to use the wrong one).

justdave's idea seems like the best so far.
I know that someone else has already suggested the following solution but I
can't find it at the moment. I think that this would be the best trade-off
solution for new/legacy databases:

Add a utf8 flag to each bug. All new bugs automatically have the utf8 flag set.
Any bugs which have the utf8 flag set have all of the appropriate HTTP/HTML
headers to specify the charset encoding for the bug and form as UTF-8. All other
bugs are displayed as currently (no charset). This sorts out the standard
bugzilla installation and upgrades and is the only thing needed to be done
officially by bugzilla (IMHO).

Then, if someone wants to convert existing bugs to UTF-8, then this can be done
via an 3rd party tool. e.g. someone creates a utility which can be run on a
bugzilla database which (1) tests existing bugs for US-ASCII only text and if so
sets the utf8 flag, (2) convert other charsets to UTF-8 from detected or
specified charsets if desired.
Here's what the prototype script I have on landfill does...

It runs a quick Perl regexp to see if there's any 8-bit data in it.  If there
isn't, then consider it to be ASCII (which is a subset of UTF-8, and thus safe).

If there is 8-bit data, then we use Encode::decode_utf8() to test whether it's
valid UTF-8 or not.  This isn't 100% accurate, but it's probably better than
99%.  This does mean a minimum requirement of Perl 5.8.0 however.

<form method="post">
<input type="hidden" name="action" value="update-comment">
<input type="hidden" name="bug_id" value="[% bug_id %]">
Comment: <texarea name="comment">[% comment FILTER html %]</textarea><br>
Charset: <input type="text" id="charset" name="charset" value="foo"><br>
<script>document.getElementById("charset").value =
document.characterSet;</script><br>
<input type="submit" value="Update">
</form>
> I'm also confused as to why you think ISO-8859-15 is more common than -8859-1.

Because it's in practice a superset (although I doubt there's many Euro
characters in Bugzilla, so the difference doesn't really matter.)

Brodie: your idea might work if single bugs were the only thing Bugzilla
displays, but it doesn't. It displays multiple bugs at once, and buglists, and...

Gerv
iso-8859-15 is more than iso-8859-1 + euro

There are other new chararacters in there that *are* used in western europe

(the characters they replace OTOH can be safely said to be *very* unusual -
that's why they were nuked when iso defined -15)
(In reply to comment #142)
> There is no reason to "vote" to use UTF-8.  It is pretty clear that that is the
> preferred option of the bugzilla developers (just look at the title of the bug).

  My comment is, in fact, in reply to comment #139 which seems to mean we have
to postpone database upgrade because we still can't figure out how to do a
*complete* conversion and thus data base still remains without encoding.

  Maybe I got it wrong, but but my "vote" to use UTF-8 means that we have to do
it as early as possible.  The reason is simple:
Later we do the conversion, more loss we're going to get.

> While treating existing data as ISO-8859-1 or ISO-8859-15
> might work for most english or european installations, it would be incorrect for
> a Japanese bugzilla installation.

  Sure, but there's no solution to this.  We have to be determined or we'll
never get out of it.  As I've written before, it'd better lose what we got in
the past than losing also what we're going to lose in the future.

> Probably the right process is to check if each string is valid UTF-8.  If it is,
> leave it.  If it isn't, convert it from the encoding the administrator thinks
> the most of their content is in.

  This is infeasible.  Taking bugzilla as an example, it has more than 200000
bugs.  How could he check if most of the content is in such or such encoding? 
You're not supposing he's going to read every bug, are you?
(In reply to comment #147)
> I'm also confused as to why you think ISO-8859-15 is more common than -8859-1.
> 
> And 99% of comments still leaves many thousands of comments that will become 
> largely unreadable, which is a regression from the current state (which involves 
> the UA guessing at the encoding instead of being forced to use the wrong one).
> 
> justdave's idea seems like the best so far.

  That was my idea :)
(In reply to comment #148)
> Then, if someone wants to convert existing bugs to UTF-8, then this can be done
> via an 3rd party tool. e.g. someone creates a utility which can be run on a
> bugzilla database which (1) tests existing bugs for US-ASCII only text and if so
> sets the utf8 flag, (2) convert other charsets to UTF-8 from detected or
> specified charsets if desired.

   A (existant) bug can have comments of different encodings (i18n related
bugs).  The detection/conversion should be done with respect to every comment
but not the whole bug.
What about Encode::Guess, for more excitement. We can't get 100%, but we can
probably get close enough.

What about the other issues I've raised, such as database searching support and
forms on older browsers?
Re: ISO-8859-1 and ISO-8859-1

Windows-1252 is a proper superset of ISO-8859-1 and more commonly emitted by
browsers than ISO-8859-1. It is the safest guess of the three.
"commonly emitted by browsers than ISO-8859-15" I meant
While debating what to do with existing installations of bugzilla, can this
please be implemented so that all *new* installations will at least be UTF-8.
Then there will not be ever increasing numbers of users will need to go through
the pain of this.
> Windows-1252 is a proper superset of ISO-8859-1

How is this true? Aren't both 8-bit encodings with valid glyphs for all octets?

Gerv
(In reply to comment #159)
> > Windows-1252 is a proper superset of ISO-8859-1
> 
> How is this true? Aren't both 8-bit encodings with valid glyphs for all octets?

 The C1 block of ISO-8859-1 (0x80 - 0x9f) doesn't have any graphic characters as
its name implies. It's for control characters (usually with no visible
representation). Nonetheless, they're characters so that strictly speaking
ISO-8859-1 is NOT a proper subset of Windows-1252. However, practically, it can
be thought of  that way. Anyway, that's quite off-topic here. We should do
something about this bug soon and especially, I agree to comment #158.

(In reply to comment #159)
> > Windows-1252 is a proper superset of ISO-8859-1
> 
> How is this true? Aren't both 8-bit encodings with valid glyphs for all octets?

Windows-1252 includes printable characters in the C1 block that ISO-8859-(1,15)
does not.
(In reply to comment #159)
> > Windows-1252 is a proper superset of ISO-8859-1
> 
> How is this true?

It isn't.

> Aren't both 8-bit encodings with valid glyphs for all octets?

They aren't. For example 0x81 isn't defined in Windows-1252.
I knowingly ignored control characters that are of no practical interest here.
As far as *printable* characters are concerned, Windows-1252 is a proper
superset of ISO-8859-1. For user-entered bug data it makes sense to consider
characters in the 0x80-0x9f range as printable characters from the Windows
counterpart of a particular ISO encoding.
I feel it would be useful to split this bug into two - one bug for implementing 
the optional charset headers and one for making the data of existing databases 
match it. 

The implementation as I see it is optionally sending a user defined character 
encoding for all pages, comments, emails, etc. This would be set to UTF-8 by 
default in new installs and turned off by default in upgrades (no header = same 
behaviour as now). This problem seems to be quite well defined and 
implementable now as evidenced by the patches. 

The other bug for the problem of upgrading existing installs to use that 
characterset has already had a lot of discussion and ultimately just needs 
someone to try implementing a solution to nail down the real problems.
*** Bug 256665 has been marked as a duplicate of this bug. ***
Attached patch updated patch (obsolete) — Splinter Review
ok, here's a different patch:

  - updated against the tip
  - uses CGI's charset() method, which simplifies things a lot
  - charset is always utf-8, so the param is now a boolean
  - param defaults to false (for existing installs), with checksetup setting
    it to true for new installs

todo:

  - encode email headers as per rfc 2047
  - encode email body as base64 (depending on percentage of non-8-bit chars)
  - set accept-encoding attribute on all forms (template function?)
  - add test to ensure accept-encoding attribute is always present

for a new bug (imho):

  - migration tools for existing installs
How about splitting the email bits out into a separate bug to make it easier to
get traction on this one?
here's *preliminary* email encoding, with the following issues:

  - not applied to all instances of sendmail
  - need to move the call to Param
  - doesn't encode to/from/reply-to
  - needs loads of testing :)

but it's a start.
Attachment #70148 - Attachment is obsolete: true
Attachment #70262 - Attachment is obsolete: true
Attachment #70270 - Attachment is obsolete: true
Attachment #70271 - Attachment is obsolete: true
Attachment #70410 - Attachment is obsolete: true
Attachment #70804 - Attachment is obsolete: true
Attachment #71202 - Attachment is obsolete: true
Attachment #71859 - Attachment is obsolete: true
Attachment #72246 - Attachment is obsolete: true
Attachment #72387 - Attachment is obsolete: true
Attachment #72860 - Attachment is obsolete: true
Attachment #88599 - Attachment is obsolete: true
Attachment #158736 - Attachment is obsolete: true
I have to everytime change the charset to UTF-8 if I want read a bug report from
bugzilla.mozilla.org which contains German umlauts and similiar characters or
cyrillic chars because no charset is specified and the default is ISO-8859-1
(which is almost right for other pages). Only with UTF-8 the reports are
displayed correct. So for me it seems that the bugzilla.mozilla.org data ARE in
UTF-8 already.
Regarding comment #169: The problem is worse than that: It's not just a display
problem, but an encoding problem: When using Mozilla (not MS-IE), form
submissions will also not be in UTF-8. Thus the characters are actually stored
with the wrong encoding in bugzilla. This applies to attachments as well. In my
bugzilla I have bug reports with mixed encodings. I think we don't want mixed
encodings (one encoding per comment or attachment or input field, right?
No longer blocks: 266658
(In reply to comment #169)
> I have to everytime change the charset to UTF-8 if I want read a bug report from
> [deleted]
> displayed correct. So for me it seems that the bugzilla.mozilla.org data ARE in
> UTF-8 already.

   I'm not so sure.  Maybe those reports are made by people whose browsers '
encoding is in UTF-8, by chance?

   I've just made a "bug", actually a test-bed, in bug 266658 in order to avoid
pollution to this bug which is mainly for discussion.

   Everybody, when you want to test in that bug, please write as much info as
possible, like the encoding of your browser when you write the test, and in what
language you're writing.
There's nothing to test. The problem is well known. Bugzilla comments are in
various encodings - UTF-8, ISO-8859-1, GB2312, Shift_JIS, KOI8-R, ISO-8859-2,
EUC-KR, Windows-1251, etc because mozilla.org's bugzilla does NOT specify its
encoding in HTTP header and html meta tag.What encoding is used in a particular
comment is determined by the encoding selected in View | Encoding at the time of
posting. To reduce the work required when it's finally decided to move to UTF-8,
everybody IS strongly encouraged to set View|Encoding to UTF-8 before posting
any non-ASCII comments.

In case of attachment, you can explicitly specify the encoding like 'text/html;
charset=XXX' or 'text/plain; charset=YYY' so that there's nothing to worry about. 
*** Bug 44343 has been marked as a duplicate of this bug. ***
I am concerned about thes patches forcing new installs of bugzilla into UTF-8. I
need to be able to file bugs containing pound sign and euro symbols which are
only available in ISO-8859-15. If a parameter is added it should be to specify a
charset for the installation rather than a boolean switch between "UTF-8" or
"let each bug be different".

Also, how does this affect the ctype=xml query string argument? At present if I
include a pound sign in a bug report, the page generates invalild XML as there
is no ISO-8859-15 encoding on the XML so most parsers default to UTF-8 and see
the file as invalid (firefox displays the document with ? graphic where the
pound sign is, IE a parse error) 

Finally, there is the importxml.pl script. In version 2.9.1, if I supply it an
XML file in UTF-8 that contains a pound sign the expat XML parser correctly
rejects the file as invalid XML. If I supply it an XML file in ISO-8859-1 it
accepts the file but enters the pound sign into the mysql database with an extra
character preceeding it (an A with an accent character above it). I'm assuming
this is because importxml.pl is assuming the data to be UTF-8 and not paying
attention to what the actual directive in the file is?

(In reply to comment #174)
> I am concerned about thes patches forcing new installs of bugzilla into UTF-8. I
> need to be able to file bugs containing pound sign and euro symbols which are
> only available in ISO-8859-15. If a parameter is added it should be to specify a
> charset for the installation rather than a boolean switch between "UTF-8" or
> "let each bug be different".

Please inform yourself of the things you are talking about before posting in a
bug report. UTF-8 supports _all_ characters of ISO-8859-15 and even much more,
all above Unicode 127 just happen to be encoded in two bytes rather than one.
Currently we have no encoding whatsoever defined in bugzilla, that means
basically every sign above Unicode 127 that you use in a bug report is at best
illegal, at worst undefined crap.
UTF-8 gives the possibility to support most characters used in any country
without having to support setting probably differing charsets on all bugs.

That means that basically no non-ascii characters can get illegal by moving to
UTF-8, as they are illegal and unsupported now, but almost all possible ones
will get legal when switching to UTF-8.
*** Bug 275377 has been marked as a duplicate of this bug. ***
glob - You might as well request review, so that we can get some action on this
bug. At least we can comment on what we think might be wrong with it.
Assignee: justdave → bugzilla
I don't know where this bug is going to, but let me repeat one of the two points
from the initioal report:
"Presently the bugzilla webpages don't contain an encoding header."
Doing a "grep header *.pl *.cgi", I see many occurrences of "print
$cgi->header();". There's nothing wrong with it, except that the CGI.pm
documentation says: "The -charset parameter can be used to control the character
set sent to the browser.  If not provided, defaults to ISO-8859-1."

Currently Firefox 1.0 (just to name one example) still thinks pages are
ISO-8859-1, and I'll have to switch the charset for every page I visit to UTF-8.
Well someone stated bugzilla uses UTF-8 internally, and exclusively. But please:
Why don't you tell the browser?

If this sounds easy to fix, would you please change those "$cgi->header()" to
"$cgi->header('-charset' => 'utf-8')" in the near future? Did I miss something
important? Sorry if I sound a bit impatient after almost two years...
Perhaps it would help if you read the bug. Especially the parts about legacy
content and such.
(In reply to comment #179)
> Perhaps it would help if you read the bug. Especially the parts about legacy
> content and such.

If you are starting with a new and empty bugzilla and (just for example) German
translation, all the "Umlaute" are mis-displayed. This has nothing to do with
legacy contents.

The longer you wait to decide on the right character encoding, the more "legacy
content" you will get. To make things worse (or maybe better) Microsoft's
Internet Explorer sets the page encoding (automatically?) to UTF-8, so even when
I enter the same words on the same machine with two different browsers, they
will use two different character encodings. This has nothing to do with legacy
contents.

Content is already misdisplayed _now_. Despite of that you could make the
character encoding a configurable parameter, so those who thing that everything
is OK with ISO-8859-1 can leave everything as it is now.

To summarize: It is a bug to use UTF-8 character coding in the HTML when saying
the page is encoded as ISO-8859-1 in the HTTP header.
Blocks: bz-recode
Byron Jones: if your patch is ready for review then please request so. Try gerv
and justdave like the previous attempted patches.

+ created bug 280633 for upgrading existing installations so we can stop wasting
time in this bug on that subject and changed the summary appropriately.
Summary: Use UTF-8 (Unicode) charset encoding for pages and email → Use UTF-8 (Unicode) charset encoding for pages and email for new installations
Comment on attachment 158835 [details] [diff] [review]
utf-8 patch with initial email support

This is definitely a really good start.  It's bitrotted now though, and getting
to to affect all callouts to sendmail should be a piece of cake now that it's
all in one place anyway :)

Also of note is the TODO item for properly handling encoding of email addresses
(only encoding the real-name part and not the address itself), which probably
should be done before this goes in as it'll break things for people that have
non-ascii chars in their email address.

glob: any chance of an update?
Attachment #158835 - Flags: review-
(In reply to comment #182)
> (From update of attachment 158835 [details] [diff] [review] [edit])
> This is definitely a really good start.  
Great to hear that !

> (only encoding the real-name part and not the address itself), which probably
> should be done before this goes in as it'll break things for people that have
> non-ascii chars in their email address.

Non-ascii chars in their email address? You meant IDN in the domain-name part of
an email address? Internationalized domain names (I don't think there are many
bugzilla acccount holders with IDNs in their addresses) had better be converted
to punycode. 
> You meant IDN in the domain-name part of an email address? 

no, as part of the real name.
eg. 
From: =?UTF-8?B?RnLDqWTDqXJpYyBCdWNsaW4=?= <lpsolit@gmail.com>

i think we should ignore IDNs for now.

i've started working on getting this patch up to date and in a more workable state.
(In reply to comment #184)
> > You meant IDN in the domain-name part of an email address? 
> 
> no, as part of the real name.
> eg. 
> From: =?UTF-8?B?RnLDqWTDqXJpYyBCdWNsaW4=?= <lpsolit@gmail.com>

So, the current patch RFC-2047-encodes the whole thing? If so,that's definitely
needs to be fixed before landing it. 
 
> i think we should ignore IDNs for now.

Yeah, that's all right. I've never seen any bugzilla account holder with an IDN
in her email address.
(In reply to comment #184)
> eg. 
> From: =?UTF-8?B?RnLDqWTDqXJpYyBCdWNsaW4=?= <lpsolit@gmail.com>

  Except make sure that you use quoted-printable (?UTF-8?Q?) instead of Base64
(?UTF-8?B?), if you do encode the header (the average mail client deals better
with quoted-printable than Base64). If we required perl 5.8, we could use Encode
(which is what I use at my installation in the Czech Republic).

  However, the internationalization of email header encoding is actually a
different bug: bug 110692.
> So, the current patch RFC-2047-encodes the whole thing? If so,that's
> definitely needs to be fixed before landing it. 

no, the current patch doesn't do any encoding of email addresses.
that from line was one i picked at random from my mailbox.

> However, the internationalization of email header encoding is actually a
> different bug: bug 110692.

the current patch already encodes the subject, using UTF-8/quote printable.
Attached patch utf-8 v3 (obsolete) — Splinter Review
updated utf-8 patch.  this patch:

  - adds a boolean utf-8 parameter, which is enabled by default on new
    installs, and disabled by default on existing installs

if utf-8 is enabled :

  - page's charset is set to utf-8
  - encoding attribute added to xml pages
  - all emails are encoded:
      - email charset is set to utf-8
      - subjects are utf-8 quotedprint'ed if required
      - name component of email addresses is qp'ed if required
      - the body is encoded as:
	  - quotedprint if less that 50% characters require encoding
	  - base64 otherwise
Attachment #116623 - Attachment is obsolete: true
Attachment #158835 - Attachment is obsolete: true
Attachment #173337 - Flags: review?
if that patch goes in, please do file a bug to switch the mail to multipart so
that the changes at the top can be qp even if the comments are base64.
I understand that this bug is not supposed to deal with legacy Bugzillas -
that's cool. However, to make it easier to fix that problem, could we do the
following?

Instead of having a "utf8" boolean, have a "charset" parameter, which is blank
by default and set to "utf8" on new installs. The behaviour is the same - it
just means that when we come to fix legacy installs, we have a mechanism whereby
users can choose a different charset which best matches their legacy data,
without rearranging the prefs again.

Gerv
(In reply to comment #190)

I think that's a good idea. It helps localized installations, too -- localized
templates often have *some* character encoding.
re: character encodings
We should be pushing UTF-8 as the only supported character encoding. This
simplifies everything - it allows all languages supported by Unicode to be
stored and displayed in bugzilla without changes. The same comment can be any
combination of languages. Supported databases only need to support UTF-8, not a
list of legacy encodings. If an existing legacy bugzilla installation is a
single character set then migrating it to UTF-8 is not difficult.

re: templates
All templates should be in UTF-8. There is no reason to use legacy encodings but
many reasons we shouldn't, e.g. for the reasons above, and for easy replacement
of UI languages. 

Think about a bug database where you wish to have multiple UI languages for the
same database. Japanese, Korean and English speakers all can read and write the
Japanese bug reports but prefer to have their own language for the UI. If legacy
encodings are used for the UI then it is not possible to enter the Japanese bugs
when using a korean UI. (This isn't a theoretical problem, it was the situation
I had in my last company).
Brodie said everything I was about to say, which is what a lot of web-based
product developers must know (but they don't as can be seen in hotmail, yahoo
mail, etc).
OK, I guess the real question is: is there some technically feasible solution to
upgrading older Bugzillas to UTF-8, assuming the admin can tell us what charset
it's in now? Do the relevant Perl modules exist? Do we have any clue which data
would need massaging?

We don't have to _implement_ it (yet), just determine whether or not it exists.
If it doesn't exist, then the UI should have the ability to specify an alternate
charset, no matter how hard we push UTF-8 in the docs and defaults. If it does
exist, then we can have a "UTF-8 or nothing" switch like now.

Gerv
(In reply to comment #194)
> OK, I guess the real question is: is there some technically feasible solution to
> upgrading older Bugzillas to UTF-8, assuming the admin can tell us what charset
> it's in now? Do the relevant Perl modules exist? 

As you know well, we can't make that assumption about bugzilla.mozilla.org
because even in a single bug, multiple different encodings are used at
bugzilla.mozilla.org. (We should be able to come up with a few things to do for
the migration of bugzilla.mozilla.org, though.[1])  However, if there's such a
legacy installation with a single encoding used throughout, one can use 'Encode'
module. (http://search.cpan.org/~dankogai/Encode-2.09/Encode.pm)

> Do we have any clue which data would need massaging?

  Any textual data (needless to say, we shouldn't touch attachment even if it's
text/*) beyond ASCII (no sane person would have used 7bit encodings like
ISO-2022-JP or HZ for bugzilla). 
 

[1] Some of them are : 1) send emails to those with their names stored in
ISO-8859-1 (I've never seen anyone use non-ASCII characters in encodings other
than ISO-8859-1 for their names at bugzilla.mozilla.org) to update their account
info in UTF-8. 2) Begin to emit 'charset=UTF-8' for bugs filed after a certain
day.(say, 2005-03-01). Do the same for current bugs with ASCII characters alone
in their comments and title. 3) For existing bugs, add a very prominent warning
right above 'additional comments' text area that 'View | Character Encoding'
should be set to UTF-8 if one's comment includes non-ASCII characters. 4) If we
really want to migrate all existing bugs to UTF-8, add a button to each comment
to indicate the current character encoding. If necessary, this button can be
made available only to the select group of people knowledgable enough to
identify encodings reliably. 5) search/query may need some more tinkering...
There are modules in Perl to detect and recode character sets. Technically this
is not a problem. Dave Miller has some comments on using the browser to detect
the charset and then recoding using Perl in bug 280633. He also seems to have
scrabbled together a proof of concept of this.

Remember also that this is implemented as an option. New databases get it by
default. Old databases in legacy encodings don't have to convert their database
to use newer versions of bugzilla. They can just leave the utf8 flag set false
and everything continues as normal. 

Of course they will probably get better results and user experience by recoding
to utf8, thus we have bug 280633. The possible migration features such as those
that Jungshik mentioned should be discussed there.
comments on patch 'utf-8 v3':

* desc => 'Use UniCode (UTF-8 character set)',

Unicode, not UniCode. 

As a better description, perhaps...
'Use UTF-8 (Unicode) encoding for all text in Bugzilla. New installations should
set this to true to avoid character encoding problems. Existing databases should
set this to true only after the data has been converted from existing legacy
character encodings to UTF-8 (see bug 280633).'

Other than that I don't know perl enough to review properly.
A while back someone proposed on IRC that we turned on UTF-8 for every bug ID
that was greater than a certain number. Although that isn't a perfect solution
(still mixed character encoding in the database) it would at least make all new
bugs and all comments on new bugs forward compatible.
Great idea, but it is a new feature so either a new bug or discuss it on bug
280633. Let's keep discussion here to just support in new installations so that
we at least get that functionality ASAP.
Summary: Use UTF-8 (Unicode) charset encoding for pages and email for new installations → Use UTF-8 (Unicode) charset encoding for pages and email for NEW installations
(In reply to comment #198)
> A while back someone proposed on IRC that we turned on UTF-8 for every bug ID
> that was greater than a certain number. 

That's not new :-) It was first proposed by Markus Kuhn in 2002(?) and I
repeated it a couple of times here (e.g. see comment #195 point #2 ). Anyway,
it'll be my last comment here on the migration. If there's anything new to add,
I'll add in bug 280633
Anne: that breaks things when you have content from multiple bugs on the same
page, such as buglists or longlist.cgi output.

I think a Bugzilla needs to be either UTF-8, or a specific charset, or "no
charset" (undefined, as now). Having it as > 1 specific charset, or part as some
charset and part as no charset sounds like a nightmare.

Gerv
(In reply to comment #201)
> Anne: that breaks things when you have content from multiple bugs on the same
> page, such as buglists or longlist.cgi output.

So does the current solution of allowing any encoding.

Sending pages that have content from multiple bugs as UTF-8 and sending all bugs
with ID > current bug number as UTF-8 seems like a reasonable start to me. 
We'll stop accumulating content of unknown encoding in new bugs, and there will
still be a way to view the content on the older bugs (by viewing the bug as its
own page) if there's some content that isn't UTF-8.
Hmm. this is a 'vicious' cycle that I have to break.  I'll add my response to
comment #201 in bug 280633
Comment on attachment 173337 [details] [diff] [review]
utf-8 v3


  Hey. I have only small comments, without doing some testing:

>+    if ($header !~/[^\x20-\x7E\x0A\x0D]/  and $body !~ /[^\x20-\x7E\x0A\x0D]/) {

  Yeah, make those regexes a function, like we talked about. :-)

>+    $head->mime_attr('content-type' => 'text/plain') unless defined $head->mime_attr('content-type');

  *nit* Just break this line and indent the "unless" four spaces (for a total
of 8 spaces).

>+    if (defined $subject && $subject =~ /[^\x20-\x7E\x0A\x0D]/) {

  Another place where the function would be cool. It's probably a
Bugzilla::Util function, really.

>+ foreach my $field (qw(from to cc reply-to)) {

  Other possible fields are Sender, X-Envelope-To, Errors-To, X-BeenThere, and
Return-Path. Usually, though, those don't have names in them.

>+            $value =~ s/[\r\n]+$//;

  I think that any given header should only have one line-ending, right? Unless
it's split across several lines, in which case you'd have to remove the line
terminators, which I think are semicolons for wrapped headers.

  If it had more than one line ending, it would be the end of the headers.

>+                if ($name =~ /[^\x20-\x7E\x0A\x0D]/) {

  Another good place for the function.

>+                    push @addresses, '=?UTF-8?Q?' . encode_qp($name) . '?= <' . $addr->address . '>';

  Names can have commas in them. Does this deal with that?

  Also, does this QP encode the entire name? That will make encoded emails a
mess in my Evolution, since it generally doesn't like encoded names. :-( I
could live with that, though.

  Also, the line is *slightly* too long, and needs to be split into two lines.

>+                    $changed = 1;
>+                } else {
>+                    push @addresses, $addr->format;

  Why do we even call format on the address, if we haven't changed it? Couldn't
we just output it as a raw string? (Or is there some other problem with that
that I'm not aware of?)

>+    if ($body !~ /[^\x20-\x7E\x0A\x0D]/) {
>+        # body is 7-bit clean, don't encode

  Just reverse the logic, instead of having an empty if.

>+    } else {
>+        # count number of 7-bit chars, and use quoted-printable if more
>+        # than half the message is 7-bit clean
>+        my $count = ($body =~ tr/\x20-\x7E\x0A\x0D//);
>+        if ($count > length($body) / 2) {
>+            $head->replace('Content-Transfer-Encoding', 'quoted-printable');
>+            $body = encode_qp($body);
>+        } else {
>+            $head->replace('Content-Transfer-Encoding', 'base64');
>+            $body = encode_base64($body);

  I'd want to test this with a few common mail clients, to make sure that they
can actually read Base64 bodies. I seem to recall that some can't, but they all
support QP. I'm not sure about that, though.

>+    $self->charset(Param('utf8') ? 'UTF-8' : '');

  I think that usually the charset is lowercase, in HTTP headers.

>-<?xml version="1.0" standalone="yes"?>
>+<?xml version="1.0" [% IF Param('utf8') %]encoding="UTF-8" [% END %]standalone="yes" ?>

  And the same, here. Although here I'm pretty sure it doesn't matter.
Attachment #173337 - Flags: review? → review-
Keywords: relnote
i'll do an updated patch when i have the chance, however i can answer a few of
your queries now:

> >+            $value =~ s/[\r\n]+$//;
> 
> I think that any given header should only have one line-ending, right? 

on windows, get() was returning the fields with CRLF, so chomp wasn't stripping
them.  looking at the code, maybe i don't need to do that anymore.  i'll have a
play.

> >+ push @addresses, '=?UTF-8?Q?' . encode_qp($name) . '?= <' . $addr->address
. '>';
> 
>   Names can have commas in them. Does this deal with that?

yes.  Mail::Address->parse() returns an array of addresses, splitting in the
correct location.

however i just realised that Mail::Address->name() flips the order of comma
seperated names.  ie.  "jones, byron" becomes "byron jones".  i should be using
phrase().

> Also, does this QP encode the entire name? 

no, only characters that require QP'ing

> >+                    $changed = 1;
> >+                } else {
> >+                    push @addresses, $addr->format;
> 
> Why do we even call format on the address, if we haven't changed it? Couldn't
> we just output it as a raw string? (Or is there some other problem with that
> that I'm not aware of?)

there may be more than on address on the field, so we can't use the raw string.

> >+    $self->charset(Param('utf8') ? 'UTF-8' : '');
> 
> I think that usually the charset is lowercase, in HTTP headers.

ok, i'll make it lowercase

> >+<?xml version="1.0" [% IF Param('utf8') %]encoding="UTF-8" [% END
%]standalone="yes" ?>
> 
>   And the same, here. Although here I'm pretty sure it doesn't matter.

it's case insentitive, but the xml specs use uppercase, so that's what most
people use.
Status: NEW → ASSIGNED
Attached patch utf-8 v4 (obsolete) — Splinter Review
this version addresses issues raised.

notes:

in the parameter description i didn't want to include the bug number, as
that's more for the documentation, and it would be confusing if the local
bugzilla install had a bug number 280633

i've set MIME::Parser to not use temp files.  while the MIME::Parse docs
indicate there's a performance hit, as we only parse the header, the temporary
objects are always empty.

i've added "sender" and "errors-to" to the list of fields to encode email
addresses on.  the other two x- headers are added by mail servers, so there's
no reason to check them here.
Attachment #173337 - Attachment is obsolete: true
Attachment #173713 - Flags: review?
Would there be a performance gain to check for 7-bit clean outside calling the
encode_message function? Would that save copying the (header, body) pair a
number of times?

e.g.
    ($header, $body) = encode_message($header, $body) if Param('utf8');

becomes

    # make sure there's work to be done
    if (Param('utf8') and (!is_7bit_clean($header) or !is_7bit_clean($body))) {
        ($header, $body) = encode_message($header, $body);
    }
The full name of the administrator user created by checksetup need to be 
converted to UTF-8 if this name contains non ASCII chars.
Blocks: 280905
(In reply to comment #208)
> The full name of the administrator user created by checksetup need to be 
> converted to UTF-8 if this name contains non ASCII chars.

that's tricky as i can't tell what charset the console is running in.

how about i update checksetup to only allow 7-bit clean characters in the admin
name, with a comment saying that once bugzilla is running the name can be
updated via the webpages?
(In reply to comment #209)
> how about i update checksetup to only allow 7-bit clean characters in the admin
> name, with a comment saying that once bugzilla is running the name can be
> updated via the webpages?

  I think that's an acceptable solution. Just put the is_7bit_clean function in
Bugzilla::Util, and don't "require" Bugzilla::Util until you need it. (Don't
"use" it -- that will break checksetup. But you probably know that. :-))

  Of course, I think you can pull out the "locale" information from the console,
somehow. You could preserve that environment variable, the same way that we
currently preserve $ENV{'PATH'}. I'm not sure it would work on Win32, though.
why not using the utf8 perl suport ?

http://search.cpan.org/dist/perl/lib/utf8.pm
(In reply to comment #209)
> that's tricky as i can't tell what charset the console is running in.

nl_langinfo(CODESET) in C; surely perl has something like that too?
(In reply to comment #211)
> http://search.cpan.org/dist/perl/lib/utf8.pm

  That's a pragma to enable/disable utf8 support in the perl source code, and to
convert a standard perl string to being a perl string with utf8 encoding. It
doesn't recode anything. (Read the page.) So, it doesn't really have any use,
here. We're encoding in quoted-printable. That has nothing to do with the above
link.

  Finally, Encode.pm support (the built-in perl charset converter) is only
available in perl 5.8, and we require perl 5.6.
(In reply to comment #212)
> > that's tricky as i can't tell what charset the console is running in.
> nl_langinfo(CODESET) in C; surely perl has something like that too?

even if we could detect the charset, i'd have to worry about conversion from the
detected charset to utf8.  sure, there's modules that'll help, but it'd be a lot
of work for something with a trivial workaround.
Attachment #173713 - Flags: review?
Attached patch utf-8 v5 (obsolete) — Splinter Review
adds the administrator name checking to checksetup, and the optimisation
suggested by brodie.
Attachment #173713 - Attachment is obsolete: true
Attachment #174022 - Flags: review?
Note that this patch doesn't apply for BugMail.pm cleanly due to one line
changed in bug 280973.
Sorry for not providing an updated patch, but it's a bit hard to do on the setup
I'm on right now.
Attached patch utf-8 v6 (obsolete) — Splinter Review
fixed bitrot; thanks Håvard
Attachment #174022 - Attachment is obsolete: true
Attachment #174022 - Flags: review?
Attachment #174458 - Flags: review?
This patch does not apply cleanly for me against yesterday's CVS, there are
problems in checksetup.pl and BugMail.pm.
It seems to be simple bitrot issues, but I can't get at CVS from here to fix
them properly at the moment.
Attached patch utf-8 v7 (obsolete) — Splinter Review
bitrot fixes
Attachment #174458 - Attachment is obsolete: true
Attachment #176018 - Flags: review?
Attachment #174458 - Flags: review?
*** Bug 285255 has been marked as a duplicate of this bug. ***
*** Bug 279589 has been marked as a duplicate of this bug. ***
Blocks: 279589
No longer blocks: 279589
Flags: blocking2.20?
Hmm, this is damn close to ready to go, I don't want to lose it.
Flags: blocking2.20? → blocking2.20+
Comment on attachment 176018 [details] [diff] [review]
utf-8 v7

>Index: checksetup.pl
>+    # As it's a new install, enable UTF-8
>+    SetParam('utf8', 1);

I'm not sure if the new admin check is the best place to check this.  Lots of
folks upgrading from 2.16 or earlier are going to get nailed with this dialog
even when it's not a new install because they twiddled with the bits on their
admin account.	We should try to find some other way to ensure that it's a new
install.

>Index: Bugzilla/CGI.pm
>+    $self->charset(Param('utf8') ? 'utf-8' : '');

Nit: 'UTF-8' should be all uppercase in the header, to follow RFC 3629 section
8.  (technically the field isn't case-sensitive, but since it's defined that
way in the RFC we should follow it)

Rest of this looks good to me.	Find a better way to detect a new install and
this has an r+ from me.
Attachment #176018 - Flags: review? → review-
Attached patch utf-8 v8 (obsolete) — Splinter Review
improved "new install" detection -- if we have to create data/nomail, it's a
new install.
Attachment #176018 - Attachment is obsolete: true
Attachment #177578 - Flags: review?
Comment on attachment 177578 [details] [diff] [review]
utf-8 v8

ok, all code style and architecture nits addressed, actually tried testing it
now...

The summary encoding is not happening correctly.  I'm not sure what's wrong
with it, but Eudora is showing decoded summaries with an extra = on the end,
and Thunderbird is outright refusing to decode them.

Subject: =?UTF-8?Q?[Bug 579] This is a s=C3=BCmm=C3=A1ry= ?=
Attachment #177578 - Flags: review? → review-
We need to require MIME::Base64 v3.03 also.  MIME::Tools doesn't explicitly
prereq it, but it won't install due to test failures if you don't have at least
that version.  It'll be fewer tech support problems for us if we just outright
require it to save people from getting the install errors on MIME::Tools.
Attached patch utf-8 v9 (obsolete) — Splinter Review
fixes subject encoding
requires MIME::Base64 (version 3.01 on windows, 3.03 on unix)
Attachment #177578 - Attachment is obsolete: true
Attachment #177580 - Flags: review?
Comment on attachment 177580 [details] [diff] [review]
utf-8 v9

woot!
Attachment #177580 - Flags: review? → review+
Attached patch utf-8 v10 (obsolete) — Splinter Review
<justdave> hmmm.....
<justdave> actually, can we swap the order of MIME::Tools and MIME::Base64 in
the modules list?
<justdave> MIME::Base64 is the prereq and since people tend to work top to
bottom...
Attachment #177580 - Attachment is obsolete: true
Attachment #177581 - Flags: review?
Attachment #177581 - Flags: review? → review+
woot! woot!!
Flags: approval+
(In reply to comment #225)
> The summary encoding is not happening correctly.  I'm not sure what's wrong
> with it, but Eudora is showing decoded summaries with an extra = on the end,
> and Thunderbird is outright refusing to decode them.
> 
> Subject: =?UTF-8?Q?[Bug 579] This is a s=C3=BCmm=C3=A1ry= ?=

If my memory is correct, that form is not allowed to have spaces within it.  All
the words that are ASCII should be passed through outside the =?-escaped form,
and all the words that are not ASCII should be escaped separately.

For the details, see ftp://ftp.rfc-editor.org/in-notes/rfc2047.txt (which I
haven't really looked at while writing this comment; the above is from memory).
(or, alternatively, the spaces inside it could be escaped, but then you risk
hitting the 75-character limit)
bah!  so close! :)  but he's right....
Flags: approval+
Comment on attachment 177581 [details] [diff] [review]
utf-8 v10

r- per comment 231
Attachment #177581 - Flags: review+ → review-
Do we not have more XML outputs than just show.xml.tmpl which need the encoding
defined?

Surely the best test for a new install is if we are creating
localconfig/creating the database?

Gerv
> Do we not have more XML outputs than just show.xml.tmpl which need the
> encoding defined?

ahhh, you're correct.

  template/en/default/bug/show.xml.tmpl
  template/en/default/config.rdf.tmpl
  template/en/default/list/list.rdf.tmpl
  template/en/default/list/list.rss.tmpl
  template/en/default/reports/duplicates.rdf.tmpl

> Surely the best test for a new install is if we are creating localconfig

no, because when we create localconfig, data/params hasn't been created; it's
created in the second phase of first-time checksetup.

> creating the database?

i normally manually create an empty database before kicking off checksetup, as i
have to set the access permissions for the bugzilla account anyhow.  so i have
new installs with an existing, but empty, database.
> no, because when we create localconfig, data/params hasn't been created; it's
> created in the second phase of first-time checksetup.

Right then - so let's do it in the second phase, when we create data/params.

Gerv
i've hit a snag.

even if the header is folded correctly, Mail::Mailer strips \n\s* from the
lines, removing the folding, so lines can break rfc by exceeding the max length.

grr
spaces inside =?Q? must be encoded as an underscore (_), see
ftp://ftp.rfc-editor.org/in-notes/rfc2047.txt, 4.2(2)
Attached patch utf-8 v11 (obsolete) — Splinter Review
ok, this is probably the best i can do without rewriting a whole lot of other
modules.

this version encodes only the words in the subject that require encoding,
rather than the whole line.  this avoids any spaces issues, and makes the line
easier to wrap.

the sub normally generates a header that is wrapped at 75 characters, using
Mail::Headers's folding code.

however Mail::Mailer kindly strips \n's from the header, resulting in lines
that are longer than 75 characters being sent.

Mail::Header's folding code is very simple -- it breaks the line on whitespace
only.  thus even if Mail::Mailer didn't unfold the header lines, it's possible
(but unlikely) that we'll still generate >75 character lines.

so, here's a solution that appears to work but is not rfc compliant.

i've contacted the author of Mail::Mailer and Mail::Header, so another solution
may be to wait for these issues to be fixed upstream.
Attachment #177581 - Attachment is obsolete: true
Attachment #178110 - Flags: review?
Depends on: 287064
(In reply to comment #240)

I don't quite understand some parts:

+sub encode_qp_words($) {
+    my ($line) = (@_);
+
+    my $line = encode_qp($line, '');
+    $line =~ s/ /=20/g;

Shouldn't you replace SPC with '_'?

+    return "=?UTF-8?Q?$line?=";

Is this an unconditional return? Looks like it.
Will the rest ever be considered?

+    
+    my @encoded;
+    foreach my $word (split / /, $line) {

Are there any SPCs left?

+        if (!is_7bit_clean($word)) {
+            push @encoded, '=?UTF-8?Q?' . encode_qp($word, '') . '?=';
+        } else {
+            push @encoded, $word;
+        }
+    }
+    return join(' ', @encoded);
+}
Comment on attachment 178110 [details] [diff] [review]
utf-8 v11

> +    $line =~ s/ /=20/g;
> 
> Shouldn't you replace SPC with '_'?

the rfc allows for _ or =20 :

The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be represented as
"_" (underscore, ASCII 95.)

> +    return "=?UTF-8?Q?$line?=";
> 
> Is this an unconditional return? Looks like it.
> Will the rest ever be considered?

d'oh, those three lines are debug code and shouldn't be there.
thanks for pointing that out.
Attachment #178110 - Attachment is obsolete: true
Attachment #178110 - Flags: review?
Attached patch utf-8 v12 (obsolete) — Splinter Review
Mail::Mailer version 1.67 fixes the bugs that were stopping us from using it.

This patch bumps up the minimum version, and addresses the other outstanding
issues.
Attachment #179244 - Flags: review?
Blocks: 281522
Blocks: 287684
Blocks: 287682
note that it's still possible for us to generate emails with lines greater than
75 characters, if the subject doesn't contain any spaces we don't have a point
to wrap it at.

i know how to fix this, but it's a fair amount of work, so i'd prefer for that
to be covered in another bug.

note that the current bugzilla code can also generate >75 char lines as there's
no checks in place to stop this .. for example if the url is too long, eg
"http://you-havent-visited-editparams.cgi-yet/userprefs.cgi" the "Configure
bugmail" line in the message footer will be more than 75 characters.
(In reply to comment #244)
> note that it's still possible for us to generate emails with lines greater than
> 75 characters, if the subject doesn't contain any spaces we don't have a point
> to wrap it at.

Why would you consider this to be a problem?

From RFC 2822:

2.1.1. Line Length Limits

   There are two limits that this standard places on the number of
   characters in a line. Each line of characters MUST be no more than
   998 characters, and SHOULD be no more than 78 characters, excluding
   the CRLF.

So IMO, you are following the spirit of the RFC and are wrapping when possible;
sometimes as you pointed out that is not possible.  

I would find the alternative of MIME encoding the subject lines that are
>75chars to be a much worse solution as I would have to assume that the only
mail agents still susceptible to being bit by the 78 character recommended limit
to be extremely old and therefore wouldn't understand MIME encoding anyway.
(In reply to comment #245)
> > [...] generate emails with lines greater than 75 characters

> Why would you consider this to be a problem?
> From RFC 2822:
[snip]

See RFC 2047, specifically (from its section 2):
   An 'encoded-word' may not be more than 75 characters long, including
   'charset', 'encoding', 'encoded-text', and delimiters.  If it is
   desirable to encode more text than will fit in an 'encoded-word' of
   75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may
   be used.

   While there is no limit to the length of a multiple-line header
   field, each line of a header field that contains one or more
   'encoded-word's is limited to 76 characters.


Current patch would fail to meet this requirement if someone creates a summary 
with too many consecutive non-spaces so that an 'encoded-word' longer than 75 
characters is created (which mail programs etc. may not recognise).

Solution (as described in the RFC) is to break up the text into smaller chunks 
creating multiple encoded-word entities each <= 75 characters, but this can 
just as well be done after this patch lands.
"If it's not a regression from 2.18 and it's not a critical problem with
something that's already landed, let's push it off." - Dave
Flags: blocking2.20+
Whiteboard: i18n → i18n [wanted for 2.20]
Flags: blocking2.20-
(In reply to comment #247)
> "If it's not a regression from 2.18 and it's not a critical problem with
> something that's already landed, let's push it off." - Dave

So continuing to fill bugzilla databases with unrecoverable undefined **** &
having notification mails vanish out of existence because they're so badly
formated any spam checker will mistake them for spam is not a critical problem ?
(In reply to comment #248)
> (In reply to comment #247)
> > "If it's not a regression from 2.18 and it's not a critical problem with
> > something that's already landed, let's push it off." - Dave
> 
> So continuing to fill bugzilla databases with unrecoverable undefined crap &
> having notification mails vanish out of existence because they're so badly
> formated any spam checker will mistake them for spam is not a critical problem ?

It's not a critical problem in something that has landed during this cycle; it
has always been there. So it isn't a blocker for 2.20. I'm sure a complete set
of patches would still be accepted, though. Perhaps you could do a review on the
current set?
My Bugzilla Version 2.17.7 .
Where I can download the "utf-8 v12 " files?
And when I get the patch files and overwrite the same file,
It's will take effect?

Who can help me?!

Where can I get the patch files and how to use it?

I am collapsing for the confused character problem.

My version is  2.17.7 .

Thanks all.
After some discussions on IRC, it's become apparent that this is potentially
destablizing, as there are uncertainties about how searching will be affected
and what kind of problems we'll run into by not using Perl's utf-8 support.  I'm
perfectly willing to check this in and iron out said problems afterwards, but
not while we're in a release freeze, and not on a stable branch.  Pushing this
off to 2.22.  I'd really like to land this as soon as possible after we branch
for 2.20 though.
Whiteboard: i18n [wanted for 2.20] → i18n
Target Milestone: Bugzilla 2.20 → Bugzilla 2.22
*** Bug 298243 has been marked as a duplicate of this bug. ***
OK, we've branched, and the trunk is open.  Let's get this thing reviewed and
landed! :)
Comment on attachment 179244 [details] [diff] [review]
utf-8 v12

Hit by bitrot, but trivial unrotting -- r=wurblzap on an unrotted patch.

In a follow-up bug, we need to find a way to stop substr() from splitting UTF-8
characters in half :/

Glitches in standards compliance should imho be handled in post-checkin fixes.

Tested on Windows, using smtp and testfile as mail_delivery_method. Couldn't
get my hands on MIME-tools 5.417, but it works for me with 5.411a just as well.
Works for newchangedmail, passwordmail, flag mail.
Tested both quoted-printable and base64 encodings (forced base64 by turning the
8-bit-content check around).
Tested 7-bit-only mails.

Let's do it :)
Attachment #179244 - Flags: review? → review+
*** Bug 175782 has been marked as a duplicate of this bug. ***
Flags: approval+
Unrotted the original patch so that it may be checked in.
Fixing 011pod.t complaint, too.
Attachment #179244 - Attachment is obsolete: true
Attachment #191577 - Flags: review+
Checking in checksetup.pl;
/cvsroot/mozilla/webtools/bugzilla/checksetup.pl,v  <--  checksetup.pl
new revision: 1.420; previous revision: 1.419
done
Checking in defparams.pl;
/cvsroot/mozilla/webtools/bugzilla/defparams.pl,v  <--  defparams.pl
new revision: 1.163; previous revision: 1.162
done
Checking in Bugzilla/BugMail.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/BugMail.pm,v  <--  BugMail.pm
new revision: 1.42; previous revision: 1.41
done
Checking in Bugzilla/CGI.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/CGI.pm,v  <--  CGI.pm
new revision: 1.18; previous revision: 1.17
done
Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v  <--  Util.pm
new revision: 1.34; previous revision: 1.33
done
Checking in template/en/default/config.rdf.tmpl;
/cvsroot/mozilla/webtools/bugzilla/template/en/default/config.rdf.tmpl,v  <-- 
config.rdf.tmpl
new revision: 1.5; previous revision: 1.4
done
Checking in template/en/default/bug/show.xml.tmpl;
/cvsroot/mozilla/webtools/bugzilla/template/en/default/bug/show.xml.tmpl,v  <--
 show.xml.tmpl
new revision: 1.8; previous revision: 1.7
done
Checking in template/en/default/list/list.rdf.tmpl;
/cvsroot/mozilla/webtools/bugzilla/template/en/default/list/list.rdf.tmpl,v  <--
 list.rdf.tmpl
new revision: 1.5; previous revision: 1.4
done
Checking in template/en/default/list/list.rss.tmpl;
/cvsroot/mozilla/webtools/bugzilla/template/en/default/list/list.rss.tmpl,v  <--
 list.rss.tmpl
new revision: 1.4; previous revision: 1.3
done
Checking in template/en/default/reports/duplicates.rdf.tmpl;
/cvsroot/mozilla/webtools/bugzilla/template/en/default/reports/duplicates.rdf.tmpl,v
 <--  duplicates.rdf.tmpl
new revision: 1.2; previous revision: 1.1
done
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
FYI - if a site admin changes CGI.pm to reflect UFT for the UTF-8 security
fixes, a CVS update will fail. bugzilla/docs/html/security-bugzilla.html should
be changed to reflect the fixes for this change.


--------------

<<<<<<< CGI.pm
    # Make sure that we don't send any charset headers
    $self->charset('UTF-8');
=======
    # Send appropriate charset
    $self->charset(Param('utf8') ? 'UTF-8' : '');
>>>>>>> 1.18 

--------------

thx

tim
Flags: documentation?
Attached patch Documentation patch (obsolete) — Splinter Review
Attachment #192066 - Flags: review?(documentation)
Comment on attachment 192066 [details] [diff] [review]
Documentation patch

>Index: docs/xml/security.xml
>-      incorporate by default the code changes suggested by 
>+      <para>If you installed Bugzilla version 2.20 or later from scratch,

Wasn't this checked in on trunk-only - therefor it is Bugzilla version 2.22 or
later?

>+      This is because due to internationalization concerns, we are unable to
>+      turn the <emphasis>utf8</emphasis> parameter on by default for upgraded
>+      installations.

This sentence doesn't read correctly to me...
Attachment #192066 - Flags: review?(documentation) → review-
(In reply to comment #261)
> Wasn't this checked in on trunk-only - therefor it is Bugzilla version 2.22 or
> later?

True.

> This sentence doesn't read correctly to me...

Ok. I'm no native speaker -- please give me a good sentence, and I'll put it
into a patch.
(In reply to comment #262)
> Ok. I'm no native speaker -- please give me a good sentence, and I'll put it
> into a patch.

Apparently it does make sense to others... so just fix the first bit :)
Attachment #192066 - Attachment is obsolete: true
Attachment #195374 - Flags: review?(documentation)
Comment on attachment 195374 [details] [diff] [review]
Documentation patch 1.2

r=me by inspection....
Attachment #195374 - Flags: review?(documentation) → review+
Docs (attachment 195374 [details] [diff] [review]):
Checking in docs/xml/security.xml;
/cvsroot/mozilla/webtools/bugzilla/docs/xml/security.xml,v  <--  security.xml
new revision: 1.8; previous revision: 1.7
done
Flags: documentation?
*** Bug 318151 has been marked as a duplicate of this bug. ***
*note*

test at http://landfill.bugzilla.org/bugzilla-tip :

If the name of a saved search contains UTF-8 characters they display wrong.

"Frédéric" would display as "Fr�d�ric".

regards reinhardt [[user:gangleri]]
(In reply to comment #268)
> If the name of a saved search contains UTF-8 characters they display wrong.

Works for me.

If the saved search was created before the switch to UTF-8, then yes, this is possible, but not a bug -- conversion is handled by bug 280633. If you get broken characters with a newly saved search, then please file a new bug.
(In reply to comment #269)

> Works for me.
> 
> If the saved search was created before the switch to UTF-8, then yes, this is
> possible, but not a bug -- conversion is handled by bug 280633. If you get
> broken characters with a newly saved search, then please file a new bug.

opened Bugzilla Bug 318583
bug at landfill.bugzilla.org : if a search is saved with a name containing UTF-8 characters this name is not shown properly at all pages containing "Saved Searches:"
*** Bug 316836 has been marked as a duplicate of this bug. ***
*** Bug 319343 has been marked as a duplicate of this bug. ***
Why can't bugzilla also use a META tag for content-type? How is doing that broken?
Added to the Bugzilla 2.22 Release Notes in bug 322960.
Keywords: relnote
There is some issues when I use bugzilla with utf-8.

With Mysql db:

1) Mysql connection should be utf-8, this is enabled by this patch:
Index: Bugzilla/DB/Mysql.pm
===================================================================
RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v
retrieving revision 1.36
diff -r1.36 Mysql.pm
70a71,76
>     $self->do ("set session character_set_results=utf8");
>     $self->do ("set session character_set_client=utf8");
>     $self->do ("set session character_set_connection=utf8");
>     $self->do ("set session character_set_database=utf8");
>     $self->do ("set session character_set_server=utf8");
> 

2) Summary field in search results (aka bug list) get trimmed right between bytes in unicode character, and lenght of a Summary is about 30 characters for Russian, not 60. Before "..." you can see bad symbol. See patch:
Index: Bugzilla/Template.pm
===================================================================
RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Template.pm,v
retrieving revision 1.41
diff -r1.41 Template.pm
272a273,275
> 
>         my $utf8_string = $string;
>         utf8::decode ($utf8_string);
274c277
<       return $string if !$length || length($string) <= $length;
---
>       return $string if !$length || length($utf8_string) <= $length;
277c280,283
<       my $newstr = substr($string, 0, $strlen) . $ellipsis;
---
>       my $newstr = substr($utf8_string, 0, $strlen) . $ellipsis;
> 
>         utf8::encode ($newstr);

3) Comments wrapped not as Unicode text, but as byte coded. As result lenght of each string of Russian text is about 40 characters, not 80. Patch for this:
Index: Bugzilla/Util.pm
===================================================================
RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v
retrieving revision 1.45
diff -r1.45 Util.pm
30a31
> use utf8;
230a232,233
>       utf8::decode($comment);
> 
247a251
>       utf8::encode($wrappedcomment);

5) Search with Russian text is impossible, only ASCII.

With Postgresql db, created with "-E UNICODE":

1) Perl strings not marked unicode strings, perl not use utf-8. Patch:
Index: Bugzilla.pm
===================================================================
RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla.pm,v
retrieving revision 1.29
diff -r1.29 Bugzilla.pm
26a27
> use encoding 'utf8';
Index: Bugzilla/DB/Pg.pm
===================================================================
RCS file: /cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Pg.pm,v
retrieving revision 1.18
diff -r1.18 Pg.pm
69,70c69,78
< 
<     my $self = $class->db_new($dsn, $user, $pass);
---
>       my $attributes = { RaiseError => 0,
>                     AutoCommit => 1,
>                     PrintError => 0,
>                     ShowErrorStatement => 1,
>                     HandleError => \&_handle_error,
>                     TaintIn => 1,
>                     FetchHashKeyName => 'NAME',
>                                       pg_enable_utf8 => 1};
> 
>     my $self = $class->db_new($dsn, $user, $pass, $attributes);

After this patch imported DB displayed correctly, searches with russian are worked, but bugzilla installation is not usable - new comments, bugs and other strings are saved in db in bad unicode strings.

My point of view is if UTF-8 is declared as encoding for new installation, than all bugzilla perl scripts MUST work with strings as Unicode strings, not as bytes. All issues above is about this.
Attached file test, please ignore (obsolete) (deleted) —
please ignore, just test attach zip
Attachment #283019 - Attachment is obsolete: true
The content of attachment 283019 [details] has been deleted by
    Dave Miller <justdave@bugzilla.org>
who provided the following reason:

irrelevant to this bug

The token used to delete this attachment was generated at 2007-10-01 09:55:04 PDT.
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.