Closed Bug 135762 Opened 18 years ago Closed 13 years ago

UTF-8 characters in submitted text get displayed as NCRs

Categories

(Bugzilla :: Creating/Changing Bugs, defect)

2.15
x86
Linux
defect
Not set

Tracking

()

RESOLVED DUPLICATE of bug 280633

People

(Reporter: Markus.Kuhn, Assigned: myk)

References

()

Details

I am using this bugzilla installation here (bugzilla.mozilla.org) with the
Mozilla 0.9.9 navigator in order to show you the problem. I use in the following
few text lines a number of non-ISO-8859-1 characters:

greek capital sigma = 'Σ'
element of sign = '∈'
double right quotation marks = '”'
infinity = '∞'
euro sign = '€'

I can see these characters (which I entered via cut & paste from a UTF-8 file
shown in an xterm window in UTF-8 mode) correctly displayed in this mozilla form
field.

As you will see, after submission of this form field text here these characters
get displayed on the resulting bugzilla bug description page as visible numeric
character references (like 'Σ' instead of the actual greek letter capital
sigma), and not as the characters themselves. So something goes wrong on the way
between me entering this form field and it getting displayed in the end on the
resulting page. I'll leave it to you to decide, whether this is a bug in
bugzilla or mozilla and what the web standards say about the use of UTF-8
characters in forms.
Strangely, the double-right-quotation-mark and the euro-sign got through intact
onto this web page, but the other test characters got indeed converted into
numeric character references.

The email confirmation that I received for the above bug submission was also
defect. It contained the double-right-quotation-mark and the euro-sign encoded
in Microsoft's code page CP1252 (0x94 and 0x80), but the email had no MIME
header to indicate that the mail is indeed encoded in this character set! The
other non-ASCII characters were represented in the email as numeric character
references again.

Correct behaviour would be: If the description of a bug contains any non-ASCII
characters, then the email sent out should get the header lines

  Mime-version: 1.0
  Content-type: text/plain; charset=UTF-8
  Content-transfer-encoding: 8bit

added and the entire text should be UTF-8 encoded, to guarantee that not only
the web interface user but also the email recipient receives the bug description
exactly as the submitter had entered it.

The special treatment of CP1252 characters surprises me here. This does not at
all sound like it conforms properly to W3C standards (which usually do not use
CP1252 anywhere).
Apparently your Mozilla encoding was set to ISO-8859-1 when you cut & pasted
UTF-8 text into bugzilla form field. You would not have had this problem
if you had set your encoding to UTF-8.

When characters not covered by the character repertoire of the current
encoding are entered in a form field, they are turned into NCRs
before being passed over to the server. What exactly Mozilla
has to do in this case is not clear. The following page goes to a great
length about the issue: 

   http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

The relevant part of HTML 4 spec is at : 

http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13
  
 
 
As for your question about Euro and 'right double quotation mark' not
in Latin1 repertoire but in CP1252, they didn't get converted into NCR
because Mozilla treats Latin1 and CP1252 (a superset of ISO-8859-1)
identically. 
Я  можу  їсти  шкло,  й воно мені не

To demonstrate it works, I'm entering a Russian setence from UTF-8 
sampler at Kermit project web page. Mozilla encoding is set to UTF-8.

> When characters not covered by the character repertoire of the current
> encoding are entered in a form field, they are turned into NCRs
> before being passed over to the server.

  Given NCRs, Bugzilla has no way to tell whether NCRs are means
for literary characters making up NCRs ( '&' followed by '#' and numbers
followed by ';') or characters represented by NCRs. Bugzilla
treats them as 'verbatim characters' and that's why you've
got 'Σ' instead of 'greek capital sigma'. 

  I believe the standard-compliant way is use 'charset' parameter
to indicate which encoding is used. Apparently, this issue has
been discussed before and Mozilla developers decided against it
because a lot of server side scripts don't check 'charset'
parameter and do 'the right thing'. 
Depends on: 35970
At the moment, Bugzilla simply does not specify, in which encoding it wants to
have the form fields submitted. This can easily be fixed! All Bugzilla has to do
is to add to the FORM element the two attributes

  accept-charset="UTF-8" enctype="multipart/form-data"

This will force the browser to send back the form fields as a properly labeled
UTF-8 message body of the POST command, and Bugzilla can receive every character
unambiguously. Bugzilla can keep its entire message database in UTF-8 and label
every outgoing page and mail as UTF-8, and suddenly Bugzilla will be beautifully
Unicode transparent.

That seems to be the only RightThing[TM] here, because the HTML 4 standard says
that you can't transfer any non-ASCII data in URI parameters.

http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13
Regarding Markus's comment#5, I had thought about this
before and it would seem to be the right thing to do 
eventually. I don worry about a number of things
about this proposal.

1. Is this for sending darta back to server only?
2. Are we going to mandate UTF-8 ad the Bugzilla display
enoding also so that the server send UTF-8 as the HTTP
charset?

2. What happens to all the legacuy data which are in a number
of different encodings an users typically switch encodings
to view them right now? We should not inconvenience these
users. (This is doabe because we currently send raw bytes
as received to the server. We can then re-create them
by simply matching our browser encoding to the one
used in sending the data.)

3. If we unify on UTF-8, how can we enter data that has to
do with our converter problem? For example, character X in
Japanese somehow gets mapped incorrectly into character Y
by our native encoding -> Unicode converter. If we use
our own UTF-8 converter, we cannot represent the problem
character correctly. It is that very converter we are 
complaining about! These input data are possible precisely because we send raw
bytes to the server under the chosen 
native encoding.

4. What is our migration story? We have a sizable amount
of non-ASCII data in current Bugzilla. Will the current 
practice of viewing under a different encoding be still
possible?
To answer the questions in comment #5 from momomi:

I would prefer if communication with Bugzilla were in UTF-8 in both directions.
It is critical that it is done from the client to the server at least, as
otherwise the server has no clue about the encoding used and can't output valid
HTML as a result. In the direction to the client, numerical character references
would be possible as well, but I see no advantage in using these. They just
consume significantly more space and bandwidth than UTF-8.

The arguments about Japanese characters with unclear mapping doesn't hold,
because the most widely used clients currently convert these in inconsistent
ways internally into Unicode anyway. So you don't loose any unambiguity by just
following W3C and IETF policy properly and do consistently everything in UTF-8
with proper MIME labels. Very much on the contrary, you eliminate for most
clients two conversion steps and therefore finally preserve information reliably
in Unicode throught the exchange end-to-end. If Japanese users want to file bug
reports with regard to the interpretation of certain SJIS or EUC byte sequences,
then they have to quote these as ASCII hex sequences anyway, independent od
Bugzillas's transfer encoding.

I think it is bad practice to put into Mozilla messages with an undefined
character encoding. This confuses both search engines and users and should be
strongly discouraged in the future.

There are two migration options (which can both be supported, or either):

a) Manually convert the message encoding with iconv when installing the new
UTF-8 Bugzilla. If a bugzilla administrator has a huge amount of legacy data
consistently in any single encoding other than ASCII (there can't be too many),
then the installation documentation should advise on how to run an encoding
converter through all the stored text messages. A generally widely recommended
converter is GNU iconv, as found in the GNU libc distribution, or in the
separately available libiconv package by Bruno Haible.

b) Add in the database a new binary field for every bug report, which indicates
whether this bug was started on a Bugzilla version that requested UTF-8 field
content or not. This way, old bug report will continue to remain in the
undefined encoding, whereas only newly started bug reports will be sent as UTF-8
to the client and will request UTF-8 from the client. That should ensure a very
smooth transition, as nothing changes for existing bug reports.

Is that easy to implement?
> If Japanese users want to file bug
> reports with regard to the interpretation of certain SJIS 
> or EUC byte sequences,
> then they have to quote these as ASCII hex sequences anyway, independent od
> Bugzillas's transfer encoding.

  On top of that, we can always attach html files in the encoding
of our choice to demonstrate converter-related issues along with
screenshots(contrasting what Mozilla should do with what Mozilla
currently does) if necessary.  
Target Milestone: --- → Bugzilla 2.20
Considering that services as popular as Google are now running completely in
UTF-8 (all Google search result pages are UTF-8 encoded), I think the time is
ready to do the same for bugzilla as well.
Bugzilla defaults its display to ISO-8859-1, so any characters that are outside
of this range will appear index 'escape' values. If you change the displat to
UTF-8, then the text entered will appear wrong to everyone else, if they view
the pages in ISO-8859-1. The only real solution is to force all pages to be in
UTF-8. It might also be an idea to include the content encoding in the comments,
so the appropriate conversion can be done.
Bugzilla should migrate to UTF-8 asap. Markus' migration plan B is certainly
workable and it'd prevent mixed-encoding pages from accidentally being made.
(one comment in KOI8-R, another comment in UTF-8, still another in EUC-JP, yet
another in Windows-1251,.... ISO-8859-5. CJK encodings like EUC-JP , EUC-KR,
Shift_JIS covers a significant part of KOI-8/Windows-1251/ISO-8859-5 
repertoire.) See bug 212380 comment #9 and bug 212380 comment #10 for an
exmaple. I forgot to set View | Character Coding to KOI8-R and ended up posting
my comment with a Russian word in EUC-KR.  Bugzilla doesn't specify the MIME
charset and Mozilla uses my default value.  


As Markus noted, with services like Google going all the way to UTF-8, I don't
see any reason to put off this migration. His plan b) can work with the minimal
disruption, I guess.
fixing dependencies.

This is a MUCH huger issue than you're making it out to be.

iconv requires you to know which charset you're converting from.  We have no clue.
There is so much data is so many bugs that are in such a wide variety of
charsets that we have no way to do conversion until we have a viable way to do
detection.
Depends on: bz-charset
No longer depends on: 35970
Pls, note that I specifically wrote that plan B is workable. I'm NOT for Plan A.
 I do NOT think we have to take Plan A. 
Just to chime in with my $0.02, moving to UTF-8 for all new bugs and leaving the
legacy bugs as legacy seems like the only viable solution to me... and I agree
that we should do that ASAP (should have a few years back, in fact).
personally I like plan A as a long-term fix.

Short term I suppose we need to implement plan B and add a mechanism to allow
people to manually convert comments that aren't ASCII or UTF-8.  We can easily
scan for bugs that are 100% ASCII and flag them as UTF-8 since it's transparent.

Note that there could also be non-ASCII characters in user real names, bug
summaries, attachment descriptions, etc.  buglists could get confusing, not that
they aren't already in those situations, but that we couldn't enforce utf-8 on
buglist displays until we know all of the data has been converted on all of the
bugs.

A lot of this discussion has already happened on bug 126266 (the dependency)
*** Bug 163921 has been marked as a duplicate of this bug. ***
I've beep playing with iconv a bit, and i have a few ideas also...

1) bugzilla should have allways been UTF-8! It's much harder to convert a bunch
of stuff from unknown_encoding to UTF-8 than simply doing everything in it.

2) currently, while the default value for all the submits is CP1252
(iso-8859-1), quite a few people, IMHO, manually switch the encoding to
something else. I think, this mostly includes those that already know what
happens if you simply leave things as-is. Hence, it would probably be possible
to convert the messages of such users from their own encoding to UTF. Most of
the other posts use html entities, which, thmeselves, are plain ASCII, so, they
won't be affected and won't affect anything.

3) you could use some algorithm for converting with iconv (for example, first,
you try to convert a message from ASCII, if that fails, you try ISO-8859-1, if
that fails, you take the next from the list etc. And you CAN define the list by
your own then!). I've implemented this idea with a list of two charsets in a
small webmail system post.online.lt, and it works in most cases. All you need is
to know good enough what that list should consist of and in what order... 

4) in current situation, stuff is broken anyway (the server sends ISO-8859-1,
but the posts come in any-user-defined-encoding), so, almost nothing would
change if you simply change the default value. users will still be able to
switch the encoding to any other to see the old posts. However, new bugs will be
clean. 
Nobody's arguing with #1! :)  But very few people knew what UTF8 was in 1998
when Bugzilla was originally written.
the worst thing is that there's still a lot of admins/programmers that don't
care a bit about UTF-8. They still programm everything in those darn 8-bit
charsets and ignore any suggests/requests to make their systems UTF8-aware. :(
Mozilla include a universal encoding detector that has excellent performance.

Just go for plan A and run the universal encoding detector on each comment
separately to detect the encoding correctly. 
One or two lines of text is enough to get excellent results. 
If needed, it wouldn't take long to write a tiny C++ utility around
universalchardet.dll that brings the functionnality to command line.

You could keep the original db around but frozen for the few case where they'd
be problems (but those few cases very certainly are already a pain in the
current situation).

There is the page http://mozilla.org/projects/intl/detectorsrc.html about how to
use the univ detector as a stand alone.
This bug has not been touched by its owner in over six months, even though it is
targeted to 2.20, for which the freeze is 10 days away. Unsetting the target
milestone, on the assumption that nobody is actually working on it or has any
plans to soon.

If you are the owner, and you plan to work on the bug, please give it a real
target milestone. If you are the owner, and you do *not* plan to work on it,
please reassign it to nobody@bugzilla.org or a .bugs component owner. If you are
*anybody*, and you get this comment, and *you* plan to work on the bug, please
reassign it to yourself if you have the ability.
Target Milestone: Bugzilla 2.20 → ---
This problem is not a blocker, but it is annoying. Please, try to solve this.
Also don't forget, that headers with UTF-8 characters should be MIME encoded:

Subject: příliš žluťoučký kůň úpěl ďábelské ódy

This won't work event with header

Content-Type: text/plain; charset=UTF-8

Adding the Content-Type line to data/params file solved the problem with mail
body encoding, but headers are still broken.
Now that bug 126266 has landed, is this a duplicate of bug 280633, or at least fixed by it?
QA Contact: mattyt-bugzilla → default-qa
I'd say the problem is that the installation in question just needs to move to using utf8. And for installations that exist already with bad data in them, then yeah, it's a dup of 280633.

*** This bug has been marked as a duplicate of 280633 ***
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.