Closed Bug 135762 Opened 18 years ago Closed 13 years ago
UTF-8 characters in submitted text get displayed as NCRs
I am using this bugzilla installation here (bugzilla.mozilla.org) with the Mozilla 0.9.9 navigator in order to show you the problem. I use in the following few text lines a number of non-ISO-8859-1 characters: greek capital sigma = 'Σ' element of sign = '∈' double right quotation marks = '”' infinity = '∞' euro sign = '€' I can see these characters (which I entered via cut & paste from a UTF-8 file shown in an xterm window in UTF-8 mode) correctly displayed in this mozilla form field. As you will see, after submission of this form field text here these characters get displayed on the resulting bugzilla bug description page as visible numeric character references (like 'Σ' instead of the actual greek letter capital sigma), and not as the characters themselves. So something goes wrong on the way between me entering this form field and it getting displayed in the end on the resulting page. I'll leave it to you to decide, whether this is a bug in bugzilla or mozilla and what the web standards say about the use of UTF-8 characters in forms.
Strangely, the double-right-quotation-mark and the euro-sign got through intact onto this web page, but the other test characters got indeed converted into numeric character references. The email confirmation that I received for the above bug submission was also defect. It contained the double-right-quotation-mark and the euro-sign encoded in Microsoft's code page CP1252 (0x94 and 0x80), but the email had no MIME header to indicate that the mail is indeed encoded in this character set! The other non-ASCII characters were represented in the email as numeric character references again. Correct behaviour would be: If the description of a bug contains any non-ASCII characters, then the email sent out should get the header lines Mime-version: 1.0 Content-type: text/plain; charset=UTF-8 Content-transfer-encoding: 8bit added and the entire text should be UTF-8 encoded, to guarantee that not only the web interface user but also the email recipient receives the bug description exactly as the submitter had entered it. The special treatment of CP1252 characters surprises me here. This does not at all sound like it conforms properly to W3C standards (which usually do not use CP1252 anywhere).
Apparently your Mozilla encoding was set to ISO-8859-1 when you cut & pasted UTF-8 text into bugzilla form field. You would not have had this problem if you had set your encoding to UTF-8. When characters not covered by the character repertoire of the current encoding are entered in a form field, they are turned into NCRs before being passed over to the server. What exactly Mozilla has to do in this case is not clear. The following page goes to a great length about the issue: http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html The relevant part of HTML 4 spec is at : http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13 As for your question about Euro and 'right double quotation mark' not in Latin1 repertoire but in CP1252, they didn't get converted into NCR because Mozilla treats Latin1 and CP1252 (a superset of ISO-8859-1) identically.
Я можу їсти шкло, й воно мені не To demonstrate it works, I'm entering a Russian setence from UTF-8 sampler at Kermit project web page. Mozilla encoding is set to UTF-8.
> When characters not covered by the character repertoire of the current > encoding are entered in a form field, they are turned into NCRs > before being passed over to the server. Given NCRs, Bugzilla has no way to tell whether NCRs are means for literary characters making up NCRs ( '&' followed by '#' and numbers followed by ';') or characters represented by NCRs. Bugzilla treats them as 'verbatim characters' and that's why you've got 'Σ' instead of 'greek capital sigma'. I believe the standard-compliant way is use 'charset' parameter to indicate which encoding is used. Apparently, this issue has been discussed before and Mozilla developers decided against it because a lot of server side scripts don't check 'charset' parameter and do 'the right thing'.
At the moment, Bugzilla simply does not specify, in which encoding it wants to have the form fields submitted. This can easily be fixed! All Bugzilla has to do is to add to the FORM element the two attributes accept-charset="UTF-8" enctype="multipart/form-data" This will force the browser to send back the form fields as a properly labeled UTF-8 message body of the POST command, and Bugzilla can receive every character unambiguously. Bugzilla can keep its entire message database in UTF-8 and label every outgoing page and mail as UTF-8, and suddenly Bugzilla will be beautifully Unicode transparent. That seems to be the only RightThing[TM] here, because the HTML 4 standard says that you can't transfer any non-ASCII data in URI parameters. http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13
Regarding Markus's comment#5, I had thought about this before and it would seem to be the right thing to do eventually. I don worry about a number of things about this proposal. 1. Is this for sending darta back to server only? 2. Are we going to mandate UTF-8 ad the Bugzilla display enoding also so that the server send UTF-8 as the HTTP charset? 2. What happens to all the legacuy data which are in a number of different encodings an users typically switch encodings to view them right now? We should not inconvenience these users. (This is doabe because we currently send raw bytes as received to the server. We can then re-create them by simply matching our browser encoding to the one used in sending the data.) 3. If we unify on UTF-8, how can we enter data that has to do with our converter problem? For example, character X in Japanese somehow gets mapped incorrectly into character Y by our native encoding -> Unicode converter. If we use our own UTF-8 converter, we cannot represent the problem character correctly. It is that very converter we are complaining about! These input data are possible precisely because we send raw bytes to the server under the chosen native encoding. 4. What is our migration story? We have a sizable amount of non-ASCII data in current Bugzilla. Will the current practice of viewing under a different encoding be still possible?
To answer the questions in comment #5 from momomi: I would prefer if communication with Bugzilla were in UTF-8 in both directions. It is critical that it is done from the client to the server at least, as otherwise the server has no clue about the encoding used and can't output valid HTML as a result. In the direction to the client, numerical character references would be possible as well, but I see no advantage in using these. They just consume significantly more space and bandwidth than UTF-8. The arguments about Japanese characters with unclear mapping doesn't hold, because the most widely used clients currently convert these in inconsistent ways internally into Unicode anyway. So you don't loose any unambiguity by just following W3C and IETF policy properly and do consistently everything in UTF-8 with proper MIME labels. Very much on the contrary, you eliminate for most clients two conversion steps and therefore finally preserve information reliably in Unicode throught the exchange end-to-end. If Japanese users want to file bug reports with regard to the interpretation of certain SJIS or EUC byte sequences, then they have to quote these as ASCII hex sequences anyway, independent od Bugzillas's transfer encoding. I think it is bad practice to put into Mozilla messages with an undefined character encoding. This confuses both search engines and users and should be strongly discouraged in the future. There are two migration options (which can both be supported, or either): a) Manually convert the message encoding with iconv when installing the new UTF-8 Bugzilla. If a bugzilla administrator has a huge amount of legacy data consistently in any single encoding other than ASCII (there can't be too many), then the installation documentation should advise on how to run an encoding converter through all the stored text messages. A generally widely recommended converter is GNU iconv, as found in the GNU libc distribution, or in the separately available libiconv package by Bruno Haible. b) Add in the database a new binary field for every bug report, which indicates whether this bug was started on a Bugzilla version that requested UTF-8 field content or not. This way, old bug report will continue to remain in the undefined encoding, whereas only newly started bug reports will be sent as UTF-8 to the client and will request UTF-8 from the client. That should ensure a very smooth transition, as nothing changes for existing bug reports. Is that easy to implement?
> If Japanese users want to file bug > reports with regard to the interpretation of certain SJIS > or EUC byte sequences, > then they have to quote these as ASCII hex sequences anyway, independent od > Bugzillas's transfer encoding. On top of that, we can always attach html files in the encoding of our choice to demonstrate converter-related issues along with screenshots(contrasting what Mozilla should do with what Mozilla currently does) if necessary.
Considering that services as popular as Google are now running completely in UTF-8 (all Google search result pages are UTF-8 encoded), I think the time is ready to do the same for bugzilla as well.
Bugzilla defaults its display to ISO-8859-1, so any characters that are outside of this range will appear index 'escape' values. If you change the displat to UTF-8, then the text entered will appear wrong to everyone else, if they view the pages in ISO-8859-1. The only real solution is to force all pages to be in UTF-8. It might also be an idea to include the content encoding in the comments, so the appropriate conversion can be done.
Bugzilla should migrate to UTF-8 asap. Markus' migration plan B is certainly workable and it'd prevent mixed-encoding pages from accidentally being made. (one comment in KOI8-R, another comment in UTF-8, still another in EUC-JP, yet another in Windows-1251,.... ISO-8859-5. CJK encodings like EUC-JP , EUC-KR, Shift_JIS covers a significant part of KOI-8/Windows-1251/ISO-8859-5 repertoire.) See bug 212380 comment #9 and bug 212380 comment #10 for an exmaple. I forgot to set View | Character Coding to KOI8-R and ended up posting my comment with a Russian word in EUC-KR. Bugzilla doesn't specify the MIME charset and Mozilla uses my default value. As Markus noted, with services like Google going all the way to UTF-8, I don't see any reason to put off this migration. His plan b) can work with the minimal disruption, I guess.
fixing dependencies. This is a MUCH huger issue than you're making it out to be. iconv requires you to know which charset you're converting from. We have no clue. There is so much data is so many bugs that are in such a wide variety of charsets that we have no way to do conversion until we have a viable way to do detection.
Pls, note that I specifically wrote that plan B is workable. I'm NOT for Plan A. I do NOT think we have to take Plan A.
Just to chime in with my $0.02, moving to UTF-8 for all new bugs and leaving the legacy bugs as legacy seems like the only viable solution to me... and I agree that we should do that ASAP (should have a few years back, in fact).
personally I like plan A as a long-term fix. Short term I suppose we need to implement plan B and add a mechanism to allow people to manually convert comments that aren't ASCII or UTF-8. We can easily scan for bugs that are 100% ASCII and flag them as UTF-8 since it's transparent. Note that there could also be non-ASCII characters in user real names, bug summaries, attachment descriptions, etc. buglists could get confusing, not that they aren't already in those situations, but that we couldn't enforce utf-8 on buglist displays until we know all of the data has been converted on all of the bugs. A lot of this discussion has already happened on bug 126266 (the dependency)
*** Bug 163921 has been marked as a duplicate of this bug. ***
I've beep playing with iconv a bit, and i have a few ideas also... 1) bugzilla should have allways been UTF-8! It's much harder to convert a bunch of stuff from unknown_encoding to UTF-8 than simply doing everything in it. 2) currently, while the default value for all the submits is CP1252 (iso-8859-1), quite a few people, IMHO, manually switch the encoding to something else. I think, this mostly includes those that already know what happens if you simply leave things as-is. Hence, it would probably be possible to convert the messages of such users from their own encoding to UTF. Most of the other posts use html entities, which, thmeselves, are plain ASCII, so, they won't be affected and won't affect anything. 3) you could use some algorithm for converting with iconv (for example, first, you try to convert a message from ASCII, if that fails, you try ISO-8859-1, if that fails, you take the next from the list etc. And you CAN define the list by your own then!). I've implemented this idea with a list of two charsets in a small webmail system post.online.lt, and it works in most cases. All you need is to know good enough what that list should consist of and in what order... 4) in current situation, stuff is broken anyway (the server sends ISO-8859-1, but the posts come in any-user-defined-encoding), so, almost nothing would change if you simply change the default value. users will still be able to switch the encoding to any other to see the old posts. However, new bugs will be clean.
Nobody's arguing with #1! :) But very few people knew what UTF8 was in 1998 when Bugzilla was originally written.
the worst thing is that there's still a lot of admins/programmers that don't care a bit about UTF-8. They still programm everything in those darn 8-bit charsets and ignore any suggests/requests to make their systems UTF8-aware. :(
Mozilla include a universal encoding detector that has excellent performance. Just go for plan A and run the universal encoding detector on each comment separately to detect the encoding correctly. One or two lines of text is enough to get excellent results. If needed, it wouldn't take long to write a tiny C++ utility around universalchardet.dll that brings the functionnality to command line. You could keep the original db around but frozen for the few case where they'd be problems (but those few cases very certainly are already a pain in the current situation).
There is the page http://mozilla.org/projects/intl/detectorsrc.html about how to use the univ detector as a stand alone.
This bug has not been touched by its owner in over six months, even though it is targeted to 2.20, for which the freeze is 10 days away. Unsetting the target milestone, on the assumption that nobody is actually working on it or has any plans to soon. If you are the owner, and you plan to work on the bug, please give it a real target milestone. If you are the owner, and you do *not* plan to work on it, please reassign it to email@example.com or a .bugs component owner. If you are *anybody*, and you get this comment, and *you* plan to work on the bug, please reassign it to yourself if you have the ability.
Target Milestone: Bugzilla 2.20 → ---
This problem is not a blocker, but it is annoying. Please, try to solve this. Also don't forget, that headers with UTF-8 characters should be MIME encoded: Subject: příliš žluťoučký kůň úpěl ďábelské ódy This won't work event with header Content-Type: text/plain; charset=UTF-8 Adding the Content-Type line to data/params file solved the problem with mail body encoding, but headers are still broken.
I'd say the problem is that the installation in question just needs to move to using utf8. And for installations that exist already with bad data in them, then yeah, it's a dup of 280633. *** This bug has been marked as a duplicate of 280633 ***
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.