Last Comment Bug 4463 - Default encoding for XUL/XML/RDF should use UTF-8
: Default encoding for XUL/XML/RDF should use UTF-8
Product: Core
Classification: Components
Component: XML (show other bugs)
: Trunk
: All All
P1 critical (vote)
: M5
Assigned To: Nisheeth Ranjan
: rchen
: Andrew Overholt [:overholt]
: 5262 (view as bug list)
Depends on:
  Show dependency treegraph
Reported: 1999-03-31 14:57 PST by rchen
Modified: 1999-04-26 14:29 PDT (History)
12 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Description User image rchen 1999-03-31 14:57:49 PST
I seperate the default encoding to UTF-8 from bug 4431.
Comment 1 User image Peter Trudelle 1999-03-31 17:06:59 PST
changing severity to enhancement, assigning to danm
Comment 2 User image msanz 1999-03-31 17:14:59 PST
Peter, this is not an enhancement, check this reference.

We have as part of M4 a goal of pseudo localizing XUL into Japanese, and this is
a blocking issue. Please, modify the priority and set the TFV M4
Comment 3 User image rchen 1999-04-01 10:40:59 PST

The fix should be in M4 and severity is major.

Thank you.
Comment 4 User image Peter Trudelle 1999-04-01 17:01:59 PST
Okay. reassigning to hyatt for m4
Comment 5 User image David Hyatt 1999-04-01 19:03:59 PST
I have no idea what this means.  Isn't nsString unicode?  What is the problem?
Comment 6 User image bobj 1999-04-01 19:28:59 PST
nsString is UTF-16 Unicode which is a 16-bit value for all characters.
But the XML is likely to be be UTF-8.

If an XML file does not contain:
   <?xml encoding='IANA-charset-name'?>
then by default the encoding is either UTF-16 or UTF-8.  For most interesting
cases, each UTF-16 character is a fixed 2-byte quantity, but these can be
affected by endian-ness.  So you need to check if the file starts with a
Byte Order Mark (BOM).  It will either be FEFF or FFFE, depending on the

If there's a BOM, the data is UTF-16.  Otherwise it is UTF-8 which is a
byte-stream and unaffected by endian-ness.  ASCII is a proper subset of
UTF-8 and each character is represented as 1 byte.  Other character sets
are encoded as multiple bytes per character (e.g., accented characters).

We plan to create all of the 5.0 XUL files in UTF-8, so we need to convert
from UTF-8 to UTF-16.

For M4 we need UTF-8 default supported.  Later we will need UTF-16 and the
  <?xml encoding='IANA-charset-name'?>
supported.  But that is another bug: 4431
Comment 7 User image David Hyatt 1999-04-01 20:06:59 PST
Is this only XUL's problem?  Are XML and HTML supporting UTF-8?
Comment 8 User image David Hyatt 1999-04-02 01:14:59 PST
I'm trying to figure out if this problem is with the XML parser or if it's with
the XUL content sink (or both).
Comment 9 User image bobj 1999-04-02 12:18:59 PST
I think its the XML parser.  It needs to parse:
   <?xml encoding='IANA-charset-name'?>
as spec'd in

(As it did in pre-5.0 browsers, HTML needs to support UTF-8 if specified,
but HTML defaults to ISO-8859-1 not UTF-8.  Our implementation is more
complicated because we take into account user settings too.)

Another thing that affects both XML and HTML is the HTTP Content-Type, e.g.,
     Content-Type: text/xml; charset=UTF-8
     Content-Type: text/html; charset=UTF-8

This needs to be parsed by netlib for both HTML and XML.  ftang is working
w/gagan and rickg on this.

For M4, we need the default XML behavior.  For M4, we don't need the above HTTP
header and the XML <?xml encoding ...> parsing.
Comment 10 User image tao 1999-04-02 12:40:59 PST
Be advised that we are supposed to switch to XPAT by M4. One thing to try is
recompile the client with  xpat turned on.

Add Nisheeth to CC list for comments.
Comment 11 User image tao 1999-04-02 12:40:59 PST
Be advised that we are supposed to switch to XPAT by M4. One thing to try is
recompile the client with  xpat turned on.

Add Nisheeth to CC list for comments.
Comment 12 User image Katsuhiko Momoi 1999-04-02 13:39:59 PST
As an additional fact, UTF-8 HTML display is working on the
current (4/1 build) as long as the Meta-Equiv Content-Type
header icnludes "charset = utf-8".
Comment 13 User image Frank Tang 1999-04-06 11:53:59 PDT
Hyatt, I hope you don't mind I reassign this to myself . I have fix that in
nsParser (approved by rickg) to use UTF-8 as default charset for RDF, XML, or
XUL. Check in as mozilla/htmlparser/src/nsParser.cpp 3.81

I verify this w/ my psueod l10n file. The button show up correctly. Howerver,
the menu still display garbage, but this is a seperate issue. Let's put the menu
display problem into a seperate bug.
Comment 14 User image rchen 1999-04-09 17:40:59 PDT
I re-open the bug because there is something wrong with the newer builds. Here
are the results with UFT8 JA pseudo navigator.xul on two OS machines:

Build 04-07-11
  JA- NT :  menu, buttons, status bar display fine.
  US- NT with J fonts: buttons, status bar display fine but not menu (it shows
????). The result is similar to MAC.

  Build 04-08-10 04-09-12 and 04-09-16
  JA- NT :  Only display frame window, no menu, buttons, status bar.
  US- NT with J fonts: Only display frame window, no menu, buttons, status bar.
Comment 15 User image msanz 1999-04-09 18:32:59 PDT
Moving to M5, we need to investigate more and it shouldn't be a show stopper
Comment 16 User image Frank Tang 1999-04-15 10:48:59 PDT
ToNewCString in nsExpatTokenizer::ConsumeToken cause the damange of the data.It damange the Unicode data which already get conveterted (by assuming

Nisheeth, please fixed it ASAP, None of our XUL/XML/RDF work without this.This is a blocker for L10N and pseudo L10N
Comment 17 User image nhottanscp 1999-04-15 12:37:59 PDT
This is also blocking viewing message headers (2671).
Comment 18 User image bobj 1999-04-15 14:42:59 PDT
This has been reopened because of the switch to expat.
Clearing resolution FIXED.
Comment 19 User image Nisheeth Ranjan 1999-04-19 17:06:59 PDT
Accepting bug.  Setting component to XML...
Comment 20 User image Nisheeth Ranjan 1999-04-21 22:32:59 PDT
The fix is checked in.  Expat now accepts unicode buffers.
Comment 21 User image Nisheeth Ranjan 1999-04-24 03:12:59 PDT
*** Bug 4431 has been marked as a duplicate of this bug. ***
Comment 22 User image Nisheeth Ranjan 1999-04-24 03:14:59 PDT
*** Bug 5262 has been marked as a duplicate of this bug. ***
Comment 23 User image rchen 1999-04-26 14:29:59 PDT
Verified on Japanese NT4.0.

Note You need to log in before you can comment on or make changes to this bug.