Closed Bug 10680 Opened 21 years ago Closed 20 years ago

Can't decode UTF-8 Japanese text in JS function

Categories

(Core :: Layout, defect, P1, critical)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: tao, Assigned: tao)

References

()

Details

(Whiteboard: Need a converter for escape-unicode)

Attachments

(3 files)

Please reassign to the proper owner.

To reproduce:

1. Launch the apprunner.
2. Click on "QA|Strres Test" menuitem to go to a test cases
   I hacked up based on openLocation.xul.
3. There are 3 titledbuttons labled "ok", "test", and
   "cancel: ...". Note that the 3rd button's label properly
   display Japanese texts in utf-8. The label string is set
   in XUL. (see attached figure 1)
4. Now, click on the "test" button which will

   4.1. set 'cancel' btn label to "\u88e6\ue3bb\u8b82" which the escaped
        unicode for a Japanese text.
   4.2. set 'ok' btn label to "onetwothree" which is a plain English
        text retrieved by nsIStringBundle from an external
        property file, resource:/res/strres.properties.
   4.3. set 'test' button to "æ »ã  " which is a utf-8 encoded Japanese tex
        text which is retrieved by nsIStringBundle from
        resource:/res/strres.properties.

5. As you can see in the attached figure 2:
   5.1 Japanese text in escaped unicode displays properly.
   5.2 English texts retrieved from property file display correctly.
   5.3 UTF-8 encoded Japanese texts do not display correctly. Even I store it
       in JS var and pass it to setAttribute() function, it does not draw
       correctly.



Feel free to let me know if you need assistence in reproducing this bug.



Thanks
Assignee: brendan → tao
Component: Javascript Engine → Layout
First, this is almost certainly not a JS engine bug.  The JS engine uses UCS-2
character arrays for strings, without combining sequences and other hard stuff,
per ECMA-262 (ISO 16262) and very much like Java.  It does not and should not
know anything about utf-8 or other encodings.

If you use utf-8 in a document, inside of a script tag or not, you must specify
the doc's charset as utf-8 to get it transcoded by code in htmlparser to a UCS-2
string.  The substrings of that string that lie within script tags are then
parsed by the JS engine.  But the external scripts loaded by script src= are not
necessarily so converted.

Looking at
http://lxr.mozilla.org/mozilla/source/intl/strres/tests/strres-test.xul (I think
this bug's URL field should point to strres-test.xul, not strres-test.js), I see
an encoding="UTF-8" attribute on the mandatory-first ?xml tag.  But it's not
clear to me that the encoding attribute on ?xml must govern the script loaded
via html:script src= -- in fact I think it's unlikely, and can't find a clear
statement in the HTML 4.0 spec that requires the doc's charset to be used if no
charset attribute is given in the script tag.  My memory of HTML 4.0 days
corresponding with Dave Raggett of w3.org muddled.

Tao, can you please try adding charset="UTF-8" to the html:script src= tag?

Cc'ing nisheeth for XML advice.

/be
I tried that before logging this bug. It didn't make any difference. I believe,
in Seamonkey, XML doc is in UTF-8 if the charset is not set.
Status: NEW → ASSIGNED
Target Milestone: M9
I'll look into the C++ code to see if the data got corrupted in between modules.
Adding charset="UTF-8" should work and this is definitely a bug.

Long ago, we talked about (and I thought implemented) supporting .jsu file
extension for UTF-8 javascript files.  Does that work?

I don't remember if we reached resolution on how to treat the charset encoding
of an unlabeled external .js file.  Since it could be sourced from multiple
docs, using the doc encoding may not be the "right" thing.

Maybe we should default like XML: either (1) the labeled charset, or
(2) if there is a Unicode Byte Order Mark (BOM), then its UCS-2 (or UTF-16)
big/little endian, (3) otherwise its UTF-8?
I hope someone tried adding charset="UTF-8" to the html:script tag, and filed a
bug on layout folks (nisheeth?) if that didn't work.

The deal in 4.x was this: .jsu file extension is mapped by our server to the
MIME type application/x-javascript; charset="utf-8" (my memory may be fuzzy on
the exact timing of this change).  I don't remember whether we made the browser
respect charset= on script in the old layout engine, but that wasn't a priority
as we reasoned that the server admin and .js file author know best, not all the
potention includers of the .js file.

As bobj says, using the doc charset seems wrong and I can't find any requirement
to do that in the HTML 4.0 spec at http://www.w3.org.

/be
Duh, I mean "I hope someone tried charset= on html:script src= and if that
failed, then I hope someone will reassign this bug to nisheeth" -- or whoever in
layout owns hooking up the converter for script src=.  Sorry for suggesting that
a new bug be filed.  Either this bug is against layout, or it should be fixed by
tao using charset= on html:script around line 20 of the testcase URL above (I
set that to the .xul file).

/be
As I said earlier in this bug report, I did set charset="UTF-8" in the
<html:script > tag before logging this bug. It did not work.

If I tweak the property file parser to decode the file as UTF-8 encoded,
then the utf-8 encoded double-byte characters are displayed corrrectly.


The escaped unicode are packed as 2 byte Unichar array. For example, "\u9cdf"
takes up 6 unichar as opposed to 1.


Is there a converted to identify and convert the array to UCS-2?
Oops, the last sentense shall read as

"Is there a converter to identify and convert the array to UCS-2?"
QA Contact: cbegle → teruko
changing qa contact to teruko
Severity: blocker → critical
Priority: P3 → P1
Whiteboard: Need to determine where to convert the escape-unicode
Target Milestone: M9 → M10
I have a patch to fix the truncated ascii text problem. The non-English texts
display problem has been isolated in property file loader code.



Move to M10 since it isn't M9 blocker. Raise priority to P1 since it blocks
non-English text display.
Blocks: 7820
Assignee: tao → ftang
Status: ASSIGNED → NEW
Whiteboard: Need to determine where to convert the escape-unicode → Need a converter for escape-unicode
Frank, I am reassigned this bug to you per our earlier discussion. Please mark
it fixed when the escaped unicode converter is done. Also update status summary.
Status: NEW → ASSIGNED
Depends on: 11725
Assignee: ftang → tao
Status: ASSIGNED → NEW
create a seperate bug about \u converters (11725) and mark this bug depend on
it. Reassign this bug back to tao and tao should fix it after I fix 11725.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Tao, is there any test case that Teruko can follow to verify this bug? If not,
can you verify it yourself? Thanks!
Hi, Teruko:

I believe that you can verify this via "QA|String Bundle Test" or simply type
in "resources:/res/strres-test.xul" in the location bar to load a XUL page
that has 3 cmd buttons at the bottom of the page.

Clicking on the middle one will trigger a JS function which retrieve CJK
charater, "loyalty" and set the middle button to it.

Let me know if this does not work.
Status: RESOLVED → VERIFIED
I verified this in 9-28 build.
You need to log in before you can comment on or make changes to this bug.