Closed Bug 68210 Opened 24 years ago Closed 24 years ago

String.split() shows non-ASCII characters as Unicode

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

VERIFIED INVALID

People

(Reporter: teruko, Assigned: rogerl)

Details

(Whiteboard: [HTML testcase files are corrupted; save the text version!!][js1.2])

Attachments

(3 files)

<HTML>
Test case

<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=Shift_JIS">
<SCRIPT LANGUAGE="JavaScript1.2">
myVar = "  これは 日本語 表示です。  ";
splits = myVar.split(" ", 3);
document.write(splits);
</script>
</HTML>

After the Japanese string is split, the result is displayed as unicode as follows.

["\u3053\u308C\u306F", "\u65E5\u672C\u8A9E",
"\u8868\u793A\u3067\u3059\u3002"]

The Japanese character should be displeyed.

I tested this with Netscape 6 rtm, 2001020604 Mtrunk build, and
4.x.

Roger said
The result you're seeing is because when the array result from split is
converted to a string the individual sub-strings are escape-mapped before being
joined. This is specific to version 1.2 and only happens during the
array.toString part of the call. If you access the array elements individually,
you won't get this behaviour.
Attached file HTML testcase
OK, the testcase does not run properly coming off the server !!!
In order to use them, you'll have to save them and run them locally...


Here is the output of the testcase:


Different versions of JavaScript: apply myString.split(" ") to the string:

                                                                                                                 
                          これは 日本語 侮ヲですB


JS version 1.1:

myArray.toSource() =   ["", "", "\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB", 
""]

myArray.toString() =     ,,これは,日本語,侮ヲですB,


myArray[0] =
myArray[1] =
myArray[2] = これは
myArray[3] = 日本語
myArray[4] = 侮ヲですB
myArray[5] =



JS version 1.2:

myArray.toSource() =   ["\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB"]

myArray.toString() =     ["\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB"]


myArray[0] = これは
myArray[1] = 日本語
myArray[2] = 侮ヲですB



JS version 1.3:

myArray.toSource() =   ["", "", "\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB", 
"", ""]

myArray.toString() =     ,,これは,日本語,侮ヲですB,,


myArray[0] =
myArray[1] =
myArray[2] = これは
myArray[3] = 日本語
myArray[4] = 侮ヲですB
myArray[5] =
myArray[6] =



JS version 1.4:

myArray.toSource() =   ["", "", "\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB", 
"", ""]

myArray.toString() =     ,,これは,日本語,侮ヲですB,,


myArray[0] =
myArray[1] =
myArray[2] = これは
myArray[3] = 日本語
myArray[4] = 侮ヲですB
myArray[5] =
myArray[6] =



JS version 1.5:

myArray.toSource() =   ["", "", "\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB", 
"", ""]

myArray.toString() =     ,,これは,日本語,侮ヲですB,,


myArray[0] =
myArray[1] =
myArray[2] = これは
myArray[3] = 日本語
myArray[4] = 侮ヲですB
myArray[5] =
myArray[6] =
The testcase illustrates what Roger said above - JS1.2 differs from 
all other JS versions in its treatment of Array.toString()


In JS1.2, Array.toString() is the same as Array.toSource(). In all 
other versions of JS, Array.toString() and Array.toSource() are different.


When you call String.split(), it returns an Array object and applies 
Array.toString() to it. In JS1.2, that will give you Array.toSource(); 
unlike other versions of JS. That's why String.split() looks funny in JS1.2


Note, that even in JS1.2,  however, the individual elements myArray[i]
appear without any Unicode escape-mapping.


Note: all comments apply to the Netcape implementation of JavaScript.
They do not apply to Microsoft's implementation. Note also that 
Microsoft JavaScript does not have a toSource() method...

Summary: Javascript-Split shows Non-ASCII characters as unicode → String.split() shows non-ASCII characters as Unicode
Whiteboard: [Testcases DON'T WORK OFF SERVER; save and run locally!]
If you want to run the HTML testcase, you'll have to save the TEXT of it
that I've attached above (attachment id = 24963), and run it locally.


I don't know why, but the HTML files got corrupted when saved to the
Bugzilla server. 


The string you test, and the split() command you apply to the string,
can be adjusted in these variables in the file: 


          var myString = "  これは 日本語 表示です?B  ";

          var mySplitCommand =  'myString.split(" ")';
Based on Roger's explanation, which the testcase confirmed, I have to 
mark this bug as invalid. In order to get the Japanese characters to display,
you should use 


                <SCRIPT LANGUAGE="JavaScript"> 
NOT

                <SCRIPT LANGUAGE="JavaScript1.2">



And even in JavaScript1.2, if you access the array elements individually,
(myArray[i]), you will get the Japanese characters and not Unicode escaping - 
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → INVALID
Marking Verified - 
Status: RESOLVED → VERIFIED
Whiteboard: [Testcases DON'T WORK OFF SERVER; save and run locally!] → [HTML testcase files are corrupted; save the text version!!]
Whiteboard: [HTML testcase files are corrupted; save the text version!!] → [HTML testcase files are corrupted; save the text version!!][js1.2]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: