String.split() shows non-ASCII characters as Unicode

VERIFIED INVALID

Status

()

Core
JavaScript Engine
VERIFIED INVALID
17 years ago
14 years ago

People

(Reporter: Teruko Kobayashi, Assigned: rogerl (gone))

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [HTML testcase files are corrupted; save the text version!!][js1.2])

Attachments

(3 attachments)

(Reporter)

Description

17 years ago
<HTML>
Test case

<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=Shift_JIS">
<SCRIPT LANGUAGE="JavaScript1.2">
myVar = "  これは 日本語 表示です。  ";
splits = myVar.split(" ", 3);
document.write(splits);
</script>
</HTML>

After the Japanese string is split, the result is displayed as unicode as follows.

["\u3053\u308C\u306F", "\u65E5\u672C\u8A9E",
"\u8868\u793A\u3067\u3059\u3002"]

The Japanese character should be displeyed.

I tested this with Netscape 6 rtm, 2001020604 Mtrunk build, and
4.x.

Roger said
The result you're seeing is because when the array result from split is
converted to a string the individual sub-strings are escape-mapped before being
joined. This is specific to version 1.2 and only happens during the
array.toString part of the call. If you access the array elements individually,
you won't get this behaviour.

Comment 1

17 years ago
Created attachment 24959 [details]
HTML testcase

Comment 2

17 years ago
Created attachment 24960 [details]
HTML testcase (second try)

Comment 3

17 years ago
OK, the testcase does not run properly coming off the server !!!
In order to use them, you'll have to save them and run them locally...


Here is the output of the testcase:


Different versions of JavaScript: apply myString.split(" ") to the string:

                                                                                                                 
                          これは 日本語 侮ヲですB


JS version 1.1:

myArray.toSource() =   ["", "", "\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB", 
""]

myArray.toString() =     ,,これは,日本語,侮ヲですB,


myArray[0] =
myArray[1] =
myArray[2] = これは
myArray[3] = 日本語
myArray[4] = 侮ヲですB
myArray[5] =



JS version 1.2:

myArray.toSource() =   ["\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB"]

myArray.toString() =     ["\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB"]


myArray[0] = これは
myArray[1] = 日本語
myArray[2] = 侮ヲですB



JS version 1.3:

myArray.toSource() =   ["", "", "\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB", 
"", ""]

myArray.toString() =     ,,これは,日本語,侮ヲですB,,


myArray[0] =
myArray[1] =
myArray[2] = これは
myArray[3] = 日本語
myArray[4] = 侮ヲですB
myArray[5] =
myArray[6] =



JS version 1.4:

myArray.toSource() =   ["", "", "\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB", 
"", ""]

myArray.toString() =     ,,これは,日本語,侮ヲですB,,


myArray[0] =
myArray[1] =
myArray[2] = これは
myArray[3] = 日本語
myArray[4] = 侮ヲですB
myArray[5] =
myArray[6] =



JS version 1.5:

myArray.toSource() =   ["", "", "\u201A\xB1\u201A\xEA\u201A\xCD", 
"\u201C\xFA\u2013{\u0152\xEA", "\u2022\u017D\xA6\u201A\xC5\u201A\xB7\uFFFDB", 
"", ""]

myArray.toString() =     ,,これは,日本語,侮ヲですB,,


myArray[0] =
myArray[1] =
myArray[2] = これは
myArray[3] = 日本語
myArray[4] = 侮ヲですB
myArray[5] =
myArray[6] =

Comment 4

17 years ago
The testcase illustrates what Roger said above - JS1.2 differs from 
all other JS versions in its treatment of Array.toString()


In JS1.2, Array.toString() is the same as Array.toSource(). In all 
other versions of JS, Array.toString() and Array.toSource() are different.


When you call String.split(), it returns an Array object and applies 
Array.toString() to it. In JS1.2, that will give you Array.toSource(); 
unlike other versions of JS. That's why String.split() looks funny in JS1.2


Note, that even in JS1.2,  however, the individual elements myArray[i]
appear without any Unicode escape-mapping.


Note: all comments apply to the Netcape implementation of JavaScript.
They do not apply to Microsoft's implementation. Note also that 
Microsoft JavaScript does not have a toSource() method...

Summary: Javascript-Split shows Non-ASCII characters as unicode → String.split() shows non-ASCII characters as Unicode
Whiteboard: [Testcases DON'T WORK OFF SERVER; save and run locally!]

Comment 5

17 years ago
Created attachment 24963 [details]
Source of HTML testcase (as text file !!!)

Comment 6

17 years ago
If you want to run the HTML testcase, you'll have to save the TEXT of it
that I've attached above (attachment id = 24963), and run it locally.


I don't know why, but the HTML files got corrupted when saved to the
Bugzilla server. 


The string you test, and the split() command you apply to the string,
can be adjusted in these variables in the file: 


          var myString = "  これは 日本語 表示です?B  ";

          var mySplitCommand =  'myString.split(" ")';

Comment 7

17 years ago
Based on Roger's explanation, which the testcase confirmed, I have to 
mark this bug as invalid. In order to get the Japanese characters to display,
you should use 


                <SCRIPT LANGUAGE="JavaScript"> 
NOT

                <SCRIPT LANGUAGE="JavaScript1.2">



And even in JavaScript1.2, if you access the array elements individually,
(myArray[i]), you will get the Japanese characters and not Unicode escaping - 
Status: NEW → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → INVALID

Comment 8

17 years ago
Marking Verified - 
Status: RESOLVED → VERIFIED

Updated

17 years ago
Whiteboard: [Testcases DON'T WORK OFF SERVER; save and run locally!] → [HTML testcase files are corrupted; save the text version!!]

Updated

14 years ago
Whiteboard: [HTML testcase files are corrupted; save the text version!!] → [HTML testcase files are corrupted; save the text version!!][js1.2]
You need to log in before you can comment on or make changes to this bug.