When running test262 in the shell (currently requires a patch from bug 669766), most of the ietestcenter tests fail with a syntax error in the first line:
jit-test/tests/ietestcenter/chapter07/7.3/7.3-9.js:1: SyntaxError: illegal character:
jit-test/tests/ietestcenter/chapter07/7.3/7.3-9.js:1: /// Copyright (c) 2009 Microsoft Corporation
I don't really know anything about this, but the filetype of the files that fail is:
"UTF-8 Unicode (with BOM) English text, with CRLF line terminators"
When the JS shell loads files, what does it assume about the character encoding?
I'm looking into this. I tried a quick script to convert the UTF8 to ASCII, but there were non-ASCII characters in there.
So the shell assumes ASCII, and transcodes it to UCS2 (or UTF16, not sure) (as does Firefox, from whatever encoding it downloads in).
So the only fixes I can think of are:
- figure out the encoding, and transcode it to UCS2
- assume UTF-8 instead of assuming ASCII
- ignore the problem, and measure test262 using the browser
I think I like option 2 the best. AIUI, UTF8 is a strict superset of ASCII, so we break nothing by assuming it. Also, I think UTF8 is the only encoding we could use that handles all characters while also not breaking stuff for users. Finally, I suspect it can be made work by setting JS_C_STRINGS_ARE_UTF8, and so involves minimal work.
(Pretty low priority, but worth a quick look)
No, I think I'm all wrong here. We already support UTF8, with the -U flag, and this doesn't solve the problem (though if it did, specifying -U would be much preferable to the solutions above).
I'm thinking we can just remove the BOM, either once off for the test262 files, or during parsing. Wikipedia says of the BOM:
``While the Unicode Standard does allow a BOM in UTF-8, it does not require or recommend it. Byte order has no meaning in UTF-8 so a BOM serves only to identify a text stream or file as UTF-8.
The reason the BOM is recommended against is that it defeats the ASCII back-compatibility that is part of UTF-8's design. Many existing pieces of software can handle UTF-8 inside the text but not at the start. For instance, the bytes of UTF-8 can be placed between the quotes of string constants in many programming languages, and that language will write the correct UTF-8 to a file or to a display, despite the language not knowing anything about UTF-8. This provides an easy migration path to convert systems to Unicode and to remove all legacy encodings. The unexpected three bytes of the BOM break this however, as they are located where they are certain to be a syntax error.``
Comment 3 is right on. It's easy to remove the BOM, and that fixes the problem. Patch coming.
Created attachment 545504 [details] [diff] [review]
Check for BOM
This checks for, and skips, a byte-order mark on UTF8 files.
Comment on attachment 545504 [details] [diff] [review]
Check for BOM
Review of attachment 545504 [details] [diff] [review]:
This is kinda hackish, but it arguably works and no one's really really going to care, so whatever.
and the followup