Last Comment Bug 671029 - test262 shell test failures due to UTF8+BOM
: test262 shell test failures due to UTF8+BOM
Status: RESOLVED FIXED
js-triage-done
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: mozilla8
Assigned To: Paul Biggar
:
:
Mentors:
Depends on:
Blocks: 669766
  Show dependency treegraph
 
Reported: 2011-07-12 12:06 PDT by Paul Biggar
Modified: 2011-07-19 08:11 PDT (History)
4 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Check for BOM (1.96 KB, patch)
2011-07-12 15:16 PDT, Paul Biggar
jwalden+bmo: review+
Details | Diff | Splinter Review

Description Paul Biggar 2011-07-12 12:06:51 PDT
When running test262 in the shell (currently requires a patch from bug 669766), most of the ietestcenter tests fail with a syntax error in the first line:

jit-test/tests/ietestcenter/chapter07/7.3/7.3-9.js:1: SyntaxError: illegal character:
jit-test/tests/ietestcenter/chapter07/7.3/7.3-9.js:1: /// Copyright (c) 2009 Microsoft Corporation 
jit-test/tests/ietestcenter/chapter07/7.3/7.3-9.js:1: .^

I don't really know anything about this, but the filetype of the files that fail is:

"UTF-8 Unicode (with BOM) English text, with CRLF line terminators"
Comment 1 Boris Zbarsky [:bz] (still a bit busy) 2011-07-12 12:20:28 PDT
When the JS shell loads files, what does it assume about the character encoding?
Comment 2 Paul Biggar 2011-07-12 14:02:47 PDT
I'm looking into this. I tried a quick script to convert the UTF8 to ASCII, but there were non-ASCII characters in there.

So the shell assumes ASCII, and transcodes it to UCS2 (or UTF16, not sure) (as does Firefox, from whatever encoding it downloads in).

So the only fixes I can think of are:
- figure out the encoding, and transcode it to UCS2
- assume UTF-8 instead of assuming ASCII
- ignore the problem, and measure test262 using the browser

I think I like option 2 the best. AIUI, UTF8 is a strict superset of ASCII, so we break nothing by assuming it. Also, I think UTF8 is the only encoding we could use that handles all characters while also not breaking stuff for users. Finally, I suspect it can be made work by setting JS_C_STRINGS_ARE_UTF8, and so involves minimal work.

(Pretty low priority, but worth a quick look)
Comment 3 Paul Biggar 2011-07-12 14:24:57 PDT
No, I think I'm all wrong here. We already support UTF8, with the -U flag, and this doesn't solve the problem (though if it did, specifying -U would be much preferable to the solutions above).

I'm thinking we can just remove the BOM, either once off for the test262 files, or during parsing. Wikipedia says of the BOM:

``While the Unicode Standard does allow a BOM in UTF-8,[2] it does not require or recommend it.[3] Byte order has no meaning in UTF-8[4] so a BOM serves only to identify a text stream or file as UTF-8.

The reason the BOM is recommended against is that it defeats the ASCII back-compatibility that is part of UTF-8's design. Many existing pieces of software can handle UTF-8 inside the text but not at the start. For instance, the bytes of UTF-8 can be placed between the quotes of string constants in many programming languages, and that language will write the correct UTF-8 to a file or to a display, despite the language not knowing anything about UTF-8. This provides an easy migration path to convert systems to Unicode and to remove all legacy encodings. The unexpected three bytes of the BOM break this however, as they are located where they are certain to be a syntax error.``
Comment 4 Paul Biggar 2011-07-12 14:52:44 PDT
Comment 3 is right on. It's easy to remove the BOM, and that fixes the problem. Patch coming.
Comment 5 Paul Biggar 2011-07-12 15:16:26 PDT
Created attachment 545504 [details] [diff] [review]
Check for BOM

This checks for, and skips, a byte-order mark on UTF8 files.
Comment 6 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-13 17:08:43 PDT
Comment on attachment 545504 [details] [diff] [review]
Check for BOM

Review of attachment 545504 [details] [diff] [review]:
-----------------------------------------------------------------

This is kinda hackish, but it arguably works and no one's really really going to care, so whatever.
Comment 7 Marco Bonardo [::mak] 2011-07-19 08:10:44 PDT
http://hg.mozilla.org/mozilla-central/rev/102481f5e2b9
Comment 8 Marco Bonardo [::mak] 2011-07-19 08:11:03 PDT
and the followup
http://hg.mozilla.org/mozilla-central/rev/52e36db1e8c7

Note You need to log in before you can comment on or make changes to this bug.