399347 - The JSC and Shell tools choke on UTF-8 BOM

Reporter

Description

•

18 years ago

User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1) Build Identifier: 1.6R7 Several places in the code use classes/constructors for URL/file streams/readers which use the default encoding. I think specs at least default to UTF-8, or if not (in the case of IE) only do UTF-8. Anyway, here are the things to search on which will get you to the offending areas: - "new FileReader", drop use of this class totally - "new InputStreamReader", always pass in an encoding - "Kit.readStream", check the use of what gets passed in and returned - "Kit.readReader", check the use of what gets passed in and returned Example changes I had to make (jsc/Main, shell/Main, shell/Global): - "new FileReader(f)" to "new InputStreamReader(new FileInputStream(f), "UTF-8")" - "new InputStreamReader(is)" to "new InputStreamReader(is, "UTF-8")" - "result = new String(data)" to "result = new String(data, "UTF-8")" I would assume a good behavior across all of Rhino would be to try to get an encoding from the charset header when a URL is used. If that doesn't exist or it is a file, default to UTF-8. I didn't go the distance in shell/Main.readFileOrUrl to parse the Content-Type header if its' a URL, but the comment "XXX: Use 'charset=' argument of Content-Type if URL?" alludes to this already. Reproducible: Always Steps to Reproduce: 1. Create any JS file in Notepad and save as UTF-8. 2. Run JSC and Shell on it. 3. Actual Results: Both will choke saying that they encountered an illegal character. Expected Results: Clean parse.

Attila Szegedi

Comment 1

•

18 years ago

You're referring to RFC 4329, right? Yes, this is justified -- I can see how we'd want to be conformant with that... However, we need to think about some additional implications though first. I'm late for my morning gym, but will elaborate more in few hours.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Attila Szegedi

Comment 2

•

18 years ago

Okay. So, we have the following cases: 1. Source message has explicit charset parameter in its Content-type. We use the explicit charset. No-brainer. 2. Source message has Content-type application/javascript or application/ecmascript. We use the RFC 4329 section 4.2. "Character Encoding Scheme Detection" to detect the charset. 3. Source message has Content-type text/javascript or text/ecmascript. I have no idea what to do here. RFC 4329 7.1 and 8.1 both say "Encoding considerations: The same as the considerations in section 3.1 of [RFC3023]." RFC 3023 is the XML media types RFC, I happen to know this one rather well. In section 3.1 it says: --- quote --- Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII]. In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". (Note: There is an inconsistency between this specification and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a historical reason. Since XML is a new format, a new default should be chosen for better I18N. US-ASCII was chosen, since it is the intersection of UTF-8 and ISO-8859-1 and since it is already used by MIME.) --- quote ends --- So, we have several strategies for text/* content type. We can assume us-ascii, (RFC 3023 would mandate we do that) we can assume ISO-8859-1 in case the URL scheme is http or https, or we can assume UTF-8. Since text/javascript and text/ecmascript content types are obsoleted anyway, I'd suggest we fall back on us-ascii when we encounter these content types without an explicit charset. I believe that's the intent of RFC 3023 (and of RFC 4329 too, since it delegates to it). 4. Source has no associated content type (i.e. comes from a local file from a filesystem that has no concept of content types). I'd say this should be left to the application, and provide Rhino API for app to specify the content type and/or charset. In the special case where the application is either jsc or shell, we just assume application/javascript to be the content type when file extension is .js and application/ecmascript when the file extension is .es, and maybe introduce a "-encoding" command line switch (similar to the one javac has) for explicitly specifying a different charset. How does all of this sound?

Attila Szegedi

Comment 3

•

17 years ago

Ok, I added the following functionality to Rhino shell: 1. An -enc command line argument, i.e. "-enc utf-8" is now supported to specify default encoding. 2. When source is read from an URL and it has a ";charset=" declaration in Content-type, that is used. 3. When source is read from an URL without charset declaration in content type, or is read from a local file, then RFC 4329 4.2.2. logic is used to automatically recognize UTF-32LE, UTF-32BE, UTF-8, UTF-16LE, and UTF-16BE encodings. 4. If no encodings can be automatically recognized, the command line "-enc" value is used, if it exists. 5. If -enc does not exist, but the source is a URL and content type is application/*, UTF-8 is assumed 6. If -enc does not exist, but the source is a URL and content type is text/*, US-ASCII is assumed 7. In the remaining case (source is local file, no encoding was autodetected, and no -enc parameter was specified), we use file.encoding system property as the character encoding. Console input is handled more simply: 1. if there is -enc, it is used 2. Otherwise, file.encoding system property is used as the character encoding.

Attila Szegedi

Comment 4

•

17 years ago

Checking in toolsrc/org/mozilla/javascript/tools/resources/Messages.properties; /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/resources/Messages.properties,v <-- Messages.properties new revision: 1.36; previous revision: 1.35 done RCS file: /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/ParsedContentType.java,v done Checking in toolsrc/org/mozilla/javascript/tools/shell/ParsedContentType.java; /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/ParsedContentType.java,v <-- ParsedContentType.java initial revision: 1.1 done Checking in toolsrc/org/mozilla/javascript/tools/shell/ShellContextFactory.java; /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/ShellContextFactory.java,v <-- ShellContextFactory.java new revision: 1.12; previous revision: 1.11 done Checking in toolsrc/org/mozilla/javascript/tools/shell/Main.java; /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/Main.java,v <-- Main.java new revision: 1.77; previous revision: 1.76 done

Attila Szegedi

Comment 5

•

17 years ago

Additionally, shell no longer chokes on BOM: Checking in toolsrc/org/mozilla/javascript/tools/shell/Main.java; /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/Main.java,v <-- Main.java new revision: 1.78; previous revision: 1.77 done I still need to look into adding the same smarts to jsc.

Attila Szegedi

Comment 6

•

17 years ago

Okay, I extracted intelligent character encoding handling into a separate class used by both shell and jsc. I also added -encoding switch to jsc, and renamed it from -enc to -encoding in shell too. The reason is that javac uses "-encoding" as well. Checking in toolsrc/org/mozilla/javascript/tools/jsc/Main.java; /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/jsc/Main.java,v <-- Main.java new revision: 1.16; previous revision: 1.15 done RCS file: /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/SourceReader.java,v done Checking in toolsrc/org/mozilla/javascript/tools/SourceReader.java; /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/SourceReader.java,v <-- SourceReader.java initial revision: 1.1 done Checking in toolsrc/org/mozilla/javascript/tools/resources/Messages.properties; /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/resources/Messages.properties,v <-- Messages.properties new revision: 1.37; previous revision: 1.36 done Checking in toolsrc/org/mozilla/javascript/tools/shell/Main.java; /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/Main.java,v <-- Main.java new revision: 1.79; previous revision: 1.78 done

Attila Szegedi

Updated

•

17 years ago

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Attila Szegedi

Updated

•

17 years ago

OS: Windows XP → All

Hardware: PC → All

Bugzilla

The JSC and Shell tools choke on UTF-8 BOM

Categories

(Rhino Graveyard :: Core, defect)

Tracking

(Not tracked)

People

(Reporter: robert.flaherty, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Updated