The JSC and Shell tools choke on UTF-8 BOM

RESOLVED FIXED

Status

RESOLVED FIXED
11 years ago
10 years ago

People

(Reporter: robert.flaherty, Unassigned)

Tracking

Details

(Reporter)

Description

11 years ago
User-Agent:       Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)
Build Identifier: 1.6R7

Several places in the code use classes/constructors for URL/file streams/readers which use the default encoding.  I think specs at least default to UTF-8, or if not (in the case of IE) only do UTF-8.  Anyway, here are the things to search on which will get you to the offending areas:

- "new FileReader", drop use of this class totally
- "new InputStreamReader", always pass in an encoding
- "Kit.readStream", check the use of what gets passed in and returned
- "Kit.readReader", check the use of what gets passed in and returned

Example changes I had to make (jsc/Main, shell/Main, shell/Global):
- "new FileReader(f)" to "new InputStreamReader(new FileInputStream(f), "UTF-8")"
- "new InputStreamReader(is)" to "new InputStreamReader(is, "UTF-8")"
- "result = new String(data)" to "result = new String(data, "UTF-8")"

I would assume a good behavior across all of Rhino would be to try to get an encoding from the charset header when a URL is used.  If that doesn't exist or it is a file, default to UTF-8.  I didn't go the distance in shell/Main.readFileOrUrl to parse the Content-Type header if its' a URL, but the comment "XXX: Use 'charset=' argument of Content-Type if URL?" alludes to this already.

Reproducible: Always

Steps to Reproduce:
1. Create any JS file in Notepad and save as UTF-8.
2. Run JSC and Shell on it.
3.
Actual Results:  
Both will choke saying that they encountered an illegal character.

Expected Results:  
Clean parse.

Comment 1

11 years ago
You're referring to RFC 4329, right? Yes, this is justified -- I can see how we'd want to be conformant with that... However, we need to think about some additional implications though first. I'm late for my morning gym, but will elaborate more in few hours.
Status: UNCONFIRMED → NEW
Ever confirmed: true

Comment 2

11 years ago
Okay. So, we have the following cases:

1. Source message has explicit charset parameter in its Content-type. We use the explicit charset. No-brainer.

2. Source message has Content-type application/javascript or application/ecmascript. We use the RFC 4329 section 4.2. "Character Encoding Scheme Detection" to detect the charset.

3. Source message has Content-type text/javascript or text/ecmascript. I have no idea what to do here. RFC 4329 7.1 and 8.1 both say "Encoding considerations: The same as the considerations in section 3.1 of [RFC3023]." RFC 3023 is the XML media types RFC, I happen to know this one rather well. In section 3.1 it says:

--- quote ---
Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII].  In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". (Note: There is an inconsistency between this specification and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a historical reason. Since XML is a new format, a new default should be chosen for better I18N. US-ASCII was chosen, since it is the intersection of UTF-8 and ISO-8859-1 and since it is already used by MIME.)
--- quote ends ---

So, we have several strategies for text/* content type. We can assume us-ascii, (RFC 3023 would mandate we do that) we can assume ISO-8859-1 in case the URL scheme is http or https, or we can assume UTF-8. Since text/javascript and text/ecmascript content types are obsoleted anyway, I'd suggest we fall back on us-ascii when we encounter these content types without an explicit charset. I believe that's the intent of RFC 3023 (and of RFC 4329 too, since it delegates to it).

4. Source has no associated content type (i.e. comes from a local file from a filesystem that has no concept of content types). I'd say this should be left to the application, and provide Rhino API for app to specify the content type and/or charset. 

In the special case where the application is either jsc or shell, we just assume application/javascript to be the content type when file extension is .js and application/ecmascript when the file extension is .es, and maybe introduce a "-encoding" command line switch (similar to the one javac has) for explicitly specifying a different charset.

How does all of this sound?

Comment 3

10 years ago
Ok, I added the following functionality to Rhino shell:

1. An -enc command line argument, i.e. "-enc utf-8" is now supported to specify default encoding.
2. When source is read from an URL and it has a ";charset=" declaration in Content-type, that is used.
3. When source is read from an URL without charset declaration in content type, or is read from a local file, then RFC 4329 4.2.2. logic is used to automatically recognize UTF-32LE, UTF-32BE, UTF-8, UTF-16LE, and UTF-16BE encodings.
4. If no encodings can be automatically recognized, the command line "-enc" value is used, if it exists.
5. If -enc does not exist, but the source is a URL and content type is application/*, UTF-8 is assumed
6. If -enc does not exist, but the source is a URL and content type is text/*, US-ASCII is assumed
7. In the remaining case (source is local file, no encoding was autodetected, and no -enc parameter was specified), we use file.encoding system property as the character encoding.

Console input is handled more simply:
1. if there is -enc, it is used
2. Otherwise, file.encoding system property is used as the character encoding.

Comment 4

10 years ago
    Checking in toolsrc/org/mozilla/javascript/tools/resources/Messages.properties;
    /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/resources/Messages.properties,v  <--  Messages.properties
    new revision: 1.36; previous revision: 1.35
    done
    RCS file: /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/ParsedContentType.java,v
    done
    Checking in toolsrc/org/mozilla/javascript/tools/shell/ParsedContentType.java;
    /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/ParsedContentType.java,v  <--  ParsedContentType.java
    initial revision: 1.1
    done
    Checking in toolsrc/org/mozilla/javascript/tools/shell/ShellContextFactory.java;
    /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/ShellContextFactory.java,v  <--  ShellContextFactory.java
    new revision: 1.12; previous revision: 1.11
    done
    Checking in toolsrc/org/mozilla/javascript/tools/shell/Main.java;
    /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/Main.java,v  <--  Main.java
    new revision: 1.77; previous revision: 1.76
    done

Comment 5

10 years ago
Additionally, shell no longer chokes on BOM:

    Checking in toolsrc/org/mozilla/javascript/tools/shell/Main.java;
    /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/Main.java,v  <--  Main.java
    new revision: 1.78; previous revision: 1.77
    done

I still need to look into adding the same smarts to jsc.

Comment 6

10 years ago
Okay, I extracted intelligent character encoding handling into a separate class used by both shell and jsc.
I also added -encoding switch to jsc, and renamed it from -enc to -encoding in shell too. The reason is that javac uses "-encoding" as well.

    Checking in toolsrc/org/mozilla/javascript/tools/jsc/Main.java;
    /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/jsc/Main.java,v  <--  Main.java
    new revision: 1.16; previous revision: 1.15
    done
    RCS file: /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/SourceReader.java,v
    done
    Checking in toolsrc/org/mozilla/javascript/tools/SourceReader.java;
    /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/SourceReader.java,v  <--  SourceReader.java
    initial revision: 1.1
    done
    Checking in toolsrc/org/mozilla/javascript/tools/resources/Messages.properties;
    /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/resources/Messages.properties,v  <--  Messages.properties
    new revision: 1.37; previous revision: 1.36
    done
    Checking in toolsrc/org/mozilla/javascript/tools/shell/Main.java;
    /cvsroot/mozilla/js/rhino/toolsrc/org/mozilla/javascript/tools/shell/Main.java,v  <--  Main.java
    new revision: 1.79; previous revision: 1.78
    done

Updated

10 years ago
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED

Updated

10 years ago
OS: Windows XP → All
Hardware: PC → All
You need to log in before you can comment on or make changes to this bug.