User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1) Build Identifier: 1.6R7 Several places in the code use classes/constructors for URL/file streams/readers which use the default encoding. I think specs at least default to UTF-8, or if not (in the case of IE) only do UTF-8. Anyway, here are the things to search on which will get you to the offending areas: - "new FileReader", drop use of this class totally - "new InputStreamReader", always pass in an encoding - "Kit.readStream", check the use of what gets passed in and returned - "Kit.readReader", check the use of what gets passed in and returned Example changes I had to make (jsc/Main, shell/Main, shell/Global): - "new FileReader(f)" to "new InputStreamReader(new FileInputStream(f), "UTF-8")" - "new InputStreamReader(is)" to "new InputStreamReader(is, "UTF-8")" - "result = new String(data)" to "result = new String(data, "UTF-8")" I would assume a good behavior across all of Rhino would be to try to get an encoding from the charset header when a URL is used. If that doesn't exist or it is a file, default to UTF-8. I didn't go the distance in shell/Main.readFileOrUrl to parse the Content-Type header if its' a URL, but the comment "XXX: Use 'charset=' argument of Content-Type if URL?" alludes to this already. Reproducible: Always Steps to Reproduce: 1. Create any JS file in Notepad and save as UTF-8. 2. Run JSC and Shell on it. 3. Actual Results: Both will choke saying that they encountered an illegal character. Expected Results: Clean parse.
You're referring to RFC 4329, right? Yes, this is justified -- I can see how we'd want to be conformant with that... However, we need to think about some additional implications though first. I'm late for my morning gym, but will elaborate more in few hours.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Ok, I added the following functionality to Rhino shell: 1. An -enc command line argument, i.e. "-enc utf-8" is now supported to specify default encoding. 2. When source is read from an URL and it has a ";charset=" declaration in Content-type, that is used. 3. When source is read from an URL without charset declaration in content type, or is read from a local file, then RFC 4329 4.2.2. logic is used to automatically recognize UTF-32LE, UTF-32BE, UTF-8, UTF-16LE, and UTF-16BE encodings. 4. If no encodings can be automatically recognized, the command line "-enc" value is used, if it exists. 5. If -enc does not exist, but the source is a URL and content type is application/*, UTF-8 is assumed 6. If -enc does not exist, but the source is a URL and content type is text/*, US-ASCII is assumed 7. In the remaining case (source is local file, no encoding was autodetected, and no -enc parameter was specified), we use file.encoding system property as the character encoding. Console input is handled more simply: 1. if there is -enc, it is used 2. Otherwise, file.encoding system property is used as the character encoding.
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.