Last Comment Bug 648102 - UTF-8 support for shell read/snarf function
: UTF-8 support for shell read/snarf function
Status: RESOLVED FIXED
fixed-in-tracemonkey
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: ---
Assigned To: general
:
Mentors:
Depends on: 650526
Blocks: 635933 640855 648596 663552
  Show dependency treegraph
 
Reported: 2011-04-06 13:41 PDT by Dave Herman [:dherman]
Modified: 2012-01-22 17:23 PST (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Quick patch to add -U which sets UTF-8 C strings (2.65 KB, patch)
2011-04-08 10:22 PDT, Wesley W. Garland
gal: review+
mattbasta: feedback+
Details | Diff | Splinter Review
print a smiley! (84 bytes, text/plain)
2011-04-08 11:31 PDT, Colin Walters
no flags Details
Legal JS program with UTF-8 chars in it (55 bytes, text/plain)
2011-04-08 12:02 PDT, Wesley W. Garland
no flags Details
series of patches to improve UTF-8 support in the JS shell (7.15 KB, patch)
2011-04-08 13:31 PDT, Colin Walters
gal: review+
Details | Diff | Splinter Review

Description Dave Herman [:dherman] 2011-04-06 13:41:31 PDT
The AMO validator needs to read in UTF-8-encoded files, and read/snarf only reads in bytes with some old encoding. But we have UTF-8 support in the engine so it shouldn't be too hard to create a UTF-8 variant.

This could be in the form of an optional second argument to read/snarf that is a string from a fixed set of recognized encoding names? Or maybe just a boolean to switch between the current encoding and UTF-8?

Dave
Comment 1 Wesley W. Garland 2011-04-07 12:06:04 PDT
FileAsString() reads the input file, converting C string to JS string via 
JS_NewStringCopyN().

It seems to me that the correct solution is not to add yet-another-code path, but rather use the one we already have: call JS_SetCStringsAreUTF8() before JS_NewRuntime()
Comment 2 Colin Walters 2011-04-07 12:08:58 PDT
(In reply to comment #1)
 
> It seems to me that the correct solution is not to add yet-another-code path,
> but rather use the one we already have: call JS_SetCStringsAreUTF8() before
> JS_NewRuntime()

But if you do that, you lose out on the ability to hold arbitrary binary data in strings, which is in turn necessary if your javascript code doesn't uniformly use Uint8Array.
Comment 3 Wesley W. Garland 2011-04-07 14:03:50 PDT
Where does the JS shell use C strings with bytes > 0x7f?
Comment 4 Colin Walters 2011-04-07 16:20:08 PDT
(In reply to comment #3)
> Where does the JS shell use C strings with bytes > 0x7f?

Sorry, I don't know all of the context here; it may be in some limited use case one can get away with JS_SetCStringsAreUTF8.  But in general it's web-incompatible at least.
Comment 5 Wesley W. Garland 2011-04-08 06:03:17 PDT
Colin, whether or not a pointer to char represents ISO-8859-1 or UTF-8 data in the source to your web browser (or, in this case, JS shell) is completely orthognal web-compatibilty.
Comment 6 Colin Walters 2011-04-08 08:42:41 PDT
(In reply to comment #5)
> Colin, whether or not a pointer to char represents ISO-8859-1 or UTF-8 data in
> the source to your web browser (or, in this case, JS shell) is completely
> orthognal web-compatibilty.

Setting it makes string handling *very* different from Firefox.  Note that Spidermonkey doesn't really try to enforce ISO-8859-1 - in practice it's just "arbitrary binary data".

Let me define "web compatible" here by saying basically if Firefox were to just invoke JS_SetCStringsAreUTF8 today with no other changes, it would probably fail to run many web pages.  For example, XPConnect would no longer pass through binary strings, but would throw an exception.

On this particular bug, we have a version of this in GJS too; I think the right approach for the bug is what Dave suggested - a new variant of JS_CompileScript with an encoding argument.


Although, it *would* be nice to eventually switch over spidermonkey to by default convert to UTF-8, and have callers that want access to raw string data just use JS_GetStringCharsAndLength (though even more ideally have a version that returns const uint8* instead of pretending it's valid unicode).

Basically anything calling any of the JS_EncodeString family needs auditing if you're going to set this flag.  As is now in the JS shell, it is used for printing out values, as well as the Unix exec() wrapper.

Formerly, looks like if you were printing a binary string, it'd just go straight to your terminal; actually this is a bug either way - in gjs when printing things we detect if the value isn't valid UTF-8 and squash it.  Without modification, binary strings would throw an exception and fail to print.

For the exec() case, you'd no longer be able to pass binary data as arguments to subprocesses.
Comment 7 Wesley W. Garland 2011-04-08 09:44:13 PDT
So we understand that this switch doesn't affect web correctness, it affects how the embedding or browser communicates with the JS engine. 

The proper way to get Unicode data into the JS engine is to transform it to UTF-16 (native endian), from whatever the input encoding happens to be, and to feed it JS_CompileUCScript().

IMO, he JS engine should NOT be doing arbitrary charset transformation: that is iconv's job; also, the browser already knows how to do this.

Anyhow, Dave wasn't suggesting we add that to the JS engine, only to add a helper in the shell. My point is that it's probably easier to just turn on UTF-8 C strings than to write a routine to do this.

For the exec() case - is this really a concern? It's not even built that way unless #if defined(SHELL_HACK) && defined(DEBUG) && defined(XP_UNIX)
Comment 8 Colin Walters 2011-04-08 10:00:04 PDT
(In reply to comment #7)
> So we understand that this switch doesn't affect web correctness, it affects
> how the embedding or browser communicates with the JS engine. 

True, strictly speaking.  But my point remains that Firefox would break messily if this flag was set blindly.

> The proper way to get Unicode data into the JS engine is to transform it to
> UTF-16 (native endian), from whatever the input encoding happens to be, and to
> feed it JS_CompileUCScript().

Ah, that's probably what Firefox does?  ... a quick "git grep" later...looks like mostly, yes.  A few tests use CompileScript.

> IMO, he JS engine should NOT be doing arbitrary charset transformation: that is
> iconv's job; also, the browser already knows how to do this.

I agree with this generally.  In my case I already have high quality Unicode routines in GLib, so I don't really want Spidermonkey doing it too (though probably NSPR has stuff in this area too, which is a different mess).

> Anyhow, Dave wasn't suggesting we add that to the JS engine, only to add a
> helper in the shell. My point is that it's probably easier to just turn on
> UTF-8 C strings than to write a routine to do this.

I think that'd just lead to more confusion down the road when the JS shell behaves differently from Firefox.

The right thing I think is to just kill off JS_SetCStringsAreUTF8.  It's an evil temptation for embedders as it is now.


But for this bug, I think we agree the JS shell should:

1) Gain some mechanism for specifying the input encoding of its script.  On Unix at least I'd expect this to default to nl_langinfo (CODESET), and maybe have a command line argument?
2) Convert the input into UTF-16 via JS_DecodeBytes, and then feed that to JS_CompileUCScript.
Comment 9 Colin Walters 2011-04-08 10:10:20 PDT
Maybe in the future of course we could actually flip the other way and use UTF-8 internally:

Robert O'Callahan has a good post on this:

http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html

Though he doesn't mention (or perhaps didn't know) in that post that we need to support non-Unicode data in JavaScript strings for backwards compatibility.
Comment 10 Wesley W. Garland 2011-04-08 10:15:08 PDT
Our opinions on what should happen in the long term with JS_SetCStringsAreUTF8() differ by 180 degress- but that's okay, neither of us are the module owner anyhow. :)

> 2) Convert the input into UTF-16 via JS_DecodeBytes, and then 
> feed that to JS_CompileUCScript.

The current implementation, which uses JS_NewStringCopyN(), is equivalent to this. It and JS_DecodeBytes() are both implemented in terms of js_InflateStringToBuffer() which requires JS_SetCStringsAreUTF8() if you want the input to be recognized as UTF-8.
Comment 11 Wesley W. Garland 2011-04-08 10:19:24 PDT
We mid-aired on the last comment.  ROC's post is interesting, and reminds me that Google's engine stores their JS strings internally as either UTF16, UTF8, or "ascii", depending on the string.  At least they did a year ago.

I'm ambivalent as to whether the JS engine should change it's internal representation -- I would want to see perf numbers -- but that has no real bearing on the API, or how the shell reads files. It's (mostly) an engine-contained problem.  That said, an internal UTF-8 repn would go nicely with UTF-8 C strings.
Comment 12 Wesley W. Garland 2011-04-08 10:22:21 PDT
Created attachment 524660 [details] [diff] [review]
Quick patch to add -U which sets UTF-8 C strings

Dave, does this meet your needs?
Comment 13 Colin Walters 2011-04-08 10:22:32 PDT
(In reply to comment #10)
> Our opinions on what should happen in the long term with
> JS_SetCStringsAreUTF8() differ by 180 degress- but that's okay, neither of us
> are the module owner anyhow. :)

Wait, do you actually use it?

> The current implementation, which uses JS_NewStringCopyN(), is equivalent to
> this. It and JS_DecodeBytes() are both implemented in terms of
> js_InflateStringToBuffer() which requires JS_SetCStringsAreUTF8() if you want
> the input to be recognized as UTF-8.

Ohh right.  So I think the right fix is a public JS_DecodeUTF8 API that doesn't depend on the value of JS_CStringsAreUTF8().
Comment 14 Dave Herman [:dherman] 2011-04-08 10:51:27 PDT
> Dave, does this meet your needs?

They're actually Matt Basta's needs, not mine -- the AMO validator isn't my jurisdiction. They're using Reflect.parse, which is how I got involved.

I haven't followed this whole conversation very well, but I'm somewhat ambivalent. I don't really care that much about cleanliness in the shell, because it's a ball of mud to begin with and NPOTB... and anyway, SpiderNode FTW. ;)

Matt: is Wes's patch sufficient for your needs? Basically, it adds a -U command-line option to the shell which changes all I/O to be UTF-8.

Dave
Comment 15 Colin Walters 2011-04-08 11:04:57 PDT
To restate more clearly the reason I'm fighting so much on this bug is simply because I've learned over time when using Spidermonkey: 

"Do What Firefox Does"

Firefox is where all the effort goes into test suites, where new JS features are driven and prototyped, etc.
Comment 16 Wesley W. Garland 2011-04-08 11:14:32 PDT
Normally, I'd agree with you, but I think adding arbitrary character set conversion into the shell is too big a burden at this time.

Feel free to pick up the slack, though.
Comment 17 Wesley W. Garland 2011-04-08 11:17:43 PDT
Sorry, I didn't mean that come off as snarky as did.

The point, though, is that the shell doesn't have access to the same libraries that the browser uses to retrieve resources and perform characters set conversion.

This is why the shell uses files, and why I'm suggesting using what resources we have at our disposal to do the conversion.

I would be very sad to see the JS shell start to depend on libiconv, for example.
Comment 18 Wesley W. Garland 2011-04-08 11:19:57 PDT
Argh, I meant for this to be comment 15, mid-air collision bit me:

Dave/Matt: Not all I/O - code coming in via JS_CompileFile() - e.g. from -f
command line option - is currently unaffected by this switch. This is an engine
bug I've been meaning to fix for quite some time...  Matt will need to use
snarf, read, or run to get the behaviour he wants.

(Full disclosure -- I didn't really test this because I had a hard time generating a test case that didn't involve copy/pasting out of my browser which is causing me charset-grief)

Colin: yes, I do I use UTF8 C strings. IMO, anybody who cares about getting
arbitrary binary data into JS Strings should be relying on the defined
interfaces for doing this rather than the happenstance of ISO-8859-1 -> UTF-16
mapping by the legacy implementation of js_inflateString.   (But I prefer
storing arbitrary binary data in byte-oriented data types anyhow)
Comment 19 Dave Herman [:dherman] 2011-04-08 11:23:07 PDT
Wes: what effect does -U have on load() then?

Dave
Comment 20 Colin Walters 2011-04-08 11:31:29 PDT
(In reply to comment #12)
> Created attachment 524660 [details] [diff] [review]
> Quick patch to add -U which sets UTF-8 C strings
> 
> Dave, does this meet your needs?

So what's interesting to me is this patch actually *breaks* UTF-8 input for me.  I'll attach a sample test file.  But *why* that's the case I'm not sure yet.
Comment 21 Colin Walters 2011-04-08 11:31:51 PDT
Created attachment 524687 [details]
print a smiley!
Comment 22 Colin Walters 2011-04-08 11:35:48 PDT
But can we get a test case of a file that AMO wants to validate that failed?
Comment 23 Wesley W. Garland 2011-04-08 12:02:50 PDT
Created attachment 524699 [details]
Legal JS program with UTF-8 chars in it

Dave -- load() is unaffected by this bug, because it is implemented with JS_CompileFile() which is not aware of the JS_SetCStringsAreUTF8() switch.  This is an engine bug, IMHO (there is an old bug on this).

Colin - Not sure what's going on with your test; it works for me, but as I explained on IRC, I'm suspicious of my terminal.

The correct output from this program is "Hello\n174" when the shell is invoked -U; a syntax error results otherwise.

To run the program, save it in a file and use the run() function or paste it into the REPL. Or use snarf()/read() to read it into a JS string and eval() it.
Comment 24 Matt Basta [:basta] 2011-04-08 13:02:38 PDT
Hi all, thanks for looking so deep into this.

Dave: That would be fine, it's all we need. Our code looks something like this:

try {
    print(JSON.stringify(Reflect.parse(read("foo.js"))));
} catch(e) {
    print(JSON.stringify({
        "error":true,
        "error_message":e.toString(),
        "line_number":e.lineNumber
    }));
}

As long as the -U flag will allow read() to decode the file, then that should sufficiently meet our needs.

Is the patch ready for us to install/deploy?
Comment 25 Wesley W. Garland 2011-04-08 13:07:34 PDT
Matt:

You could certainly *test* the patch, but it can't go into the source tree until it is reviewed by a JS peer.

Are you able to test?   If so, please apply locally and test; indicate that it meets your needs by marking feedback+ on the attachment, and I will request peer review. Once reviewed, assuming it passes muster, I can push it into the tracemonkey tree, where it will eventually make it's way into mozilla-central.

Wes

PS: If you don't know how to apply a patch, build spidermonkey, etc -- find me in #jsapi and I will help you.
Comment 26 Colin Walters 2011-04-08 13:28:25 PDT
Ok, here's how I'm testing.  I have two files:

// begin testValidEscapedUTF8.js
var \u03A9 = "Hello";
print(\u03A9);
// end testValidEscapedUTF8.js

// begin testValidLiteralUTF8.js
var Ω = "Hello";
print(Ω);
// end testValidLiteralUTF8.js

Now, from the current mozilla-central git (I'm using the github mirror):

$ ./shell/js -e 'compile(snarf("/home/walters/tmp/jstest/testValidEscapedUTF8.js"))'
$ ./shell/js -e 'compile(snarf("/home/walters/tmp/jstest/testValidLiteralUTF8.js"))'
<string>:SyntaxError: illegal character:
<string>:var Ω = "Hello";
<string>:.....^
$

We can see that the shell barfs on the literal UTF-8.  If I add JS_SetCStringsAreUTF8() before JS_NewRuntime, it does work (so I'm not sure what happened above).

Now, I have a series of patches that give the same result, but don't require calling JS_SetCStringsAreUTF8.
Comment 27 Colin Walters 2011-04-08 13:31:04 PDT
Created attachment 524720 [details] [diff] [review]
series of patches to improve UTF-8 support in the JS shell

These patches are a different approach than Wes' - we essentially assume that the shell input is always UTF-8.  This is honestly a lot better default than what we're doing now, which is hard to characterize - we're taking each byte of input as the low bit in UTF-16...I think that's basically going to be corrupted unless the input is ASCII.
Comment 28 Wesley W. Garland 2011-04-08 13:35:38 PDT
What the current shell does is to treat all input as ISO-8859-1, via the algorithm you described.  There is no ambiguity.
Comment 29 Colin Walters 2011-04-08 13:59:25 PDT
(In reply to comment #28)
> What the current shell does is to treat all input as ISO-8859-1, via the
> algorithm you described.  There is no ambiguity.

You're right, sorry.  I had forgotten this property of UTF-16.  

So this patch changes the shell to default to UTF-8 instead of ISO-8859-1 unconditionally.  I think this is basically unilaterally better because whatever encoding your JavaScript happens to be, you should be able to losslessly convert it to UTF-8 via an external mechanism (be that iconv or whatever).
Comment 30 Matt Basta [:basta] 2011-04-14 08:23:41 PDT
Wes: The patch works just fine. All of my tests pass now! Hooray! I'm going to talk to the other AMO guys to see what efforts need to be organized to update Spidermonkey once this makes its way into the source tree.

Thanks for your help!
Comment 31 Wesley W. Garland 2011-04-14 08:40:32 PDT
Which patch did you test, Matt?
Comment 32 Matt Basta [:basta] 2011-04-14 09:31:55 PDT
Wes's patch from 4/8 (attachment 524660 [details] [diff] [review])
Comment 33 Andreas Gal :gal 2011-04-14 10:10:52 PDT
Comment on attachment 524720 [details] [diff] [review]
series of patches to improve UTF-8 support in the JS shell

Changing a public API is frowned upon but this API is really obscure so what the hell.
Comment 34 Colin Walters 2011-04-14 11:13:44 PDT
(In reply to comment #33)
> Comment on attachment 524720 [details] [diff] [review]
> series of patches to improve UTF-8 support in the JS shell
> 
> Changing a public API is frowned upon but this API is really obscure so what
> the hell.

Yeah, this is basically only used by js "console" implementations, and they probably only have a total of one call in their codebase.

But the other alternative would be introducing a "UC" variant as JS_UCBufferIsCompilableUnit, and using JS_DecodeUTF8 to convert the input into UTF-16 and pass it in there.  I started on a patch to do this but ended up deciding a boolean was easier.
Comment 35 Wesley W. Garland 2011-04-14 12:56:15 PDT
I agree with Colin w.r.t. the interface, and I actually use it in my REPL.  I'll just add an autoconf-style test when the time comes.

Anybody see a reason not to land both these patches?
Comment 36 Colin Walters 2011-04-14 13:40:36 PDT
(In reply to comment #35)
> I agree with Colin w.r.t. the interface, and I actually use it in my REPL. 
> I'll just add an autoconf-style test when the time comes.
> 
> Anybody see a reason not to land both these patches?

The main reason I did mine is I objected to adding JS_SetCStringsAreUTF8() into the js shell =)  But yes, they don't actually conflict, they just fix the same problem in different ways.
Comment 38 Michael Wu [:mwu] 2011-06-10 15:16:56 PDT
Comment on attachment 524720 [details] [diff] [review]
series of patches to improve UTF-8 support in the JS shell

>@@ -4612,7 +4612,10 @@ JS_BufferIsCompilableUnit(JSContext *cx, JSObject *obj, const char *bytes, size_
> 
>     CHECK_REQUEST(cx);
>     assertSameCompartment(cx, obj);
>-    chars = js_InflateString(cx, bytes, &length);
>+    if (bytes_are_utf8)
>+        chars = js_InflateString(cx, bytes, &length, JS_TRUE);
>+    else
>+        chars = js_InflateString(cx, bytes, &length);
>     if (!chars)
>         return JS_TRUE;
> 

Uhh, this is wrong AFAICT since CESU-8 != UTF-8.

Note You need to log in before you can comment on or make changes to this bug.