648102 - UTF-8 support for shell read/snarf function

Reporter

Description

•

14 years ago

The AMO validator needs to read in UTF-8-encoded files, and read/snarf only reads in bytes with some old encoding. But we have UTF-8 support in the engine so it shouldn't be too hard to create a UTF-8 variant. This could be in the form of an optional second argument to read/snarf that is a string from a fixed set of recognized encoding names? Or maybe just a boolean to switch between the current encoding and UTF-8? Dave

Wesley W. Garland

Comment 1

•

14 years ago

FileAsString() reads the input file, converting C string to JS string via JS_NewStringCopyN(). It seems to me that the correct solution is not to add yet-another-code path, but rather use the one we already have: call JS_SetCStringsAreUTF8() before JS_NewRuntime()

Colin Walters

Comment 2

•

14 years ago

(In reply to comment #1) > It seems to me that the correct solution is not to add yet-another-code path, > but rather use the one we already have: call JS_SetCStringsAreUTF8() before > JS_NewRuntime() But if you do that, you lose out on the ability to hold arbitrary binary data in strings, which is in turn necessary if your javascript code doesn't uniformly use Uint8Array.

Wesley W. Garland

Comment 3

•

14 years ago

Where does the JS shell use C strings with bytes > 0x7f?

Colin Walters

Comment 4

•

14 years ago

(In reply to comment #3) > Where does the JS shell use C strings with bytes > 0x7f? Sorry, I don't know all of the context here; it may be in some limited use case one can get away with JS_SetCStringsAreUTF8. But in general it's web-incompatible at least.

Matt Basta [:basta]

Updated

•

14 years ago

Blocks: 635933

Wesley W. Garland

Comment 5

•

14 years ago

Colin, whether or not a pointer to char represents ISO-8859-1 or UTF-8 data in the source to your web browser (or, in this case, JS shell) is completely orthognal web-compatibilty.

Colin Walters

Comment 6

•

14 years ago

(In reply to comment #5) > Colin, whether or not a pointer to char represents ISO-8859-1 or UTF-8 data in > the source to your web browser (or, in this case, JS shell) is completely > orthognal web-compatibilty. Setting it makes string handling *very* different from Firefox. Note that Spidermonkey doesn't really try to enforce ISO-8859-1 - in practice it's just "arbitrary binary data". Let me define "web compatible" here by saying basically if Firefox were to just invoke JS_SetCStringsAreUTF8 today with no other changes, it would probably fail to run many web pages. For example, XPConnect would no longer pass through binary strings, but would throw an exception. On this particular bug, we have a version of this in GJS too; I think the right approach for the bug is what Dave suggested - a new variant of JS_CompileScript with an encoding argument. Although, it *would* be nice to eventually switch over spidermonkey to by default convert to UTF-8, and have callers that want access to raw string data just use JS_GetStringCharsAndLength (though even more ideally have a version that returns const uint8* instead of pretending it's valid unicode). Basically anything calling any of the JS_EncodeString family needs auditing if you're going to set this flag. As is now in the JS shell, it is used for printing out values, as well as the Unix exec() wrapper. Formerly, looks like if you were printing a binary string, it'd just go straight to your terminal; actually this is a bug either way - in gjs when printing things we detect if the value isn't valid UTF-8 and squash it. Without modification, binary strings would throw an exception and fail to print. For the exec() case, you'd no longer be able to pass binary data as arguments to subprocesses.

Wesley W. Garland

Comment 7

•

14 years ago

So we understand that this switch doesn't affect web correctness, it affects how the embedding or browser communicates with the JS engine. The proper way to get Unicode data into the JS engine is to transform it to UTF-16 (native endian), from whatever the input encoding happens to be, and to feed it JS_CompileUCScript(). IMO, he JS engine should NOT be doing arbitrary charset transformation: that is iconv's job; also, the browser already knows how to do this. Anyhow, Dave wasn't suggesting we add that to the JS engine, only to add a helper in the shell. My point is that it's probably easier to just turn on UTF-8 C strings than to write a routine to do this. For the exec() case - is this really a concern? It's not even built that way unless #if defined(SHELL_HACK) && defined(DEBUG) && defined(XP_UNIX)

Colin Walters

Comment 8

•

14 years ago

(In reply to comment #7) > So we understand that this switch doesn't affect web correctness, it affects > how the embedding or browser communicates with the JS engine. True, strictly speaking. But my point remains that Firefox would break messily if this flag was set blindly. > The proper way to get Unicode data into the JS engine is to transform it to > UTF-16 (native endian), from whatever the input encoding happens to be, and to > feed it JS_CompileUCScript(). Ah, that's probably what Firefox does? ... a quick "git grep" later...looks like mostly, yes. A few tests use CompileScript. > IMO, he JS engine should NOT be doing arbitrary charset transformation: that is > iconv's job; also, the browser already knows how to do this. I agree with this generally. In my case I already have high quality Unicode routines in GLib, so I don't really want Spidermonkey doing it too (though probably NSPR has stuff in this area too, which is a different mess). > Anyhow, Dave wasn't suggesting we add that to the JS engine, only to add a > helper in the shell. My point is that it's probably easier to just turn on > UTF-8 C strings than to write a routine to do this. I think that'd just lead to more confusion down the road when the JS shell behaves differently from Firefox. The right thing I think is to just kill off JS_SetCStringsAreUTF8. It's an evil temptation for embedders as it is now. But for this bug, I think we agree the JS shell should: 1) Gain some mechanism for specifying the input encoding of its script. On Unix at least I'd expect this to default to nl_langinfo (CODESET), and maybe have a command line argument? 2) Convert the input into UTF-16 via JS_DecodeBytes, and then feed that to JS_CompileUCScript.

Colin Walters

Comment 9

•

14 years ago

Maybe in the future of course we could actually flip the other way and use UTF-8 internally: Robert O'Callahan has a good post on this: http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html Though he doesn't mention (or perhaps didn't know) in that post that we need to support non-Unicode data in JavaScript strings for backwards compatibility.

Wesley W. Garland

Comment 10

•

14 years ago

Our opinions on what should happen in the long term with JS_SetCStringsAreUTF8() differ by 180 degress- but that's okay, neither of us are the module owner anyhow. :) > 2) Convert the input into UTF-16 via JS_DecodeBytes, and then > feed that to JS_CompileUCScript. The current implementation, which uses JS_NewStringCopyN(), is equivalent to this. It and JS_DecodeBytes() are both implemented in terms of js_InflateStringToBuffer() which requires JS_SetCStringsAreUTF8() if you want the input to be recognized as UTF-8.

Wesley W. Garland

Comment 11

•

14 years ago

We mid-aired on the last comment. ROC's post is interesting, and reminds me that Google's engine stores their JS strings internally as either UTF16, UTF8, or "ascii", depending on the string. At least they did a year ago. I'm ambivalent as to whether the JS engine should change it's internal representation -- I would want to see perf numbers -- but that has no real bearing on the API, or how the shell reads files. It's (mostly) an engine-contained problem. That said, an internal UTF-8 repn would go nicely with UTF-8 C strings.

Wesley W. Garland

Comment 12

•

14 years ago

Attached patch Quick patch to add -U which sets UTF-8 C strings — Details — Splinter Review

Dave, does this meet your needs?

Attachment #524660 - Flags: feedback?(dherman)

Colin Walters

Comment 13

•

14 years ago

(In reply to comment #10) > Our opinions on what should happen in the long term with > JS_SetCStringsAreUTF8() differ by 180 degress- but that's okay, neither of us > are the module owner anyhow. :) Wait, do you actually use it? > The current implementation, which uses JS_NewStringCopyN(), is equivalent to > this. It and JS_DecodeBytes() are both implemented in terms of > js_InflateStringToBuffer() which requires JS_SetCStringsAreUTF8() if you want > the input to be recognized as UTF-8. Ohh right. So I think the right fix is a public JS_DecodeUTF8 API that doesn't depend on the value of JS_CStringsAreUTF8().

Dave Herman [:dherman]

Reporter

Comment 14

•

14 years ago

> Dave, does this meet your needs? They're actually Matt Basta's needs, not mine -- the AMO validator isn't my jurisdiction. They're using Reflect.parse, which is how I got involved. I haven't followed this whole conversation very well, but I'm somewhat ambivalent. I don't really care that much about cleanliness in the shell, because it's a ball of mud to begin with and NPOTB... and anyway, SpiderNode FTW. ;) Matt: is Wes's patch sufficient for your needs? Basically, it adds a -U command-line option to the shell which changes all I/O to be UTF-8. Dave

Colin Walters

Comment 15

•

14 years ago

To restate more clearly the reason I'm fighting so much on this bug is simply because I've learned over time when using Spidermonkey: "Do What Firefox Does" Firefox is where all the effort goes into test suites, where new JS features are driven and prototyped, etc.

Wesley W. Garland

Comment 16

•

14 years ago

Normally, I'd agree with you, but I think adding arbitrary character set conversion into the shell is too big a burden at this time. Feel free to pick up the slack, though.

Wesley W. Garland

Comment 17

•

14 years ago

Sorry, I didn't mean that come off as snarky as did. The point, though, is that the shell doesn't have access to the same libraries that the browser uses to retrieve resources and perform characters set conversion. This is why the shell uses files, and why I'm suggesting using what resources we have at our disposal to do the conversion. I would be very sad to see the JS shell start to depend on libiconv, for example.

Wesley W. Garland

Comment 18

•

14 years ago

Argh, I meant for this to be comment 15, mid-air collision bit me: Dave/Matt: Not all I/O - code coming in via JS_CompileFile() - e.g. from -f command line option - is currently unaffected by this switch. This is an engine bug I've been meaning to fix for quite some time... Matt will need to use snarf, read, or run to get the behaviour he wants. (Full disclosure -- I didn't really test this because I had a hard time generating a test case that didn't involve copy/pasting out of my browser which is causing me charset-grief) Colin: yes, I do I use UTF8 C strings. IMO, anybody who cares about getting arbitrary binary data into JS Strings should be relying on the defined interfaces for doing this rather than the happenstance of ISO-8859-1 -> UTF-16 mapping by the legacy implementation of js_inflateString. (But I prefer storing arbitrary binary data in byte-oriented data types anyhow)

Dave Herman [:dherman]

Reporter

Comment 19

•

14 years ago

Wes: what effect does -U have on load() then? Dave

Colin Walters

Comment 20

•

14 years ago

(In reply to comment #12) > Created attachment 524660 [details] [diff] [review] > Quick patch to add -U which sets UTF-8 C strings > > Dave, does this meet your needs? So what's interesting to me is this patch actually *breaks* UTF-8 input for me. I'll attach a sample test file. But *why* that's the case I'm not sure yet.

Colin Walters

Comment 21

•

14 years ago

Attached file print a smiley! (obsolete) — Details

Colin Walters

Comment 22

•

14 years ago

But can we get a test case of a file that AMO wants to validate that failed?

Wesley W. Garland

Updated

•

14 years ago

Attachment #524660 - Flags: feedback?(dherman) → feedback?(mbasta)

Wesley W. Garland

Comment 23

•

14 years ago

Attached file Legal JS program with UTF-8 chars in it — Details

Dave -- load() is unaffected by this bug, because it is implemented with JS_CompileFile() which is not aware of the JS_SetCStringsAreUTF8() switch. This is an engine bug, IMHO (there is an old bug on this). Colin - Not sure what's going on with your test; it works for me, but as I explained on IRC, I'm suspicious of my terminal. The correct output from this program is "Hello\n174" when the shell is invoked -U; a syntax error results otherwise. To run the program, save it in a file and use the run() function or paste it into the REPL. Or use snarf()/read() to read it into a JS string and eval() it.

Matt Basta [:basta]

Comment 24

•

14 years ago

Hi all, thanks for looking so deep into this. Dave: That would be fine, it's all we need. Our code looks something like this: try { print(JSON.stringify(Reflect.parse(read("foo.js")))); } catch(e) { print(JSON.stringify({ "error":true, "error_message":e.toString(), "line_number":e.lineNumber })); } As long as the -U flag will allow read() to decode the file, then that should sufficiently meet our needs. Is the patch ready for us to install/deploy?

Wesley W. Garland

Comment 25

•

14 years ago

Matt: You could certainly *test* the patch, but it can't go into the source tree until it is reviewed by a JS peer. Are you able to test? If so, please apply locally and test; indicate that it meets your needs by marking feedback+ on the attachment, and I will request peer review. Once reviewed, assuming it passes muster, I can push it into the tracemonkey tree, where it will eventually make it's way into mozilla-central. Wes PS: If you don't know how to apply a patch, build spidermonkey, etc -- find me in #jsapi and I will help you.

Colin Walters

Comment 26

•

14 years ago

Ok, here's how I'm testing. I have two files: // begin testValidEscapedUTF8.js var \u03A9 = "Hello"; print(\u03A9); // end testValidEscapedUTF8.js // begin testValidLiteralUTF8.js var Ω = "Hello"; print(Ω); // end testValidLiteralUTF8.js Now, from the current mozilla-central git (I'm using the github mirror): $ ./shell/js -e 'compile(snarf("/home/walters/tmp/jstest/testValidEscapedUTF8.js"))' $ ./shell/js -e 'compile(snarf("/home/walters/tmp/jstest/testValidLiteralUTF8.js"))' <string>:SyntaxError: illegal character: <string>:var Ω = "Hello"; <string>:.....^ $ We can see that the shell barfs on the literal UTF-8. If I add JS_SetCStringsAreUTF8() before JS_NewRuntime, it does work (so I'm not sure what happened above). Now, I have a series of patches that give the same result, but don't require calling JS_SetCStringsAreUTF8.

Colin Walters

Comment 27

•

14 years ago

Attached patch series of patches to improve UTF-8 support in the JS shell — Details — Splinter Review

These patches are a different approach than Wes' - we essentially assume that the shell input is always UTF-8. This is honestly a lot better default than what we're doing now, which is hard to characterize - we're taking each byte of input as the low bit in UTF-16...I think that's basically going to be corrupted unless the input is ASCII.

Attachment #524687 - Attachment is obsolete: true

Wesley W. Garland

Comment 28

•

14 years ago

What the current shell does is to treat all input as ISO-8859-1, via the algorithm you described. There is no ambiguity.

Colin Walters

Comment 29

•

14 years ago

(In reply to comment #28) > What the current shell does is to treat all input as ISO-8859-1, via the > algorithm you described. There is no ambiguity. You're right, sorry. I had forgotten this property of UTF-16. So this patch changes the shell to default to UTF-8 instead of ISO-8859-1 unconditionally. I think this is basically unilaterally better because whatever encoding your JavaScript happens to be, you should be able to losslessly convert it to UTF-8 via an external mechanism (be that iconv or whatever).

Matt Basta [:basta]

Updated

•

14 years ago

Attachment #524660 - Flags: feedback+

Matt Basta [:basta]

Comment 30

•

14 years ago

Wes: The patch works just fine. All of my tests pass now! Hooray! I'm going to talk to the other AMO guys to see what efforts need to be organized to update Spidermonkey once this makes its way into the source tree. Thanks for your help!

Wesley W. Garland

Comment 31

•

14 years ago

Which patch did you test, Matt?

Wesley W. Garland

Updated

•

14 years ago

Attachment #524720 - Flags: review?(jorendorff)

Wesley W. Garland

Updated

•

14 years ago

Attachment #524660 - Flags: review?(jorendorff)

Matt Basta [:basta]

Comment 32

•

14 years ago

Wes's patch from 4/8 (attachment 524660 [details] [diff] [review])

Matt Basta [:basta]

Updated

•

14 years ago

Blocks: 648596

Andreas Gal :gal

Updated

•

14 years ago

Attachment #524660 - Flags: review?(jorendorff) → review+

Andreas Gal :gal

Comment 33

•

14 years ago

Comment on attachment 524720 [details] [diff] [review] series of patches to improve UTF-8 support in the JS shell Changing a public API is frowned upon but this API is really obscure so what the hell.

Attachment #524720 - Flags: review?(jorendorff) → review+

Colin Walters

Comment 34

•

14 years ago

(In reply to comment #33) > Comment on attachment 524720 [details] [diff] [review] > series of patches to improve UTF-8 support in the JS shell > > Changing a public API is frowned upon but this API is really obscure so what > the hell. Yeah, this is basically only used by js "console" implementations, and they probably only have a total of one call in their codebase. But the other alternative would be introducing a "UC" variant as JS_UCBufferIsCompilableUnit, and using JS_DecodeUTF8 to convert the input into UTF-16 and pass it in there. I started on a patch to do this but ended up deciding a boolean was easier.

Wesley W. Garland

Comment 35

•

14 years ago

I agree with Colin w.r.t. the interface, and I actually use it in my REPL. I'll just add an autoconf-style test when the time comes. Anybody see a reason not to land both these patches?

Colin Walters

Comment 36

•

14 years ago

(In reply to comment #35) > I agree with Colin w.r.t. the interface, and I actually use it in my REPL. > I'll just add an autoconf-style test when the time comes. > > Anybody see a reason not to land both these patches? The main reason I did mine is I objected to adding JS_SetCStringsAreUTF8() into the js shell =) But yes, they don't actually conflict, they just fix the same problem in different ways.

Wil Clouser [:clouserw]

Updated

•

14 years ago

Blocks: 640855

Wesley W. Garland

Comment 37

•

14 years ago

http://hg.mozilla.org/tracemonkey/rev/3bcc68adfd8c http://hg.mozilla.org/tracemonkey/rev/a773890b676f

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Whiteboard: fixed-in-tracemonkey

Wesley W. Garland

Updated

•

14 years ago

Depends on: 650526

Michael Wu [:mwu]

Comment 38

•

14 years ago

Comment on attachment 524720 [details] [diff] [review] series of patches to improve UTF-8 support in the JS shell >@@ -4612,7 +4612,10 @@ JS_BufferIsCompilableUnit(JSContext *cx, JSObject *obj, const char *bytes, size_ > > CHECK_REQUEST(cx); > assertSameCompartment(cx, obj); >- chars = js_InflateString(cx, bytes, &length); >+ if (bytes_are_utf8) >+ chars = js_InflateString(cx, bytes, &length, JS_TRUE); >+ else >+ chars = js_InflateString(cx, bytes, &length); > if (!chars) > return JS_TRUE; > Uhh, this is wrong AFAICT since CESU-8 != UTF-8.

Michael Wu [:mwu]

Updated

•

14 years ago

Blocks: 663552

Matt Basta [:basta]

Updated

•

13 years ago

Attachment #524660 - Flags: feedback?(mattbasta)

Quick patch to add -U which sets UTF-8 C strings 14 years ago Wesley W. Garland 2.65 KB, patch	gal : review+ basta : feedback+	Details \| Diff \| Splinter Review
print a smiley! 14 years ago Colin Walters 84 bytes, text/plain		Details
Legal JS program with UTF-8 chars in it 14 years ago Wesley W. Garland 55 bytes, text/plain		Details
series of patches to improve UTF-8 support in the JS shell 14 years ago Colin Walters 7.15 KB, patch	gal : review+	Details \| Diff \| Splinter Review