Closed
Bug 811911
Opened 12 years ago
Closed 10 years ago
Regression: Unicode option -U removed from js shell
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
mozilla20
Tracking | Status | |
---|---|---|
firefox41 | --- | fixed |
People
(Reporter: mozillabugs, Assigned: terrence)
Details
Attachments
(2 files, 2 obsolete files)
7.23 KB,
patch
|
mozillabugs
:
review+
|
Details | Diff | Splinter Review |
2.47 KB,
patch
|
jandem
:
review+
|
Details | Diff | Splinter Review |
The js shell used to have an option -U, which caused input and output to be in UTF-8. That's a very useful feature when testing and demoing internationalization APIs, and probably for other purposes as well. Using this option now results in:
$ js -U
Error: Invalid short option: -U
... plus usage information.
Without the -U option, non-ASCII output is reduced to mojibake:
$ js
js> print(new Intl.DateTimeFormat("ja", {year: "numeric", month: "long", day: "numeric"}).format(new Date()))
2012t114?
With the -U option, non-ASCII output used to display correctly:
$ js -U
js> print(new Intl.DateTimeFormat("ja", {year: "numeric", month: "long", day: "numeric"}).format(new Date()))
2012年11月14日
Reporter | ||
Comment 1•12 years ago
|
||
It looks like the feature was removed as part of bug 805080.
Assignee | ||
Comment 2•12 years ago
|
||
Hmm. I'm surprised that worked at all. Because of several obvious bugs, JS_SetCStringsAreUTF8 was actually causing most (but not quite all) strings that the interfaced was passed to be treated as as CESU-8 and *not* as UTF-8. I removed -U in in the first patch in Bug 805080 because it would have been hitting this bug. I did not expect that anyone would be depending on -U since there aren't any bug reports about the broken encoding. I guess CESU-8 and UTF-8 are close enough that nobody noticed.
The goal, of course, is to make the shell /always/ process strings in UTF-8 mode. I think this should be relatively easy now, so I'll take a crack at it today.
In the mean time, you may be able to use the shell's |load()| function as a workaround -- see Bug 501265.
Reporter | ||
Comment 3•12 years ago
|
||
CESU-8 and UTF-8 are identical for characters in the Unicode Basic Multilingual Plane, covering less than half of the current Unicode characters, but the vast majority of everyday text, and the same characters that ECMAScript supports in all functionality. We'd likely see bugs with supplementary characters such as parts of the Hong Kong Supplementary Character Set or emoji.
Assignee | ||
Comment 4•12 years ago
|
||
The attached is a total hack, but is good enough until we are linking against ICU and can offload our conversions to it.
Norbert, does this patch fix your problem? Also, is there a good way to reproduce this without the Intl patches? I was not sure how to get those to compile, so I'm flying a bit blind here. I tried the following, with limited success:
~/moz/branch/utf8_the_shell/mozilla/js/src (fix)
> LC_ALL=jp ../../../def-js-dbg-clang-x64/js \
-e "print((new Date()).toLocaleString())"
Wed Nov 14 15:34:10 2012
~/moz/branch/utf8_the_shell/mozilla/js/src (fix)
> LC_ALL=en ../../../def-js-dbg-clang-x64/js \
-e "print((new Date()).toLocaleString())"
Wed Nov 14 15:34:15 2012
~/moz/branch/utf8_the_shell/mozilla/js/src (fix)
> ../../../def-js-dbg-clang-x64/js \
-e "print((new Date()).toLocaleString())"
Wed 14 Nov 2012 03:34:19 PM PST
Assignee: general → terrence
Status: NEW → ASSIGNED
Attachment #681736 -
Flags: feedback?(mozillabugs)
Reporter | ||
Comment 5•12 years ago
|
||
(In reply to Terrence Cole [:terrence] from comment #4)
> Also, is there a good way to
> reproduce this without the Intl patches? I was not sure how to get those to
> compile, so I'm flying a bit blind here.
For testing, you can simply use
js> print("2012\u5e7411\u670815\u65e5")
This should print "2012年11月15日". Make sure that your terminal uses UTF-8 - in the Terminal app on the Mac, you may have to change the Preferences (Settings -> Advanced).
Similarly, "\uD83D\uDE04" should turn into a smiley when full UTF-8 is supported, assuming you have a font (it happens to work in Terminal on Mac OS 10.7 even with the old -U).
Reporter | ||
Comment 6•12 years ago
|
||
Comment on attachment 681736 [details] [diff] [review]
v0
Review of attachment 681736 [details] [diff] [review]:
-----------------------------------------------------------------
- The old -U option allowed both input and output in UTF-8.
- The old version required an explicit option -U. Was the default always just ASCII+mojibake, or was it some other encoding? If I follow the code and documentation correctly, it's actually ISO-8859-1 + mojibake. If it was just ASCII+mojibake, then UTF-8 is clearly an improvement, but if it other encodings were supported, we have to be more careful.
::: js/src/shell/js.cpp
@@ +339,5 @@
> +
> + bad_surrogate:
> + if (maybecx) {
> + JS_snprintf(buffer, 10, "0x%x", c);
> + JS_ReportErrorFlagsAndNumber(maybecx, JSREPORT_ERROR, js_GetErrorMessage,
Are you sure you want to fail on malformed surrogates? ECMAScript doesn't do anything to prevent them otherwise, so it might be better to convert them to U+FFFD (replacement character) and continue.
@@ +347,5 @@
> +}
> +
> +static bool
> +DeflateStringToUTF8Buffer(JSContext *maybecx, const jschar *src, size_t srclen,
> + char *dst, size_t *dstlenp)
Do I understand correctly that on exit *dstlenp will be set to the number of bytes used? A comment documenting this would be useful.
@@ +352,5 @@
> +{
> + size_t i, utf8Len;
> + jschar c, c2;
> + uint32_t v;
> + uint8_t utf8buf[6];
UTF-8 uses at most 4 bytes per character. The comment in jsstr.h is wrong; the one in jsstr.cpp correct.
@@ +390,5 @@
> + }
> + dstlen -= utf8Len;
> + }
> + *dstlenp = (origDstlen - dstlen);
> + return JS_TRUE;
Jeff told me that JS_TRUE and JS_FALSE shouldn't be used anymore; true and false are preferred.
Attachment #681736 -
Flags: feedback?(mozillabugs)
Assignee | ||
Comment 7•12 years ago
|
||
(In reply to Norbert Lindenberg from comment #6)
>
> - The old -U option allowed both input and output in UTF-8.
Before the changes I made, input would be in ISO-8859-1 (no -U) or in CESU-8 (with -U). Output would be in ISO-8859-1 + mojibake or in CESU-8 respectively. Both CESU-8 were inappropriately referred to as UTF-8 in the interface.
After the changes I made, input should be in UTF-8, but output would only in ISO-8859-1 + mojibake. This was an obvious oversight on my part. My hope is that after this patch both input and output will always be UTF-8.
> - The old version required an explicit option -U. Was the default always
> just ASCII+mojibake, or was it some other encoding? If I follow the code and
> documentation correctly, it's actually ISO-8859-1 + mojibake. If it was just
> ASCII+mojibake, then UTF-8 is clearly an improvement, but if it other
> encodings were supported, we have to be more careful.
The interface and implementation demonstrate enough confusion and active lying that I'm not sure I would go so far as to call what it accepted an "encoding."
> ::: js/src/shell/js.cpp
This is (minus the CESU-8 specific bits) the code that used to live in jsstr and handled "encodings." After reading it, I'm sure you can see why I am eager to get us moved to ICU so we can use its conversion code instead.
> @@ +339,5 @@
> > +
> > + bad_surrogate:
> > + if (maybecx) {
> > + JS_snprintf(buffer, 10, "0x%x", c);
> > + JS_ReportErrorFlagsAndNumber(maybecx, JSREPORT_ERROR, js_GetErrorMessage,
>
> Are you sure you want to fail on malformed surrogates? ECMAScript doesn't do
> anything to prevent them otherwise, so it might be better to convert them to
> U+FFFD (replacement character) and continue.
These functions are all shell specific, so it's just a matter of what's convenient to the users of the shell (e.g. us). I never liked the old behavior, so I think I'll take your suggestion. :-)
> @@ +352,5 @@
> > +{
> > + size_t i, utf8Len;
> > + jschar c, c2;
> > + uint32_t v;
> > + uint8_t utf8buf[6];
>
> UTF-8 uses at most 4 bytes per character. The comment in jsstr.h is wrong;
> the one in jsstr.cpp correct.
Good catch! Looks like I forgot to update this when removing the CESU-8 support.
Assignee | ||
Comment 8•12 years ago
|
||
It looks like I "fixed" the wrong print the first time around. With your STR I was quickly able to find the right one. I replaced all the rest too this time, for good measure.
Attachment #681736 -
Attachment is obsolete: true
Attachment #682689 -
Flags: review?(mozillabugs)
Reporter | ||
Comment 9•12 years ago
|
||
Comment on attachment 682689 [details] [diff] [review]
v1: Fixed and applied review.
Review of attachment 682689 [details] [diff] [review]:
-----------------------------------------------------------------
Some test cases would be useful, with input such as "\ud800", "\ud800more", "\udc00", "\uD83D\uDE04", "2012\u5e7411\u670815\u65e5".
::: js/src/shell/js.cpp
@@ +315,5 @@
> + if (c < 0x80)
> + continue;
> + if (0xD800 <= c && c <= 0xDFFF) {
> + /* Surrogate pair. */
> + chars++;
The increment comes too early; we don't know yet whether we're going to consume the following char. This leads to a crash for
js> print("\udc00")
Probably should be around line 332.
@@ +318,5 @@
> + /* Surrogate pair. */
> + chars++;
> +
> + /* nbytes sets 1 length since this is surrogate pair. */
> + nbytes--;
Also too early - we don't know yet whether we're going to consume two chars.
@@ +323,5 @@
> + if (c >= 0xDC00 || chars == end) {
> + nbytes += 2; /* Bad Surrogate -> 0xFFFD */
> + continue;
> + }
> + c2 = *chars;
Needs to be chars[1] because we don't know yet whether we're going to use the char.
@@ +345,5 @@
> +PutUTF8ReplacementCharacter(size_t *srclenp, char *dst, size_t *dstlenp) {
> + if (*dstlenp < 2)
> + return false;
> + *dst++ = (char) 0xFF;
> + *dst++ = (char) 0xFD;
U+FFFD is the code point; the corresponding UTF-8 bytes are EF BF BD (yes, three bytes). The length calculation above did account for three bytes.
@@ +423,5 @@
> + JSLinearString *linear = str->ensureLinear(cx);
> + if (!linear)
> + return NULL;
> +
> + const jschar *es5chars = linear->chars();
How about simply "chars"? There's nothing specific to ES5 here...
Attachment #682689 -
Flags: review?(mozillabugs) → review-
Assignee | ||
Comment 10•12 years ago
|
||
(In reply to Norbert Lindenberg from comment #9)
> Comment on attachment 682689 [details] [diff] [review]
Yikes! that half of the bugs you found were present in the version of this function that has been in SpiderMonkey forever.
> Some test cases would be useful, with input such as "\ud800", "\ud800more",
> "\udc00", "\uD83D\uDE04", "2012\u5e7411\u670815\u65e5".
These tests caught several more bugs. I'm going to add them to the patch in Bug 811060, since I moved the TestUTF8 function to jsapi-tests in that patch already.
Assignee | ||
Comment 11•12 years ago
|
||
Attachment #682689 -
Attachment is obsolete: true
Attachment #683687 -
Flags: review?(mozillabugs)
Reporter | ||
Comment 12•12 years ago
|
||
Comment on attachment 683687 [details] [diff] [review]
v2: Many more bugs fixed.
Review of attachment 683687 [details] [diff] [review]:
-----------------------------------------------------------------
One remaining issue, otherwise this looks good to me.
::: js/src/shell/js.cpp
@@ +315,5 @@
> + if (c < 0x80)
> + continue;
> + if (0xD800 <= c && c <= 0xDFFF) {
> + /* nbytes sets 1 length since this is surrogate pair. */
> + if (c >= 0xDC00 || chars == end) {
Since we moved the chars++, chars must be chars+1 here.
Attachment #683687 -
Flags: review?(mozillabugs) → review+
Assignee | ||
Comment 13•12 years ago
|
||
(In reply to Norbert Lindenberg from comment #12)
> ::: js/src/shell/js.cpp
> @@ +315,5 @@
> > + if (c < 0x80)
> > + continue;
> > + if (0xD800 <= c && c <= 0xDFFF) {
> > + /* nbytes sets 1 length since this is surrogate pair. */
> > + if (c >= 0xDC00 || chars == end) {
>
> Since we moved the chars++, chars must be chars+1 here.
Good catch!
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7f8011950c9
Comment 14•12 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Reporter | ||
Comment 15•12 years ago
|
||
Seems we missed something:
js> throw(new Error("2012\u5e7411\u670815\u65e5"))
still produces mojibake.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 16•10 years ago
|
||
Sorry I let this linger for so long, Norbert!
I /finally/ tracked this down today and we're actually inboarding and processing the string correctly. It's only during the very last stage where we're formatting the message that we encode to latin1 instead of utf8. This is using the !dontReportUncaught path, which the browser is not using anymore (mostly), so there's not any reason at all not to just take this to utf8.
Attachment #8614896 -
Flags: review?(jdemooij)
Updated•10 years ago
|
Attachment #8614896 -
Flags: review?(jdemooij) → review+
Comment 17•10 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 12 years ago → 10 years ago
status-firefox41:
--- → fixed
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•