Closed Bug 811911 Opened 12 years ago Closed 10 years ago

Regression: Unicode option -U removed from js shell

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla20

Tracking Flags:

Tracking

Status

firefox41

---

fixed

People

(Reporter: mozillabugs, Assigned: terrence)

Details

Attachments

(2 files, 2 obsolete files)

v0 12 years ago Terrence Cole [:terrence] 5.44 KB, patch		Details \| Diff \| Splinter Review
v1: Fixed and applied review. 12 years ago Terrence Cole [:terrence] 7.30 KB, patch	mozillabugs : review-	Details \| Diff \| Splinter Review
v2: Many more bugs fixed. 12 years ago Terrence Cole [:terrence] 7.23 KB, patch	mozillabugs : review+	Details \| Diff \| Splinter Review
use_utf8_for_error_message_output-v0.diff 10 years ago Terrence Cole [:terrence] 2.47 KB, patch	jandem : review+	Details \| Diff \| Splinter Review

Norbert Lindenberg

Reporter

Description

•

12 years ago

The js shell used to have an option -U, which caused input and output to be in UTF-8. That's a very useful feature when testing and demoing internationalization APIs, and probably for other purposes as well. Using this option now results in: $ js -U Error: Invalid short option: -U ... plus usage information. Without the -U option, non-ASCII output is reduced to mojibake: $ js js> print(new Intl.DateTimeFormat("ja", {year: "numeric", month: "long", day: "numeric"}).format(new Date())) 2012t114? With the -U option, non-ASCII output used to display correctly: $ js -U js> print(new Intl.DateTimeFormat("ja", {year: "numeric", month: "long", day: "numeric"}).format(new Date())) 2012年11月14日

Norbert Lindenberg

Reporter

Comment 1

•

12 years ago

It looks like the feature was removed as part of bug 805080.

Terrence Cole [:terrence]

Assignee

Comment 2

•

12 years ago

Hmm. I'm surprised that worked at all. Because of several obvious bugs, JS_SetCStringsAreUTF8 was actually causing most (but not quite all) strings that the interfaced was passed to be treated as as CESU-8 and *not* as UTF-8. I removed -U in in the first patch in Bug 805080 because it would have been hitting this bug. I did not expect that anyone would be depending on -U since there aren't any bug reports about the broken encoding. I guess CESU-8 and UTF-8 are close enough that nobody noticed. The goal, of course, is to make the shell /always/ process strings in UTF-8 mode. I think this should be relatively easy now, so I'll take a crack at it today. In the mean time, you may be able to use the shell's |load()| function as a workaround -- see Bug 501265.

Norbert Lindenberg

Reporter

Comment 3

•

12 years ago

CESU-8 and UTF-8 are identical for characters in the Unicode Basic Multilingual Plane, covering less than half of the current Unicode characters, but the vast majority of everyday text, and the same characters that ECMAScript supports in all functionality. We'd likely see bugs with supplementary characters such as parts of the Hong Kong Supplementary Character Set or emoji.

Terrence Cole [:terrence]

Assignee

Comment 4

•

12 years ago

Attached patch v0 (obsolete) — Details — Splinter Review

The attached is a total hack, but is good enough until we are linking against ICU and can offload our conversions to it. Norbert, does this patch fix your problem? Also, is there a good way to reproduce this without the Intl patches? I was not sure how to get those to compile, so I'm flying a bit blind here. I tried the following, with limited success: ~/moz/branch/utf8_the_shell/mozilla/js/src (fix) > LC_ALL=jp ../../../def-js-dbg-clang-x64/js \ -e "print((new Date()).toLocaleString())" Wed Nov 14 15:34:10 2012 ~/moz/branch/utf8_the_shell/mozilla/js/src (fix) > LC_ALL=en ../../../def-js-dbg-clang-x64/js \ -e "print((new Date()).toLocaleString())" Wed Nov 14 15:34:15 2012 ~/moz/branch/utf8_the_shell/mozilla/js/src (fix) > ../../../def-js-dbg-clang-x64/js \ -e "print((new Date()).toLocaleString())" Wed 14 Nov 2012 03:34:19 PM PST

Assignee: general → terrence

Status: NEW → ASSIGNED

Attachment #681736 - Flags: feedback?(mozillabugs)

Norbert Lindenberg

Reporter

Comment 5

•

12 years ago

(In reply to Terrence Cole [:terrence] from comment #4) > Also, is there a good way to > reproduce this without the Intl patches? I was not sure how to get those to > compile, so I'm flying a bit blind here. For testing, you can simply use js> print("2012\u5e7411\u670815\u65e5") This should print "2012年11月15日". Make sure that your terminal uses UTF-8 - in the Terminal app on the Mac, you may have to change the Preferences (Settings -> Advanced). Similarly, "\uD83D\uDE04" should turn into a smiley when full UTF-8 is supported, assuming you have a font (it happens to work in Terminal on Mac OS 10.7 even with the old -U).

Norbert Lindenberg

Reporter

Comment 6

•

12 years ago

Comment on attachment 681736 [details] [diff] [review] v0 Review of attachment 681736 [details] [diff] [review]: ----------------------------------------------------------------- - The old -U option allowed both input and output in UTF-8. - The old version required an explicit option -U. Was the default always just ASCII+mojibake, or was it some other encoding? If I follow the code and documentation correctly, it's actually ISO-8859-1 + mojibake. If it was just ASCII+mojibake, then UTF-8 is clearly an improvement, but if it other encodings were supported, we have to be more careful. ::: js/src/shell/js.cpp @@ +339,5 @@ > + > + bad_surrogate: > + if (maybecx) { > + JS_snprintf(buffer, 10, "0x%x", c); > + JS_ReportErrorFlagsAndNumber(maybecx, JSREPORT_ERROR, js_GetErrorMessage, Are you sure you want to fail on malformed surrogates? ECMAScript doesn't do anything to prevent them otherwise, so it might be better to convert them to U+FFFD (replacement character) and continue. @@ +347,5 @@ > +} > + > +static bool > +DeflateStringToUTF8Buffer(JSContext *maybecx, const jschar *src, size_t srclen, > + char *dst, size_t *dstlenp) Do I understand correctly that on exit *dstlenp will be set to the number of bytes used? A comment documenting this would be useful. @@ +352,5 @@ > +{ > + size_t i, utf8Len; > + jschar c, c2; > + uint32_t v; > + uint8_t utf8buf[6]; UTF-8 uses at most 4 bytes per character. The comment in jsstr.h is wrong; the one in jsstr.cpp correct. @@ +390,5 @@ > + } > + dstlen -= utf8Len; > + } > + *dstlenp = (origDstlen - dstlen); > + return JS_TRUE; Jeff told me that JS_TRUE and JS_FALSE shouldn't be used anymore; true and false are preferred.

Attachment #681736 - Flags: feedback?(mozillabugs)

Terrence Cole [:terrence]

Assignee

Comment 7

•

12 years ago

(In reply to Norbert Lindenberg from comment #6) > > - The old -U option allowed both input and output in UTF-8. Before the changes I made, input would be in ISO-8859-1 (no -U) or in CESU-8 (with -U). Output would be in ISO-8859-1 + mojibake or in CESU-8 respectively. Both CESU-8 were inappropriately referred to as UTF-8 in the interface. After the changes I made, input should be in UTF-8, but output would only in ISO-8859-1 + mojibake. This was an obvious oversight on my part. My hope is that after this patch both input and output will always be UTF-8. > - The old version required an explicit option -U. Was the default always > just ASCII+mojibake, or was it some other encoding? If I follow the code and > documentation correctly, it's actually ISO-8859-1 + mojibake. If it was just > ASCII+mojibake, then UTF-8 is clearly an improvement, but if it other > encodings were supported, we have to be more careful. The interface and implementation demonstrate enough confusion and active lying that I'm not sure I would go so far as to call what it accepted an "encoding." > ::: js/src/shell/js.cpp This is (minus the CESU-8 specific bits) the code that used to live in jsstr and handled "encodings." After reading it, I'm sure you can see why I am eager to get us moved to ICU so we can use its conversion code instead. > @@ +339,5 @@ > > + > > + bad_surrogate: > > + if (maybecx) { > > + JS_snprintf(buffer, 10, "0x%x", c); > > + JS_ReportErrorFlagsAndNumber(maybecx, JSREPORT_ERROR, js_GetErrorMessage, > > Are you sure you want to fail on malformed surrogates? ECMAScript doesn't do > anything to prevent them otherwise, so it might be better to convert them to > U+FFFD (replacement character) and continue. These functions are all shell specific, so it's just a matter of what's convenient to the users of the shell (e.g. us). I never liked the old behavior, so I think I'll take your suggestion. :-) > @@ +352,5 @@ > > +{ > > + size_t i, utf8Len; > > + jschar c, c2; > > + uint32_t v; > > + uint8_t utf8buf[6]; > > UTF-8 uses at most 4 bytes per character. The comment in jsstr.h is wrong; > the one in jsstr.cpp correct. Good catch! Looks like I forgot to update this when removing the CESU-8 support.

Terrence Cole [:terrence]

Assignee

Comment 8

•

12 years ago

Attached patch v1: Fixed and applied review. (obsolete) — Details — Splinter Review

It looks like I "fixed" the wrong print the first time around. With your STR I was quickly able to find the right one. I replaced all the rest too this time, for good measure.

Attachment #681736 - Attachment is obsolete: true

Attachment #682689 - Flags: review?(mozillabugs)

Norbert Lindenberg

Reporter

Comment 9

•

12 years ago

Comment on attachment 682689 [details] [diff] [review] v1: Fixed and applied review. Review of attachment 682689 [details] [diff] [review]: ----------------------------------------------------------------- Some test cases would be useful, with input such as "\ud800", "\ud800more", "\udc00", "\uD83D\uDE04", "2012\u5e7411\u670815\u65e5". ::: js/src/shell/js.cpp @@ +315,5 @@ > + if (c < 0x80) > + continue; > + if (0xD800 <= c && c <= 0xDFFF) { > + /* Surrogate pair. */ > + chars++; The increment comes too early; we don't know yet whether we're going to consume the following char. This leads to a crash for js> print("\udc00") Probably should be around line 332. @@ +318,5 @@ > + /* Surrogate pair. */ > + chars++; > + > + /* nbytes sets 1 length since this is surrogate pair. */ > + nbytes--; Also too early - we don't know yet whether we're going to consume two chars. @@ +323,5 @@ > + if (c >= 0xDC00 || chars == end) { > + nbytes += 2; /* Bad Surrogate -> 0xFFFD */ > + continue; > + } > + c2 = *chars; Needs to be chars[1] because we don't know yet whether we're going to use the char. @@ +345,5 @@ > +PutUTF8ReplacementCharacter(size_t *srclenp, char *dst, size_t *dstlenp) { > + if (*dstlenp < 2) > + return false; > + *dst++ = (char) 0xFF; > + *dst++ = (char) 0xFD; U+FFFD is the code point; the corresponding UTF-8 bytes are EF BF BD (yes, three bytes). The length calculation above did account for three bytes. @@ +423,5 @@ > + JSLinearString *linear = str->ensureLinear(cx); > + if (!linear) > + return NULL; > + > + const jschar *es5chars = linear->chars(); How about simply "chars"? There's nothing specific to ES5 here...

Attachment #682689 - Flags: review?(mozillabugs) → review-

Terrence Cole [:terrence]

Assignee

Comment 10

•

12 years ago

(In reply to Norbert Lindenberg from comment #9) > Comment on attachment 682689 [details] [diff] [review] Yikes! that half of the bugs you found were present in the version of this function that has been in SpiderMonkey forever. > Some test cases would be useful, with input such as "\ud800", "\ud800more", > "\udc00", "\uD83D\uDE04", "2012\u5e7411\u670815\u65e5". These tests caught several more bugs. I'm going to add them to the patch in Bug 811060, since I moved the TestUTF8 function to jsapi-tests in that patch already.

Terrence Cole [:terrence]

Assignee

Comment 11

•

12 years ago

Attached patch v2: Many more bugs fixed. — Details — Splinter Review

Attachment #682689 - Attachment is obsolete: true

Attachment #683687 - Flags: review?(mozillabugs)

Norbert Lindenberg

Reporter

Comment 12

•

12 years ago

Comment on attachment 683687 [details] [diff] [review] v2: Many more bugs fixed. Review of attachment 683687 [details] [diff] [review]: ----------------------------------------------------------------- One remaining issue, otherwise this looks good to me. ::: js/src/shell/js.cpp @@ +315,5 @@ > + if (c < 0x80) > + continue; > + if (0xD800 <= c && c <= 0xDFFF) { > + /* nbytes sets 1 length since this is surrogate pair. */ > + if (c >= 0xDC00 || chars == end) { Since we moved the chars++, chars must be chars+1 here.

Attachment #683687 - Flags: review?(mozillabugs) → review+

Terrence Cole [:terrence]

Assignee

Comment 13

•

12 years ago

(In reply to Norbert Lindenberg from comment #12) > ::: js/src/shell/js.cpp > @@ +315,5 @@ > > + if (c < 0x80) > > + continue; > > + if (0xD800 <= c && c <= 0xDFFF) { > > + /* nbytes sets 1 length since this is surrogate pair. */ > > + if (c >= 0xDC00 || chars == end) { > > Since we moved the chars++, chars must be chars+1 here. Good catch! https://hg.mozilla.org/integration/mozilla-inbound/rev/f7f8011950c9

Ed Morley [:emorley]

Comment 14

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/f7f8011950c9

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla20

Norbert Lindenberg

Reporter

Comment 15

•

12 years ago

Seems we missed something: js> throw(new Error("2012\u5e7411\u670815\u65e5")) still produces mojibake.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Norbert Lindenberg

Reporter

Updated

•

12 years ago

Blocks: 837957

Norbert Lindenberg

Reporter

Updated

•

12 years ago

No longer blocks: 769872

Norbert Lindenberg

Reporter

Updated

•

12 years ago

No longer blocks: 837957

Terrence Cole [:terrence]

Assignee

Comment 16

•

10 years ago

Attached patch use_utf8_for_error_message_output-v0.diff — Details — Splinter Review

Sorry I let this linger for so long, Norbert! I /finally/ tracked this down today and we're actually inboarding and processing the string correctly. It's only during the very last stage where we're formatting the message that we encode to latin1 instead of utf8. This is using the !dontReportUncaught path, which the browser is not using anymore (mostly), so there's not any reason at all not to just take this to utf8.

Attachment #8614896 - Flags: review?(jdemooij)

Jan de Mooij [:jandem]

Updated

•

10 years ago

Attachment #8614896 - Flags: review?(jdemooij) → review+

Pulsebot

Comment 17

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/7f6f9c513651

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 18

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/7f6f9c513651

Status: REOPENED → RESOLVED

Closed: 12 years ago → 10 years ago

status-firefox41: --- → fixed

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.