397215 - print() is buggy with non-ASCII strings

Reporter

Description

•

17 years ago

js> print("\u263A")
:

Should print a smiley face character.

http://intertwingly.net/blog/2007/09/21/JSON-Interop suggests a crazy workaround where you hand print() a "string" whose characters are really UTF-8 bytes.  I hope we can fix this bug before the workaround becomes widespread.

A comment on that page points out that readline() has a similar issue.

Sam Ruby

Assignee

Comment 1

•

17 years ago

Excellent!

After a quick scan, it looks like all the right code is in there; what's needed is mostly deciding what the right behavior is -- that, and the brazen audacity to make the solution that is come up with the default.  The alternative is that people will deploy crazy workarounds, such as the one I found and briefly contemplated deploying before I found a change I could make to the Python side of my implementation.

Brendan Eich [:brendan]

Updated

•

17 years ago

Depends on: 315288

Sam Ruby

Assignee

Comment 2

•

17 years ago

After a few tests, it looks like JS_C_STRINGS_ARE_UTF8 solves the issue I brought up.  Evidence:

http://intertwingly.net/blog/2007/09/21/JSON-Interop#c1190666379

So, the question is which value should be the default?  Should the correct behavior only be available to those who can read and compile their own version?

My vote would be to rename the define to JS_C_STRINGS_ARE_ISO_8859_1 and flip all of the #ifdefs.  This would enable anybody who wanted the previous behavior to build their own executable with that feature.

Brendan Eich [:brendan]

Comment 3

•

17 years ago

We could rename that macro (I prefer the shorter name), but flipping the default breaks very long-standing API compatibility promises.

We really do have lots of downstream distributions and consumers, too many of them incommunicado (e.g. tellme.com's VXML-based service, which uses massively multi-threaded SpiderMonkey, now part of MSFT). Unlike, say, C-Python, we have kept jsapi.h stable for a very long time, with only a few unintended bugs leaking through to change API semantics that mattered (enough to be reported).

The real work in Mozilla code is changing XPConnect, etc. to cope, hence the blocking bug.

/be

Brendan Eich [:brendan]

Comment 4

•

17 years ago

Can CouchDB folks build with the currently-named macro defined to 1 and be happy?

/be

Sam Ruby

Assignee

Comment 5

•

17 years ago

Yes, the macro solves the CouchDB problem.  But for anybody who (like me) simply issues a "apt-get install spidermonkey-bin" is likely to do exactly what Jesse fears most: find and propagate crazy workarounds.

P.S.  I may be dense, but I don't understand the jsapi comment.  jsapi.h does not have any #ifdefs that depend on this setting, it merely exposes functions that will do the conversions for you, and one function to tell you how the code has been compiled.  Unless I am reading this header incorrectly, the necessary breaking changes to jsapi.h have already been made.

Igor Bukanov

Comment 6

•

17 years ago

(In reply to comment #5)
> Unless I am reading this header incorrectly, the necessary
> breaking changes to jsapi.h have already been made.

The macro affects the utility functions that JS API implementation code uses to convert between bytes and JSStrings. Thus turning JS_C_STRINGS_ARE_UTF8 on by default would break any application that currently uses Latin-1 encoding and passes non-ascii chars to SpiderMonkey API.

Brendan Eich [:brendan]

Comment 7

•

17 years ago

Right, we can't change the *source* to force an incompatible API difference. But we could build a distributed JS shell with JS_C_STRINGS_ARE_UTF8. Do we even build the JS shell that Sam fetches with apt-get spidermonkey-bin, though? I doubt it. NetBSD, probably OpenBSD, the OSSP guys, and some Linuxes must build it themselves via Makefile.ref.

If Makefile.ref is used with straightforward options (BUILD_OPT=1, maybe OPTIMIZER=-Os on Mac OS X), then we could define JS_C_STRINGS_ARE_UTF8 in that makefile.

In any event, we absolutely must run our automated testsuite on builds with and without JS_C_STRINGS_ARE_UTF8 defined. Bob, can you try the "with" case and file any bugs that bite it, that don't bite the "without" case? Thanks.

Similar question has arisen in the past about building a libjs.{a,so} distributed by others, with respect to JS_THREADSAFE (which then requires NSPR).

We have never acted alone as a good vendor of standalone SpiderMonkey. It was always a group effort among distros, groups such as OSSP, and Mozilla. Now with CouchDB using the js shell it may be time to rally interested parties and fix a few bugs (this bug's dependency, the unified build system bug, others I'm not going to remember right now).

/be

Bob Clary [:bc] (inactive)

Comment 8

•

17 years ago

(In reply to comment #7)

> In any event, we absolutely must run our automated testsuite on builds with and
> without JS_C_STRINGS_ARE_UTF8 defined. Bob, can you try the "with" case and
> file any bugs that bite it, that don't bite the "without" case? Thanks.

I presume those getting SpiderMonkey from their distros are running 1.6.0. Would a run on gecko 1.8.0 be a better first target to test or would the trunk be better?

Brendan Eich [:brendan]

Comment 9

•

17 years ago

Trunk -- we do not support 1.8.0.x in official builds and I don't have time for anything but cvs trunk and hg.mozilla.org projects. We should focus on the future since the distros have their own schedules and trailing-edge investments by us may be wasted.

/be

Brendan Eich [:brendan]

Comment 10

•

17 years ago

Michael, is Adobe using JS_C_STRINGS_ARE_UTF8 in Acrobat or other SpiderMonkey embeddings? What version(s) of SpiderMonkey?

/be

Sam Ruby

Assignee

Comment 11

•

17 years ago

Perhaps a dumb question, but I'm not shy.

Could the #ifdefs be turned into real ifs?  With the existing compile time switch defining the initial (default) value?  And js.c could call the following new entry point early in its processing:

extern JS_PUBLIC_API(void)
JS_SetStringsAreUTF8(JSBool utf8);

If desired, then over a period of many many years, first the default could be switched, and ultimately the the non-unicode code paths could be removed.

Jason Orendorff [:jorendorff]

Comment 12

•

17 years ago

Here's another thought.  Maybe we can leave JS_C_STRINGS_ARE_UTF8 alone and just modify print() to Do The Right Thing.  The Right Thing here depends on the OS.  On Unix, the locale API tells you what charset to use, right?

This is still a backwards-incompatible change, but a much smaller one.  It won't affect embedders, since print() only exists in the shell.

Sam Ruby

Assignee

Comment 13

•

17 years ago

If you change print without changing readline, then "Señor" becomes "SeÃ±or".

It is true that Print and ReadLine are in js.c.  It could have it's own private versions of STRING<->JSVAL without affecting the jsapi at all.

Blake Kaplan (:mrbkap) (inactive)

Comment 14

•

17 years ago

How much work (and what sort of performance/codesize hit would it be to make this a runtime option instead of a compile time option)?

Bob Clary [:bc] (inactive)

Comment 15

•

17 years ago

(In reply to comment #7)

bug 397530

Sam Ruby

Assignee

Comment 16

•

17 years ago

(In reply to comment #14)
> How much work (and what sort of performance/codesize hit would it be to make
> this a runtime option instead of a compile time option)?

The functions involved are small.  The way to minimize the performance impact would be to have one if check (against a new JSBool added to struct JSRuntime?) per call, as opposed to potentially one if check per character.  This expands the codesize a bit, but again, the amount of code is nearly trivial.

As to the effort, I think it will take longer to figure out what to do than to do it.  If it helps, I could code up a patch for the above.  The one thing I would appreciate before doing so is some indication of what the desired API to set the runtime option would be.

Brendan Eich [:brendan]

Comment 17

•

17 years ago

There's a chance the runtime cost could regress Tp or another tinderboxed metric -- stranger things have happened.

The JS_CStringsAreUTF8 function takes no parameters, so can't be per-runtime. It can only reflect a macro or a true global. If we think each JSRuntime (disjoint GC heap containing strings and their byte-wide-char deflations) shold be able to vary the proposed runtime flag, then we would want to deprecate and stub out JS_CStringsAreUTF8 and provide both

JS_PUBLIC_API(JSBool)
JS_GetCStringsAreUTF8(JSRuntime *rt);

JS_PUBLIC_API(JSBool)
JS_SetCStringsAreUTF8(JSRuntime *rt, JSBool flag);

where the latter returns the old value of rt->CstringsAreUTF8 after updating that new member of JSRuntime, and the C <=> UTF16 string transcoders test that member fetched from cx->runtime.

We'd have to get perf testing somehow, possibly by just checking in and seeing what changed. Code size hit would be good to note for a few top platforms.

/be

Sam Ruby

Assignee

Comment 18

•

17 years ago

Attached patch Patch for evaluation purposes (obsolete) — Details — Splinter Review

(In reply to comment #17)
> There's a chance the runtime cost could regress Tp or another tinderboxed
> metric -- stranger things have happened.

Eek!

> The JS_CStringsAreUTF8 function takes no parameters, so can't be per-runtime.
> It can only reflect a macro or a true global. If we think each JSRuntime
> (disjoint GC heap containing strings and their byte-wide-char deflations) shold
> be able to vary the proposed runtime flag, then we would want to deprecate and
> stub out JS_CStringsAreUTF8 and provide both

It is not clear to me how to do that, so the attached patch does not include the deprecated stub.

> JS_PUBLIC_API(JSBool)
> JS_GetCStringsAreUTF8(JSRuntime *rt);
> 
> JS_PUBLIC_API(JSBool)
> JS_SetCStringsAreUTF8(JSRuntime *rt, JSBool flag);
> 
> where the latter returns the old value of rt->CstringsAreUTF8 after updating
> that new member of JSRuntime, and the C <=> UTF16 string transcoders test that
> member fetched from cx->runtime.

Thanks!
 
> We'd have to get perf testing somehow, possibly by just checking in and seeing
> what changed. Code size hit would be good to note for a few top platforms.

I've attached a patch for evaluation.  

Questions/comments:

* Should js have a command line option to toggle this setting?  My preference would be that the default be to "do the right thing", but that coupled with a toggle could be confusing.  For evaluation purposes, I the attached patch has js.c default to UTF8 on Unix systems, and non UTF8 everywhere else.

* Additionally, there probably should be a shell function for this setting.

* I'm not thrilled by the fact that this required QuoteString to be passed a context.  If there is any measurable runtime overhead, this is likely to be the cause.  Hopefully this overhead will be negligible compared to the preexisting calls to sprintf.

* I couldn't figure out any way that %hc was ever used in a context other than utf8, do I did not make the change to pass context to dosprintf.

* After the context is no longer available during shutdown, the code behaves as if utf8 was NOT set.  non-ASCII characters may not be processed correctly at such times.

Sam Ruby

Assignee

Comment 19

•

17 years ago

It occurs to me that a JS_free(cx, buffer) might be a good idea in the Print function.  And perhaps even an if(buffer) or two.

Brendan Eich [:brendan]

Updated

•

17 years ago

Attachment #282392 - Flags: review?(igor)

Igor Bukanov

Comment 20

•

17 years ago

Since the problem comes from the fact that SM may not be compiled with JS_C_STRINGS_ARE_UTF8 when an application needs UTF8 conversions, I suggest to keep JS_GetCStringsAreUTF8 as it is and add

  void JS_SetCStringsAreUTF8();

that sets a global flag. In this way the application can just call JS_SetCStringsAreUTF8 at the beginning of its execution and does not worry about JS_C_STRINGS_ARE_UTF8 definition. Moreover, to avoid issues with JS_GetStringBytes returning non-UTF8 bytes due to caching after calls to JS_SetCStringsAreUTF8 SM should disallows calls to the function after the first JSRuntime instance is created. This can be enforced with asserts.

Brendan Eich [:brendan]

Comment 21

•

17 years ago

Igor: good point, a truly global is cool if set early, and we have single-threaded restrictions on the first calls in a process to JS_NewRuntime and JS_NewContext already.

Sam, care to revise the patch? If not, please call for someone else to do it and I'll help find a hacker.

/be

Sam Ruby

Assignee

Comment 22

•

17 years ago

Attached patch static global version (obsolete) — Details — Splinter Review

* I no longer see the point of having JS_SetCStringsAreUTF8 returning the previous value as it can't be changed, but this patch continues to implement the signature as spec'ed.

* It has been over a decade since I have coded a static variable in C/C++, but if I recall correctly such was initialized and behaved differently in shared objects vs dynamic link libraries.  On one the data was shared across processes, but not the other.  But I might be misremembering this.

* I still wonder if js should have a command line option to toggle this setting?  My preference remains that the default be to "do the right thing", but that coupled with a toggle could be confusing.  Like the previous patch, this patch has js.c default to UTF8 on Unix systems, and non UTF8 everywhere else.

* A global variable that is initialized early pretty much rules out the possibility of exposing a shell function for this setting.

* Use of a global variable makes the patch considerably smaller :-)

* Now that the value is global, retaining the previous behavior for %hc is considerably easier.

* Shutdown no longer becomes an issue (if it ever was).

* I remembered to call JS_Free in Print this time.  :-)

* I've not marked the previous patch as obsolete due to these open questions.

Attachment #284761 - Flags: review?(igor)

Brendan Eich [:brendan]

Updated

•

17 years ago

Attachment #284761 - Attachment is patch: true

Attachment #284761 - Attachment mime type: application/octet-stream → text/plain

Brendan Eich [:brendan]

Comment 23

•

17 years ago

Comment on attachment 284761 [details] [diff] [review]
static global version

(In reply to comment #22)
> * It has been over a decade since I have coded a static variable in C/C++, but
> if I recall correctly such was initialized and behaved differently in shared
> objects vs dynamic link libraries.  On one the data was shared across
> processes, but not the other.  But I might be misremembering this.

Not in any OS we support, and not in any OS I know of. We use BSS (avoid explicit zero-init) as a matter of style and (usually minor) object-file size minimization.

A few nits on the patch:

> StringsAreUTF8(JSContext *cx, JSObject *obj, uintN argc, jsval *argv,
>                jsval *rval)
> {
>-    *rval = JS_CStringsAreUTF8() ? JSVAL_TRUE : JSVAL_FALSE;
>+    *rval = JS_GetCStringsAreUTF8() ? JSVAL_TRUE : JSVAL_FALSE;

Could avoid branches (no real perf worries, just noting) and keep source shorter (real win here) by using BOOLEAN_TO_JSVAL(JS_GetCStringsAreUTF8()).

>+// The following is initially undefined, changeable; but once set
>+// it never should be changed.

C-style comments for now, please -- we have downstreams using non-GCC and non-MSVC. Major comment style is documented in README.html (major == multiline).

>+static JSBool CStringsAreUTF8 = -(JS_TRUE);

No need for this (still wondering what OS you might have been remembering ;-).

> JS_PUBLIC_API(JSBool)
>-JS_CStringsAreUTF8()
>+JS_GetCStringsAreUTF8()
> {
>-#ifdef JS_C_STRINGS_ARE_UTF8
>-    return JS_TRUE;
>+    if (CStringsAreUTF8 == -(JS_TRUE)) {
>+        /* Only default to UTF-8 on Unix */
>+#ifdef XP_UNIX
>+        CStringsAreUTF8 = JS_TRUE;
> #else
>-    return JS_FALSE;
>+        CStringsAreUTF8 = JS_FALSE;
> #endif
>+    }

So just #ifdef the static initialiser and simplify this API to:

>+
>+    return CStringsAreUTF8;
>+}
>+
>+

Nit: one blank line between functions is house style.

>+JS_PUBLIC_API(JSBool)
>+JS_SetCStringsAreUTF8(JSBool flag)
>+{
>+    JS_ASSERT(CStringsAreUTF8==-(JS_TRUE) || CStringsAreUTF8==(flag!=JS_FALSE));
>+    CStringsAreUTF8 = flag ? JS_TRUE : JS_FALSE;

No need for ?: here.

Have to stop here, but Igor should do the real review. IIRC he was traveling back to Norway from California over the weekend.

/be

Sam Ruby

Assignee

Comment 24

•

17 years ago

(In reply to comment #23)
> (From update of attachment 284761 [details] [diff] [review])
> (In reply to comment #22)
> > * It has been over a decade since I have coded a static variable in C/C++, but
> > if I recall correctly such was initialized and behaved differently in shared
> > objects vs dynamic link libraries.  On one the data was shared across
> > processes, but not the other.  But I might be misremembering this.
> 
> Not in any OS we support, and not in any OS I know of. We use BSS (avoid
> explicit zero-init) as a matter of style and (usually minor) object-file size
> minimization.

Cool.

> A few nits on the patch:
> 
> > StringsAreUTF8(JSContext *cx, JSObject *obj, uintN argc, jsval *argv,
> >                jsval *rval)
> > {
> >-    *rval = JS_CStringsAreUTF8() ? JSVAL_TRUE : JSVAL_FALSE;
> >+    *rval = JS_GetCStringsAreUTF8() ? JSVAL_TRUE : JSVAL_FALSE;
> 
> Could avoid branches (no real perf worries, just noting) and keep source
> shorter (real win here) by using BOOLEAN_TO_JSVAL(JS_GetCStringsAreUTF8()).

I didn't do it.  :-)

> >+// The following is initially undefined, changeable; but once set
> >+// it never should be changed.
> 
> C-style comments for now, please -- we have downstreams using non-GCC and
> non-MSVC. Major comment style is documented in README.html (major ==
> multiline).

My bad.

> >+static JSBool CStringsAreUTF8 = -(JS_TRUE);
> 
> No need for this (still wondering what OS you might have been remembering ;-).

I guess my comment wasn't clear enough.  I want a three state boolean, with three values: not-yet-set, set-false, and set-true.  A call to Get before Set will convert this to the default value (a value it will hold for the remainder of the process).  Once set, calls to re-set it to the same value are OK, but calls to change it will trigger an assert.

> > JS_PUBLIC_API(JSBool)
> >-JS_CStringsAreUTF8()
> >+JS_GetCStringsAreUTF8()
> > {
> >-#ifdef JS_C_STRINGS_ARE_UTF8
> >-    return JS_TRUE;
> >+    if (CStringsAreUTF8 == -(JS_TRUE)) {
> >+        /* Only default to UTF-8 on Unix */
> >+#ifdef XP_UNIX
> >+        CStringsAreUTF8 = JS_TRUE;
> > #else
> >-    return JS_FALSE;
> >+        CStringsAreUTF8 = JS_FALSE;
> > #endif
> >+    }
> 
> So just #ifdef the static initialiser and simplify this API to:
> 
> >+
> >+    return CStringsAreUTF8;
> >+}

See explanation above.

> >+
> >+
> 
> Nit: one blank line between functions is house style.

Generally, mine too.  Again, my bad.

> >+JS_PUBLIC_API(JSBool)
> >+JS_SetCStringsAreUTF8(JSBool flag)
> >+{
> >+    JS_ASSERT(CStringsAreUTF8==-(JS_TRUE) || CStringsAreUTF8==(flag!=JS_FALSE));
> >+    CStringsAreUTF8 = flag ? JS_TRUE : JS_FALSE;
> 
> No need for ?: here.

Just trying to follow the (bad) example above.  :-)

I also want to prevent against setting this flag to -1.

> Have to stop here, but Igor should do the real review. IIRC he was traveling
> back to Norway from California over the weekend.
> 
> /be

- Sam Ruby

Igor Bukanov

Comment 25

•

17 years ago

Comment on attachment 284761 [details] [diff] [review]
static global version

>diff -u -r3.167 js.c
>@@ -714,12 +714,17 @@
> {
>     uintN i;
>     JSString *str;
>+    char *bytes;
> 
>     for (i = 0; i < argc; i++) {
>         str = JS_ValueToString(cx, argv[i]);
>         if (!str)
>             return JS_FALSE;
>-        fprintf(gOutFile, "%s%s", i ? " " : "", JS_GetStringBytes(str));
>+        bytes = js_DeflateString(cx, JSSTRING_CHARS(str), JSSTRING_LENGTH(str));
>+        if (bytes) {
>+            fprintf(gOutFile, "%s%s", i ? " " : "", bytes);
>+            JS_free(cx, bytes);
>+        }
>     }

The code should use a pattern like 

if (!bytes)
    return JS_FALSE;

to properly propagate the failure.

>diff -u -r3.161 jsapi.h
>--- jsapi.h	1 Oct 2007 00:31:22 -0000	3.161
>+++ jsapi.h	13 Oct 2007 12:46:37 -0000
>@@ -2231,11 +2231,18 @@
> 
> /*
>  * Return JS_TRUE if C (char []) strings passed via the API and internally
>- * are UTF-8. The source must be compiled with JS_C_STRINGS_ARE_UTF8 defined
>- * to get UTF-8 support.
>+ * are UTF-8.
>  */
> JS_PUBLIC_API(JSBool)
>-JS_CStringsAreUTF8();
>+JS_GetCStringsAreUTF8();


Lets not change the API here and stick with JS_CStringsAreUTF8().

>+JS_PUBLIC_API(JSBool)
>+JS_SetCStringsAreUTF8(JSBool flag);

That should be JS_PUBLIC_API(void) as JS_CStringsAreUTF8() provide a way to query for the flag.


> 
> /*
>  * Character encoding support.
>@@ -2253,11 +2260,11 @@
>  * NB: Neither function stores an additional zero byte or jschar after the
>  * transcoded string.
>  *
>- * If the source has been compiled with the #define JS_C_STRINGS_ARE_UTF8 to
>- * enable UTF-8 interpretation of C char[] strings, then JS_EncodeCharacters
>- * encodes to UTF-8, and JS_DecodeBytes decodes from UTF-8, which may create
>- * addititional errors if the character sequence is malformed.  If UTF-8
>- * support is disabled, the functions deflate and inflate, respectively.
>+ * If cStringsAreUtf8 is set to JS_TRUE to enable UTF-8 interpretation

The comments should refer to JS_CStringsAreUTF8(), not an internal flag.

>Index: jsopcode.c
>===================================================================
>RCS file: /cvsroot/mozilla/js/src/jsopcode.c,v
>retrieving revision 3.266
>diff -u -r3.266 jsopcode.c
>--- jsopcode.c	17 Sep 2007 19:11:52 -0000	3.266
>+++ jsopcode.c	13 Oct 2007 12:46:38 -0000
>@@ -530,6 +530,7 @@
>     const jschar *s, *t, *z;
>     const char *e;
>     char *bp;
>+    JSBool CStringsAreUtf8 = JS_GetCStringsAreUTF8();

There is no need to add an overhead of a function call here. Make CStringsAreUtf8 a global named like 
js_CStringsAreUtf8 and declared in jsprvtd.h/ defined in jsapi.c as:

#ifdef JS_C_STRINGS_ARE_UTF8
# define js_CStringsAreUtf8 JS_TRUE
#else
extern JSBool js_CStringsAreUtf8;
#endif

#ifndef JS_C_STRINGS_ARE_UTF8
JSBool js_CStringsAreUtf8;
#endif

The use in Spider Monkey sources js_CStringsAreUtf8, not the function call. In this way when the source is compiled with JS_C_STRINGS_ARE_UTF8 defined the compiler can optimize away the if checks for the js_CStringsAreUtf8.

Igor Bukanov

Comment 26

•

17 years ago

(In reply to comment #23)
> Igor should do the real review. IIRC he was traveling
> back to Norway from California over the weekend.

I am traveling to Norway on Tuesday as I stay with my friend at Seattle for few days.

Brendan Eich [:brendan]

Comment 27

•

17 years ago

I had to find this via Sam's blog (not a bad route, but:). This bug needs an assignee. Sam or Igor, please take it.

I like compiling out tests, so comment 25 seems good. If the API were not yet shipped in a standalone js1.7 release, I would prefer JS_{Get,Set}CStringsAreUTF8 but IIRC (bclary, please confirm) we have tagged 1.7 and it has JS_CStringsAreUTF8 already in jsapi.[ch].

/be

Bob Clary [:bc] (inactive)

Comment 28

•

17 years ago

(In reply to comment #27)
>
> (bclary, please confirm) we have tagged
> 1.7 and it has JS_CStringsAreUTF8 already in jsapi.[ch].

yes.

Sam Ruby

Assignee

Comment 29

•

17 years ago

Attached patch js_CStringsAreUtf8 is now a global instead of a function (obsolete) — Details — Splinter Review

* I assume we are now OK with Set returning void?
 * JS_NewRuntime locks in the value of js_CStringsAreUTF8
 * Kept in default to utf-8, but only on Unix.

Attachment #282392 - Attachment is obsolete: true

Attachment #284761 - Attachment is obsolete: true

Attachment #286398 - Flags: review?

Attachment #282392 - Flags: review?(igor)

Attachment #284761 - Flags: review?(igor)

Brendan Eich [:brendan]

Comment 30

•

17 years ago

Comment on attachment 286398 [details] [diff] [review]
js_CStringsAreUtf8 is now a global instead of a function

Unfortunately, bugzilla does not guess or default if you set the r? flag but do not fill in the text field that appears next to it for that value -- setting to ask Igor to review.

/be

Attachment #286398 - Flags: review? → review?(igor)

Igor Bukanov

Comment 31

•

17 years ago

Comment on attachment 286398 [details] [diff] [review]
js_CStringsAreUtf8 is now a global instead of a function

>diff -u -r3.168 js.c
>--- js/src/js.c	13 Oct 2007 20:09:48 -0000	3.168
>+++ js/src/js.c	27 Oct 2007 13:10:02 -0000
>@@ -714,12 +714,17 @@

Tip: use diff -U8 -p with patches for more context and function names. 

> {
>     uintN i;
>     JSString *str;
>+    char *bytes;
> 
>     for (i = 0; i < argc; i++) {
>         str = JS_ValueToString(cx, argv[i]);
>         if (!str)
>             return JS_FALSE;
>-        fprintf(gOutFile, "%s%s", i ? " " : "", JS_GetStringBytes(str));
>+        bytes = js_DeflateString(cx, JSSTRING_CHARS(str), JSSTRING_LENGTH(str));

js_DeflateString is internal SpiderMonkey. Although jsshell sometimes has to use them, it is better to use only public API from jsapi.h. This suggests to add the following declarations with their implementations in jsapi.c:

JS_PUBLIC_API(char *)
JS_EncodeString(JSContext *cx, JSString *str);
 
that would just call
js_DeflateString(cx, JSSTRING_CHARS(str), JSSTRING_LENGTH(str));
 
> JS_PUBLIC_API(JSBool)
> JS_CStringsAreUTF8()
> {
...
>+}
>+
>+JS_PUBLIC_API(void)
>+JS_SetCStringsAreUTF8(JSBool flag)
>+{
...
> }
> 

For compatibility JS_CStringsAreUTF8() must return false by default. This also suggest that SpiderMonkey should just define:

JS_SetCStringsAreUTF8(void), not JS_SetCStringsAreUTF8(JSBool)

so it would be an embedding that decides about UTF-8, not the engine.

Then jsshell can call the function at the beginning of main() like in:

const char *s;
...

s = getenv("LANG");
if (s && strstr(s, "UTF-8"))
    JS_SetCStringsAreUTF8();

On Fedora, Debian and Ubuntu this will correctly set JS_SetCStringsAreUTF8() as necessary. I will set r+ on a patch with these changes and will help to land it.

Sam Ruby

Assignee

Comment 32

•

17 years ago

Attached patch incorporate suggestions by Igor Bukanov (obsolete) — Details — Splinter Review

add JS_EncodeString public API
JS_SetCStringsAreUTF8 now accepts no parameters
js.c now sets strings to be UTF-8 based on the LANG environment variable
patch now in diff -U8 -p format

Attachment #286398 - Attachment is obsolete: true

Attachment #287560 - Flags: review?

Attachment #286398 - Flags: review?(igor)

Sam Ruby

Assignee

Updated

•

17 years ago

Attachment #287560 - Flags: review? → review?(igor)

Sam Ruby

Assignee

Comment 33

•

17 years ago

(In reply to comment #30)
> (From update of attachment 286398 [details] [diff] [review])
> Unfortunately, bugzilla does not guess or default if you set the r? flag but do
> not fill in the text field that appears next to it for that value -- setting to
> ask Igor to review.

"You entered a username that matched more than one user, so we have instead left the requestee field blank."

Apparently there are more than 100 igor's and three Igor Bukanov's.  I selected <igor@mir2.org> this time.

Igor Bukanov

Comment 34

•

17 years ago

Comment on attachment 287560 [details] [diff] [review]
incorporate suggestions by Igor Bukanov

> JS_PUBLIC_API(JSRuntime *)
> JS_NewRuntime(uint32 maxbytes)
> {
>     JSRuntime *rt;
> 
>+    /* lock in choice between utf-8 support and inflate/deflate */
>+    JS_CStringsAreUTF8();
>+
> #ifdef DEBUG
>     static JSBool didFirstChecks;
> 
...
>@@ -5198,23 +5201,49 @@ JS_EncodeCharacters(JSContext *cx, const
>+/*
>+ * The following is a three state boolean value: initially undefined
>+ * and changeable; but once set it can never be changed.
>+ */
>+#ifndef JS_C_STRINGS_ARE_UTF8
>+JSBool js_CStringsAreUTF8 = -(JS_TRUE);
>+#endif

I do not think that we should insist that JS_SetCStringsAreUTF8() can be called only once. For example, a pattern like:

int main()
{
    ...
    if (want_utf8_via_environment)
        JS_SetCStringsAreUTF8();
    .....
    if (want_utf8_via_option)
        JS_SetCStringsAreUTF8();

is reasonable.


Thus JS_SetCStringsAreUTF8() should just assert that JS_NewRuntime has not been invoked before the call. Since JS_NewRuntime already has "static JSBool didFirstChecks", that can be used for the assertion.

Sam Ruby

Assignee

Comment 35

•

17 years ago

(In reply to comment #34)
> (From update of attachment 287560 [details] [diff] [review])
> > JS_PUBLIC_API(JSRuntime *)
> > JS_NewRuntime(uint32 maxbytes)
> > {
> >     JSRuntime *rt;
> > 
> >+    /* lock in choice between utf-8 support and inflate/deflate */
> >+    JS_CStringsAreUTF8();
> >+
> > #ifdef DEBUG
> >     static JSBool didFirstChecks;
> > 
> ...
> >@@ -5198,23 +5201,49 @@ JS_EncodeCharacters(JSContext *cx, const
> >+/*
> >+ * The following is a three state boolean value: initially undefined
> >+ * and changeable; but once set it can never be changed.
> >+ */
> >+#ifndef JS_C_STRINGS_ARE_UTF8
> >+JSBool js_CStringsAreUTF8 = -(JS_TRUE);
> >+#endif
> 
> I do not think that we should insist that JS_SetCStringsAreUTF8() can be called
> only once. For example, a pattern like:
> 
> int main()
> {
>     ...
>     if (want_utf8_via_environment)
>         JS_SetCStringsAreUTF8();
>     .....
>     if (want_utf8_via_option)
>         JS_SetCStringsAreUTF8();
> 
> is reasonable.

The above sequence would not cause an assertion to fail with the current patch.    

> Thus JS_SetCStringsAreUTF8() should just assert that JS_NewRuntime has not been
> invoked before the call. Since JS_NewRuntime already has "static JSBool
> didFirstChecks", that can be used for the assertion.

The current patch allows JS_SetCStringsAreUTF8 to be called any number of times, just so long as the first time it was called is before the first call to JS_NewRuntime.

didFirstChecks currently is an internal static variable, and therefore only accessible to JS_NewRuntime.  It certainly could be renamed and moved out to jsprvtd.h.

It might just be me, but the current patch does what you want and (again, to me) seems a small bit cleaner.  But let me know if you want a new patch, and I will comply.

Igor Bukanov

Comment 36

•

17 years ago

(In reply to comment #35)
> (In reply to comment #34)
> The above sequence would not cause an assertion to fail with the current patch. 

Sorry, I was not clear. The current patch does not allow the following:

if (!JS_CStringsAreUTF8() && some_complex_checks_for_environment())
    JS_SetCStringsAreUTF8();

In addition the patch uses a 3-state boolean where a normal 2 state + debug-only-helper (like didFirstChecks) would do the assert job and make non-debug code smaller. What exactly I have in mind is the following implementation:

#ifdef DEBUG
static JSBool NewRuntimeWasCalled = JS_FALSE;
#endif

JS_PUBLIC_API(JSRuntime *)
JS_NewRuntime(uint32 maxbytes)
{
    JSRuntime *rt;

#ifdef DEBUG
    if (!NewRuntimeWasCalled) {
...
}
...

#ifndef JS_C_STRINGS_ARE_UTF8
JSBool js_CStringsAreUTF8 = JS_FALSE;
#endif

JS_PUBLIC_API(JSBool)
JS_CStringsAreUTF8()
{
    return js_CStringsAreUTF8;
}

JS_PUBLIC_API(void)
JS_SetCStringsAreUTF8()
{
    JS_ASSERT(!NewRuntimeWasCalled);
    js_CStringsAreUTF8 = JS_TRUE;
}

Could you update the patch accordingly?

Sam Ruby

Assignee

Comment 37

•

17 years ago

Attached patch incorporate additional suggestions by Igor Bukanov (obsolete) — Details — Splinter Review

Attachment #287560 - Attachment is obsolete: true

Attachment #287602 - Flags: review?

Attachment #287560 - Flags: review?(igor)

Sam Ruby

Assignee

Updated

•

17 years ago

Attachment #287602 - Flags: review? → review?(igor)

Sam Ruby

Assignee

Comment 38

•

17 years ago

(In reply to comment #36)
> 
> #ifdef DEBUG
> static JSBool NewRuntimeWasCalled = JS_FALSE;
> #endif

I hope you don't mind, but the above variable name seemed more consistent with a js_ prefix.  Let me know if you disagree.
 
> JS_PUBLIC_API(void)
> JS_SetCStringsAreUTF8()
> {
>     JS_ASSERT(!NewRuntimeWasCalled);
>     js_CStringsAreUTF8 = JS_TRUE;
> }

An ifdef is required around that assignment.
 
> Could you update the patch accordingly?

Done.

Igor Bukanov

Comment 39

•

17 years ago

Comment on attachment 287602 [details] [diff] [review]
incorporate additional suggestions by Igor Bukanov

>+/*
>+ * Has a new runtime ever been created?  Used to detect unsafe changes to
>+ * js_CStringsAreUTF8 after a runtime has been created, and to ensure that
>+ * "first checks" on runtime creation are run only once.
>+ */
>+#ifdef DEBUG
>+static JSBool js_NewRuntimeWasCalled = JS_FALSE;
>+#endif


A static variable should not be defined in the header. Move the definition before js_NewRutime method and you can assume r+ for such patch. Note that I will be traveling for the next few days so I may do the final review and land the patch only on 2007-11-12.

Sam Ruby

Assignee

Comment 40

•

17 years ago

Attached patch move the declaration of js_NewRuntimeWasCalled (obsolete) — Details — Splinter Review

Attachment #287602 - Attachment is obsolete: true

Attachment #287669 - Flags: review?(igor)

Attachment #287602 - Flags: review?(igor)

Igor Bukanov

Comment 41

•

17 years ago

Comment on attachment 287669 [details] [diff] [review]
move the declaration of js_NewRuntimeWasCalled

The patch simplifies UTF-8 support for SpiderMonkey embeddings.

Attachment #287669 - Flags: review?(igor)

Attachment #287669 - Flags: review+

Attachment #287669 - Flags: approval1.9?

Igor Bukanov

Comment 42

•

17 years ago

Nominating for 1.9/1.8.1 so API for UTF-8 support will be available even for 1.8.1-based embeddings.

Flags: blocking1.9?

Flags: blocking1.8.1.10?

Daniel Veditz [:dveditz]

Updated

•

17 years ago

Flags: blocking1.8.1.10? → blocking1.8.1.11?

Mike Schroepfer

Comment 43

•

17 years ago

Have we run this through the js regression test suite and are there any new tests - just curious about regression risk here...

Mike Schroepfer

Comment 44

•

17 years ago

Minusing from blocking - but would take the patch if regression concerns are addressed

Flags: blocking1.9? → blocking1.9-

Bob Clary [:bc] (inactive)

Comment 45

•

17 years ago

(In reply to comment #43)
> Have we run this through the js regression test suite and are there any new
> tests - just curious about regression risk here...
> 

I haven't tested the patch, but can fairly easily. I'll do that later today. With regards to new tests, I'll leave that to others to decide.

Bob Clary [:bc] (inactive)

Comment 46

•

17 years ago

sam, the patch has bitrotted. Can you attach a refreshed one?

Sam Ruby

Assignee

Comment 47

•

17 years ago

Attached patch resync patch (obsolete) — Details — Splinter Review

I see minor shifts in line offsets, but no significant bitrot.  In any case, I've attached a patch against the current CVS head.

Attachment #287669 - Attachment is obsolete: true

Attachment #288570 - Flags: approval1.9?

Attachment #287669 - Flags: approval1.9?

Igor Bukanov

Updated

•

17 years ago

Attachment #288570 - Flags: review+

Mike Schroepfer

Comment 48

•

17 years ago

will approve pending js test results - thanks folks!

Bob Clary [:bc] (inactive)

Comment 49

•

17 years ago

fyi, i had some crashers during the test run related to GC which I can't reproduce and am dealing with a number of changed error messages which makes my regression detection more laborious. I'll kick off a new set of builds and try again tonight. sorry for the delay.

Bob Clary [:bc] (inactive)

Comment 50

•

17 years ago

on linux opt|debug shell from last night: js1_5/Regress/regress-315974.js now runs out of memory. I had to modify the patch to get it to apply with the current trunk. I'll attach it next.

Bob Clary [:bc] (inactive)

Comment 51

•

17 years ago

Attached patch tested version of patch (obsolete) — Details — Splinter Review

Bob Clary [:bc] (inactive)

Comment 52

•

17 years ago

Also, I had crashed in debug browser builds on linux with js1_5/GC/regress-203278-2.js and js1_5/Regress/regress-203278-1.js during the automated run but can't reproduce them running standalone with or without the spider. 

You can run this from the browser at 

<http://test.bclary.com/tests/mozilla.org/js/js-test-driver-standards.html?test=js1_5/GC/regress-203278-2.js;language=type;text/javascript>

<http://test.bclary.com/tests/mozilla.org/js/js-test-driver-standards.html?test=js1_5/Regress/regress-203278-1.js;language=type;text/javascript>

Mike Schroepfer

Comment 53

•

17 years ago

Bob/Sam - so what's the next step?

Bob Clary [:bc] (inactive)

Comment 54

•

17 years ago

I've update my test tree and will re-run the tests this afternoon. Hopefully the crashes will be gone or I'll be able to reproduce them.

Sam Ruby

Assignee

Comment 55

•

17 years ago

(In reply to comment #53)
> Bob/Sam - so what's the next step?

Best case: there is a bug in my patch that was missed in the review.

Worst case: there is a bug in the utf-8 logic that previously was only enabled by a compiler switch.

Commenting out the call to JS_SetCStringsAreUTF8(); in js.c may help determine which of the two cases we are faced with.

Bob Clary [:bc] (inactive)

Comment 56

•

17 years ago

(In reply to comment #55)
> 
> Commenting out the call to JS_SetCStringsAreUTF8(); in js.c may help determine
> which of the two cases we are faced with.
> 

js1_5/Regress/regress-315974.js does not fail with out of memory at ./js1_5/Regress/regress-315974.js:79 if the call is commented out.

Mike Beltzner [:beltzner, not reading bugmail]

Comment 57

•

17 years ago

Comment on attachment 288570 [details] [diff] [review]
resync patch

(clearing approval request as per comments above)

Attachment #288570 - Flags: approval1.9?

Igor Bukanov

Comment 58

•

17 years ago

(In reply to comment #50)
> on linux opt|debug shell from last night: js1_5/Regress/regress-315974.js now
> runs out of memory. I had to modify the patch to get it to apply with the
> current trunk. I'll attach it next.

The problem with js1_5/Regress/regress-315974.js is the bogus code in QuoteString from jsopcode.c. It should never report errors for bad surrogate pairs and rather just fallback to using \u escapes for such chars.

This is an existing problem and should be addressed by another. For now I suggest to remove JS_SetCStringsAreUTF8, test and land this patch, then fix QuoteString and then add JS_SetCStringsAreUTF8 to the shell.

Brendan Eich [:brendan]

Comment 59

•

17 years ago

Ok, who will update the patch, file the followup bug(s) on fixing QuoteString/re-adding JS_SetCStringsAreUTF8?

/be

Sam Ruby

Assignee

Comment 60

•

17 years ago

(In reply to comment #59)
> Ok, who will update the patch, file the followup bug(s) on fixing
> QuoteString/re-adding JS_SetCStringsAreUTF8?

1) To the best of my knowledge, my patches were against CVS HEAD, but consistently they were found to be suffering from "bitrot" and needing modification to work.  Ideally somebody will tell me that I should have been using a different tag, at which point, it will all make sense, and I can start contributing directly again.

2) The latest patch, minus the lines in the block "@@ -3422,16 +3427,23 @@ main(int argc, char **argv, char **envp)", seem to be what is recommended to move forward.  I could manually edit the patch file, but I would really prefer to solve #1 first.

3) While this puts the API in place, there is now a known problem if this API or if the compile time flags are used.  I'm not familiar enough with this codebase to know how to chase memory leaks, but if a test case were provided, I would gladly pursue it.  Preferably after #1 is addressed.

4) Neither here nor there, but a while back (10-24) you asked that this bug be assigned.  I don't have the necessary authority to do so.

Bob Clary [:bc] (inactive)

Comment 61

•

17 years ago

over to sam.

Assignee: general → rubys

Igor Bukanov

Updated

•

17 years ago

Depends on: 406555

Igor Bukanov

Comment 62

•

17 years ago

To proceed I suggest to fix the bug 406555 first and then land the updated patch as with that bug fixed there would be no regressions.

The current behavior of the decompiler exposed by this bug is wrong as it the job of the decompiler is to produce UCS-2 JS strings from something that is build from UCS-2 sources or strings. These has nothing to do with UTF8 conversion of JS strings into byte sequences. If one wants to avoid escaping non-ascii symbols in strings, then the decompiler should deal on its own using a special option or flag that properly supports UCS-2 with, for example, invalid UTF-16 surrogate pairs.

Igor Bukanov

Comment 63

•

17 years ago

(In reply to comment #60 by Sam Ruby)
> 2) The latest patch, minus the lines in the block "@@ -3422,16 +3427,23 @@
> main(int argc, char **argv, char **envp)", seem to be what is recommended to
> move forward.

Could you update the patch to sync with CVS head? It should *not* include the code to activate UTF8 option in js shell. That can be done in another bug to exclude a possibility of affecting jsshell output. I will review the patch quickly and ask for its approval so it can be included in beta2 of FF3.

Sam Ruby

Assignee

Comment 64

•

17 years ago

Attached patch sync with CVS head; remove code to activate UTF8 option in js shell (obsolete) — Details — Splinter Review

Methodology: took the latest (tested version of the patch), applied it to CVS HEAD, noded that there were two rejections in jsopcode.c.  Examined both rejections and came to the conclusion that that code no longer needed to be patched (and wondered if those changes would have caused the test failure?).  Compiled and ran my tests, verified that they worked.  Removed the lines in js.c which enable the UTF-8 support, and verified that the original behavior was restored.

If this patch gets applied, should the title of this bug be changed?

Attachment #288570 - Attachment is obsolete: true

Attachment #291456 - Flags: review?(igor)

Igor Bukanov

Comment 65

•

17 years ago

Comment on attachment 291456 [details] [diff] [review]
sync with CVS head; remove code to activate UTF8 option in js shell

>Index: jsapi.c
>===================================================================
..
>+/*
>+ * Has a new runtime ever been created?  Used to detect unsafe changes to
>+ * js_CStringsAreUTF8 after a runtime has been created, and to ensure that
>+ * "first checks" on runtime creation are run only once.
>+ */

Nit: I am not a native speaker, but "Used to detect" is not a proper English grammar. Something like "The flag is used" would be better.

>+/*
>+ * The following determines whether C Strings are to be treated as UTF-8
>+ * or ISO-8859-1/WIN-1252.  For correct operation, it must be set prior
>+ * to the first call to JS_NewRuntime.
>+ */

The comments are not true: Windows-1252 is not ISO-8859-1 as it includes, for example, the euro sign, which is not in ISO-8859-1.


>Index: jsapi.h
>===================================================================
>- * If the source has been compiled with the #define JS_C_STRINGS_ARE_UTF8 to
>- * enable UTF-8 interpretation of C char[] strings, then JS_EncodeCharacters
>- * encodes to UTF-8, and JS_DecodeBytes decodes from UTF-8, which may create
>- * addititional errors if the character sequence is malformed.  If UTF-8
>- * support is disabled, the functions deflate and inflate, respectively.
>+ * If JS_CStringsAreUTF8() is JS_TRUE then JS_EncodeCharacters encodes to
>+ * UTF-8, and JS_DecodeBytes decodes from UTF-8, which may create addititional
>+ * errors if the character sequence is malformed.  If UTF-8 support is
>+ * disabled, the functions deflate and inflate, respectively.
>  */

Nit: SpiderMonkey style is to write true/false in comments when referring to boolean values, not JS_TRUE/JS_FALSE. Also "additional" is misspelled as "addititional".

>Index: jsprf.c
>                 /*
>                  * This would do a simple string/byte conversion
>-                 * if JS_C_STRINGS_ARE_UTF8 is not defined.
>+                 * unless CStringsAreUTF8.
>                  */

Nit: use js_CStringsAreUTF8, not CStringsAreUTF8 in comments to avoid confusion.

>Index: jsstr.c
> JSBool
>-js_InflateStringToBuffer(JSContext *cx, const char *src, size_t srclen,
>-                         jschar *dst, size_t *dstlenp)
>+js_InflateStringToBuffer(JSContext* cx, const char *src, size_t srclen,
>+                         jschar *dst, size_t* dstlenp)

Nit: preserve here the original signature. SpiderMonkey's style is to use T *t, not T* t. 

I will r+ on a patch with this fixed.

Igor Bukanov

Comment 66

•

17 years ago

(In reply to comment #64)
> and wondered if those changes would have caused the test failure?

This was the case: the previous patch just exposed the bug in the decompiler.

> If this patch gets applied, should the title of this bug be changed?

No: the title is a reflection of the state before the bug is fixed.

Igor Bukanov

Comment 67

•

17 years ago

(In reply to comment #65)
> (From update of attachment 291456 [details] [diff] [review])
...
> >Index: jsapi.h
> >===================================================================
> >- * If the source has been compiled with the #define JS_C_STRINGS_ARE_UTF8 to
> >- * enable UTF-8 interpretation of C char[] strings, then JS_EncodeCharacters
> >- * encodes to UTF-8, and JS_DecodeBytes decodes from UTF-8, which may create
> >- * addititional errors if the character sequence is malformed.  If UTF-8
> >- * support is disabled, the functions deflate and inflate, respectively.
> >+ * If JS_CStringsAreUTF8() is JS_TRUE then JS_EncodeCharacters encodes to
> >+ * UTF-8, and JS_DecodeBytes decodes from UTF-8, which may create addititional
> >+ * errors if the character sequence is malformed.  If UTF-8 support is
> >+ * disabled, the functions deflate and inflate, respectively.
> >  */
> 
> Nit: SpiderMonkey style is to write true/false in comments when referring to
> boolean values, not JS_TRUE/JS_FALSE. Also "additional" is misspelled as
> "addititional".

That misspell is already in the code and it was not misspelled in the patch. But it would be nice to fix it.

Sam Ruby

Assignee

Comment 68

•

17 years ago

(In reply to comment #67)
> (In reply to comment #65)
> >+/*
> >+ * The following determines whether C Strings are to be treated as UTF-8
> >+ * or ISO-8859-1/WIN-1252.  For correct operation, it must be set prior
> >+ * to the first call to JS_NewRuntime.
> >+ */
> 
> The comments are not true: Windows-1252 is not ISO-8859-1 as it includes, for
> example, the euro sign, which is not in ISO-8859-1.

I understand and agree with all of the other comments.  I'm just not sure how to proceed with this one.

*technically* if you take UTF-16 and lop off the high order byte, what you end up is only correct if the data in question conforms to set of codepoints that are valid in iso-8859-1.

In *practice*, however, all of the displayable code points in ISO-8859-1 will work, *and* U+0080 will print as a Euro on Windows machines.  As well as the other 26 deviations that Microsoft made from ISO-8859-1:

http://www.intertwingly.net/stories/2004/04/14/i18n.html#CleaningWindows

As we are talking about a *print* function in code which for legacy reasons defaults to simply chopping off bytes without regard to consequences, it isn't clear to me whether it would be better to document what should be happening (iso-8859-1) vs what effectively does happen (win-1252, but only on Windows machines).

Igor Bukanov

Comment 69

•

17 years ago

(In reply to comment #68)
> In *practice*, however, all of the displayable code points in ISO-8859-1 will
> work, *and* U+0080 will print as a Euro on Windows machines.

The euro sign position in Unicode is U+20AC. So with the default strip-higher-byte "decoding" that would result in 0xAC or ¬ sign. I.e. print("\u20AC") would not print the unicode on Windows with Windows-1252 as encoding. On the other hand all symbols in ISO-8859-1 are bellow 256 by the definition. Thus using escapes to enter them in JS string literals and decoding them back to bytes will preserve the result.

This is why I suggest to remove the reference to Win-1252 from the comments while keeping the ISO standard.

Sam Ruby

Assignee

Comment 70

•

17 years ago

(In reply to comment #69)
> (In reply to comment #68)
> > In *practice*, however, all of the displayable code points in ISO-8859-1 will
> > work, *and* U+0080 will print as a Euro on Windows machines.
> 
> The euro sign position in Unicode is U+20AC. So with the default
> strip-higher-byte "decoding" that would result in 0xAC or ¬ sign. I.e.
> print("\u20AC") would not print the unicode on Windows with Windows-1252 as
> encoding. On the other hand all symbols in ISO-8859-1 are bellow 256 by the
> definition. Thus using escapes to enter them in JS string literals and decoding
> them back to bytes will preserve the result.
> 
> This is why I suggest to remove the reference to Win-1252 from the comments
> while keeping the ISO standard. 

More commonly most js programs which make use of the print statement would be more complicated forms of:

print(readline());

And the above will happily deal with Microsoft curly quotes and mdashes.

Furthermore, it is a bit more complicated than that:

https://bugzilla.mozilla.org/show_bug.cgi?id=228779

Finally, just for fun: try putting &#x80; inside a HTML page and see what firefox does with it.

But, I've said my peace.  I'll set off to create a patch as prescribed.

Sam Ruby

Assignee

Comment 71

•

17 years ago

Attached patch Address review comments. — Details — Splinter Review

Attachment #291456 - Attachment is obsolete: true

Attachment #291485 - Flags: review?(igor)

Attachment #291456 - Flags: review?(igor)

Jason Orendorff [:jorendorff]

Comment 72

•

17 years ago

I sense the need for a better understanding of the problem.  :)

print() takes 16-bit JS strings and produces 8-bit output on stdout.  Then,
separately, that stdout is consumed by some other program that interprets the
8-bit output as characters.

SpiderMonkey has no control over the consumer program.  You can pipe the js
shell to anything.  Different consumers assume different and incompatible
things about their input.

For example, on Windows (up to Windows XP), the Command Prompt window treats
output as IBM Code Page 437.  But on the exact same machine, if you pipe the
same command to a file, then open the file in Notepad, it'll assume something
different: the "ANSI code page" for that machine.  On U.S. and Western European
Windows machines, that's Windows-1252, but I think it depends on what
localization of Windows you have.  (And some Windows programs assume UTF-8.)

The non-Windows world seems to have converged on UTF-8, finally, but that's
recent and it's still not always that simple (for example, in many network
protocols and data formats, UTF-8 is *not* the default, by spec).

The best thing to do is configure SpiderMonkey to produce UTF-8.  Then if you
need the output in some other charset, you can always pipe SpiderMonkey's
output to a conversion utility (in a pinch, something like python -c "import
sys; sys.stdout.write(sys.stdin.read().decode('utf-8').encode('iso-8859-1'))"
might work for you).

Let's just put those few paragraphs into the comment.  Almost no one
understands this; we'd be doing a public service.

Igor Bukanov

Comment 73

•

17 years ago

Comment on attachment 291485 [details] [diff] [review]
Address review comments.

>Index: jsstr.c
>===================================================================
> jschar *
> js_InflateString(JSContext *cx, const char *bytes, size_t *length)
> {
>     jschar *chars = NULL;
>     size_t dstlen = 0;
>+    size_t i, bytesLength = *length;

The final nit which I should have seen on many previous occasions: bytesLength is only used with !js_CStringsAreUTF8. Thus bytesLength = *length should be moved into the else part.

Igor Bukanov

Comment 74

•

17 years ago

(In reply to comment #72)
> Let's just put those few paragraphs into the comment.  Almost no one
> understands this; we'd be doing a public service.

I suggest to do this in a separated bug. That bug is need in any case since the last patch just provides the option to switch to UTF-8 at runtime but does not activate it. Initially I suggested in jsshell to auto-sense the presence of UTF-8 in the locale setup, but that may affect the output of the test cases. So a new bug would be better a better place for that and writeup about UTF-8.

Igor Bukanov

Comment 75

•

17 years ago

Attached patch initializing locals only when necessary — Details — Splinter Review

The new version changes code in jsstr.c to initialize locals only when they will be used and makes sure that js_InflateString sets *length to zero on errors in all cases and js_InflateStringToBuffer supports NULL as dst parameter for the non-UTF8 case.

The patch also renames some locals to get consistent naming. 

To Sam Ruby: do you have a chance to look at jsstr.c changes as well?

Attachment #291861 - Flags: review?(brendan)

Igor Bukanov

Updated

•

17 years ago

Attachment #289017 - Attachment is obsolete: true

Igor Bukanov

Updated

•

17 years ago

Attachment #291485 - Flags: review?(igor)

Brendan Eich [:brendan]

Comment 76

•

17 years ago

Comment on attachment 291861 [details] [diff] [review]
initializing locals only when necessary

Declare locals in first-use order if you can, e.g. nbytes etc. before chars in js_InflateString. Likewise in js_GetDeflatedStringLength, etc.

Why zero *length on failure in js_InflateString? Generally, scalar out params are not usable by caller after a failure in SpiderMonkey. Ah, is this part of the js_InflateString* contract to fail with buffer-too-small but return a useful partial result? If so, comment liberally.

r/a=me with these nits picked. Thanks,

/be

Attachment #291861 - Flags: review?(brendan)

Attachment #291861 - Flags: review+

Attachment #291861 - Flags: approval1.9+

Igor Bukanov

Comment 77

•

17 years ago

Attached patch v12 — Details — Splinter Review

The new version addresses the nits from comment 76 plus renames size_t *length into lengthp to follow SM style.

Note that in the current code in the non UTF-8 case on errors js_InflateString sets *lengthp to 0 while the UTF-8 case does not touch it, it does not set it to a partial result. So to ensure the compatibility it is sufficient to always set  *lengthp to zero in all the cases like the previous patch did.

Attachment #292027 - Flags: review+

Attachment #292027 - Flags: approval1.9+

Sam Ruby

Assignee

Comment 78

•

17 years ago

(In reply to comment #75)
> 
> To Sam Ruby: do you have a chance to look at jsstr.c changes as well?

It is not clear to me what you are asking here.  Are you talking about additional changes?  If you are talking about the either the set of changes that were made to HEAD while this patch was under development or the various nits that are now being addressed, I'm comfortable with both.

Of course, the real proof will be when the default behavior is utf-8 on a real platform.

Igor Bukanov

Comment 79

•

17 years ago

(In reply to comment #78)
> It is not clear to me what you are asking here.

I meant the additional changes to jsstr.c compared with your patch from comment 71. Bugzilla's interdiff works in this case and shows them at

https://bugzilla.mozilla.org/attachment.cgi?oldid=291485&action=interdiff&newid=292027&headers=1

> Of course, the real proof will be when the default behavior is utf-8 on a real
> platform.

That is for sure.

Sam Ruby

Assignee

Comment 80

•

17 years ago

(In reply to comment #79)
> (In reply to comment #78)
> > It is not clear to me what you are asking here.
> 
> I meant the additional changes to jsstr.c compared with your patch from comment
> 71.

Looks fine to me.

Reed Loden [:reed]

Updated

•

17 years ago

Keywords: checkin-needed

Reed Loden [:reed]

Updated

•

17 years ago

Keywords: checkin-needed

Igor Bukanov

Comment 81

•

17 years ago

I checked in the patch from comment 77 to the trunk of CVS:

http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&branch=HEAD&cvsroot=%2Fcvsroot&date=explicit&mindate=1197367443&maxdate=1197367855&who=igor%mir2.org

Sam Ruby: thanks for driving this to the resolution!

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Igor Bukanov

Updated

•

17 years ago

Target Milestone: --- → mozilla1.9 M11

Bob Clary [:bc] (inactive)

Updated

•

17 years ago

Flags: in-testsuite-

Flags: in-litmus-

Daniel Veditz [:dveditz]

Comment 82

•

17 years ago

Not blocking a branch release since this is a dev tool, but if you request approval on a branch-merged patch we could look at it. It's a pretty big patch for the branch, though. What's the reward vs. the regression risk of taking this on the branch? Please make that case if you have to have this.

Flags: blocking1.8.1.12? → blocking1.8.1.12-

Patch for evaluation purposes 17 years ago Sam Ruby 32.69 KB, patch		Details \| Diff \| Splinter Review
static global version 17 years ago Sam Ruby 16.38 KB, patch		Details \| Diff \| Splinter Review
js_CStringsAreUtf8 is now a global instead of a function 17 years ago Sam Ruby 16.84 KB, patch		Details \| Diff \| Splinter Review
incorporate suggestions by Igor Bukanov 17 years ago Sam Ruby 23.53 KB, patch		Details \| Diff \| Splinter Review
incorporate additional suggestions by Igor Bukanov 17 years ago Sam Ruby 24.53 KB, patch		Details \| Diff \| Splinter Review
move the declaration of js_NewRuntimeWasCalled 17 years ago Sam Ruby 24.74 KB, patch	igor : review+	Details \| Diff \| Splinter Review
resync patch 17 years ago Sam Ruby 24.65 KB, patch	igor : review+	Details \| Diff \| Splinter Review
tested version of patch 17 years ago Bob Clary [:bc] (inactive) 24.61 KB, patch		Details \| Diff \| Splinter Review
sync with CVS head; remove code to activate UTF8 option in js shell 17 years ago Sam Ruby 22.21 KB, patch		Details \| Diff \| Splinter Review
Address review comments. 17 years ago Sam Ruby 21.96 KB, patch		Details \| Diff \| Splinter Review
initializing locals only when necessary 17 years ago Igor Bukanov 23.41 KB, patch	brendan : review+ brendan : approval1.9+	Details \| Diff \| Splinter Review
v12 17 years ago Igor Bukanov 23.59 KB, patch	igor : review+ igor : approval1.9+	Details \| Diff \| Splinter Review