495790 - jsapi should be able to use UTF-8 sources

Assignee

Description

•

15 years ago

JS_CompileFileHandle() and friends do not process input as UTF-8, even if JS_CStringsAreUTF8().

They really should.  By this, I mean that a String literal containing a UTF-8 multi-byte character should be properly promoted to UTF-16, following the same algorithm as js_InflateSTring() in jsstr.cpp.

In #jsapi conversation today, shaver suggests a JS_CompilerFileHandleUTF8() API. brendan and jorendorff think it should 'just work' if JS_CStringsAreUTF8().

I'm going to take a read through the sources and figure out where this needs to happen, and how to do it without performance penalty for non-UTF8 sources.

Implementation advice sought. API opinion sought.

Also, should leading BOMs be handled (I think so), and how does the browser get UTF-8 JavaScript into jsapi? Does it convert to UTF-16 and use the UC interfaces?

Wesley W. Garland

Assignee

Comment 1

•

15 years ago

Further investigation (reading, not testing) shows that JS_CompileScript() does, in fact, handle UTF-8 sources, as it js_InflateString()s the source code before passing it along to JS_CompileUCScript().

This leaves JS_CompileScript() treating JS source code differently than JS_CompileFile() and JS_CompileFileHandle() when C strings are UTF-8.

The root of the issue appears to be that GetChar() in jsscan.cpp does a naive inflation when converting from char->jschar.

It is my intention to fix this by inflating utf8->utf16 inline in GetChar(), and patching UnGetChar, PeekChar, etc as needed.

Things to watch out for
 - Unicode line and paragraph separators 0x2028 and 0x2029 should be treated the same as \n

 - AFAICT, js_CStringsAreUTF8 might change halfway through a token stream (MT silliness?); to insure that each stream is processed as either UTF8 or not UTF8, I will add a TSF_UTF8_BYTESTREAM flag and set it during js_InitTokenStream().

 - I have considered the case where \uXXXX has a trailing non-BMP character and decided that it is irrelevant (was concerned about sizeof(ts->ungetbuf))

 - It should be possible to re-implement JS_CompileScript to call js_CompileScript() directly when I'm done, rather than js_CompileScript(js_inflateString(theScript)): is there any reason why this would be undesirable? It would be a separate patch. It would yield guaranteed consistent behaviour across the API (as opposed to coincidentally identical).

Brendan Eich [:brendan]

Comment 2

•

15 years ago

(In reply to comment #1)
> The root of the issue appears to be that GetChar() in jsscan.cpp does a naive
> inflation when converting from char->jschar.

That's right. The C-strings-are-UTF-8 work didn't touch that function, which calls js_fgets only in the file input case.
 
> It is my intention to fix this by inflating utf8->utf16 inline in GetChar(),
> and patching UnGetChar, PeekChar, etc as needed.

No, UngetChar (note spelling) and PeekChar do not have to re-encode. They push uint16 jschar units back on appropriate storage, *after* js_fgets has returned cbuf to GetChar, which has then decoded once and once only from ISO-Latin-1 (what it does today) or UTF-8 (what you propose to do).

>  - AFAICT, js_CStringsAreUTF8 might change halfway through a token stream (MT
> silliness?); to insure that each stream is processed as either UTF8 or not
> UTF8, I will add a TSF_UTF8_BYTESTREAM flag and set it during
> js_InitTokenStream().

No. The rule is that js_CStringsAreUTF8 must be set before the world begins. See the comment in jsapi.h.

Do not over-engineer this, it's nonsensical to try to support runtime changes to this flag.

>  - I have considered the case where \uXXXX has a trailing non-BMP character and
> decided that it is irrelevant (was concerned about sizeof(ts->ungetbuf))

This does not make sense. The \u escape can spell only 16-bit code storage unit values, period.

>  - It should be possible to re-implement JS_CompileScript to call
> js_CompileScript() directly when I'm done, rather than
> js_CompileScript(js_inflateString(theScript)): is there any reason why this
> would be undesirable? It would be a separate patch. It would yield guaranteed
> consistent behaviour across the API (as opposed to coincidentally identical).

Note that js_CompileScript is now JSCompiler::compileScript.

How would you make it take a char buffer, again?

/be

Wesley W. Garland

Assignee

Comment 3

•

15 years ago

> No, UngetChar (note spelling) and PeekChar do not have to re-encode. 

I came to this conclusion tentatively after realizing that the unget buffer was jschar earlier this afternoon. Thanks for reinforcing my thought; that tidbit changes this patch from "annoying" to "rather trivial".

Will drop the TSF_ flag and use js_StringsAreUTF8. I read JS_SetCStringsAreUTF8() but failed to process the assertion. I need to start doing a better job of reading header files: you guys bury more comments there than I expect.

> Note that js_CompileScript is now JSCompiler::compileScript.

Heh, I did my initial analysis when I was cloning a new repository. That must change must have been made relatively recently (April or later).

As for making JSCompiler::compileScript take char buffers: I admit, I haven't given "how" any real thought.  I think all the pieces are /in there/ to do that, but I was more concerned with "should I" than "how shall I?" this afternoon.

Thanks for the feedback Brendan. I have a good enough handle on the surrounding code and issues now that I can implement with trivial changes and generate good test coverage.

Final implementation question: my plan is convert illegal UTF-8 sequences into equally illegal UTF-16. Any kind of encoding error handling in GetChar seems like it cost too much for too little benefit.

Wesley W. Garland

Assignee

Comment 4

•

15 years ago

Attached patch Patch to add UTF-8 to support to FILE-backed token streams, works except in corner cases (obsolete) — Details — Splinter Review

Wesley W. Garland

Assignee

Comment 5

•

15 years ago

This is a working but incomplete patch. It does not deal with the corner case where an identifier is longer than JS_LINE_LIMIT bytes, and a multibyte character straddles the buffer boundary.

The approach in this patch is to process single characters as usual. If a multibyte character is detected, the total number of bytes in the character is derived from the first byte of the encoding. We then "swallow" those bytes, and convert to UTF-16. Either or one or two jschars are used for the UTF-16 data, non-BMP sequences are handled as expected. Character counting is decoupled from byte counting and gives correct diagnostic information for a programmer using a UTF-8 editor.

This patch moves the UTF8->UCS4 logic out of jsstr.cpp and into jsstr.h.  This approach was taken as that code is used the js_InflateString() loop and so probably needs to remain inline-able.

Feedback sought for how to best handle the case at "multi-byte character straddles buffer boundary" comment.

Jeff Walden [:Waldo]

Comment 6

•

15 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=445886 has some interesting tests and commentary on replacement characters and code unit consumption in case of errors in the input.

I tend to think this patch, when it is eventually completed, should be reviewed by at least one non-JS character encoding doyen, beyond the SpiderMonkey peer review requirement.

Wesley W. Garland

Assignee

Comment 7

•

15 years ago

Attached patch Patch to add UTF-8 to support FILE-backed token streams (obsolete) — Details — Splinter Review

This patch is expected to work correctly on all valid UTF-8 input.

It differs from the behaviour js_InflateString  (which is called by JS_CompileScript()) in the case of invalid UTF-8 sequences.  js_InflateString() will throw an exception whereas I will produce characters consisting of equally-invalid UTF-16.

Advice sought re. how to best handle bad byte sequences from within GetChar() sought. My temptation is to turn the first invalid sequence into EOF -- that will have the least impact on the surrounding code, and still provide diagnostics (of a sort) for the programmer.

 * Jeff Walden points out that securely consuming invalid Unicode is a non-trivial and tricky proposition.

 * Jeff also points out that my patch has a flaw at the buffer boundary: I do not handle the case where ungetc fails, and ungetc is not guaranteed to succeed for more than one byte of push-back on all platforms.  Implementing the pushback correctly might require adding a new member to JSTokenStream, although I'd rather find another solution.

Attachment #381860 - Attachment is obsolete: true

Wesley W. Garland

Assignee

Comment 8

•

15 years ago

Attached patch Patch to add UTF-8 to support to FILE-backed token streams (obsolete) — Details — Splinter Review

This patch enhances the previous one by adding invalid-byte-sequence error handling. 

Invalid sequences of UTF-8 are handled by pushing an invalid UTF-16 character (0xFFFF) into the token buffer.  When that invalid character is pulled out of the buffer as GetChar returns, we pass along an invalid character (-2, similar to how EOF is passed), and consider finding this token to be a syntax error.

We also insure that we never skip over the invalid character or otherwise "lose" it to optimizations in parsing: it should always be handled.

I was going to request review on this patch, but just noticed that I missed handling the case where ungetc() fails. Will try to get to that later today.

Attachment #381939 - Attachment is obsolete: true

Wesley W. Garland

Assignee

Comment 9

•

15 years ago

Attached patch Patch to add UTF-8 to support to FILE-backed token streams (obsolete) — Details — Splinter Review

Replaces call to ungetc with a buffer inside the JSTokenStream struct, and corrects a UTF-8 validation bug.

Attachment #382347 - Attachment is obsolete: true

Attachment #382361 - Flags: review?(jorendorff)

Brendan Eich [:brendan]

Comment 10

•

15 years ago

Brace, comment, paren style nits by quick inspection.

Seems like we shouldn't need another unget buffer, but I get it if it's UTF-8 decoding-specific -- name it accordingly? utf8buf or something like that.

Is this patch making maximal use of existing UTF-8 decoding machinery added for js_CStringsAreUTF8?

/be

Wesley W. Garland

Assignee

Comment 11

•

15 years ago

Style Police: How firm is the wide-code rule?  Line 419 (grep: 0xFFE) is 84 characters; breaking it up makes it significantly harder to read IMO.

Unfortunately the other unget buffer is necessary; it needs to store bytes ("code units") which make up part of a character; the other unget buffer stores characters only.

A better name might in fact be appropriate: I can't see this buffer being reasonably reused for any other purpose inside the scanner without other major surgery (buffer is sized exactly big enough, for starters).  utf8buf and utf8buflen seem reasonable.

The "inner machinery" for decoding UTF-8 from jsstr.cpp has been moved into jsstr.h so that it can be reused by GetChar -- that's the bit-tricky code which changes UTF-8 to UCS4. 

There is very very slight intent-duplication between this code and js_InflateStringToBuffer() -- conversion from UCS4 to UTF-16 (pretty trivial - 3 LOC) and error handling. Errors must be handled differently between the two functions, as the jsstr.cpp code is free to throw exceptions whereas GetChar may not, instead returning tokens which lead to eventual syntax errors.

Jason Orendorff [:jorendorff]

Comment 12

•

15 years ago

99 columns is the new hard limit, so an 84-character line is fine. Even code like this:

>+                        codePoint = js_Utf8ToOneUcs4Char((const uint8 *)
>+                                                         (cbuf + j), 
>+                                                         utf8Length);

could fit on one line. (If you actually use vi, feel free to patch line 2 where it says tw=78. I don't use vi, so the wrong value doesn't bother me, and I really don't know if the correct value is tw=98 or tw=99 or what.)

Brendan said not to overengineer this, but I think this patch is too clunky; it could really use a bit more engineering.

This patch changes most GetChar call sites to check for JS_INVALID_JSCHAR. So basically we're changing GetChar to be able to return an error code. I think this is a fine thing, but then it seems sensible to go all the way, and have GetChar report the error itself instead of leaving this to the caller. This is what all our other code does, and in the case of Unicode errors, the resulting error messages will be more appropriate that way than if we just let the scanner blunder ahead to some syntax error.

Let's avoid the unget buffer. Instead, change js_fgets to fill a buffer of jschars. The little buffer of 8-bit characters can be local in js_fgets. (js/shell uses js_fgets too, but I think adapting that code to use jschars will be no problem.)

Note that js/src style is to write "} else" on the same line.

Jason Orendorff [:jorendorff]

Updated

•

15 years ago

Attachment #382361 - Flags: review?(jorendorff)

Brendan Eich [:brendan]

Comment 13

•

15 years ago

My "don't over-engineer" advice in no way favored clunkiness :-P. I was hoping for a small patch, since we already have uint16 jschar at a low level.

But making GetChar fallible is a big change. Of course it should report/throw but the callers all must then check its return value for the error code. That's ~40 callsites that have to test and propagate failure. This is a non-trivial source complexity, compiled code size, and JS compiler runtime hit, all only for the sake of this bug.

/be

Jason Orendorff [:jorendorff]

Comment 14

•

15 years ago

Another option is to have GetChar report a warning on an encoding error and translate the garbage bytes as '\ufffd', the Unicode "replacement character":

http://www.fileformat.info/info/unicode/char/fffd/index.htm

Then we might as well leave all GetChar call sites as they are--basically drop the second half of the above patch.  In a comment, non-UTF8 bytes would be ignored.  In a string, you would get the all-too-familiar � character.  Anywhere else it would be a syntax error.

Jeff Walden [:Waldo]

Comment 15

•

15 years ago

That was what the original patch did, as I recall, but I somewhat-discouraged it because I was wary of encoding bugs and wasn't sure the extra pain of recovery was worthwhile.  If we have this many GetChar calls, tho, it does seem the way to go.

Wesley W. Garland

Assignee

Comment 16

•

15 years ago

Attached patch Patch to add UTF-8 to support to FILE-backed token streams (obsolete) — Details — Splinter Review

I'm hoping you'll find this patch less clunky -- having a hard time balancing "elegant code" against "beware of lurking dragons".

This patch is algorithmically identical to last one. It address the style nits raised above, and makes better use of screen real estate.

Architecturally speaking, it is quite different:

- unget buffer is gone: we demand-get exactly enough extra characters in the edge case when they are needed

- GetChar re-factored to only check js_CStringsAreUTF8 once per buffer; UTF-8 details moved out into ProcessUTF8Char(). This is now the only comparison added to the scanner's "hot loop" when parsing FILEs in traditional mode, and no extra code is executed when compile JavaScript which came from an in-memory buffer.

- When an invalid UTF-8 sequence is read, we now produce TOK_ERROR and stop processing the input buffer. Compilation error is noted in ProcessUTF8Char().

- In order note compilation errors, we need to pass a context through to ProcessUTFChar(), via GetChar(). This is accomplished by removing the context from the JSCompile class, and moving it into the JSTokenStream struct.

- Because of this, previous references to JSCompile::context are now JSCompile::context(), an inline method which retrieves the context from the JSTokenStream. This bloats the patch but should be easy to review for correctness.

- In some circumstances, js_utf8ToOneUcs4Char() might assert on UTF8 bytes which would get reported as compilation errors by ProcessUTF8Chars() in an optimized build. I don't consider this a bug.

- js_InitTokenStream() now takes a "snapshot" of the context it was called with. This patch would create a bug if there was ever a situation where an active token stream could be passed from one context to another. I could not find any such instances.

Attachment #382361 - Attachment is obsolete: true

Attachment #388296 - Flags: review?(jorendorff)

Jeff Walden [:Waldo]

Comment 17

•

15 years ago

Comment on attachment 388296 [details] [diff] [review]
Patch to add UTF-8 to support to FILE-backed token streams

Stealing with jorendorff's approval...

Attachment #388296 - Flags: review?(jorendorff) → review?(jwalden+bmo)

Wesley W. Garland

Assignee

Comment 18

•

15 years ago

Attached patch Patch to add UTF-8 to support to FILE-backed token streams — Details — Splinter Review

Exact same patch, except I
 - rebased it to 45dda1dc4d07+ tip (2 hours old), and
 - removed two lines in jstracer.h which accidentally "leaked in" via a temporary local fix to bug 504119

Attachment #388296 - Attachment is obsolete: true

Attachment #388498 - Flags: review?(jwalden+bmo)

Attachment #388296 - Flags: review?(jwalden+bmo)

Jeff Walden [:Waldo]

Comment 19

•

15 years ago

Comment on attachment 388498 [details] [diff] [review]
Patch to add UTF-8 to support to FILE-backed token streams

>diff --git a/js/src/jsparse.cpp b/js/src/jsparse.cpp

>-    JS_ARENA_ALLOCATE_TYPE(funbox, JSFunctionBox, &context->tempPool);
>+    JS_ARENA_ALLOCATE_TYPE(funbox, JSFunctionBox, &(context()->tempPool));

Superfluous extra parentheses.


>diff --git a/js/src/jsscan.cpp b/js/src/jsscan.cpp

>+static bool
>+ProcessUTF8Char(JSTokenStream *ts, const unsigned char *cbuf, jschar *ubuf, ptrdiff_t *ip, ptrdiff_t *jp, ptrdiff_t *lenp)
>+{
>+  /* 
>+   * At the beginning of a UTF-8 multi-byte character.
>+   * If an invalid sequence is detected, we insert 
>+   * TOK_ERROR into the buffer and set the TSF_ERROR flag.
>+   */
>+  uint32 codePoint;
>+  int utf8Length;
>+

JS_ASSERT(cbuf[*jp] >= 0x80);

>+  if (cbuf[*jp] >= 0xc2 && cbuf[*jp] <= 0xdf) {
>+    utf8Length = 2;
>+  } else if (cbuf[*jp] >= 0xe0 && cbuf[*jp] <= 0xef) {
>+    utf8Length = 3;
>+  } else if ((cbuf[*jp] >= 0xf0) && (cbuf[*jp] <= 0xf4)){

These conditions are overparenthesized.

That's most of the easy stuff, would continue more but OS X is hinting it needs a reboot, so dumping what little I have for now and leaving the rest for the other side; wait for the full review before posting a revised patch.

Jeff Walden [:Waldo]

Comment 20

•

15 years ago

Comment on attachment 388498 [details] [diff] [review]
Patch to add UTF-8 to support to FILE-backed token streams

In general through the entire patch, comments should break after column 79 (maybe 80 if the wording most naturally results in that, we haven't quite needed to decide which yet), not earlier; half-width comments eat up extra vertical space for not much gain.  (Code lines break after column 99, a semi-recent change from the previous after-79 standard.)


>diff --git a/js/src/jsscan.cpp b/js/src/jsscan.cpp
>+static bool
>+ProcessUTF8Char(JSTokenStream *ts, const unsigned char *cbuf, jschar *ubuf, ptrdiff_t *ip, ptrdiff_t *jp, ptrdiff_t *lenp)
>+{
>+  /* 
>+   * At the beginning of a UTF-8 multi-byte character.
>+   * If an invalid sequence is detected, we insert 
>+   * TOK_ERROR into the buffer and set the TSF_ERROR flag.
>+   */

This comment seems like it would be better outside the method.  Also, use complete sentences, so perhaps start with, "Process one UTF-8 multi-byte character starting from its first byte."


>+  uint32 codePoint;
>+  int utf8Length;
>+

Don't include a blank line here, given the codePoint declaration comment further below.


>+  if (cbuf[*jp] >= 0xc2 && cbuf[*jp] <= 0xdf) {
>+    utf8Length = 2;
>+  } else if (cbuf[*jp] >= 0xe0 && cbuf[*jp] <= 0xef) {
>+    utf8Length = 3;
>+  } else if ((cbuf[*jp] >= 0xf0) && (cbuf[*jp] <= 0xf4)){
>+    utf8Length = 4;
>+  } else {
>+    goto bad_utf8_bytes;
>+  }

I find myself reading this and being frustrated that we're using a signed type, ptrdiff_t, rather than an unsigned type here, for a non-negative index into memory.  Not your problem to fix, seems to be a systemic issue, just would rather we were using size_t here.

Could you common cbuf[*jp] into |unsigned char c|, please, to cut down on the symbol-clutter?


>+  if (*ip + utf8Length > *lenp) {

codePoint's declaration should move down to immediately above this if; we're C++ now, we declare at use or as little in advance as possible.


>+    /*
>+     * Multi-byte character straddles buffer boundary, 
>+     * pull a few extra bytes from the stream to use.
>+     */
>+    unsigned char utf8Bytes[4];
>+
>+    memcpy(utf8Bytes, cbuf + *jp, *lenp - *ip);

This needs |ptrdiff_t remaining = *lenp - *ip;| commoning to avoid repetition and a scourge of parentheses on the next several lines.


>+    if (utf8Length - (*lenp - *ip) != 
>+        js_fgets((char *)utf8Bytes + (*lenp - *ip), utf8Length - (*lenp - *ip) + 1, ts->file)) {

Where does this + 1 come from?  This looks to me like you're chomping off an extra byte whenever we pass the buffer limit while processing a multi-byte character.


>+      goto bad_utf8_bytes;
>+    } else {
>+      codePoint = js_Utf8ToOneUcs4Char((const uint8 *)utf8Bytes, utf8Length);
>+    }

Having an else after an if which ends with a goto is unnecessary; you can just keep the else-block at the same level of nestedness as the if.


>+  } else {
>+    codePoint = js_Utf8ToOneUcs4Char((const uint8 *)(cbuf + *jp), utf8Length);
>+  }

...but then again, I think I'd prefer only one js_Utf8ToOneUcs4Char call here, so utf8Bytes would move to toplevel and the if would assign to a pointer which would be the first argument to this call.


>+  *lenp -= (utf8Length - 1);
>+  while(--utf8Length)

Space between while and (.


>+  {
>+    unsigned char mask = 0x80;
>+    if (cbuf[(*jp)++] & mask != 0x80) {

Given C operator precedence, which requires that the left-hand side be parenthesized, how does this even work for multi-byte characters?  As soon as you hit a multi-byte character, 0xC0 != 0x80 and it's goto bad_utf8_bytes time.


>+      goto bad_utf8_bytes;
>+    }
>+    mask = 0xC0;
>+  }
>+
>+  if (codePoint >= 0x10000) { 

3 + 6 + 6 + 6 == width(10FFFF), but I'd like a JS_ASSERT(codePoint <= 0x10FFFF) before this line as a sanity check.  Also, and this might just be personal preference, but I'd prefer |codePoint > 0xFFFF| because it conveys out-of-range rather than in-the-not-in-range-range and because counting the adjacent zeroes in 0x10000 is hard.  :-)


>+    /* 
>+     * Handle UTF-16 surrogate pair. ubuf is always 
>+     * big enough, encoding a UTF-8 code point which
>+     * converts to a surrogate pair requires at least
>+     * two bytes.
>+     */

ubuf's size is not immediately apparent in the context of this code; you have to go off and find that ts->userbuf.base has extent JS_LINE_LIMIT to do so.  Longer-term we really should use a JSTempVector for this or something like that, but for now at least direct the reader to that code and reasoning in addition to simply stating it and sketching an outline somewhat disconnected from the immediate code at hand.


>+    ubuf[*ip] = ((codePoint - 0x10000) >> 10) + 0xD800;
>+    ubuf[*ip + 1] = ((codePoint - 0x10000) & 0x3FF) + 0xDC00;

Fundamentally all this is is grabbing bit-slices out of codePoint and or-ing them with surrogate masks, so I don't think additions or subtractions are appropriate to convey that -- not to mention that adds/subs are necessarily slower than shifts and bitmasking (it's conceivable a really smart compiler might intuit all the bit-range invariants we have, but I'd bet against it).  This should be equivalent (double-check me, please) and more in-line with the bit-ish nature of the manipulations:

  ubuf[*ip] = ((codePoint >> 10) & 0x3FF) | 0xD800; /* NB: chops off 0x100000 bit */
  ubuf[*ip + 1] = (codePoint & 0x3FF) | 0xDC00;


>+  } else {
>+    ubuf[*ip] = (jschar) codePoint;
>+  }

C++-style jschar(codePoint) cast, please.


>+  if ((codePoint >= 0xD800 && codePoint <= 0xDFFF) || codePoint > 0x10FFF ||
>+      ((codePoint & 0xF000) && (codePoint & 0xFFE == 0xFFE))) {
>+    goto bad_utf8_bytes;
>+  }

Second condition can't be hit per above reasoning.  The third suffers from C operator precedence fail, looking to fail any code point in the range [U+F000, U+FFFF].

Do [U+FDD0, U+FDEF] need to be treated as errors as well here?


>+  bad_utf8_bytes:

Style nit: label shouldn't be indented here.


>+  ubuf[*ip] = (short unsigned int)TOK_ERROR;
>+  js_ReportCompileErrorNumber(ts->context, ts, NULL, JSREPORT_ERROR,
>+                              JSMSG_INVALID_UNICODE_CHAR);
>+  return false;

Why cast to short unsigned int here, rather than to jschar?  C++ style also.

It's convenient for the reporting method to detect double-reports as it does, this would be pretty ugly otherwise.


>-                for (j = 0; i < len; i++, j++)
>-                    ubuf[i] = (jschar) (unsigned char) cbuf[j];
>+                
>+                if (!js_CStringsAreUTF8) {
>+                  for (j = 0; i < len; i++, j++) {
>+                      ubuf[i] = (jschar) (unsigned char) cbuf[j];
>+                  }

C++ style here, also nix the braces around the for-loop body since it's only one line (as before for the latter, fixing up the former to conform to new style).


>+                } else {
>+                  for (j = 0; i < len; i++, j++) {
>+                    if (cbuf[j] & 0x80) {
>+                      if (ProcessUTF8Char(ts, reinterpret_cast<const unsigned char *>(cbuf), ubuf,
>+                                           &i, &j, &len) == false)
>+                        break;
>+                    } else {
>+                        ubuf[i] = (jschar) (unsigned char) cbuf[j];
>+                    }
>+                  }

Would prefer this reversed so that multi-byte is logically the exceptional case.  Also, |if (!Process(...))| rather than comparing to false, and if the entire condition fits in 99 characters, don't bother bracing the break-statement.


>diff --git a/js/src/jsstr.h b/js/src/jsstr.h

> /*
>+ * Convert a utf8 character sequence into a UCS-4 character and return that
>+ * character.  It is assumed that the caller already checked that the sequence
>+ * is valid.
>+ */
>+static JS_INLINE uint32

inline, not the obfuscating JS_INLINE -- this is an internal header, and anyway, it's 2009.


>+js_Utf8ToOneUcs4Char(const uint8 *utf8Buffer, int utf8Length)
>+{
>+    uint32 ucs4Char;
>+    uint32 minucs4Char;
>+    /* from Unicode 3.1, non-shortest form is illegal */

Capitalize and add a period at the end while you're here.


>+    static const uint32 minucs4Table[] = {
>+        0x00000080, 0x00000800, 0x0001000, 0x0020000, 0x0400000
>+    };
>+
>+
>+    JS_ASSERT(utf8Length >= 1 && utf8Length <= 6);

Hm, this 5- and 6-byte character thing is whack in modern Unicode and ES5 (maybe ES3, too, but I'm too lazy to check for certain) and results in an error (see DecodeURI for when this could be hit, via Decode!jsstr.cpp).  This is too involved for driveby changes in this bug, so we should fix this in bug 505366 in the future.

Anyway.  r- until we can hash out the above; sorry for the delay in getting to reviewing this, not something that was easy to skim without slowly teasing out meaning line-by-line...

Attachment #388498 - Flags: review?(jwalden+bmo) → review-

Brendan Eich [:brendan]

Comment 21

•

15 years ago

(In reply to comment #20)
> (From update of attachment 388498 [details] [diff] [review])
> In general through the entire patch, comments should break after column 79
> (maybe 80 if the wording most naturally results in that, we haven't quite
> needed to decide which yet),

See the modeline -- vim decides, no "we" thinking required.

> not earlier;

Affirmatron.

Looks like c-basic-offset of 2 is used!? Indentation unit, in non-emacs speak.

/be

Jeff Walden [:Waldo]

Comment 22

•

15 years ago

(In reply to comment #21)
> See the modeline -- vim decides, no "we" thinking required.

Modeline doesn't set distinct limits for code and comment lengths, does it?  99 is all I see in modelines right now, so comment length is merely implicit in that -- and I don't think anyone really wants 99-limit comments.


> Looks like c-basic-offset of 2 is used!? Indentation unit, in non-emacs speak.

Hm, I missed that, didn't I?  An impressive oversight on my part, to be sure.

Brendan Eich [:brendan]

Comment 23

•

15 years ago

(In reply to comment #22)
> (In reply to comment #21)
> > See the modeline -- vim decides, no "we" thinking required.
> 
> Modeline doesn't set distinct limits for code and comment lengths, does it?  99
> is all I see in modelines right now,

Ergh, jsscan.cpp has 78, not 79 and not 99.

> so comment length is merely implicit in
> that -- and I don't think anyone really wants 99-limit comments.

Righto.

/be

Terrence Cole [:terrence]

Updated

•

12 years ago

Depends on: 805081

Jeff Walden [:Waldo]

Comment 24

•

5 years ago

This is long since fixed.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Patch to add UTF-8 to support to FILE-backed token streams, works except in corner cases 15 years ago Wesley W. Garland 11.55 KB, patch		Details \| Diff \| Splinter Review
Patch to add UTF-8 to support FILE-backed token streams 15 years ago Wesley W. Garland 10.34 KB, patch		Details \| Diff \| Splinter Review
Patch to add UTF-8 to support to FILE-backed token streams 15 years ago Wesley W. Garland 29.56 KB, patch		Details \| Diff \| Splinter Review
Patch to add UTF-8 to support to FILE-backed token streams 15 years ago Wesley W. Garland 31.33 KB, patch		Details \| Diff \| Splinter Review
Patch to add UTF-8 to support to FILE-backed token streams 15 years ago Wesley W. Garland 38.40 KB, patch		Details \| Diff \| Splinter Review
Patch to add UTF-8 to support to FILE-backed token streams 15 years ago Wesley W. Garland 37.80 KB, patch	Waldo : review-	Details \| Diff \| Splinter Review