Last Comment Bug 652771 - Update Unicode Table
: Update Unicode Table
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla8
Assigned To: Tom Schuster [:evilpie]
:
: Jason Orendorff [:jorendorff]
Mentors:
: 26473 497692 (view as bug list)
Depends on: 672760 675400 675896
Blocks: test262 672042 672083 672759 289630 514808
  Show dependency treegraph
 
Reported: 2011-04-26 03:53 PDT by Tom Schuster [:evilpie]
Modified: 2011-08-04 05:32 PDT (History)
16 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Sneak Peek (129.96 KB, patch)
2011-06-21 07:21 PDT, Tom Schuster [:evilpie]
no flags Details | Diff | Splinter Review
Generating script (3.10 KB, text/x-python)
2011-06-21 07:26 PDT, Tom Schuster [:evilpie]
no flags Details
Generating script v2 (211.82 KB, application/x-gzip)
2011-06-22 10:36 PDT, Tom Schuster [:evilpie]
no flags Details
wip 1 (76.87 KB, patch)
2011-06-22 10:42 PDT, Tom Schuster [:evilpie]
no flags Details | Diff | Splinter Review
pre-v1 (2.12 MB, patch)
2011-06-22 15:58 PDT, Tom Schuster [:evilpie]
no flags Details | Diff | Splinter Review
v1 (2.08 MB, patch)
2011-07-18 11:45 PDT, Tom Schuster [:evilpie]
jwalden+bmo: review-
Details | Diff | Splinter Review
v2 (2.08 MB, patch)
2011-07-20 05:05 PDT, Tom Schuster [:evilpie]
jwalden+bmo: review+
Details | Diff | Splinter Review
Final (2.08 MB, patch)
2011-07-21 17:15 PDT, Tom Schuster [:evilpie]
no flags Details | Diff | Splinter Review
v3 (2.08 MB, patch)
2011-07-23 13:01 PDT, Tom Schuster [:evilpie]
jwalden+bmo: review+
Details | Diff | Splinter Review

Description Tom Schuster [:evilpie] 2011-04-26 03:53:45 PDT

    
Comment 1 Tom Schuster [:evilpie] 2011-04-26 03:59:48 PDT
At least we don't recognize these characters in trim(), based on this test http://test262.ecmascript.org/:
\x85
\u1680
\u180e
\u202f
\u205f

Seems to partially confirm this:
https://secure.wikimedia.org/wikipedia/en/wiki/Whitespace_character#Unicode
http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode

js_X, js_Y and js_A are from hg@1, maybe they require some update today, but i don't understand how this table was generated.
Comment 2 Jeff Walden [:Waldo] (remove +bmo to email) 2011-04-26 06:52:53 PDT
Note bug 514808 as well, and note that there's some belief that \u0085 is properly not a whitespace character.  Dunno about all the others.

Also, if you end up doing this and decide to look somewhere for a review, please don't look to me.  :-)  smontagu does intl stuff, he might be a reasonable reviewer perhaps.
Comment 3 Luke Wagner [:luke] 2011-04-26 07:21:58 PDT
I hear js_X et al. and some other parts of our Unicode handling were copied from an open-source Java project.  A refresh sounds good.
Comment 4 David Mandelin [:dmandelin] 2011-04-26 09:42:01 PDT
(In reply to comment #3)
> I hear js_X et al. and some other parts of our Unicode handling were copied
> from an open-source Java project.  A refresh sounds good.

Agreed. A little more background on that: the js_X et al. Unicode tables were taken from an early version of the JDK (1.1 or some such). It was actually a pretty good design except for the small issue that it doesn't represent everything in the Unicode spec. See, for example, bug 502789 comment 12.

I don't think we can get it right without a new design. It might be good to look at Python--they seem to have good Unicode support these days.
Comment 5 Tom Schuster [:evilpie] 2011-05-03 11:48:03 PDT
So looked at the python implementation. It uses a bit more logic then our current one, and some stuff like whitespace is hardcoded as a switch. Overall we would execute a few more instructions. On the other hand python is maintained and they have a script for automatically generating the required tables. One thing i am currently not sure about is if we can easily implement all our macros with it, in particular JS_ISLETTER (these use some bitfield, and i can't map them to something in the python code directly in my head).
Comment 6 Tom Schuster [:evilpie] 2011-05-10 11:59:52 PDT
I would like to include this two files from cpython into SpiderMonkey, see license isn't clear to me, so i would need somebody to make sure this is legally okay.
>/*
>Unicode character type helpers.
>
>Written by Marc-Andre Lemburg (mal@lemburg.com).
>Modified for Python 2.0 by Fredrik Lundh (fredrik@pythonware.com)
>
>Copyright (c) Corporation for National Research Initiatives.
>
>*/
gerv: Jeff suggest you looking at this? 

http://hg.python.org/cpython/file/5e1ed84883c5/Objects/unicodectype.c
http://hg.python.org/cpython/file/5e1ed84883c5/Objects/unicodetype_db.h
Comment 7 Gervase Markham [:gerv] 2011-05-11 04:45:53 PDT
The license for recent versions of Python is here:
http://docs.python.org/license.html

It is GPL-compatible and I think lax enough that we wouldn't object to it on the grounds of its terms. (However, I'd want Luis to check that out for certain.) But, as you can see, the amount of text required is long and verbose - almost as long as the code you want to import! 

Is there no alternative source for the code?

Gerv
Comment 8 Tom Schuster [:evilpie] 2011-05-11 06:53:29 PDT
>the amount of text required is long and verbose - almost as long as the code you want to import
It's not so much about the code actually, but that it's actively maintained (at least that's what i hope) and we could update the table anytime with the provided python script.
I didn't realize we would need to include the whole license blob.

>Is there no alternative source for the code?
I actually didn't look at to much stuff, but the python unicode support is considered pretty good.
Comment 9 Gervase Markham [:gerv] 2011-05-11 09:59:38 PDT
I think we would need to include the whole license blob, but if you decide to go further down this route, we could check that with a lawyer. But I'd appreciate it if you could cast your eye around for alternative sources first. Is there not some canonical source for this data which we can process?

Gerv
Comment 10 Tom Schuster [:evilpie] 2011-05-16 12:57:59 PDT
It would really help if we could first find this magical csv file that is used in the Java and possibly the Python implementation.
Personally i would like to see the java based current implementation gone.
Comment 11 Luke Wagner [:luke] 2011-06-17 16:40:16 PDT
Btw, when this bug gets fixed, you may have the privilege of resolving a Really Old Bug (bug 26473).
Comment 12 Tom Schuster [:evilpie] 2011-06-20 10:02:26 PDT
>Btw, when this bug gets fixed, you may have the privilege of resolving a Really >Old Bug (bug 26473).
Nice

Okay i am slowly getting to something here. I was under the illusion that i only need 3 classes "whitespace", "identifier start" and "identifier end". Then i found this obscure code:
#define JS_ISXMLNSSTART(c)      ((JS_CCODE(c) & 0x00000100) || (c) == '_')
#define JS_ISXMLNS(c)           ((JS_CCODE(c) & 0x00000080) || (c) == '.' ||  \
                                 (c) == '-' || (c) == '_')
I figured the first one checks for:
> JSCT_ENCLOSING_MARK         = 7,
> JSCT_COMBINING_SPACING_MARK = 8,
This sucks because 2 more classes lets the table grow.
Comment 13 Tom Schuster [:evilpie] 2011-06-21 07:21:46 PDT
Created attachment 540744 [details] [diff] [review]
Sneak Peek

The code patch
Comment 14 Tom Schuster [:evilpie] 2011-06-21 07:26:05 PDT
Created attachment 540746 [details]
Generating script

You can get makeunicodedata from 
http://hg.python.org/cpython/file/c8192197d23d/Tools/unicode/makeunicodedata.py
UnicodeData from 
http://unicode.org/Public/UNIDATA/UnicodeData.txt

reader.py is really similar to makeunicodetype in makeunicodedata.py
Comment 15 Jeff Walden [:Waldo] (remove +bmo to email) 2011-06-21 16:50:22 PDT
Comment on attachment 540744 [details] [diff] [review]
Sneak Peek

Review of attachment 540744 [details] [diff] [review]:
-----------------------------------------------------------------

As far as the character-detection methods go, it looks like you could templatize them to take either char or jschar, avoiding the many casts you have in the patch.  Then include a JS_ASSERT(c >= 0) and assign the char into a jschar and do all the computations with the jschar.  (Add a JS_STATIC_ASSERT(jschar(-1) > 0) to jsstr.cpp or something to verify that jschar is always an unsigned type -- it's JSUint16 now everywhere but Windows, and on Windows it's wchar_t which MS defines as uint16_t.)  A local unsigned shouldn't be necessary -- these are inlines, so the compiler will do the right thing, whatever that happens to be.

General comment: the Char prefix on the method names doesn't usually help much, and you can pretty much nix it.  (More precise comments follow.)

All this Unicode gunk doesn't have much to do with strings, really.  We should separate all these character info tables and stuff into a completely separate file.  Unicode.{cpp,h} maybe?  (mfbt might be even nicer for some of this, but one step at a time, the JS-only step first.)  Don't much care about the paint, do think it would be nice to not have those large tables in jsstr.cpp/h.

::: js/src/jsdate.cpp
@@ +548,5 @@
>      /* return true if matches, otherwise, false */
>  
>      while (count > 0 && s1[s1off] && s2[s2off]) {
>          if (ignoreCase) {
> +            if (CharToLower((jschar)s1[s1off]) != CharToLower(s2[s2off])) {

ToLowerCase seems a better name to me, as it implies the "Char" bit and doesn't elide a word.

::: js/src/jsnuminlines.h
@@ +70,5 @@
>          if ('0' <= c && c <= '9') {
>              *result = NumberTraits<T>::toSelfType(T(c - '0'));
>              return true;
>          }
> +        if (CharIsSpace(c)) {

I don't see the prefix as useful here.  But I'd make this IsWhitespace (or IsWhiteSpace?  but I'd prefer just one word to avoid the mental speed bump of two), as that's what the Unicode category's called and that's what the spec calls them and all.

::: js/src/jsopcode.cpp
@@ +765,5 @@
>      /* Loop control variables: z points at end of string sentinel. */
>      for (const jschar *t = s; t < z; s = ++t) {
>          /* Move t forward from s past un-quote-worthy characters. */
>          jschar c = *t;
> +        while (CharIsPrint(c) && c != qc && c != '\\' && c != '\t' &&

IsPrintable seems a more obviously meaningful name than CharIsPrint.  Most people don't write code that would have them familiar with isprint these days.

::: js/src/jsscan.cpp
@@ +141,5 @@
>  
>      if (length == 0)
>          return JS_FALSE;
>      jschar c = *chars;
> +    if (!CharIsIdentifierStart(c))

Again the Char prefix seems unnecessary.

@@ +146,5 @@
>          return JS_FALSE;
>      const jschar *end = chars + length;
>      while (++chars != end) {
>          c = *chars;
> +        if (!CharIsIdentifierPart(c))

And again.

@@ +644,5 @@
>      ispair = false;
>      if (length > 2 && bp[1] == '#') {
>          /* Match a well-formed XML Character Reference. */
>          i = 2;
> +        if (length > 3 && (bp[i] == 'x' || bp[i] == 'X')) {

<3

@@ +1136,5 @@
>          cp[2] == 'i' &&
>          cp[3] == 'n' &&
>          cp[4] == 'e') {
>          skipChars(5);
> +        while ((c = getChar()) != '\n' && CharIsSpaceOrBom2((jschar)c))

IsWhitespaceOrBOM2

@@ +1360,5 @@
>          }
>  
>          tp = newToken(-1);
>  
> +        if (CharIsLetter(c)) {

IsLetter

::: js/src/jsstr.h
@@ +797,5 @@
> +const jschar NO_BREAK_SPACE  = 0x00A0;
> +const jschar BYTE_ORDER_MARK2 = 0xFFFE;
> +
> +
> +enum {

Make this:

namespace CharFlag {
enum {
    WHITESPACE = 1 << 0,
    LETTER     = 1 << 1,
    ...
};
}

cdleary did this in the version/options code, and it's nice because you can have a prefix or not depending on whether it makes the code using it more readable or not.

@@ +804,5 @@
> +    FLAG_IDENTIFIER_PART = 1 << 2,
> +    FLAG_NO_DELTA = 1 << 3,
> +    FLAG_LINE_TERMINATOR = 1 << 4,
> +    FLAG_ENCLOSING_MARK = 1 << 5,
> +    FLAG_COMBINING_SPACING_MARK = 1 << 6

Ordinarily I think vertical alignment of assignments is overrated, but if you're shifting by increasing numbers, I think aligning the = would be good.

@@ +808,5 @@
> +    FLAG_COMBINING_SPACING_MARK = 1 << 6
> +};
> +
> +
> +struct CharInfo {

CharacterInfo (see below).

Rather than do everything by manual bit-reference against the flags member, add some inline member methods to abstract them:

bool isWhitespace()
bool isLetter()
bool isIdentifierPart()
bool isLineTerminator()
bool isEnclosingMark()
bool isCombiningSpaceMark()

You can't make flags private, alas, because that would break the initializer, but at least you can make use of it and the raw flag enums outside CharacterInfo quite rare.

@@ +811,5 @@
> +
> +struct CharInfo {
> +    const jschar upperCase;
> +    const jschar lowerCase;
> +    const uint8 flags; 

The upperCase and lowerCase fields need descriptions.  flags can be described by reference to the associated enum.

@@ +816,5 @@
> +};
> +
> +extern const uint16 index1[];
> +extern const uint16 index2[];
> +extern const CharInfo js_charinfo[];

These three arrays need descriptions.  We don't want the same situation as bug 26473 comment 5, with nobody knowing how this works and being thus unwilling to touch it.  :-)

@@ +819,5 @@
> +extern const uint16 index2[];
> +extern const CharInfo js_charinfo[];
> +
> +static inline const CharInfo *
> +GetCharInfo(jschar c)

The "get" prefix usually indicates fallibility -- and when it returns a pointer, even moreso.  Make this return a reference, and name it just "CharInfo", since the struct has the longer name now.

@@ +821,5 @@
> +
> +static inline const CharInfo *
> +GetCharInfo(jschar c)
> +{
> +	int index;

This should be a uint16 as both index1 and index2 are arrays of uint16.

@@ +831,5 @@
> +	return &js_charinfo[index];
> +}
> +
> +
> +#define JS_ISXMLSPACE(c)        ((c) == ' ' || (c) == '\t' || (c) == '\r' ||  \

Make this a static inline method.

@@ +835,5 @@
> +#define JS_ISXMLSPACE(c)        ((c) == ' ' || (c) == '\t' || (c) == '\r' ||  \
> +                                 (c) == '\n')
> +
> +#define JS_ISXMLNSSTART(c)      ((GetCharInfo(c)->flags & FLAG_COMBINING_SPACING_MARK) || (c) == '_')
> +#define JS_ISXMLNS(c)           ((GetCharInfo(c)->flags & FLAG_ENCLOSING_MARK) || (c) == '.' ||  \

And these too.

@@ +839,5 @@
> +#define JS_ISXMLNS(c)           ((GetCharInfo(c)->flags & FLAG_ENCLOSING_MARK) || (c) == '.' ||  \
> +                                 (c) == '-' || (c) == '_')
> +
> +#define JS_ISXMLNAMESTART(c)    (JS_ISXMLNSSTART(c) || (c) == ':')
> +#define JS_ISXMLNAME(c)         (JS_ISXMLNS(c) || (c) == ':')

And these too.

@@ +851,2 @@
>  {
> +	unsigned w = c;

Tab characters!  Lotsa places in these methods, make alignment a mess in Splinter.

@@ +855,5 @@
> +		return js_isidstart[w];
> +	else {
> +		const CharInfo *info = GetCharInfo(c);
> +		return info->flags & FLAG_LETTER;
> +	}

if (c < 128)
    return js_isidstart[w];
return CharInfo(c).isLetter();

@@ +861,2 @@
>  
> +#define CharIsLetter CharIsIdentifierStart

Why not an inline function to make the two identical?

@@ +861,5 @@
>  
> +#define CharIsLetter CharIsIdentifierStart
> +
> +static inline bool
> +CharIsIdentifierPart(jschar c)

Make the same style of adjustment as in CharIsIdentifierStart.

@@ +875,4 @@
>  }
>  
>  static inline bool
> +CharIsSpace(jschar c) 

Same style adjustments again.  (And for all the others, just not going to mention it any more.)

@@ +879,5 @@
>  {
> +	unsigned w = c;
> +	if (w < 128)
> +		return js_isspace[w];		
> +	if (w == NO_BREAK_SPACE)

Put a comment by this explaining why just referring to CharInfo(NO_BREAK_SPACE) isn't adequate.  (I assume that reason is performance?  It's not actually stated anywhere, so I'm just guessing.)

@@ +889,4 @@
>  }
>  
>  static inline bool
> +CharIsSpaceOrBom2(jschar c) 

My GNOME character map says BOM isn't in the whitespace category.  Does this method implement CharIsSpaceOrBom2(BOM) correctly and return true?  Unless the CharInfo's different from Unicode, it looks like it doesn't.

@@ +893,3 @@
>  {
> +	unsigned w = c;	
> +	/* Treat little- and big-endian BOMs as whitespace for compatibility. */	

This comment belongs by the callers of this method, not within it.  There's no obvious compatibility constraint on a method clearly named to detect BOM or its reverse -- its callers are what want that special behavior.

@@ +904,4 @@
>  }
>  
> +static inline jschar
> +CharToUpper(jschar c)

I sort of have a glimmer of understanding of this method, but it's not clear without a description of FLAG_NO_DELTA.  Please add some comments on this by the FLAG_NO_DELTA enum value?

@@ +915,5 @@
> +	return c + upper;
> +}
> +
> +static inline jschar
> +CharToLower(jschar c)

The comments by FLAG_NO_DELTA should address the identical concern here, I hope.

@@ +1091,5 @@
>  static inline const jschar *
>  js_SkipWhiteSpace(const jschar *s, const jschar *end)
>  {
>      JS_ASSERT(s <= end);
> +    while (s != end && js::CharIsSpace(*s))

I note that in this method, IsWhitespace is very much the natural name for the query being made here.  :-)

::: js/src/jsxml.cpp
@@ +1155,5 @@
>  
>  #define IS_XML_CHARS(chars)                                                   \
> +    (CharToLower((chars)[0]) == 'x' &&                                         \
> +     CharToLower((chars)[1]) == 'm' &&                                         \
> +     CharToLower((chars)[2]) == 'l')

Make this a static inline method if you're touching it this much?  The over-parenthesization-but-not-really of chars grates.

@@ +1160,4 @@
>  
>  #define HAS_NS_AFTER_XML(chars)                                               \
> +    (CharToLower((chars)[3]) == 'n' &&                                         \
> +     CharToLower((chars)[4]) == 's')

And again.
Comment 16 Jeff Walden [:Waldo] (remove +bmo to email) 2011-06-21 17:01:13 PDT
Hmm, I could have sworn I checked.  Ignore the whitespace commentary -- Unicode does just call it "space", so those really should be "IsSpace" and such.
Comment 17 Tom Schuster [:evilpie] 2011-06-22 10:36:44 PDT
Created attachment 541102 [details]
Generating script v2
Comment 18 Tom Schuster [:evilpie] 2011-06-22 10:42:11 PDT
Created attachment 541104 [details] [diff] [review]
wip 1
Comment 19 Tom Schuster [:evilpie] 2011-06-22 15:58:27 PDT
Created attachment 541212 [details] [diff] [review]
pre-v1

Won't be here till next monday, but because i think it takes a lot of time and cross-checking i will attach the current patch here already for review.

This patch passes nearly all tests, except some ecma/String/xxx.js, these are wrong, because they are not compatible to this version of Unicode (i believe it is 6.0.0, but i don't really dig their http repository). The file test.js in unicode-tests is supposed to replace all these failing String/xxx.js tests.

I am going to explain the index1, index2, js_charinfo tabel, but this needs some more prep work. I am also going to document reader.py a little bit better.
Comment 20 Jeff Walden [:Waldo] (remove +bmo to email) 2011-06-28 16:27:52 PDT
I'm beginning to think this data, or something along these lines, wants to live in mfbt/ so both JS and Gecko can use it.  roc tells me that smontagu was considering doing such unification at one time, so it's probably a good idea to do that if we're rewriting this.  Right now mfbt/ doesn't actually build a library or anything (it's purely headers, and it does dirty evil tricks to avoid building one at the moment, purely for reasons of expediency), so we'd need bug 668090 to be fixed to do that.  Or we can do the hack that bug 668090 comment 1 points out, but as I note in that bug that seems quite confusing, and worth avoiding if possible.

Anyway.  I'm still looking at this and should have comments to make which don't concern that side issue.  I should have more comments tomorrow.
Comment 21 Jeff Walden [:Waldo] (remove +bmo to email) 2011-06-28 16:31:09 PDT
smontagu, roc tells me you were thinking of unifying a bunch of Unicode character data tables at one point.  Are you still interested?  And if you have the time, do you have any particular comments on the patch here, or on the attached conversion scripts?
Comment 22 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-01 12:46:38 PDT
Comment on attachment 541212 [details] [diff] [review]
pre-v1

Review of attachment 541212 [details] [diff] [review]:
-----------------------------------------------------------------

I haven't looked at UnicodeData.txt, makeunicodedata.py, out, reader.py, or test, but here are comments on the rest of it, at least.

::: js/src/jsdate.cpp
@@ +548,5 @@
>      /* return true if matches, otherwise, false */
>  
>      while (count > 0 && s1[s1off] && s2[s2off]) {
>          if (ignoreCase) {
> +            if (unicode::ToLowerCase((jschar)s1[s1off]) != 

Hm, you didn't templatize these methods to take either char or jschar.  That would help with avoiding the kinda noisy cast here.

@@ +549,5 @@
>  
>      while (count > 0 && s1[s1off] && s2[s2off]) {
>          if (ignoreCase) {
> +            if (unicode::ToLowerCase((jschar)s1[s1off]) != 
> +                unicode::ToLowerCase(s2[s2off])) {

Without the unicode:: prefix (see later) this probably still fits on one line (in which case you could remove the braces).

::: js/src/jsopcode.cpp
@@ +766,5 @@
>      for (const jschar *t = s; t < z; s = ++t) {
>          /* Move t forward from s past un-quote-worthy characters. */
>          jschar c = *t;
> +        while (c < 127 && isprint(c) && c != qc && 
> +               c != '\\' && c != '\t') {

This fits in one line, so please make it do so.

::: js/src/jsscan.cpp
@@ +795,5 @@
>              goto out;
>          }
>  
>          tokenbuf.clear();
> +        if (unicode::IsXMLNameSpaceStart(c)) {

IsXMLNamespaceStart, pretty sure namespace in XML terms is a single word.

@@ +1137,5 @@
>          cp[3] == 'n' &&
>          cp[4] == 'e') {
>          skipChars(5);
> +        while ((c = getChar()) != '\n' && 
> +                unicode::IsSpaceOrBOM2((jschar)c))

One line, another place where the templatizing-for-char/jschar would save.

@@ +1149,5 @@
>                      return true;
>                  }
>                  line = temp;
>              }
> +            while (c != '\n' && unicode::IsSpaceOrBOM2((jschar)c))

And another saving place.

@@ +1164,5 @@
>                      filenameBuf[i++] = (char) c;
>                  }
>                  if (c == '"') {
>                      while ((c = getChar()) != '\n' &&
> +                           unicode::IsSpaceOrBOM2((jschar)c)) {

And another.

@@ +1361,5 @@
>          }
>  
>          tp = newToken(-1);
> +        
> +        /* It's okay to use IsLetter here, because $ and _ are < 128) */

It might be good to have JS_STATIC_ASSERT('$' < 128 && '_' < 128) as an even further backstop.  Or you could replace the comment with that, but I suspect it wouldn't be quite as clear.

@@ +1414,5 @@
>      if (c1kind == Ident) {
>          identStart = userbuf.addressOfNextRawChar() - 1;
>          hadUnicodeEscape = false;
> +    
> +        

Don't add an extra line.

::: js/src/jsstr.cpp
@@ +6115,5 @@
>  }
>  
>  } /* namespace js */
> +
> +CharacterInfo js_charinfo[] = {

Move the unicode stuff into js/src/vm/Unicode.cpp and js/src/vm/Unicode.h.  We moved the JSString definition stuff out of jsstr.h/cpp, and it makes sense to move this for similar reasons.

::: js/src/jsstr.h
@@ +797,5 @@
>  const jschar BYTE_ORDER_MARK = 0xFEFF;
>  const jschar BYTE_ORDER_MARK2 = 0xFFFE;
>  const jschar NO_BREAK_SPACE  = 0x00A0;
>  
> +/* This enum contains the all the knowledge required to handle

/*
 * This

@@ +798,5 @@
>  const jschar BYTE_ORDER_MARK2 = 0xFFFE;
>  const jschar NO_BREAK_SPACE  = 0x00A0;
>  
> +/* This enum contains the all the knowledge required to handle
> + * unicode in JavaScript

"Unicode": it should only be uncapitalized for the namespace name, I think -- definitely capitalized in comments and such.  This applies other places, too.

@@ +801,5 @@
> +/* This enum contains the all the knowledge required to handle
> + * unicode in JavaScript
> + * 
> + * SPACE, every character that is either in the ECMA-262 5th Edition
> + * class WhiteSpace or LineTerminator.

These last four lines are discombobulated, and I'm not sure what they're supposed to do/be.  :-)

@@ +856,5 @@
> +    };
> +};
> +
> +class CharacterInfo {
> +    /* Both upperCase and loweCase normally store the delta between two

/*
 * Both...

and "lowerCase".

This comment is very, very helpful in understanding the tricks played here!

@@ +859,5 @@
> +class CharacterInfo {
> +    /* Both upperCase and loweCase normally store the delta between two
> +     * letters. For example the lower case alpha (a) has the char code
> +     * 97, and the upper case alpha (A) has 65. So for "a" we would 
> +     * store -32 in upperCase (97 + (-32) = 65 :) and 0 in lowerCase,

You're not really storing -32, you're storing jschar(-32) which converts to some large uint16_t.  Maybe after the "So for" sentence you should add "(Well, not -32 exactly, but (2**16 - 32) to induce unsigned overflow with identical mathematical behavior.)"?  There's a balance between wordiness and not lying here.

Probably the parenthetical shouldn't have a smiley in it.

@@ +861,5 @@
> +     * letters. For example the lower case alpha (a) has the char code
> +     * 97, and the upper case alpha (A) has 65. So for "a" we would 
> +     * store -32 in upperCase (97 + (-32) = 65 :) and 0 in lowerCase,
> +     * because this char is already in lower case.
> +     * For upper case alpha, we would store, 0 in upperCase and 32 in

"we'd store 0 in upperCase"

@@ +864,5 @@
> +     * because this char is already in lower case.
> +     * For upper case alpha, we would store, 0 in upperCase and 32 in
> +     * lowerCase (65 + 32 = 97).
> +     * 
> +     * If we can't store the delta between to chars the flag 

"If the delta between the chars wouldn't fit in a jschar, the flag..."

@@ +868,5 @@
> +     * If we can't store the delta between to chars the flag 
> +     * FLAG_NO_DELTA is set, and you can just use upperCase and lowerCase
> +     * without adding them the base char. See CharInfo.toUpperCase().
> +     * 
> +     * We use deltas because this way, we can safe a lot data. For

"We use deltas to reuse information for multiple characters."

@@ +871,5 @@
> +     * 
> +     * We use deltas because this way, we can safe a lot data. For
> +     * example the whole lower case english alphabet fits into on entry,
> +     * because it's always a UnicodeLetter and upperCase contains
> +     * -32.

This whole comment has made me realize the JS approach to uppercase and lowercase is just Unicode-bogus.  The browser and stuff that want this information will want it working for even stuff past U+FFFF.  So probably we shouldn't actually share any of this, or we should leave that for someone who really understands the whole problem space.  :-\  Sadfaces.

@@ +873,5 @@
> +     * example the whole lower case english alphabet fits into on entry,
> +     * because it's always a UnicodeLetter and upperCase contains
> +     * -32.
> +     */
> +public:     

Two-space (half) indent

@@ +880,5 @@
> +    uint8 flags; 
> +
> +
> +    inline bool
> +    isSpace() 

inline bool isSpace() {

Likewise for the rest of these.

@@ +894,5 @@
> +
> +    inline bool
> +    isIdentifierPart()
> +    {
> +        /* Attention: we return false for $ and _ in this function */

A CharacterInfo isn't tied to a particular character, so I don't think this comment makes sense here.

@@ +915,5 @@
> +
> +
> +extern const uint16 index1[];
> +extern const uint16 index2[];
> +extern CharacterInfo js_charinfo[];

This should be const CharacterInfo.

@@ +917,5 @@
> +extern const uint16 index1[];
> +extern const uint16 index2[];
> +extern CharacterInfo js_charinfo[];
> +
> +inline CharacterInfo&

...so this probably needs to be as well.

@@ +931,3 @@
>  }
>  
> +class unicode

This should be namespace unicode.  Yes, that's different from how the enum thing I mentioned should go.  I'm not entirely certain I can cogently articulate the distinction, myself.  But methods going in namespaces is consistent with existing code.  Then the few files that want this stuff can do |using namespace js::unicode;| to avoid needing to type out unicode::IsSpace or whatever.

@@ +934,2 @@
>  {
> +public:

This ordinarily should be indented two spaces, since it's in a class definition.  But if this is all becoming a namespace, nothing inside here will need indentation, anyway.

@@ +935,5 @@
> +public:
> +    static bool
> +    IsIdentifierStart(jschar ch)
> +    {
> +        /* Per ECMA-262 5th Edition the first character of some indentifer

"of an identifier" (note typo)

Multiline comments should be in this form:

/*
 * ...
 */

i.e. with start of the comment being an otherwise-empty line.

@@ +937,5 @@
> +    IsIdentifierStart(jschar ch)
> +    {
> +        /* Per ECMA-262 5th Edition the first character of some indentifer
> +         * must be either:
> +         * (From $ 7.6 IdentifierStart)

ES5 7.6 IdentifierStart

@@ +939,5 @@
> +        /* Per ECMA-262 5th Edition the first character of some indentifer
> +         * must be either:
> +         * (From $ 7.6 IdentifierStart)
> +         * $ (dollar sign)
> +         * _ (low line)

"underscore" is the normal English (or at least American) name for this character; I did a double-take on seeing "low line" here.

Or maybe you're really just referring to it the same way Unicode does.  That seems reasonable, but you should make this association clearer: please say U+0024 DOLLAR SIGN and U+005F LOW LINE so it's obvious.

@@ +947,5 @@
> +         * 
> +         * Attention: this functions does _not_ handle 
> +         * unicode escape sequences. The dollar sign and low line are not
> +         * included in the set of UnicodeLetters. So IsLetter('$') returns 
> +         * false. They are both handled by the lookup table.

This comment talking about IsLetter('$') seems out of place.  IsIdentifierStart('$') does exactly what it claims to do; how IsLetter('$') works doesn't really matter -- or at least it doesn't really matter inside this method.

@@ +959,5 @@
>  
> +    static bool
> +    IsIdentifierPart(jschar ch)
> +    {
> +        /* This handles ECMA-262 5th Edition $ 7.6 "IdentifierPart". 

I'd just copy in the pseudo-BNF from the spec directly:

IdentifierPart ::
  IdentifierStart
  UnicodeCombiningMark
  UnicodeDigit
  UnicodeConnectorPunctuation
  <ZWNJ>
  <ZWJ>

Although, given this knowledge I'm not sure what the reader's supposed to do with it, since the implementation doesn't logically track those alternatives.  /* Matches ES5 7.6 IdentifierPart. */ seems good enough to me.

@@ +985,5 @@
> +
> +    static bool
> +    IsSpace(jschar ch)
> +    {
> +        /* IsSpace checks if some character is included in the merged set

/*
 * IsSpace ...

@@ +987,5 @@
> +    IsSpace(jschar ch)
> +    {
> +        /* IsSpace checks if some character is included in the merged set
> +         * of WhiteSpace and LineTerminator, specified by ECMA-262 5th Edition
> +         * $ 7.2 and $ 7.3. We combined them, because in practice nearly

ES5 7.2 and 7.3, no need to spell it out overlong, most other places don't.

The aesthete in me wants "ยง" here, but nothing else does, so it'd be out of place.  ;-)

@@ +991,5 @@
> +         * $ 7.2 and $ 7.3. We combined them, because in practice nearly
> +         * every calling function wants this, except some code in the tokenizer.
> +         * 
> +         * We use a lookup table for ASCII-7 characters, because they are 
> +         * very common and need to handled fast in the tokenizer.

"must be handled quickly"

@@ +1013,5 @@
> +
> +        if (ch < 128)
> +            return js_isspace[ch];
> +
> +        if (ch == NO_BREAK_SPACE || ch == BYTE_ORDER_MARK2)

The /* We accept */ comment should move to just before this.

@@ +1027,5 @@
> +        
> +        /* The delta didn't fit into jschar, so we had to store the 
> +         * actual char code.
> +         */
> +        if (info.flags & CharFlag::NO_DELTA)

Just a thought, but maybe the flag should be CharFlag::UPPER_LOWER_ARE_ACTUAL?  That seems clear enough to be able to get rid of the above comment and its equivalent in ToLowerCase(ch).

@@ +1031,5 @@
> +        if (info.flags & CharFlag::NO_DELTA)
> +            return info.upperCase;
> +        
> +        /* Add the delta to produce the right upper case letter. */
> +        return ch + info.upperCase;    

I'd say NO_DELTA is commented such that this comment isn't needed.

::: js/src/tests/ecma_3/extensions/regress-274152.js
@@ +54,5 @@
>   
>    expect = 'SyntaxError: illegal character';
>  
> +  var formatcontrolchars = [/*
> +                            '\u200C', These are allowed by ECMA-262 7.1  <ZWNJ> and <ZWJ>

Just remove them from the list entirely.
Comment 23 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-05 16:50:13 PDT
Comment on attachment 541212 [details] [diff] [review]
pre-v1

Review of attachment 541212 [details] [diff] [review]:
-----------------------------------------------------------------

Wheeee, Python script comments done!  Ball's in your end now, thus canceling r?.

::: js/src/unicode-tests/reader.py
@@ +1,1 @@
> +# -*- coding: utf-8 -*-

Probably this file should just generate the vm/Unicode.cpp file I mentioned previously -- right now it dumps stuff into out, but then the developer has to copy-paste it into jsstr.cpp.  Get rid of the middleman, make it patently obvious that the contents of the file are themselves entirely unimportant as far as making changes to it goes.  :-)

Along these lines, I don't see any generated comments like "/* generated by reader.py -- DO NOT MODIFY */".  Any files generated here should have such beasts.  And (I'm not the expert) it might not be a bad idea to spit out some sort of license header in the generated files, ideally as generous as possible to simplify matters, to placate Linux distros and whoever that want every source file in the tree to have a license header.

@@ +7,5 @@
> +
> +reader = csv.reader(open('UnicodeData.txt', 'r'), delimiter=';')
> +
> +# ECMAScript 5 $ 7.2
> +whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]

It might be more readable to name the characters.  I tried something like this to get Python to do the work for me, but it doesn't work (I'm guessing it doesn't associate control characters with names, which is sad):

whitespace = map(lambda name: ord(unicodedata.lookup(name)),
                 ["CHARACTER TABULATION",
                  "LINE TABULATION",
                  "FORM FEED",
                  "SPACE",
                  "NO-BREAK SPACE",
                  "ZERO WIDTH NO-BREAK SPACE", # also BOM
                 ])

I asked about this on stackoverflow, maybe someone there will have a bright idea:

http://stackoverflow.com/questions/6589586/how-can-i-determine-a-unicode-character-from-its-name-in-python-even-if-that-cha

@@ +9,5 @@
> +
> +# ECMAScript 5 $ 7.2
> +whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]
> +# $ 7.3
> +line_terminator = [0xa, 0xd, 0x2028, 0x2029]

Same comment and wonderings as above apply here.

@@ +22,5 @@
> +FLAG_LETTER = 1 << 1
> +FLAG_IDENTIFIER_PART = 1 << 2
> +FLAG_NO_DELTA = 1 << 3
> +FLAG_ENCLOSING_MARK = 1 << 4
> +FLAG_COMBINING_SPACING_MARK = 1 << 5

These all could use vertical alignment, again.

@@ +47,5 @@
> +    
> +    if code > MAX:
> +        break
> +    
> +    # we use all this in one class, because in practice we don't need to seperate them often

separate

@@ +48,5 @@
> +    if code > MAX:
> +        break
> +    
> +    # we use all this in one class, because in practice we don't need to seperate them often
> +    # 0x85 NEL (Next Line) is not enforced by ECMA-262, but it says:

I filed https://bugs.ecmascript.org/show_bug.cgi?id=118 on fixing the test262 tests which wrongly believe U+0085 should be a space character, so this should probably be undone, and U+0085 not special-cased.
Comment 24 Tom Schuster [:evilpie] 2011-07-18 11:41:11 PDT
Comment on attachment 541212 [details] [diff] [review]
pre-v1

Review of attachment 541212 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsstr.h
@@ +871,5 @@
> +     * 
> +     * We use deltas because this way, we can safe a lot data. For
> +     * example the whole lower case english alphabet fits into on entry,
> +     * because it's always a UnicodeLetter and upperCase contains
> +     * -32.

While fixing some bugs and looking at v8 made me realize that we indeed do the wrong thing at the momemnt, not even for the small region till U+FFFF. We actually would need to look at the previous (or next?) character and even the language. ES5 specifies this in 15.5.4.16:
>The result must be derived according to the case mappings in the Unicode >character database (this explicitly includes not only the UnicodeData.txt >file, but also the SpecialCasings.txt file that accompanies it in Unicode >2.1.8 and later).

I am going to do this in Bug 672042 after this landed.

::: js/src/unicode-tests/reader.py
@@ +7,5 @@
> +
> +reader = csv.reader(open('UnicodeData.txt', 'r'), delimiter=';')
> +
> +# ECMAScript 5 $ 7.2
> +whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]

Nice idea, but really it's lame that they don't safe the aliases.

@@ +22,5 @@
> +FLAG_LETTER = 1 << 1
> +FLAG_IDENTIFIER_PART = 1 << 2
> +FLAG_NO_DELTA = 1 << 3
> +FLAG_ENCLOSING_MARK = 1 << 4
> +FLAG_COMBINING_SPACING_MARK = 1 << 5

I don't do this in python.
Comment 25 Tom Schuster [:evilpie] 2011-07-18 11:45:41 PDT
Created attachment 546588 [details] [diff] [review]
v1

So added a few comments to your comments :)

Please also note Bug 672083 on better compressing for the table, eg v8 uses a different table for case mappings and does a limited table search for cases like IsSpace.

I added some code for generating two tests in make_unicode.py, but we should have a look what needs tests.
Also the comment for Unicode.h took me a long time to write, but i am not entirely sure that i hit the nail with them, because i was knee deep into that stuff while writing it :/
Comment 26 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-19 16:23:13 PDT
Comment on attachment 546588 [details] [diff] [review]
v1

Review of attachment 546588 [details] [diff] [review]:
-----------------------------------------------------------------

This doesn't implement correct toLowerCase/toUpperCase semantics, per bug 672042, right?  But it's an improvement over status quo.  And we can fix that building on this.  Just want to be sure I understand the overall situation here...

This is mostly down to nits, but there are enough of them, and this change is complex enough, that we should probably err on the side of over-reviewing.  A few tweaks and one more pass and I think we're good here!

::: js/src/Makefile.in
@@ +187,5 @@
>  		GlobalObject.cpp \
>  		Stack.cpp \
>  		String.cpp \
>  		ParseMaps.cpp \
> +		Unicode.cpp \

Hum, I thought vm/*.cpp files were in their own list.  Sad that they're not.  :-(

::: js/src/jsdate.cpp
@@ +76,5 @@
>  
>  #include "vm/Stack-inl.h"
>  
>  using namespace js;
> +using namespace ::unicode;

Make that js::unicode -- that's definitely the more common style in js/src, and having dependent |using| statements like that seems like not great style anyway.

For just two uses, tho, I'd be tempted to just unicode::ToLowerCase() both of them, if the change fit in 99ch.

::: js/src/jsnum.cpp
@@ +1456,5 @@
>      char *cstr, *istr, *estr;
>      JSBool negative;
>      jsdouble d;
>  
> +    s1 = js_SkipSpace(s, send);

If you're changing the line, let's move the declaration of s1 down to this first use.  No need to fix up the rest of the method's declarations, although if you want to I don't particularly mind.

::: js/src/jsnuminlines.h
@@ +80,5 @@
>      }
>  
>      const jschar *bp = chars;
>      const jschar *end = chars + length;
> +    bp = js_SkipSpace(bp, end);

I'd be tempted to just pass |chars| as the first argument and assign to and declare |bp| at the same time.  The current code requires an extra bit of understanding to see that the range being maybe advanced through is [chars, chars + end), because the variable has another intermediate value to read through.

::: js/src/jsopcode.cpp
@@ -766,5 @@
>      for (const jschar *t = s; t < z; s = ++t) {
>          /* Move t forward from s past un-quote-worthy characters. */
>          jschar c = *t;
> -        while (JS_ISPRINT(c) && c != qc && c != '\\' && c != '\t' &&
> -               !(c >> 8)) {

|!(c >> 8)|?!?!?!  How did I not notice this in previous readings here?  I'm quite happy to see that hard-to-read condition go away!

::: js/src/jsscan.cpp
@@ +78,5 @@
>  #include "jsxml.h"
>  #endif
>  
>  using namespace js;
> +using namespace ::unicode;

js::unicode

@@ +1168,2 @@
>                          continue;
>                      }

This can be a single-liner now.  And I like the explicit |continue| here for what'd otherwise be an empty loop!  This idiom was only pointed out to me yesterday (for avoiding loops with the empty statement as bodies), but it seems clearly the correct alternative in hindsight.

@@ +1360,5 @@
>          }
>  
>          tp = newToken(-1);
> +        
> +        /* It's okay to use IsLetter here, because $ and _ are < 128) */

I think I might have suggested this comment, but on second read it's clunky.  What say you to this?

/* '$' and '_' don't pass IsLetter, but they're < 128 so never appear here. */

::: js/src/jsstr.cpp
@@ +87,5 @@
>  #include "vm/String-inl.h"
>  
>  using namespace js;
>  using namespace js::gc;
> +using namespace js::unicode;

\o/

@@ +1206,5 @@
>              ++begin;
>      }
>  
>      if (trimRight) {
> +        while (end > begin && unicode::IsSpace(chars[end-1]))

|chars[end - 1]|

::: js/src/jsstr.h
@@ +229,5 @@
> +namespace unicode {
> +template <typename T>
> +inline bool
> +IsSpace(T ch);
> +}

Is this namespace unicode { IsSpace declaration } really necessary?  You include vm/Unicode.h above, which I would think would be sufficient to give you this.

Also, I'm somewhat wary of declaring inline functions like this and then using them before actually defining them, as you do here.  I thought that spewed warnings with at least some compilers.  If this turns out to be necessary, I'd say you should move this method into jssstrinlines.h, and have that include the header that defines unicode::IsSpace.

@@ +235,5 @@
> +/*
> + * Return s advanced past any Unicode white space characters.
> + */
> +static inline const jschar *
> +js_SkipSpace(const jschar *s, const jschar *end)

Don't use the js_ prefix any more: namespace js ftw!  You're even inside one, so this is defining a js::js_SkipSpace method.  :-O  (The one exception I'm aware of: traceable natives declared with macros have cruel and unusual things done to them, and I don't think you can have a namespaced traceable native, so keep using js_* there.)

@@ +239,5 @@
> +js_SkipSpace(const jschar *s, const jschar *end)
> +{
> +    JS_ASSERT(s <= end);
> +    
> +    while (s != end && unicode::IsSpace(*s))

General hygiene: use < for pointer range checking, not !=.  Nothing can go wrong here.  But in general, supposing you misunderstood your code and the pointer somehow skipped over the sentinel, you'd end up looping for a really long time on bad memory, and the loop body or whatever might do things with that out-of-bounds index which could result in a vulnerability if you were unlucky.  Best to avoid that as a matter of practice with the strongest condition that can be written.

::: js/src/make_unicode.py
@@ +131,5 @@
> +            test_mapping.write('  [' + hex(upper) + ', ' + hex(lower) + '], /* ' +
> +                       name + (' (' + alias + ')' if alias else '') + ' */\n')
> +        else:
> +            test_mapping.write('  [' + hex(code) + ', ' + hex(code) + '],\n')
> +    test_mapping.write(']')

var mapping = [
...
]
assertEq...

relies on ASE, always unwise -- write a semicolon after the close of the array literal.

@@ +191,5 @@
> + * Step 3:
> + *  Combining the index and the bottom 6 bits of the orginial jschar.
> + *   real_index = index2[(index << 6) + (jschar & 0b111111)] (this is [..********++++++])
> + *
> + *  The advantage here is that the bigest number in index1 doesn't need 10 bits, but 8 and we safe 

"The advantage here is that the biggest number in index1 only needs 8 bits, not 10, and we save..."

@@ +230,5 @@
> +    data_file.write(comment)
> +    data_file.write('const CharacterInfo js_charinfo[] = {\n')
> +    for d in table:
> +        data_file.write('    {')
> +        data_file.write(', '.join((str(e) for e in d)) + '},\n')

If you're not going to have all this in one write, let's make it one write for the '    {', one for the join, and one for the ',\n'.

@@ +243,5 @@
> +        for entry in data:
> +            s = str(entry)
> +            assert len(s) <= 3
> +            if len(s) < 3:
> +                s = (' ' * (3 - len(s))) + s

I believe this is what rjust was for:

  s = str(entry).rjust(3)

@@ +251,5 @@
> +                line = pad + s + ', '
> +            else:
> +                line = line + s + ', '
> +        lines.append(line)
> +        lines[-1] = lines[-1][:-2] # remove ", "

Actually, I think trailing commas here (inside array initializers) are permitted by ISO C++.  (Let us pause for a moment and curse ISO C++ for not extending this nicety to enum declarations.  *pause*  Done.)  If no compilers complain, it seems best to not have to do/explain the extra work.  :-)

@@ +264,5 @@
> +    
> +    data_file.write('}\n}\n')
> +
> +def getsize(data):
> +    # return smallest possible integer size for the given array

Should be a docstring, that's the Python way as I understand it.

@@ +271,5 @@
> +        return 1
> +    elif maxdata < 65536:
> +        return 2
> +    else:
> +        return 4

An |assert maxdata < 2**32| here would be nice out of an abundance of caution.

@@ +274,5 @@
> +    else:
> +        return 4
> +
> +def splitbins(t):
> +    """t, trace=0 -> (t1, t2, shift).  Split a table to save space.

"trace=0" looks vestigial.

@@ +297,5 @@
> +            n >>= 1
> +            maxshift += 1
> +    del n
> +    bytes = sys.maxsize  # smallest total size so far
> +    t = tuple(t)    # so slices can be dict keys

(t,) might be more idiomatic, not sure.

@@ +298,5 @@
> +            maxshift += 1
> +    del n
> +    bytes = sys.maxsize  # smallest total size so far
> +    t = tuple(t)    # so slices can be dict keys
> +    for shift in [6]:

Is this vestigial too?  Looks like this was all some sort of adaptive code trying to determine the best possible split, or something, but this looks like it might be just hardcoding a 6 or something.

@@ +325,5 @@
> +    print("Best:", end=' ', file=sys.stderr)
> +    dump(t1, t2, shift, bytes)
> +
> +    # exhaustively verify that the decomposition is correct
> +    mask = ~((~0) << shift) # i.e., low-bit mask of shift bits

|2**shift - 1| seems clearer to me, and wouldn't need a comment in my book.

@@ +349,5 @@
> +    print('Generating...')
> +    generate_unicode_stuff(unicode_data,
> +        open('vm/Unicode.cpp', 'w'),
> +        open('tests/ecma_5/String/string-upper-lower-mapping.js', 'w'),
> +        open('tests/ecma_5/String/string-space-trim.js', 'w'))

I like how this generates tests too!

::: js/src/vm/Unicode.h
@@ +1,1 @@
> +#ifndef Unicode_h_

This file could use a license header.  Right?

@@ +73,5 @@
> +const jschar BYTE_ORDER_MARK2 = 0xFFFE;
> +const jschar NO_BREAK_SPACE  = 0x00A0;
> +
> +class CharacterInfo {
> +    /* 

This comment (and various others) doesn't look like it wraps at 79 characters -- please rewrap at 79ch if it's not.

@@ +127,5 @@
> +template <typename T>
> +inline const CharacterInfo&
> +CharInfo(T ch)
> +{
> +    /* We must accept EOF, because jsscan.cpp sometimes passes EOF. */

That seems like a bug to me, or at least like semantics that make this method more complex than it should be.  File a followup to make those cases handle EOF explicitly?

@@ +129,5 @@
> +CharInfo(T ch)
> +{
> +    /* We must accept EOF, because jsscan.cpp sometimes passes EOF. */
> +    JS_ASSERT((ch <= 0xffff && ch >= 0) || ch == EOF);
> +    uint16 code = uint16(ch);

This should probably be jschar, not uint16, since it's a character code in a very real sense.

@@ +292,5 @@
> +}
> +
> +
> +}
> +}

/* namespace js */ and /* namespace unicode */ here.  Isolated closing braces at top level like this are horribly hard to read.  :-)
Comment 27 Tom Schuster [:evilpie] 2011-07-20 05:05:14 PDT
Created attachment 547041 [details] [diff] [review]
v2

>Is this vestigial too?  Looks like this was all some sort of adaptive code >trying to determine the best possible split, or something, but this looks like >it might be just hardcoding a 6 or something.
nice, good catch
>This file could use a license header.  Right?
I wasn't to sure about what licensing i should use, because you said something about a permissive license in a previous review. I don't think thats terrible important here, because this code isn't very useful except you implemented javascripts botched unicode. I hope nobody does that.
Comment 28 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-21 16:42:10 PDT
Comment on attachment 547041 [details] [diff] [review]
v2

Review of attachment 547041 [details] [diff] [review]:
-----------------------------------------------------------------

Address the few comments, then let's get this landed!  \o/

::: js/src/jsnuminlines.h
@@ +40,5 @@
>  #ifndef jsnuminlines_h___
>  #define jsnuminlines_h___
>  
> +#include "jsstrinlines.h"
> +#include "vm/Unicode.h"

Generally we've been keeping subdirectory #includes separate by a blank line, and we include all non-inline headers before we include any inline headers, and we include headers in js/src first, then vm/* and other subdirectories:

  #include "vm/Unicode.h"
  
  #include "jsstrinlines.h"

I don't remember all the style quirks here, but I'm sure at least for including a header in js/src and a header in a subdirectory, we definitely have a single style requiring the above.

::: js/src/make_unicode.py
@@ +37,5 @@
> +line_terminator = [
> +    0xa, # LINE FEED
> +    0xd, # CARRIAGE RETURN
> +    ord(u'\N{LINE SEPARATOR}'),
> +    ord(u'\N{PARAGRAPH SEPARATOR}')

Comma after this.

@@ +169,5 @@
> +""")
> +
> +    index1, index2, shift = splitbins(index)
> +
> +    # Don't forget updating CharInfo in Unicode.cpp if you need to change this

"Don't forget to update..."

@@ +213,5 @@
> + * let index2 be an empty array
> + * let cache be a hash map
> + *
> + * while shift is less then maximal amount you can shift 0xffff before it's 0
> + *  let chunks be table split in chunks of size 2^shift

Since ^ is a bit ambiguous between exponentiation and bit-xor, use ** instead here.

::: js/src/tests/ecma_5/String/string-space-trim.js
@@ +1,2 @@
> +/* Generated by make_unicode.py DO NOT MODIFY */
> +var onlySpace = String.fromCharCode(0x9, 0xa, 0xb, 0xc, 0xd, 0x20, 0xa0, 0x1680, 0x180e, 0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005, 0x2006, 0x2007, 0x2008, 0x2009, 0x200a, 0x2028, 0x2029, 0x202f, 0x205f, 0x3000, 0xfeff);

Given that it looks like make_unicode.py does indeed now print out a PD notice, did you not rerun it?  Please do so.

::: js/src/tests/ecma_5/String/string-upper-lower-mapping.js
@@ +1,2 @@
> +/* Generated by make_unicode.py DO NOT MODIFY */
> +var mapping = [

Regenerate this, too.

::: js/src/vm/Unicode.cpp
@@ +1,2 @@
> +/* Generated by make_unicode.py DO NOT MODIFY */
> +#include "Unicode.h"

And here needs regen.

::: js/src/vm/Unicode.h
@@ +14,5 @@
> + *
> + * The Original Code is SpiderMonkey JavaScript engine.
> + *
> + * The Initial Developer of the Original Code is
> + * Mozilla Corporation.

This is all your code, right?  So you're the Initial Developer, not MoCo.  :-)  Or you could put the SpiderMonkey project or somesuch, I guess.

@@ +36,5 @@
> + *
> + * ***** END LICENSE BLOCK ***** */
> +
> +#ifndef Unicode_h_
> +#define Unicode_h_

On second look, this seems like the sort of macro-name that wouldn't unreasonably be found in other code.  :-\  vm_Unicode_h_ maybe?  h8 C macros.
Comment 29 Tom Schuster [:evilpie] 2011-07-21 17:15:57 PDT
Created attachment 547566 [details] [diff] [review]
Final

Can't check-in at the moment.
Comment 30 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-21 17:42:36 PDT
I'm going to check this in, no need for the keyword.
Comment 31 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-22 01:43:07 PDT
Well, something's whacked.

http://tbpl.mozilla.org/?tree=Try&rev=e271453a3173

Tom, if you have a Mac (I don't remember you having one, but I could be wrong), feel free to debug this before I show up in the morning.  (And if you don't mind, hg qimport this URL rather than modifying the patch in your tree, so you get the username/commit message that your posted patch here lacks:

http://hg.mozilla.org/try/raw-rev/50e75bf9ba3e

)  If you don't have a Mac, I'll try a hand at it myself.  Either way, we definitely need more changes before we can push this into the tree.
Comment 32 Tom Schuster [:evilpie] 2011-07-22 08:05:41 PDT
I don't have a mac, but i already had similar test failures when i updated the table, but didn't change the shifts. In that case some characters were identified as the wrong ones. I wonder why this only happens on OPT.
Comment 33 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-22 16:50:33 PDT
So for people not following along on IRC, it bounced because IsIdentifierStart(T c) is called with T = int32_t, specifically with |c == EOF && c == -1|, and |-1 < 128|, and therefore the result is determined by |js_isidstart[-1]|, which in Mac optimized builds happens to not be completely 0, and we get random flags.  So fixing bug 672760 is actually a prerequisite to fixing this, so we're not passing in EOF (which isn't even guaranteed to be -1, note) to these methods.
Comment 34 Tom Schuster [:evilpie] 2011-07-23 13:01:08 PDT
Created attachment 547950 [details] [diff] [review]
v3
Comment 35 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-26 14:01:26 PDT
Comment on attachment 547950 [details] [diff] [review]
v3

I'm going to push this and bug 672760 to try, then if that's good I should be able to land this.  \o/
Comment 36 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-27 00:12:18 PDT
Try approved, after I corrected a buglet in bug 672760's patch, so I pushed this to m-i.  Before pushing I made some changes to remove trailing whitespace noticed during |hg o -p| and also to add a few .rstrip() calls to make_unicode.py to do the same in the generated files (easy to verify correctness with |hg diff -b|):

http://hg.mozilla.org/integration/mozilla-inbound/rev/54b8ca3b0c7a

And with that, barring unexpected calamities, I think we're done in this bug.  \o/
Comment 38 Tom Schuster [:evilpie] 2011-08-01 12:23:52 PDT
*** Bug 26473 has been marked as a duplicate of this bug. ***
Comment 39 Tom Schuster [:evilpie] 2011-08-01 12:46:18 PDT
*** Bug 497692 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.