Add support for locale argument to String.prototype.toLocale{Lower,Upper}Case

RESOLVED FIXED in Firefox 55

Status

()

defect
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: anba, Assigned: anba)

Tracking

(Blocks 1 bug, {dev-doc-complete})

Trunk
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 fixed)

Details

Attachments

(12 attachments, 17 obsolete attachments)

9.72 KB, patch
anba
: review+
Details | Diff | Splinter Review
6.03 MB, patch
anba
: review+
Details | Diff | Splinter Review
67.92 KB, patch
anba
: review+
Details | Diff | Splinter Review
23.21 KB, patch
anba
: review+
Details | Diff | Splinter Review
30.13 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
555.89 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
37.67 KB, patch
anba
: review+
Details | Diff | Splinter Review
4.33 KB, patch
anba
: review+
Details | Diff | Splinter Review
4.29 KB, patch
anba
: review+
Details | Diff | Splinter Review
29.77 KB, patch
anba
: review+
Details | Diff | Splinter Review
4.61 KB, patch
anba
: review+
Details | Diff | Splinter Review
1.26 KB, patch
anba
: review+
Details | Diff | Splinter Review
(Assignee)

Updated

2 years ago
Blocks: test262
(Assignee)

Updated

2 years ago
Blocks: 672042
(Assignee)

Comment 1

2 years ago
Posted patch bug1318403.patch (obsolete) — Splinter Review
Unfortunately it wasn't possible to simply switch to ICU without causing noticeable performance regressions, so this patch got a bit longer than expected. :-|


builtin/String.js:
Support the locale argument for String.prototype.toLocale{Lower,Upper}Case() per [1,2]. Language tag validation and canonicalization isn't exactly fast, so we need fast paths to handle the common cases when zero or one arguments are provided.

[1] http://tc39.github.io/ecma402/#sup-string.prototype.tolocalelowercase
[2] http://tc39.github.io/ecma402/#sup-string.prototype.tolocaleuppercase

builtin/Intl.js:
Again another fast path to handle language tags which contain only a language subtag.

builtin/Intl.cpp:
Call to ICU when language dependent case mapping is required, otherwise use the normal String.prototype.to{Lower,Upper}String() implementation. We can't use ICU for all case mappings, because it's 0.3 to 4 times slower when compared to the optimized case conversion methods from this patch. The language check is done in C++ instead of self-hosted code, because it's faster to do it in C++, at least when measured with µ-benchmarks. 

jsstr.cpp:
The case mapping methods in jsstr.cpp only supported simple case mappings, but ICU always applies full case mappings (that means the special case mappings from Unicode's SpecialCasing.txt file are used). This forces us to implement full case mapping to avoid inconsistent results when String.prototype.to{Lower,Upper}String() is invoked. Yay! 

Full case mapping has the interesting property that it can change the string length for upper or lower case forms. For example "ß" (German sharp S) is upper cased to two "S" characters. But since most strings won't contain any characters which require special casing, we first try to case map the string into a new string of the same length. Only if that fails we inspect the whole string to compute the actual length for the case mapped string and then restart the case mapping process.

Some special casing rules are inlined in jsstr.cpp, for example the lower case mapping for U+0130. This shouldn't be a forward compatibility problem with new Unicode versions, because SpecialCasing.txt doesn't get updated regularly (last update was in 2002 for Unicode 3.2). Nevertheless I've also added validations for the inlined cases to vm/make_unicode.py to make sure we're notified if it was updated.

vm/make_unicode.py:
Added new methods to read and process the SpecialCasing.txt file. To avoid checking every special casing character one after another, the generated code for js::unicode::CanUpperCaseSpecialCasing() is a bit more complex than a simple switch-statement, but hopefully the added complexity will pay off in practice (it certainly does in µ-benchmarks).
Attachment #8812872 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 2

2 years ago
Posted patch bug1318403.patch (obsolete) — Splinter Review
Updated patch to apply cleanly on inbound/central.
Attachment #8812872 - Attachment is obsolete: true
Attachment #8812872 - Flags: review?(jwalden+bmo)
Attachment #8813291 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 3

2 years ago
Posted patch bug1318403.patch (obsolete) — Splinter Review
Updated patch to apply cleanly on inbound.
Attachment #8813291 - Attachment is obsolete: true
Attachment #8813291 - Flags: review?(jwalden+bmo)
Attachment #8822705 - Flags: review?(jwalden+bmo)
Summary: Add support for locale argument to String.prototype.toLocale{Lower,Uppper}Case → Add support for locale argument to String.prototype.toLocale{Lower,Upper}Case
(Assignee)

Comment 4

2 years ago
Reduce code duplication and prepare make_bmp_mapping_test for modifications in part 2.
Attachment #8822705 - Attachment is obsolete: true
Attachment #8822705 - Flags: review?(jwalden+bmo)
Attachment #8840173 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 5

2 years ago
Added new methods to read and process the Unicode SpecialCasing.txt file. To avoid checking every special casing character one after another, the generated code for js::unicode::CanUpperCaseSpecialCasing() is a bit more complex than a simple switch-statement, but hopefully the added complexity will pay off in practice (it certainly does in µ-benchmarks).
Attachment #8840175 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 6

2 years ago
Rerun make_unicode.py to pick up the changes from part 2.
Attachment #8840177 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 7

2 years ago
Change String.prototype.to{Lower,Upper}Case() to handle Unicode special casing characters, so we finally apply full case mapping instead of just simple case mapping.

Full case mapping has the interesting property that it can change the string length for upper or lower case forms. For example "ß" (German sharp S) is upper cased to two "S" characters. But since most strings won't contain any characters which require special casing, we first try to case map the string into a new string of the same length. Only if that fails we inspect the whole string to compute the actual length for the case mapped string and then restart the case mapping process.

Some special casing rules are inlined in jsstr.cpp, for example the lower case mapping for U+0130. This shouldn't be a forward compatibility problem with new Unicode versions, because SpecialCasing.txt doesn't get updated regularly (last update was in 2002 for Unicode 3.2). Nevertheless I've also added validations for the inlined cases to vm/make_unicode.py to make sure we're notified if it was updated.
Attachment #8840178 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 8

2 years ago
builtin/String.js:
Support the locale argument for String.prototype.toLocale{Lower,Upper}Case() per [1,2]. Language tag validation and canonicalization isn't exactly fast, so we need fast paths to handle the common cases when zero or one arguments are provided.

[1] http://tc39.github.io/ecma402/#sup-string.prototype.tolocalelowercase
[2] http://tc39.github.io/ecma402/#sup-string.prototype.tolocaleuppercase

builtin/Intl.js:
Again another fast path to handle language tags which contain only a language subtag.

builtin/Intl.cpp:
Call to ICU when language dependent case mapping is required, otherwise use the normal String.prototype.to{Lower,Upper}String() implementation. We can't use ICU for all case mappings, because it's 0.3 to 4 times slower when compared to the optimized case conversion methods from this patch. The language check is done in C++ instead of self-hosted code, because it's faster to do it in C++, at least when measured with µ-benchmarks.
Attachment #8840179 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 9

2 years ago
Update the test262 exclusions list.
Attachment #8840180 - Flags: review?(jwalden+bmo)
Comment on attachment 8840173 [details] [diff] [review]
bug1318403-part1-reduce-code-duplication.patch

Review of attachment 8840173 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/make_unicode.py
@@ +726,5 @@
> +
> +    def write_table(data_type, name, tbl, idx1_name, idx1, idx2_name, idx2, println):
> +        println('const {} unicode::{}[] = {{'.format(data_type, name))
> +        for d in tbl:
> +            println('    {{{}}},'.format(', '.join(str(e) for e in d)))

{{ {} }}

, extra spacing making this more readable.
Attachment #8840173 - Flags: review?(jwalden+bmo) → review+
(Assignee)

Comment 11

2 years ago
Now with 100% less uncurry. :-)
Attachment #8840175 - Attachment is obsolete: true
Attachment #8840175 - Flags: review?(jwalden+bmo)
Attachment #8844524 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 12

2 years ago
Fixed the indentation issue.
Attachment #8840177 - Attachment is obsolete: true
Attachment #8840177 - Flags: review?(jwalden+bmo)
Attachment #8844527 - Flags: review?(jwalden+bmo)
Comment on attachment 8844524 [details] [diff] [review]
bug1318403-part2-process-special-casing.patch

Review of attachment 8844524 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/make_unicode.py
@@ +163,3 @@
>  def int_ranges(ints):
>      """ Yields consecutive ranges (inclusive) from integer values. """
>      from itertools import tee, izip_longest

If we're importing izip_longest at top level, it could be removed here maybe.

@@ +602,5 @@
> +    # Special casing for U+03A3 (GREEK CAPITAL LETTER SIGMA).
> +    assert lowerCase(0x03A3) == 0x03C3 and conditional_tolower[0x03A3] == ([0x03C2], 'Final_Sigma');
> +
> +    return (
> +        unconditional_tolower, unconditional_toupper

All on one line?

@@ +706,5 @@
> +            println(indent, 'if (ch < {})'.format(hexlit(min_child)))
> +            println(indent, '    return false;')
> +
> +        # If there's no successor block, we can omit the |input <= max_child| check,
> +        # because it was already check when we emitted the parent range test.

checked

@@ +782,5 @@
> +        println('    switch(ch) {')
> +        for (code, converted) in sorted(code_map.iteritems(), key=itemgetter(0)):
> +            println('      case {}: return {};'.format(hexlit(code), len(converted)))
> +        println('    }')
> +        println('    MOZ_ASSERT_UNREACHABLE("Bad character input.");')

I'd prefer a println('') before this.

@@ +805,5 @@
> +        println('    return;')
> +
> +        println('}')
> +
> +    write_has_method('CanUpperCaseSpecialCasing', unconditional_toupper)

I'd prefer if the function names were just inlined in all these functions, as there's only one caller of each of these functions.  And maybe rename each function to write_CanUpperCaseSpecialCasing and so on?  "has"/"length"/"append" as names aren't as clear as those would be.

@@ +832,5 @@
>                  (upper, lower, name, alias) = entry
> +                upper = unconditional_toupper[code] if code in unconditional_toupper else [upper]
> +                lower = unconditional_tolower[code] if code in unconditional_tolower else [lower]
> +                println('  ["{}", "{}"], /* {}{} */'.format("".join(imap(unicodeEsc, upper)),
> +                                                            "".join(imap(unicodeEsc, lower)),

It would have been nice if converting this to hexints were done in a separate patch, so the *semantic* changes to the generated test, reflecting the new casing logic, could actually be reviewed.  But I'm not going to hold you to it if it's at all difficult.
Attachment #8844524 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 8844527 [details] [diff] [review]
bug1318403-part3-recreate-files.patch

Review of attachment 8844527 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/Unicode.cpp
@@ +2685,5 @@
> +      case 0x1F88: return 2;
> +      case 0x1F89: return 2;
> +      case 0x1F8A: return 2;
> +      case 0x1F8B: return 2;
> +      case 0x1F8C: return 2;

A part of me sort of wishes these had the codepoint names in a comment next to them, like they are in the test.  But this is followup-land at best.
Attachment #8844527 - Flags: review?(jwalden+bmo) → review+
Attachment #8840180 - Flags: review?(jwalden+bmo) → review+
(Assignee)

Comment 15

2 years ago
Updated part 1 per review comments, carrying r+.
Attachment #8840173 - Attachment is obsolete: true
Attachment #8845057 - Flags: review+
(Assignee)

Comment 16

2 years ago
Recreated files, split from part 3. Carrying r+ for part 3.
Attachment #8845058 - Flags: review+
(Assignee)

Comment 17

2 years ago
Updated part 2 per review comments, carrying r+.
Attachment #8844524 - Attachment is obsolete: true
Attachment #8845059 - Flags: review+
(Assignee)

Comment 18

2 years ago
Updated part 3 so it only includes special casing changes, carrying r+.
Attachment #8844527 - Attachment is obsolete: true
Attachment #8845060 - Flags: review+
(Assignee)

Comment 19

2 years ago
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #13)
> @@ +832,5 @@
> >                  (upper, lower, name, alias) = entry
> > +                upper = unconditional_toupper[code] if code in unconditional_toupper else [upper]
> > +                lower = unconditional_tolower[code] if code in unconditional_tolower else [lower]
> > +                println('  ["{}", "{}"], /* {}{} */'.format("".join(imap(unicodeEsc, upper)),
> > +                                                            "".join(imap(unicodeEsc, lower)),
> 
> It would have been nice if converting this to hexints were done in a
> separate patch, so the *semantic* changes to the generated test, reflecting
> the new casing logic, could actually be reviewed.  But I'm not going to hold
> you to it if it's at all difficult.

Split part 3 into part 1.2 (only reformatting) and new part 3 (special casing changes).
(Assignee)

Comment 20

2 years ago
Missed to address one review comment.
Attachment #8845059 - Attachment is obsolete: true
Attachment #8845065 - Flags: review+
(Assignee)

Comment 21

2 years ago
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #14)
> ::: js/src/vm/Unicode.cpp
> @@ +2685,5 @@
> > +      case 0x1F88: return 2;
> > +      case 0x1F89: return 2;
> > +      case 0x1F8A: return 2;
> > +      case 0x1F8B: return 2;
> > +      case 0x1F8C: return 2;
> 
> A part of me sort of wishes these had the codepoint names in a comment next
> to them, like they are in the test.  But this is followup-land at best.

Sure, I can do that. Do you also want the names added to the AppendUpperCaseSpecialCasing() and the existing IsIdentifierStartNonBMP() and IsIdentifierPartNonBMP() methods?
Comment on attachment 8840178 [details] [diff] [review]
bug1318403-part4-string-special-casing.patch

Review of attachment 8840178 [details] [diff] [review]:
-----------------------------------------------------------------

Looks broadly fine, but this is too many comments piling up for me to r+, even tho I really want to.

::: js/src/jsstr.cpp
@@ +617,5 @@
> +            }
> +        }
> +        // Ignore any characters with the property Case_Ignorable.
> +        // NB: We need to skip over all Case_Ignorable characters, even when
> +        // they also have the Cased binary property.

There's a tests of this behavior somewhere, right?

@@ +658,5 @@
> + * U+03A3 (GREEK CAPITAL LETTER SIGMA) has two different lower case mappings
> + * depending on its context:
> + * When it's preceded by a cased character and not followed by another cased
> + * character, its lower case form is U+03C2 (GREEK SMALL LETTER FINAL SIGMA).
> + * Otherwise its lower case mapping is U+03C3 (GREEK SMALL LETTER SIGMA).

Mild preference for not parenthesizing the codepoint name because convention is not to.

@@ +671,5 @@
> +    MOZ_ASSERT(unicode::ToLowerCase(0x3A3) == 0x3C3);
> +
> +#if ENABLE_INTL_API
> +    if (Final_Sigma_PrecededByCased(chars, length, index) &&
> +        !Final_Sigma_FollowedByCased(chars, length, index))

As these two functions are both only used in this function, could you just inline them both, including with early-returns?  As long as there are good comments indicating what they're trying to detect, I don't think we need separate functions for the two detections performed.

@@ +710,5 @@
> +                }
> +            }
> +
> +            // Special case for U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE).
> +            if (c == 0x130) {

Looks like vm/Unicode.h could stand to have a section of

  const char16_t GREEK_SMALL_LETTER_SIGMA = 0x03C3;
  const char16_t LATIN_CAPITAL_LETTER_I_WITH_DOT_ABOVE = 0x0130;
  ...

with all the various constants that keep piling up.  (And ideally that have piled up in other files in one-offs, but we can deal with them later.)  This, the codepoints expanded below, the sigmas, &c.

@@ +712,5 @@
> +
> +            // Special case for U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE).
> +            if (c == 0x130) {
> +                // Return if the output buffer is too small.
> +                if (srcLength == destLength)

Prefer >= as safer if something goes awry.  Also in similar other places.

@@ +743,5 @@
> +template <typename CharT>
> +static size_t
> +ToLowerCaseLength(const CharT* chars, size_t startIndex, size_t length)
> +{
> +    size_t lowerLength = length;

Maybe add

    // U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE is the only codepoint that
    // lowercases to more than one codepoint.
    if (mozilla::IsSame<Latin1Char, CharT>::value)
        return lowerLength;

so we don't have to rely on the compiler recognizing the loop never adds 1 in the Latin1Char case.

@@ +767,1 @@
>      size_t length = str->length();

I think it would be slightly helpful to readers (or at least this reader) to make this |const|, given we have two length-y variables floating around now.

@@ +773,5 @@
> +        // We don't need extra special casing checks in the loop below, because
> +        // U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) and U+03A3 (GREEK
> +        // CAPITAL LETTER SIGMA) already have simple lower case mappings.
> +        MOZ_ASSERT(unicode::CanLowerCase(0x130), "U+0130 has a simple lower case mapping");
> +        MOZ_ASSERT(unicode::CanLowerCase(0x3A3), "U+03A3 has a simple lower case mapping");

It'd be nice if CanLowerCase were constexpr so we could static_assert this, but I imagine that's not readably writable now, without C++14's generalized constexpr support.

@@ +779,1 @@
>          // Look for the first upper case character.

This comment is literally correct, but it's not really the right way to describe it -- it operates at too low a level to make intuitive sense, IMO.  It really should be:

"""
Look for the first character that changes when lowercased.
"""

It's true that that by definition means "is uppercase", but "uppercase" only has that meaning if you remember that this function performs a lowercasing operation.  And when there's both uppercasing and lowercasing in the conceptual space, having to remember that is cognitive overhead.  My phrasing, on the other hand, has its precise meaning without having to keep in mind context.

@@ +795,5 @@
>              if (unicode::CanLowerCase(c))
>                  break;
>          }
>  
>          // If all characters are lower case, return the input string.

Possibly this could be "If no character needs to change, return the input string."

@@ +818,5 @@
> +            PodCopy(buf.get(), newChars.get(), readChars);
> +            newChars = Move(buf);
> +
> +            MOZ_ALWAYS_TRUE((length ==
> +                ToLowerCaseImpl(newChars.get(), chars, readChars, length, resultLength)));

Remove the apparent extra parens here.

@@ +872,5 @@
>      /*
>       * Forcefully ignore the first (or any) argument and return toLowerCase(),
>       * ECMA has reserved that argument, presumably for defining the locale.
>       */
>      if (cx->runtime()->localeCallbacks && cx->runtime()->localeCallbacks->localeToLowerCase) {

I guess after the next patch (if it doesn't do so), please guard this block by #if EXPOSE_INTL_API, and add "// not used #if EXPOSE_INTL_API" by localeToLowerCase's field definition in jsapi.h.  Ditto for the uppercase equivalents.

@@ +922,5 @@
> +
> +static inline size_t
> +LengthUpperCaseSpecialCasing(char16_t charCode)
> +{
> +    return unicode::LengthUpperCaseSpecialCasing(charCode);

MOZ_ASSERT(CanUpperCaseSpecialCasing(charCode));

@@ +990,5 @@
> +}
> +
> +static bool
> +ToUpperCaseImpl(Latin1Char* destChars, const char16_t* srcChars, size_t startIndex,
> +                size_t srcLength, size_t destLength)

First, is this function necessary so that the template function above is never instantiated such that it would fail the src/dest-types static_assert?  Possibly could use a comment to that effect.

Second, is this overload actually called in any code?  If it's not called, using = delete instead seems better.

@@ +1040,5 @@
>      {
>          AutoCheckCannotGC nogc;
>          const CharT* chars = str->chars<CharT>(nogc);
>  
>          // Look for the first lower case character.

Same comment-adjustment as above.

@@ +1064,3 @@
>          }
>  
>          // If all characters are upper case, return the input string.

Same adjustment as above.

@@ +1067,5 @@
>          if (i == length)
>              return str;
>  
>          // If the string is Latin1, check if it contains the MICRO SIGN (0xb5)
>          // or SMALL LETTER Y WITH DIAERESIS (0xff) character. The corresponding

"""
The string changes when uppercased, so we must create a new string.  Can it be Latin-1?

If the original string is Latin-1, it can -- unless the string contains U+00B5 MICRO SIGN or U+00FF SMALL LETTER Y WITH DIAERESIS, the only Latin-1 codepoints that don't uppercase within Latin-1.  Search for those codepoints to decide whether the new string can be Latin-1.

If the original string is a two-byte string, its uppercase form is so rarely Latin-1 that we don't even consider creating a new Latin-1 string.
"""

@@ +1088,5 @@
>          }
>  
>          if (resultIsLatin1) {
> +            resultLength = length;
> +            Latin1CharPtr buf = cx->make_pod_array<Latin1Char>(resultLength + 1);

If we make_pod_array here but find something inflated, we make_pod_array *again* and copy into it.  But when the new allocation is only slightly bigger, which it almost always will be, we could just realloc and possibly avoid a fresh allocation and copy.  That seems much better to me.  So let's do something along those lines.  Maybe

tempalate<typename CharT>
static bool
ReallocChars(JSContext* cx, typename JS::UniquePtr<CharT[], JS::FreePolicy>* chars,
             size_t oldLength, size_t newLength)
{
    CharT* oldChars = chars->release();
    CharT* newChars = cx->maybe_pod_realloc(oldChars, oldLength, newLength);
    if (!newChars)
        return false;

    chars->reset(newChars);
    return true;
}

or so.

@@ +1106,5 @@
> +                CopyChars(buf2.get(), buf.get(), readChars);
> +                buf = Move(buf2);
> +
> +                MOZ_ALWAYS_TRUE((length ==
> +                    ToUpperCaseImpl(buf.get(), chars, readChars, length, resultLength)));

Extra parens here and in the two-byte block below.

@@ +1112,1 @@
>              newChars.construct<Latin1CharPtr>(Move(buf));

This whole block is exactly duplicated in the not-Latin-1 block.  In a followup, could you make this one template function that's separately called in the two blocks?

::: js/src/vm/Unicode.h
@@ +239,5 @@
>      return CharInfo(ch).isSpace();
>  }
>  
> +/*
> + * Returns the simple upper case mapping of the given UTF-16 code unit.

"simple upper case mapping" (see CanUpperCaseSpecialCasing for details)

and make the same change for ToLowerCase.

@@ +342,5 @@
> + * Unicode defines two case mapping modes:
> + * 1. "simple case mappings" for one-to-one mappings which are independent of
> + *    context and language (defined in UnicodeData.txt).
> + * 2. "special case mappings" for mappings which can increase or decrease the
> + *    string length or are dependent of context or language (defined in

dependent on (or independent of?  but I don't think that's what was meant).

@@ +343,5 @@
> + * 1. "simple case mappings" for one-to-one mappings which are independent of
> + *    context and language (defined in UnicodeData.txt).
> + * 2. "special case mappings" for mappings which can increase or decrease the
> + *    string length or are dependent of context or language (defined in
> + *    SpecialCasing.txt)

Maybe end #1 with a comma or semicolon, then end this with a period.

@@ +349,5 @@
> + * The CanUpperCase() method defined above only supports simple case mappings.
> + * In order to support the full case mappings of all Unicode characters,
> + * callers need to check this method in addition to CanUpperCase().
> + *
> + * NOTE: All special upper case mappings are unconditional in Unicode 9.

Maybe "permanently unconditional as of Unicode 5", which I think is right but I could be mistaken about if I misunderstand other readings I've done.

@@ +357,5 @@
> +
> +/*
> + * Returns the length of the upper case mapping of |ch|.
> + *
> + * An assertion is raised if |ch| doesn't have a special upper case mapping.

This function asserts if

and below as well.
Attachment #8840178 - Flags: review?(jwalden+bmo) → review-
(In reply to André Bargull from comment #21)
> (In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #14)
> > ::: js/src/vm/Unicode.cpp
> > @@ +2685,5 @@
> > > +      case 0x1F88: return 2;
> > > +      case 0x1F89: return 2;
> > > +      case 0x1F8A: return 2;
> > > +      case 0x1F8B: return 2;
> > > +      case 0x1F8C: return 2;
> > 
> > A part of me sort of wishes these had the codepoint names in a comment next
> > to them, like they are in the test.  But this is followup-land at best.
> 
> Do you also want the names added to the
> AppendUpperCaseSpecialCasing() and the existing IsIdentifierStartNonBMP()
> and IsIdentifierPartNonBMP() methods?

Yeah, I think so.
Comment on attachment 8840179 [details] [diff] [review]
bug1318403-part5-intl-case-conversion.patch

Review of attachment 8840179 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/Intl.cpp
@@ +798,5 @@
>  
>  } // anonymous namespace
>  
> +int32_t u_strToLower(UChar* dest, int32_t destCapacity, const UChar* src, int32_t srcLength,
> +                     const char* locale, UErrorCode* pErrorCode)

int32_t
u_strToLower(...)
{

and same below.

@@ +3641,5 @@
>  
> +/******************** String ********************/
> +
> +// Lithuanian, Turkish, and Azeri have language dependent case mappings.
> +const char* const languagesWithSpecialCasing[] = { "lt", "tr", "az" };

Might as well do

  static const char languagesWC[][3]

to guarantee these strings are stored compactly/adjacently in memory.  Also, if these are all only used in one function, put the array *in* that function.

@@ +3657,5 @@
> +            if (equal(locale, language, 2))
> +                return language;
> +        }
> +    }
> +    return ""; // ICU root locale

Blank line before this.

::: js/src/builtin/Intl.js
@@ +510,5 @@
> +
> +        // Replace deprecated subtags with their preferred values.
> +        locale = callFunction(std_Object_hasOwnProperty, langSubtagMappings, locale)
> +                 ? langSubtagMappings[locale]
> +                 : locale;

Please add an assertion to make_intl_data.py somewhere that |langTagMappings| doesn't contain any 2*3ALPHA mappings.  Then add a backstop assertion here that there's no such mapping.

::: js/src/builtin/String.js
@@ +627,5 @@
> +        // Step 3.
> +        var requestedLocales = CanonicalizeLocaleList(locales);
> +
> +        // Steps 4-6.
> +        requestedLocale = requestedLocales.length > 0 ? requestedLocales[0] : undefined;

This use-just-the-first-element behavior is to spec, but an interesting deviation from how most of the Intl stuff works.  Odd.  (Ish?)

::: js/src/tests/Intl/String/toLocaleUpperCase.js
@@ +31,5 @@
> +assertEq("a".toLocaleUpperCase(undefined), "A");
> +assertEq("a".toLocaleUpperCase([]), "A");
> +assertEq("a".toLocaleUpperCase({}), "A");
> +assertEq("a".toLocaleUpperCase({length: 0}), "A");
> +assertEq("a".toLocaleUpperCase({length: -1}), "A");

Seems like some verification with the "x-x" language tag would be desirable.  (And also in the lowercasing file.)

::: js/src/vm/SelfHosting.cpp
@@ +2354,5 @@
>  #else
>      JS_FN("std_String_normalize",                str_normalize,                0,0),
>  #endif
>      JS_FN("std_String_concat",                   str_concat,                   1,0),
>      

Mind removing the trailing WS while you're here?
Attachment #8840179 - Flags: review?(jwalden+bmo) → review+
(Assignee)

Comment 25

2 years ago
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #22)
> ::: js/src/jsstr.cpp
> @@ +617,5 @@
> > +            }
> > +        }
> > +        // Ignore any characters with the property Case_Ignorable.
> > +        // NB: We need to skip over all Case_Ignorable characters, even when
> > +        // they also have the Cased binary property.
> 
> There's a tests of this behavior somewhere, right?

Yes, this is covered in test262 (U+0345 COMBINING GREEK YPOGEGRAMMENI has the properties Cased and Case_Ignorable):

https://github.com/tc39/test262/blob/master/test/built-ins/String/prototype/toLowerCase/special_casing_conditional.js

(This is the test I mentioned on IRC when I said I first messed up the Final_Sigma check. :-)

> @@ +712,5 @@
> > +
> > +            // Special case for U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE).
> > +            if (c == 0x130) {
> > +                // Return if the output buffer is too small.
> > +                if (srcLength == destLength)
> 
> Prefer >= as safer if something goes awry.  Also in similar other places.

I think using >= could be confusing given that the function starts with the assertion |MOZ_ASSERT(srcLength <= destLength)|. Maybe I should have added a comment to ToLowerCaseImpl explaining the two valid ways this function can be called:
---
If srcLength == destLength is true, the output buffer was optimistically allocated with the same size as the input buffer. When we append characters which have special casing mappings, we test `srcLength == destLength` to decide if we need to back out and reallocate a sufficiently large output buffer. Otherwise the output buffer was allocated with the correct size to hold all lower case mapped characters, i.e. destLength == ToLowerCaseLength(chars, 0, len(chars)) is true.
---


> 
> @@ +743,5 @@
> > +template <typename CharT>
> > +static size_t
> > +ToLowerCaseLength(const CharT* chars, size_t startIndex, size_t length)
> > +{
> > +    size_t lowerLength = length;
> 
> Maybe add
> 
>     // U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE is the only codepoint
> that
>     // lowercases to more than one codepoint.
>     if (mozilla::IsSame<Latin1Char, CharT>::value)
>         return lowerLength;
> 
> so we don't have to rely on the compiler recognizing the loop never adds 1
> in the Latin1Char case.

This function should never be called with Latin1 characters, because the first call to ToLowerCaseImpl will always succeed. (This is one of the reasons I wished if-constexpr was already a thing, because then I could guard the statements which are only necessary for Latin1 resp. don't apply for Latin1 characters.)


> @@ +773,5 @@
> > +        // We don't need extra special casing checks in the loop below, because
> > +        // U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) and U+03A3 (GREEK
> > +        // CAPITAL LETTER SIGMA) already have simple lower case mappings.
> > +        MOZ_ASSERT(unicode::CanLowerCase(0x130), "U+0130 has a simple lower case mapping");
> > +        MOZ_ASSERT(unicode::CanLowerCase(0x3A3), "U+03A3 has a simple lower case mapping");
> 
> It'd be nice if CanLowerCase were constexpr so we could static_assert this,
> but I imagine that's not readably writable now, without C++14's generalized
> constexpr support.

Agreed.


> 
> @@ +990,5 @@
> > +}
> > +
> > +static bool
> > +ToUpperCaseImpl(Latin1Char* destChars, const char16_t* srcChars, size_t startIndex,
> > +                size_t srcLength, size_t destLength)
> 
> First, is this function necessary so that the template function above is
> never instantiated such that it would fail the src/dest-types static_assert?
> Possibly could use a comment to that effect.
> 
> Second, is this overload actually called in any code?  If it's not called,
> using = delete instead seems better.

Yes, this definition was only necessary for the static_assert.


> 
> @@ +1088,5 @@
> >          }
> >  
> >          if (resultIsLatin1) {
> > +            resultLength = length;
> > +            Latin1CharPtr buf = cx->make_pod_array<Latin1Char>(resultLength + 1);
> 
> If we make_pod_array here but find something inflated, we make_pod_array
> *again* and copy into it.  But when the new allocation is only slightly
> bigger, which it almost always will be, we could just realloc and possibly
> avoid a fresh allocation and copy.  That seems much better to me.  So let's
> do something along those lines.  Maybe
> 
> tempalate<typename CharT>
> static bool
> ReallocChars(JSContext* cx, typename JS::UniquePtr<CharT[], JS::FreePolicy>*
> chars,
>              size_t oldLength, size_t newLength)
> {
>     CharT* oldChars = chars->release();
>     CharT* newChars = cx->maybe_pod_realloc(oldChars, oldLength, newLength);
>     if (!newChars)
>         return false;
> 
>     chars->reset(newChars);
>     return true;
> }
> 
> or so.

Okay, I can try to do this, but I'll split it into a separate patch, so it's easier for me to test and if I made any mistakes, we only need to back out the realloc part. :-)
And |pod_realloc| instead of |maybe_pod_realloc|, right?
(Assignee)

Comment 26

2 years ago
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #22)
> @@ +349,5 @@
> > + * The CanUpperCase() method defined above only supports simple case mappings.
> > + * In order to support the full case mappings of all Unicode characters,
> > + * callers need to check this method in addition to CanUpperCase().
> > + *
> > + * NOTE: All special upper case mappings are unconditional in Unicode 9.
> 
> Maybe "permanently unconditional as of Unicode 5", which I think is right
> but I could be mistaken about if I misunderstand other readings I've done.

I didn't find any note in Unicode 9 that the mappings are permanently unconditional. 5.18 Case Mappings, "Complications for Case Mapping" section, only has this note:
---
Because only a few context-sensitive case mappings exist, and because they involve only a
very few characters, implementations may choose to hard-code the treatment of these
characters for casing operations rather than using data-driven code based on the Unicode
Character Database. However, if this approach is taken, each time the implementation is
upgraded to a new version of the Unicode Standard, hard-coded casing operations should
be checked for consistency with the updated data. See SpecialCasing.txt in the Unicode
Character Database for details of context-sensitive case mappings.
---
(Assignee)

Comment 27

2 years ago
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #22)
> Looks broadly fine, but this is too many comments piling up for me to r+,
> even tho I really want to.

The updated patch should cover all review comments, except for this one:


> @@ +712,5 @@
> > +
> > +            // Special case for U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE).
> > +            if (c == 0x130) {
> > +                // Return if the output buffer is too small.
> > +                if (srcLength == destLength)
> 
> Prefer >= as safer if something goes awry.  Also in similar other places.

I didn't change it to >=, but instead added a comment to explain the two possible values for |destLength|.

> @@ +743,5 @@
> > +template <typename CharT>
> > +static size_t
> > +ToLowerCaseLength(const CharT* chars, size_t startIndex, size_t length)
> > +{
> > +    size_t lowerLength = length;
> 
> Maybe add
> 
>     // U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE is the only codepoint
> that
>     // lowercases to more than one codepoint.
>     if (mozilla::IsSame<Latin1Char, CharT>::value)
>         return lowerLength;
> 
> so we don't have to rely on the compiler recognizing the loop never adds 1
> in the Latin1Char case.

Added a Latin1Char version of ToLowerCaseLength which has `MOZ_ASSERT_UNREACHABLE("Never called for Latin1 strings");`.

> 
> @@ +990,5 @@
> > +}
> > +
> > +static bool
> > +ToUpperCaseImpl(Latin1Char* destChars, const char16_t* srcChars, size_t startIndex,
> > +                size_t srcLength, size_t destLength)
> 
> First, is this function necessary so that the template function above is
> never instantiated such that it would fail the src/dest-types static_assert?
> Possibly could use a comment to that effect.

Added a comment.

> Second, is this overload actually called in any code?  If it's not called,
> using = delete instead seems better.

It's never called at runtime, but the compiler is unable to see this at compile-time.

> 
> @@ +1088,5 @@
> >          }
> >  
> >          if (resultIsLatin1) {
> > +            resultLength = length;
> > +            Latin1CharPtr buf = cx->make_pod_array<Latin1Char>(resultLength + 1);
> 
> If we make_pod_array here but find something inflated, we make_pod_array
> *again* and copy into it.  But when the new allocation is only slightly
> bigger, which it almost always will be, we could just realloc and possibly
> avoid a fresh allocation and copy.  That seems much better to me.  So let's
> do something along those lines.  Maybe
> 

Addressed in part 4.2.

> 
> @@ +1112,1 @@
> >              newChars.construct<Latin1CharPtr>(Move(buf));
> 
> This whole block is exactly duplicated in the not-Latin-1 block.  In a
> followup, could you make this one template function that's separately called
> in the two blocks?
> 

Addressed in part 4.3.
Attachment #8840178 - Attachment is obsolete: true
Attachment #8846697 - Flags: review?(jwalden+bmo)
Ah, right.  I was thinking of the Case Folding Stability Policy, but that doesn't have the meaning I thought it had.

http://www.unicode.org/policies/stability_policy.html#Case_Folding
(Assignee)

Comment 29

2 years ago
Changed malloc to realloc per comment #22.

I didn't like the pointer of UniquePtr approach, so I used normal UniquePtrs and Moves.
Attachment #8846698 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 30

2 years ago
From comment #22:

Moves the duplicated code out of ToUpperCase into a new helper method.
Attachment #8846699 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 31

2 years ago
Updated part 5 per review comments, carrying r+ from Waldo.

And additionally changed the JSAutoByteString to a normal JSLinearString*, so we don't need to copy the language tag string needlessly.
Attachment #8840179 - Attachment is obsolete: true
Attachment #8846701 - Flags: review+
(Assignee)

Updated

2 years ago
Attachment #8846697 - Attachment filename: bug1318403-part4-string-special-casing.patch → bug1318403-part4.1-string-special-casing.patch
(Assignee)

Updated

2 years ago
Attachment #8846697 - Attachment description: bug1318403-part4-string-special-casing.patch → bug1318403-part4.1-string-special-casing.patch
Attachment #8846697 - Attachment filename: bug1318403-part4.1-string-special-casing.patch → bug1318403-part4-string-special-casing.patch
(Assignee)

Comment 32

2 years ago
Updated part 6 to apply cleanly on inbound. Carrying r+.
Attachment #8840180 - Attachment is obsolete: true
Attachment #8846703 - Flags: review+
(Assignee)

Comment 33

2 years ago
Updates make_unicode.py to print the character names in all generated files.

I've used the existing |test_table|, renamed it to |codepoint_table|, and changed it to also save the details of non-bmp characters. 


Drive-by changes:
Moved some lines in process_unicode_data(...) which are only needed for bmp code-points. And removed the for_each_non_bmp_group function, because int_ranges provides the same functionality.
Attachment #8846706 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 34

2 years ago
Regenerated all Unicode files after part 7.
Attachment #8846708 - Flags: review?(jwalden+bmo)
Comment on attachment 8846698 [details] [diff] [review]
bug1318403-part4.2-use-realloc.patch

Review of attachment 8846698 [details] [diff] [review]:
-----------------------------------------------------------------

I preferred rewriting the UniquePtr in place, but eh.
Attachment #8846698 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 8846697 [details] [diff] [review]
bug1318403-part4.1-string-special-casing.patch

Review of attachment 8846697 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsstr.cpp
@@ +710,5 @@
> +                    continue;
> +                }
> +            }
> +
> +            // Special case for U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE.

The named constant means this is unnecessary.  But really, it makes more sense to combine it with the "is lwoercased to" comment below, so:

"""
Special case: U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE lowers to <U+0069 U+0307>.
"""

or somesuch.

@@ +722,5 @@
> +                destChars[j++] = CharT(unicode::COMBINING_DOT_ABOVE);
> +                continue;
> +            }
> +
> +            // Special case for U+03A3 GREEK CAPITAL LETTER SIGMA.

"""
Greek uppercase sigma lowercases to one of two codepoints depending on context.
"""

@@ +757,5 @@
> +
> +static size_t
> +ToLowerCaseLength(const Latin1Char* chars, size_t startIndex, size_t length)
> +{
> +    MOZ_ASSERT_UNREACHABLE("Never called for Latin1 strings");

Preference for lowercase at the start of assertion messages.

@@ +766,5 @@
>  static JSString*
>  ToLowerCase(JSContext* cx, JSLinearString* str)
>  {
> +    // Unlike toUpperCase, toLowerCase has the nice invariant that if the
> +    // input is a Latin1 string, the output is also a Latin1 string.

Latin-1, and elsewhere.
Attachment #8846697 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 8846699 [details] [diff] [review]
bug1318403-part4.3-toupper-duplicate-code.patch

Review of attachment 8846699 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsstr.cpp
@@ +1070,5 @@
> +
> +    *resultLength = length;
> +    DestCharPtr buf = cx->make_pod_array<DestChar>(length + 1);
> +    if (!buf)
> +        return DestCharPtr();

Could return buf.
Attachment #8846699 - Flags: review?(jwalden+bmo) → review+
Attachment #8846706 - Flags: review?(jwalden+bmo) → review+
Attachment #8846708 - Flags: review?(jwalden+bmo) → review+
(Assignee)

Comment 38

2 years ago
Update part 4.1 per review comments, carrying r+ from Waldo.
Attachment #8846697 - Attachment is obsolete: true
Attachment #8855266 - Flags: review+
(Assignee)

Comment 39

2 years ago
Update part 4.2 to apply on changes from 4.1, carrying r+.
Attachment #8846698 - Attachment is obsolete: true
Attachment #8855267 - Flags: review+
(Assignee)

Comment 40

2 years ago
Update part 4.3 per review comments, carrying r+.
Attachment #8846699 - Attachment is obsolete: true
Attachment #8855268 - Flags: review+
(Assignee)

Comment 41

2 years ago
Update part 5 to apply on changes from part 4, carrying r+.
Attachment #8846701 - Attachment is obsolete: true
Attachment #8855269 - Flags: review+
(Assignee)

Comment 42

2 years ago
Update part 6 to apply cleanly on inbound, carrying r+.
Attachment #8846703 - Attachment is obsolete: true
Attachment #8855270 - Flags: review+

Comment 44

2 years ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9a91cb4b12e5
Part 1.1: Reduce code duplication in make_unicode. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/3a6ac8fe8434
Part 1.2: Recreate files created through make_unicode. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/e32eeab1e231
Part 2: Update make_unicode.py to process SpecialCasing.txt. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/0a8e8e545554
Part 3: Recreate files created through make_unicode to include special casing info. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/c34970cf36e9
Part 4.1: Handle special casing characters in String.prototype.to(Lower|Upper)Case. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/23da062eab54
Part 4.2: Use realloc instead of malloc when resizing a newly created string buffer. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/26c080929f59
Part 4.3: Move duplicate code in ToUpperCase to new helper method. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/a9b7a65ca73a
Part 5: Reimplement String.prototype.toLocale{Lower,Upper}Case per ECMAScript Intl specification. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/5b06ed0f1b18
Part 6: Update test262 exclusion list. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/3043922c81c7
Part 7: Add code point names in generated Unicode sources. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/c008acb08872
Part 8: Generate Unicode resources to include code point names. r=Waldo
Keywords: checkin-needed
Backed out fro rooting hazard:

https://hg.mozilla.org/integration/mozilla-inbound/rev/d06a959dded788fbb879d9c602cd77b44fb97168
https://hg.mozilla.org/integration/mozilla-inbound/rev/7eb3020455859030f0bec366508b58107e763785
https://hg.mozilla.org/integration/mozilla-inbound/rev/0f1e08d8a2268a164ad07a9311fad6780bc3116a
https://hg.mozilla.org/integration/mozilla-inbound/rev/c88cb0e4cbbd89e557e47e12208994165532a461
https://hg.mozilla.org/integration/mozilla-inbound/rev/687026812c877df7f7825d455ecd339a68e10933
https://hg.mozilla.org/integration/mozilla-inbound/rev/3c030549f4fe3e2b85de743367bf4af5321132f5
https://hg.mozilla.org/integration/mozilla-inbound/rev/c1aa9fb53a0158f3ea04b76669b7a61f12c00fd5
https://hg.mozilla.org/integration/mozilla-inbound/rev/eaa1aa7657ba63768a9f7ce3fb46eaa392683ac3
https://hg.mozilla.org/integration/mozilla-inbound/rev/74b9bdf6c5cfc4c713395274e9d0e1bd67d01e1a
https://hg.mozilla.org/integration/mozilla-inbound/rev/7bdd0b240bc441d97b7c40e75dbed2d567ef39e2
https://hg.mozilla.org/integration/mozilla-inbound/rev/df999dd8dbc70c0176b8e465d77dbdf04da7ec96

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=c008acb088721fab8f74c9f7c42fe5e28e3ab825&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=89184345&repo=mozilla-inbound

[task 2017-04-06T13:05:01.717367Z] Found 1 hazards and 92 unsafe references
[task 2017-04-06T13:05:01.721112Z] + check_hazards /home/worker/workspace/analysis
[task 2017-04-06T13:05:01.721307Z] + set +e
[task 2017-04-06T13:05:01.721623Z] ++ grep -c 'Function.*has unrooted.*live across GC call' /home/worker/workspace/analysis/rootingHazards.txt
[task 2017-04-06T13:05:01.722818Z] + NUM_HAZARDS=1
[task 2017-04-06T13:05:01.723067Z] ++ grep -c '^Function.*takes unsafe address of unrooted' /home/worker/workspace/analysis/refs.txt
[task 2017-04-06T13:05:01.724178Z] + NUM_UNSAFE=92
[task 2017-04-06T13:05:01.724410Z] ++ grep -c '^Function.* has unnecessary root' /home/worker/workspace/analysis/unnecessary.txt
[task 2017-04-06T13:05:01.726226Z] + NUM_UNNECESSARY=866
[task 2017-04-06T13:05:01.726248Z] + set +x
[task 2017-04-06T13:05:01.726282Z] TinderboxPrint: rooting hazards<br/>1
[task 2017-04-06T13:05:01.726306Z] TinderboxPrint: unsafe references to unrooted GC pointers<br/>92
[task 2017-04-06T13:05:01.726324Z] TinderboxPrint: unnecessary roots<br/>866

See https://queue.taskcluster.net/v1/task/NHLSaCnQQj6_mU2ciVr5Rg/runs/0/artifacts/public/build/rootingHazards.txt.gz
Flags: needinfo?(andrebargull)
(Assignee)

Comment 46

2 years ago
This looks like a false-positive by the rooting analysis. It claims ToLowerCaseImpl (in jsstr.cpp) can GC, but I don't see how this is possible. Maybe the call to ICU's u_hasBinaryProperty in Final_Sigma (which is called from ToLowerCaseImpl) confuses the rooting analysis? Adding JS::AutoSuppressGCAnalysis to Final_Sigma seems to fix the rooting hazard:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=255b27c8f6b51fbb11f38af5621624333ef960bf

@sfink:
What is the correct way to let the rooting analysis know that a certain function cannot gc?
Flags: needinfo?(andrebargull) → needinfo?(sphink)
It would be much better to look at https://queue.taskcluster.net/v1/task/NHLSaCnQQj6_mU2ciVr5Rg/runs/0/artifacts/public/build/hazards.txt.gz since it gives the GC stack. In this case, Final_Sigma calls u_hasBinaryProperty_58 calls a function pointer named 'contains' in a BinaryProperty type, and it assumes that can do anything or everything, including GC.

Your patch looks good to me. It's suppressing the checks in a function that doesn't touch anything related to GC, other than taking in a pointer that could theoretically point into a JSString. But (1) it looks like you only pass in a malloced buffer, and (2) this really *can't* GC. Except (1) doesn't matter, because it turns out GC *would* be fatal here, which is why there is an AutoCheckCannotGC that the hazard analysis tripped over.

Anyway, the choices would be (1) use AutoSuppressGCAnalysis exactly as you did; (2) add u_hasBinaryProperty_58 to annotations.js to whitelist it, but that'd be gross because of the macro-added _58 prefix; or (3) whitelist BinaryProperty.contains in annotations.js. The choice of (1) vs (3) doesn't matter much unless we start hitting more hazards, and we'd want something that gets rid of all of them at once.

The only thing I'd change is the comment.

    // u_hasBinaryProperty cannot gc, but the analysis doesn't know about it.

The analysis *does* know about it, with its macro-mangled name even. It just sees that it calls through a function pointer that it has no way of knowing what is. So the comment should refer to BinaryProperty.contains, called by u_hasBinaryProperty.
Flags: needinfo?(sphink)
(r=me with the comment fix)
(Assignee)

Comment 49

2 years ago
(In reply to Steve Fink [:sfink] [:s:] from comment #47)
> It would be much better to look at
> https://queue.taskcluster.net/v1/task/NHLSaCnQQj6_mU2ciVr5Rg/runs/0/
> artifacts/public/build/hazards.txt.gz since it gives the GC stack. 

Thanks for the tip about "hazards.txt.gz"!

> Anyway, the choices would be (1) use AutoSuppressGCAnalysis exactly as you
> did; (2) add u_hasBinaryProperty_58 to annotations.js to whitelist it, but
> that'd be gross because of the macro-added _58 prefix; or (3) whitelist
> BinaryProperty.contains in annotations.js. The choice of (1) vs (3) doesn't
> matter much unless we start hitting more hazards, and we'd want something
> that gets rid of all of them at once.

Hmm, maybe it's better (or rather more future-proof) to use (3), given that we eventually want to support https://github.com/tc39/proposal-regexp-unicode-property-escapes, which will also need to call u_hasBinaryProperty?
(Assignee)

Comment 50

2 years ago
(In reply to André Bargull from comment #49)
> Hmm, maybe it's better (or rather more future-proof) to use (3), given that
> we eventually want to support
> https://github.com/tc39/proposal-regexp-unicode-property-escapes, which will
> also need to call u_hasBinaryProperty?

On second thought we can always switch to (3) later, so we don't need to guess now how we will implement Unicode property escapes then. :-)
(Assignee)

Comment 51

2 years ago
Changed the comment to:
// Tell the analysis the BinaryProperty.contains function pointer called by
// u_hasBinaryProperty cannot GC.


r+ by sfink in comment #48.
Attachment #8856249 - Flags: review+

Comment 53

2 years ago
Pushed by ihsiao@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/503d276ad3bb
Part 1.1: Reduce code duplication in make_unicode. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/6890eedfe3be
Part 1.2: Recreate files created through make_unicode. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/8424a6a068da
Part 2: Update make_unicode.py to process SpecialCasing.txt. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/70abb1ca3cba
Part 3: Recreate files created through make_unicode to include special casing info. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/d539219ef53a
Part 4.1: Handle special casing characters in String.prototype.to(Lower|Upper)Case. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/3ebf41460510
Part 4.2: Use realloc instead of malloc when resizing a newly created string buffer. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/41df7f41e185
Part 4.3: Move duplicate code in ToUpperCase to new helper method. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/0e77e970cac1
Part 5: Reimplement String.prototype.toLocale{Lower,Upper}Case per ECMAScript Intl specification. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/85b3106777fa
Part 6: Update test262 exclusion list. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/946ad1048730
Part 7: Add code point names in generated Unicode sources. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/d0c6a4e68b34
Part 8: Generate Unicode resources to include code point names. r=Waldo
https://hg.mozilla.org/integration/mozilla-inbound/rev/85adabbcb617
Part 9: Suppress rooting analysis when calling u_hasBinaryProperty. r=sfink
Keywords: checkin-needed
Duplicate of this bug: 536672
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1378782
You need to log in before you can comment on or make changes to this bug.