Closed Bug 1017107 Opened 10 years ago Closed 10 years ago

StringBuffer should create Latin1 strings if possible

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33

People

(Reporter: jandem, Assigned: jandem)

References

Details

Attachments

(1 file, 2 obsolete files)

I may fix some other issues first, but filing this because it's definitely needed at some point.
One idea I had is to give StringBuffer both a jschar (TwoByte) and char (Latin1) buffer (and use mozilla::Maybe to ensure only one of them can be used without asserting). We start using the Latin1 one, and when we call a non-Latin1 method, we switch from Latin1 to TwoByte by copying everything to the TwoByte buffer.

We'd also have to rename appendInflated etc.

Luke, I was just curious if you had better ideas, before I spend a lot of time on it :)
Flags: needinfo?(luke)
Yeah, that makes sense.

I bet a bunch of places will call two-byte operations when they could call latin1 operations, so it'd probably be worth an audit to push everything toward latin1.  Also, I was thinking, although you could just add 'char' overloads for all the append operations, it would be better to give them all explicit names like appendLatin1/appendTwoByte.  That way, when someone isn't thinking and just appends a jschar, they don't add this big silent performance fault.

One question is whether all the appendTwoByte operations should attempt, if the underlying buffer is latin1, to test whether the given jschars could actually be deflated to latin1.  It kinda makes sense since we are iterating over the chars anyhow.  It really depends on how StringBuffer is used.
Flags: needinfo?(luke)
Attached patch Part 1 - append methods (obsolete) — Splinter Review
Initially I also renamed the append methods that take "char" and "const char *" to appendLatin1, but the patch was much bigger (~72K), because almost all append calls are for const char*.

This patch uses appendTwoByte for jschar, appendLatin1 for Latin1Char and append for "char".
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Attachment #8435083 - Flags: review?(luke)
It seems inconsistent to have separate 'char' and 'latin1'.  There would also be the hazard of silent slicing if you have a 'jschar c' and call 'sb.append(c)'.

If it is too gross/noisy to convert all 'char' callers to latin1, maybe we should just rely on overloading and have 'append(jschar c)' and 'append(Latin1 c)'.  Silent slicing shouldn't happen then and, if we dynamically coerce jschars that fit in latin1 into latin1, then the silent inflation shouldn't be a problem either.

Another question: is there a part 2 that will try to use a latin1 buffer until the first two-byte-char?
(In reply to Luke Wagner [:luke] from comment #4)
> If it is too gross/noisy to convert all 'char' callers to latin1, maybe we
> should just rely on overloading and have 'append(jschar c)' and
> 'append(Latin1 c)'.  Silent slicing shouldn't happen then and, if we
> dynamically coerce jschars that fit in latin1 into latin1, then the silent
> inflation shouldn't be a problem either.

Yeah, the more I think about it the more I like this. The caller can just append() what it has and let the StringBuffer take care of choosing the best encoding. If the automatic jschar -> latin1 conversion hurts perf somewhere, we can always add appendDontDeflate for perf-sensitive callers.

> Another question: is there a part 2 that will try to use a latin1 buffer
> until the first two-byte-char?

Not yet :) I want to get part 1 in first, so that I don't have to keep rebasing part 2.
(In reply to Jan de Mooij [:jandem] from comment #5)
> Yeah, the more I think about it the more I like this. The caller can just
> append() what it has and let the StringBuffer take care of choosing the best
> encoding. If the automatic jschar -> latin1 conversion hurts perf somewhere,
> we can always add appendDontDeflate for perf-sensitive callers.

Indeed, good point.  Also, I assume StringBuffer::append(JSString*) wouldn't auto deflate, inductively trusting that a two-byte JSString really needed to be.  Similarly, it might be good to have an appendSubstring(JSString *base, uint32_t off, uint32_t len) that also didn't attempt to deflate.  Given those, it seems like it would be rare to append() a random jschar array.

ArrayJoin is pretty hot and uses StringBuffer, so it might be worth auditing that whole path.  The appending of the separator passes a jschar; it might be worth hoisting out the branch on whether the separator is Latin1/TwoByte (esp. in the most common case of ',').
Attached patch Patch (obsolete) — Splinter Review
Rewrites StringBuffer as discussed. There's a bunch of callers that need more work (sb.twoByteBegin() etc will assert if Latin1), but I'll fix these at some point.

I noticed another advantage of append instead of appendLatin1/appendTwoByte: it works better with templating.
Attachment #8435083 - Attachment is obsolete: true
Attachment #8435083 - Flags: review?(luke)
Attachment #8437114 - Flags: review?(luke)
Comment on attachment 8437114 [details] [diff] [review]
Patch

Review of attachment 8437114 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/RegExp.cpp
@@ +175,5 @@
>          if (*it == '/' && (it == oldChars || it[-1] != '\\')) {
>              /* There's a forward slash that needs escaping. */
>              if (sb.empty()) {
>                  /* This is the first one we've seen, copy everything up to this point. */
> +                if (unescaped->hasTwoByteChars() && !sb.ensureTwoByte())

This seems a bit inefficient: we'll likely always have to do an inflate at the end.  Perhaps we really need two string buffer types: StringBuffer (which tries to abstract over latin1 vs. two-byte) and Latin1StringBuffer/TwoByteStringBuffer (which are instantiations of some StringBufferT<CharT> template).  Ideally, StringBuffer wouldn't have this externally-observable "am I latin1 or two byte" state; you'd stick stuff in of various types and pull out JSStrings.

::: js/src/vm/StringBuffer.h
@@ +42,5 @@
> +     * a TwoByte buffer, inflate() copies the chars in latin1Cb to twoByteCb
> +     * and destroys latin1Cb.
> +     */
> +    mozilla::Maybe<Latin1CharBuffer> latin1Cb;
> +    mozilla::Maybe<TwoByteCharBuffer> twoByteCb;

I know it's on the stack, but this kinda uses a lot of stack space.

Ideally, we'd have some utility mozilla::MaybeOneOf:
  mozilla::MaybeOneOf<Latin1CharBuffer, TwoByteCharBuffer> cb;
  cb.construct<Latin1CharBuffer>(...);
  cb.ref<Latin1CharBuffer>().append(...);
  cb.ref<TwoByteCharBuffer>().append(...); // assertion failure
  cb.destroy();
  cb.construct<Latin1CharBuffer>(...);
  cb.ref<TwoByteCharBuffer().append(...);  // ok

This seems generally useful... are you game for adding it?  It seems like you could mostly just copy/paste Maybe.h into mfbt/MaybeOneOf.h, change the 'bool constructed' into a tri-state enum, templatize all the members, etc.  The one bit of trickiness that needs some template-fu is mapping the static type to the enum value you can assert on.  Here is how you do that: https://pastebin.mozilla.org/5385747.
Attached patch PatchSplinter Review
Thanks for the suggestions, here's an updated patch as discussed,
Attachment #8437114 - Attachment is obsolete: true
Attachment #8437114 - Flags: review?(luke)
Attachment #8438016 - Flags: review?(luke)
Blocks: 1023778
Blocks: 1024518
Comment on attachment 8438016 [details] [diff] [review]
Patch

Review of attachment 8438016 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/NameFunctions.cpp
@@ +234,5 @@
>                  /*
>                   * Don't have consecutive '<' characters, and also don't start
>                   * with a '<' character.
>                   */
> +                if (!buf.empty() && buf.getChar(buf.length() - 1) != '<' && !buf.append("<"))

Can you turn this append("<") into append('<')?  (And below)

::: js/src/jsstr.cpp
@@ +699,5 @@
>          AutoCheckCannotGC nogc;
>          const CharT *chars = str->chars<CharT>(nogc);
>          for (size_t i = 0; i < length; i++) {
>              jschar c = unicode::ToLowerCase(chars[i]);
> +            MOZ_ASSERT_IF((IsSame<CharT, Latin1Char>::value), c <= 0xff);

Oh that's how.

::: js/src/vm/String-inl.h
@@ +45,5 @@
>  NewFatInlineString(ThreadSafeContext *cx, JS::Latin1Chars chars)
>  {
>      size_t len = chars.length();
>  
> +    if (EnableLatin1Strings) {

Since EnableLatin1Strings determines whether a latin1 buffer is used in StringBuffer, it seems like we don't need this here.

::: js/src/vm/StringBuffer.cpp
@@ +80,5 @@
> +
> +    JSFlatString *str = js_NewString<CanGC>(cx, buf.get(), len);
> +    if (str)
> +        buf.forget();
> +    return str;

Even for trivial cases like this, I like to main the general style of "early return is for failure", which would mean:
  if (!str)
      return nullptr;
  
  buf.forget();
  return str;

@@ +107,3 @@
>  
> +    return isLatin1()
> +        ? FinishStringFlat<Latin1Char>(cx, *this)

Can you align '?' and ':' with 'i' in 'isLatin' ?

@@ +118,4 @@
>          return cx->names().empty;
>  
> +    JSAtom *atom = AtomizeChars(cx, twoByteChars().begin(), len);
> +    twoByteChars().clear();

I think you need an isLatin1() branch here.

Second, pre-existing but: it'd nice to extractWellSized() instead of making a copy here.  Currently all the APIs in jsatom.h make copies, but there is an AtomizeAndtake (that t should be capitalized...) that we could extern.

::: js/src/vm/StringBuffer.h
@@ +97,5 @@
> +    }
> +    inline bool append(char c) { return append(Latin1Char(c)); }
> +
> +    inline bool append(const jschar *begin, const jschar *end);
> +    inline bool append(const jschar *chars, size_t len) { return append(chars, chars + len); }

Aesthetically, for classes with may members like this, I think it looks nicer not to mix the single-line and multi-line styles.  I generally segregate them (having a block of single-line functions and blocks of multi-line) or, if that grouping doesn't make sense, turn the single-liners into multi-liners.

@@ +147,5 @@
> +    /*
> +     * Because inflation is fallible, these methods should only be used if
> +     * the buffer is known to be TwoByte.
> +     */
> +    void infallibleAppendNoInflate(const jschar *chars, size_t len) {

"no inflate" is a bit confusing; I see it's referring to the internal buffer, but, read from the callsite, it looks like it's talking about the chars, which is confusing since they are already jschars.  I couldn't think of any other not-super-long names.  How about we just have a DebugOnly<bool> hasEnsuredTwoByteChars_ that is only set to true by ensureTwoByte() and infalliableAppend(jschar*) asserts this is true.  This should catch any errors relatively early.

@@ +193,3 @@
>       */
> +    template <typename CharT>
> +    CharT *extractWellSized();

To make this API a bit safer and simpler, perhaps we could untemplateize this and have it always return jschar*, and name it stealChars().  (It'll inflate to two-byte if latin1.)  For the impl of finishString, we should be able to just call ExtractWellSized directly.

::: mfbt/MaybeOneOf.h
@@ +25,5 @@
> + * arguments and that object will be destroyed when the owning MaybeOneOf is
> + * destroyed.
> + */
> +template<class T1, class T2>
> +class MaybeOneOf

This looks great.  For good form, you should probably file a bug in the mfbt component.  I'll still review, but that way everyone else can have a heads up.
Attachment #8438016 - Flags: review?(luke) → review+
Depends on: 1024688
https://hg.mozilla.org/integration/mozilla-inbound/rev/dd46f3baee85

(In reply to Luke Wagner [:luke] from comment #10)
> Can you turn this append("<") into append('<')?  (And below)

Done.

> Since EnableLatin1Strings determines whether a latin1 buffer is used in
> StringBuffer, it seems like we don't need this here.

We need it because there are other places where we call NewFatInlineString(..., Latin1Chars).

> I think you need an isLatin1() branch here.

True, but AtomizeChars doesn't accept Latin1 strings yet (it's in one of the JSON patches; will fix the code here when that lands).

> Second, pre-existing but: it'd nice to extractWellSized() instead of making
> a copy here.  Currently all the APIs in jsatom.h make copies, but there is
> an AtomizeAndtake (that t should be capitalized...) that we could extern.

Yeah, but as this patch is pretty big I'd prefer doing this separately. I'll also rewrite the (only) caller of AtomizeAndtake to just call the Latin1 version of AtomizeChars instead of inflating + AtomizeAndtake, with that done, we could remove AtomizeAndtake I think and I'm not sure if keeping it for StringBuffer is still worth it.

> Aesthetically, for classes with may members like this, I think it looks
> nicer not to mix the single-line and multi-line styles.  I generally
> segregate them (having a block of single-line functions and blocks of
> multi-line) or, if that grouping doesn't make sense, turn the single-liners
> into multi-liners.

Done.

> "no inflate" is a bit confusing; I see it's referring to the internal
> buffer, but, read from the callsite, it looks like it's talking about the
> chars, which is confusing since they are already jschars.  I couldn't think
> of any other not-super-long names.  How about we just have a DebugOnly<bool>
> hasEnsuredTwoByteChars_ that is only set to true by ensureTwoByte() and
> infalliableAppend(jschar*) asserts this is true.  This should catch any
> errors relatively early.

Done. It's also nice for templated callers to have infallibleAppend accept both Latin1Char* and jschar*.

> To make this API a bit safer and simpler, perhaps we could untemplateize
> this and have it always return jschar*, and name it stealChars().  (It'll
> inflate to two-byte if latin1.)  For the impl of finishString, we should be
> able to just call ExtractWellSized directly.

Done.
Forgot to mention, I also renamed StringBuffer::inflate() to StringBuffer::inflateChars(), becaues zlib does:

#define inflate MOZ_Z_inflate

And somehow that got #included in StringBuffer.cpp :(
https://hg.mozilla.org/mozilla-central/rev/dd46f3baee85
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: