Closed Bug 1271014 Opened 9 years ago Closed 8 years ago

make DeflateStringToUTF8Buffer return partial result if dest buffer runs out of space

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: myk, Assigned: myk)

Details

(Whiteboard: [v8api])

Attachments

(2 files, 3 obsolete files)

SpiderShim currently has a custom implementation of DeflateStringToUTF8Buffer [1], since neither of the existing ones (the public one in CharacterEncoding and a private one in CTypes) does quite what it wants, which is to return a partial result if the destination buffer runs out of space (CharacterEncoding requires the buffer to have enough space [2]) and continue on a bad surrogate (CTypes aborts on a bad surrogate [3]).

In order to support SpiderShim, the public DeflateStringToUTF8Buffer in CharacterEncoding should return a partial result if the destination buffer runs out of space.

[1] https://github.com/mozilla/spidernode/blob/master/deps/spidershim/src/v8string.cc#L455-L538
[2] https://dxr.mozilla.org/mozilla-central/rev/e5a10b/js/public/CharacterEncoding.h#208-213
[3] https://dxr.mozilla.org/mozilla-central/rev/e5a10b/js/src/ctypes/CTypes.cpp#172-177
I'll refactor the SpiderShim code for this.
Assignee: nobody → myk
This patch refactors the SpiderShim implementation to be a compatible replacement for the existing function in CharacterEncoding.

It adds two optional parameters, *dstlenp and *numcharsp, which enable the function to write a partial result (up to *dstlenp bytes) and return the number of bytes and chars written (by updating *dstlenp and *numcharsp).

(The SpiderShim implementation includes a third parameter to configure the encoder to write an invalid surrogate to the buffer instead of replacing it with the UTF-8 replacement character, but Node doesn't use that feature of the V8 API, so I won't add it here.)

This passes the check-spidermonkey tests, although I don't think those exercise this API, at least not directly.
Attachment #8750502 - Flags: review?(terrence)
Status: NEW → ASSIGNED
Comment on attachment 8750502 [details] [diff] [review]
return partial result and num bytes/chars written

It looks like Terrence is on PTO, so I'm redirecting this review request to another reviewer who has touched this area of code in the past.
Attachment #8750502 - Flags: review?(terrence) → review?(jdemooij)
I can review this patch tomorrow - I didn't take a close look but we should definitely add some jsapi-tests for all interesting (edge) cases here.
(In reply to Jan de Mooij [:jandem] from comment #4)
> I can review this patch tomorrow - I didn't take a close look but we should
> definitely add some jsapi-tests for all interesting (edge) cases here.

I can do so, although there aren't currently tests for DeflateStringToUTF8Buffer, neither for edge cases nor for basic functionality.
Attached patch test DeflateStringToUTF8Buffer (obsolete) — Splinter Review
Here's a jsapi-test for both the existing and the new functionality, including basic tests and tests of various edge cases.
Attachment #8751516 - Flags: review?(jdemooij)
Comment on attachment 8750502 [details] [diff] [review]
return partial result and num bytes/chars written

Review of attachment 8750502 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, but a suggestion below to simplify the code a bit. If you can post an updated patch with that fixed I can r+ quickly.

::: js/src/vm/CharacterEncoding.cpp
@@ +73,5 @@
>             ? ::GetDeflatedUTF8StringLength(s->latin1Chars(nogc), s->length())
>             : ::GetDeflatedUTF8StringLength(s->twoByteChars(nogc), s->length());
>  }
>  
> +const char16_t UTF8_REPLACEMENT_CHAR = u'\uFFFD';

According to the following page, VS 2013 doesn't support this syntax yet: https://developer.mozilla.org/en-US/docs/Using_CXX_in_Mozilla_code

So to be safe, I'd just use:

static const char16_t UTF8_REPLACEMENT_CHAR = 0xFFFD;

@@ +98,5 @@
>              if (srclen < 1) {
> +                v = UTF8_REPLACEMENT_CHAR;
> +            } else {
> +                char16_t c2 = *src;
> +                if ((c2 < 0xDC00) || (c2 > 0xDFFF)) {

Pre-existing, but please remove the parentheses around the < and > expressions while you're here.

(I like extra parentheses when there's potential confusion about the evaluation order, but (x < y || a > b) is common enough that it shouldn't confuse anyone :)

@@ +113,5 @@
>          if (v < 0x0080) {
>              /* no encoding necessary - performance hack */
> +            if (dstlenp && dstlen == 0) {
> +                goto finish;
> +            }

Nit: no {} (because each of condition, else-body, then-body fit on a single line).

The control-flow and the subtraction at the end seem more complicated than necessary though. What if we do the following:

- At the start of this function:

if (dstlenp)
    *dstlenp = 0;

Then at the end of the loop body:

if (dstlenp)
    *dstlenp += utf8Len;

And to bounds check here:

if (dstlenp && *dstlenp + 1 < origDstlen)
    return;

What do you think? That way we always maintain the correct state so we can just return when we're done.
Attachment #8750502 - Flags: review?(jdemooij) → feedback+
Comment on attachment 8751516 [details] [diff] [review]
test DeflateStringToUTF8Buffer

Review of attachment 8751516 [details] [diff] [review]:
-----------------------------------------------------------------

Beautiful! Thank you for adding these tests.

::: js/src/jsapi-tests/testDeflateStringToUTF8Buffer.cpp
@@ +22,5 @@
> +    // Test with an ASCII string, which calls JSFlatString::latin1Chars
> +    // to retrieve the characters from the string and generates UTF-8 output
> +    // that is identical to the ASCII input.
> +
> +    str = JS_NewStringCopyZ(cx, "Ohai"); // { 0x4F, 0x68, 0x61, 0x69 }

Nit: add a MOZ_RELEASE_ASSERT(str); here and after the JS_New* calls below.

(We don't have to handle OOM gracefully in tests but crashing immediately helps if this ever fails.)

@@ +98,5 @@
> +    // Test with a Latin-1 string, which calls JSFlatString::latin1Chars
> +    // like with the ASCII string but generates UTF-8 output that is different
> +    // from the ASCII input.
> +
> +    str = JS_NewUCStringCopyZ(cx, u"Óhãï"); // { 0xD3, 0x68, 0xE3, 0xEF }

VS 2013 comment applies here too, you can use MOZ_UTF16("foo") instead. Hopefully we'll drop support for VS 2013 soon.
Attachment #8751516 - Flags: review?(jdemooij) → review+
(In reply to Jan de Mooij [:jandem] from comment #7)
> ::: js/src/vm/CharacterEncoding.cpp
> @@ +73,5 @@
> >             ? ::GetDeflatedUTF8StringLength(s->latin1Chars(nogc), s->length())
> >             : ::GetDeflatedUTF8StringLength(s->twoByteChars(nogc), s->length());
> >  }
> >  
> > +const char16_t UTF8_REPLACEMENT_CHAR = u'\uFFFD';
> 
> According to the following page, VS 2013 doesn't support this syntax yet:
> https://developer.mozilla.org/en-US/docs/Using_CXX_in_Mozilla_code

Ah, thanks for the reference, I was unaware of that page!


> So to be safe, I'd just use:
> 
> static const char16_t UTF8_REPLACEMENT_CHAR = 0xFFFD;

Sounds good, fixed in this patch.


> @@ +98,5 @@
> >              if (srclen < 1) {
> > +                v = UTF8_REPLACEMENT_CHAR;
> > +            } else {
> > +                char16_t c2 = *src;
> > +                if ((c2 < 0xDC00) || (c2 > 0xDFFF)) {
> 
> Pre-existing, but please remove the parentheses around the < and >
> expressions while you're here.

Done!


> (I like extra parentheses when there's potential confusion about the
> evaluation order, but (x < y || a > b) is common enough that it shouldn't
> confuse anyone :)

Indeed, I noticed that too but figured I'd resist the urge to make unrelated changes.  Thanks for encouraging my inner pedant! ;-)


> @@ +113,5 @@
> >          if (v < 0x0080) {
> >              /* no encoding necessary - performance hack */
> > +            if (dstlenp && dstlen == 0) {
> > +                goto finish;
> > +            }
> 
> Nit: no {} (because each of condition, else-body, then-body fit on a single
> line).

Right, fixed.


> The control-flow and the subtraction at the end seem more complicated than
> necessary though. What if we do the following:
> 
> - At the start of this function:
> 
> if (dstlenp)
>     *dstlenp = 0;
> 
> Then at the end of the loop body:
> 
> if (dstlenp)
>     *dstlenp += utf8Len;
> 
> And to bounds check here:
> 
> if (dstlenp && *dstlenp + 1 < origDstlen)
>     return;

Here presumably you mean:

> if (dstlenp && *dstlenp + 1 > origDstlen)
>     return;

And then the check for the general case would be:

> if (dstlenp && *dstlenp + utf8Len > origDstlen)
>     return;

(We could refactor a bit further, setting utf8Len first based on the value of |v| and then consolidating these conditional early returns.  But I don't think it'd add clarity.)


> What do you think? That way we always maintain the correct state so we can
> just return when we're done.

I like it!  I adapted the original implementation from the DeflateStringToUTF8Buffer function in CTypes.cpp <https://dxr.mozilla.org/mozilla-central/rev/3461f3/js/src/ctypes/CTypes.cpp#124-187>, but I prefer yours, as it maintains the loop invariant that *dstlenp is the number of bytes written to the destination buffer.

So I've adopted your implementation in this version of the patch.  The only changes I made are the aforementioned correction to the conditionals, s/origDstlen/capacity/ (to make its purpose clearer), and consolidation of the code that initializes capacity and *dstlenp, which is now:

> size_t capacity = 0;
> if (dstlenp) {
>     capacity = *dstlenp;
>     *dstlenp = 0;
> }
Attachment #8750502 - Attachment is obsolete: true
Attachment #8751817 - Flags: review?(jdemooij)
Comment on attachment 8751817 [details] [diff] [review]
return partial result and num bytes/chars written

Review of attachment 8751817 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, looks great.
Attachment #8751817 - Flags: review?(jdemooij) → review+
Attached patch test DeflateStringToUTF8Buffer (obsolete) — Splinter Review
(In reply to Jan de Mooij [:jandem] from comment #8)

> Nit: add a MOZ_RELEASE_ASSERT(str); here and after the JS_New* calls below.
> 
> (We don't have to handle OOM gracefully in tests but crashing immediately
> helps if this ever fails.)

Indeed.  I added that to this patch.


> VS 2013 comment applies here too, you can use MOZ_UTF16("foo") instead.
> Hopefully we'll drop support for VS 2013 soon.

Right.  Fixed in this patch.
It looks like one (or more) of the tests is failing on Windows in the try run.  I'll build and debug there.
(In reply to Myk Melez [:myk] [@mykmelez] from comment #13)
> It looks like one (or more) of the tests is failing on Windows in the try
> run.  I'll build and debug there.

The problem on Windows is that the encoding of wide string literals (L"") is undefined in the C++ specification for characters outside the ASCII range, so compilers are free to choose an encoding, and MSVC is choosing a different one.

This hasn't been an issue in Mozilla code before because we've only used MOZ_UTF16 with ASCII string literals to date.

The fix (until we require MSVC 2015 and get rid of MOZ_UTF16, anyway) is to specify the characters via hexadecimal escape sequences, i.e.:

>-    str = JS_NewUCStringCopyZ(cx, MOZ_UTF16("Óhãï")); // { 0xD3, 0x68, 0xE3, 0xEF }
>+    str = JS_NewUCStringCopyZ(cx, MOZ_UTF16("\xD3\x68\xE3\xEF")); // u"Óhãï"
> …
>-    str = JS_NewUCStringCopyZ(cx, MOZ_UTF16("Όhȃї")); // { 0x038C, 0x0068, 0x0203, 0x0457 }
>+    str = JS_NewUCStringCopyZ(cx, MOZ_UTF16("\x038C\x0068\x0203\x0457")); // u"Όhȃї"

Here's an updated patch with that change.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=73185ae37418
Attachment #8751516 - Attachment is obsolete: true
Attachment #8751827 - Attachment is obsolete: true
Attachment #8752398 - Flags: review?(jdemooij)
Attachment #8752398 - Flags: review?(jdemooij) → review+
https://hg.mozilla.org/mozilla-central/rev/6a5fe7a3af40
https://hg.mozilla.org/mozilla-central/rev/ea6e8a3c5880
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: