Closed
Bug 1271014
Opened 9 years ago
Closed 8 years ago
make DeflateStringToUTF8Buffer return partial result if dest buffer runs out of space
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla49
Tracking | Status | |
---|---|---|
firefox49 | --- | fixed |
People
(Reporter: myk, Assigned: myk)
Details
(Whiteboard: [v8api])
Attachments
(2 files, 3 obsolete files)
5.57 KB,
patch
|
jandem
:
review+
|
Details | Diff | Splinter Review |
11.65 KB,
patch
|
jandem
:
review+
|
Details | Diff | Splinter Review |
SpiderShim currently has a custom implementation of DeflateStringToUTF8Buffer [1], since neither of the existing ones (the public one in CharacterEncoding and a private one in CTypes) does quite what it wants, which is to return a partial result if the destination buffer runs out of space (CharacterEncoding requires the buffer to have enough space [2]) and continue on a bad surrogate (CTypes aborts on a bad surrogate [3]). In order to support SpiderShim, the public DeflateStringToUTF8Buffer in CharacterEncoding should return a partial result if the destination buffer runs out of space. [1] https://github.com/mozilla/spidernode/blob/master/deps/spidershim/src/v8string.cc#L455-L538 [2] https://dxr.mozilla.org/mozilla-central/rev/e5a10b/js/public/CharacterEncoding.h#208-213 [3] https://dxr.mozilla.org/mozilla-central/rev/e5a10b/js/src/ctypes/CTypes.cpp#172-177
Assignee | ||
Comment 2•9 years ago
|
||
This patch refactors the SpiderShim implementation to be a compatible replacement for the existing function in CharacterEncoding. It adds two optional parameters, *dstlenp and *numcharsp, which enable the function to write a partial result (up to *dstlenp bytes) and return the number of bytes and chars written (by updating *dstlenp and *numcharsp). (The SpiderShim implementation includes a third parameter to configure the encoder to write an invalid surrogate to the buffer instead of replacing it with the UTF-8 replacement character, but Node doesn't use that feature of the V8 API, so I won't add it here.) This passes the check-spidermonkey tests, although I don't think those exercise this API, at least not directly.
Attachment #8750502 -
Flags: review?(terrence)
Assignee | ||
Updated•9 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 3•9 years ago
|
||
Comment on attachment 8750502 [details] [diff] [review] return partial result and num bytes/chars written It looks like Terrence is on PTO, so I'm redirecting this review request to another reviewer who has touched this area of code in the past.
Attachment #8750502 -
Flags: review?(terrence) → review?(jdemooij)
Comment 4•9 years ago
|
||
I can review this patch tomorrow - I didn't take a close look but we should definitely add some jsapi-tests for all interesting (edge) cases here.
Assignee | ||
Comment 5•9 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #4) > I can review this patch tomorrow - I didn't take a close look but we should > definitely add some jsapi-tests for all interesting (edge) cases here. I can do so, although there aren't currently tests for DeflateStringToUTF8Buffer, neither for edge cases nor for basic functionality.
Assignee | ||
Comment 6•8 years ago
|
||
Here's a jsapi-test for both the existing and the new functionality, including basic tests and tests of various edge cases.
Attachment #8751516 -
Flags: review?(jdemooij)
Comment 7•8 years ago
|
||
Comment on attachment 8750502 [details] [diff] [review] return partial result and num bytes/chars written Review of attachment 8750502 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, but a suggestion below to simplify the code a bit. If you can post an updated patch with that fixed I can r+ quickly. ::: js/src/vm/CharacterEncoding.cpp @@ +73,5 @@ > ? ::GetDeflatedUTF8StringLength(s->latin1Chars(nogc), s->length()) > : ::GetDeflatedUTF8StringLength(s->twoByteChars(nogc), s->length()); > } > > +const char16_t UTF8_REPLACEMENT_CHAR = u'\uFFFD'; According to the following page, VS 2013 doesn't support this syntax yet: https://developer.mozilla.org/en-US/docs/Using_CXX_in_Mozilla_code So to be safe, I'd just use: static const char16_t UTF8_REPLACEMENT_CHAR = 0xFFFD; @@ +98,5 @@ > if (srclen < 1) { > + v = UTF8_REPLACEMENT_CHAR; > + } else { > + char16_t c2 = *src; > + if ((c2 < 0xDC00) || (c2 > 0xDFFF)) { Pre-existing, but please remove the parentheses around the < and > expressions while you're here. (I like extra parentheses when there's potential confusion about the evaluation order, but (x < y || a > b) is common enough that it shouldn't confuse anyone :) @@ +113,5 @@ > if (v < 0x0080) { > /* no encoding necessary - performance hack */ > + if (dstlenp && dstlen == 0) { > + goto finish; > + } Nit: no {} (because each of condition, else-body, then-body fit on a single line). The control-flow and the subtraction at the end seem more complicated than necessary though. What if we do the following: - At the start of this function: if (dstlenp) *dstlenp = 0; Then at the end of the loop body: if (dstlenp) *dstlenp += utf8Len; And to bounds check here: if (dstlenp && *dstlenp + 1 < origDstlen) return; What do you think? That way we always maintain the correct state so we can just return when we're done.
Attachment #8750502 -
Flags: review?(jdemooij) → feedback+
Comment 8•8 years ago
|
||
Comment on attachment 8751516 [details] [diff] [review] test DeflateStringToUTF8Buffer Review of attachment 8751516 [details] [diff] [review]: ----------------------------------------------------------------- Beautiful! Thank you for adding these tests. ::: js/src/jsapi-tests/testDeflateStringToUTF8Buffer.cpp @@ +22,5 @@ > + // Test with an ASCII string, which calls JSFlatString::latin1Chars > + // to retrieve the characters from the string and generates UTF-8 output > + // that is identical to the ASCII input. > + > + str = JS_NewStringCopyZ(cx, "Ohai"); // { 0x4F, 0x68, 0x61, 0x69 } Nit: add a MOZ_RELEASE_ASSERT(str); here and after the JS_New* calls below. (We don't have to handle OOM gracefully in tests but crashing immediately helps if this ever fails.) @@ +98,5 @@ > + // Test with a Latin-1 string, which calls JSFlatString::latin1Chars > + // like with the ASCII string but generates UTF-8 output that is different > + // from the ASCII input. > + > + str = JS_NewUCStringCopyZ(cx, u"Óhãï"); // { 0xD3, 0x68, 0xE3, 0xEF } VS 2013 comment applies here too, you can use MOZ_UTF16("foo") instead. Hopefully we'll drop support for VS 2013 soon.
Attachment #8751516 -
Flags: review?(jdemooij) → review+
Assignee | ||
Comment 9•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #7) > ::: js/src/vm/CharacterEncoding.cpp > @@ +73,5 @@ > > ? ::GetDeflatedUTF8StringLength(s->latin1Chars(nogc), s->length()) > > : ::GetDeflatedUTF8StringLength(s->twoByteChars(nogc), s->length()); > > } > > > > +const char16_t UTF8_REPLACEMENT_CHAR = u'\uFFFD'; > > According to the following page, VS 2013 doesn't support this syntax yet: > https://developer.mozilla.org/en-US/docs/Using_CXX_in_Mozilla_code Ah, thanks for the reference, I was unaware of that page! > So to be safe, I'd just use: > > static const char16_t UTF8_REPLACEMENT_CHAR = 0xFFFD; Sounds good, fixed in this patch. > @@ +98,5 @@ > > if (srclen < 1) { > > + v = UTF8_REPLACEMENT_CHAR; > > + } else { > > + char16_t c2 = *src; > > + if ((c2 < 0xDC00) || (c2 > 0xDFFF)) { > > Pre-existing, but please remove the parentheses around the < and > > expressions while you're here. Done! > (I like extra parentheses when there's potential confusion about the > evaluation order, but (x < y || a > b) is common enough that it shouldn't > confuse anyone :) Indeed, I noticed that too but figured I'd resist the urge to make unrelated changes. Thanks for encouraging my inner pedant! ;-) > @@ +113,5 @@ > > if (v < 0x0080) { > > /* no encoding necessary - performance hack */ > > + if (dstlenp && dstlen == 0) { > > + goto finish; > > + } > > Nit: no {} (because each of condition, else-body, then-body fit on a single > line). Right, fixed. > The control-flow and the subtraction at the end seem more complicated than > necessary though. What if we do the following: > > - At the start of this function: > > if (dstlenp) > *dstlenp = 0; > > Then at the end of the loop body: > > if (dstlenp) > *dstlenp += utf8Len; > > And to bounds check here: > > if (dstlenp && *dstlenp + 1 < origDstlen) > return; Here presumably you mean: > if (dstlenp && *dstlenp + 1 > origDstlen) > return; And then the check for the general case would be: > if (dstlenp && *dstlenp + utf8Len > origDstlen) > return; (We could refactor a bit further, setting utf8Len first based on the value of |v| and then consolidating these conditional early returns. But I don't think it'd add clarity.) > What do you think? That way we always maintain the correct state so we can > just return when we're done. I like it! I adapted the original implementation from the DeflateStringToUTF8Buffer function in CTypes.cpp <https://dxr.mozilla.org/mozilla-central/rev/3461f3/js/src/ctypes/CTypes.cpp#124-187>, but I prefer yours, as it maintains the loop invariant that *dstlenp is the number of bytes written to the destination buffer. So I've adopted your implementation in this version of the patch. The only changes I made are the aforementioned correction to the conditionals, s/origDstlen/capacity/ (to make its purpose clearer), and consolidation of the code that initializes capacity and *dstlenp, which is now: > size_t capacity = 0; > if (dstlenp) { > capacity = *dstlenp; > *dstlenp = 0; > }
Attachment #8750502 -
Attachment is obsolete: true
Attachment #8751817 -
Flags: review?(jdemooij)
Comment 10•8 years ago
|
||
Comment on attachment 8751817 [details] [diff] [review] return partial result and num bytes/chars written Review of attachment 8751817 [details] [diff] [review]: ----------------------------------------------------------------- Thanks, looks great.
Attachment #8751817 -
Flags: review?(jdemooij) → review+
Assignee | ||
Comment 11•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #8) > Nit: add a MOZ_RELEASE_ASSERT(str); here and after the JS_New* calls below. > > (We don't have to handle OOM gracefully in tests but crashing immediately > helps if this ever fails.) Indeed. I added that to this patch. > VS 2013 comment applies here too, you can use MOZ_UTF16("foo") instead. > Hopefully we'll drop support for VS 2013 soon. Right. Fixed in this patch.
Assignee | ||
Comment 12•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bcfacefb3a6a
Assignee | ||
Comment 13•8 years ago
|
||
It looks like one (or more) of the tests is failing on Windows in the try run. I'll build and debug there.
Assignee | ||
Comment 14•8 years ago
|
||
(In reply to Myk Melez [:myk] [@mykmelez] from comment #13) > It looks like one (or more) of the tests is failing on Windows in the try > run. I'll build and debug there. The problem on Windows is that the encoding of wide string literals (L"") is undefined in the C++ specification for characters outside the ASCII range, so compilers are free to choose an encoding, and MSVC is choosing a different one. This hasn't been an issue in Mozilla code before because we've only used MOZ_UTF16 with ASCII string literals to date. The fix (until we require MSVC 2015 and get rid of MOZ_UTF16, anyway) is to specify the characters via hexadecimal escape sequences, i.e.: >- str = JS_NewUCStringCopyZ(cx, MOZ_UTF16("Óhãï")); // { 0xD3, 0x68, 0xE3, 0xEF } >+ str = JS_NewUCStringCopyZ(cx, MOZ_UTF16("\xD3\x68\xE3\xEF")); // u"Óhãï" > … >- str = JS_NewUCStringCopyZ(cx, MOZ_UTF16("Όhȃї")); // { 0x038C, 0x0068, 0x0203, 0x0457 } >+ str = JS_NewUCStringCopyZ(cx, MOZ_UTF16("\x038C\x0068\x0203\x0457")); // u"Όhȃї" Here's an updated patch with that change. https://treeherder.mozilla.org/#/jobs?repo=try&revision=73185ae37418
Attachment #8751516 -
Attachment is obsolete: true
Attachment #8751827 -
Attachment is obsolete: true
Attachment #8752398 -
Flags: review?(jdemooij)
Updated•8 years ago
|
Attachment #8752398 -
Flags: review?(jdemooij) → review+
Comment 15•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/6a5fe7a3af40 https://hg.mozilla.org/integration/mozilla-inbound/rev/ea6e8a3c5880
Comment 16•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/6a5fe7a3af40 https://hg.mozilla.org/mozilla-central/rev/ea6e8a3c5880
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
status-firefox49:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in
before you can comment on or make changes to this bug.
Description
•