Closed
Bug 892806
Opened 11 years ago
Closed 11 years ago
Clean up InflateUTF8String() and related functions.
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla25
People
(Reporter: n.nethercote, Assigned: n.nethercote)
Details
Attachments
(1 file, 1 obsolete file)
31.79 KB,
patch
|
terrence
:
review+
|
Details | Diff | Splinter Review |
We have a function InflateUTF8StringToBuffer() which has two modes: depending on the arguments, it either computes the inflated size of a UTF8 buffer, or it actually expands it. And in several places (ReadStringCommon(), InflateUTF8String(), FileAsString()) we call it in the first mode, then allocate, and then call it in the second mode. InflateUTF8StringToBufferReplaceInvalid() is similar, but only used in ReadStringCommon(). Furthermore, InflateUTF8StringToBuffer() and InflateUTF8StringToBufferReplaceInvalid() are very similar, both dealing with UTF8 decoding. Having said that, the latter checks for an invalid second byte whereas the former doesn't, which doesn't seem right.
Assignee | ||
Comment 1•11 years ago
|
||
This patch does the following. - It fixes the duplicated UTF8 decoding by changing InflateUTF8StringToBuffer() to be templatized on a tri-value enum, so it does one of (a) determines length, reporting errors; (b) determines length, ignoring errors, (c) expands the string. So it's still doing multiple things, but the templates make this clearer. And it does that check of the second-byte. - It removes InflateUTF8StringToBufferReplaceInvalid(), now unnecessary. - It changes things so that the call-in-one-mode-then-another dance is done in a single place: InflateUTF8StringHelper(). ReadStringCommon() and FileAsString() now both use higher-level functions. InflateUTF8StringHelper() also does faster copying if the string is ASCII. - It introduces a new function, InflateUTF8StringReplaceInvalid(). - In CTypes.cpp, it changes a function template parameter to a function pointer argument, which avoids unnecessary duplication of the function.
Attachment #774390 -
Flags: review?(terrence)
Comment 2•11 years ago
|
||
Comment on attachment 774390 [details] [diff] [review] Clean up InflateUTF8String() and related functions. Review of attachment 774390 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for doing this! These were originally implemented in haste for B2G and I did not want to hold it up too much or take any risky refactorings, so I insisted on a trivial copy. I had been meaning to get back to this and clean it up myself: thanks for beating me to it. It looks pretty excellent, but I'm asking for a pretty big refactoring so I'd like to see it again after that. ::: js/src/jsstr.cpp @@ +3995,5 @@ > return NULL; > } > > +static void > +ReportInvalidCharacter(JSContext *cx, uint32_t offset) This is a bit weird: you're using cx as if it's a maybecx here, but the only callers use cx unconditionally. I expect you can get rid of the check. If it is needed, change the name to maybecx. @@ +4006,5 @@ > + } > +} > + > +static void > +ReportBufferTooSmall(JSContext *cx, uint32_t dummy) Ditto. @@ +4013,5 @@ > + JS_ReportErrorNumber(cx, js_GetErrorMessage, NULL, JSMSG_BUFFER_TOO_SMALL); > +} > + > +static void > +ReportTooBigCharacter(JSContext *cx, uint32_t v) Ditto. @@ +4066,5 @@ > + goto invalidMultiByteCodeUnit; \ > + } else { \ > + MOZ_ASSUME_UNREACHABLE("bad action"); \ > + } \ > + } while (0) Rather than duping this rather large branch structure to all of our errors paths and bloating up the full function, could we move this out-of-line somehow? How about making the macro unconditionally call report and having both that function and the invalid label check for CountAndReportInvalids? #define INVALID do { report(cx, action, arg); goto invalidMultiByteCodeUnit; } while(0); invalidMultiByteCodeUnit: if (action == CountAndReportInvalids) return false; i += n - 1 // ... @@ +4169,5 @@ > + lengthp); > +} > + > +jschar * > +js::InflateUTF8StringReplaceInvalid(JSContext *cx, const char *src, size_t *lengthp) Please move this, the other, and all the static helpers to /js/src/vm/CharacterEncoding.cpp. ::: js/src/jsstr.h @@ +276,3 @@ > */ > extern jschar * > InflateUTF8String(JSContext *cx, const char *bytes, size_t *length); Move this to /js/public/CharacterEncoding.h and change the name to UTF8CharsToNewTwoByteCharsZ. It's a bit wordy, but, like the other methods in that file, it tells you /exactly/ what you're getting. You'll also want to change the signature to return TwoByteCharsZ: the caller should be able to assume the length matches the input on success, so the length outparam should be unneeded now. The input should also be changed to be UTF8Chars, which also carries the length so a separate length in-parameter isn't needed anymore. Unfortunately we only have UTF8CharsZ at the moment, but it should be very easy to implement by copying either Latin1Chars or TwoByteChars. @@ +282,5 @@ > + * will be replaced by \uFFFD. No exception will be thrown for malformed UTF-8 > + * input. > + */ > +extern jschar * > +InflateUTF8StringReplaceInvalid(JSContext *cx, const char *bytes, size_t *length); Ditto, with a new name of: LossyUTF8ToNewTwoByteCharsZ.
Attachment #774390 -
Flags: review?(terrence)
Assignee | ||
Comment 3•11 years ago
|
||
> Rather than duping this rather large branch structure to all of our errors > paths and bloating up the full function, could we move this out-of-line > somehow? There isn't bloat because the conditions are compile-time constants. Nonetheless, I've tweaked it a bit by using |continue| instead of |goto|. > You'll also want to change the signature to return TwoByteCharsZ: the caller > should be able to assume the length matches the input on success, so the > length outparam should be unneeded now. The input should also be changed to > be UTF8Chars, which also carries the length so a separate length > in-parameter isn't needed anymore. Unfortunately we only have UTF8CharsZ at > the moment, but it should be very easy to implement by copying either > Latin1Chars or TwoByteChars. I've done all that except for removing the length outparam -- it is still needed. Assuming the length matches the input on success would be disastrous, those lengths are entirely unrelated. At least the length outparam is just an outparam now, not an inoutparam (because the input length is within the UTF8Chars).
Assignee | ||
Comment 4•11 years ago
|
||
Attachment #779020 -
Flags: review?(terrence)
Assignee | ||
Updated•11 years ago
|
Attachment #774390 -
Attachment is obsolete: true
Comment 5•11 years ago
|
||
Comment on attachment 779020 [details] [diff] [review] Clean up InflateUTF8String() and related functions. Review of attachment 779020 [details] [diff] [review]: ----------------------------------------------------------------- I like it. r=me (In reply to Nicholas Nethercote [:njn] from comment #3) > > You'll also want to change the signature to return TwoByteCharsZ: the caller > > should be able to assume the length matches the input on success, so the > > length outparam should be unneeded now. The input should also be changed to > > be UTF8Chars, which also carries the length so a separate length > > in-parameter isn't needed anymore. Unfortunately we only have UTF8CharsZ at > > the moment, but it should be very easy to implement by copying either > > Latin1Chars or TwoByteChars. > > I've done all that except for removing the length outparam -- it is still > needed. Assuming the length matches the input on success would be > disastrous, those lengths are entirely unrelated. At least the length > outparam is just an outparam now, not an inoutparam (because the input > length is within the UTF8Chars). D'oh. Quite right. If we could get rid of the null-termination requirement we could hand back the size as part of the return. Not for today though. ::: js/src/jsstr.h @@ +261,5 @@ > > /* > * Inflate bytes in ASCII encoding to jschars. Return null on error, otherwise > * return the jschar that was malloc'ed. length is updated to the length of the > + * new string (in jschars). A null char is appended, but it is not included in One space between sentences.
Attachment #779020 -
Flags: review?(terrence) → review+
Assignee | ||
Comment 6•11 years ago
|
||
Well, if you were wondering if there is thorough testing of this InflateUTF8 stuff in the tree, I can confirm that there is! Thanks to toolkit/components/ctypes/tests/unit/test_jsctypes.js.in and netwerk/test/unit/test_bug722299.js (and Valgrind) I found several subtle and not so subtle bugs in the patch. (Note to self: embedding |continue| in a macro is a bad idea because if you have nested loops it's easy to forget you'll continue on the *inner* loop.) As a result, I'm fairly confident about the final version that's just about ready to land.
Assignee | ||
Comment 7•11 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/a6d8089d98d7
Comment 8•11 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/a6d8089d98d7
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
You need to log in
before you can comment on or make changes to this bug.
Description
•