Closed Bug 892806 Opened 11 years ago Closed 11 years ago

Clean up InflateUTF8String() and related functions.

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla25

People

(Reporter: n.nethercote, Assigned: n.nethercote)

Details

Attachments

(1 file, 1 obsolete file)

We have a function InflateUTF8StringToBuffer() which has two modes:
depending on the arguments, it either computes the inflated size of a UTF8
buffer, or it actually expands it.  And in several places (ReadStringCommon(),
InflateUTF8String(), FileAsString()) we call it in the first mode, then
allocate, and then call it in the second mode.
InflateUTF8StringToBufferReplaceInvalid() is similar, but only used in
ReadStringCommon().

Furthermore, InflateUTF8StringToBuffer() and
InflateUTF8StringToBufferReplaceInvalid() are very similar, both dealing with
UTF8 decoding.  Having said that, the latter checks for an invalid second byte
whereas the former doesn't, which doesn't seem right.
This patch does the following.

- It fixes the duplicated UTF8 decoding by changing InflateUTF8StringToBuffer()
  to be templatized on a tri-value enum, so it does one of (a) determines
  length, reporting errors;  (b) determines length, ignoring errors, (c)
  expands the string.  So it's still doing multiple things, but the templates
  make this clearer.  And it does that check of the second-byte.

- It removes InflateUTF8StringToBufferReplaceInvalid(), now unnecessary.

- It changes things so that the call-in-one-mode-then-another dance is done in a
  single place:  InflateUTF8StringHelper().  ReadStringCommon() and
  FileAsString() now both use higher-level functions.
  InflateUTF8StringHelper() also does faster copying if the string is ASCII.

- It introduces a new function, InflateUTF8StringReplaceInvalid().
  
- In CTypes.cpp, it changes a function template parameter to a function pointer
  argument, which avoids unnecessary duplication of the function.
Attachment #774390 - Flags: review?(terrence)
Comment on attachment 774390 [details] [diff] [review]
Clean up InflateUTF8String() and related functions.

Review of attachment 774390 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for doing this! These were originally implemented in haste for B2G and I did not want to hold it up too much or take any risky refactorings, so I insisted on a trivial copy. I had been meaning to get back to this and clean it up myself: thanks for beating me to it.

It looks pretty excellent, but I'm asking for a pretty big refactoring so I'd like to see it again after that.

::: js/src/jsstr.cpp
@@ +3995,5 @@
>      return NULL;
>  }
>  
> +static void
> +ReportInvalidCharacter(JSContext *cx, uint32_t offset)

This is a bit weird: you're using cx as if it's a maybecx here, but the only callers use cx unconditionally. I expect you can get rid of the check. If it is needed, change the name to maybecx.

@@ +4006,5 @@
> +    }
> +}
> +
> +static void
> +ReportBufferTooSmall(JSContext *cx, uint32_t dummy)

Ditto.

@@ +4013,5 @@
> +        JS_ReportErrorNumber(cx, js_GetErrorMessage, NULL, JSMSG_BUFFER_TOO_SMALL);
> +}
> +
> +static void
> +ReportTooBigCharacter(JSContext *cx, uint32_t v)

Ditto.

@@ +4066,5 @@
> +                    goto invalidMultiByteCodeUnit;                  \
> +                } else {                                            \
> +                    MOZ_ASSUME_UNREACHABLE("bad action");           \
> +                }                                                   \
> +            } while (0)

Rather than duping this rather large branch structure to all of our errors paths and bloating up the full function, could we move this out-of-line somehow? How about making the macro unconditionally call report and having both that function and the invalid label check for CountAndReportInvalids?

#define INVALID do { report(cx, action, arg); goto invalidMultiByteCodeUnit; } while(0);

invalidMultiByteCodeUnit:
    if (action == CountAndReportInvalids)
        return false;
    i += n - 1 // ...

@@ +4169,5 @@
> +                                   lengthp);
> +}
> +
> +jschar *
> +js::InflateUTF8StringReplaceInvalid(JSContext *cx, const char *src, size_t *lengthp)

Please move this, the other, and all the static helpers to /js/src/vm/CharacterEncoding.cpp.

::: js/src/jsstr.h
@@ +276,3 @@
>   */
>  extern jschar *
>  InflateUTF8String(JSContext *cx, const char *bytes, size_t *length);

Move this to /js/public/CharacterEncoding.h and change the name to UTF8CharsToNewTwoByteCharsZ. It's a bit wordy, but, like the other methods in that file, it tells you /exactly/ what you're getting.

You'll also want to change the signature to return TwoByteCharsZ: the caller should be able to assume the length matches the input on success, so the length outparam should be unneeded now. The input should also be changed to be UTF8Chars, which also carries the length so a separate length in-parameter isn't needed anymore. Unfortunately we only have UTF8CharsZ at the moment, but it should be very easy to implement by copying either Latin1Chars or TwoByteChars.

@@ +282,5 @@
> + * will be replaced by \uFFFD. No exception will be thrown for malformed UTF-8
> + * input.
> + */
> +extern jschar *
> +InflateUTF8StringReplaceInvalid(JSContext *cx, const char *bytes, size_t *length);

Ditto, with a new name of: LossyUTF8ToNewTwoByteCharsZ.
Attachment #774390 - Flags: review?(terrence)
> Rather than duping this rather large branch structure to all of our errors
> paths and bloating up the full function, could we move this out-of-line
> somehow?

There isn't bloat because the conditions are compile-time constants.  Nonetheless, I've tweaked it a bit by using |continue| instead of |goto|.


> You'll also want to change the signature to return TwoByteCharsZ: the caller
> should be able to assume the length matches the input on success, so the
> length outparam should be unneeded now. The input should also be changed to
> be UTF8Chars, which also carries the length so a separate length
> in-parameter isn't needed anymore. Unfortunately we only have UTF8CharsZ at
> the moment, but it should be very easy to implement by copying either
> Latin1Chars or TwoByteChars.

I've done all that except for removing the length outparam -- it is still needed.  Assuming the length matches the input on success would be disastrous, those lengths are entirely unrelated.  At least the length outparam is just an outparam now, not an inoutparam (because the input length is within the UTF8Chars).
Attachment #774390 - Attachment is obsolete: true
Comment on attachment 779020 [details] [diff] [review]
Clean up InflateUTF8String() and related functions.

Review of attachment 779020 [details] [diff] [review]:
-----------------------------------------------------------------

I like it. r=me

(In reply to Nicholas Nethercote [:njn] from comment #3)
> > You'll also want to change the signature to return TwoByteCharsZ: the caller
> > should be able to assume the length matches the input on success, so the
> > length outparam should be unneeded now. The input should also be changed to
> > be UTF8Chars, which also carries the length so a separate length
> > in-parameter isn't needed anymore. Unfortunately we only have UTF8CharsZ at
> > the moment, but it should be very easy to implement by copying either
> > Latin1Chars or TwoByteChars.
> 
> I've done all that except for removing the length outparam -- it is still
> needed.  Assuming the length matches the input on success would be
> disastrous, those lengths are entirely unrelated.  At least the length
> outparam is just an outparam now, not an inoutparam (because the input
> length is within the UTF8Chars).

D'oh. Quite right. If we could get rid of the null-termination requirement we could hand back the size as part of the return. Not for today though.

::: js/src/jsstr.h
@@ +261,5 @@
>  
>  /*
>   * Inflate bytes in ASCII encoding to jschars. Return null on error, otherwise
>   * return the jschar that was malloc'ed. length is updated to the length of the
> + * new string (in jschars).  A null char is appended, but it is not included in

One space between sentences.
Attachment #779020 - Flags: review?(terrence) → review+
Well, if you were wondering if there is thorough testing of this InflateUTF8 stuff in the tree, I can confirm that there is!  Thanks to toolkit/components/ctypes/tests/unit/test_jsctypes.js.in and netwerk/test/unit/test_bug722299.js (and Valgrind) I found several subtle and not so subtle bugs in the patch.  (Note to self: embedding |continue| in a macro is a bad idea because if you have nested loops it's easy to forget you'll continue on the *inner* loop.)  As a result, I'm fairly confident about the final version that's just about ready to land.
https://hg.mozilla.org/mozilla-central/rev/a6d8089d98d7
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: