811060 - Move character "decoding" out of jsstr and type it

Assignee

Description

•

12 years ago

Attached patch v0: character encoding (obsolete) — Details — Splinter Review

These patches are the first step on the way to fully sane and typey characters in the JS engine.  These create the base types and replace DeflateString, since it's the simplest user.  These base classes pave the way to killing the SkipRoot in NewString, since that work was based on these classes (I rebuilt the character work for DeflateString to make the review easier).

This patch does the following:
* Deletes DecodeString and replaces all users with nicely named and typed functions with extensive range assertions.
* Moved our current UTF8 tests from custom shell code to a jsapi-test.  These tests are (still) not currently testing anything useful.  This will change when we move UTF-8 over to this interface.
* Makes JS_EncodeStringToBuffer and JS_GetStringEncodingLength take and always take, respectively, a JSContext.  This involved a small change to jsd, which I think is correct.  I'm prepared to patch the ability to take a NULL cx back in if that turns out to be required.

I tried to only include things in this patch that were actively being used, but if I missed anything, hopefully this list of things we still need to do will suffice to explain what the extra bits are doing.  Remaining work:
* Create UTF8Chars : public OnebyteCharsBase.
* Move the remaining Inflate/Deflate functions to CharacterEncoding.
* Create StableTwoByteChars : public TwoByteCharsBase.
* Make JSFlatString, JSLinearString and JSStableString return the correct TwoByteCharsBase subclasses.

The last part is basically converting the engine to Range.  I've already rebased the RangePtr work I did to Range, so it is basically done once I get this in and rebase on top of it.

Attachment #680756 - Flags: review?(luke)

Terrence Cole [:terrence]

Assignee

Comment 1

•

12 years ago

Attached patch v0: Range and RangedPtr fixes (obsolete) — Details — Splinter Review

The split off Range addition to mfbt and ConvertibleToBool for RangedPtr.

Attachment #680761 - Flags: review?(jwalden+bmo)

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 2

•

12 years ago

Comment on attachment 680756 [details] [diff] [review]
v0: character encoding

Review of attachment 680756 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/public/CharacterEncoding.h
@@ +24,5 @@
> +
> +// By default, all C/C++ 1-byte-per-character strings passed into the JSAPI
> +// are treated as ISO/IEC 8859-1, also known as ISO-Latin-1. That is, each
> +// byte is treated as a 2-byte character, and there is no way to pass in a
> +// string containing characters beyond U+00FF.

windows-1252?

::: js/src/jsreflect.cpp
@@ +3437,5 @@
>              if (!baseops::GetPropertyDefault(cx, config, sourceId, nullVal, &prop))
>                  return JS_FALSE;
>  
>              if (!prop.isNullOrUndefined()) {
> +                RawString str = ToString(cx, prop);

Why?

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 3

•

12 years ago

Comment on attachment 680756 [details] [diff] [review]
v0: character encoding

Review of attachment 680756 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/exported_headers.mk
@@ +22,5 @@
>    LinkedList.h \
>    MathAlgorithms.h \
>    MSStdInt.h \
>    NullPtr.h \
> +  Range.h \

Wrong patch

Jeff Walden [:Waldo]

Comment 4

•

12 years ago

Comment on attachment 680761 [details] [diff] [review]
v0: Range and RangedPtr fixes

Review of attachment 680761 [details] [diff] [review]:
-----------------------------------------------------------------

Range stores start/length.  Should it perhaps store start/end?  This would be consistent with RangedPtr.  It seems like it'd play nicer with algorithms iterating over the contents of a range, as the end point would be more easily available (not requiring a computation to get).

Not that there's anything strictly wrong with this patch, setting aside style issues, but the start/end issue needs to be resolved before I can give a + here, I think.

::: mfbt/Range.h
@@ +5,5 @@
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#ifndef mfbt_Range_h___
> +#define mfbt_Range_h___

There's a consistent include-guard naming scheme for mfbt stuff, see mfbt/STYLE for it.

@@ +8,5 @@
> +#ifndef mfbt_Range_h___
> +#define mfbt_Range_h___
> +
> +#include "mozilla/RangedPtr.h"
> +

Please add a #include <stddef.h> here to get size_t.

@@ +16,5 @@
> +template <typename T>
> +class Range
> +{
> +    T *ptr_;
> +    size_t length_;

mfbt doesn't _-suffix members, so these should be ptr/length.  And * goes by type name.

@@ +22,5 @@
> +    typedef void (Range::* ConvertibleToBool)();
> +    void nonNull() {}
> +
> +  public:
> +    Range() : ptr_(NULL), length_(0) {}

mfbt/STYLE is a little outdated on this point: use nullptr here, please, and add the #include "mozilla/NullPtr.h" up top.

@@ +23,5 @@
> +    void nonNull() {}
> +
> +  public:
> +    Range() : ptr_(NULL), length_(0) {}
> +    Range(T *ptr, size_t length) : ptr_(ptr), length_(length) {}

T* ptr, or to avoid shadowing, T* p and size_t len like RangedPtr has.

@@ +25,5 @@
> +  public:
> +    Range() : ptr_(NULL), length_(0) {}
> +    Range(T *ptr, size_t length) : ptr_(ptr), length_(length) {}
> +
> +    T *start() const { return ptr_; }

This should return a RangedPtr.  Heck, ptr should be a RangedPtr, actually, so that operator[] could just be |return ptr[offset];|.

Having |RangedPtr<T> end() const| seems right to provide the full complement here.

Attachment #680761 - Flags: review?(jwalden+bmo)

Terrence Cole [:terrence]

Assignee

Comment 5

•

12 years ago

Attached patch v1: Range and RangedPtr fixes (obsolete) — Details — Splinter Review

Updated with review comments.

Attachment #680761 - Attachment is obsolete: true

Attachment #680869 - Flags: review?(jwalden+bmo)

Jeff Walden [:Waldo]

Comment 6

•

12 years ago

Erm, that looks like a different patch from the first one I reviewed.  Should I still be reviewing it?

Terrence Cole [:terrence]

Assignee

Comment 7

•

12 years ago

Attached patch v1: Range and RangedPtr fixes — Details — Splinter Review

Doh!  Sorry!  I qpu'ed instead of qpo'ing and didn't stop to check what my queue state was before exporting.  I think my brain was still expecting this patch to be on top of the other.

Attachment #680869 - Attachment is obsolete: true

Attachment #680869 - Flags: review?(jwalden+bmo)

Attachment #681091 - Flags: review?(jwalden+bmo)

Jeff Walden [:Waldo]

Comment 8

•

12 years ago

Comment on attachment 681091 [details] [diff] [review]
v1: Range and RangedPtr fixes

Review of attachment 681091 [details] [diff] [review]:
-----------------------------------------------------------------

Quality.

Attachment #681091 - Flags: review?(jwalden+bmo) → review+

Luke Wagner [:luke]

Comment 9

•

12 years ago

Comment on attachment 680756 [details] [diff] [review]
v0: character encoding

Sorry for the delay, but I'm pretty weak in matters of character encodings.  Types+Unicode make me think Waldo, so I'll tentatively punt to him.

Attachment #680756 - Flags: review?(luke) → review?(jwalden+bmo)

Terrence Cole [:terrence]

Assignee

Comment 10

•

12 years ago

Fine by me.  As long as one of you, Bill, or Jeff is processing reviews, I can almost always make forward progress somewhere :-).

Jeff, based on Norbert's comments in Bug 805081, I'd like to rename TwoByteChars(Z) to ES5Chars(Z).  I don't think it should affect the review too much, so I'm just going to leave the patch as is unless you specifically want me to do the rename before you review it.

Naveed Ihsanullah [:naveed]

Updated

•

12 years ago

Whiteboard: [js:t]

Jeff Walden [:Waldo]

Comment 11

•

12 years ago

Comment on attachment 680756 [details] [diff] [review]
v0: character encoding

Review of attachment 680756 [details] [diff] [review]:
-----------------------------------------------------------------

We discussed CharacterEncoding.h in-person and covered most of the details that really matter that way, but I had this written up, with a bunch of nitpicky things, so I might as well post it anyway.  Not all apply, some are probably nonsensical, but getting it out there warts and all at least doesn't memory-hole the details entirely.  (I was going along fine here until I realized that Chars and CharsZ derived from totally distinct types, which just seemed wrong [compare to JSLinearString and JSFlatString, which do have a subtype relationship], and at that point talking it out seemed more productive than trying to retroactively correct all the comments to deal with that fundamental bad assumption.)

I've mentioned a few times before that I don't find // for big documentary comments as readable as I find /**/.  I'm still giving it time, but right now, the // reads as line noise to me, and it's harder to ignore than * and delineates less well to boot.

ES5Chars is not a good name; nothing about any of this implies ES5-ishness at all.  Stick with TwoByteChars, it's good enough.  CodeUnit is workable as well, although it doesn't imply 16-bit, which makes me think it's a little worse than TwoByteChars.  A code unit in a UTF-8 string is a byte, or in UTF-32 four bytes, for example.

I'd like to see a relatively fast followup to get rid of the .get() in the methods where you didn't want to change return type to a Range subclass; dropping the smartness so early loses most of the benefits.  And it's hard to grep for these, too, because sometimes, rarely, get() is what you want.

::: js/public/CharacterEncoding.h
@@ +19,5 @@
> +// This type is the base of all 1-byte character range types.
> +typedef mozilla::Range<char> OneByteCharsBase;
> +
> +// This type is the base of all 1-byte null-terminated character types.
> +typedef mozilla::RangedPtr<char> OneByteCharPtrBase;

Hmm.  So this raises the <T>-versus-<const T> issue for container types.  I guess you have this as <T> because we write into it, and you have the two-byte case as <const T> because we don't?  Inconsistent and kind of messy, but whatever, we can figure this out more as we go.

@@ +25,5 @@
> +// By default, all C/C++ 1-byte-per-character strings passed into the JSAPI
> +// are treated as ISO/IEC 8859-1, also known as ISO-Latin-1. That is, each
> +// byte is treated as a 2-byte character, and there is no way to pass in a
> +// string containing characters beyond U+00FF.
> +class ISOLatin1Chars : public OneByteCharsBase

ISOLatin1Chars isn't so long, but all the combinations of this with other stuff grate a bit.  Let's trim this to Latin1Chars so we can keep everything shorter.

@@ +33,5 @@
> +    ISOLatin1Chars(char *bytes, size_t length) : OneByteCharsBase(bytes, length) {}
> +};
> +
> +// Like ISOLatin1Chars, but implies NULL termination for C compatibility.
> +class ISOLatin1CharsZ : public OneByteCharPtrBase

Why not inherit from Latin1Chars?  This is-a the other, seems to me, just as JSFlatString is-a JSLinearString.

@@ +36,5 @@
> +// Like ISOLatin1Chars, but implies NULL termination for C compatibility.
> +class ISOLatin1CharsZ : public OneByteCharPtrBase
> +{
> +  public:
> +    ISOLatin1CharsZ(char *bytes, size_t length) : OneByteCharPtrBase(bytes, length) {}

MOZ_ASSERT(bytes[length] == '\0');

@@ +45,5 @@
> +
> +// SpiderMonkey uses a 2-byte character representation: it is a
> +// 2-byte-at-a-time view of a UTF-16 byte stream. This is similar to UCS-2,
> +// but unlike UCS-2, we do not strip UTF-16 extension bytes. This allows a
> +// sufficiently dedicated JavaScript program to be fully unicode aware by

Unicode-aware

@@ +47,5 @@
> +// 2-byte-at-a-time view of a UTF-16 byte stream. This is similar to UCS-2,
> +// but unlike UCS-2, we do not strip UTF-16 extension bytes. This allows a
> +// sufficiently dedicated JavaScript program to be fully unicode aware by
> +// manually interpreting UTF-16 extension characters embedded in the JS
> +// String.

string

@@ +56,5 @@
> +    TwoByteChars(jschar *chars, size_t length) : TwoByteCharsBase(chars, length) {}
> +    TwoByteChars(const jschar *chars, size_t length) : TwoByteCharsBase(chars, length) {}
> +};
> +
> +// Like TwoByteChars, but NULL terminated for compatibility with JSFlatString.

NULL always reads as a pointer to me; could you use '\0' instead?  (This applies everywhere you say NULL but mean '\0'.)

@@ +57,5 @@
> +    TwoByteChars(const jschar *chars, size_t length) : TwoByteCharsBase(chars, length) {}
> +};
> +
> +// Like TwoByteChars, but NULL terminated for compatibility with JSFlatString.
> +class TwoByteCharsZ : public TwoByteCharsBase

Why doesn't this also inherit from TwoByteChars, for the same is-a reason?

@@ +61,5 @@
> +class TwoByteCharsZ : public TwoByteCharsBase
> +{
> +  public:
> +    TwoByteCharsZ() : TwoByteCharsBase() {}
> +    TwoByteCharsZ(jschar *chars, size_t length): TwoByteCharsBase(chars, length) {}

Missing a space before the :, and MOZ_ASSERT(chars[length] == '\0').

@@ +63,5 @@
> +  public:
> +    TwoByteCharsZ() : TwoByteCharsBase() {}
> +    TwoByteCharsZ(jschar *chars, size_t length): TwoByteCharsBase(chars, length) {}
> +
> +    operator TwoByteChars &() { return reinterpret_cast<TwoByteChars&>(*this); }

...

Inheritance!

@@ +72,5 @@
> +// contains any UTF-16 extension characters, then this may give invalid Latin1
> +// output. The returned string is zero terminated. The returned string or the
> +// returned string's |start()| must be freed with JS_free or js_free,
> +// respectively. If allocation fails, an OOM error will be set and the method
> +// will return a NULL chars. This method cannot trigger GC.

Maybe add a "(which can be tested for with the ! operator)", since that isn't obvious from the class definitions.

@@ +73,5 @@
> +// output. The returned string is zero terminated. The returned string or the
> +// returned string's |start()| must be freed with JS_free or js_free,
> +// respectively. If allocation fails, an OOM error will be set and the method
> +// will return a NULL chars. This method cannot trigger GC.
> +ISOLatin1CharsZ

This should probably be extern.

@@ +74,5 @@
> +// returned string's |start()| must be freed with JS_free or js_free,
> +// respectively. If allocation fails, an OOM error will be set and the method
> +// will return a NULL chars. This method cannot trigger GC.
> +ISOLatin1CharsZ
> +TwoByteCharsToISOLatin1CharsZ(JSContext *cx, TwoByteChars tbchars);

Please put "Lossy" at the start of this name, to make people think twice about using it.  Gecko has NS_LossyConvertUTF16toASCII, for example, and the name should set off warnings any time anyone uses it, or sees it in a patch being reviewed.  Ditto for all the other TwoByteCharsTo* methods.  It'll make all these stick out as BAD so much more in SpiderMonkey, rather than lingering forever only half-recognized.

Since this allocates new chars, let's sneak a "New" in here for clarity.

It doesn't really seem necessary to call out the "Z" in the method name -- we're allocating the chars, it's only reasonable to hand them back null-terminated.  The return type indicates that, so let's drop the Z.  Also let's make its allocating nature clearer.  Ergo: LossyTwoByteCharsToNewLatin1Chars.

@@ +76,5 @@
> +// will return a NULL chars. This method cannot trigger GC.
> +ISOLatin1CharsZ
> +TwoByteCharsToISOLatin1CharsZ(JSContext *cx, TwoByteChars tbchars);
> +
> +// Identical to the JSContext taking version, but will not report OOM.

Grammar nazi says no comma for you!

@@ +77,5 @@
> +ISOLatin1CharsZ
> +TwoByteCharsToISOLatin1CharsZ(JSContext *cx, TwoByteChars tbchars);
> +
> +// Identical to the JSContext taking version, but will not report OOM.
> +ISOLatin1CharsZ

extern

@@ +87,5 @@
> +// output. Zero is appended to the output buffer. The output buffer *MUST* be
> +// of greater than or equal length to the input buffer (plus 1 for null
> +// termination implied by the ISOLatin1CharsZ type).
> +inline void
> +TwoByteCharsToISOLatin1BufferZ(TwoByteChars input, ISOLatin1CharsZ out)

const TwoByteChars &input

Going purely by name, I'd expect this to return a Latin1Chars or something.  Maybe make this LossyCopyTwoByteCharsToLatin1?  (I don't think the "Buffer" part's necessary here.)

jsutil.h has a bunch of precedent for (dest, src) as the canonical ordering (presumably imitating stdlib), so please switch the argument order.

@@ +90,5 @@
> +inline void
> +TwoByteCharsToISOLatin1BufferZ(TwoByteChars input, ISOLatin1CharsZ out)
> +{
> +    for (size_t i = 0; i < input.length(); ++i)
> +        out[i] = static_cast<char>(input[i]);

I like hoisting |len = input.length()| and using that to reduce cognitive overhead of thinking about whether the compiler will inline here.

::: js/src/jscntxt.cpp
@@ +970,5 @@
>                  }
>                  JS_ASSERT(expandedArgs == argCount);
>                  *out = 0;
>                  js_free(buffer);
> +                TwoByteChars ucmsg(reportp->ucmessage, size_t(out - reportp->ucmessage));

Use mozilla::PointerRangeSize here, please.

::: js/src/jsprf.cpp
@@ +361,5 @@
>      /* and away we go */
>      return fill2(ss, s ? s : "(null)", slen, width, flags);
>  }
>  
> +static int cvt_ws(SprintfState *ss, const jschar *ws, int width, int prec, int flags)

static int
cvt_ws(...)

::: js/src/jsreflect.cpp
@@ +3437,5 @@
>              if (!baseops::GetPropertyDefault(cx, config, sourceId, nullVal, &prop))
>                  return JS_FALSE;
>  
>              if (!prop.isNullOrUndefined()) {
> +                RawString str = ToString(cx, prop);

Keep this Rooted; it's way too cutesy otherwise, and we don't care about perf here.

::: js/src/methodjit/PolyIC.cpp
@@ +2447,2 @@
>      JaegerSpew(JSpew_PICs, "generated %s stub at %p for atom %p (\"%s\") shape %p (%s: %d)\n",
> +               js_CodeName[JSOp(*f.pc())], cs.executableAddress(), (void*)name, latin1.get(),

get() will just give you a RangedPtr, which itself isn't safe to pass varargs-style -- this needs to be .get().get().  (Yeah, ugh.  But varargs.)

...actually, I'm a little confused here.  Range in the previous patch here has start() and end(), no get() at all.  Please explain!

::: js/src/vm/CharacterEncoding.cpp
@@ +6,5 @@
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#include "jsapi.h"
> +#include "jscntxt.h"
> +#include "js/CharacterEncoding.h"

Blank line between js* and js/* headers.

I don't think you need jsapi.h here.  jscntxt.h should be enough for everything this file uses, beyond the encoding stuff.

@@ +16,5 @@
> +{
> +    JS_ASSERT(cx);
> +    char *latin1 = cx->pod_malloc<char>(tbchars.length() + 1);
> +    if (!latin1) {
> +        JS_ReportOutOfMemory(cx);

js_ReportOutOfMemory(cx)

Attachment #680756 - Flags: review?(jwalden+bmo) → review-

Terrence Cole [:terrence]

Assignee

Comment 12

•

12 years ago

Attached patch v1: Simplified. (obsolete) — Details — Splinter Review

This removes the Z type variants: we decided that char* was implicitly "C" typed and therefore \0 terminated.  I've also removed the Buffer variants since we don't /really/ need this until we replace DeflateStringToBuffer.  In fact, most of CharacterEncoding.h is gone now.

(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #11)
> Comment on attachment 680756 [details] [diff] [review]
> @@ +74,5 @@
> > +// returned string's |start()| must be freed with JS_free or js_free,
> > +// respectively. If allocation fails, an OOM error will be set and the method
> > +// will return a NULL chars. This method cannot trigger GC.
> > +ISOLatin1CharsZ
> > +TwoByteCharsToISOLatin1CharsZ(JSContext *cx, TwoByteChars tbchars);
>
> It doesn't really seem necessary to call out the "Z" in the method name --
> we're allocating the chars, it's only reasonable to hand them back
> null-terminated.  The return type indicates that, so let's drop the Z.  Also
> let's make its allocating nature clearer.  Ergo:
> LossyTwoByteCharsToNewLatin1Chars.

Now that this is returning a bare |char*|, I think it would be better if we kept the Z suffix to match what we do with NewStringFromCopy, etc.

Attachment #680756 - Attachment is obsolete: true

Attachment #688029 - Flags: review?(jwalden+bmo)

Terrence Cole [:terrence]

Assignee

Comment 13

•

12 years ago

Comment on attachment 688029 [details] [diff] [review]
v1: Simplified.

This is wrong still.  Had some ideas last night, will ping you on IRC to discuss.

Attachment #688029 - Flags: review?(jwalden+bmo)

Terrence Cole [:terrence]

Assignee

Comment 14

•

12 years ago

Attached patch v3 (obsolete) — Details — Splinter Review

This integrates the fixes we discussed. I think it looks much nicer now.

Attachment #688029 - Attachment is obsolete: true

Attachment #690997 - Flags: review?(jwalden+bmo)

Terrence Cole [:terrence]

Assignee

Comment 15

•

12 years ago

Attached patch v4: this time remembering to qref — Details — Splinter Review

Attachment #690997 - Attachment is obsolete: true

Attachment #690997 - Flags: review?(jwalden+bmo)

Attachment #691080 - Flags: review?(jwalden+bmo)

Jeff Walden [:Waldo]

Comment 16

•

12 years ago

Comment on attachment 691080 [details] [diff] [review]
v4: this time remembering to qref

Review of attachment 691080 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay here, I am a horrible human being.  :-(

::: js/jsd/jsd_xpc.cpp
@@ +959,5 @@
>          mBaseLineNumber = JSD_GetScriptBaseLineNumber(mCx, mScript);
>          mLineExtent = JSD_GetScriptLineExtent(mCx, mScript);
>          mFirstPC = JSD_GetClosestPC(mCx, mScript, 0);
>          JSD_UnlockScriptSubsystem(mCx);
>          

This trailing whitespace taunts me, because it knows I shouldn't insist on it being fixed, because jsd is about to be dumped on the scrap heap.  :-(

::: js/public/CharacterEncoding.h
@@ +40,5 @@
> +
> +  public:
> +    Latin1CharsZ() : Base(NULL, 0) {}
> +    Latin1CharsZ(char *bytes, size_t length) : Base(reinterpret_cast<unsigned char *>(bytes), length) {}
> +    Latin1CharsZ(unsigned char *bytes, size_t length) : Base(bytes, length) {}

These last two should assert bytes[length] == '\0'.

@@ +42,5 @@
> +    Latin1CharsZ() : Base(NULL, 0) {}
> +    Latin1CharsZ(char *bytes, size_t length) : Base(reinterpret_cast<unsigned char *>(bytes), length) {}
> +    Latin1CharsZ(unsigned char *bytes, size_t length) : Base(bytes, length) {}
> +
> +    char *c_str() { return reinterpret_cast<char *>(get()); }

I almost feel like we should name this SpiderMonkey-style, but I have no idea what that would be, and meh anyway.  :-)

@@ +61,5 @@
> +/*
> + * SpiderMonkey uses a 2-byte character representation: it is a
> + * 2-byte-at-a-time view of a UTF-16 byte stream. This is similar to UCS-2,
> + * but unlike UCS-2, we do not strip UTF-16 extension bytes. This allows a
> + * sufficiently dedicated JavaScript program to be fully unicode aware by

Unicode-aware

@@ +83,5 @@
> +{
> +    typedef mozilla::RangedPtr<jschar> Base;
> +
> +  public:
> +    TwoByteCharsZ(jschar *chars, size_t length) : Base(chars, length) {}

This should assert chars[length] == '\0'.

@@ +101,5 @@
> +LossyTwoByteCharsToNewLatin1CharsZ(JSContext *cx, TwoByteChars tbchars);
> +
> +} // namespace JS
> +
> +inline void JS_free(const JS::Latin1CharsZ &ptr) { js_free((void*)ptr.get()); }

Hm.  I guess ptr.get() is const unsigned char* here, due to ptr being consty, ergo the need for the cast.  But should this really be taking a const&?  Freeing seems mutation-y enough that this should be the relatively rare case of a non-const&, seems to me.

::: js/src/jsapi-tests/testUTF8.cpp
@@ +11,5 @@
> +#include "js/CharacterEncoding.h"
> +
> +BEGIN_TEST(testUTF8_badUTF8)
> +{
> +    static const char* badUTF8 = "...\xC0...";

static const char badUTF8[]

@@ +23,5 @@
> +END_TEST(testUTF8_badUTF8)
> +
> +BEGIN_TEST(testUTF8_bigUTF8)
> +{
> +    static const char* bigUTF8 = "...\xFB\xBF\xBF\xBF\xBF...";

[] too

@@ +28,5 @@
> +    JSString *str = JS_NewStringCopyZ(cx, bigUTF8);
> +    CHECK(str);
> +    const jschar *chars = JS_GetStringCharsZ(cx, str);
> +    CHECK(chars);
> +    return true;

Presumably this should CHECK(chars[3] == 0x00FB) too?

@@ +38,5 @@
> +    static const jschar badSurrogate[] = { 'A', 'B', 'C', 0xDEEE, 'D', 'E', 0 };
> +    JS::TwoByteChars tbchars(badSurrogate, js_strlen(badSurrogate));
> +    JS::Latin1CharsZ latin1 = JS::LossyTwoByteCharsToNewLatin1CharsZ(cx, tbchars);
> +    CHECK(latin1);
> +    CHECK(latin1[3] == (unsigned char)0xee);

'\xEE'

::: js/src/jsapi.cpp
@@ -6206,5 @@
> -    /* jsd calls us with a NULL cx. Ugh. */
> -    if (cx) {
> -        AssertHeapIsIdle(cx);
> -        CHECK_REQUEST(cx);
> -    }

DIE DIE DIE DIE DIE

::: js/src/jsapi.h
@@ +5395,2 @@
>   * Encode string into a buffer. The function does not stores an additional
>   * zero byte. The function returns (size_t) -1 if the string can not be

Make that "can't" while you're in the area, or "cannot" if you're the formal type.

::: js/src/jscntxt.cpp
@@ +1013,5 @@
>                  JS_ASSERT(expandedArgs == argCount);
>                  *out = 0;
>                  js_free(buffer);
> +                TwoByteChars ucmsg(reportp->ucmessage,
> +                                   PointerRangeSize(const_cast<jschar *>(reportp->ucmessage), out));

Hmm.  I should throw some IsConvertible template-fu at PointerRangeSize to avoid this necessity at some point, perhaps.

const_cast<> is always a little scary-looking; mind using the safer-reading static_cast<const jschar*>(out) instead to get the job done?

::: js/src/jsinferinlines.h
@@ +290,5 @@
>       */
>      if (JSID_IS_STRING(id)) {
>          JSFlatString *str = JSID_TO_FLAT_STRING(id);
> +        TwoByteChars cp = str->range();
> +        if (JS7_ISDEC(cp[0]) || cp[0] == '-') {

This is still kind of relying on the JSFlatString-ness whence cp came, to do the [0] safely.  Something I guess we're going to come back and clean up later?

@@ +300,1 @@
>          }

This is interesting -- so we group all numeric keys as JSID_VOID?  Even something like '1234567890123456789012345678901234567890' that is never elementy?  Interesting.

::: js/src/vm/CharacterEncoding.cpp
@@ +13,5 @@
> +
> +Latin1CharsZ
> +JS::LossyTwoByteCharsToNewLatin1CharsZ(JSContext *cx, TwoByteChars tbchars)
> +{
> +    JS_ASSERT(cx);

Add an AssertNoGC nogc; here, or however you spell it.

@@ +17,5 @@
> +    JS_ASSERT(cx);
> +    size_t len = tbchars.length();
> +    unsigned char *latin1 = cx->pod_malloc<unsigned char>(len + 1);
> +    if (!latin1) {
> +        js_ReportOutOfMemory(cx);

You shouldn't be doing this -- cx->pod_malloc calls cx->runtime->pod_malloc, which reports.

@@ +21,5 @@
> +        js_ReportOutOfMemory(cx);
> +        return Latin1CharsZ();
> +    }
> +    for (size_t i = 0; i < len; ++i)
> +        latin1[i] = (unsigned char)tbchars[i];

static_cast<>

::: js/src/vm/String.h
@@ +476,5 @@
>          JS_ASSERT(JSString::isLinear());
>          return d.u1.chars;
>      }
> +
> +    JS::TwoByteChars range() const {

Eventually we'll want to make this be chars(), I think, but this is a good step for now.

Attachment #691080 - Flags: review?(jwalden+bmo) → review+

Terrence Cole [:terrence]

Assignee

Comment 17

•

12 years ago

Green try run at:
https://tbpl.mozilla.org/?tree=Try&rev=9d7d08a9f3ce

Pushed at:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6acc72608961
https://hg.mozilla.org/integration/mozilla-inbound/rev/f8c06ade84d6

Ed Morley [:emorley]

Comment 18

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/f8c06ade84d6
https://hg.mozilla.org/mozilla-central/rev/6acc72608961

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla21

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

12 years ago

Depends on: 832103

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

11 years ago

Depends on: 913883

Boris Chiou [:boris]

Updated

•

8 years ago

Blocks: 1276573

v0: character encoding 12 years ago Terrence Cole [:terrence] 36.68 KB, patch	Waldo : review-	Details \| Diff \| Splinter Review
v0: Range and RangedPtr fixes 12 years ago Terrence Cole [:terrence] 2.79 KB, patch		Details \| Diff \| Splinter Review
v1: Range and RangedPtr fixes 12 years ago Terrence Cole [:terrence] 1.73 KB, patch		Details \| Diff \| Splinter Review
v1: Range and RangedPtr fixes 12 years ago Terrence Cole [:terrence] 3.40 KB, patch	Waldo : review+	Details \| Diff \| Splinter Review
v1: Simplified. 12 years ago Terrence Cole [:terrence] 36.39 KB, patch		Details \| Diff \| Splinter Review
v3 12 years ago Terrence Cole [:terrence] 37.49 KB, patch		Details \| Diff \| Splinter Review
v4: this time remembering to qref 12 years ago Terrence Cole [:terrence] 38.31 KB, patch	Waldo : review+	Details \| Diff \| Splinter Review