Closed Bug 468472 Opened 16 years ago Closed 15 years ago

Cosmetic fixes for the new StringObject class

Categories

(Tamarin Graveyard :: Virtual Machine, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: daumling, Assigned: daumling)

References

Details

Attachments

(1 file)

This bug collects minor cosmetic fixes for the StringObject class.
Attached patch PatchSplinter Review
Various cosmetic fixes to new-strings code, based on working them into a large codebase:

-- added StUTF8String and StUTF16String classes, for when a utf8/ or utf16 string is used in a temporary manner: the dtor optionally deletes the temporary (if optimizing for memory usage) or just nulls it and allows it to be gc'ed (if optimizing for performance). (Replaces the RegExp-specific UsesUTF8String class.)  Also modified various internal code paths to use it instead of toUTF8String()/toUTF16String() where appropriate.

-- modified various internal code paths to use StringIndexer where appropriate, rather than String::operator[]. (IMHO we should consider removing that operator to encourage use of StringIndexer.)

-- changed String::charAt() methods to match operator[] in return type so they are interchangeable.

-- added AvmCore::newString(wchar*, len) utility method

-- added default-parameter values to AvmCore::internAllocUtf8() and String::indexOf() to improve backwards compatibility.
Attachment #351999 - Flags: review?(daumling)
Attachment #351999 - Flags: review?(daumling) → review+
Comment on attachment 351999 [details] [diff] [review]
Patch

The StUTFxxClasses are very nice - leaving the decision to memory usage vs performance to the user is great. The idea to declare CharAtType for the result of [] and charAt() is so obvious - another great catch.

DomainClass.cpp, lines 121-131: This could also have been a simple lastIndexOf (core->cachedChars ['.']), right?
> DomainClass.cpp, lines 121-131: This could also have been a simple lastIndexOf
> (core->cachedChars ['.']), right?

Yep, I'll make that change. FYI, we could use a lastIndexOf() variant that allows a string-literal (as we have with indexOf()) -- would you like to consider adding that in a followup?
Heh, actually, that reveals a subtle bug in String::lastIndexOf -- default value for offset should be 0x7fffffff rather than 0, in order to match String.as. (otherwise, calling lastIndexOf with the default arg always returns -1) I'll fix this prior to pushing.
pushed to redux as changeset:   1214:8113a26b70aa
Two more things that would be helpful to add:

-- lastIndexOf() with literal string (as with indexOf())
-- a way to copy UTF16 (or UTF8) into a buffer without having to instantiate the full UTF16String... can be simulated with StringIndexer and a loop, but a single call do this would be handy in many places.
I also wonder if we shouldn't rename the two String::create() methods to something else, to emphasize they DON'T take UTF8/UTF16. I mistakenly used them in that fashion.
(In reply to comment #7)
> I also wonder if we shouldn't rename the two String::create() methods to
> something else, to emphasize they DON'T take UTF8/UTF16. I mistakenly used them
> in that fashion.

So if I understand correctly, the 16-bit version of String::create() should probably be called String::createUCS2(), and the 8-bit version String::createLatin1(). (The first 256 codepoints of Unicode are an exact match for Latin1, right?)

Given this, we're not using them entirely correctly internally.... ByteArrayGlue and FileClass read UTF16 but use create() to fill in the string; you'd only notice it's wrong for non-BMP text, which is easy for most of the active coders here to miss.

(IMHO we should probably also move that code for UTF16LE and UF16BE into String itself, so that getData() can be made truly private.)
Upon further reflection, I'm thinking that createUCS2() should simply be eradicated. Does it offer a performance benefit sufficient to outweight the risk of using it incorrectly (ie when createUTF16 should be used instead)?
(In reply to comment #9)
> Upon further reflection, I'm thinking that createUCS2() should simply be
> eradicated. Does it offer a performance benefit sufficient to outweight the
> risk of using it incorrectly (ie when createUTF16 should be used instead)?

Changes turned out to be substantive rather than cosmetic, so offered to https://bugzilla.mozilla.org/show_bug.cgi?id=465506 instead of here.
We should also

1) remove String::operator[] to emphasize the usage of charAt() and StringIndexer

2) Investigate the usage of StringBuffer and StringOutputStream. IMHO, these classes could be collapsed or even removed and replaced with String::appendXXX().
Steven further suggested that we add methods that copy character data into a buffer to avoid excessive creation of UTFxxStrings.

Proposed APIs:

int32_t fillLatin1Buffer (char* buffer, int32_t length);
int32_t fillUTF8Buffer (utf8_t* buffer, int32_t length);
int32_t fillUTF16Buffer (wchar* buffer, int32_t length);

No NUL added - people can do so themselves if they wish.

Return value: The number of characters copied. If < expected value, then the characters do not fit (i.e. fillLatin1Buffer() detected 16/32-bit characters).
(In reply to comment #11)
> We should also
> 
> 1) remove String::operator[] to emphasize the usage of charAt() and
> StringIndexer

I would only agree with this if it doesn't muddle Unicode worries with string worries.  as a litmus tests, the core string class should have no unicode types or methods in it, unless severely commented as to why they aren't in unicode-specific helpers.

specifically, the new CharAtType vs the old uint32_t, and this comment above, cause me to worry we're losing separation of concerns.

> 2) Investigate the usage of StringBuffer and StringOutputStream. IMHO, these
> classes could be collapsed or even removed and replaced with
> String::appendXXX().

+1
> I would only agree with this if it doesn't muddle Unicode worries with string
> worries.  as a litmus tests, the core string class should have no unicode types
> or methods in it, unless severely commented as to why they aren't in
> unicode-specific helpers.

You should definitely review and comment on the latest patch for 465506 (that I haven't landed) -- String was already "muddled" and this muddles it much further. But IMHO that is a good thing; why do we want to remove Unicode-ness from String?
We talked about this extensively at the outset of the design of the new string code, I don't want to revisit it here unless there's new evidence to consider.  the core logic for manipulating arrays of 8/16/32 bit values should be encapsulated, and the additional logic for treating such arrays as various kinds of unicode encoding (utf8/16le/16be/32/ucs2) should be distinct and encapsulated.

the only caveats ought to be where performance is a concern.  

the reason why they should be separated, include:

* the core memory management problem is agnostic of encoding
* its too easy to mix up encodings
* assumptions about whether surrogate pairs are one index or two, should not be baked into the core class, because they will differ over time and configuration, compatibility mode, etc.
> > 1) remove String::operator[] to emphasize the usage of charAt() and
> > StringIndexer
> 
> I would only agree with this if it doesn't muddle Unicode worries with string
> worries. 

charAt() and [] are already exact equivalents; this just makes it more obvious when you're using [] on String directly when you should be using StringIndexer.
(In reply to comment #15)
> We talked about this extensively at the outset of the design of the new string
> code, I don't want to revisit it here unless there's new evidence to consider. 

IMHO there is new evidence to consider in light of the mistakes that we found that are easy to make... we could go back to a separation of concerns by moving the UTF related calls elsewhere (eg into the now-defunct StringUtils), but it will take a day or two of rethinking. Michael, your thoughts?
Also, keep in mind that some muddling is going to be avoidable only at the cost of possible misuse. e.g., say we have

    String::append(char*)

by the rules of the game you state, this would append a strict sequence of 8-bit characters, which really only makes sense as Latin1... if you passed a UTF8 string, each byte would be appended as a codepoint, resulting in the definitely-wrong answer. But it's pretty easy to do this without realizing it (our existing code did so). Likewise for the single-byte String::create() method: it's certainly *possible* to use it correctly and safely, but it required extra diligence and is error-prone.
I can understand Ed's concerns in the light of cleanness of code. OTOH, it is probably easier to have all String-related APIs in one header file. IMHO, there are no technical reasons why APIs should be in separate files. And, given the mess the String integration caused, I am currently reluctant to again land a patch that touches several files unless mandated by a very good reason. Maybe we should keep the issue somewhere in our records, and revisit it during a general round of clean-up.
(In reply to comment #15)
> * the core memory management problem is agnostic of encoding

agreed

> * its too easy to mix up encodings

agreed, except this is my point: currently the APIs are IMHO too easy to misuse WRT encoding.

> * assumptions about whether surrogate pairs are one index or two, should not be
> baked into the core class, because they will differ over time and
> configuration, compatibility mode, etc.

now this is a point that is my biggest point of concern, as current ECMA compliance actually requires being able to create "malformed" strings (eg String.fromCharCode(0xd800).charCodeAt(0) == 0xd800).
Performance improvements: 

1) replace core->concatStrings(s1, s2) with s1->append(s2) if  both are non-NULL

2) Especially: s1 = core->concatStrings(s1, core->newStringLatin1Constant("text") ==> s1->appendLatin1("text"); and s = core->concatStrings(s, core->newStringUTF16(buffer, len) ==> s = s->append16(buffer, len);

3) Try to replace single-character string creation with ASCII with core->cachedChars['c'], you will get them anyway. Best example:

if (core->intern(P) == core->internConstantStringLatin1("0")) ==> if (core->intern(P) == core->cachedChars['0'])
> 3) Try to replace single-character string creation with ASCII with
> core->cachedChars['c'], you will get them anyway. Best example:

agreed, but even better would be to offer an API like

    Stringp AvmCore::singleCharString(wchar c)

so we can make cachedChars private.
We do have String::createLatin1 that does this for you already. It would even hand out a dynamic string if the character was not ASCII.
I think this bug is probably resolved at this point. Any open issues?
Nope, closing
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Resolved fixed engineering / work item that has been pushed.  Setting status to verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: