Cosmetic fixes for the new StringObject class

VERIFIED FIXED

Status

Tamarin
Virtual Machine
VERIFIED FIXED
10 years ago
9 years ago

People

(Reporter: Michael Daumling, Assigned: Michael Daumling)

Tracking

Details

Attachments

(1 attachment)

26.71 KB, patch
Michael Daumling
: review+
Details | Diff | Splinter Review
(Assignee)

Description

10 years ago
This bug collects minor cosmetic fixes for the StringObject class.

Comment 1

10 years ago
Created attachment 351999 [details] [diff] [review]
Patch

Various cosmetic fixes to new-strings code, based on working them into a large codebase:

-- added StUTF8String and StUTF16String classes, for when a utf8/ or utf16 string is used in a temporary manner: the dtor optionally deletes the temporary (if optimizing for memory usage) or just nulls it and allows it to be gc'ed (if optimizing for performance). (Replaces the RegExp-specific UsesUTF8String class.)  Also modified various internal code paths to use it instead of toUTF8String()/toUTF16String() where appropriate.

-- modified various internal code paths to use StringIndexer where appropriate, rather than String::operator[]. (IMHO we should consider removing that operator to encourage use of StringIndexer.)

-- changed String::charAt() methods to match operator[] in return type so they are interchangeable.

-- added AvmCore::newString(wchar*, len) utility method

-- added default-parameter values to AvmCore::internAllocUtf8() and String::indexOf() to improve backwards compatibility.
Attachment #351999 - Flags: review?(daumling)
(Assignee)

Updated

10 years ago
Attachment #351999 - Flags: review?(daumling) → review+
(Assignee)

Comment 2

10 years ago
Comment on attachment 351999 [details] [diff] [review]
Patch

The StUTFxxClasses are very nice - leaving the decision to memory usage vs performance to the user is great. The idea to declare CharAtType for the result of [] and charAt() is so obvious - another great catch.

DomainClass.cpp, lines 121-131: This could also have been a simple lastIndexOf (core->cachedChars ['.']), right?

Comment 3

10 years ago
> DomainClass.cpp, lines 121-131: This could also have been a simple lastIndexOf
> (core->cachedChars ['.']), right?

Yep, I'll make that change. FYI, we could use a lastIndexOf() variant that allows a string-literal (as we have with indexOf()) -- would you like to consider adding that in a followup?

Comment 4

10 years ago
Heh, actually, that reveals a subtle bug in String::lastIndexOf -- default value for offset should be 0x7fffffff rather than 0, in order to match String.as. (otherwise, calling lastIndexOf with the default arg always returns -1) I'll fix this prior to pushing.

Comment 5

10 years ago
pushed to redux as changeset:   1214:8113a26b70aa

Comment 6

10 years ago
Two more things that would be helpful to add:

-- lastIndexOf() with literal string (as with indexOf())
-- a way to copy UTF16 (or UTF8) into a buffer without having to instantiate the full UTF16String... can be simulated with StringIndexer and a loop, but a single call do this would be handy in many places.

Comment 7

10 years ago
I also wonder if we shouldn't rename the two String::create() methods to something else, to emphasize they DON'T take UTF8/UTF16. I mistakenly used them in that fashion.

Comment 8

10 years ago
(In reply to comment #7)
> I also wonder if we shouldn't rename the two String::create() methods to
> something else, to emphasize they DON'T take UTF8/UTF16. I mistakenly used them
> in that fashion.

So if I understand correctly, the 16-bit version of String::create() should probably be called String::createUCS2(), and the 8-bit version String::createLatin1(). (The first 256 codepoints of Unicode are an exact match for Latin1, right?)

Given this, we're not using them entirely correctly internally.... ByteArrayGlue and FileClass read UTF16 but use create() to fill in the string; you'd only notice it's wrong for non-BMP text, which is easy for most of the active coders here to miss.

(IMHO we should probably also move that code for UTF16LE and UF16BE into String itself, so that getData() can be made truly private.)

Comment 9

10 years ago
Upon further reflection, I'm thinking that createUCS2() should simply be eradicated. Does it offer a performance benefit sufficient to outweight the risk of using it incorrectly (ie when createUTF16 should be used instead)?

Comment 10

10 years ago
(In reply to comment #9)
> Upon further reflection, I'm thinking that createUCS2() should simply be
> eradicated. Does it offer a performance benefit sufficient to outweight the
> risk of using it incorrectly (ie when createUTF16 should be used instead)?

Changes turned out to be substantive rather than cosmetic, so offered to https://bugzilla.mozilla.org/show_bug.cgi?id=465506 instead of here.
(Assignee)

Comment 11

10 years ago
We should also

1) remove String::operator[] to emphasize the usage of charAt() and StringIndexer

2) Investigate the usage of StringBuffer and StringOutputStream. IMHO, these classes could be collapsed or even removed and replaced with String::appendXXX().
(Assignee)

Comment 12

10 years ago
Steven further suggested that we add methods that copy character data into a buffer to avoid excessive creation of UTFxxStrings.

Proposed APIs:

int32_t fillLatin1Buffer (char* buffer, int32_t length);
int32_t fillUTF8Buffer (utf8_t* buffer, int32_t length);
int32_t fillUTF16Buffer (wchar* buffer, int32_t length);

No NUL added - people can do so themselves if they wish.

Return value: The number of characters copied. If < expected value, then the characters do not fit (i.e. fillLatin1Buffer() detected 16/32-bit characters).

Comment 13

10 years ago
(In reply to comment #11)
> We should also
> 
> 1) remove String::operator[] to emphasize the usage of charAt() and
> StringIndexer

I would only agree with this if it doesn't muddle Unicode worries with string worries.  as a litmus tests, the core string class should have no unicode types or methods in it, unless severely commented as to why they aren't in unicode-specific helpers.

specifically, the new CharAtType vs the old uint32_t, and this comment above, cause me to worry we're losing separation of concerns.

> 2) Investigate the usage of StringBuffer and StringOutputStream. IMHO, these
> classes could be collapsed or even removed and replaced with
> String::appendXXX().

+1

Comment 14

10 years ago
> I would only agree with this if it doesn't muddle Unicode worries with string
> worries.  as a litmus tests, the core string class should have no unicode types
> or methods in it, unless severely commented as to why they aren't in
> unicode-specific helpers.

You should definitely review and comment on the latest patch for 465506 (that I haven't landed) -- String was already "muddled" and this muddles it much further. But IMHO that is a good thing; why do we want to remove Unicode-ness from String?

Comment 15

10 years ago
We talked about this extensively at the outset of the design of the new string code, I don't want to revisit it here unless there's new evidence to consider.  the core logic for manipulating arrays of 8/16/32 bit values should be encapsulated, and the additional logic for treating such arrays as various kinds of unicode encoding (utf8/16le/16be/32/ucs2) should be distinct and encapsulated.

the only caveats ought to be where performance is a concern.  

the reason why they should be separated, include:

* the core memory management problem is agnostic of encoding
* its too easy to mix up encodings
* assumptions about whether surrogate pairs are one index or two, should not be baked into the core class, because they will differ over time and configuration, compatibility mode, etc.

Comment 16

10 years ago
> > 1) remove String::operator[] to emphasize the usage of charAt() and
> > StringIndexer
> 
> I would only agree with this if it doesn't muddle Unicode worries with string
> worries. 

charAt() and [] are already exact equivalents; this just makes it more obvious when you're using [] on String directly when you should be using StringIndexer.

Comment 17

10 years ago
(In reply to comment #15)
> We talked about this extensively at the outset of the design of the new string
> code, I don't want to revisit it here unless there's new evidence to consider. 

IMHO there is new evidence to consider in light of the mistakes that we found that are easy to make... we could go back to a separation of concerns by moving the UTF related calls elsewhere (eg into the now-defunct StringUtils), but it will take a day or two of rethinking. Michael, your thoughts?

Comment 18

10 years ago
Also, keep in mind that some muddling is going to be avoidable only at the cost of possible misuse. e.g., say we have

    String::append(char*)

by the rules of the game you state, this would append a strict sequence of 8-bit characters, which really only makes sense as Latin1... if you passed a UTF8 string, each byte would be appended as a codepoint, resulting in the definitely-wrong answer. But it's pretty easy to do this without realizing it (our existing code did so). Likewise for the single-byte String::create() method: it's certainly *possible* to use it correctly and safely, but it required extra diligence and is error-prone.
(Assignee)

Comment 19

10 years ago
I can understand Ed's concerns in the light of cleanness of code. OTOH, it is probably easier to have all String-related APIs in one header file. IMHO, there are no technical reasons why APIs should be in separate files. And, given the mess the String integration caused, I am currently reluctant to again land a patch that touches several files unless mandated by a very good reason. Maybe we should keep the issue somewhere in our records, and revisit it during a general round of clean-up.

Comment 20

10 years ago
(In reply to comment #15)
> * the core memory management problem is agnostic of encoding

agreed

> * its too easy to mix up encodings

agreed, except this is my point: currently the APIs are IMHO too easy to misuse WRT encoding.

> * assumptions about whether surrogate pairs are one index or two, should not be
> baked into the core class, because they will differ over time and
> configuration, compatibility mode, etc.

now this is a point that is my biggest point of concern, as current ECMA compliance actually requires being able to create "malformed" strings (eg String.fromCharCode(0xd800).charCodeAt(0) == 0xd800).
(Assignee)

Comment 21

10 years ago
Performance improvements: 

1) replace core->concatStrings(s1, s2) with s1->append(s2) if  both are non-NULL

2) Especially: s1 = core->concatStrings(s1, core->newStringLatin1Constant("text") ==> s1->appendLatin1("text"); and s = core->concatStrings(s, core->newStringUTF16(buffer, len) ==> s = s->append16(buffer, len);

3) Try to replace single-character string creation with ASCII with core->cachedChars['c'], you will get them anyway. Best example:

if (core->intern(P) == core->internConstantStringLatin1("0")) ==> if (core->intern(P) == core->cachedChars['0'])

Comment 22

10 years ago
> 3) Try to replace single-character string creation with ASCII with
> core->cachedChars['c'], you will get them anyway. Best example:

agreed, but even better would be to offer an API like

    Stringp AvmCore::singleCharString(wchar c)

so we can make cachedChars private.
(Assignee)

Comment 23

10 years ago
We do have String::createLatin1 that does this for you already. It would even hand out a dynamic string if the character was not ASCII.

Comment 24

10 years ago
I think this bug is probably resolved at this point. Any open issues?

Comment 25

9 years ago
Nope, closing
Status: ASSIGNED → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED

Comment 26

9 years ago
Resolved fixed engineering / work item that has been pushed.  Setting status to verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.