468472 - Cosmetic fixes for the new StringObject class

Michael Daumling

Assignee

Description

•

16 years ago

This bug collects minor cosmetic fixes for the StringObject class.

Steven Johnson

Comment 1

•

16 years ago

Attached patch Patch — Details — Splinter Review

Various cosmetic fixes to new-strings code, based on working them into a large codebase: -- added StUTF8String and StUTF16String classes, for when a utf8/ or utf16 string is used in a temporary manner: the dtor optionally deletes the temporary (if optimizing for memory usage) or just nulls it and allows it to be gc'ed (if optimizing for performance). (Replaces the RegExp-specific UsesUTF8String class.) Also modified various internal code paths to use it instead of toUTF8String()/toUTF16String() where appropriate. -- modified various internal code paths to use StringIndexer where appropriate, rather than String::operator[]. (IMHO we should consider removing that operator to encourage use of StringIndexer.) -- changed String::charAt() methods to match operator[] in return type so they are interchangeable. -- added AvmCore::newString(wchar*, len) utility method -- added default-parameter values to AvmCore::internAllocUtf8() and String::indexOf() to improve backwards compatibility.

Attachment #351999 - Flags: review?(daumling)

Michael Daumling

Assignee

Updated

•

16 years ago

Attachment #351999 - Flags: review?(daumling) → review+

Michael Daumling

Assignee

Comment 2

•

16 years ago

Comment on attachment 351999 [details] [diff] [review] Patch The StUTFxxClasses are very nice - leaving the decision to memory usage vs performance to the user is great. The idea to declare CharAtType for the result of [] and charAt() is so obvious - another great catch. DomainClass.cpp, lines 121-131: This could also have been a simple lastIndexOf (core->cachedChars ['.']), right?

Steven Johnson

Comment 3

•

16 years ago

> DomainClass.cpp, lines 121-131: This could also have been a simple lastIndexOf > (core->cachedChars ['.']), right? Yep, I'll make that change. FYI, we could use a lastIndexOf() variant that allows a string-literal (as we have with indexOf()) -- would you like to consider adding that in a followup?

Steven Johnson

Comment 4

•

16 years ago

Heh, actually, that reveals a subtle bug in String::lastIndexOf -- default value for offset should be 0x7fffffff rather than 0, in order to match String.as. (otherwise, calling lastIndexOf with the default arg always returns -1) I'll fix this prior to pushing.

Steven Johnson

Comment 5

•

16 years ago

pushed to redux as changeset: 1214:8113a26b70aa

Steven Johnson

Comment 6

•

16 years ago

Two more things that would be helpful to add: -- lastIndexOf() with literal string (as with indexOf()) -- a way to copy UTF16 (or UTF8) into a buffer without having to instantiate the full UTF16String... can be simulated with StringIndexer and a loop, but a single call do this would be handy in many places.

Steven Johnson

Comment 7

•

16 years ago

I also wonder if we shouldn't rename the two String::create() methods to something else, to emphasize they DON'T take UTF8/UTF16. I mistakenly used them in that fashion.

Steven Johnson

Comment 8

•

16 years ago

(In reply to comment #7) > I also wonder if we shouldn't rename the two String::create() methods to > something else, to emphasize they DON'T take UTF8/UTF16. I mistakenly used them > in that fashion. So if I understand correctly, the 16-bit version of String::create() should probably be called String::createUCS2(), and the 8-bit version String::createLatin1(). (The first 256 codepoints of Unicode are an exact match for Latin1, right?) Given this, we're not using them entirely correctly internally.... ByteArrayGlue and FileClass read UTF16 but use create() to fill in the string; you'd only notice it's wrong for non-BMP text, which is easy for most of the active coders here to miss. (IMHO we should probably also move that code for UTF16LE and UF16BE into String itself, so that getData() can be made truly private.)

Steven Johnson

Comment 9

•

16 years ago

Upon further reflection, I'm thinking that createUCS2() should simply be eradicated. Does it offer a performance benefit sufficient to outweight the risk of using it incorrectly (ie when createUTF16 should be used instead)?

Steven Johnson

Comment 10

•

16 years ago

(In reply to comment #9) > Upon further reflection, I'm thinking that createUCS2() should simply be > eradicated. Does it offer a performance benefit sufficient to outweight the > risk of using it incorrectly (ie when createUTF16 should be used instead)? Changes turned out to be substantive rather than cosmetic, so offered to https://bugzilla.mozilla.org/show_bug.cgi?id=465506 instead of here.

Michael Daumling

Assignee

Comment 11

•

16 years ago

We should also 1) remove String::operator[] to emphasize the usage of charAt() and StringIndexer 2) Investigate the usage of StringBuffer and StringOutputStream. IMHO, these classes could be collapsed or even removed and replaced with String::appendXXX().

Michael Daumling

Assignee

Comment 12

•

16 years ago

Steven further suggested that we add methods that copy character data into a buffer to avoid excessive creation of UTFxxStrings. Proposed APIs: int32_t fillLatin1Buffer (char* buffer, int32_t length); int32_t fillUTF8Buffer (utf8_t* buffer, int32_t length); int32_t fillUTF16Buffer (wchar* buffer, int32_t length); No NUL added - people can do so themselves if they wish. Return value: The number of characters copied. If < expected value, then the characters do not fit (i.e. fillLatin1Buffer() detected 16/32-bit characters).

Edwin Smith

Comment 13

•

16 years ago

(In reply to comment #11) > We should also > > 1) remove String::operator[] to emphasize the usage of charAt() and > StringIndexer I would only agree with this if it doesn't muddle Unicode worries with string worries. as a litmus tests, the core string class should have no unicode types or methods in it, unless severely commented as to why they aren't in unicode-specific helpers. specifically, the new CharAtType vs the old uint32_t, and this comment above, cause me to worry we're losing separation of concerns. > 2) Investigate the usage of StringBuffer and StringOutputStream. IMHO, these > classes could be collapsed or even removed and replaced with > String::appendXXX(). +1

Steven Johnson

Comment 14

•

16 years ago

> I would only agree with this if it doesn't muddle Unicode worries with string > worries. as a litmus tests, the core string class should have no unicode types > or methods in it, unless severely commented as to why they aren't in > unicode-specific helpers. You should definitely review and comment on the latest patch for 465506 (that I haven't landed) -- String was already "muddled" and this muddles it much further. But IMHO that is a good thing; why do we want to remove Unicode-ness from String?

Edwin Smith

Comment 15

•

16 years ago

We talked about this extensively at the outset of the design of the new string code, I don't want to revisit it here unless there's new evidence to consider. the core logic for manipulating arrays of 8/16/32 bit values should be encapsulated, and the additional logic for treating such arrays as various kinds of unicode encoding (utf8/16le/16be/32/ucs2) should be distinct and encapsulated. the only caveats ought to be where performance is a concern. the reason why they should be separated, include: * the core memory management problem is agnostic of encoding * its too easy to mix up encodings * assumptions about whether surrogate pairs are one index or two, should not be baked into the core class, because they will differ over time and configuration, compatibility mode, etc.

Steven Johnson

Comment 16

•

16 years ago

> > 1) remove String::operator[] to emphasize the usage of charAt() and > > StringIndexer > > I would only agree with this if it doesn't muddle Unicode worries with string > worries. charAt() and [] are already exact equivalents; this just makes it more obvious when you're using [] on String directly when you should be using StringIndexer.

Steven Johnson

Comment 17

•

16 years ago

(In reply to comment #15) > We talked about this extensively at the outset of the design of the new string > code, I don't want to revisit it here unless there's new evidence to consider. IMHO there is new evidence to consider in light of the mistakes that we found that are easy to make... we could go back to a separation of concerns by moving the UTF related calls elsewhere (eg into the now-defunct StringUtils), but it will take a day or two of rethinking. Michael, your thoughts?

Steven Johnson

Comment 18

•

16 years ago

Also, keep in mind that some muddling is going to be avoidable only at the cost of possible misuse. e.g., say we have String::append(char*) by the rules of the game you state, this would append a strict sequence of 8-bit characters, which really only makes sense as Latin1... if you passed a UTF8 string, each byte would be appended as a codepoint, resulting in the definitely-wrong answer. But it's pretty easy to do this without realizing it (our existing code did so). Likewise for the single-byte String::create() method: it's certainly *possible* to use it correctly and safely, but it required extra diligence and is error-prone.

Michael Daumling

Assignee

Comment 19

•

16 years ago

I can understand Ed's concerns in the light of cleanness of code. OTOH, it is probably easier to have all String-related APIs in one header file. IMHO, there are no technical reasons why APIs should be in separate files. And, given the mess the String integration caused, I am currently reluctant to again land a patch that touches several files unless mandated by a very good reason. Maybe we should keep the issue somewhere in our records, and revisit it during a general round of clean-up.

Steven Johnson

Comment 20

•

16 years ago

(In reply to comment #15) > * the core memory management problem is agnostic of encoding agreed > * its too easy to mix up encodings agreed, except this is my point: currently the APIs are IMHO too easy to misuse WRT encoding. > * assumptions about whether surrogate pairs are one index or two, should not be > baked into the core class, because they will differ over time and > configuration, compatibility mode, etc. now this is a point that is my biggest point of concern, as current ECMA compliance actually requires being able to create "malformed" strings (eg String.fromCharCode(0xd800).charCodeAt(0) == 0xd800).

Michael Daumling

Assignee

Comment 21

•

16 years ago

Performance improvements: 1) replace core->concatStrings(s1, s2) with s1->append(s2) if both are non-NULL 2) Especially: s1 = core->concatStrings(s1, core->newStringLatin1Constant("text") ==> s1->appendLatin1("text"); and s = core->concatStrings(s, core->newStringUTF16(buffer, len) ==> s = s->append16(buffer, len); 3) Try to replace single-character string creation with ASCII with core->cachedChars['c'], you will get them anyway. Best example: if (core->intern(P) == core->internConstantStringLatin1("0")) ==> if (core->intern(P) == core->cachedChars['0'])

Steven Johnson

Comment 22

•

16 years ago

> 3) Try to replace single-character string creation with ASCII with > core->cachedChars['c'], you will get them anyway. Best example: agreed, but even better would be to offer an API like Stringp AvmCore::singleCharString(wchar c) so we can make cachedChars private.

Michael Daumling

Assignee

Comment 23

•

16 years ago

We do have String::createLatin1 that does this for you already. It would even hand out a dynamic string if the character was not ASCII.

Steven Johnson

Comment 24

•

16 years ago

I think this bug is probably resolved at this point. Any open issues?

Steven Johnson

Comment 25

•

16 years ago

Nope, closing

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Chris Peyer

Comment 26

•

15 years ago

Resolved fixed engineering / work item that has been pushed. Setting status to verified.

Status: RESOLVED → VERIFIED