Last Comment Bug 28221 - [tracking bug] profile string usage; deploy new implementations where appropriate
: [tracking bug] profile string usage; deploy new implementations where appropr...
Status: RESOLVED WONTFIX
HELPWANTED
: helpwanted, perf
Product: Core
Classification: Components
Component: String (show other bugs)
: Trunk
: All All
: -- normal with 1 vote (vote)
: ---
Assigned To: jag (Peter Annema)
: jag (Peter Annema)
Mentors:
: 26435 28842 (view as bug list)
Depends on: 16108 40140 46738 53065 53209 65219 67876 69872 70143 70740 74726 74985
Blocks: 70090
  Show dependency treegraph
 
Reported: 2000-02-17 11:18 PST by Warren Harris
Modified: 2009-04-14 22:11 PDT (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
nsStringStats.h (558 bytes, text/plain)
2000-02-24 20:42 PST, Chris Waterson
no flags Details
diffs to xpcom/ds to implement string stats (14.08 KB, patch)
2000-02-24 20:43 PST, Chris Waterson
no flags Details | Diff | Review
better diffs. (15.92 KB, patch)
2000-02-24 23:46 PST, Chris Waterson
no flags Details | Diff | Review
Here is a patch containing the changes I made to measure COW efficacy (7.72 KB, patch)
2000-02-28 04:08 PST, Scott Collins
no flags Details | Diff | Review
diffs for mutated/unmutated accounting (in addition to Chris' diffs) (39.32 KB, patch)
2000-02-28 11:56 PST, Warren Harris
no flags Details | Diff | Review
attaching data <dougt@netscape.com> generated... (133.10 KB, text/plain)
2000-07-28 18:09 PDT, Scott Collins
no flags Details
more <dougt@netscape.com> data... (181.48 KB, text/plain)
2000-07-28 18:10 PDT, Scott Collins
no flags Details
Here's the patch Doug used to generate this data... (3.12 KB, patch)
2000-07-28 18:12 PDT, Scott Collins
no flags Details | Diff | Review

Description Warren Harris 2000-02-17 11:18:30 PST
I still believe that an immutable nsIString interface coupled with appropriate 
implementations could be a huge win for us in terms of both space and time.  
There would need to be at least 4 implementations to make this work:

- nsUnicharString for double-byte encoding,
- nsCString for single-byte encoding,
- nsSubString would manage a lengh and offset into another nsIString to avoid 
copying,
- nsConcatenatedString would manage a sequence of nsIStrings, treating them as a 
single concatenated string.

To determine whether my hypothesis is correct, I think we can instrument 
nsString and nsCString to gather statistics that indicate how many copies of 
strings we make in the process of running our app. Specifically:

- Count the number of times each nsString constructs a char/PRUnichar array. 
This is often done when passing them to IDL-generated interfaces. (This number 
could be completely eliminated with nsIString.) [ToNewString, ToNewCString, 
ToNewUnicode, ToCString]

- Count the number of times we construct nsStrings from char/PRUnichar arrays. 
This is often done when we want to manipulate strings that come in from 
IDL-generated interfaces. (Some number of these could be eliminated with 
nsIString.) (How can we break down first-time constructions from copies - 
histogram?)

- Count the number of times we assign the character sequence in a string. Also 
count the percentage of strings which are actually assigned. (This number would 
indicate the number of additional nsIStrings which would need to be created due 
to immutability.) [SetString, Assign, operator=]

- Count the number of substring operations done on nsStrings. (This number could 
be replaced by an allocation of an nsSubString object.) [SetLength, Truncate, 
Trim, Left, Mid, Right, Cut]

- Count the number of concatenation operations done on nsStrings. Also count the 
percentage of strings which are concatenated. (This number could be replaced by 
an allocation of an nsConcatenatedString object, saving space.) [operator+, 
operator+=, Append, Insert]

- Count the number of mutation operations done on nsStrings. Also count the 
percentage of strings which are mutated. (This number would indicate how often 
an actual string buffer (e.g. the existing nsString implementation) would 
continue to be needed.) [SetCharAt, ToLowerCase, ToUpperCase, StripChars, 
StripWhitespace, ReplaceChar, ReplaceSubstring, CompressSet, CompressWhitespace]

Once we have these counts, we can see can do some analysis to determine what 
sort of ramifications nsIString might have: Will we allocate far fewer strings 
because they're shared more? Will we need to allocate far more substring objects 
because they're mutated too often? What sort of space might we expect to save 
due to more sharing. What sort of space might we expect to loose due to more 
copies made as a result of mutation.

Right now we're in the dark.
Comment 1 Suresh Duddi (gone) 2000-02-17 15:59:02 PST
Just counting constructors wont be good I think. We should factor in how many 
are destroyed to get a figure of how many will exist. That will influence the 
space issue.
Comment 2 Chris Waterson 2000-02-17 16:04:39 PST
n.b. that vidur & troy are exploring some kind of BSTR-like stuff to reduce 
copies in layout. Not sure if their stuff would cross interface boundaries...
Comment 3 rickg 2000-02-17 16:11:09 PST
adding myself to cc.
Comment 4 Doug Turner (:dougt) 2000-02-22 09:28:41 PST
I have wanted this for a long while.  With this we could get rid of the 
nsXPIDLCString.  Adding myself to cc.
Comment 5 Daniel Veditz [:dveditz] 2000-02-22 10:41:09 PST
We could get rid of *some* uses of nsXPIDL(c)String -- those where the string 
is immutable. If the caller is going to manipulate the string anyway then it 
doesn't buy anything.
Comment 6 Scott Collins 2000-02-22 13:29:57 PST
Another plan for getting rid of |nsXPIDL[C]String| is to roll its functionality 
into |nsString|.  See related bugs

  <http://bugzilla.mozilla.org/show_bug.cgi?id=28846> -- alecf
  <http://bugzilla.mozilla.org/show_bug.cgi?id=28841> -- scc
Comment 7 Chris Waterson 2000-02-24 11:46:52 PST
I'm tinkering with instrumenting nsStr with the stack walking code n' stuff to
see what intersting statistics I can produce.
Comment 8 Chris Waterson 2000-02-24 20:41:40 PST
Ok, made a first cut at gathering some stats. It's not everything that you asked
for, warren, but most of the ones that were really easy to pick up. The data is
below for simple startup & shutdown with www.cnn.com as the homepage.

For each operation, I captured the number of times the operation occurred, the
total number of characters (one- or two-byte) that were involved in the
operation, and the mean and standard deviation of the number of characters per
operation (cuz I knew warren'd ask).

For ctors and Assign operations, I tried to deduce when copy-on-write (COW)
sharing would and would not occur; e.g., "nsString::nsString(const nsStr&) COW"
indicates that the incoming nsStr's buffer could be shared. "NOCOW" means that
the incoming nsStr's buffer was incompatible, and would need to be inflated to
two-byte.

ns[C]String::Append(const nsStr&) sometimes needs to inflate[deflate] the
inbound nsStr, so I factored out INFL[DEFL] appends from normal appends. (This
may be indicative of how useful a segmented buffer implementation would be.)

                                                      ------ Characters ------
Operation                                       Count   Total    Mean   StdDev
nsCString::Append(char)                          3081    3081       1 +/-    0
nsCString::Append(const char*)                  33624  704152      21 +/-   85
nsCString::Append(const nsCString&)              1317   17939      14 +/-   11
nsCString::Append(const nsStr&) DEFL             1213   22236      18 +/-   17
nsCString::Assign(const PRUnichar*)                62    3553      57 +/-   19
nsCString::Assign(const char*)                   2139   36108      17 +/-   22
nsCString::Assign(const nsStr&) COW                85    1720      20 +/-    4
nsCString::Assign(const nsStr&) NOCOW            7217  551108      76 +/-   48
nsCString::Cut()                                  570    5015       9 +/-    5
nsCString::ToNewCString                          7494  357502      48 +/-   26
nsCString::ToNewUnicode                            64    3559      56 +/-   21
nsCString::nsCString()                          35033       0       0 +/-    0
nsCString::nsCString(const PRUnichar*)             62    3553      57 +/-   19
nsCString::nsCString(const char*)                 105    5971      57 +/-   68
nsCString::nsCString(const nsCString&) COW       7655  364331      48 +/-   26
nsCString::nsCString(const nsStr&) NOCOW         6896  230701      33 +/-   36
nsString::Append(PRUnichar)                    153883  153883       1 +/-    0
nsString::Append(char)                            982     982       1 +/-    0
nsString::Append(const PRUnichar*)              31627  676237      21 +/-  202
nsString::Append(const char*)                   32759  816672      25 +/-  409
nsString::Append(const nsStr&)                  11694   71151       6 +/-    8
nsString::Append(const nsStr&) INFL               236     290       1 +/-    1
nsString::Append(const nsString&)               19554  292897      15 +/-   25
nsString::Cut()                                  4897   68723      14 +/-  228
nsString::Insert(PRUnichar)                      1383    1383       1 +/-    0
nsString::Insert(const char*)                      70     140       2 +/-    0
nsString::SetCharAt()                            4591    4591       1 +/-    0
nsString::ToNewCString                           2351   56711      24 +/-   23
nsString::ToNewUTF8String                        4204  169205      40 +/-   41
nsString::ToNewUnicode                           3674   56392      15 +/-   13
nsString::nsString()                           133552       0       0 +/-    0
nsString::nsString(const PRUnichar*)             1055   14819      14 +/-    8
nsString::nsString(const char*)                  1953  253280     130 +/- 1671
nsString::nsString(const nsStr&) NOCOW            230    6670      29 +/-   52
nsString::nsString(const nsString&) COW         10843  222664      21 +/-  526
Comment 9 Chris Waterson 2000-02-24 20:42:19 PST
Created attachment 5736 [details]
nsStringStats.h
Comment 10 Chris Waterson 2000-02-24 20:43:26 PST
Created attachment 5737 [details] [diff] [review]
diffs to xpcom/ds to implement string stats
Comment 11 Chris Waterson 2000-02-24 20:43:49 PST
Attached extra files and patches required to gather statistics.
Comment 12 Chris Waterson 2000-02-24 23:46:24 PST
Created attachment 5739 [details] [diff] [review]
better diffs.
Comment 13 Warren Harris 2000-02-25 00:28:55 PST
Here's another run, with more functions accounted for, and visiting more sites:

                                                      ------ Characters ------
Operation                                       Count   Total    Mean   StdDev
nsCString::Append(char)                          8600    8600       1 +/-    0
nsCString::Append(const char*)                 157749 2465740      16 +/-   51
nsCString::Append(const nsCString&)              2348   27977      12 +/-   10
nsCString::Append(const nsStr&) DEFL             3549   59550      17 +/-   17
nsCString::Assign(const PRUnichar*)               347   15652      45 +/-   14
nsCString::Assign(const char*)                   5762  185482      32 +/-   43
nsCString::Assign(const nsStr&) COW               152    2978      20 +/-    3
nsCString::Assign(const nsStr&) NOCOW           33221 2086625      63 +/-   45
nsCString::Cut()                                 3306   26758       8 +/-    5
nsCString::SetCharAt()                            196     196       1 +/-    0
nsCString::ToNewCString                         35072 1739748      50 +/-   30
nsCString::ToNewUnicode                           375   15521      41 +/-   18
nsCString::nsCString()                         133167       0       0 +/-    0
nsCString::nsCString(const PRUnichar*)            347   15652      45 +/-   14
nsCString::nsCString(const char*)                 496   57162     115 +/-   93
nsCString::nsCString(const nsCString&) COW      35300 1747820      50 +/-   30
nsCString::nsCString(const nsStr&) NOCOW        37316 1381001      37 +/-   42
nsString::Append(PRUnichar)                    215192  215192       1 +/-    0
nsString::Append(char)                           1996    1996       1 +/-    0
nsString::Append(const PRUnichar*)             146947 3483496      24 +/-  152
nsString::Append(const char*)                  186147 3901793      21 +/-  195
nsString::Append(const nsStr&)                  44255  318232       7 +/-   34
nsString::Append(const nsStr&) INFL               405    1578       4 +/-    4
nsString::Append(const nsString&)              100415 1785264      18 +/-   36
nsString::Assign(PRUnichar)                     27116   27116       1 +/-    0
nsString::Assign(char)                            322     322       1 +/-    0
nsString::Assign(const PRUnichar*)              28497  464960      16 +/-   52
nsString::Assign(const char*)                   88503 3087143      35 +/-  281
nsString::Assign(const nsStr&) COW             133837 1256708       9 +/-   35
nsString::Assign(const nsStr&) NOCOW            11782   83533       7 +/-   16
nsString::Cut()                                 31814  578205      18 +/-  249
nsString::Insert(PRUnichar)                      8674    8674       1 +/-    0
nsString::Insert(const char*)                    1461    2932       2 +/-    0
nsString::SetCharAt()                           41303   41303       1 +/-    0
nsString::ToNewCString                           8461  224481      27 +/-   36
nsString::ToNewUTF8String                       28337 1151229      41 +/-   43
nsString::ToNewUnicode                           3177   53286      17 +/-   12
nsString::nsString()                           705595       0       0 +/-    0
nsString::nsString(const PRUnichar*)             4675   83960      18 +/-   17
nsString::nsString(const char*)                 16266  494636      30 +/-  651
nsString::nsString(const nsStr&) NOCOW            490   39884      81 +/-  114
nsString::nsString(const nsString&) COW         33812  515520      15 +/-  329
TOTAL                                        2326782 27657905

Here, 12.7% of the characters fall into the COW category. On a previous run for 
just the mozilla.org page, I got 14.4%. Seems like we can safely assume 
that we can save >10% by doing COW.
Comment 14 Scott Collins 2000-02-25 00:37:52 PST
Three things to test:

  (1) put a flag in |nsStr| to simulate COW semantics ... "this is a reference"
      then charge subsequent mutators with the cost of an allocation/copy
      this will help us better determine the value of adding COW

  (2) count the number of times a string had mutators applied to it, this will
      help us better determine the value of adding, e.g., an |nsIImutableString|

  (3) put a time in the string, and whenever it changes size or when it gets
      destroyed, add its duration to a bucket for that size.  This will help
      us better determine the value of adding, e.g., arena based allocation
      for some capacities.
Comment 15 Chris Waterson 2000-02-25 01:50:01 PST
*** Bug 28842 has been marked as a duplicate of this bug. ***
Comment 16 Scott Collins 2000-02-25 03:02:23 PST
*** Bug 26435 has been marked as a duplicate of this bug. ***
Comment 17 Warren Harris 2000-02-25 03:49:25 PST
Interesting news! I implemented number (2) to determine how many strings are 
mutated. Here are the results:

Allocated strings = 833582
Mutated strings =   551607
Unmutated strings = 281252

That's over 50%!! This was for a run visiting about 10-15 pages, including 
tinderbox, a build log, cnn, yahoo, abcnews and others. 

If you add my numbers up you'll see that there are 700+ strings unaccounted 
for. This is because I only determine the number of mutated/unmutated strings 
when they're destroyed, so the remining ones must be leaks. 

Here's another longer run (60%):

Allocated strings = 1231811
Mutated strings =   777911
Unmutated strings = 467373

and here's one bringing up my mailbox (with 4000+ messages), and forwarding a 
message with a lot of extra typing added to it (39%):

Allocated strings = 2170221
Mutated strings =   1594744
Unmutated strings = 621191

Immutable strings should be a huge win!
Comment 18 troy 2000-02-25 08:29:15 PST
That's what I had informally determined as well, that immutable strings would 
be a big win
Comment 19 John Bandhauer 2000-02-25 09:17:19 PST
And that's not even counting char* and PRUnichar* strings that are manually 
alloc'd and are never in nsStrings. Right?
Comment 20 Scott Collins 2000-02-27 21:08:47 PST
I instrumented |ns[C]String| to determine the amount of character copying and the 
number of allocations that would be saved by implementing a Copy-On-Write [COW] 
mechanism without necessarily changing the current interface.  Here is a sample 
run, typical of my results to date.

       un-shared work:     15313192
             COW work:      1183682
un-shared allocations:       416265
      COW allocations:       460985 or about 10.74% more than un-shared 
allocations

Yes, we save a lot on copying characters, but we actually end up doing _more_ 
allocations.  The reason?  One explanation is callers copying strings and making 
small modifications in extant string variables explicitly for this purpose.  So, 
for example, I have a string variable into which I copy another string (the 
allocation and copying are deferred), but now I modify it (and am charged for the 
allocation, and some fraction of the copying, depending on the operation) and do 
something with it, like compare.  Then I copy another string into it (the current 
value is released, but the allocation and copy are deferred again), and then, as 
before, immediately make a change... now I'm charged for an allocation that I 
wouldn't have been in the non-COW implementation.

Interesting results.
Comment 21 Scott Collins 2000-02-28 04:08:25 PST
Created attachment 5842 [details] [diff] [review]
Here is a patch containing the changes I made to measure COW efficacy
Comment 22 Warren Harris 2000-02-28 11:56:16 PST
Created attachment 5856 [details] [diff] [review]
diffs for mutated/unmutated accounting (in addition to Chris' diffs)
Comment 23 rickg 2000-02-28 12:06:47 PST
One comment about SCC's analysis: the 2nd copy would not necessarily require a 
subsequent allocation. We can implement COW so that the underlying buffer is 
retained until the string is deleted or resized. The original buffer could be 
reused in the 2nd copy, so that the 2nd allocation (may not) be neccessary. Of 
course a great deal depends on the size of the strings being operated upon.
Comment 24 Scott Collins 2000-02-28 13:51:45 PST
With respect to the notion of a segmented string implementation: note that
|GetUnicode| is called in more than a thousand places.  |GetBuffer| is called 
less, but still quite a bit.

  <http://lxr.mozilla.org/seamonkey/search?string=GetUnicode>
  <http://lxr.mozilla.org/seamonkey/search?string=GetBuffer>

Both are obstacles to implementing a segmented string since callers expect the 
resulting pointer to point to the entire buffer, and they do math with it or pass 
it to things expecting an entire string.
Comment 25 rickg 2000-02-28 14:05:33 PST
I agree, Scott. It makes you wish we had iterators, doesn't it?
Comment 26 Warren Harris 2000-02-28 16:39:49 PST
I did some more analysis... of how many strings have GetBuffer or GetUnicode 
called for them. Here's the answer:

Allocated strings =  756941
Mutated strings =    521389 (68%)
Unmutated strings =  245669 (32%)
Contiguous buffers = 140836 (18%)

This was for visiting mozilla.org, cnn.com, abcnews.com, usatoday.com. 
GetBuffer and GetUnicode were only called for 18% of the strings. So I think a 
non-contiguous buffer implementation could still be a win. I didn't count how 
many times GetBuffer/GetUnicode were called for the same string, but that would 
be easy to add.


On another note... I must have been on crack when I reported the percentages 
for unmutated strings. For the 3 runs I listed above, the percentages are 
33.74%, 37.94% and 28.62% respectively. And for the above run, 32%. Still a 
win, although not quite as spectacular as first reported.
Comment 27 Warren Harris 2000-02-28 16:54:20 PST
P.S. My GetBuffer/GetUnicode analysis doesn't include places in the code that 
use mStr directly, so it's an upper bound. If there's a critical place in the 
code that uses mStr, then my numbers could be completely off.

Comment 28 Scott Collins 2000-02-29 00:05:22 PST
I am posting the following set of recommendations to this bug to keep external
developers informed of the direction in which we are heading.  The main players
in this impending change are already, as far as I know, all on the same page.
Rick Gessner <rickg@netscape.com> recently sent out his recommendations---which
I hope he will also post to this bug---which touch on the same themes.  Rick's
recommendations paint a fairly good picture of the new world.  I still think
the following evaluation is valuable, because it goes into detail as to what the
actual changes are/should-be, and _why_ those particular changes are important.


We want to move to a world where string clients can select from among a range of
implementations that embody different implementation strategies, e.g., a
general purpose string such as we have now, and specific-use implementations
like an immutable string that optimizes allocations, and a segmented string that
minimizes character copying over editing operations on very large datasets.

These new goals impose new requirements on our current string interfaces.  Any
changes we make to the current interface must be source compatible with extant
clients, or we must be willing to pay the penalty of updating callers.

Note: our new goals fall out of our experiences using strings in our
application.  They differ significantly from our original goals (which were all
about revealing the underlying implementation to clients for performance) and
so none of these recommendations can or should be taken as a criticism of the
current interface.

Specific recommendations fall into several categories (note: these are not the
the recommendations ... these are the categories).  Very roughly in order of
importance with respect to this effort:

  [A] removing from the interface any visible members that compromise the
      abstraction allowing different underlying representations, else clients
      won't be satisfied by alternate implementations

  [B] removing from the interface any routines that aren't specifically about
      manipulating the underlying representation, else alternate implementations
      must re-implement identical functionality

  [C] removing from the interface any i18n sensitive functionality, though
      mostly instances of this recommendation will be covered by [B] above

  [D] removing from the interface unused, unneeded, or unconventionally located
      functionality, to reduce the burden on alternate implementations, and to
      generally simplify

  [E] adding to the interface any support machinery needed to enable changes
      falling into one of the categories above, or simply to allow multiple
      implementations at all, e.g., |virtual|

I believe rickg and I are already very much in agreement on these points.  We
discussed them at length, and his recent message on redesign notes echoes these
sentiments.  It is clear from his recent email messages that he has been
focused on these same key issues.


Here are my specific recommendations:

   1  [A] Remove public inheritance from |nsStr|.  Access to a specific
      underlying buffer representation is prohibitive to alternate
      implementations, e.g., a segmented string et al.  It is also agreed that
      having any public data members is a political impediment to crossing
      XPCOM boundaries.  According to rickg, visible inheritance from |nsStr|
      is not exploited heavily, and should be easily removed.  This is arguably
      the most important thing we can do to enable further enhancements to our
      string implementations and uses.

   2  [A] |GetUnicode| and |GetBuffer| impose a prohibitive burden on
      implementations in a multiple implementation world.  As rickg points out,
      this is another reason to add iterators.  Unfortunately, these two
      routines are very heavily used.

   3  [E] Make the string interface abstract to allow multiple implementations.
      We were already paying for a vtable, so no extra space requirements are
      expected.  The performance impact should be minimal.

   4  [E] Split the abstract interface into layers encouraging read-only
      implementations, e.g., an immutable string

   5  [BCD] Make narrowing/widening an explicit operation done by constructors.
      Do not allow implicit conversion in append and assign operations.  Tests
      show that we are not exploiting the `double interface' of string very
      much, and this is good.  Note that like |ToUpperCase| (et al)
      functionality mentioned below, encoding conversions are properly in the
      domain of i18n, and duplicating the functionality at the low-level in
      string is suboptimal. 

   6  [BDE] Either remove operator overloading from the abstract interface, or
      implement it conventionally, that is: non-virtual inlines using only the
      abstract signatures for |Append| and |Assign|.  Implement |operator+=()|
      and |operator=()| as members; implement |operator+()| and relations as
      non-members.  Virtual assignment operators must be written carefully to
      avoid slicing.

   7  [BCDE] Remove |ToUpperCase|, |ToLowerCase|, |IsASCII|, |IsUnicode|,
      |IsSpace|, and |IsAlpha| from the interface.  Of these, only |ToLowerCase|
      is heavily used, and i18n functionality like this must be pushed up into
      the i18n layer, where, coincidentally, this functionality already happens
      to exist.

   8  [D] Remove (the little used) |ToNewString| from the interface.  This
      functionality is already available in the form of the copy-constructor.
      In a multiple implementation world, the user will typically need to
      select a specific implementation, in any case.

   9  [BCD] Remove |IsOrdered| and |BinarySearch| from the interface.  These are 
not
      general purpose routines, and can easily be implemented outside the
      string class if they are deemed still needed.

  10  [BCD] |EqualsIgnoreCase| and the |Compare| functions when the
      |aIgnoreCase| parameter is true are problematic just as the other i18n
      dependent routines are.  Unfortunately, these routines are very heavily
      used.  Again, they are a burden in a multiple implementation world.  They
      should be implemented as non-members (based on extant i18n facilities)
      that use iterators into the underlying string ... which also implies that
      we will need string iterators.

  11  [BCD] |ToFloat| and |ToInteger| should be removed from the interface.
      Parsing should not be part of the required functionality for multiple
      implementations.  Given iterators, this functionality could be moved to
      a non-member implementation, which, in any case, is again requires i18n
      sensitivity.  |ToFloat| is not heavily used.  |ToInteger| is.

      Similarly, the |Append|s that format a float or an integer are i18n
      dependent.  Some work may be required to provide similar functionality
      that is factored into the i18n support.

  12  [D] We probably don't need the power to say something like
  
        myStr.Assign(yourStr).Append(herStr).Cut(20, 15);

      It makes sense with operators, but we may want to simplify the client
      interface with respect to named member functions.  The |Assign|, |Append|,
      |Insert|, |Cut|, and |SetString|, signatures should be changed to return
      |void|.

  13  [BCD] Turn the specialized modification and accessor functions |Trim|,
      |CompressSet|, |StripChar|, |StripChars|, |StripWhitespace|,
      |ReplaceChar|, |ReplaceSubstring|, and |CountChar| into non-member
      `algorithms' that can be applied to any implementation.

  14  [DE] Given the current copying signatures of |Left|, |Right|, and |Mid|,
      they should probably be turned into non-member algorithms writing to an
      iterator as well.

  15  [E] Add iterators.  Several of the points above are eased or solved by the
      introduction of reasonable iterators.

  16  [AD] Eliminate |nsSubsume[C]Str|.  To much implementation knowledge is
      currently required to reasonably utilize this in clients, and it presents
      a burden to implementations to facilitate.
Comment 29 Scott Collins 2000-03-30 10:25:59 PST
This is my primary focus at the moment.
Comment 30 Scott Collins 2000-03-30 10:27:44 PST
fixing summary to better reflect our understanding
Comment 31 Scott Collins 2000-05-15 01:38:04 PDT
mass re-assigning to my new bugzilla account
Comment 32 Scott Collins 2000-05-16 00:57:15 PDT
Well, NEW_STRING_APIS is now switched on.  The factoring is accomplished.  And 
some new implementation exist to solve some problems.  We need a replacement for 
XPIDL string; we need a COW implementation; we need to deploy the new 
implementations.  I'm re-summarizing this bug for the work of measuring and 
deploying the new implementations.
Comment 33 leger 2000-05-24 06:05:15 PDT
Putting on nsbeta3 radar.  warren say we really need to get this in for PR3, per 
beta2 PDT reviews.
Comment 34 Scott Collins 2000-06-06 10:20:09 PDT
Simon, can you add your recent profiling work to this bug?

For everyone else, under discussion is the idea that `chunk' allocating strings 
has turned out to be a bigger source of wasted space than it has been a 
performance boon.

(oops, too many cc's, Bugzilla is making me remove one to add simon.  Sorry dp)
Comment 35 Simon Fraser 2000-06-06 11:49:20 PDT
Adding newsgroup postings on string usage:

            <news:sfraser-F78927.13360202062000@secnews1.mcom.com>
It seems that nsString::SetCapacity() always buffers the string
size in 64-character chunks - this logic lives down in nsStr::Alloc.
So:

   nsString foo;
   foo.SetCapacity(1);
   foo.Assign('a');

will eat up 128 bytes of heap space.

I have not found a way that I can set the capacity of an nsString
to exactly the length I know is needed. This of course has quite
an impact on bloat.

Seeing this leads me to question how often we actually need to
chunk changes in string length; what proportion of strings actually
change length during their lifetime? My guess is that it's < 50%,
which perhaps suggests that the normal behaviour should be to
not round up string sizes, and that we should have an API that
allows the caller to create a string with, or specify that an
existing string is likely to change length frequently from now
on.

          <news:sfraser-0720CE.17500902062000@secnews1.mcom.com>

Some data on the bloat that results from string chunking (recall,
bloat = total memory every allocated, not a runtime high-water mark).
Numbers are K.

Test              Allocated       Used        Waste    % waste
--------------------------------------------------------------
Simple browser      2938.00     1548.98     1389.01     47.28%
Complex browser     5839.17     3214.73     2624.44     44.95%
Mail                6232.82     3369.89     2862.93     45.93%

So this chunking almost doubles the amount of memory that our
strings use.

      <news:sfraser-D08810.16523102062000@secnews1.mcom.com>
In another post, waterson posed the question of how many strings with
identical contents are allocated, and whether we could use atoms for
these common strings. (He was, I think, talking about string usage in
a particular module/API, but the question can be generalized.)

So I put some debug code in nsStr::Destroy, that dumps out the contents
of strings just before they are deleted, if aDest.mOwnsBuffer == PR_TRUE
(which indicates that the buffer was heap-allocated). Some results are below.

These results can be used to find places is the code the might benefit from
shared strings, or cacheing of frequently used strings. Of course, I have
no data on call sites here.

The data look like this:

 848 1   dummy:path

'848' is the count (# strings with these contents), '1' is the character width
(1 for char, 2 for PRUnichar), and the rest is the string itself.

Test 1:
   Bring up browser, loading simple text-only HTML page, Quit.
   <http://www.smfr.org/mozilla/sortedstrings.txt>
 
 848 1   dummy:path
 510 2   true
 333 2   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 303 2   monospace
 278 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 277 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 255 2   geneva
 205 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
 200 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
 167 1   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 159 1   file:///Bleeding%20Edge/Mozilla%20tree/src/mozilla/dist/viewer_debug/
 135 2   ISO-8859-1
 123 2   UTF-8
 117 2   serif
  85 2   broadcaster
  79 2   menuitem
  78 2   menupopup
  73 2                 <string is a run of 28 spaces>
  71 2   vertical
  67 2   rdf:http://home.netscape.com/NC-rdf#Name

Test 2:
   Bring up browser, surf to mozilla.org, tinderbox, bugzilla, load a bug,
   open prefs dialog.
   <http://www.smfr.org/mozilla/sortedbrowser.txt>

1797 2   monospace
1698 2   true
1007 2   serif
 879 2   geneva
 848 1   dummy:path
 371 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 370 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 333 2   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 322 2   ISO-8859-1
 315 1   file:///Bleeding%20Edge/Mozilla%20tree/src/mozilla/dist/viewer_debug/
 266 2   vertical
 248 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
 243 2   white
 236 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
 176 2   \
 167 1   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 159 2   UTF-8
 128 1   css
 125 2   never
 121 2   black


Test 3:
   Bring up browser, open mail-news, load 2 large IMAP folders
   (including one of bugzilla mail)
   <http://www.smfr.org/mozilla/sortedmail.txt>

7084 2   UTF-8
4785 2   us-ascii
2809 1   mozilla.org
2808 1   bugzilla-daemon
1304 2   true
1034 2   component://netscape/intl/unicode/decoder?charset=x-imap4-modified-utf7
1032 2   x-imap4-modified-utf7
1029 2   never
 987 1   netscape.com
 957 2   geneva
 948 1   %S Receiving: message headers %lu of %lu
 947 1   sfraser
 941 2   Bugzilla
 894 2   monospace
 848 1   dummy:path
 500 2   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 488 2   autostretch
 392 2   menuitem
 358 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 357 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl

Comment 36 Scott Collins 2000-07-28 18:09:54 PDT
Created attachment 12100 [details]
attaching data <dougt@netscape.com> generated...
Comment 37 Scott Collins 2000-07-28 18:10:43 PDT
Created attachment 12101 [details]
more <dougt@netscape.com> data...
Comment 38 Scott Collins 2000-07-28 18:12:08 PDT
Created attachment 12102 [details] [diff] [review]
Here's the patch Doug used to generate this data...
Comment 39 Scott Collins 2000-07-28 18:29:44 PDT
Doug, I attached your patch and data to this bug ... which seems like the 
appropriate place.  What conclusions can we draw from this data?  There are 
certain very common strings, true, but this is not enough to know if they are 
candidates for being replaced with |nsShared[C]String|s, since we don't know how 
they were generated.  What do you think?  The other bug filed on this is bug 
#46738.  I commented there as well.  We should consider marking that bug either a 
duplicate or a blocker for this bug.
Comment 40 Scott Collins 2001-02-24 12:43:05 PST
marking dependencies, turning this [officially] into a tracking bug for
deploying new string implementations
Comment 41 Scott Collins 2002-11-03 12:49:16 PST
giving up ancient string bugs to the new string owner.  jag, you'll want to sort
through these and see which ones still apply and go with or against the
direction in which you intend strings evolve
Comment 42 jag (Peter Annema) 2009-04-14 22:11:56 PDT
This work is no longer relevant, strings have had a new implementation for a while now :-)

If the new strings code needs performance tuning, please file new bugs.

Note You need to log in before you can comment on or make changes to this bug.