28221 - [tracking bug] profile string usage; deploy new implementations where appropriate

Reporter

Description

•

25 years ago

I still believe that an immutable nsIString interface coupled with appropriate 
implementations could be a huge win for us in terms of both space and time.  
There would need to be at least 4 implementations to make this work:

- nsUnicharString for double-byte encoding,
- nsCString for single-byte encoding,
- nsSubString would manage a lengh and offset into another nsIString to avoid 
copying,
- nsConcatenatedString would manage a sequence of nsIStrings, treating them as a 
single concatenated string.

To determine whether my hypothesis is correct, I think we can instrument 
nsString and nsCString to gather statistics that indicate how many copies of 
strings we make in the process of running our app. Specifically:

- Count the number of times each nsString constructs a char/PRUnichar array. 
This is often done when passing them to IDL-generated interfaces. (This number 
could be completely eliminated with nsIString.) [ToNewString, ToNewCString, 
ToNewUnicode, ToCString]

- Count the number of times we construct nsStrings from char/PRUnichar arrays. 
This is often done when we want to manipulate strings that come in from 
IDL-generated interfaces. (Some number of these could be eliminated with 
nsIString.) (How can we break down first-time constructions from copies - 
histogram?)

- Count the number of times we assign the character sequence in a string. Also 
count the percentage of strings which are actually assigned. (This number would 
indicate the number of additional nsIStrings which would need to be created due 
to immutability.) [SetString, Assign, operator=]

- Count the number of substring operations done on nsStrings. (This number could 
be replaced by an allocation of an nsSubString object.) [SetLength, Truncate, 
Trim, Left, Mid, Right, Cut]

- Count the number of concatenation operations done on nsStrings. Also count the 
percentage of strings which are concatenated. (This number could be replaced by 
an allocation of an nsConcatenatedString object, saving space.) [operator+, 
operator+=, Append, Insert]

- Count the number of mutation operations done on nsStrings. Also count the 
percentage of strings which are mutated. (This number would indicate how often 
an actual string buffer (e.g. the existing nsString implementation) would 
continue to be needed.) [SetCharAt, ToLowerCase, ToUpperCase, StripChars, 
StripWhitespace, ReplaceChar, ReplaceSubstring, CompressSet, CompressWhitespace]

Once we have these counts, we can see can do some analysis to determine what 
sort of ramifications nsIString might have: Will we allocate far fewer strings 
because they're shared more? Will we need to allocate far more substring objects 
because they're mutated too often? What sort of space might we expect to save 
due to more sharing. What sort of space might we expect to loose due to more 
copies made as a result of mutation.

Right now we're in the dark.

Warren Harris

Reporter

Updated

•

25 years ago

Keywords: perf

Summary: investigate nsIString → investigate nsIString

Suresh Duddi (gone)

Comment 1

•

25 years ago

Just counting constructors wont be good I think. We should factor in how many 
are destroyed to get a figure of how many will exist. That will influence the 
space issue.

Chris Waterson

Comment 2

•

25 years ago

n.b. that vidur & troy are exploring some kind of BSTR-like stuff to reduce 
copies in layout. Not sure if their stuff would cross interface boundaries...

Scott Collins

Updated

•

25 years ago

Status: NEW → ASSIGNED

rickg

Comment 3

•

25 years ago

adding myself to cc.

Doug Turner (:dougt)

Comment 4

•

25 years ago

I have wanted this for a long while.  With this we could get rid of the 
nsXPIDLCString.  Adding myself to cc.

Daniel Veditz [:dveditz]

Comment 5

•

25 years ago

We could get rid of *some* uses of nsXPIDL(c)String -- those where the string 
is immutable. If the caller is going to manipulate the string anyway then it 
doesn't buy anything.

Scott Collins

Comment 6

•

25 years ago

Another plan for getting rid of |nsXPIDL[C]String| is to roll its functionality 
into |nsString|.  See related bugs

  <http://bugzilla.mozilla.org/show_bug.cgi?id=28846> -- alecf
  <http://bugzilla.mozilla.org/show_bug.cgi?id=28841> -- scc

Scott Collins

Updated

•

25 years ago

Target Milestone: M17

Chris Waterson

Comment 7

•

25 years ago

I'm tinkering with instrumenting nsStr with the stack walking code n' stuff to
see what intersting statistics I can produce.

Chris Waterson

Comment 8

•

25 years ago

Ok, made a first cut at gathering some stats. It's not everything that you asked
for, warren, but most of the ones that were really easy to pick up. The data is
below for simple startup & shutdown with www.cnn.com as the homepage.

For each operation, I captured the number of times the operation occurred, the
total number of characters (one- or two-byte) that were involved in the
operation, and the mean and standard deviation of the number of characters per
operation (cuz I knew warren'd ask).

For ctors and Assign operations, I tried to deduce when copy-on-write (COW)
sharing would and would not occur; e.g., "nsString::nsString(const nsStr&) COW"
indicates that the incoming nsStr's buffer could be shared. "NOCOW" means that
the incoming nsStr's buffer was incompatible, and would need to be inflated to
two-byte.

ns[C]String::Append(const nsStr&) sometimes needs to inflate[deflate] the
inbound nsStr, so I factored out INFL[DEFL] appends from normal appends. (This
may be indicative of how useful a segmented buffer implementation would be.)

                                                      ------ Characters ------
Operation                                       Count   Total    Mean   StdDev
nsCString::Append(char)                          3081    3081       1 +/-    0
nsCString::Append(const char*)                  33624  704152      21 +/-   85
nsCString::Append(const nsCString&)              1317   17939      14 +/-   11
nsCString::Append(const nsStr&) DEFL             1213   22236      18 +/-   17
nsCString::Assign(const PRUnichar*)                62    3553      57 +/-   19
nsCString::Assign(const char*)                   2139   36108      17 +/-   22
nsCString::Assign(const nsStr&) COW                85    1720      20 +/-    4
nsCString::Assign(const nsStr&) NOCOW            7217  551108      76 +/-   48
nsCString::Cut()                                  570    5015       9 +/-    5
nsCString::ToNewCString                          7494  357502      48 +/-   26
nsCString::ToNewUnicode                            64    3559      56 +/-   21
nsCString::nsCString()                          35033       0       0 +/-    0
nsCString::nsCString(const PRUnichar*)             62    3553      57 +/-   19
nsCString::nsCString(const char*)                 105    5971      57 +/-   68
nsCString::nsCString(const nsCString&) COW       7655  364331      48 +/-   26
nsCString::nsCString(const nsStr&) NOCOW         6896  230701      33 +/-   36
nsString::Append(PRUnichar)                    153883  153883       1 +/-    0
nsString::Append(char)                            982     982       1 +/-    0
nsString::Append(const PRUnichar*)              31627  676237      21 +/-  202
nsString::Append(const char*)                   32759  816672      25 +/-  409
nsString::Append(const nsStr&)                  11694   71151       6 +/-    8
nsString::Append(const nsStr&) INFL               236     290       1 +/-    1
nsString::Append(const nsString&)               19554  292897      15 +/-   25
nsString::Cut()                                  4897   68723      14 +/-  228
nsString::Insert(PRUnichar)                      1383    1383       1 +/-    0
nsString::Insert(const char*)                      70     140       2 +/-    0
nsString::SetCharAt()                            4591    4591       1 +/-    0
nsString::ToNewCString                           2351   56711      24 +/-   23
nsString::ToNewUTF8String                        4204  169205      40 +/-   41
nsString::ToNewUnicode                           3674   56392      15 +/-   13
nsString::nsString()                           133552       0       0 +/-    0
nsString::nsString(const PRUnichar*)             1055   14819      14 +/-    8
nsString::nsString(const char*)                  1953  253280     130 +/- 1671
nsString::nsString(const nsStr&) NOCOW            230    6670      29 +/-   52
nsString::nsString(const nsString&) COW         10843  222664      21 +/-  526

Chris Waterson

Comment 9

•

25 years ago

Attached file nsStringStats.h — Details

Chris Waterson

Comment 10

•

25 years ago

Attached patch diffs to xpcom/ds to implement string stats — Details — Splinter Review

Chris Waterson

Comment 11

•

25 years ago

Attached extra files and patches required to gather statistics.

Chris Waterson

Comment 12

•

25 years ago

Attached patch better diffs. — Details — Splinter Review

Warren Harris

Reporter

Comment 13

•

25 years ago

Here's another run, with more functions accounted for, and visiting more sites:

                                                      ------ Characters ------
Operation                                       Count   Total    Mean   StdDev
nsCString::Append(char)                          8600    8600       1 +/-    0
nsCString::Append(const char*)                 157749 2465740      16 +/-   51
nsCString::Append(const nsCString&)              2348   27977      12 +/-   10
nsCString::Append(const nsStr&) DEFL             3549   59550      17 +/-   17
nsCString::Assign(const PRUnichar*)               347   15652      45 +/-   14
nsCString::Assign(const char*)                   5762  185482      32 +/-   43
nsCString::Assign(const nsStr&) COW               152    2978      20 +/-    3
nsCString::Assign(const nsStr&) NOCOW           33221 2086625      63 +/-   45
nsCString::Cut()                                 3306   26758       8 +/-    5
nsCString::SetCharAt()                            196     196       1 +/-    0
nsCString::ToNewCString                         35072 1739748      50 +/-   30
nsCString::ToNewUnicode                           375   15521      41 +/-   18
nsCString::nsCString()                         133167       0       0 +/-    0
nsCString::nsCString(const PRUnichar*)            347   15652      45 +/-   14
nsCString::nsCString(const char*)                 496   57162     115 +/-   93
nsCString::nsCString(const nsCString&) COW      35300 1747820      50 +/-   30
nsCString::nsCString(const nsStr&) NOCOW        37316 1381001      37 +/-   42
nsString::Append(PRUnichar)                    215192  215192       1 +/-    0
nsString::Append(char)                           1996    1996       1 +/-    0
nsString::Append(const PRUnichar*)             146947 3483496      24 +/-  152
nsString::Append(const char*)                  186147 3901793      21 +/-  195
nsString::Append(const nsStr&)                  44255  318232       7 +/-   34
nsString::Append(const nsStr&) INFL               405    1578       4 +/-    4
nsString::Append(const nsString&)              100415 1785264      18 +/-   36
nsString::Assign(PRUnichar)                     27116   27116       1 +/-    0
nsString::Assign(char)                            322     322       1 +/-    0
nsString::Assign(const PRUnichar*)              28497  464960      16 +/-   52
nsString::Assign(const char*)                   88503 3087143      35 +/-  281
nsString::Assign(const nsStr&) COW             133837 1256708       9 +/-   35
nsString::Assign(const nsStr&) NOCOW            11782   83533       7 +/-   16
nsString::Cut()                                 31814  578205      18 +/-  249
nsString::Insert(PRUnichar)                      8674    8674       1 +/-    0
nsString::Insert(const char*)                    1461    2932       2 +/-    0
nsString::SetCharAt()                           41303   41303       1 +/-    0
nsString::ToNewCString                           8461  224481      27 +/-   36
nsString::ToNewUTF8String                       28337 1151229      41 +/-   43
nsString::ToNewUnicode                           3177   53286      17 +/-   12
nsString::nsString()                           705595       0       0 +/-    0
nsString::nsString(const PRUnichar*)             4675   83960      18 +/-   17
nsString::nsString(const char*)                 16266  494636      30 +/-  651
nsString::nsString(const nsStr&) NOCOW            490   39884      81 +/-  114
nsString::nsString(const nsString&) COW         33812  515520      15 +/-  329
TOTAL                                        2326782 27657905

Here, 12.7% of the characters fall into the COW category. On a previous run for 
just the mozilla.org page, I got 14.4%. Seems like we can safely assume 
that we can save >10% by doing COW.

Scott Collins

Comment 14

•

25 years ago

Three things to test:

  (1) put a flag in |nsStr| to simulate COW semantics ... "this is a reference"
      then charge subsequent mutators with the cost of an allocation/copy
      this will help us better determine the value of adding COW

  (2) count the number of times a string had mutators applied to it, this will
      help us better determine the value of adding, e.g., an |nsIImutableString|

  (3) put a time in the string, and whenever it changes size or when it gets
      destroyed, add its duration to a bucket for that size.  This will help
      us better determine the value of adding, e.g., arena based allocation
      for some capacities.

Chris Waterson

Comment 15

•

25 years ago

*** Bug 28842 has been marked as a duplicate of this bug. ***

Scott Collins

Comment 16

•

25 years ago

*** Bug 26435 has been marked as a duplicate of this bug. ***

Warren Harris

Reporter

Comment 17

•

25 years ago

Interesting news! I implemented number (2) to determine how many strings are 
mutated. Here are the results:

Allocated strings = 833582
Mutated strings =   551607
Unmutated strings = 281252

That's over 50%!! This was for a run visiting about 10-15 pages, including 
tinderbox, a build log, cnn, yahoo, abcnews and others. 

If you add my numbers up you'll see that there are 700+ strings unaccounted 
for. This is because I only determine the number of mutated/unmutated strings 
when they're destroyed, so the remining ones must be leaks. 

Here's another longer run (60%):

Allocated strings = 1231811
Mutated strings =   777911
Unmutated strings = 467373

and here's one bringing up my mailbox (with 4000+ messages), and forwarding a 
message with a lot of extra typing added to it (39%):

Allocated strings = 2170221
Mutated strings =   1594744
Unmutated strings = 621191

Immutable strings should be a huge win!

troy

Comment 18

•

25 years ago

That's what I had informally determined as well, that immutable strings would 
be a big win

John Bandhauer

Comment 19

•

25 years ago

And that's not even counting char* and PRUnichar* strings that are manually 
alloc'd and are never in nsStrings. Right?

Scott Collins

Comment 20

•

25 years ago

I instrumented |ns[C]String| to determine the amount of character copying and the 
number of allocations that would be saved by implementing a Copy-On-Write [COW] 
mechanism without necessarily changing the current interface.  Here is a sample 
run, typical of my results to date.

       un-shared work:     15313192
             COW work:      1183682
un-shared allocations:       416265
      COW allocations:       460985 or about 10.74% more than un-shared 
allocations

Yes, we save a lot on copying characters, but we actually end up doing _more_ 
allocations.  The reason?  One explanation is callers copying strings and making 
small modifications in extant string variables explicitly for this purpose.  So, 
for example, I have a string variable into which I copy another string (the 
allocation and copying are deferred), but now I modify it (and am charged for the 
allocation, and some fraction of the copying, depending on the operation) and do 
something with it, like compare.  Then I copy another string into it (the current 
value is released, but the allocation and copy are deferred again), and then, as 
before, immediately make a change... now I'm charged for an allocation that I 
wouldn't have been in the non-COW implementation.

Interesting results.

Scott Collins

Comment 21

•

25 years ago

Attached patch Here is a patch containing the changes I made to measure COW efficacy — Details — Splinter Review

Warren Harris

Reporter

Comment 22

•

25 years ago

Attached patch diffs for mutated/unmutated accounting (in addition to Chris' diffs) — Details — Splinter Review

rickg

Comment 23

•

25 years ago

One comment about SCC's analysis: the 2nd copy would not necessarily require a 
subsequent allocation. We can implement COW so that the underlying buffer is 
retained until the string is deleted or resized. The original buffer could be 
reused in the 2nd copy, so that the 2nd allocation (may not) be neccessary. Of 
course a great deal depends on the size of the strings being operated upon.

Scott Collins

Comment 24

•

25 years ago

With respect to the notion of a segmented string implementation: note that
|GetUnicode| is called in more than a thousand places.  |GetBuffer| is called 
less, but still quite a bit.

  <http://lxr.mozilla.org/seamonkey/search?string=GetUnicode>
  <http://lxr.mozilla.org/seamonkey/search?string=GetBuffer>

Both are obstacles to implementing a segmented string since callers expect the 
resulting pointer to point to the entire buffer, and they do math with it or pass 
it to things expecting an entire string.

rickg

Comment 25

•

25 years ago

I agree, Scott. It makes you wish we had iterators, doesn't it?

Warren Harris

Reporter

Comment 26

•

25 years ago

I did some more analysis... of how many strings have GetBuffer or GetUnicode 
called for them. Here's the answer:

Allocated strings =  756941
Mutated strings =    521389 (68%)
Unmutated strings =  245669 (32%)
Contiguous buffers = 140836 (18%)

This was for visiting mozilla.org, cnn.com, abcnews.com, usatoday.com. 
GetBuffer and GetUnicode were only called for 18% of the strings. So I think a 
non-contiguous buffer implementation could still be a win. I didn't count how 
many times GetBuffer/GetUnicode were called for the same string, but that would 
be easy to add.


On another note... I must have been on crack when I reported the percentages 
for unmutated strings. For the 3 runs I listed above, the percentages are 
33.74%, 37.94% and 28.62% respectively. And for the above run, 32%. Still a 
win, although not quite as spectacular as first reported.

Warren Harris

Reporter

Comment 27

•

25 years ago

P.S. My GetBuffer/GetUnicode analysis doesn't include places in the code that 
use mStr directly, so it's an upper bound. If there's a critical place in the 
code that uses mStr, then my numbers could be completely off.

Scott Collins

Comment 28

•

25 years ago

I am posting the following set of recommendations to this bug to keep external
developers informed of the direction in which we are heading.  The main players
in this impending change are already, as far as I know, all on the same page.
Rick Gessner <rickg@netscape.com> recently sent out his recommendations---which
I hope he will also post to this bug---which touch on the same themes.  Rick's
recommendations paint a fairly good picture of the new world.  I still think
the following evaluation is valuable, because it goes into detail as to what the
actual changes are/should-be, and _why_ those particular changes are important.


We want to move to a world where string clients can select from among a range of
implementations that embody different implementation strategies, e.g., a
general purpose string such as we have now, and specific-use implementations
like an immutable string that optimizes allocations, and a segmented string that
minimizes character copying over editing operations on very large datasets.

These new goals impose new requirements on our current string interfaces.  Any
changes we make to the current interface must be source compatible with extant
clients, or we must be willing to pay the penalty of updating callers.

Note: our new goals fall out of our experiences using strings in our
application.  They differ significantly from our original goals (which were all
about revealing the underlying implementation to clients for performance) and
so none of these recommendations can or should be taken as a criticism of the
current interface.

Specific recommendations fall into several categories (note: these are not the
the recommendations ... these are the categories).  Very roughly in order of
importance with respect to this effort:

  [A] removing from the interface any visible members that compromise the
      abstraction allowing different underlying representations, else clients
      won't be satisfied by alternate implementations

  [B] removing from the interface any routines that aren't specifically about
      manipulating the underlying representation, else alternate implementations
      must re-implement identical functionality

  [C] removing from the interface any i18n sensitive functionality, though
      mostly instances of this recommendation will be covered by [B] above

  [D] removing from the interface unused, unneeded, or unconventionally located
      functionality, to reduce the burden on alternate implementations, and to
      generally simplify

  [E] adding to the interface any support machinery needed to enable changes
      falling into one of the categories above, or simply to allow multiple
      implementations at all, e.g., |virtual|

I believe rickg and I are already very much in agreement on these points.  We
discussed them at length, and his recent message on redesign notes echoes these
sentiments.  It is clear from his recent email messages that he has been
focused on these same key issues.


Here are my specific recommendations:

   1  [A] Remove public inheritance from |nsStr|.  Access to a specific
      underlying buffer representation is prohibitive to alternate
      implementations, e.g., a segmented string et al.  It is also agreed that
      having any public data members is a political impediment to crossing
      XPCOM boundaries.  According to rickg, visible inheritance from |nsStr|
      is not exploited heavily, and should be easily removed.  This is arguably
      the most important thing we can do to enable further enhancements to our
      string implementations and uses.

   2  [A] |GetUnicode| and |GetBuffer| impose a prohibitive burden on
      implementations in a multiple implementation world.  As rickg points out,
      this is another reason to add iterators.  Unfortunately, these two
      routines are very heavily used.

   3  [E] Make the string interface abstract to allow multiple implementations.
      We were already paying for a vtable, so no extra space requirements are
      expected.  The performance impact should be minimal.

   4  [E] Split the abstract interface into layers encouraging read-only
      implementations, e.g., an immutable string

   5  [BCD] Make narrowing/widening an explicit operation done by constructors.
      Do not allow implicit conversion in append and assign operations.  Tests
      show that we are not exploiting the `double interface' of string very
      much, and this is good.  Note that like |ToUpperCase| (et al)
      functionality mentioned below, encoding conversions are properly in the
      domain of i18n, and duplicating the functionality at the low-level in
      string is suboptimal. 

   6  [BDE] Either remove operator overloading from the abstract interface, or
      implement it conventionally, that is: non-virtual inlines using only the
      abstract signatures for |Append| and |Assign|.  Implement |operator+=()|
      and |operator=()| as members; implement |operator+()| and relations as
      non-members.  Virtual assignment operators must be written carefully to
      avoid slicing.

   7  [BCDE] Remove |ToUpperCase|, |ToLowerCase|, |IsASCII|, |IsUnicode|,
      |IsSpace|, and |IsAlpha| from the interface.  Of these, only |ToLowerCase|
      is heavily used, and i18n functionality like this must be pushed up into
      the i18n layer, where, coincidentally, this functionality already happens
      to exist.

   8  [D] Remove (the little used) |ToNewString| from the interface.  This
      functionality is already available in the form of the copy-constructor.
      In a multiple implementation world, the user will typically need to
      select a specific implementation, in any case.

   9  [BCD] Remove |IsOrdered| and |BinarySearch| from the interface.  These are 
not
      general purpose routines, and can easily be implemented outside the
      string class if they are deemed still needed.

  10  [BCD] |EqualsIgnoreCase| and the |Compare| functions when the
      |aIgnoreCase| parameter is true are problematic just as the other i18n
      dependent routines are.  Unfortunately, these routines are very heavily
      used.  Again, they are a burden in a multiple implementation world.  They
      should be implemented as non-members (based on extant i18n facilities)
      that use iterators into the underlying string ... which also implies that
      we will need string iterators.

  11  [BCD] |ToFloat| and |ToInteger| should be removed from the interface.
      Parsing should not be part of the required functionality for multiple
      implementations.  Given iterators, this functionality could be moved to
      a non-member implementation, which, in any case, is again requires i18n
      sensitivity.  |ToFloat| is not heavily used.  |ToInteger| is.

      Similarly, the |Append|s that format a float or an integer are i18n
      dependent.  Some work may be required to provide similar functionality
      that is factored into the i18n support.

  12  [D] We probably don't need the power to say something like
  
        myStr.Assign(yourStr).Append(herStr).Cut(20, 15);

      It makes sense with operators, but we may want to simplify the client
      interface with respect to named member functions.  The |Assign|, |Append|,
      |Insert|, |Cut|, and |SetString|, signatures should be changed to return
      |void|.

  13  [BCD] Turn the specialized modification and accessor functions |Trim|,
      |CompressSet|, |StripChar|, |StripChars|, |StripWhitespace|,
      |ReplaceChar|, |ReplaceSubstring|, and |CountChar| into non-member
      `algorithms' that can be applied to any implementation.

  14  [DE] Given the current copying signatures of |Left|, |Right|, and |Mid|,
      they should probably be turned into non-member algorithms writing to an
      iterator as well.

  15  [E] Add iterators.  Several of the points above are eased or solved by the
      introduction of reasonable iterators.

  16  [AD] Eliminate |nsSubsume[C]Str|.  To much implementation knowledge is
      currently required to reasonably utilize this in clients, and it presents
      a burden to implementations to facilitate.

Scott Collins

Comment 29

•

24 years ago

This is my primary focus at the moment.

Target Milestone: M17 → M16

Scott Collins

Updated

•

24 years ago

Priority: P3 → P1

Scott Collins

Comment 30

•

24 years ago

fixing summary to better reflect our understanding

Summary: investigate nsIString → factor |ns[C]String| to allow specialized implementations

Scott Collins

Comment 31

•

24 years ago

mass re-assigning to my new bugzilla account

Assignee: scc → scc

Status: ASSIGNED → NEW

Scott Collins

Updated

•

24 years ago

Status: NEW → ASSIGNED

Scott Collins

Comment 32

•

24 years ago

Well, NEW_STRING_APIS is now switched on.  The factoring is accomplished.  And 
some new implementation exist to solve some problems.  We need a replacement for 
XPIDL string; we need a COW implementation; we need to deploy the new 
implementations.  I'm re-summarizing this bug for the work of measuring and 
deploying the new implementations.

Keywords: helpwanted

Summary: factor |ns[C]String| to allow specialized implementations → profile string usage; deploy new implementations where appropriate

Whiteboard: HELPWANTED

Target Milestone: M16 → M20

leger

Comment 33

•

24 years ago

Putting on nsbeta3 radar.  warren say we really need to get this in for PR3, per 
beta2 PDT reviews.

Keywords: nsbeta3

Scott Collins

Comment 34

•

24 years ago

Simon, can you add your recent profiling work to this bug?

For everyone else, under discussion is the idea that `chunk' allocating strings 
has turned out to be a bigger source of wasted space than it has been a 
performance boon.

(oops, too many cc's, Bugzilla is making me remove one to add simon.  Sorry dp)

Simon Fraser [no longer active]

Comment 35

•

24 years ago

Adding newsgroup postings on string usage:

            <news:sfraser-F78927.13360202062000@secnews1.mcom.com>
It seems that nsString::SetCapacity() always buffers the string
size in 64-character chunks - this logic lives down in nsStr::Alloc.
So:

   nsString foo;
   foo.SetCapacity(1);
   foo.Assign('a');

will eat up 128 bytes of heap space.

I have not found a way that I can set the capacity of an nsString
to exactly the length I know is needed. This of course has quite
an impact on bloat.

Seeing this leads me to question how often we actually need to
chunk changes in string length; what proportion of strings actually
change length during their lifetime? My guess is that it's < 50%,
which perhaps suggests that the normal behaviour should be to
not round up string sizes, and that we should have an API that
allows the caller to create a string with, or specify that an
existing string is likely to change length frequently from now
on.

          <news:sfraser-0720CE.17500902062000@secnews1.mcom.com>

Some data on the bloat that results from string chunking (recall,
bloat = total memory every allocated, not a runtime high-water mark).
Numbers are K.

Test              Allocated       Used        Waste    % waste
--------------------------------------------------------------
Simple browser      2938.00     1548.98     1389.01     47.28%
Complex browser     5839.17     3214.73     2624.44     44.95%
Mail                6232.82     3369.89     2862.93     45.93%

So this chunking almost doubles the amount of memory that our
strings use.

      <news:sfraser-D08810.16523102062000@secnews1.mcom.com>
In another post, waterson posed the question of how many strings with
identical contents are allocated, and whether we could use atoms for
these common strings. (He was, I think, talking about string usage in
a particular module/API, but the question can be generalized.)

So I put some debug code in nsStr::Destroy, that dumps out the contents
of strings just before they are deleted, if aDest.mOwnsBuffer == PR_TRUE
(which indicates that the buffer was heap-allocated). Some results are below.

These results can be used to find places is the code the might benefit from
shared strings, or cacheing of frequently used strings. Of course, I have
no data on call sites here.

The data look like this:

 848 1   dummy:path

'848' is the count (# strings with these contents), '1' is the character width
(1 for char, 2 for PRUnichar), and the rest is the string itself.

Test 1:
   Bring up browser, loading simple text-only HTML page, Quit.
   <http://www.smfr.org/mozilla/sortedstrings.txt>
 
 848 1   dummy:path
 510 2   true
 333 2   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 303 2   monospace
 278 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 277 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 255 2   geneva
 205 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
 200 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
 167 1   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 159 1   file:///Bleeding%20Edge/Mozilla%20tree/src/mozilla/dist/viewer_debug/
 135 2   ISO-8859-1
 123 2   UTF-8
 117 2   serif
  85 2   broadcaster
  79 2   menuitem
  78 2   menupopup
  73 2                 <string is a run of 28 spaces>
  71 2   vertical
  67 2   rdf:http://home.netscape.com/NC-rdf#Name

Test 2:
   Bring up browser, surf to mozilla.org, tinderbox, bugzilla, load a bug,
   open prefs dialog.
   <http://www.smfr.org/mozilla/sortedbrowser.txt>

1797 2   monospace
1698 2   true
1007 2   serif
 879 2   geneva
 848 1   dummy:path
 371 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 370 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 333 2   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 322 2   ISO-8859-1
 315 1   file:///Bleeding%20Edge/Mozilla%20tree/src/mozilla/dist/viewer_debug/
 266 2   vertical
 248 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
 243 2   white
 236 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
 176 2   \
 167 1   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 159 2   UTF-8
 128 1   css
 125 2   never
 121 2   black


Test 3:
   Bring up browser, open mail-news, load 2 large IMAP folders
   (including one of bugzilla mail)
   <http://www.smfr.org/mozilla/sortedmail.txt>

7084 2   UTF-8
4785 2   us-ascii
2809 1   mozilla.org
2808 1   bugzilla-daemon
1304 2   true
1034 2   component://netscape/intl/unicode/decoder?charset=x-imap4-modified-utf7
1032 2   x-imap4-modified-utf7
1029 2   never
 987 1   netscape.com
 957 2   geneva
 948 1   %S Receiving: message headers %lu of %lu
 947 1   sfraser
 941 2   Bugzilla
 894 2   monospace
 848 1   dummy:path
 500 2   file:///Other%20stuff/Documents/Mozilla/Users50/Simon/localstore.rdf
 488 2   autostretch
 392 2   menuitem
 358 2   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl
 357 1   component://netscape/layout/element-factory?namespace=http://
www.mozilla.org/xbl

Scott Collins

Comment 36

•

24 years ago

Attached file attaching data <dougt@netscape.com> generated... — Details

Scott Collins

Comment 37

•

24 years ago

Attached file more <dougt@netscape.com> data... — Details

Scott Collins

Comment 38

•

24 years ago

Attached patch Here's the patch Doug used to generate this data... — Details — Splinter Review

Scott Collins

Comment 39

•

24 years ago

Doug, I attached your patch and data to this bug ... which seems like the 
appropriate place.  What conclusions can we draw from this data?  There are 
certain very common strings, true, but this is not enough to know if they are 
candidates for being replaced with |nsShared[C]String|s, since we don't know how 
they were generated.  What do you think?  The other bug filed on this is bug 
#46738.  I commented there as well.  We should consider marking that bug either a 
duplicate or a blocker for this bug.

Dawn Endico

Updated

•

24 years ago

QA Contact: leger → kandrot

Scott Collins

Updated

•

24 years ago

Component: XPCOM → String

OS: Windows NT → All

Hardware: PC → All

Target Milestone: --- → mozilla0.9.1

Scott Collins

Comment 40

•

24 years ago

marking dependencies, turning this [officially] into a tracking bug for
deploying new string implementations

Depends on: 46892, 69872, 69873, 70075, 70076, 70078

Scott Collins

Updated

•

24 years ago

Summary: profile string usage; deploy new implementations where appropriate → [tracking bug] profile string usage; deploy new implementations where appropriate

Scott Collins

Updated

•

24 years ago

No longer depends on: 46892

Scott Collins

Updated

•

24 years ago

Depends on: 53065

Scott Collins

Updated

•

24 years ago

No longer depends on: 69873

Scott Collins

Updated

•

24 years ago

Depends on: 69873

Scott Collins

Updated

•

24 years ago

Depends on: 70082

Scott Collins

Updated

•

24 years ago

Depends on: 70083

Scott Collins

Updated

•

24 years ago

Depends on: 70090

Scott Collins

Updated

•

24 years ago

Target Milestone: mozilla0.9.1 → ---

Scott Collins

Updated

•

24 years ago

Priority: P1 → --

Scott Collins

Updated

•

24 years ago

No longer depends on: 70090

Scott Collins

Updated

•

24 years ago

Blocks: 70090

Scott Collins

Updated

•

24 years ago

No longer depends on: 70075

Scott Collins

Updated

•

24 years ago

Depends on: 65219

Scott Collins

Updated

•

24 years ago

Depends on: 16108, 40140, 53209, 67876

Scott Collins

Updated

•

24 years ago

Depends on: 70143

Scott Collins

Updated

•

24 years ago

No longer depends on: 70078

Scott Collins

Updated

•

24 years ago

Depends on: 70740

Scott Collins

Updated

•

23 years ago

Depends on: 74726

Scott Collins

Updated

•

23 years ago

Depends on: 74985

Scott Collins

Updated

•

23 years ago

Depends on: 46738

Scott Collins

Updated

•

23 years ago

No longer depends on: 70083

Scott Collins

Updated

•

23 years ago

No longer depends on: 70082

Scott Collins

Updated

•

23 years ago

No longer depends on: 69873

jag (Peter Annema)

Assignee

Updated

•

23 years ago

QA Contact: kandrot → jaggernaut

Axel Hecht

Updated

•

23 years ago

No longer depends on: 70076

Scott Collins

Comment 41

•

22 years ago

giving up ancient string bugs to the new string owner.  jag, you'll want to sort
through these and see which ones still apply and go with or against the
direction in which you intend strings evolve

Assignee: scc → jaggernaut

Status: ASSIGNED → NEW

jag (Peter Annema)

Assignee

Comment 42

•

15 years ago

This work is no longer relevant, strings have had a new implementation for a while now :-)

If the new strings code needs performance tuning, please file new bugs.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: String → XPCOM

nsStringStats.h 25 years ago Chris Waterson 558 bytes, text/plain		Details
diffs to xpcom/ds to implement string stats 25 years ago Chris Waterson 14.08 KB, patch		Details \| Diff \| Splinter Review
better diffs. 25 years ago Chris Waterson 15.92 KB, patch		Details \| Diff \| Splinter Review
Here is a patch containing the changes I made to measure COW efficacy 25 years ago Scott Collins 7.72 KB, patch		Details \| Diff \| Splinter Review
diffs for mutated/unmutated accounting (in addition to Chris' diffs) 25 years ago Warren Harris 39.32 KB, patch		Details \| Diff \| Splinter Review
attaching data <dougt@netscape.com> generated... 24 years ago Scott Collins 133.10 KB, text/plain		Details
more <dougt@netscape.com> data... 24 years ago Scott Collins 181.48 KB, text/plain		Details
Here's the patch Doug used to generate this data... 24 years ago Scott Collins 3.12 KB, patch		Details \| Diff \| Splinter Review