Open Bug 232927 Opened 21 years ago Updated 2 years ago

investigate further performance improvements in strings

Categories

(Core :: XPCOM, defect, P4)

defect

Tracking

()

Future

People

(Reporter: dbaron, Unassigned)

References

Details

(Keywords: perf)

Attachments

(5 files, 1 obsolete file)

Once bug 231995 lands, there are some other performance improvements to strings that I'd like to investigate (i.e., write them and test to see if they're really improvements): 1. Remove threadsafe reference counting by having flag for strings accessed on multiple threads -- and prevent them from sharing. 2. Have an nsSubstring::PromoteToShared() const method (use |mutable_this| inside) that makes a dependent string, string with adopted buffer, or auto string (probably not worth it until not threadsafe) shared. Use this in Assign when the buffer isn't already shared and it's not a stack-to-stack assignment. 3. Implement multiple sizes of nsAutoString (nsStackString16, nsStackString32, etc.). (I think Stack is a better name than Auto, and it's a chance to fix the ns*C*AutoString inconsistency with nsStack*C*String.) 4. Have nsAutoString set a flag that indicates that it has a stack buffer so we have the ability to use that stack buffer in an assignment after we've moved to a non-stack buffer (especially important if we do the nsAutoString part of (2).
this list looks great.
5. investigate tweaks in the capacity doubling code: * whether to double initially when going from immutable to mutable * whether to continue up powers of two or use max(2*current capacity, required capacity)
Maybe the malloc size should be 8-byte aligned. This is what win32 malloc does so this is free and it should help reduce reallocation count. Maybe string data access should occur on aligned memory so one should ensure that sizeof(nsStringHeader) is always 8 like it seems to be now. This patch is completely untested.
Thanks for the patch Bernard. I'll incorporate that into my patch for bug 231995 since that hasn't landed yet.
What if malloc can return an address that's 4 mod 8? If that's possible, then rounding size up to a multiple of 8 is not going to align the address (alignment usually refers to addresses, not sizes, so it seems the code must be assuming 0 mod 8 pointer alignment from malloc). Is the idea simply to overallocate in case the string grows by 1-7 chars and there's room in the rounded-up size for the growth? Why is it harmful (to performance, presumably) to let realloc worry about that, if the malloc heap uses 8-byte alignment? What about other platforms? The patch is against a file in the attic, and I'm rusty on string code anyway, but I hope my comment here is helpful. /be
(In reply to comment #5) > What if malloc can return an address that's 4 mod 8? Then this code will allocate a few more bytes than necessary. > Is the idea simply to overallocate in case the string grows by 1-7 chars and > there's room in the rounded-up size for the growth? Yes it is exactly what is intended > Why is it harmful (to > performance, presumably) to let realloc worry about that, if the malloc heap > uses 8-byte alignment? I think that it's a liitle gain to avoid some realloc() just for a few bytes. Those realloc() calls should be fast though because there's actually nothing to realloc in this case because malloc() did round up the size previously. It just avoids the "realloc" code path in the string code. > > What about other platforms? The constant 7 used to round up the size should be 3 if malloc returns 4-byte aligned adresses or could be changed to something like: "2 * sizeof(size_t) - 1" (platform specific malloc returns 8-byte aligned addresses) or "sizeof(size_t) - 1" (4-byte aligned adresses). If 7 is used when 4 mod 8 addresses are returned by malloc then a few bytes will be wasted by the rounded-up size (too much round-up).
Hmm... I'm going to hold off on this malloc alignment patch until after the string branch lands.
Comment on attachment 141418 [details] [diff] [review] align malloc size some testing shows that this patch saves 50% of realloc() calls. ("ReallocCount" string statistics going from 196954 to 97751 browsing a few web sites)
This is another variation. It saves 15% more reallocations compared to the previous patch. I think that it can be efficient especially when the platform malloc() implementation uses links of blocks of size 8, 16, 32, ... for small blocks. If the malloc() implementation is not efficient then another idea would be to use several linked lists of arena allocated small blocks of size 8, 16, ... and reuse the free blocks without actually calling free() (for instance using one nsFixedSizeAllocator for each block size: http://lxr.mozilla.org/seamonkey/source/xpcom/ds/nsFixedSizeAllocator.h#41).
We gained significant memory use improvements when we stopped doing that a number of years ago. IIRC, something like 70% of strings are never modified.
Why should we optimize to reduce realloc count, and nothing else? realloc is not the dominant cost, and you're not measure heap fragmentation or its effect, virtual memory (vm size and even resident set size) bloat. Size-classifying allocators, especially power-of-two ones, are fragmentation prone. See waterson's research from several years ago into the Lea malloc, at http://www.mozilla.org/projects/footprint/allocators.html. Bernard, I don't think this patch is being subject to the right system tests. /be
As far as I've seen, maxheap went up on brad as well after the checkin of the string branch... Would this patch help that as well? Would the patch help Txul/Ts even more? I guess that's the some of the really interesting questions...
please see my notes in bug 234864 concerning dbaron's point #4 in comment #0 of this bug.
I think that to improve the string library you have to improve on sharing and/or improve allocations/reallocation. So when I first noticed the high reallocation count noted in bug 231995 I thought that something could be improved here. I don't like my patch to round up to 2^n very much but I just wanted to see how many reallocations I could cut down that way, especially when I saw the 50% gain from the first patch . That patch (the 1st one) is different because I think that it is free on most platforms and even if it suppresses some realloc() that are almost no-ops it can avoid some noise in the malloc stats. Theses two patches can be run against the pageloader tests to see if there are improvements or not. If there's too little performance gain then we'll know for sure that reallocation is not important.
I should probably do a more thorough audit rather than just waiting for assertions and fixing them when I see them. And I should probably test the installer...
same, but with better comments
Attachment #141870 - Attachment is obsolete: true
Comment on attachment 141871 [details] [diff] [review] preliminary patch for non-threadsafe refcounting Note that, given that it may (given some of the other suggestions above) soon be possible for any string to become a string with a shared buffer, it would be better to assert when assigning from any string to any string. However, we can't do that because there's no place to store the original thread for strings without shared buffers since increasing the size of nsString breaks the nsStringContainer code. The weakness of the assertions is particularly problematic for the case of nsAutoString, where only a large string would cause problems.
dbaron, if i want to let JS share one of our sharable buffers, do i need to worry about it releasing that buffer on a background thread? IOW, does this patch prevent me from implementing sharing from C++ to JS in xpconnect?
Comment on attachment 141871 [details] [diff] [review] preliminary patch for non-threadsafe refcounting FWIW, I just tested this on btek, and it gains less than a percent (probably about half) on pageload time.
Comment on attachment 141934 [details] [diff] [review] patch for more sharing This has no noticeable affect on pageload.
Keywords: perf
I tested changing the size of the ns[C]AutoString initial, stack-based, buffer. My testcase first created a new profile, then loaded a +2M html-file (saved bugzilla buglist) from localhost. When the page had loaded ready, I quit moz. In this testcase there is a ~5.5% reduction of allocation and reallocation if using a buffer of size 128 compared to a 64. I guess this could do a small change on Tp?
There might be costs involved with increasing the size of an object that is as commonly used as nsAutoString, even if it's stackbased. The reason is that on some systems you'll end up forcing the OS to muck around with increasing the stacksize when it runs out of what it originally had planned for. I'm not sure if this is a problem on the platforms that moz runs on these days, but I do think testing needs to be done across at least the most important platforms before we do something like this. IMHO it would be interesting to see some numbers on the patch from comment 3 though. Yes, number of realloc()s is in itself not something to optimize for, however the benefit of reducing them might very well outweigh the cost of the patch. Possibly align-code like that should be platform dependant and we'd only use it on platforms where we know to benefit from it (like win32 apparently)
QA Contact: string
Assignee: dbaron → nobody
Priority: -- → P4
Target Milestone: --- → Future
Component: String → XPCOM
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: