Closed
Bug 269853
Opened 20 years ago
Closed 18 years ago
Use copy-on-write substrings for tokenizer
Categories
(Core :: DOM: HTML Parser, defect)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
FIXED
People
(Reporter: bryner, Assigned: bryner)
References
Details
(Keywords: perf)
Attachments
(2 files)
40.51 KB,
patch
|
jst
:
review+
darin.moz
:
superreview+
|
Details | Diff | Splinter Review |
41.82 KB,
patch
|
Details | Diff | Splinter Review |
Often in the tokenizer code, the situation arises where most of the time, a token will want the unmodified text from the scanner buffer as its text value, but occasionally, it needs to modify it somehow. Currently this would require having both an nsScannerSubstring and an nsString, which is bloaty and hard to manage. What I'd like to do instead is implement a new class that wraps a portion of the scanner buffer in a nsDependentSubstring, doing the appropriate refcounting on the buffers so that they stay alive. If you want to write to the string, the data is copied and the reference to the scanner buffer is released. Quantify data from this change is very promising. I ran a profile before and after with it focused on the ResumeParse subtree. Since this change affects both allocation and string copying, I added up all of the time spent in malloc, memcpy, memmove, and free on the before and after profiles. It appears that with this change we spend 70% less time total in these functions combined. Also, I saw roughly 1,000 fewer calls to malloc. The Quantify run used the netscape.com page on the pageload test with a fresh profile. This is also somewhat related to bug 269054. After thinking this over, I like this solution better than what's proposed in that bug. If we allowed arbitrary types of strings to suddenly become dependent, with no way of telling that this happened, that's going to be a nightmare for callers. In my opinion, we should make it continue to be the case that all of the string classes except for nsDependent* have an owning reference to their data.
Assignee | ||
Comment 1•20 years ago
|
||
One other thing... bz had some concerns that changing the sizes of the token classes would cause the token allocator to become less efficient. I'm open to adjusting that, but in the profile I ran, we had no problems with running out of buckets... it fell through to PL_ARENA_ALLOCATE only 2 or 3 times.
Assignee | ||
Comment 2•20 years ago
|
||
(In reply to comment #1) > adjusting that, but in the profile I ran, we had no problems with running out > of buckets... it fell through to PL_ARENA_ALLOCATE only 2 or 3 times. Ok, let me correct that. It called AddBucket 3 times. It called PL_ArenaAllocate 41 times, out of 4,993 calls to CreateTokenOfType.
Assignee | ||
Updated•20 years ago
|
Attachment #165931 -
Flags: superreview?(darin)
Attachment #165931 -
Flags: review?(bzbarsky)
Comment 3•20 years ago
|
||
nsScannerSharedSubstring maybe? one quick note: should the nsTDependentSubstring ctors use a method other than Rebind to avoid having to initialize mData, mLength, and mFlags twice, etc.?
Assignee | ||
Comment 4•20 years ago
|
||
(In reply to comment #3) > nsScannerSharedSubstring maybe? > > one quick note: should the nsTDependentSubstring ctors use a method other than > Rebind to avoid having to initialize mData, mLength, and mFlags twice, etc.? I guess they could, if it's a big deal. I figured the function call overhead would outweigh any savigs from not initializing those members.
Comment 5•20 years ago
|
||
So... I'll give this a closer look a bit later, but for starters it looks like none of our preallocated buckets (see http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsDTDUtils.cpp#1253) will be the right size for the whitespace token with your change... More importantly, the whitespace tokens will no longer share a bucket with end, cdata, entity, etc tokens. So we'll have an extra bucket floating about... Again, I'm not sure how much that matters. If we decide it does, it actually looks like end tokens could use the same setup pretty easily, but entity tokens certainly could not. Perhaps the right answer is to just add another bucket to the preallocated set?
Assignee | ||
Comment 6•20 years ago
|
||
Right, I didn't bother making any changes for start and end tokens since they'll generally just have converted their tag to an id and we'd be wasting space. nsFixedSizeAllocator does add a bucket the first time you try to allocate an object of a size you haven't already added a bucket for. So preallocating the bucket would certainly make sense given the current code. I don't know if it's a problem that we now have an additional bucket... as I said, it seems to have a pretty good hit rate for the arena but I'm not sure what to expect.
Assignee | ||
Comment 7•20 years ago
|
||
Comment on attachment 165931 [details] [diff] [review] patch Trying to lighten boris's load a bit... peterv, can you look at this?
Attachment #165931 -
Flags: review?(bzbarsky) → review?(peterv)
Comment 8•20 years ago
|
||
Comment on attachment 165931 [details] [diff] [review] patch looks good to me. sr=darin provided peterv is happy with the patch.
Attachment #165931 -
Flags: superreview?(darin) → superreview+
Comment 9•20 years ago
|
||
Comment on attachment 165931 [details] [diff] [review] patch - In nsScanner::ReadUntil(): + while (current != mEndPosition) { [...] + while (*setcurrent) { + if (*setcurrent == theChar) { + goto found; + } + ++setcurrent; } + } + + ++current; + } + // If we are here, we didn't find any terminator in the string and + // current = mEndPosition + SetPosition(current); + AppendUnicodeTo(origin, current, aString); + return Eof(); + +found: + if(addTerminal) + ++current; + AppendUnicodeTo(origin, current, aString); + SetPosition(current); + + //DoErrTest(aString); + + return NS_OK; You could completely eliminate that goto if you simply replace the goto with the code after the lable. r=jst with that changed.
Attachment #165931 -
Flags: review?(peterv) → review+
Assignee | ||
Comment 10•20 years ago
|
||
Comment 11•20 years ago
|
||
Brian, could you add that other bucket in nsDTDUtils?
Comment 12•20 years ago
|
||
This commit have added +parser/htmlparser/src/nsScannerString.cpp:439 + `nsScannerBufferList*bufferList' might be used uninitialized in this function on brad TBox
Comment 13•20 years ago
|
||
That's pretty obviously a bogus warning (bufferList is used if and only if sameBuffer is true, in which case it will be set; there's no way the value of sameBuffer can change between those two blocks).
Comment 14•20 years ago
|
||
This may have caused the regression in bug 273085
Assignee | ||
Comment 15•18 years ago
|
||
this should have been marked fixed when checked in
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•