Closed Bug 269853 Opened 21 years ago Closed 19 years ago

Use copy-on-write substrings for tokenizer

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: bryner, Assigned: bryner)

References

Details

(Keywords: perf)

Attachments

(2 files)

patch 21 years ago Brian Ryner (not reading) 40.51 KB, patch	jst : review+ darin.moz : superreview+	Details \| Diff \| Splinter Review
final patch with jst's comments addressed 21 years ago Brian Ryner (not reading) 41.82 KB, patch		Details \| Diff \| Splinter Review

Brian Ryner (not reading)

Assignee

Description

•

21 years ago

Often in the tokenizer code, the situation arises where most of the time, a token will want the unmodified text from the scanner buffer as its text value, but occasionally, it needs to modify it somehow. Currently this would require having both an nsScannerSubstring and an nsString, which is bloaty and hard to manage. What I'd like to do instead is implement a new class that wraps a portion of the scanner buffer in a nsDependentSubstring, doing the appropriate refcounting on the buffers so that they stay alive. If you want to write to the string, the data is copied and the reference to the scanner buffer is released. Quantify data from this change is very promising. I ran a profile before and after with it focused on the ResumeParse subtree. Since this change affects both allocation and string copying, I added up all of the time spent in malloc, memcpy, memmove, and free on the before and after profiles. It appears that with this change we spend 70% less time total in these functions combined. Also, I saw roughly 1,000 fewer calls to malloc. The Quantify run used the netscape.com page on the pageload test with a fresh profile. This is also somewhat related to bug 269054. After thinking this over, I like this solution better than what's proposed in that bug. If we allowed arbitrary types of strings to suddenly become dependent, with no way of telling that this happened, that's going to be a nightmare for callers. In my opinion, we should make it continue to be the case that all of the string classes except for nsDependent* have an owning reference to their data.

Brian Ryner (not reading)

Assignee

Comment 1

•

21 years ago

Attached patch patch — Details — Splinter Review

One other thing... bz had some concerns that changing the sizes of the token classes would cause the token allocator to become less efficient. I'm open to adjusting that, but in the profile I ran, we had no problems with running out of buckets... it fell through to PL_ARENA_ALLOCATE only 2 or 3 times.

Brian Ryner (not reading)

Assignee

Comment 2

•

21 years ago

(In reply to comment #1) > adjusting that, but in the profile I ran, we had no problems with running out > of buckets... it fell through to PL_ARENA_ALLOCATE only 2 or 3 times. Ok, let me correct that. It called AddBucket 3 times. It called PL_ArenaAllocate 41 times, out of 4,993 calls to CreateTokenOfType.

Brian Ryner (not reading)

Assignee

Updated

•

21 years ago

Attachment #165931 - Flags: superreview?(darin)

Attachment #165931 - Flags: review?(bzbarsky)

Darin Fisher

Comment 3

•

21 years ago

nsScannerSharedSubstring maybe? one quick note: should the nsTDependentSubstring ctors use a method other than Rebind to avoid having to initialize mData, mLength, and mFlags twice, etc.?

Brian Ryner (not reading)

Assignee

Comment 4

•

21 years ago

(In reply to comment #3) > nsScannerSharedSubstring maybe? > > one quick note: should the nsTDependentSubstring ctors use a method other than > Rebind to avoid having to initialize mData, mLength, and mFlags twice, etc.? I guess they could, if it's a big deal. I figured the function call overhead would outweigh any savigs from not initializing those members.

Boris Zbarsky [:bzbarsky]

Comment 5

•

21 years ago

So... I'll give this a closer look a bit later, but for starters it looks like none of our preallocated buckets (see http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsDTDUtils.cpp#1253) will be the right size for the whitespace token with your change... More importantly, the whitespace tokens will no longer share a bucket with end, cdata, entity, etc tokens. So we'll have an extra bucket floating about... Again, I'm not sure how much that matters. If we decide it does, it actually looks like end tokens could use the same setup pretty easily, but entity tokens certainly could not. Perhaps the right answer is to just add another bucket to the preallocated set?

Brian Ryner (not reading)

Assignee

Comment 6

•

21 years ago

Right, I didn't bother making any changes for start and end tokens since they'll generally just have converted their tag to an id and we'd be wasting space. nsFixedSizeAllocator does add a bucket the first time you try to allocate an object of a size you haven't already added a bucket for. So preallocating the bucket would certainly make sense given the current code. I don't know if it's a problem that we now have an additional bucket... as I said, it seems to have a pretty good hit rate for the arena but I'm not sure what to expect.

Brian Ryner (not reading)

Assignee

Comment 7

•

21 years ago

Comment on attachment 165931 [details] [diff] [review] patch Trying to lighten boris's load a bit... peterv, can you look at this?

Attachment #165931 - Flags: review?(bzbarsky) → review?(peterv)

Darin Fisher

Comment 8

•

21 years ago

Comment on attachment 165931 [details] [diff] [review] patch looks good to me. sr=darin provided peterv is happy with the patch.

Attachment #165931 - Flags: superreview?(darin) → superreview+

Johnny Stenback (:jst)

Comment 9

•

21 years ago

Comment on attachment 165931 [details] [diff] [review] patch - In nsScanner::ReadUntil(): + while (current != mEndPosition) { [...] + while (*setcurrent) { + if (*setcurrent == theChar) { + goto found; + } + ++setcurrent; } + } + + ++current; + } + // If we are here, we didn't find any terminator in the string and + // current = mEndPosition + SetPosition(current); + AppendUnicodeTo(origin, current, aString); + return Eof(); + +found: + if(addTerminal) + ++current; + AppendUnicodeTo(origin, current, aString); + SetPosition(current); + + //DoErrTest(aString); + + return NS_OK; You could completely eliminate that goto if you simply replace the goto with the code after the lable. r=jst with that changed.

Attachment #165931 - Flags: review?(peterv) → review+

Brian Ryner (not reading)

Assignee

Comment 10

•

21 years ago

Attached patch final patch with jst's comments addressed — Details — Splinter Review

Boris Zbarsky [:bzbarsky]

Comment 11

•

21 years ago

Brian, could you add that other bucket in nsDTDUtils?

Aleksey Nogin

Comment 12

•

21 years ago

This commit have added +parser/htmlparser/src/nsScannerString.cpp:439 + `nsScannerBufferList*bufferList' might be used uninitialized in this function on brad TBox

Boris Zbarsky [:bzbarsky]

Comment 13

•

21 years ago

That's pretty obviously a bogus warning (bufferList is used if and only if sameBuffer is true, in which case it will be set; there's no way the value of sameBuffer can change between those two blocks).

Andrew Schultz

Updated

•

21 years ago

Blocks: 273085

Boris Zbarsky [:bzbarsky]

Comment 14

•

21 years ago

This may have caused the regression in bug 273085

Brian Ryner (not reading)

Assignee

Comment 15

•

19 years ago

this should have been marked fixed when checked in

Status: NEW → RESOLVED

Closed: 19 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Use copy-on-write substrings for tokenizer

Categories

(Core :: DOM: HTML Parser, defect)

Tracking

()

People

(Reporter: bryner, Assigned: bryner)

References

Details

(Keywords: perf)

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Comment 14

Comment 15

Attachment

General

Description

File Name

Content Type