Use copy-on-write substrings for tokenizer

RESOLVED FIXED

Status

()

Core
HTML: Parser
RESOLVED FIXED
14 years ago
12 years ago

People

(Reporter: Brian Ryner (not reading), Assigned: Brian Ryner (not reading))

Tracking

({perf})

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Assignee)

Description

14 years ago
Often in the tokenizer code, the situation arises where most of the time, a
token will want the unmodified text from the scanner buffer as its text value,
but occasionally, it needs to modify it somehow.  Currently this would require
having both an nsScannerSubstring and an nsString, which is bloaty and hard to
manage.

What I'd like to do instead is implement a new class that wraps a portion of the
scanner buffer in a nsDependentSubstring, doing the appropriate refcounting on
the buffers so that they stay alive.  If you want to write to the string, the
data is copied and the reference to the scanner buffer is released.

Quantify data from this change is very promising.  I ran a profile before and
after with it focused on the ResumeParse subtree.  Since this change affects
both allocation and string copying, I added up all of the time spent in malloc,
memcpy, memmove, and free on the before and after profiles.  It appears that
with this change we spend 70% less time total in these functions combined. 
Also, I saw roughly 1,000 fewer calls to malloc.  The Quantify run used the
netscape.com page on the pageload test with a fresh profile.

This is also somewhat related to bug 269054.  After thinking this over, I like
this solution better than what's proposed in that bug.  If we allowed arbitrary
types of strings to suddenly become dependent, with no way of telling that this
happened, that's going to be a nightmare for callers.  In my opinion, we should
make it continue to be the case that all of the string classes except for
nsDependent* have an owning reference to their data.
(Assignee)

Comment 1

14 years ago
Created attachment 165931 [details] [diff] [review]
patch

One other thing... bz had some concerns that changing the sizes of the token
classes would cause the token allocator to become less efficient.  I'm open to
adjusting that, but in the profile I ran, we had no problems with running out
of buckets... it fell through to PL_ARENA_ALLOCATE only 2 or 3 times.
(Assignee)

Comment 2

14 years ago
(In reply to comment #1)
> adjusting that, but in the profile I ran, we had no problems with running out
> of buckets... it fell through to PL_ARENA_ALLOCATE only 2 or 3 times.

Ok, let me correct that.  It called AddBucket 3 times.  It called
PL_ArenaAllocate 41 times, out of 4,993 calls to CreateTokenOfType.
(Assignee)

Updated

14 years ago
Attachment #165931 - Flags: superreview?(darin)
Attachment #165931 - Flags: review?(bzbarsky)

Comment 3

14 years ago
nsScannerSharedSubstring maybe?

one quick note: should the nsTDependentSubstring ctors use a method other than
Rebind to avoid having to initialize mData, mLength, and mFlags twice, etc.?
(Assignee)

Comment 4

14 years ago
(In reply to comment #3)
> nsScannerSharedSubstring maybe?
> 
> one quick note: should the nsTDependentSubstring ctors use a method other than
> Rebind to avoid having to initialize mData, mLength, and mFlags twice, etc.?

I guess they could, if it's a big deal.  I figured the function call overhead
would outweigh any savigs from not initializing those members.
So... I'll give this a closer look a bit later, but for starters it looks like
none of our preallocated buckets (see
http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsDTDUtils.cpp#1253)
will be the right size for the whitespace token with your change...

More importantly, the whitespace tokens will no longer share a bucket with end,
cdata, entity, etc tokens.  So we'll have an extra bucket floating about...

Again, I'm not sure how much that matters.  If we decide it does, it actually
looks like end tokens could use the same setup pretty easily, but entity tokens
certainly could not.

Perhaps the right answer is to just add another bucket to the preallocated set?
(Assignee)

Comment 6

14 years ago
Right, I didn't bother making any changes for start and end tokens since they'll
generally just have converted their tag to an id and we'd be wasting space.

nsFixedSizeAllocator does add a bucket the first time you try to allocate an
object of a size you haven't already added a bucket for.  So preallocating the
bucket would certainly make sense given the current code.  I don't know if it's
a problem that we now have an additional bucket... as I said, it seems to have a
pretty good hit rate for the arena but I'm not sure what to expect.
(Assignee)

Comment 7

14 years ago
Comment on attachment 165931 [details] [diff] [review]
patch

Trying to lighten boris's load a bit... peterv, can you look at this?
Attachment #165931 - Flags: review?(bzbarsky) → review?(peterv)

Comment 8

14 years ago
Comment on attachment 165931 [details] [diff] [review]
patch

looks good to me.

sr=darin provided peterv is happy with the patch.
Attachment #165931 - Flags: superreview?(darin) → superreview+
Comment on attachment 165931 [details] [diff] [review]
patch

- In nsScanner::ReadUntil():

+  while (current != mEndPosition) {
[...]
+      while (*setcurrent) {
+	 if (*setcurrent == theChar) {
+	   goto found;
+	 }
+	 ++setcurrent;
      }
+    }
+    
+    ++current;
+  }

+  // If we are here, we didn't find any terminator in the string and
+  // current = mEndPosition
+  SetPosition(current);
+  AppendUnicodeTo(origin, current, aString);
+  return Eof();
+
+found:
+  if(addTerminal)
+    ++current;
+  AppendUnicodeTo(origin, current, aString);
+  SetPosition(current);
+
+  //DoErrTest(aString);
+
+  return NS_OK;

You could completely eliminate that goto if you simply replace the goto with
the code after the lable.

r=jst with that changed.
Attachment #165931 - Flags: review?(peterv) → review+
(Assignee)

Comment 10

14 years ago
Created attachment 166978 [details] [diff] [review]
final patch with jst's comments addressed
Brian, could you add that other bucket in nsDTDUtils?

Comment 12

14 years ago
This commit have added

+parser/htmlparser/src/nsScannerString.cpp:439
+ `nsScannerBufferList*bufferList' might be used uninitialized in this function

on brad TBox
That's pretty obviously a bogus warning (bufferList is used if and only if
sameBuffer is true, in which case it will be set; there's no way the value of
sameBuffer can change between those two blocks).

Updated

14 years ago
Blocks: 273085
This may have caused the regression in bug 273085
(Assignee)

Comment 15

12 years ago
this should have been marked fixed when checked in
Status: NEW → RESOLVED
Last Resolved: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.