Closed Bug 329974 Opened 18 years ago Closed 18 years ago

Use static data for common whitespace values in textnodes

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sicking, Assigned: sicking)

References

Details

Attachments

(3 files)

There is a lot of makup out there that looks like

 <foo>
  <bar>...</bar>
  <baz>
 </foo>

This produces textnodes that contains values like "\n  " and "\n    ". It should be possible to detect this and instead of allocating new string data for each of these nodes instead let them point into a single buffer that we allocate on startup. Since nsTextFragment doesn't use null terminated data we can use a single string like "        \n        " and then just set start and length as appropriate.

I'm currently running some stats to determine what types of values are common enough that this is useful. Once I have that i'll attach a patch.
I went to just over 500 pages (spider starting with alexa top 500 until we crash). This resulted in 391031 SetTo calls in nsTextFragment. Out of these 227593 calls were for whitespace-only strings. Below are the resulting stats, n=newline, s=space, t=tab.

86872 - n
10601 - s
8313 - nt
7388 - nn
7357 - ntt
6773 - nss
6197 - nssssssss
5774 - nssss
5143 - nssssssssss
5128 - nttt
3929 - ntttt
3533 - nssssssssssss
3517 - nssssssssssssssss
3396 - nssssss
2835 - ns
2511 - nttttt
2216 - nssssssssssssss
1929 - nssssssssssssssssssssss
1815 - nssssssssssssssssss
1765 - ntttttt
1720 - nnn
1708 - nssssssssssssssssssss
1591 - sn
1514 - nssssssssssssssssssssssss
1219 -
1185 - snssssssssssss
1131 - nsss
1033 - nsssss
993 - ss
961 - nttttttt
800 - nssssssssssssssssssssssssssss
786 - snssssssssssssssssss
777 - snssssss
752 - nsssssssssssssss
743 - nsssssssssssssssssss
715 - nnt
688 - snssssssssssssssssssssssss
674 - ntttttttt
603 - snsssssssssssssss
603 - nssssssssssssssssssssssssss
507 - nnnn
475 - sss
466 - nntttttttttttt
457 - nttttttnttttttt
449 - snssssssssss
435 - nnttttttttttt
414 - nsssssssss
406 - nnss
373 - nssssssssssssssssssssssssssssssss
369 - snssss
365 - snssssssss
352 - nnssss
345 - nntt
339 - snssssssssssssssssssss
339 - nnssssssss
313 - ntttttttttt
313 - nttttttttt
277 - nssssssssssssssssssssssssssssss
276 - nsssssss
273 - snssssssssssssss
262 - snss
257 - nntttttttttt
251 - nssssssssssssssssssssssssssssssssss
250 - nnssssss
246 - ntnt
246 - nnssssssssssss
240 - ntss
237 - nntttttttttttttt
236 - snssssssssssssssssssssssssss
235 - ntssss
232 - nnttttttttttttttt
231 - snsssssssss
229 - snssssssssssssssss
228 - ssss
I tried some various strategies of which types of whitespace nodes to use static data for. My initial strategy (A) was to match patterns that started with 1-7 newlines followed by 0-50 spaces or tabs (not a mixture), or just 0-50 spaces. This matches 204664 out of 227593 nodes.

If we allow a single space before the newlines it matches 214985 strings.

Allowing 0-10 spaces before the newlines matches 215442 strings.

If we match 0-50 tabs in addition to 0-50 spaces that adds an additional 500-something strings.

I also tried to reduce the number of newlines to match. Just matching 1-3 looses some 1200 strings, 1-2 an additional 2200 and just a single newline 14000.
Sicking, I was wondering what the data looks like if you include the unicode char produced by &nbsp; in "whitespace"...  Would that require another spider run, or can it be extracted from whatever data you logged?
I'd have to make another run. What pattern that includes &nbsp; do you suspect is common?

I did note that out of the total 391031 values 12927 were a single non-whitespace char. If a significan portion of these chars have a low unicode value it would probably be worth having a static array for them too.
> What pattern that includes &nbsp; do you suspect is common?

&nbsp;&nbsp;&nbsp;....

(repeated some number of times).  Possibly with a leading \n.

I looked into which characters are common for single-char nodes. Seems like pretty much only characters 1-255 appear often enough on their own to warrent searching for them (well, 1-183 really, but it's cheap enough to do the rest).

The only other charater that occurs pretty often on its own is 12288 which appears to be some sort of whitespace and 12539 which is a dot. However they appear as 0.4 resp. 0.2 per cent of the total number of textnodes so I don't really think it's worth it.
So I ran some tests that included &nbsp; characters. This gave the following result out of 1030005 nodes:

25189  /^\n{1,7}$/s
4093   /^ \n{1,7}$/s
119615 /^\n{1,7} {1,50}$/s
14155  /^ \n{1,7} {1,50}$/s
92678  /^\n{1,7}\t{1,50}$/s
3526   /^ \n{1,7}\t{1,50}$/s
285    /^\n{1,7}\x{a0}{1,50}$/s
7      /^ \n{1,7}\x{a0}{1,50}$/s
7353   /^ {2,50}$/s
1092   /^\t{2,50}$/s
2319   /^\x{a0}{2,50}$/s
664    /^\x{a0}\n{1,7}( {0,50}|\t{0,50}|\x{a0}{0,50})$/s
167    /^ \x{a0}{1,50}$/s
4      /^ {2,10}\x{a0}{1,50}$/s

Note that all these patterns should be mutually exclusive. This also does not include any single-char nodes.

While &nbsp; on its own is more common then tabs, I don't think searching for either is worth it. After newlines tabs are definitely worth it, however &nbsp; is not.
This is the perl program i've used to gather the stats. I've put the compressed log on http://people.mozilla.com/~sicking/data/bug329974_log.bz2
Attached patch Patch to fixSplinter Review
This patch makes us share all 8-bit single-char node values, as well as nodevalues matching the expression:

/^ ?\n{0,7}( {0,50}|\t{0,50})$/

Those two together matches just over 60% of all textnodes!


I also got sucked into cleaning nsTextFragment up a bit. There were way too much dead code in there to just let it pass.
Attachment #215447 - Flags: superreview?(jst)
Attachment #215447 - Flags: review?(jst)
I just thought of a minor change that would be good to include. Rather then having an initial 

if (firstChar == ' ' || firstChar == '\n' || firstChar == '\t')

I should make that

if (firstChar <= ' ' &&
    (firstChar == ' ' || firstChar == '\n' || firstChar == '\t'))

That way the first test will most likly be the one that decides which way we go and we'd save a few cycles. And even better is

if (firstChar <= ' ' &&
    (firstChar == '\n' || firstChar == ' ' || firstChar == '\t'))

since newline is the most common firstChar of the three
> +    NS_ERROR("Could not initialize nsAttrValue");

You mean nsTextFragment, right?
Comment on attachment 215447 [details] [diff] [review]
Patch to fix

- In nsTextFragment::operator=(const nsTextFragment& aOther):

+    else if (aOther.mState.mIs2b) {
+      m2b = NS_STATIC_CAST(PRUnichar*,
+        nsMemory::Clone(aOther.m2b, aOther.mState.mLength * sizeof(PRUnichar)));
+    }
+    else {
+      m1b = NS_STATIC_CAST(char *,
+        nsMemory::Clone(aOther.m1b, aOther.mState.mLength * sizeof(char)));
     }

Combine that to:

+      m1b = NS_STATIC_CAST(PRUnichar*,
+        nsMemory::Clone(aOther.m2b, aOther.mState.mLength * (sizeof(char) << aOther.mState.mIs2b)));

? :) Or combine them to:

+      m1b = NS_STATIC_CAST(PRUnichar*,
+        nsMemory::Clone(aOther.m2b, aOther.mState.mLength * (aOther.mState.mIs2b ? sizeof(PRUnichar) : sizeof(char)));

r+sr=jst either way.
Attachment #215447 - Flags: superreview?(jst)
Attachment #215447 - Flags: superreview+
Attachment #215447 - Flags: review?(jst)
Attachment #215447 - Flags: review+
Checked in. Unfortunately there was no win in any perf numbers :(
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Blocks: 332184
No longer blocks: 332184
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: