Open Bug 493651 Opened 16 years ago Updated 1 year ago

SelectorMatches is Firefox hotspot on Zimbra Collaboration Suite Client running on Intel Architecture

Categories

(Core :: CSS Parsing and Computation, defect)

x86
Windows XP
defect

Tracking

()

People

(Reporter: mohammad.r.haghighat, Unassigned)

References

Details

(Whiteboard: [evang-wanted] [platform-rel-Intel])

Attachments

(11 files)

User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.1; MS-RTC LM 8) Build Identifier: 3.6a1pre Approximately, 40% of the total retired instructions in Firefox when running Zimbra Collaboration Suite client on Intel Architecture is in gklayout, 15% in xpcom_core, and 14% in js. Approximately, 30% of the layout (~13% of the overall) is in the function "SelectorMatches". This shows a good potential for performance improvement of the browser. Reproducible: Always Steps to Reproduce: 1. Download and install Zimbra server from http://www.zimbra.com/ 2. Need to have Zimbra's performance harness 3. Run the performanve harness on Firefox Actual Results: The attached CSV files capture a full run of Zimbra’s performance harness on a very recent version of Firefox running on Intel Atom processor, profiled by Intel VTune. For more information regarding this suite, refer to Zimbra pages, for example: http://www.zimbrablog.com/blog/archives/2008/09/lets-talk-speed-chrome-and-webkit.html 1. Module View: FirefoxModuleViewOnZimbra.csv 2. FirefoxHotspotViewOnZimbra.csv 3. FirefoxSourceViewOnZimbra.csv You can open these files in a spreadsheet and see the sampled profile. These results point to the function “SelectorMatches” in nsCSSRuleProcessor.cpp as the current hotspot of Firefox on ZCS. Moreover, the exact dynamic count of basic blocks of file nsCSSRuleProcessor.cpp is in the attached file nsCSSRuleProcessor_cpp.html. The file is enerated by Intel Compiler code-coverage tool. A brief explanation of the output. Below each basic block, the beginning of the basis block is marked by a ‘^’ sign and the total execution count of that basic block is presented. If a source construct (e.g., a macro) results in multiple BBs for the same source position, the sum of the counts are presented, and the number of BBs is shown in parentheses. For instance, from the following we see that SelectorMatches is called 60341089 times, while the “if” on line 1843 results in two basic blocks and their sum is 105M, i.e., SelectorMatches is called ~53M times from there. The next hot caller is in line 1885, where SelectorMatches is called ~6M times. 1199) static PRBool SelectorMatches(RuleProcessorData &data, ^ 60,341,089 1843) if (SelectorMatches(*data, selector, 0, nsnull, aForStyling)){ ^ 105,555,380 (2) 1885) if (SelectorMatches(*data, aSelector, 0, nsnull, PR_TRUE)) { ^ 5,576,907 The hottest call stack is captured in the file HotCallStack.txt. I started a mail thread about optimizations related to SelectorMatches, and several folks provided helpful comments that are captured here. - moh From Moh: Hi Brendan, I’ve been looking at Zimbra Client on the latest Firefox on both Core-2 as well as Atom using Zimbra’s performance harness I got from them. It turns out that 40% of the total retired instructions is in gklayout, 15% in xpcom_core, and 14% in js. 30% of the layout (~13% of the overall) is in a single function (SelectorMatches). I guess it facilitates style rule matching. Searching bugzilla and the web, it seems that over years, people have spent time optimizing this function, but none of them seems terribly recent, e.g.: https://bugzilla.mozilla.org/show_bug.cgi?id=66634 http://www.mail-archive.com/mozilla-layout@mozilla.org/msg00301.html It seems it would be worth investigating it again, and I’d like to do so if there’s any chance for improvement. Any starting points? Any || potentials? Thanks, - moh From Brendan: Thanks Moh -- good find. Cc'ing style folks and Andreas. /be From David Baron: There's been a lot of work over time to: 1. pull loop invariants out of SelectorMatches 2. call SelectorMatches less We've probably done what we can for (1), but there's more we could do for (2), e.g., https://bugzilla.mozilla.org/show_bug.cgi?id=479655 . I suspect there's also more we could do to optimize the function itself at a low level (e.g., what to check first, how to avoid mispredicted branches and cache misses, etc.). It might be useful to know what the distribution of call stacks was in the testcase. -David From Boris Zbarsky: Some quick notes: 1) Is the issue optimizing SelectorMatches, or number of calls to it? I'd be interested in knowing what the latter number is for your testcase. 2) Is the time spent _in_ SelectorMatches, or _under_ (e.g. creating the matching structs under SelectorMatchesTree)? 3) Do we have any data on where in SelectorMatches time is being spent on this testcase? The ordering of the various parts of SelectorMatches is certainly something we should look into. 4) Parallel selector matching came up at a talk recently. I've attached Leo's e-mails to me about that. -Boris
Component: General → Style System (CSS)
Product: Firefox → Core
QA Contact: general → style-system
In particular, note that the shortcut on line 1208 is not particularly effective in this case. Out of the ~60M calls that reach there, only ~7M of them return through the shortcut, while quite a good number of clocks are spent to do the necessary check.
> SelectorMatches is called 60341089 times That's the problem, right there. No amount of speeding up SelectorMatches is likely to help in the face of that! ;) The line 1843 caller is what's used to go up the tree to match selectors like "foo bar". The line 1885 caller is the entrypoint to selector matching. The 12:1 ratio of calls to entrypoints, with lots of calls via SelectorMatchesTree, is consistent with selectors of the form "#foo *", which force a walk to the root for all nodes that are not descendants of a #foo.... Are there many such selectors on this page? For that matter, would it be possible to attach the stylesheets from this page to this bug? More interesting, in some ways is the 5e6 entrypoints to selector matching. How many nodes does the page have? You can use this bookmarklet: javascript:function%20countNodes(root)%20{var%20count%20=%201;%20if%20(root.contentDocument)%20{%20count%20+=%20countNodes(root.contentDocument);%20}%20for%20(var%20i%20=%200;%20i%20<%20root.childNodes.length;%20++i)%20{count%20+=%20countNodes(root.childNodes[i]);}%20var%20xblNodes;%20if%20(root.nodeType%20==%20Node.ELEMENT_NODE)%20{xblNodes%20=%20document.getAnonymousNodes(root);}%20else%20{xblNodes%20=%20null;}%20if%20(xblNodes)%20{for%20(i%20=%200;%20i%20<%20xblNodes.length;%20++i)%20{count%20+=%20countNodes(xblNodes[i]);}}%20return%20count;}%20alert(countNodes(window.document)) to count them. Just copy/paste that into your url bar on the page in question. I'd be somewhat surprised if the answer is more than 20000 or so, honestly. But maybe this page is a lot bigger than I expected? The callstack is also interesting: it shows us reresolving style on a node with kids at least 13 levels deep, and the reason we're doing it is that someone changed style and then asked for computed style. Is the page possibly doing that sort of thing a whole bunch of times on some node near the root (e.g. the body).... Looking at the dynamic counts attachment, we have 5e6 calls to ContentEnumFunc, but only 283k calls to RulesMatching. So the page has somewhere on the order of 25-ish rules that are in the "slow matching rule" bucket, apparently... 283k is still a lot of attempts to match styles, which brings us back to the questions of how many nodes there are and what the page is doing.
Moh, I'm trying to understand the dynamic-count output. It looks like there are 60,341,089 hits in SelectorMatches and 7,435,641 hits in the |return PR_FALSE| inside that first if, but what does the "144,460,859 (5)" on the if itself mean? What are the five basic blocks? And why do they add up to more than the total count for SelectorMatches? By the way, only 7e6 exits there just means lots of selectors without a tagname in them, which would be consistent with what the rest of this is looking like... Looking further down in this function, and sorta assuming I understand what the multiple basic blocks thing means, it looks like a lot of the matching time is spent on the class selectors, so maybe the rules in question look more like ".foo *"?
Hi Boris, 144460859 is the sum total of the counts of the basic blocks that map to the source position corresponding to the "if" statement in line 1208, i.e., the complex conditional expression of the if statement. The expression is implemented in 5 basic blocks. It is possible that in some cases such a sum total becomes larger than the count of the entry block. The reason is that, in general, these blocks may not be mutually exclusive on the executed path, and more than one of these blocks may get executed on the same execution path. The reason why these counts might sometimes be usful (in spite of being confusing sometimes) is some constructs such as macros may be too large or may contain loops, and it would be good to have more information about them than just the # of calls to them. Your interpretation of "60,341,089 hits in SelectorMatches and 7,435,641 hits in the 1st return statement" is correct. From the comment in the shortcute "optimization : bail out early if we can", I thought this was an optional exit just as an optimization shortcut. But commenting that fragment caused a crash, so it seems not to be purely an option. Let me know if you need more information. - moh
> complex conditional expression of the if statement Aha. Gotcha. > I thought this was an optional exit just as an optimization shortcut Ah, no. The optimization is doing that check first, basically. The check needs to be done sometime. I'd dearly love to get my hands on the stylesheets involved and better yet on the script that's making that computed style call the stack in comment 5 shows...
(In reply to comment #7) > Are there many such selectors on this page? This is not just a single page. The test harness corresponds to Zimbra's performance tests on their collaboratuin suite's basic operations. An example chart is given here: http://www.zimbrablog.com/blog/archives/2008/09/lets-talk-speed-chrome-and-webkit.html Basically, it is an automated script using Selenium around such ZCS operations as logginig on, moving the mouse over various buttons, clicking certain folders, checking preferences, checking calendar, etc. -- what we typically do with a mail/calendar system. The harness times each such primitive operation separately. It is essentially a rendering experience of more than 1.5 minutes of AJAX events. I'll work on getting the stylesheet for you. Would need to check it with Zimbra.
Ah, I see. In that case, the numbers of calls might well be reasonable; depends on exactly what operations get performed.... It's hard to say whether we can reduce the number of calls without having some idea of what the JS is doing and what the CSS looks like. :( Bug 479655 might help there, of course. David, does anything obvious jump out at you in terms of possible ordering changes in SelectorMatches based on this data?
The average number of iterations of the main while loop of SelectorMatchesTree (line 1793) is 13. If this loop just inspects the data structures and there are no “global” side effects in the body of this loop and what gets called from there (i.e., the processing of each tree node), then there seems to be potentials for 13x gain through parallelism. Right? Here are more details: return through iter avg [min:max] sum hits %return iters {line 1840: !data} 14.596376 [1 : 33] 53397382 3658263 88.36% iters {line 1872: !GREEDY} 1.008398 [1 : 6] 389271 386029 9.32% iters {line 1877: all matched} 6.015172 [0 : 27] 575682 95705 2.31%
Possibly... Doing the ancestors or prev siblings in parallel might make sense for the greedy case, maybe...
Moh, would you be willing to apply this patch and retest with it? I'd be interested in the effect on the number of calls to SelectorMatches...
Sure. The patch doesn't apply cleanly. Will post the results after a refresh.
The patch should apply cleanly on top of m-c rev bc81d3337a1f. That was tip when I started working on it earlier today, but I bet other stuff landed since then. I can try to update it, but given checkin volume today it might be simpler to just update to that rev and apply it...
After cloning the latest version, the patch applies cleanly. Working on getting the numbers.
Hey Boris, The patch results in a good reduction (1.5x) in the number of calls to SelectorMatchesTree and SelectorMatches. The new coverage file is attached. Original total call to SelectorMatchesTree: 4,286,204 Original total call to SelectorMatches: 60,341,089 1843) if (SelectorMatches(*data, selector, 0, nsnull, aForStyling)) { ^ 105,555,380 (2) 1885) if (SelectorMatches(*data, aSelector, 0, nsnull, PR_TRUE)) { ^ 5,576,907 1938) if (SelectorMatches(*data, selector, 0, nsnull, PR_TRUE)) { ^ 2,528,012 (2) 1766) result = !SelectorMatches(data, negation, aStateMask, ^ 27,486 (3) Trial total call to SelectorMatchesTree: 2,783,365 Trial total call to SelectorMatches: 40,311,512 1843) if (SelectorMatches(*data, selector, 0, nsnull, aForStyling)) { ^ 69,597,120 (2) 1885) if (SelectorMatches(*data, aSelector, 0, nsnull, PR_TRUE)) { ^ 2,584,135 1938) if (SelectorMatches(*data, selector, 0, nsnull, PR_TRUE)) { ^ 2,514,230 (2) 1766) result = !SelectorMatches(data, negation, aStateMask, ^ 18,044 (3) I'll do a performance run of the test harness. - moh
Here are the speedup results of the patch on an Intel Atom system. The patch gives us about an overall 10% improvement on Zimbra's atomic tests. I ran the entire test suite three times with the original and the new versions, and there is very little variance in the results of each version. The S curve of the speedups looks good. Overall, 27 tests improve and 15 tests degrade in performance. The gains, however, are more than the losses. orig new spdup Test 0.484 0.544 0.89 Click Preferences_Signatures tab 1.251 1.397 0.90 Click Drafts folder.. 0.500 0.536 0.93 Click Preferences_Address Book tab 0.688 0.728 0.95 Click Preferences_Composing tab 0.637 0.672 0.95 Cal_FindLocations tab(2nd time).. 0.381 0.402 0.95 View Mail in Reading Pane... 0.603 0.631 0.96 Cal_FindResources tab(2nd time).. 1.284 1.328 0.97 Click Sent folder(2nd time).. 2.236 2.305 0.97 go to 'Preferences'.. 1.275 1.306 0.98 Click Preferences_Accounts tab 0.535 0.545 0.98 Click Preferences_Calendar tab 1.233 1.254 0.98 Click Junk folder(2nd time).. 2.063 2.095 0.98 Cal_Schedule tab.. 1.820 1.833 0.99 go to 'Documents'.. 1.003 1.008 0.99 Click Preferences_Mail tab 1.072 1.068 1.00 Click Junk folder.. 1.567 1.557 1.01 go to 'Tasks'.. 1.183 1.176 1.01 Click Preferences_Shortcuts tab 1.932 1.903 1.02 View mail with Reading Pane off... 1.063 1.045 1.02 Cal_Schedule tab(2nd time).. 1.116 1.091 1.02 Click Drafts folder(2nd time).. 1.052 1.028 1.02 go to 'Address Book'.. 0.646 0.627 1.03 Cal_FindAttendees tab(2nd time).. 1.092 1.060 1.03 Click Trash folder(2nd time).. 1.526 1.453 1.05 Open Mail Compose.. 1.144 1.080 1.06 Click Trash folder.. 1.326 1.247 1.06 Click Inbox folder(2nd time).. 1.120 1.044 1.07 Cal_FindAttendees tab.. 1.021 0.947 1.08 Cal_FindResources tab.. 1.215 1.123 1.08 Click Inbox folder.. 0.324 0.297 1.09 View another Mail in Reading Pane... 1.038 0.946 1.10 Cal_FindLocations tab.. 2.572 2.321 1.11 go to 'Calendar'.. 1.275 1.129 1.13 View another mail with Reading Pane off... 1.493 1.190 1.25 Click Sent folder.. 0.388 0.289 1.35 go to 'Address Book' 2nd time.. 0.303 0.220 1.37 Open Mail Compose(2nd time).. 0.260 0.189 1.37 go to 'Documents' 2nd time.. 0.308 0.215 1.43 go to 'Tasks' 2nd time.. 0.618 0.428 1.44 go to 'Mail'.. 0.459 0.315 1.45 go to 'Calendar' 2nd time.. 0.605 0.402 1.51 go to 'Mail' 2nd time.. 0.406 0.268 1.51 go to 'Preferences' 2nd time.. Here are the overall speedup statistics: Average: 1.09x Geomean: 1.08x Min: 0.89x Max: 1.51x The detailed spreadsheet is also attached. Note that the ultra high speedups tends to be in the very short running tests. - moh
OK. Sounds like we should work on fixing bug 479655, then: that patch gives an estimate of what we can expect the resulting speedup to look like... and it's nice, at least in this case.
Depends on: 479655
Er, actually that patch may be somewhat orthogonal to bug 479655. I've spun it off into bug 494117.
Depends on: 494117
Looking for further improvements, it turned out that there were significantly large number of L2 Cache misses in running Firefox on Zimbra client. Out of the total, 65M misses on my run, gklayout was the heavy hitter, responsible for 43% of the misses (28M out of 65M). In the new version (Boris’s patch), AttrubuteEnumFunc() accounts for 23% of gklayout misses. The misses were at the first statement of the function: data = aData->data; AttrubuteEnumFunc() is called from three loops. Value profiling of the trip counts show that some of these loops can get up to a couple of hundred iterations (700+). At each iteration, a selector is passed to the function. Different selectors’ data could be distant in memeory, resulting in L2 misses, which are expensive. The good thing is that we know all selectors at the beginning of each loop. So, we could prefetch the next selector at each iteration. I inserted SW prefetches before these three calls, and am getting good results. Essentially, all AttrubuteEnumFunc() misses get eliminated, resulting in a reduction of gklayout misses from 28M to 20M, and getting another 4% performance overall improvement. So, with the patch and this fix, we get an overall 13% improvement of Firefox on Zimbra. The details are in the attached spreadsheet. I'll try this experiment on other Intel plaforms as well. #include "xmmintrin.h" for(; iter != end; ++iter) { _mm_prefetch ((char*) &((*(iter+1))->mOperator), _MM_HINT_T1); AttributeEnumFunc(*iter, &data); } This optimization is independent of Boris’s patch and can be applied to the existing source as well. - moh
Boris’s patch improves Firefox performance on Intel Core2 as well as on Atom. The overall performance improvement of Zimbra’s performance suite is 1.12x. The numbers follow: spdup orig new Test 0.83 0.329 0.395 Click Preferences_Signatures tab 0.92 0.836 0.913 Click Junk folder.. 0.93 0.435 0.466 Click Preferences_Composing tab 0.94 0.962 1.021 Click Junk folder(2nd time).. 0.96 0.911 0.953 Click Trash folder(2nd time).. 0.97 1.038 1.073 Click Drafts folder.. 0.98 0.840 0.856 Click Drafts folder(2nd time).. 0.98 0.839 0.852 Click Trash folder.. 0.99 0.319 0.323 Click Preferences_Address Book tab 0.99 0.862 0.872 Click Preferences_Accounts tab 1.00 0.286 0.286 View another Mail in Reading Pane... 1.00 0.424 0.424 Cal_FindResources tab(2nd time).. 1.00 0.819 0.820 Click Sent folder(2nd time).. 1.00 1.366 1.366 go to 'Documents'.. 1.01 0.901 0.897 Click Inbox folder(2nd time).. 1.01 0.681 0.676 Click Preferences_Mail tab 1.01 1.771 1.751 go to 'Preferences'.. 1.02 1.500 1.470 Cal_Schedule tab.. 1.04 0.463 0.444 Cal_FindAttendees tab(2nd time).. 1.04 0.365 0.350 Click Preferences_Calendar tab 1.05 1.507 1.431 View mail with Reading Pane off... 1.06 0.712 0.675 Cal_FindResources tab.. 1.06 1.882 1.775 go to 'Calendar'.. 1.08 0.820 0.762 Cal_FindAttendees tab.. 1.08 0.727 0.673 Cal_Schedule tab(2nd time).. 1.09 1.039 0.956 Click Sent folder.. 1.09 0.791 0.725 go to 'Address Book'.. 1.09 0.474 0.434 Cal_FindLocations tab(2nd time).. 1.10 0.896 0.818 Click Preferences_Shortcuts tab 1.10 0.970 0.882 Click Inbox folder.. 1.10 1.236 1.120 Open Mail Compose.. 1.15 0.993 0.866 View another mail with Reading Pane off... 1.18 0.785 0.667 Cal_FindLocations tab.. 1.20 1.248 1.037 go to 'Tasks'.. 1.36 0.212 0.155 go to 'Documents' 2nd time.. 1.39 0.296 0.213 go to 'Address Book' 2nd time.. 1.43 0.462 0.324 View Mail in Reading Pane... 1.44 0.251 0.174 go to 'Tasks' 2nd time.. 1.51 0.497 0.329 go to 'Mail'.. 1.52 0.364 0.240 go to 'Calendar' 2nd time.. 1.53 0.481 0.314 go to 'Mail' 2nd time.. 1.55 0.326 0.210 go to 'Preferences' 2nd time.. 1.57 0.251 0.160 Open Mail Compose(2nd time).. 1.12x Average speedup 1.11x Geomean speedup 0.83x Min speedup 1.57x Max speedup The software prefetches that I mentioned in Comment #23, do not, however, improve the performance of Core-2 Duo in contrast to Atom, where they result in 4% overall improvement. The reason is that Atom has a smaller cache size (L1=24K, L2=512K) compared to Core-2 Duo (L1=32K, L2=4096K). In particular, note that 512K << 4096K. Thus, SW prefetching into L2 cache results in good gains on Atom. The attached spreadsheet captures the details. - moh
Adding Carmen and Leo to the CC List. - moh
GCC has a generic prefetch hint, which could be used to avoid the dependency on xmmintrin.h: for(; iter != end; ++iter) { __builtin_prefetch(&iter[1]->mOperator, /*write=*/0, /*locality=*/2); AttributeEnumFunc(*iter, &data); } I don't know if AttributeEnumFunc ever modifies the data at iter->mOperator; if so, the 0 should be changed to a 1. Of course this only trades an Intel dependency for a GCC dependency; perhaps we ought to have generic prefetch wrappers in XPCOM or NSPR.
> I don't know if AttributeEnumFunc ever modifies the data It does not. If the data is not already const here, it should be.
Performance improvemnts through paralllism tracked on a separate thread (bug #520942). https://bugzilla.mozilla.org/show_bug.cgi?id=520942
Whiteboard: [evang-wanted]
Attached file HotPar'10 Paper
We recently submitted a paper describing some of our work and results on parallelizing the CSS rule-matching component of Firefox to the 2010 USENIX Workshop on Hot Topics in Parallelism (http://www.usenix.org/event/hotpar10/cfp/). We were just informed that our paper (see attachment) was one of the 16 papers that were accepted. Special thanks to those of you who provided us the benchmarks we used and for answering our questions. Hope we did a fair job in the acknowledgment section :).
Both dependent bugs were just closed - we're doing better yere?
jst should have numbers+graphs in a few days, but the short version is "yes".
Hot diggity.
Summary: SelectorMatches is Firefox Hotspot on Zimbra Collaboratuin Suite Client running on Intel Architecture → SelectorMatches is Firefox hotspot on Zimbra Collaboration Suite Client running on Intel Architecture
The big boost from this change is in the "go to '*' 2nd time.." tests as visible here: http://spreadsheets.google.com/ccc?key=0AiY-NtfakhpidHlSeHdIWnJXOE1xY193ZnBuSGJWSEE&hl=en
Wow! Very nice improvement.
Whiteboard: [evang-wanted] → [evang-wanted] [platform-rel-Intel]
platform-rel: --- → ?
platform-rel: ? → ---
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: