Closed Bug 520942 Opened 15 years ago Closed 5 years ago

Parallelism Opportunities in CSS Selector Matching

Categories

(Core :: CSS Parsing and Computation, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mohammad.r.haghighat, Unassigned)

Details

Attachments

(9 files, 1 obsolete file)

The purpose of this bug report is capturing the results of parallelism opportunity investigations in CSS rule-matching and tracking the related implementation progress.

This bug is also closely related to the bug #493651, which is about performance opportunities in CSS rule-matching in a more general context:  
https://bugzilla.mozilla.org/show_bug.cgi?id=493651
First, in order to assess the speedup potentials of parallelizing the SelectorMatchesTree process, we built a tracing mechanism based on VProf, the value-profiling package under the SpiderMonkey source tree. This mechanism achieves the following:

1. Determines which branch in the SelectorMatchesTree function is most often executed (for the Mozilla page load tests, as well as for the Zimbra suite, we saw that it was very heavily biased towards descendant selectors)
2. For each SelectorMatchesTree call, it yields the execution trace (we used notations to indicate the decisions taken in each iteration, e.g. "p" means that a match against the parent/ancestor was attempted, "n" means that there was no "cached" ancestor available, "b" means that a match against a sibling was attempted, "m" means that a match was found for the current simple selector, "1" means that the function finished because there were no more ancestors available to match against, "4" means that the function exited with a successful match). The results obtained with this tracing mechanism show that in the vast majority of cases, the matching process for a descendant selector results in a non-match. Here is a sample of the most frequent iteration execution patterns encountered when executing SelectorMatchesTree function calls for the Mozilla page load tests:

#Occurences Pattern               #Iterations   %iterations covered(cumulative)
  5561595     p3                          1             2.93
  1997994     ppppppppppn1               10            13.43
  1977238     pppppppppppn1              11            24.88
  1846126     pn3                         1            25.85
  1599769     ppppppppn1                  8            32.58
  1530957     pppppppppn1                 9            39.83
  1293911     ppppppppppppn1             12            47.99
  1187742     pppppppn1                   7            52.37
   840304     ppppppn1                    6            55.02
   619361     pppppppppppppn1            13            59.25
   550482     ppppppppppppppn1           14            63.31
   418940     pppppppppppppppn1          15            66.61
   382272     ppppppppppppppppn1         16            69.83
   310138     pppppn1                     5            70.65
   259582     pm4                         1            70.78
   224274     pnm4                        1            70.90
   216842     pppppppppppppppppn1        17            72.84
   184728     ppppppppppppppppppn1       18            74.59
   176367     pmppppppppppn1             11            75.61
   173219     4                           0            75.61
   156926     pmpppppppppn1              10            76.43
   152407     pppppppppppppppppppn1      19            77.96
   141351     pmppppppn1                  7            78.48
   138621     ppppppppppppppppppppn1     20            79.94
   127323     pppppppppppppppppppppn1    21            81.34
   125789     pmppppppppn1                9            81.94
   117422     ppppn1                      4            82.18
   105090     pmpppppppn1                 8            82.63
    96911     pppppppppppppppppppppppn1  23            83.80
    82972     ppppppppppppppppppppppn1   22            84.76
    82084     ppnpnpnpnpnpnpnpnpnpn1     11            85.23
    80338     ppnpnpnpnpnpnpnpnpn1       10            85.66


Based on the profiling&tracing results obtained, we bank on the majority of rule non-match behavior observed. We re-factored the selector matching algorithm into a parallel equivalent so that for each SelectorMatchesTree call, the iterations that carry out the matching process are distributed in chunks to and executed in parallel by an adjustable number of worker threads. There is some degree of speculation, as not all iterations might get executed; the speculative work that proves to be unnecessary is discarded. Furthermore, the code is also designed in such a way that the speculative work does not cause any unwanted side effects. 
As observed through profiling, in the majority of cases all the iterations need to be executed and having them execute in parallel would lead to considerable performance improvement. There is also a check by each worker thread to avoid unnecessary computations if another thread has already found a match. The parallel rule matching process we implemented depends on three adjustable parameters:
 
1. the number of worker threads available
2. the default workload size to be assigned to a worker thread for processing (i.e. no. of matching iterations) 
3. a threshold on the workload size that needs to be hit before enabling the parallel selector matching. 

The implementation is currently Pthreads based. We tested the performance of our parallelized prototype using the Mozilla page load tests on a Nehalem machine and with the best configuration out of the ones we tested (2 worker threads, a default workload size of 4 and minimum workload threshold of 28), we saw performance improvements in ~ 87% of the pages tested, with a maximum speedup of 1.84x. The first JPEG file attached shows the plotted performance results over the 390 web pages tested.
The second JPEG file attached shows the speedups achieved for the Zimbra Collaboration suite.
We're still working on refining the parallelizing algorithm and on cleaning up the code (we need to clean up the comments and take out the debugging code). The patch attached contains the current code we have.

Let us know if there's any more info that we can provide.

--
Carmen
+  *  Copyright 2009 Intel Corporation. All rights reserved.

Code contributed to Mozilla has to be under the MPL/GPL/LGPL tri-license as described in http://www.mozilla.org/MPL/license-policy.html .  Is Intel willing to contribute the SelectorThread.h/cpp code under that license?  (See the license headers for the other files in the directory for examples of what I'd expect the license header to look like.)

(Note that the changes to nsCSSRuleProcessor.cpp are already under that license.)
Another important basic question:  did you test this in an --enable-debug build?  There are many threadsafety invariants that are checked by assertions in debug builds; if this patch violates some of those, then that's an additional cost we'd need to consider (because in order to avoid crashing due to things like races to set reference count variables, we'd need to fix such bugs).


That said, this certainly looks interesting; I think you've discussed it a good bit more with Boris than with me, so his comments would probably be more useful than mine (since he has a better sense of your work).
Hi,

We didn't use --enable-debug, only --enable-debugger-info-modules=yes. This is the .mozconfig file used:

ac_add_options --enable-optimize
ac_add_options --disable-debug
ac_add_options --enable-debugger-info-modules=yes
ac_add_options --disable-accessibility
ac_add_options --enable-shared
ac_add_options --disable-libxul
ac_add_options --disable-tests
ac_add_options --disable-mochitest
ac_add_options --enable-application=browser
mk_add_options MOZ_CO_PROJECT=browser
 

--
Carmen

(In reply to comment #6)
> Another important basic question:  did you test this in an --enable-debug
> build?  There are many threadsafety invariants that are checked by assertions
> in debug builds; if this patch violates some of those, then that's an
> additional cost we'd need to consider (because in order to avoid crashing due
> to things like races to set reference count variables, we'd need to fix such
> bugs).
> 
> 
> That said, this certainly looks interesting; I think you've discussed it a good
> bit more with Boris than with me, so his comments would probably be more useful
> than mine (since he has a better sense of your work).
(In reply to comment #5)
> Is Intel willing to contribute the SelectorThread.h/cpp code under that license?

Sure. We have our legal approval for doing so, and have already done it in the
case of VProf. 

Carmen, please look at the the license block from VProf, copy it into the newly
introduced files, and resubmit the patch.

David, thanks for the catch.

-moh
So from a quick skim this might in fact be ok threadsafety-wise due to the limited range of checks SelectorThread::parallelSelectorMatches performs (namespace, tag, id, class).  Would need to check that carefully, of course, but it should all just be atom and integer compares.  And you have a check up front to restrict the parallel matching to cases where SelectorMatchesTree is called with a selector with null mNext, mOperator being the descendant combinator, no negations, pseudo-classes, or attr selectors...

So the basic approach looks sound (and the operating assumption that selectors of the form "foo bar" rarely match on the "foo" part is correct).

Which OSes was this tested on?  I assume Linux given the pthreads usage?  I wonder whether NSPR threads would give similar results (and make the code more portable)...
Oh, another question, for Carmen.  How much time do you have to spend on this, in terms of the NSPR threads question above, cleaning up some of the code duplication introduced in SelectorMatchesTree, etc?  That is, are you planning to do that sort of thing yourself, up through checkin, or do you want someone to help get the patch into shape to check in?
Oh, and the other key property that makes this safe is that the main thread blocks on the worker threads' matching work, so there's no danger of the objects involved going away or anything like that.

(In reply to comment #9)
> So from a quick skim this might in fact be ok threadsafety-wise due to the
> limited range of checks SelectorThread::parallelSelectorMatches performs
> (namespace, tag, id, class).  Would need to check that carefully, of course,
> but it should all just be atom and integer compares.  And you have a check up
> front to restrict the parallel matching to cases where SelectorMatchesTree is
> called with a selector with null mNext, mOperator being the descendant
> combinator, no negations, pseudo-classes, or attr selectors...
> 
Yes, that's what we've done. We did profile these properties too (e.g. how often are there no negations, no attributes, etc.) and the results showed that it was rare to have negations or attributes, for example. 

> So the basic approach looks sound (and the operating assumption that selectors
> of the form "foo bar" rarely match on the "foo" part is correct).
> 
> Which OSes was this tested on?  I assume Linux given the pthreads usage?  I
> wonder whether NSPR threads would give similar results (and make the code more
> portable)...

So far we've tested it on Windows only. For the implementation we used the Pthreads-w32 API.

--
Carmen
I've just resubmitted the patch with the license information modified for the SelectorThread files. David, Boris, could you please take a look and let me know if you think the licensing info is fine?

Thank you.
--
Carmen

(In reply to comment #8)
> (In reply to comment #5)
> > Is Intel willing to contribute the SelectorThread.h/cpp code under that license?
> 
> Sure. We have our legal approval for doing so, and have already done it in the
> case of VProf. 
> 
> Carmen, please look at the the license block from VProf, copy it into the newly
> introduced files, and resubmit the patch.
> 
> David, thanks for the catch.
> 
> -moh
(In reply to comment #10)
> Oh, another question, for Carmen.  How much time do you have to spend on this,
> in terms of the NSPR threads question above, cleaning up some of the code
> duplication introduced in SelectorMatchesTree, etc?  That is, are you planning
> to do that sort of thing yourself, up through checkin, or do you want someone
> to help get the patch into shape to check in?

Right now I am working on improving the thread workload management and I also need to test it against the currently functional version to see if the  performance improves. That and cleaning up the code should take me about a week. I hadn't thought about how to check it in yet :). Let's see how Moh thinks we should proceed.

Thanks :).
--
Carmen
Boris,

We'd love to work and get all the possible issues and concerns resolved, but we'd need to know what those issues are in the first place ;) I.e., we'd need your help in telling us what needs to be done. I think knowing the process would also help us in other parallelism related work we might do in future.

-moh
License looks fine to me.  David?

As far as threading goes...  Benjamin thinks (and I agree) that using NSPR threads here is less pain than trying to make sure pthreads works across the board with shipping Firefox on typical users' machines.

You will probably also want to use the NSPR locking primitives, or our C++ wrappers on top of them.  See xpcom/glue/CondVar.h and xpcom/glue/Mutex.h for a start.

I haven't read the details of the thread code, but for the nsCSSRuleProcessor changes, with this patch there are three copies of the main matching loop.  There should be no need for that; one copy should be quite sufficient, right?  The loop to determine |n| is unnecessary; just check tempSelector->mNext; if we ever want to test for anything but n == 1 we can modify the check at that point (say by storing the number of following selectors in the chain in each selector at parse time).

The "assume 128 is large enough" code needs to be fixed to not assume that.  I suggest using an nsAutoTArray instead of a C array.

The code in parallelSelectorMatches can be cleaned up (e.g. the commented-out bits removed, the aForStyling argument removed, the unused setNodeFlags local removed).

The debug print statements should perhaps be under DEBUG_PARALLEL_MATCHING instead of C_DEBUG?

General whitespace issues (use of tabs etc) also need addressing, but that can happen towards the end if desired.
Boris, thank you so much for the helpful suggestions. I'll go ahead and work on getting the patch code into shape.

--
Carmen

(In reply to comment #17)
> License looks fine to me.  David?
> 
> As far as threading goes...  Benjamin thinks (and I agree) that using NSPR
> threads here is less pain than trying to make sure pthreads works across the
> board with shipping Firefox on typical users' machines.
> 
> You will probably also want to use the NSPR locking primitives, or our C++
> wrappers on top of them.  See xpcom/glue/CondVar.h and xpcom/glue/Mutex.h for a
> start.
> 
> I haven't read the details of the thread code, but for the nsCSSRuleProcessor
> changes, with this patch there are three copies of the main matching loop. 
> There should be no need for that; one copy should be quite sufficient, right? 
> The loop to determine |n| is unnecessary; just check tempSelector->mNext; if we
> ever want to test for anything but n == 1 we can modify the check at that point
> (say by storing the number of following selectors in the chain in each selector
> at parse time).
> 
> The "assume 128 is large enough" code needs to be fixed to not assume that.  I
> suggest using an nsAutoTArray instead of a C array.
> 
> The code in parallelSelectorMatches can be cleaned up (e.g. the commented-out
> bits removed, the aForStyling argument removed, the unused setNodeFlags local
> removed).
> 
> The debug print statements should perhaps be under DEBUG_PARALLEL_MATCHING
> instead of C_DEBUG?
> 
> General whitespace issues (use of tabs etc) also need addressing, but that can
> happen towards the end if desired.
Great!  If you have any questions, feel free to ask here, or mail me directly, or ask on irc.mozilla.org in the #developers channel!
Comment on attachment 405222 [details] [diff] [review]
Parallelized selector matching patch - updated w/ license information

Some drive-by comments:

>diff -crBN mozilla-central/layout/style/SelectorThread.cpp mozilla-central-pSMT/layout/style/SelectorThread.cpp
>*** mozilla-central/layout/style/SelectorThread.cpp	Wed Dec 31 16:00:00 1969
>--- mozilla-central-pSMT/layout/style/SelectorThread.cpp	Wed Oct  7 21:51:00 2009
>
>+ SharedData* SelectorMainThread::startOrWakeup(int mode, int _availableAncestorIndex, RuleProcessorData** _ancestors, FILE* _f, nsCSSSelector* _initialSelector)
>+ {

[snip]

>+ 	for (int j = 0; j < i; j++)
>+ 	{
>+ 		sharedData->sems[j]->post();

Barrier here would be more efficient in the common case (threads already created), since you would only need to make one system call to awaken them rather than NTHREADS calls.

[snip]
>+ 	sharedData->mainSem->wait();
>+ 

Why is the main thread idle-waiting here?  It would be better for it to also participate in the selector matching; you'd save a TCB.

>+ 
>+ //------------------------------------------------------------------------------------------------------------
>+ 

>+ void *SelectorThread::run()
>+ {

[snip]

>+ 	while (true)
>+ 	{
>+ 		sharedData->sems[index]->wait();

Barrier sync with the main thread and other worker threads here would be better, IMHO.

>+ 		for (i = tData->startAIndex; i <= tData->endAIndex; i++)
>+ 		{
>+ 			
>+ 			//if  (i < 0) break;
>+ 			if ((sharedData->found) && (sharedData->foundIndex < i)) break;
>+ 			tData->ancestor = sharedData->ancestors[i];
>+ 			//tData->currentAIndex = i;
>+ 
>+ 			result = parallelSelectorMatches(aForStyling);
>+ 			if (result)
>+ 			{
>+ 				sharedData->mainLock->acquire();
>+ 				if (!sharedData->found)
>+ 				{
>+ 					sharedData->found = true;
>+ 					sharedData->foundIndex = i;
>+ 				}
>+ 				else
>+ 				{
>+ 					if (i < sharedData->foundIndex)
>+ 					{
>+ 						sharedData->foundIndex = i;
>+ 					}
>+ 					//else, there's already an earlier match, so this match is irrelevant
>+ 				sharedData->mainLock->release();

Yeah ... you don't want to lock/unlock |mainLock| in every loop iteration :).  I'd recommend making found/foundIndex thread-private variables, and have the main thread compute the min foundIndex of all threads (including itself) after the barrier below.  You could bring back this "speculation failed" optimization with s/sharedData->foundIndex/sharedData->approximateFoundIndex/ and do the comparison and update with the thread-local foundIndex outside of the mutex.  That being racy is OK: having sharedData->approximateFoundIndex not decrease monotically doesn't affect the correctness of the algorithm.

>+ 		sharedData->mainLock->acquire();
>+ 
>+ 		sharedData->WaitForThreads--;
>+ 		if (sharedData->WaitForThreads == 0)
>+ 		{
>+ #ifdef C_DEBUG
>+ 				fprintf(f, "thread #%d->run(): WaitForThreads = %d, everyone's gone to sleep...\n", this->getIndex(), sharedData->WaitForThreads);
>+ #endif
>+ 			sharedData->mainSem->post();
>+ 		}
>+ 		sharedData->mainLock->release();

This is sort of a proto-barrier.  I'd make it an actual barrier; when the main thread is also doing selector matching, the main thread would also want to hit that barrier like all the other worker threads.
Adding Leo.
Attached file HotPar'10 Paper
HotPar'10 Paper
We recently submitted a paper describing some of our work and results on parallelizing the CSS rule-matching component of Firefox to the USENIX Workshop on Hot Topics in Parallelism (<a href="http://www.usenix.org/event/hotpar10/cfp/">HotPar '10</a>). We were just informed that our paper was one of the 16 papers that were accepted. Special thanks to those of you who provided us the benchmarks we used and for answering our questions. Hope we did a fair job in the acknowledgment section :).
Hi everyone,

Attached is the patch containing the implementation of the parallel selector matching process using NSPR instead of Pthreads. It takes into account most of the suggestions that Boris made (e.g., using NSPR constructs, cleaning up comments, using an nsAutoTArray to remove the previous size limitations, etc.).

I've tested this implementation on two systems running Windows (Nehalem and Atom), and in both cases the performance speedups observed are slightly worse than the ones observed when experimenting with the previous Pthread implementation. Attached JPEGs show the performance results of the NSPR implementation for the Mozilla page load tests on Nehalem and Atom, respectively.

With the NSPR implementation, using the Mozilla page load tests on Nehalem, we get  an average speedup of 2x (0.49x to 14x, mean = 2x), with ~75% of the benchmarks showing improvement or not being affected performance wise. With our initial Pthreads implementation, we obtained an average speedup of 1.08x  (0.51x to 1.84x, mean = 1.08x) with ~ 87% of the benchmarks showing improvement or not being affected performance wise. The obtained speedups depend on the amount of work needed for CSS rule matching.

With the NSPR implementation, using the Mozilla page load tests on Atom, we get an average speedup of 1.03x (0.67x to 1.5x, mean=1.03x) with ~69% of the benchmarks not being affected performance wise. Note that while Nehalem is multi-core, the Atom that I used for performance testing is single core with hyperthreading.  

I also have another patch (see attached NSPR-MTWT patch file), built upon the NSPR one, where the main thread also joins in the parallel rule matching process, carrying out the workload that a worker thread would do, instead of sitting idle waiting for the worker threads to finish as before. However, I still have not tested this implementation in terms of performance, only in terms of functionality using the Mozilla page load tests, but expect that the performance should improve, at least slightly.

Please feel free to take the code and test it or modify it as you see fit. It would definitely need to be adapted to the latest trunk as this was built on a older version of the Firefox source code.

--
Carmen
The bugzilla diff view doesn't display the patch for me for some reason. The raw output is fine.
Andreas, I bet the built-in bugzilla thing just fails because the patch doesn't apply to m-c tip.

Carmen, thanks for following up on this!  The patch will definitely need to be merged to tip and ideally modified so that we don't have the two different copies of SelectorMatchesTree around (needs more ifdefs, possibly, but makes it less likely that they'll get out of sync); do you want to do that or do you want someone else to pick that up?  If you want I might do at least a preliminary merge to tip just so I can get some measurements on a testcase or two I have lying around here....
Actually, looks like that patch creates _three_ separate copies of SelectorMatchesTree.  That's really not desirable at all, and reading more carefully should be easy enough to avoid.
And does SelectorThread still need windows.h?  Seems to compile fine for me (on Mac) without it.
SharedData::SharedData seems to sprintf into a random memory location (str1 is never actually set to point to anywhere).  That doesn't look right.
Hi Boris,


I’d have loved to complete the work to the end. However, I am in the process of finishing up my PhD dissertation and preparing for my final defense which I expect to happen sometime in summer. I’m thus worried that my continuing the work on this patch will further delay my graduation. Please feel free to take the code and test/modify it as you see fit or assign it to someone else. I will definitely be interested to follow this project's further evolution.

--
Carmen
Carmen, sounds good.  I know how that defense thing works.  ;)

I'll get this merged to trunk (probably starting with the MTWT version, right?) and see how things look.  Thanks for working on this!
Hi Boris,

Yes, i think that's the version you want. It's the one where the main thread joins in carrying out the CSS selector matching process instead of sitting idle waiting for the worker threads to complete the work.
Thanks again for all your help :).

--
Carmen

(In reply to comment #34)
> Carmen, sounds good.  I know how that defense thing works.  ;)
> 
> I'll get this merged to trunk (probably starting with the MTWT version, right?)
> and see how things look.  Thanks for working on this!
Attached patch Merged to tipSplinter Review
This is the "NSPR version with main thread as worker thread" patch merged to tip and with the redundant SelectorMatchesTree stuff removed (as well as a few minor changes to SelectorMatchingTree callsites to allow it to assume that aSelector is non-null and reindentation of the new code to be 2-space indented, no tabs).  This doesn't eliminate the 3x duplication of the SelectorMatches prefix (namespace, tag, id, class match); the best way to eliminate that will depend on how we end up splitting up the compilation units here.

Chris, if you want to play with this, it should apply fine to m-c tip.  I tried it on the following task:

1)  Load http://www.whatwg.org/specs/web-apps/current-work/
2)  Wait for all loading work to be done.
3)  Run javascript:%20var%20docEl%20=%20document.documentElement;%20docEl.style.display%20=%20"none";%20document.body.offsetWidth;%20var%20start%20=%20new%20Date();%20docEl.style.display%20=%20"";%20window.getComputedStyle(document.body,%20"").fontSize;%20var%20end%20=%20new%20Date();%20alert(end%20-%20start)

and I see a slowdown from abou 1425ms to about 1650ms (both with NUM_WORKERS set to 1 and 2) on a Core 2 Duo Macbook Pro.
I filed bug 631527 on a more ambitious plan to parallelize over entire subtrees, not over matching of a single selector against a single node.

Stylo does parallel selector-matching. I think we can close this.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: