Closed Bug 996842 Opened 10 years ago Closed 10 years ago

Improve parallel performance of convolution workload


(Core :: General, defect)

31 Branch
Windows 7
Not set





(Reporter: jaswanth.sreeram, Unassigned)




(1 file, 3 obsolete files)

Attached file convolve-test.html (obsolete) —
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36

Steps to reproduce:

Open the attached test case in the latest Nightly (31.0a1 2014-04-15)

Actual results:

[Parallel] : 5 iterations took 1201ms. Average is 4.16 fps

[Sequential] : 5 iterations took 52ms. Average is 96.15fps 

Expected results:

Parallel version should run faster than sequential on my 4-core (8 h/w threads) machine.
User agent automaticelly inserted above is incorrect. It should be:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
Attached file convolve-test.html (obsolete) —
Edited attachment to fix bug.
Attachment #8407115 - Attachment is obsolete: true
Fixing a bug in the test case and bumping number of iterations to 50 produces the following:

[Parallel] : 50 iterations took 11734ms. Average is 4.26 fps

[Sequential] : 50 iterations took 2394ms. Average is 20.89fps
A few obvious things:

- Looks like buildPar has only a sequential implementation in js/src/builtin/TypedObject.js,
  so we get no benefits from using buildPar

- Looks like the parallel version of the benchmark generates a new matrix for each iteration
  while the sequential version reuses the memory, so that skews the result - to at least get to
  an even playing field the sequential version should allocate one grid per outer loop

- The inner loop of BuildTypedSeqImpl has a lot of activity, and does not have a fast path
  for when the element type is simple - having that would help a lot on raw performance.  In
  general the use of push/pop on the index array is more work than just setting an element.
  The use of apply to spread the arguments to the call is probably very expensive.
Attached file convolve-test-ArrayofPixels.html (obsolete) —
Modified the test case as follows:
1) use fromPar instead of buildPar
2) In the sequential case, allocate a new result for run
3) Change type of color values from uint8Clamped to int32
Attachment #8407126 - Attachment is obsolete: true
I have modified the test case as follows:
1) Use |fromPar| instead of |buildPar|
2) In the sequential case, allocate a new result for each iteration as per Lars's comment in #4.
3) Changed type of color values from uint8 to int32.

With this modified program, I see the following on my Nightly (2014-04-24):

[Parallel] : 5 iterations took 1377ms. Average is 3.63 fps

[Sequential] : 5 iterations took 232ms. Average is 21.55fps
Attachment #8412920 - Attachment is obsolete: true
See Also: → 966567
I dug a little deeper.  There are two problems here.

The main problem is that the parallel engine is not used at all, because there is not yet an implementation of parallel work for "sized" array types.  The initial guard for applicability in the function MapTypedParImpl() - which is used to implement fromPar - thus fails in its fourth disjunct and all the work is sequential, using the sequential typed map.

The second problem is that the sequential typed map is fairly expensive, there are several specializations that are probably desirable / necessary to make it perform well with non-simple types. There is some discussion of that in bug 983577, but I'm guessing more things are going on.

In short, let's revisit this bug after Niko has landed the updates to TypedObject and we can rewrite the various self-hosted methods to get rid of the sized/unsized discrepancy.
Ever confirmed: true
Blocks: PJS
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.