Closed
Bug 399488
Opened 17 years ago
Closed 17 years ago
Faster gaussian blur
Categories
(Core :: SVG, defect)
Core
SVG
Tracking
()
RESOLVED
FIXED
People
(Reporter: tor, Assigned: jwatt)
References
Details
(Keywords: perf)
Attachments
(3 files, 3 obsolete files)
6.84 KB,
patch
|
Details | Diff | Splinter Review | |
2.97 KB,
image/svg+xml
|
Details | |
11.50 KB,
patch
|
longsonr
:
review+
roc
:
superreview+
beltzner
:
approval1.9+
|
Details | Diff | Splinter Review |
No description provided.
Comment 1•17 years ago
|
||
Requesting blocking as it seems to be responsible for bug 403932 which is a blocker.
Flags: blocking1.9?
2-pass patch is slower than the trunk on my performance test:
trunk : 28.95s 28.89s 28.92s
2-pass: 32.25s 32.15s 32.20s
Simple speedup up of current trunk code - remove the 24 divisions per pixels. Speeds up blur by about 30% in my benchmark.
trunk : 28.95s 28.89s 28.92s
prediv: 20.17s 20.05s 20.05s
Someone want to unrot this and get it landed?
Assignee | ||
Comment 5•17 years ago
|
||
I've also more aggressively coded the inner loops to assume the compiler is stupid about optimization, and I changed nsAutoPtr to nsAutoArrayPtr so we don't leak.
Looks like you changed some prediv2s to prediv by mistake?
Assignee | ||
Comment 7•17 years ago
|
||
:-/ Thanks!
Attachment #291760 -
Attachment is obsolete: true
Attachment #307849 -
Attachment is obsolete: true
Assignee | ||
Comment 8•17 years ago
|
||
tor: any chance you can post that benchmark if you're still around?
Assignee | ||
Comment 9•17 years ago
|
||
I'm not seeing much more than a 10% perf improvement in my debug build on WinXP on this test.
Assignee | ||
Updated•17 years ago
|
Attachment #307851 -
Flags: superreview?(roc)
Attachment #307851 -
Flags: review?(longsonr)
Attachment #307851 -
Flags: superreview?(roc) → superreview+
Comment 10•17 years ago
|
||
Comment on attachment 307851 [details] [diff] [review]
patch with prediv2 fix
> void
> nsSVGFEGaussianBlurElement::BoxBlurH(PRUint8 *aInput, PRUint8 *aOutput,
> PRInt32 aStride, const nsRect &aRegion,
>- PRUint32 leftLobe, PRUint32 rightLobe)
>+ PRUint32 leftLobe, PRUint32 rightLobe,
>+ PRUint8 *prediv)
prediv should be const since you're not modifying its contents in this method. Various methods should change with this.
> {
> PRInt32 boxSize = leftLobe + rightLobe + 1;
>+ PRInt32 posStart = aRegion.x - leftLobe;
>
> for (PRInt32 y = aRegion.y; y < aRegion.YMost(); y++) {
Could do this:
PRInt32 lineIndex = y * aStride;
or some other name. This is applicable to most filters though and so might be better as another patch.
> PRUint32 sums[4] = {0, 0, 0, 0};
> for (PRInt32 i=0; i < boxSize; i++) {
>- PRInt32 pos = aRegion.x - leftLobe + i;
>+ PRInt32 pos = posStart + i;
> pos = PR_MAX(pos, aRegion.x);
> pos = PR_MIN(pos, aRegion.XMost() - 1);
>- sums[0] += aInput[aStride*y + 4*pos ];
>- sums[1] += aInput[aStride*y + 4*pos + 1];
>- sums[2] += aInput[aStride*y + 4*pos + 2];
>- sums[3] += aInput[aStride*y + 4*pos + 3];
>+ PRInt32 index = aStride*y + 4*pos;
Nit: spaces round * is more usual.
you would use lineIndex here if you went for that idea.
>+ sums[0] += aInput[index ];
>+ sums[1] += aInput[index + 1];
>+ sums[2] += aInput[index + 2];
>+ sums[3] += aInput[index + 3];
> }
> for (PRInt32 x = aRegion.x; x < aRegion.XMost(); x++) {
> PRInt32 tmp = x - leftLobe;
> PRInt32 last = PR_MAX(tmp, aRegion.x);
> PRInt32 next = PR_MIN(tmp + boxSize, aRegion.XMost() - 1);
>
>- aOutput[aStride*y + 4*x ] = sums[0]/boxSize;
>- aOutput[aStride*y + 4*x + 1] = sums[1]/boxSize;
>- aOutput[aStride*y + 4*x + 2] = sums[2]/boxSize;
>- aOutput[aStride*y + 4*x + 3] = sums[3]/boxSize;
>-
>- sums[0] += aInput[aStride*y + 4*next ] -
>- aInput[aStride*y + 4*last ];
>- sums[1] += aInput[aStride*y + 4*next + 1] -
>- aInput[aStride*y + 4*last + 1];
>- sums[2] += aInput[aStride*y + 4*next + 2] -
>- aInput[aStride*y + 4*last + 2];
>- sums[3] += aInput[aStride*y + 4*next + 3] -
>- aInput[aStride*y + 4*last + 3];
>+ PRInt32 index = aStride*y + 4*x;
>+ aOutput[index ] = prediv[sums[0]];
>+ aOutput[index + 1] = prediv[sums[1]];
>+ aOutput[index + 2] = prediv[sums[2]];
>+ aOutput[index + 3] = prediv[sums[3]];
>+
>+ PRInt32 index2 = aStride*y + 4*next;
>+ PRInt32 index3 = aStride*y + 4*last;
Nit spaces again and possibility to use lineIndex
>+ sums[0] += aInput[index2 ] - aInput[index3 ];
>+ sums[1] += aInput[index2 + 1] - aInput[index3 + 1];
>+ sums[2] += aInput[index2 + 2] - aInput[index3 + 2];
>+ sums[3] += aInput[index2 + 3] - aInput[index3 + 3];
> }
> }
> }
>
> void
> nsSVGFEGaussianBlurElement::BoxBlurV(PRUint8 *aInput, PRUint8 *aOutput,
> PRInt32 aStride, const nsRect &aRegion,
>- PRUint32 topLobe, PRUint32 bottomLobe)
>+ PRUint32 topLobe, PRUint32 bottomLobe,
>+ PRUint8 *prediv)
> {
> PRInt32 boxSize = topLobe + bottomLobe + 1;
>+ PRInt32 posStart = aRegion.y - topLobe;
>
> for (PRInt32 x = aRegion.x; x < aRegion.XMost(); x++) {
> PRUint32 sums[4] = {0, 0, 0, 0};
>+ PRInt32 fourX = 4*x;
Nit: spaces
Some compilers have been known to do x << 2 quicker. Maybe that's a part of the follow up patch though.
> for (PRInt32 i=0; i < boxSize; i++) {
>- PRInt32 pos = aRegion.y - topLobe + i;
>+ PRInt32 pos = posStart + i;
> pos = PR_MAX(pos, aRegion.y);
> pos = PR_MIN(pos, aRegion.YMost() - 1);
>- sums[0] += aInput[aStride*pos + 4*x ];
>- sums[1] += aInput[aStride*pos + 4*x + 1];
>- sums[2] += aInput[aStride*pos + 4*x + 2];
>- sums[3] += aInput[aStride*pos + 4*x + 3];
>+ PRInt32 index = aStride * pos + fourX;
>+ sums[0] += aInput[index ];
>+ sums[1] += aInput[index + 1];
>+ sums[2] += aInput[index + 2];
>+ sums[3] += aInput[index + 3];
> }
> for (PRInt32 y = aRegion.y; y < aRegion.YMost(); y++) {
> PRInt32 tmp = y - topLobe;
> PRInt32 last = PR_MAX(tmp, aRegion.y);
> PRInt32 next = PR_MIN(tmp + boxSize, aRegion.YMost() - 1);
>
>- aOutput[aStride*y + 4*x ] = sums[0]/boxSize;
>- aOutput[aStride*y + 4*x + 1] = sums[1]/boxSize;
>- aOutput[aStride*y + 4*x + 2] = sums[2]/boxSize;
>- aOutput[aStride*y + 4*x + 3] = sums[3]/boxSize;
>-
>- sums[0] += aInput[aStride*next + 4*x ] -
>- aInput[aStride*last + 4*x ];
>- sums[1] += aInput[aStride*next + 4*x + 1] -
>- aInput[aStride*last + 4*x + 1];
>- sums[2] += aInput[aStride*next + 4*x + 2] -
>- aInput[aStride*last + 4*x + 2];
>- sums[3] += aInput[aStride*next + 4*x + 3] -
>- aInput[aStride*last + 4*x + 3];
>+ PRInt32 index = aStride * y + fourX;
>+ aOutput[index ] = prediv[sums[0]];
>+ aOutput[index + 1] = prediv[sums[1]];
>+ aOutput[index + 2] = prediv[sums[2]];
>+ aOutput[index + 3] = prediv[sums[3]];
>+
>+ PRInt32 index2 = aStride * next + fourX;
>+ PRInt32 index3 = aStride * last + fourX;
Could just put the body of next and last in here and omit those variables
I think nextIndex and lastIndex are better names
>+ sums[0] += aInput[index2 ] - aInput[index3 ];
>+ sums[1] += aInput[index2 + 1] - aInput[index3 + 1];
>+ sums[2] += aInput[index2 + 2] - aInput[index3 + 2];
>+ sums[3] += aInput[index2 + 3] - aInput[index3 + 3];
> }
> }
> }
>
>+PRUint8 *
>+nsSVGFEGaussianBlurElement::SetupPredivide(PRUint32 size)
>+{
>+ PRUint8 *tmp = new PRUint8[size * 256];
>+ if (tmp) {
>+ for (int i = 0; i < 256; i++)
PRUint32 rather than int
>+ memset(tmp + i * size, i, size);
>+ }
>+ return tmp;
>+}
>+
> nsresult
> nsSVGFEGaussianBlurElement::GetDXY(PRUint32 *aDX, PRUint32 *aDY,
> const nsSVGFilterInstance& aInstance)
> {
> float stdX, stdY;
> nsSVGLength2 val;
>
> GetAnimatedNumberValues(&stdX, &stdY, nsnull);
>@@ -786,59 +804,77 @@ nsSVGFEGaussianBlurElement::GetDXY(PRUin
> return NS_OK;
> }
>
> void
> nsSVGFEGaussianBlurElement::GaussianBlur(PRUint8 *aInput, PRUint8 *aOutput,
> nsSVGFilterResource *aFilterResource,
> PRUint32 aDX, PRUint32 aDY)
> {
>- PRUint8 *tmp = static_cast<PRUint8*>
>- (calloc(aFilterResource->GetDataSize(), 1));
>+ nsAutoArrayPtr<PRUint8> tmp(new PRUint8[aFilterResource->GetDataSize()]);
>+ if (!tmp)
>+ return;
>+ memset(tmp, 0, aFilterResource->GetDataSize());
> nsRect rect = aFilterResource->GetSurfaceRect();
> #ifdef DEBUG_tor
> fprintf(stderr, "FILTER GAUSS rect: %d,%d %dx%d\n",
> rect.x, rect.y, rect.width, rect.height);
> #endif
>
> PRUint32 stride = aFilterResource->GetDataStride();
>
> if (aDX & 1) {
> // odd
>- BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2);
>- BoxBlurH(tmp, aOutput, stride, rect, aDX/2, aDX/2);
>- BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2);
>+ nsAutoArrayPtr<PRUint8> prediv(SetupPredivide(2 * (aDX / 2) + 1));
>+ if (!prediv)
>+ return;
>+
>+ BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2, prediv);
>+ BoxBlurH(tmp, aOutput, stride, rect, aDX/2, aDX/2, prediv);
>+ BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2, prediv);
> } else {
> // even
> if (aDX == 0) {
> aFilterResource->CopyImageSubregion(tmp, aInput);
> } else {
>- BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2 - 1);
>- BoxBlurH(tmp, aOutput, stride, rect, aDX/2 - 1, aDX/2);
>- BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2);
>+ nsAutoArrayPtr<PRUint8> prediv(SetupPredivide(2 * (aDX / 2) + 1));
>+ nsAutoArrayPtr<PRUint8> prediv2(SetupPredivide(2 * (aDX / 2)));
>+ if (!prediv || !prediv2)
>+ return;
>+
>+ BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2 - 1, prediv2);
>+ BoxBlurH(tmp, aOutput, stride, rect, aDX/2 - 1, aDX/2, prediv2);
>+ BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2, prediv);
> }
> }
I can't help thinking this lot could be reordered. Test for 0 first then in the else create prediv, split into an odd or even creating prediv 2 in the even then do the final BoxBlurH outside the inner if else. Also applies to aDY with slightly different logic.
Attachment #307851 -
Flags: superreview+ → superreview?(roc)
Assignee | ||
Comment 11•17 years ago
|
||
longsonr: did you mean to clear roc's sr+?
Assignee | ||
Comment 12•17 years ago
|
||
(In reply to comment #10)
> Could just put the body of next and last in here and omit those variables
Same for tmp, but I'd rather let the compiler figure out that optimization. I'm fairly sure it should, and it's more readable as-is. I've moved those lines down so they're grouped as they should be.
> I think nextIndex and lastIndex are better names
I avoided those names for more generic names to avoid being misleading.
> I can't help thinking this lot could be reordered. Test for 0 first then in the
> else create prediv, split into an odd or even creating prediv 2 in the even
> then do the final BoxBlurH outside the inner if else. Also applies to aDY with
> slightly different logic.
There's a never ending number of improvements that could be made to this code, but we've got to call it a day at some point. With only two days to the freeze I think we should stop here.
Attachment #307851 -
Attachment is obsolete: true
Attachment #309795 -
Flags: review?(longsonr)
Attachment #307851 -
Flags: superreview?(roc)
Attachment #307851 -
Flags: review?(longsonr)
Updated•17 years ago
|
Attachment #309795 -
Flags: review?(longsonr) → review+
Comment 13•17 years ago
|
||
(In reply to comment #11)
> longsonr: did you mean to clear roc's sr+?
>
Not at all. I didn't realise I'd done that till you pointed it out. Sorry.
Assignee | ||
Comment 14•17 years ago
|
||
Comment on attachment 309795 [details] [diff] [review]
patch addressing most of longsonr's comments
Okay, better re-request.
Attachment #309795 -
Flags: superreview?(roc)
Attachment #309795 -
Flags: superreview?(roc) → superreview+
Assignee | ||
Comment 15•17 years ago
|
||
Comment on attachment 309795 [details] [diff] [review]
patch addressing most of longsonr's comments
Requesting approval1.9. This patch improves our performance on one of the most concerning areas of our SVG implementation: the performance and user experience of filters.
Attachment #309795 -
Flags: approval1.9+
Assignee | ||
Updated•17 years ago
|
Attachment #309795 -
Flags: approval1.9+ → approval1.9?
Updated•17 years ago
|
Assignee: nobody → jwatt
Comment 16•17 years ago
|
||
Comment on attachment 309795 [details] [diff] [review]
patch addressing most of longsonr's comments
a1.9=beltzner
Attachment #309795 -
Flags: approval1.9? → approval1.9+
Assignee | ||
Comment 17•17 years ago
|
||
Checked in.
Status: NEW → RESOLVED
Closed: 17 years ago
OS: Linux → All
Hardware: PC → All
Resolution: --- → FIXED
I actually thought of a better idea. This patch is problematic because it uses memory bandwidth when filters are probably already quite memory-bandwidth-hungry. We can actually speed up the division another way, by writing
PRUint32 p = v/N;
as
PRUint32 k = ((1 << 25) - 1)/N; // ceil(2^24/N)
PRUint32 p = (v*k) >> 24;
where 'k' is hoisted out of the loops, of course.
If you work out the math, this works for all N <= 2^24/255 which should be plenty...
Jonathan, do you want to try that?
Assignee | ||
Comment 19•17 years ago
|
||
Good idea. I'll give that a go, but it won't be before beta 5.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Note that you might as well clamp N to 2^16 or something that works with the above formula. Values that high won't be used in real drawings and would be incredibly slow anyway.
Comment 21•17 years ago
|
||
I'm going to go ahead and close this; roc or jwatt, can you file a follow-up bug?
Status: REOPENED → RESOLVED
Closed: 17 years ago → 17 years ago
Resolution: --- → FIXED
Yeah, I've got patches in my tree and I'll file a new bug for them.
Assignee | ||
Comment 23•17 years ago
|
||
Please don't close bugs that aren't assigned to you.
You need to log in
before you can comment on or make changes to this bug.
Description
•