Closed Bug 399488 Opened 17 years ago Closed 17 years ago

Faster gaussian blur

Categories

(Core :: SVG, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: tor, Assigned: jwatt)

References

Details

(Keywords: perf)

Attachments

(3 files, 3 obsolete files)

No description provided.
Keywords: perf
Requesting blocking as it seems to be responsible for bug 403932 which is a blocker.
Flags: blocking1.9?
Flags: wanted1.9+
Flags: blocking1.9?
Flags: blocking1.9-
2-pass patch is slower than the trunk on my performance test: trunk : 28.95s 28.89s 28.92s 2-pass: 32.25s 32.15s 32.20s
Simple speedup up of current trunk code - remove the 24 divisions per pixels. Speeds up blur by about 30% in my benchmark. trunk : 28.95s 28.89s 28.92s prediv: 20.17s 20.05s 20.05s
Depends on: 416305
Someone want to unrot this and get it landed?
Attached patch patch updated to tip + more (obsolete) — Splinter Review
I've also more aggressively coded the inner loops to assume the compiler is stupid about optimization, and I changed nsAutoPtr to nsAutoArrayPtr so we don't leak.
Looks like you changed some prediv2s to prediv by mistake?
Attached patch patch with prediv2 fix (obsolete) — Splinter Review
:-/ Thanks!
Attachment #291760 - Attachment is obsolete: true
Attachment #307849 - Attachment is obsolete: true
tor: any chance you can post that benchmark if you're still around?
I'm not seeing much more than a 10% perf improvement in my debug build on WinXP on this test.
Attachment #307851 - Flags: superreview?(roc)
Attachment #307851 - Flags: review?(longsonr)
Attachment #307851 - Flags: superreview?(roc) → superreview+
Comment on attachment 307851 [details] [diff] [review] patch with prediv2 fix > void > nsSVGFEGaussianBlurElement::BoxBlurH(PRUint8 *aInput, PRUint8 *aOutput, > PRInt32 aStride, const nsRect &aRegion, >- PRUint32 leftLobe, PRUint32 rightLobe) >+ PRUint32 leftLobe, PRUint32 rightLobe, >+ PRUint8 *prediv) prediv should be const since you're not modifying its contents in this method. Various methods should change with this. > { > PRInt32 boxSize = leftLobe + rightLobe + 1; >+ PRInt32 posStart = aRegion.x - leftLobe; > > for (PRInt32 y = aRegion.y; y < aRegion.YMost(); y++) { Could do this: PRInt32 lineIndex = y * aStride; or some other name. This is applicable to most filters though and so might be better as another patch. > PRUint32 sums[4] = {0, 0, 0, 0}; > for (PRInt32 i=0; i < boxSize; i++) { >- PRInt32 pos = aRegion.x - leftLobe + i; >+ PRInt32 pos = posStart + i; > pos = PR_MAX(pos, aRegion.x); > pos = PR_MIN(pos, aRegion.XMost() - 1); >- sums[0] += aInput[aStride*y + 4*pos ]; >- sums[1] += aInput[aStride*y + 4*pos + 1]; >- sums[2] += aInput[aStride*y + 4*pos + 2]; >- sums[3] += aInput[aStride*y + 4*pos + 3]; >+ PRInt32 index = aStride*y + 4*pos; Nit: spaces round * is more usual. you would use lineIndex here if you went for that idea. >+ sums[0] += aInput[index ]; >+ sums[1] += aInput[index + 1]; >+ sums[2] += aInput[index + 2]; >+ sums[3] += aInput[index + 3]; > } > for (PRInt32 x = aRegion.x; x < aRegion.XMost(); x++) { > PRInt32 tmp = x - leftLobe; > PRInt32 last = PR_MAX(tmp, aRegion.x); > PRInt32 next = PR_MIN(tmp + boxSize, aRegion.XMost() - 1); > >- aOutput[aStride*y + 4*x ] = sums[0]/boxSize; >- aOutput[aStride*y + 4*x + 1] = sums[1]/boxSize; >- aOutput[aStride*y + 4*x + 2] = sums[2]/boxSize; >- aOutput[aStride*y + 4*x + 3] = sums[3]/boxSize; >- >- sums[0] += aInput[aStride*y + 4*next ] - >- aInput[aStride*y + 4*last ]; >- sums[1] += aInput[aStride*y + 4*next + 1] - >- aInput[aStride*y + 4*last + 1]; >- sums[2] += aInput[aStride*y + 4*next + 2] - >- aInput[aStride*y + 4*last + 2]; >- sums[3] += aInput[aStride*y + 4*next + 3] - >- aInput[aStride*y + 4*last + 3]; >+ PRInt32 index = aStride*y + 4*x; >+ aOutput[index ] = prediv[sums[0]]; >+ aOutput[index + 1] = prediv[sums[1]]; >+ aOutput[index + 2] = prediv[sums[2]]; >+ aOutput[index + 3] = prediv[sums[3]]; >+ >+ PRInt32 index2 = aStride*y + 4*next; >+ PRInt32 index3 = aStride*y + 4*last; Nit spaces again and possibility to use lineIndex >+ sums[0] += aInput[index2 ] - aInput[index3 ]; >+ sums[1] += aInput[index2 + 1] - aInput[index3 + 1]; >+ sums[2] += aInput[index2 + 2] - aInput[index3 + 2]; >+ sums[3] += aInput[index2 + 3] - aInput[index3 + 3]; > } > } > } > > void > nsSVGFEGaussianBlurElement::BoxBlurV(PRUint8 *aInput, PRUint8 *aOutput, > PRInt32 aStride, const nsRect &aRegion, >- PRUint32 topLobe, PRUint32 bottomLobe) >+ PRUint32 topLobe, PRUint32 bottomLobe, >+ PRUint8 *prediv) > { > PRInt32 boxSize = topLobe + bottomLobe + 1; >+ PRInt32 posStart = aRegion.y - topLobe; > > for (PRInt32 x = aRegion.x; x < aRegion.XMost(); x++) { > PRUint32 sums[4] = {0, 0, 0, 0}; >+ PRInt32 fourX = 4*x; Nit: spaces Some compilers have been known to do x << 2 quicker. Maybe that's a part of the follow up patch though. > for (PRInt32 i=0; i < boxSize; i++) { >- PRInt32 pos = aRegion.y - topLobe + i; >+ PRInt32 pos = posStart + i; > pos = PR_MAX(pos, aRegion.y); > pos = PR_MIN(pos, aRegion.YMost() - 1); >- sums[0] += aInput[aStride*pos + 4*x ]; >- sums[1] += aInput[aStride*pos + 4*x + 1]; >- sums[2] += aInput[aStride*pos + 4*x + 2]; >- sums[3] += aInput[aStride*pos + 4*x + 3]; >+ PRInt32 index = aStride * pos + fourX; >+ sums[0] += aInput[index ]; >+ sums[1] += aInput[index + 1]; >+ sums[2] += aInput[index + 2]; >+ sums[3] += aInput[index + 3]; > } > for (PRInt32 y = aRegion.y; y < aRegion.YMost(); y++) { > PRInt32 tmp = y - topLobe; > PRInt32 last = PR_MAX(tmp, aRegion.y); > PRInt32 next = PR_MIN(tmp + boxSize, aRegion.YMost() - 1); > >- aOutput[aStride*y + 4*x ] = sums[0]/boxSize; >- aOutput[aStride*y + 4*x + 1] = sums[1]/boxSize; >- aOutput[aStride*y + 4*x + 2] = sums[2]/boxSize; >- aOutput[aStride*y + 4*x + 3] = sums[3]/boxSize; >- >- sums[0] += aInput[aStride*next + 4*x ] - >- aInput[aStride*last + 4*x ]; >- sums[1] += aInput[aStride*next + 4*x + 1] - >- aInput[aStride*last + 4*x + 1]; >- sums[2] += aInput[aStride*next + 4*x + 2] - >- aInput[aStride*last + 4*x + 2]; >- sums[3] += aInput[aStride*next + 4*x + 3] - >- aInput[aStride*last + 4*x + 3]; >+ PRInt32 index = aStride * y + fourX; >+ aOutput[index ] = prediv[sums[0]]; >+ aOutput[index + 1] = prediv[sums[1]]; >+ aOutput[index + 2] = prediv[sums[2]]; >+ aOutput[index + 3] = prediv[sums[3]]; >+ >+ PRInt32 index2 = aStride * next + fourX; >+ PRInt32 index3 = aStride * last + fourX; Could just put the body of next and last in here and omit those variables I think nextIndex and lastIndex are better names >+ sums[0] += aInput[index2 ] - aInput[index3 ]; >+ sums[1] += aInput[index2 + 1] - aInput[index3 + 1]; >+ sums[2] += aInput[index2 + 2] - aInput[index3 + 2]; >+ sums[3] += aInput[index2 + 3] - aInput[index3 + 3]; > } > } > } > >+PRUint8 * >+nsSVGFEGaussianBlurElement::SetupPredivide(PRUint32 size) >+{ >+ PRUint8 *tmp = new PRUint8[size * 256]; >+ if (tmp) { >+ for (int i = 0; i < 256; i++) PRUint32 rather than int >+ memset(tmp + i * size, i, size); >+ } >+ return tmp; >+} >+ > nsresult > nsSVGFEGaussianBlurElement::GetDXY(PRUint32 *aDX, PRUint32 *aDY, > const nsSVGFilterInstance& aInstance) > { > float stdX, stdY; > nsSVGLength2 val; > > GetAnimatedNumberValues(&stdX, &stdY, nsnull); >@@ -786,59 +804,77 @@ nsSVGFEGaussianBlurElement::GetDXY(PRUin > return NS_OK; > } > > void > nsSVGFEGaussianBlurElement::GaussianBlur(PRUint8 *aInput, PRUint8 *aOutput, > nsSVGFilterResource *aFilterResource, > PRUint32 aDX, PRUint32 aDY) > { >- PRUint8 *tmp = static_cast<PRUint8*> >- (calloc(aFilterResource->GetDataSize(), 1)); >+ nsAutoArrayPtr<PRUint8> tmp(new PRUint8[aFilterResource->GetDataSize()]); >+ if (!tmp) >+ return; >+ memset(tmp, 0, aFilterResource->GetDataSize()); > nsRect rect = aFilterResource->GetSurfaceRect(); > #ifdef DEBUG_tor > fprintf(stderr, "FILTER GAUSS rect: %d,%d %dx%d\n", > rect.x, rect.y, rect.width, rect.height); > #endif > > PRUint32 stride = aFilterResource->GetDataStride(); > > if (aDX & 1) { > // odd >- BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2); >- BoxBlurH(tmp, aOutput, stride, rect, aDX/2, aDX/2); >- BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2); >+ nsAutoArrayPtr<PRUint8> prediv(SetupPredivide(2 * (aDX / 2) + 1)); >+ if (!prediv) >+ return; >+ >+ BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2, prediv); >+ BoxBlurH(tmp, aOutput, stride, rect, aDX/2, aDX/2, prediv); >+ BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2, prediv); > } else { > // even > if (aDX == 0) { > aFilterResource->CopyImageSubregion(tmp, aInput); > } else { >- BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2 - 1); >- BoxBlurH(tmp, aOutput, stride, rect, aDX/2 - 1, aDX/2); >- BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2); >+ nsAutoArrayPtr<PRUint8> prediv(SetupPredivide(2 * (aDX / 2) + 1)); >+ nsAutoArrayPtr<PRUint8> prediv2(SetupPredivide(2 * (aDX / 2))); >+ if (!prediv || !prediv2) >+ return; >+ >+ BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2 - 1, prediv2); >+ BoxBlurH(tmp, aOutput, stride, rect, aDX/2 - 1, aDX/2, prediv2); >+ BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2, prediv); > } > } I can't help thinking this lot could be reordered. Test for 0 first then in the else create prediv, split into an odd or even creating prediv 2 in the even then do the final BoxBlurH outside the inner if else. Also applies to aDY with slightly different logic.
Attachment #307851 - Flags: superreview+ → superreview?(roc)
longsonr: did you mean to clear roc's sr+?
(In reply to comment #10) > Could just put the body of next and last in here and omit those variables Same for tmp, but I'd rather let the compiler figure out that optimization. I'm fairly sure it should, and it's more readable as-is. I've moved those lines down so they're grouped as they should be. > I think nextIndex and lastIndex are better names I avoided those names for more generic names to avoid being misleading. > I can't help thinking this lot could be reordered. Test for 0 first then in the > else create prediv, split into an odd or even creating prediv 2 in the even > then do the final BoxBlurH outside the inner if else. Also applies to aDY with > slightly different logic. There's a never ending number of improvements that could be made to this code, but we've got to call it a day at some point. With only two days to the freeze I think we should stop here.
Attachment #307851 - Attachment is obsolete: true
Attachment #309795 - Flags: review?(longsonr)
Attachment #307851 - Flags: superreview?(roc)
Attachment #307851 - Flags: review?(longsonr)
Attachment #309795 - Flags: review?(longsonr) → review+
(In reply to comment #11) > longsonr: did you mean to clear roc's sr+? > Not at all. I didn't realise I'd done that till you pointed it out. Sorry.
Comment on attachment 309795 [details] [diff] [review] patch addressing most of longsonr's comments Okay, better re-request.
Attachment #309795 - Flags: superreview?(roc)
Attachment #309795 - Flags: superreview?(roc) → superreview+
Comment on attachment 309795 [details] [diff] [review] patch addressing most of longsonr's comments Requesting approval1.9. This patch improves our performance on one of the most concerning areas of our SVG implementation: the performance and user experience of filters.
Attachment #309795 - Flags: approval1.9+
Attachment #309795 - Flags: approval1.9+ → approval1.9?
Assignee: nobody → jwatt
Comment on attachment 309795 [details] [diff] [review] patch addressing most of longsonr's comments a1.9=beltzner
Attachment #309795 - Flags: approval1.9? → approval1.9+
Checked in.
Status: NEW → RESOLVED
Closed: 17 years ago
OS: Linux → All
Hardware: PC → All
Resolution: --- → FIXED
I actually thought of a better idea. This patch is problematic because it uses memory bandwidth when filters are probably already quite memory-bandwidth-hungry. We can actually speed up the division another way, by writing PRUint32 p = v/N; as PRUint32 k = ((1 << 25) - 1)/N; // ceil(2^24/N) PRUint32 p = (v*k) >> 24; where 'k' is hoisted out of the loops, of course. If you work out the math, this works for all N <= 2^24/255 which should be plenty... Jonathan, do you want to try that?
Good idea. I'll give that a go, but it won't be before beta 5.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Note that you might as well clamp N to 2^16 or something that works with the above formula. Values that high won't be used in real drawings and would be incredibly slow anyway.
I'm going to go ahead and close this; roc or jwatt, can you file a follow-up bug?
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
Yeah, I've got patches in my tree and I'll file a new bug for them.
Please don't close bugs that aren't assigned to you.
Depends on: 441368
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: