Closed Bug 399488 Opened 17 years ago Closed 17 years ago

Faster gaussian blur

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: tor, Assigned: jwatt)

References

Details

(Keywords: perf)

Attachments

(3 files, 3 obsolete files)

alternate work in progress - two pass instead of six 17 years ago tor 6.84 KB, patch		Details \| Diff \| Splinter Review
avoid division in gaussian filter 17 years ago tor 9.17 KB, patch		Details \| Diff \| Splinter Review
patch updated to tip + more 17 years ago Jonathan Watt [:jwatt] 11.25 KB, patch		Details \| Diff \| Splinter Review
patch with prediv2 fix 17 years ago Jonathan Watt [:jwatt] 11.25 KB, patch		Details \| Diff \| Splinter Review
stress test (hangs for >20 seconds!) 17 years ago Jonathan Watt [:jwatt] 2.97 KB, image/svg+xml		Details
patch addressing most of longsonr's comments 17 years ago Jonathan Watt [:jwatt] 11.50 KB, patch	longsonr : review+ roc : superreview+ beltzner : approval1.9+	Details \| Diff \| Splinter Review

tor

Reporter

Description

•

17 years ago

Attached patch alternate work in progress - two pass instead of six — Details — Splinter Review

No description provided.

Jesse Ruderman

Updated

•

17 years ago

Keywords: perf

Frédéric Buclin

Comment 1

•

17 years ago

Requesting blocking as it seems to be responsible for bug 403932 which is a blocker.

Flags: blocking1.9?

tor

Reporter

Updated

•

17 years ago

Flags: wanted1.9+

Flags: blocking1.9?

Flags: blocking1.9-

tor

Reporter

Comment 2

•

17 years ago

2-pass patch is slower than the trunk on my performance test: trunk : 28.95s 28.89s 28.92s 2-pass: 32.25s 32.15s 32.20s

tor

Reporter

Comment 3

•

17 years ago

Attached patch avoid division in gaussian filter (obsolete) — Details — Splinter Review

Simple speedup up of current trunk code - remove the 24 divisions per pixels. Speeds up blur by about 30% in my benchmark. trunk : 28.95s 28.89s 28.92s prediv: 20.17s 20.05s 20.05s

Jonathan Watt [:jwatt]

Assignee

Updated

•

17 years ago

Depends on: 416305

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 4

•

17 years ago

Someone want to unrot this and get it landed?

Jonathan Watt [:jwatt]

Assignee

Comment 5

•

17 years ago

Attached patch patch updated to tip + more (obsolete) — Details — Splinter Review

I've also more aggressively coded the inner loops to assume the compiler is stupid about optimization, and I changed nsAutoPtr to nsAutoArrayPtr so we don't leak.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 6

•

17 years ago

Looks like you changed some prediv2s to prediv by mistake?

Jonathan Watt [:jwatt]

Assignee

Comment 7

•

17 years ago

Attached patch patch with prediv2 fix (obsolete) — Details — Splinter Review

:-/ Thanks!

Attachment #291760 - Attachment is obsolete: true

Attachment #307849 - Attachment is obsolete: true

Jonathan Watt [:jwatt]

Assignee

Comment 8

•

17 years ago

tor: any chance you can post that benchmark if you're still around?

Jonathan Watt [:jwatt]

Assignee

Comment 9

•

17 years ago

Attached image stress test (hangs for >20 seconds!) — Details

I'm not seeing much more than a 10% perf improvement in my debug build on WinXP on this test.

Jonathan Watt [:jwatt]

Assignee

Updated

•

17 years ago

Attachment #307851 - Flags: superreview?(roc)

Attachment #307851 - Flags: review?(longsonr)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

17 years ago

Attachment #307851 - Flags: superreview?(roc) → superreview+

Robert Longson [:longsonr]

Comment 10

•

17 years ago

Comment on attachment 307851 [details] [diff] [review] patch with prediv2 fix > void > nsSVGFEGaussianBlurElement::BoxBlurH(PRUint8 *aInput, PRUint8 *aOutput, > PRInt32 aStride, const nsRect &aRegion, >- PRUint32 leftLobe, PRUint32 rightLobe) >+ PRUint32 leftLobe, PRUint32 rightLobe, >+ PRUint8 *prediv) prediv should be const since you're not modifying its contents in this method. Various methods should change with this. > { > PRInt32 boxSize = leftLobe + rightLobe + 1; >+ PRInt32 posStart = aRegion.x - leftLobe; > > for (PRInt32 y = aRegion.y; y < aRegion.YMost(); y++) { Could do this: PRInt32 lineIndex = y * aStride; or some other name. This is applicable to most filters though and so might be better as another patch. > PRUint32 sums[4] = {0, 0, 0, 0}; > for (PRInt32 i=0; i < boxSize; i++) { >- PRInt32 pos = aRegion.x - leftLobe + i; >+ PRInt32 pos = posStart + i; > pos = PR_MAX(pos, aRegion.x); > pos = PR_MIN(pos, aRegion.XMost() - 1); >- sums[0] += aInput[aStride*y + 4*pos ]; >- sums[1] += aInput[aStride*y + 4*pos + 1]; >- sums[2] += aInput[aStride*y + 4*pos + 2]; >- sums[3] += aInput[aStride*y + 4*pos + 3]; >+ PRInt32 index = aStride*y + 4*pos; Nit: spaces round * is more usual. you would use lineIndex here if you went for that idea. >+ sums[0] += aInput[index ]; >+ sums[1] += aInput[index + 1]; >+ sums[2] += aInput[index + 2]; >+ sums[3] += aInput[index + 3]; > } > for (PRInt32 x = aRegion.x; x < aRegion.XMost(); x++) { > PRInt32 tmp = x - leftLobe; > PRInt32 last = PR_MAX(tmp, aRegion.x); > PRInt32 next = PR_MIN(tmp + boxSize, aRegion.XMost() - 1); > >- aOutput[aStride*y + 4*x ] = sums[0]/boxSize; >- aOutput[aStride*y + 4*x + 1] = sums[1]/boxSize; >- aOutput[aStride*y + 4*x + 2] = sums[2]/boxSize; >- aOutput[aStride*y + 4*x + 3] = sums[3]/boxSize; >- >- sums[0] += aInput[aStride*y + 4*next ] - >- aInput[aStride*y + 4*last ]; >- sums[1] += aInput[aStride*y + 4*next + 1] - >- aInput[aStride*y + 4*last + 1]; >- sums[2] += aInput[aStride*y + 4*next + 2] - >- aInput[aStride*y + 4*last + 2]; >- sums[3] += aInput[aStride*y + 4*next + 3] - >- aInput[aStride*y + 4*last + 3]; >+ PRInt32 index = aStride*y + 4*x; >+ aOutput[index ] = prediv[sums[0]]; >+ aOutput[index + 1] = prediv[sums[1]]; >+ aOutput[index + 2] = prediv[sums[2]]; >+ aOutput[index + 3] = prediv[sums[3]]; >+ >+ PRInt32 index2 = aStride*y + 4*next; >+ PRInt32 index3 = aStride*y + 4*last; Nit spaces again and possibility to use lineIndex >+ sums[0] += aInput[index2 ] - aInput[index3 ]; >+ sums[1] += aInput[index2 + 1] - aInput[index3 + 1]; >+ sums[2] += aInput[index2 + 2] - aInput[index3 + 2]; >+ sums[3] += aInput[index2 + 3] - aInput[index3 + 3]; > } > } > } > > void > nsSVGFEGaussianBlurElement::BoxBlurV(PRUint8 *aInput, PRUint8 *aOutput, > PRInt32 aStride, const nsRect &aRegion, >- PRUint32 topLobe, PRUint32 bottomLobe) >+ PRUint32 topLobe, PRUint32 bottomLobe, >+ PRUint8 *prediv) > { > PRInt32 boxSize = topLobe + bottomLobe + 1; >+ PRInt32 posStart = aRegion.y - topLobe; > > for (PRInt32 x = aRegion.x; x < aRegion.XMost(); x++) { > PRUint32 sums[4] = {0, 0, 0, 0}; >+ PRInt32 fourX = 4*x; Nit: spaces Some compilers have been known to do x << 2 quicker. Maybe that's a part of the follow up patch though. > for (PRInt32 i=0; i < boxSize; i++) { >- PRInt32 pos = aRegion.y - topLobe + i; >+ PRInt32 pos = posStart + i; > pos = PR_MAX(pos, aRegion.y); > pos = PR_MIN(pos, aRegion.YMost() - 1); >- sums[0] += aInput[aStride*pos + 4*x ]; >- sums[1] += aInput[aStride*pos + 4*x + 1]; >- sums[2] += aInput[aStride*pos + 4*x + 2]; >- sums[3] += aInput[aStride*pos + 4*x + 3]; >+ PRInt32 index = aStride * pos + fourX; >+ sums[0] += aInput[index ]; >+ sums[1] += aInput[index + 1]; >+ sums[2] += aInput[index + 2]; >+ sums[3] += aInput[index + 3]; > } > for (PRInt32 y = aRegion.y; y < aRegion.YMost(); y++) { > PRInt32 tmp = y - topLobe; > PRInt32 last = PR_MAX(tmp, aRegion.y); > PRInt32 next = PR_MIN(tmp + boxSize, aRegion.YMost() - 1); > >- aOutput[aStride*y + 4*x ] = sums[0]/boxSize; >- aOutput[aStride*y + 4*x + 1] = sums[1]/boxSize; >- aOutput[aStride*y + 4*x + 2] = sums[2]/boxSize; >- aOutput[aStride*y + 4*x + 3] = sums[3]/boxSize; >- >- sums[0] += aInput[aStride*next + 4*x ] - >- aInput[aStride*last + 4*x ]; >- sums[1] += aInput[aStride*next + 4*x + 1] - >- aInput[aStride*last + 4*x + 1]; >- sums[2] += aInput[aStride*next + 4*x + 2] - >- aInput[aStride*last + 4*x + 2]; >- sums[3] += aInput[aStride*next + 4*x + 3] - >- aInput[aStride*last + 4*x + 3]; >+ PRInt32 index = aStride * y + fourX; >+ aOutput[index ] = prediv[sums[0]]; >+ aOutput[index + 1] = prediv[sums[1]]; >+ aOutput[index + 2] = prediv[sums[2]]; >+ aOutput[index + 3] = prediv[sums[3]]; >+ >+ PRInt32 index2 = aStride * next + fourX; >+ PRInt32 index3 = aStride * last + fourX; Could just put the body of next and last in here and omit those variables I think nextIndex and lastIndex are better names >+ sums[0] += aInput[index2 ] - aInput[index3 ]; >+ sums[1] += aInput[index2 + 1] - aInput[index3 + 1]; >+ sums[2] += aInput[index2 + 2] - aInput[index3 + 2]; >+ sums[3] += aInput[index2 + 3] - aInput[index3 + 3]; > } > } > } > >+PRUint8 * >+nsSVGFEGaussianBlurElement::SetupPredivide(PRUint32 size) >+{ >+ PRUint8 *tmp = new PRUint8[size * 256]; >+ if (tmp) { >+ for (int i = 0; i < 256; i++) PRUint32 rather than int >+ memset(tmp + i * size, i, size); >+ } >+ return tmp; >+} >+ > nsresult > nsSVGFEGaussianBlurElement::GetDXY(PRUint32 *aDX, PRUint32 *aDY, > const nsSVGFilterInstance& aInstance) > { > float stdX, stdY; > nsSVGLength2 val; > > GetAnimatedNumberValues(&stdX, &stdY, nsnull); >@@ -786,59 +804,77 @@ nsSVGFEGaussianBlurElement::GetDXY(PRUin > return NS_OK; > } > > void > nsSVGFEGaussianBlurElement::GaussianBlur(PRUint8 *aInput, PRUint8 *aOutput, > nsSVGFilterResource *aFilterResource, > PRUint32 aDX, PRUint32 aDY) > { >- PRUint8 *tmp = static_cast<PRUint8*> >- (calloc(aFilterResource->GetDataSize(), 1)); >+ nsAutoArrayPtr<PRUint8> tmp(new PRUint8[aFilterResource->GetDataSize()]); >+ if (!tmp) >+ return; >+ memset(tmp, 0, aFilterResource->GetDataSize()); > nsRect rect = aFilterResource->GetSurfaceRect(); > #ifdef DEBUG_tor > fprintf(stderr, "FILTER GAUSS rect: %d,%d %dx%d\n", > rect.x, rect.y, rect.width, rect.height); > #endif > > PRUint32 stride = aFilterResource->GetDataStride(); > > if (aDX & 1) { > // odd >- BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2); >- BoxBlurH(tmp, aOutput, stride, rect, aDX/2, aDX/2); >- BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2); >+ nsAutoArrayPtr<PRUint8> prediv(SetupPredivide(2 * (aDX / 2) + 1)); >+ if (!prediv) >+ return; >+ >+ BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2, prediv); >+ BoxBlurH(tmp, aOutput, stride, rect, aDX/2, aDX/2, prediv); >+ BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2, prediv); > } else { > // even > if (aDX == 0) { > aFilterResource->CopyImageSubregion(tmp, aInput); > } else { >- BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2 - 1); >- BoxBlurH(tmp, aOutput, stride, rect, aDX/2 - 1, aDX/2); >- BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2); >+ nsAutoArrayPtr<PRUint8> prediv(SetupPredivide(2 * (aDX / 2) + 1)); >+ nsAutoArrayPtr<PRUint8> prediv2(SetupPredivide(2 * (aDX / 2))); >+ if (!prediv || !prediv2) >+ return; >+ >+ BoxBlurH(aInput, tmp, stride, rect, aDX/2, aDX/2 - 1, prediv2); >+ BoxBlurH(tmp, aOutput, stride, rect, aDX/2 - 1, aDX/2, prediv2); >+ BoxBlurH(aOutput, tmp, stride, rect, aDX/2, aDX/2, prediv); > } > } I can't help thinking this lot could be reordered. Test for 0 first then in the else create prediv, split into an odd or even creating prediv 2 in the even then do the final BoxBlurH outside the inner if else. Also applies to aDY with slightly different logic.

Attachment #307851 - Flags: superreview+ → superreview?(roc)

Jonathan Watt [:jwatt]

Assignee

Comment 11

•

17 years ago

longsonr: did you mean to clear roc's sr+?

Jonathan Watt [:jwatt]

Assignee

Comment 12

•

17 years ago

Attached patch patch addressing most of longsonr's comments — Details — Splinter Review

(In reply to comment #10) > Could just put the body of next and last in here and omit those variables Same for tmp, but I'd rather let the compiler figure out that optimization. I'm fairly sure it should, and it's more readable as-is. I've moved those lines down so they're grouped as they should be. > I think nextIndex and lastIndex are better names I avoided those names for more generic names to avoid being misleading. > I can't help thinking this lot could be reordered. Test for 0 first then in the > else create prediv, split into an odd or even creating prediv 2 in the even > then do the final BoxBlurH outside the inner if else. Also applies to aDY with > slightly different logic. There's a never ending number of improvements that could be made to this code, but we've got to call it a day at some point. With only two days to the freeze I think we should stop here.

Attachment #307851 - Attachment is obsolete: true

Attachment #309795 - Flags: review?(longsonr)

Attachment #307851 - Flags: superreview?(roc)

Attachment #307851 - Flags: review?(longsonr)

Robert Longson [:longsonr]

Updated

•

17 years ago

Attachment #309795 - Flags: review?(longsonr) → review+

Robert Longson [:longsonr]

Comment 13

•

17 years ago

(In reply to comment #11) > longsonr: did you mean to clear roc's sr+? > Not at all. I didn't realise I'd done that till you pointed it out. Sorry.

Jonathan Watt [:jwatt]

Assignee

Comment 14

•

17 years ago

Comment on attachment 309795 [details] [diff] [review] patch addressing most of longsonr's comments Okay, better re-request.

Attachment #309795 - Flags: superreview?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

17 years ago

Attachment #309795 - Flags: superreview?(roc) → superreview+

Jonathan Watt [:jwatt]

Assignee

Comment 15

•

17 years ago

Comment on attachment 309795 [details] [diff] [review] patch addressing most of longsonr's comments Requesting approval1.9. This patch improves our performance on one of the most concerning areas of our SVG implementation: the performance and user experience of filters.

Attachment #309795 - Flags: approval1.9+

Jonathan Watt [:jwatt]

Assignee

Updated

•

17 years ago

Attachment #309795 - Flags: approval1.9+ → approval1.9?

Reed Loden [:reed]

Updated

•

17 years ago

Assignee: nobody → jwatt

Mike Beltzner [:beltzner, not reading bugmail]

Comment 16

•

17 years ago

Comment on attachment 309795 [details] [diff] [review] patch addressing most of longsonr's comments a1.9=beltzner

Attachment #309795 - Flags: approval1.9? → approval1.9+

Jonathan Watt [:jwatt]

Assignee

Comment 17

•

17 years ago

Checked in.

Status: NEW → RESOLVED

Closed: 17 years ago

OS: Linux → All

Hardware: PC → All

Resolution: --- → FIXED

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 18

•

17 years ago

I actually thought of a better idea. This patch is problematic because it uses memory bandwidth when filters are probably already quite memory-bandwidth-hungry. We can actually speed up the division another way, by writing PRUint32 p = v/N; as PRUint32 k = ((1 << 25) - 1)/N; // ceil(2^24/N) PRUint32 p = (v*k) >> 24; where 'k' is hoisted out of the loops, of course. If you work out the math, this works for all N <= 2^24/255 which should be plenty... Jonathan, do you want to try that?

Jonathan Watt [:jwatt]

Assignee

Comment 19

•

17 years ago

Good idea. I'll give that a go, but it won't be before beta 5.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 20

•

17 years ago

Note that you might as well clamp N to 2^16 or something that works with the above formula. Values that high won't be used in real drawings and would be incredibly slow anyway.

Joe Drew (not getting mail)

Comment 21

•

17 years ago

I'm going to go ahead and close this; roc or jwatt, can you file a follow-up bug?

Status: REOPENED → RESOLVED

Closed: 17 years ago → 17 years ago

Resolution: --- → FIXED

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 22

•

17 years ago

Yeah, I've got patches in my tree and I'll file a new bug for them.

Jonathan Watt [:jwatt]

Assignee

Comment 23

•

17 years ago

Please don't close bugs that aren't assigned to you.

Daniel Veditz [:dveditz]

Updated

•

17 years ago

Depends on: 441368

You need to log in before you can comment on or make changes to this bug.