Closed Bug 753010 Opened 9 years ago Closed 6 years ago

Add a NEON optimized blur method

Categories

(Core :: Graphics, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla38

People

(Reporter: jrmuizel, Assigned: ethlin)

References

(Depends on 1 open bug)

Details

Attachments

(2 files)

This would be handy to have.
See bug 509052 for an old SSE2 version.
Assignee: nobody → etlin
Add neon functions for blur to speed up performance. The method is similar with SSE version.
Attachment #8551072 - Flags: feedback?(hshih)
Attachment #8551072 - Flags: feedback?(hshih) → review?(mstange)
I'm amazed by how identical this looks to my patch in bug 1045865. The only differences seem to be in the last few lines of BlurNEON.cpp, and in the location of the call to vqmovn_u32 (which I've moved into Divide).
Duplicate of this bug: 1045865
Oh, and our GenerateIntegralImage_NEON implementation is completely different.
Comment on attachment 8551072 [details] [diff] [review]
Part 1 - Add neon method for blur operation

Review of attachment 8551072 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great. I reviewed it by comparing it to my implementation; my suggestions below are basically just those differences where I preferred my version.

::: gfx/2d/BlurNEON.cpp
@@ +13,5 @@
> +uint32x4_t Divide(uint32x4_t aValues, uint32x4_t aDivisor)
> +{
> +  uint64x2_t roundingAddition = vdupq_n_u64(int64_t(1) << 31);
> +  uint64x2_t multiplied21 = vmull_u32(vget_low_u32(aValues), vget_low_u32(aDivisor));
> +  uint64x2_t multiplied43 = vmull_u32(vget_high_u32(aValues), vget_high_u32(aDivisor));

vget_low_u32(aDivisor) is always the same as vget_high_u32(aDivisor), isn't it? Would it make sense to pass aDivisor as uint32x2_t instead?

@@ +201,5 @@
> +  GenerateIntegralImage_NEON(leftInflation, aRightLobe, aTopLobe, aBottomLobe,
> +                             aIntegralImage, aIntegralImageStride, aData,
> +                             mStride, size);
> +
> +  uint32x4_t divisor = vdupq_n_u32(reciprocal);

Right, so this can be uint32x2_t and vdupq_n_u32.

@@ +257,5 @@
> +      bottomLeft = vld1q_u32(bottomLeftBase + x + 12);
> +      uint32x4_t result4 = BlurFourPixels(topLeft, topRight, bottomRight, bottomLeft, divisor);
> +
> +      uint8x8_t combine1 = vqmovn_u16(vcombine_u16(vqmovn_u32(result1), vqmovn_u32(result2)));
> +      uint8x8_t combine2 = vqmovn_u16(vcombine_u16(vqmovn_u32(result3), vqmovn_u32(result4)));

Instead of calling vqmovn_u32 here every time, just move it into Divide.

@@ +281,5 @@
> +      uint8x8_t final = vqmovn_u16(vcombine_u16(vqmovn_u32(result), vdup_n_u16(0)));
> +      vst1_lane_u8(data + stride * y + x    , final, 0);
> +      vst1_lane_u8(data + stride * y + x + 1, final, 1);
> +      vst1_lane_u8(data + stride * y + x + 2, final, 2);
> +      vst1_lane_u8(data + stride * y + x + 3, final, 3);

So my patch did this instead:

  uint32x2_t final = vreinterpret_u32_u8(vmovn_u16(vcombine_u16(result, vdup_n_u16(0))));
  *(uint32_t*)(data + stride * y + x) = vget_lane_u32(final, 0);

I don't think I've tested whether that works. Do you think it would work? Would it be faster?
Attachment #8551072 - Flags: review?(mstange) → review+
Comment on attachment 8551072 [details] [diff] [review]
Part 1 - Add neon method for blur operation

Review of attachment 8551072 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/2d/BlurNEON.cpp
@@ +21,5 @@
> +
> +MOZ_ALWAYS_INLINE
> +uint32x4_t BlurFourPixels(const uint32x4_t& aTopLeft, const uint32x4_t& aTopRight,
> +                       const uint32x4_t& aBottomRight, const uint32x4_t& aBottomLeft,
> +                       const uint32x4_t& aDivisor)

indent
Thanks for the recommendations. I tested the performance of the last part and your changes are correct and faster.
Attachment #8553468 - Flags: review?(mstange)
Comment on attachment 8553468 [details] [diff] [review]
Part 2 - Refactor some neon functions

thanks!
Attachment #8553468 - Flags: review?(mstange) → review+
https://hg.mozilla.org/mozilla-central/rev/afd9fa40a02e
https://hg.mozilla.org/mozilla-central/rev/09f774177683
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Blocks: gfxperf
You need to log in before you can comment on or make changes to this bug.