Improve the performance of mask image computation by neon

RESOLVED FIXED in mozilla37

Status

()

RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: ethlin, Assigned: ethlin)

Tracking

unspecified
mozilla37
ARM
Gonk (Firefox OS)
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 5 obsolete attachments)

(Assignee)

Description

4 years ago
When generating the SVG mask, we used to calculate index by pixel. There is a way to get index by accumulating. Another problem is the luminance calculation is using double type, but the decimal digits of precision is only 4. I think we can just use float type to do the calculation.
(In reply to Ethan Lin[:Ethan] from comment #0)
> When generating the SVG mask, we used to calculate index by pixel. There is
> a way to get index by accumulating.  Another problem is the luminance
> calculation is using double type, but the decimal digits of precision is
> only 4. I think we can just use float type to do the calculation.

Sounds good to me.

BTW if you have content that's using SVG masks and it's slow, consider using alpha masks, which should be much faster.

(Ideally we'd use a mask layer to handle SVG masks, though I'm not sure whether we have plans to do that any time soon?)
(Assignee)

Comment 2

4 years ago
Created attachment 8542072 [details] [diff] [review]
v1 - accumulation index and use float

There are 2 changes:
1. Calculate index by accumulation. Which can reduce the cross operations in the for loop.
2. Use float instead of double. The calculation of luminance will be faster.

I tried the 480x360 mask image, the time of mask computation is from 12.5ms to 10.6ms on flame-kk.
Attachment #8542072 - Flags: feedback?(hshih)
Attachment #8542072 - Flags: review+
Attachment #8542072 - Flags: feedback?(hshih) → feedback+
(Assignee)

Comment 3

4 years ago
Created attachment 8542888 [details] [diff] [review]
v2 - Use neon to speed up

There is a bug 1106909 about the scrolling performance problem on the mobile lockscreen and we found that the SVG mask uses a lot of time in each frame.
In this patch, I try to use neon to speed up the mask creation. I also change the luminance calculation to uint8 computation. This is the formula: (76*R + 150*G + 29*B) >> 8 [1].
By this patch, the creation time of 480x360 in flame-kk is from the original 12.5ms to 1.5ms.

[1] http://en.wikipedia.org/wiki/YUV
Attachment #8542072 - Attachment is obsolete: true
Attachment #8542888 - Flags: review?(longsonr)
Comment on attachment 8542888 [details] [diff] [review]
v2 - Use neon to speed up

Everything is cosmetic really. I would like to see another patch mainly because I need more time to review the NEON code.

>+  int rFac = 76*aOpacity;
>+  int gFac = 150*aOpacity;
>+  int bFac = 29*aOpacity;

needs spaces round *
shouldn't use int try int32_t (multiple places)
I don't mind magic numbers from the SVG specification but 76 isn't. We can afford floating point calculations outside loops so use the original constants and multiply by 255
rFax, gFac and bFac don't mean much. Don't abbreviate and try to think of better names please.

>+  int offset = aStride - 4 * aSize.width;

int32_t

>+  int offset = aStride - 4 * aSize.width;
>+  uint8_t *pixel = aData;
>+  int rFac = 76*aOpacity;
>+  int gFac = 150*aOpacity;
>+  int bFac = 29*aOpacity;
>+

ditto

>+/**
>+ * Byte offsets of channels in a native packed gfxColor or cairo image surface.
>+ */
>+#ifdef IS_BIG_ENDIAN
>+#define GFX_ARGB32_OFFSET_A 0
>+#define GFX_ARGB32_OFFSET_R 1
>+#define GFX_ARGB32_OFFSET_G 2
>+#define GFX_ARGB32_OFFSET_B 3
>+#else
>+#define GFX_ARGB32_OFFSET_A 3
>+#define GFX_ARGB32_OFFSET_R 2
>+#define GFX_ARGB32_OFFSET_G 1
>+#define GFX_ARGB32_OFFSET_B 0
>+#endif

Be nice if these moved to a header so there's just the one copy. Do we actually support NEON with both endiannesses though?

>+
>+void
>+ComputesRGBLuminanceMask_NEON(uint8_t *aData,
>+                              int32_t aStride,
>+                              const IntSize &aSize,
>+                              float aOpacity)
>+{
>+  int rFac = 76*aOpacity;
>+  int gFac = 150*aOpacity;
>+  int bFac = 29*aOpacity;

ditto.

>+
>+  uint8_t *pixel = aData;
>+  int offset = aStride - 4 * aSize.width;
>+  for (int32_t y = 0; y < aSize.height; y++) {
>+    for (int32_t x = 0; x < aSize.width; x+=8) {

spaces round +=

>+      if (!pixel[GFX_ARGB32_OFFSET_A]) {
>+        memset(pixel, 0, 4);
>+      }
>+      pixel += 4;
>+    }
>+    pixel += offset;
>+  }
>+
>+  pixel = aData;
>+  int remainWidth = aSize.width & 8;

Please don't use int.

>+  uint16x8_t temp;
>+  uint8x8_t gray;
>+  uint8x8x4_t result;
>+  uint8x8_t rVec = vdup_n_u8(rFac);
>+  uint8x8_t gVec = vdup_n_u8(gFac);
>+  uint8x8_t bVec = vdup_n_u8(bFac);
>+  for (int32_t y = 0; y < aSize.height; y++) {
>+    for (int32_t x = 0; x < aSize.width; x+=8) {

spaces round += please.

>+      uint8x8x4_t argb  = vld4_u8(pixel);
>+      temp = vmull_u8(argb.val[GFX_ARGB32_OFFSET_R], rVec);
>+      temp = vmlal_u8(temp, argb.val[GFX_ARGB32_OFFSET_G], gVec);
>+      temp = vmlal_u8(temp, argb.val[GFX_ARGB32_OFFSET_B], bVec);
>+      gray = vshrn_n_u16(temp, 8);
>+      result.val[0] = gray;
>+      result.val[1] = gray;
>+      result.val[2] = gray;
>+      result.val[3] = gray;
>+      vst4_u8(pixel, result);
>+      pixel += 8 * 4;
>+    }
>+
>+    for (int32_t x = 0; x < remainWidth; ++x) {
>+      pixel[0] = (rFac * pixel[GFX_ARGB32_OFFSET_R] +
>+                  gFac * pixel[GFX_ARGB32_OFFSET_G] +
>+                  bFac * pixel[GFX_ARGB32_OFFSET_B]) >> 8;
>+      memset(pixel+1, pixel[0], 3);
>+      pixel+=4;

spaces around += please and + (line above)




>diff --git a/layout/svg/nsSVGMaskFrameNEON.h b/layout/svg/nsSVGMaskFrameNEON.h
>new file mode 100644
>index 0000000..74db889
>--- /dev/null
>+++ b/layout/svg/nsSVGMaskFrameNEON.h
>@@ -0,0 +1,20 @@
>+/* -*- mode: c++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
>+/* this source code form is subject to the terms of the mozilla public
>+ * license, v. 2.0. if a copy of the mpl was not distributed with this file,
>+ * You can obtain one at http://mozilla.org/MPL/2.0/. */
>+
>+#ifndef __NS_SVGMASKFRAMENEON_H__
>+#define __NS_SVGMASKFRAMENEON_H__
>+
>+#include "mozilla/gfx/2D.h"
>+
>+using namespace mozilla;
>+using namespace mozilla::gfx;

Do you need both (or in fact either) of these lines?
Attachment #8542888 - Flags: review?(longsonr) → review-
(Assignee)

Comment 5

4 years ago
Created attachment 8543716 [details] [diff] [review]
v3 - Use neon to speed up

I fix the coding style and some try server errors. I think neon should support both endiannesses because we also use neon in other places.
Attachment #8543716 - Flags: review?(longsonr)
(Assignee)

Updated

4 years ago
Attachment #8542888 - Attachment is obsolete: true
(Assignee)

Comment 6

4 years ago
Created attachment 8543820 [details] [diff] [review]
v4 - Use neon to speed up

Last patch contained wrong file. This is the correct one. Fix the coding style and some try server errors.
Attachment #8543716 - Attachment is obsolete: true
Attachment #8543716 - Flags: review?(longsonr)
Attachment #8543820 - Flags: review?(longsonr)
(Assignee)

Comment 7

4 years ago
Comment on attachment 8543820 [details] [diff] [review]
v4 - Use neon to speed up

Review of attachment 8543820 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/reftests/svg/svg-integration/reftest.list
@@ +26,5 @@
>  == filter-html-zoomed-01.xhtml filter-html-01-ref.svg
>  == mask-html-01.xhtml mask-html-01-ref.svg
>  == mask-html-01-extref-01.xhtml mask-html-01-ref.svg
>  random == mask-html-01-extref-02.xhtml mask-html-01-ref.svg # random due to bug 877661
> +fuzzy-if(B2G&&browserIsRemote,1,2300) == mask-html-zoomed-01.xhtml mask-html-01-ref.svg

I change the fuzzy tolerant number from 2000 to 2300 of the test case 'mask-html-zoomed-01.xhtml'. That is because the number of different pixels before calculating luminance is 2250, and the original luminance calculation has some error from float to integer. After the float conversion, the number of different pixels become less than 2000 coincidentally.
With the new luminance calculation, there is no error of float conversion. So the number become 2250. That is why I have to change the tolerant number.
Comment on attachment 8543820 [details] [diff] [review]
v4 - Use neon to speed up

+        luminance = pixel[0] = (redFactor * pixel[GFX_ARGB32_OFFSET_R] +
+                                greenFactor * pixel[GFX_ARGB32_OFFSET_G] +
+                                blueFactor * pixel[GFX_ARGB32_OFFSET_B]) >> 8;

what's the pixel[0] doing? Seems like that could be removed, no?

>+  pixel = aData;
>+  int32_t remainWidth = aSize.width % 8;

Call this remainderWidth (or widthRemainder) as that's clearer. I really don't like abbreviating variables.

>+  int32_t neonWidth = aSize.width - remainWidth;

roundedWidth perhaps? neonWidth doesn't really give much away as to what it is.

>+  uint16x8_t temp;
>+  uint8x8_t gray;
>+  uint8x8x4_t result;
>+  uint8x8_t redVec = vdup_n_u8(redFactor);
>+  uint8x8_t greenVec = vdup_n_u8(greenFactor);
>+  uint8x8_t blueVec = vdup_n_u8(blueFactor);
>+  for (int32_t y = 0; y < aSize.height; y++) {
>+    // Calculate luminance by neon with 8 pixels per loop 
>+    for (int32_t x = 0; x < neonWidth; x += 8) {
>+      uint8x8x4_t argb  = vld4_u8(pixel);
>+      temp = vmull_u8(argb.val[GFX_ARGB32_OFFSET_R], redVec); // temp = red * redFactor
>+      temp = vmlal_u8(temp, argb.val[GFX_ARGB32_OFFSET_G], greenVec); // temp += green * greenFactor
>+      temp = vmlal_u8(temp, argb.val[GFX_ARGB32_OFFSET_B], blueVec); // temp += blue * blueFactor
>+      gray = vshrn_n_u16(temp, 8); // gray = temp >> 8
>+
>+      // Put the result to the 8 pixels
>+      result.val[0] = gray;
>+      result.val[1] = gray;
>+      result.val[2] = gray;
>+      result.val[3] = gray;
>+      vst4_u8(pixel, result);
>+      pixel += 8 * 4;

What if pixel[GFX_ARGB32_OFFSET_A] == 0? You set all the right values in the first place (in code I've elided) but then we overwrite them here with the wrong values don't you.

>+    }
>+
>+    // Calculate the reset pixels of the line by cpu

s/reset/rest/ ?

>+    for (int32_t x = 0; x < remainWidth; x++) {
>+      pixel[0] = (redFactor * pixel[GFX_ARGB32_OFFSET_R] +
>+                  greenFactor * pixel[GFX_ARGB32_OFFSET_G] +
>+                  blueFactor * pixel[GFX_ARGB32_OFFSET_B]) >> 8;
>+      memset(pixel+1, pixel[0], 3);

spaces round + please?

Same as above, what if pixel[GFX_ARGB32_OFFSET_A] == 0?

>+      pixel += 4;
>+    }
>+    pixel += offset;
>+  }
>+}
>+
>diff --git a/layout/svg/nsSVGMaskFrameNEON.h b/layout/svg/nsSVGMaskFrameNEON.h
>new file mode 100644
>index 0000000..6ce7a0a
>--- /dev/null
>+++ b/layout/svg/nsSVGMaskFrameNEON.h
>@@ -0,0 +1,20 @@
>+/* -*- mode: c++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
>+/* this source code form is subject to the terms of the mozilla public
>+ * license, v. 2.0. if a copy of the mpl was not distributed with this file,
>+ * You can obtain one at http://mozilla.org/MPL/2.0/. */
>+
>+#ifndef __NS_SVGMASKFRAMENEON_H__
>+#define __NS_SVGMASKFRAMENEON_H__
>+
>+#include "mozilla/gfx/2D.h"
>+
>+//using namespace mozilla;

Nit: No commented out code please, remove the above line.
(Assignee)

Comment 9

4 years ago
Created attachment 8544322 [details] [diff] [review]
v5 - Use neon to speed up

This patch fix the above coding style problems. About the case of alpha with zero value, I think the result should be correct. The first loop sets all channel to zero when alpha is zero, so the following luminance calculation will still get the result as zero. That's something like a pre-process for neon to prevent the alpha check when calculating 8 pixels at a time.
Attachment #8543820 - Attachment is obsolete: true
Attachment #8543820 - Flags: review?(longsonr)
Attachment #8544322 - Flags: review?(longsonr)
I'll look at this later btw have you seen bug 1025928, especially bug 1025928 comment 3. Up to you whether you combine that idea or take that bug or ignore this comment altogether :-)
Another totally followup idea would be, could we create an alpha only surface for mask? 1/4 the memory and data to write if it works of course.
Attachment #8544322 - Flags: review?(longsonr) → review+
Don't forget to add r=longsonr on the checkin comment :-)
(Assignee)

Comment 13

4 years ago
(In reply to Robert Longson from comment #11)
> Another totally followup idea would be, could we create an alpha only
> surface for mask? 1/4 the memory and data to write if it works of course.

Right, I tried this method yesterday. I used a A8 format surface to store the mask result and also only set alpha channel(only has this channel now) in luminance calculation. For 480x360 image with neon, this will improve about 2.5ms when applying mask. I think we will have more benefits with cpu computation. I will add this in next patch.
Another idea is using cache, and this is what I am trying. If the mask has no change, we could just use the last one. Do you think it's a correct direction? I am still studying how to make sure there is no change of the mask.
Flags: needinfo?(longsonr)
Caching sounds difficult. How would you track changes in the mask's content? SMIL changes would be hard to track I think (use elements suffer from that problem already).

The mask may also be applied to many elements, each of different size requiring different bitmap sizes.
Flags: needinfo?(longsonr)
(Assignee)

Comment 15

4 years ago
Created attachment 8544508 [details] [diff] [review]
v6 -  Use neon to speed up. r=longsonr

Add reviewer to comment. I will create another bug for further improvement.
Attachment #8544322 - Attachment is obsolete: true
Attachment #8544508 - Flags: review+
(Assignee)

Updated

4 years ago
Summary: Improve the performance of mask image computation → Improve the performance of mask image computation by neon
https://hg.mozilla.org/mozilla-central/rev/a437d18a4b9b
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
You need to log in before you can comment on or make changes to this bug.