Closed Bug 634557 Opened 14 years ago Closed 14 years ago

Implement efficient scaling YUV-to-RGB conversions for ARM (using NEON).

Categories

(Core :: Graphics, defect)

ARM
Linux
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla6

People

(Reporter: jbramley, Assigned: derf)

References

Details

Attachments

(7 files, 11 obsolete files)

25.04 KB, application/octet-stream
Details
45.97 KB, patch
Details | Diff | Splinter Review
60.58 KB, application/gzip
Details
120.43 KB, application/gzip
Details
27.34 KB, patch
derf
: review+
Details | Diff | Splinter Review
19.54 KB, patch
derf
: review+
Details | Diff | Splinter Review
11.60 KB, text/plain
Details
The existing YV12-to-RGB565 routine doesn't handle scaling. A scaling version would be very desirable for Firefox Mobile 4.
Assignee: Jacob.Bramley → nobody
Component: JavaScript Engine → Graphics
QA Contact: general → thebes
Whoops, accidentally wiped out the assignee.
Assignee: nobody → Jacob.Bramley
Attached file Conversion routines. (obsolete) —
Attached file Test framework. (obsolete) —
Attachment #513455 - Attachment mime type: application/octet-stream → text/plain
The conversion routines provide the following:

  * Multiply-by-two.
     - Sampling using the nearest pixel.
     - Minimum source width of 16 pixels.
     - Number of rows must be divisible by 2.
  * Divide-by-two.
     - Sampling using a 2*2 average.
     - Minimum source width of 16 pixels.
     - Number of rows must be divisible by 2.
  * Divide-by-four.
     - Sampling using a 4*4 average.
     - Minimum source width of 32 pixels.
     - Number of rows must be divisible by 4.

There are variants for both ARGB-8888 and RGB-565 format. Each routine expects to be able to decode a whole image in one shot. Row-by-row variants are feasible, but will be notably less efficient as the function call overhead here is significant simply because we're using so many registers which need to be stacked.

There is room for improvement, both in flexibility and in performance, so we should consider having another look at these routines in the future.

The test framework provided runs each routine, along with Siarhei's implementation (from yuv_convert_arm.cpp) and Steve's solution (which I'll attach in a minute). It saves the result of each routine as a BMP file and displays the time taken for each one.
This is the solution on which my scaling variants are based.
Factors of 2 and 4 are all well and good, but what's required for this bug is an ARM version of ScaleYCbCrToRGB32 targeting RGB565 instead, i.e., support for arbitrary scale factors. ScaleYCbCrToRGB32 also supports 90-degree rotations, but only because that's in the original Chromium code; I don't believe we use that functionality. Arbitrary scale factors, however, we definitely need.
I doubt that it would be possible to implement arbitrary scale factors before the release. Does it really need to be arbitrary? Do many users scale videos by factors of 7, for example?

It is surely possible to write an arbitrary scaler, but it would not be efficient. Still, it could probably be made notably more efficient than a plain C version, so I'll give it some thought.
Video is quite often scaled by small amounts. For example, to correct for the aspect ratio it may be scaled by 11/10. YouTube had for a long time (and still does, AFAIK) some kind of off-by-one error in their layout code that causes all videos to be scaled by a few pixels. The list goes on and on.

For scale factors of 0.5 or greater, I think this could be made reasonably efficient using, e.g., VTBL (a fantastic instruction). For smaller scale factors, it's probably okay to use a less efficient (but still vectorized) fallback that does the loads for the horizontal filtering step individually (since there are so many fewer output pixels than source pixels).
(In reply to comment #8)
> Video is quite often scaled by small amounts. For example, to correct for the
> aspect ratio it may be scaled by 11/10. YouTube had for a long time (and still
> does, AFAIK) some kind of off-by-one error in their layout code that causes all
> videos to be scaled by a few pixels. The list goes on and on.

Ah, Ok, so we scale up by 11 into some buffer, then scale down by 10 with a separate routine?
No, not at all. Using 11x the memory for a scale factor of 1.1 would be absolutely terrible for performance, not to mention very fragile for large frame sizes (what do you do when you need to scale by 4739/4320?).

Really do look at ScaleYCbCrToRGB32 in gfx/ycbcr/yuv_convert.cpp. This is the interface we want. You specify the exact number of pixels in the output. Internally the location to sample at is computed with fixed-point math (the exact details of that are not critical, but the 16 fractional bits used by this code are probably a fine choice). The existing code interpolates between two rows into a temporary buffer (which is only the size of two output rows, plus a small amount of padding), but it should be possible (and more efficient) to load the necessary pieces of both rows into registers, sample using VTBL, and then interpolate, without ever hitting a temporary buffer.
(In reply to comment #10)
> No, not at all. Using 11x the memory for a scale factor of 1.1 would be
> absolutely terrible for performance, not to mention very fragile for large
> frame sizes (what do you do when you need to scale by 4739/4320?).

Agreed!

> Really do look at ScaleYCbCrToRGB32 in gfx/ycbcr/yuv_convert.cpp.

I'll give it a closer look.
Interesting, I did not know that somebody else is working on scaled YUV->RGB conversion at the moment :-) Just the other day I tried to revive the old planar YUV support discussion and propose a minor API extension in pixman, which would allow to fit scaled YUV->RGB conversion nicely into cairo rendering pipeline: http://comments.gmane.org/gmane.comp.graphics.pixman/1029

Would it be better for me to suspend this activity?
(In reply to comment #7)
> It is surely possible to write an arbitrary scaler, but it would not be
> efficient. Still, it could probably be made notably more efficient than a plain
> C version, so I'll give it some thought.

The competition for this NEON optimized single pass YUV-RGB scaler is going to be just an unscaled NEON YUV->RGB conversion followed by RGB->RGB scaling. Possibly done on per-slice basis in order to improve L1/L2 cache hit ratio.

If NEON optimized YUV->RGB scaling function will not be fast enough to outperform it, that's going to be a fail.
(In reply to comment #12)
> Would it be better for me to suspend this activity?

Arguably, the code would be better off in Cairo than in Firefox, as it would be more generally useful. I also suspect that your NEON knowledge exceeds my own somewhat. What is the likely timescale for getting something working in Cairo? I'm not really familiar with the Cairo APIs, but do you think it'd be easy enough to hook a suitable scaling routine into Firefox?

I'm currently working on a routine that will scale between 0.5-1.0 times in either or both directions. It's not actually as tricky as I'd thought, but getting it right will probably take some time. Do you have any code already?
(In reply to comment #13)
> If NEON optimized YUV->RGB scaling function will not be fast enough to
> outperform it, that's going to be a fail.

Down-scaling YV12 is much simpler than down-scaling RGB, so it should be easy to make the combined operation much faster than the separate ones, even without considering the cache and memory effects. In YV12, each U and V value covers a 2*2 pixel block, so we can use a simple nearest-value scheme and apply bilinear filtering only to the Y channel.

Up-scaling is similar. I suspect that for small-to-moderate up-scaling, bilinear filters on the U and V channels will be unnoticeable, so again, we only need to implement a proper bilinear filter for the Y channel.

The biggest problem is that an arbitrary scale is that it's not actually very well suited to NEON. We can't use the VLD{1-4} instructions to their full potential, and we essentially end up loading scalar values into individual lanes. However, the rest of the filter (including the YV12 stuff) can be handled very nicely.
(In reply to comment #14)
> What is the likely timescale for getting something working in Cairo?

That's actually the biggest question. Integrating this functionality smoothly into pixman (and cairo) is more difficult than writing a standalone scaler itself. Especially considering the emphasis on image quality and intention to do it right in pixman. On the other hand, I'm already fed up having quick hacks for mostly the same stuff in different codebases. So finding some real "home" for these optimizations may be a good idea.

If everything goes fine, then it may take one or two weeks. If something goes really bad (strong disagreement with upstream about this feature), then it would need to be dropped in a few days :-)

> I'm not really familiar with the Cairo APIs, but do you think it'd be easy
> enough to hook a suitable scaling routine into Firefox?

Yes, it should be quite easy. It is going to be a cairo surface. Or at least in the case of relying on pixman only - just C API for buffer to buffer scaling.

> I'm currently working on a routine that will scale between 0.5-1.0 times in
> either or both directions. It's not actually as tricky as I'd thought, but
> getting it right will probably take some time.

I wonder what's so special about 0.5-1.0 times scaling (and not arbitrary scale factor)? I would guess that upscaling might be actually needed more often. Because screen resolutions are not so bad on modern mobile devices (at least VGA). "Internet" videos seem to be typically having rather low resolution.

> Do you have any code already?

Lots of different test code snippets but not a complete implementation yet. At the very least, doing unscaled YUV->RGB conversion in a temporary scanline buffer and then using nearest or bilinear scaling of this RGB data should be quite easy because all of this code already exists. Going to single pass processing and then checking what is faster is a natural next step.
Code-freeze for the next Fennec release is Mar. 8th (i.e., five days from now). In the past that's been known to slip a day or two, but if we want this before then, the Cairo approach doesn't seem possible. In fact, if it requires a Cairo update on our side, it's probably not possible for 4.x at all, since AFAIK the version we're currently using doesn't support YUV surfaces at all. Not doing this through Cairo also makes it easier to push the code back upstream to Chromium (where the rest of our YUV->RGB code comes from), if we wanted to do that.

FWIW, doing YUV->RGB and then scaling seems like a slower approach than scaling the YUV and then converting. It requires a temporary buffer (which means more loads/stores, even if you can reduce the number of per-pixel loads) and more arithmetic (because you have to interpolate at full resolution for R, G, and B, instead of only doing half the interpolation for U and V, or doing nearest-neighbor as Jacob suggested).

I don't understand the limitation on the scale range either, if you're doing individual loads per pixel. The route I was suggested was limited to scale factors larger than 0.5 (but not limited at 2 on the upper end) because it was using aligned loads of up to two blocks of data, followed by VTBL to select the individual columns that you actually wanted. You would still need a fallback that supports scaling down to less than 0.5 the original size.
(In reply to comment #17)
> Code-freeze for the next Fennec release is Mar. 8th (i.e., five days from now).

But you are not closing up shop after this date, right? And I hope there will be more fennec releases later.

> In the past that's been known to slip a day or two, but if we want this before
> then, the Cairo approach doesn't seem possible. In fact, if it requires a Cairo
> update on our side, it's probably not possible for 4.x at all, since AFAIK the
> version we're currently using doesn't support YUV surfaces at all. Not doing
> this through Cairo also makes it easier to push the code back upstream to
> Chromium (where the rest of our YUV->RGB code comes from), if we wanted to do
> that.

Well, if Jacob is going to provide optimizations for this Chromium code in the available time frame, then surely go for it. And actually I have some other more urgent things to do which may be beneficial for Fennec considering Mar. 8th deadline.

> FWIW, doing YUV->RGB and then scaling seems like a slower approach than scaling
> the YUV and then converting. It requires a temporary buffer (which means more
> loads/stores, even if you can reduce the number of per-pixel loads) and more
> arithmetic (because you have to interpolate at full resolution for R, G, and B,
> instead of only doing half the interpolation for U and V, or doing
> nearest-neighbor as Jacob suggested).

Many approaches can be tried and benchmarked. Just unscaled YUV->RGB conversion already exists. RGB->RGB scaling exists too. So they can be used easily. And keeping temporary buffer in L1 cache is going to be faster than doing first pass converting the whole frame to RGB, followed by another pass also scaling the whole frame. Memory bandwidth is the main bottleneck in ARM devices for such operations unless they involve really a lot of computations.

> I don't understand the limitation on the scale range either, if you're doing
> individual loads per pixel. The route I was suggested was limited to scale
> factors larger than 0.5 (but not limited at 2 on the upper end) because it was
> using aligned loads of up to two blocks of data, followed by VTBL to select the
> individual columns that you actually wanted. You would still need a fallback
> that supports scaling down to less than 0.5 the original size.

I'm not so sure. VTBL is a relatively slow instruction which takes 2-3 cycles to execute and only provides 64-bit of data as a result: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch16s06s06.html

It may be a better idea to do initial processing of pixels using ARM instructions, bundle them together and then pass to NEON. The rationale is that NEON has throughput of around 1 instruction per cycle at best. But the decoder can provide 2 instructions per cycle, so ARM pipeline is underused and actually can do some useful work assisting NEON.
2-3 cycles to do the equivalent of 8 loads sounds pretty good to me. To be fair, you need to add the overhead to build the table that drives the selection and to do the vector loads that feed it, but the ARM side has to do this work, too (and also has to pack the data so it can shunt it over to the NEON side, which isn't free), and it can't parallelize any of it, while the NEON side can.

NEON can also share the first cycle of one instruction with the last cycle of the previous one, as long as they're of different types (load/store/permute, which includes VTBL, vs. arithmetic), as I'm sure you're aware. Done right, this can lessen the impact of using the expensive permute instructions considerably.

One could also envisage a hybrid approach, with the ARM side doing, say, U and V loads, while the NEON side does the Y loads and interpolation. There's less U and V data, and more calculations can be shared, so the lack of vectorization may not be so bad.
(In reply to comment #19)
> NEON can also share the first cycle of one instruction with the last cycle of
> the previous one, as long as they're of different types (load/store/permute,
> which includes VTBL, vs. arithmetic), as I'm sure you're aware.

Unfortunately not with Cortex-A9 anymore.
(In reply to comment #17)
> Code-freeze for the next Fennec release is Mar. 8th (i.e., five days from now).
> In the past that's been known to slip a day or two, but if we want this before
> then, the Cairo approach doesn't seem possible.

To be honest, I'm going to struggle to get it into Firefox by that date too, but I'm hoping to have something to post on this bug today, or maybe Monday.

> I don't understand the limitation on the scale range either, if you're doing
> individual loads per pixel.

For ranges 0.5-1.0, you have exactly four source pixels influencing any one destination pixel. The same is also true of up-scaling, but I suspect that some of the calculation details will differ so I will consider that separately. Most of the algorithm will be very similar.

Up-scaling by large factors will require bilinear filters on the U and V channels too. I don't know where the threshold would be, but nearest artifacts on U and V would probably be visible between factors of 1.5 and 2.5. We could filter U and V on all up-scaling operations to be on the safe side, but it will cost performance for small scaling factors, such as the off-by-one-type errors that you mentioned.

Down-scaling below 0.5 will be possible with the algorithm I'm looking at now, but it probably wouldn't do it right. What is the proper way to down-scale when you have more than two source pixels (in each dimension) to a destination pixel?

> The route I was suggested was limited to scale
> factors larger than 0.5 (but not limited at 2 on the upper end) because it was
> using aligned loads of up to two blocks of data, followed by VTBL to select the
> individual columns that you actually wanted.

The trouble with the VTBL approach is that you have to check that you aren't reading beyond the limits of the source image. The conditional code required can be quite disruptive. I might be wrong, but intuitively I would have thought that a sequence of scalar loads will be faster. My current (sketchy and untested) implementation loads two pixels from two rows (for V filtering) based on a src_x co-ordinate, adds a 16.16 (fixed-point) 'dx' value to src_x, then repeats for each element. This isn't as bad as it looks because most of the loads will be in cache anyway.
(In reply to comment #19)
> 2-3 cycles to do the equivalent of 8 loads sounds pretty good to me. To be
> fair, you need to add the overhead to build the table that drives the selection
> and to do the vector loads that feed it, but the ARM side has to do this work,
> too (and also has to pack the data so it can shunt it over to the NEON side,
> which isn't free), and it can't parallelize any of it, while the NEON side can.

NEON can do scalar loads. For example:

@ Load a uint8_t from [r0] to the first (8-bit) lane of d0.
vld1.u8   d0[0], [r0]

It isn't parallel, of course, but neither is the work you have to do to populate the VTBL look-up register. (It might be possible to do something clever there with 16.16 offsets, but I haven't thought it through.)

VTBL works very well for some tasks. I think it would work well here if we could guarantee that we could over-read the source image. However, because we probably can't over-read the source, we cannot issue the block load required by VTBL, so individual element loads are the only obvious way to proceed.
(In reply to comment #21)
> Down-scaling below 0.5 will be possible with the algorithm I'm looking at now,
> but it probably wouldn't do it right. What is the proper way to down-scale
> when you have more than two source pixels (in each dimension) to a destination
> pixel?

Well, "proper" is in the eye of the beholder. All the alternatives have trade-offs. Without some kind of low-pass filter, you'll introduce aliasing. The (mathematically) ideal one is based on a sinc filter of infinite extent, which is clearly impractical. A real implementation would use something like a Lanczos filter (sinc-windowed sinc), but that will still introduce ringing around edges because real images aren't band-limited. A box filter is probably the best you can hope for around strong edges.

In practical terms, I'd be happy to see a simple bilinear filter like you're proposing (which at least has _some_ low-pass characteristics). Much more than that is getting too complex to be worth it.

> The trouble with the VTBL approach is that you have to check that you aren't
> reading beyond the limits of the source image. The conditional code required

You could always just loop up through the last block, and only special case the end. But in practice all of our images are padded (for Theora, an extra 16 pixels on all four sides for Y, 8 for U and V; for VP8, twice that amount).
(In reply to comment #22)
> It isn't parallel, of course, but neither is the work you have to do to
> populate the VTBL look-up register. (It might be possible to do something
> clever there with 16.16 offsets, but I haven't thought it through.)

Each 16.16 offset is just a+b*x, where a and b are 16.16 values, and x is the horizontal pixel coordinate. You can advance all of them to the next block by adding 16*b. After that you subtract off the integer coordinate of the data you actually have loaded, and extract a few bits from the integer part to make up the table (this extraction is, I think, actually the most expensive part). You even get to re-use the tables for both rows you're loading. It _seems_ reasonable to me, but I haven't written the code.
(In reply to comment #23)
> You could always just loop up through the last block, and only special case the
> end. But in practice all of our images are padded (for Theora, an extra 16
> pixels on all four sides for Y, 8 for U and V; for VP8, twice that amount).

I am working with 8 _destination_ pixels at a time, so for the 0.5-to-1.0 range, we could be loading anything between 9 to 16 source pixels. Are the images actually padded, or are they padded up to a 16-pixel alignment boundary?

Of course, we could choose to handle 16 source pixels at a time and store a variable width, but that seemed more complicated to implement when I thought through it a day or two ago.
(In reply to comment #25)
> range, we could be loading anything between 9 to 16 source pixels. Are the
> images actually padded, or are they padded up to a 16-pixel alignment boundary?

The images are actually padded (motion compensation can reference data outside the frame). This might change if we were to enable post-processing (which is done in a separate buffer that is never used for motion compensation, and thus doesn't need the padding), but AFAIK we have no plans for doing that in the near future.

> Of course, we could choose to handle 16 source pixels at a time and store a
> variable width, but that seemed more complicated to implement when I thought
> through it a day or two ago.

Yeah, I don't think that's a good idea, either.
Attached file Half-complete generic scaler. (obsolete) —
Ok, here's a part-complete scaler. It lacks the YUV conversion, but that can be pulled straight from my other routines (or from Siarhei's version, currently in mozilla-central). It also lacks a lot of the looping code, the nearest filter for U and V, and that kind of stuff. It does, however, do a bilinear filter on the Y channel.

It builds and runs without crashing, but is obviously untested. Apologies for the half-written nature of this code!

I won't be able to work on it this weekend, but feel free to continue it if you feel so inclined. (I'll also be at home with no working internet connection so I won't be able to pick up e-mails.) I'll be back in the office on Monday, as usual.
Okay, I actually took the time to implement the VTBL method, with bilinear for Y and nearest-neighbor for U and V. Managed to get a simple ordered dither for free (a few cycles per row, but no extra cycles per pixel).

As far as I can tell, using VTBL for the loads can load the 64 Y bytes needed to produce 16 output pixels in 48 cycles. Less on an A8 where some NEON instructions can dual issue (this could be improved, I didn't make much attempt at sharing). And I get the x weights for free.

By comparison, the single-lane VLD2.8 loads your patch was using were 3 cycles each (or 6 cycles/pixel compared to 3). You could reduce that a lot by using a VLD1.16 and unzipping things at the end, but it's still over 4 cycles/pixel before accounting for the weights or any of the surrounding ARM instructions.

I also used a simpler method for the bilinear blending, which uses half the multiplies (since they have an incredible 6-cycle latency), used VDUP/VMLAL instead of VMUL/VADD for the conversion to RGB itself, and VQSHLU instead of VQSHRUN/VSHLL for the final packing, since there's no even/odd merging that needs to be done.

In any case, the code for a single row is attached. I'm working on a full patch. This only supports scale factors down to 0.5 (and shouldn't be used for large scale factors, or the nearest-neighbor for chroma will look awful), and only supports 4:2:0, so we'll need a few more versions, but this should at least cover the most common cases.
Ok. My solution isn't far off ready but there'll be little performance difference so if you're already integrating your patch it doesn't make sense to stop now. (I used VMLAL too in the end, but I'm still doing single-lane loads because over-reading the rows makes me nervous, even if it's safe for our current video sources.)
Incidentally, it's still best to do a whole-image conversion, rather wrap each row in a function call. We need to stack every pretty much every callee-preserved register for this call, so there's a lot of data to move around for each call.
(In reply to comment #29)
> Ok. My solution isn't far off ready but there'll be little performance
> difference so if you're already integrating your patch it doesn't make sense to
> stop now. (I used VMLAL too in the end, but I'm still doing single-lane loads

Yeah, sorry about jumping in like that, but I hadn't heard from you for several days, and your last message was, "Feel free to continue." The above is far from a complete solution, and we'll still probably need a single-lane load version for scale factors under 0.5, if nothing else. I don't really have a great deal of time to spend on this personally, but I wanted to get us to the point of having even a basic proof of concept that actually works (which the above is not, yet... hopefully I'll have a full patch sometime today).

> because over-reading the rows makes me nervous, even if it's safe for our
> current video sources.)

I agree with you in principle, but it turned out that in order to avoid various half-pixel shifts (which under large scale changes can become several pixels in either the source or the destination, depending on whether you're shrinking or enlarging the frame, respectively), you need coordinates that are slightly negative at the beginning and point slightly past the row at the end. The current Chromium code does not handle this at all well, and so is not a good model in this regards. The upshot is you're always going to overread, unless you clamp, which is expensive.

(In reply to comment #30)
> Incidentally, it's still best to do a whole-image conversion, rather wrap each
> row in a function call. We need to stack every pretty much every
> callee-preserved register for this call, so there's a lot of data to move
> around for each call.

From a pure performance perspective, you are correct, but the cost is on the order of 28 kcycles for a 1080p frame, i.e., about as much as a couple of contested lock acquisitions. The maintenance benefits are huge. Even just from the reduced number of versions of this function we need: although I said above this code only handles 4:2:0, it actually handles 4:2:2 as well, with only minor modifications to the calling C code. It makes it easier to solve any future problems with row-overreading without rewriting the asm (you can switch to a C fallback for parts of the rows, which won't even be _terrible_ since they reduce to "compute a single pixel value and memcpy"). It also makes it trivial to add things like multithreading for dual-core devices. Even just handling things like the half-pixel offsets mentioned above (i.e., all the nasty things marked "TODO") was extraordinarily easier to do in C. If we ever want to support MPEG cositing instead of JPEG, we can do that in C now, etc. I'm sure there's other things I haven't thought of. There are relatively few people who can write NEON asm, or even read and understand it, but nearly any developer we have can fix issues in C.
So, this is the full patch. It adds ScaleYCbCrToRGB565 with (currently) a single NEON-acclerated code path. It also fixes a ton of bugs I found in the process, against both the current ConvertYCbCrToRGB565 path and even the desktop paths. I'll be breaking it out into pieces for review and separating out some things into their own bugs, but this should be enough to get people going for now.

jbramley, if you want to help write some of the other accelerated NEON variants, that would be awesome. Feel free to throw out my current asm if you want. It should be fairly well independent of the rest. I just needed something to hang the rest around.
Attachment #518212 - Attachment is obsolete: true
Attachment #518634 - Attachment is obsolete: true
Attachment #518690 - Attachment mime type: application/octet-stream → text/plain
(In reply to comment #31)
> (In reply to comment #29)
> > Ok. My solution isn't far off ready but there'll be little
> > performance difference so if you're already integrating your patch
> > it doesn't make sense to stop now. (I used VMLAL too in the end, but
> > I'm still doing single-lane loads
> 
> Yeah, sorry about jumping in like that, but I hadn't heard from you
> for several days, and your last message was, "Feel free to continue."

No worries!

> The above is far from a complete solution, and we'll still probably
> need a single-lane load version for scale factors under 0.5, if
> nothing else. I don't really have a great deal of time to spend on
> this personally,

That's a shame. You appear to be one of those rare people who know NEON
well enough to write a routine like this.

> > because over-reading the rows makes me nervous, even if it's safe
> > for our current video sources.)
> 
> I agree with you in principle, but it turned out that in order to
> avoid various half-pixel shifts (which under large scale changes can
> become several pixels in either the source or the destination,
> depending on whether you're shrinking or enlarging the frame,
> respectively), you need coordinates that are slightly negative at the
> beginning and point slightly past the row at the end. The current
> Chromium code does not handle this at all well, and so is not a good
> model in this regards. The upshot is you're always going to overread,
> unless you clamp, which is expensive.

I'm sure there's a good way to clamp. My working solution knows when the
iteration is the last one in the row, so we could have a separate src_dx
for the last pixel in the last block, for example.

I'm using overlapping blocks so we will only ever over-read by one
pixel. I think you'll be familiar with the concept, but to clarify and
for the sake of the bug record, it is described here:
http://blogs.arm.com/software-enablement/196-coding-for-neon-part-2-dealing-with-leftovers/

> (In reply to comment #30)
> > Incidentally, it's still best to do a whole-image conversion, rather
> > wrap each row in a function call. We need to stack every pretty much
> > every callee-preserved register for this call, so there's a lot of
> > data to move around for each call.
> 
> From a pure performance perspective, you are correct, but the cost is
> on the order of 28 kcycles for a 1080p frame, i.e., about as much as a
> couple of contested lock acquisitions.

28000 cycles at 1GHz is admittedly not a lot of actual time, but it
still seems inefficient to me. The performance with a 1080p frame is not
significant, but the penalty for a small video is relatively much
greater.

> The maintenance benefits are huge.

That argument is a good one, and probably out-weighs the performance
penalty in this case.

> jbramley, if you want to help write some of the other accelerated NEON
> variants, that would be awesome. Feel free to throw out my current asm
> if you want. It should be fairly well independent of the rest. I just
> needed something to hang the rest around.

I certainly wouldn't throw it out. I think it'd be best if someone takes
the best bits from each of our implementations. (That someone could be
me if you like, but a group effort may be better.)

I'd like to suggest a few convention modifications, if you don't mind.
For example, "push" is much clearer than "stmfd ...", to me at least.
I'll give it a full review next week.

----

I'd also like to see this stuff handled by Pixman eventually. If Pixman
has the routines, more than just Firefox benefits from it, and we avoid
having several similar (or identical) implementations spread about the
place.

----

Finally, it should be noted that the _best_ place to do this kind of
this is on a GPU. Is there any plan to get that stuff working? I heard
from somewhere that we had OpenGL-ES support for some things. Apologies
for my ignorance; I'm not familiar with this area of the browser at all.
(In reply to comment #34)
> I certainly wouldn't throw it out. I think it'd be best if someone takes
> the best bits from each of our implementations. (That someone could be
> me if you like, but a group effort may be better.)

By that, I mean that we shouldn't nit-pick about the precise implementation at the start, but we should tweak and optimize later.
Depends on: 640980
(In reply to comment #34)
> That's a shame. You appear to be one of those rare people who know NEON
> well enough to write a routine like this.

I wrote my first ARM assembly in October. I make no claims to being an expert. But if you noticed the other 1800-odd lines in the diff I posted, there's a few things that need to be done besides the NEON part.

> I'm sure there's a good way to clamp. My working solution knows when the
> iteration is the last one in the row, so we could have a separate src_dx
> for the last pixel in the last block, for example.

It's not quite that simple. Large scaling factors mean we could be pointing outside the source for more than 16 pixels at the beginning and at the end. The stuff I said before about only having to compute a single pixel value per edge per row is also not correct: for 4:2:0 and 4:2:2 sources you can have two, because the chroma offset is still negative when the first luma pixel is processed (at least when you're doing bilinear for chroma instead of nearest-neighbor). A similar thing happens on the right edge.

> I'm using overlapping blocks so we will only ever over-read by one
> pixel. I think you'll be familiar with the concept, but to clarify and
> for the sake of the bug record, it is described here:
> http://blogs.arm.com/software-enablement/196-coding-for-neon-part-2-dealing-with-leftovers/

Sure. My code over-reads and then special-cases the writes in the last block, because it let me handle widths less than 16 without any extra work. However, you can do quite a lot of over-reading when the destination width is less than 16 (I still believe it's safe, but it's a lot more than one pixel). I'm happy with either approach. My patch includes C fallbacks which can be used when one of the dimensions is very small.

> 28000 cycles at 1GHz is admittedly not a lot of actual time, but it
> still seems inefficient to me. The performance with a 1080p frame is not
> significant, but the penalty for a small video is relatively much
> greater.

Sure, but most people, at least on a mobile device, are still only going to be playing one video at a time, so relative performance doesn't mean much. I agree it's not the most efficient solution, but it is much more maintainable, and there are much easier places in our codebase to save a few kilocycles, if it really bothers you.

> I certainly wouldn't throw it out. I think it'd be best if someone takes
> the best bits from each of our implementations. (That someone could be
> me if you like, but a group effort may be better.)

Sounds like you just volunteered. I'm happy to serve as reviewer for whatever you come up with.

> I'd like to suggest a few convention modifications, if you don't mind.
> For example, "push" is much clearer than "stmfd ...", to me at least.
> I'll give it a full review next week.

Sure. I will note that, although in principle your register allocation macros are a nice idea, I found it extremely difficult to read the resulting code, because I couldn't (easily) tell what registers were D registers vs. Q registers, which registers were pairs, or grouped consecutively for the instructions where that matters... at least not without spending a lot of time jumping back and forth. If you do want to use them, they should also probably live somewhere centralized, rather than just being plopped into yuv_row_arm.s.

> I'd also like to see this stuff handled by Pixman eventually. If Pixman
> has the routines, more than just Firefox benefits from it, and we avoid
> having several similar (or identical) implementations spread about the
> place.

Unfortunately, upgrading pixman is not an option for the 4.x timeframe, and I'm still hoping to get this, if not into 4.0, at least into a quick follow-up release. Post-4.x, there's been some grumbling about dropping pixman altogether and writing our own thing. That is not my department, however.

> Finally, it should be noted that the _best_ place to do this kind of
> this is on a GPU. Is there any plan to get that stuff working? I heard
> from somewhere that we had OpenGL-ES support for some things. Apologies
> for my ignorance; I'm not familiar with this area of the browser at all.

Yes, this was put off so long because we were expecting the GPU solution to work. Unfortunately it crashes on far too many devices to be turned on by default, and is simply not ready for prime time. This is the hastily-conceived back-up plan.(In reply to comment #35)

> By that, I mean that we shouldn't nit-pick about the precise implementation at
> the start, but we should tweak and optimize later.

Agreed. "Something that works" is most important. I'm willing to r+ a patch that isn't the fastest possible, as long as it's better than what we have now (which shouldn't be difficult) and doesn't break anything (considering all the special cases and how much is already broken, a more challenging task). The stuff I posted meets that minimum bar, I think. Further improvements would be most welcome, though they can be done as a separate patch/bug on a less-accelerated schedule if necessary.
Depends on: 641014
Blocks: 641019
Attachment #518690 - Attachment is patch: true
Attachment #516906 - Attachment is patch: true
Attachment #516906 - Attachment is patch: false
(In reply to comment #36)
> (In reply to comment #34)
> > I'd also like to see this stuff handled by Pixman eventually. If Pixman
> > has the routines, more than just Firefox benefits from it, and we avoid
> > having several similar (or identical) implementations spread about the
> > place.
> 
> Unfortunately, upgrading pixman is not an option for the 4.x timeframe,

Any version of pixman is perfectly usable for fennec 4.x when fennec is configured with --enable-system-pixman. That's what I'm doing in all my tests. And that's what all sane linux distributions are doing too. But I admit that all the others who have to rely on a bundled in-tree pixman copy might have a bit harder life.

The biggest problem, as I see here, is the process of upgrading this in-tree pixman which scares everyone. Eliminating or greatly reducing the chances of human error would be a good idea. Because pixman is basically a pure C library (with a bit of external assembly sources for ARM) responsible for software rendering done on memory buffers and has no external dependencies, has anybody ever considered to introduce something similar to sqlite "amalgamation"? http://www.sqlite.org/amalgamation.html

> and I'm still hoping to get this, if not into 4.0, at least into a
> quick follow-up release. Post-4.x, there's been some grumbling
> about dropping pixman altogether and writing our own thing. That
> is not my department, however.

As I see it, some politics might be involved here. The future always looks brighter with the new flashy things. And the big GPU plan might also serve as an example. Anyway, I hope these grumblers have the answers for the following questions readily available:
1. What they don't like in pixman?
2. How exactly the own thing is going to help?

> > Finally, it should be noted that the _best_ place to do this kind of
> > this is on a GPU. Is there any plan to get that stuff working? I heard
> > from somewhere that we had OpenGL-ES support for some things. Apologies
> > for my ignorance; I'm not familiar with this area of the browser at all.
> 
> Yes, this was put off so long because we were expecting the GPU solution to
> work. Unfortunately it crashes on far too many devices to be turned on by
> default, and is simply not ready for prime time. This is the hastily-conceived
> back-up plan.(In reply to comment #35)

Indeed, this original GPU plan surely affected all of us, me included (at least as a great source as discouragement). That's why we all are a bit late and catching up with introducing the missing ARM NEON optimizations for the software rendering backend.

> > By that, I mean that we shouldn't nit-pick about the precise
> > implementation at the start, but we should tweak and optimize later.
> 
> Agreed. "Something that works" is most important. I'm willing to r+ a patch
> that isn't the fastest possible, as long as it's better than what we have now
> (which shouldn't be difficult)

Actually this is where I have some doubts.

Essentially "what" we currently do is an unscaled "yv12 -> r5g6b5" conversion followed by nearest neighbour scaling of the result to r5g6b5 destination. And in my opinion this existing design is very hard to beat in terms of performance at least for upscaling, especially with high scale ratios. Just because this solution performs actual YUV->RGB conversion for much smaller number of as opposed to this conversion happening in the end. For example, for 480x360 video (360p format from youtube) upscaled to fullscreen 640x480, we need to convert 1.78x less pixels. Downscaling is a different story, and I assume when we need to downscale 720p video to fit the typical smartphone 800x480 screen, we additionally have troubles saving any CPU cycles to provide them to the video decoder, so we might have to forget about any quality at all in this case and concentrate on performance only, so bilinear scaling even only for the luma plane may be excessive.

But "how" we currently do the above mentioned unscaled "yv12 -> r5g6b5" conversion followed by nearest neighbour scaling of the result to r5g6b5 destination is another question. See bug 641196 for more details. Are you sure that just introducing a single pass yuv->rgb565 scaler without fixing the rest of current braindead html5 video graphics pipeline would give much dividends?

Wrapping it up, I would like to see some answers for the following questions:
1. What is the primary purpose of this bug precisely? Is it about improving html5 video performance? Or about improving image quality? Or maybe just finding some use for the fantastic VTBL instruction? ;) Some clarity would be surely welcome.
2. What are the (primary) use cases of interest?
3. How the improvements introduced by the attached patches are going to be evaluated?
By the way, I'm still doing my own implementation of YUV->RGB scaler. So I have not dropped out of this "contest" yet :)
(In reply to comment #37)
> The biggest problem, as I see here, is the process of upgrading this in-tree
> pixman which scares everyone. Eliminating or greatly reducing the chances of
> human error would be a good idea. Because pixman is basically a pure C library

I do not agree with you here (though I tend to agree with your assessment of "new flashy things"), but I am also not the one who works on these areas of the code, and this is very much not my call. So arguing with me won't get you anywhere, and I don't intend to comment on this any further.

> 1. What they don't like in pixman?
> 2. How exactly the own thing is going to help?

Please ask in #gfx or file a different bug on the subject if you want to discuss it.

> Essentially "what" we currently do is an unscaled "yv12 -> r5g6b5" conversion
> followed by nearest neighbour scaling of the result to r5g6b5 destination. And
> in my opinion this existing design is very hard to beat in terms of performance
> at least for upscaling, especially with high scale ratios. Just because this

If the limitation is memory bandwidth, as you keep saying, then this is not true, because "what we currently do" is drag the entire image through cache twice during this process, instead of once (many more times, actually, given bug 641196, but even with that fixed, it would be twice, because right now the conversion and scaling steps happen in different processes). I agree that the "scale, convert" order may not be the best for upscaling, but it should still be done in one ScaleYCbCrToRGB565 function. Staging to a temporary RGB24 row buffer that fits in L1 is probably best here, and it would not be hard to modify the patch I posted to do so.

> to downscale 720p video to fit the typical smartphone 800x480 screen, we
> additionally have troubles saving any CPU cycles to provide them to the video
> decoder, so we might have to forget about any quality at all in this case and

I don't think WebM can decode 720p on current smartphone CPUs, at all. Theora can just barely do it for relatively low-motion clips on an N900, basically leaving no CPU left over for audio or actual display. So making global policy decisions on what kind of scaling to use to try to get these cases to work seems like a bad idea to me. In theory it should work on some of the dual-core A9s, except that we do decode and color conversion on the same thread. Fixing that, I think, should be a separate bug.

> destination is another question. See bug 641196 for more details. Are you sure
> that just introducing a single pass yuv->rgb565 scaler without fixing the rest
> of current braindead html5 video graphics pipeline would give much dividends?

I am not at all sure that it is sufficient, but I am sure that it is necessary.

> 1. What is the primary purpose of this bug precisely? Is it about improving
> html5 video performance? Or about improving image quality? Or maybe just
> finding some use for the fantastic VTBL instruction? ;) Some clarity would be
> surely welcome.

See the title. This bug is about a very specific improvement to HTML5 video performance. Lots of other, possibly orthogonal, improvements are certainly possible, but should be in their own bugs.

> 2. What are the (primary) use cases of interest?

I think the primary use cases should be a) relatively small scalings (aspect ratio corrections, etc.), and b) moderate upscaling (240p, 360p -> 480p, etc.). I am happy to listen to other opinions.

> 3. How the improvements introduced by the attached patches are going to be
> evaluated?

I thought I said that in my last comment: "better than what we have now... and doesn't break anything." If there's more than one "competing" solution that meets both those criteria, then all the usual trade-offs apply (speed vs. maintenance cost, etc., see comment #31).

(In reply to comment #38)
> By the way, I'm still doing my own implementation of YUV->RGB scaler. So I have
> not dropped out of this "contest" yet :)

Great! This is how open source works.
Attached file Jacob's generic scaler (unoptimized). (obsolete) —
I realize that I never attached my complete scalar, so here it is. It produces a good result when scaling from 0.5 to 1.5, and some sources still look good up to 2.5. (The U and V block become apparent at that point.) I haven't optimized the code yet, so it's poorly scheduled and lacks PLD instructions.

I intend to slot the two solutions into a test + benchmark framework and pick the best bits from each, until we have a good all-round solution. Your VTBL solution is starting to look appealing. My solution generates a lot of code for all the single-lane loads. (It could load them in a loop, but that might cost some performance.) I will assess the difference in more detail.

Mine still works on whole images, but I want to run some real benchmarks to see how much effect it has. With the figures you quoted, I'm doubtful that it will matter, and in that case we can use a per-row solution.

(In reply to comment #36)
> Sure. I will note that, although in principle your register allocation macros
> are a nice idea, I found it extremely difficult to read the resulting code,

Yep, agreed! I wrote them to try to help keep track of overlapping registers in different regions of code, but the macros ended up rather horrible to use.
Attachment #516906 - Attachment is obsolete: true
Attachment #519688 - Attachment mime type: application/octet-stream → text/plain
I've attached the simple testbench I wrote, which you may find helpful. It reads input from a Theora file, so that a) all the various conditions (picture offsets, 4:2:2 and 4:4:4 formats, etc.) can be tested easily and b) the buffer is padded exactly as it will be in Firefox.

It's not suitable for timing as is, but it would not be too hard to modify it for that purpose. Right now it just reads the first frame, converts it, and dumps the result to a PNG.

It uses ycbcr_to_rgb565.h, ycbcr_to_rgb565.cpp, yuv_convert_arm.cpp, and yuv_row_arm.s directly as from my patch. The rest of the headers are stubs for the remaining Mozilla dependencies. Much more convenient than testing in-tree. You'll need installed copies of libtheora, libogg, libpng, and zlib to use it, though.
Attachment #519688 - Attachment description: Jacob's generic scalar (unoptimized). → Jacob's generic scaler (unoptimized).
As described in bug 637961 comment #14 the current tree of thunderbird does not compile on windows with out the patch of this bug. Unfortunately the old patch provided here cannot be applied to the trunk (mozilla-central). So I updated it. 

(Disclaimer: I have no idea what this really does and the only purpose of the patch is to make it compile again.)
(In reply to comment #42)
> As described in bug 637961 comment #14 the current tree of thunderbird does not
> compile on windows with out the patch of this bug. Unfortunately the old patch
> provided here cannot be applied to the trunk (mozilla-central). So I updated
> it. 
> 
> (Disclaimer: I have no idea what this really does and the only purpose of the
> patch is to make it compile again.)

I've been splitting that patch up into separate patches for purposes of review. Of those, bug 640588 and bug 641019 have already landed (which is why the patch in this bug no longer applies). I believe the patches from bug 639415, bug 583958, and bug 641014 should now be enough to fix your issue (if they're not, let me know).
> I've been splitting that patch up into separate patches for purposes of review.
> Of those, bug 640588 and bug 641019 have already landed (which is why the patch
> in this bug no longer applies). I believe the patches from bug 639415, bug
> 583958, and bug 641014 should now be enough to fix your issue (if they're not,
> let me know).

Thank you for that hint. With these patches applied the compilation works.
(In reply to comment #44)
> Thank you for that hint. With these patches applied the compilation works.

Great. I've made a note to that effect over in bug 641014. Sorry this got spread around in so many bugs, but there were, uh, a lot of things broken.
Hi!

I've been trying to compare the two routines side-by-side. However, the
BilinearY_Row_NEON routine doesn't seem to work for me.

  • When the vertical scaling factor is 1, I get output as if only U has
    been sampled, and the Y and V values are both 0 (or at least close
    to it).

  • Other vertical scaling factors produce lots of colourful noise.

I thought I was calling it incorrectly, but the Bilinear_Row_C routine
works correctly. Before I try to debug it, do I have the latest version
here?

Also, I found a bug in my previous routine (which I'll correct when I
upload my comparison results), and a unrelated bug in
ScaleYCbCrToRGB565:

  source_x0_q16 = source_x0*source_dx_q16+(source_dx_q16>>1)-0x8000;

The source_x0 quantity is already in the source co-ordinate space, so it
just needs converting to 16.16 format:

  source_x0_q16 = (source_x0<<16)+(source_dx_q16>>1)-0x8000;

My test images have padding on all sides, so this one was easy to spot.

Thanks,
Jacob
(In reply to comment #46)
> I thought I was calling it incorrectly, but the Bilinear_Row_C routine
> works correctly. Before I try to debug it, do I have the latest version
> here?

I haven't updated it since I posted the patch, and though I've made some changes to other, unrelated pieces of the code in other bugs compared to the current patch in this bug, I haven't touched the actual NEON routine. I know the NEON routine was actually being called when I tested it, too, because I was setting breakpoints in it, so it at least worked at one point, for at least a few test cases.

>   source_x0_q16 = source_x0*source_dx_q16+(source_dx_q16>>1)-0x8000;
> 
> The source_x0 quantity is already in the source co-ordinate space, so it
> just needs converting to 16.16 format:

Ooh, good call.
Here's my framework, for the record.

Your scaler is faster than mine in most cases on A9. Mine is better for small images on A8.

Overall, your solution is faster, and I see no reason to change it, particularly as it's already integrated.

Curiosities:
  * For the test-case I was using at the time, the VMUL/VMLA sequence ("a*(1-w)+c*w") was faster than performing a single VMUL with a VSUB ("(a-c)*w+c"). I suspect that this isn't the case in general.
  * My solution is really quick if no scaling is required. This isn't useful, but it's interesting to note.
  * Your solution is really good at big images (compared to mine). I suspect that your core algorithm is better optimized than mine, but the function call overhead dominates for small images.
  * I use a slightly different method for up-scaling than you, particularly around the borders. I take source samples to be points, but you take th I don't much about raster graphics, so I don't know which is considered correct. (A web search gave me several different answeres.)
Attachment #519688 - Attachment is obsolete: true
> around the borders. I take source samples to be points, but you take th [...]

... them to be areas. At least, that's how I understand it. It only makes a difference at the borders. (Your solution copes with it by handling negative co-ordinates sensibly.)

[Sorry, something went wrong in my final edit somehow.]
Timothy, I'm assigning this to you because you've already done the integration work and I don't yet have a build environment set up for the browser. Let me know if this is a problem.
Assignee: Jacob.Bramley → tterribe
Oh, wow. I was pre-loading the next block for each source element on each iteration. Curious to see why my routine was terribly slow on some tests on A9, but not on A8, I made it pre-load two blocks ahead. Some particurly bad tests doubled in speed!

The change was trivial. I'll attach a new version and some new results once the benchmark has finished running.
Includes PLD tweak.
Attachment #513455 - Attachment is obsolete: true
Attachment #513456 - Attachment is obsolete: true
Attachment #513463 - Attachment is obsolete: true
Attachment #524163 - Attachment is obsolete: true
Here are also the straightforward two-pass yuv->rgb neon and bilinear scalers added for reference (first pass performs unscaled conversion, second pass performs scaling).

The code should work, though I did not check it carefully for the corner cases yet, so it can be considered alpha quality. No padding of images is required which is about the only benefit for the potential users. Due to doing scaling in RGB colorspace, bilinear interpolation is performed for all color components (plus unused 'alpha' channel) without saving time on skipping it for chroma planes. So the performance is not that great. Unscaled YUV->RGB conversion needs some tuning for quality and performance. Also bilinear filter uses pixman implementation here and probably after sacrificing some precision it can be done with a bit less instructions. Still I doubt that it will be sufficient to catch up with the other implementations even for large scale factors.

A benchmark from ARM Cortex-A8:

Converting:  ./samples/sample-480x360-pad16.yv12
 Scale factor:  0.5
  Method: [Bilinear][derf]             Average time:   0.513ms
  Method: [Bilinear][jbramley][vtbl]   Average time:   0.464ms
  Method: [Bilinear][siarhei][twopass] Average time:   1.286ms
  Method: [Nearest][siarhei][twopass]  Average time:   0.583ms
Scale factor:  1.5
  Method: [Bilinear][derf]             Average time:   3.807ms
  Method: [Bilinear][jbramley][vtbl]   Average time:   3.650ms
  Method: [Bilinear][siarhei][twopass] Average time:   5.435ms
  Method: [Nearest][siarhei][twopass]  Average time:   2.040ms
 Scale factor:  2.0
  Method: [Bilinear][derf]             Average time:   6.712ms
  Method: [Bilinear][jbramley][vtbl]   Average time:   6.447ms
  Method: [Bilinear][siarhei][twopass] Average time:   8.966ms
  Method: [Nearest][siarhei][twopass]  Average time:   2.939ms


PS. Seems like jbramley's implementation does not have a special handling for odd sized 'src_left'.
(In reply to comment #53)

> Here are also the straightforward two-pass yuv->rgb neon and bilinear scalers
> added for reference (first pass performs unscaled conversion, second pass
> performs scaling).

An interesting comparison, thanks!

> PS. Seems like jbramley's implementation does not have a special handling for
> odd sized 'src_left'.

Thanks for spotting that! I had forgetten about that case, to be honest. It could be incorporated if it's important.
(In reply to comment #53)
> Still I doubt that it will be sufficient to
> catch up with the other implementations even for large scale factors.

Actually, yours implicitly scales all channels, as it scales after conversion. The other implementations only do interpolation on Y, so your two-pass routine isn't too bad at all for large images. This isn't bad at all, considering that it has to do many more memory accesses!

Yes, a one-pass rountine could probably do it faster, but there are probably more obvious optimization targets for now. (Scaling down by less than 1/2 springs to mind, as does handling of other YCbCr formats.)
(In reply to comment #50)
> Timothy, I'm assigning this to you because you've already done the integration
> work and I don't yet have a build environment set up for the browser. Let me
> know if this is a problem.

There's no way I have enough time to get this ready by April 13th, which is at least the planned deadline for getting stuff into the next fennec release, with the new, faster release cycle.

(In reply to comment #49)
> > around the borders. I take source samples to be points, but you take th [...]
> 
> ... them to be areas. At least, that's how I understand it. It only makes a
> difference at the borders. (Your solution copes with it by handling negative
> co-ordinates sensibly.)
> 
> [Sorry, something went wrong in my final edit somehow.]

I'm not sure what you mean by "take them to be areas"... I'm treating them as discrete samples of a continuous signal. The important thing is to get the handling of the coordinate systems to be consistent, especially between the cases of 4:2:0 vs. 4:2:2 (and 4:4:4) and even and odd pixel offsets. But this matters at a lot more than just the edges.

For example, when upscaling by a factor of 2, you should find that if you don't introduce a -0.25 pixel offset, then you'll be doubling the last pixel in the image. It's easy to see the effect at the edges, but in reality you've introduced a half-pixel shift to the entire image (compared to what you would see if you simply had a screen that was twice as big). You run into similar problems when you scale in the other direction unless you make a similar correction. For very large or very small scale factors, this can be quite significant. You have to handle reading out of bounds at the end of a row either way you do it, so it shouldn't be _that_ much more complex to also handle it at the beginning.
(In reply to comment #56)
> There's no way I have enough time to get this ready by April 13th,
> which is at least the planned deadline for getting stuff into the next
> fennec release, with the new, faster release cycle.

Wow, that deadline's tight! Ok, we're going to miss that one too then, I
think.

> (In reply to comment #49) I'm not sure what you mean by "take them to
> be areas"...

My description was poor. One approach, which I believe is what your
algorithm does, is to take samples from outside the border pixels, like
this:

[Scaling from 4px wide to 8px wide.]
Source:            *       *       *       *  
Sample points:   -   -   -   -   -   -   -   -

The problem there is working out what to interpolate with for the edge
pixels. The only reasonable choice (I think) in this case is to sample
the edge pixels twice, but then you need a special case somewhere,
probably covered using compare-and-VBIT, or something similar. With
hindsight, it's probably not terribly expensive to do.

Another approach, which is what my algorithm does, is to ignore
everything outside the border source samples, like this:

[Scaling from 4px wide to 8px wide.]
Source:             *      *      *      *
Sample points:      -  -  -  -  -  -  -  -

In effect, the first method applies a different (basically
nearest-sample) scaling technique to border pixels, whilst the second
method just ignores them.

The second method was marginally easier to implement, which is why I
chose it, but I don't know if it's considered good practice.
Depending on the type of resampling, it usually makes sense either to mirror the image beyond the boundaries or to use a color to blend with (e.g. black). Mirroring can be a pain to implement though, and unless you're sampling more than one pixel outside the source region, it's no different from your first choice.
(In reply to comment #54)
> > PS. Seems like jbramley's implementation does not have a special handling for
> > odd sized 'src_left'.
> 
> Thanks for spotting that! I had forgetten about that case, to be honest. It
> could be incorporated if it's important.

It's just that for both odd and even 'src_left' argument, your code is using the same values of 'u' and 'v' pointers passed next to the scaling function. Which causes some chroma plane shift relative to luma plane. And the same actually applies to 'src_top'. One of the test would be to perform unscaled yuv->rgb conversion once as a whole image, and once by splitting it to individual tiles and converting such tiles separately. The expected end result should be the same.

(In reply to comment #55)
> Yes, a one-pass rountine could probably do it faster, 

Actually looks like it's not so simple anymore: http://lists.freedesktop.org/archives/pixman/2011-April/001156.html

I still can cheat a bit by reducing 8-bit precision for the fractional part to just 4-bit when sampling and interpolating pixels. This is going to save one instruction per pixel. And one more thing which can be tried is to convert pixel data to planar format before interpolating to get rid of unnecessary calculations with the unused alpha channel (interpolate 3 color components instead of 4). Though this is going to add a bunch of VUZP.8 instructions to the code and give one more advantage to ARM Cortex-A8 which can dual issue at least some of them.

> but there are probably
> more obvious optimization targets for now. (Scaling down by less than 1/2
> springs to mind, as does handling of other YCbCr formats.)

But these cases may be not so interesting in practice. I would primarily care about scaling of youtube 360p video to 480p screen for now (~1.3x scale factor), and it's the main battleground for scalers :)

And scaling down by using less than 1/2 scale factor is probably better done by downscaling 2x to the temporary buffer first (just averaging color from 2x2 blocks) and then using bilinear scaling to the final size. This way the image quality is also going to be better.

(In reply to comment #57)
> The problem there is working out what to interpolate with for the edge
> pixels.

There is a good article about scaling images here: http://www.virtualdub.org/blog/pivot/entry.php?id=86

But I guess, for html5 video in the browser nobody would care if the scaling is actually done wrong from the academic point of view, as long as it is fast enough and looks good ;)
(In reply to comment #57)
> [Scaling from 4px wide to 8px wide.]
> Source:             *      *      *      *
> Sample points:      -  -  -  -  -  -  -  -

The problem is that even though you've doubled the number of pixels here, you've actually scaled by 2.33 (i.e., the step size between samples is 3/7 = 0.43 instead of 0.5). See the article Siarhei linked for more details.

(In reply to comment #59)
> But I guess, for html5 video in the browser nobody would care if the scaling is
> actually done wrong from the academic point of view, as long as it is fast
> enough and looks good ;)

"Nobody would care" is a pretty poor argument once you have a few million users. Some of them will care. Especially when it's not _that_ hard to get things right, and shouldn't have a significant effect on performance.

In any case, one of the reasons my original patch provided reference C code that tried to get all these things right (modulo the source offset bug that Jacob found) was so that it would be easy to test that, whatever algorithm people came up with, the output was identical. That may not hold true for Siarhei's two-pass approach, but it should at least be possible to test against it for shifts, etc.
(In reply to comment #60)
> "Nobody would care" is a pretty poor argument once you have a few million
> users. Some of them will care. Especially when it's not _that_ hard to get
> things right, and shouldn't have a significant effect on performance.

I don't mean to be rude, but currently html5 video playback just sucks on mobile devices. And it's not "some of them", but more like almost everyone can easily notice that the performance is clearly insufficient.

(In reply to comment #56)
> (In reply to comment #50)
> > Timothy, I'm assigning this to you because you've already done the integration
> > work and I don't yet have a build environment set up for the browser. Let me
> > know if this is a problem.
> 
> There's no way I have enough time to get this ready by April 13th, which is at
> least the planned deadline for getting stuff into the next fennec release, with
> the new, faster release cycle.

And this is just adding insult to injury :( I wonder if anybody is looking after html5 video playback in Fennec in general? Because the first priority should be fixing issues like bug 647462. That bug 647462 is really seriously contributing to video playback performance no matter how and where you do the scaling. Image quality aside, the current nearest scaling used in Fennec should be already reasonably fast. But because the overall performance is bad, there are clearly issues and they need to be solved by looking at the bigger picture.

Single pass partial or full bilinear scaling is nice and should definitely improve user experience, being a clear winner by quality/performance metric. But I'm still not convinced that it's a magic pill for solving performance problems. And this is what matters.
(In reply to comment #61)
> (In reply to comment #60)
> > "Nobody would care" is a pretty poor argument once you have a few million
> > users. Some of them will care. Especially when it's not _that_ hard to get
> > things right, and shouldn't have a significant effect on performance.
> I don't mean to be rude, but currently html5 video playback just sucks on
> mobile devices. And it's not "some of them", but more like almost everyone can
> easily notice that the performance is clearly insufficient.

Which is why I've tried to remove every roadblock I can for you guys to getting to actual patches we can commit. No one is more frustrated than I that this is still not solved. I will still advocate for a "good" implementation because, honestly, doing it the right way is just not that hard, but my personal review criteria from comment #36 still apply, and if those are too strict for you, I'm sure you can shop around and find someone else willing to r+ any random patch. But we haven't even gotten that far yet.
 
> > There's no way I have enough time to get this ready by April 13th, which is
> And this is just adding insult to injury :( I wonder if anybody is looking
> after html5 video playback in Fennec in general? Because the first priority

In short, the answer is no. Just because I am a Mozilla employee does not mean that this is my job. It is very much not. I have been trying to help out on this bug precisely because I care, same as you, and because, as far as I can tell, no one else is going to do it.

When I say, "I don't have time," I don't mean, "I'd like to go home at 5 o'clock tonight so I can catch my TV shows," I mean, "I've been awake for 53 and a half hours straight" (technically that was Saturday through Monday; I've had about 18 hours of sleep, total, since then).

> should be fixing issues like bug 647462. That bug 647462 is really seriously
> contributing to video playback performance no matter how and where you do the

I agree, which is why I personally tried to find someone who could actually review that patch for you, instead of randomly picking a familiar name without bothering to ask if they were available and willing to do the review and ignoring the fact that they had no familiarity with the code in question (like me). I suppose I could have just r+'d it for you, but I'd like to avoid disasters like bug 616469, which single-handedly caused bug 620526, bug 637961, bug 640588, bug 641014, and bug 641019. Stuff like that doesn't help anyone.

_I_ don't mean to be rude, but perhaps you'd like to take the chip off your shoulder and get back to actual engineering.
Attachment #518690 - Attachment is obsolete: true
I've rebased my patch on top of current trunk (i.e., with the patches for bug
583958, bug 639415, bug 640588, bug 641014, and bug 641019 removed, since these have now landed). It's now split into two pieces: a reference C implementation in the first patch and the single path that I've accelerated with NEON in the second patch.
I should also note, this version corrects the offset bug jbramley identified in #c46.
Is it ready to go? I'm eager to get this committed so we can make incremental improvements (such as other optimized paths) from there. I can review the NEON side, at least, if you need a reviewer.
Comment on attachment 525216 [details] [diff] [review]
ScaleYCbCr42xToRGB565_BilinearY_Row_NEON

(In reply to comment #67)
> Is it ready to go? I'm eager to get this committed so we can make incremental
> improvements (such as other optimized paths) from there. I can review the NEON
> side, at least, if you need a reviewer.

I am not opposed. I know I said April 13th was the deadline for Fennec 5 above (because that's what I was told), but apparently it branched last night. So there's no need to rush to get things in by the 13th. But I'll set you for reviewer for the NEON piece and try to find someone for the C piece. We can add and/or replace parts of these patches in the future as necessary.
Attachment #525216 - Flags: review?(Jacob.Bramley)
Attachment #525214 - Flags: review?(chris.double)
Attachment #525214 - Flags: review?(chris.double) → review+
Comment on attachment 525216 [details] [diff] [review]
ScaleYCbCr42xToRGB565_BilinearY_Row_NEON

I'd prefer the NEON code to include a long comment to explain the general principle of how the algorithm works, as it may not be obvious at first and could be useful to people writing similar routines in the future (or maintaining this one). The VTBL index stuff, for example, looks pretty peculiar until you realize what VTBL does. On the other hand, it has a similar level of commenting to most of the other NEON routines that I've seen, so I'll leave the choice up to you.

The NEON code is sound, and it should be reasonably well tested by now.

To be clear to others reading this bug, though, I can't review the C++ stuff around it because I've not worked in this area of the browser before and I'm not at all familiar with it.
Attachment #525216 - Flags: review?(Jacob.Bramley) → review+
Updated patch for check-in. Carrying forward r=doublec.
Attachment #525214 - Attachment is obsolete: true
Attachment #526164 - Flags: review+
Updated patch for check-in. I added some additional comments explaining the general idea of the VTBL approach. Carrying forward r=jbramley.
Attachment #525216 - Attachment is obsolete: true
Attachment #526165 - Flags: review+
Keywords: checkin-needed
Whiteboard: [526164 and 526165 ready to land]
(In reply to comment #37)
> (In reply to comment #36)
> > and I'm still hoping to get this, if not into 4.0, at least into a
> > quick follow-up release. Post-4.x, there's been some grumbling
> > about dropping pixman altogether and writing our own thing. That
> > is not my department, however.
> 
> As I see it, some politics might be involved here. The future always looks
> brighter with the new flashy things. And the big GPU plan might also serve as
> an example. Anyway, I hope these grumblers have the answers for the following
> questions readily available:
> 1. What they don't like in pixman?
> 2. How exactly the own thing is going to help?

I think Tim was mixing up cairo and pixman here. Some Mozilla gfx people definitely have a plan to move away from using cairo as our cross-platform graphics API layer. We're suspicious of new flashy things too, so over the last week we've been experimenting with a prototype of an alternative design, and just within the last two days we've got data showing a major performance improvement with the new API on top of D2D vs cairo-d2d on some benchmarks we care about.

For non-GPU rendering, we don't have any specific long-term plans. The short-term plan is to continue using cairo (and pixman of course), underneath the new API. It's very possible that we'll continue doing that indefinitely.
(Further discussion should probably go to dev.platform; we should announce something there soon.)
http://hg.mozilla.org/mozilla-central/rev/e957f873a565
http://hg.mozilla.org/mozilla-central/rev/e05cdd49d004

(If there's more work to be done, it seems like it would be better done in new bugs.)
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: [526164 and 526165 ready to land]
Target Milestone: --- → mozilla6
The patches apparently have no effect on Fennec (old scaling method is used in all cases there). So benchmarking was done with just a "desktop" variant of Firefox.

Based on these profiling results, new scaling method works somewhat slower, but definitely provides better image quality. An obvious performance problem is the redundant copy which can be observed with the new method. So it got changed from "unscaled YUV-RGB conversion + scaling RGB->RGB" to "scaling YUV->RGB + RGB->RGB copy". Even without taking this redundant copy into account, higher quality YUV->RGB scaling is still a bit slower than old code (a naive full image YUV->RGB conversion followed by full image nearest scaling).

(In reply to comment #74)
> http://hg.mozilla.org/mozilla-central/rev/e957f873a565
> http://hg.mozilla.org/mozilla-central/rev/e05cdd49d004
> 
> (If there's more work to be done, it seems like it would be better done in new
> bugs.)

OK. I have added some profiling results here to sum up all the achievements.

So is this bug FIXED exactly as originally intended? Comment 39 did not provide much clarity regarding the purpose of this particular bug, so I'm asking for the final confirmation. I just want to be sure that the people around here know what's going on, what they are doing and what they are going to do next. To keep focused on practical html5 video playback improvements for Fennec.
(In reply to comment #75)
> Created attachment 526576 [details]
> oprofile-logs-for-bug-634557.txt (before and after patches got applied)
Thanks for running these. This is useful information.

> The patches apparently have no effect on Fennec (old scaling method is used in
> all cases there). So benchmarking was done with just a "desktop" variant of
> Firefox.

We're still missing a piece that forwards the scaling hint from the chrome to the content process. Last time I talked to stuart about it, he said that cjones was going to handle that part, but I haven't heard anything about it in a while, so I'll try to track down what the story is there (but see below). However, I believe Fennec will use the scaling path if the video has a non-square pixel aspect ratio (which quite a lot of content does). At least, that's what I was using to test against.

> RGB->RGB copy". Even without taking this redundant copy into account, higher
> quality YUV->RGB scaling is still a bit slower than old code (a naive full
> image YUV->RGB conversion followed by full image nearest scaling).

I think we always knew that bilinear was going to be somewhat slower than nearest, however the quality improvement is marked. I notice the Xorg memcpy usage also goes down by a bit with the new patches. Overall it looks like we're about 17% slower? Which is not bad for doing 3 times the amount of work.

> (In reply to comment #74)
> > (If there's more work to be done, it seems like it would be better done in new
> > bugs.)
> 
> OK. I have added some profiling results here to sum up all the achievements.
> 
> So is this bug FIXED exactly as originally intended? Comment 39 did not provide

The title of the bug says "Implement efficient scaling YUV-to-RGB conversions for ARM", and we now have at least one ARM-accelerated scaling converter. It's not so much that I'm going to declare "victory!" but I do agree additional enhancements should be in a new bug for organizational purposes. I'll open one.

> much clarity regarding the purpose of this particular bug, so I'm asking for
> the final confirmation. I just want to be sure that the people around here know
> what's going on, what they are doing and what they are going to do next. To
> keep focused on practical html5 video playback improvements for Fennec.

And thanks for that. Part of the difficulties in eliminating the extra copy during composition is that in Fennec this actually happens in a separate process from the YUV-to-RGB conversion. One approach proposed at the Platform Work Week last week, which will also help with EGL layers, is to have libtheora/libvpx decode directly into a shared memory buffer in the content process, and have the chrome process do both the conversion and scaling, which could then be done in-place during composition (this would also eliminate the need to forward scaling hints from chrome to content, and avoid the case where these hints are out-of-date by the time the image actually shows up in the chrome process, e.g., if the user is actively zooming, causing the image to get scaled again). I've started the modifications to libtheora to make this possible, and discussed the required API changes with the VP8 folks last week. Those changes should also probably be tracked in separate bugs.
(In reply to comment #72)
> I think Tim was mixing up cairo and pixman here. Some Mozilla gfx people

By the way, I apologize if my ignorance here caused any consternation. This just proves my point that I am not the one who makes these decisions, nor, apparently, should I be. I know just enough about what's going on in #gfx to be dangerous.

> For non-GPU rendering, we don't have any specific long-term plans. The
> short-term plan is to continue using cairo (and pixman of course), underneath
> the new API. It's very possible that we'll continue doing that indefinitely.

This makes moving these routines into pixman more appealing, but I think we can proceed in parallel with the code here. I.e., at some future point when a merged, in-tree pixman provides everything we need, we can remove all of this code (whether pixman uses a version of this code or something else is up to you). The new rapid release model should make upgrading pixman easier for you, since there shouldn't be any more "freeze the world for months on end" periods.
Blocks: 650651
(In reply to comment #76)
> not so much that I'm going to declare "victory!" but I do agree additional
> enhancements should be in a new bug for organizational purposes. I'll open one.

I opened bug 650651 for this purpose.
(In reply to comment #76)
> We're still missing a piece that forwards the scaling hint from the chrome to
> the content process. Last time I talked to stuart about it, he said that cjones
> was going to handle that part, but I haven't heard anything about it in a
> while, so I'll try to track down what the story is there (but see below).

Does it make sense to have bug(s) in bugzilla to track this and similar activity? Or maybe everything is already there?

> I think we always knew that bilinear was going to be somewhat slower than
> nearest, however the quality improvement is marked.

Actually it was not so certain. With the strange redundant conversions happening in the graphics pipeline all over the place earlier, it was quite possible to hit some non-neon code path in pixman and get a real slide show. Doing bilinear scaling in a different place could possibly change the graphics pipeline a bit and avoid hitting some non-optimized parts of code as a side effect.

That's why I insisted on upgrading pixman first. The older version of pixman just had less optimizations, and some of them got added to pixman later to specifically address the html5 video related operations such as those from bug 641196. Yes, it may be not so practical to optimize something that should have not been used in Fennec the first place. But these operations still look generic enough to justify at least having some optimizations for them. Additionally, there are html5 games which also need some related optimizations.

> I notice the Xorg memcpy usage also goes down by a bit with the new
> patches. Overall it looks like we're about 17% slower? Which is not
> bad for doing 3 times the amount of work.

Yes, I also noticed that. The problem is that this video is exactly on the borderline of perfect playback on this hardware: firefox process uses a bit less than 100% of one CPU core, and Xorg uses a tiny bit of another. So a small fraction of frames is still dropped occasionally, with more frames dropped for the new scaling method. Xorg memcpy usage is quite likely correlated with the actual number of frames shown on screen. And the contribution of bilinear scaling to CPU usage may probably depend on whether the frames are dropped before or after scaling.

The other thing about this particular hardware is that it has really slow memory. It's a pre-production pandaboard with its memory clocked at half of the normal frequency. This board is mostly convenient for one thing: highlight the badness of any cache unfriendly code :)

The relative results may be different on some other hardware, but at least the odds are in favor of bilinear scaling in such setup.

> The title of the bug says "Implement efficient scaling YUV-to-RGB conversions
> for ARM", and we now have at least one ARM-accelerated scaling converter. It's
> not so much that I'm going to declare "victory!" but I do agree additional
> enhancements should be in a new bug for organizational purposes. I'll open
> one.

Agreed. This scaling converter is quite impressive, taking advantage of the known padding/alignment properties of the buffers with the decoded video. Such tricks may be harder to implement in a general purpose library though (not that it's impossible, but handling all the corner cases can be quite a pain).

At least another good thing is that I had another look at RGB bilinear scaling in pixman and among other things, it can be done ~10% faster by just relaxing interpolation precision. This change is not bitexact and may require additional verification/testing, but it may be beneficial or even critical for smooth bilinear pinch zoom in Fennec.
(In reply to comment #79)
> Does it make sense to have bug(s) in bugzilla to track this and similar
> activity? Or maybe everything is already there?

I asked cjones to open a bug for it, which he apparently file as bug 651000.
Depends on: 671818
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: