Open Bug 1394061 Opened 7 years ago Updated 2 years ago

av1 performance regression

Categories

(Core :: Audio/Video: Playback, defect, P3)

defect

Tracking

()

People

(Reporter: rillian, Assigned: drno)

References

()

Details

The recent update of the av1 reference implementation in third_party/aom to upstream commit id f5bdeac22930ff4c6b219be49c843db35970b918 (bug 1380118) resulted in dropped frames for streams over 1 Mbps, as demonstrated by the demo at https://demo.bitmovin.com/public/firefox/av1/ even on high-end hardware.

A lot of new features have been added to the code recently, so it may just be extra complexity. It's also now returning 16 bit-per-channel image data, even for 8-bit input, so memory bandwidth should be higher. Or maybe something is interacting badly with the Firefox playback scheduling.

This bug is about tracking down and resolving the regression so the demo plays smoothy.
David, the update should be merged soon, making it easier to verify this. Could you run your profiler, please, and see if anything stands out vs the 2017 August 25 Firefox Nightly?
Flags: needinfo?(dmajor)
(Fixed typo)

As a first step sanity-check, I confirmed that I can reproduce the symptoms on my test machine (haven't run the profiler yet) with https://demo.bitmovin.com/public/firefox/av1/ at 1Mbps.

Nightly 08-25 takes about 9% of my 8-core CPU, so 75% of a core
Nightly 08-28 takes about 15% of my CPU, so a full core and then some
Flags: needinfo?(dmajor)
Huge increase in av1_loop_filter_rows:

Nightly 0825:

xul.dll!av1_loop_filter_frame, 7532
 xul.dll!av1_loop_filter_rows, 7531
  xul.dll!av1_filter_block_plane_non420_ver, 4828
  xul.dll!av1_filter_block_plane_non420_hor, 2676

Nightly 0828:

xul.dll!av1_loop_filter_frame, 33164
 xul.dll!av1_loop_filter_rows, 33163
  xul.dll!av1_filter_block_plane_vert, 16703
  xul.dll!av1_filter_block_plane_horz, 16378
Also a large increase in CreateAndCopyData:

Nightly 0825:

xul.dll!mozilla::VideoData::CreateAndCopyData, 611
 xul.dll!mozilla::VideoData::SetVideoDataToImage, 607
  xul.dll!mozilla::layers::SharedPlanarYCbCrImage::CopyData, 607
   xul.dll!mozilla::layers::UpdateYCbCrTextureClient, 589
    xul.dll!mozilla::layers::MappedYCbCrTextureData::CopyInto, 588
     xul.dll!mozilla::layers::MappedYCbCrChannelData::CopyInto, 588

Nightly 0828:

xul.dll!mozilla::VideoData::CreateAndCopyData, 4907
 xul.dll!mozilla::VideoData::SetVideoDataToImage, 4895
  xul.dll!mozilla::layers::SharedPlanarYCbCrImage::CopyData, 4895
   xul.dll!mozilla::layers::UpdateYCbCrTextureClient, 4882
    xul.dll!mozilla::layers::MappedYCbCrTextureData::CopyInto, 4882
     xul.dll!mozilla::layers::MappedYCbCrChannelData::CopyInto, 4882
And some notable decreases:

av1_cdef_frame decreased from 7106 to 4573
update_boundary_info (#663) decreased from 6043 to negligible

The numbers in the data above are sample counts at 8kHz for 10 seconds of playback. So about 80k samples represents "a full core" worth of processing. In total I had 72k samples in nightly 0825, and 92k samples in nightly 0828, which more or less agrees with my comment 3.
Thanks, David. That's very helpful. I looks like the PARALLEL_DEBLOCK feature disabled simd in the loopfilter. I've tried to re-enable it in https://aomedia-review.googlesource.com/c/aom/+/19920 but there may be alignment issues.

The CreateAndCopyData spike is on our side. I thought by using the mSkip member of YCbCrBuffer::Plane would avoid the overhead of my downsampling loop, but it if it did it didn't help enough. Hopefully native 10/12-bit support (bug 1215089) will give us a fast-path here, but in the meantime I'll look into optimizing what we have.
Assignee: nobody → giles
Depends on: 1215089
Depends on: 1413734
Nils, this is a tracking bug for a performance regressino from the last update we did. The recent upstream optimizations should help, but we probably still need a 16-bit fast-path. Overlaps with HDR work.
Assignee: giles → drno
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.