Closed Bug 950113 Opened 6 years ago Closed 6 years ago

canvasmark 5% regression on 10.6 machines

Categories

(Core :: Graphics: Layers, defect)

29 Branch
All
macOS
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla29
Tracking Status
firefox28 --- unaffected
firefox29 + fixed

People

(Reporter: jmaher, Assigned: bjacob)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(1 file)

:bjacob, could you weigh in here and help understand if this is related to bug 897452?
Flags: needinfo?(bjacob)
Many thanks for catching this 5% regression. I'll make sure to find some time this week to look into this performance regression.

Could you please indicate how to run this performance test locally?
Flags: needinfo?(bjacob)
the simple way would be to run talos locally:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

./talos -a tcanvasmark -e <path_to_firefox> --results_url canvasmark.out --datazilla-url canvasmark.json

likewise you could run it online at:
http://www.kevs3d.co.uk/dev/canvasmark/
(In reply to Benoit Jacob [:bjacob] from comment #2)
> Could you please indicate how to run this performance test locally?

"./mach talos-test
Version: unspecified → 29 Branch
(In reply to Matt Brubeck (:mbrubeck) from comment #4)
> "./mach talos-test

Sorry, I posted this comment accidentally.  Please ignore it.
Could you expand on how this will impact our users and severity so we can determine whether this is something we need to track?
Flags: needinfo?(mbrubeck)
This will be investigated before it has a chance to hit users; thanks to it having been filed early and having one clear suspected cause, it should really be fixed in time; only reason why it wasn't already is bad timing: we're having a gfx work week here in toronto, and then there will be the holidays. But even after that, we'll have a few weeks before it hits even Aurora. Impossible to tell how that regression would affect users before investigating closely what regressed inside there.
Flags: needinfo?(mbrubeck)
Tracking but would like an assignee for this
Flags: needinfo?(bjacob)
I'll take this, see if I can reproduce. Please allow a few days.
Assignee: nobody → bjacob
Flags: needinfo?(bjacob)
I have tried very hard to reproduce this on Linux with GL layers and Skia, which should be quite close to Mac; but I couldn't reproduce any performance impact. I tried with both Skia/GL and Skia/software canvas.

So I pushed to tryserver different stages of the incriminated changeset range, and am waiting for results. Note that I had to apply the patch from bug 903274 to get talos to work locally, so, in doubt, I pushed it on my try pushes too.

"Last good changeset" in the above regression range i.e. changeset just before the first ptexture patches:
https://tbpl.mozilla.org/?tree=Try&rev=ea1df9a297af

After the two "preliminary" patches for PTexture, before the actual core PTexture patches:
https://tbpl.mozilla.org/?tree=Try&rev=8e4b67131669

After the core PTexture patches:
https://tbpl.mozilla.org/?tree=Try&rev=46874ab8303c

After the post-PTexture cleanup patches:
https://tbpl.mozilla.org/?tree=Try&rev=0d95a688d49b

After the GLContext cleanup patches from bug 942502:
https://tbpl.mozilla.org/?tree=Try&rev=5185e8c669b1

After the GLContext cleanup patches from bug 942506:
https://tbpl.mozilla.org/?tree=Try&rev=6902835feb8f
Argh, tcanvasmark wasn't part of the 'other' talos tier, as I thought it was. So my above try pushes are useless.

New try pushes with -t all, and too bad for tryserver compute cycles:

"Last good changeset" in the above regression range i.e. changeset just before the first ptexture patches:
https://tbpl.mozilla.org/?tree=Try&rev=a6228f254790

After the two "preliminary" patches for PTexture, before the actual core PTexture patches:
https://tbpl.mozilla.org/?tree=Try&rev=50d0995efa4b

After the core PTexture patches:
https://tbpl.mozilla.org/?tree=Try&rev=37b0fb43a865

After the post-PTexture cleanup patches:
https://tbpl.mozilla.org/?tree=Try&rev=192d6afe8768

After the GLContext cleanup patches from bug 942502:
https://tbpl.mozilla.org/?tree=Try&rev=dd4641e7de68

After the GLContext cleanup patches from bug 942506:
https://tbpl.mozilla.org/?tree=Try&rev=0f846c981258
OK, so this tells us:

 1. That the regression is caused by one of the PTexture patches, but not the two "prerequisite" ones that landed first. We unfortunately don't know more precisely yet as some intermediate csets don't build.

 2. That while most canvasmark test are affected to some extent, this is particularly visible in the "Arena5" test.

 3. That Mac OSX 10.8 is not affected; and we also already knew (comment 10) that Linux with GL layers is unaffected. So, so far, this only affects "old" Mac OSX --- at least 10.6, we don't know about 10.7, but not 10.8.
Blocks: PTexture
It's probable that this is caused by PTexture because iirc, in this scenario we use SharedTextureClientOGL which data is externally owned by the producer. When a TextureClient's data is not deleted by the TextureClient/Host, we can't do the usual asynchronous destruction steps (that is, the client asynchronously tells the host to destroy the texture). Instead we send a synchronous call to the host and tell it to abandon all references to the texture, so that the client can safely destroy it. I need to check but I think that before PTexture, the TextureClient/Host pair was just not doing anything about the texture's deletion, which is prone to races since the producer would just decide to delete the texture without knowledge of whether it's being used by the compositor, but somehow it was working.

It looks very much like the performance regression is due to the added synchronous call, if so I see two solutions:
 * The easy and unsafe way: just pretend we don't need to do the synchronous destruction of the TextureClient/Host, and assume that the producer is magically smart enough and that we won't race (that's going back to the previous behavior that was probably race-prone but was working somehow).
 * The right (I think) way: make it possible to have an asynchronous destruction when we deallocate textures on the client side, where the client asynchronously notifies the host that it should loose its references to the texture and send an asynchronous notification back to the client so that the client side knows it can delete or reuse the shared texture data.

The other place where we use the synchronous destruction mode is B2G's hardware-decoded video playback, so I suppose that if we go for the second option it'll make it more efficient there too.
We can also go for solution 1 (since the patch would be rather simple) while waiting for solution 2 that would require a bit more thinking and testing.

I've been wanting to support solution 2 for some time but I wanted to wait and see that we have actual performance issues before doing it.
(In reply to Nicolas Silva [:nical] from comment #13)
> It's probable that this is caused by PTexture

See comment 11 tryserver pushes: this is now established.

> It looks very much like the performance regression is due to the added
> synchronous call,

Is there a theory explaining why that would affect only Mac OSX 10.6, and not 10.8, and not Linux with GL layers?
(In reply to Benoit Jacob [:bjacob] from comment #14)
> Is there a theory explaining why that would affect only Mac OSX 10.6, and
> not 10.8, 

I don't know, I need to check how canvas works on OSX and see if we use the same kind of textures on 10.6 and 10.8

> and not Linux with GL layers?

Do we measure that?
OMTC on Linux still uses deprecated textures by default so no PTexture unless you flip the pref.
(In reply to Nicolas Silva [:nical] from comment #15)
> (In reply to Benoit Jacob [:bjacob] from comment #14)
> > and not Linux with GL layers?
> 
> Do we measure that?
> OMTC on Linux still uses deprecated textures by default so no PTexture
> unless you flip the pref.

Hah, thanks, let me try that.
(In reply to Benoit Jacob [:bjacob] from comment #16)
> (In reply to Nicolas Silva [:nical] from comment #15)
> > (In reply to Benoit Jacob [:bjacob] from comment #14)
> > > and not Linux with GL layers?
> > 
> > Do we measure that?
> > OMTC on Linux still uses deprecated textures by default so no PTexture
> > unless you flip the pref.
> 
> Hah, thanks, let me try that.

That still didn't show any significant performance difference with Skia/GL, and possibly a tiny performance difference with Skia/software but it's not clear that it's significant, and even if it is, it's much smaller than the one we're talking about here, and not of the same nature, so I'll consider that this doesn't reproduce on Linux. See next comment about what is happening on Mac OSX.
Got some concrete stuff that we can work on. I received much help from :BenWa to use various aspects of the wonderful Gecko profiler --- thanks!

The first thing is, I could reproduce this 5% canvasmark regression locally on a Mac OSX 10.6 machine.

I then profiled the "Arena5" part that exhibits the biggest performance difference (see comment 12) using the Gecko profiler.

Profile before PTexture landed:
http://people.mozilla.org/~bgirard/cleopatra/?customProfile=http://people.mozilla.org/~bjacob/without-ptexture.dms&select=330200,330500#

Profile after PTexture landed:
http://people.mozilla.org/~bgirard/cleopatra/?customProfile=http://people.mozilla.org/~bjacob/with-ptexture.dms&select=391500,391800#

We see that before PTexture landed, the compositor was mostly sleeping and only woke up for no more than 2 ms to composite each frame. In other words, compositing was cheap.

After PTexture landed, we see that the compositor has to work for about 8 ms to composite each frame. In other words, PTexture made compositing much more expensive.

If we then click on the time slices when the compositor is active and look at the stacks, we see that before PTexture landed, most of the time was spent in this stack:

  CompositableParentManager::ReceiveCompositableUpdate
    BufferTextureHost::Updated
      BufferTextureHost::Upload
        TextureImageTextureSourceOGL::Update
          TextureImage::UpdateFromDataSource
            BasicTextureImage::DirectUpdate
              UploadSurfaceToTexture
                UploadImageDataToTexture
                  TexSubImage2DHelper
                    glTexSubImage2D

but after PTexture landed, this stack looks different, and is now much more costly:

  CompositableParentManager::ReceiveCompositableUpdate
    BufferTextureHost::Updated
      BufferTextureHost::Upload
        TextureImageTextureSourceOGL::Update
          TextureImage::UpdateFromDataSource
            BasicTextureImage::DirectUpdate
              UploadSurfaceToTexture
                UploadImageDataToTexture
                  glTexImage2D

In other words, the PTexture landing caused us to do glTexImage2D instead of glTexSubImage2D. Why? Looking at UploadImageDataToTexture's code, this means that we have a new texture every time. And indeed, if we look further in the after-PTexture-landed profile, we see this stack at every frame:

  CompositableParentManager::ReceiveCompositableUpdate
    ImageHost::UseTextureHost
      BufferTextureHost::SetCompositor
        BufferTextureHost::DeallocateDeviceData
          TextureImageTextureSourceOGL::DeallocateDeviceData
            BasicTextureImage::~BasicTextureImage
              glDeleteTextures

So we _are_ deleting a texture at every frame, so the above texImage2D-became-texSubImage2D is only a symptom of that, and we real question becomes why are we destroying a texture at every frame?

Looking at the PTexture diff around the functions mentioned in the latter stack, it seems that the most likely candidate is this diff from the Part 4 patch on the PTexture bug 897452 (attachment 8343311 [details] [diff] [review]):


   void
   ImageHost::UseTextureHost(TextureHost* aTexture)
   {
  +  if (mFrontBuffer) {
  +    // XXX - When we implement sharing textures between several compositables
  +    // we will need to not remove the compositor if there is another compositable
  +    // using the texture.
  +    mFrontBuffer->SetCompositor(nullptr);
  +  }
     CompositableHost::UseTextureHost(aTexture);
     mFrontBuffer = aTexture;
   }

Nical, do you know what the fix would be, to avoid destroying our textures at every frame? Should UseTextureHost be made idempotent, i.e. early-return doing nothing if aTexture == mFrontBuffer ?
Flags: needinfo?(nical.bugzilla)
By the way, comment 18 also suggests an answer to "why does this only reproduce on Mac OSX 10.6?". The answer being, that this regression is causing us to do more OpenGL work than necessary (destroying, recreating and reallocating textures at every frame). If the system's OpenGL libraries are optimized enough, they can make this a non-issue, but evidently Mac OSX 10.6's OpenGL libraries are not optimized in that way, making this issue apparent. I learned a lesson here, not to dismiss performance regressions that seem to affect only some older systems and not some newer systems.
Nice catch! Darn I don't remember why I ended up clearing the compositor (which is causing to deallocate the gl texture). Maybe something about the problem that we had with GL resources not being deallocated before the widget. after the GLContext fix that we did recently, I think it's worth trying to just remove the if branch and what it contains. If this breaks anything we should look why and document the fix more efficiently than what I did.
Flags: needinfo?(nical.bugzilla)
As you suggested (thanks!). I'm currently testing this, and won't land unless it fixes the regression, but it'd be useful to get your r+ before it's late in your timezone.
Attachment #8359855 - Flags: review?(nical.bugzilla)
The patch does fix the regression according to Talos, see this graph:
https://datazilla.mozilla.org/?start=1386499904&stop=1389728145&product=Firefox&repository=Try&os=mac&os_version=OS%20X%2010.6.8&test=tcanvasmark&graph_search=4a1e392f375c&tr_id=4110663&graph=3D%20Rendering%20-%20Maths-%20polygons-%20image%20transforms&x86=false&project=talos

I also verified locally that it removes the above-mentioned oddities in the profiles (i.e. I checked that it causes us to not destroy our texture everytime, and lets us use glTexSubImage2D instead of glTexImage2D as we should).
I've now verified locally that this patch fixes the regression. The reason why that took me a bit longer is I still had the Gecko profiler enabled, which had a surprisingly large performance impact.
Attachment #8359855 - Flags: review?(nical.bugzilla) → review+
https://hg.mozilla.org/mozilla-central/rev/960a1153f12e
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Just got this from dev-tree-management, so marking VERIFIED.

Improvement: Mozilla-Inbound - CanvasMark - MacOSX 10.6 (rev4) - 5.86% increase
-------------------------------------------------------------------------------
    Previous: avg 5995.458 stddev 15.604 of 12 runs up to revision 347d5704cbe1
    New     : avg 6346.917 stddev 30.185 of 12 runs since revision f0d8f9868ca4
    Change  : +351.458 (5.86% / z=22.524)
    Graph   : http://mzl.la/1iXl2aZ

Changeset range: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=347d5704cbe1&tochange=f0d8f9868ca4

Changesets:
  * http://hg.mozilla.org/integration/mozilla-inbound/rev/960a1153f12e
    : Benoit Jacob <bjacob@mozilla.com> - Bug 950113 - Avoid destroying our texture everytime we composite a ImageHost - r=nical
    : http://bugzilla.mozilla.org/show_bug.cgi?id=950113

  * http://hg.mozilla.org/integration/mozilla-inbound/rev/f0d8f9868ca4
    : Nicholas Hurley <hurley@todesschaf.org> - Bug 959799 - Fix autogen include guard error. r=mcmanus
    : http://bugzilla.mozilla.org/show_bug.cgi?id=959799

Bugs:
  * http://bugzilla.mozilla.org/show_bug.cgi?id=950113 - canvasmark 5% regression on 10.6 machines
  * http://bugzilla.mozilla.org/show_bug.cgi?id=959799 - Http2HuffmanOutgoing.h:4:9: error: 'mozilla__net__Http2HuffmanOutgoing_h' is used as a header guard here, followed       by #define of a different macro [-Werror,-Wheader-guard] #ifndef mozilla__net__Http2HuffmanOutgoing_h
Status: RESOLVED → VERIFIED
thanks for looking into this and fixing it!
Thanks again to you: this bug was a very good example of the value of filing bugs about Talos regressions, even if they seem to affect only one version of one OS. Some things just shouldn't affect performance at all, so if they do at all, it's a clear sign that a real bug is lurking.
You need to log in before you can comment on or make changes to this bug.