Closed Bug 1314291 Opened 3 years ago Closed 3 years ago

11.82 - 15.05% tresize (linux64) regression on push 20ff6628ea8b (Thu Oct 27 2016)

Categories

(Core :: Graphics, defect, P3)

51 Branch
Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox52 --- fixed

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression, Whiteboard: [gfx-noted])

Attachments

(1 file)

Talos has detected a Firefox performance regression from push 20ff6628ea8b. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 15%  tresize linux64 opt e10s     38.38 -> 44.15
 12%  tresize linux64 pgo e10s     36.19 -> 40.47


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=3856

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
cc'ing relman as this is a larger regression.

also tpaint has some linux64 regressions here:
== Change summary for alert #3856 (as of October 27 2016 12:10 UTC) ==

Regressions:

 23%  tpaint linux64 opt           311.03 -> 381.19
 21%  tpaint linux64 pgo           257.3 -> 311.61
 15%  tresize linux64 opt e10s     38.38 -> 44.15
 12%  tresize linux64 pgo e10s     36.19 -> 40.47

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=3856

:lsalzman, I see you reviewed this, could you help determine if there is a way to reduce this perf impact or if this is something we need to keep around?
Flags: needinfo?(lsalzman)
This was already a known regression from bug 1286649, due to contention for the X display, which we needed to ensure that X window creation was resilient to errors, otherwise causing a lot of nasty errors on resize. One of the compensatory bugs was bug 1287463, but the fallout from that caused bug 1304637. Given that bug 1287463 was not necessary at all to follow-up on bug 1286649, and given the consequences window repainting bugs that resulted, I believe it is better to in the worst case accept this regression, since otherwise we're stuck with unfixable startup/resize errors and repainting bugs.

I can do some experiments maybe this week to see if I can get rid of some of the contention, but if I can't, I believe we otherwise need to just accept the regressions to ensure that the Linux implementation is sufficiently robust.
Thanks for the information and background on this issue Lee.  Please don't spend too much time on this- ideally something small will be seen and we can reduce or remove these regressions.  I am glad we already know what is causing this.
Has Regression Range: --- → yes
Has STR: --- → yes
Flags: needinfo?(lsalzman)
OS: Unspecified → Linux
Priority: -- → P3
Whiteboard: [gfx-noted]
Version: unspecified → 50 Branch
This makes the PutImage requests unchecked to remove contention caused by the sync versions we were using before, and then finally uses xcb_discard_reply to ensure we don't get any errors from these. This seems to fix both the regression in tpaint and tresize.

Due to the fact that xcb_discard_reply requires libxcb >= 1.6, and out build machines seem to be using an ancient setup with libxcb 1.5 (even Ubuntu 12.04 has libxcb 1.8), I have to use dlsym to work around the availability issue.
Attachment #8807108 - Flags: review?(karlt)
Comment on attachment 8807108 [details] [diff] [review]
use _unchecked xcb PutImage requests in nsShmImage

I don't know how this improves performance with xcb_shm, but removing one of
the two calls in CreateDrawTarget() makes things simpler, so I like this
anyway.

The regressions here were reported on mozilla-inbound.  I assume that uses GL,
and so this code is not involved?

>+  gXCBDiscardReply = (void (*)(xcb_connection_t*, unsigned int))dlsym(RTLD_DEFAULT, "xcb_discard_reply");

Please wrap to stay within 80 columns.

Something like reinterpret_cast<decltype(gXCBDiscardReply)> is an option, if
you like.

It should be easy to consider gXCBDiscardReply in setting gUseShmPixmaps, so
please do that, just in case.

>+    mPutRequest = xcb_copy_area(mConnection, mPixmap, mWindow, mGC,
>+                                0, 0, 0, 0, mSize.width, mSize.height);
>   } else {
>+    mPutRequest = xcb_shm_put_image(mConnection, mWindow, mGC,
>+                                    mSize.width, mSize.height,
>+                                    0, 0, mSize.width, mSize.height,
>+                                    0, 0, mDepth,
>+                                    XCB_IMAGE_FORMAT_Z_PIXMAP, 0,
>+                                    mShmSeg, 0);
>+  }
>+  // The PutImage request may generate an error if the rect falls outside the
>+  // window bounds. Discard the reply to silence these errors.

Looking at ProcShmPutImage() and ProcCopyArea() and their documentation I
don't think an error is generated if the rect falls outside, and so all the
discard_reply code can be dropped.  If mWindow may have been destroyed, then
it is worth keeping the discard_reply, but AIUI, window destruction is
careful to stop compositing first.

If keeping the discard_reply, then please adjust or remove this comment,
because mention of the rect and window bounds is misleading AFAICT.

>+  if (gXCBDiscardReply) {
>+    gXCBDiscardReply(mConnection, mPutRequest.sequence);
>   }
> 
>   // Send a request that returns a response so that we don't have to start a
>   // sync in nsShmImage::CreateDrawTarget to retrieve the result of mPutRequest.

Please make mPutRequest a local variable, or drop it if removing
discard_reply(), as mPutRequest is no longer used elsewhere.
Attachment #8807108 - Flags: review?(karlt) → review+
Pushed by lsalzman@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/36024a692117
use unchecked xcb PutImage requests in nsShmImage. r=karlt
It would appear that due to a mixup in my testing that this change didn't unfortunately entirely compensate for the talos regressions. However, we're still keeping the patch in as it cleans up some unnecessary syncing/error handling within the code.
https://hg.mozilla.org/mozilla-central/rev/36024a692117
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 52
Lee, this bug is resolved and I want to properly annotate the results inside of perfherder.  Do you have additional work you can think of here, or should we accept this regression?
Flags: needinfo?(lsalzman)
(In reply to Joel Maher ( :jmaher) from comment #9)
> Lee, this bug is resolved and I want to properly annotate the results inside
> of perfherder.  Do you have additional work you can think of here, or should
> we accept this regression?

For now, we're just accepting the regression. I did as much as was worth to investigate possible ways to mitigate the regression, but they did not end up doing so, and we need to keep the robustness improvements that came from patch that caused said regression.
Flags: needinfo?(lsalzman)
Thanks Lee!
Component: Untriaged → Graphics
Product: Firefox → Core
Target Milestone: Firefox 52 → ---
Version: 50 Branch → 51 Branch
You need to log in before you can comment on or make changes to this bug.