Closed Bug 1492519 Opened Last year Closed 5 months ago

2.98 - 133.72% sessionrestore / tresize / ts_paint / ts_paint_webext (linux64, linux64-qr, windows10-64, windows10-64-qr, windows7-32) regression on push 541a1ba5f55a5119333b9f7de349e5d3e8482bf8 (Tue Sep 18 2018)

Categories

(Testing :: Talos, defect, P1)

defect

Tracking

(firefox-esr60 unaffected, firefox62 wontfix, firefox63 wontfix, firefox64 ?)

RESOLVED WORKSFORME
mozilla64
Tracking Status
firefox-esr60 --- unaffected
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- ?

People

(Reporter: jmaher, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, talos-regression)

Attachments

(1 file)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=541a1ba5f55a5119333b9f7de349e5d3e8482bf8

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

134%  ts_paint linux64-qr opt e10s stylo                365.25 -> 853.67
133%  ts_paint_webext linux64-qr opt e10s stylo         366.83 -> 854.17
 82%  ts_paint linux64 opt e10s stylo                   239.08 -> 435.92
 82%  ts_paint_webext linux64 opt e10s stylo            240.75 -> 437.75
 81%  ts_paint linux64 pgo e10s stylo                   227.50 -> 412.17
 81%  ts_paint_webext linux64 pgo e10s stylo            227.33 -> 411.75
 28%  tresize windows10-64 pgo e10s stylo               7.01 -> 9.01
 26%  tresize windows10-64 opt e10s stylo               7.49 -> 9.44
 23%  tresize windows7-32 opt e10s stylo                8.13 -> 10.00
 21%  tresize windows7-32 pgo e10s stylo                8.11 -> 9.82
 17%  sessionrestore linux64-qr opt e10s stylo          728.00 -> 852.00
  7%  sessionrestore windows7-32 opt e10s stylo         291.33 -> 311.58
  6%  sessionrestore windows7-32 pgo e10s stylo         270.54 -> 287.50
  5%  sessionrestore windows10-64 pgo e10s stylo        263.88 -> 277.67
  5%  sessionrestore windows10-64 opt e10s stylo        284.96 -> 299.58
  3%  sessionrestore windows10-64-qr opt e10s stylo     400.33 -> 412.25


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=15979

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:dao, I see that you landed the changes from bug 1485629, can you look at these regressions and help explain them or fix them?

I will get some geckoProfiles in the next hour or so
Component: General → Theme
Flags: needinfo?(dao+bmo)
Product: Testing → Firefox
This makes no sense to me. Do these tests run with lightweightThemes.selectedThemeID != "default-theme@mozilla.org"? Also, tresize isn't even in the startup path, is it?
Flags: needinfo?(dao+bmo)
Any ideas, Florian?
Flags: needinfo?(florian)
Priority: -- → P1
Looking at the profiles from https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&group_state=expanded&searchStr=talos&fromchange=f22f793f7cd209fedf853090436af56931a3ced2&tochange=1432c27ce2927994faea1f9af34c70f806499c4d

On the Linux ts_paint profile, the 'firstPaint' point is somehow taken right after the blank window gets painted when the early blank window is enabled, and after the browser UI gets painted when the blank window is disabled. So that explains the regression.

On the Windows 10 x64 QuantumRender opt ts_paint profiles, the 'firstPaint' marker is after the first time we have painted the browser UI. These profiles don't show us displaying the early blank window at all. I'm not sure why. My current best guess is that we start these runs with a profile that has an empty xulStore, so we return early at https://searchfox.org/mozilla-central/rev/0b8ed772d24605d7cb44c1af6d59e4ca023bd5f5/browser/components/nsBrowserGlue.js#323-325


On an unrelated note, the Linux startup profile I looked at (https://perfht.ml/2MQeyi5) has pretty obvious (unrelated) issues that may be interesting to look into:
- Each time we start a content process, the main thread is blocked in gfxFcPlatformFontList::ReadSystemFontList for more than 60ms in libfontconfig.so code. I wonder if there may be fc-cache issues similar to what we observed in bug 1443437.
- the first paint blocks the main thread for 457ms! In the meantime, we spend a lot of time on the Renderer thread in i965_dri.so, so we may have video driver issues.

I haven't explored yet the tresize and sessionrestore profiles, but the regression there is small enough that I would be (pleasantly!) surprised if the profiles turned out to show the cause of the regression.
Flags: needinfo?(florian)
(In reply to Florian Quèze [:florian] from comment #5)

> I haven't explored yet the tresize and sessionrestore profiles

The sessionrestore tests measure the time between the sessionRestoreInit marker and when we are done restoring the session. This is IMO a bad test because it's measuring the length of something that's done asynchronously, and is affected by the random code that runs in the middle. Showing the early blank window moves some processing before the sessionRestoreInit, that would overwise happen in the middle of what we are measuring.

The interesting question here is: why is the patch causing a difference? It turns out with Dão's patch applied, we no longer show the early blank window when running the sessionrestore test. This means the value of lightweightThemes.selectedThemeID is not "default-theme@mozilla.org" when running the test. And indeed, that pref is initialized to "" at https://searchfox.org/mozilla-central/rev/0b8ed772d24605d7cb44c1af6d59e4ca023bd5f5/testing/profiles/perf/user.js#67-69
(In reply to Florian Quèze [:florian] from comment #6)
> The interesting question here is: why is the patch causing a difference? It
> turns out with Dão's patch applied, we no longer show the early blank window
> when running the sessionrestore test. This means the value of
> lightweightThemes.selectedThemeID is not "default-theme@mozilla.org" when
> running the test. And indeed, that pref is initialized to "" at
> https://searchfox.org/mozilla-central/rev/
> 0b8ed772d24605d7cb44c1af6d59e4ca023bd5f5/testing/profiles/perf/user.js#67-69

Ouch! Nice find. I guess we should fix that in user.js.
Component: Theme → Talos
Product: Firefox → Testing
Blocks: 1372694
Attached patch patchSplinter Review
Assignee: nobody → dao+bmo
Status: NEW → ASSIGNED
Attachment #9010949 - Flags: review?(aswan)
(In reply to Dão Gottwald [::dao] from comment #9)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=9dd9da1aa221d0ae4f8c966af68c728d30676d22

No clue what the test_TelemetrySession.js failure on Windows 10 is about. Hopefully it's unrelated?
(In reply to Florian Quèze [:florian] from comment #5)

> I haven't explored yet the tresize and sessionrestore profiles, but the
> regression there is small enough that I would be (pleasantly!) surprised if
> the profiles turned out to show the cause of the regression.

As I expected, I haven't found any usable information in the tresize profiles.
Attachment #9010949 - Flags: review?(aswan) → review+
(In reply to Dão Gottwald [::dao] from comment #11)
> (In reply to Dão Gottwald [::dao] from comment #9)
> > https://treeherder.mozilla.org/#/jobs?repo=try&revision=9dd9da1aa221d0ae4f8c966af68c728d30676d22
> 
> No clue what the test_TelemetrySession.js failure on Windows 10 is about.
> Hopefully it's unrelated?

This is green now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=26778046d1b3396a310de6637d0f51e01c2856a8
Keywords: checkin-needed
Pushed by btara@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f2b2b043c046
lightweightThemes.selectedThemeID default should be "default-theme@mozilla.org" rather than "". r=aswan
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/f2b2b043c046
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Backout by aciure@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/ce24611f29dc
Backed out 1 changesets for tps perma failures a=backout
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Andrei Ciure[:andrei_ciure_] from comment #17)
> Backed out 1 changesets (bug 1492519) for tps perma failures

Looks like this only occurs on Linux x64 QuantumRender opt. I have no idea why that would be the case. Unassigning for now...
Assignee: dao+bmo → nobody
Flags: needinfo?(dao+bmo)
::dao, the original regressions still exist on more than linux64-qr; if you are not working on this can you find someone else to do that or recommend backing out or accepting these regressions?
Flags: needinfo?(dao+bmo)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #19)
> ::dao, the original regressions still exist on more than linux64-qr; if you
> are not working on this can you find someone else to do that or recommend
> backing out or accepting these regressions?

This is a talos glitch due to a wrongly set preference. There is no regression in the product, therefore a backout doesn't seem like a sensible option to me. Accepting the regression is out of question too, as we need talos to run with the right preferences. I need a talos expert to make sense of the Linux x64 QuantumRender opt failure that made my patch bounce. Are you that expert?
Flags: needinfo?(dao+bmo) → needinfo?(jmaher)
::dao, I could help figure it out- It isn't clear what is wrong- I looked at the failure logs for the tps jobs and I don't see specific errors.  If there is a preference we need to set for a specific test, lets do that.
Flags: needinfo?(jmaher) → needinfo?(dao+bmo)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)
> ::dao, I could help figure it out- It isn't clear what is wrong- I looked at
> the failure logs for the tps jobs and I don't see specific errors.  If there
> is a preference we need to set for a specific test, lets do that.

I don't know that there's a preference that we need to set for a specific test. Tests should work fine with lightweightThemes.selectedThemeID == "default-theme@mozilla.org".
Flags: needinfo?(dao+bmo)
looking into this, I see something unique for the linux64-QR builds:
02:49:52    ERROR -  JavaScript error: moz-extension://9436f915-1fa8-42c7-875f-0d4c7944c580/content/tabswitch-content-process.js, line 18: TypeError: extensionChild is undefined; can't access its "baseURI" property
02:49:52     INFO -  [Parent 1931, Gecko_IOThread] WARNING: pipe error (81): Connection reset by peer: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 363
02:49:52     INFO -  [Parent 1931, Gecko_IOThread] WARNING: pipe error (80): Connection reset by peer: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 363
02:49:54     INFO -  console.error: PushService:
02:49:54     INFO -    clearOriginData: Error clearing origin data:
02:49:54     INFO -    UnknownError
02:49:54     INFO -  console.error: PushService:
02:49:54     INFO -    stateChangeProcessEnqueue: Error transitioning state
02:49:54     INFO -    UnknownError
02:49:54     INFO -  ###!!! [Parent][DispatchAsyncMessage] Error: PBackgroundIDBFactory::Msg_PBackgroundIDBFactoryRequestConstructor Value error: message was deserialized, but contained an illegal value
02:49:54     INFO -  ###!!! [Parent][DispatchAsyncMessage] Error: PBackgroundIDBFactory::Msg_PBackgroundIDBFactoryRequestConstructor Value error: message was deserialized, but contained an illegal value
02:49:54     INFO -  *** UTM:SVC TimerManager:registerTimer called after profile-before-change notification. Ignoring timer registration for id: recipe-client-addon-run


why is there a javascript error?  the error looks to be:
https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/tabswitch/content/tabswitch-content-process.js#18

    let extensionChild = processScript.getExtensionChild(WEBEXTENSION_ID);
    let aboutPageURI = extensionChild.baseURI.resolve(ABOUT_PATH_PATH);


does the default-theme treat extensions differently?
Flags: needinfo?(dao+bmo)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #23)
> looking into this, I see something unique for the linux64-QR builds:
> 02:49:52    ERROR -  JavaScript error:
> moz-extension://9436f915-1fa8-42c7-875f-0d4c7944c580/content/tabswitch-
> content-process.js, line 18: TypeError: extensionChild is undefined; can't
> access its "baseURI" property
[...]
> 
> why is there a javascript error?  the error looks to be:
> https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/
> tabswitch/content/tabswitch-content-process.js#18
> 
>     let extensionChild = processScript.getExtensionChild(WEBEXTENSION_ID);
>     let aboutPageURI = extensionChild.baseURI.resolve(ABOUT_PATH_PATH);
> 
> 
> does the default-theme treat extensions differently?

It shouldn't. aswan, any ideas?
Flags: needinfo?(dao+bmo) → needinfo?(aswan)
it is also odd that this is on linux64-qr only, not win10-qr.  Maybe there is something different with the theme in linux or qr mode.
The error in comment 24 is coming from the extension used internally by the tps test.  I can't see any obvious connection between 
Dão's patch and that failure, and I can't reproduce the failure locally.  And the link from comment 17 to failures isn't working for me, which platforms did this actually fail on?  Sorry for my treeherder ineptness but can you generate a durable link to one (or more) logs from failed runs?
Flags: needinfo?(aswan) → needinfo?(jmaher)
:aswan thanks for responding- this passes the tps test on all configs except linux64-qr.  It works on win10-qr, and linux64 regular.

here is a push on treeherder when it originally landed:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&resultStatus=success,testfailed,busted,exception&searchStr=tps&selectedJob=202100584&revision=f2b2b043c04620257c8e64a1c722b6838331d143

and the log file where it fails:
https://taskcluster-artifacts.net/AuP4VIstRQmX1JRMGyL14Q/0/public/logs/live_backing.log

I have been playing around on try and I am having trouble seeing where extensionChild is undefined- if I add a dump(extensionChild) it shows up as an object, I can even iterate through all the properties of the object.

Ideally I would have a hacked up solution on try showing the problem narrowed down or more data to help figure out where to look.
Flags: needinfo?(jmaher)
Hm, so that error is happening in the first browser run when it is supposed to just load getInfo.html and immediately exit.
And it looks like that is not actually the cause of the failure here since it also happens in successful runs, eg:
https://treeherder.mozilla.org/logviewer.html#?job_id=204270119&repo=mozilla-inbound&lineNumber=819

And, the browser does actually exit cleanly.  The real test run starts a little later:
https://treeherder.mozilla.org/logviewer.html#?job_id=202100584&repo=mozilla-inbound&lineNumber=880

But it never produces any output and is eventually killed after a timeout.

I don't have a Linux setup handy, a useful step would be to reproduce this locally and see what the browser is actually doing when the harness times out, and see if there's anything relevant in the browser console...
I really don't understand this- I have moved the offending code to a different part of the extension and with that there is no javascript error in the logs, only a timeout- I have some dump statements that print out some status on linxu64, but print nothing on linux64-qr.

I agree we need someone with linux machine to look at this locally.
Robert, can you help find someone to investigate further?
Flags: needinfo?(rwood)
I'm not on Linux either, :bebe or :igoldan are either of you on Linux by any chance?
Flags: needinfo?(rwood)
Flags: needinfo?(igoldan)
Flags: needinfo?(bebe)
I thing we both are running Win.
:Igoldan?
Flags: needinfo?(bebe)
(In reply to Florin Strugariu [:Bebe] from comment #32)
> I thing we both are running Win.
> :Igoldan?

Yes, that's right.
Flags: needinfo?(igoldan)
I managed to set up Windows Subsystem for Linux. :jmaher, if it's ok with you, I could pick this bug and for further investigation.
Flags: needinfo?(jmaher)
no wsl will not solve this- we need to run the code.

:ahal, could you help out here?
Flags: needinfo?(jmaher) → needinfo?(ahal)
Yeah, I'll take a look when I get a chance.
Sorry for the delay.

I can't seem to reproduce, but maybe someone can verify that I'm doing it right. I'm running the following from latest central on Fedora 29:

$ MOZ_WEBRENDER=1 MOZ_ACCELERATED=1 ./mach talos-test -a tps

I see the test has started and tabs being loaded without any errors. Should I be doing anything else?
Also I see the original reasons for wanting to reproduce this locally are:

1. See browser console output
2. See what is happening with browser visually

But for 1), this output should also be in the stdout (at least I would hope it would be!). And for 2), if the extension isn't loading, then the browser is probably just sitting on the newtab page. So I guess I'm questioning whether we really need to block on reproducing this locally here.

Trying to reproduce it in an interactive task (aka one click loaner) might be an option.
Flags: needinfo?(ahal)
(In reply to Andrew Halberstadt [:ahal] from comment #38)
> Trying to reproduce it in an interactive task (aka one click loaner) might
> be an option.

(though personally I would just print debug with tons of try pushes over attempting that)
:ahal, you need to have this patch applied to your local tree:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f2b2b043c04620257c8e64a1c722b6838331d143

then tps should fail.
Thanks, I missed that it got backed out!

After applying that patch and re-running a build I still can't reproduce. Here's a try push to verify that this didn't get fixed by something else in the meantime:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=84b20b6a9799a7d4640b3ec3c0734e69fdec1a59
It's still an issue. Seeing as I can't reproduce locally, it must be timing or environment related. It's possible that webrender changes the timing of things just enough to cause this to happen.
Status: REOPENED → RESOLVED
Closed: Last year8 months ago
Resolution: --- → INACTIVE

This still needs fixing.

Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---

Dave, the perf fix was backed out because of an intermittent on Linux 64bit QR. Unfortunately, we weren't able to reproduce it locally. Could we require another person to attempt a reproduce or reland the fix again (maybe the intermittent issue got fixed in the meanwhile)?

Flags: needinfo?(dave.hunt)

I certainly don't know enough about the tests to provide useful insight here. Perhaps :jmaher has some fresh ideas? I understand that we're unable to reproduce this locally, but it's not clear to me if this can be reproduced by pushing to try.

Flags: needinfo?(dave.hunt) → needinfo?(jmaher)

What can we not reproduce locally, the unittest failure or the performance regression?

Flags: needinfo?(jmaher)

Comment 40 suggests that tps would fail, so I'm guessing that. My understanding is that the performance regression is still present because the fix was backed out due to a failure that we were unable to replicate locally.

Flags: needinfo?(jmaher)

ok, re-reading this bug, I see that the fix for the regressions caused linux64-qr tps test to fail and there was no relevant information in the logs. I had no ideas on this test originally and many people couldn't figure it out. This would need some debugging.

Is this even reproducing in today's browser?
Can someone with native linux run this?

Flags: needinfo?(jmaher)

Florin, I know you now have native Ubuntu. Could you find the time and try to redo steps from comment 37 up to comment 42?
We need to reproduce this weird Linux intermittent, so we can address it and reland the performance fix.

Flags: needinfo?(fstrugariu)
Depends on: 1540856
Depends on: 1542030
No longer depends on: 1542030

I think bug 1540856 fixed this. Bug 1540856 comment 7 doesn't look like a straight reversal of comment 0 here. I guess this bug just lingered around for too long so that many things changed in the meantime.

Status: REOPENED → RESOLVED
Closed: 8 months ago5 months ago
Flags: needinfo?(fstrugariu)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.