bugzilla.mozilla.org will be intermittently unavailable on Saturday, March 24th, from 16:00 until 20:00 UTC.

3% win7 e10s svgx talos regression on inbound (v.46) on Jan 12 from push 8e0a8cdc62ad




New Tab Page
2 years ago
2 years ago


(Reporter: jmaher, Assigned: maxim zhilyaev)


({perf, regression})

perf, regression
Dependency tree / graph

Firefox Tracking Flags



(Whiteboard: [talos_regression][e10s])


(1 attachment)



2 years ago
Happy times, I am a bit late in filing this regression- in fact the previous push had a lot of regressions, so it took a while to see that this was different :)

here is information about what svgx is;

to run it locally:
talos -e <firefox> -a tsvgx --develop

or to run on try:
try: -b o -p win32 -u none -t svgr-e10s --rebuild 5

Here is the compare view:

it shows that the composite tests are having a regression.

Comment 1

2 years ago
:oyiptong, can you look into this?  Please ask questions if you have trouble seeing the regression or verifying locally/try a potential fix.  Also if this is something we should accept, please explain why it is happening and why we should accept it- this way we have a documented history of it.

Thanks for looking at this!
Flags: needinfo?(oyiptong)
Ok, taking alook
Flags: needinfo?(oyiptong)

Comment 3

2 years ago
thanks Olivier!  let me know if you need help or have questions!
tracking-e10s: --- → -

Comment 4

2 years ago
:oyiptong, checking in to see if you have any troubles or questions.
Flags: needinfo?(oyiptong)
Assignee: nobody → oyiptong
Flags: needinfo?(oyiptong)
No questions yet. I should be working on this bug tomorrow or so.
So, :jmaher, what does the svgx test do?

I suspect the svgx tests slow down because they opens a lot of newtab pages. How are the times measured?

Comment 7

2 years ago
here is a bit about the svg tests:

not too many pages, but specifically the composite tests are failing, for example:

Comment 8

2 years ago
times are measured when mozafterpaint is seen after loading- keep in mind this is a ASAP test where we force the browser to load faster with these prefs:
Having problems building on win32 as per bug 1240993. Progress might be slower than expected, though I'm still working on it
Nevermind, seems that the pip issue has been fixed in a later revision.


2 years ago
Assignee: oyiptong → mzhilyaev

Comment 11

2 years ago
this is now on aurora, and I see it for linux64 e10s ts_paint.

Comment 12

2 years ago
jmaher, i do not seem to be able to reproduce the problem on my win7 laptop. The performance does not not change noticeably between baseline and implicated commit.  Short of shooting try pushes, do you think I can work on the machine instance that surfaces the issue?
Flags: needinfo?(jmaher)

Comment 13

2 years ago
Thanks for askiing Maxim!  There is a good chance that the hardware you are running won't show the regression.  So I think a couple try pushes would be useful.  This happens about 25% of the time in our regressions.  It is still important to try to understand the issue- many times a simple fix comes from it.

Can you give me the command that you ran to try to reproduce this?

If you push to try, use this:
try: -b o -p win32 -u none -t svgr,svgr-e10s --rebuild-talos 5

Ideally two commits- a baseline, then one with any changes/backouts.
Flags: needinfo?(jmaher)

Comment 14

2 years ago
I ran this on win7 for both base and implicated commits:

talos --develop --executablePath obj-i686-pc-mingw32/dist/bin/fireforx.exe --e10s --activeTests tsvgx

No apparent differences between local.json (or local.out).  Offending tests (like composite-scalr-rotate-opacity.svg) between baseline and the later commit seem produce same results (+/- minor deviation)
Doing profiling runs against try may also produce useful results: https://wiki.mozilla.org/Buildbot/Talos/Profiling

Comment 16

2 years ago
Reporting current progress on the bug:

After serious of test-patches I figured the code change that implicates performance degrade.
Note the the implicated change is highly illogical, but I couldn't trigger the error in any other way. 

Below is the relevant chunk of init function of lxr.mozilla.org/mozilla-central/source/browser/modules/DirectoryLinksProvider.jsm#707:

  init: function DirectoryLinksProvider_init() {
    // ensure remote new tab doesn't go beyond aurora
    if (!AppConstants.RELEASE_BUILD) {
    return Task.spawn(function*() {
      ....  A TON OF IO OPERATIONS  ....

Removal of RemoteNewTabUtils methods calls from base commit (https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=deaf7c6ca55b) degrades performance.  Results are here: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=deaf7c6ca55b&newProject=try&newRevision=05551d74aaa7&originalSignature=10e252613ca692c02415637069fdc71d612830bc&newSignature=10e252613ca692c02415637069fdc71d612830bc 

The performance degrade is not specific to "tsvgx opt e10s" test.  The degrade becomes noticeable at the specific opacity tests as these tests are extremely short lived under e10s. However, almost all tests across all suites suffer mild performance degrade, especially under e10s. This makes me believe in a systematic time being added somehow by DirectoryLinksProvider either by issuing IO calls, or acting on frecency-change events (which talos tests definitely fire).

My current theory is that RemoteNewTabUtils caused DirectoryLinksProvider to fail to initialize properly, hence it either did not run IO inside the spawned task or failed to handle history events. Once RemoteNewTabUtils calls were removed, the initialization went proper, which introduced extra, and systemic time delta because DirectoryLinksProvider became fully functional again.

My next steps is to try running talos profiler and debug DirectoryLinksProvider in place on windows.

It appears that profiler does not work with e10s tests:
I ran this try: -b o -p win32 -u none -t svgr-e10s mozharness: --spsProfile 
The try failed to e10s, see here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d03a675ef250&selectedJob=16172919

If you have advice on how to run profiler with e10s, please advice.

Comment 17

2 years ago
:mconley, can you help maxim with a profiler on e10s?  

Maxim, thanks for the great investigation so far!
Flags: needinfo?(mconley)

Comment 18

2 years ago
:mconley volunteered to take a look at the problem.
The description of all the things I've done contains many links to builds, and try pushes, and talos results, so I parked it in the google doc below:

As of now, I am unable to root cause the issue. I seem to be able to trigger the regression by removing lines as described in comment 16, but I have no logical explanation of why it may slow down the tests. If anything it should speed tests up. In fact, on my local window build, I do see about 3% performance increase after offending lines are removed.  Hence, reaching for divine powers (i.e. :mconley).

As a side note, talos tests with profiling enabled failed with timeout:

Comment 19

2 years ago
To exemplify the issue, I am including links to try runs with and without the magic patch that removes references to RemoteNewTabUtils inside DirectoryLinksProvider.

The original commit is https://hg.mozilla.org/try/rev/311418dbd2c5 - Fuse aboutNewTabService and RemoteNewTabLocation and return resource URLS r=marcosc 

Here's the try and talos results without the change:

Here's the try run and talos results with the change

In as much as I can the regression is present when the magic patch applied.
Still can't answer why this change would cause regression.
My wild theory is that the change does make tests run faster but then it uncovers some crazy concurrency issue that slower running tests do not encounter.
With some hackery, I was able to tease some profiles out:

Baseline: https://cleopatra.io/#report=7918d1d0b7de549a888fcc4cd39a00045f1bf466
Regression: https://cleopatra.io/#report=7c9ac843bed630812becdc2d3e313418804e4fd6

Unfortunately, the sampling rate is too low in these, so I'm doing a new push with a 1ms sampling rate - hopefully that doesn't throw off the measurements.
With 1ms sampling rate:

Baseline: https://cleopatra.io/#report=b785cea3bdf697f366ade71c6f7613d96855af7e&selection=0,1063,6,7,1064,1065,4,5,6,7,1064,8,9,10,37
Regression: https://cleopatra.io/#report=29f7d291956d395777c8d2975d6057753882c26a&selection=0,1175

With the regression profile, I see a few short samples of the content process cycle collecting that are not present in the baseline. It's short - only around ~2ms, but that's how much we're apparently regressing here, so that's my top lead.

At least for the first run of the composite-scale-rotate-opacity test, it appears to be a string bundle that the content process is releasing.

Comment 22

2 years ago
this is great mconley, thanks for getting this to work!

Comment 23

2 years ago
To eliminate a possibility of I/O during visit notification handling, I disabled DirectoryLinksProvider.updateSuggestedTile() method and re-tested in this try push:

Regression is still visible:

At this point I am ready to declare the defeat. Talos won :(
I do believe that the code is working correctly:
- the changes are related to newtab setup, and do not affect running of svgx rendering, as only single
  newtab is ever open by the rendering tests.
- the only thing can possibly affect cvgx performance is handling of frecency events
- however, the commit change is such that it halves the time spent on frecency event handling which gives about 1.5% performance gain in my local talos tests.  :mconley's profiling showed same evidences.

Furthermore, for non e10s tests, we do have about %1 performance increase as expected.
Could it be the the harness under e10s generating slight error upon a small increase in tests performance?
So I've been poking at this for a few days with not much to show for it, but last night I pushed some patches that disable the setTimeout's inside DirectoryLinksProvider's onLinkChanged and onManyLinksChanged.

Here are the pushes:

Baseline: https://treeherder.mozilla.org/#/jobs?repo=try&revision=782137197fac
Regression: https://treeherder.mozilla.org/#/jobs?repo=try&revision=38ee2dca5aec

Comparison: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=782137197fac&newProject=try&newRevision=38ee2dca5aec&originalSignature=10e252613ca692c02415637069fdc71d612830bc&newSignature=10e252613ca692c02415637069fdc71d612830bc

With the setTimeout's disabled, the baseline and regression patches appear to be equivalent. So this is my current area of exploration. I'm going to add some profile markers in and around these setTimeout calls to see if I can track whether or not the behaviour has changed a bit here due to the "regressor" patch.

Comment 25

2 years ago
:mconley, FYI - there will be twice as few setTimeout() calls with the "regressor" patch.
I think my setTimeout theory is bunk. I also am not convinced about the cycle / garbage collection theories.

The main problem here is that as soon as we start profiling, the problem becomes hard / impossible to detect, unless we make our sampling more coarse (but then we risk not getting useful samples). My profiles have been mostly useless here, since (at least as far as I can tell). In fact, when I profile, I find that oftentimes the "regression" patch perform the test more quickly than the baseline patch.

jmaher, I don't think we should burn much more time investigating this - we might want to punt on this one.
Flags: needinfo?(mconley) → needinfo?(jmaher)

Comment 27

2 years ago
ideally we would understand it even if we don't fix it.  A lot of time is spent here- we should be smart, wontfix this and live with it.
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #27)
> ideally we would understand it even if we don't fix it.  A lot of time is
> spent here- we should be smart, wontfix this and live with it.

To be clear, this is your okay to grudgingly WONTFIX this?
Flags: needinfo?(jmaher)

Comment 29

2 years ago
Last Resolved: 2 years ago
Flags: needinfo?(jmaher)
Resolution: --- → WONTFIX
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.