Closed Bug 1260736 Opened 8 years ago Closed 8 years ago

3.71 - 5.47% damp / tp5o (windows7-32) regression on push 4de1094b41b2 for interruptible reflow (Tue Mar 29 2016)

Categories

(Core :: General, defect, P1)

48 Branch
Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
e10s + ---
firefox47 --- unaffected
firefox48 --- fixed

People

(Reporter: wlach, Assigned: ting)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(1 file, 1 obsolete file)

Hey Jim, Ting, I can't figure out which of your patches is the culprit and unfortunately they were both landed in a single push. :( 

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=4de1094b41b2&selectedJob=24805221&group_state=expanded

Any help you can provide tracking this down would be appreciated.

--

Talos has detected a Firefox performance regression from push 4de1094b41b2. As author of one of the patches included in that push, we need your help to address this regression.

This is a list of all known regressions and improvements related to the push:

https://treeherder.mozilla.org/perf.html#/alerts?id=668

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see:

https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5
https://wiki.mozilla.org/Buildbot/Talos/Tests#DAMP

Reproducing and debugging the regression:

If you would like to re-run this Talos test on a potential fix, use try with the following syntax:

try: -b o -p win32 -u none -t tp5o-e10s[Windows 7],g2-e10s[Windows 7] --rebuild 5  # add "mozharness: --spsProfile" to generate profile data

(we suggest --rebuild 5 to be more confident in the results)

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:

https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:

talos --develop -e [path]/firefox -a tp5o:damp

(add --e10s to run tests in e10s mode)

Making a decision:

As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations:

https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Flags: needinfo?(jmathies)
Flags: needinfo?(janus926)
Blocks: 1256666
Product: Firefox → Core
as there is no action on this bug, I took the liberty to bisect the set of patches that landed in the single commit:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=2378cff556d9&tochange=628b0ba6ecd2

we should have results in a few hours.
Bug 1255968 peeks IPC message queue see if there're IO events to interrupt current reflow, which could possibily cause a perf drawback. But I wonder how the regression numbers were calculated, did it compare to previous e10s runs? If yes, can we also get numbers which compare to non-e10s?

If the numbers do not regress from non-e10s, would it be all right?
Flags: needinfo?(janus926) → needinfo?(wlachance)
(In reply to Ting-Yu Chou [:ting] from comment #2)
> Bug 1255968 peeks IPC message queue see if there're IO events to interrupt
> current reflow, which could possibily cause a perf drawback. But I wonder
> how the regression numbers were calculated, did it compare to previous e10s
> runs? If yes, can we also get numbers which compare to non-e10s?

Yes, you can see information like this on the perfherder e10s dashboard:

https://treeherder.mozilla.org/perf.html#/e10s?filter=tp5o summary

By hovering over each result and click on `graph`, you can visualize how the tests have changed over time. For You can see that the results went from "slightly better than non-e10s" to "about the same"

https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,7d6d0a62f7a2126944a5e7f463ac524c6f771fd3,1%5D&series=%5Bmozilla-inbound,e9786d3bc11a5fff1fa013567494164638d623e9,1%5D&selected=%5Bmozilla-inbound,e9786d3bc11a5fff1fa013567494164638d623e9,29284,24814034%5D

> If the numbers do not regress from non-e10s, would it be all right?

There really isn't one answer to that question. In general, we want Firefox performance to be as good as possible, while still behaving correctly. We might accept a regression if the fix is more important and it's not possible (or too difficult) to land a fix. In general we rely on you (the developers) to make this judgement.

Let's see the results of Joel's test and go from there.
Flags: needinfo?(wlachance)
the results are in, you can see the difference in this compare link:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=25284d4437f7&newProject=try&newRevision=628b0ba6ecd2&framework=1&showOnlyImportant=0

the changes are the puppetwidget changes from bug 1255968:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4de1094b41b2

this would be effectively showing a fresh clobber build of the exact revision (not changeset, but revision) before to this change.  We should understand why this is happening and weigh the value of wontfix vs fixing this.

Looking forward to the understanding the root cause better.
No longer blocks: 1148978
Flags: needinfo?(jmathies) → needinfo?(janus926)
The PuppetWidgets changes are for interruptible reflow to work in content process, which is important when a reflow takes long.

Originally it simply returns false, but now it peeks queued IPC messages see if there're IO events need to be handled, if yes then interrupt the reflow. Since non-e10s does similar check with native API [1] [2], I expect the numbers between e10s and non-e10s will be about the same with the patch.

It shows tp5o had similar results but damp went double than non-e10s for reload subtests.

Jim, I think it's better to keep bug 1255968 in mc, and I will check why damp got much higher numbers on reload here. What do you think?

[1] https://dxr.mozilla.org/mozilla-central/source/widget/windows/nsWindow.cpp#3590
[2] https://dxr.mozilla.org/mozilla-central/source/widget/cocoa/nsChildView.mm#1606-1641
Flags: needinfo?(janus926) → needinfo?(jmathies)
The callstack of HasPendingInputEvent() while running damp:

  xul.dll!mozilla::widget::PuppetWidget::HasPendingInputEvent() Line 1427	C++
  xul.dll!nsContentSink::DidProcessATokenImpl() Line 1404	C++
  xul.dll!nsHtml5TreeOpExecutor::RunFlushLoop() Line 460	C++
  xul.dll!nsHtml5ExecutorFlusher::Run() Line 130	C++
  xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Line 994	C++
(In reply to Ting-Yu Chou [:ting] from comment #5)
> damp went double than non-e10s for reload subtests.

This only applies to complicated subtest, and when I run the test locally, the complicated page shows error 404. Is this expected?
which page it is trying to load?  I suspect you need the tp5 pageset which 'mach talos-test' should automatically setup (unless something broke).
we run windows xp infrequently, this showed up as regressing on tp5o, also the pgo results are coming in and we see the same regressions on pgo as we do on non-pgo.
First, we should make sure the regression here is caused by our use of PeekMessages and not something else.

tpo5 remains under that of non-e10s, so we can take this regression. DAMP is slightly above non-e10s now, which is unfortunate. But DAMP measures dev tools related layout so I'm not as concerned about it.

I think we should have an e10s bug on file about looking for ways to optimize peek. I wouldn't block on this though.
Flags: needinfo?(jmathies)
I will check why damp reload got 2x longer after back from the holiday.
Tried to profile damp test on Windows, but the profile does not contain content process. The command was:

  $ talos --develop --executablePath ... -a damp --spsProfile --tppagecycles 1
(In reply to Ting-Yu Chou [:ting] from comment #12)
> Tried to profile damp test on Windows, but the profile does not contain
> content process. The command was:
> 
>   $ talos --develop --executablePath ... -a damp --spsProfile --tppagecycles
> 1

Do you know is there a bug for this?
Flags: needinfo?(jmaher)
I believe mconley fixed this not too long ago, :mconley- do you know if sps profiling works on all processes for damp?
Flags: needinfo?(jmaher) → needinfo?(mconley)
OS: Unspecified → Linux
Hardware: Unspecified → x86_64
with my apologies for the spam, wrong Sys/Os got set when I added myself to CC;
OS: Linux → Unspecified
Hardware: x86_64 → Unspecified
Unfortunately, it doesn't appear to be working out of the box. The work to transition all of the exiting tests over to the new scripts has not happened yet (bug 1251363).
Flags: needinfo?(mconley)
tracking-e10s: --- → ?
Summary: 3.71 - 5.47% damp / tp5o (windows7-32) regression on push 4de1094b41b2 (Tue Mar 29 2016) → 3.71 - 5.47% damp / tp5o (windows7-32) regression on push 4de1094b41b2 for interruptible reflow (Tue Mar 29 2016)
damp regression only happens on Windows since "content.sink.enable_perf_mode" is 1 on Windows and 0 on the other OSes. And there're plenty of "debug:netmonitor:..." async messages that PeekMessages() walk through.

Though I couldn't confirm by a profile, but I expect it's the monitor entering in PeekMessages() taking most of time. Now a single PuppetWidget::HasPendingInputEvent() will call PeekMessages and enter the monitor 11 times, I'd like to make it 1 time by removing the parameter aMsgId from PeekMessages() so the callback will be called for each message, thus HasPendingInputEvent() doesn't need to call PeekMessages() for each different message type.

I'll verify my principal on Try tomorrow.
OS: Unspecified → Windows
(In reply to Ting-Yu Chou [:ting] from comment #18)
> Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=48be06da397d

Built error, retry: https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa02a267c3c6

I also tested damp on my laptop with the patch, the numbers of reloading complicated page are lower than non-e10s: https://goo.gl/eQzIjZ.
(In reply to Ting-Yu Chou [:ting] from comment #19)
> I also tested damp on my laptop with the patch, the numbers of reloading
> complicated page are lower than non-e10s: https://goo.gl/eQzIjZ.

These were wrong numbers, I misplaced a patch that removes "debug:netmonitor:..." messages. Removed, retested and updated, still e10s runs slower than non-e10s, but the regression goes from 100% to <50%.
Attached patch patch v1 (obsolete) — Splinter Review
I removed the parameter |aMsgId| of PeekMessages(), so if the caller is interested in more than one messages, it doesn't have to call PeekMessages() and enter the monitor for each one.

With the patch, the comparison [1] on perfherder:

tp5o opt e10s 333.63 > 326.03 (-2.28%)
damp opt e10s 283.62 > 274.83 (-3.10%) 

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=21bf1af375c1fa8565ae3bb2e89bd1a0809363d4&newProject=try&newRevision=aa02a267c3c6dabf509872bf8b7787f59b4aa4d4&framework=1&showOnlyImportant=0
Attachment #8740370 - Flags: review?(dvander)
thanks for pushing on this and getting back most of the regression!
Attachment #8740370 - Flags: review?(dvander) → review+
Priority: -- → P1
Attached patch patch v2Splinter Review
Fixed the commit log.
Attachment #8740370 - Attachment is obsolete: true
(In reply to Ting-Yu Chou [:ting] from comment #23)
> Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8c4b9513dd3

The busted cases of Windows 7 opt were because of this:

  bash: wget: command not found

The others look good.
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/b24b880f2e23
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Assignee: nobody → janus926
Comment on attachment 8740819 [details] [diff] [review]
patch v2

Approval Request Comment
[Feature/regressing bug #]: bug 1255968
[User impact if declined]: perf regression
[Describe test coverage new/current, TreeHerder]: automated tests
[Risks and why]: low, because there're only 2 clients of the API
[String/UUID change made/needed]: n/a
Attachment #8740819 - Flags: approval-mozilla-aurora?
Attachment #8740819 - Flags: approval-mozilla-aurora?
I see nice wins on tp5 and some medium wins on damp!  Thanks for getting this fix in:)
Version: unspecified → 48 Branch
Moving from Core::Untriaged to Core::General https://bugzilla.mozilla.org/show_bug.cgi?id=1407598
Component: Untriaged → General
You need to log in before you can comment on or make changes to this bug.