Closed
Bug 1260736
Opened 9 years ago
Closed 9 years ago
3.71 - 5.47% damp / tp5o (windows7-32) regression on push 4de1094b41b2 for interruptible reflow (Tue Mar 29 2016)
Categories
(Core :: General, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla48
Tracking | Status | |
---|---|---|
e10s | + | --- |
firefox47 | --- | unaffected |
firefox48 | --- | fixed |
People
(Reporter: wlach, Assigned: ting)
References
Details
(Keywords: perf, regression, Whiteboard: [talos_regression])
Attachments
(1 file, 1 obsolete file)
6.10 KB,
patch
|
Details | Diff | Splinter Review |
Hey Jim, Ting, I can't figure out which of your patches is the culprit and unfortunately they were both landed in a single push. :(
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=4de1094b41b2&selectedJob=24805221&group_state=expanded
Any help you can provide tracking this down would be appreciated.
--
Talos has detected a Firefox performance regression from push 4de1094b41b2. As author of one of the patches included in that push, we need your help to address this regression.
This is a list of all known regressions and improvements related to the push:
https://treeherder.mozilla.org/perf.html#/alerts?id=668
On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.
To learn more about the regressing test(s), please see:
https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5
https://wiki.mozilla.org/Buildbot/Talos/Tests#DAMP
Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p win32 -u none -t tp5o-e10s[Windows 7],g2-e10s[Windows 7] --rebuild 5 # add "mozharness: --spsProfile" to generate profile data
(we suggest --rebuild 5 to be more confident in the results)
To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code
Then run the following command from the directory where you set up Talos:
talos --develop -e [path]/firefox -a tp5o:damp
(add --e10s to run tests in e10s mode)
Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch(es) will be backed out! ***
Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Flags: needinfo?(jmathies)
Flags: needinfo?(janus926)
Updated•9 years ago
|
Product: Firefox → Core
Comment 1•9 years ago
|
||
as there is no action on this bug, I took the liberty to bisect the set of patches that landed in the single commit:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=2378cff556d9&tochange=628b0ba6ecd2
we should have results in a few hours.
Assignee | ||
Comment 2•9 years ago
|
||
Bug 1255968 peeks IPC message queue see if there're IO events to interrupt current reflow, which could possibily cause a perf drawback. But I wonder how the regression numbers were calculated, did it compare to previous e10s runs? If yes, can we also get numbers which compare to non-e10s?
If the numbers do not regress from non-e10s, would it be all right?
Flags: needinfo?(janus926) → needinfo?(wlachance)
Reporter | ||
Comment 3•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #2)
> Bug 1255968 peeks IPC message queue see if there're IO events to interrupt
> current reflow, which could possibily cause a perf drawback. But I wonder
> how the regression numbers were calculated, did it compare to previous e10s
> runs? If yes, can we also get numbers which compare to non-e10s?
Yes, you can see information like this on the perfherder e10s dashboard:
https://treeherder.mozilla.org/perf.html#/e10s?filter=tp5o summary
By hovering over each result and click on `graph`, you can visualize how the tests have changed over time. For You can see that the results went from "slightly better than non-e10s" to "about the same"
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,7d6d0a62f7a2126944a5e7f463ac524c6f771fd3,1%5D&series=%5Bmozilla-inbound,e9786d3bc11a5fff1fa013567494164638d623e9,1%5D&selected=%5Bmozilla-inbound,e9786d3bc11a5fff1fa013567494164638d623e9,29284,24814034%5D
> If the numbers do not regress from non-e10s, would it be all right?
There really isn't one answer to that question. In general, we want Firefox performance to be as good as possible, while still behaving correctly. We might accept a regression if the fix is more important and it's not possible (or too difficult) to land a fix. In general we rely on you (the developers) to make this judgement.
Let's see the results of Joel's test and go from there.
Flags: needinfo?(wlachance)
Comment 4•9 years ago
|
||
the results are in, you can see the difference in this compare link:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=25284d4437f7&newProject=try&newRevision=628b0ba6ecd2&framework=1&showOnlyImportant=0
the changes are the puppetwidget changes from bug 1255968:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4de1094b41b2
this would be effectively showing a fresh clobber build of the exact revision (not changeset, but revision) before to this change. We should understand why this is happening and weigh the value of wontfix vs fixing this.
Looking forward to the understanding the root cause better.
No longer blocks: 1148978
Flags: needinfo?(jmathies) → needinfo?(janus926)
Assignee | ||
Comment 5•9 years ago
|
||
The PuppetWidgets changes are for interruptible reflow to work in content process, which is important when a reflow takes long.
Originally it simply returns false, but now it peeks queued IPC messages see if there're IO events need to be handled, if yes then interrupt the reflow. Since non-e10s does similar check with native API [1] [2], I expect the numbers between e10s and non-e10s will be about the same with the patch.
It shows tp5o had similar results but damp went double than non-e10s for reload subtests.
Jim, I think it's better to keep bug 1255968 in mc, and I will check why damp got much higher numbers on reload here. What do you think?
[1] https://dxr.mozilla.org/mozilla-central/source/widget/windows/nsWindow.cpp#3590
[2] https://dxr.mozilla.org/mozilla-central/source/widget/cocoa/nsChildView.mm#1606-1641
Flags: needinfo?(janus926) → needinfo?(jmathies)
Assignee | ||
Comment 6•9 years ago
|
||
The callstack of HasPendingInputEvent() while running damp:
xul.dll!mozilla::widget::PuppetWidget::HasPendingInputEvent() Line 1427 C++
xul.dll!nsContentSink::DidProcessATokenImpl() Line 1404 C++
xul.dll!nsHtml5TreeOpExecutor::RunFlushLoop() Line 460 C++
xul.dll!nsHtml5ExecutorFlusher::Run() Line 130 C++
xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Line 994 C++
Assignee | ||
Comment 7•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #5)
> damp went double than non-e10s for reload subtests.
This only applies to complicated subtest, and when I run the test locally, the complicated page shows error 404. Is this expected?
Comment 8•9 years ago
|
||
which page it is trying to load? I suspect you need the tp5 pageset which 'mach talos-test' should automatically setup (unless something broke).
Comment 9•9 years ago
|
||
we run windows xp infrequently, this showed up as regressing on tp5o, also the pgo results are coming in and we see the same regressions on pgo as we do on non-pgo.
Comment 10•9 years ago
|
||
First, we should make sure the regression here is caused by our use of PeekMessages and not something else.
tpo5 remains under that of non-e10s, so we can take this regression. DAMP is slightly above non-e10s now, which is unfortunate. But DAMP measures dev tools related layout so I'm not as concerned about it.
I think we should have an e10s bug on file about looking for ways to optimize peek. I wouldn't block on this though.
Flags: needinfo?(jmathies)
Assignee | ||
Comment 11•9 years ago
|
||
I will check why damp reload got 2x longer after back from the holiday.
Assignee | ||
Comment 12•9 years ago
|
||
Tried to profile damp test on Windows, but the profile does not contain content process. The command was:
$ talos --develop --executablePath ... -a damp --spsProfile --tppagecycles 1
Assignee | ||
Comment 13•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #12)
> Tried to profile damp test on Windows, but the profile does not contain
> content process. The command was:
>
> $ talos --develop --executablePath ... -a damp --spsProfile --tppagecycles
> 1
Do you know is there a bug for this?
Flags: needinfo?(jmaher)
Comment 14•9 years ago
|
||
I believe mconley fixed this not too long ago, :mconley- do you know if sps profiling works on all processes for damp?
Flags: needinfo?(jmaher) → needinfo?(mconley)
Updated•9 years ago
|
OS: Unspecified → Linux
Hardware: Unspecified → x86_64
Comment 15•9 years ago
|
||
with my apologies for the spam, wrong Sys/Os got set when I added myself to CC;
OS: Linux → Unspecified
Hardware: x86_64 → Unspecified
Comment 16•9 years ago
|
||
Unfortunately, it doesn't appear to be working out of the box. The work to transition all of the exiting tests over to the new scripts has not happened yet (bug 1251363).
Flags: needinfo?(mconley)
Assignee | ||
Updated•9 years ago
|
tracking-e10s:
--- → ?
Summary: 3.71 - 5.47% damp / tp5o (windows7-32) regression on push 4de1094b41b2 (Tue Mar 29 2016) → 3.71 - 5.47% damp / tp5o (windows7-32) regression on push 4de1094b41b2 for interruptible reflow (Tue Mar 29 2016)
Assignee | ||
Comment 17•9 years ago
|
||
damp regression only happens on Windows since "content.sink.enable_perf_mode" is 1 on Windows and 0 on the other OSes. And there're plenty of "debug:netmonitor:..." async messages that PeekMessages() walk through.
Though I couldn't confirm by a profile, but I expect it's the monitor entering in PeekMessages() taking most of time. Now a single PuppetWidget::HasPendingInputEvent() will call PeekMessages and enter the monitor 11 times, I'd like to make it 1 time by removing the parameter aMsgId from PeekMessages() so the callback will be called for each message, thus HasPendingInputEvent() doesn't need to call PeekMessages() for each different message type.
I'll verify my principal on Try tomorrow.
OS: Unspecified → Windows
Assignee | ||
Comment 18•9 years ago
|
||
Assignee | ||
Comment 19•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #18)
> Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=48be06da397d
Built error, retry: https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa02a267c3c6
I also tested damp on my laptop with the patch, the numbers of reloading complicated page are lower than non-e10s: https://goo.gl/eQzIjZ.
Assignee | ||
Comment 20•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #19)
> I also tested damp on my laptop with the patch, the numbers of reloading
> complicated page are lower than non-e10s: https://goo.gl/eQzIjZ.
These were wrong numbers, I misplaced a patch that removes "debug:netmonitor:..." messages. Removed, retested and updated, still e10s runs slower than non-e10s, but the regression goes from 100% to <50%.
Assignee | ||
Comment 21•9 years ago
|
||
I removed the parameter |aMsgId| of PeekMessages(), so if the caller is interested in more than one messages, it doesn't have to call PeekMessages() and enter the monitor for each one.
With the patch, the comparison [1] on perfherder:
tp5o opt e10s 333.63 > 326.03 (-2.28%)
damp opt e10s 283.62 > 274.83 (-3.10%)
[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=21bf1af375c1fa8565ae3bb2e89bd1a0809363d4&newProject=try&newRevision=aa02a267c3c6dabf509872bf8b7787f59b4aa4d4&framework=1&showOnlyImportant=0
Attachment #8740370 -
Flags: review?(dvander)
Comment 22•9 years ago
|
||
thanks for pushing on this and getting back most of the regression!
Attachment #8740370 -
Flags: review?(dvander) → review+
Updated•9 years ago
|
Priority: -- → P1
Assignee | ||
Comment 23•9 years ago
|
||
Assignee | ||
Comment 24•9 years ago
|
||
Fixed the commit log.
Attachment #8740370 -
Attachment is obsolete: true
Assignee | ||
Comment 25•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #23)
> Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8c4b9513dd3
The busted cases of Windows 7 opt were because of this:
bash: wget: command not found
The others look good.
Assignee | ||
Updated•9 years ago
|
Keywords: checkin-needed
Comment 26•9 years ago
|
||
Keywords: checkin-needed
Comment 27•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox48:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → janus926
Assignee | ||
Comment 28•9 years ago
|
||
Comment on attachment 8740819 [details] [diff] [review]
patch v2
Approval Request Comment
[Feature/regressing bug #]: bug 1255968
[User impact if declined]: perf regression
[Describe test coverage new/current, TreeHerder]: automated tests
[Risks and why]: low, because there're only 2 clients of the API
[String/UUID change made/needed]: n/a
Attachment #8740819 -
Flags: approval-mozilla-aurora?
Assignee | ||
Updated•9 years ago
|
status-firefox47:
--- → ?
Assignee | ||
Comment 29•9 years ago
|
||
Bug 1255968 won't be uplifted, see bug 1255968 comment 44.
Assignee | ||
Updated•9 years ago
|
Attachment #8740819 -
Flags: approval-mozilla-aurora?
Comment 30•9 years ago
|
||
I see nice wins on tp5 and some medium wins on damp! Thanks for getting this fix in:)
Updated•8 years ago
|
Version: unspecified → 48 Branch
Comment 31•7 years ago
|
||
Moving from Core::Untriaged to Core::General https://bugzilla.mozilla.org/show_bug.cgi?id=1407598
Component: Untriaged → General
You need to log in
before you can comment on or make changes to this bug.
Description
•