Closed Bug 1454948 Opened 6 years ago Closed 6 years ago

4.4 - 4.83% damp (linux64, windows10-64) regression on push d22104932e7b (Tue Apr 17 2018)

Categories

(DevTools :: Netmonitor, defect)

defect
Not set
normal

Tracking

(firefox-esr52 unaffected, firefox-esr60 unaffected, firefox60 unaffected, firefox61- fix-optional, firefox62- fix-optional, firefox63 ?)

RESOLVED WONTFIX
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox60 --- unaffected
firefox61 - fix-optional
firefox62 - fix-optional
firefox63 --- ?

People

(Reporter: igoldan, Assigned: Honza)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=8bdeab80ff08949d236fd898a669e44b93f9ce30&tochange=ac148c181d2b0471fbe34b2559ad9876790f3796

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  5%  damp linux64 opt e10s stylo     61.00 -> 63.94
  4%  damp windows10-64 opt e10s stylo71.26 -> 74.40


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=12763

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:Honza Are we going to accept this perf regression? Or should we look for a fix?
Flags: needinfo?(odvarko)
Another try run highlighting the same issue:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=ecf859b9595030ed338b2b60ac65695d49bfe85c&newProject=try&newRevision=cf0475005fa1e19a6239548778239c0bbbe3bd18&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&showOnlyImportant=1&framework=1

This isn't really a perf regression but more a test breakage.
The patch from bug 1436665 most likely delay some execution without waiting for it correctly.
The increase of ".settle" subtest is a typical sign of such issue.
The improvement made to complicated.netmonitor.open (-25ms) is reported to complicated.netmonitor.open.settle (+18ms)
Same for simple.netmonitor.open (-28ms) and simple.netmonitor.open.settle (+19ms).

"settle" subtest should be close to zero, which was the case before this patch.
It ensures that the tests are waiting correctly for any asynchronous event and won't pollute the test running right after,
so it is important for DAMP health to fix that.

A good way to understand what is going on is to run DAMP and record a profile on perf-html.
Then, focus from the "Marker chart" on the marker related to complicated.netmonitor.open.settle and see what JS code is being executed during this mark.
(In reply to Alexandre Poirot [:ochameau] from comment #2)
> Another try run highlighting the same issue:
> https://treeherder.mozilla.org/perf.html#/
> comparesubtest?originalProject=try&originalRevision=ecf859b9595030ed338b2b60a
> c65695d49bfe85c&newProject=try&newRevision=cf0475005fa1e19a6239548778239c0bbb
> e3bd18&originalSignature=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&newSignatur
> e=f79b6a4f1f8f53326d3056f6c8008c0ff4de0a94&showOnlyImportant=1&framework=1
> 
> This isn't really a perf regression but more a test breakage.
> The patch from bug 1436665 most likely delay some execution without waiting
> for it correctly.
> The increase of ".settle" subtest is a typical sign of such issue.
> The improvement made to complicated.netmonitor.open (-25ms) is reported to
> complicated.netmonitor.open.settle (+18ms)
> Same for simple.netmonitor.open (-28ms) and simple.netmonitor.open.settle
> (+19ms).
> 
> "settle" subtest should be close to zero, which was the case before this
> patch.
> It ensures that the tests are waiting correctly for any asynchronous event
> and won't pollute the test running right after,
> so it is important for DAMP health to fix that.
> 
> A good way to understand what is going on is to run DAMP and record a
> profile on perf-html.
> Then, focus from the "Marker chart" on the marker related to
> complicated.netmonitor.open.settle and see what JS code is being executed
> during this mark.

Is there any work being done for resolving this test breakage?
We must be sure our tests are in good shape.
Flags: needinfo?(poirot.alex)
Yes, the test hasn't been fixed.

I think Honza was planning to look into that, I'll ping him during our next perf meeting.
Flags: needinfo?(poirot.alex)
(In reply to Alexandre Poirot [:ochameau] from comment #4)
> Yes, the test hasn't been fixed.
> 
> I think Honza was planning to look into that, I'll ping him during our next
> perf meeting.

Yes, I am planning to work on this.
Honza
Assignee: nobody → odvarko
Flags: needinfo?(odvarko)
Status: NEW → ASSIGNED
What's the status on this?
Flags: needinfo?(odvarko)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #6)
> What's the status on this?
I've already done some analysis and the issue doesn't represent any performance problem for DevTools.

The problem is related to two subtests: one is faster and one is slower - that's why Talos shows the regression. The way how the first subtest is waiting for async operations needs to be changed. I can't reproduce the problem locally (it's only visible on Try), and so it takes time to figure it out.

Honza
Flags: needinfo?(odvarko)
(In reply to Jan Honza Odvarko [:Honza] from comment #7)
> (In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #6)
> > What's the status on this?
> I've already done some analysis and the issue doesn't represent any
> performance problem for DevTools.
> 
> The problem is related to two subtests: one is faster and one is slower -
> that's why Talos shows the regression. The way how the first subtest is
> waiting for async operations needs to be changed. I can't reproduce the
> problem locally (it's only visible on Try), and so it takes time to figure
> it out.
> 
> Honza

Could you give me an update on this? And link the bugs you filed with this one?
Flags: needinfo?(odvarko)
Product: Firefox → DevTools
Emailing a reminder, but, it does not look crucial for 62 since from comment 2 there is no actual perf issue and the test just need fixing.  But it would be good to be able to rely more on the test correctness.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #9)
> Emailing a reminder, but, it does not look crucial for 62 since from comment
> 2 there is no actual perf issue and the test just need fixing.  But it would
> be good to be able to rely more on the test correctness.
Correct, this is not important for 62/63.

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #8)
> (In reply to Jan Honza Odvarko [:Honza] from comment #7)
> Could you give me an update on this? And link the bugs you filed with this
> one?
No follow ups filed.

I think that the scope of the work is revisiting the test itself
and make sure it's solid and measuring the right thing.
As I mentioned there is no real perf regression and we just
need correct testing.

Honza
Flags: needinfo?(odvarko)
(In reply to Jan Honza Odvarko [:Honza] from comment #10)
> (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #9)
> I think that the scope of the work is revisiting the test itself
> and make sure it's solid and measuring the right thing.
> As I mentioned there is no real perf regression and we just
> need correct testing.
> 
> Honza

:ochameau can you look into this issue and state some action items?
Flags: needinfo?(poirot.alex)
"settle" subtests have been removed in bug 1473322.
While this was a way to ensure very high quality tests, it proved to be too much work to maintain.
Unfortunately, it is still too hard to:
* figure out what async code is still pending
* wait for absolutely all async code
And right here, Honza looked into that and the pending async code only reproduced on CI, which makes is utterly hard to fix.

As-is, netmonitor subtests are correct enough. A good quality metric is the stddev of the subtests. It is still under 1%.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(poirot.alex)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.