Closed Bug 1371649 Opened 7 years ago Closed 6 years ago

2.48 - 3.73% tp5n nonmain_normal_fileio (windows7-32) regression on push 2b11565d26fc3e84db455ecc46cab3fdf4cde59b (Thu Jun 8 2017)

Categories

(Core :: Networking: HTTP, defect, P2)

x86
Windows 7
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fix-optional
firefox56 --- affected

People

(Reporter: igoldan, Assigned: mayhemer)

References

Details

(Keywords: perf, regression, talos-regression, Whiteboard: [necko-triaged])

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=2b11565d26fc3e84db455ecc46cab3fdf4cde59b

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  4%  tp5n nonmain_normal_fileio windows7-32 pgo e10s     355,510,048.50 -> 368,766,201.08
  2%  tp5n nonmain_normal_fileio windows7-32 opt e10s     356,332,714.04 -> 365,168,193.50


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=7154

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Blocks: 1369496, 1346783
Component: Untriaged → Networking: HTTP
Product: Firefox → Core
:mayhemer Could you look over this bug and confirm the regression?
Flags: needinfo?(honzab.moz)
(In reply to IonuČ› Goldan [:igoldan], Performance Sheriffing from comment #1)
> :mayhemer Could you look over this bug and confirm the regression?

I highly doubt that bug 1369496 would increase IO operations (if my understanding of the regressions is correct) that much since it only very slightly changes some http networking internal scheduling.  

Unfortunately I'm more and more getting busy right now, so it may not be right now for me to look at it.
in looking at this, there is a clear regression that holds consistent with the push for bug 1369496:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,c0e52d7e6d0f31d629654c20aad3fec29640939f,1,1%5D&zoom=1496941416057.9004,1496945524394.4573,346067416.59720427,370325717.31681037

I even looked at a larger scale and this is a test that has a lot of noise or many changes.

:mayhemer I is ok to be busy, could you find someone else to look at this?  Maybe :mcmanus as he reviewed it?   Or maybe you have a better idea of when you can take a few hours to a day to look into this?
Oh, according comment 0, it looks like we are pretty certain it's the change from bug 1369496.  I wrote comment 2 before I looked at that more closely...

Notes for me:

[16:13]	jmaher: that is measured with the xperf toolchain (that is why it is win7 only- those are the only machines we have that toolchain installed on)
[16:14]	jmaher: we run tp5 with a single page load (not repeated)
[16:14]	jmaher: nonmain = non main thread
[16:14]	jmaher: normal = normal operations, not startup
[16:14]	jmaher: file IO = any file access (even to cached files)- that is read/write
Assignee: nobody → honzab.moz
Flags: needinfo?(honzab.moz)
[16:18]	mayhemer: jmaher: I presume there is no breakdown on the IO?
[16:20]	jmaher: mayhemer: it is on my p3 list to make that easier to get that data; I can push to try and get a bunch of data by hacking up talos a bit; then analyzing the data with some scripts


Thanks!
Flags: needinfo?(jmaher)
Whiteboard: [necko-active]
a few days late, but here is what I have:
https://docs.google.com/a/mozilla.com/spreadsheets/d/1QBnZ6lbOR3q_PDtwqY9oZq3ClUFMrNwZ5qgaPncJ7PM/edit?usp=sharing

a lot of font files are referenced after this patch.
Flags: needinfo?(jmaher)
Honza, wanna have a look at the info in Comment 6?
Flags: needinfo?(honzab.moz)
(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #7)
> Honza, wanna have a look at the info in Comment 6?

I'm aware, just busy.
Flags: needinfo?(honzab.moz)
(In reply to Joel Maher ( :jmaher) from comment #6)
> a few days late, but here is what I have:
> https://docs.google.com/a/mozilla.com/spreadsheets/d/
> 1QBnZ6lbOR3q_PDtwqY9oZq3ClUFMrNwZ5qgaPncJ7PM/edit?usp=sharing
> 
> a lot of font files are referenced after this patch.

Thanks for this!  But to be honest, I'm lost more than before...

I can see that the HTTP cache IO (in cache2/entries) doesn't overlap AT ALL between the two runs.  Are those runs made on exactly the same set of pages?  Are the runs in different containers?  Are they really the same changeset except bug 1369496?  Because otherwise I don't follow why there would not be any overlap of http cache entries at all - the entry file name is based on |context+URL| hash.  Context may contain the container id (reason I'm asking), URLs should remain the same?

The HTTP cache IO adds "only" around 4M IOs.  It's about 17% of the overall difference.  This is the only thing that my patch could have an influence on.

Then I can see +9M IOs on {cltbld}\AppData\Local\\{profile}\startupCache\startupCache.4.little

And 12.5M IOs on ttf files - no idea who works with these.


Sorry, but I'm less and less convinced this is caused by bug 1369496 at all.

Let's take bug 1369496 out from mozilla-central and see if something changes, what do you think?
Flags: needinfo?(jmaher)
I pushed to try a baseline from tip of inbound, then again for same baseline+backout, I see this on the graph:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Btry,c0e52d7e6d0f31d629654c20aad3fec29640939f,1,1%5D&highlightedRevisions=cbdffb52c6a2&highlightedRevisions=1e0399b52cc7&zoom=1498238586800.0247,1498239255000,346138119.21698123,383507138.2802308


for the baseline the data falls in the 368-372M range for 6 data points

for the baseline+backout, I see bi-modal data, 350-355M and 368-373M, so this seems to make a difference.


looking at the data for inbound, there is a large range of normal values:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=%5Bmozilla-inbound,c0e52d7e6d0f31d629654c20aad3fec29640939f,1,1%5D&selected=%5Bmozilla-inbound,c0e52d7e6d0f31d629654c20aad3fec29640939f,NaN,NaN,1%5D

not fully conclusive- I will retrigger many more times on my try pushes.  Possibly there is not much we can do here and have to accept it, it would be nice to know what is going on either way.
Flags: needinfo?(jmaher)
Does xperf provide callstacks where from a file is being open?  I think the font thing is the most interesting to figure out here.

Just a note we also have a kill switch pref to disable the throttling - network.http.throttle.enable.  May be interesting to check what effect that may have.
I am not sure what is mentioned by callstacks- I suspect the answer is no as we are profiling inside the windows OS, not related to firefox.

Are you suggesting pushing to try with network.http.throttle.enable=False and comparing that to baseline?
(In reply to Joel Maher ( :jmaher) from comment #12)
> I am not sure what is mentioned by callstacks- 

a windows term for back trace or stack trace.

> I suspect the answer is no as
> we are profiling inside the windows OS, not related to firefox.

according some xperf docs [1][2] it should be possible to get stack traces for IO events as well.  you only need to setup symbols correctly.

> 
> Are you suggesting pushing to try with network.http.throttle.enable=False
> and comparing that to baseline?

maybe it can show some difference too or at least it would be interesting to know.  that pref flip should effectively disable the whole feature (landed/enabled in bug 1365307) that only has been slightly changed by the patch for the offending bug 1369496.

thanks!


[1] https://blogs.msdn.microsoft.com/ntdebugging/2008/04/03/windows-performance-toolkit-xperf/
[2] https://blogs.msdn.microsoft.com/ntdebugging/2009/08/17/using-xperf-to-investigate-slow-io-issues/
We looked at this in regression triage today... did anyone do the try push here? (comment 12)
(I think Comment #14 is directed at Joel)
Flags: needinfo?(jmaher)
Thanks Joel.
and I have data:
baseline: 359904066
pref: 349405002

this shows about a 10MB difference which is similar to the regression reported.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-8) from comment #18)
> and I have data:
> baseline: 359904066
> pref: 349405002
> 
> this shows about a 10MB difference which is similar to the regression
> reported.

So, there is a *lower* number of IO with the patch for bug 1369496 and the pref at "false"?
yes, with the tip of trunk (assume that includes the patch) and the pref = False, we see a 10MB lower number in IO.
(In reply to Joel Maher ( :jmaher) (UTC-8) from comment #20)
> yes, with the tip of trunk (assume that includes the patch) and the pref =
> False, we see a 10MB lower number in IO.

Thanks.  Any chance on getting the stack traces for at least the tff file opening?  Do you have any idea where from in the code we could be opening those files?  Or who could know?
I am not sure how to do that- ideally that can be done running on try with custom code? |try -b o -p win32 -u none -t xperf-e10s|
ni me on comment 22.
Flags: needinfo?(honzab.moz)
Should we consider backing this out at this point?
(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #24)
> Should we consider backing this out at this point?

There will be another change in the code around landing soon.  I'd like to wait until then (or check before landing via a try push) if that's going to affect the regression here.

If it doesn't help, I'll investigate the reason.

Note that because the relation between the code changes and the character of this regression it's not likely the regression is directly caused by those changes but only uncovers a bug somewhere else.

My patch that started this regressions is part of the CPD project, which is an important Quantum bug.  Hence, I am against backing it out.
We are about to have 55 RC tomorrow. Looks like there are no actions at this moment. Mark 55 fix-optional.
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Keeping as P1.  We've made significant changes since this was filed, so we should look again at how the perf is doing, and try to fix if it's still present.
Whiteboard: [necko-active]
Priority: P1 → P2
Whiteboard: [necko-triaged]
I have recently changed the code that caused this regressions in bug 1386746.  It's currently disabled via a pref, but I would like to check if this regression is still present when the pref is switched.

To turn it on, change:

network.http.throttle.version = 2

Joel, can you please repush with that pref modified?  Or just tell me (maybe again, I know) how and I will do it.

Thanks!
Flags: needinfo?(honzab.moz) → needinfo?(jmaher)
nonmain_normal_fileio appears to be fixed:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=76231f59448d&newProject=try&newRevision=6d9fa8b952303035c078c1c9bbb4fa049ded2f7e&framework=1

the _netio counters seem to have changes, but for the purposes of this regression, there is no issue.
Flags: needinfo?(jmaher)
Thanks a lot Joel!  I am about to run a pref-flip study on the new version of the code.  If successful I will switch to the new version and likely remove the old version code.

ni? on myself to not loose this.
Flags: needinfo?(honzab.moz)
Depends on: 1444975
fixed with bug 1444975
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(honzab.moz)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.