Bugzilla

Assignee

Comment 2

•

7 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #1)
> :mayhemer Could you look over this bug and confirm the regression?

I highly doubt that bug 1369496 would increase IO operations (if my understanding of the regressions is correct) that much since it only very slightly changes some http networking internal scheduling.  

Unfortunately I'm more and more getting busy right now, so it may not be right now for me to look at it.

Comment 3

•

7 years ago

in looking at this, there is a clear regression that holds consistent with the push for bug 1369496:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,c0e52d7e6d0f31d629654c20aad3fec29640939f,1,1%5D&zoom=1496941416057.9004,1496945524394.4573,346067416.59720427,370325717.31681037

I even looked at a larger scale and this is a test that has a lot of noise or many changes.

:mayhemer I is ok to be busy, could you find someone else to look at this?  Maybe :mcmanus as he reviewed it?   Or maybe you have a better idea of when you can take a few hours to a day to look into this?

Assignee

Comment 4

•

7 years ago

Oh, according comment 0, it looks like we are pretty certain it's the change from bug 1369496.  I wrote comment 2 before I looked at that more closely...

Notes for me:

[16:13]	jmaher: that is measured with the xperf toolchain (that is why it is win7 only- those are the only machines we have that toolchain installed on)
[16:14]	jmaher: we run tp5 with a single page load (not repeated)
[16:14]	jmaher: nonmain = non main thread
[16:14]	jmaher: normal = normal operations, not startup
[16:14]	jmaher: file IO = any file access (even to cached files)- that is read/write

Assignee: nobody → honzab.moz

Flags: needinfo?(honzab.moz)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 5

•

7 years ago

[16:18]	mayhemer: jmaher: I presume there is no breakdown on the IO?
[16:20]	jmaher: mayhemer: it is on my p3 list to make that easier to get that data; I can push to try and get a bunch of data by hacking up talos a bit; then analyzing the data with some scripts


Thanks!

Flags: needinfo?(jmaher)

ovidiu boca[:Ovidiu]

Updated

•

7 years ago

status-firefox53: --- → unaffected

status-firefox54: --- → unaffected

status-firefox55: --- → affected

Updated

•

7 years ago

Whiteboard: [necko-active]

Comment 6

•

7 years ago

a few days late, but here is what I have:
https://docs.google.com/a/mozilla.com/spreadsheets/d/1QBnZ6lbOR3q_PDtwqY9oZq3ClUFMrNwZ5qgaPncJ7PM/edit?usp=sharing

a lot of font files are referenced after this patch.

Flags: needinfo?(jmaher)

Mike Taylor [:miketaylr]

Comment 7

•

7 years ago

Honza, wanna have a look at the info in Comment 6?

Flags: needinfo?(honzab.moz)

Assignee

Comment 8

•

7 years ago

(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #7)
> Honza, wanna have a look at the info in Comment 6?

I'm aware, just busy.

Flags: needinfo?(honzab.moz)

Assignee

Comment 9

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #6)
> a few days late, but here is what I have:
> https://docs.google.com/a/mozilla.com/spreadsheets/d/
> 1QBnZ6lbOR3q_PDtwqY9oZq3ClUFMrNwZ5qgaPncJ7PM/edit?usp=sharing
> 
> a lot of font files are referenced after this patch.

Thanks for this!  But to be honest, I'm lost more than before...

I can see that the HTTP cache IO (in cache2/entries) doesn't overlap AT ALL between the two runs.  Are those runs made on exactly the same set of pages?  Are the runs in different containers?  Are they really the same changeset except bug 1369496?  Because otherwise I don't follow why there would not be any overlap of http cache entries at all - the entry file name is based on |context+URL| hash.  Context may contain the container id (reason I'm asking), URLs should remain the same?

The HTTP cache IO adds "only" around 4M IOs.  It's about 17% of the overall difference.  This is the only thing that my patch could have an influence on.

Then I can see +9M IOs on {cltbld}\AppData\Local\\{profile}\startupCache\startupCache.4.little

And 12.5M IOs on ttf files - no idea who works with these.


Sorry, but I'm less and less convinced this is caused by bug 1369496 at all.

Let's take bug 1369496 out from mozilla-central and see if something changes, what do you think?

Flags: needinfo?(jmaher)

Comment 10

•

7 years ago

I pushed to try a baseline from tip of inbound, then again for same baseline+backout, I see this on the graph:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Btry,c0e52d7e6d0f31d629654c20aad3fec29640939f,1,1%5D&highlightedRevisions=cbdffb52c6a2&highlightedRevisions=1e0399b52cc7&zoom=1498238586800.0247,1498239255000,346138119.21698123,383507138.2802308


for the baseline the data falls in the 368-372M range for 6 data points

for the baseline+backout, I see bi-modal data, 350-355M and 368-373M, so this seems to make a difference.


looking at the data for inbound, there is a large range of normal values:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=%5Bmozilla-inbound,c0e52d7e6d0f31d629654c20aad3fec29640939f,1,1%5D&selected=%5Bmozilla-inbound,c0e52d7e6d0f31d629654c20aad3fec29640939f,NaN,NaN,1%5D

not fully conclusive- I will retrigger many more times on my try pushes.  Possibly there is not much we can do here and have to accept it, it would be nice to know what is going on either way.

Flags: needinfo?(jmaher)

Assignee

Comment 11

•

7 years ago

Does xperf provide callstacks where from a file is being open?  I think the font thing is the most interesting to figure out here.

Just a note we also have a kill switch pref to disable the throttling - network.http.throttle.enable.  May be interesting to check what effect that may have.

Comment 12

•

7 years ago

I am not sure what is mentioned by callstacks- I suspect the answer is no as we are profiling inside the windows OS, not related to firefox.

Are you suggesting pushing to try with network.http.throttle.enable=False and comparing that to baseline?

David Bolter [:davidb] (NeedInfo me for attention)

Assignee

Comment 13

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #12)
> I am not sure what is mentioned by callstacks- 

a windows term for back trace or stack trace.

> I suspect the answer is no as
> we are profiling inside the windows OS, not related to firefox.

according some xperf docs [1][2] it should be possible to get stack traces for IO events as well.  you only need to setup symbols correctly.

> 
> Are you suggesting pushing to try with network.http.throttle.enable=False
> and comparing that to baseline?

maybe it can show some difference too or at least it would be interesting to know.  that pref flip should effectively disable the whole feature (landed/enabled in bug 1365307) that only has been slightly changed by the patch for the offending bug 1369496.

thanks!


[1] https://blogs.msdn.microsoft.com/ntdebugging/2008/04/03/windows-performance-toolkit-xperf/
[2] https://blogs.msdn.microsoft.com/ntdebugging/2009/08/17/using-xperf-to-investigate-slow-io-issues/

Comment 14

•

7 years ago

We looked at this in regression triage today... did anyone do the try push here? (comment 12)

Mike Taylor [:miketaylr]

Comment 15

•

7 years ago

(I think Comment #14 is directed at Joel)

Flags: needinfo?(jmaher)

Comment 16

•

7 years ago

I just did this:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=e01e4c8941984b0c2b6884ea5eba35b8c5717bf5&tochange=174b70f6c4c723174989d24f670cec7842355ef6

I will follow up when results are available in a couple hours.

Assignee

Comment 17

•

7 years ago

Thanks Joel.

Comment 18

•

7 years ago

and I have data:
baseline: 359904066
pref: 349405002

this shows about a 10MB difference which is similar to the regression reported.

Flags: needinfo?(jmaher)

Assignee

Comment 19

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-8) from comment #18)
> and I have data:
> baseline: 359904066
> pref: 349405002
> 
> this shows about a 10MB difference which is similar to the regression
> reported.

So, there is a *lower* number of IO with the patch for bug 1369496 and the pref at "false"?

Comment 20

•

7 years ago

yes, with the tip of trunk (assume that includes the patch) and the pref = False, we see a 10MB lower number in IO.

Assignee

Comment 21

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-8) from comment #20)
> yes, with the tip of trunk (assume that includes the patch) and the pref =
> False, we see a 10MB lower number in IO.

Thanks.  Any chance on getting the stack traces for at least the tff file opening?  Do you have any idea where from in the code we could be opening those files?  Or who could know?

Comment 22

•

7 years ago

I am not sure how to do that- ideally that can be done running on try with custom code? |try -b o -p win32 -u none -t xperf-e10s|

Assignee

Comment 23

•

7 years ago

ni me on comment 22.

Flags: needinfo?(honzab.moz)

Mike Taylor [:miketaylr]

Comment 24

•

7 years ago

Should we consider backing this out at this point?

Firefox Bug Husbandry Bot

Assignee

Comment 25

•

7 years ago

(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #24)
> Should we consider backing this out at this point?

There will be another change in the code around landing soon.  I'd like to wait until then (or check before landing via a try push) if that's going to affect the regression here.

If it doesn't help, I'll investigate the reason.

Note that because the relation between the code changes and the character of this regression it's not likely the regression is directly caused by those changes but only uncovers a bug somewhere else.

My patch that started this regressions is part of the CPD project, which is an important Quantum bug.  Hence, I am against backing it out.

Gerry Chang [:gchang]

Comment 26

•

7 years ago

We are about to have 55 RC tomorrow. Looks like there are no actions at this moment. Mark 55 fix-optional.

status-firefox55: affected → fix-optional

Gerry Chang [:gchang]

Updated

•

7 years ago

status-firefox56: --- → affected

Comment 27

•

7 years ago

Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Priority: -- → P1

Comment 28

•

7 years ago

Keeping as P1.  We've made significant changes since this was filed, so we should look again at how the perf is doing, and try to fix if it's still present.

Updated

•

7 years ago

Whiteboard: [necko-active]

Updated

•

7 years ago

Priority: P1 → P2

Updated

•

7 years ago

Whiteboard: [necko-triaged]

Assignee

Comment 29

•

6 years ago

I have recently changed the code that caused this regressions in bug 1386746.  It's currently disabled via a pref, but I would like to check if this regression is still present when the pref is switched.

To turn it on, change:

network.http.throttle.version = 2

Joel, can you please repush with that pref modified?  Or just tell me (maybe again, I know) how and I will do it.

Thanks!

Flags: needinfo?(honzab.moz) → needinfo?(jmaher)

Comment 30

•

6 years ago

nonmain_normal_fileio appears to be fixed:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=76231f59448d&newProject=try&newRevision=6d9fa8b952303035c078c1c9bbb4fa049ded2f7e&framework=1

the _netio counters seem to have changes, but for the purposes of this regression, there is no issue.

Flags: needinfo?(jmaher)

Assignee

Comment 31

•

6 years ago

Thanks a lot Joel!  I am about to run a pref-flip study on the new version of the code.  If successful I will switch to the new version and likely remove the old version code.

ni? on myself to not loose this.

Flags: needinfo?(honzab.moz)

Assignee

Updated

•

6 years ago

Depends on: 1444975