Closed Bug 1362920 Opened 2 years ago Closed 2 years ago

[e10s-multi] Talos tps page load times have grown increasingly erratic for 32-bit Win7 e10s compared to non-e10s

Categories

(Core :: General, defect)

54 Branch
x86
Windows 7
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- fixed
firefox56 --- fixed

People

(Reporter: cpeterson, Assigned: mconley)

References

Details

(Keywords: talos-regression, Whiteboard: [qf-][e10s-multi:-])

Attachments

(2 files)

tps improved for 32-bit Win7 and 64-bit Win8 starting with the pushlog [1] that enabled e10s-multi on 2017-01-04. However, tps scores on 32-bit Win7 have grown increasingly erratic but only for the "opt e10s" and "pgo e10s" test configurations. The non-e10s "opt" and "pgo" configurations on 32-bit Win7 and all e10s and non-e10s on 64-bit Win8 are stable. You can toggle and compare all of those configurations on this Perfherder page:

https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=%5Bmozilla-central,a86a2a069ed634663dbdef7193f2dee69b50dbc9,1,1%5D&series=%5Bmozilla-central,22751a9cf13f8eb316dd7abb02109dac45a3b8df,1,1%5D&series=%5Bmozilla-central,890f291f15fa3591eb1694ceb3476e94a69a096a,1,1%5D&series=%5Bmozilla-central,087c2c11959a3da9e402a87d7d1fdefd4d1638ec,1,1%5D&series=%5Bmozilla-central,38bec4f9b89bcbee8b353d582a8f5ab360c9b735,1,1%5D&series=%5Bmozilla-central,b577ed169c62edd6045f127a597a9b55905c8d81,1,1%5D&series=%5Bmozilla-central,0c6878f5f448ce4a08cb81f025d8a3b1557a0305,1,1%5D&series=%5Bmozilla-central,cfc195cb8dcd3d23be28f59f57a9bb68b8d7dfe2,1,1%5D

Unfortunately, we have two variables here: Win 7/8 and 32/64 bit. It's hard to know whether this is a problem with e10s-multi on Win7 specifically or 32-bit in general because Talos doesn't test 64-bit Win7 and none of the other Talos platforms are 32-bit.

[1] https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=cad2ea346d06ec5a3a70eda912513201dff0c21e&tochange=57ac9f63fc6953f4efeb0cc84a60192d3721251f
Flags: needinfo?(elancaster)
(In reply to Chris Peterson [:cpeterson] from comment #0)
> tps improved for 32-bit Win7 and 64-bit Win8 starting with the pushlog [1]
> that enabled e10s-multi on 2017-01-04.

Just for the record the improvement came mostly from filtering out the glyph caching from the test [1] (that's why it affects non-e10s configuration as well)

[1]: d0d55af22bcb	Gabor Krizsanits — Bug 1317312 - Tps should filter out glyph cache overhead. r=mconley

> However, tps scores on 32-bit Win7
> have grown increasingly erratic but only for the "opt e10s" and "pgo e10s"
> test configurations. The non-e10s "opt" and "pgo" configurations on 32-bit
> Win7 and all e10s and non-e10s on 64-bit Win8 are stable. You can toggle and
> compare all of those configurations on this Perfherder page:
> 
> https://treeherder.mozilla.org/perf.html#/
> graphs?timerange=31536000&series=%5Bmozilla-central,
> a86a2a069ed634663dbdef7193f2dee69b50dbc9,1,1%5D&series=%5Bmozilla-central,
> 22751a9cf13f8eb316dd7abb02109dac45a3b8df,1,1%5D&series=%5Bmozilla-central,
> 890f291f15fa3591eb1694ceb3476e94a69a096a,1,1%5D&series=%5Bmozilla-central,
> 087c2c11959a3da9e402a87d7d1fdefd4d1638ec,1,1%5D&series=%5Bmozilla-central,
> 38bec4f9b89bcbee8b353d582a8f5ab360c9b735,1,1%5D&series=%5Bmozilla-central,
> b577ed169c62edd6045f127a597a9b55905c8d81,1,1%5D&series=%5Bmozilla-central,
> 0c6878f5f448ce4a08cb81f025d8a3b1557a0305,1,1%5D&series=%5Bmozilla-central,
> cfc195cb8dcd3d23be28f59f57a9bb68b8d7dfe2,1,1%5D
> 
> Unfortunately, we have two variables here: Win 7/8 and 32/64 bit. It's hard
> to know whether this is a problem with e10s-multi on Win7 specifically or
> 32-bit in general because Talos doesn't test 64-bit Win7 and none of the
> other Talos platforms are 32-bit.
> 
> [1]
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=cad2ea346d06ec5a3a70eda912513201dff0c21e&tochange=57ac
> 9f63fc6953f4efeb0cc84a60192d3721251f

Thanks this is very interesting. Fortunately e10s-multi numbers are still better than non-e10s numbers but it's really sad to see all the optimization on the non-e10s side and how they not affecting any of the e10s configurations :( And that the regression win7-32 didn't ring any alarm bells... How can I get the list of various change-sets that caused the regressions and the improvements on non-e10s configurations? I think we should investigate both.
Whiteboard: [qf] → [qf][e10s-multi:?]
> How can I get the list of various change-sets that caused the
> regressions and the improvements on non-e10s configurations? I think we
> should investigate both.

You can get a list of changesets from the Perfherder graph. Highlight and zoom some section of the graph and then click the circles on the bottom graph to see the changeset and pushlog. The April 16 improvement to non-e10s [1] started after pushlog [2].

[1] https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=%5Bmozilla-central,087c2c11959a3da9e402a87d7d1fdefd4d1638ec,1,1%5D&series=%5Bmozilla-central,0c6878f5f448ce4a08cb81f025d8a3b1557a0305,1,1%5D&zoom=1491778966922.078,1492902751707.7922,33.12359413404143,49.9775267183111

[2] https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ce69b6e1773e9e0d0a190ce899f34b1658e66ca4&tochange=c697e756f738ce37abc56f31bfbc48f55625d617
(In reply to Chris Peterson [:cpeterson] from comment #3)
> > How can I get the list of various change-sets that caused the
> > regressions and the improvements on non-e10s configurations? I think we
> > should investigate both.
> 
> You can get a list of changesets from the Perfherder graph. Highlight and
> zoom some section of the graph and then click the circles on the bottom
> graph to see the changeset and pushlog. The April 16 improvement to non-e10s
> [1] started after pushlog [2].

Thanks, I figured it out as well in the end, but I didn't know about the zooming feature :) It's a bit weird to track all the changes because of the merges but I think I found them.

Cause of the first improvement: Bug 1354199
Cause of the second: Bug 1302071

The added noise for the e10s case this bug is originally about is somewhat harder to track... that's up next.
Joel, do you have any thoughts on this? Are you aware of any changes in the Win7 machine config that might have contributed?
Flags: needinfo?(jmaher)
there should be no changes to the win7 machines.  We get this often in various platforms or tests, I gave up after spending hundreds of hours tracking down osx or linux issues in the past with no resolution.

Glad that there is focus on the data!
Flags: needinfo?(jmaher)
 (In reply to Joel Maher ( :jmaher) from comment #6)
> there should be no changes to the win7 machines.  We get this often in
> various platforms or tests, I gave up after spending hundreds of hours
> tracking down osx or linux issues in the past with no resolution.

Joel, are you recommending that we not bother looking for an explanation of the erratic results from win7 e10s-multi since it is still faster than non-multi e10s? Where can we find more information about the differences between the win7 and win8 test machines' hardware configurations?

e10s-multi was enabled with two content processes on January 4 and increased to four content processes on March 18. In the graph, you can see that win7-32 e10s variance gets even worse after March 18. So there really is something about the win7 test machines' hardware configuration or 32-bit Firefox that gets increasingly unhappy as we add more content processes.
Flags: needinfo?(elancaster) → needinfo?(jmaher)
Whiteboard: [qf][e10s-multi:?] → [qf][e10s-multi:+]
bmiroglio said he is looking at some e10s-multi telemetry that may be relevant.
(In reply to Chris Peterson [:cpeterson] from comment #7)
> e10s-multi was enabled with two content processes on January 4 and increased
> to four content processes on March 18. In the graph, you can see that
> win7-32 e10s variance gets even worse after March 18. So there really is
> something about the win7 test machines' hardware configuration or 32-bit
> Firefox that gets increasingly unhappy as we add more content processes.

It's funny because that's the first thing I checked as well and came to the opposite conclusion that after March 18 for a good 3 days it was calm and only then did the noise get stronger.

https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=%5Bmozilla-central,890f291f15fa3591eb1694ceb3476e94a69a096a,1,1%5D&zoom=1489148994393.675,1490762075396.9465,2.8985507246376727,56.52173913043478

But with this much noise we don't have enough data points to tell for sure, so you might be right.

For the initial enabling I think it wasn't January 4th but January 23rd
https://hg.mozilla.org/mozilla-central/rev/aefa445b9c77
The previous landing attempts were all backed out before reaching mc I think. Again I feel like there is a good 2 days delays as well there before the noise gets stronger but would not bet my life on it. I'm not sure what have happened on Jan 4th...

As a conclusion I think we should be cautious about the assumptions we make with the available data I think. But do let me know if I'm missing something obvious.

One thing that can result such noise is CPU temperature. I wonder if these machines are used for anything else... But if you take a look at the timestamps of the nodes at both extremes it does not seem to be completely random. It can be that in busy hours it gets hotter in a server room :)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f2171cf2d000&newProject=try&newRevision=a038fa2d3c91&framework=1&showOnlyImportant=0

Let's see if a single content process or 4 makes a difference in the noise. I've added a few more rounds to Jim's measurements.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #10)
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=f2171cf2d000&newProject=try&newR
> evision=a038fa2d3c91&framework=1&showOnlyImportant=0
> 
> Let's see if a single content process or 4 makes a difference in the noise.
> I've added a few more rounds to Jim's measurements.

I don't think we should blame it on e10s-multi:

https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=%5Btry,890f291f15fa3591eb1694ceb3476e94a69a096a,1,1%5D&highlightedRevisions=f2171cf2d000&highlightedRevisions=a038fa2d3c91&zoom=1493895336165.563,1493901821132.4504,20.901955997242645,33.45097560508579

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #9)
> One thing that can result such noise is CPU temperature. I wonder if these
> machines are used for anything else... But if you take a look at the
> timestamps of the nodes at both extremes it does not seem to be completely
> random. It can be that in busy hours it gets hotter in a server room :)

This was just a silly thought ofc and I'm pretty sure it's not the case (+the timestamps on the nodes are not the time when these tests were executed). But based on Joel's comment that this happens often, I still cannot rule out something at the hardware/infra level.
my results for tp5o responsiveness opt e10s are contrary to the above: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=4aeff94a7fc872e4716def28f89647fd0d0b4488&newProject=try&newRevision=abd843dd9c400aa08a06a64bf886e5fc972e407a&framework=1&filter=e10s&showOnlyImportant=0

> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #9)
> > One thing that can result such noise is CPU temperature. I wonder if these
> > machines are used for anything else... But if you take a look at the
> > timestamps of the nodes at both extremes it does not seem to be completely
> > random. It can be that in busy hours it gets hotter in a server room :)
> 
> This was just a silly thought ofc and I'm pretty sure it's not the case
> (+the timestamps on the nodes are not the time when these tests were
> executed). But based on Joel's comment that this happens often, I still
> cannot rule out something at the hardware/infra level.

My test run *may* support this hardware theory.  The Base (e10s(1)) was run in the afternoon. Where as the New (e10s(4)) was run the following overnight/early morning.  But there were plenty of other tests run that don't appear to have a notable regression in standard deviation.
Attachment #8865442 - Attachment description: graphs.jpg → Nightly comparison between e10s and non-e10s
Whiteboard: [qf][e10s-multi:+] → [qf-][e10s-multi:+]
No longer blocks: e10s-multi
Whiteboard: [qf-][e10s-multi:+] → [qf-][e10s-multi:-]
this got buried for me, let me answer a few questions here.

The hardware for win7 and win8 are identical and they are on the same vlan and location in the datacenter.  So if there are differences between win7 and win8, it would be hard to believe this is something infrastructure related (although once we traced some osx issues down to excessive multicast packets from default software on OSX that would do network discovery on startup, and we reboot the machines often!!).

the difference is win8 is 64 bit and we use a 64 bit build vs win7 and 32 bit. Do we see a difference in opt vs pgo?  

typically we see issues on weekends vs weekdays, do we see that pattern?
Flags: needinfo?(jmaher)
See Also: → 1372261
I think we can resolve this bug as FIXED thanks to Mike's tps test fixes in bug 1372261!

Mike landed his fixes on July 10 and win7 stabilized immediately:

https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=%5Bmozilla-central,7bdaad0fa21778103f4cd0d6bbe81fe3dc49040c,1,1%5D&series=%5Bmozilla-central,a86a2a069ed634663dbdef7193f2dee69b50dbc9,1,1%5D
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Assignee: nobody → mconley
Depends on: 1372261
Target Milestone: --- → mozilla56
You need to log in before you can comment on or make changes to this bug.