1271948 - linux64 talos tests became bimodal on May 9th: tp5o_scroll, tscrollx, glterrain

Reporter

Description

•

8 years ago

I am not sure why this happened, but here is an example of the behavior:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,485c0fb96b42e7472f3cafbfea5ce0c3058b1bd1,1%5D

this is for e10s and non e10s, possibly related to a commit, vs infrastructure

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

8 years ago

I am collecting more data in this range:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&group_state=expanded&filter-searchStr=linux%20talos%20g1&tochange=5268ba1f8114fb1db78f36e0c6e20fdf1c0e596d&fromchange=23a4649788f0c4c3fbc9d7b4cb8e7515e77d32e3&selectedJob=27526494

the g1 and svg jobs are the ones which have those tests.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 2

•

8 years ago

lots of retriggers, doing more now- with extra data we don't see the bi-modal data easily, but with 10+ data points we are seeing bi-modal data.

here is some graphs (zoomed in):
glterrain: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,485c0fb96b42e7472f3cafbfea5ce0c3058b1bd1,1%5D&zoom=1462817376254.6167,1462819128433.5552,11.709407662906855,25.14401134275268
* bed3ca8d4a30 

tp5o_scroll: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,434100a87461dd17a9176d8820820e0a4fc038c7,1%5D&zoom=1462813969177.8667,1462821921391.4238,4.108863962465807,4.391958447063864
* bed3ca8d4a30 

tscrollx: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,7f213cf0f28370e00d701b7b3e65621d396c5df7,1%5D&zoom=1462815813160.567,1462819782132.223,3.457967760630079,3.8963945374850146
* 277692603a94  (prior push to bed3ca8d4a30)

collecting data earlier, originally I had d55e3bc94309 (3 pushes after bed3ca8d4a30) for glterrain, so if we move any pushes earlier, I am calling this 100% infrastructure and will need to hunt that down.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 3

•

8 years ago

and cart is bi-modal:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,5470dd119f529e9cc4ca90863bb3950b11ae265f,1%5D&series=%5Bfx-team,5470dd119f529e9cc4ca90863bb3950b11ae265f,0%5D&series=%5Bmozilla-central,5470dd119f529e9cc4ca90863bb3950b11ae265f,0%5D&zoom=1462791945088.5417,1462841955546.875,26.034324674358157,27.573358131607225

I suspect the list will go on

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 4

•

8 years ago

and damp goes bi-modal:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,2becc39f2d1a40c23b420d838d99330621d50c26,1%5D

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 5

•

8 years ago

and svgr_opacity:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,bd0b1764d1a874d7adbe4322b42eee851c10d31e,1%5D&selected=%5Bmozilla-inbound,bd0b1764d1a874d7adbe4322b42eee851c10d31e,NaN,NaN,1%5D

...


here are the alerts so far I have seen from perfherder:
https://treeherder.mozilla.org/perf.html#/alerts?id=1148

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

8 years ago

this is now officially an infra issue!

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 7

•

8 years ago

hard to know the first instance here, I would say May 9th around 3pm EDT.

according to the reconfig wiki page:
https://wiki.mozilla.org/ReleaseEngineering:Maintenance

there are no buildbot reconfigs over the weekend or on Monday.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

8 years ago

Blocks: 1255582

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

8 years ago

it appears the load on the linux machines as of about 2:30EDT on Monday jumps up and stays pretty high- there was a dip for a brief period- oddly enough that is a period where we have no bi-modal data.

Amy Rich [:arr] [:arich]

Comment 9

•

8 years ago

Granted jmaher root access to talos-linux64-ix-008 and talos-linux-ix-048 to troubleshoot. When you're done, please let buildduty know so they can delete those two hosts from the VPN list.

Justin Wood (:Callek)

Comment 10

•

8 years ago

Joel notified me in IRC that he is done with these troubleshooting steps on those two machines in c#9

Can you remove the hosts from the Loan list please.

Flags: needinfo?(vciobancai)

Flags: needinfo?(ihsiao)

Flags: needinfo?(aselagea)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 11

•

8 years ago

watching two machines for a couple of hours, I see firefox chewing up 100+ % of the cpu, but no other odd processes or resource usage.  Looking on graphite the load seemed to be down on monday until 8pmUTC and then it seemed to remain quite high- and it remained high except for a dip on Tuesday (coincidentally this same dip corresponds to the lack of bi-modal data for a few hours).

:arr, can you highlight any changes that were made over the weekend or monday which would affect the linux machines?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

8 years ago

Flags: needinfo?(arich)

Amy Rich [:arr] [:arich]

Comment 12

•

8 years ago

No hardware or OS changes made to these systems in a long time.

https://hg.mozilla.org/build/puppet/ shows all config changes made by puppet by people in releng. Those all look like BBB or key rotations, so I wouldn't think those would have any impact.

I'm not qualified to say if anything else in https://hg.mozilla.org/build/?sort=lastchange would have been impactful, but that's another place to look.

Flags: needinfo?(arich)

Iris Hsiao [:ihsiao]

Comment 13

•

8 years ago

Removed talos-linux64-ix-008 and talos-linux-ix-048 from the Loan list

Flags: needinfo?(vciobancai)

Flags: needinfo?(ihsiao)

Flags: needinfo?(aselagea)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 14

•

8 years ago

there was an period of ~4 hours where we ended up on the low (better) mode:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,08e8cf88e48de5fdb067b982f12675315fc84b71,1,1%5D&zoom=1463034227115.83,1463139609000,8.681846533566159,20.67069411720928

and now we are back to bi-modal.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 15

•

8 years ago

looking at bug 1262760, and a more remote possibility in bug 1214487.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 16

•

8 years ago

of course our timestamps are all messed up, perfherder confuses me with the timestamps on the graph vs the logs on treeherder vs the data points popup.

treeherder is in my local timezone: EDT
perfherder???
graphite in: UTC

in the perfherder graph (mouse click on data points), I see:
13:40 -> 21:40 as the time window of a temporary change.  From what I can tell that is EDT-1:00, so I estimate in UTC May 12:19:40 -> May13:03:40

oddly, I see a spike in load at that time range on talos-linux-ix-048, which is very confusing given the fact that the spike in load is what we saw as the pattern for the bi-modal data.

looking at the average load across all linux64 instances:
https://graphite-scl3.mozilla.org/render/?width=1287&height=854&_salt=1462986630.126&from=-2days&target=averageSeries(hosts.talos-linux64-ix-*_test_releng_scl3_mozilla_com.load.load.shortterm)

^ you need to be vpn'd in to see that- not sure what we require vpn, but it is what it is.

here we see the average jump up in the same general time window.

for reference here is a graph of the average load from before until now:
https://graphite-scl3.mozilla.org/render/?width=1287&height=854&_salt=1462986630.126&from=-10days&target=averageSeries(hosts.talos-linux64-ix-*_test_releng_scl3_mozilla_com.load.load.shortterm)


if this continues on much longer, I will have to consider disabling talos on linux64 just due to the large volume of random alerts being generated.

William Lachance (:wlach)

Comment 17

•

8 years ago

(In reply to Joel Maher (:jmaher) from comment #16)
> of course our timestamps are all messed up, perfherder confuses me with the
> timestamps on the graph vs the logs on treeherder vs the data points popup.
> 
> treeherder is in my local timezone: EDT
> perfherder???
> graphite in: UTC

Perfherder uses the push timestamp to plot stuff on the graph, using the user's local timezone.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 18

•

8 years ago

a backout of in bug 1262760:
https://bugzilla.mozilla.org/show_bug.cgi?id=1262760#c73

this was one of the two changes I was looking at that could be remotely related from a infra config standpoint.

that was backed out and rolled out a few hours ago.  Quite possibly that was the culprit, I am going to do a massive retrigger bomb to collect more data.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 19

•

8 years ago

ok the backout from bug 1262760 did not solve the problem, I would still like to clear bug 1214487 before getting crazy here.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 20

•

8 years ago

to double check, retriggers back on friday and thursday- this data should help us prove without a doubt this is not code related- I would prefer this to be in-tree code related and admit I jumped the gun!

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 21

•

8 years ago

oddly this is still a problem, but we had the bi-modal data stop from may 14-may 25.  Then around June 8th, this magically stopped.  In fact on June 8th we had upgraded the NVIDIA drivers in bug 1273286 and it appears that the bi-modal data has stopped!

As per my work to investigate running talos inside of docker, it seems that adding gl to the mix causes a lot of noise:
https://bugzilla.mozilla.org/show_bug.cgi?id=1269784#c12

The only conclusion I would draw is that our noise might be related to the use of which graphics driver we have and what else might be using the GPU or X session.

As it stands, I am going to resolve this bug as we no longer have bi-modal data and much less noise overall.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

linux64 talos tests became bimodal on May 9th: tp5o_scroll, tscrollx, glterrain

Categories

(Testing :: Talos, defect)

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Comment 21