linux64 talos tests became bimodal on May 9th: tp5o_scroll, tscrollx, glterrain

RESOLVED FIXED

Status

Testing
Talos
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: jmaher, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

I am not sure why this happened, but here is an example of the behavior:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,485c0fb96b42e7472f3cafbfea5ce0c3058b1bd1,1%5D

this is for e10s and non e10s, possibly related to a commit, vs infrastructure
lots of retriggers, doing more now- with extra data we don't see the bi-modal data easily, but with 10+ data points we are seeing bi-modal data.

here is some graphs (zoomed in):
glterrain: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,485c0fb96b42e7472f3cafbfea5ce0c3058b1bd1,1%5D&zoom=1462817376254.6167,1462819128433.5552,11.709407662906855,25.14401134275268
* bed3ca8d4a30 

tp5o_scroll: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,434100a87461dd17a9176d8820820e0a4fc038c7,1%5D&zoom=1462813969177.8667,1462821921391.4238,4.108863962465807,4.391958447063864
* bed3ca8d4a30 

tscrollx: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,7f213cf0f28370e00d701b7b3e65621d396c5df7,1%5D&zoom=1462815813160.567,1462819782132.223,3.457967760630079,3.8963945374850146
* 277692603a94  (prior push to bed3ca8d4a30)

collecting data earlier, originally I had d55e3bc94309 (3 pushes after bed3ca8d4a30) for glterrain, so if we move any pushes earlier, I am calling this 100% infrastructure and will need to hunt that down.
this is now officially an infra issue!
hard to know the first instance here, I would say May 9th around 3pm EDT.

according to the reconfig wiki page:
https://wiki.mozilla.org/ReleaseEngineering:Maintenance

there are no buildbot reconfigs over the weekend or on Monday.
it appears the load on the linux machines as of about 2:30EDT on Monday jumps up and stays pretty high- there was a dip for a brief period- oddly enough that is a period where we have no bi-modal data.
Granted jmaher root access to talos-linux64-ix-008 and talos-linux-ix-048 to troubleshoot. When you're done, please let buildduty know so they can delete those two hosts from the VPN list.
Joel notified me in IRC that he is done with these troubleshooting steps on those two machines in c#9

Can you remove the hosts from the Loan list please.
Flags: needinfo?(vciobancai)
Flags: needinfo?(ihsiao)
Flags: needinfo?(aselagea)
watching two machines for a couple of hours, I see firefox chewing up 100+ % of the cpu, but no other odd processes or resource usage.  Looking on graphite the load seemed to be down on monday until 8pmUTC and then it seemed to remain quite high- and it remained high except for a dip on Tuesday (coincidentally this same dip corresponds to the lack of bi-modal data for a few hours).

:arr, can you highlight any changes that were made over the weekend or monday which would affect the linux machines?
Flags: needinfo?(arich)
No hardware or OS changes made to these systems in a long time.

https://hg.mozilla.org/build/puppet/ shows all config changes made by puppet by people in releng. Those all look like BBB or key rotations, so I wouldn't think those would have any impact.

I'm not qualified to say if anything else in https://hg.mozilla.org/build/?sort=lastchange would have been impactful, but that's another place to look.
Flags: needinfo?(arich)
Removed talos-linux64-ix-008 and talos-linux-ix-048 from the Loan list
Flags: needinfo?(vciobancai)
Flags: needinfo?(ihsiao)
Flags: needinfo?(aselagea)
looking at bug 1262760, and a more remote possibility in bug 1214487.
of course our timestamps are all messed up, perfherder confuses me with the timestamps on the graph vs the logs on treeherder vs the data points popup.

treeherder is in my local timezone: EDT
perfherder???
graphite in: UTC

in the perfherder graph (mouse click on data points), I see:
13:40 -> 21:40 as the time window of a temporary change.  From what I can tell that is EDT-1:00, so I estimate in UTC May 12:19:40 -> May13:03:40

oddly, I see a spike in load at that time range on talos-linux-ix-048, which is very confusing given the fact that the spike in load is what we saw as the pattern for the bi-modal data.

looking at the average load across all linux64 instances:
https://graphite-scl3.mozilla.org/render/?width=1287&height=854&_salt=1462986630.126&from=-2days&target=averageSeries(hosts.talos-linux64-ix-*_test_releng_scl3_mozilla_com.load.load.shortterm)

^ you need to be vpn'd in to see that- not sure what we require vpn, but it is what it is.

here we see the average jump up in the same general time window.

for reference here is a graph of the average load from before until now:
https://graphite-scl3.mozilla.org/render/?width=1287&height=854&_salt=1462986630.126&from=-10days&target=averageSeries(hosts.talos-linux64-ix-*_test_releng_scl3_mozilla_com.load.load.shortterm)


if this continues on much longer, I will have to consider disabling talos on linux64 just due to the large volume of random alerts being generated.
(In reply to Joel Maher (:jmaher) from comment #16)
> of course our timestamps are all messed up, perfherder confuses me with the
> timestamps on the graph vs the logs on treeherder vs the data points popup.
> 
> treeherder is in my local timezone: EDT
> perfherder???
> graphite in: UTC

Perfherder uses the push timestamp to plot stuff on the graph, using the user's local timezone.
a backout of in bug 1262760:
https://bugzilla.mozilla.org/show_bug.cgi?id=1262760#c73

this was one of the two changes I was looking at that could be remotely related from a infra config standpoint.

that was backed out and rolled out a few hours ago.  Quite possibly that was the culprit, I am going to do a massive retrigger bomb to collect more data.
ok the backout from bug 1262760 did not solve the problem, I would still like to clear bug 1214487 before getting crazy here.
to double check, retriggers back on friday and thursday- this data should help us prove without a doubt this is not code related- I would prefer this to be in-tree code related and admit I jumped the gun!
oddly this is still a problem, but we had the bi-modal data stop from may 14-may 25.  Then around June 8th, this magically stopped.  In fact on June 8th we had upgraded the NVIDIA drivers in bug 1273286 and it appears that the bi-modal data has stopped!

As per my work to investigate running talos inside of docker, it seems that adding gl to the mix causes a lot of noise:
https://bugzilla.mozilla.org/show_bug.cgi?id=1269784#c12

The only conclusion I would draw is that our noise might be related to the use of which graphics driver we have and what else might be using the GPU or X session.

As it stands, I am going to resolve this bug as we no longer have bi-modal data and much less noise overall.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.