Closed Bug 1201230 Opened 9 years ago Closed 8 years ago

investigate infrastructure change which caused linux talos tresize test to regression by 10% on July 30th

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

Attachments

(4 files)

in bug 1190877, we had a linux regression that looked real.  Then we came back a couple days later and retriggered to find out that all data was showing a regression.  Going back a few revisions earlier, we see that this regression showed up on retriggers- it is unrelated to that of a change in the tree.

Here is a link to the graph:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=[mozilla-inbound,72f4651f24362c87efb15d5f4113b9ca194d8e3f,1]&series=[mozilla-aurora,86708a260eef1d74b07e38d19095ffae06a3d262,1]&series=[mozilla-beta,86708a260eef1d74b07e38d19095ffae06a3d262,1]

The problem here is that the test is so noisy we get alerts every Monday (developers complain every monday as they get an email saying they caused a regression).  Essentially this test should be turned off until we figure this out.

This is seen on linux64 and linux32; in addition this was seen on mozilla-central, mozilla-aurora and mozilla-beta- this is a good indicator an infrastructure change took place which caused this regression- the problem is what is it?

In looking at puppet logs, idleizer is the thing that changed a lot.  This doesn't seem like the type of thing which would cause performance regressions.  Here is a link to the puppet logs:
http://hg.mozilla.org/build/puppet/graph

There are buildbot changes as well, I am not sure of the way to determine those changes fully and how they were deployed.

I got an idea that we could debug this based on network traffic in bug 1193707.  Looking at psutil to grab a snapshot of the network counters at the start of the test and the end of the test, I took a diff to see the total traffic.  This was for 8 counters:
lo - rx bytes
lo - tx bytes
lo - rx packets
lo - tx packets
eth0 - rx bytes
eth0 - tx bytes
eth0 - rx packets
eth0 - tx packets

starting in comment 16 (https://bugzilla.mozilla.org/show_bug.cgi?id=1193707#c16), you can see the differences.  Dustin did some extra debugging to pinpoint the big thing on the network is m-dns/arp packets.  These do not seem like a likely culprit, probably a red herring.

Some other ideas to look at:
* uptime on the machines
* are we somehow mixing pgo/non-pgo
* machine specific differences
* total cpu on the machines
* total memory on the machines
* disk IO on the machines

another trend to solve is that on weekends (where we have less overall builds, jobs, tests) the numbers for linux* tresize are not as noisy.  They fall into a much smaller range.  While it could be true that we just don't run enough data, week after week we follow an exacting pattern.  Is it possible that whatever is happening on the infrastructure is related to this?  Why are we not seeing this shift in noise on the weekends?
(In reply to Joel Maher (:jmaher) from comment #0)
> The problem here is that the test is so noisy we get alerts every Monday
> (developers complain every monday as they get an email saying they caused a
> regression).  Essentially this test should be turned off until we figure
> this out.
..
> another trend to solve is that on weekends (where we have less overall
> builds, jobs, tests) the numbers for linux* tresize are not as noisy.  They
> fall into a much smaller range.  While it could be true that we just don't
> run enough data, week after week we follow an exacting pattern.  Is it
> possible that whatever is happening on the infrastructure is related to
> this?  Why are we not seeing this shift in noise on the weekends?

Was this test noisy on Mondays before July 30?  I'm not sure from your description or from the graph.

There are two directions to pursue here:

 One, what caused the big change around Thu Jul 30, 6:43:02 (push of 5e130ad70aa7, the (now exonerated) commit initially blamed for the regression)

 Two, what is the hidden variable shared by this performance metric and the day of the week?

One is obviously the root problem to solve, but Two might be the leverage we need to figure it out.
we were not getting false alerts every week previously, this started up since July 30th, every Monday like clockwork :)  This is good as it reminds me to look into this issue.

This test doesn't do much of anything rather than resize the browser.  Many other tests experience differences on weekends vs weekdays, this is one that became worse at a single point in time.  There might be more to investigate in the browser, prefs, etc.
Regarding One, you've had a look at the puppet changes.  http://hg.mozilla.org/build/puppet/rev/a3b1ae943ba4 upgraded buildbot, and although buildbot itself is not tagged, the difference was
  http://hg.mozilla.org/build/buildbot/rev/ade48470874e
The effect of this code was to *allow* idelizer to power down (halt) hosts instead of rebooting them, but this was protected by IDLEIZER_HALT_ON_IDLE, which (in the puppet commit) was only set on Windows (in a batch file, so I'm pretty sure it didn't leak onto Linux).  To my knowledge, we haven't had an epidemic of hosts halting, so I don't think this is causing the issue.  Not to mention, it landed 10 hours after the suspected commit.

It looks like the most recent reconfig before that time was
  http://hg.mozilla.org/build/buildbot-configs/rev/29fb3e98c237
with no corresponding changes to buildbotcustom.  So it's hard to see anything there that could be related.

Looked at a different way, https://wiki.mozilla.org/ReleaseEngineering/Maintenance shows a change later on the 30th:

buildbot-configs

    bug 1184117 - Make Android 4.0 opt-in by default on Try - r=Callek (e7176ef993ef)
    bug 1189273 - Drop twice-daily nightlies for v2.2/v2.2r and stop running emulator nightly builds on the B2G release branches - r=catlee (fbd3abad6d6b)

buildbotcustom

    bug 1187966 - Set the env for periodic file update jobs so we can use TOOLTOOL_CACHE = - r=nthomas (dc0535892866)

mozharness

    bug 1184594 - git 'repo' is not deployed correctly on build machines managed by buildbot - r=hwine (57dde819edcb)
    bug 1188648 - Use HOST_LOG env var to log gecko output when running Gij on _all_ gecko builds - r=jgriffin (79fffa3a18ae)
    bug 1188648 - Always log gecko output during Gaia JS Integration Tests - r=jgriffin (d72df953784c)
    bug 1188698 - Do not fail when succeed count is 0 (672af8a8da3e)
    No bug - Backed out changeset 79fffa3a18ae -- Unfortunately try run reveals this isn't working as expected (faf4013a7fc6)
    No bug - Backed out changeset 57dde819edcb (3ae81f4458ef)

The mozharness changes shouldn't be related because (I think?) Talos are running out of the in-tree mozharness.

I don't know of any other changes that could affect the hosts themselves.  We only have so many tools that will touch all talos hosts -- puppet, mig, and buildbot, basically.  So there's some subtle bug hidden in one of those changes.

Morgan: do you have historical information on number of reboots or uptime or anything like that for this period?

Julien: did anything change with mig around this time that might have altered its resource usage?
Flags: needinfo?(winter2718)
Flags: needinfo?(jvehent)
Nothing changes on MIG's side. We had a single Rabbitmq outage on August 9th for a few hours, and nothing else since. That shouldn't have impacted talos systems anyway because mig-agent runs in taskrunner on talos machine, not as a persistent process.
Flags: needinfo?(jvehent)
Thanks, Juilien.  I had forgotten mig only runs on startup on these hosts, so it's definitely off the hook.

One more needinfo -- kim, was there switch work going on around that time?  There's an outside chance that some change to switch configuration has caused unicast traffic to broadcast, or something like that.

As for Two, I think the current hypotheses are

 - network traffic levels differ based on overall activity level
 - hosts may be more likely to be "fresh" on the weekends due to idle reboots

Are there others?

Hopefully Morgan has data to address the second hypothesis.
Flags: needinfo?(kmoir)
Switch work occurred in June. See bug 1161314
Flags: needinfo?(kmoir)
(In reply to Dustin J. Mitchell [:dustin] from comment #3)
> Regarding One, you've had a look at the puppet changes. 
> http://hg.mozilla.org/build/puppet/rev/a3b1ae943ba4 upgraded buildbot, and
> although buildbot itself is not tagged, the difference was
>   http://hg.mozilla.org/build/buildbot/rev/ade48470874e
> The effect of this code was to *allow* idelizer to power down (halt) hosts
> instead of rebooting them, but this was protected by IDLEIZER_HALT_ON_IDLE,
> which (in the puppet commit) was only set on Windows (in a batch file, so
> I'm pretty sure it didn't leak onto Linux).  To my knowledge, we haven't had
> an epidemic of hosts halting, so I don't think this is causing the issue. 
> Not to mention, it landed 10 hours after the suspected commit.
> 
> It looks like the most recent reconfig before that time was
>   http://hg.mozilla.org/build/buildbot-configs/rev/29fb3e98c237
> with no corresponding changes to buildbotcustom.  So it's hard to see
> anything there that could be related.
> 
> Looked at a different way,
> https://wiki.mozilla.org/ReleaseEngineering/Maintenance shows a change later
> on the 30th:
> 
> buildbot-configs
> 
>     bug 1184117 - Make Android 4.0 opt-in by default on Try - r=Callek
> (e7176ef993ef)
>     bug 1189273 - Drop twice-daily nightlies for v2.2/v2.2r and stop running
> emulator nightly builds on the B2G release branches - r=catlee (fbd3abad6d6b)
> 
> buildbotcustom
> 
>     bug 1187966 - Set the env for periodic file update jobs so we can use
> TOOLTOOL_CACHE = - r=nthomas (dc0535892866)
> 
> mozharness
> 
>     bug 1184594 - git 'repo' is not deployed correctly on build machines
> managed by buildbot - r=hwine (57dde819edcb)
>     bug 1188648 - Use HOST_LOG env var to log gecko output when running Gij
> on _all_ gecko builds - r=jgriffin (79fffa3a18ae)
>     bug 1188648 - Always log gecko output during Gaia JS Integration Tests -
> r=jgriffin (d72df953784c)
>     bug 1188698 - Do not fail when succeed count is 0 (672af8a8da3e)
>     No bug - Backed out changeset 79fffa3a18ae -- Unfortunately try run
> reveals this isn't working as expected (faf4013a7fc6)
>     No bug - Backed out changeset 57dde819edcb (3ae81f4458ef)
> 
> The mozharness changes shouldn't be related because (I think?) Talos are
> running out of the in-tree mozharness.
> 
> I don't know of any other changes that could affect the hosts themselves. 
> We only have so many tools that will touch all talos hosts -- puppet, mig,
> and buildbot, basically.  So there's some subtle bug hidden in one of those
> changes.
> 
> Morgan: do you have historical information on number of reboots or uptime or
> anything like that for this period?
> 
> Julien: did anything change with mig around this time that might have
> altered its resource usage?

Sorry for the delay here, runner on non-puppet Windows isn't reporting to influxdb. I believe we can get this info from graphite. I can take a looksie.
Flags: needinfo?(winter2718)
I would like to experiment with turning off m-dns on the osx machines, we have a LOT of noise in talos on osx 10.10 and if we had it off for all the machines we could see if the noise is reduced.

:dustin, could you help us figure that out?
Flags: needinfo?(dustin)
Blocks: 1191019
I presume you just want to test this on a small scale first? 
http://krypted.com/tag/com-apple-mdnsresponder-plist/ is good reading on bonjour (apple's version of mdns)
maybe we could test on a loaner assuming the loaner is in the same vlan/network as the other machines.

A few thoughts-
* if the loaner is in a "private" network, then we could rule out network influence over a variety of runs
* if this is not a service we need or use for our tests, then we should consider disabling it across the board.
Loaners are on the same network as the others -- and moving to a "quiet" network is going to be very difficult (since that network wouldn't have the required network flows).  I'd be game for disabling it across the board.  I don't think there's any part of Firefox that relies on it (autodiscovery of other browsers?  sounds creepy).

I'm not necessarily the best person to write that patch, though -- maybe Jake or Amy?  I'm kind of on a mad dash to get my ducks in a row before a work-week next week.
Flags: needinfo?(dustin)
Good advice Dustin.  As this isn't high priority, I think we could schedule this in over the next week or two and find the right time to do this.  Ideally after the uplift and related releases next week (or asap this week!!)

Jake, is this something you could comment on when it would make sense and if you could pick it up?
Flags: needinfo?(jwatkins)
Yes, Although you can't just disable the mdns service completely, we could disable mdns multicast advertisements across the board.  I'll work up a patch.

For more mdns insight, see my comment here: https://bugzilla.mozilla.org/show_bug.cgi?id=1193707#c27
Flags: needinfo?(jwatkins)
I think this is likely a red herring, and it worries me that we're attempting to disable core services (that are on on user machines) when doing testing. Joel, do you have something that actually points to the issue being mdns as opposed to just seeing traffic on the wire (which is completely normal)?
I think it's fair to try and disable services that aren't required for tests to run, and have the potential to introduce noise into the results.
(In reply to Chris AtLee [:catlee] from comment #15)
> I think it's fair to try and disable services that aren't required for tests
> to run, and have the potential to introduce noise into the results.

While possibly questionable in the long run, it's definitely important in order to identify the cause for this. And so far I don't think we know what the cause is.

If it turns out that it's this service, then we can have an actual discussion whether to remove it or not. But right now we should consider it a debugging/bisecting step.
this might help linux, itwould be an experiment.  I really got motivated to do it because we realized that all OSX talos results are useless.  We have actually disabled reporting osx 10.10 results in our compare view because every push shows a bunch of osx 10.10 regressions/improvements, even two pushes of the same revision!

Fixing osx 10.10 is more of a priority than linux infrastructure, I would call it a lucky win if both are fixed by the same thing.
We're in the process of replacing all of the 10.10 infrastructure with a new hardware running a newer version of the OS that went back to the old way of doing mdns (they used a new method for 10.10.0 and switched back in 10.10.4). If the goal here is to fix OS X, we're probably better off spending effort on getting tests on 10.10.5 functional and seeing if that magically fixes the linux problem when the last 10.10.2 machine goes away. I think it's unlikely that mdns is the cause, but there's always some chance.

If the goal is to try to experiment with mdns on OS X 10.10.2 machines (which have been in production since Q1, long before the July regression) to see if it impacts linux, we would need to roll out this change to all 10.10.2 hosts. Broadcast packets from a large number of hosts is what would be creating traffic, so disabling one or two will make no difference. There's also some danger of adversely impacting general OS X dns resolution since it's all the same bundled service (this is why we can't just disable the service as a whole).
I am happy to wait for the r7 10.10.5 machines to come online before trying anything out.  We should use this as an opportunity to disable extra services which are not needed on 10.10.5 before deploying.

Lets not mess with 10.10.2.  My understanding is we are 4-6 weeks out on the 10.10.5 machines?
:dividehex found bug 1144206, which points out that we've already disabled bonjour on 10.10.2 back in March.
Depends on: 1144206
Attached file stop-rebooting.diff
Looks like the machines are again rebooting after jobs, this patch will fix that.
Attachment #8666165 - Flags: review?(catlee)
Attachment #8666165 - Flags: review?(catlee) → review+
Since bonjour multicast had been disabled on 10.10.2 test slaves and the expectation was for it to be disabled it on all 10.10.x test slaves, this patch does just that. Apple made a major change on a minor release by bringing mDNSResponder back in version 10.10.4.  This patch also cleans up erroneous file created from the previous code.
Attachment #8667024 - Flags: review?(dustin)
Comment on attachment 8667024 [details] [diff] [review]
bug1201230-fix_bonjour_diable_arg.patch

Review of attachment 8667024 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/tweaks/manifests/disable_bonjour.pp
@@ +11,5 @@
> +            /10\.10\.[0-3]/: {
> +                $plist = 'com.apple.discoveryd'
> +                $disable_arg = '--no-multicast'
> +            }
> +            # OSX 10.10.4+ Apple saw the error of there ways and brought back mDNSResponder

their
Attachment #8667024 - Flags: review?(dustin) → review+
Comment on attachment 8667024 [details] [diff] [review]
bug1201230-fix_bonjour_diable_arg.patch

Committed with minor edits

--- /home/jwatkins/bug1201230-fix_bonjour_diable_arg.patch      2015-09-28 17:14:55.561608232 -0700
+++ /home/jwatkins/bug1201230-fix_bonjour_diable_arg-2.patch    2015-09-29 10:03:15.840736542 -0700
@@ -19,7 +19,7 @@
 -                        unless => "/usr/bin/defaults read /System/Library/LaunchDaemons/com.apple.discoveryd.plist | egrep 'no-multicast'"
 +            # This file is a byproduct of the previous code which disabled multicast mdns discovery
 +            # it has been updated and moved to modules/tweaks/manifests/disable_bonjour.pp
-+            if $::macosx_productversion == /10\.10\.[4,]/ {
++            if $::macosx_productversion =~ /10\.10\.[4-9]/ {
 +                file { "/System/Library/LaunchDaemons/com.apple.discoveryd.plist":
 +                    ensure => absent;
                  }
@@ -68,7 +68,7 @@
 +                $plist = 'com.apple.discoveryd'
 +                $disable_arg = '--no-multicast'
 +            }
-+            # OSX 10.10.4+ Apple saw the error of there ways and brought back mDNSResponder
++            # OSX 10.10.4+ Apple saw the error of their ways and brought back mDNSResponder
 +            default: {
 +                $plist = 'com.apple.mDNSResponder'
 +                $disable_arg = '-NoMulticastAdvertisements'


remote:   https://hg.mozilla.org/build/puppet/rev/2834295e51c4
remote:   https://hg.mozilla.org/build/puppet/rev/3982e61ffd1e
This rolls back the idleizer patch that broke not rebooting. The reason for this, is that the new features added to idleizer depended on environment variables being set. We deploy idleizer by calling it from within the buildbot.tac file that is downloaded before running buildbot, unfortunately, setting environment variables here seems to require modifying the buildbot.tac template, which is no trivial matter. It will be simpler to just re-implement the patch using a config file.
Attachment #8669824 - Flags: review?(dustin)
Comment on attachment 8669824 [details] [diff] [review]
idleizer-rollback.diff

Assuming this is just the reverse of
  http://hg.mozilla.org/build/buildbot/rev/ade48470874e
Attachment #8669824 - Flags: review?(dustin) → review+
Just looking for an ETA when this will land so I can keep an eye on the results.
Attached file bump-buildbot-version
Tested on talos-linux64-ix-001, this deploys the rolled back idleizer code.
Attachment #8671014 - Flags: review?(dustin)
Comment on attachment 8671014 [details]
bump-buildbot-version

Be sure to ping :markco to let him know.
Attachment #8671014 - Flags: review?(dustin) → review+
Did this ever get landed? Do we have a plan for how to move forward?
we were unable to find a root cause here.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Blocks: 1255582
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: