Closed Bug 1193707 Opened 6 years ago Closed 6 years ago

temporary record network statistics from the operating system as counters and only submit to treeherder for linux64 tresize

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(1 file, 2 obsolete files)

to solve the tresize weekends problem, I think a first step is to monitor network statistics.

psutil has a great net_io_counters function, but psutil is compiled code and near impossible to get in the system, so I just include it and compile it for linux64- then we can record data on linux64 only!  And lets limit it to tresize.

I see these differences during a tresize run on try:
13:53:18     INFO -  lo: bytes_sent: 301539
13:53:18     INFO -  lo: bytes_recv: 301539
13:53:18     INFO -  lo: packets_sent: 720
13:53:18     INFO -  lo: packets_recv: 720
13:53:18     INFO -  eth0: bytes_sent: 73723
13:53:18     INFO -  eth0: bytes_recv: 1513923
13:53:18     INFO -  eth0: packets_sent: 123
13:53:18     INFO -  eth0: packets_recv: 5883

now to get that put into counters and something not so ugly.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Wow, including the source of psutil is a bit rude - can't we find another way ?

For example, there is already psutil sources into m-c, in python/psutil/. On my linux box, when I activate the virtualenv in $OBJ_DIR/_virtualenv, the psutil module is available and working. We should make that usable to test jobs, if it is not already the case.
Oh, I just found that psutil is also in internal pypi:

http://pypi.pub.build.mozilla.org/pub/

Sources for linux/mac, but there is a compiled version for windows it seems. We should try to just require psutil from there if we can't use the psutil compiled from objdir. I can try that with a simple patch to mozharness talos.py file.
I thought this at first, but it really isn't there.  if you add psutil to requirements.txt, you get:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmaher@mozilla.com-e385c5e1c46a/try-win32/try_xp-ix_test-chromez-bm119-tests1-windows-build422.txt.gz

this is a temporary thing which I intend to land this week- that leaves a lot of time to figure it out another way.  I do agree that adding full psutil in is not ideal, but when I filed 1192971, you can see there are other related bugs for getting psutil.
Cool, bug 1192971 is a great starting point. :) It should be good, I think uploading those wheels should not be hard.

But at the end if it does not work, and that we intend to use psutil on linux only, I would prefer to land a patch on mozharness to install psutil when testing talos if the platform is linux - and in talos code do not add psutil as a dependency but do a try/catch around the psutil import (if possible).

Well, let's see first if it would be possible to resolve bug 1192971, that would be awesome! I can see a lot of stuff that could be cleaned if psutil is used!
a fallback would be to get raw /sys/proc/... values.
So, I've been on #releng, and asked for bug 1192971. And thanks to :pmoore and others, it was fast and wheels are here!

So, if pip is >= 1.5, it should prefer wheels and use that when available:

https://packaging.python.org/en/latest/installing.html#installing-cached-wheels
I see that psutil was landed, but I don't have a lot of luck getting data out of it.  I need to experiment some more.
I guess the question is, are we using a pip > 1.5?  I can't tell in the logs.
(In reply to Joel Maher (:jmaher) from comment #9)
> I guess the question is, are we using a pip > 1.5?  I can't tell in the logs.

Right, that's it - I tried that, and this is a success:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f4dd950feab2

Here is the patch I used:
https://hg.mozilla.org/try/rev/27c2318c59b5

and for talos:
https://bitbucket.org/parkouss/talos/commits/9d3e2cf54970891dfb5b65f6d63e7c2f59fd96a4?at=default

If you look into windows logs, the .whl file is installed!

So we need to discuss the patch - I changed something into python.py to make this work, it seems good but we need to ensure that mozsystemmonitor (the one package that use psutil in harness) can use psutil >= 3.0. Also I could add some comments.

If we can't change the python.py file in harness, we can still hack the talos.py one to fix that.
Depends on: 1194074
hardcoded for eth0 (production), still needs a better summarization and to upload what I really want (total packets, not median)
Attachment #8651074 - Flags: feedback?
Attachment #8651074 - Flags: feedback? → feedback?(j.parkouss)
Attachment #8646877 - Attachment is obsolete: true
Comment on attachment 8651074 [details] [diff] [review]
collect network stats a bit cleaner (0.9)

Review of attachment 8651074 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/cmanager_linux.py
@@ +8,5 @@
>  from cmanager import CounterManager
>  from mozprocess import pid as mozpid
>  
> +try:
> +    import psutil

we should be good with just "import psutil" (since it is in the requirement file anyway). Also a good practice is to group the simple "import" together (and group "from foo import bar" after that)

@@ +141,5 @@
>              except ValueError:
>                  print "Invalid data, not a float"
>                  raise
>          else:
> +            pass

hm, I don't think you should change that - at least not in this patch.

@@ +205,5 @@
>      def getCounterValue(self, counterName):
>          """Returns the last value of the counter 'counterName'"""
>          try:
>              self.updatePidList()
> +            if counterName.startswith('Network'):

hmm, I don't really like this if, but I suppose changing that will involve a refactoring of counters, so let's do that later.
Attachment #8651074 - Flags: feedback?(j.parkouss) → feedback+
ok, this is updated with feedback and working on try.  Also aggregated the counters.
Attachment #8651074 - Attachment is obsolete: true
Attachment #8651245 - Flags: review?(j.parkouss)
Comment on attachment 8651245 [details] [diff] [review]
collect network statistics on talos tresize (1.0)

Sorry for the delay!

LGTM, let's land that to see if the network affect Talos results.
Attachment #8651245 - Flags: review?(j.parkouss) → review+
Blocks: 1197483
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
now that we have some data (although backed this out on inbound just now).

it looks like eth0 packets rx is growing:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=[mozilla-inbound,47f45692cd483aaf83ff4eb363d65726b2da23aa,1]

tx is pretty much even from the sunday -> monday; loopback counters are all the same.

bytes go up a bit for rx/tx:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=[mozilla-inbound,65acfc92adde7564b8e7eebfaccf83947895695e,1]&series=[mozilla-inbound,47f45692cd483aaf83ff4eb363d65726b2da23aa,1]

I am not sure if bytes are an issue, but the packets are so different I would like to figure out what is the problem.  If these packets are causing the OS to parse them and cycle through some programs, it would explain why our numbers are so different.

:arr, can you help find someone who could figure out why we see so many more packets (75% increase) on boxes between Sunday and Monday?
Flags: needinfo?(arich)
oh, we could get tcpdump:
/usr/sbin/tcpdump

I could do this and upload to blobber, maybe key it off machine name to keep it consistent.
:jmaher and I talked, and I made suggestions to use netstat to capture stats on linux and OS X. To examine actual content, I suggested doing short tcpdump captures (this will require root privs).
Flags: needinfo?(arich)
I'd bet a lot of this is broadcasts - ARP requests, mDNS, etc.  Macs, especially, are ridiculously chatty.

It'd be interesting to capture, say, an hour of data per day of the week on a talos host, without promiscuous mode on, and then dig into protocol counts a little bit.
Oh, I see this is closed, never mind, sorry for the noise.
this is still active- I would like to investigate this more- I am thinking of netstat -s and diff'ing the data from start to finish.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
let me get more data on this tomorrow, it might be easy to figure out.
Holy cow, macs's chattiness is proportional to the square of the number of macs on a subnet.  They basically do their own DNS via broadcast.  I see some ARP and netbios mixed in there, too, but it's dwarfed by the mdns.  Inventory says there are about 950 hosts in this subnet (!!).

On a randomly selected scl3 talos host, running tcpdump for 60 seconds with -p (disable promiscuous mode) and excluding unicast traffic to/from this host shows 1544 packets, or about 28pps.  

If I remember from earlier explorations of this (it's amazing what staring at tcpdumps can teach you!) most of this mdns stuff occurs when macs start, as they try to find their friends.  That's probably also true for Windows hosts, although to a smaller extent.  That would explain the weekend/weekday pattern: fewer jobs means fewer reboots means fewer startups means less chatter.

That said, all of this traffic should be rather quickly disposed-of by a Linux host, so I don't expect this to cause any noticeable effects on performance numbers..
nice find Dustin!  is there a way to turn off the mdns discovery of macs on reboot?  Is it possible that linux is not just dropping the packets, or the rate is high enough that we end up causing the network driver to do more work than intended?

I also recall some hackery when we brought up osx 10.10 as it couldn't find graphs.mozilla.org at the end of a test run.  Is there something we did with DNS to make it work?

I guess ideally turning things off would be the best world- less random noise on our testing network.
Well, DNS != mDNS so let's not dredge up every DNS issue we've ever had :)

I suppose anything's possible, but I don't see what the kernel would do with these packets since it's not running an mDNS (avahi), so it's dropping them once it sees there's no socket open on the mdns port.  30pps isn't much for a NIC.  In other words, I suspect that this is a dead end.  My gut says that there's a common cause with whatever performance changes you're seeing and the decreased weekend traffic, specifically around overall load.  That could be related to number of reboots: the probability of a reboot between jobs is likely a lot higher on weekends (since hosts can be rebooted while idle) than on weekdays.

http://osxdaily.com/2009/09/15/disable-bonjour-by-turning-off-mdnsresponder/ suggests a way to turn off Bonjour.  I don't know if that would have any effect on Firefox, though.
Dustin, thanks for clarifying!

my next plan is to look at overall system cpu/memory, ideally seeing if there is a correlation between sets of high numbers/counters and sets of low numbers/counters.

I should record 'uptime' as well.  Also try to figure out programatically what jobs ran prior to the talos job.  Maybe there is some other process running since we are not rebooting.

Good thinking Dustin!
(In reply to Dustin J. Mitchell [:dustin] from comment #25)
> Well, DNS != mDNS so let's not dredge up every DNS issue we've ever had :)

This is true but unfortunately apple rolls the unicast dns resolver into the same daemon. So disabling the service altogether breaks ALL THE THINGS.

I little up-to-date background on osx mdns here:

mdnsResponder has been the mdns AND dns resolver in OSX for quite awhile.  As of 10.10.0, it was replaced with a complete rewrite called discoveryd.  Discoveryd was chalked full of critical bugs and stability issues and with every minor release seemed to get worse.  As of 10.10.4, Apple ripped it out and put good old mdnsResponder back in.  I suspect, almost all the dns issues we have seen in yosemite can be attributed to discoveryd at this point.  So it's probably a good idea to move all yosemite systems (both servers and testers) to 10.10.5.

Dustin: I would be interested to know if the arp/mdns traffic was coming from all osx versions or from just 10.10.  Would you happen to recall?

> I suppose anything's possible, but I don't see what the kernel would do with
> these packets since it's not running an mDNS (avahi), so it's dropping them
> once it sees there's no socket open on the mdns port.  30pps isn't much for
> a NIC.  In other words, I suspect that this is a dead end.  My gut says that
> there's a common cause with whatever performance changes you're seeing and
> the decreased weekend traffic, specifically around overall load.  That could
> be related to number of reboots: the probability of a reboot between jobs is
> likely a lot higher on weekends (since hosts can be rebooted while idle)
> than on weekdays.

I agree with dustin's gut here. This is probably a dead end regarding impacting linux.  But we can try disabling mdns multicast adverts anyway (without breaking dns)
Flags: needinfo?(dustin)
Ah, *that* is why we didn't disable mDNS then :)

I don't recall hostnames from my look earlier, but here's a sample from around 10 minutes just now:

cut -d- -f 1-3  hosts.txt| sort | uniq -c
     26 install.test.releng.scl3.mozilla.com.mdns
     25 r4-mini-001.test.releng.scl3.mozilla.com.mdns
   6769 t-snow-r4
   1583 t-yosemite-r7

so it looks like snow leopard is the bigger offender, followed by yosemite.

I don't have any issue with disabling the adverts -- I don't think that will substantially affect performance.  Disabling the whole service might (if Firefox is using the OS X resolver and suddenly it skips trying to do mDNS resolution, that might change page-load times).
Flags: needinfo?(dustin)
ok, the mdns might be fixed when we get the r7 (10.10.5) stuff live.  Right now all the talos results for OSX are an expensive random number generator- this is why disabling mDns might be useful.  it looks like we are in progress for the r7 stuff, maybe in a week or two we can revisit?  Maybe we can just disable the mdnsresponder on 10.10.5 right away? :)
no need to fix this.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WONTFIX
Blocks: 1255582
You need to log in before you can comment on or make changes to this bug.