966400 - Gather, aggregate and visualize machine utilization stats

Reporter

Description

•

12 years ago

We have no idea what the utilization of our machines is. Lets fix that. I want to see snapshots over time of cpu, memory, disk, network. Ideally will have some high level events overlayed(build, hg checkout, cleanup, etc), but that can wait. The recipe seems to be: * sar to record data http://en.wikipedia.org/w/index.php?title=Sar_%28Unix%29&action=edit&section=3 * a way to dump that data into rrdtool http://www.elekslabs.com/2013/12/rrd-and-rrdtool-sar-graphs-using-pyrrd.html * a way to visualize data http://javascriptrrd.sourceforge.net/ * Data should live in a public S3 bucket somewhere This will help us make sure we are utilizing our AWS node types efficiently. We should probably think about displaying this data on a per machine basis and maybe aggregating it across all machines of a particular type. The aggregation should probably be a follow up

Chris AtLee [:catlee]

Comment 1

•

12 years ago

How much are the AWS cloud watch metrics useful here?

Chris AtLee [:catlee]

Comment 2

•

12 years ago

We've also got tons of data in graphite: https://graphite.mozilla.org/

(dormant account)

Reporter

Comment 3

•

12 years ago

(In reply to Chris AtLee [:catlee] from comment #2) > We've also got tons of data in graphite: > https://graphite.mozilla.org/ I have no idea if getting this data out of graphite is harder than starting from scratch.

Gregory Szorc [:gps]

Comment 4

•

12 years ago

I believe mmayo's org has already solved this problem. I'm willing to wager RelEng could piggyback on their aggregation infrastructure or at least steal the same technical solution. bobm can shed some light here. Bug 859573 comment 11 states there were meetings between IT and RelEng about unifying solutions to this problem. This bug sounds like the opportunity to do that. CC :arich. Graphite has an HTTP API to facilitate data retrieval: http://graphite.readthedocs.org/en/1.0/url-api.html. It's also trivial to create custom dashboards using that API. Mozilla Services' "Pencil" dashboard does this. Graphite/Carbon can have scaling issues. I encountered and mostly solved them back when I was a server engineer. But I've been out of that game for a while - best to ask bobm or someone in Services/IT land. I highly recommend collectd for data collection. Low probe overhead (written in C) + many plugins/probes + very good extensibility. Heka (http://hekad.readthedocs.org/en/latest/) is worth investigating for data/log transport. Look for an email thread from July 23 to release@ and auto-tools@ titled "Using Heka in automation" where I postulated compelling use cases for using Heka in release automation. I've reforwarded this to :taras, :catlee, and :rail. The ideas are beyond the scope of this bug, but Heka could be part of a larger solution. e.g. usa Heka for transporting system utilization today, add logs, etc to it later. mozharness has support for correlating resource usage to steps within the larger job (see bug 859573). But you still need monitoring that outlives a mozharness job or you won't have complete data. That, or you get mozharness reporting to the global collector. But most collection tools poll at hard-coded intervals and can't correlate exactly to fine-grained/short events. (Do we care?) As I successfully argued in bug 859573 comment #16, I think the two solutions are complementary. Plus, one of mozharness's goals is to run things locally. Having collection baked into mozharness makes it easier for us regular folk to get resource usage too. </braindump>

Flags: needinfo?(bobm)

(dormant account)

Reporter

Comment 5

•

12 years ago

(In reply to Gregory Szorc [:gps] from comment #4) > > Graphite has an HTTP API to facilitate data retrieval: > http://graphite.readthedocs.org/en/1.0/url-api.html. It's also trivial to > create custom dashboards using that API. Mozilla Services' "Pencil" > dashboard does this. Cool. cumulative cpu usage: https://graphite.mozilla.org/render?target=hosts.bld-linux64-ec2-*_build_releng_use1_mozilla_com.cpu.0.cpu.user&format=raw seems to work > > Graphite/Carbon can have scaling issues. I encountered and mostly solved > them back when I was a server engineer. But I've been out of that game for a > while - best to ask bobm or someone in Services/IT land. I wonder if can just export rrd files from graphite for batch processing. Dustin or Amy, how easy is it to get at the raw data? > > I highly recommend collectd for data collection. Low probe overhead (written > in C) + many plugins/probes + very good extensibility. Sure. seems to work. > > Heka (http://hekad.readthedocs.org/en/latest/) is worth investigating for > data/log transport. > > Look for an email thread from July 23 to release@ and auto-tools@ titled > "Using Heka in automation" where I postulated compelling use cases for using > Heka in release automation. I've reforwarded this to :taras, :catlee, and > :rail. The ideas are beyond the scope of this bug, but Heka could be part of > a larger solution. e.g. usa Heka for transporting system utilization today, > add logs, etc to it later. Really not sure where you are going with heka thing. We use it for telemetry, it doesn't add value here. > > mozharness has support for correlating resource usage to steps within the > larger job (see bug 859573). But you still need monitoring that outlives a > mozharness job or you won't have complete data. That, or you get mozharness > reporting to the global collector. But most collection tools poll at > hard-coded intervals and can't correlate exactly to fine-grained/short > events. (Do we care?) As I successfully argued in bug 859573 comment #16, I > think the two solutions are complementary. Plus, one of mozharness's goals > is to run things locally. Having collection baked into mozharness makes it > easier for us regular folk to get resource usage too. Agree. Thanks for the pointers.

(dormant account)

Reporter

Comment 6

•

12 years ago

Note our EC2 machines don't have any disk throughput info in graphite, someone should add that. Network should be measured in bytes, not packets

(dormant account)

Reporter

Comment 7

•

12 years ago

So here is a chart showing that we should probably try to run tests on bld nodes as they spend a lot of time idle. Tests should be run in spot-mode so they can be displaced by a build job. This compares load on first cpu between bld and test boxes. https://graphite.mozilla.org/render?from=-24hours&until=now&width=1800&height=900&target=sum%28hosts.tst-linux*-ec2-*_test_releng_*_mozilla_com.cpu.0.cpu.user%29&_uniq=0.9621919982401083&&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.user%29

(dormant account)

Reporter

Comment 8

•

12 years ago

Attached image Adding test capacity too eagerly — Details

Here is graph showing a bug somewhere. We spin up 20 idle testers at 16:00. It could be someone tried to test a build that failed to run any tests? This would also benefit from running tests on compile nodes. This is the url I used https://graphite.mozilla.org/render?from=-24hours&until=now&width=1800&height=900&target=sum%28hosts.tst-linux*-ec2-*_test_releng_*_mozilla_com.cpu.*.cpu.user%29&target=sum%28hosts.tst-linux*-ec2-*_test_releng_*_mozilla_com.cpu.*.cpu.idle%29

(dormant account)

Reporter

Comment 9

•

12 years ago

Attached image scaling builds too eagerly — Details

I think we are throwing money away by spinning up idle capacity. Note how when we reach capacity, it is always tailed by ~20% more idle than needed. https://graphite.mozilla.org/render?from=-24hours&until=now&width=1800&height=900&_uniq=0.9621919982401083&&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.user%29&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.idle%29

(dormant account)

Reporter

Updated

•

12 years ago

Attachment #8369310 - Attachment description: render.png → Adding test capacity too eagerly

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

12 years ago

(In reply to Taras Glek (:taras) from comment #5) > (In reply to Gregory Szorc [:gps] from comment #4) > > Graphite/Carbon can have scaling issues. I encountered and mostly solved > > them back when I was a server engineer. But I've been out of that game for a > > while - best to ask bobm or someone in Services/IT land. > > I wonder if can just export rrd files from graphite for batch processing. > > Dustin or Amy, how easy is it to get at the raw data? Jake is the best person to ask here..

Flags: needinfo?(jwatkins)

Jake Watkins [:dividehex]

Comment 11

•

12 years ago

(In reply to Taras Glek (:taras) from comment #5) > (In reply to Gregory Szorc [:gps] from comment #4) > > > > Graphite/Carbon can have scaling issues. I encountered and mostly solved > > them back when I was a server engineer. But I've been out of that game for a > > while - best to ask bobm or someone in Services/IT land. > > I wonder if can just export rrd files from graphite for batch processing. > > Dustin or Amy, how easy is it to get at the raw data? Carbon doesn't use rrd files but instead stores the collected metrics in whisper db files. If you want to do batch processing, you would need to import the whisper.py libs for reading them out. :Ericz would be able to provide more info since he operates the graphite/carbon infrastructure. > > > > > I highly recommend collectd for data collection. Low probe overhead (written > > in C) + many plugins/probes + very good extensibility. > > Sure. seems to work. > I also recommend collectd. Aside from the points :gps pointed out, it is already installed on all releng systems. It is very easy to add an additional output module if we decide to send metrics elsewhere (or in addition to graphite/carbon)

Flags: needinfo?(jwatkins)

(dormant account)

Reporter

Comment 12

•

12 years ago

Attached image build scaling issues from 60 days ago (obsolete) — Details

https://graphite.mozilla.org/render?from=-60days&until=-59days&width=1800&height=900&_uniq=0.9621919982401083&&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.user%29&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.idle%29 Note, this is a less crazy time for our tree. Shows that even non-network outage time our capacity overprovisioning patterns are suboptimal.

Eric Ziegenhorn :ericz

Comment 13

•

12 years ago

As :dividehex said, locally on the graphite servers you can use the whisper libraries in python or the whisper-fetch command to dump raw data, but as we expand our single server into a cluster you may need to use the URL API to get the data from multiple servers so I'd recommend just starting with the raw format url like your example https://graphite-scl3.mozilla.org/render?target=hosts.bld-linux64-ec2-*_build_releng_use1_mozilla_com.cpu.0.cpu.user&format=raw.

Mike Hommey [:glandium]

Comment 14

•

12 years ago

Looking at cpu idle is not going to tell you whether a slave is actually idle or not. For instance, when running make check, only 1 cpu will be used, and not even necessarily at 100%. There are many other build operations that will only use 1 CPU. Plus, your graphs are only showing idle time on cpu.0, which is only 1 of them, and processes are not going to be running on cpu.0 all the time. IOW, the graphs are not showing the right data to come to the conclusions you are getting to.

Gregory Szorc [:gps]

Comment 15

•

12 years ago

EC2 instances are also very rarely at 100% idle since CPU steal commonly chews up a lot of what would be idle.

Amy Rich [:arr] [:arich]

Comment 16

•

12 years ago

I think Eric was simply giving an example. When we look at CPU utilization aggregated over the month, we look at all of the CPUs over all the machines in a given pool. You can use wildcard matching to get the type of data you want. https://graphite-scl3.mozilla.org/render?target=averageSeries(hosts.talos-linux64-ix*.cpu.*.cpu.idle)&from=20140101&until=20140131 Is an example of using wildcards and an average series to look at the CPU utilization of the talos-linux64-ix pool over the entire month of Jan. You can be more or less fine grained if you want different stats. I'd look at the rendering and functions documentation: https://graphite.readthedocs.org/en/latest/render_api.html https://graphite.readthedocs.org/en/latest/functions.html

Jake Watkins [:dividehex]

Comment 17

•

12 years ago

(In reply to Gregory Szorc [:gps] from comment #15) > EC2 instances are also very rarely at 100% idle since CPU steal commonly > chews up a lot of what would be idle. :gps makes a good point here. If we want to be able to observe CPU steal, we should collect and compare the usage as reported by cloudwatch to what we are collecting from the EC2 instance itself.

Gregory Szorc [:gps]

Comment 18

•

12 years ago

(In reply to Jake Watkins [:dividehex] from comment #17) > (In reply to Gregory Szorc [:gps] from comment #15) > > EC2 instances are also very rarely at 100% idle since CPU steal commonly > > chews up a lot of what would be idle. > > :gps makes a good point here. If we want to be able to observe CPU steal, > we should collect and compare the usage as reported by cloudwatch to what we > are collecting from the EC2 instance itself. Why do we need to compare against cloudwatch? I just logged into an Ubuntu EC2 instance and CPU steal is reported to the instance. (vmstat -s and look for "stolen cpu ticks"). I've been able to measure CPU steal since at least 2010 using collectd. According to the proc man page, steal has been reported in /proc/stat since 2.6.11. I'm pretty sure collectd reads directly from proc.

Jake Watkins [:dividehex]

Comment 19

•

12 years ago

(In reply to Gregory Szorc [:gps] from comment #18) > (In reply to Jake Watkins [:dividehex] from comment #17) > > (In reply to Gregory Szorc [:gps] from comment #15) > > > EC2 instances are also very rarely at 100% idle since CPU steal commonly > > > chews up a lot of what would be idle. > > > > :gps makes a good point here. If we want to be able to observe CPU steal, > > we should collect and compare the usage as reported by cloudwatch to what we > > are collecting from the EC2 instance itself. > > Why do we need to compare against cloudwatch? > > I just logged into an Ubuntu EC2 instance and CPU steal is reported to the > instance. (vmstat -s and look for "stolen cpu ticks"). I've been able to > measure CPU steal since at least 2010 using collectd. > > According to the proc man page, steal has been reported in /proc/stat since > 2.6.11. I'm pretty sure collectd reads directly from proc. I was not aware cpu steal was reported to the instance. Thanks for pointing that out. gtk

(dormant account)

Reporter

Comment 20

•

12 years ago

Attached image render.png (obsolete) — Details

Here is a fancier graph. I corrected this one to use stacked graphs and adjust idle/wait(for some reason they are multipled by 2 in the data). For curious parties, I included all cpus and all cpu measures(eg on a single box the graph adds up to 100). It doesn't make a difference in reality whether I measure 1 cpu or 4. The pattern of suspected overprovisioning still holds. We seem to be overprovisioning. Now that graphs are stacked you can see when machines get killed more clearly, etc. https://graphite.mozilla.org/render?from=-60days&until=-59days&width=1800&height=900&target=stacked(scale(sum(exclude(hosts.bld-linux64-ec2-0[0-9]*_build_releng_*_mozilla_com.cpu.*.cpu.*, "(idle|wait)")),0.25))&target=stacked(scale(sum(hosts.bld-linux64-ec2-0[0-9]*_build_releng_*_mozilla_com.cpu.*.cpu.idle), 0.125))&target=stacked(scale(sum(hosts.bld-linux64-ec2-0[0-9]*_build_releng_*_mozilla_com.cpu.*.cpu.wait), 0.125))

Attachment #8369595 - Attachment is obsolete: true

(dormant account)

Reporter

Comment 21

•

12 years ago

Attached image build scaling issues from 60 days ago — Details

nm re doubled measures. attaching corrected graph

Attachment #8370216 - Attachment is obsolete: true

(dormant account)

Reporter

Comment 22

•

12 years ago

(In reply to Taras Glek (:taras) from comment #21) > Created attachment 8370225 [details] > build scaling issues from 60 days ago > > nm re doubled measures. attaching corrected graph This one divides by 100 to show number of machines

(dormant account)

Reporter

Updated

•

12 years ago

Blocks: 968381

(dormant account)

Reporter

Comment 23

•

12 years ago

Thanks for helping here, spinning off actionable stuff into bug 968381.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WORKSFORME

Bob Micheletto [:bobm]

Updated

•

11 years ago

Flags: needinfo?(bobm)

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Adding test capacity too eagerly 12 years ago (dormant account) 135.35 KB, image/png		Details
scaling builds too eagerly 12 years ago (dormant account) 201.09 KB, image/png		Details
build scaling issues from 60 days ago 12 years ago (dormant account) 213.81 KB, image/png		Details
render.png 12 years ago (dormant account) 171.92 KB, image/png		Details
build scaling issues from 60 days ago 12 years ago (dormant account) 172.76 KB, image/png		Details