Closed
Bug 966400
Opened 10 years ago
Closed 10 years ago
Gather, aggregate and visualize machine utilization stats
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: taras.mozilla, Unassigned)
References
Details
Attachments
(3 files, 2 obsolete files)
We have no idea what the utilization of our machines is. Lets fix that. I want to see snapshots over time of cpu, memory, disk, network. Ideally will have some high level events overlayed(build, hg checkout, cleanup, etc), but that can wait. The recipe seems to be: * sar to record data http://en.wikipedia.org/w/index.php?title=Sar_%28Unix%29&action=edit§ion=3 * a way to dump that data into rrdtool http://www.elekslabs.com/2013/12/rrd-and-rrdtool-sar-graphs-using-pyrrd.html * a way to visualize data http://javascriptrrd.sourceforge.net/ * Data should live in a public S3 bucket somewhere This will help us make sure we are utilizing our AWS node types efficiently. We should probably think about displaying this data on a per machine basis and maybe aggregating it across all machines of a particular type. The aggregation should probably be a follow up
Comment 1•10 years ago
|
||
How much are the AWS cloud watch metrics useful here?
Comment 2•10 years ago
|
||
We've also got tons of data in graphite: https://graphite.mozilla.org/
Reporter | ||
Comment 3•10 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #2) > We've also got tons of data in graphite: > https://graphite.mozilla.org/ I have no idea if getting this data out of graphite is harder than starting from scratch.
Comment 4•10 years ago
|
||
I believe mmayo's org has already solved this problem. I'm willing to wager RelEng could piggyback on their aggregation infrastructure or at least steal the same technical solution. bobm can shed some light here. Bug 859573 comment 11 states there were meetings between IT and RelEng about unifying solutions to this problem. This bug sounds like the opportunity to do that. CC :arich. Graphite has an HTTP API to facilitate data retrieval: http://graphite.readthedocs.org/en/1.0/url-api.html. It's also trivial to create custom dashboards using that API. Mozilla Services' "Pencil" dashboard does this. Graphite/Carbon can have scaling issues. I encountered and mostly solved them back when I was a server engineer. But I've been out of that game for a while - best to ask bobm or someone in Services/IT land. I highly recommend collectd for data collection. Low probe overhead (written in C) + many plugins/probes + very good extensibility. Heka (http://hekad.readthedocs.org/en/latest/) is worth investigating for data/log transport. Look for an email thread from July 23 to release@ and auto-tools@ titled "Using Heka in automation" where I postulated compelling use cases for using Heka in release automation. I've reforwarded this to :taras, :catlee, and :rail. The ideas are beyond the scope of this bug, but Heka could be part of a larger solution. e.g. usa Heka for transporting system utilization today, add logs, etc to it later. mozharness has support for correlating resource usage to steps within the larger job (see bug 859573). But you still need monitoring that outlives a mozharness job or you won't have complete data. That, or you get mozharness reporting to the global collector. But most collection tools poll at hard-coded intervals and can't correlate exactly to fine-grained/short events. (Do we care?) As I successfully argued in bug 859573 comment #16, I think the two solutions are complementary. Plus, one of mozharness's goals is to run things locally. Having collection baked into mozharness makes it easier for us regular folk to get resource usage too. </braindump>
Flags: needinfo?(bobm)
Reporter | ||
Comment 5•10 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #4) > > Graphite has an HTTP API to facilitate data retrieval: > http://graphite.readthedocs.org/en/1.0/url-api.html. It's also trivial to > create custom dashboards using that API. Mozilla Services' "Pencil" > dashboard does this. Cool. cumulative cpu usage: https://graphite.mozilla.org/render?target=hosts.bld-linux64-ec2-*_build_releng_use1_mozilla_com.cpu.0.cpu.user&format=raw seems to work > > Graphite/Carbon can have scaling issues. I encountered and mostly solved > them back when I was a server engineer. But I've been out of that game for a > while - best to ask bobm or someone in Services/IT land. I wonder if can just export rrd files from graphite for batch processing. Dustin or Amy, how easy is it to get at the raw data? > > I highly recommend collectd for data collection. Low probe overhead (written > in C) + many plugins/probes + very good extensibility. Sure. seems to work. > > Heka (http://hekad.readthedocs.org/en/latest/) is worth investigating for > data/log transport. > > Look for an email thread from July 23 to release@ and auto-tools@ titled > "Using Heka in automation" where I postulated compelling use cases for using > Heka in release automation. I've reforwarded this to :taras, :catlee, and > :rail. The ideas are beyond the scope of this bug, but Heka could be part of > a larger solution. e.g. usa Heka for transporting system utilization today, > add logs, etc to it later. Really not sure where you are going with heka thing. We use it for telemetry, it doesn't add value here. > > mozharness has support for correlating resource usage to steps within the > larger job (see bug 859573). But you still need monitoring that outlives a > mozharness job or you won't have complete data. That, or you get mozharness > reporting to the global collector. But most collection tools poll at > hard-coded intervals and can't correlate exactly to fine-grained/short > events. (Do we care?) As I successfully argued in bug 859573 comment #16, I > think the two solutions are complementary. Plus, one of mozharness's goals > is to run things locally. Having collection baked into mozharness makes it > easier for us regular folk to get resource usage too. Agree. Thanks for the pointers.
Reporter | ||
Comment 6•10 years ago
|
||
Note our EC2 machines don't have any disk throughput info in graphite, someone should add that. Network should be measured in bytes, not packets
Reporter | ||
Comment 7•10 years ago
|
||
So here is a chart showing that we should probably try to run tests on bld nodes as they spend a lot of time idle. Tests should be run in spot-mode so they can be displaced by a build job. This compares load on first cpu between bld and test boxes. https://graphite.mozilla.org/render?from=-24hours&until=now&width=1800&height=900&target=sum%28hosts.tst-linux*-ec2-*_test_releng_*_mozilla_com.cpu.0.cpu.user%29&_uniq=0.9621919982401083&&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.user%29
Reporter | ||
Comment 8•10 years ago
|
||
Here is graph showing a bug somewhere. We spin up 20 idle testers at 16:00. It could be someone tried to test a build that failed to run any tests? This would also benefit from running tests on compile nodes. This is the url I used https://graphite.mozilla.org/render?from=-24hours&until=now&width=1800&height=900&target=sum%28hosts.tst-linux*-ec2-*_test_releng_*_mozilla_com.cpu.*.cpu.user%29&target=sum%28hosts.tst-linux*-ec2-*_test_releng_*_mozilla_com.cpu.*.cpu.idle%29
Reporter | ||
Comment 9•10 years ago
|
||
I think we are throwing money away by spinning up idle capacity. Note how when we reach capacity, it is always tailed by ~20% more idle than needed. https://graphite.mozilla.org/render?from=-24hours&until=now&width=1800&height=900&_uniq=0.9621919982401083&&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.user%29&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.idle%29
Reporter | ||
Updated•10 years ago
|
Attachment #8369310 -
Attachment description: render.png → Adding test capacity too eagerly
Comment 10•10 years ago
|
||
(In reply to Taras Glek (:taras) from comment #5) > (In reply to Gregory Szorc [:gps] from comment #4) > > Graphite/Carbon can have scaling issues. I encountered and mostly solved > > them back when I was a server engineer. But I've been out of that game for a > > while - best to ask bobm or someone in Services/IT land. > > I wonder if can just export rrd files from graphite for batch processing. > > Dustin or Amy, how easy is it to get at the raw data? Jake is the best person to ask here..
Flags: needinfo?(jwatkins)
Comment 11•10 years ago
|
||
(In reply to Taras Glek (:taras) from comment #5) > (In reply to Gregory Szorc [:gps] from comment #4) > > > > Graphite/Carbon can have scaling issues. I encountered and mostly solved > > them back when I was a server engineer. But I've been out of that game for a > > while - best to ask bobm or someone in Services/IT land. > > I wonder if can just export rrd files from graphite for batch processing. > > Dustin or Amy, how easy is it to get at the raw data? Carbon doesn't use rrd files but instead stores the collected metrics in whisper db files. If you want to do batch processing, you would need to import the whisper.py libs for reading them out. :Ericz would be able to provide more info since he operates the graphite/carbon infrastructure. > > > > > I highly recommend collectd for data collection. Low probe overhead (written > > in C) + many plugins/probes + very good extensibility. > > Sure. seems to work. > I also recommend collectd. Aside from the points :gps pointed out, it is already installed on all releng systems. It is very easy to add an additional output module if we decide to send metrics elsewhere (or in addition to graphite/carbon)
Flags: needinfo?(jwatkins)
Reporter | ||
Comment 12•10 years ago
|
||
https://graphite.mozilla.org/render?from=-60days&until=-59days&width=1800&height=900&_uniq=0.9621919982401083&&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.user%29&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.idle%29 Note, this is a less crazy time for our tree. Shows that even non-network outage time our capacity overprovisioning patterns are suboptimal.
Comment 13•10 years ago
|
||
As :dividehex said, locally on the graphite servers you can use the whisper libraries in python or the whisper-fetch command to dump raw data, but as we expand our single server into a cluster you may need to use the URL API to get the data from multiple servers so I'd recommend just starting with the raw format url like your example https://graphite-scl3.mozilla.org/render?target=hosts.bld-linux64-ec2-*_build_releng_use1_mozilla_com.cpu.0.cpu.user&format=raw.
Comment 14•10 years ago
|
||
Looking at cpu idle is not going to tell you whether a slave is actually idle or not. For instance, when running make check, only 1 cpu will be used, and not even necessarily at 100%. There are many other build operations that will only use 1 CPU. Plus, your graphs are only showing idle time on cpu.0, which is only 1 of them, and processes are not going to be running on cpu.0 all the time. IOW, the graphs are not showing the right data to come to the conclusions you are getting to.
Comment 15•10 years ago
|
||
EC2 instances are also very rarely at 100% idle since CPU steal commonly chews up a lot of what would be idle.
Comment 16•10 years ago
|
||
I think Eric was simply giving an example. When we look at CPU utilization aggregated over the month, we look at all of the CPUs over all the machines in a given pool. You can use wildcard matching to get the type of data you want. https://graphite-scl3.mozilla.org/render?target=averageSeries(hosts.talos-linux64-ix*.cpu.*.cpu.idle)&from=20140101&until=20140131 Is an example of using wildcards and an average series to look at the CPU utilization of the talos-linux64-ix pool over the entire month of Jan. You can be more or less fine grained if you want different stats. I'd look at the rendering and functions documentation: https://graphite.readthedocs.org/en/latest/render_api.html https://graphite.readthedocs.org/en/latest/functions.html
Comment 17•10 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #15) > EC2 instances are also very rarely at 100% idle since CPU steal commonly > chews up a lot of what would be idle. :gps makes a good point here. If we want to be able to observe CPU steal, we should collect and compare the usage as reported by cloudwatch to what we are collecting from the EC2 instance itself.
Comment 18•10 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #17) > (In reply to Gregory Szorc [:gps] from comment #15) > > EC2 instances are also very rarely at 100% idle since CPU steal commonly > > chews up a lot of what would be idle. > > :gps makes a good point here. If we want to be able to observe CPU steal, > we should collect and compare the usage as reported by cloudwatch to what we > are collecting from the EC2 instance itself. Why do we need to compare against cloudwatch? I just logged into an Ubuntu EC2 instance and CPU steal is reported to the instance. (vmstat -s and look for "stolen cpu ticks"). I've been able to measure CPU steal since at least 2010 using collectd. According to the proc man page, steal has been reported in /proc/stat since 2.6.11. I'm pretty sure collectd reads directly from proc.
Comment 19•10 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #18) > (In reply to Jake Watkins [:dividehex] from comment #17) > > (In reply to Gregory Szorc [:gps] from comment #15) > > > EC2 instances are also very rarely at 100% idle since CPU steal commonly > > > chews up a lot of what would be idle. > > > > :gps makes a good point here. If we want to be able to observe CPU steal, > > we should collect and compare the usage as reported by cloudwatch to what we > > are collecting from the EC2 instance itself. > > Why do we need to compare against cloudwatch? > > I just logged into an Ubuntu EC2 instance and CPU steal is reported to the > instance. (vmstat -s and look for "stolen cpu ticks"). I've been able to > measure CPU steal since at least 2010 using collectd. > > According to the proc man page, steal has been reported in /proc/stat since > 2.6.11. I'm pretty sure collectd reads directly from proc. I was not aware cpu steal was reported to the instance. Thanks for pointing that out. gtk
Reporter | ||
Comment 20•10 years ago
|
||
Here is a fancier graph. I corrected this one to use stacked graphs and adjust idle/wait(for some reason they are multipled by 2 in the data). For curious parties, I included all cpus and all cpu measures(eg on a single box the graph adds up to 100). It doesn't make a difference in reality whether I measure 1 cpu or 4. The pattern of suspected overprovisioning still holds. We seem to be overprovisioning. Now that graphs are stacked you can see when machines get killed more clearly, etc. https://graphite.mozilla.org/render?from=-60days&until=-59days&width=1800&height=900&target=stacked(scale(sum(exclude(hosts.bld-linux64-ec2-0[0-9]*_build_releng_*_mozilla_com.cpu.*.cpu.*, "(idle|wait)")),0.25))&target=stacked(scale(sum(hosts.bld-linux64-ec2-0[0-9]*_build_releng_*_mozilla_com.cpu.*.cpu.idle), 0.125))&target=stacked(scale(sum(hosts.bld-linux64-ec2-0[0-9]*_build_releng_*_mozilla_com.cpu.*.cpu.wait), 0.125))
Attachment #8369595 -
Attachment is obsolete: true
Reporter | ||
Comment 21•10 years ago
|
||
nm re doubled measures. attaching corrected graph
Attachment #8370216 -
Attachment is obsolete: true
Reporter | ||
Comment 22•10 years ago
|
||
(In reply to Taras Glek (:taras) from comment #21) > Created attachment 8370225 [details] > build scaling issues from 60 days ago > > nm re doubled measures. attaching corrected graph This one divides by 100 to show number of machines
Reporter | ||
Comment 23•10 years ago
|
||
Thanks for helping here, spinning off actionable stuff into bug 968381.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Updated•9 years ago
|
Flags: needinfo?(bobm)
Assignee | ||
Updated•6 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•