Closed Bug 966400 Opened 10 years ago Closed 10 years ago

Gather, aggregate and visualize machine utilization stats

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 8
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: taras.mozilla, Unassigned)

References

Details

Attachments

(3 files, 2 obsolete files)

We have no idea what the utilization of our machines is. Lets fix that.
I want to see snapshots over time of cpu, memory, disk, network. 

Ideally will have some high level events overlayed(build, hg checkout, cleanup, etc), but that can wait.

The recipe seems to be:
* sar to record data http://en.wikipedia.org/w/index.php?title=Sar_%28Unix%29&action=edit&section=3
* a way to dump that data into rrdtool http://www.elekslabs.com/2013/12/rrd-and-rrdtool-sar-graphs-using-pyrrd.html
* a way to visualize data http://javascriptrrd.sourceforge.net/
* Data should live in a public S3 bucket somewhere

This will help us make sure we are utilizing our AWS node types efficiently.

We should probably think about displaying this data on a per machine basis and maybe aggregating it across all machines of a particular type. The aggregation should probably be a follow up
How much are the AWS cloud watch metrics useful here?
We've also got tons of data in graphite:
https://graphite.mozilla.org/
(In reply to Chris AtLee [:catlee] from comment #2)
> We've also got tons of data in graphite:
> https://graphite.mozilla.org/

I have no idea if getting this data out of graphite is harder than starting from scratch.
I believe mmayo's org has already solved this problem. I'm willing to wager RelEng could piggyback on their aggregation infrastructure or at least steal the same technical solution. bobm can shed some light here. 

Bug 859573 comment 11 states there were meetings between IT and RelEng about unifying solutions to this problem. This bug sounds like the opportunity to do that. CC :arich.

Graphite has an HTTP API to facilitate data retrieval: http://graphite.readthedocs.org/en/1.0/url-api.html. It's also trivial to create custom dashboards using that API. Mozilla Services' "Pencil" dashboard does this.

Graphite/Carbon can have scaling issues. I encountered and mostly solved them back when I was a server engineer. But I've been out of that game for a while - best to ask bobm or someone in Services/IT land.

I highly recommend collectd for data collection. Low probe overhead (written in C) + many plugins/probes + very good extensibility.

Heka (http://hekad.readthedocs.org/en/latest/) is worth investigating for data/log transport.

Look for an email thread from July 23 to release@ and auto-tools@ titled "Using Heka in automation" where I postulated compelling use cases for using Heka in release automation. I've reforwarded this to :taras, :catlee, and :rail. The ideas are beyond the scope of this bug, but Heka could be part of a larger solution. e.g. usa Heka for transporting system utilization today, add logs, etc to it later.

mozharness has support for correlating resource usage to steps within the larger job (see bug 859573). But you still need monitoring that outlives a mozharness job or you won't have complete data. That, or you get mozharness reporting to the global collector. But most collection tools poll at hard-coded intervals and can't correlate exactly to fine-grained/short events. (Do we care?) As I successfully argued in bug 859573 comment #16, I think the two solutions are complementary. Plus, one of mozharness's goals is to run things locally. Having collection baked into mozharness makes it easier for us regular folk to get resource usage too.

</braindump>
Flags: needinfo?(bobm)
(In reply to Gregory Szorc [:gps] from comment #4)
> 
> Graphite has an HTTP API to facilitate data retrieval:
> http://graphite.readthedocs.org/en/1.0/url-api.html. It's also trivial to
> create custom dashboards using that API. Mozilla Services' "Pencil"
> dashboard does this.

Cool. cumulative cpu usage:
https://graphite.mozilla.org/render?target=hosts.bld-linux64-ec2-*_build_releng_use1_mozilla_com.cpu.0.cpu.user&format=raw seems to work


> 
> Graphite/Carbon can have scaling issues. I encountered and mostly solved
> them back when I was a server engineer. But I've been out of that game for a
> while - best to ask bobm or someone in Services/IT land.

I wonder if can just export rrd files from graphite for batch processing. 

Dustin or Amy, how easy is it to get at the raw data?

> 
> I highly recommend collectd for data collection. Low probe overhead (written
> in C) + many plugins/probes + very good extensibility.

Sure. seems to work.

> 
> Heka (http://hekad.readthedocs.org/en/latest/) is worth investigating for
> data/log transport.
> 
> Look for an email thread from July 23 to release@ and auto-tools@ titled
> "Using Heka in automation" where I postulated compelling use cases for using
> Heka in release automation. I've reforwarded this to :taras, :catlee, and
> :rail. The ideas are beyond the scope of this bug, but Heka could be part of
> a larger solution. e.g. usa Heka for transporting system utilization today,
> add logs, etc to it later.

Really not sure where you are going with heka thing. We use it for telemetry, it doesn't add value here.



> 
> mozharness has support for correlating resource usage to steps within the
> larger job (see bug 859573). But you still need monitoring that outlives a
> mozharness job or you won't have complete data. That, or you get mozharness
> reporting to the global collector. But most collection tools poll at
> hard-coded intervals and can't correlate exactly to fine-grained/short
> events. (Do we care?) As I successfully argued in bug 859573 comment #16, I
> think the two solutions are complementary. Plus, one of mozharness's goals
> is to run things locally. Having collection baked into mozharness makes it
> easier for us regular folk to get resource usage too.

Agree. Thanks for the pointers.
Note our EC2 machines don't have any disk throughput info in graphite, someone should add that. Network should be measured in bytes, not packets
So here is a chart showing that we should probably try to run tests on bld nodes as they spend a lot of time idle. Tests should be run in spot-mode so they can be displaced by a build job.

This compares load on first cpu between bld and test boxes.
https://graphite.mozilla.org/render?from=-24hours&until=now&width=1800&height=900&target=sum%28hosts.tst-linux*-ec2-*_test_releng_*_mozilla_com.cpu.0.cpu.user%29&_uniq=0.9621919982401083&&target=sumSeries%28hosts.bld-linux64-ec2-*_build_releng_*_mozilla_com.cpu.0.cpu.user%29
Here is graph showing a bug somewhere. We spin up 20 idle testers at 16:00. It could be someone tried to test a build that failed to run any tests? This would also benefit from running tests on compile nodes.

This is the url I used
https://graphite.mozilla.org/render?from=-24hours&until=now&width=1800&height=900&target=sum%28hosts.tst-linux*-ec2-*_test_releng_*_mozilla_com.cpu.*.cpu.user%29&target=sum%28hosts.tst-linux*-ec2-*_test_releng_*_mozilla_com.cpu.*.cpu.idle%29
Attachment #8369310 - Attachment description: render.png → Adding test capacity too eagerly
(In reply to Taras Glek (:taras) from comment #5)
> (In reply to Gregory Szorc [:gps] from comment #4)
> > Graphite/Carbon can have scaling issues. I encountered and mostly solved
> > them back when I was a server engineer. But I've been out of that game for a
> > while - best to ask bobm or someone in Services/IT land.
> 
> I wonder if can just export rrd files from graphite for batch processing. 
> 
> Dustin or Amy, how easy is it to get at the raw data?

Jake is the best person to ask here..
Flags: needinfo?(jwatkins)
(In reply to Taras Glek (:taras) from comment #5)
> (In reply to Gregory Szorc [:gps] from comment #4)
> > 
> > Graphite/Carbon can have scaling issues. I encountered and mostly solved
> > them back when I was a server engineer. But I've been out of that game for a
> > while - best to ask bobm or someone in Services/IT land.
> 
> I wonder if can just export rrd files from graphite for batch processing. 
> 
> Dustin or Amy, how easy is it to get at the raw data?

Carbon doesn't use rrd files but instead stores the collected metrics in whisper db files.  If you want to do batch processing, you would need to import the whisper.py libs for reading them out.  :Ericz would be able to provide more info since he operates the graphite/carbon infrastructure.

> 
> > 
> > I highly recommend collectd for data collection. Low probe overhead (written
> > in C) + many plugins/probes + very good extensibility.
> 
> Sure. seems to work.
> 

I also recommend collectd.  Aside from the points :gps pointed out, it is already installed on all releng systems.  It is very easy to add an additional output module if we decide to send metrics elsewhere (or in addition to graphite/carbon)
Flags: needinfo?(jwatkins)
As :dividehex said, locally on the graphite servers you can use the whisper libraries in python or the whisper-fetch command to dump raw data, but as we expand our single server into a cluster you may need to use the URL API to get the data from multiple servers so I'd recommend just starting with the raw format url like your example https://graphite-scl3.mozilla.org/render?target=hosts.bld-linux64-ec2-*_build_releng_use1_mozilla_com.cpu.0.cpu.user&format=raw.
Looking at cpu idle is not going to tell you whether a slave is actually idle or not. For instance, when running make check, only 1 cpu will be used, and not even necessarily at 100%. There are many other build operations that will only use 1 CPU. Plus, your graphs are only showing idle time on cpu.0, which is only 1 of them, and processes are not going to be running on cpu.0 all the time. IOW, the graphs are not showing the right data to come to the conclusions you are getting to.
EC2 instances are also very rarely at 100% idle since CPU steal commonly chews up a lot of what would be idle.
I think Eric was simply giving an example.  When we look at CPU utilization aggregated over the month, we look at all of the CPUs over all the machines in a given pool.  You can use wildcard matching to get the type of data you want.

https://graphite-scl3.mozilla.org/render?target=averageSeries(hosts.talos-linux64-ix*.cpu.*.cpu.idle)&from=20140101&until=20140131

Is an example of using wildcards and an average series to look at the CPU utilization of the talos-linux64-ix pool over the entire month of Jan.  You can be more or less fine grained if you want different stats.  I'd look at the rendering and functions documentation:

https://graphite.readthedocs.org/en/latest/render_api.html
https://graphite.readthedocs.org/en/latest/functions.html
(In reply to Gregory Szorc [:gps] from comment #15)
> EC2 instances are also very rarely at 100% idle since CPU steal commonly
> chews up a lot of what would be idle.

:gps makes a good point here.  If we want to be able to observe CPU steal, we should collect and compare the usage as reported by cloudwatch to what we are collecting from the EC2 instance itself.
(In reply to Jake Watkins [:dividehex] from comment #17)
> (In reply to Gregory Szorc [:gps] from comment #15)
> > EC2 instances are also very rarely at 100% idle since CPU steal commonly
> > chews up a lot of what would be idle.
> 
> :gps makes a good point here.  If we want to be able to observe CPU steal,
> we should collect and compare the usage as reported by cloudwatch to what we
> are collecting from the EC2 instance itself.

Why do we need to compare against cloudwatch?

I just logged into an Ubuntu EC2 instance and CPU steal is reported to the instance. (vmstat -s and look for "stolen cpu ticks"). I've been able to measure CPU steal since at least 2010 using collectd.

According to the proc man page, steal has been reported in /proc/stat since 2.6.11. I'm pretty sure collectd reads directly from proc.
(In reply to Gregory Szorc [:gps] from comment #18)
> (In reply to Jake Watkins [:dividehex] from comment #17)
> > (In reply to Gregory Szorc [:gps] from comment #15)
> > > EC2 instances are also very rarely at 100% idle since CPU steal commonly
> > > chews up a lot of what would be idle.
> > 
> > :gps makes a good point here.  If we want to be able to observe CPU steal,
> > we should collect and compare the usage as reported by cloudwatch to what we
> > are collecting from the EC2 instance itself.
> 
> Why do we need to compare against cloudwatch?
> 
> I just logged into an Ubuntu EC2 instance and CPU steal is reported to the
> instance. (vmstat -s and look for "stolen cpu ticks"). I've been able to
> measure CPU steal since at least 2010 using collectd.
> 
> According to the proc man page, steal has been reported in /proc/stat since
> 2.6.11. I'm pretty sure collectd reads directly from proc.

I was not aware cpu steal was reported to the instance.  Thanks for pointing that out.  gtk
Attached image render.png (obsolete) —
Here is a fancier graph. I corrected this one to use stacked graphs and adjust idle/wait(for some reason they are multipled by 2 in the data).

For curious parties, I included all cpus and all cpu measures(eg on a single box the graph adds up to 100). It doesn't make a difference in reality whether I measure 1 cpu or 4.

The pattern of suspected overprovisioning still holds. We seem to be overprovisioning. Now that graphs are stacked you can see when machines get killed more clearly, etc.

https://graphite.mozilla.org/render?from=-60days&until=-59days&width=1800&height=900&target=stacked(scale(sum(exclude(hosts.bld-linux64-ec2-0[0-9]*_build_releng_*_mozilla_com.cpu.*.cpu.*, "(idle|wait)")),0.25))&target=stacked(scale(sum(hosts.bld-linux64-ec2-0[0-9]*_build_releng_*_mozilla_com.cpu.*.cpu.idle), 0.125))&target=stacked(scale(sum(hosts.bld-linux64-ec2-0[0-9]*_build_releng_*_mozilla_com.cpu.*.cpu.wait), 0.125))
Attachment #8369595 - Attachment is obsolete: true
nm re doubled measures. attaching corrected graph
Attachment #8370216 - Attachment is obsolete: true
(In reply to Taras Glek (:taras) from comment #21)
> Created attachment 8370225 [details]
> build scaling issues from 60 days ago
> 
> nm re doubled measures. attaching corrected graph

This one divides by 100 to show number of machines
Blocks: 968381
Thanks for helping here, spinning off actionable stuff into bug 968381.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Flags: needinfo?(bobm)
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: