Closed Bug 918988 Opened 11 years ago Closed 11 years ago

metrics collection for windows

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

Attachments

(1 file)

In the effort to collect metric on all releng systems, windows will prove to be the most difficult since collectd is posix only. We have a couple options but the one that looks the most attractive is to write python app using psutil and pywin32. The app will run as a system service and simply poll psutil for a certain set of metrics at a set interval in a loop. Queue them and send them off to the graphite server. From what I can see, we already install pywin32 (build 218) on builders and testers, but psutil is not yet installed. We will need to fix that.
Depends on: 918992
Blocks: 920629
Depends on: 921240
Quick update here: This is mostly written and initial test runs show it works well. It still needs some touch up (eg. docs strings) and the config module need to be implemented. (lots of hardcoded stuff that shouldn't be) When that's done, I'll have dustin review it. Then, pending any changes, I'll hand this off to Q to deploy as disabled until 921783 is resolved.
Attachment #813231 - Flags: review?(dustin)
Comment on attachment 813231 [details] [diff] [review] metcollective.patch I don't know much about building windows services, so just a few general comments. Bugzilla doesn't recognize this as a patch, so I can't reply inline. The .py file should have a license header (presumably MPLv2) There are some under_score methods and some camelCase methods - might as well be consistent. Code after 'raise' never executes, so the sys.exit(-1) is dead code and should be removed. The make_trans call should probably be done once, rather than on every invocation of sanitize_path Some comments in MetricReadCls as to what the methods are for would help. MetricWriteCls_graphite.teardown needs () on the close invocation. It doesn't look like send_metric does any kind of backoff when a send fails. Is that intentional? I can see why it'd be better to drop metrics than deliver them late and potentially pound the heck out of the server with cached metrics. You can probably get quicker shutdowns by using a condition variable with a timeout instead of an Event with sleep's. I don't know if that's a problem, though. In fact, for both MetricsReader and MetricsWriter, that loop is an awkward fit: for reader, _workload may do nothing if the metric is not ready to be read yet. And similarly, for writer, you are passing a timeout to the queue.get method, only so you can return to _workload and sleep before calling queue.get again. Instead, for Reader, perhaps the work_obj could have a method to get the next wakeup time; then just wait until that time using Condition.wait() with a timeout. For Writer, you can just call queue.get with no timeout, but you'll need to find a way to make that call return when terminating the thread. The easiest may be to push None into the queue, and use that as a semaphore for "please exit". That will get you queue flushing automatically, too. This looks like it will work just fine, and I like the overall structure -- the above are just areas for improvement :)
Attachment #813231 - Flags: review?(dustin) → review+
I'm going to R/F this. I've updated the thread wait logic and made some other improvements from c3. git repo is here: https://github.com/dividehex/metric-collective We will track deployment in bug920629
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: