Closed Bug 884307 Opened 11 years ago Closed 8 years ago

Expose releng slave nagios history via JSON

Categories

(Infrastructure & Operations :: MOC: Projects, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: coop, Assigned: ryanc)

Details

(Whiteboard: [slavehealth])

Attachments

(1 file, 6 obsolete files)

I'd like to be able to add the nagios history for individual slaves on the new slave health pages. If nagios were able to output that data via JSON, it would be instantly compatible with the other data sources I'm using to construct those pages.

Nagios has a bunch of plugins that might do the job, but I'm not familiar with any of them:

http://exchange.nagios.org/directory/Addons/APIs/JSON
Assignee: server-ops-releng → server-ops
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → shyam
Ashish, can we look at some options to do this for the releng instance?
Assignee: server-ops → ashish
I'm leaning towards mklivestatus on this one, which provides for SQL-like queries to Nagios:

http://mathias-kettner.de/checkmk_livestatus.html

We already have this installed on all instances and it can output to JSON if required. livestatus isn't available over HTTP however but there if that is needed, I'll dup this against Bug 788514.
Do we still need to work on this?
(In reply to Ashish Vijayaram [:ashish] from comment #2) 
> We already have this installed on all instances and it can output to JSON if
> required. livestatus isn't available over HTTP however but there if that is
> needed, I'll dup this against Bug 788514.

HTTP(S) was the important part here. I'm trying to include the nagios information into a web tool, so I was hoping for a web-accessible API to just dump machine history from nagios as JSON.

Maybe I don't understand how livestatus works though? Can you give me an example of the output and how to access it?
(In reply to Chris Cooper [:coop] from comment #4)
> (In reply to Ashish Vijayaram [:ashish] from comment #2) 
> > We already have this installed on all instances and it can output to JSON if
> > required. livestatus isn't available over HTTP however but there if that is
> > needed, I'll dup this against Bug 788514.
> 
> HTTP(S) was the important part here. I'm trying to include the nagios
> information into a web tool, so I was hoping for a web-accessible API to
> just dump machine history from nagios as JSON.
> 
> Maybe I don't understand how livestatus works though? Can you give me an
> example of the output and how to access it?

livestatus is a broker module for Nagios and you can query it with a SQL-like language. For example:

[ashish@nagios1.private.phx1 ~]$ cat query 
GET hostgroups
Columns: alias num_hosts
OutputFormat: json

[ashish@nagios1.private.phx1 ~]$ cat query | sudo unixcat /var/log/nagios/rw/live | head -n 3 
[["3ware Servers",0],
["Admin server for amo",1],
["AMO Celery Nodes",1],

For now, livestatus is query-able locally via the pipe (/var/log/nagios/rw/live) interface but HTTP can be turned on for whitelisted internal URLs.
Flags: needinfo?(coop)
(In reply to Ashish Vijayaram [:ashish] from comment #5)
> For now, livestatus is query-able locally via the pipe
> (/var/log/nagios/rw/live) interface but HTTP can be turned on for
> whitelisted internal URLs.

OK, this part sounds promising.

For ref, I'm looking to pull in the last 10 nagios events to the following page, with a link to the full nagios history:

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tegra&name=tegra-050

That's not technically an internal URL, but we could change that, say by setting up https://slavehealth.pvt.build.mozilla.org. 

Sheriffs rely on this tool too (including philor who's non-MoCo), so it would useful to have the public URL available, but the public version would simply degrade to displaying what's currently in the nagios section of slave_health, i.e. nothing. 

How locked down does our URL whitelist need to be? Could we make it accessible from the entire build network? Since the JSON request is going to come from the client, I'm not sure how else this will work.

Can we get the HTTP interface setup, please? I'll worry about filing the relevant follow-up bugs once I can get access to the data and play with it.
Flags: needinfo?(coop)
We are working on a security review of the TCP and/or the HTTP(S) part. Stay tuned.
(In reply to Michal Purzynski [:michal`] (use NEEDINFO) from comment #7)
> We are working on a security review of the TCP and/or the HTTP(S) part. Stay
> tuned.

what was the output ?
Component: Server Operations → MOC: Projects
Flags: needinfo?(mpurzynski)
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → lypulong
I've just spoken with coop and he helped me understand this project. Looks like it is still valid and it would be useful to have it.

The plugins would run on the server side and expose the aggregated data as JSON.

Having it accessible from entire build network seems like the right thing to do, no problem with that.
Flags: needinfo?(mpurzynski)
ok will the livestatus hand off work?


You can query this info remotely using livestatus. I have also attached an example.


nc nagios1.private.scl3.mozilla.com 6557 < get_service2.query
[["check_https_cert_only!60",["sysalerts"],["irc","pagerduty-funnel","sysadmin-oncall"],["servicenow-checks","nagios-servers","hp-servers","generic"],["sysalerts"],["irc","pagerduty-funnel","sysadmin-oncall"]]]
Attached file bug884307.txt (obsolete) —
The date is in epoch. date +%s for current time.

dgarvey@dgarvey-mozilla:~/bug1184750$ date -d @1439590000
Fri Aug 14 15:06:40 PDT 2015
dgarvey@dgarvey-mozilla:~/bug1184750$ date -d @1439600036
Fri Aug 14 17:53:56 PDT 2015
dgarvey@dgarvey-mozilla:~/bug1184750$ cat state_history 
GET statehist
Columns: host_name service_description state duration duration_part
Filter: time >= 1439590000
Filter: time < 1439600036
dgarvey@dgarvey-mozilla:~/bug1184750$
Attached file bug884307-json.txt (obsolete) —
Here is the json example.

$ cat state_history
GET statehist
Columns: host_name service_description state duration duration_part
Filter: time >= 1439590000
Filter: time < 1439600036
OutputFormat: json
Attached file livestatus_json (obsolete) —
Here is a small script to make this query:
GET statehist
Columns: host_name service_description state duration duration_part
Filter: time >= 1439590000
Filter: time < 1439600036
OutputFormat: json


Please be sure and change the path to live socket.
socket_path = "/omd/sites/stage/tmp/run/live"
Assignee: ashish → dgarvey
coop, ping
Flags: needinfo?(coop)
(In reply to Ashish Vijayaram [:ashish] from comment #5)
> For now, livestatus is query-able locally via the pipe
> (/var/log/nagios/rw/live) interface but HTTP can be turned on for
> whitelisted internal URLs.

dgarvey: what did you want me to look at here? 

What I'm looking for is a web-accessible API endpoint for nagios data, i.e. this particular machine has these current issues and/or history of events. Does livestatus_json provide that at an URL I can access?
Flags: needinfo?(coop)
Attached image Screenshot from 2015-09-29 16:33:04.png (obsolete) —
No it just has a socket to query the data from it doesn't present the data in html. I am no web dev but I was able to whip up a flask presentation pretty quickly.
Attached file bug884307_flask.txt (obsolete) —
The flask stuff.
coop,

I am using flask to provide the web frontend and because I am not a webdev I have limited it to one query.

This query gets the history which is what you need?
Assignee: dgarvey → rchilds
Status: NEW → ASSIGNED
Really sorry about the delay with this.

I think just outputting the json to a subfolder in Apache makes more sense than using Flask for this,

http://nagios1.private.scl3.mozilla.com/livestatus/

Just need to iron out the query for this. Currently using,

> GET log
> Columns: time host_name service_description state
> Filter: time >= 1455091200
> Filter: time < 1455137494
> OutputFormat: json

But it's not giving the desired output. Any ideas Ashish?
Flags: needinfo?(ashish)
Alright,

Getting there with the query.

http://nagios1.private.releng.scl3.mozilla.com/livestatus/

Coop,

Sorry again for the delay, but want to make sure this is still relevant considering the age of this bug.
Flags: needinfo?(ashish) → needinfo?(coop)
(In reply to Ryan C [:ryanc] from comment #20)
> Alright,
> 
> Getting there with the query.
> 
> http://nagios1.private.releng.scl3.mozilla.com/livestatus/
> 
> Coop,
> 
> Sorry again for the delay, but want to make sure this is still relevant
> considering the age of this bug.

Hey Ryan,

Thanks for picking this up. Yes, this bug is still relevant. It's always good to be able to pull in more data about the machines when something is broken.

To that end, more data in the output is always appreciated. Would it be possible to add a header row to the file to make ongoing interpretation easier if/when the format changes?

Also, would it be possible to provide a key/value match-up for fields like "state" (possibly in the same output dir) to aid in interpretation?
Flags: needinfo?(coop)
Chris,

Still a WIP, but updated with header,

http://nagios1.private.releng.scl3.mozilla.com/livestatus/

Unfortunately key/value with LQL is not a thing,

"In order to avoid redundancy and keep the overhead as low as possible, the output is not formatted as a list of objects (with key/value pairs), but as a list of lists (JSON speaks of arrays)."

https://mathias-kettner.de/checkmk_livestatus.html
Here we go,

[rchilds@nagios1.private.releng.scl3 ~]$ cat query | sudo unixcat /var/log/nagios/rw/live
[["time","host_name","plugin_output"],
[1456012766,"t-w732-ix-229.wintest.releng.scl3.mozilla.com","PING CRITICAL - Packet loss = 100%"],
[1456012766,"bld-lion-r5-090.build.releng.scl3.mozilla.com","CHECK_NRPE: Socket timeout after 20 seconds."],
[1456012766,"t-w732-ix-092.wintest.releng.scl3.mozilla.com","PING CRITICAL - Packet loss = 100%"],
[1456012746,"t-w732-ix-195.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 1.75 ms"],
[1456012746,"t-xp32-ix-095.wintest.releng.scl3.mozilla.com","PING CRITICAL - Packet loss = 100%"],
[1456012746,"panda-0387.p4.releng.scl3.mozilla.com","DOWN: in state failed_self_test"],
[1456012746,"t-w732-ix-206.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 1.01 ms"],
[1456012746,"t-w732-ix-092.wintest.releng.scl3.mozilla.com","PING CRITICAL - Packet loss = 100%"],
[1456012746,"t-w732-ix-187.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 0.86 ms"],
[1456012706,"t-w732-ix-008.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 0.90 ms"]]


Next I'm gonna work on getting this into Puppet and refreshing this every 10 seconds. Still not sure if I believe the time stamps, though.
Chris,

Just to confirm. Does the output from comment 23 work for you?
Flags: needinfo?(coop)
(In reply to Ryan C [:ryanc] from comment #24)
> Just to confirm. Does the output from comment 23 work for you?

Sorry for the delay. I was on PTO for 2 weeks in there.

The output looks good, and I should be able to integrate it into our machine health dashboard easily.

A few questions:
1) Will only machines with outstanding issues appear in the list?
2) What does the timestamp represent? Last check? First occurrence?

Thanks for this.
Flags: needinfo?(coop)
(In reply to Chris Cooper [:coop] from comment #25)
> (In reply to Ryan C [:ryanc] from comment #24)
> > Just to confirm. Does the output from comment 23 work for you?
> 
> Sorry for the delay. I was on PTO for 2 weeks in there.
> 
> The output looks good, and I should be able to integrate it into our machine
> health dashboard easily.
> 
> A few questions:
> 1) Will only machines with outstanding issues appear in the list?
> 2) What does the timestamp represent? Last check? First occurrence?
> 
> Thanks for this.

Chris,

I pushed this out in r115592. It's working, but needs a little bit more to get it just right. As of now it refreshes every minute and displays the last 30 alerts. I'm not quite sure about the timestamp, it definitely doesn't look legit. Do you need a timestamp? I'll defer to Ashish for the rest of this.

Ashish,

The cron is running, but I can't get Puppet to take the proper cron entry without enclosing the LQL query in double quotes.

> command => '/usr/bin/printf "GET log \nColumns: time host_name plugin_output \nOutputFormat: json \nLimit: 30 \nFilter: host_name >= "" \n" | /u  sr/bin/unixcat /var/log/nagios/rw/live > /var/www/html/livestatus/index.html',

But it should be,

> command => '/usr/bin/printf 'GET log \nColumns: time host_name plugin_output \nOutputFormat: json \nLimit: 30 \nFilter: host_name >= "" \n' | /u  sr/bin/unixcat /var/log/nagios/rw/live > /var/www/html/livestatus/index.html',

So it wont cancel out,

> host_name >= ""

Any idea?
Flags: needinfo?(ashish)
Sure, fixed that with better quoting:

> -            command => '/usr/bin/printf "GET log \nColumns: time host_name plugin_output \nOutputFormat: json \nLimit: 30 \nFilter: host_name >= "" \n" | /usr/bin/unixcat /var/log/nagios/rw/live > /var/www/html/livestatus/index.html',
> +            command => '/usr/bin/printf \'GET log \nColumns: time host_name plugin_output \nOutputFormat: json \nLimit: 30 \nFilter: host_name >= "" \n\' | /usr/bin/unixcat /var/log/nagios/rw/live > /var/www/html/livestatus/index.html

I do have other queries about r115592, let's take it on IRC.
Flags: needinfo?(ashish)
The more I look I this, the less I think it's doing what I need. 

What I'd really like is the alert history for each machine in json format.

i.e. when I visit (notice the json param):

http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/history.cgi?host=t-xp32-ix-103&json=1

...I'd get a json dump of the alert history for t-xp32-ix-103, instead of the html output. 

I'm trying to determine at a glance whether a given machine has a history of problems and might need service/decomm, and a long list of alerts is a good data point to add to our machine health dashboard.

Sorry to move the goalposts. :/

This may all be immaterial though. Right now the releng machine health dashboard can't pull the nagios data without a CORS change, or some changes on the releng code side.

If you want to WONTFIX this, I'd be fine with that.
(In reply to Chris Cooper [:coop] from comment #28)
> This may all be immaterial though. Right now the releng machine health
> dashboard can't pull the nagios data without a CORS change, or some changes
> on the releng code side.

Chris,

If the releng side prevents this entirely, let me know and we'll WONTFIX this. But I'd be willing to pursue this to get it as you like.

Ashish,

I'm assuming one of those things is to make this a sub-class of nagios (e.g nagios::releng::livestatus), but we'll chat tomorrow.
Flags: needinfo?(coop)
(In reply to Ryan C [:ryanc] from comment #29)
> If the releng side prevents this entirely, let me know and we'll WONTFIX
> this. But I'd be willing to pursue this to get it as you like.

I think it just requires a small change to wrap the json output in a jsonp callback, e.g.

jsonp_callback([[1458241618,"t-w732-ix-055.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 0.81 ms"], [1458241608,"t-w732-ix-208.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 1.18 ms"])

Since the livestatus output is static, if you could provide a separate endpoint, e.g. livestatus_jsonp, that returned the livestatus data wrapped in a named callback, I should be able to digest it on the releng side. The same would apply for any single slave history json output.
Flags: needinfo?(coop)
Chris,

I'm going to redo this in a Python, since I just discovered a module that interfaces with livestatus. This will be a lot easier.
Attached file releng_json1.py (obsolete) —
Update,

Getting the initial workings setup. Now to get urlparse and the rest setup.
Attachment #8648322 - Attachment is obsolete: true
Attachment #8648323 - Attachment is obsolete: true
Attachment #8648589 - Attachment is obsolete: true
Attachment #8667616 - Attachment is obsolete: true
Attachment #8667618 - Attachment is obsolete: true
Attached file releng_json2.py
Hey Chris,

Sorry for the extremely delayed update, I've been occupied with other projects and this got pushed to the back.

I've attached what I have, but I don't think this is a solid or long term solution for this, considering how hacky this is. 

[rchilds@nagios1.private.releng.scl3 ~]$ curl http://127.0.0.1:5000/livestatus?h=t-xp32-ix-103.wintest.releng.scl3.mozilla.com
[["time","host_name","service_description","state","duration"],
[1471091187,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","",1,29],
[1471091187,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","PING",2,29],
[1471091187,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","PING",2,0],
[1471094758,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","",0,3571],
[1471094758,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","PING",0,3571]]


We're looking into new monitoring tools, such as Prometheus to possibly replace Nagios in some aspects - I feel this would be 1000% better for this task.
Attachment #8734863 - Attachment is obsolete: true
OK, let's not waste any more time on this.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: