<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 2

•

12 years ago

I'm leaning towards mklivestatus on this one, which provides for SQL-like queries to Nagios: http://mathias-kettner.de/checkmk_livestatus.html We already have this installed on all instances and it can output to JSON if required. livestatus isn't available over HTTP however but there if that is needed, I'll dup this against Bug 788514.

Comment 3

•

12 years ago

Do we still need to work on this?

Reporter

Comment 4

•

12 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #2) > We already have this installed on all instances and it can output to JSON if > required. livestatus isn't available over HTTP however but there if that is > needed, I'll dup this against Bug 788514. HTTP(S) was the important part here. I'm trying to include the nagios information into a web tool, so I was hoping for a web-accessible API to just dump machine history from nagios as JSON. Maybe I don't understand how livestatus works though? Can you give me an example of the output and how to access it?

Comment 5

•

12 years ago

(In reply to Chris Cooper [:coop] from comment #4) > (In reply to Ashish Vijayaram [:ashish] from comment #2) > > We already have this installed on all instances and it can output to JSON if > > required. livestatus isn't available over HTTP however but there if that is > > needed, I'll dup this against Bug 788514. > > HTTP(S) was the important part here. I'm trying to include the nagios > information into a web tool, so I was hoping for a web-accessible API to > just dump machine history from nagios as JSON. > > Maybe I don't understand how livestatus works though? Can you give me an > example of the output and how to access it? livestatus is a broker module for Nagios and you can query it with a SQL-like language. For example: [ashish@nagios1.private.phx1 ~]$ cat query GET hostgroups Columns: alias num_hosts OutputFormat: json [ashish@nagios1.private.phx1 ~]$ cat query | sudo unixcat /var/log/nagios/rw/live | head -n 3 [["3ware Servers",0], ["Admin server for amo",1], ["AMO Celery Nodes",1], For now, livestatus is query-able locally via the pipe (/var/log/nagios/rw/live) interface but HTTP can be turned on for whitelisted internal URLs.

Shyam Mani [:fox2mike]

Updated

•

12 years ago

Flags: needinfo?(coop)

Michal Purzynski [:michal`] (use NEEDINFO)

Reporter

Comment 6

•

12 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #5) > For now, livestatus is query-able locally via the pipe > (/var/log/nagios/rw/live) interface but HTTP can be turned on for > whitelisted internal URLs. OK, this part sounds promising. For ref, I'm looking to pull in the last 10 nagios events to the following page, with a link to the full nagios history: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tegra&name=tegra-050 That's not technically an internal URL, but we could change that, say by setting up https://slavehealth.pvt.build.mozilla.org. Sheriffs rely on this tool too (including philor who's non-MoCo), so it would useful to have the public URL available, but the public version would simply degrade to displaying what's currently in the nagios section of slave_health, i.e. nothing. How locked down does our URL whitelist need to be? Could we make it accessible from the entire build network? Since the JSON request is going to come from the client, I'm not sure how else this will work. Can we get the HTTP interface setup, please? I'll worry about filing the relevant follow-up bugs once I can get access to the data and play with it.

Flags: needinfo?(coop)

Comment 7

•

12 years ago

We are working on a security review of the TCP and/or the HTTP(S) part. Stay tuned.

Ludovic Hirlimann [:Usul]

Comment 8

•

11 years ago

(In reply to Michal Purzynski [:michal`] (use NEEDINFO) from comment #7) > We are working on a security review of the TCP and/or the HTTP(S) part. Stay > tuned. what was the output ?

Component: Server Operations → MOC: Projects

Flags: needinfo?(mpurzynski)

Product: mozilla.org → Infrastructure & Operations

QA Contact: shyam → lypulong

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 9

•

11 years ago

I've just spoken with coop and he helped me understand this project. Looks like it is still valid and it would be useful to have it. The plugins would run on the server side and expose the aggregated data as JSON. Having it accessible from entire build network seems like the right thing to do, no problem with that.

Flags: needinfo?(mpurzynski)

Comment 10

•

10 years ago

•

Edited

ok will the livestatus hand off work? You can query this info remotely using livestatus. I have also attached an example. nc nagios1.private.scl3.mozilla.com 6557 < get_service2.query [["check_https_cert_only!60",["sysalerts"],["irc","pagerduty-funnel","sysadmin-oncall"],["servicenow-checks","nagios-servers","hp-servers","generic"],["sysalerts"],["irc","pagerduty-funnel","sysadmin-oncall"]]]

Comment 11

•

10 years ago

Attached file bug884307.txt (obsolete) — Details

The date is in epoch. date +%s for current time. dgarvey@dgarvey-mozilla:~/bug1184750$ date -d @1439590000 Fri Aug 14 15:06:40 PDT 2015 dgarvey@dgarvey-mozilla:~/bug1184750$ date -d @1439600036 Fri Aug 14 17:53:56 PDT 2015 dgarvey@dgarvey-mozilla:~/bug1184750$ cat state_history GET statehist Columns: host_name service_description state duration duration_part Filter: time >= 1439590000 Filter: time < 1439600036 dgarvey@dgarvey-mozilla:~/bug1184750$

Comment 12

•

10 years ago

Attached file bug884307-json.txt (obsolete) — Details

Here is the json example. $ cat state_history GET statehist Columns: host_name service_description state duration duration_part Filter: time >= 1439590000 Filter: time < 1439600036 OutputFormat: json

Comment 13

•

10 years ago

Attached file livestatus_json (obsolete) — Details

Here is a small script to make this query: GET statehist Columns: host_name service_description state duration duration_part Filter: time >= 1439590000 Filter: time < 1439600036 OutputFormat: json Please be sure and change the path to live socket. socket_path = "/omd/sites/stage/tmp/run/live"

Updated

•

10 years ago

Assignee: ashish → dgarvey

Comment 14

•

10 years ago

coop, ping

Flags: needinfo?(coop)

Reporter

Comment 15

•

10 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #5) > For now, livestatus is query-able locally via the pipe > (/var/log/nagios/rw/live) interface but HTTP can be turned on for > whitelisted internal URLs. dgarvey: what did you want me to look at here? What I'm looking for is a web-accessible API endpoint for nagios data, i.e. this particular machine has these current issues and/or history of events. Does livestatus_json provide that at an URL I can access?

Flags: needinfo?(coop)

Comment 16

•

10 years ago

Attached image Screenshot from 2015-09-29 16:33:04.png (obsolete) — Details

No it just has a socket to query the data from it doesn't present the data in html. I am no web dev but I was able to whip up a flask presentation pretty quickly.

Comment 17

•

10 years ago

Attached file bug884307_flask.txt (obsolete) — Details

The flask stuff.

Comment 18

•

10 years ago

coop, I am using flask to provide the web frontend and because I am not a webdev I have limited it to one query. This query gets the history which is what you need?

Assignee

Updated

•

10 years ago

Assignee: dgarvey → rchilds

Status: NEW → ASSIGNED

Assignee

Comment 19

•

10 years ago

Really sorry about the delay with this. I think just outputting the json to a subfolder in Apache makes more sense than using Flask for this, http://nagios1.private.scl3.mozilla.com/livestatus/ Just need to iron out the query for this. Currently using, > GET log > Columns: time host_name service_description state > Filter: time >= 1455091200 > Filter: time < 1455137494 > OutputFormat: json But it's not giving the desired output. Any ideas Ashish?

Flags: needinfo?(ashish)

Assignee

Comment 20

•

10 years ago

Alright, Getting there with the query. http://nagios1.private.releng.scl3.mozilla.com/livestatus/ Coop, Sorry again for the delay, but want to make sure this is still relevant considering the age of this bug.

Flags: needinfo?(ashish) → needinfo?(coop)

Reporter

Comment 21

•

10 years ago

(In reply to Ryan C [:ryanc] from comment #20) > Alright, > > Getting there with the query. > > http://nagios1.private.releng.scl3.mozilla.com/livestatus/ > > Coop, > > Sorry again for the delay, but want to make sure this is still relevant > considering the age of this bug. Hey Ryan, Thanks for picking this up. Yes, this bug is still relevant. It's always good to be able to pull in more data about the machines when something is broken. To that end, more data in the output is always appreciated. Would it be possible to add a header row to the file to make ongoing interpretation easier if/when the format changes? Also, would it be possible to provide a key/value match-up for fields like "state" (possibly in the same output dir) to aid in interpretation?

Flags: needinfo?(coop)

Assignee

Comment 22

•

10 years ago

Chris, Still a WIP, but updated with header, http://nagios1.private.releng.scl3.mozilla.com/livestatus/ Unfortunately key/value with LQL is not a thing, "In order to avoid redundancy and keep the overhead as low as possible, the output is not formatted as a list of objects (with key/value pairs), but as a list of lists (JSON speaks of arrays)." https://mathias-kettner.de/checkmk_livestatus.html

Assignee

Comment 23

•

9 years ago

Here we go, [rchilds@nagios1.private.releng.scl3 ~]$ cat query | sudo unixcat /var/log/nagios/rw/live [["time","host_name","plugin_output"], [1456012766,"t-w732-ix-229.wintest.releng.scl3.mozilla.com","PING CRITICAL - Packet loss = 100%"], [1456012766,"bld-lion-r5-090.build.releng.scl3.mozilla.com","CHECK_NRPE: Socket timeout after 20 seconds."], [1456012766,"t-w732-ix-092.wintest.releng.scl3.mozilla.com","PING CRITICAL - Packet loss = 100%"], [1456012746,"t-w732-ix-195.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 1.75 ms"], [1456012746,"t-xp32-ix-095.wintest.releng.scl3.mozilla.com","PING CRITICAL - Packet loss = 100%"], [1456012746,"panda-0387.p4.releng.scl3.mozilla.com","DOWN: in state failed_self_test"], [1456012746,"t-w732-ix-206.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 1.01 ms"], [1456012746,"t-w732-ix-092.wintest.releng.scl3.mozilla.com","PING CRITICAL - Packet loss = 100%"], [1456012746,"t-w732-ix-187.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 0.86 ms"], [1456012706,"t-w732-ix-008.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 0.90 ms"]] Next I'm gonna work on getting this into Puppet and refreshing this every 10 seconds. Still not sure if I believe the time stamps, though.

Assignee

Comment 24

•

9 years ago

Chris, Just to confirm. Does the output from comment 23 work for you?

Flags: needinfo?(coop)

Reporter

Comment 25

•

9 years ago

(In reply to Ryan C [:ryanc] from comment #24) > Just to confirm. Does the output from comment 23 work for you? Sorry for the delay. I was on PTO for 2 weeks in there. The output looks good, and I should be able to integrate it into our machine health dashboard easily. A few questions: 1) Will only machines with outstanding issues appear in the list? 2) What does the timestamp represent? Last check? First occurrence? Thanks for this.

Flags: needinfo?(coop)

Assignee

Comment 26

•

9 years ago

(In reply to Chris Cooper [:coop] from comment #25) > (In reply to Ryan C [:ryanc] from comment #24) > > Just to confirm. Does the output from comment 23 work for you? > > Sorry for the delay. I was on PTO for 2 weeks in there. > > The output looks good, and I should be able to integrate it into our machine > health dashboard easily. > > A few questions: > 1) Will only machines with outstanding issues appear in the list? > 2) What does the timestamp represent? Last check? First occurrence? > > Thanks for this. Chris, I pushed this out in r115592. It's working, but needs a little bit more to get it just right. As of now it refreshes every minute and displays the last 30 alerts. I'm not quite sure about the timestamp, it definitely doesn't look legit. Do you need a timestamp? I'll defer to Ashish for the rest of this. Ashish, The cron is running, but I can't get Puppet to take the proper cron entry without enclosing the LQL query in double quotes. > command => '/usr/bin/printf "GET log \nColumns: time host_name plugin_output \nOutputFormat: json \nLimit: 30 \nFilter: host_name >= "" \n" | /u sr/bin/unixcat /var/log/nagios/rw/live > /var/www/html/livestatus/index.html', But it should be, > command => '/usr/bin/printf 'GET log \nColumns: time host_name plugin_output \nOutputFormat: json \nLimit: 30 \nFilter: host_name >= "" \n' | /u sr/bin/unixcat /var/log/nagios/rw/live > /var/www/html/livestatus/index.html', So it wont cancel out, > host_name >= "" Any idea?

Flags: needinfo?(ashish)

Comment 27

•

9 years ago

Sure, fixed that with better quoting: > - command => '/usr/bin/printf "GET log \nColumns: time host_name plugin_output \nOutputFormat: json \nLimit: 30 \nFilter: host_name >= "" \n" | /usr/bin/unixcat /var/log/nagios/rw/live > /var/www/html/livestatus/index.html', > + command => '/usr/bin/printf \'GET log \nColumns: time host_name plugin_output \nOutputFormat: json \nLimit: 30 \nFilter: host_name >= "" \n\' | /usr/bin/unixcat /var/log/nagios/rw/live > /var/www/html/livestatus/index.html I do have other queries about r115592, let's take it on IRC.

Flags: needinfo?(ashish)

Reporter

Comment 28

•

9 years ago

The more I look I this, the less I think it's doing what I need. What I'd really like is the alert history for each machine in json format. i.e. when I visit (notice the json param): http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/history.cgi?host=t-xp32-ix-103&json=1 ...I'd get a json dump of the alert history for t-xp32-ix-103, instead of the html output. I'm trying to determine at a glance whether a given machine has a history of problems and might need service/decomm, and a long list of alerts is a good data point to add to our machine health dashboard. Sorry to move the goalposts. :/ This may all be immaterial though. Right now the releng machine health dashboard can't pull the nagios data without a CORS change, or some changes on the releng code side. If you want to WONTFIX this, I'd be fine with that.

Assignee

Comment 29

•

9 years ago

(In reply to Chris Cooper [:coop] from comment #28) > This may all be immaterial though. Right now the releng machine health > dashboard can't pull the nagios data without a CORS change, or some changes > on the releng code side. Chris, If the releng side prevents this entirely, let me know and we'll WONTFIX this. But I'd be willing to pursue this to get it as you like. Ashish, I'm assuming one of those things is to make this a sub-class of nagios (e.g nagios::releng::livestatus), but we'll chat tomorrow.

Assignee

Updated

•

9 years ago

Flags: needinfo?(coop)

Reporter

Comment 30

•

9 years ago

(In reply to Ryan C [:ryanc] from comment #29) > If the releng side prevents this entirely, let me know and we'll WONTFIX > this. But I'd be willing to pursue this to get it as you like. I think it just requires a small change to wrap the json output in a jsonp callback, e.g. jsonp_callback([[1458241618,"t-w732-ix-055.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 0.81 ms"], [1458241608,"t-w732-ix-208.wintest.releng.scl3.mozilla.com","PING OK - Packet loss = 0%, RTA = 1.18 ms"]) Since the livestatus output is static, if you could provide a separate endpoint, e.g. livestatus_jsonp, that returned the livestatus data wrapped in a named callback, I should be able to digest it on the releng side. The same would apply for any single slave history json output.

Flags: needinfo?(coop)

Assignee

Comment 31

•

9 years ago

Chris, I'm going to redo this in a Python, since I just discovered a module that interfaces with livestatus. This will be a lot easier.

Assignee

Comment 32

•

9 years ago

Attached file releng_json1.py (obsolete) — Details

Update, Getting the initial workings setup. Now to get urlparse and the rest setup.

Attachment #8648322 - Attachment is obsolete: true

Attachment #8648323 - Attachment is obsolete: true

Attachment #8648589 - Attachment is obsolete: true

Attachment #8667616 - Attachment is obsolete: true

Attachment #8667618 - Attachment is obsolete: true

Assignee

Comment 33

•

9 years ago

Attached file releng_json2.py — Details

Hey Chris, Sorry for the extremely delayed update, I've been occupied with other projects and this got pushed to the back. I've attached what I have, but I don't think this is a solid or long term solution for this, considering how hacky this is. [rchilds@nagios1.private.releng.scl3 ~]$ curl http://127.0.0.1:5000/livestatus?h=t-xp32-ix-103.wintest.releng.scl3.mozilla.com [["time","host_name","service_description","state","duration"], [1471091187,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","",1,29], [1471091187,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","PING",2,29], [1471091187,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","PING",2,0], [1471094758,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","",0,3571], [1471094758,"t-xp32-ix-103.wintest.releng.scl3.mozilla.com","PING",0,3571]] We're looking into new monitoring tools, such as Prometheus to possibly replace Nagios in some aspects - I feel this would be 1000% better for this task.

Attachment #8734863 - Attachment is obsolete: true