Closed Bug 1383047 Opened 8 years ago Closed 8 years ago

Add additional alerts to the service so that we catch more complex error. (eg /trees/mozilla-inbound)

Categories

(Infrastructure & Operations :: MOC: Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garbas, Assigned: ryanc)

References

Details

This action item comes from a postmortem after latest TreeStatus downtime[1] [1] https://docs.google.com/document/d/1NfxZVmfGTwLalUethwboKMvKKO1vFBkamW0Vdf9LXEI/edit#
:RyanC: I was referred to you as a Nagios master. Would it be possible to find a time to improve (together) current Nagios check? I would need a bit of guidance with Nagios configuration. Especially I would like to check if API endpoints work as expected (return proper status codes, maybe even check JSON output if possible). I'd also like to automatically update status.mozilla.org page if possible. My quick search revealed nagios2statuspage[1] [1] https://github.com/bldrtech/nagios2statuspage
Flags: needinfo?(rchilds)
Blocks: 1383050
Hey, We have the existing checks setup, but could improve upon if we had specifics for what you want for perhaps monitoring a site string, JSON, etc. Existing checks, 13:43:20 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net: is OK - TCP OK - 0.073 second response time on 23.21.51.84 port 443 Last Checked: 2017-07-21 20:39:08 UTC 13:43:24 <~ryanc> nagios-releng: status treestatus.mozilla-releng.net:* 13:43:24 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTPS is OK - HTTP OK: HTTP/1.1 200 OK - 4910 bytes in 1.347 second response time Last Checked: 2017-07-21 20:03:48 UTC 13:43:24 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTPS - SSL Cert expiration is OK - OK - Certificate 'treestatus.mozilla-releng.net' will expire on 2018-11-28 12:00 +0000/UTC. Last Checked: 2017-07-21 20:03:48 UTC 13:43:24 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:PING is OK - TCP OK - 0.073 second response time on 23.21.51.84 port 443 Last Checked: 2017-07-21 20:39:33 UTC Events over July according to Nagios, [root@nagios1.private.releng.scl3 rchilds]# grep "treestatus.mozilla-releng.net" /var/log/nagios/archives/nagios-07-*2017* | grep CRITICAL | grep HARD /var/log/nagios/archives/nagios-07-21-2017-00.log:[1500521767] SERVICE ALERT: treestatus.mozilla-releng.net;HTTPS;CRITICAL;HARD;3;HTTP CRITICAL - No data received from host /var/log/nagios/archives/nagios-07-21-2017-00.log:[1500525743] SERVICE ALERT: treestatus.mozilla-releng.net;PING;CRITICAL;HARD;5;CRITICAL - Socket timeout /var/log/nagios/archives/nagios-07-21-2017-00.log:[1500525963] HOST ALERT: treestatus.mozilla-releng.net;DOWN;HARD;10;CRITICAL - Socket timeout
Flags: needinfo?(rchilds)
(In reply to Rok Garbas [:garbas] from comment #1) > :RyanC: I was referred to you as a Nagios master. Would it be possible to > find a time to improve (together) current Nagios check? I would need a bit > of guidance with Nagios configuration. > > Especially I would like to check if API endpoints work as expected (return > proper status codes, maybe even check JSON output if possible). What is the URL for these endpoints? What values are we looking for?
Assignee: nobody → rchilds
Status: NEW → ASSIGNED
Component: TreeStatus → MOC: Service Requests
Flags: needinfo?(rgarbas)
Product: Release Engineering → Infrastructure & Operations
QA Contact: catlee → lypulong
:RyanC: i was looking at code in ``modules/nagios4/`` folder of ``sysadmins-puppet`` repository. and here are the questions i have (1) currently mozilla-releng.net checks are bound together with scl3 (``modules/nagios4/manifests/prod/releng/scl3.pp``). in reality mozilla-releng.net services have nothing to do with scl3 except that we are migrating relengapi services away from it. Would it make sense to run checks from nagios host then (defining a different parent)? (2) I might misunderstand the infrastructure of sysadmins-puppet but currently you can only do one check for each url (please tell me if I'm wrong). I would like to define multiple checks (eg. multiple services if i use nagios terminology) (3) As mentioned in previous comments I would like to generate this checks (nagios configuration) from the configuration in mozilla-releng/services and manually send a patch for a review. Reason for this is that it is hard to keep up to date with all the services that are being added/removed. Would it be possible to have all of mozilla-releng.net related services in one (4) I noticed in nagios documentation few options to increase frequency, but I'm not sure which option to use. For services which are more important I would like to increase the frequency of checks. I was thinking to run checs every 5min, would this be to often? should we go even lower? What would you suggest (5) For each I would like to customize the url that is being reported. eg. for releng-tooltool project i would like to include https://docs.mozilla-releng.net/projects/releng-tooltool.html. would that be possible? if i undertand the code correctly i should set info_url parameter right? Currently I would like to add few more checks for treestatus and tooltool project. (1) https://treestatus.mozilla-releng.net (1.1) https://treestatus.mozilla-releng.net/trees should return JSON object where ``result.ash.tree`` path is equal to "ash" and http status code is 200 (1.2) https://treestatus.mozilla-releng.net/trees/mozilla-beta should return JSON object where ``result.tree`` path is equal to "mozilla-beta" and http status code is 200 (1.3) https://treestatus.mozilla-releng.net/trees/invalid should return JSON object where ``detail`` path is "No such tree" and http status code is 404 (2) https://tooltool.mozilla-releng.net (2.1) https://tooltool.mozilla-releng.net/sha512/f93a685c8a10abbd349cbef5306441ba235c4cbfba1cc000299e11b58f258e9953cbe23463515407925eeca94c3f5d8e5f637c95be387e620845efa43cdcb0c0 should include "Location" header which starts with "https://mozilla-releng-usw2-tooltool.s3-us-west-2.amazonaws.com:443/sha512/f93a685c8a10abbd349cbef5306441ba235c4cbfba1cc000299e11b58f258e9953cbe23463515407925eeca94c3f5d8e5f637c95be387e620845efa43cdcb0c0?Signature=" and https status code is 302 (2.2) https://tooltool.mozilla-releng.net/sha512/edf96781042db513700c4a092ef367c05933967b036db9b0f716b75da613a7eaea055d0f60b1e12f6e41a545962cec97a7b78c6b86363ee1ec7a9f42699a5531 should return JSON object where ``detail`` path is qeual to "You don't have the permission to access the requested resource. It is either read-protected or not readable by the server." and http status code is 403 (2.3) https://tooltool.mozilla-releng.net/sha512/invalid should return JSON object where ``detail`` path is "Invalid sha512 digest" and http status code is 400 [1] modules/nagios4/manifests/prod/releng/scl3.pp
Flags: needinfo?(rgarbas)
(In reply to Rok Garbas [:garbas] from comment #4) > :RyanC: i was looking at code in ``modules/nagios4/`` folder of > ``sysadmins-puppet`` repository. and here are the questions i have > > (1) currently mozilla-releng.net checks are bound together with scl3 > (``modules/nagios4/manifests/prod/releng/scl3.pp``). in reality > mozilla-releng.net services have nothing to do with scl3 except that we are > migrating relengapi services away from it. Would it make sense to run checks > from nagios host then (defining a different parent)? scl3 is where our primary nagios instances for releng and non-releng live > (2) I might misunderstand the infrastructure of sysadmins-puppet but > currently you can only do one check for each url (please tell me if I'm > wrong). I would like to define multiple checks (eg. multiple services if i > use nagios terminology) No, multiple services can be assigned to a host > (3) As mentioned in previous comments I would like to generate this checks > (nagios configuration) from the configuration in mozilla-releng/services and > manually send a patch for a review. Reason for this is that it is hard to > keep up to date with all the services that are being added/removed. Would it > be possible to have all of mozilla-releng.net related services in one OK > (4) I noticed in nagios documentation few options to increase frequency, but > I'm not sure which option to use. For services which are more important I > would like to increase the frequency of checks. I was thinking to run checs > every 5min, would this be to often? should we go even lower? What would you > suggest 5 seems fine > (5) For each I would like to customize the url that is being reported. eg. > for releng-tooltool project i would like to include > https://docs.mozilla-releng.net/projects/releng-tooltool.html. would that be > possible? if i undertand the code correctly i should set info_url parameter > right? We can do that > Currently I would like to add few more checks for treestatus and tooltool > project. > > (1) https://treestatus.mozilla-releng.net We have that, $ git grep "treestatus.mozilla-releng.net" modules/nagios4/manifests/prod/releng/scl3.pp: 'treestatus.mozilla-releng.net' => { > (1.1) https://treestatus.mozilla-releng.net/trees should return JSON object > where ``result.ash.tree`` path is equal to "ash" and http status code is 200 We'll add that > (1.2) https://treestatus.mozilla-releng.net/trees/mozilla-beta should return > JSON object where ``result.tree`` path is equal to "mozilla-beta" and http > status code is 200 We'll add that > (1.3) https://treestatus.mozilla-releng.net/trees/invalid should return JSON > object where ``detail`` path is "No such tree" and http status code is 404 We'll add that > (2) https://tooltool.mozilla-releng.net We don't have that but we'll add it > (2.1) > https://tooltool.mozilla-releng.net/sha512/ > f93a685c8a10abbd349cbef5306441ba235c4cbfba1cc000299e11b58f258e9953cbe23463515 > 407925eeca94c3f5d8e5f637c95be387e620845efa43cdcb0c0 should include > "Location" header which starts with > "https://mozilla-releng-usw2-tooltool.s3-us-west-2.amazonaws.com:443/sha512/ > f93a685c8a10abbd349cbef5306441ba235c4cbfba1cc000299e11b58f258e9953cbe23463515 > 407925eeca94c3f5d8e5f637c95be387e620845efa43cdcb0c0?Signature=" and https > status code is 302 We'll add that > (2.2) > https://tooltool.mozilla-releng.net/sha512/ > edf96781042db513700c4a092ef367c05933967b036db9b0f716b75da613a7eaea055d0f60b1e > 12f6e41a545962cec97a7b78c6b86363ee1ec7a9f42699a5531 should return JSON > object where ``detail`` path is qeual to "You don't have the permission to > access the requested resource. It is either read-protected or not readable > by the server." and http status code is 403 We'll add that > (2.3) https://tooltool.mozilla-releng.net/sha512/invalid should return JSON > object where ``detail`` path is "Invalid sha512 digest" and http status code > is 400 We'll add that I'll get back to you by tomorrow with diff and added checks.
(In reply to Rok Garbas [:garbas] from comment #4) All looks well besides 2.2; making a custom check for this now, > (1.1) https://treestatus.mozilla-releng.net/trees should return JSON object > where ``result.ash.tree`` path is equal to "ash" and http status code is 200 $ curl -sS https://treestatus.mozilla-releng.net/trees | jq -r '.result.ash.tree' ash $ curl -I https://treestatus.mozilla-releng.net/trees HTTP/1.1 200 OK > (1.2) https://treestatus.mozilla-releng.net/trees/mozilla-beta should return > JSON object where ``result.tree`` path is equal to "mozilla-beta" and http > status code is 200 $ curl -sS https://treestatus.mozilla-releng.net/trees/mozilla-beta | jq -r '.result.tree' mozilla-beta $ curl -I https://treestatus.mozilla-releng.net/trees/mozilla-beta HTTP/1.1 200 OK > (1.3) https://treestatus.mozilla-releng.net/trees/invalid should return JSON > object where ``detail`` path is "No such tree" and http status code is 404 $ curl -sS https://treestatus.mozilla-releng.net/trees/invalid | jq -r '.detail' No such tree $ curl -I https://treestatus.mozilla-releng.net/trees/invalid HTTP/1.1 404 NOT FOUND > (2.1) > https://tooltool.mozilla-releng.net/sha512/ > f93a685c8a10abbd349cbef5306441ba235c4cbfba1cc000299e11b58f258e9953cbe23463515 > 407925eeca94c3f5d8e5f637c95be387e620845efa43cdcb0c0 should include > "Location" header which starts with > "https://mozilla-releng-usw2-tooltool.s3-us-west-2.amazonaws.com:443/sha512/ > f93a685c8a10abbd349cbef5306441ba235c4cbfba1cc000299e11b58f258e9953cbe23463515 > 407925eeca94c3f5d8e5f637c95be387e620845efa43cdcb0c0?Signature=" and https > status code is 302 $ curl -I https://tooltool.mozilla-releng.net/sha512/f93a685c8a10abbd349cbef5306441ba235c4cbfba1cc000299e11b58f258e9953cbe23463515407925eeca94c3f5d8e5f637c95be387e620845efa43cdcb0c0 | grep Location Location: https://mozilla-releng-usw1-tooltool.s3-us-west-1.amazonaws.com:443/sha512/f93a685c8a10abbd349cbef5306441ba235c4cbfba1cc000299e11b58f258e9953cbe23463515407925eeca94c3f5d8e5f637c95be387e620845efa43cdcb0c0?Signature= $ curl -I https://tooltool.mozilla-releng.net/sha512/f93a685c8a10abbd349cbef5306441ba235c4cbfba1cc000299e11b58f258e9953cbe23463515407925eeca94c3f5d8e5f637c95be387e620845efa43cdcb0c0 HTTP/1.1 302 FOUND > (2.2) > https://tooltool.mozilla-releng.net/sha512/ > edf96781042db513700c4a092ef367c05933967b036db9b0f716b75da613a7eaea055d0f60b1e > 12f6e41a545962cec97a7b78c6b86363ee1ec7a9f42699a5531 should return JSON > object where ``detail`` path is qeual to "You don't have the permission to > access the requested resource. It is either read-protected or not readable > by the server." and http status code is 403 $ curl -sS https://tooltool.mozilla-releng.net/sha512/edf96781042db513700c4a092ef367c05933967b036db9b0f716b75da613a7eaea055d0f60b1e12f6e41a545962cec97a7b78c6b86363ee1ec7a9f42699a5531.json | jq -r '.detail' Invalid sha512 digest $ curl -I https://tooltool.mozilla-releng.net/sha512/edf96781042db513700c4a092ef367c05933967b036db9b0f716b75da613a7eaea055d0f60b1e12f6e41a545962cec97a7b78c6b86363ee1ec7a9f42699a5531 HTTP/1.1 302 FOUND > (2.3) https://tooltool.mozilla-releng.net/sha512/invalid should return JSON > object where ``detail`` path is "Invalid sha512 digest" and http status code > is 400 $ curl -sS https://tooltool.mozilla-releng.net/sha512/invalid | jq -r '.detail' Invalid sha512 digest $ curl -I https://tooltool.mozilla-releng.net/sha512/invalid HTTP/1.1 400 BAD REQUEST
Here we go, 22:50:53 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees is OK - HTTP OK: Status line output matched "200" - 6237 bytes in 0.345 second response time Last Checked: 2017-08-31 05:50:46 UTC 22:50:53 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees/invalid is OK - HTTP OK: Status line output matched "404" - 753 bytes in 0.343 second response time Last Checked: 2017-08-31 05:50:46 UTC 22:50:53 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees/mozilla-beta is OK - HTTP OK: Status line output matched "200" - 730 bytes in 0.443 second response time Last Checked: 2017-08-31 05:50:46 UTC 22:50:53 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTPS is OK - HTTP OK: HTTP/1.1 200 OK - 4910 bytes in 1.305 second response time Last Checked: 2017-08-31 05:50:46 UTC 22:50:53 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTPS - SSL Cert expiration is OK - OK - Certificate 'treestatus.mozilla-releng.net' will expire on 2018-11-28 12:00 +0000/UTC. Last Checked: 2017-08-31 05:50:46 UTC 22:50:53 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees is OK - Check JSON status API OK Last Checked: 2017-08-31 05:50:46 UTC 22:50:53 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees/mozilla-beta is OK - Check JSON status API OK Last Checked: 2017-08-31 05:50:46 UTC 22:50:53 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:PING is OK - TCP OK - 0.078 second response time on 54.243.209.110 port 443 Last Checked: 2017-08-31 05:50:46 UTC 22:54:24 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-1 is OK - HTTP OK: Status line output matched "302" - 1741 bytes in 0.359 second response time Last Checked: 2017-08-31 05:49:35 UTC 22:54:24 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-2 is OK - HTTP OK: Status line output matched "302" - 1753 bytes in 0.321 second response time Last Checked: 2017-08-31 05:52:38 UTC 22:54:24 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:HTTPS is OK - HTTP OK: HTTP/1.1 200 OK - 4910 bytes in 1.237 second response time Last Checked: 2017-08-31 05:17:50 UTC 22:54:24 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:HTTPS - SSL Cert expiration is OK - OK - Certificate 'treestatus.mozilla-releng.net' will expire on 2018-11-28 12:00 +0000/UTC. Last Checked: 2017-08-31 05:17:50 UTC 22:54:24 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:PING is OK - TCP OK - 0.076 second response time on 174.129.205.60 port 443 Last Checked: 2017-08-31 05:49:43 UTC As for the json checks for https://treestatus.mozilla-releng.net/trees/invalid and https://tooltool.mozilla-releng.net/* -- Do these have any endpoints we can check that don't return a 4XX status code? The check is not into it, [rchilds@nagios1.private.releng.scl3 ~]$ ./check_json.pl -u https://tooltool.mozilla-releng.net/sha512/invalid -a '{detail}' Check JSON status API CRITICAL - Connection failed: 400 BAD REQUEST
Flags: needinfo?(rgarbas)
Since we can not check error responses where we return 4xx response and it is ok to skip those checks. That means to skip check: - 1.3 - 2.2 - 2.3 which are listed above. also i noticed i didn't provide a link (info_url) for each service - treestatus -> https://docs.mozilla-releng.net/projects/releng-treestatus.html - tooltool -> https://docs.mozilla-releng.net/projects/releng-tooltool.html thank you :ryanc!
Flags: needinfo?(rgarbas)
(In reply to Rok Garbas [:garbas] from comment #9) > Since we can not check error responses where we return 4xx response and it > is ok to skip those checks. > > That means to skip check: > - 1.3 > - 2.2 > - 2.3 > which are listed above. > > also i noticed i didn't provide a link (info_url) for each service > - treestatus -> > https://docs.mozilla-releng.net/projects/releng-treestatus.html > - tooltool -> https://docs.mozilla-releng.net/projects/releng-tooltool.html > > > thank you :ryanc! All set. Amy, since Rok is on leave, where else should these go besides #sysadmins? #buildduty? 13:45:25 <~ryanc> nagios-releng: status treestatus.mozilla-releng.net:* 13:45:25 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees is OK - HTTP OK: Status line output matched "200" - 6084 bytes in 0.485 second response time Last Checked: 2017-08-31 20:40:40 UTC 13:45:25 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees/mozilla-beta is OK - HTTP OK: Status line output matched "200" - 730 bytes in 0.634 second response time Last Checked: 2017-08-31 20:40:40 UTC 13:45:25 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTPS is OK - HTTP OK: HTTP/1.1 200 OK - 4910 bytes in 1.228 second response time Last Checked: 2017-08-31 19:50:45 UTC 13:45:25 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:HTTPS - SSL Cert expiration is OK - OK - Certificate 'treestatus.mozilla-releng.net' will expire on 2018-11-28 12:00 +0000/UTC. Last Checked: 2017-08-31 19:50:45 UTC 13:45:25 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees is OK - Check JSON status API OK Last Checked: 2017-08-31 20:40:40 UTC 13:45:25 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees/mozilla-beta is OK - Check JSON status API OK Last Checked: 2017-08-31 20:40:42 UTC 13:45:25 <nagios-releng> ryanc: [] treestatus.mozilla-releng.net:PING is OK - TCP OK - 0.072 second response time on 54.243.209.110 port 443 Last Checked: 2017-08-31 20:40:36 UTC 13:45:29 <~ryanc> nagios-releng: status tooltool.mozilla-releng.net:* 13:45:29 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-1 is OK - HTTP OK: Status line output matched "302" - 1717 bytes in 0.330 second response time Last Checked: 2017-08-31 20:44:29 UTC 13:45:29 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-2 is OK - HTTP OK: Status line output matched "302" - 1711 bytes in 0.329 second response time Last Checked: 2017-08-31 20:44:34 UTC 13:45:29 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:HTTPS is OK - HTTP OK: HTTP/1.1 200 OK - 4910 bytes in 1.262 second response time Last Checked: 2017-08-31 20:17:50 UTC 13:45:29 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:HTTPS - SSL Cert expiration is OK - OK - Certificate 'treestatus.mozilla-releng.net' will expire on 2018-11-28 12:00 +0000/UTC. Last Checked: 2017-08-31 20:17:50 UTC 13:45:29 <nagios-releng> ryanc: [] tooltool.mozilla-releng.net:PING is OK - TCP OK - 0.073 second response time on 54.243.58.209 port 443 Last Checked: 2017-08-31 20:44:29 UTC
Flags: needinfo?(arich)
Yes, these should all go to #buildduty, not #sysadmins.
Flags: needinfo?(arich)
(In reply to Amy Rich [:arr] [:arich] from comment #11) > Yes, these should all go to #buildduty, not #sysadmins. Flipped in 955395dbdf7806c965b680535403218e8c9830a1, reopen if anything else is needed.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.