Establish nagios checks for mozilla-releng.net https web apps

NEW
Assigned to

Status

Release Engineering
General
P1
normal
2 years ago
11 months ago

People

(Reporter: dividehex, Assigned: garbas)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
In bug 1301677, domains and ssl certificates were setup for several heroku apps and aws cloudfront endpoints.  These ssl certificates should be monitored for expiration.
(Reporter)

Updated

2 years ago
Depends on: 1301677
(Reporter)

Comment 1

2 years ago
I've added https and ssl cert nagios checks for cluster of web apps which were setup in bug 1301677.  The only check that is failing at this time is the https service check on clobberer.mozilla-releng.net.  This is probably due to a missing root redirect to /docs/


The following hosts were added to releng nagios:
mozilla-releng.net
staging.mozilla-releng.net
docs.mozilla-releng.net
docs.staging.mozilla-releng.net
shipit.mozilla-releng.net
shipit.staging.mozilla-releng.net
clobberer.mozilla-releng.net
clobberer.staging.mozilla-releng.net
dashboard.shipit.mozilla-releng.net
dashboard.shipit.staging.mozilla-releng.net
(Reporter)

Updated

2 years ago
Summary: Establish ssl certificate expiration checks for mozilla-releng.net https apps → Establish nagios checks for mozilla-releng.net https web apps
(Reporter)

Comment 2

2 years ago
Also, dashboard.shipit.mozilla-releng.net is missing the root redirect / -> /docs/

I've downtimed both hosts for 30 days

:garbas, do you have plans to add the root level redirects for these apps?
Flags: needinfo?(rgarbas)
(Assignee)

Comment 3

2 years ago
none of the services is used in production, yet.

also i dont think it makes sense checking anything starting with "staging.mozilla-releng.net" since staging will be "up and down" a lot since we also use it to test some ideas etc.. and not only as a pre step before production.

I was hoping to get an EC2 instance where i would setup Prometeus&Grafana to also do the monitoring, not only for pings, but also for some internal metrics of the services. But I think having 2 things check if service is up can only be a good thing.

I'm leaving "needinfo" for me to keep it in my todo list and solve it before hawaii (eg. next week).
(Assignee)

Comment 4

2 years ago
As of yesterday following services are now deployed to production:
 - https://mozilla-releng.org
 - https://docs.mozilla-releng.org
(Reporter)

Comment 5

2 years ago
I've removed all the staging hosts from nagios. As a reminder, the nagios downtime for clobberer will expire on 12-21-2016 17:21:11.  Just let me know if this needs to be extended.

commit 583a90f3763dae2351e51b1c2dd398505d4f04be
Author: Jake Watkins <jwatkins@mozilla.com>
Date:   Thu Dec 15 09:21:02 2016 -0800

    Bug 1308352: Remove nagios checks for staging webapps
Terminology is fun! Minor clarification on comment 4:

The new implementation of the service "trychooser" is now running on the production instance of the new host https://mozilla-releng.net (not .org).

However, developers are still routed to the legacy deployment at:
  http://trychooser.pub.build.mozilla.org/

As can be shown via curl:
  $ curl -IL http://trychooser.pub.build.mozilla.org/
  HTTP/1.1 200 OK
  Date: Thu, 15 Dec 2016 19:11:06 GMT
  Server: Apache
  X-Backend-Server: web1.releng.webapp.scl3.mozilla.com
  Last-Modified: Wed, 21 Sep 2016 13:14:54 GMT

When the redirect is added on the legacy service, the new service will be fully in production.

Deployment of that redirect is blocked on allowing redeployment of the legacy relengapi, see bug 1318890
Extending the downtime for these set of hosts for another 30 days:

04:32:44 < nagios-releng> Mon 20:32:44 PST [4019] clobberer.mozilla-releng.net:HTTPS is WARNING: HTTP WARNING: HTTP/1.1 404 NOT FOUND - 447 bytes in 0.344 second response time (http://m.mozilla.org/HTTPS)

Please feel free to undowntime or let us know when these go live when the redirect is unblocked
(Assignee)

Comment 8

2 years ago
Few things happen in last 10 days and I have been monitoring them myself (only to revert the change in case something went wrong, but since everything looks fine I would like to request Nagios checks.

1. A redirect for TryChooser is now in place:
- old: http://trychooser.pub.build.mozilla.org
- new: (Static HTML, AWS S3) https://mozilla-releng.net/trychooser

2. A redirect for a frontend of TreeStatus is now in place:
- old: https://api.pub.build.mozilla.org/treestatus/
- new: (Static HTML, AWS S3) https://mozilla-releng.net/treestatus

3. A proxy is in place for API requests for TreeStatus
- old: https://api.pub.build.mozilla.org/treestatus/*
- new: (Heroku app) https://treestatus.mozilla-releng.net/

4. New frontend for ShipIt was released:
- new: (Static HTML, AWS S3) https://shipit.mozilla-releng.net

5. New backend for ShipIt Dashboard was released:
- new: (Heroku app) https://dashboard.shipit.mozilla-releng.net/

As for the ``clobberer.mozilla-releng.net`` please extend the downtime until the end of Feb.
Flags: needinfo?(rgarbas) → needinfo?(jwatkins)

Comment 9

2 years ago
buildduty should be able to help with this.
Assignee: jwatkins → nobody
Component: Tools → Buildduty
QA Contact: hwine → bugspam.Callek
(Reporter)

Updated

2 years ago
Flags: needinfo?(jwatkins)
(In reply to Rok Garbas [:garbas] from comment #8)
> Few things happen in last 10 days and I have been monitoring them myself
> (only to revert the change in case something went wrong, but since
> everything looks fine I would like to request Nagios checks.
> 
> 1. A redirect for TryChooser is now in place:
> - old: http://trychooser.pub.build.mozilla.org
> - new: (Static HTML, AWS S3) https://mozilla-releng.net/trychooser
> 
> 2. A redirect for a frontend of TreeStatus is now in place:
> - old: https://api.pub.build.mozilla.org/treestatus/
> - new: (Static HTML, AWS S3) https://mozilla-releng.net/treestatus

We already have a check for 'mozilla-releng.net'. [1]
Do we want separate checks for each endpoint (e.g. treestatus, trychooser)?  

> 3. A proxy is in place for API requests for TreeStatus
> - old: https://api.pub.build.mozilla.org/treestatus/*
> - new: (Heroku app) https://treestatus.mozilla-releng.net/

That'll probably require removing the comments for the existing treestatus block (will follow-up with a patch once things are a bit more clear on what's needed here).
 
> 4. New frontend for ShipIt was released:
> - new: (Static HTML, AWS S3) https://shipit.mozilla-releng.net
> 
> 5. New backend for ShipIt Dashboard was released:
> - new: (Heroku app) https://dashboard.shipit.mozilla-releng.net/

Nagios checks for points 4 and 5 are already present [1] 

> As for the ``clobberer.mozilla-releng.net`` please extend the downtime until
> the end of Feb.

The actual downtime is scheduled to expire on 02-22-2017 20:36:31. I added another downtime that starts on 02-22-2017 20:36:32 and ends on 02-28-2017 23:59:59	

[1] http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/status.cgi?hostgroup=releng-apps&style=detail

@Jake: can you please provide some directions here?
Thanks!
Flags: needinfo?(jwatkins)
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #10)

We definitely want to monitor each endpoint separately since they're different instances and can fail separately.
Assignee: nobody → aselagea
Created attachment 8835920 [details] [diff] [review]
bug_1308352.patch

Patch to enable alerts for:
    - treestatus.mozilla-releng.net
    - mozilla-releng.net/trychooser
    - mozilla-releng.net/treestatus
Flags: needinfo?(jwatkins)
Attachment #8835920 - Flags: review?(jwatkins)
(Reporter)

Comment 13

a year ago
Comment on attachment 8835920 [details] [diff] [review]
bug_1308352.patch

Review of attachment 8835920 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think we need to check the individual endpoints of the url since these all point to a single cloudfront (mozilla-releng.net).  As it is, these checks are just primitive tcp ping for the host check and the service checks are a sni https check and a cert expiration check.

::: modules/nagios/manifests/releng/scl3.pp
@@ +1275,4 @@
>                  'releng-apps'
>              ]
>          },
> +        'mozilla-releng.net/trychooser' => {

I don't believe this format will work, since the hostname is derived from there.
Attachment #8835920 - Flags: review?(jwatkins) → review-
I'm going to comment out the entry for clobberer.mozilla-releng.net, unless this is now live.  This is still sending an alert to oncall for the page returning a 404
Priority: -- → P1
Assigning this to rok and moving to the general queue since he was working on these.
Assignee: aselagea → rgarbas
Component: Buildduty → General
QA Contact: bugspam.Callek
You need to log in before you can comment on or make changes to this bug.