Created attachment 8852261 [details] pingdom.pdf The MOC has been receiving alerts from Pingdom for the past few days, with much of the alerts recovering after a minute or so. I've attached a screenshot of the Pingdom UI showing a sample of the 'outage' period. First reports began on Mar. 24th ~ 12:46AM. It also appears that there is a migration in progress from our IT-managed infra to Lithium, which may be related (bug 1342467). There have been comments mentioning that monitoring should be suspended to iron out any issues during this transition. We can edit the existing checks to only alert if the site is hard-down for 5 minutes or longer, but welcome any other suggestions. The MOC would also like to know how you would like us to handle outages, in case the site is unavailable, whom should we escalate to, or where to send bugs (product/component) when we encounter issues.
(In reply to Justin Lazaro [:jlaz] from comment #0) > We can edit the existing checks to only alert if the site is hard-down for 5 > minutes or longer, but welcome any other suggestions. let's start by that to avoid alert fatigue.
I have emailed Lithium Support to try to get an update on this, I am suspicious it is due to the bug you mentioned. I have also asked for performance expectations regarding your 5 min pingdom suggested monitor alert for the site. I believe Patrick and Roland are your two people for this?
a year ago
Component: Lithium Migration → General
Product: support.mozilla.org → support.mozilla.org - Lithium
a year ago
please see: https://bugzilla.mozilla.org/show_bug.cgi?id=1342467#c10 un-needinfo'ing myself
Hello, This is happening again today. SUMO is showing latency.
Severity: normal → major
Created attachment 8854355 [details] Screenshot 2017-04-04 01.33.05.png These alerts also occurred a few times over my shift, with recoveries within one minute.
Created attachment 8854448 [details] document.pdf So far today I've had alerts for this many times. Very high latency spikes and various other errors from lithium. I've bumped our alerting check up to 5 mins from 1 min just to try and reduce the alert fatigue. Resolved By Email at Apr 04, 2017 at 2:55 PM (London) Opened On Apr 04, 2017 at 2:55 PM (London) Resolved By Email at Apr 04, 2017 at 2:57 PM (London) Opened On Apr 04, 2017 at 2:57 PM (London) Resolved By Email at Apr 04, 2017 at 3:06 PM (London) Opened On Apr 04, 2017 at 3:02 PM (London) Resolved By Email at Apr 04, 2017 at 3:12 PM (London) Opened On Apr 04, 2017 at 3:12 PM (London) Resolved By Email at Apr 04, 2017 at 3:15 PM (London) Opened On Apr 04, 2017 at 3:15 PM (London) Resolved By Email at Apr 04, 2017 at 3:19 PM (London) Opened On Apr 04, 2017 at 3:19 PM (London) Resolved By Email at Apr 04, 2017 at 3:22 PM (London) Opened On Apr 04, 2017 at 3:22 PM (London)
Main site is very, very slow to load for me and I've had a report that it is breaking automated tests
We have test failures for automated tests today which are failing because https://support.mozilla.org cannot be reached or is serving an invalid certificate. This is tracked by bug 1353391. The machines which are reporting this are all located in SCL3. Could this bug be related to that?
Could this be related? https://bugzilla.mozilla.org/show_bug.cgi?id=1353364 They just turned this off and traffic dropped on Moz.org.
(In reply to Patrick McClard;pmcclard from comment #11) > Could this be related? https://bugzilla.mozilla.org/show_bug.cgi?id=1353364 > > They just turned this off and traffic dropped on Moz.org. Entirely possible. Seems to be responding much faster now. :whimboo are your problems gone?
Our tests are running once a day for central and aurora nightly builds. So I can't tell it before tomorrow.
I manually connected to one of the machines used for testing in SCL3 and can say that https://support.mozilla.org is better reachable now. So tests should also work again.
I have to say that we still have those issues today with our machines from SCL3: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=896e9cfb9d67d6a73e70e39532f31306c22202cb&filter-searchStr=firefox%20ui%20fxfn&filter-tier=1&filter-tier=2&filter-tier=3 Linux tests were run earlier today and seem to be fine. But OS X and Windows tests were all failing because they were running around 10am - 1pm UTC.
We've had no further alerting from SUMO since yesterday. There was a issue during the night with connectivity between AWS (and other place) and scl3 but that seems to be before the period you're talking about. https://bugzilla.mozilla.org/show_bug.cgi?id=1353624#c5 https://bug1353624.bmoattachments.org/attachment.cgi?id=8854700&t=Sh4SuBps1VvvctJr8Ccor3 I don't see anything informative as to why things failed (logs, error messages, etc) in that link but I'm not particularly familiar with treeherder since we don't use it. We have no access to the current SUMO infrastructure to check from the other side. Without any further information the MOC has nothing to go on and will have to leave this to the SUMO/lithium folks.
I just want to let you know that meanwhile our tests have been stabilized again and do not show issues with support.mozilla.org anymore.
:whimboo - as we reported in bug 1353572, it looks like Lithium blocks ICMP checks, which we had in place for SUMO and now are alerting (HTTPS checks are showing success). Can you comment if this is the expected norm now? If so, we'll remove the ICMP checks and just leave the existing HTTPS checks for uptime on SUMO.
If HTTPS is fine, we are fine. :) Thanks.
Thanks! We also just noticed the ICMP ping check actually cleared a couple of days after the first alert so it looks like we don't need to make any changes. PagerDuty shows: on Apr 6, 2017 at 4:42 PM Resolved through the API. Host: support.mozilla.org (View Message) And Nagios logging: ./nagios-04-07-2017-00.log:[Thu Apr 6 16:42:14 2017] HOST ALERT: support.mozilla.org;UP;HARD;2;PING OK - Packet loss = 0%, RTA = 2.34 ms
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.