Closed
Bug 1351498
Opened 7 years ago
Closed 7 years ago
support.mozilla.org has been alerting in Pingdom
Categories
(support.mozilla.org - Lithium :: General, enhancement)
support.mozilla.org - Lithium
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlaz, Unassigned)
References
()
Details
(Whiteboard: [li-00134461])
Attachments
(4 files)
The MOC has been receiving alerts from Pingdom for the past few days, with much of the alerts recovering after a minute or so. I've attached a screenshot of the Pingdom UI showing a sample of the 'outage' period. First reports began on Mar. 24th ~ 12:46AM. It also appears that there is a migration in progress from our IT-managed infra to Lithium, which may be related (bug 1342467). There have been comments mentioning that monitoring should be suspended to iron out any issues during this transition. We can edit the existing checks to only alert if the site is hard-down for 5 minutes or longer, but welcome any other suggestions. The MOC would also like to know how you would like us to handle outages, in case the site is unavailable, whom should we escalate to, or where to send bugs (product/component) when we encounter issues.
Comment 2•7 years ago
|
||
(In reply to Justin Lazaro [:jlaz] from comment #0) > We can edit the existing checks to only alert if the site is hard-down for 5 > minutes or longer, but welcome any other suggestions. let's start by that to avoid alert fatigue.
Comment 3•7 years ago
|
||
I have emailed Lithium Support to try to get an update on this, I am suspicious it is due to the bug you mentioned. I have also asked for performance expectations regarding your 5 min pingdom suggested monitor alert for the site. I believe Patrick and Roland are your two people for this?
Flags: needinfo?(rtanglao)
Flags: needinfo?(pmcclard)
Updated•7 years ago
|
Component: Lithium Migration → General
Flags: needinfo?(rtanglao)
Product: support.mozilla.org → support.mozilla.org - Lithium
Updated•7 years ago
|
Whiteboard: [li-00134461]
Comment 4•7 years ago
|
||
please see: https://bugzilla.mozilla.org/show_bug.cgi?id=1342467#c10 un-needinfo'ing myself
Comment 5•7 years ago
|
||
Hello, This is happening again today. SUMO is showing latency.
Severity: normal → major
Comment 6•7 years ago
|
||
Reporter | ||
Comment 7•7 years ago
|
||
These alerts also occurred a few times over my shift, with recoveries within one minute.
Comment 8•7 years ago
|
||
So far today I've had alerts for this many times. Very high latency spikes and various other errors from lithium. I've bumped our alerting check up to 5 mins from 1 min just to try and reduce the alert fatigue. Resolved By Email at Apr 04, 2017 at 2:55 PM (London) Opened On Apr 04, 2017 at 2:55 PM (London) Resolved By Email at Apr 04, 2017 at 2:57 PM (London) Opened On Apr 04, 2017 at 2:57 PM (London) Resolved By Email at Apr 04, 2017 at 3:06 PM (London) Opened On Apr 04, 2017 at 3:02 PM (London) Resolved By Email at Apr 04, 2017 at 3:12 PM (London) Opened On Apr 04, 2017 at 3:12 PM (London) Resolved By Email at Apr 04, 2017 at 3:15 PM (London) Opened On Apr 04, 2017 at 3:15 PM (London) Resolved By Email at Apr 04, 2017 at 3:19 PM (London) Opened On Apr 04, 2017 at 3:19 PM (London) Resolved By Email at Apr 04, 2017 at 3:22 PM (London) Opened On Apr 04, 2017 at 3:22 PM (London)
Comment 9•7 years ago
|
||
Main site is very, very slow to load for me and I've had a report that it is breaking automated tests
Comment 10•7 years ago
|
||
We have test failures for automated tests today which are failing because https://support.mozilla.org cannot be reached or is serving an invalid certificate. This is tracked by bug 1353391. The machines which are reporting this are all located in SCL3. Could this bug be related to that?
Comment 11•7 years ago
|
||
Could this be related? https://bugzilla.mozilla.org/show_bug.cgi?id=1353364 They just turned this off and traffic dropped on Moz.org.
Flags: needinfo?(pmcclard)
Comment 12•7 years ago
|
||
(In reply to Patrick McClard;pmcclard from comment #11) > Could this be related? https://bugzilla.mozilla.org/show_bug.cgi?id=1353364 > > They just turned this off and traffic dropped on Moz.org. Entirely possible. Seems to be responding much faster now. :whimboo are your problems gone?
Flags: needinfo?(hskupin)
Comment 13•7 years ago
|
||
Our tests are running once a day for central and aurora nightly builds. So I can't tell it before tomorrow.
Flags: needinfo?(hskupin)
Comment 14•7 years ago
|
||
I manually connected to one of the machines used for testing in SCL3 and can say that https://support.mozilla.org is better reachable now. So tests should also work again.
Comment 15•7 years ago
|
||
I have to say that we still have those issues today with our machines from SCL3: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=896e9cfb9d67d6a73e70e39532f31306c22202cb&filter-searchStr=firefox%20ui%20fxfn&filter-tier=1&filter-tier=2&filter-tier=3 Linux tests were run earlier today and seem to be fine. But OS X and Windows tests were all failing because they were running around 10am - 1pm UTC.
Comment 16•7 years ago
|
||
We've had no further alerting from SUMO since yesterday. There was a issue during the night with connectivity between AWS (and other place) and scl3 but that seems to be before the period you're talking about. https://bugzilla.mozilla.org/show_bug.cgi?id=1353624#c5 https://bug1353624.bmoattachments.org/attachment.cgi?id=8854700&t=Sh4SuBps1VvvctJr8Ccor3 I don't see anything informative as to why things failed (logs, error messages, etc) in that link but I'm not particularly familiar with treeherder since we don't use it. We have no access to the current SUMO infrastructure to check from the other side. Without any further information the MOC has nothing to go on and will have to leave this to the SUMO/lithium folks.
Comment 17•7 years ago
|
||
I just want to let you know that meanwhile our tests have been stabilized again and do not show issues with support.mozilla.org anymore.
Comment 19•7 years ago
|
||
:whimboo - as we reported in bug 1353572, it looks like Lithium blocks ICMP checks, which we had in place for SUMO and now are alerting (HTTPS checks are showing success). Can you comment if this is the expected norm now? If so, we'll remove the ICMP checks and just leave the existing HTTPS checks for uptime on SUMO.
Flags: needinfo?(hskupin)
Comment 21•7 years ago
|
||
Thanks! We also just noticed the ICMP ping check actually cleared a couple of days after the first alert so it looks like we don't need to make any changes. PagerDuty shows: on Apr 6, 2017 at 4:42 PM Resolved through the API. Host: support.mozilla.org (View Message) And Nagios logging: ./nagios-04-07-2017-00.log:[Thu Apr 6 16:42:14 2017] HOST ALERT: support.mozilla.org;UP;HARD;2;PING OK - Packet loss = 0%, RTA = 2.34 ms
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•