Closed Bug 765623 Opened 12 years ago Closed 12 years ago

Nagios check geodns1.vips.scl3.mozilla.com:releasesrsynclag doesn't follow DNS changes

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: nthomas, Assigned: bhourigan)

Details

Right now the usual primary mirror in Europe, mozilla.openap.net, is doing some maintenance. They've blocked rsync connections and 3crowd (which releases-rsync.m.o points do) has failed over to the secondary, ftp.acc.umu.se. So far so good.

Nagios hasn't picked this up though, and is giving a spurious alert:
 geodns1.vips.scl3.mozilla.com:releasesrsynclag is WARNING: Mirrors WARNING - mozilla.openap.net [EMEA primary] : 65 minutes

It looks like the host is hard coded or something.
Assignee: server-ops → bhourigan
:nthomas

It looks like that the releases rsync lag has no way to determine what is and isnt active in 3Crowd. In fact, looking at the script, there is a comment about this exact same problem:

# list disabled servers here:  these have to be manually updated, we don't have a way to tell what's active in 3crowd
my @skipservers = ( '130.239.18.138', '130.239.18.163', '130.239.18.173', '140.211.166.134', '149.20.20.5', '64.50.236.214' );

If you think this feature is required we can change the scope of the bug to add this feature into the nagios check
The motivation was to have nagios alerts for this be trustworthy, since AMO depends on all of releases.m.o being less than 30 minutes out of date to serve newly uploaded addons. I hope this means oncall is on the hook for the alerts, but I don't know if they page or if we have good docs for how to deal with the issue. 

AIUI, 3crowd will change the DNS response if a primary is down, but this check will still alert, which is a false positive. If the secondary is up and current we're OK, if it's not current we need to know. I don't know what 3crowd does if both primary and secondary are down. Maybe we just need to move the monitoring into CatchPoint or something; would be great if we can maintain the visibility of the nagios check though.
:jakem

Does 3Crowd have an API to export the list of servers that are 'up' with a corresponding VIP? Would Dynect be appropriate for releases.m.o?
As far as I know, 3crowd does not really have a suitable API for this. It can page if a node is "down" I think, but I don't think there's a way to programmatically scan it for up or down nodes. Dynect might be better, I really haven't investigated that aspect of it.

However, neither is capable of making sure any given mirror is "up to date".
Is this still needed after moving product delivery to cdn?
Hmm, so will releases.m.o turn into ftp.m.o now, or the CDN ?
To be honest, I really don't know enough about what these are used for to be able to say with certainty that we can point them at the CDN or FTP cluster. I suspect we cannot, and that this is still a necessity, at least for now.


Having AMO addons on a CDN sounds like a decent idea to me, technically. AFAIK it has the same "once a file is created, it is almost never changed" mentality, which makes it good for a CDN. It should also have faster uptake of new files than mirrors would. my only concern would be the cache hit rate... I suspect it would be lower. Not sure if it would be significant enough to be a problem.

I can't say if this is financially viable... we haven't tested to determine total bandwidth usage (among other things). This leads me to conclude that even if we can and do move AMO downloads to a CDN, it won't be for some time. We should plan on the status quo existing for a few months at least.
We still distribute addons via releases.m.o so need some sort of monitoring of that (pending bug 626564). If we already have something in nagios then we could WONTFIX this as the mirror system is effectively getting dismantled. If we don't it boils down to a similar problem, except that releases.m.o is releases.geo.mozilla.com, rather than 3crowd handling releases-rsync.m.o.
So what's the action item here?
QA Contact: mrz → shyam
Addons are no longer distributed via releases.mozilla.org. I believe we can WONTFIX this bug. All of geodns and the associated nagios checks can go away... there are other bugs on that alread.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.