Closed Bug 933334 Opened 11 years ago Closed 8 years ago

sumotools1.webapp.phx1: puppetize postfix

Categories

(Infrastructure & Operations :: Infrastructure: Puppet, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: cww, Unassigned)

References

Details

This broke again.

I'm making it critical because we use email alerts from this server to let us know when things spike in input and we just released a bunch of stuff for Firefox 25.

This absolutely needs to stop happening.

+++ This bug was initially created as a clone of Bug #886561 +++

I use 

/usr/sbin/sendmail -t

to send data reports off of our metrics server (sumotools1.webapp.phx1.mozilla.com) to our community managers.

This seems to have broken some time after June 8th. Can someone look into it or give me another command?

Thanks.
The commands are in bug 886561. I have done this now, so mail should be working.

[root@sumotools1.webapp.phx1 ~]# /etc/init.d/postfix  status
master is stopped
[root@sumotools1.webapp.phx1 ~]# /etc/init.d/postfix  start
Starting postfix:                                          [  OK  ]
[root@sumotools1.webapp.phx1 ~]# /etc/init.d/postfix  status
master (pid  10022) is running...

And it sounds like you want postfix monitored on this server, so that we will be paged and fix it if it happens to die. Is that accurate?
(also, it can't have been broken since June 8th, because it was fixed on August 13th as per 904660).
(In reply to [:Cww] from comment #0)
> This broke again.
> 
> I'm making it critical because we use email alerts from this server to let
> us know when things spike in input and we just released a bunch of stuff for
> Firefox 25.
> 
> This absolutely needs to stop happening.

So, this seems like it is a one-off setup that might not be built to our standards..  In 886561 it seems postfix stopped after an upgrade.   Was this installed by hand on sumotools?  If it was puppet managed, it should be ensured to be running and this should not happen.

If sumotools is really this mission critical we should take a deep dive into it and make sure it is properly managed and documented (as of now it is not even in our list of sumo servers, I recall this being a one-off that maybe we P2V'd in the sumo move?)

In the meantime, we have people looking into getting mail flowing again.
I said it on August 14th, but that bug was resolved, so I'll say it again here:

https://bugzilla.mozilla.org/show_bug.cgi?id=904660#c3

We need to make sure that if it's important, puppet will ensure it exists.

Changing the title and component to reflect the actual work to be done.
Component: Infrastructure: Mail → Infrastructure: Puppet
QA Contact: limed → jdow
Summary: sendmail no longer working on sumotools1.webapp.phx1.mozilla.com → puppetize postfix
Sorry, this broke last week. I just cloned a bug when I realized that we should have gotten a backlog of alerts around the release and didn't.

Mail seems to be flowing again, so that's good. Thanks Sheeri.

I would like a server that I can use to run essentially a cron job that queries the SUMO and Input DBs and sends out alert emails when something breaks. This saves us from having to manually run those queries multiple times each day, especially around releases.  It's not mission critical in that I can get the data another way but only if I know it's not running.

If sumotools isn't the right place, then we need somewhere to do it. Sumotools has been holding up great for this purpose except that postfix seems to turn itself off every couple of months and I need to file a bug to get it restored.

I don't think anyone ever put postfix on puppet. If that's all it takes, I think it's fine.
(In reply to [:Cww] from comment #5)
> If sumotools isn't the right place, then we need somewhere to do it.
> Sumotools has been holding up great for this purpose except that postfix
> seems to turn itself off every couple of months and I need to file a bug to
> get it restored.

I think sumotools is a fine place to be checking for the status of things, but we may be able to provide a better way to report alarms to you than email from the box itself.

> I would like a server that I can use to run essentially a cron job that
> queries the SUMO and Input DBs and sends out alert emails when something
> breaks. This saves us from having to manually run those queries multiple
> times each day, especially around releases.  It's not mission critical in
> that I can get the data another way but only if I know it's not running.

If the cron job wrote its results to a file ("okay" or "warning" or "critical"), then we could have the existing monitoring system actually go as far as paging people directly, and/or emailing alerts about the cron results. Our nagios setup has a mission-critical email delivery setup and supports escalation from IRC to email to SMS if so desired.

Thoughts?
Hmmm...

So that'd work for some cases and not others.

One of the reports is just a daily email with the number/names of new contributors so we can welcome them (IMO, this isn't mission critical but the community managers would be sad if it's not there).  I mean I can hack it and have it send a critical page once a week, I gues.

The other regular email is this alert with text. I think an on-demand alert system would be awesome for that.

I definitely don't need to use email or postfix if there are better ways.
(In reply to [:Cww] from comment #7)
> Hmmm...
> 
> So that'd work for some cases and not others.
> 
> One of the reports is just a daily email with the number/names of new
> contributors so we can welcome them (IMO, this isn't mission critical but
> the community managers would be sad if it's not there).  I mean I can hack
> it and have it send a critical page once a week, I gues.

I'm all for making Postfix work reliably, but at some point it's not the best long-term solution for every case.

For this case, puppetizing postfix would ensure that it's delivered within an hour of being generated, which is more than sufficient for a daily report.

Sending the report through a mail delivery provider (we use Socketlabs) would also be an excellent alternative, especially if it's being sent to non-@mozilla addresses.

> The other regular email is this alert with text. I think an on-demand alert
> system would be awesome for that.

Great!

In the short-term, we'll still need this bug (puppetize postfix) and bug 933345 (monitor postfix queue) to ensure that everything is delivered okay.
See Also: → 933399
It's going to a mozilla email address so the puppet solution would be cheapest, I think. It's only going to a couple mozilla email addresses (basically for community managers who don't want to have to log into the VPN and use SQL to get this info)
I think the nagios plan with delivery would be awesome for alerts. In the middle term, we will need three different ones (one for desktop, android and fxos) so the notifications go to the right people.
(In reply to [:Cww] from comment #10)
> I think the nagios plan with delivery would be awesome for alerts. In the
> middle term, we will need three different ones (one for desktop, android and
> fxos) so the notifications go to the right people.

If you could add more about the Nagios side of things to bug 933399, that's tracking further work on that side. This bug is (now) focused just on fixing up postfix.
Severity: critical → normal
Summary: puppetize postfix → sumotools1.webapp.phx1: puppetize postfix
Taking myself off because this is a bug about puppetizing postfix, not about anything particularly database.
Hi, this bug hasn't had much movement since 2013 and the reporter's BZ account is disabled. I'm marking this bug as resolved. 

I checked puppet and postfix isn't explicitly managed on this host, however, I'm not going to change anything without a reason and context.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.