All users were logged out of Bugzilla on October 13th, 2018

PuppetAgain error e-mails should get processed more timely.

RESOLVED FIXED

Status

RESOLVED FIXED
6 years ago
5 months ago

People

(Reporter: Callek, Unassigned)

Tracking

Details

(Reporter)

Description

6 years ago
So, there was a brief syntax error checked in earlier today, which caused all ec2 slaves to scream about puppet errors.

That however has been gradually filling our shared inbox for the greater part of 7 hours by now, which would surely mask other real issues if there were any transient or code-related ones.

The e-mail headers show (ET) ~4pm, while its now 11:17pm.

We should find a way to let these get processed/sent faster. I'm not sure if the delay is on a puppetAgain system, mozilla's e-mail servers, or what, but I think this is a huge pain, and has caused me confusion in the past since they are received well past the moment of incident.
(Reporter)

Comment 1

6 years ago
Note the e-mail is still strolling in for this same event, so far > 14 hours past the event.
As I understand it, that's because they're in AWS.  I don't know how AWS handles email, but presumably it's rate-limited to reduce spamming.
Assignee: server-ops-releng → nobody
Component: Server Operations: RelEng → Release Engineering: Machine Management
QA Contact: arich → armenzg
I think that's our smarthost limiting us: 

Feb 12 05:52:41 puppetmaster-02 postfix/error[28966]: B4B5639FE: to=<releng-shared@mozilla.com>, relay=none, delay=59398, delays=59355/43/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to mx2.corp.phx1.mozilla.com[63.245.216.70]:25: Connection timed out)
(Reporter)

Comment 4

6 years ago
Point of reference, error e-mails still flowing in, now 27 hours after event
(Reporter)

Comment 5

6 years ago
Ok, so props go to IT for their help with me to track this down.

tldr; we have to request from amazon to fix this for us, I submitted said request

Myself, Limed, Solarce, Ravi, and Justdave were throwing ideas up.

We realized, for sure, that we have routes from AWS --> smtp.m.o
We confirmed that smtp.m.o maps to mx[12].corp....
We confirmed that the AWS puppetmaster can connect to mx[12] on port 25

The mails are held in postfix queue on the puppetmaster, due to connection timing out against mx[12]

solarce noticed https://forums.aws.amazon.com/message.jspa?messageID=397163#397163 (specifically last response there)

which describes Amazon restricting, by default, SMTP connection limits, even within the VPC. And pointed at a way to file a request to extend said limit.

I filled out said request at https://portal.aws.amazon.com/gp/aws/html-forms-controller/contactus/ec2-email-limit-rdns-request/submit?_encoding=UTF8

Rail also mentions that this likely was the root cause of Bug 824485
(Reporter)

Comment 6

6 years ago
Hello,

We've reviewed and approved your request for the removal of the EC2 e-mail sending limitations on your Amazon Web Services account. There are no longer limitations on your account for any IPs and instances under your account. If you requested removal of e-mail sending limits on Amazon Elastic IPs, they've also been removed.

Because reverse DNS record entries are commonly considered in anti-SPAM filters, we recommend assigning a reverse DNS record to the Elastic IP address you use to send email to third parties. Please use the form located at this link to request a reverse DNS entry:

https://aws-portal.amazon.com/gp/aws/html-forms-controller/contactus/ec2-email-limit-rdns-request

Note that a corresponding forward DNS record mapping only one domain to one Elastic IP address must exist before we can create the reverse DNS record on our side.

Thank you for your inquiry. Did I solve your problem?
 
If yes, please click here:
http://www.amazon.com/gp/help/survey?p=<SNIP>
 
If no, please click here:
http://www.amazon.com/gp/help/survey?p=<SNIP>

Best regards,

Oren M.
http://aws.amazon.com

---- Original message: ----

AWS AccountId                        <SNIP>
AccountEmailAddress                        <SNIP>@mozilla.com
UseCaseDescription                        We have puppet masters in AWS EC2 instances, which we want to e-mail us if there are errors in puppetizing our machines.

When a puppet-wide error happens, (say a typo in node definition) we'd normally get hundreds of e-mails per machine over the course of a few minutes. Though with the SMTP limiting I see (based on expirimentation and https://forums.aws.amazon.com/message.jspa?messageID=397163#397163 ) it takes well over a day for a very brief event to get purged from the mail queue we have on the puppet machine(s).

All mails are internal to us (routing from our AWS systems to an in-house mail server) I would like the ability for our machines to send a max of 1000 e-mails within an hour.

The max will rarely be reached at present, but when the burst of emails happen we want them to still get delivered, but not end up blocking future issues.
ElasticIPAddress1                        
ElasticIPAddress2                        
ReverseDNSRecord1                        
ReverseDNSRecord2
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering

Updated

5 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.