Closed Bug 1332761 Opened 8 years ago Closed 8 years ago

Discourse authentication emails taking too long to arrive

Categories

(Infrastructure & Operations :: Community IT: Discourse, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: michael, Assigned: gene)

Details

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36 Steps to reproduce: 1. Have an unauthenticated Discourse browsing session. 2. Click "Log In", select the email option, and complete the form with a valid email address. 3. Wait for the email with the authentication key link. Actual results: Often (not always) the email takes too long to arrive in my gmail.com inbox, resulting in an expired token error message when clicking the authentication link in the email. Expected results: The email should arrive within moments, and the link should work & authenticate the user to Discourse.
Assignee: nobody → gene
Michael, thanks for the report. Can you copy and paste or attach the original email that arrived too late in your inbox to this ticket? Here's how to get the full original email with headers : https://support.google.com/mail/answer/29436?hl=en I can look at this and at our logs to determine the cause of the delay of email delivery. -Gene
Flags: needinfo?(michael)
Headers are below. The timestamps don't look particularly concerning or interesting, but for whatever reason, the email still doesn't appear in the inbox for quite some time (as much as 10+ minutes). Perhaps someone else can (attempt to?) reproduce. Delivered-To: michael@example.com Received: by 10.176.70.9 with SMTP id m9csp243624uaa; Fri, 20 Jan 2017 14:33:54 -0800 (PST) X-Received: by 10.84.213.151 with SMTP id g23mr24709701pli.43.1484951634246; Fri, 20 Jan 2017 14:33:54 -0800 (PST) Return-Path: <01010159be03bda8-411fa9f9-842f-4e52-b4ff-d8ea6758a64d-000000@us-west-2.amazonses.com> Received: from a27-31.smtp-out.us-west-2.amazonses.com (a27-31.smtp-out.us-west-2.amazonses.com. [54.240.27.31]) by mx.google.com with ESMTPS id s18si8019117pgd.149.2017.01.20.14.33.53 for <michael@example.com> (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 20 Jan 2017 14:33:54 -0800 (PST) Received-SPF: pass (google.com: domain of 01010159be03bda8-411fa9f9-842f-4e52-b4ff-d8ea6758a64d-000000@us-west-2.amazonses.com designates 54.240.27.31 as permitted sender) client-ip=54.240.27.31; Authentication-Results: mx.google.com; dkim=pass header.i=@sso.mozilla.com; dkim=pass header.i=@amazonses.com; spf=pass (google.com: domain of 01010159be03bda8-411fa9f9-842f-4e52-b4ff-d8ea6758a64d-000000@us-west-2.amazonses.com designates 54.240.27.31 as permitted sender) smtp.mailfrom=01010159be03bda8-411fa9f9-842f-4e52-b4ff-d8ea6758a64d-000000@us-west-2.amazonses.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=mozilla.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=djibmc2gvzwjtdg2g6oqgcfqizc74ac3; d=sso.mozilla.com; t=1484951633; h=Content-Type:From:To:Subject:Message-Id:Date:MIME-Version; bh=STBTZEM0Pj+N75936CvBkUWqqEe7icuP0bKFf53uUKk=; b=WVcX4mDQ+szkfYzF0MsM9lCzzbxncuWV6F6UhXxg0Z9UGhg3nZX9ZqVl2XMb4zDh JcUQmAX60vIyma8UM+jQ6mJ2p0WD7EVKeKQJg9Tfa5fvU4svcYBf0craDI6JtoR32c4 VfkpKlUyw4AhBIPkoL3uwjAz/oePp7UziE3hq6P8= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=gdwg2y3kokkkj5a55z2ilkup5wp5hhxx; d=amazonses.com; t=1484951633; h=Content-Type:From:To:Subject:Message-Id:Date:MIME-Version:Feedback-ID; bh=STBTZEM0Pj+N75936CvBkUWqqEe7icuP0bKFf53uUKk=; b=a9bp7kuOTVcUBg4if4b+tDCsnHoUAAop1je4pQu8mA5KCmk5i9pUGrHd9af9awzW I0oZtL00RjAG5IIics8FCel7RuXChPDtzH7cWLgCaU0IO6eMJuHX7NCthc+btCnhtHA 821vfoE/auuywO/t0oEcukM4dphOvlmJYhxOK9Cg= Content-Type: multipart/alternative; boundary="----sinikael-?=_1-14849516331790.037705678725615144" From: Mozilla SSO <noreply@sso.mozilla.com> To: michael@example.com Subject: Welcome to discourse.mozilla-community.org Message-ID: <01010159be03bda8-411fa9f9-842f-4e52-b4ff-d8ea6758a64d-000000@us-west-2.amazonses.com> X-Mailer: nodemailer (2.3.0; +http://nodemailer.com/; SES/1.3.0) Date: Fri, 20 Jan 2017 22:33:53 +0000 MIME-Version: 1.0 X-SES-Outgoing: 2017.01.20-54.240.27.31 Feedback-ID: 1.us-west-2.uHogrXyy+XKMRmZyxDC5UOCBhNiEyoF5M/tmxXz7Wmc=:AmazonSES
Flags: needinfo?(michael)
Oh, also, what was the actual email address you were authenticating with (I saw you obfuscated it in the headers) so I can correlate it in the logs. Feel free to share it in whatever obfuscating method you prefer.
Flags: needinfo?(michael)
The domain part was {my last name}.net. :)
Flags: needinfo?(michael)
So I've looked at this and here's what I can see 2017-01-20T22:33:53.100Z code sent from auth0 358042 Here at 33 minutes 53.1 seconds Auth0 logged the code being generated and sent to the mail service (AWS SES) Fri, 20 Jan 2017 14:33:54 -0800 (PST) Received: from a27-31.smtp-out.us-west-2.amazonses.com (a27-31.smtp-out.us-west-2.amazonses.com. [54.240.27.31]) by mx.google.com with ESMTPS id s18si8019117pgd.149.2017.01.20.14.33.53 One second later at 33 minutes and 54 seconds Google reports in the headers in Comment 2 that it received the email from AWS SES. There's then a 9 minute delay at which point auth0 records your attempt to use the now expired code 2017-01-20T22:42:42.956Z wrong email or verification code So it sounds like for whatever reason the email was received by google one second after you clicked the button but then wasn't displayed or presented to you until some time later. Where you refreshing while you were waiting for the email? I'm assuming at some point you went and did something else (didn't just sit refreshing email for 9 minutes). Did you get to see when it arrived with one of those inbox refreshes? If so do you remember what time you first saw it at? Unfortunately, looking at the auth0 logs and the email headers it sounds like google received the email very quickly (1 second) but then didn't show it to you for some reason. I say unfortunately because if it's a delay inside Google, that's the one part of the picture I can't do anything about. The one log I don't have because of a misconfiguration on our side which stopped the logging was the AWS SES notification of the disposition of it's connection to Google. I suspect it would show just what Google reports in the header though.
Michael, thanks again for reporting this and sharing the email header details. If you see this happen again please let us know but at the moment it looks like a Gmail issue delaying display of the email.
Status: UNCONFIRMED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
I've been seeing this for a few months, too, with a non-Google server. Is there any way we could get the timeout lengthened to, say, 12 or 24 hours? As long as the link only works once, I'm not sure what the security benefits of that short a timeout are…
(Just tested it now, entering "bwinton@latte.ca", then clicking the button in discourse at 10:39, but didn't receive the message at my server until 10:46:37.)
(In reply to Blake Winton (:bwinton) (:☕️) from comment #8) > (Just tested it now, entering "bwinton@latte.ca", then clicking the button > in discourse at 10:39, but didn't receive the message at my server until > 10:46:37.) If you'd like I can change the email address associated with your Discourse account to bwinton@mozilla.com, and then you can log in with your LDAP creds.
That would definitely let me reply to comments, which would be good, but it's not quite the same as a fix… ;) (Still, if you could do that I would appreciate it. Thank you!)
Done, you should've received an activation email. > but it's not quite the same as a fix… ;) The problem is that extending the timeout isn't really a fix either - if you're waiting longer than 5 minutes for the email to arrive which allows you to log in, that's the real problem. However I think having users waiting around for a long time but still able to log in is better than them not being able to log in at all, @gene is this something we can revisit?
Flags: needinfo?(gene)
My opinion is that changing the validity window doesn't buy us anything. Say we change the validity window to an hour. If a user tries to log into a website and their receiving mail server doesn't deliver the email to them for say 10 minutes (due to anti-spam greylisting or some other issue), indeed the link would still be valid. But that experience in it's entirety is still completely broken. Here's the scenario we have now : A user wants to login to a website, they begin the login process by asking for a login email. They wait around checking their email for 10 minutes. Somehow after 10 minutes the user hasn't sought out an alternative login method, used a different email address or given up on using the site completely and is still checking email intermittently and willing to wait 10 minutes to login to a website The user gets the email and clicks the link and it doesn't work. Here's what we'd have if we increased the validity window A user wants to login to a website, they begin the login process by asking for a login email. They wait around checking their email for 10 minutes. Somehow after 10 minutes the user hasn't sought out an alternative login method, used a different email address or given up on using the site completely and is still checking email intermittently and willing to wait 10 minutes to login to a website The user gets the email and clicks the link and it works. I really don't want to "shift blame" here, it's not my intent, but if a user has a mail server that won't deliver email that it receives to it's users in a timely manner, the fix needs to happen on the mail server side. This is based on the assumption that AWS Simple Email Service is delivering the email to the receiving mail server in a timely manner. In the examples I've looked at and the diagnostics I've run AWS SES delivers email promptly (in 1 second). The IAM team discussed the idea of establishing monitoring that would trigger SES and determine delivery time but declined to prioritize doing so as we just don't have an indicator yet that this might be a problem. Anyhow, * I totally understand that it's super frustrating to try to log into a website and not be able to * Waiting 10 minutes let alone 1 minute to be able to log in to a site is unacceptable from a user experience standpoint * Unless the cause of the delay is either on the Auth0 side or AWS SES side of things the amount of time the receiving mail server takes to deliver that mail to it's users is unfortunately out of our (Mozilla's) control * If anyone has data indicating that anyone experiences a delay of their mail server *receiving* email from AWS SES, do please let us know * If you can contact the people that run your mail server that's causing the delay, you should do so because the delay you're experiencing in receiving these login emails are not the only emails that are being delayed to you and you probably want to have it fixed.
Flags: needinfo?(gene)
For the record, I'm the people running my mail server, and the only site that is giving me problems is the Mozilla Discourse instance. Most other places don't seem to require email to sign in, or have their timeouts set for hours or days… If you'd like I'll happily provide server logs showing when the message hits my computer (quite a long time after I hit the button), and how quickly it gets to my inbox from there (basically instantaneously), which leads me to believe it's something on the AWS end. I've tried getting it to send me email again, and will send the full headers to you, if you think they would be useful in debugging why this is taking so long. More generally, though, I would suggest 10 minutes is still far too short. Why not make it 24 hours? What problem are you trying to solve by setting the timeout that short? Surely you agree that being able to log in is a far more acceptable user experience than not being able to, and asking random people to fix their mail servers seems unreasonable to me.
Clicked the button at 10:36, hit the server at 10:47, six minutes after the link expired. :( Full headers: -------------------------------------------- Return-Path: <0101015ccda35269-2c704181-3bdc-405f-bbd3-616b1b004d14-000000@us-west-2.amazonses.com> Delivered-To: bwinton@latte.ca Received: from localhost (localhost [127.0.0.1]) by aviva.latte.ca (Postfix) with ESMTP id 6604392ABF88 for <bwinton@latte.ca>; Wed, 21 Jun 2017 22:47:27 -0400 (EDT) Authentication-Results: latte.ca (amavisd-new); dkim=pass (1024-bit key) header.d=sso.mozilla.com header.b=DQLvrftQ; dkim=pass (1024-bit key) header.d=amazonses.com header.b=fRPiTDOl Received: from aviva.latte.ca ([127.0.0.1]) by localhost (latte.ca [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id lNXtHVtXwN2P for <bwinton@latte.ca>; Wed, 21 Jun 2017 22:47:25 -0400 (EDT) Received: from a27-23.smtp-out.us-west-2.amazonses.com (a27-23.smtp-out.us-west-2.amazonses.com [54.240.27.23]) (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by aviva.latte.ca (Postfix) with ESMTPS id 1D38192ABF7F for <bwinton@latte.ca>; Wed, 21 Jun 2017 22:47:24 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=djibmc2gvzwjtdg2g6oqgcfqizc74ac3; d=sso.mozilla.com; t=1498098652; h=Content-Type:From:To:Subject:Message-Id:Date:MIME-Version; bh=C2mSJ0KSTstUYcV8lNnYhlP5x3cL4YGhDfjkIeJVN/8=; b=DQLvrftQVD0x6Xl9zlrzlfxM6jojHnpCyLLuqVF/v4R9YTOfs1Q0UAAdZRMpzTUX KX4bhWpK3jKzlCj8Q+Fbq2Lf3wnK3dYHZw6U/7BfW2px9KF5366nRKZoPsSNoTZtH+S ruthl9ecNTY1mBE97QF4pGKKbIi78D306TE3LD+Y= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=hsbnp7p3ensaochzwyq5wwmceodymuwv; d=amazonses.com; t=1498098652; h=Content-Type:From:To:Subject:Message-Id:Date:MIME-Version:Feedback-ID; bh=C2mSJ0KSTstUYcV8lNnYhlP5x3cL4YGhDfjkIeJVN/8=; b=fRPiTDOlzq6bYA1RJ9PDHlkpyjYs1GlyQCsy+AkJWHyMySvpAZtBbR+NX1wGleoL ErnkfGiTh/IhZuHttRUQfjkx3Mu7LX503FjufnWODcaOZPpRRluGU8L34wE6e1AF2c5 Rg2WR8K0qm5WvlqYzp1pjiar2jhTaYdEJndpq4xA= Content-Type: multipart/alternative; boundary="----sinikael-?=_1-14980986516290.983197633177042" From: Mozilla SSO <noreply@sso.mozilla.com> To: bwinton@latte.ca Subject: Welcome to discourse.mozilla-community.org Message-ID: <0101015ccda35269-2c704181-3bdc-405f-bbd3-616b1b004d14-000000@us-west-2.amazonses.com> X-Mailer: nodemailer (2.3.0; +http://nodemailer.com/; SES/1.3.0) Date: Thu, 22 Jun 2017 02:30:51 +0000 MIME-Version: 1.0 X-SES-Outgoing: 2017.06.22-54.240.27.23 Feedback-ID: 1.us-west-2.uHogrXyy+XKMRmZyxDC5UOCBhNiEyoF5M/tmxXz7Wmc=:AmazonSES -------------------------------------------- (I'll also note that my mail client checks for new mail every 10 minutes, so even if it took 0 seconds to arrive at my server, and I was checking email at the exact second it hit my client, and clicked the link as fast as I could, I would only be able to successfully log in about 50% of the time.) Please let me know if there's anything else I can do to help.
(In reply to Gene Wood [:gene] from comment #12) > My opinion is that changing the validity window doesn't buy us anything. It does, in that it allows everyone to log in. > Say we change the validity window to an hour. If a user tries to log into a > website and their receiving mail server doesn't deliver the email to them > for say 10 minutes (due to anti-spam greylisting or some other issue), > indeed the link would still be valid. But that experience in it's entirety > is still completely broken. It's not completely broken, because a user can still log in. We shouldn't punish a user with a bad UX, and then punish them with not being able to log in. > Here's the scenario we have now : > [...] > The user gets the email and clicks the link and it doesn't work. > > Here's what we'd have if we increased the validity window > [...] > The user gets the email and clicks the link and it works. I don't think you're giving enough importance to the link working, and so a user being able to log in. > I totally understand that it's super frustrating to try to log into a > website and not be able to It's more than frustrating, it's exclusionary. It has the potential to (and given the support emails I've responded to, almost certainly has) excluded people from discussions happening on Discourse. However bad the UX is leading up to it is, and whatever other authentication options are available, the login link should always work as a last resort to log someone in (dependent, of course, on security considerations). If it doesn't, then there's really not a whole lot a user can do.
Status: RESOLVED → REOPENED
Ever confirmed: true
Flags: needinfo?(gene)
Resolution: WONTFIX → ---
Makes sense. I've increased the OTP Expiry in our dev environment and will test it out. Once it looks good I'll schedule a push to production.
Flags: needinfo?(gene)
So, as a question, why did you choose 15 minutes instead of 24 hours? Is there some sort of database limit where you don't want to have rows sticking around for too long because it takes up space?
The production change has been completed and the expiry time is now set to 15 minutes.
> Is there some sort of database limit where you don't want to have rows sticking around for too long because it takes up space? There is a database but no, no limits (that I know of) on it's size. The choice of 15 minutes was in service of upping by three times to something that has got to be the maximum amount of time a user is going to sit around waiting to be able to log into a site without giving up or trying some other auth method (github, google, re-initiating the passwordless flow). If we continue to see users both getting emails delivered in excess of 15 minutes and those users having waited for an email to arrive in say 20 minutes without trying other login methods, we'll need to consider increasing the window size further (I'm not expecting this to happen)
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.