Closed Bug 1672239 Opened 5 years ago Closed 4 years ago

Phabricator emails: Be more resilient to failures caused by specific events

Categories

(Conduit :: Phabricator, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mhentges, Assigned: mhentges)

References

Details

(Keywords: conduit-triaged)

Attachments

(3 files)

Though processing each event (Phabricator-side) and rendering each event (phab-email-side) is handled safely, there's other cases in which issues can be encountered that will cause the email stream to stall (email send, parsing feed stories, etc).

We should handle these cases to ensure that the email stream never stalls.

Assignee: nobody → mhentges
Status: NEW → ASSIGNED
Blocks: 1667229
Keywords: conduit-triaged

Had a good chat with Zeid, here's where we landed:

  1. Have a failsafe system on the Phabricator endpoint. For each event:
    1. Parse out the revision number, title, and recipients of the event. If this fails, skip the event, log the failure in Sentry and bump a metric in statsd.
    2. Attempt to fully populate the event with relevant data. If this fails, log the failure in Sentry and bump a metric in statsd. Continue to the next step, even if full population fails.
    3. Send the event to the email service with both the "basic" data (revision number, title, recipients) and "complete" data if possible from the previous step^.
  2. Have a failsafe system on the email service. For each raw event data from the endpoint:
    1. If the "complete" data is available:
      1. Attempt to parse the "complete" data, render the HTML and text, and convert it to MIME. Send that email, if successful, and continue to the next event
      2. If unsuccessful, log the failure in Sentry and bump a metric in statsd.
    2. If the "complete" data isn't available (or it couldn't be parsed/rendered), parse the "basic" data, then render a barebones email saying "something happened to <link to revision>". Send that email, if successful, and continue to the next event
    3. If the above step failed^, skip the event, log the failure in Sentry and bump a metric in statsd.
  3. Handle Amazon SES errors appropriately:
    • For errors like MessageRejected, build the "basic" email and attempt to send that instead (skip the email if MessageRejected is still encountered).
    • For errors like AccountSendingPaused or ConnectionClosedError, keep retrying until the send succeeds.
      • This has a downside: the longer it takes before this issue is resolved (network error, creds error, settings error, whatever), the more "behind" the service will be in the email queue. Since the Phabricator email endpoint uses some "current" state (rather than "at the time" state) when populating events, the emails will be slightly incorrect. For example, a user might receive an email showing that they're "added as reviewer", but it'll show them having already accepted the patch if they had done so before the email was sent.
  4. This is a lower priority and won't block the release, but to avoid the "stale email" issue described in that last bullet point^, we can add a queuing system:
    1. Perform the regular "Phabricator query, render (with failsafe) logic as before, but put every completed email into a queue in the database. Keep doing this, even when emails are unable to be sent due to an Amazon SES issue.
    2. A separate process will work through the queue, sending emails via SES, retrying as necessary and only popping emails from the queue when they're sent
    • The benefit of this strategy is that, though emails may be delayed when SES failures occur, once sent they will appear as they were rendered when the event occurred.

This is to make it consistent with it's class's name.

Depends on D110275

Adjusts the structure of the output so that the consuming service can
optionally use a "minimal context" to produce emails if the full context
can't be rendered.

Changes exception-handling behaviour:

  • If an error is encountered while resolving email context, safely fall
    back to just providing a "minimal context".
  • If an error is encountered before the "minimal context" can be
    resolved, no longer skip the event. Instead, respond with an error,
    and block the email queue until the issue is resolved.

One note, now that this is ready for review:

[in Phabricator] Attempt to fully populate the event with relevant data. If this fails, log the failure in Sentry and bump a metric in statsd.

We don't have statsd attached to Phabricator yet, so I'll defer this. Besides, we'll know that event-population is failing because of events in Sentry.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Attachment #9214246 - Attachment description: Bug 1672239: Add "minimalContext" to each email event → WIP: Bug 1672239: Add "minimalContext" to each email event
Attachment #9214246 - Attachment description: WIP: Bug 1672239: Add "minimalContext" to each email event → Bug 1672239: Add "minimalContext" to each email event
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: