Closed Bug 1518281 Opened 6 years ago Closed 6 years ago

[tracker] switch from rabbitmq to pub/sub

Categories

(Socorro :: Infra, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Socorro uses rabbitmq to queue crash ids for processing. It gets populated via three things:

  1. Antenna (the collector) saves crashes to AWS S3 which triggers Pigeon which tosses crash ids into the socorro.normal queue.

  2. A user views a crash report for a crash that hasn't been processed which adds a crash id to the socorro.priority queue.

  3. Someone requests a crash to be reprocessed either from the report view or the Reprocess API which adds crash ids to the socorro.reprocessing queue.

The queue gets consumed by the RabbitMQCrashStore in the processor.

This bug covers figuring out a plan to redo that using Amazon SQS and come up with a rough work estimate.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P1

I think we want to maintain three separate queues: normal, priority, and reprocessing. We should set up 3 SQS queues.

We can set up AWS S3 event notification to add things to SQS. If we decide to go with this plan, we should nix bug #1513080 because we won't need to adjust Antenna at all. I wonder what the shape of those events in SQS look like.

We can write an SQSCrashStore that does what the RabbitMQCrashStore is currently doing where it cycles between the queues and yields crash ids.

We can either continue with the current architecture and write PriorityjobRabbitMQCrashStore and ReprocessingOneRabbitMQCrashStore or rework how that works and do something saner. I think I'm voting for the latter.

We need to be able to run an SQS equivalent in the local development environment. We're currently using localstack for a local S3 and it implements a local SQS. That might work nicely.

I want to talk with Miles and Brian.

Brian pointed out maybe we skip SQS and just switch directly to pubsub. I'll look at that, too.

I worked through requirements and talked with Brian and Miles and the consensus is that we should switch to Pub/Sub.

We're going to try to do that this quarter. I'm going to rescope this bug to switching from rabbitmq to pubsub.

Priority: P1 → P2
Summary: look at switching from rabbitmq to sqs → switch from rabbitmq to pub/sub
Summary: switch from rabbitmq to pub/sub → [tracker] switch from rabbitmq to pub/sub

Rough scope of work:

  • bug #1527343: (Will) Write a class the collector can use to produce Pub/Sub messages.
  • bug #1527346: (Will) Write a class the webapp can use to add crash ids in the same to socorro.priority and socorro.reprocessing queues.
  • bug #1527345: (Will) Write class processor will use to consume crash ids from the Pub/Sub queues.
  • (Will) Write tests.
  • (ops) Set up socorro.normal, socorro.priority, and socorro.reprocessing queues for stage.
  • (ops) Set up socorro.normal, socorro.priority, and socorro.reprocessing queues for prod.
  • (ops; Will) Write configuration.
  • (ops; Will) Deploy to stage.
    • (ops) Set up queues in stage.
    • (ops; Will) Deploy antenna with Pub/Sub configuration and code.
    • (ops; Will) Deploy socorro processor and webapp with Pub/Sub configuration and code.
    • (ops; Will) Update Datadog graphs for stage.
    • (ops; Will) Verify.
  • (ops; Will) Deploy to prod.

I'll break this up into bugs that block this bug.

Depends on: 1527343
Depends on: 1527345
Depends on: 1527346
Depends on: 1528243
Depends on: 1535727
Depends on: 1538202
Depends on: 1539153
Depends on: 1540831
Depends on: 1540833

All bugs have been completed and we've pushed everything to prod. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

Here's the project plan to switch to PubSub in case future me needs to find it again: https://docs.google.com/document/d/13_PbjSncCH60tLjkWst91B6TQQDaJYlSuSvPLqaWN5A/edit

You need to log in before you can comment on or make changes to this bug.