Closed Bug 1583930 Opened 5 years ago Closed 4 years ago

[tracker] switch from pubsub to aws sqs

Categories

(Socorro :: General, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

We were planning to move Socorro to GCP and towards that, we switched from using RabbitMQ to using PubSub for queuing crash ids to process.

We're no longer planning to migrate to GCP, so we should probably switch from PubSub to AWS SQS.

This issue covers tracking that project.

Depends on: 1433148
Depends on: 1425475
Depends on: 1517807

In working on the SQS plan, I discovered we had a bunch of technical debt issues we needed to solve first:

  • fix the s3mock issue in Antenna so we could update to the latest boto3/botocore (bug #1417807)
  • update Socorro from boto to boto3 which involved a semi-extensive rewrite (bug #1433148) which also fixes AWS auth things (bug #1425475)

I've finished that work and it's sitting on stage now. It'll go to prod in December.

However, I'm pretty sure my window for working on this SQS project is over and I need to switch back to MLS. I'm hoping I find time to pick this back up in 2020 and finish up the SQS project plan.

Depends on: 1598765
Depends on: 1601455
Depends on: 1602120
Depends on: 1602121

I landed the last of the required code changes for supporting AWS SQS. I updated the migration plan. I think we're all set to migrate in January 2020. I'll work with Brian to schedule that.

Depends on: 1605716
Depends on: 1617008
Depends on: 1617187
Depends on: 1617977
Depends on: 1618201

We switched stage over to AWS SQS yesterday. First the collector (5:18pm EST), waited for the queue to dry up, then the processor (5:35pm EST), then the webapp (6:31pm EST).

The pub/sub queue graphs continue to show a non-zero number that fluctuates up and down ranging from like 14 to 30. I don't know where those crash ids are coming from or where they're going to. Socorro has a job that runs nightly to catch any crash reports that weren't processed, so I don't think there's any risk here. It's just curious.

Everything else looks fine.

Regarding the stage migration, the graphs in Grafana for Pub/Sub were showing prod--not stage. That's why they looked curious after the migration.

We switched prod over to AWS SQS today. I sent an email to the stability mailing list and notified #stability and #breakpad on chat.mozilla.org. First we switched the collector over (9:35am EDT), waited for pub/sub queue to dry up, then the processor (9:55am EDT), then the webapp (10:10am EDT).

Pub/Sub queue hit zero. Nothing in Sentry. Grafana graphs look fine.

I notified stability mailing list and #stability and #breakpad on chat.mozilla.org that the maintenance window was done.

I'll keep an eye on things for the rest of today and write up bugs for removing Pub/Sub configuration, code, and documentation.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

Everything looks fine.

We pushed all the Pub/Sub code removal to production for Socorro and Antenna.

We're done here. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.