[tracker] switch from pubsub to aws sqs
Categories
(Socorro :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
We were planning to move Socorro to GCP and towards that, we switched from using RabbitMQ to using PubSub for queuing crash ids to process.
We're no longer planning to migrate to GCP, so we should probably switch from PubSub to AWS SQS.
This issue covers tracking that project.
Assignee | ||
Comment 1•5 years ago
|
||
Working out a project plan in this document: https://docs.google.com/document/d/1HZ3C-Qg0phMta0uRi5jBarH35rk0OwTRvpGKQtqZTf0/edit#
Assignee | ||
Comment 2•5 years ago
|
||
In working on the SQS plan, I discovered we had a bunch of technical debt issues we needed to solve first:
- fix the s3mock issue in Antenna so we could update to the latest boto3/botocore (bug #1417807)
- update Socorro from boto to boto3 which involved a semi-extensive rewrite (bug #1433148) which also fixes AWS auth things (bug #1425475)
I've finished that work and it's sitting on stage now. It'll go to prod in December.
However, I'm pretty sure my window for working on this SQS project is over and I need to switch back to MLS. I'm hoping I find time to pick this back up in 2020 and finish up the SQS project plan.
Assignee | ||
Comment 3•5 years ago
|
||
I landed the last of the required code changes for supporting AWS SQS. I updated the migration plan. I think we're all set to migrate in January 2020. I'll work with Brian to schedule that.
Assignee | ||
Comment 4•5 years ago
|
||
We switched stage over to AWS SQS yesterday. First the collector (5:18pm EST), waited for the queue to dry up, then the processor (5:35pm EST), then the webapp (6:31pm EST).
The pub/sub queue graphs continue to show a non-zero number that fluctuates up and down ranging from like 14 to 30. I don't know where those crash ids are coming from or where they're going to. Socorro has a job that runs nightly to catch any crash reports that weren't processed, so I don't think there's any risk here. It's just curious.
Everything else looks fine.
Assignee | ||
Comment 5•5 years ago
|
||
Regarding the stage migration, the graphs in Grafana for Pub/Sub were showing prod--not stage. That's why they looked curious after the migration.
We switched prod over to AWS SQS today. I sent an email to the stability mailing list and notified #stability and #breakpad on chat.mozilla.org. First we switched the collector over (9:35am EDT), waited for pub/sub queue to dry up, then the processor (9:55am EDT), then the webapp (10:10am EDT).
Pub/Sub queue hit zero. Nothing in Sentry. Grafana graphs look fine.
I notified stability mailing list and #stability and #breakpad on chat.mozilla.org that the maintenance window was done.
I'll keep an eye on things for the rest of today and write up bugs for removing Pub/Sub configuration, code, and documentation.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 6•5 years ago
|
||
Everything looks fine.
We pushed all the Pub/Sub code removal to production for Socorro and Antenna.
We're done here. Marking as FIXED.
Description
•