Closed
Bug 1447748
Opened 7 years ago
Closed 7 years ago
socorro infrastructure migration -- March 29th, 2018
Categories
(Socorro :: Infra, task)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: willkg, Unassigned)
References
Details
For the last year, we've been making changes to Socorro code and infrastructure to eventually migrate to a new infrastructure. This work is covered in tracker bug #1391034.
The new infrastructure simplifies a lot of the ops side of maintaining our crash ingestion pipeline. It brings Socorro's infrastructure inline with how we do infrastructure for other Mozilla projects. It simplifies a lot of things and makes developing and maintaining Socorro a lot easier.
Further, it cleans up some architecture issues. For example, in the new infrastructure, Socorro's components (processor, webapp) will automatically scale with load.
The current target date for migration to the new infrastructure is March 26th, 2018.
This bug covers migration prep.
Reporter | ||
Comment 1•7 years ago
|
||
We have a go/no-go meeting scheduled for March 26th, at 5:pm UTC (10:am Pacific / 1:pm Eastern).
Reporter | ||
Comment 2•7 years ago
|
||
We had the go/no-go meeting. Summary of the meeting is as follows:
1. Copying the contents of the S3 crash bucket has not completed. The S3DistCp that we did with -stage isn't working for -prod. It takes between 16 and 23 hours for an S3DistCp attempt to fail, during which time they haven't copied any data, and they fail with error messages that are mysterious at best.
2. Over the last 5 days, we've talked with our AWS colleagues and discussed keys in our bucket (we have terrible keys--bug #1448421). We discussed S3 infrastructure throttle limits. We discussed other things. Eventually, we concluded that S3DistCp wasn't going to work for us and we should try something different. One suggestion was roll our own copy system using the aws cli or boto. Another suggestion was to try out s3s3mirror.
3. Miles launched a s3s3mirror run over the weekend. It's still running. He and Brian did some maths and the estimate is that it will finish in the next couple of days.
4. We're a "no-go" for doing the migration today.
5. Will will update this bug, the stability list, and the crash stats status message.
6. When the initial S3 copy is done, we'll schedule a new migration attempt.
Reporter | ||
Comment 3•7 years ago
|
||
The copy isn't quite done, but it's almost done. Miles and Brian migrated -stage to -new-stage today and that went well.
Given that and the fact we want to give users some lead time and we're running out of time for migrating this week, we're going to schedule the next go/no-go meeting (followed by a migration attempt if we're "go") for March 28th, at 5:pm UTC (10:am Pacific / 1:pm Eastern).
Summary: socorro infrastructure migration -- March 26th, 2018 → socorro infrastructure migration -- March 28th, 2018
Reporter | ||
Comment 4•7 years ago
|
||
We're super close, but decided to push it off until Thursday.
New migration window: Thursday, March 29th, at 3:00pm UTC (8:00am Pacific / 11:00am Eastern).
During the migration window, the Crash Stats site will continue to work, but Socorro will not be processing new incoming crashes. After the outage window, we'll cut over to the new infrastructure and start the processors. The new Crash Stats site will see new crashes, but will be missing some crash data in S3. So super search and anything else that uses Elasticsearch will be fine, but the report view and API endpoints that pull from S3 might not work depending on whether the data they're looking for is in S3 or not. That likelihood will decrease rapidly over time. By next week we should be fine.
We'll set up a process for moving data over by hand if people *need* to see crash data from a report view for crashes where not all the data has made it.
Summary: socorro infrastructure migration -- March 28th, 2018 → socorro infrastructure migration -- March 29th, 2018
Comment 5•7 years ago
|
||
What will happen to crashes that occur during the migration while the processors are offline? Will they be queued and then processed after the migration when the processors are back online? Or will those crash reports be lost?
Flags: needinfo?(willkg)
Reporter | ||
Comment 6•7 years ago
|
||
Trevor: Socorro will continue to collect crashes and queue them up for processing. Processing will be started after the migration. Nothing gets lost.
Flags: needinfo?(willkg)
Reporter | ||
Comment 7•7 years ago
|
||
We're done with the migration! We finished a while ago, but I wanted to wait before notifying anyone to give us some time to make sure everything was ok.
Everything is great!... with one caveat: data on S3 for crashes that came in between 2018-03-24 and 2018-03-29 might be missing. We're backfilling that data now. In the meantime, anything on Crash Stats that uses Elasticsearch (Super Search, API endpoints that use Elasticsearch, API endpoints that use Postgres, etc) should work fine, but anything that looks at data on S3 (crash report view, /api/RawCrash/, /api/ProcessedCrash/, etc) might error out if the data is missing.
For example, the crash report view will show this:
https://screenshots.firefox.com/D5AKOXiXcFEVUsbb/crash-stats.mozilla.com
If you see that and really need to look at that crash data right now (as opposed to when the backfill is done), then let us know and we can move it over manually.
I'll keep this open until we're done with the backfill.
Reporter | ||
Comment 8•7 years ago
|
||
Because of the alice-in-wonderland-style data flow for ADI that we have, when we copied the -prod database to -new-prod, it caused RawADIMoverCronApp not to run today (2018-03-30) and thus we didn't have ADI data for 2018-03-29 in -new-prod for a few hours.
The data is now in -new-prod and I'm pretty sure the job should run correctly tomorrow. I'll check it tomorrow.
More details about that in bug #1450243.
S3 crash data for the last week is still being copied to -new-prod.
Other than that, everything looks good so far!
Comment 9•7 years ago
|
||
The data copy is fully completed. I think we're good here.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 10•7 years ago
|
||
Fantastic! Super smooth migration! Congratulations all around!
You need to log in
before you can comment on or make changes to this bug.
Description
•