remove support for private symbols bucket in tecken
Categories
(Tecken :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(4 files)
Per bug #1838892, we're dropping support for the private symbols bucket.
This bug covers:
- demote the accounts that were noted in
UPLOAD_URL_EXCEPTIONS
so they no longer have upload privileges - remove the
UPLOAD_URL_EXCEPTIONS
configuration variable and everything that uses it in the tecken codebase - remove the
access=public/private
stuff fromSYMBOL_URLS
handling - remove the
UPLOAD_URL_EXCEPTIONS
configuration in the infrastructure codebase - write up a DSRE bug to remove the private symbols bucket for stage and prod
Assignee | ||
Comment 1•1 year ago
|
||
Grabbing this to work on now. I removed uploader privileges for anyone listed in the UPLOAD_URL_EXCEPTIONS
.
Assignee | ||
Comment 2•1 year ago
|
||
The public/private stuff affects how the StorageBucket works and switching everything to public (aka getting rid of private) causes many many many tests to fail because they were all depending on the bucket being private which forces the boto codepath for download which is mocked out in the tests.
In order to remove the public/private stuff, I need to rewrite the tests so they're not using botomock for upload and download and instead using localstack (a fake s3 service) for upload and http for download (bug #1834626). I wanted to do that anyway, but not necessarily in this pass, but the two things are tangled so I'm just going to do it all.
Assignee | ||
Comment 3•1 year ago
|
||
Assignee | ||
Comment 4•1 year ago
|
||
Assignee | ||
Comment 5•1 year ago
|
||
This needs extensive testing on stage.
Before landing this, I ran the system tests against stage to get a baseline for what the output looks like. Everything looks ok.
Once this deploys to stage, I will:
- re-upload those same zip files to make sure duplicate-identification works
- run the system tests with a new set of zip files to verify everything (upload and download) works
- go through the site to make sure all the pages work
Assignee | ||
Comment 6•1 year ago
|
||
I ran the system tests on stage before and after. All the tests did fine with the uploading API. However, the downloading API is failing.
On stage, the first url in SYMBOL_URLS
is:
https://s3-us-east-1.amazonaws.com/symbols-webeng-stage
If I try to HEAD a symbol file in the Tecken symbols stage public bucket, I get a dns error:
curl --head 'https://s3-us-east-1.amazonaws.com/symbols-webeng-stage/v1/libfreebl3.dylib/D260D4EA1F983017A5C99E944CF653B30/libfreebl3.dylib.sym'
curl: (6) Could not resolve host: s3-us-east-1.amazonaws.com
Tecken prod has this as the first url in SYMBOL_URLS
:
https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public
If I do the same thing to the Tecken symbols prod public bucket, it works fine:
curl --head 'https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/libfreebl3.dylib/D260D4EA1F983017A5C99E944CF653B30/libfreebl3.dylib.sym'
HTTP/1.1 200 OK
x-amz-id-2: og8g+iJaAQSz599xyOItWA3xrR2ddTtYW7eUk6iJbFbBmbcuZhXF617Wg+lbwMrwz0IkJwmaWKI=
x-amz-request-id: 7MNECFR7YRHP8QJ7
Date: Mon, 31 Jul 2023 14:19:47 GMT
Last-Modified: Thu, 27 Jul 2023 17:18:54 GMT
x-amz-expiration: expiry-date="Sun, 27 Jul 2025 00:00:00 GMT", rule-id="first_ia_then_delete"
ETag: "12a8670fac7df82bc5b80bf202b6a8e6"
x-amz-server-side-encryption: AES256
x-amz-meta-original_md5_hash: b6e10cca0af2c864f586302446aa82f6
x-amz-meta-original_size: 771035
Content-Encoding: gzip
x-amz-version-id: ar1QIPcdxx34ARB85_pqhtx6a_ro0Caq
Accept-Ranges: bytes
Content-Type: text/plain
Server: AmazonS3
Content-Length: 201727
I think one of a couple of things is wrong with Tecken stage:
- the
SYMBOL_URLS
value is wrong; wrong region? wrong structure? - the
symbols-webeng-stage
bucket policy is different than the one fororg.mozilla.crash-stats.symbols-public
and doesn't allow HTTPS access--though I don’t know why this would cause a DNS error
My hunch is that the SYMBOL_URLS
value is wrong and has been ever since the stage environment was set up years ago and because the bucket was being checked using boto since it didn't have access=public
in the url, it ignored the host in the url and worked fine. Now that that code is out (no more public/private), it's using HTTP HEAD which uses the host in the url and fails.
I wrote up a Jira ticket to get some SRE help to look into it.
Assignee | ||
Comment 7•1 year ago
•
|
||
There were two issues:
-
We were using a legacy endpoint where it's
s3-us-east-1...
with a hyphen between the service and region and that doesn't work with us-east-1. We were advised to switch tos3.us-east-1...
.https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#s3-legacy-endpoints
-
Once we fixed the endpoint url, then we hit an HTTP 403 when requesting things in the bucket. Harold added a bucket policy.
Now the download API works.
One interesting thing is that this means that the download API never worked on stage, but because the stage SYMBOL_URLS
has the second url as prod, it would always degrade to redirecting to prod. Fun!
I need to go through and test everything on stage. I'm also going to make some minor changes to the systemtest scripts
Assignee | ||
Comment 8•1 year ago
|
||
Assignee | ||
Comment 9•1 year ago
|
||
Assignee | ||
Comment 10•1 year ago
|
||
I went through and tested everything on stage:
- system test against stage:
- tests uploading symbol files using the upload API
- tests uploading symbol files using the upload API and the upload-by-download path
- tests the download API
- I built a new symbol zip file and uploaded it multiple times; second+ upload didn't reupload because the items were the same as what was in the bucket already
- I verified that the download API lets you download symbol files from the stage symbols bucket and can degrade to the prod symbols bucket when the requested file isn't in the stage bucket
- I went through the upload, upload files, file, and django admin site status pages to make sure they all worked correctly.
Everything looks good. I think this is ready to go to production now. Whew.
Once we push this to production, we still need to do the following cleanup tasks:
- remove the
access=public/private
stuff fromSYMBOL_URLS
handling - remove the
UPLOAD_URL_EXCEPTIONS
configuration in the infrastructure codebase - write up a DSRE bug to remove the private symbols bucket for stage and prod
Assignee | ||
Comment 11•1 year ago
|
||
This went to prod in bug #1847570.
There's an increase in mean time (ms) to handle a download API request going from 10ms to ~20ms.
It's possible this is because we just did a deploy and we've got a cold cache or something like that. It's also possible that this is because of a code change. Previously, the code did a HeadObject
call using the boto library client. Now it does an HTTP HEAD call using requests. It's possible the former was faster--maybe it involved less set-up/take-down time. It's also possible both are true.
I checked the Socorro dashboard and there's no noticeable change in the stackwalker runtime.
I'll keep an eye on it, but even if it is a regression, I think it's ok for now since this it's a ~10ms difference on something that's pretty fast. We can optimize it after we rewrite the storage layer which will give us a lot more options for how to figure out if a file exists in a bucket and after we migrate to GCP.
Also, it looks like we stopped emitting some metrics:
tecken.symboldownloader_exists
tecken.symboldownloader_exists_cache_hit
tecken.symboldownloader_exists_cache_miss
Other than that, everything looks ok at this point.
I want to re-add those metrics. Also, since this is a large change I'll take a look at everything again tomorrow.
Assignee | ||
Comment 12•1 year ago
|
||
Re: the regression, the graphs settled back down at a mean of 10ms, so it was probably a cold cache or something like that.
Assignee | ||
Comment 13•1 year ago
|
||
I figured out what happened with the missing metrics. We had two different "does this thing exist?" functions and only one of them was being used which emitted those metrics. When I removed the private/public support code, that changed which one is used so now the other one is used. It emits different metrics:
tecken.symboldownloader_public_exists
tecken.symboldownloader_public_exists_cache_hit
tecken.symboldownloader_public_exists_cache_miss
I'll fix this. It probably also means I need to make sure that cache busting is using the correct function and works.
Assignee | ||
Comment 14•1 year ago
|
||
Assignee | ||
Comment 15•1 year ago
|
||
willkg merged PR #2772: "bug 1843356: fix check_url_head metrics" in 1907629.
This changes check_url_head
to use the old metrics keys that were emitting data in our dashboards.
Assignee | ||
Comment 16•1 year ago
|
||
Assignee | ||
Comment 17•1 year ago
|
||
Assignee | ||
Comment 18•1 year ago
|
||
This deployed to prod just now in bug #1848164.
I'm seeing metrics for symboldownloader cache usage again.
Last thing to do here is:
- write up a DSRE bug to remove the private symbols bucket for stage and prod
Assignee | ||
Comment 19•1 year ago
|
||
I wrote up DSRE-1401 to cover decommissioning the stage and prod private symbols buckets.
We're all done here.
Description
•