The APT index of packages.mozilla.org is too large
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
People
(Reporter: jlorenzo, Assigned: gabriel)
References
(Blocks 1 open bug)
Details
Attachments
(4 files)
First reported by :sylvestre
Steps to reproduce
sudo apt-get update
Expected results
Fetching the packages.mozilla.org
index should just be a matter of downloading a few kilobytes.
Actual results
[...]
Get:5 https://packages.mozilla.org/apt mozilla InRelease [1356 B]
Get:6 https://packages.mozilla.org/apt mozilla/main all Packages [21.4 MB]
Fetched 21.4 MB in 2s (9199 kB/s)
Notes
I suspect we either need to tweak Google Artifact Registry or report a bug to them.
Assignee | ||
Comment 1•1 year ago
•
|
||
This is the repository index containing the platform-agnostic l10n packages.
There are roughly 200 l10n packages included in each nightly release.
The file that has grown excessively large is here:
https://packages.mozilla.org/apt/dists/mozilla/main/binary-all/Packages
Currently, it stands at 20M, as shown in the Actual results
section above.
➜ ~ du -h Packages
20M Packages
Google Artifact Registry isn't providing compressed versions of the index, as can be seen here:
https://packages.mozilla.org/apt/dists/mozilla/main/binary-all/Packages.xz
According to the Debian repository documentation:
https://wiki.debian.org/DebianRepository/Format
Compression of indices
... [an] index may be compressed in one or multiple of the following formats:
No compression (no extension)
XZ (.xz extension)
Gzip (.gz extension, usually for Contents files, and diffs)
Bzip2 (.bz2 extension, usually for Translations)
LZMA (.lzma extension)
Clients must support xz compression, and must support gzip and bzip2 if they want to use the files that are listed as usual use cases of these formats. Support for all three formats is highly recommended, as gzip and bzip2 are historically more widespread.
Servers should offer only xz compressed files, except for the special cases listed above. Some historical clients may only understand gzip compression, if these need to be supported, gzip-compressed files may be offered as well.
When I compress the file locally I get the size down to 3M.
➜ ~ du -h Packages
20M Packages
➜ ~ xz Packages
➜ ~ du -h Packages.xz
3.0M Packages.xz
By my count there are 38,683 packages indexed in this file (counted as the lines staring with Package
)
Using the timestamp embedded in the nightly Version
(and some napkin math) I derived the following from the index:
- Nearly half the packages (19,668) are older than 90 days.
- Approximately 65% of the packages are older than 60 days (25,585.)
- On average, around 6,447 packages are published monthly.
I think we might need a cron job that can go in and clean-up 🧹 the repository as there are no clean-up policies for APT and YUM. Google Artifact Registry should serve compressed indexes... but maybe we could compress the file ourselves and redirect our CDN to the compressed version to hasten the process. I don't know how long it would take Google to start serving compressed indexes from Google Artifact Registry (after we highlight the issue.)
Comment 2•1 year ago
|
||
it should store only the last versions, no?
No need to keep previous one afaik
Assignee | ||
Comment 3•1 year ago
|
||
Oh. If that's the case then maybe there should be a task that cleans up the repo after publishing a new nightly. I guess someone might want to install an specific version of Firefox? Probably not
Assignee | ||
Comment 4•1 year ago
|
||
Specially not on nightly
Assignee | ||
Comment 5•1 year ago
|
||
Marking this as S2 because I don't think there's a work around and the repo is pretty bloated
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 6•1 year ago
|
||
Some reasons we might want to keep some non-latest releases in the repository:
- Sometimes older versions are needed by people dependent on systems or applications that are not compatible with the latest package
- If there is a critical issue with a new release, having older packages available allows users to quickly rollback to a stable version
Comment 7•1 year ago
|
||
I filed the following two support tickets with Google for feature requests:
- https://console.cloud.google.com/support/cases/detail/v2/47695937?project=moz-fx-productdelivery-pr-38b5
- https://console.cloud.google.com/support/cases/detail/v2/47695977?project=moz-fx-productdelivery-pr-38b5
I think we can work around both of these missing features by adding a separate http handler for XZ compression, and removing old versions on a daily basis.
Assignee | ||
Comment 8•1 year ago
•
|
||
:jbuck could I get access to those support tickets? It would also be useful to support .gz compression in our request handler.
I've been poking at the APT registry API looking at this issue, I could re-use some of that and write a clean-up script to remove old versions (but I am not sure how we could deploy it and schedule it.)
A little off-topic, but there's also some discussion about the 404 page on Bug 1861929, perhaps we can deal with 404s in a handler too? Return a Mozilla 404 page?
Comment 9•1 year ago
|
||
:gabriel - you are CC'd on those tickets. You should just be able to reply. I don't know of an easy way to let you interact with the support portal itself, tho.
We could just deploy a cleanup script as a k8s cron.
We should be able to use nginx to serve up a custom 404, I think.
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 10•1 year ago
•
|
||
I am working on a clean-up script: https://github.com/mozilla-releng/mozilla-linux-pkg-manager
Assignee | ||
Comment 11•1 year ago
|
||
How can we deploy this as a k8 cron? Should I package it up in a docker image?
Comment 12•1 year ago
|
||
If you provide a Dockerfile in the repo then github actions can build and deploy it
Comment 13•1 year ago
|
||
(which we can plumb)
Assignee | ||
Comment 14•1 year ago
|
||
Thanks Chris. Could I have read-only access to the product delivery repository so I can call the read APIs locally while working on Bug 1863841?
Assignee | ||
Comment 16•1 year ago
|
||
It worked. I can look inside the repository. Thanks!
Comment 17•1 year ago
|
||
ping me when you're done so I can remove the bespoke access!
Assignee | ||
Comment 18•1 year ago
|
||
I don't need it anymore, you can remove it
Comment 19•1 year ago
|
||
done
Assignee | ||
Comment 20•1 year ago
•
|
||
(In reply to chris valaas [:cvalaas] from comment #12)
If you provide a Dockerfile in the repo then github actions can build and deploy it
I packaged the script into a docker image and deployed it.
https://github.com/mozilla-releng/mozilla-linux-pkg-manager#docker
https://hub.docker.com/r/mozillareleases/mozilla-linux-pkg-manager/tags
Comment 21•1 year ago
|
||
We're setting this up as a cron, yeah? How often? Daily? Weekly?
Updated•1 year ago
|
Assignee | ||
Comment 23•1 year ago
•
|
||
We should trigger it when it is available to clean out the repository
Assignee | ||
Comment 24•1 year ago
|
||
I updated the script's image today btw (I had build an ARM image on accident, it is now an AMD64 image)
Comment 25•1 year ago
|
||
Can you make the image run-able as non-root? Our k8s clusters do not allow pods to run with root privs...
Error at the moment:
Creating virtualenv mozilla-linux-pkg-manager-VA82Wl8V-py3.11 in /.cache/pypoetry/virtualenvs
[...]
virtualenv: error: argument dest: the destination . is not write-able at /
Updated•1 year ago
|
Assignee | ||
Comment 26•1 year ago
|
||
Oh I didn't consider that. Yeah, it should be able to. Looking...
Assignee | ||
Comment 27•1 year ago
|
||
I modified the docker image to create and use a non-root user. I published it as 0.4.0
.
Comment 28•1 year ago
|
||
I'm not seeing the repo updated (but the new image is on dockerhub) -- what's the UID/GID of the user?
Assignee | ||
Comment 29•1 year ago
|
||
Comment 30•1 year ago
|
||
Starts up fine now, but crashes. Error/trace attached
Updated•1 year ago
|
Comment 31•1 year ago
|
||
(This is a run against the stage env)
Assignee | ||
Comment 32•1 year ago
|
||
There are VPN packages in the staging repo breaking some assumptions I made about the Firefox packages.
I updated the script to be a little more resilient and uploaded a 0.5.0 image.
Assignee | ||
Comment 33•1 year ago
|
||
Let me know if there are more issues. The output of this new version against the staging repo should be "successfully parsed the package data" and "there are no nightly packages."
Comment 34•1 year ago
|
||
That's what I'm seeing now. I'll promote the dry-run to prod and see what happens.
Comment 35•1 year ago
|
||
Error seen in prod. Note the delete attempt(?) when --dry-run is set.
Updated•1 year ago
|
Assignee | ||
Comment 36•1 year ago
•
|
||
The attempt is expected. The script uses a validate_only flag available in the API, so it will send the request, but things will not be deleted. The errors looks like an IAM permissions issue, doesn't look like the script has the capability to send the delete requests
Comment 37•1 year ago
|
||
Sweet. I updated permissions and will run again.
Comment 38•1 year ago
|
||
Looks like you can only do 50 at a time
Updated•1 year ago
|
Assignee | ||
Comment 39•1 year ago
|
||
That's... interesting.
The docs say its 10,000 and so does the protocol buffer in Google's client... (maybe that is for docker images?)
Comment 40•1 year ago
|
||
Looks good now. Latest run ends with:
2023-12-06 22:04:58,716 - mozilla-linux-pkg-manager - INFO - Done cleaning up!
We good to run it for real?
Assignee | ||
Comment 41•1 year ago
|
||
Yeah, that's the expected output, lgtm 👍
Comment 42•1 year ago
|
||
I don't think it's working ... ?
I ran it twice and it claimed to be deleting tons of stuff each time. But I would expect nothing to happen the second run-through. I checked the repo itself and everything seems to still be there.
I wonder if these two lines:
https://github.com/mozilla-releng/mozilla-linux-pkg-manager/blob/a07f976a468d6f97957a277cc9c070418b64361c/src/mozilla_linux_pkg_manager/cli.py#L145-L146
need to be inside the for
started on line 134?
IOW, I think it's only deleting the last "batch" from each "package"
Assignee | ||
Comment 43•1 year ago
•
|
||
Yes. Good catch! It does looks like it only deleted a single batch. Fixing it...
Assignee | ||
Comment 44•1 year ago
|
||
I patched the indentation error and published the fix
Comment 45•1 year ago
|
||
So far so good. I'll set this to run weekly (unless you've changed your mind?).
Comment 46•1 year ago
|
||
It is 12.2M (to compare - vscode 3k - llvm 6k)
Comment 47•1 year ago
|
||
Sorry, you said daily. I'll set it to daily. The script finished successfully and we're all good!
Assignee | ||
Comment 48•1 year ago
|
||
Thanks Chris!
It is better now. The indexes with the Firefox packages are ~25k (that would be about ~4k if it was compressed)
The index with the l10n packages was around ~60M+ 😓 and now its at ~1.2M 😅
We could tweak the arguments to the script and delete a few hundred more outdated l10n packages.
I don't think that's going to get us all the way there, but we can get into the kilobyte realm using compressed indexes:
➜ wget https://packages.mozilla.org/apt/dists/mozilla/main/binary-all/Packages
--2023-12-07 12:06:40-- https://packages.mozilla.org/apt/dists/mozilla/main/binary-all/Packages
Resolving packages.mozilla.org (packages.mozilla.org)... 34.160.78.70
Connecting to packages.mozilla.org (packages.mozilla.org)|34.160.78.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1285769 (1.2M) [text/html]
Saving to: ‘Packages’
Packages 100%[===================>] 1.23M 770KB/s in 1.6s
2023-12-07 12:06:42 (770 KB/s) - ‘Packages’ saved [1285769/1285769]
➜ xz Packages
➜ du -h Packages.xz
152K Packages.xz
Assignee | ||
Comment 49•9 months ago
•
|
||
The updates to the script described on Bug 1875338 will land soon. Once I publish them we can start cleaning-up packages in a more flexible way.
For example to clean-up the devedition/beta packages we can do something like:
mozilla-linux-pkg-manager \
clean-up \
--package "^firefox-(devedition|beta)(-l10n-.+)?$" \
--retention-days 60 \
--repository mozilla \
--region us \
--dry-run \
2>&1 | tee clean-up-devedition-and-beta.log
The packages aren't super bloated with versions yet, but we should keep them in check so it doesn't get as bloated as the nightly.
I don't see a lot of activity on https://issuetracker.google.com/issues/308952923
Should I reach out to someone on our GCP account team? Start considering a workaround?
Comment 50•9 months ago
|
||
I will ask for an update in our google slack channel
Assignee | ||
Comment 51•9 months ago
•
|
||
The clean-up policies for apt repos seem to be working in preview now :) (tried them out on the dev account)
We could do:
- A prefix rule matching "firefox-nightly" that keeps only the n most recent versions
- A prefix rule matching "firefox-beta" and "firefox-devedition"
Comment 52•9 months ago
|
||
From slack 3/4/2024 3:11pm:
Hi @cvalaas regarding the first bug to reduce the size of package index, I checked with the product team and unfortunately this feature request has not been prioritized so there is no ETA. I am pushing for them to atleast evaluate it, and will keep you posted
Assignee | ||
Comment 53•9 months ago
|
||
😕
I think we can download and compress the indexes and store the compressed versions somewhere.
Then point at the files - i.e. https://packages.mozilla.org/apt/dists/mozilla/main/binary-all/Packages.{COMPRESSION_FORMAT}
Comment 54•2 months ago
|
||
Today, it is quite large: 27,3 MB
If/when Thunderbird uses it, it is going to be too big.
Comment 55•2 months ago
|
||
We were hoping to publish Thunderbird debs to the PPA soon. Is there a time frame for when the compression or clean-up will be done?
Reporter | ||
Comment 56•2 months ago
|
||
Hey :cvalaas!
(In reply to Gabriel Bustamante [:gabriel] from comment #51)
The clean-up policies for apt repos seem to be working in preview now :) (tried them out on the dev account)
Would it be something easy to implement as code? Given that Thunderbird is looking into publishing their package too, we should avoid doubling the size of the APT repo. Just as an example, I was traveling last week and I couldn't update Firefox because of my data plan abroad.
(In reply to Gabriel Bustamante [:gabriel] from comment #53)
I think we can download and compress the indexes and store the compressed versions somewhere.
Then point at the files - i.e.https://packages.mozilla.org/apt/dists/mozilla/main/binary-all/Packages.{COMPRESSION_FORMAT}
Would this option make sense too? It seems hacky at first but we given that feature is still missing on Google Artifact Registry, it sounds like the best option we have.
Comment 57•2 months ago
|
||
Looks like google's clean-up method is available. Here's how it's set now:
Comment 58•2 months ago
|
||
Also, we have :gabriel's script which runs weekly and is invoked thusly:
- command:
- poetry
- run
- mozilla-linux-pkg-manager
- clean-up
- --product
- firefox
- --channel
- nightly
- --format
- deb
- --retention-days
- "2"
- --repository
- mozilla
- --region
- us
So I guess we could use either/both? My preference would be to use google's built-in clean-up policy, tho.
But just let me know which way you want to go and what parameters to use.
Reporter | ||
Comment 59•2 months ago
|
||
Thank you very much for looking into this so quickly, :cvalaas!
Let's go with Google's clean-up. It's standard and we don't have maintain our own thing. Let's keep the last 10 versions of firefox-nightly
(that would give us approximately 5 days of Nightly). Regarding firefox-beta
and firefox-devedition
, 5 versions should be enough (that's one week and and half). For firefox
(aka release), 3 versions should be good, that's a whole cycle (4 weeks) with 2 dot releases in the meantime.
Comment 60•2 months ago
•
|
||
Unfortunately, the cleanup policies can only operate on prefixes, so any rules I assign to "firefox" will also match everything else. But KEEP rules take precedence (according to the UI), so how about:
DELETE anything older than 31d any package name starting with "firefox"
(implicit BUT) KEEP 5 most recent versions of any package name starting with "firefox-beta" or "firefox-devedition"
(implicit BUT) KEEP 10 most recent versions of any package name starting with "firefox-nightly"
i think that accomplishes what you want?
Reporter | ||
Comment 61•1 months ago
|
||
No problem, thanks for calling this out! 👍
It looks good to me. However, I'm not totally sure how the implicit buts are going to work. I tried to test it out on moz-fx-dev-releng/releng-apt-dev
but for an unknown reason, I don't manage to get any results with this command line.
gcloud logging read 'protoPayload.serviceName="artifactregistry.googleapis.com" AND protoPayload.request.parent:"projects/moz-fx-dev-releng/locations/northamerica-northeast2/repositories/releng-apt-dev" AND protoPayload.request.validateOnly=true' \
--resource-names="projects/moz-fx-dev-releng" \
--project=moz-fx-dev-releng
For the record, I did enable Audit logs[2] by flipping DATA_WRITE
[3] to see this kind of event google.devtools.artifactregistry.v1.ArtifactRegistry.BatchDeleteVersions
.
Anyway, :cvalaas, could you show me some examples of what would be deleted or not with your proposal?
[1] https://cloud.google.com/artifact-registry/docs/repositories/cleanup-policy#dry-run
[2] https://cloud.google.com/logging/docs/audit/configure-data-access#config-console-enable
[3] https://cloud.google.com/artifact-registry/docs/audit-logging#permission-type
Comment 62•1 month ago
|
||
https://github.com/mozilla-it/webservices-infra/pull/3160 will fix the logs in stage/prod
Comment 63•1 month ago
•
|
||
:jlorenzo, are you able to see what's in stage? I implemented the same rules there and they are live. There are no "firefox" packages, but the "firefox" rule does exist there.
I see that there are now 5 versions of all the "firefox-beta" packages, but I also see there are NO versions of "firefox-esr*", since those would've been deleted by the "firefox" 31d catch-all. Is that bad? If it is, I suppose we could also add a "firefox" rule that always keeps 1 old version?
Comment 64•1 month ago
|
||
This PR will implement xz compression for Packages: https://github.com/mozilla-it/webservices-infra/pull/3176
If there is any other types of compression or paths we should compress, we should be able to add that fairly easily.
Reporter | ||
Comment 65•1 month ago
|
||
:jlorenzo, are you able to see what's in stage?
Yup, I'm in! For the record, I'm also able to view what's in prod.
I implemented the same rules there and they are live.
Excellent, thank you!
but I also see there are NO versions of "firefox-esr*", since those would've been deleted by the "firefox" 31d catch-all. Is that bad?
It's usually okay because we ship a new ESR version every 4 weeks. However, it's not a golden rule because there are some known exceptions like around the end of the year. This December, for instance, there will be 6 weeks between 2 ESR.
If it is, I suppose we could also add a "firefox" rule that always keeps 1 old version?
If we change
DELETE anything older than 31d any package name starting with "firefox"
into
KEEP 3 most recent versions of any package name starting with "firefox"
, would the 2 implicit BUTs be still valid?
This PR will implement xz compression for Packages
🙌 Thank you so much for providing this solution so quickly!
Comment 66•1 month ago
|
||
And if anyone else is curious what I'm using to compress the Package listing... it's a shell script: https://github.com/mozilla/artifact-registry-compression/blob/main/compress.sh
Comment 67•1 month ago
|
||
If we change
DELETE anything older than 31d any package name starting with "firefox" into
KEEP 3 most recent versions of any package name starting with "firefox", would the 2 implicit BUTs be still valid?
I believe so! I'll see what happens on stage.
Comment 68•1 month ago
|
||
Okay, on stage, there are still 5 copies of firefox-beta (there were 6 as of this morning). So the more restrictive KEEP rules seem to take precedence; good!
Reporter | ||
Comment 69•1 month ago
|
||
Perfect! Then, let's go with:
KEEP 3 most recent versions of any package name starting with "firefox"
(implicit BUT) KEEP 5 most recent versions of any package name starting with "firefox-beta" or "firefox-devedition"
(implicit BUT) KEEP 10 most recent versions of any package name starting with "firefox-nightly"
Then, let's see how thin the index gets. We may want to be more restrictive if it still to big.
Thank you again :cvalaas and :jbuck for this quick turnaround!
Comment 70•1 month ago
|
||
Made the rules live in prod: https://github.com/mozilla-it/webservices-infra/pull/3198
Comment 71•1 month ago
|
||
The XZ compression of https://packages.mozilla.org/apt/dists/mozilla/main/binary-all/Packages.xz is live - it's a nice ~90% compression from 27.3M to 3.1M
There's a Cloud Run (code here) running on an hourly schedule that fetches the uncompressed Packages index, compresses it, then uploads to another GCS bucket. Nginx does a path matching rule to serve that compressed content.
If there are other paths or compression formats that should be added, please let us know, it should be pretty trivial to do so.
Comment 72•1 month ago
|
||
haHA I spoke too soon. Of course, to use the new compressed Packages index, we'll need to update the Release/InRelease files too. And they need to be signed with google's GPG key too? 🙃
Reporter | ||
Comment 73•1 month ago
|
||
Thanks for making it live, :cvalaas!
(In reply to Jon Buckley [:jbuck] from comment #72)
haHA I spoke too soon. Of course, to use the new compressed Packages index, we'll need to update the Release/InRelease files too. And they need to be signed with google's GPG key too? 🙃
Ah yeah, I forgot about that part. The only thing that's signed on APT repo is the metadata (e.g. InRelease). That's right, we need to have that signed by Google's GPG key.
(In reply to Jon Buckley [:jbuck] from comment #71)
The XZ compression of https://packages.mozilla.org/apt/dists/mozilla/main/binary-all/Packages.xz is live - it's a nice ~90% compression from 27.3M to 3.1M
I'm surprised the Packages
is still that big despite the cleanup being live. In comment 54, Sylvestre shared it was already 27.3 MB. I confirm it's now 27.47MB: wget https://packages.mozilla.org/apt/dists/mozilla/main/binary-all/Package
.
Is there anything we missed in the cleanup, :cvalaas?
Comment 74•1 month ago
|
||
Oops, I guess the cleanup doesn't work without the first DELETE rule as well. I have a PR (https://github.com/mozilla-it/webservices-infra/pull/3234) to reinstate that, and then it should clean up by tomorrow.
Reporter | ||
Comment 75•1 month ago
|
||
Great! It's now just 5.4MB. Thank you very much, :cvalaas! The next step would be to compress the metadata but as said in comment 72 and comment 73, that's something only Google can do.
I'm marking this bug as fixed since it's not 20+MB anymore.
Description
•