frequent OOMs in balrog
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
People
(Reporter: jcristau, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
balrog's uwsgi processes keep getting oom-killed.
| Reporter | ||
Comment 1•1 day ago
|
||
grafana points at a change around January 14.
| Reporter | ||
Comment 2•1 day ago
|
||
First OOM appears to be timestamped 2026-01-14T12:11:40Z
Memory cgroup out of memory: Killed process 721264 (uwsgi) total-vm:638804kB, anon-rss:534324kB, file-rss:8832kB, shmem-rss:0kB, UID:0 pgtables:1168kB oom_score_adj:976
| Reporter | ||
Comment 3•1 day ago
|
||
The admin app shows a similar pattern, though it doesn't go quite high enough to OOM.
| Reporter | ||
Updated•1 day ago
|
| Reporter | ||
Comment 4•1 day ago
|
||
We had a deploy of v3.100 on January 13, but it was complete by 16:11 UTC, and things were still fine at that stage...
| Reporter | ||
Comment 5•1 day ago
|
||
Bug 2004050 caused a fairly large increase of the nightly release blob's size, and the timing would fit.
Comment 6•20 hours ago
|
||
Investigating the public API one only as it's causing immediate issues.
First, we're leaking references from the release_assets cache into the releases one.
def get_release(name, trans, include_sc=True):
base_row = cache.get("releases", name, lambda: get_base_row(name, trans))
asset_rows = cache.get("release_assets", name, lambda: get_asset_rows(name, trans))
if base_row:
base_blob = base_row["data"] # Copy ref, base_row["data"] is a dict
for asset in asset_rows:
path = asset["path"].split(".")[1:]
ensure_path_exists(base_blob, path)
set_by_path(base_blob, path, asset["data"]) # Copies refs from `release_assets` into `releases`
This makes it so evicting values from release_assets doesn't actually free up any memory until every cached release that references said asset also gets evicted. In testing, with the 504 releases that are in the dev db dump, fixing this reduced the combined cache size by ~20MB (for one worker, keep in mind prod has 6 of those per pod, so 120MB). We're still far from the ~500MB increase but... That data is very arbitrary and by duplicating the one nightly in the dump 25 times and reducing the release_assets cache size to hopefully mimic prod eviction a bit better (prod has 8k+ releases), I was able to leak 250MB of data into the releases cache... Fixing this issue might be enough to reduce the memory usage although it's very difficult to tell without testing.
The second observation is that the cache size is absolutely ridiculous. Locally I'm able to get the release_assets cache to a whopping 500MB after fixing the bug above. It's partly due to nightly having more partials (the math is ~ 113 locales * 7 platforms * 16 partials * 900 bytes * 30 nightlies). That's 330MB of firefox nightlies only in caches instead of 82MB but we should look at reducing the cache scopes. /update uses get_release to build a 10MB object to extract 14kB from it (16 partials * 900 bytes).
Description
•