Open Bug 2014991 Opened 1 day ago Updated 20 hours ago

frequent OOMs in balrog

Categories

(Release Engineering :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: jcristau, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

balrog's uwsgi processes keep getting oom-killed.

grafana points at a change around January 14.

First OOM appears to be timestamped 2026-01-14T12:11:40Z

Memory cgroup out of memory: Killed process 721264 (uwsgi) total-vm:638804kB, anon-rss:534324kB, file-rss:8832kB, shmem-rss:0kB, UID:0 pgtables:1168kB oom_score_adj:976

The admin app shows a similar pattern, though it doesn't go quite high enough to OOM.

Attachment #9543068 - Attachment description: Memory request utilisation % - grafana → Memory request utilisation % - grafana - public app

We had a deploy of v3.100 on January 13, but it was complete by 16:11 UTC, and things were still fine at that stage...

Bug 2004050 caused a fairly large increase of the nightly release blob's size, and the timing would fit.

See Also: → 2004050
Blocks: 2014908

Investigating the public API one only as it's causing immediate issues.

First, we're leaking references from the release_assets cache into the releases one.

def get_release(name, trans, include_sc=True):
    base_row = cache.get("releases", name, lambda: get_base_row(name, trans))
    asset_rows = cache.get("release_assets", name, lambda: get_asset_rows(name, trans))
    if base_row:
        base_blob = base_row["data"] # Copy ref, base_row["data"] is a dict

    for asset in asset_rows:
        path = asset["path"].split(".")[1:]
        ensure_path_exists(base_blob, path) 
        set_by_path(base_blob, path, asset["data"]) # Copies refs from `release_assets` into `releases`

This makes it so evicting values from release_assets doesn't actually free up any memory until every cached release that references said asset also gets evicted. In testing, with the 504 releases that are in the dev db dump, fixing this reduced the combined cache size by ~20MB (for one worker, keep in mind prod has 6 of those per pod, so 120MB). We're still far from the ~500MB increase but... That data is very arbitrary and by duplicating the one nightly in the dump 25 times and reducing the release_assets cache size to hopefully mimic prod eviction a bit better (prod has 8k+ releases), I was able to leak 250MB of data into the releases cache... Fixing this issue might be enough to reduce the memory usage although it's very difficult to tell without testing.

The second observation is that the cache size is absolutely ridiculous. Locally I'm able to get the release_assets cache to a whopping 500MB after fixing the bug above. It's partly due to nightly having more partials (the math is ~ 113 locales * 7 platforms * 16 partials * 900 bytes * 30 nightlies). That's 330MB of firefox nightlies only in caches instead of 82MB but we should look at reducing the cache scopes. /update uses get_release to build a 10MB object to extract 14kB from it (16 partials * 900 bytes).

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: