Open Bug 2014991 Opened 1 day ago Updated 20 hours ago

frequent OOMs in balrog

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: jcristau, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Memory request utilisation % - grafana - public app 1 day ago Julien Cristau [:jcristau] 286.80 KB, image/png		Details
Memory request utilisation % - grafana - admin app 1 day ago Julien Cristau [:jcristau] 141.10 KB, image/png		Details

Julien Cristau [:jcristau]

Reporter

Description

•

1 day ago

balrog's uwsgi processes keep getting oom-killed.

Julien Cristau [:jcristau]

Reporter

Comment 1

•

1 day ago

Attached image Memory request utilisation % - grafana - public app — Details

grafana points at a change around January 14.

Julien Cristau [:jcristau]

Reporter

Comment 2

•

1 day ago

First OOM appears to be timestamped 2026-01-14T12:11:40Z

Memory cgroup out of memory: Killed process 721264 (uwsgi) total-vm:638804kB, anon-rss:534324kB, file-rss:8832kB, shmem-rss:0kB, UID:0 pgtables:1168kB oom_score_adj:976

Julien Cristau [:jcristau]

Reporter

Comment 3

•

1 day ago

Attached image Memory request utilisation % - grafana - admin app — Details

The admin app shows a similar pattern, though it doesn't go quite high enough to OOM.

Julien Cristau [:jcristau]

Reporter

Updated

•

1 day ago

Attachment #9543068 - Attachment description: Memory request utilisation % - grafana → Memory request utilisation % - grafana - public app

Julien Cristau [:jcristau]

Reporter

Comment 4

•

1 day ago

We had a deploy of v3.100 on January 13, but it was complete by 16:11 UTC, and things were still fine at that stage...

Julien Cristau [:jcristau]

Reporter

Comment 5

•

1 day ago

Bug 2004050 caused a fairly large increase of the nightly release blob's size, and the timing would fit.

Updated

•

22 hours ago

Blocks: 2014908

Bastien Orivel [:eijebong]

Comment 6

•

20 hours ago

Investigating the public API one only as it's causing immediate issues.

First, we're leaking references from the release_assets cache into the releases one.

def get_release(name, trans, include_sc=True):
    base_row = cache.get("releases", name, lambda: get_base_row(name, trans))
    asset_rows = cache.get("release_assets", name, lambda: get_asset_rows(name, trans))
    if base_row:
        base_blob = base_row["data"] # Copy ref, base_row["data"] is a dict

    for asset in asset_rows:
        path = asset["path"].split(".")[1:]
        ensure_path_exists(base_blob, path) 
        set_by_path(base_blob, path, asset["data"]) # Copies refs from `release_assets` into `releases`

This makes it so evicting values from release_assets doesn't actually free up any memory until every cached release that references said asset also gets evicted. In testing, with the 504 releases that are in the dev db dump, fixing this reduced the combined cache size by ~20MB (for one worker, keep in mind prod has 6 of those per pod, so 120MB). We're still far from the ~500MB increase but... That data is very arbitrary and by duplicating the one nightly in the dump 25 times and reducing the release_assets cache size to hopefully mimic prod eviction a bit better (prod has 8k+ releases), I was able to leak 250MB of data into the releases cache... Fixing this issue might be enough to reduce the memory usage although it's very difficult to tell without testing.

The second observation is that the cache size is absolutely ridiculous. Locally I'm able to get the release_assets cache to a whopping 500MB after fixing the bug above. It's partly due to nightly having more partials (the math is ~ 113 locales * 7 platforms * 16 partials * 900 bytes * 30 nightlies). That's 330MB of firefox nightlies only in caches instead of 82MB but we should look at reducing the cache scopes. /update uses get_release to build a 10MB object to extract 14kB from it (16 partials * 900 bytes).

You need to log in before you can comment on or make changes to this bug.

Bugzilla

frequent OOMs in balrog

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: jcristau, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Updated

Comment 6

Attachment

General

Description

File Name

Content Type