Open Bug 2015887 Opened 1 day ago Updated 1 day ago

llama.cpp leaks ~500mb for each link preview performed until threads are exhausted

Tracking

()

Status:

NEW

People

(Reporter: gregtatum, Unassigned)

Details

Greg Tatum [:gregtatum]

Reporter

Description

•

1 day ago

I got the inference process up to 1.5G of memory usage before link preview refused to generate any more requests. From the performance profile it looks like it's creating threads, and not re-using them. Eventually it seems to exhaust the threads and breaks.

https://share.firefox.dev/3OeYJqz

I'm tentatively marking as S2 since it gets up to a pretty problematic amount of memory. It is quite easy to accidentally activate the link preview feature since it's a long press on a link. Eventually, the process is terminated after some timeouts, so the memory is eventually freed, but during this time a user could have some negative impacts from this.

Greg Tatum [:gregtatum]

Reporter

Comment 1

•

1 day ago

Here's what claude suggests. No idea if it's accurate, but is a good place to start the investigation:

Root Cause

Every time LinkPreviewModel.generateTextAI() is called, the following happens:

New Engine Creation (LinkPreviewModel.sys.mjs:415-480):
- Line 417-445: Creates a new engine via createEngine()
- Line 479: Terminates the engine after use
Engine Caching Issue (EngineProcess.sys.mjs:133-137 and MLEngineParent.sys.mjs:287-295):
- Engines are supposed to be reused if pipeline options match
- The engineId for link-preview is "wllamapreview" (line 174 in EngineProcess.sys.mjs)
- However, the engine caching only works if called through the same MLEngine instance
Native Thread Creation (LlamaBackend.cpp:221-254):
- In ReinitializeContext(), new threadpools are created:
- Line 247: mThreadpool.reset(mLib->ggml_threadpool_new(&tpp))
- Line 235: mThreadpoolBatch.reset(mLib->ggml_threadpool_new(&tppBatch))
  - These create llama.cpp worker threads with callbacks (lines 223-224, 228-231)
Worker Thread (LlamaRunner.cpp:437-439):
- Each generation also creates a new "LlamaWorker" thread via NewNamedThread
- This thread is properly shut down via AsyncShutdown() (line 391)

The Problem

The issue is that engines are NOT being reused properly. Looking at LinkPreviewModel.sys.mjs, each call to generateTextAI():

Creates a fresh engine (line 417-445)
Terminates it immediately after use (line 479)

This means:

New llama.cpp threadpools are created each time (multiple threads per threadpool)
The threadpools should be freed when the LlamaBackend is destroyed (via custom deleters in LlamaBackend.h:93-94)
However, if there are any lingering references to the backend, the threadpools won't be freed

Potential Solutions

Reuse the engine instead of creating a new one each time:
- Cache the engine at the LinkPreviewModel level
- Only recreate if pipeline options change
Ensure proper cleanup of threadpools:
- Verify that all RefPtr references to LlamaBackend are dropped when engine terminates
- Check if LlamaStreamSource might be keeping the backend alive

Let me check if there's an engine caching issue:

⏺ Searched for 1 pattern (ctrl+o to expand)

⏺ Based on my analysis, the issue is that LinkPreviewModel.generateTextAI() creates a new engine for every preview, which creates new llama.cpp threadpools each time, and these threads are not being properly
reused.

The Fix

The solution is to cache and reuse the engine at the LinkPreviewModel level. Here's what needs to be changed in browser/components/genai/LinkPreviewModel.sys.mjs:

The problem is in the generateTextAI function (lines 366-481). Currently:

Line 415-445: Creates a new engine every time
Line 479: Terminates it after use

Instead, the engine should be:

Created once and cached in the LinkPreviewModel object
Reused across multiple generateTextAI() calls
Only recreated if pipeline options change
Properly terminated only when LinkPreviewModel is torn down

Would you like me to implement this fix? The solution would involve:

Adding a cached engine field to LinkPreviewModel
Checking if an engine exists and its options match before creating a new one
Reusing the existing engine when possible
Only calling engine.terminate() when the options change or on module teardown

This would eliminate the thread leak by ensuring threadpools are created once and reused, rather than being created and destroyed on every preview.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

llama.cpp leaks ~500mb for each link preview performed until threads are exhausted

Categories

(Core :: Machine Learning: On Device, defect, P3)

Tracking

()

People

(Reporter: gregtatum, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1