llama.cpp leaks ~500mb for each link preview performed until threads are exhausted
Categories
(Core :: Machine Learning: On Device, defect, P3)
Tracking
()
People
(Reporter: gregtatum, Unassigned)
Details
I got the inference process up to 1.5G of memory usage before link preview refused to generate any more requests. From the performance profile it looks like it's creating threads, and not re-using them. Eventually it seems to exhaust the threads and breaks.
https://share.firefox.dev/3OeYJqz
I'm tentatively marking as S2 since it gets up to a pretty problematic amount of memory. It is quite easy to accidentally activate the link preview feature since it's a long press on a link. Eventually, the process is terminated after some timeouts, so the memory is eventually freed, but during this time a user could have some negative impacts from this.
| Reporter | ||
Comment 1•1 day ago
|
||
Here's what claude suggests. No idea if it's accurate, but is a good place to start the investigation:
Root Cause
Every time LinkPreviewModel.generateTextAI() is called, the following happens:
- New Engine Creation (LinkPreviewModel.sys.mjs:415-480):
- Line 417-445: Creates a new engine via createEngine()
- Line 479: Terminates the engine after use - Engine Caching Issue (EngineProcess.sys.mjs:133-137 and MLEngineParent.sys.mjs:287-295):
- Engines are supposed to be reused if pipeline options match
- The engineId for link-preview is "wllamapreview" (line 174 in EngineProcess.sys.mjs)
- However, the engine caching only works if called through the same MLEngine instance - Native Thread Creation (LlamaBackend.cpp:221-254):
- In ReinitializeContext(), new threadpools are created:- Line 247: mThreadpool.reset(mLib->ggml_threadpool_new(&tpp))
- Line 235: mThreadpoolBatch.reset(mLib->ggml_threadpool_new(&tppBatch))
- These create llama.cpp worker threads with callbacks (lines 223-224, 228-231)
- Worker Thread (LlamaRunner.cpp:437-439):
- Each generation also creates a new "LlamaWorker" thread via NewNamedThread
- This thread is properly shut down via AsyncShutdown() (line 391)
The Problem
The issue is that engines are NOT being reused properly. Looking at LinkPreviewModel.sys.mjs, each call to generateTextAI():
- Creates a fresh engine (line 417-445)
- Terminates it immediately after use (line 479)
This means:
- New llama.cpp threadpools are created each time (multiple threads per threadpool)
- The threadpools should be freed when the LlamaBackend is destroyed (via custom deleters in LlamaBackend.h:93-94)
- However, if there are any lingering references to the backend, the threadpools won't be freed
Potential Solutions
- Reuse the engine instead of creating a new one each time:
- Cache the engine at the LinkPreviewModel level
- Only recreate if pipeline options change - Ensure proper cleanup of threadpools:
- Verify that all RefPtr references to LlamaBackend are dropped when engine terminates
- Check if LlamaStreamSource might be keeping the backend alive
Let me check if there's an engine caching issue:
⏺ Searched for 1 pattern (ctrl+o to expand)
⏺ Based on my analysis, the issue is that LinkPreviewModel.generateTextAI() creates a new engine for every preview, which creates new llama.cpp threadpools each time, and these threads are not being properly
reused.
The Fix
The solution is to cache and reuse the engine at the LinkPreviewModel level. Here's what needs to be changed in browser/components/genai/LinkPreviewModel.sys.mjs:
The problem is in the generateTextAI function (lines 366-481). Currently:
- Line 415-445: Creates a new engine every time
- Line 479: Terminates it after use
Instead, the engine should be:
- Created once and cached in the LinkPreviewModel object
- Reused across multiple generateTextAI() calls
- Only recreated if pipeline options change
- Properly terminated only when LinkPreviewModel is torn down
Would you like me to implement this fix? The solution would involve:
- Adding a cached engine field to LinkPreviewModel
- Checking if an engine exists and its options match before creating a new one
- Reusing the existing engine when possible
- Only calling engine.terminate() when the options change or on module teardown
This would eliminate the thread leak by ensuring threadpools are created once and reused, rather than being created and destroyed on every preview.
Description
•