Open Bug 1924881 Opened 1 year ago Updated 2 months ago

Investigate pipeline-server memory growth / usage for potential leaks

Categories

(Webtools :: Searchfox, task)

task

Tracking

(Not tracked)

People

(Reporter: asuth, Unassigned)

References

Details

I noticed the pipeline-sever fell over today due to apparently hitting its memory limit. Logs didn't show any suspicious queries, just normal ones that context menu would generate. Briefly noodling with htop open on the server and issuing requests seemed to show non-trivial amounts of retention that could explain this. Note that we do explicitly give the pipeline-server a ulimit cap in order to ensure that the alpha-quality pipeline-server can't cause problems for the stable, supported functionality.

In terms of known potential things that could happen:

  • We explicitly use ustr for string-interning of symbols and pretty identifiers which, by design, entrains all values it sees. The rationale for this has been that the set of symbols/pretty identifiers is bounded.
    • We mainly arrived here because we were already using ustr as an evolution of :mccr8's crossref optimizations to use refcounted strings to minimize our memory-usage there where we hold the entire crossref in memory at one time and so the memory usage really matters. We hadn't initially parameterized the structs, so it made sense to just use the ustr's where they already existed and take the potential performance improvements from being able to more efficiently compare strings. But :arai made most (all?) of our structs parameterized on string types, so we can potentially stop doing this if this is the actual problem.
  • The fancy debug logging stuff uses tracing-forest to aggregate all logs associated with specific spans of interest. This inherently accumulates and buffers the logs for the duration of the span. This should only be active when &debug=true is passed to the pipeline server, and even then the memory use should be ephemeral, but if this goes awry, it would very quickly entrain a ton of memory.

In terms of available tooling / resources:

Note that an orthogonal thing we could do here is move all our daemons to be run and automatically restarted by systemd.

See Also: → 1924886

I did a quick pass at this locally using bytehound (noting that I had problems building it so I used the most recent 0.11.0 release binaries) and:

  • It suggested we're not leaking anything but our "Clean" memory does grow over time if we do a bunch of diagramming stuff trying to make sure we touch a ton of symbols, but that's expected from the memory mapping. (Although we do expect ustr usage to grow over time, but the accumulation seems negligible.) The dominant memory usage for firefox-main is from FileLookupMap::new which gets intentionally entrained by the local_index::make_all_local_servers step and is ~326M for firefox-main.
    • Because the current excerpt extraction performance is not amazing because of how it works, queries using the "default" search can be somewhat long-running and in that case if you kill the pipeline-server while it's still running, the liquid Template::render calls can be holding onto a large Vec that they're writing into that looks like a leak, but is not a leak if you let it finish what it's doing.
    • Although I just got another case where there is a small hangover liquid String hanging around that has hyper's dispatcher on the stack on a MapResponseFuture and I do wonder if there's some kind of very limited retention of responses? But so far I've only ever seen a single one of these and I could wonder if it's something where my shutdown process of "ctrl-c" followed by "ctrl-" could perhaps mean that we end up losing a small amount of samples and that tends to mean that the samples of the last responses getting freed aren't in the log.
  • It's a pretty great tool, especially the scripting console! The case study in the docs in particular provides a nice overview of how that can be used, although the most immediately useful thing is the "never deallocated flamegraph".
  • Caveats: At default settings with full stack tracing, the traces can get quite large (6G and 10G for ~6 minutes traces) and the UI can then want a fair bit of memory (~40G Resident).

For what I tested, testing a bunch of queries for each:

  • calls-between / calls-from / calls-to / class diagram / inheritance diagram. I did both the normal thing plus I also used the debug mode for calls-from and calls-to which turns on tracing / tracing-forest and builds giant JSON buffers.
  • the "default" query with tons of results
  • the field-layout mechanism

I feel much better about pipeline-server now, but I think it makes sense to leave this open. Probably a good thing to have before closing would be basic memory-usage metrics for pipeline-server over time, especially if we can correlate against the specific queries it received.

:nika reported a pipeline-server failure on Nov 17 (13:21 eastern) 2025 that :arai looked into and there was an allocation failure and :arai manually restarted the server. It could be interesting to check the ELB logs to see what the pipeline-server might have been dealing with in the lead-up to that, but the comment 0 note about just having systemd restart pipeline-server if it falls over is of course something we can do without figuring out what might have been going on.

You need to log in before you can comment on or make changes to this bug.