Open Bug 1866783 Opened 2 years ago Updated 1 year ago

3.22 - 2.04% Base Content Heap Unclassified / Base Content Heap Unclassified + 1 more (Linux, Windows) regression on Tue November 21 2023

Categories

(Core :: DOM: Core & HTML, defect, P3)

defect

Tracking

()

Tracking Status
firefox-esr115 --- unaffected
firefox120 --- unaffected
firefox121 --- wontfix
firefox122 --- fix-optional

People

(Reporter: bacasandrei, Unassigned)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: perf, perf-alert, regression)

Perfherder has detected a awsy performance regression from push 238d07f2df955a10d01dd4e6197356c768bf6ba0. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Test Platform Options Absolute values (old vs new)
3% Base Content Heap Unclassified linux1804-64-shippable-qr fission 2,215,389.50 -> 2,286,742.67
2% Base Content Heap Unclassified windows11-64-2009-shippable-qr fission 1,793,759.00 -> 1,837,259.33
2% Base Content Heap Unclassified linux1804-64-shippable-qr fission 2,240,853.50 -> 2,286,600.33

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the patch(es) may be backed out in accordance with our regression policy.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

For more information on performance sheriffing please see our FAQ.

Note: The needinfo should have been addressed to :chutten, but I was unable to do so due to PTO settings.

Flags: needinfo?(peterv)

Note: The needinfo should have been addressed to :chutten, but I was unable to do so due to PTO settings.

I will take care to notify Chris about this once he's back from PTO.

Set release status flags based on info from the regressing bug 1852098

Flags: needinfo?(peterv) → needinfo?(chutten)

To make sure I understand correctly, this is an additional 40-60k of memory not attributed to any memory reporter. The bug's about two things, then: 1) Is this an expected, understood, acceptable, etc. increase in heap per process? and 2) Maybe this memory ought to be reported, huh?

Do I have that right?

For 1) I'd say it's a little expected. bug 1852098 introduced nearly 2300 new metrics. 20B or so per metric sounds about right. I can't say whether it's acceptable, but it's a pretty small value amortized over the metrics involved, so maybe it is?

I will say that in local testing of removing the old use counters, there is no improvement to unclassified heap, but there is an improvement somewhere between 4k-80k (I ran two local runs of ./mach awsy-test --base both with and without and took the farthest and closest measurements) in explicit heap (which I guess is to be expected since Telemetry specifically reports its memory)

Since we're aiming to remove the old stuff and get some/most/all/more of the memory back that we're adding here, we might be cool to let this ride for the nonce?

Flags: needinfo?(chutten)

(In reply to Chris H-C :chutten from comment #3)

Since we're aiming to remove the old stuff and get some/most/all/more of the memory back that we're adding here, we might be cool to let this ride for the nonce?

I would strongly second this.

Duplicate of this bug: 1867658
See Also: 1867658
  1. What's the timeframe for removing the old stuff?
  2. Agree that adding reporters so this memory is properly accounted for makes sense.

A regression in heap-unclassified isn't a huge deal, but yes please add some memory reporters to fix this. Not having memory reporters makes understanding Firefox memory usage more difficult.

  1. ASAP. I hope to have something concrete in bug 1866834 in the coming days, but it'll boil down to "As soon as we (mozilla in general, but Emilio, Nick Alexander, and myself in specific) are convinced the data coming out of the new is at least as good and useful as the data coming out of the old (through performing a data validation and marking our acceptance therein)"
  2. So, uh... how do I add memory reporters? And is it possible in Rust? Is there docs for that? (:mccr8, do you know?) -- I've looked for mention of CollectReports in the sourcedocs to no avail.
Flags: needinfo?(continuation)

Bad news: the amount of base heap gained by removing Telemetry Use Counter probes is minimal if any. My local testing's pretty conclusive that there's some, but it's hidden in the noise when done in CI.

So if we want improvements to base heap allocation, we're going to have to find it ourselves and not just wait for the old system to be retired.

I'm not really sure what the state of Rust memory reporting is. Servo must do it I'm assuming? Nika, do you know if there's any documentation of writing memory reporters in Rust? I vaguely remember something about macros...

Flags: needinfo?(continuation) → needinfo?(nika)

IIRC there is some helper stuff which exists to make setting up rust memory reporting easier. There's a malloc_size_of crate which has a derive for annotating structs, and you can use the rust-xpcom APIs to register a nsIMemoryReporter with the nsIMemoryReporterManager. Unfortunately, I'm relatively unaware of how best to do this, so I'm going to redirect to :emilio to give more advice.

This is definitely something which is going to continue coming up more and more often, and we should perhaps look into making the story around memory reporters in Rust more ergonomic (e.g. by exporting malloc_size_of from the xpcom crate and setting up a more idiomatic reporter interface in the future).

Flags: needinfo?(nika) → needinfo?(emilio)

I think a pre-existing example could be this, or the code from bug 1677866.

That code uses the relevant nsIMemoryReporterManager stuff a bit below. That reuses wr_malloc_size_of, which is probably fine, rather than the malloc_size_of version inside servo, which has a lot more dependencies / servo-specific implementations.

Unifying them in the long term would be nice (the reason wr_malloc_size_of exists is so that wr can be published in crates.io), but for now something like what security/ is doing seems alright. Does that help?

Flags: needinfo?(emilio)

The severity field is not set for this bug.
:farre, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(afarre)

Apparently I need some background on what memory reporting is and how it works, and luckily nsIMemoryManager.idl has a link to some docs... lol. I went looking for where in the source docs it might've been migrated to, but couldn't find a thing. So I'll crack on reading code.

Seems as though there's a few problems to conquer:

  • One is to design the organization of memory reporters within telemetry/components/glean.
    • Since we have an nsIFOG singleton, that seems like a good root node (like how Legacy Telemetry uses TelemetryImpl). We can define subreports on collection (explicit/fog/gifft, explicit/fog/mla, maybe an organization will suggest itself when I start writing).
    • I'm not sure what to do about measuring the SDK's memory itself at this time. Might be most sensible to defer that unless we find that there's a hidden cache of stuff in there somewhere (ping_lifetime_data?). I think most of the accumulations will be out in the embedding (FOG). To check this assumption I suppose I could check the changes in unclassified heap when I add/remove use counters and see if there's a large shift.
  • Another problem is measuring the size of things.
    • In C++ we can avail ourselves of SizeOfIncludingThis and friends. This'll get us through the GIFFT maps at least (though we may want to address bug 1868174 first, or at the same time)
    • In Rust we get to use MallocSizeOfOps like security is using as described in comment #12. I'm not sure how #[derive(MallocSizeOf)] works (pretty sure that's where size_of(mozilla::MallocSizeOf) comes from) or how we're going to count all the stuff we codegen (looking at you Lazy). But at least all those helpful impls will give me a great starting point.

I think this might be enough to start writing code, though it's about 100 minutes from signing off for the year, so mostly I'm writing this down so I have a hope at remembering things in 2024.

Assignee: nobody → chutten
Severity: -- → S4
Status: NEW → ASSIGNED
Flags: needinfo?(afarre)
Priority: -- → P2
Assignee: chutten → nobody
Blocks: 1882112
Status: ASSIGNED → NEW
Priority: P2 → P3
You need to log in before you can comment on or make changes to this bug.