Open Bug 1939888 Opened 2 months ago Updated 25 days ago

12.44% Heap Unclassified Fresh start [+30s] (Linux) regression on Wed October 30 2024

Categories

(Core :: Layout, defect, P3)

defect

Tracking

()

Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- unaffected
firefox133 --- wontfix
firefox134 --- wontfix
firefox135 --- fix-optional
firefox136 --- fix-optional

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Regression)

Details

(4 keywords)

Perfherder has detected a awsy performance regression from push aa2b4302bdccdbfd659158479d68c4f377237e6c. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Test Platform Options Absolute values (old vs new)
12% Heap Unclassified Fresh start [+30s] linux1804-64-shippable-qr fission tp6 118,722,819.33 -> 133,491,586.67

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the patch(es) may be backed out in accordance with our regression policy.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run all of these tests on try with ./mach try perf --alert 43251

The following documentation link provides more information about this command.

For more information on performance sheriffing please see our FAQ.

If you have any questions, please do not hesitate to reach out to afinder@mozilla.com.

Flags: needinfo?(emilio)

Set release status flags based on info from the regressing bug 1927111

So:

I'm guessing pulling in CustomizableUI is the key here, will send a try run to double-check.

The only reason we needed that was because of tests that called things like CustomizableUI.reset() without entering and leaving customize mode... Marco / Dao, do you know if there's a better way of dealing with it? I guess pulling CustomizableUI unnecessarily is rather unfortunate...

Flags: needinfo?(mak)
Flags: needinfo?(emilio)
Flags: needinfo?(dao+bmo)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #2)

So:

I'm guessing pulling in CustomizableUI is the key here, will send a try run to double-check.

The only reason we needed that was because of tests that called things like CustomizableUI.reset() without entering and leaving customize mode... Marco / Dao, do you know if there's a better way of dealing with it? I guess pulling CustomizableUI unnecessarily is rather unfortunate...

Hmm, I'm not sure I understand the problem. CustomizableUI is already used extensively in front-end code, including in the startup path, e.g. https://searchfox.org/mozilla-central/rev/94c62970ba2f9c40efd5a4f83a538595425820d9/browser/base/content/browser.js#7506. So why would loading the module from UrlbarInput add additional overhead?

Flags: needinfo?(dao+bmo) → needinfo?(gijskruitbosch+bugs)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #2)

So:

I'm guessing pulling in CustomizableUI is the key here, will send a try run to double-check.

Do you have results from the try run?

Flags: needinfo?(emilio)

Err, I must've forgotten to push or something. Here it is: https://perf.compare/compare-results?baseRev=82c1ad37ab3775a9737346b27a152f7c5d68a2f6&newRev=a6c5abc82a094aec4d10c057aedeb5833bf702f1&baseRepo=try&newRepo=try&framework=4

Also just trying to verify the regression:

I'm thinking that it can't really be CustomizableUI, as that shouldn't be in heap-unclassified... But maybe popover or something? I'm also curious as why this looks platform specific (it shouldn't). It's also really massive for that change (10 megabytes).

Alex, two questions:

  • If I look at the same test across platforms, this seems only on Linux. It also seems that it was after a similar change in the opposite direction that didn't trigger an alert but is probably related. Would there be any chance of figuring out the culprit of the improvement shown in the linux graph before this regression? Or is it a configuration change of sorts?

  • How sure are we of the regression range here? Things are not adding up here, but let's see if the try runs show something useful.

Flags: needinfo?(emilio) → needinfo?(afinder)

I agree it shouldn't be CUI-load-related - it's a singleton and it should have been loaded anyway.

The perfcompares suggest the regression is real (assuming that's just trypushes of the same commits).

I don't see where on the (new UI) perf compare I can look at the actual memory dumps from the tests... so not sure what other suggestions to offer?

Is it possible the geometryutils thingy pushed us over some webidl limit? Do we know if any (parts) of the individual commits already trip the regression, ie can we bisect it to specific code?

Flags: needinfo?(gijskruitbosch+bugs)

To get the actual about:memory report, you need to click through to the try push and then look under artifacts or something.

The gold standard way to investigate heap-unclassified is by running with DMD (it requires a special build flag, but that is set on Nightly). I'm not sure off the top of my head how to enable that on try and also set the save location to be MOZ_UPLOAD_DIR.

I don't have better suggestions, it seems doubtful that this has added new unclassified, more likely we're measuring something that was not measured before.
I concur with Gijs it may be nice starting to figure which line exactly causes the regression, is it sufficient to do anything with CustomizableUI to cause it (just call some method that doesn't add ref cycles), or is calling addListener necessary? is the toolbarvisibilitychange listener involved at all?

Flags: needinfo?(mak)

(In reply to Marco Bonardo [:mak] from comment #8)

I don't have better suggestions, it seems doubtful that this has added new unclassified, more likely we're measuring something that was not measured before.

Heap-unclassified is the amount of jemalloc allocations that are not covered by any memory reporter, so either we're allocating a lot more memory, or we've stopped measuring things. I do agree with the general point that it doesn't make much sense that a JS change would cause such a large regression in heap-unclassified.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)

Err, I must've forgotten to push or something. Here it is: https://perf.compare/compare-results?baseRev=82c1ad37ab3775a9737346b27a152f7c5d68a2f6&newRev=a6c5abc82a094aec4d10c057aedeb5833bf702f1&baseRepo=try&newRepo=try&framework=4

Also just trying to verify the regression:

I'm thinking that it can't really be CustomizableUI, as that shouldn't be in heap-unclassified... But maybe popover or something? I'm also curious as why this looks platform specific (it shouldn't). It's also really massive for that change (10 megabytes).

Alex, two questions:

  • If I look at the same test across platforms, this seems only on Linux. It also seems that it was after a similar change in the opposite direction that didn't trigger an alert but is probably related. Would there be any chance of figuring out the culprit of the improvement shown in the linux graph before this regression? Or is it a configuration change of sorts?

  • How sure are we of the regression range here? Things are not adding up here, but let's see if the try runs show something useful.

Additional note: In this whole range, two other changes increased the heap-unclassified: bug 1917493 and Bug 1917652. Both landed, were backed out, relanded and backed out again. Each time they landed, the heap-unclassified increased. But when they were backed out, the values returned to baseline. Ultimately, they both were backed out and their negative effect was removed, so i did not include them in the points above.

Ok, so:

(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)

Err, I must've forgotten to push or something. Here it is: https://perf.compare/compare-results?baseRev=82c1ad37ab3775a9737346b27a152f7c5d68a2f6&newRev=a6c5abc82a094aec4d10c057aedeb5833bf702f1&baseRepo=try&newRepo=try&framework=4

Also just trying to verify the regression:

I'm thinking that it can't really be CustomizableUI, as that shouldn't be in heap-unclassified... But maybe popover or something? I'm also curious as why this looks platform specific (it shouldn't). It's also really massive for that change (10 megabytes).

Alex, two questions:

  • If I look at the same test across platforms, this seems only on Linux. It also seems that it was after a similar change in the opposite direction that didn't trigger an alert but is probably related. Would there be any chance of figuring out the culprit of the improvement shown in the linux graph before this regression? Or is it a configuration change of sorts?

From what I can see in the graph, the improvement before this regression looks like it was caused by 42d3134758c7 (or Bug 1921811 Make urlbar a popover so that it draws on the top layer), also mentioned by Mayank in his comment about the Return-toBaseline (Bug 1921811).

  • How sure are we of the regression range here? Things are not adding up here, but let's see if the try runs show something useful.

I can try backing out the patch and push to try, but I see you already tried this. I'll try to see if anyone else from the performance team has any ideas. I'll try reaching out to sparky to see if he has any ideas.

Flags: needinfo?(afinder) → needinfo?(gmierz2)

Yeah I guess at this point we do know that it causes the issue due to .showPopover being called earlier. Last link in comment 11 shows the improvement vs. the first.

So I guess the remaining thing is knowing why that causes a 10mb regression.

I need to dig into DMD for that but should be reproducible locally at least.

Flags: needinfo?(emilio)

I see :emilio has found something that could be causing the issue. Feel free to needinfo me if there are any other questions.

Flags: needinfo?(gmierz2)

It has been over 7 days with no activity on this performance regression.

:emilio, since you are the author of the regressor, bug 1927111, which triggered this performance alert, could you please provide a progress update?

If this regression is something that fixes a bug, changes the baseline of the regression metrics, or otherwise will not be fixed, please consider closing it as WONTFIX. See this documentation for more information on how to handle regressions.

For additional information/help, please needinfo the performance sheriff who filed this alert (they can be found in comment #0), or reach out in #perftest, or #perfsheriffs on Element.

For more information, please visit BugBot documentation.

Flags: needinfo?(emilio)

Yeah so I couldn't look into this just yet, but given the change above I'm not super concerned. My patch just made the memory usage happen earlier, rather than the first time the urlbar is focused. Still should be investigated tho.

Flags: needinfo?(emilio)
Flags: needinfo?(emilio)
Severity: -- → S3
Priority: -- → P3

It's probably more likely for this unclassified memory issue to be in layout or widget code rather than address bar, but we're not sure which it is, so for now we'll move this to Layout, but feel free to move it to more appropriate component.

Component: Address Bar → Layout
Product: Firefox → Core
Flags: needinfo?(emilio)
You need to log in before you can comment on or make changes to this bug.