1939888 - 12.44% Heap Unclassified Fresh start [+30s] (Linux) regression on Wed October 30 2024

Reporter

Description

•

2 months ago

treeherder

Perfherder has detected a awsy performance regression from push aa2b4302bdccdbfd659158479d68c4f377237e6c. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio	Test	Platform	Options	Absolute values (old vs new)
12%	Heap Unclassified Fresh start [+30s]	linux1804-64-shippable-qr	fission tp6	118,722,819.33 -> 133,491,586.67

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the patch(es) may be backed out in accordance with our regression policy.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run all of these tests on try with ./mach try perf --alert 43251

The following documentation link provides more information about this command.

For more information on performance sheriffing please see our FAQ.

If you have any questions, please do not hesitate to reach out to afinder@mozilla.com.

Flags: needinfo?(emilio)

Alex Finder

Updated

•

2 months ago

status-firefox135: --- → affected

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

2 months ago

Set release status flags based on info from the regressing bug 1927111

status-firefox133: --- → affected

status-firefox134: --- → affected

status-firefox-esr128: --- → unaffected

Emilio Cobos Álvarez (:emilio)

Comment 2

•

2 months ago

So:

https://hg.mozilla.org/integration/autoland/rev/aa2b4302bdccdbfd659158479d68c4f377237e6c shouldn't be responsible since it's mac-only.
So that leaves https://hg.mozilla.org/integration/autoland/rev/75d799eb1f33254be20ff86f0776133c89e4c9ae

I'm guessing pulling in CustomizableUI is the key here, will send a try run to double-check.

The only reason we needed that was because of tests that called things like CustomizableUI.reset() without entering and leaving customize mode... Marco / Dao, do you know if there's a better way of dealing with it? I guess pulling CustomizableUI unnecessarily is rather unfortunate...

Flags: needinfo?(mak)

Flags: needinfo?(emilio)

Flags: needinfo?(dao+bmo)

Donal Meehan [:dmeehan]

Updated

•

2 months ago

status-firefox133: affected → wontfix

status-firefox136: --- → affected

status-firefox-esr115: --- → unaffected

Dão Gottwald [:dao]

Comment 3

•

2 months ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #2)

So:

https://hg.mozilla.org/integration/autoland/rev/aa2b4302bdccdbfd659158479d68c4f377237e6c shouldn't be responsible since it's mac-only.

So that leaves https://hg.mozilla.org/integration/autoland/rev/75d799eb1f33254be20ff86f0776133c89e4c9ae

I'm guessing pulling in CustomizableUI is the key here, will send a try run to double-check.

The only reason we needed that was because of tests that called things like CustomizableUI.reset() without entering and leaving customize mode... Marco / Dao, do you know if there's a better way of dealing with it? I guess pulling CustomizableUI unnecessarily is rather unfortunate...

Hmm, I'm not sure I understand the problem. CustomizableUI is already used extensively in front-end code, including in the startup path, e.g. https://searchfox.org/mozilla-central/rev/94c62970ba2f9c40efd5a4f83a538595425820d9/browser/base/content/browser.js#7506. So why would loading the module from UrlbarInput add additional overhead?

Flags: needinfo?(dao+bmo) → needinfo?(gijskruitbosch+bugs)

Dão Gottwald [:dao]

Comment 4

•

2 months ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #2)

So:

https://hg.mozilla.org/integration/autoland/rev/aa2b4302bdccdbfd659158479d68c4f377237e6c shouldn't be responsible since it's mac-only.

So that leaves https://hg.mozilla.org/integration/autoland/rev/75d799eb1f33254be20ff86f0776133c89e4c9ae

I'm guessing pulling in CustomizableUI is the key here, will send a try run to double-check.

Do you have results from the try run?

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Comment 5

•

2 months ago

Err, I must've forgotten to push or something. Here it is: https://perf.compare/compare-results?baseRev=82c1ad37ab3775a9737346b27a152f7c5d68a2f6&newRev=a6c5abc82a094aec4d10c057aedeb5833bf702f1&baseRepo=try&newRepo=try&framework=4

Also just trying to verify the regression:

https://treeherder.mozilla.org/jobs?repo=try&revision=4253f63bbdcd4690f44f9dd7bc3535f8155aae17 vs https://treeherder.mozilla.org/jobs?repo=try&revision=26f2b59e73cfdab3d9080b6afeb8b4932ace5388 (which should be https://perf.compare/compare-results?baseRev=26f2b59e73cfdab3d9080b6afeb8b4932ace5388&newRev=4253f63bbdcd4690f44f9dd7bc3535f8155aae17&baseRepo=try&newRepo=try&framework=4)

I'm thinking that it can't really be CustomizableUI, as that shouldn't be in heap-unclassified... But maybe popover or something? I'm also curious as why this looks platform specific (it shouldn't). It's also really massive for that change (10 megabytes).

Alex, two questions:

If I look at the same test across platforms, this seems only on Linux. It also seems that it was after a similar change in the opposite direction that didn't trigger an alert but is probably related. Would there be any chance of figuring out the culprit of the improvement shown in the linux graph before this regression? Or is it a configuration change of sorts?
How sure are we of the regression range here? Things are not adding up here, but let's see if the try runs show something useful.

Flags: needinfo?(emilio) → needinfo?(afinder)

:Gijs (he/him)

Comment 6

•

2 months ago

I agree it shouldn't be CUI-load-related - it's a singleton and it should have been loaded anyway.

The perfcompares suggest the regression is real (assuming that's just trypushes of the same commits).

I don't see where on the (new UI) perf compare I can look at the actual memory dumps from the tests... so not sure what other suggestions to offer?

Is it possible the geometryutils thingy pushed us over some webidl limit? Do we know if any (parts) of the individual commits already trip the regression, ie can we bisect it to specific code?

Flags: needinfo?(gijskruitbosch+bugs)

Andrew McCreight [:mccr8]

Comment 7

•

2 months ago

To get the actual about:memory report, you need to click through to the try push and then look under artifacts or something.

The gold standard way to investigate heap-unclassified is by running with DMD (it requires a special build flag, but that is set on Nightly). I'm not sure off the top of my head how to enable that on try and also set the save location to be MOZ_UPLOAD_DIR.

Marco Bonardo [:mak]

Comment 8

•

2 months ago

I don't have better suggestions, it seems doubtful that this has added new unclassified, more likely we're measuring something that was not measured before.
I concur with Gijs it may be nice starting to figure which line exactly causes the regression, is it sufficient to do anything with CustomizableUI to cause it (just call some method that doesn't add ref cycles), or is calling addListener necessary? is the toolbarvisibilitychange listener involved at all?

Flags: needinfo?(mak)

Andrew McCreight [:mccr8]

Comment 9

•

2 months ago

•

Edited

(In reply to Marco Bonardo [:mak] from comment #8)

I don't have better suggestions, it seems doubtful that this has added new unclassified, more likely we're measuring something that was not measured before.

Heap-unclassified is the amount of jemalloc allocations that are not covered by any memory reporter, so either we're allocating a lot more memory, or we've stopped measuring things. I do agree with the general point that it doesn't make much sense that a JS change would cause such a large regression in heap-unclassified.

Mayank Bansal

Comment 10

•

2 months ago

•

Edited

(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)

Err, I must've forgotten to push or something. Here it is: https://perf.compare/compare-results?baseRev=82c1ad37ab3775a9737346b27a152f7c5d68a2f6&newRev=a6c5abc82a094aec4d10c057aedeb5833bf702f1&baseRepo=try&newRepo=try&framework=4

Also just trying to verify the regression:

https://treeherder.mozilla.org/jobs?repo=try&revision=4253f63bbdcd4690f44f9dd7bc3535f8155aae17 vs https://treeherder.mozilla.org/jobs?repo=try&revision=26f2b59e73cfdab3d9080b6afeb8b4932ace5388 (which should be https://perf.compare/compare-results?baseRev=26f2b59e73cfdab3d9080b6afeb8b4932ace5388&newRev=4253f63bbdcd4690f44f9dd7bc3535f8155aae17&baseRepo=try&newRepo=try&framework=4)

I'm thinking that it can't really be CustomizableUI, as that shouldn't be in heap-unclassified... But maybe popover or something? I'm also curious as why this looks platform specific (it shouldn't). It's also really massive for that change (10 megabytes).

Alex, two questions:

If I look at the same test across platforms, this seems only on Linux. It also seems that it was after a similar change in the opposite direction that didn't trigger an alert but is probably related. Would there be any chance of figuring out the culprit of the improvement shown in the linux graph before this regression? Or is it a configuration change of sorts?

How sure are we of the regression range here? Things are not adding up here, but let's see if the try runs show something useful.

Additional note: In this whole range, two other changes increased the heap-unclassified: bug 1917493 and Bug 1917652. Both landed, were backed out, relanded and backed out again. Each time they landed, the heap-unclassified increased. But when they were backed out, the values returned to baseline. Ultimately, they both were backed out and their negative effect was removed, so i did not include them in the points above.

Emilio Cobos Álvarez (:emilio)

Comment 11

•

2 months ago

Ok, so:

https://treeherder.mozilla.org/jobs?repo=try&revision=5738c0e6eabaa902f1e590e5f91101474e66bcfa is without the second patch (I expect to still see the regression)
https://treeherder.mozilla.org/jobs?repo=try&revision=2c1befa11ecb64a2937124a175e73d6061eb0189 with a full revert of the first patch (expect improvement)
https://treeherder.mozilla.org/jobs?repo=try&revision=54e5c38fe650fbe5f05ec11d5b68fedd3288d6e6 without showPopover() which is the only other thing that I think could affect this somehow?

Alex Finder

Comment 12

•

2 months ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)

Err, I must've forgotten to push or something. Here it is: https://perf.compare/compare-results?baseRev=82c1ad37ab3775a9737346b27a152f7c5d68a2f6&newRev=a6c5abc82a094aec4d10c057aedeb5833bf702f1&baseRepo=try&newRepo=try&framework=4

Also just trying to verify the regression:

https://treeherder.mozilla.org/jobs?repo=try&revision=4253f63bbdcd4690f44f9dd7bc3535f8155aae17 vs https://treeherder.mozilla.org/jobs?repo=try&revision=26f2b59e73cfdab3d9080b6afeb8b4932ace5388 (which should be https://perf.compare/compare-results?baseRev=26f2b59e73cfdab3d9080b6afeb8b4932ace5388&newRev=4253f63bbdcd4690f44f9dd7bc3535f8155aae17&baseRepo=try&newRepo=try&framework=4)

I'm thinking that it can't really be CustomizableUI, as that shouldn't be in heap-unclassified... But maybe popover or something? I'm also curious as why this looks platform specific (it shouldn't). It's also really massive for that change (10 megabytes).

Alex, two questions:

If I look at the same test across platforms, this seems only on Linux. It also seems that it was after a similar change in the opposite direction that didn't trigger an alert but is probably related. Would there be any chance of figuring out the culprit of the improvement shown in the linux graph before this regression? Or is it a configuration change of sorts?

From what I can see in the graph, the improvement before this regression looks like it was caused by 42d3134758c7 (or Bug 1921811 Make urlbar a popover so that it draws on the top layer), also mentioned by Mayank in his comment about the Return-toBaseline (Bug 1921811).

How sure are we of the regression range here? Things are not adding up here, but let's see if the try runs show something useful.

I can try backing out the patch and push to try, but I see you already tried this. I'll try to see if anyone else from the performance team has any ideas. I'll try reaching out to sparky to see if he has any ideas.

Flags: needinfo?(afinder) → needinfo?(gmierz2)

Emilio Cobos Álvarez (:emilio)

Comment 13

•

2 months ago

Yeah I guess at this point we do know that it causes the issue due to .showPopover being called earlier. Last link in comment 11 shows the improvement vs. the first.

So I guess the remaining thing is knowing why that causes a 10mb regression.

Emilio Cobos Álvarez (:emilio)

Comment 14

•

2 months ago

I need to dig into DMD for that but should be reproducible locally at least.

Flags: needinfo?(emilio)

Pascal Chevrel:pascalc

Updated

•

2 months ago

status-firefox134: affected → wontfix

Greg Mierzwinski [:sparky]

Comment 15

•

2 months ago

I see :emilio has found something that could be causing the issue. Feel free to needinfo me if there are any other questions.

Flags: needinfo?(gmierz2)

Ryan VanderMeulen [:RyanVM]

Updated

•

2 months ago

status-firefox135: affected → fix-optional

BugBot [:suhaib / :marco/ :calixte]

Comment 16

•

1 month ago

It has been over 7 days with no activity on this performance regression.

:emilio, since you are the author of the regressor, bug 1927111, which triggered this performance alert, could you please provide a progress update?

If this regression is something that fixes a bug, changes the baseline of the regression metrics, or otherwise will not be fixed, please consider closing it as WONTFIX. See this documentation for more information on how to handle regressions.

For additional information/help, please needinfo the performance sheriff who filed this alert (they can be found in comment #0), or reach out in #perftest, or #perfsheriffs on Element.

For more information, please visit BugBot documentation.

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Comment 17

•

1 month ago

Yeah so I couldn't look into this just yet, but given the change above I'm not super concerned. My patch just made the memory usage happen earlier, rather than the first time the urlbar is focused. Still should be investigated tho.

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Updated

•

1 month ago

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Updated

•

1 month ago

Severity: -- → S3

Priority: -- → P3

Greg Mierzwinski [:sparky]

Updated

•

1 month ago

Keywords: backlog-deferred

James Teow [:jteow]

Comment 18

•

1 month ago

It's probably more likely for this unclassified memory issue to be in layout or widget code rather than address bar, but we're not sure which it is, so for now we'll move this to Layout, but feel free to move it to more appropriate component.

Component: Address Bar → Layout

Product: Firefox → Core

Donal Meehan [:dmeehan]

Updated

•

29 days ago

status-firefox136: affected → fix-optional

Emilio Cobos Álvarez (:emilio)

Updated

•

25 days ago

Flags: needinfo?(emilio)