Closed Bug 1892940 Opened 2 years ago Closed 2 years ago

11.07 - 10.15% Images / Images (OSX) regression on Wed April 17 2024

Tracking

(firefox-esr115 unaffected, firefox125 unaffected, firefox126 unaffected, firefox127 fixed, firefox128 fixed)

Status:

RESOLVED FIXED

Milestone:

128 Branch

Tracking Flags:

Tracking

Status

firefox-esr115

---

unaffected

firefox125

---

unaffected

firefox126

---

unaffected

firefox127

---

fixed

firefox128

---

fixed

People

(Reporter: fbilt, Assigned: whimboo)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Attachments

(1 file)

memory diff (base vs. patched) 2 years ago Henrik Skupin [:whimboo][⌚️UTC+2] 78.83 KB, text/plain		Details

Florin Bilt

Reporter

Description

•

2 years ago

Perfherder has detected a awsy performance regression from push e96fce323a1bee6228bfcb97988b7fbf37681f10. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio	Test	Platform	Options	Absolute values (old vs new)
11%	Images	macosx1015-64-shippable-qr	fission tp6	6,004,909.82 -> 6,669,934.43
10%	Images	macosx1015-64-shippable-qr	fission tp6	6,032,871.24 -> 6,645,367.10

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the patch(es) may be backed out in accordance with our regression policy.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 42353

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(hskupin)

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

2 years ago

Set release status flags based on info from the regressing bug 1884401

status-firefox125: --- → unaffected

status-firefox126: --- → unaffected

status-firefox-esr115: --- → unaffected

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 2

•

2 years ago

I can already see a higher Images value for a landing before my code reached autoland/central.

My change only modifies how Marionette starts the application but doesn't make any changes to Firefox itself. So I doubt that my change could have actually made this change.

Daisuke, given that you worked on bug 1891602, do you think that your patch might have caused this regression?

Flags: needinfo?(hskupin) → needinfo?(daisuke)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 3

•

2 years ago

Nevertheless I've triggered a perf job with a backout of my patch:

Base: https://treeherder.mozilla.org/jobs?repo=try&revision=5c313d96a02c6e793f5612065f0b43fe8cd8d315
Backout: https://treeherder.mozilla.org/jobs?repo=try&revision=b737d6f7835a98e32db6ffb2133e48f9c7f081f4

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 4

•

2 years ago

Actually as it looks like it is indeed a regression as caused by my patch.

I'll take a look how image memory is recorded and how starting Firefox via psutil will cause this change in memory usage.

Flags: needinfo?(daisuke)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 5

•

2 years ago

The Images perf suite seems to sample memory from explicit/images. And the subtest that is actually affected here is Images closed extra processes opt fission tp6 and especially the Fresh start checkpoint. It increased by 173%.

Running ./mach awsy-test --tp6 --quick locally I can as well see an increase for Fresh Start from around 1MB to 3MB.

Andrew, do you have an idea what this specifically could mean?

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 6

•

2 years ago

I'd recommend downloading the actual memory reports from the checkpoint in question, before and after your patch, and then using the "load and diff" option in about:memory to see where the actual difference lies.

That Fresh Start link doesn't make much sense to me. "Fresh start" and "closed extra processes" are separate checkpoints, I thought, so I have no idea what it means for something to be both.

If we ignore the "closed extra processes" here and just treat this as a "Fresh start" regression, one thing that I can think of is that we drop the decoded images a certain amount of time after we're idle. Maybe your change causes the browser to start up a little later and we end up not dropping images quite as fast. If "fresh start" is the only checkpoint that regressed (and not, say, "fresh start [+30s]", and your patch just messes with test harness stuff, then this is probably okay. But it would be good to understand what regressed a bit better.

Flags: needinfo?(continuation)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 7

•

2 years ago

(In reply to Andrew McCreight [:mccr8] from comment #6)

I'd recommend downloading the actual memory reports from the checkpoint in question, before and after your patch, and then using the "load and diff" option in about:memory to see where the actual difference lies.

Ah that's a good idea. Let me do that.

That Fresh Start link doesn't make much sense to me. "Fresh start" and "closed extra processes" are separate checkpoints, I thought, so I have no idea what it means for something to be both.

Yes, there is a bug in the perfherder UI. This is really about Fresh Start and not closed extra processes. Greg filed a bug to get this fixed.

If we ignore the "closed extra processes" here and just treat this as a "Fresh start" regression, one thing that I can think of is that we drop the decoded images a certain amount of time after we're idle. Maybe your change causes the browser to start up a little later and we end up not dropping images quite as fast. If "fresh start" is the only checkpoint that regressed (and not, say, "fresh start [+30s]", and your patch just messes with test harness stuff, then this is probably okay. But it would be good to understand what regressed a bit better.

Yes, Fresh Start [+30s] stays the same. It's only the memory used right after startup. I'll attach a diff for the both runs.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 8

•

2 years ago

Attached file memory diff (base vs. patched) — Details

Could you maybe have a look at this diff? Maybe we can really ignore it?

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 9

•

2 years ago

It looks like the change is from about:newtab being alive in the privilegedabout process. Is that present in the before memory report? I'm guessing not. It could even be the case that your changes caused things to load faster and about:newtab wasn't loaded before. I don't know what the expected thing is. Honestly, it wouldn't be a bad idea to ignore the FreshStart data point, but I haven't noticed it being a problem before. The general way we average a few data points evenly in AWSY is a bit odd.

It seems a little odd, but yeah I think that is fine to ignore it. It sounds like this is a test harness change that is causing a massive increase in one data point.

Status: NEW → RESOLVED

Closed: 2 years ago

Flags: needinfo?(continuation) → needinfo?(hskupin)

Resolution: --- → WONTFIX

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 10

•

2 years ago

(In reply to Andrew McCreight [:mccr8] from comment #9)

It looks like the change is from about:newtab being alive in the privilegedabout process. Is that present in the before memory report?

Yes, it is. At least it's shown by these lines:

privilegedabout (pid 1188)
Explicit Allocations 
[..]
├───4.87 MB (19.79%) -- window-objects
│   ├──4.81 MB (19.53%) -- top(about:newtab, id=27)
│   │  ├──2.53 MB (10.29%) -- active
│   │  │  ├──2.47 MB (10.03%) -- window(about:newtab)
│   │  │  │  ├──1.54 MB (06.26%) -- js-realm(about:newtab)

But with Marionette we disable about:newtab and by default load about:blank instead.

I've done some more checks and looked at parts that are different between subprocess.Popen and psutil.Popen, and after commenting out certain parts of the Marionette's process handler I noticed that the performance regression is visible when running the following code:

https://searchfox.org/mozilla-central/rev/b41bb321fe4bd7d03926083698ac498ebec0accf/testing/marionette/client/marionette_driver/processhandler.py#231-250

When processOutput is not called when creating the process it works fine. As such it looks like that we have an issue with reading stdout/stderr and maybe we do not read everything and leave logs in the buffer which is accumulated? Where would this be reflected in the memory logs?

I'm off tomorrow but it looks like I should further check to find the exact reason and fix it.

Status: RESOLVED → REOPENED

Flags: needinfo?(hskupin) → needinfo?(continuation)

Resolution: WONTFIX → ---

Andrew McCreight [:mccr8]

Comment 11

•

2 years ago

The memory logs rely on reading the memory reports that get dumped to disk, to specific file names. I can't imagine how failing to write something to stdout would cause those to change.

Flags: needinfo?(continuation)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 12

•

2 years ago

Here some more findings but it's still unclear to me what's causing it:

The whole ProcessReader code as used by Marionette is a 1:1 copy from mozprocess without any changes
Running the process reader and not using stdout=subprocess.PIPE when starting the Firefox process the memory usage drops
Not running the process reader and using stdout=subprocess.PIPE when starting the Firefox process the memory usage drops as well

I further tried to remove all the code that is related to handling stdout but without any change.

I've then also checked again the Images perf results for MacOS for the last 90 days and noticed that there was quite a fluctuation including a lot of cases when the Fresh start was already at this level. With my patch landed the results now look way more stable and values do not have a range of 1.7MB but just 0.4MB now.

That that I may agree to Andrew and that we should accept the new value as base for the future.

Status: REOPENED → RESOLVED

Closed: 2 years ago → 2 years ago

Resolution: --- → WORKSFORME

Andrew McCreight [:mccr8]

Updated

•

2 years ago

Resolution: WORKSFORME → WONTFIX

Andrew McCreight [:mccr8]

Comment 13

•

2 years ago

I'm marking this wontfix because it sounds like we're accepting the regression, rather than it went away for an unknown reason.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 14

•

2 years ago

Actually today I found another regression in Marionette related to psutil and reading the process output (see bug 1896106). As it turned out for the Marionette's new session command we are using poll() which forwards to wait() including waiting for the reader to finish. This causes a 1s delay before any Marionette consumer including AWSY can continue. But that also means that in this 1s Firefox has acquired more memory, which basically ends up in this regression.

As the try build including the awsy tp6 jobs show the images measurement returns to the previous value, and fresh start drops by 63%!

Assignee: nobody → hskupin

Status: RESOLVED → REOPENED

Depends on: 1896106

Resolution: WONTFIX → ---

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P2

BugBot [:suhaib / :marco/ :calixte]

Comment 15

•

2 years ago

Set release status flags based on info from the regressing bug 1884401

status-firefox128: --- → affected

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 16

•

2 years ago

Florin, could you please check if my patch from bug 1896106 actually fixed this regression? Or do we have to wait some more days to get better numbers?

Flags: needinfo?(fbilt)

Florin Bilt

Reporter

Comment 17

•

2 years ago

The graph seems ti have returned to its initial values. Good job!

Perfherder has detected a awsy performance change from push 78ea7783f16f564aa8886dd8e51b42c2971c508b.

Improvements:

Ratio	Test	Platform	Options	Absolute values (old vs new)
10%	Images	macosx1015-64-shippable-qr	fission tp6	6,642,001.27 -> 5,979,633.87

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 221

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(fbilt)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 18

•

2 years ago

That's great to hear. In that case lets close this bug as fixed!

Status: REOPENED → RESOLVED

Closed: 2 years ago → 2 years ago

status-firefox128: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 128 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox127: affected → fixed

You need to log in before you can comment on or make changes to this bug.