2.91 - 3.03% Explicit Memory (windows7-32) regression on push 517c9efa9592ad0c103ec2542e4bc7fcaf06c505 (Fri September 20 2019)
Categories
(Firefox :: Address Bar, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox-esr68 | --- | unaffected |
firefox67 | --- | unaffected |
firefox68 | --- | unaffected |
firefox69 | --- | unaffected |
firefox70 | --- | unaffected |
firefox71 | --- | wontfix |
firefox72 | --- | fixed |
People
(Reporter: marauder, Assigned: dao)
References
(Regression)
Details
(Keywords: perf, perf-alert, regression)
Attachments
(2 files)
We have detected an awsy regression from push:
As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
3% Explicit Memory windows7-32 opt 275,105,364.06 -> 283,428,326.77
3% Explicit Memory windows7-32 opt 274,870,410.18 -> 282,861,784.65
You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=23229
On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.
To learn more about the regressing test(s), please see: https://wiki.mozilla.org/AWSY/Tests
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
How confident are we in the regression range? I have no idea off-hand how bug 1580538 could have caused this.
Comment 2•5 years ago
•
|
||
(In reply to Dão Gottwald [::dao] from comment #1)
How confident are we in the regression range?
What would be interesting is knowing whether the needle moved at all on other platforms.
I have no idea off-hand how bug 1580538 could have caused this.
The job details have memory reports (see 'job details' tab on https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&searchStr=awsy&revision=517c9efa9592ad0c103ec2542e4bc7fcaf06c505&selectedJob=268135403 ) for the subtests that regressed ( https://treeherder.mozilla.org/perf.html#/comparesubtest?framework=4&originalProject=autoland&originalSignature=2118213&newProject=autoland&newSignature=2118213&originalRevision=ac5fc1e491a2bb5481d47247bea3c2b22318001c&newRevision=517c9efa9592ad0c103ec2542e4bc7fcaf06c505 ) which you can compare in about:memory to see what's eating the extra memory.
Comment 4•5 years ago
|
||
(In reply to Marian Raiciof [:marauder] from comment #3)
Thanks for the details, Gijs!
I was still hoping you would know how to figure out if there were similar AWSY regressions on other platforms (perhaps of smaller magnitude, so they didn't trigger alerts) ?
Reporter | ||
Comment 5•5 years ago
|
||
I added more platforms to the graph and also did backfills for these, because there are missing data points:
(it might take a while to see the data points)
https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=autoland,2118213,1,4&series=autoland,2118063,1,4&series=autoland,2118263,1,4&series=autoland,2073424,1,4&highlightAlerts=1&zoom=1568947527472,1569007116055,255879474.56320772,444877917.3373801
I'll leave this here if anyone wants to add more platforms:
How to add a new platform - you can click on the "windows7-32" from the left rectangular.
Then on the popup that appears select: awsy - autoland - the platform - in the filter text type "Explicit Memory"
Select the test from "Tests" field.
Then click on Plot Graphs.
In case no tests appears, click on the blue button - Include Subtests.
Comment 6•5 years ago
|
||
(In reply to Marian Raiciof [:marauder] from comment #5)
I added more platforms to the graph and also did backfills for these, because there are missing data points:
(it might take a while to see the data points)
https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=autoland,2118213,1,4&series=autoland,2118063,1,4&series=autoland,2118263,1,4&series=autoland,2073424,1,4&highlightAlerts=1&zoom=1568947527472,1569007116055,255879474.56320772,444877917.3373801
Looks like Linux saw a similar increase and mac and win10 were steady. :-\
Comment 7•5 years ago
|
||
Dão, are the about:memory reports any help here? The regression seems serious enough that we should probably at least figure out what caused it in the short term...
Assignee | ||
Comment 8•5 years ago
|
||
Adding this to the dependency tree so it shows up when we look at megabar-related bugs.
Assignee | ||
Updated•5 years ago
|
Comment 9•5 years ago
|
||
Dao, this is marked as P1 but the bug is unassigned, is that a regression we need to fix in 71? Thanks
Comment 10•5 years ago
|
||
off-hand, diffing the memory reports, the only relevant thing I see is a bump on gfx/heap-textures... maybe I'm misreading the diff, but I don't see much other.
Comment 11•5 years ago
•
|
||
Hey Gijs, this is a unassigned P1 which may be a regression 71 related to megabar work. Can we get an update here on what we should do with this?
Comment 12•5 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #11)
Hey Gijs, this is a unassigned P1 which may be a regression 71 related to megabar work. Can we get an update here on what we should do with this?
It needs an investigation and fix.
(In reply to Marco Bonardo [::mak] from comment #10)
off-hand, diffing the memory reports, the only relevant thing I see is a bump on gfx/heap-textures... maybe I'm misreading the diff, but I don't see much other.
So to make it easier to see what we're talking about, looking at:
before: https://queue.taskcluster.net/v1/task/R-trSLX6SX-IuqH5IjOyUQ/runs/0/artifacts/public/test_info/memory-report-TabsClosedForceGC-0.json.gz
after: https://queue.taskcluster.net/v1/task/Lx0AY5lITyaeEnJ7HxJ-zA/runs/0/artifacts/public/test_info/memory-report-TabsClosedForceGC-0.json.gz
there's a 10-mb difference in gfx/heap-textures .
I don't know how to dig into that - the memory report doesn't really provide clues. But I think that means we should track this under graphics. Maybe :jbonisteel can help find someone to dig deeper. I could see why the CSS change here (removing overflow: hidden) might have changed how gfx processes some of the layers here but I have no idea why it'd increase memory usage for textures or anything like that.
Comment 13•5 years ago
|
||
Removing overflow hidden effectively removes a clip, which means the painting stack has one less piece of information telling it that drawing can be limited to that clip. So texture sizes could get bigger as a result.
Comment 14•5 years ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #13)
Removing overflow hidden effectively removes a clip, which means the painting stack has one less piece of information telling it that drawing can be limited to that clip. So texture sizes could get bigger as a result.
I'm a bit surprised by this though - the toolbar hasn't changed size; if anything the contents of the previously overflow: hidden
container should now be constrained to the available space rather than potentially extending "underneath" it, as that's how XUL flexbox works... That is, I wouldn't have expected any difference in terms of the constraints on the (contents of the) toolbar.
Anyway, is there a straightforward way to find out what part of the toolbar or its children is responsible for the change, and/or how to "help" graphics code know what's going on?
Updated•5 years ago
|
Assignee | ||
Comment 15•5 years ago
|
||
Thanks mak and Gijs for the investigation so far, I have not much to add. I'm also surprised by the increased texture size as the toolbar code ensures that the toolbar doesn't overflow. It might however overflow initially before we remove elements. Maybe the graphics code never recovers from that? Though I still wouldn't expect that we have enough toolbar items when running talos to justify a 10mb difference.
Assignee | ||
Comment 16•5 years ago
|
||
(In reply to Dão Gottwald [::dao] from comment #15)
Though I still wouldn't expect that we have enough toolbar items when running talos to justify a 10mb difference.
I now realize that the urlbar results "popup" might get added to that texture. This was the case before bug 1580538, but we'd also reset the toolbar to overflow:hidden
every time the results closed. I think we could start doing that again but with clip-patch
instead of overflow:hidden
so that we don't reframe the urlbar.
Timothy, does this make sense?
Assignee | ||
Comment 17•5 years ago
|
||
(In reply to Dão Gottwald [::dao] from comment #16)
I now realize that the urlbar results "popup" might get added to that texture. This was the case before bug 1580538, but we'd also reset the toolbar to
overflow:hidden
every time the results closed. I think we could start doing that again but withclip-patch
instead ofoverflow:hidden
so that we don't reframe the urlbar.
I have a patch doing that. Currently running it through AWSY on Try.
Assignee | ||
Comment 18•5 years ago
|
||
Assignee | ||
Comment 19•5 years ago
|
||
Assignee | ||
Comment 20•5 years ago
|
||
Cancelling needinfos as mattwoodrow is now a reviewer for my patch.
Comment 21•5 years ago
|
||
Comment 22•5 years ago
|
||
Backed out changeset e8905725eb46 (Bug 1584101) for browser-chrome at widget/tests/browser/browser_test_procinfo.js.
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272495130&repo=autoland&lineNumber=59765
[task 2019-10-23T00:36:05.890Z] 00:36:05 INFO - TEST-INFO | started process screenshot
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - TEST-INFO | screenshot: exit 0
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - Buffered messages logged at 00:34:35
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - Entering test bound test_proc_info
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - Buffered messages finished
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - TEST-UNEXPECTED-FAIL | widget/tests/browser/browser_test_procinfo.js | Test timed out -
[task 2019-10-23T00:36:05.971Z] 00:36:05 INFO - GECKO(128) | MEMORY STAT | vsize 2104227MB | vsizeMaxContiguous 65978538MB | residentFast 270MB | heapAllocated 96MB
[task 2019-10-23T00:36:05.971Z] 00:36:05 INFO - TEST-OK | widget/tests/browser/browser_test_procinfo.js | took 90300ms
[task 2019-10-23T00:36:05.971Z] 00:36:05 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-10-23T00:36:05.971Z] 00:36:05 INFO - TEST-UNEXPECTED-FAIL | widget/tests/browser/browser_test_procinfo.js | Found a tab after previous test timed out: http://example.com/browser/widget/tests/browser/dummy.html -
[task 2019-10-23T00:36:05.971Z] 00:36:05 INFO - GECKO(128) | ++DOCSHELL 00000280B8250800 == 1 [pid = 9636] [id = {d662e728-54da-4d59-89ad-f93e18b257ea}]
[task 2019-10-23T00:36:05.971Z] 00:36:05 INFO - GECKO(128) | ++DOMWINDOW == 1 (00000280C0109020) [pid = 9636] [serial = 6] [outer = 0000000000000000]
[task 2019-10-23T00:36:05.971Z] 00:36:05 INFO - GECKO(128) | ++DOMWINDOW == 2 (00000280BE176800) [pid = 9636] [serial = 7] [outer = 00000280C0109020]
[task 2019-10-23T00:36:05.972Z] 00:36:05 INFO - GECKO(128) | [Child 9636, Main Thread] WARNING: NS_ENSURE_TRUE(frame) failed: file z:/build/build/src/layout/base/nsPresContext.cpp, line 821
[task 2019-10-23T00:36:05.991Z] 00:36:05 INFO - checking window state
[task 2019-10-23T00:36:06.009Z] 00:36:06 INFO - GECKO(128) | [Parent 6124, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file z:/build/build/src/dom/base/nsFrameLoader.cpp, line 673
[task 2019-10-23T00:36:06.009Z] 00:36:06 INFO - GECKO(128) | ++DOMWINDOW == 15 (000001790A572800) [pid = 6124] [serial = 23] [outer = 00000179092D02E0]
[task 2019-10-23T00:36:06.034Z] 00:36:06 INFO - GECKO(128) | ++DOMWINDOW == 16 (000001790A574800) [pid = 6124] [serial = 24] [outer = 00000179092D02E0]
[task 2019-10-23T00:36:06.172Z] 00:36:06 INFO - GECKO(128) | --DOCSHELL 0000014779055000 == 0 [pid = 7164] [id = {32689c2a-ef75-4634-8daa-6d346457b8b6}] [url = http://example.com/browser/widget/tests/browser/dummy.html]
[task 2019-10-23T00:36:06.191Z] 00:36:06 INFO - GECKO(128) | --DOCSHELL 000002A3AAD54800 == 0 [pid = 5252] [id = {919b96f9-ab62-4045-90e8-997c73ad1428}] [url = about:blank]
[task 2019-10-23T00:36:06.269Z] 00:36:06 INFO - GECKO(128) | --DOMWINDOW == 1 (0000026E4C10B3E0) [pid = 904] [serial = 4] [outer = 0000000000000000] [url = about:newtab]
[task 2019-10-23T00:36:06.269Z] 00:36:06 INFO - GECKO(128) | --DOCSHELL 0000026E4C1DD800 == 0 [pid = 904] [id = {5ccb4d00-b0f4-4020-bfe3-58fae24c4f29}] [url = about:newtab]
[task 2019-10-23T00:36:06.269Z] 00:36:06 INFO - GECKO(128) | --DOMWINDOW == 0 (0000026E4C957400) [pid = 904] [serial = 6] [outer = 0000000000000000] [url = about:newtab]
[task 2019-10-23T00:36:06.291Z] 00:36:06 INFO - GECKO(128) | --DOMWINDOW == 1 (000002916371BD40) [pid = 6060] [serial = 4] [outer = 0000000000000000] [url = about:newtab]
[task 2019-10-23T00:36:06.291Z] 00:36:06 INFO - GECKO(128) | --DOCSHELL 000002915C856000 == 0 [pid = 6060] [id = {c508ce40-53e2-4bb1-95f0-33472f2abde4}] [url = about:newtab]
[task 2019-10-23T00:36:06.293Z] 00:36:06 INFO - GECKO(128) | --DOMWINDOW == 0 (0000029165C0EC00) [pid = 6060] [serial = 6] [outer = 0000000000000000] [url = about:newtab]
[task 2019-10-23T00:36:07.467Z] 00:36:07 INFO - GECKO(128) | --DOMWINDOW == 1 (000002A3B0D67000) [pid = 5252] [serial = 2] [outer = 0000000000000000] [url = about:blank]
[task 2019-10-23T00:36:07.467Z] 00:36:07 INFO - GECKO(128) | --DOMWINDOW == 0 (000002A3B2D11020) [pid = 5252] [serial = 1] [outer = 0000000000000000] [url = about:blank]
[task 2019-10-23T00:36:07.489Z] 00:36:07 INFO - GECKO(128) | --DOMWINDOW == 1 (000001470671E020) [pid = 7164] [serial = 1] [outer = 0000000000000000] [url = http://example.com/browser/widget/tests/browser/dummy.html]
[task 2019-10-23T00:36:07.489Z] 00:36:07 INFO - GECKO(128) | --DOMWINDOW == 0 (000001470677E000) [pid = 7164] [serial = 3] [outer = 0000000000000000] [url = http://example.com/browser/widget/tests/browser/dummy.html]
[task 2019-10-23T00:36:07.997Z] 00:36:07 INFO - GECKO(128) | --DOMWINDOW == 15 (00000179092CF200) [pid = 6124] [serial = 19] [outer = 0000000000000000] [url = chrome://browser/content/browser.xhtml]
[task 2019-10-23T00:36:07.998Z] 00:36:07 INFO - GECKO(128) | --DOMWINDOW == 14 (000001790A579C00) [pid = 6124] [serial = 20] [outer = 0000000000000000] [url = about:blank]
[task 2019-10-23T00:36:08.302Z] 00:36:08 INFO - GECKO(128) | Completed ShutdownLeaks collections in process 9636
[task 2019-10-23T00:36:08.442Z] 00:36:08 INFO - GECKO(128) | --DOMWINDOW == 13 (00000179110B2C00) [pid = 6124] [serial = 9] [outer = 0000000000000000] [url = about:blank]
Comment 23•5 years ago
|
||
Assignee | ||
Comment 24•5 years ago
|
||
(In reply to Daniel Varga [:dvarga] from comment #22)
Backed out changeset e8905725eb46 (Bug 1584101) for browser-chrome at widget/tests/browser/browser_test_procinfo.js.
[...]
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - Entering test bound test_proc_info
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - Buffered messages finished
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - TEST-UNEXPECTED-FAIL | widget/tests/browser/browser_test_procinfo.js | Test timed out -
Tarek, can you make sense of this failure? I'm guessing it times out at await ChromeUtils.requestProcInfo()
? Why would it do that?
Updated•5 years ago
|
Assignee | ||
Comment 25•5 years ago
|
||
(In reply to Dão Gottwald [::dao] from comment #24)
(In reply to Daniel Varga [:dvarga] from comment #22)
Backed out changeset e8905725eb46 (Bug 1584101) for browser-chrome at widget/tests/browser/browser_test_procinfo.js.
[...]
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - Entering test bound test_proc_info
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - Buffered messages finished
[task 2019-10-23T00:36:05.970Z] 00:36:05 INFO - TEST-UNEXPECTED-FAIL | widget/tests/browser/browser_test_procinfo.js | Test timed out -Tarek, can you make sense of this failure? I'm guessing it times out at
await ChromeUtils.requestProcInfo()
? Why would it do that?
Ah, this is webrender crashing:
PROCESS-CRASH | Main app process exited normally | application crashed [@ webrender::resource_cache::{{impl}}::drop(struct webrender::resource_cache::ResourceCache *)]
Matt or Emilio, any ideas?
Comment 26•5 years ago
|
||
(In reply to Dão Gottwald [::dao] from comment #25)
Ah, this is webrender crashing:
PROCESS-CRASH | Main app process exited normally | application crashed [@ webrender::resource_cache::{{impl}}::drop(struct webrender::resource_cache::ResourceCache *)]
Matt or Emilio, any ideas?
It looks like it's failing some of the debug assertions in here.
[task 2019-10-23T00:34:26.808Z] 00:34:26 INFO - GECKO(128) | Hit MOZ_CRASH(assertion failed: !self.blob_image_templates.keys().any(&blob_f)) at gfx\wr\webrender\src\resource_cache.rs:1963
This one in particular. So it seems WebRender is leaking some of the images, or leaving some blob image as registered while the renderer goes away. I'm not familiar with the resource cache blob image stuff...
Nical I think is the person that knows this better?
Comment 27•5 years ago
|
||
I am having a hard time reproducing this locally. As Emilio said this looks like an image resources being leaked. I don't think it is in webrender but rather mismanagement of the handles that are held outside of it, so typically a CreateImage or CreateBlobImage message that isn't matched with a DeleteImage.
In some of the logs there are some IPC messages that are sent too late during shutdown and dropped. Maybe a resource on the content process is destroyed too late. Or it could be a handle being leaked at some other place.
It is very very odd that this patch would cause this to fail reliably on try.
Comment 28•5 years ago
|
||
Bug 1593013 was filed to reenable them.
Comment 29•5 years ago
|
||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Comment 30•5 years ago
|
||
Comment 31•5 years ago
|
||
bugherder |
Assignee | ||
Updated•5 years ago
|
Comment 32•5 years ago
|
||
bugherder |
Comment 33•5 years ago
|
||
Does this need a Beta uplift request?
Comment 34•5 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #33)
Does this need a Beta uplift request?
I think that'd be good for the frontend patch. It's not clear to me if we'd need the gfx leakcheck patch then, too, and if so, if it's suitable for uplift. Nical?
Comment 35•5 years ago
|
||
The leakcheck patch can be uplifted without risk.
Comment 36•5 years ago
•
|
||
Comment on attachment 9102847 [details]
Bug 1584101 - Clip the navigation toolbar when the address bar can be rendered within the toolbar's bounds. r=Gijs,mattwoodrow
Beta/Release Uplift Approval Request
- User impact if declined: Increased memory usage
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: bug 1590203
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Straightforward patches to change how we clip the navigation toolbar to ensure we don't needlessly waste memory on graphics textures.
- String changes made/needed:
Updated•5 years ago
|
Comment 37•5 years ago
|
||
I noted these as "verified: yes" because https://treeherder.mozilla.org/perf.html#/alerts?id=23711, which is an improvement on the regressed metrics here, has this pushlog attached to it: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=c85ae1289d33e348bdda56d6ae3f61a987f784d7&tochange=fa1fe1088a5e415df857e6b0767442e0e9ceae25 which includes this bug. It looks like the alert is untriaged yet - :alexandrui, looks like you're assigned to that one, so giving you a heads up.
Assignee | ||
Comment 38•5 years ago
|
||
Bug 1590203 would need to be uplifted as well.
Comment 39•5 years ago
|
||
FWIW, I will wait for a decision in https://bugzilla.mozilla.org/show_bug.cgi?id=1590203#c11 before looking at this uplift.
Comment 40•5 years ago
|
||
Comment on attachment 9102847 [details]
Bug 1584101 - Clip the navigation toolbar when the address bar can be rendered within the toolbar's bounds. r=Gijs,mattwoodrow
Based on the discussion in https://bugzilla.mozilla.org/show_bug.cgi?id=1590203#c13 I think we need to hold off uplifting this given all the layout dependencies.
Updated•5 years ago
|
Comment 41•5 years ago
|
||
I got an unexpected explicit memory improvement in bug 1592981 comment 8 that I suspect should be credited to this changeset instead.
3% Explicit Memory windows7-32 opt 295,940,910.84 -> 285,632,278.54
3% Explicit Memory windows7-32-shippable opt 295,216,826.77 -> 285,805,301.59
2% Explicit Memory windows7-32-shippable opt stylo tp6 391,692,290.18 -> 383,544,099.35
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=23710
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•3 years ago
|
Description
•