1.92 - 2.33% Explicit Memory / Resident Memory (windows10-64-shippable-qr) regression (Mon July 20 2020)
Categories
(Core :: Networking, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox-esr78 | --- | unaffected |
firefox78 | --- | unaffected |
firefox79 | --- | unaffected |
firefox80 | --- | wontfix |
firefox81 | --- | wontfix |
People
(Reporter: kimberlythegeek, Unassigned)
References
(Regression)
Details
(Keywords: perf, perf-alert, regression)
Perfherder has detected a awsy performance regression from push 78186a074f2f21eac00c492a34baf659a7ba203c. As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
2% Explicit Memory windows10-64-shippable-qr opt 426,281,137.62 -> 436,221,252.04
2% Resident Memory windows10-64-shippable-qr opt 627,293,504.38 -> 639,362,785.45
Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.
For more information on performance sheriffing please see our FAQ.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Comment 1•4 years ago
|
||
Set release status flags based on info from the regressing bug 1653485
Comment 2•4 years ago
|
||
That's really odd result for the patches because they make editor code stop copying stack-only class and use const reference instead so that it should improve the perofmance if it actually affects something. I'll check whether the patches actually the regression cause or not, and if so, I'll back them out.
Immediately before the landing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d4927e4e8c6650d7296468fce7a0e890c12b2ddd
The landing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8bee7d72c53fac0defc71976b0a91df7dadba793
Comment 3•4 years ago
|
||
I ran awsy 10 times on tryserver for each, but looks like my landing is not the regression cause.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=d4927e4e8c6650d7296468fce7a0e890c12b2ddd&newProject=try&newRevision=8bee7d72c53fac0defc71976b0a91df7dadba793&framework=4
Updated•4 years ago
|
Comment 5•4 years ago
|
||
As the test regressed by 2% and this is in the noise are of the test let's trigger some more retriggers on autoland until we find the culprit.
Also there are multiple metrics that point to this bug so I would recomend for the moment keeping the regressed by flag on this bug until we find the culprit bug that regressed.
Comment 7•4 years ago
|
||
Really hard to see how bug 1653485 could have caused this
But masayuki, would it be easy to try to back out for a day or so to see if there is something weird happening?
Patch right before bug 1653485 could be more likely https://hg.mozilla.org/integration/autoland/rev/cd05ec275e213937a1c9c5d98a9ee60bb84b6ea0
Comment 8•4 years ago
|
||
No problem about backing them out from 80, although I have no idea how to fix it if this is actually caused by bug 1653485. (Backing out from 81 requires a back out patch due to dependency.)
Comment 9•4 years ago
|
||
(According to the result of tryserver (comment 3), I still believe that my patches are not the cause of this bug.)
Comment 10•4 years ago
|
||
(kicked awsy jobs before actually backing them out.)
Reporter | ||
Comment 11•4 years ago
|
||
Hi there, apologies for the delays--I was in a biking accident last week and have been taking a day off here and there to recover.
I am the one investigating this regression (forgive me, I am learning, I am new to this process)
Here are the try pushes I did with the changes and with them backed out:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa853aff6d28cee77c12b215933479221ecaf7e6
after backing out changes:
remote: View your changes here:
remote: https://hg.mozilla.org/try/rev/37b4db7731a8c4afb7cad536cbdca69fde705529
remote: https://hg.mozilla.org/try/rev/b0f48d0a809e1bdbf538773049f45c7fc90fe41d
remote: https://hg.mozilla.org/try/rev/30b99830af2b691a5596048b3d60405a0f06d732
remote: https://hg.mozilla.org/try/rev/70c6c67694ebf9641e0ff096af2e24c2c31c35e8
Comment 12•4 years ago
|
||
With the AWSY reports, you should be able to download the memory reports from Perfherder before and after the change, and then do a diff in about:memory to hopefully get some insight about what increased. I'll try to find the time to look at that at some point.
Comment 13•4 years ago
|
||
This is comparison result on autoland of bug 1653485 and bug 1652973 which is 5th previous landing of bug 1653485. Awsy for every landings ran 11 times, but looks like the regression disappeared...
Comment 14•4 years ago
|
||
Retriggered awsy of some landings after bug 1653485 too.
Comment 15•4 years ago
|
||
Ah, according to the graphs
- https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=autoland,1959207,1,4&timerange=1209600&zoom=1595206162006,1595210262973,565834779.5111591,680347025.2101923
- https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=autoland,1959207,1,4&timerange=1209600&zoom=1595205969028,1595211273852,572539982.533614,718948944.475693
- https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=autoland,1959217,1,4&timerange=1209600&zoom=1595205818380,1595210894080,380628822.07295245,465048685.0706244
It seems that bug 1596741 might be the regression point.
https://hg.mozilla.org/integration/autoland/rev/cd05ec275e213937a1c9c5d98a9ee60bb84b6ea0
Comment 16•4 years ago
|
||
mayhemer:
What do you think about your fix? It added QI and logging method calls a lot, but I don't know whether the paths are hot paths or not.
Comment 17•4 years ago
|
||
This is a memory usage regression, so using dumping more stuff to the console sure could end up causing higher memory usage.
Updated•4 years ago
|
Comment 18•4 years ago
|
||
(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #16)
mayhemer:
What do you think about your fix? It added QI and logging method calls a lot, but I don't know whether the paths are hot paths or not.
I don't believe those are hot paths. We log only when something is wrong with cookies we get from servers. It's not that common or expected and these warnings are there to help diagnose the problems, so quite useful.
But, looking at the numbers - the first test result shows increase of 12'069'281.07? What is the unit actually? I don't know about a machine able to allocate fractions of bytes ;)) so this has to be at least kB. That seems a little excessive for only console messages. Could this be a leak? What exactly the test does?
Comment 19•4 years ago
|
||
I think the units are bytes. I think this particular measure is some kind of average of measurements taken at different points in time, which produces the fraction.
The test starts the browser, opens a bunch of tabs, then closes a bunch of tabs. At each of those states, it takes a measurement, waits 30 seconds and takes a measurement, then forces a GC and takes a measurement. The set of pages being loaded is TP6. I think they are loaded about 8 at a time, and it cycles through them.
I downloaded some memory reports before and after the regression, at the "TabsClosedForceGC" and "TabsOpenSettled" points, because those showed clear increases. There's about a 10MB or 12MB increase of explicit/heap-overhead/bin-unused in the main process, and about 3MB of heap-unclassified in the GPU process.
The large increase in bin-unused probably indicates that a lot of allocation happened while the pages were loading, but they went away at some point. Would you expect there to be a lot of cookie errors with the TP6 page set? I could see there being a memory increase if there's a ton of allocations for logging messages that are quickly thrown out.
That said, it looks like the numbers here got bad around July 20, then stayed that way until July 23, where it dropped quite a bit, then became bimodal with the high value about what the memory usage was before July 20. Honza, maybe you could check to see if TP6 generates a huge number of these cookie errors or whatever and if not maybe this can consider this to be a false positive?
Comment 20•4 years ago
|
||
Pages producing a lot of the new cookie rejection messages (using livesites, not the recorded content):
https://www.instagram.com/ - 55
https://outlook.live.com/mail/inbox (not logged in) - 4
https://www.youtube.com/ - 1
60 * 8 = 480
12MB / 480 = ~12kB
Not sure how many times each message is copied/cloned, but still 12kB per a message is IMO a bit too much using the common sense.
According the response refreshing of MITM, the cookie expiry times are updated. If not, we would produce a lot of "rejected because expired" messages.
How can I run this test locally and get the messages we generate?
Comment 21•4 years ago
|
||
Note that the MITM option server_replay_refresh
must be true
for cookie expiration times to be updated. It's default at true, so unless this has been turned off, or is buggy, we should be ok.
Updated•4 years ago
|
Comment 22•4 years ago
|
||
It looks like mach awsy-test will run it locally. It is a standard Marionette test living in testing/awsy/awsy , so maybe you can modify the scripts there to do some additional logging.
Updated•4 years ago
|
Comment 23•4 years ago
|
||
It looks like the memory usage went down again, and from comment 20 it sounds like there aren't that many messages, so I'll just assume this was just some weird temporary issue. The regression was also only on Windows, and the patch doesn't look Windows specific.
Updated•4 years ago
|
Comment 24•4 years ago
|
||
Regression bugs raised via alerts should be resolved as WORKSFORME. I'm going to mark this as a regression that was accepted (WONTFIX), which was later reversed by an unidentified series of commits.
Description
•