Closed Bug 1654585 Opened 4 years ago Closed 4 years ago

1.92 - 2.33% Explicit Memory / Resident Memory (windows10-64-shippable-qr) regression (Mon July 20 2020)

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox78 --- unaffected
firefox79 --- unaffected
firefox80 --- wontfix
firefox81 --- wontfix

People

(Reporter: kimberlythegeek, Unassigned)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Perfherder has detected a awsy performance regression from push 78186a074f2f21eac00c492a34baf659a7ba203c. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

2% Explicit Memory windows10-64-shippable-qr opt 426,281,137.62 -> 436,221,252.04
2% Resident Memory windows10-64-shippable-qr opt 627,293,504.38 -> 639,362,785.45

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Version: 79 Branch → unspecified
Blocks: 1650395
Regressed by: 1653485
Component: Performance → DOM: Editor
Product: Testing → Core
Flags: needinfo?(masayuki)

Set release status flags based on info from the regressing bug 1653485

That's really odd result for the patches because they make editor code stop copying stack-only class and use const reference instead so that it should improve the perofmance if it actually affects something. I'll check whether the patches actually the regression cause or not, and if so, I'll back them out.

Immediately before the landing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d4927e4e8c6650d7296468fce7a0e890c12b2ddd
The landing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8bee7d72c53fac0defc71976b0a91df7dadba793

Assignee: nobody → masayuki
Status: NEW → ASSIGNED
Flags: needinfo?(masayuki)
Assignee: masayuki → nobody
Status: ASSIGNED → NEW
Component: DOM: Editor → Performance
Flags: needinfo?(ksereduck)
No longer regressed by: 1653485
Summary: 1.92 - 2.33% Explicit Memory / Resident Memory (windows10-64-shippable-qr) regression on push 78186a074f2f21eac00c492a34baf659a7ba203c (Mon July 20 2020) → 1.92 - 2.33% Explicit Memory / Resident Memory (windows10-64-shippable-qr) regression (Mon July 20 2020)

As the test regressed by 2% and this is in the noise are of the test let's trigger some more retriggers on autoland until we find the culprit.

Also there are multiple metrics that point to this bug so I would recomend for the moment keeping the regressed by flag on this bug until we find the culprit bug that regressed.

Flags: needinfo?(ksereduck)
Regressed by: 1653485

checking in on those retriggers.

Flags: needinfo?(fstrugariu)
No longer regressed by: 1653485

Really hard to see how bug 1653485 could have caused this
But masayuki, would it be easy to try to back out for a day or so to see if there is something weird happening?

Patch right before bug 1653485 could be more likely https://hg.mozilla.org/integration/autoland/rev/cd05ec275e213937a1c9c5d98a9ee60bb84b6ea0

Severity: -- → S3
Flags: needinfo?(masayuki)
Priority: -- → P3

No problem about backing them out from 80, although I have no idea how to fix it if this is actually caused by bug 1653485. (Backing out from 81 requires a back out patch due to dependency.)

(According to the result of tryserver (comment 3), I still believe that my patches are not the cause of this bug.)

(kicked awsy jobs before actually backing them out.)

Hi there, apologies for the delays--I was in a biking accident last week and have been taking a day off here and there to recover.

I am the one investigating this regression (forgive me, I am learning, I am new to this process)

Here are the try pushes I did with the changes and with them backed out:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa853aff6d28cee77c12b215933479221ecaf7e6

after backing out changes:

remote: View your changes here:
remote:   https://hg.mozilla.org/try/rev/37b4db7731a8c4afb7cad536cbdca69fde705529
remote:   https://hg.mozilla.org/try/rev/b0f48d0a809e1bdbf538773049f45c7fc90fe41d
remote:   https://hg.mozilla.org/try/rev/30b99830af2b691a5596048b3d60405a0f06d732
remote:   https://hg.mozilla.org/try/rev/70c6c67694ebf9641e0ff096af2e24c2c31c35e8
Flags: needinfo?(fstrugariu)

With the AWSY reports, you should be able to download the memory reports from Perfherder before and after the change, and then do a diff in about:memory to hopefully get some insight about what increased. I'll try to find the time to look at that at some point.

mayhemer:

What do you think about your fix? It added QI and logging method calls a lot, but I don't know whether the paths are hot paths or not.

Flags: needinfo?(honzab.moz)

This is a memory usage regression, so using dumping more stuff to the console sure could end up causing higher memory usage.

Component: Performance → Networking
Regressed by: 1596741
Has Regression Range: --- → yes

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #16)

mayhemer:

What do you think about your fix? It added QI and logging method calls a lot, but I don't know whether the paths are hot paths or not.

I don't believe those are hot paths. We log only when something is wrong with cookies we get from servers. It's not that common or expected and these warnings are there to help diagnose the problems, so quite useful.

But, looking at the numbers - the first test result shows increase of 12'069'281.07? What is the unit actually? I don't know about a machine able to allocate fractions of bytes ;)) so this has to be at least kB. That seems a little excessive for only console messages. Could this be a leak? What exactly the test does?

Flags: needinfo?(honzab.moz) → needinfo?(masayuki)

I think the units are bytes. I think this particular measure is some kind of average of measurements taken at different points in time, which produces the fraction.

The test starts the browser, opens a bunch of tabs, then closes a bunch of tabs. At each of those states, it takes a measurement, waits 30 seconds and takes a measurement, then forces a GC and takes a measurement. The set of pages being loaded is TP6. I think they are loaded about 8 at a time, and it cycles through them.

I downloaded some memory reports before and after the regression, at the "TabsClosedForceGC" and "TabsOpenSettled" points, because those showed clear increases. There's about a 10MB or 12MB increase of explicit/heap-overhead/bin-unused in the main process, and about 3MB of heap-unclassified in the GPU process.

The large increase in bin-unused probably indicates that a lot of allocation happened while the pages were loading, but they went away at some point. Would you expect there to be a lot of cookie errors with the TP6 page set? I could see there being a memory increase if there's a ton of allocations for logging messages that are quickly thrown out.

That said, it looks like the numbers here got bad around July 20, then stayed that way until July 23, where it dropped quite a bit, then became bimodal with the high value about what the memory usage was before July 20. Honza, maybe you could check to see if TP6 generates a huge number of these cookie errors or whatever and if not maybe this can consider this to be a false positive?

Flags: needinfo?(masayuki) → needinfo?(honzab.moz)

Pages producing a lot of the new cookie rejection messages (using livesites, not the recorded content):
https://www.instagram.com/ - 55
https://outlook.live.com/mail/inbox (not logged in) - 4
https://www.youtube.com/ - 1

60 * 8 = 480
12MB / 480 = ~12kB

Not sure how many times each message is copied/cloned, but still 12kB per a message is IMO a bit too much using the common sense.

According the response refreshing of MITM, the cookie expiry times are updated. If not, we would produce a lot of "rejected because expired" messages.

How can I run this test locally and get the messages we generate?

Flags: needinfo?(honzab.moz) → needinfo?(continuation)

Note that the MITM option server_replay_refresh must be true for cookie expiration times to be updated. It's default at true, so unless this has been turned off, or is buggy, we should be ok.

It looks like mach awsy-test will run it locally. It is a standard Marionette test living in testing/awsy/awsy , so maybe you can modify the scripts there to do some additional logging.

Flags: needinfo?(continuation)

It looks like the memory usage went down again, and from comment 20 it sounds like there aren't that many messages, so I'll just assume this was just some weird temporary issue. The regression was also only on Windows, and the patch doesn't look Windows specific.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME

Regression bugs raised via alerts should be resolved as WORKSFORME. I'm going to mark this as a regression that was accepted (WONTFIX), which was later reversed by an unidentified series of commits.

Resolution: WORKSFORME → WONTFIX
You need to log in before you can comment on or make changes to this bug.