1654585 - 1.92 - 2.33% Explicit Memory / Resident Memory (windows10-64-shippable-qr) regression (Mon July 20 2020)

Reporter

Description

•

4 years ago

Perfherder has detected a awsy performance regression from push 78186a074f2f21eac00c492a34baf659a7ba203c. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

2% Explicit Memory windows10-64-shippable-qr opt 426,281,137.62 -> 436,221,252.04
2% Resident Memory windows10-64-shippable-qr opt 627,293,504.38 -> 639,362,785.45

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Kimberly Sereduck :kimberlythegeek

Reporter

Updated

•

4 years ago

Version: 79 Branch → unspecified

Kimberly Sereduck :kimberlythegeek

Reporter

Updated

•

4 years ago

Blocks: 1650395

Regressed by: 1653485

Kimberly Sereduck :kimberlythegeek

Reporter

Updated

•

4 years ago

Component: Performance → DOM: Editor

Product: Testing → Core

Kimberly Sereduck :kimberlythegeek

Reporter

Updated

•

4 years ago

Flags: needinfo?(masayuki)

Kimberly Sereduck :kimberlythegeek

Reporter

Updated

•

4 years ago

status-firefox80: --- → affected

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

4 years ago

Set release status flags based on info from the regressing bug 1653485

status-firefox78: --- → unaffected

status-firefox79: --- → unaffected

status-firefox-esr68: --- → unaffected

status-firefox-esr78: --- → unaffected

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 2

•

4 years ago

That's really odd result for the patches because they make editor code stop copying stack-only class and use const reference instead so that it should improve the perofmance if it actually affects something. I'll check whether the patches actually the regression cause or not, and if so, I'll back them out.

Immediately before the landing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d4927e4e8c6650d7296468fce7a0e890c12b2ddd
The landing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8bee7d72c53fac0defc71976b0a91df7dadba793

Assignee: nobody → masayuki

Status: NEW → ASSIGNED

Flags: needinfo?(masayuki)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 3

•

4 years ago

I ran awsy 10 times on tryserver for each, but looks like my landing is not the regression cause.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=d4927e4e8c6650d7296468fce7a0e890c12b2ddd&newProject=try&newRevision=8bee7d72c53fac0defc71976b0a91df7dadba793&framework=4

Assignee: masayuki → nobody

Status: ASSIGNED → NEW

Component: DOM: Editor → Performance

Flags: needinfo?(ksereduck)

No longer regressed by: 1653485

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Updated

•

4 years ago

Summary: 1.92 - 2.33% Explicit Memory / Resident Memory (windows10-64-shippable-qr) regression on push 78186a074f2f21eac00c492a34baf659a7ba203c (Mon July 20 2020) → 1.92 - 2.33% Explicit Memory / Resident Memory (windows10-64-shippable-qr) regression (Mon July 20 2020)

Florin Strugariu [:Bebe]

Comment 5

•

4 years ago

As the test regressed by 2% and this is in the noise are of the test let's trigger some more retriggers on autoland until we find the culprit.

Also there are multiple metrics that point to this bug so I would recomend for the moment keeping the regressed by flag on this bug until we find the culprit bug that regressed.

Flags: needinfo?(ksereduck)

Regressed by: 1653485

Jim Mathies [:jimm]

Comment 6

•

4 years ago

checking in on those retriggers.

Flags: needinfo?(fstrugariu)

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

4 years ago

No longer regressed by: 1653485

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 7

•

4 years ago

Really hard to see how bug 1653485 could have caused this
But masayuki, would it be easy to try to back out for a day or so to see if there is something weird happening?

Patch right before bug 1653485 could be more likely https://hg.mozilla.org/integration/autoland/rev/cd05ec275e213937a1c9c5d98a9ee60bb84b6ea0

Severity: -- → S3

Flags: needinfo?(masayuki)

Priority: -- → P3

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 8

•

4 years ago

No problem about backing them out from 80, although I have no idea how to fix it if this is actually caused by bug 1653485. (Backing out from 81 requires a back out patch due to dependency.)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 9

•

4 years ago

(According to the result of tryserver (comment 3), I still believe that my patches are not the cause of this bug.)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 10

•

4 years ago

(kicked awsy jobs before actually backing them out.)

Kimberly Sereduck :kimberlythegeek

Reporter

Comment 11

•

4 years ago

Hi there, apologies for the delays--I was in a biking accident last week and have been taking a day off here and there to recover.

I am the one investigating this regression (forgive me, I am learning, I am new to this process)

Here are the try pushes I did with the changes and with them backed out:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa853aff6d28cee77c12b215933479221ecaf7e6

after backing out changes:

remote: View your changes here:
remote:   https://hg.mozilla.org/try/rev/37b4db7731a8c4afb7cad536cbdca69fde705529
remote:   https://hg.mozilla.org/try/rev/b0f48d0a809e1bdbf538773049f45c7fc90fe41d
remote:   https://hg.mozilla.org/try/rev/30b99830af2b691a5596048b3d60405a0f06d732
remote:   https://hg.mozilla.org/try/rev/70c6c67694ebf9641e0ff096af2e24c2c31c35e8

Flags: needinfo?(fstrugariu)

Andrew McCreight [:mccr8]

Comment 12

•

4 years ago

With the AWSY reports, you should be able to download the memory reports from Perfherder before and after the change, and then do a diff in about:memory to hopefully get some insight about what increased. I'll try to find the time to look at that at some point.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 13

•

4 years ago

https://treeherder.mozilla.org/perf.html#/compare?originalProject=autoland&originalRevision=fb64d0a7a99c7835331a961de1585ca9ab308e96&newProject=autoland&newRevision=78186a074f2f21eac00c492a34baf659a7ba203c&framework=4#table-header-136450681

This is comparison result on autoland of bug 1653485 and bug 1652973 which is 5th previous landing of bug 1653485. Awsy for every landings ran 11 times, but looks like the regression disappeared...

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 14

•

4 years ago

Retriggered awsy of some landings after bug 1653485 too.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 15

•

4 years ago

Ah, according to the graphs

It seems that bug 1596741 might be the regression point.
https://hg.mozilla.org/integration/autoland/rev/cd05ec275e213937a1c9c5d98a9ee60bb84b6ea0

Flags: needinfo?(masayuki)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 16

•

4 years ago

mayhemer:

What do you think about your fix? It added QI and logging method calls a lot, but I don't know whether the paths are hot paths or not.

Flags: needinfo?(honzab.moz)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 17

•

4 years ago

This is a memory usage regression, so using dumping more stuff to the console sure could end up causing higher memory usage.

Component: Performance → Networking

Regressed by: 1596741

BMO Automation

Updated

•

4 years ago

Has Regression Range: --- → yes

Honza Bambas (:mayhemer)

Comment 18

•

4 years ago

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #16)

mayhemer:

What do you think about your fix? It added QI and logging method calls a lot, but I don't know whether the paths are hot paths or not.

I don't believe those are hot paths. We log only when something is wrong with cookies we get from servers. It's not that common or expected and these warnings are there to help diagnose the problems, so quite useful.

But, looking at the numbers - the first test result shows increase of 12'069'281.07? What is the unit actually? I don't know about a machine able to allocate fractions of bytes ;)) so this has to be at least kB. That seems a little excessive for only console messages. Could this be a leak? What exactly the test does?

Flags: needinfo?(honzab.moz) → needinfo?(masayuki)

Andrew McCreight [:mccr8]

Comment 19

•

4 years ago

I think the units are bytes. I think this particular measure is some kind of average of measurements taken at different points in time, which produces the fraction.

The test starts the browser, opens a bunch of tabs, then closes a bunch of tabs. At each of those states, it takes a measurement, waits 30 seconds and takes a measurement, then forces a GC and takes a measurement. The set of pages being loaded is TP6. I think they are loaded about 8 at a time, and it cycles through them.

I downloaded some memory reports before and after the regression, at the "TabsClosedForceGC" and "TabsOpenSettled" points, because those showed clear increases. There's about a 10MB or 12MB increase of explicit/heap-overhead/bin-unused in the main process, and about 3MB of heap-unclassified in the GPU process.

The large increase in bin-unused probably indicates that a lot of allocation happened while the pages were loading, but they went away at some point. Would you expect there to be a lot of cookie errors with the TP6 page set? I could see there being a memory increase if there's a ton of allocations for logging messages that are quickly thrown out.

That said, it looks like the numbers here got bad around July 20, then stayed that way until July 23, where it dropped quite a bit, then became bimodal with the high value about what the memory usage was before July 20. Honza, maybe you could check to see if TP6 generates a huge number of these cookie errors or whatever and if not maybe this can consider this to be a false positive?

Flags: needinfo?(masayuki) → needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Comment 20

•

4 years ago

Pages producing a lot of the new cookie rejection messages (using livesites, not the recorded content):
https://www.instagram.com/ - 55
https://outlook.live.com/mail/inbox (not logged in) - 4
https://www.youtube.com/ - 1

60 * 8 = 480
12MB / 480 = ~12kB

Not sure how many times each message is copied/cloned, but still 12kB per a message is IMO a bit too much using the common sense.

According the response refreshing of MITM, the cookie expiry times are updated. If not, we would produce a lot of "rejected because expired" messages.

How can I run this test locally and get the messages we generate?

Flags: needinfo?(honzab.moz) → needinfo?(continuation)

Honza Bambas (:mayhemer)

Comment 21

•

4 years ago

Note that the MITM option server_replay_refresh must be true for cookie expiration times to be updated. It's default at true, so unless this has been turned off, or is buggy, we should be ok.

Julien Cristau [:jcristau]

Updated

•

4 years ago

status-firefox80: affected → fix-optional

Andrew McCreight [:mccr8]

Comment 22

•

4 years ago

It looks like mach awsy-test will run it locally. It is a standard Marionette test living in testing/awsy/awsy , so maybe you can modify the scripts there to do some additional logging.

Flags: needinfo?(continuation)

Julien Cristau [:jcristau]

Updated

•

4 years ago

status-firefox80: fix-optional → wontfix

status-firefox81: --- → affected

Andrew McCreight [:mccr8]

Comment 23

•

4 years ago

It looks like the memory usage went down again, and from comment 20 it sounds like there aren't that many messages, so I'll just assume this was just some weird temporary issue. The regression was also only on Windows, and the patch doesn't look Windows specific.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → WORKSFORME

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox81: affected → wontfix

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 24

•

4 years ago

Regression bugs raised via alerts should be resolved as WORKSFORME. I'm going to mark this as a regression that was accepted (WONTFIX), which was later reversed by an unidentified series of commits.

Resolution: WORKSFORME → WONTFIX