Closed Bug 1683841 Opened 3 years ago Closed 3 years ago

3.23 - 24.96% Explicit Memory / Heap Unclassified / Images / JS / Resident Memory (linux1804-64, macosx1014-64-shippable-qr, windows10-64) regression on push ea10c714c606f9209a6b21c53ad46939f066cb71 (Thu December 17 2020)

Categories

(Remote Protocol :: Marionette, defect, P1)

Firefox 86
defect

Tracking

(firefox-esr78 unaffected, firefox84 unaffected, firefox85 unaffected, firefox86 affected)

RESOLVED WONTFIX
Tracking Status
firefox-esr78 --- unaffected
firefox84 --- unaffected
firefox85 --- unaffected
firefox86 --- affected

People

(Reporter: alexandrui, Assigned: whimboo)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Attachments

(1 obsolete file)

Perfherder has detected a awsy performance regression from push ea10c714c606f9209a6b21c53ad46939f066cb71. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Suite Test Platform Options Absolute values (old vs new)
25% Images linux1804-64-shippable-qr 4,556,804.97 -> 5,694,325.92
24% Images windows10-64-shippable-qr 6,150,925.17 -> 7,630,074.22
24% Images macosx1014-64-shippable-qr 5,290,836.76 -> 6,561,703.93
23% Images linux1804-64-shippable 4,512,409.84 -> 5,532,798.34
21% Heap Unclassified linux1804-64-shippable-qr 232,784,876.29 -> 282,282,125.03
21% Images windows10-64-shippable 6,171,707.90 -> 7,445,428.11
19% Explicit Memory linux1804-64-shippable-qr 491,215,659.89 -> 586,963,794.02
18% Resident Memory linux1804-64-shippable-qr 967,844,402.72 -> 1,142,448,644.71
17% Explicit Memory windows10-64-shippable 282,180,207.08 -> 329,561,430.74
17% Explicit Memory windows10-64-shippable-qr 295,683,644.75 -> 345,245,965.59
17% Explicit Memory linux1804-64-shippable 304,715,327.82 -> 355,748,455.71
15% Explicit Memory macosx1014-64-shippable-qr 350,059,387.42 -> 402,637,886.52
14% Resident Memory windows10-64-shippable-qr 478,407,584.93 -> 546,087,761.19
14% Resident Memory windows10-64-shippable-qr 477,504,971.06 -> 544,949,012.49
13% Resident Memory windows10-64-shippable 448,344,560.11 -> 507,302,623.86
13% Heap Unclassified windows10-64-shippable-qr 46,343,022.11 -> 52,335,693.41
13% Resident Memory macosx1014-64-shippable-qr 702,794,694.08 -> 791,511,418.06
12% JS linux1804-64-shippable-qr 80,482,513.69 -> 89,929,274.85
11% JS linux1804-64-shippable 80,569,985.14 -> 89,806,844.84
11% Resident Memory linux1804-64-shippable 555,918,751.78 -> 619,436,135.30
11% Heap Unclassified windows10-64-shippable 40,021,404.52 -> 44,494,636.96
11% JS macosx1014-64-shippable-qr 82,467,524.93 -> 91,684,953.50
11% JS windows10-64-shippable-qr 81,328,019.62 -> 90,398,017.49
11% JS windows10-64-shippable 81,471,212.08 -> 90,317,007.66
6% Heap Unclassified macosx1014-64-shippable-qr 102,187,088.53 -> 108,807,128.68
6% Images linux1804-64-shippable-qr tp6 7,983,977.07 -> 8,425,930.85
5% Images macosx1014-64-shippable-qr tp6 8,776,847.20 -> 9,251,056.85
5% Images linux1804-64-shippable tp6 8,073,095.93 -> 8,478,283.14
3% JS macosx1014-64-shippable-qr tp6 169,704,440.47 -> 175,179,864.48

Improvements:

Ratio Suite Test Platform Options Absolute values (old vs new)
6% Base Content JS macosx1014-64-shippable-qr 2,686,941.33 -> 2,518,160.00
6% Base Content JS linux1804-64-shippable 2,683,950.67 -> 2,515,744.00
6% Base Content JS windows10-64-shippable-qr 2,688,547.33 -> 2,519,948.00
6% Base Content JS windows10-64-shippable 2,688,513.33 -> 2,520,112.00
6% Base Content JS linux1804-64-shippable-qr 2,682,786.00 -> 2,515,292.00
2% Base Content Explicit windows10-64-shippable 8,546,901.33 -> 8,353,792.00

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(hskupin)

Set release status flags based on info from the regressing bug 1669174

Here the results....

When backing out the patch from bug 1669174 there are no improvements at all. Only the improvements for the Base Content JS opt are reverted.

Backing out the changes for the double click tracker show already a huge improvement for images and heap

And the backout of having the MarionetteEventsActor registered all the time also shows improvements.

As such this is a regression from bug 1682757..

Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Flags: needinfo?(hskupin)
Priority: -- → P1
Regressed by: 1682757
No longer regressed by: 1669174

This is quite an interesting behavior. So here a summary of my investigation from yesterday...

So I picked the top-most regression data, which is about Images, and happening across platforms with a ~24% increase:

25% 	Images 		linux1804-64-shippable-qr       4,556,804.97 -> 5,694,325.92
24% 	Images 		windows10-64-shippable-qr       6,150,925.17 -> 7,630,074.22
24% 	Images 		macosx1014-64-shippable-qr      5,290,836.76 -> 6,561,703.93
23% 	Images 		linux1804-64-shippable          4,512,409.84 -> 5,532,798.34

When checking the results directly on https://arewefastyet.com they directly apply to the very first graph. It clearly indicates the increase on Dec 18th. Also other graphs show the regression but interestingly aren't getting reported?

With some help from Andrew I had a look at graph 3 and 4. What's interesting here is that for graph 4 we seem to miss to garbage collect images, or AWSY records the memory usage before the next GC runs. With a forced GC (graph 3) everything is still fine. Diffing two memory reports for After tabs open [+30s] (one from before my landing, and one after) I can see a lot of images from tp5 pages, which were loaded in all the various tabs, that are not marked as unlocked. Means the next GC will not free those up. And as such more memory is used. In the above case we speak about 35MB.

As next step I continued on the results from comment 3, and identified bd98eeb06846 as the causing changeset. Here we are going to add some events (specifically click, dblclick, and unload) that should cause the MarionetteEventsActor to be created. Based on when those fire we want to set some internal state to track mouse clicks/double clicks. When inspecting the AWSY tests I cannot see anything around mouse clicks. What it basically does is opening tabs, and loading various pages from the tp5 pageset. That's all.

But that means that the one and only event that should have an effect here is unload. To check what's happening when we don't specify it when registering the actor but using addEventListener within the actor's child, I submitted a new try build, and interestingly the results are looking promising. All the values seem to go back to their original values:

https://treeherder.mozilla.org/perfherder/compare?originalProject=mozilla-central&newProject=try&newRevision=11e4dd86e8e37941e5f71402a008cf2f471ab424&framework=4&selectedTimeRange=172800

So I assume there is something strange going on for JSWindowActors with a unload event being registered. Note that we also have beforeunload and pagehide registered, which already cause an actor pair to be created. So that the unload event should basically be a no-op. But instead is causes various huge memory regressions.

Nika, Olli, and Kris, does one of you have an idea why this is happening? Could this be a bug in our JSWindowActor implementation?

Flags: needinfo?(nika)
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(bugs)

Note that the framescript code before had the same problem. There we actually registered the unload event via addEventListener():
https://searchfox.org/mozilla-central/rev/5a90b263f71b558fd14bade65b0c4ae5e216e646/testing/marionette/listener.js#70

And by enabling the actors by default, and not using the framescript code, we got this massive improvement.

Unload prevents BFcaching doesn't it?

Specifying "unload" as part of registerWindowActor() causes a huge memory
regression. By using addEventListener inside the actors child this can
be avoided.

Yes, bfcache is disabled when there is an unload event listener on the window object.
So https://phabricator.services.mozilla.com/D100385 disables bfcaches.

Flags: needinfo?(bugs)

Ok, so here the full explanation:

Before bug 1669169 landed the unload event listener has been added within the framescript (listener.js) onto the TabChildGlobal:

https://hg.mozilla.org/integration/autoland/file/1363594da38eb65394034fb50448833cb93305ea/testing/marionette/listener.js#l69

By activating the actors by default we now had the following unload registration on the window object, which prevented bfcache:

https://hg.mozilla.org/integration/autoland/file/1363594da38eb65394034fb50448833cb93305ea/testing/marionette/actors/MarionetteCommandsChild.jsm#l70

This was NOT on purpose. Marionette should not disable bfache.

With my recent patch on bug 1682757 I moved out this code from MarionetteCommandsChild, and registered it with registerWindowActor for the MarionetteEvents actor child. It uses the windowroot for the event registration and not the window object.

As such we enabled bfcache again, and in-fact the memory usage increased.

Given that we never wanted to disable bfcache there is no actual regression in memory usage here. Actually, it's back at the level it has been before. That means there is nothing left to do on this bug and we have to accept these values as a new baseline.

Thanks Olli for all the help!

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(nika)
Flags: needinfo?(kmaglione+bmo)
Resolution: --- → WONTFIX
Attachment #9194528 - Attachment is obsolete: true

Is it possible to write a test to make sure that we do not accidentally disable bfcache again?

Yes, this will be done via bug 1673059.

See Also: → 1673059
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: