Closed Bug 1543328 Opened 5 years ago Closed 5 years ago

1.17 - 1.22% Heap Unclassified (windows10-64, windows10-64-pgo-qr) regression on push 9bad25ccc8d0414e21d3c8d4dcf2178a97bb07c3 (Mon Mar 25 2019)

Categories

(Core :: General, defect)

Unspecified
Windows 10
defect
Not set
normal

Tracking

()

VERIFIED WORKSFORME

People

(Reporter: igoldan, Unassigned)

References

(Regression)

Details

(Keywords: perf, regression)

Attachments

(1 file)

We have detected an awsy regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=5dc0652cd024fef30217ac3e6087b7a32f16967f&tochange=9bad25ccc8d0414e21d3c8d4dcf2178a97bb07c3

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

1% Heap Unclassified windows10-64 pgo stylo 48,906,485.70 -> 49,501,400.29
1% Heap Unclassified windows10-64-pgo-qr opt stylo 57,649,330.94 -> 58,321,834.02

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=20249

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/AWSY/Tests

Product: Testing → Core
Flags: needinfo?(mats)

Mats, I also want to know which of the 3 bugs (bug 205202, bug 288704, bug 1518201) is more related to these regressions?

I wonder if bug 1539171 would mitigate this regression...

The timestamp for my push is: Sun Mar 24 22:14:10 2019 +0000.
When I look at the two graphs linked from https://treeherder.mozilla.org/perf.html#/alerts?id=20249:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=mozilla-inbound,1645853,1
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=mozilla-inbound,1893290,1
I see a large regression in those graphs early March 21, which clearly
isn't these patches fault. If I zoom in on the graphs after that
incident, from March 22 and forward:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=mozilla-inbound,1893290,1,4&zoom=1553161466018.3606,1553753142000,50000000,65000000
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=mozilla-inbound,1645853,1,4&zoom=1553162563799.3442,1553753142000,40000000,55000000
then it's hard for me to detect any regression at all.
In the first graph there are a couple of changesets before
mine (529782921812, 8941a9be9141) which have data points
with a large margin of error (way more than 1%).

Would it be possible to fill in some more test runs on
the changesets before mine? The data points are a little
thin there.

Flags: needinfo?(igoldan)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #2)

I wonder if bug 1539171 would mitigate this regression...

We only add a counter-increment for list items though, so I doubt
that would move the needle at all. I don't know what content
"tp5o" loads, but I doubt it's a huge amount of list items.

Maybe we now have some persistent data about the ::marker
pseudo in the style system for every element?

We shouldn't, no... If there are a lot of list items maybe just the counter node allocations add up... tp5o had wikipedia pages IIRC, which are pretty list-heavy.

Well, in a 64-bit Linux Opt build, nsCounterList is 80 bytes
and nsCounterChangeNode is 88. We should have one per document
of the former and one per list item of the latter, plus one
per list. That seems like the wrong order of magnitude to
match the reported regression...

I would be happy to look at the actual numbers if someone can
point me to the actual pages these tests load.

Attached file tp5n URL manifest

More detailed analysis

We're looking at primarily a 5-6MB regression on the heap-unclassified tabs open measure.

AFAICT it doesn't look like our overall resident memory changed, just that we made more memory unclassified. To get a better idea of what's going on we should at least add reporting of the memory.

re: what pages are in tp5

For reference this attachment is the list of pages we load for AWSY, one of which is in fact wikipedia.

Testing locally

You can run the tests locally with:

./mach awsy-test

You can specify your own manifest with the --page-manifest flag, so if you wanted to run locally and just test a single site something like this would work:

echo "http://localhost/page_load_test/tp5n/en.wikipedia.org/en.wikipedia.org/wiki/Rorschach_test.html" > test.manifest
./mach awsy-test --page-manifest test.manifest --iterations 1

You can also run with dmd enabled to get an idea of where the memory is that's not being reported:

./mach awsy-test --dmd

Thanks for the info Eric!

So, some rough estimates from the contents in "tp5n.zip":

401 .html files, with
1 <menu>
172 <ol>
4275 <ul>
20839 <li> (~52 per file on avg!)

which means we have nsCounterList/nsCounterChangeNode for
the 'list-item' counter (worst case):

401 * 80
+ 1 * 88
+ 172 * 88
+ 4275 * 88
+ 20839 * 88
= ~2.26MB (~5.63kB per page on avg)

The real memory is probably a bit higher than that though since
those structs aren't all memory that is used. It's still less
than half of the 5-6MB heap-unclassified Eric mentioned though.

The CounterManager structs are missing AddSizeOf* methods,
so I assume that's why they end up as "heap unclassified".
But that shouldn't be the case for style data though (I hope).

I guess we could try to optimize all the CounterManager stuff
for memory use, but I suspect they are currently optimized
for speed (lookup and insertion), so there's a trade-off there...

We should definitely add AddSizeOf* methods though, so that it's
classified in the memory log. I also note that none of these
structs are allocated from the shell arena, which would be nice
to do since we'd get both poisoning, faster allocations,
and classification that way.

I'm a bit surprised how heavily HTML lists are used in this set.
It's a lot more than I typically see when I casually browse.
Here's the top 10 users of <li>:

# ag -c '\<li[^a-zA-Z]' tp5n/ | sort -n -t : -k 2 -r | head
tp5n/youku.com/www.youku.com/index.html:1837
tp5n/globo.com/www.globo.com/index.html:679
tp5n/sohu.com/www.sohu.com/index.html:607
tp5n/spiegel.de/www.spiegel.de/index.html:527
tp5n/xunlei.com/xunlei.com/index.html:521
tp5n/yelp.com/www.yelp.com/biz/alexanders-steakhouse-cupertino.html:467
tp5n/ifeng.com/ifeng.com/index.html:450
tp5n/repubblica.it/www.repubblica.it/index.html:445
tp5n/rakuten.co.jp/www.rakuten.co.jp/index.html:414
tp5n/huffingtonpost.com/www.huffingtonpost.com/index.html:405

Now that the CSSWG has resolved to not include the automatic
'list-item' increment in computed style:
https://github.com/w3c/csswg-drafts/issues/3769
We could see if fixing bug 1539171 gives us back some memory.
It definitely should, but I guess not as "heap unclassified".

BTW, the markup in the youku.com page is pretty much all <ul><li>text.
Most of the <li>s seems to have display:inline though, so I suspect my
estimations above for <li> are wrong since these wouldn't create
counter-increments. Which makes me even more puzzled where the "heap
unclassified" increase comes from...

I dumped the CounterManager for the youku.com page and it has
607 increment nodes (out of 1837 elements) and 118 reset.

Hmm, do we atomize the counter name in the style system?
I wonder if have thousands of "list-item" string copies laying around...
Anyway, bug 1539171 should fix that, so we'll see.

(In reply to Mats Palmgren (:mats) from comment #12)

Hmm, do we atomize the counter name in the style system?
I wonder if have thousands of "list-item" string copies laying around...
Anyway, bug 1539171 should fix that, so we'll see.

Yeah, you're right that we should just use atoms in the style structs and counter manager as well though. Will file.

Depends on: 1539171
Flags: needinfo?(mats)

I'm on Core: General triage duty this week and would like to move this to a better component but I'm not sure what that would be. Should keep this open as a meta of sorts?

Flags: needinfo?(mats)

Did bug 1539171 make any difference?

Flags: needinfo?(overholt)

I'll let Ionuț answer since I don't know.

Flags: needinfo?(overholt)
Flags: needinfo?(mats)

Bug 1539171 should have fixed this. Feel free to reopen if that's not the case.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
Status: RESOLVED → VERIFIED
Flags: needinfo?(igoldan)
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: