70MB AWSY regression after landing of bug 1079627 on 1/15

RESOLVED FIXED

Status

()

RESOLVED FIXED
4 years ago
2 years ago

People

(Reporter: erahm, Unassigned)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [MemShrink:P1][awsy][gfx-noted])

Attachments

(2 attachments)

(Reporter)

Description

4 years ago
There's about a 70-80MB regression in "Explicit: After TP5, tabs closed" in the 1/15 - 1/16 time frame.

The pushlog [1] points to bug 1079627. Note there was a huge regression from cache2 at the same time, but that got backed out.

[1] https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=58f7a19c5572&tochange=37df11b1f5ca
(Reporter)

Comment 1

4 years ago
Created attachment 8550487 [details]
memory report before

This is before the regression(s).
(Reporter)

Comment 2

4 years ago
Created attachment 8550488 [details]
memory report after

This is after the regression.
(Reporter)

Comment 3

4 years ago
A snippet of the diff, it's pretty clear this is all image data:

70.35 MB (100.0%) -- explicit
├──69.61 MB (98.95%) -- images
│  ├──69.48 MB (98.77%) -- content/raster
│  │  ├──69.46 MB (98.73%) -- used
│  │  │  ├──31.93 MB (45.39%) ++ (577 tiny)
│  │  │  ├──11.74 MB (16.69%) -- image(http://localhost:8089/tp5/repubblica.it/www.repubblica.it/static/images/homepage/2010/sprite-repubblica.png)
│  │  │  │  ├──11.74 MB (16.69%) ── decoded-heap [+]
│  │  │  │  └───0.00 MB (00.00%) ── source
│  │  │  ├───3.66 MB (05.21%) -- image(http://localhost:8089/tp5/repubblica.it/www.repubblica.it/static/images/pubblico/sprite_poteri.png)
│  │  │  │   ├──3.66 MB (05.21%) ── decoded-heap [+]
│  │  │  │   └──0.00 MB (00.00%) ── source
Something very odd is going on here; why is source 0 MB for these images?! I need to investigate that.

But beyond that question: before, we have an estimated 8,061,744 bytes in the SurfaceCache, with 808,596 bytes locked. After, we have an estimated 77,277,380 bytes, with 806,684 bytes locked. So we're retaining about 70 MB more of decoded image surfaces, but they will actually expire in less than 60 seconds.

That raises the question: is this just a timing thing? Bug 1079627 made us lock images while they are being decoded, which means that if they get redecoded while unlocked, that will reset the timer that will discard their surfaces. It also might mean that, even if they started to be decoded while locked, if they became unlocked before decoding finished, the new policy might again mean that the timer that will discard them is started later.

This is pretty similar to the other AWSY regression, bug 1120607, which I also am pretty sure is timing related. Before doing anything else, I think we need to see what effect drastically reducing "image.mem.surfacecache.min_expiration_ms" has on these numbers.

I have to step away from the computer for a while, but tonight I'll try to find time to push a try job to explore that question.
Whiteboard: [MemShrink][awsy] → [MemShrink][awsy][gfx-noted]
I'm still investigating here. For future reference, the jump here was from ~415MB to ~485MB.
Whiteboard: [MemShrink][awsy][gfx-noted] → [MemShrink:P1][awsy][gfx-noted]
BTW, I just noticed I never posted the AWSY series I'm using to investigate this. It's here:

https://areweslimyet.com/?series=bug1122704
Depends on: 1125462
Depends on: 1125490
The original series became hard to interpret because it got so full of various experiments, so I created a new series for only the patches that seemed most promising. You can find it here:

http://arcus.corp.mtv2.mozilla.com:8000/?series=bug1122704_proposed_fixes

I've compiled the results below. (I wish AWSY could automatically spit out a report like this!)

936dcf9c0a9f - 473.3 MiB
  Baseline

6917e1d72ff7 - 469.57MiB
  (Bug 1125490) Make sure we request discarding for images in PresShell::Destroy

3849c7dad361 - 431.53MiB
  (Bug 1125491) Only lock the image during the initial decode

47e737e194fc - 390.46MiB
  (Bug 1125491) Only lock the image during the initial decode.
  (Bug 1125490) Make sure we request discarding for images in PresShell::Destroy.

393c89af6531 - 413.19MiB
  (Bug 1125491) Only lock the image during the initial decode.
  (Bug 1125462) Try to cancel decoders when an image's lock count drops to zero.

So, looking over these numbers, here are my thoughts:

Most of the original regression was due to one change in bug 1079627: we started
locking images while we were decoding them. This makes a lot more sense to me
than the cancelling decoders explanation, which doesn't really explain anything
when examined closely, because cancelling decoders did *not* cause the surfaces
they were decoding into to be immediately removed from the SurfaceCache. We also
preallocate the first surface when we create the decoder, so there's absolutely
no possibility that we could've cancelled the decoder before the first surface
(generally the only surface) got allocated.

That doesn't explain the remaining component of the original regression; there's
still about a 30MB regression even once we apply bug 1125491. I really don't
know where that difference comes from; for now I'm going to have to chalk it up
to "timing differences". A lot of this stuff could go wrong if the ordering
between different events changes, and there's no good way to figure out that's
happening besides adding a bunch of logging and trying different variations,
which is unfortunately not easy on AWSY right now.

While investigating this regression I noticed an issue in the order that we did
various image-related cleanup in PresShell::Destroy. This is an issue that
existed before bug 1079627; it's not obvious how the changes in bug 1079627
would have made it worse, though I can't rule out that there has been some sort
of negative interaction, which might explain why this hasn't been noticed
before. At any rate, the details are explained in bug 1125490. By itself, this
patch makes only a small improvement, but once combined with bug 1125491, we get
an improvement of 83MB - larger than the original regression! It's likely, then,
that this patch would have yielded an improvement even before bug 1079627
landed.

Previous to the discovery of the issue in PresShell::Destroy, the best candidate
for a fix here was bug 1125462, which both cancels decoders when an image's lock
count drops to zero and (importantly) immediately frees their surfaces, which
the code before bug 1079627 did *not* do. This change mostly fixes the
regression (a 60MB improvement), but I really did not want to apply it as an
image's lock count could drop to zero just because it scrolled a little bit off
the screen, and I don't think that we want to cancel decoders in such a
situation. I'm therefore very glad that the combo of bug 1125491 and bug 1125490
does better than this alternative solution. I *do* think there's merit to
cancelling decoders when a tab closes, in the interest of not wasting the user's
CPU time or battery life, and I'll be reworking bug 1125462 with that in mind.

TLDR: Bug 1125491 and bug 1125490 not only fix the regression but actually
appear to improve on the situation before the regression. Huzzah!
Thank you for the detailed analysis, Seth. It seems like things are on finally back on track :) Hopefully the investigation triggered by AWSY's numbers has been worthwhile.
After some additional discussion today Timothy and I were interested to see what the result would be of never locking during decoding at all. That turns out to yield 381.91 MiB, which gives almost another 10MB improvement over the best result achieved so far. The bad news is that I'm not super keen on making that change right now, because it's not safe for animated images (and unfortunately we don't know that an image is animated until we decode it!).

It might be interesting to look at how much benefit we'd get from not locking during decoding at all for image types that cannot be animated. (That's JPEGs and BMPs, basically.) *That* is a change we could make right away, and if we still get some decent savings, it might be worth considering.
No longer depends on: 1125462
Can we consider this resolved? Bug 1125490 has landed now.
Flags: needinfo?(erahm)
(Reporter)

Comment 11

4 years ago
I'm okay with resolving this regression.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Flags: needinfo?(erahm)
You need to log in before you can comment on or make changes to this bug.