There's about a 70-80MB regression in "Explicit: After TP5, tabs closed" in the 1/15 - 1/16 time frame. The pushlog  points to bug 1079627. Note there was a huge regression from cache2 at the same time, but that got backed out.  https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=58f7a19c5572&tochange=37df11b1f5ca
This is before the regression(s).
This is after the regression.
A snippet of the diff, it's pretty clear this is all image data: 70.35 MB (100.0%) -- explicit ├──69.61 MB (98.95%) -- images │ ├──69.48 MB (98.77%) -- content/raster │ │ ├──69.46 MB (98.73%) -- used │ │ │ ├──31.93 MB (45.39%) ++ (577 tiny) │ │ │ ├──11.74 MB (16.69%) -- image(http://localhost:8089/tp5/repubblica.it/www.repubblica.it/static/images/homepage/2010/sprite-repubblica.png) │ │ │ │ ├──11.74 MB (16.69%) ── decoded-heap [+] │ │ │ │ └───0.00 MB (00.00%) ── source │ │ │ ├───3.66 MB (05.21%) -- image(http://localhost:8089/tp5/repubblica.it/www.repubblica.it/static/images/pubblico/sprite_poteri.png) │ │ │ │ ├──3.66 MB (05.21%) ── decoded-heap [+] │ │ │ │ └──0.00 MB (00.00%) ── source
Something very odd is going on here; why is source 0 MB for these images?! I need to investigate that. But beyond that question: before, we have an estimated 8,061,744 bytes in the SurfaceCache, with 808,596 bytes locked. After, we have an estimated 77,277,380 bytes, with 806,684 bytes locked. So we're retaining about 70 MB more of decoded image surfaces, but they will actually expire in less than 60 seconds. That raises the question: is this just a timing thing? Bug 1079627 made us lock images while they are being decoded, which means that if they get redecoded while unlocked, that will reset the timer that will discard their surfaces. It also might mean that, even if they started to be decoded while locked, if they became unlocked before decoding finished, the new policy might again mean that the timer that will discard them is started later. This is pretty similar to the other AWSY regression, bug 1120607, which I also am pretty sure is timing related. Before doing anything else, I think we need to see what effect drastically reducing "image.mem.surfacecache.min_expiration_ms" has on these numbers. I have to step away from the computer for a while, but tonight I'll try to find time to push a try job to explore that question.
Whiteboard: [MemShrink][awsy] → [MemShrink][awsy][gfx-noted]
I'm still investigating here. For future reference, the jump here was from ~415MB to ~485MB.
Whiteboard: [MemShrink][awsy][gfx-noted] → [MemShrink:P1][awsy][gfx-noted]
BTW, I just noticed I never posted the AWSY series I'm using to investigate this. It's here: https://areweslimyet.com/?series=bug1122704
The original series became hard to interpret because it got so full of various experiments, so I created a new series for only the patches that seemed most promising. You can find it here: http://arcus.corp.mtv2.mozilla.com:8000/?series=bug1122704_proposed_fixes I've compiled the results below. (I wish AWSY could automatically spit out a report like this!) 936dcf9c0a9f - 473.3 MiB Baseline 6917e1d72ff7 - 469.57MiB (Bug 1125490) Make sure we request discarding for images in PresShell::Destroy 3849c7dad361 - 431.53MiB (Bug 1125491) Only lock the image during the initial decode 47e737e194fc - 390.46MiB (Bug 1125491) Only lock the image during the initial decode. (Bug 1125490) Make sure we request discarding for images in PresShell::Destroy. 393c89af6531 - 413.19MiB (Bug 1125491) Only lock the image during the initial decode. (Bug 1125462) Try to cancel decoders when an image's lock count drops to zero. So, looking over these numbers, here are my thoughts: Most of the original regression was due to one change in bug 1079627: we started locking images while we were decoding them. This makes a lot more sense to me than the cancelling decoders explanation, which doesn't really explain anything when examined closely, because cancelling decoders did *not* cause the surfaces they were decoding into to be immediately removed from the SurfaceCache. We also preallocate the first surface when we create the decoder, so there's absolutely no possibility that we could've cancelled the decoder before the first surface (generally the only surface) got allocated. That doesn't explain the remaining component of the original regression; there's still about a 30MB regression even once we apply bug 1125491. I really don't know where that difference comes from; for now I'm going to have to chalk it up to "timing differences". A lot of this stuff could go wrong if the ordering between different events changes, and there's no good way to figure out that's happening besides adding a bunch of logging and trying different variations, which is unfortunately not easy on AWSY right now. While investigating this regression I noticed an issue in the order that we did various image-related cleanup in PresShell::Destroy. This is an issue that existed before bug 1079627; it's not obvious how the changes in bug 1079627 would have made it worse, though I can't rule out that there has been some sort of negative interaction, which might explain why this hasn't been noticed before. At any rate, the details are explained in bug 1125490. By itself, this patch makes only a small improvement, but once combined with bug 1125491, we get an improvement of 83MB - larger than the original regression! It's likely, then, that this patch would have yielded an improvement even before bug 1079627 landed. Previous to the discovery of the issue in PresShell::Destroy, the best candidate for a fix here was bug 1125462, which both cancels decoders when an image's lock count drops to zero and (importantly) immediately frees their surfaces, which the code before bug 1079627 did *not* do. This change mostly fixes the regression (a 60MB improvement), but I really did not want to apply it as an image's lock count could drop to zero just because it scrolled a little bit off the screen, and I don't think that we want to cancel decoders in such a situation. I'm therefore very glad that the combo of bug 1125491 and bug 1125490 does better than this alternative solution. I *do* think there's merit to cancelling decoders when a tab closes, in the interest of not wasting the user's CPU time or battery life, and I'll be reworking bug 1125462 with that in mind. TLDR: Bug 1125491 and bug 1125490 not only fix the regression but actually appear to improve on the situation before the regression. Huzzah!
Thank you for the detailed analysis, Seth. It seems like things are on finally back on track :) Hopefully the investigation triggered by AWSY's numbers has been worthwhile.
After some additional discussion today Timothy and I were interested to see what the result would be of never locking during decoding at all. That turns out to yield 381.91 MiB, which gives almost another 10MB improvement over the best result achieved so far. The bad news is that I'm not super keen on making that change right now, because it's not safe for animated images (and unfortunately we don't know that an image is animated until we decode it!). It might be interesting to look at how much benefit we'd get from not locking during decoding at all for image types that cannot be animated. (That's JPEGs and BMPs, basically.) *That* is a change we could make right away, and if we still get some decent savings, it might be worth considering.
Can we consider this resolved? Bug 1125490 has landed now.
I'm okay with resolving this regression.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.