Closed Bug 1453801 Opened 2 years ago Closed 2 years ago

In SourceSurfaceSharedDataWrapper::SourceSurfaceSharedDataWrapper, mConsumer being initialized to 0 causes an assertion to trigger

Categories

(Core :: Graphics: WebRender, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox59 --- disabled
firefox60 --- disabled
firefox61 --- fixed

People

(Reporter: tristanbourvon, Assigned: aosmond)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 3 obsolete files)

This hunk https://hg.mozilla.org/integration/mozilla-inbound/diff/d7d2f08e051c/gfx/layers/SourceSurfaceSharedData.h caused an assertion to trigger (https://treeherder.mozilla.org/logviewer.html#?job_id=173104451&repo=mozilla-inbound), presumably because RemoveConsumer is called before AddConsumer somewhere, and currently mConsumer gets positive garbage so this is invisible.

The hunk has since been backed out but should be a correct fix to land.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(aosmond)
Well that's embarrassing.
Assignee: nobody → aosmond
Status: NEW → ASSIGNED
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(aosmond)
I guess this also means that in SharedSurfacesParent::Release we never do sInstance->mSurfaces.Remove(id) ?
I'm not sure if I even need the consumer code anymore, the code has evolved a lot since it was first introduced. Also this bug only affects WebRender enabled builds.
Should be fine just ripping it out. The image cache in the content process keeps the surfaces alive in the GPU process as long as necessary anyways. The races I feared should be very uncommon, and now WebRender accepts an external image ID not mapping to anything (it will log a critical warning).

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c081d1c862f527c8efa22f578b60daea022ee7eb
Attachment #8967829 - Flags: review?(jmuizelaar)
Attachment #8967829 - Flags: review?(jmuizelaar) → review+
This should be unmarked as a security bug, given it will only affect WebRender, and the consequence is that we won't release images from the cache in the GPU process when we should (memory leak).
Depends on: 1454112
I think we need to back this out, it caused a huge AWSY regression w/ WebRender (bug 1454346) as well as frequent crashes (bug 1454112, bug 1454114).
Flags: needinfo?(aryx.bugmail)
https://hg.mozilla.org/mozilla-central/rev/789e30ff2e3d6e1fcfce1a373c1e5635488d24da
Status: RESOLVED → REOPENED
Flags: needinfo?(aryx.bugmail)
Resolution: FIXED → ---
Target Milestone: mozilla61 → ---
Group: core-security → gfx-core-security
My latest attempt to fix this has yielded fruit.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=0e45c13b34e815cb42a9f08bb44142d1a81e186e&newProject=try&newRevision=385b2b7f02edc4049f1bf48fc61299d174a39beb&framework=4&showOnlyImportant=1

I haven't looked into why the the Images measurement went up by 30% / 1.5MB, but the Resident Memory went down by 6% / 70MB. Finally results that are starting to make some sense.
(In reply to Andrew Osmond [:aosmond] from comment #9)
> My latest attempt to fix this has yielded fruit.
> 
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-
> central&originalRevision=0e45c13b34e815cb42a9f08bb44142d1a81e186e&newProject=
> try&newRevision=385b2b7f02edc4049f1bf48fc61299d174a39beb&framework=4&showOnly
> Important=1
> 
> I haven't looked into why the the Images measurement went up by 30% / 1.5MB,
> but the Resident Memory went down by 6% / 70MB. Finally results that are
> starting to make some sense.

Looking at the subtests it feels like you're keeping images alive longer [1]. The after 30s / forced GC numbers look okay.

[1] https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-central&originalRevision=0e45c13b34e815cb42a9f08bb44142d1a81e186e&newProject=try&newRevision=385b2b7f02edc4049f1bf48fc61299d174a39beb&originalSignature=192131af763ce4043e6b7dacf039fd3330f47e1a&newSignature=192131af763ce4043e6b7dacf039fd3330f47e1a&framework=4
After some consideration, I think this is what is happened before:

1) Load a page. Surface cache touched to create image container. Image container cached.
2) Tab switch away to prepare for a new page. Image container remains cached by the caller.
3) Repeat.
4a) Return to tabs in order to close. Surface cache not touched as the image container was kept alive. Doesn't matter if imgIContainer::GetImageContainer(AtSize) is called because it exits before hitting the cache if the container already exists.
4b) In parallel, items in cache are expired and removed from reporting. The surfaces may still be kept alive by the image container, but no longer reported.

With my new changes:

2) Tab switch away to prepare for a new page. Image container is discarded.
4a). Return to tabs in order to close. Hits surface cache in order to get the image container for the last brief display before closing.
4b). Same as before, but the timer got reset, so the images live a bit longer.
Component: Graphics: Layers → Graphics: WebRender
This reveals a shortcoming in imagelib as well. Ideally we wouldn't expire cache entries that still have an image container alive. This would also make reporting more accurate, e.g. not underreporting.
I don't believe linux64 suffers much from this misreporting. Images are less likely to be layerized. That would cause them to go via the imgIContainer::Draw path, which will touch the surface cache again and keep the images alive longer.

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-central&originalRevision=2c32c973232d7362ca9dfe103e1c3a4863e9a423&newProject=mozilla-central&newRevision=8ed49dd81059dfdd876cf62ad5def1cfa56ffbbf&originalSignature=64f1ad4137b67d2c468d4d5efc381c4aec875872&newSignature=64f1ad4137b67d2c468d4d5efc381c4aec875872&framework=4

While the patch appears to regress linux64-qr, in reality it just brings us back in line with what one would expect. Since we are showing the same images, why are we would imagelib report less memory usage? There isn't any obvious reason why WebRender would be more efficient here.
Group: gfx-core-security
This fixes how we would hold onto the image containers for too long. Tab switching away would keep the WebRenderUserData objects, while scrolling would toss them.
Attachment #8967829 - Attachment is obsolete: true
This sort of does a 180 on the consumer code and actually uses it. I didn't run into any failures without it, but unfortunately there is a small chance of a race...
And finally, add the missing shutdown code that should have been there all along. It is a bit tricky because we want to ensure the render thread has finished all its tasks before we remove any lingering textures.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f3a7290c590d109202fb61340ecff91c5af8b83
Attachment #8969372 - Attachment is obsolete: true
Attachment #8969745 - Flags: review?(sotaro.ikeda.g)
Attachment #8969375 - Attachment is obsolete: true
Attachment #8969747 - Flags: review?(sotaro.ikeda.g)
Comment on attachment 8969371 [details] [diff] [review]
0001-Bug-1453801-Part-1.-Ensure-WebRenderUserData-objects.patch, v1

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=80cba9c2fe8009cba5ebabaae4c4498eb1c93689&selectedJob=174820076
Attachment #8969371 - Flags: review?(sotaro.ikeda.g)
I see why the original change made things worse instead of fixing them. Correcting the image lifetime issue on the content process caused us to redecode the images more frequently. Since the rendering thread was keeping the old images alive, more images decoded will mean the memory footprint keeps getting higher. Hence why I needed to fix all of the bugs to see the improvement.
Attachment #8969371 - Flags: review?(sotaro.ikeda.g) → review+
Attachment #8969745 - Flags: review?(sotaro.ikeda.g) → review+
Comment on attachment 8969747 [details] [diff] [review]
0003-Bug-1453801-Part-3.-Fix-race-condition-shutting-down.patch, v2

Looks good!
Attachment #8969747 - Flags: review?(sotaro.ikeda.g) → review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f3eeeae62a5a
Part 1. Ensure WebRenderUserData objects are freed after a tab switch. r=sotaro
https://hg.mozilla.org/integration/mozilla-inbound/rev/c9f4fd7170ad
Part 2. Ensure shared surfaces are properly released from render texture cache. r=sotaro
https://hg.mozilla.org/integration/mozilla-inbound/rev/d1479b21a284
Part 3. Fix race condition shutting down the render thread and shared surfaces. r=sotaro
https://hg.mozilla.org/mozilla-central/rev/f3eeeae62a5a
https://hg.mozilla.org/mozilla-central/rev/c9f4fd7170ad
https://hg.mozilla.org/mozilla-central/rev/d1479b21a284
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
You need to log in before you can comment on or make changes to this bug.