Closed Bug 985193 Opened 6 years ago Closed 5 years ago

High heap unclassified/display driver crash during slideshow

Categories

(Core :: Graphics, defect)

x86
Windows 7
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
firefox37 + fixed
firefox38 --- fixed

People

(Reporter: ferongr, Assigned: seth)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [MemShrink:P2])

Attachments

(6 files, 4 obsolete files)

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0 ID:20140318030202 CSet: 082761b7bc54 2014-03-18 Nightly build.

STR

1. Follow URL
2. Enable auto-next at the bottom left panel and set it to a low value (for faster reproduction, the crash eventually happens regardless of speed).
3. Wait for a minute or so
4. Observe about:memory
5. Wait for 5 minutes or so
6a. With layers.offmainthreadcomposition.enabled;false: Display driver/DWM crash (resolution switches to 640x480, theme switches to Aero Basic, Windows recovers after a few seconds). Nightly becomes inoperable (window shrinks completely showing only a small titlebar), requiring to restart it.
6b. With layers.offmainthreadcomposition.enabled;true: The browser crashes. Example crash ID: 203e780f-5466-400a-9132-e02802140318

Without OMTC, the heap-unclassified leaf node keeps growing until the driver crashes. With OMTC enabled it's the GFX -> heap-textures leaf node that keeps growing.

Forcing a GC/CC without OMTC does lower the memory used (heap-unclassified goes to normal levels). With OMTC though the memory stays around until the browser is restarted.

GPU: AMD HD6850 with 13.12 Catalyst.
Attached file about:memory snapshot with OMTC (obsolete) —
Blocks: DarkMatter
Component: General → Graphics
Whiteboard: [MemShrink]
Bad behaviour with and without OMTC, yikes.
Whiteboard: [MemShrink] → [MemShrink:P2]
In my testing images uncompressed heap never went up with OMTC on or off, so I'm guessing the presence of significant images uncompressed heap in the OMTC off report attachment is due to other activity in the session?

There was lots of memory usage in heap-textures and/or heap-unclassified though.
Status: UNCONFIRMED → NEW
Ever confirmed: true
On mac (with omtc on, the default) heap-textures grows to 1 GB in about a minute. So this seems bad.
No, I don't think so. The snapshot with OMTC off is with a brand-new profile and a session a few seconds old. The OMTC snapshot is from my normal profile but I will upload a new snapshot with a fresh profile and OMTC enable in a bit just to be sure.
Attachment #8393206 - Attachment is obsolete: true
(In reply to ferongr from comment #6)
> No, I don't think so. The snapshot with OMTC off is with a brand-new profile
> and a session a few seconds old. The OMTC snapshot is from my normal profile
> but I will upload a new snapshot with a fresh profile and OMTC enable in a
> bit just to be sure.

Interesting. I tested on a Windows machine (previous results were for mac) and I still never get any values on images uncompressed over 1 MB. I wish I could reproduce what you are seeing.

Either way those numbers are but a sideshow to heap-textures and heap-unclassified.
Ah, so I can get large images uncompressed-heap numbers in the release channel. I'm guessing that bug 962670 moved that to heap-textures or unclassified?

The site users background-image to show the images, for which we don't have any smarts about keeping around decoded data like we do for img elements.
In my case, it's the layers.offmainthreadcomposition.enabled (that's false by default on Windows Nightly) pref that makes the high number go from heap-unclassified to heap-textures.
We appear to be discarding images fine. The difference between the number of RasterImage::DecodingComplete calls and RasterImage::Discard calls (on non-chrome images) is never more than 15 even after my laptop start going into an unusable state due to memory use. And I traced one Discard call down to the vm_deallocate call, so that appears to be working fine. So the problem must be somewhere else.
The images are getting layerized and the image container that RasterImage uses seems to be the source of the problem. We don't touch it in ::Discard, so if proper disposal of it is happening must be done by the layers subsystem.
Looks like this is still an issue. With a late Nightly 37 on Windows, after about an hour I see:

463.56 MB (100.0%) -- explicit
├──273.78 MB (59.06%) ── heap-unclassified
├──117.12 MB (25.27%) -- images

5,468.42 MB ── gpu-committed
  367.62 MB ── gpu-dedicated
  122.00 MB ── gpu-shared
  422.81 MB ── heap-allocated
        473 ── heap-chunks
    1.00 MB ── heap-chunksize
  426.46 MB ── heap-committed
  473.00 MB ── heap-mapped
      0.86% ── heap-overhead-ratio
    2.42 MB ── imagelib-surface-cache-estimated-locked
    2.42 MB ── imagelib-surface-cache-estimated-total
    0.34 MB ── js-main-runtime-temporary-peak
          0 ── low-commit-space-events
  734.10 MB ── private
  139.82 MB ── resident
3,906.76 MB ── vsize
   17.00 MB ── vsize-max-contiguous

VMMap says I have about 2700MB of "Shareable" write-combine regions. (Is the gpu-committed number double-counting?)

A sample WinDbg !address dump on one of those regions looks like:

Usage:                  MappedFile
Base Address:           dab00000
End Address:            db600000
Region Size:            00b00000
State:                  00001000	MEM_COMMIT
Protect:                00000404	PAGE_READWRITE|PAGE_WRITECOMBINE
Type:                   00040000	MEM_MAPPED
Allocation Base:        dab00000
Allocation Protect:     00000404	PAGE_READWRITE|PAGE_WRITECOMBINE
Mapped file name:       PageFile

I can't tell where those regions are coming from. AFAICT it's not through the VirtualAlloc or MapViewOfFile families.

On the bright side, those regions do get cleaned up if I navigate away and Minimize Memory.
No longer blocks: 1006295
Blocks: 1062065
Attached image gpu-memory-usage.svg
So I had a look at gpuview trace of this situation. We definitely seem to be accumulating gpu allocations. I've attached a graph of gpu allocations from David's gpuview trace.
Looking more closely, it appears as though we are accumulating textures. I do not yet know why or how.
I can reproduce this locally and it looks like the gpuview logs contain allocation stacks so I should be able to figure out what's going on tomorrow.
Here's a sample deallocation from when everything gets freed.
It looks like imagelib is keeping all of images on the page in the image cache. This is keeping a bunch of the gpu stuff alive.

How are images supposed to be removed from the cache? What's keeping them there on this page?
A few more details.

mCache is growing in size. The cacheQueue does not have very many items in it. It seems like the imageRequestProxy destructor is not being called much.
Flags: needinfo?(seth)
Flags: needinfo?(tnikkel)
I think the problem is that there is nothing to release mImageContainer on RasterImage. So as long as the image is kept alive (the imgCache is only for keeping the source data around) the image container sticks around. Instead it should be treated the same as the rest of the decoded data for images.

I think all we should need is nulling out mImageContainer in RasterImage::Discard.
Flags: needinfo?(tnikkel)
(In reply to Timothy Nikkel (:tn) from comment #20)
> I think the problem is that there is nothing to release mImageContainer on
> RasterImage. So as long as the image is kept alive (the imgCache is only for
> keeping the source data around) the image container sticks around. Instead
> it should be treated the same as the rest of the decoded data for images.
> 
> I think all we should need is nulling out mImageContainer in
> RasterImage::Discard.

This seems to have worked.
Assignee: nobody → tnikkel
(In reply to Timothy Nikkel (:tn) from comment #20)
> I think all we should need is nulling out mImageContainer in
> RasterImage::Discard.

Looking at this problem, I'm wonder if what we should really do is get rid of RasterImage::mImageContainer completely and keep only RasterImage::mImageContainerCache. In other words, we should keep only a weak reference to the ImageContainer and not a strong reference.

Is there any reason we *should* keep a strong reference? I suppose whether that will hurt us depends on how the layer system behaves.
Flags: needinfo?(seth)
Here are some further comments from IRC:

seth:	tn: the comment in GetImageContainer says "We only need to be careful
about holding on to the image when it is discardable by the OS"

seth:	tn: that makes zero sense to me

seth:	tn: i think the goal here is that we keep the ImageContainer alive,
which keeps its reference the imgFrame's SourceSurface alive, which keeps the
VolatileBuffer underneath alive

seth:	tn: but if the layer system drops its reference to the ImageContainer,
and the image is unlocked, and nothing is calling Draw, why *should* we keep the
VolatileBuffer alive?

I've gone ahead and written a patch that removes the strong reference to
RasterImage's ImageContainer, and keeps only a weak reference. In my local
testing it seems to work as expected and have no negative effects - the weak
reference stays alive as long as the image stays layerized, and gets dropped
when it stops being layerized.

I wasn't 100% sure who should review this; Matt, if you think it should be
someone else, please forward the r?.
Attachment #8549297 - Flags: review?(matt.woodrow)
Assignee: tnikkel → seth
Status: NEW → ASSIGNED
Comment on attachment 8549297 [details] [diff] [review]
Stop holding a strong reference to RasterImage's ImageContainer

Review of attachment 8549297 [details] [diff] [review]:
-----------------------------------------------------------------

::: image/src/RasterImage.cpp
@@ +850,5 @@
>  
> +  // We need a new ImageContainer, so create one.
> +  container = LayerManager::CreateImageContainer();
> +
> +  nsRefPtr<layers::Image> image = GetCurrentImage(container);

Add a comment saying that the Image contains a SourceSurface which is holding the lock on the VolatileBuffer, preventing it from getting freed.
Attachment #8549297 - Flags: review?(matt.woodrow) → review+
Thanks for the quick review! I'll make that change.
Addressed review comments.
Attachment #8549297 - Attachment is obsolete: true
Here's a try job. I'm going to be paranoid about this one and run this on all platforms:

https://tbpl.mozilla.org/?tree=Try&rev=03ea59b4a02f
It seems like we still accumulate image cache entries for this page without bound. Is that intended?
(In reply to Jeff Muizelaar [:jrmuizel] from comment #28)
> It seems like we still accumulate image cache entries for this page without
> bound. Is that intended?

So it seems like this is intended by this particular page. It keeps a cache of all the image elements that it has ever loaded, which explains the behaviour I was seeing.
So it looks like this actually fixed several tests that were marked fails-if on
Android, although I don't really understand why. I've updated the patch to
remove those fails-if annotations.
Attachment #8549307 - Attachment is obsolete: true
It would be valuable to know when this problem started occurring.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #31)
> It would be valuable to know when this problem started occurring.

I'd bet that it came from the bugs that added support for volatile buffers.
I pushed a new Android-only try job to make sure that the new test annotations are good:

https://tbpl.mozilla.org/?tree=Try&rev=36d55de4d291
(In reply to Seth Fowler [:seth] from comment #32)
> I'd bet that it came from the bugs that added support for volatile buffers.

I take it back! My guess was based on the comments in GetImageContainer, but Timothy investigated and it appears that the volatile buffers patches added the weak pointer, not the strong pointer. We were in an even worse position before those patches.
Sigh.. I appear to have neglected to request tests on that try push.

https://tbpl.mozilla.org/?tree=Try&rev=e927a3d40324
Wow, I appear to be hitting a bug in 'hg trychooser'. Good to know about. =\

At any rate, this new try run is now useless, because those same test annotations got changed by someone else.

I'm just going to go ahead and pushed.

Here's the patch without the new test annotations, and rebased.
Attachment #8549851 - Attachment is obsolete: true
https://hg.mozilla.org/mozilla-central/rev/ead7a5999fe8
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Comment on attachment 8550545 [details] [diff] [review]
Stop holding a strong reference to RasterImage's ImageContainer

Approval Request Comment
[Feature/regressing bug #]: Part of fix for bug 1125272 and bug 1119938.
[User impact if declined]: Already approved.
[Describe test coverage new/current, TreeHerder]: In 38 for a long time.
[Risks and why]: Low risk; in 38 for a long time.
[String/UUID change made/needed]: None.
Attachment #8550545 - Flags: approval-mozilla-aurora?
Comment on attachment 8550545 [details] [diff] [review]
Stop holding a strong reference to RasterImage's ImageContainer

Aurora approval previously granted for a collection of ImageLib related bugs that Seth worked on and QE verified before uplift.
Attachment #8550545 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Covered by verification for bug 1125272 and bug 1119938.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.