Closed Bug 902922 Opened 6 years ago Closed 6 years ago

Heavy, silent memory leak as the cycle collector runs into pldhash's 8M limit after browsing on some complex Web page

Categories

(Core :: XPCOM, defect)

23 Branch
x86_64
Windows 7
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla27

People

(Reporter: typedef.dev, Assigned: njn)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [MemShrink])

Attachments

(4 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0 (Beta/Release)
Build ID: 20130730113002

Steps to reproduce:

- test case A :
1/ load a webgl app ( three.js examples : http://threejs.org/examples/webgl_materials_cars_camaro.html ) that contains mesh geometry and textures.
2/ reload the page n times

- test case B :
After looking on the web if someone had the same webgl memory leaks problems that I have in my webgl apps, I found someone that uploaded a texture reloading webgl test. It could be usefull to also take a look at it.
1/ open webgl textured cube demo : http://jodyj.com/test/webgl_panorama_memoryLeak.html
2/ click 'memory leak test' button that swap between two textures 


Actual results:

-test case A:
Every time the page is reloaded, memory usage go up by 100Mo, no matter the time waited between two re-load. When memory is full or after ~3Gb Firefox crash.

-test case B:
Every click swap between the two textures and make the memory go up by 10Mo each click.
When memory is full or after ~3Gb Firefox crash.


Expected results:

Memory should not indefinitely increase on the same page re-load.

Both leak don't appear in other tested browsers.
Taking. This appears to be a texture leak.
Attachment #808341 - Flags: review?(jmuizelaar)
Attachment #808341 - Flags: review?(jmuizelaar) → review?(bjacob)
Comment on attachment 808341 [details] [diff] [review]
Stop leaking WebGLTexture

This doesn't look right to me.  That object will be returned with refcount of 0, which is against the already_AddRefed<> contract.
Comment on attachment 808341 [details] [diff] [review]
Stop leaking WebGLTexture

Review of attachment 808341 [details] [diff] [review]:
-----------------------------------------------------------------

Indeed, already_AddRefed and dont_AddRef are pretty explicitly contradictory!
Attachment #808341 - Flags: review?(bjacob) → review-
Is this just a matter of GC not happening often enough, because the underlying memory used by the texture objects isn't being taken into account to trigger it... and we have old contexts sitting around after the reloads?  Or is three.js just never deleting the texture explicitly?

Hard to debug because the three.js that the second testcase is using is minified.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #5)
> Is this just a matter of GC not happening often enough, because the
> underlying memory used by the texture objects isn't being taken into account
> to trigger it... and we have old contexts sitting around after the reloads? 
> Or is three.js just never deleting the texture explicitly?
> 
> Hard to debug because the three.js that the second testcase is using is
> minified.

Repeatedly stomping on "Minimize memory usage" in about:memory didn't do anything. The objects hung around until Firefox crashed or the tab was closed. 
After the testcase (2) was closed a full GC took care of the memory.

The number of contexts remained stubbornly at 1.
Here, you want "GC" and "CC", but not "minimize memory usage". Indeed, GC and CC should be enough to free all the unneeded memory (or else, we have a leak) and conversely, "minimize memory usage" can corrupt your results by losing WebGL contexts, artificially hiding a leak.
Probably the most effective way to make progress on this bug, is going to be to minimize a testcase.
The second testcase is pretty minimal; just needs non-minified three.js so we can see what it's doing.
Huh. Switching out for non-minified three.js at threejs.org gives:
TypeError: texture.deallocate is not a function @ file:///C:/Users/Michael/Desktop/testcase.html:114.
Different version?


Also, the author of that page gives this comment:
materialPano.materials=materialsT;  //without this line, I don't have memory leak but I can't change the texture. So it means that loadTexture doesn't have memory leak since the memory doesn't increase when I click on the button with only materialsT=[loadTexture(...),...] in the changeTexture() function.

JS holding onto something?
Yes, this sounds like JS is used to complete a cycle, which we leak.

When a cycle is completely made of GC'd objects, the GC takes care of them.

When non-GC'd objects are exposed to JS (for example, the underlying C++ object behind a WebGLTexture), all the strong references that they might hold outside of the GC must be declared to the cycle collector. Example: in content/canvas/src, in WebGLFramebuffer.cpp, we declare to the cycle collector that a WebGLFramebuffer may have a strong reference to a WebGLTexture (see macros at the bottom of that file). If we didn't, then doing in JS mytexture.foo = myframebuffer would allow to complete a cycle that we would leak. The present bug is starting to sound like we have a similar leak somewhere.

What are materialPano and materialsT here?
materialPano=new THREE.MeshFaceMaterial( materialsT );

later:

materialPano.materials=materialsT;

where both of them are globals. materialsT is an array.
Next, we need to look into these arrays/dictionaries/structures, recursively, and find
 - what are the WebGL objects referenced by materialsT
 - what is referencing materialPano?

...so that we can hopefully find a reference from materialsT to materialPano, closing the cycle.
OK, so far,

the elements of materialsT are THREE.MeshBasicMaterial.

Note that in the function creating/returning the material it uses a local var as a THREE.Texture .

It supposedly calls the deallocate() method on that texture.

Back to three.js, 
>	var deallocateTexture = function ( texture ) {
>
>		if ( texture.image && texture.image.__webglTextureCube ) {
>
>			// cube texture
>
>			_gl.deleteTexture( texture.image.__webglTextureCube );
>
>		} else {
>
>			// 2D texture
>
>			if ( ! texture.__webglInit ) return;
>
>			texture.__webglInit = false;
>			_gl.deleteTexture( texture.__webglTexture );
>
>		}
>
>	};

Except the testcase uses a version that apparently has deallocate() stuck on the JS texture object itself.
I'll do some more digging later.
Looking at this in Refgraph, a tool that can examine heap objects, their types, and strong references between them.

On http://jodyj.com/test/webgl_panorama_memoryLeak.html after hitting 'leak test' a few times, here are the C++ WebGL objects on the heap:

64 WebGLBuffer
44 WebGLTexture
31 WebGLUniformLocation
6 WebGLShader
4 WebGLExtensionBase
3 WebGLProgram
1 WebGLContext
1 WebGLMemoryReporter
1 WebGLVertexArray
1 mozilla::WebGLMemoryPressureObserver

After hitting 'leak test' 10 more times:

104 WebGLTexture
64 WebGLBuffer
31 WebGLUniformLocation
6 WebGLShader
4 WebGLExtensionBase
3 WebGLProgram
1 WebGLContext
1 WebGLMemoryReporter
1 WebGLVertexArray
1 mozilla::WebGLMemoryPressureObserver

So we see that hitting 'leak test' once causes 6 new WebGLTextures to be created and kept alive.

After hitting CC/GC many times in about:memory, the numbers are still the same:

104 WebGLTexture
64 WebGLBuffer
31 WebGLUniformLocation
6 WebGLShader
4 WebGLExtensionBase
3 WebGLProgram
1 WebGLContext
1 WebGLMemoryReporter
1 WebGLVertexArray
1 mozilla::WebGLMemoryPressureObserver

After closing this page and GC/CC:

no WebGL object remains in memory.

The fact that all WebGL objects are correctly destroyed after closing the page means that they aren't completely leaked. If they are leaked, then that's a kind of leak that is plugged when the document is destroyed. The only point of contact between WebGL objects and the document is the *WebGLContext*. So the place where to look is WebGL objects strongly referenced by *WebGLContext* and not traversed by the CC.

The other possibility is that it's not a leak, it's just the javascript code explicitly keeping these objects alive... we should check that at some point. Even if the leak behavior doesn't happen in other browsers, it could be that this JS code doesn't do the same thing in other browsers.
I proof-read WebGLContext for undeclared strong references to anything outside, and didn't find anything.

But I've been working with mozilla-central, and just saw that this was reported against Firefox 23.

Can you reproduce the leak in Nightly ?
Yes, I was working in Nightlies with this bug.
Also note that when I was running the testcase while debugging, I noticed that when the new textures were allocated by hitting the button, the old ones had a state of 'DeleteRequested'.
Interesting. DeleteRequested means that webgl.deleteTexture was called on them, which means that their OpenGL resources must be freed as soon as possibe, but they can't be freed yet because something is holding on to them. There are only two things that can hold on to a WebGLTexture in this way (there are only two things that have WebGLRefPtr<WebGLTexture>'s):

 (1) either the texture is attached to a WebGLFramebuffer object
 (2) or the texture is still bound to some texture unit (as in webgl.activeTexture) on this WebGL context.

Of course (1) is far more likely and is the only way to have this unbounded number of textures around (as there is only a limited number of texture units).
... but my data in comment 15 shows that there isn't any WebGLFramebuffer object around. So I still have no clue.
Hey, has there been any progress solving this?

We have noticed similar issues with http://p3d.in/ WebGL viewer. Reloading heavy models on Firefox eats memory until whole browser freezes or crashes.


Here is a link to a super heavy million poly test model: http://p3d.in/3axnM

Each reload eats about 1 gb of memory, causing 3 reloads to crash the browser. Closing the tab or even forcing garbage collection on about:memory doesn't reduce the memory usage no matter how long you wait.

Of course not many models are that heavy but having a +250k poly sculpted model is fairly common... Chrome handles memory usage just fine on that same model.
Thanks, the testcase in comment http://p3d.in/3axnM actually works for me (Nightly 27.0a1 64bit, Ubuntu 12.04, Intel GPU).

That's a great progress, as the previous STR didn't work for me.

By 'work for me' I mean that loading the testcase, closing the tab, going to about:memory, hitting GC and CC 3 times, I still have the 1G+ extra memory usage. As a bonus, the browser freezes intermittently (I suspect an out-of-graphics-memory condition).
OK, in about:memory on my build (Win8 debug) I have a massive number of WebGL buffers being reported.
Those buffers don't seem to go away at all, and the destructor isn't getting called.
I think I understand what's happening, and why it's been hard to reproduce across different machines.

The problem is that we run out of memory during cycle collection. I saw a NS_ASSERTION failure "Ran out of memory while building cycle collector graph" flying in my terminal while reproducing this. Then I replaced the NS_ASSERTION(false) by MOZ_CRASH in nsCycleCollector.cpp, and got this stack:

Program received signal SIGSEGV, Segmentation fault.
nsCycleCollector::MarkRoots (this=0x7fffe4344030, aBuilder=...)
    at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:2131
2131            MOZ_CRASH("Ran out of memory while building cycle collector graph");
(gdb) bt
#0  nsCycleCollector::MarkRoots (this=0x7fffe4344030, aBuilder=...)
    at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:2131
#1  0x00007ffff4129b83 in nsCycleCollector::BeginCollection (this=0x7fffe4344030, aCCType=
    ScheduledCC, aManualListener=0x0) at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:2792
#2  0x00007ffff4129beb in Collect (aManualListener=<optimized out>, aResults=<optimized out>, 
    aWhiteNodes=<optimized out>, aCCType=ScheduledCC, this=0x7fffe4344030)
    at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:2697
#3  nsCycleCollector::Collect (this=<optimized out>, aCCType=ScheduledCC, 
    aWhiteNodes=<optimized out>, aResults=<optimized out>, aManualListener=<optimized out>)
    at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:2684
#4  0x00007ffff4129c7a in nsCycleCollector_collect (aManuallyTriggered=false, aResults=
    0x7fffffffc31c, aManualListener=<optimized out>)
    at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:3162
#5  0x00007ffff37d54cb in CycleCollectNow (aManuallyTriggered=false, 
    aExtraForgetSkippableCalls=0, aListener=0x0)
    at /hack/mozilla-git/dom/base/nsJSEnvironment.cpp:2079
#6  nsJSContext::CycleCollectNow (aListener=0x0, aExtraForgetSkippableCalls=0, 
    aManuallyTriggered=false) at /hack/mozilla-git/dom/base/nsJSEnvironment.cpp:2036
#7  0x00007ffff37d5aef in CCTimerFired (aTimer=<optimized out>, aClosure=<optimized out>)
    at /hack/mozilla-git/dom/base/nsJSEnvironment.cpp:2292


I verified that when we are not running out of memory during cycle collection, the leak does not occur.

The first reason why other browsers would not be affected, is that other browsers do not use a cycle collector.

Now we should understand if it is really normal that we run OOM during CC here.
OOM during CC was a cause of some intermittent failures (see bug 894030 and 900633). It was "fixed" by reducing permutation of the test (bug 920043).
In GCGraphBuilder::AddNode, where we set

   mRanOutOfMemory = true;

I added a MOZ_CRASH();

Now I get this stack:

#6  0x00007ff771eeb51d in GCGraphBuilder::AddNode (this=0x7fff63454d70, aPtr=0x7ff706395680, 
    aParticipant=0x7ff776a57c38) at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:1687
#7  0x00007ff771ef18b0 in GCGraphBuilder::NoteChild (this=0x7fff63454d70, child=
    0x7ff706395680, cp=0x7ff776a57c38, edgeName=...)
    at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:1609
#8  0x00007ff771eebd18 in GCGraphBuilder::NoteJSChild (this=0x7fff63454d70, child=
    0x7ff706395680) at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:1835
#9  0x00007ff771ee3b57 in NoteJSChild (aTrc=0x7fff63454c00, aThing=0x7ff706395680, aTraceKind=
    JSTRACE_OBJECT) at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:391
#10 0x00007ff771ee3bea in NoteJSChildTracerShim (aTrc=0x7fff63454c00, aThingp=0x7fff63454ab8, 
    aTraceKind=JSTRACE_OBJECT) at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:402
#11 0x00007ff772f10ead in MarkInternal<JSObject> (trc=0x7fff63454c00, thingp=0x7fff63454ab8)
    at /hack/mozilla-git/js/src/gc/Marking.cpp:193
#12 0x00007ff772f069b1 in js::gc::MarkKind (trc=0x7fff63454c00, thingp=0x7fff63454ab8, kind=
    JSTRACE_OBJECT) at /hack/mozilla-git/js/src/gc/Marking.cpp:411
#13 0x00007ff772f0707f in MarkValueInternal (trc=0x7fff63454c00, v=0x7ff701986448)
    at /hack/mozilla-git/js/src/gc/Marking.cpp:531
#14 0x00007ff772f0772c in js::gc::MarkArraySlots (trc=0x7fff63454c00, len=4, vec=
    0x7ff701986430, name=0x7ff774499e03 "objectElements")
    at /hack/mozilla-git/js/src/gc/Marking.cpp:646
#15 0x00007ff772f2a97b in js::ObjectImpl::markChildren (this=0x7ff701986400, trc=
    0x7fff63454c00) at /hack/mozilla-git/js/src/vm/ObjectImpl.cpp:350
#16 0x00007ff772f08dc8 in js::gc::MarkChildren (trc=0x7fff63454c00, obj=
    (JSObject *) 0x7ff701986400 [object Array]) at /hack/mozilla-git/js/src/gc/Marking.cpp:1029
#17 0x00007ff772f0a154 in js::TraceChildren (trc=0x7fff63454c00, thing=0x7ff701986400, kind=
    JSTRACE_OBJECT) at /hack/mozilla-git/js/src/gc/Marking.cpp:1565
#18 0x00007ff772fd354b in JS_TraceChildren (trc=0x7fff63454c00, thing=0x7ff701986400, kind=
    JSTRACE_OBJECT) at /hack/mozilla-git/js/src/gc/Tracer.cpp:103
#19 0x00007ff771ee4357 in mozilla::CycleCollectedJSRuntime::NoteGCThingJSChildren (this=
    0x7ff776a57c30, aThing=0x7ff701986400, aTraceKind=JSTRACE_OBJECT, aCb=...)
    at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:571
#20 0x00007ff771ee4621 in mozilla::CycleCollectedJSRuntime::TraverseGCThing (this=
    0x7ff776a57c30, aTs=mozilla::CycleCollectedJSRuntime::TRAVERSE_FULL, aThing=
    0x7ff701986400, aTraceKind=JSTRACE_OBJECT, aCb=...)
    at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:626
#21 0x00007ff771ee38c4 in mozilla::JSGCThingParticipant::Traverse (this=0x7ff776a57c38, p=
    0x7ff701986400, cb=...) at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:326
#22 0x00007ff771eeb65a in GCGraphBuilder::Traverse (this=0x7fff63454d70, aPtrInfo=
    0x7ff682c8f7f8) at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:1712
#23 0x00007ff771eec3b7 in nsCycleCollector::MarkRoots (this=0x7ff76681a030, aBuilder=...)
    at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:2122
#24 0x00007ff771eedcc5 in nsCycleCollector::BeginCollection (this=0x7ff76681a030, aCCType=
    ScheduledCC, aManualListener=0x0) at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:2793
#25 0x00007ff771eed79f in nsCycleCollector::Collect (this=0x7ff76681a030, aCCType=ScheduledCC, 



The code here is:

PtrInfo*
GCGraphBuilder::AddNode(void *aPtr, nsCycleCollectionParticipant *aParticipant)
{
    PtrToNodeEntry *e = static_cast<PtrToNodeEntry*>(PL_DHashTableOperate(&mPtrToNodeMap, aPtr, PL_DHASH_ADD));
    if (!e) {
        mRanOutOfMemory = true;
        MOZ_CRASH();                 // <-- we crash here
        return nullptr;
    }


(gdb) p mPtrToNodeMap
$2 = {
  ops = 0x7ff77658ed40, 
  data = 0x0, 
  hashShift = 9, 
  maxAlphaFrac = 192 '\300', 
  minAlphaFrac = 64 '@', 
  entrySize = 16, 
  entryCount = 8388607, 
  removedCount = 0, 
  generation = 8, 
  entryStore = 0x7ff677c00030
}
I added some MOZ_CRASH in PL_DHashTableOperate in the PL_DHASH_ADD case to see exactly where it runs out of memory. Got this stack:

#6  0x00007fc7f4955162 in PL_DHashTableOperate (table=0x7fff0fe82370, key=0x7fc7888caa00, op=
    PL_DHASH_ADD) at /hack/mozilla-git/xpcom/glue/pldhash.cpp:620
#7  0x00007fc7f49eb586 in GCGraphBuilder::AddNode (this=0x7fff0fe82310, aPtr=0x7fc7888caa00, 
    aParticipant=0x7fc7f9557c38) at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:1684
#8  0x00007fc7f49f194c in GCGraphBuilder::NoteChild (this=0x7fff0fe82310, child=
    0x7fc7888caa00, cp=0x7fc7f9557c38, edgeName=...)
    at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:1609
#9  0x00007fc7f49ebdb4 in GCGraphBuilder::NoteJSChild (this=0x7fff0fe82310, child=
    0x7fc7888caa00) at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:1835
#10 0x00007fc7f49e3bf3 in NoteJSChild (aTrc=0x7fff0fe821a0, aThing=0x7fc7888caa00, aTraceKind=
    JSTRACE_OBJECT) at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:391
#11 0x00007fc7f49e3c86 in NoteJSChildTracerShim (aTrc=0x7fff0fe821a0, aThingp=0x7fff0fe82058, 
    aTraceKind=JSTRACE_OBJECT) at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:402
#12 0x00007fc7f5a10f4d in MarkInternal<JSObject> (trc=0x7fff0fe821a0, thingp=0x7fff0fe82058)
    at /hack/mozilla-git/js/src/gc/Marking.cpp:193
#13 0x00007fc7f5a06a51 in js::gc::MarkKind (trc=0x7fff0fe821a0, thingp=0x7fff0fe82058, kind=
    JSTRACE_OBJECT) at /hack/mozilla-git/js/src/gc/Marking.cpp:411
#14 0x00007fc7f5a0711f in MarkValueInternal (trc=0x7fff0fe821a0, v=0x7fc783eb4c28)
    at /hack/mozilla-git/js/src/gc/Marking.cpp:531
#15 0x00007fc7f5a077cc in js::gc::MarkArraySlots (trc=0x7fff0fe821a0, len=4, vec=
    0x7fc783eb4c10, name=0x7fc7f6f99ec3 "objectElements")
    at /hack/mozilla-git/js/src/gc/Marking.cpp:646
#16 0x00007fc7f5a2aa1b in js::ObjectImpl::markChildren (this=0x7fc783eb4be0, trc=
    0x7fff0fe821a0) at /hack/mozilla-git/js/src/vm/ObjectImpl.cpp:350
#17 0x00007fc7f5a08e68 in js::gc::MarkChildren (trc=0x7fff0fe821a0, obj=
    (JSObject *) 0x7fc783eb4be0 [object Array]) at /hack/mozilla-git/js/src/gc/Marking.cpp:1029
#18 0x00007fc7f5a0a1f4 in js::TraceChildren (trc=0x7fff0fe821a0, thing=0x7fc783eb4be0, kind=
    JSTRACE_OBJECT) at /hack/mozilla-git/js/src/gc/Marking.cpp:1565
#19 0x00007fc7f5ad35eb in JS_TraceChildren (trc=0x7fff0fe821a0, thing=0x7fc783eb4be0, kind=
    JSTRACE_OBJECT) at /hack/mozilla-git/js/src/gc/Tracer.cpp:103
#20 0x00007fc7f49e43f3 in mozilla::CycleCollectedJSRuntime::NoteGCThingJSChildren (this=
    0x7fc7f9557c30, aThing=0x7fc783eb4be0, aTraceKind=JSTRACE_OBJECT, aCb=...)
    at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:571
#21 0x00007fc7f49e46bd in mozilla::CycleCollectedJSRuntime::TraverseGCThing (this=
    0x7fc7f9557c30, aTs=mozilla::CycleCollectedJSRuntime::TRAVERSE_FULL, aThing=
    0x7fc783eb4be0, aTraceKind=JSTRACE_OBJECT, aCb=...)
    at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:626
#22 0x00007fc7f49e3960 in mozilla::JSGCThingParticipant::Traverse (this=0x7fc7f9557c38, p=
    0x7fc783eb4be0, cb=...) at /hack/mozilla-git/xpcom/base/CycleCollectedJSRuntime.cpp:326
#23 0x00007fc7f49eb6f6 in GCGraphBuilder::Traverse (this=0x7fff0fe82310, aPtrInfo=
    0x7fc70816b378) at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:1712
#24 0x00007fc7f49ec453 in nsCycleCollector::MarkRoots (this=0x7fc7e931a030, aBuilder=...)
    at /hack/mozilla-git/xpcom/base/nsCycleCollector.cpp:2122
#25 0x00007fc7f49edd61 in nsCycleCollector::BeginCollection (this=0x7fc7e931a030, aCCType=


The code in pldhash.cpp is:

      case PL_DHASH_ADD:
        /*
         * If alpha is >= .75, grow or compress the table.  If key is already
         * in the table, we may grow once more than necessary, but only if we
         * are on the edge of being overloaded.
         */
        size = PL_DHASH_TABLE_SIZE(table);
        if (table->entryCount + table->removedCount >= MAX_LOAD(table, size)) {
            /* Compress if a quarter or more of all entries are removed. */
            if (table->removedCount >= size >> 2) {
                METER(table->stats.compresses++);
                deltaLog2 = 0;
            } else {
                METER(table->stats.grows++);
                deltaLog2 = 1;
            }

            /*
             * Grow or compress table, returning null if ChangeTable fails and
             * falling through might claim the last free entry.
             */
            if (!ChangeTable(table, deltaLog2) &&
                table->entryCount + table->removedCount == size - 1) {
                METER(table->stats.addFailures++);
                entry = NULL;
                MOZ_CRASH();   // <-- line 620, crash here
                break;
            }
        }


(gdb) p table->entryCount
$2 = 8388607
(gdb) p table->removedCount
$3 = 0
(gdb) p size
$4 = 8388608

size is exactly 8M and was obtained above by:
  size = PL_DHASH_TABLE_SIZE(table);

I guess at this point the question is why ChangeTable failed.
Found it... both happy and sad at the same time. Happy to have found it. Sad that we are doing such things.

I sprinkled some MOZ_CRASH around where ChangeTable can fail, and it turns out we're failing here:


static bool
ChangeTable(PLDHashTable *table, int deltaLog2)
{
    int oldLog2, newLog2;
    uint32_t oldCapacity, newCapacity;
    char *newEntryStore, *oldEntryStore, *oldEntryAddr;
    uint32_t entrySize, i, nbytes;
    PLDHashEntryHdr *oldEntry, *newEntry;
    PLDHashMoveEntry moveEntry;
#ifdef DEBUG
    uint32_t recursionLevel;
#endif

    /* Look, but don't touch, until we succeed in getting new entry store. */
    oldLog2 = PL_DHASH_BITS - table->hashShift;
    newLog2 = oldLog2 + deltaLog2;
    oldCapacity = 1u << oldLog2;
    newCapacity = 1u << newLog2;
    if (newCapacity >= PL_DHASH_SIZE_LIMIT) {
        return false;           // <------ **** failing here !! ****
    }


(gdb) p newCapacity
$1 = 16777216

In pldhash.h we see that that is the value of PL_DHASH_SIZE_LIMIT:

#define PL_DHASH_SIZE_LIMIT     ((uint32_t)1 << 24)

In other words, cycle collection is not actually running out of memory. Instead, it is running into an arbitrary size restriction in pldhash: no pldhash table can use more than 16M. That's failing because cycle collection may have to cycle collect arbitrary complex graphs, depending on the complexity of Web pages that were viewed. I suppose that the pldhash restriction was written at a time when 16M was considered an enormous amount of memory for a browser to use...
Renaming the bug accordingly.

I am calling this leak 'silent' because it really is silent in release builds, and in debug builds, by default, NS_ASSERTION only outputs a line of text to stderr, which is very easy to miss.

I'm going to make a patch that just removes pldhash's 16M limit because I don't suppose that anyone is going to say that it was a good idea.
Summary: WebGL related, heavy memory leak → Heavy, silent memory leak as the cycle collector runs into pldhash's 16M limit after browsing on some complex Web page
This bug was introduced 12 years ago by this changeset:

https://github.com/mozilla/mozilla-central/commit/a96c9a987084e7ddfcac2e675dfc04587ff9426f#diff-5a53bc204352acd1b122479cfd10064dR53

I bet that at that time, 16M was considered a very large amount of memory...
Attached patch Remove pldhash's 16M limit (obsolete) — Splinter Review
Attachment #817818 - Flags: review?(khuey)
Impact note: take a look at comment 26 which shows that this bug has also likely been a cause of intermittent oranges.
Whiteboard: [MemShrink]
For History purposes, this was introduced by bug 103990.
Comment on attachment 817818 [details] [diff] [review]
Remove pldhash's 16M limit

Review of attachment 817818 [details] [diff] [review]:
-----------------------------------------------------------------

Adding :dbaron to the review. The primary thing that I'm worrying about is that pldhash does some bit shifts that might actually be relying on the size not being higher than a certain value (?).
Attachment #817818 - Flags: review?(dbaron)
In bug 811212 we made some of the CC assertions fatal in debug builds. Now it looks like we should also make the assertions on OOM fatal in debug builds, which would have greally accelerated debugging here.

If the OOM condition is a true OOM, we're at risk of crashing soon anyway, and if it's not actually an OOM (but say, rather, an unwanted restriction in pldhash) then we want to find out about that.
Attachment #817853 - Flags: review?(continuation)
Just mochitests for now, we'll enable more tests if this is green...

https://tbpl.mozilla.org/?tree=Try&rev=dd378b000e2c
Comment on attachment 817818 [details] [diff] [review]
Remove pldhash's 16M limit

Shouldn't Benjamin review this patch?
Comment on attachment 817818 [details] [diff] [review]
Remove pldhash's 16M limit

Sorry, I didn't look very carefully into whom to ask for review. Anyway, this is a very small but potentially large-impact patch, so it can use multiple reviews.
Attachment #817818 - Flags: review?(benjamin)
Attachment #817853 - Attachment description: cc-oom-asserts-fatal → OOM assertions in the cycle collector should be fatal in debug builds
Comment on attachment 817818 [details] [diff] [review]
Remove pldhash's 16M limit

I think I need brendan to approve here. This sounds ok in theory, but I'm slightly afraid that there may be intrinsic limit of jsdhash/pldhash that I'm not aware of.
Attachment #817818 - Flags: review?(brendan)
Attachment #817818 - Flags: review?(benjamin)
Attachment #817818 - Flags: review+
Status: UNCONFIRMED → NEW
Ever confirmed: true
Great investigation here, bjacob!

(In reply to Benoit Jacob [:bjacob] from comment #37)
> In bug 811212 we made some of the CC assertions fatal in debug builds. Now
> it looks like we should also make the assertions on OOM fatal in debug
> builds, which would have greally accelerated debugging here.

I initially made these asserts fatal, for the reasons you list, but I made them non-fatal after it turned out we were hitting them fairly frequently on TBPL.  From the TBPL perspective, NS_ASSERTION is strictly better, because we record it either way, plus we run the rest of the test suite.  I guess for local debugging, it is harder to notice that NS_ASSERTION is failing in the midst of the other debug spew.  Anyways, since the test suite has been cleaned up and isn't hitting these assertions any more, it is probably okay to make these fatal again.  We also collect telemetry for OOMs (CYCLE_COLLECTOR_OOM), though I'm having trouble interpreting the results in the dash board.
Component: Canvas: WebGL → XPCOM
Attachment #817853 - Flags: review?(continuation) → review+
Duplicate of this bug: 925298
I guess we should not have changed the test suite because the OOM was not real. Rather, we've lost a testcase testing a high-memory condition.
Yeah, it might be worth investigating how it works if it gets changed back a bit, with the patches in this bug.  I don't think they were all happening at this particular place, but I could imagine a failure cascade from this bug that results in a real OOM.
Comment on attachment 817818 [details] [diff] [review]
Remove pldhash's 16M limit

Review of attachment 817818 [details] [diff] [review]:
-----------------------------------------------------------------

Fantastic detective work, bjacob!

But the patch is unsafe.  minAlphaFrac and maxAlphaFrac are 8-bit fixed-point fractions, and the way they are used only works up to a limit of 1<<24 elements.

Here's a test program I used to check this, which simulates the arithmetic:

#include <stdint.h>
#include <stdio.h>
int main(void)
{
    uint32_t size = 1 << 24;
    uint8_t min = (uint8_t)(0x100 * 0.25);
    uint8_t max = (uint8_t)(0x100 * 0.75);
    printf("%u, %u, %u\n", size, (size*min)>>8, (size*max)>>8);
}

The last line prints the table size, and the computed table min and max number of entries.  As written, the output is this:

  16777216, 4194304, 12582912

which is correct.  If you increase size to 1<<25 or higher, you get bogus results for the 2nd and 3rd values.  Basically, because things get shifted left by 8 bits and then shifted back, you need to have 8 spare high bits, and beyond 1<<24 you don't, at least not for uint32_t.

We could increase the size of {min,max}AlphaFrac, but that would increase sizeof(PLDHashTable) on 32-bit platforms from 32 bytes to slightly more than 32 bytes, and when that's heap-allocated jemalloc would round it up to 48 bytes, which is a bummer.

But why do we actually have {min,max}AlphaFrac?  They are in PLDHashTable so that you can modify the min/max loading of a table, via PL_DHashTableSetAlphaBounds().  But guess what -- nobody ever calls that function!  It's unused and unnecessary configurability.  So let's just remove them (and the tricky 8-bit fixed point arithmetic) and hardcode 0.25 and 0.75 as the min and max load factors.  I'm happy to write the patch that does this, if you like.
Attachment #817818 - Flags: review?(khuey)
Attachment #817818 - Flags: review?(dbaron)
Attachment #817818 - Flags: review?(brendan)
Attachment #817818 - Flags: review-
Depends on: 927705
Comment on attachment 817818 [details] [diff] [review]
Remove pldhash's 16M limit

I've spun off the increase-pldhash's-max-capacity part of this into bug 927705.
Attachment #817818 - Attachment is obsolete: true
Comment on attachment 817818 [details] [diff] [review]
Remove pldhash's 16M limit

There are some other things that would need to be fixed before doing this:

I think there still needs to be some code to fail sensibly in some edge
cases.  At the very least, SearchTable and FindFreeEntry doesn't handle
32-bit size (hashShift == 0, log2 == 32).

It might also be worth asserting in ChangeTable that deltaLog2 is never
greater than 1.  (It can be less than -1, though.)  This assertion would
really be for documentation purposes -- the code would be insufficient
to handle growing to a more-than-current but less-than-requested size.

The change you have also looks like it will introduce an integer overflow
bug in that, during an initialization with log2 / newLog2 equal to 32,
capacity/newCapacity would end up being (I think) 0, yielding a bogus
0 nbytes and the ability to stop on lots of memory.

So I think you still at the very least need to bail if log2/newLog2 is
32.
Attachment #817818 - Flags: review-
dbaron, see the patch I wrote in the spin-off bug 927705.  It increases the limit to 1<<30, which should be safe against the overflow cases you warned about.
Summary: Heavy, silent memory leak as the cycle collector runs into pldhash's 16M limit after browsing on some complex Web page → Heavy, silent memory leak as the cycle collector runs into pldhash's 8M limit after browsing on some complex Web page
Bug 927705, which increases pldhash's max capacity, has landed.  That should be enough to fix this bug, assuming the analysis is correct.  typedef or Earth4, can you try a mozilla-inbound/mozilla-central build and see if it has?  I tried to verify it, but I had trouble replicating the leak with a pre-927705 build.
If the present STR doesn't reproduce anymore, closing this bug now would require filing a follow-up bug for really fixing this in general. I had no trouble hitting the 8M limit everytime I load http://p3d.in/3axnM once, so I expect that a 64M limit won't be hard to hit with future testcases, possibly loaded more than once.
Hello,
just for notice, I've updated to the today last Nightly and still gets the same problem (https://bugzilla.mozilla.org/show_bug.cgi?id=925298).
Attached file Testcase for reference
Link to testcase
Still reproduces on m-i, haven't pulled m-c yet.

> top(none)/detached/window(http://p3d.in/3axnM)
Umm...
Attachment #821411 - Attachment mime type: text/plain → text/html
I too can reproduce the p3d.in leak in a browser with the bug 927705 fix present.  After loading the page three times and then closing its tab:

│  ├──2,700.09 MB (84.05%) -- top(none)/ghost/window(http://p3d.in/3axnM)
│  │  ├──2,700.09 MB (84.05%) -- js-compartment(http://p3d.in/3axnM)
│  │  │  ├──2,689.21 MB (83.71%) -- objects
│  │  │  │  ├──2,129.69 MB (66.29%) -- gc-heap
│  │  │  │  │  ├──1,295.20 MB (40.32%) ── ordinary [3]
│  │  │  │  │  ├────831.25 MB (25.88%) ── dense-array [3]
│  │  │  │  │  └──────3.24 MB (00.10%) ── function [3]
│  │  │  │  ├────559.53 MB (17.42%) -- malloc-heap
│  │  │  │  │    ├──558.10 MB (17.37%) ── elements/non-asm.js [3]
│  │  │  │  │    └────1.42 MB (00.04%) ── slots [3]
│  │  │  │  └──────0.00 MB (00.00%) ── non-heap/code/asm.js
│  │  │  └─────10.88 MB (00.34%) ++ (5 tiny)
│  │  └──────0.00 MB (00.00%) ++ dom
│  └─────17.06 MB (00.53%) ++ (5 tiny)

Next step is to see if the CC failure is still occurring for some reason.
Actually, the pldhash capacity increase *does* fix this bug... mostly.

With the old pldhash code, the p3d.in pages leaked consistently and I could never get them to disappear.  With the new pldhash code, I can get them to disappear via about:memory's "minimize memory usage".  Sometimes they linger for longer than I would expect, i.e. long enough to show up in about:memory as detached or a ghost (once the tab is closed), which makes it look like they've leaked.  But they do go away eventually;  forcing GC+CC is usually enough, forcing "minimize memory usage" is always enough.  (Actually, I see similar behaviour with other pages like mozilla.com, but it's less likely to be noticed because normal pages don't allocate 100s of MBs of JS stuff.)

Other good news is that with the old code I got frequent multi-second full-browser hangs on that page, and I don't get that with the new code.  This is because the old pldhash got *really* slow when it was full.

FWIW, I instrumented the new code and saw the max hash table size usually hitting 9.1M, and it even hit 18.2M at one point (presumably from having two copies of everything hanging around).

So I think it's enough to declare this bug fixed, but we might need a follow-up to investigate if the GC/CC clean-up can be made more aggressive.
(In reply to Nicholas Nethercote [:njn] from comment #56)
> So I think it's enough to declare this bug fixed, but we might need a
> follow-up to investigate if the GC/CC clean-up can be made more aggressive.

Bug 865959?
I'm not sure if GCing more will help.  It depends on whether there are just a million objects held alive by C++.  In that case, only something like bug 928179 will help.
> > So I think it's enough to declare this bug fixed, but we might need a
> > follow-up to investigate if the GC/CC clean-up can be made more aggressive.
> 
> Bug 865959?

> I'm not sure if GCing more will help.  It depends on whether there are just
> a million objects held alive by C++.  In that case, only something like bug
> 928179 will help.

Thanks for the links.  Since those bugs already exist (and I mentioned this bug's test case in 865959) I think there's nothing more to do here.

Thanks again to bjacob for diagnosing this!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee: nobody → n.nethercote
Target Milestone: --- → mozilla27
I would like to file a follow-up bug for fixing the rest of these issues, but need to understand better what isn't fixed yet.

(In reply to Nicholas Nethercote [:njn] from comment #56)
> Actually, the pldhash capacity increase *does* fix this bug... mostly.
> 
> With the old pldhash code, the p3d.in pages leaked consistently and I could
> never get them to disappear.  With the new pldhash code, I can get them to
> disappear via about:memory's "minimize memory usage".  Sometimes they linger
> for longer than I would expect, i.e. long enough to show up in about:memory
> as detached or a ghost (once the tab is closed), which makes it look like
> they've leaked.  But they do go away eventually;  forcing GC+CC is usually
> enough, forcing "minimize memory usage" is always enough.  (Actually, I see
> similar behaviour with other pages like mozilla.com, but it's less likely to
> be noticed because normal pages don't allocate 100s of MBs of JS stuff.)

Is "minimize memory usage" something that always eventually happens without the user manually going to about:memory and clicking a button? Otherwise, isn't this an artificial way of solving this particular leak? For example, the WebGL implementation has deliberate code to destroy all WebGL objects on "minimize memory usage". What if another testcase created a similar CC graph, just not with WebGL?

Also, do I correctly understand that the only thing that changed here, to remove leaks, was replacing a 8M limit in pldhash by a 64M limit there? If so, do I correctly understand that a testcase with 8x bigger CC graph will definitely still reproduce?

Would anyone have suggestions for what kind of JS-wrapped C++ objects with strong references between them (just one-way C++ strong refs, we'd close the cycles in JS) would have the smallest sizeof(), and, contrary to WebGL objects, would not be destroyed on "minimize memory usage"? So that we can build "optimally nasty" testcases with the highest number of CC objects for a given overall memory usage, with no way of destroying them if CC doesn't work?
Flags: needinfo?(n.nethercote)
Landed the patch that makes CC OOM asserts fatal in debug builds:
https://hg.mozilla.org/integration/mozilla-inbound/rev/338760b1955c

Andrew, I wonder if these should simply be MOZ_CRASH i.e. fatal also in release builds? I understand that there is already telemetry to find out about this, but this is such a bad scenario (leaking gigabytes) that maybe there is not much point continuing running, and maybe we want to get bug reports about this as soon as people run into it?
Flags: needinfo?(continuation)
Filed bug 931491 with a testcase that generates a CC graph with > 64 M nodes.
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/2d5462474ba2 because we apparently hit this (mostly 'while building graph' like https://tbpl.mozilla.org/php/getParsedLog.php?id=29739200&tree=Mozilla-Inbound but one 'ScanRoots' in https://tbpl.mozilla.org/php/getParsedLog.php?id=29740931&tree=Mozilla-Inbound) 100% of the time in Win7 mochitest-2 shutdown.
> Is "minimize memory usage" something that always eventually happens without
> the user manually going to about:memory and clicking a button?

The right combination of GC and CC will reclaim the memory.  CC then GC usually seems to be enough most of the time, but not all of the time.  You don't need "Minimize memory usage".

> Also, do I correctly understand that the only thing that changed here, to
> remove leaks, was replacing a 8M limit in pldhash by a 64M limit there? If
> so, do I correctly understand that a testcase with 8x bigger CC graph will
> definitely still reproduce?

Correct.
Flags: needinfo?(n.nethercote)
(In reply to Phil Ringnalda (:philor) from comment #63)
> Backed out in
> https://hg.mozilla.org/integration/mozilla-inbound/rev/2d5462474ba2 because
> we apparently hit this (mostly 'while building graph' like
> https://tbpl.mozilla.org/php/getParsedLog.php?id=29739200&tree=Mozilla-
> Inbound but one 'ScanRoots' in
> https://tbpl.mozilla.org/php/getParsedLog.php?id=29740931&tree=Mozilla-
> Inbound) 100% of the time in Win7 mochitest-2 shutdown.

Ouch, I didn't expect this to happen on TBPL, let alone reliably. I'm not sure how to interprete this. I didn't suppose that the payloads we're running on test slaves would hit the now 64M limit in pldhash; and if this is a real OOM, then it would most of the time give us a sigkill instead of that assertion. So does anyone have an interpretation?
bjacob: it seems to be happening intermittently on TBPL as bug 929359, starting on 10/22.
Flags: needinfo?(continuation)
> Andrew, I wonder if these should simply be MOZ_CRASH i.e. fatal also in release builds?
Yeah, maybe.  It would at least make the issue clearer, and we might get better information about how common it is.
You need to log in before you can comment on or make changes to this bug.