Closed Bug 522193 Opened 15 years ago Closed 13 years ago

WebGL calls are slow because of memory allocation churning

Categories

(Core :: Graphics: CanvasWebGL, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla10

People

(Reporter: 00003b, Unassigned)

References

()

Details

Attachments

(1 file, 2 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20090910 Ubuntu/9.04 (jaunty) Shiretoko/3.5.3
Build Identifier: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.3a1pre) Gecko/20091013 Minefield/3.7a1pre

In test app at http://3bb.cc/projects/webgl/test1/ ~60% of cpu time is spent making GL calls according to firebug profiler at higher model counts, and framerate is lower than expected.



Reproducible: Always

Steps to Reproduce:
1.view specified page in browser with WebGL enabled
2.increase number of instances by picking a number from the select box
Actual Results:  
On a core i7 920, nvidia gtx285, ubuntu 9.04 x86-64, nvidia drivers 190.18.04
I get about 10-13 fps with 100 models (10 on select box)
According to firebug, ~45% of time is spent in the closure that binds texture and buffer objects, and 20% in the closure that updates the bone uniforms (that also does a bunch of matrix multiplies, but splitting it still shows the single call to uniformMatrix4fv as taking more time than the matrix multiply loop)



Expected Results:  
Hard to say exactly how fast it should be running, but rough tests outside the browser and reports from webkit users suggests there is room for at least 2x improvement.

This particular test case could avoid most of those GL calls, since it actually only has 1 model, but that optimization was skipped to try to test the worst case performance where there were a number of different models.
Component: Canvas: 2D → Canvas: WebGL
Interesting... on N900 it works much faster than on T61p with Nvidia card...
None of WebGL is currently quickstubbed, so it's taking the slow boat between JS and native code.  That'll be fixed soon.
Gonna call this WORKSFORME -- currently I get about 45fps with 100 models, though the models themselves are drawing weird/brokenly (not sure if there's something that needs to get updated here).  Some more profiling would be interesting, though.
Status: UNCONFIRMED → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
With 100 models, I get 40 FPS in Firefox Nightly (10.0a1) and 60 FPS in Chrome 14.
A profile showed, as usual, a bit of time spent in XPC overhead for WebGL calls; with the patches on bug 638328 applied, I get perhaps a tiny performance improvement (41 FPS instead of 40) and the profile is now as follows:

 * 22% in methodjit-generated code
 * 12% in libnvidia-glcore.so
 * 10% in libGL.so
 * 1.6% in js_CloneFunctionObject, called by js::mjit::stubs::Lambda, called from jitted code. Should I file a JS engine bug? CC'ing Andreas.
 * lots of time in realloc and malloc, some of which with stacks like

   - realloc                                                                     
      - 79.11% moz_xrealloc
         - 99.53% nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity
            - 38.10% WebGLZeroingObject::AddRefOwner(WebGLObjectBaseRefPtr*)
               - 92.16% WebGLObjectRefPtr<WebGLBuffer>::assign_with_AddRef
                   - WebGLContext::VertexAttribPointer

   and similar stacks pointing to BindBuffer and BindTexture. I need to investigate this.

 * also a lot of time in mutex locking, some of which with stacks like
   - pthread_mutex_lock
      - 8.74% nsTArray_base<nsTArrayDefaultAllocator>::ShrinkCapacity
         - 85.70% WebGLObjectRefPtr<WebGLBuffer>::assign_with_AddRef
            - 90.11% WebGLContext::VertexAttribPointer

   so we really need to look into these WebGL functions.

 * 1.1% in castNativeFromWrapper and 0.6% in nsID::Equals. This is much less XPC overhead than without the patches in bug 638328.
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: WORKSFORME → ---
So, the above-mentioned overhead goes away with this change replacing refptrs by plain pointers. This brings us to 50 FPS, up from 40 FPS, i.e. covers half the distance toward chrome's speed.

The question is how to turn this finding into a patch that we could actually land?
It looks like the work is all in manipulating the list of refs in WebGLZeroingObject -- try stubbing just those functions and keeping everything else the same, and see how it behaves?
Indeed, this patch instead of the previous one gives the same performance increase.
The benchmark has WebGLBuffers oscillating between 0, 1, and 2 owners. The ShrinkCapacity calls from above profile come from call stacks like below. When the number of owners goes down to 0, ShrinkCapacity throws away the whole mHdr so the next time it grows, it has to reallocate stuff from scratch.

#0  nsTArray_base<nsTArrayDefaultAllocator>::ShrinkCapacity (this=0x7fffd68f58d8, elemSize=8)
    at ../../dist/include/nsTArray-inl.h:143
#1  0x00007ffff583571e in RemoveElementsAt (count=1, start=<optimized out>, this=0x7fffd68f58d8)
    at ../../../dist/include/nsTArray.h:895
#2  RemoveElementAt (index=<optimized out>, this=0x7fffd68f58d8)
    at ../../../dist/include/nsTArray.h:900
#3  RemoveElement<mozilla::WebGLObjectBaseRefPtr*, nsDefaultComparator<mozilla::WebGLObjectBaseRefPtr*, mozilla::WebGLObjectBaseRefPtr*> > (item=<synthetic pointer>, this=0x7fffd68f58d8, 
    comp=<optimized out>) at ../../../dist/include/nsTArray.h:919
#4  RemoveElement<mozilla::WebGLObjectBaseRefPtr*> (item=<synthetic pointer>, 
    this=0x7fffd68f58d8) at ../../../dist/include/nsTArray.h:927
#5  mozilla::WebGLZeroingObject::RemoveRefOwner (this=0x7fffd68f58d8, owner=<optimized out>)
    at /home/bjacob/mozilla-central/content/canvas/src/WebGLContext.h:710
#6  0x00007ffff58493c5 in assign_assuming_AddRef (newPtr=0x7fffd75d1970, this=0x7fffd68c4cc0)
    at /home/bjacob/mozilla-central/content/canvas/src/WebGLContext.h:237
#7  mozilla::WebGLObjectRefPtr<mozilla::WebGLBuffer>::assign_with_AddRef (this=0x7fffd68c4cc0, 
    rawPtr=0x7fffd75d1970) at /home/bjacob/mozilla-central/content/canvas/src/WebGLContext.h:230
#8  0x00007ffff5849630 in operator= (rhs=<optimized out>, this=<optimized out>)
    at /home/bjacob/mozilla-central/content/canvas/src/WebGLContext.h:192
#9  mozilla::WebGLContext::BindBuffer (this=0x7fffd68c4c00, target=34963, bobj=<optimized out>)
    at /home/bjacob/mozilla-central/content/canvas/src/WebGLContextGL.cpp:231
Yup, this gives me as high performance as I've seen with any of the other patches, so let's go for this.
Attachment #564071 - Attachment is obsolete: true
Attachment #564083 - Attachment is obsolete: true
Attachment #564098 - Flags: review?
Wouldn't it be a better fix to just grow and not shrink the array so aggressively?
(In reply to Andreas Gal :gal from comment #11)
> Wouldn't it be a better fix to just grow and not shrink the array so
> aggressively?

Looking at my logs, I'd say yes, but do we have any Array class readily able to do that?

While that behavior would be nice to have, it's not worth making a new array class just to have that. The amount of memory that I preallocate here is not big compared to what these objects use anyway (that was the point of having this template parameter so we could adjust the preallocated size for each type of webgl object).
We could quite easily make nsTArray parameterizable with options to control the shrinking policy. But we probably don't need to here.
@ Andreas, while you're here, please have a look at the js_CloneFunctionObject showing up in the profile in comment 5.
Comment on attachment 564098 [details] [diff] [review]
use a nsAutoTArray with some preallocated capacity

Review of attachment 564098 [details] [diff] [review]:
-----------------------------------------------------------------

Who were you looking for review from?

I'm not really a peer, but the patch is trivial so who cares ...
Attachment #564098 - Flags: review? → review+
Note that between this patch and the patches from bug 638328, we improved from 50 FPS to 60 FPS on WebGL Aquarium with 1000 fish!
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #15)
> Who were you looking for review from?

Oops, I thought I had entered :vlad, but apparently I failed. Thanks.
Sure, happy to look. What does the JS code look like around the call site? Can you paste here.
(In reply to Andreas Gal :gal from comment #18)
> Sure, happy to look. What does the JS code look like around the call site?
> Can you paste here.

I broke in js_CloneFunctionObject and called DumpJSStack, and got alternatively



0 anonymous(cache = undefined, stat = [function], time = 0, animNumber = 0) ["http://3bb.cc/projects/webgl/test1/model.js":120]
    getKeyR = [function]
    getKeyP = [function]
    parent3568 = 5
    r = [object Float32Array]
    p = [object Float32Array]
    anim = [object Object]
    i = 5
    akl = 46
    a = [object Object]
    this = [object Object]
1 draw(0) ["http://3bb.cc/projects/webgl/test1/foo.js":147]
    _js_idx3650 = undefined
    _js_arrvar3651 = undefined
    m = undefined
    animNodes = null
    xx = [object Float32Array]
    pos3649 = 0,-60,0
    instance = [object Object]
    stat = [function]
    x = 0
    count = 1
    drawCount = 1
    r = [object Float32Array]
    d2 = [object Float32Array]
    d = [object Float32Array]
    p2 = [object Float32Array]
    la = [object Float32Array]
    time = 1317604694.539
    this = [object Window]



*** and ***


0 anonymous(cache = undefined, stat = [function], time = 0, animNumber = 0) ["http://3bb.cc/projects/webgl/test1/model.js":107]
    getKeyR = [function]
    getKeyP = [function]
    parent3568 = 2
    r = [object Float32Array]
    p = [object Float32Array]
    anim = [object Object]
    i = 6
    akl = 46
    a = [object Object]
    this = [object Object]
1 draw(0) ["http://3bb.cc/projects/webgl/test1/foo.js":147]
    _js_idx3650 = undefined
    _js_arrvar3651 = undefined
    m = undefined
    animNodes = null
    xx = [object Float32Array]
    pos3649 = 0,-60,0
    instance = [object Object]
    stat = [function]
    x = 0
    count = 1
    drawCount = 1
    r = [object Float32Array]
    d2 = [object Float32Array]
    d = [object Float32Array]
    p2 = [object Float32Array]
    la = [object Float32Array]
    time = 1317604694.539
    this = [object Window]


The code in http://3bb.cc/projects/webgl/test1/model.js, around lines 107 and 120, is (line markers added, search for ### ):



var I = _mat4i();
var Temp = _mat4i();
function makeOldAnimator(obj, nodes, anims, lastAnim) {
    var pI = new Float32Array([0.0, 0.0, 0.0,0.0]);
    var rI = new Float32Array([0.0, 1.0, 0.0, 0.0, 0.0]);
    var pt = new Float32Array([0.0, 0.0, 0.0, 0.0]);
    var rt = new Float32Array([0.0, 1.0, 0.0, 0.0, 0.0]);
    var mx = Temp;//_mat4i;//new Float32Array(16);
    return function (animNumber, time, stat, cache) {
        var a = anims[animNumber];
        if (a.length)
            time = time % a.length;
        else
            time = 0;
        var akl = a.keys.length;
        for (var i = 0; i < akl; i += 1) {
            var anim = (a.keys)[i];
            var p = anim.position;
            var r = anim.rotation;
            var parent3568 = (obj.nodes)[anim.index].parent;
            var getKeyP = function (keys, time,dd) {            // ### line 107
                var k = keys[0];
                for (var ii = 0; ii < keys.length; ii += 1) {
                    var ki = keys[ii];
                    if (ki[0] < time) {
                        k = ki;
                    } else {
                        var dt = (time - k[0]) / (ki[0] - k[0]);
                        return lerppInto(k, ki, dt,dd);
                    };
                };
                return k;
            };
            var getKeyR = function (keys, time,d) {             // ### line 120
                var k = keys[0];
                for (var ii = 0; ii < keys.length; ii += 1) {
                    var ki = keys[ii];
                    if (ki[0] < time) {
                        k = ki;
                    } else {
                        var dt = (time - k[0]) / (ki[0] - k[0]);
                        return lerprInto(k, ki, dt,d);
                    };
                };
                return k;
            };
            animsT++;
            if (p) {
                if (p.length == 1) {
                    animsP1++;
                    p = p[0];
                } else {
                    p = getKeyP(p, time,pt);
                };
            } else {
                animsPi++;
                p = pI;
            };
            if (r) {
                if (r.length == 1) {
                    animsR1++;
                    r = r[0];
                } else {
                    r = getKeyR(r, time,rt);
                };
                _translationxquatInto(p[1], p[2], p[3], r[1], r[2], r[3], r[4], mx);
            } else {
                animsRi++;
                //r = rI;
                mx[ 0] = 1.0;
                mx[ 1] = 0.0;
                mx[ 2] = 0.0;
                //mx[ 3] = 0.0;
                mx[ 4] = 0.0;
                mx[ 5] = 1.0;
                mx[ 6] = 0.0;
                //mx[ 7] = 0.0;
                mx[ 8] = 0.0;
                mx[ 9] = 0.0;
                mx[10] = 1.0;
                //mx[11] = 0.0;
                mx[12] = p[1];
                mx[13] = p[2];
                mx[14] = p[3];
                //mx[15] = 1.0;
            };
            if (parent3568 == 0) animsXi++;
            //var pm = parent3568 > 0 ? (obj.animNodes)[parent3568] : I;
            //_matx3(pm, mx, (obj.animNodes)[anim.index]);

            if (parent3568 > 0)
                _matx3((obj.animNodes)[parent3568], mx,
                       (obj.animNodes)[anim.index]);
            else
                _mcopy(mx, (obj.animNodes)[anim.index]);
        };
    };
};
Yeah, the closure in the function escapes and it closes over different objects every time. Not much we can do there. How much time do we spend in the JS engine after your optimization?
The profile now looks like:

    25.64%     firefox  perf-1813.map                [.] 0x7f24ad30b585  
    14.72%     firefox  libnvidia-glcore.so.275.28   [.] 0x12e8bc8       
    11.17%     firefox  libGL.so.275.28              [.] 0x9bd77         
     1.61%     firefox  libxul.so                    [.] js_CloneFunctionObject
     1.25%     firefox  libxul.so                    [.] castNativeFromWrapper
     0.99%     firefox  libpthread-2.13.so           [.] pthread_self
     0.80%     firefox  libxul.so                    [.] nsID::Equals(nsID const&) const
     0.79%     firefox  [kernel.kallsyms]            [k] system_call_after_swapgs
     0.68%     firefox  libxul.so                    [.] js::gc::FinalizeArenas
     0.67%     firefox  [usb_storage]                [.] 0x5e5a2108      
     0.65%     firefox  libpthread-2.13.so           [.] pthread_mutex_lock
     0.65%     firefox  libc-2.13.so                 [.] __memcpy_ssse3_back
     0.56%     firefox  firefox                      [.] arena_malloc
     0.51%     firefox  libxul.so                    [.] js_GCThingIsMarked
     0.49%     firefox  libxul.so                    [.] js_GetClassPrototype
     0.47%     firefox  libxul.so                    [.] JSObject::isWrapper
     0.47%     firefox  libxul.so                    [.] XPCWrappedNative::GetWrappedNativeOfJSObject
     0.47%     firefox  libxul.so                    [.] castNative
     0.46%     firefox  libxul.so                    [.] mozilla::WebGLContext::ValidateBuffers

The top one, perf-1813.map, is methodjit code (many of the call stacks have EnterMethodJIT)

The two next ones are the OpenGL libraries.
Summary: WebGL: GL calls are slow → WebGL calls are slow because of memory allocation churning
https://hg.mozilla.org/mozilla-central/rev/704f37801611
Status: REOPENED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: