Last Comment Bug 522193 - WebGL calls are slow because of memory allocation churning
: WebGL calls are slow because of memory allocation churning
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Canvas: WebGL (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla10
Assigned To: Nobody; OK to take it and work on it
:
: Milan Sreckovic [:milan]
Mentors:
http://3bb.cc/projects/webgl/test1/
Depends on:
Blocks: 638321
  Show dependency treegraph
 
Reported: 2009-10-13 22:25 PDT by 00003b
Modified: 2011-10-03 16:55 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
use plain pointers instead of WebGLObjectRefPtr (illegal, but fast) (1.79 KB, patch)
2011-10-02 12:27 PDT, Benoit Jacob [:bjacob] (mostly away)
no flags Details | Diff | Splinter Review
comment out WebGLZeroingObject AddRefOwner/RemoveRefOwner (879 bytes, patch)
2011-10-02 14:12 PDT, Benoit Jacob [:bjacob] (mostly away)
no flags Details | Diff | Splinter Review
use a nsAutoTArray with some preallocated capacity (8.54 KB, patch)
2011-10-02 17:58 PDT, Benoit Jacob [:bjacob] (mostly away)
roc: review+
Details | Diff | Splinter Review

Description 00003b 2009-10-13 22:25:37 PDT
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20090910 Ubuntu/9.04 (jaunty) Shiretoko/3.5.3
Build Identifier: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.3a1pre) Gecko/20091013 Minefield/3.7a1pre

In test app at http://3bb.cc/projects/webgl/test1/ ~60% of cpu time is spent making GL calls according to firebug profiler at higher model counts, and framerate is lower than expected.



Reproducible: Always

Steps to Reproduce:
1.view specified page in browser with WebGL enabled
2.increase number of instances by picking a number from the select box
Actual Results:  
On a core i7 920, nvidia gtx285, ubuntu 9.04 x86-64, nvidia drivers 190.18.04
I get about 10-13 fps with 100 models (10 on select box)
According to firebug, ~45% of time is spent in the closure that binds texture and buffer objects, and 20% in the closure that updates the bone uniforms (that also does a bunch of matrix multiplies, but splitting it still shows the single call to uniformMatrix4fv as taking more time than the matrix multiply loop)



Expected Results:  
Hard to say exactly how fast it should be running, but rough tests outside the browser and reports from webkit users suggests there is room for at least 2x improvement.

This particular test case could avoid most of those GL calls, since it actually only has 1 model, but that optimization was skipped to try to test the worst case performance where there were a number of different models.
Comment 1 Oleg Romashin (:romaxa) 2010-01-11 23:38:11 PST
Interesting... on N900 it works much faster than on T61p with Nvidia card...
Comment 2 Vladimir Vukicevic [:vlad] [:vladv] 2010-01-11 23:57:06 PST
None of WebGL is currently quickstubbed, so it's taking the slow boat between JS and native code.  That'll be fixed soon.
Comment 3 Vladimir Vukicevic [:vlad] [:vladv] 2010-05-28 13:12:19 PDT
Gonna call this WORKSFORME -- currently I get about 45fps with 100 models, though the models themselves are drawing weird/brokenly (not sure if there's something that needs to get updated here).  Some more profiling would be interesting, though.
Comment 4 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 10:35:20 PDT
With 100 models, I get 40 FPS in Firefox Nightly (10.0a1) and 60 FPS in Chrome 14.
Comment 5 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 11:11:06 PDT
A profile showed, as usual, a bit of time spent in XPC overhead for WebGL calls; with the patches on bug 638328 applied, I get perhaps a tiny performance improvement (41 FPS instead of 40) and the profile is now as follows:

 * 22% in methodjit-generated code
 * 12% in libnvidia-glcore.so
 * 10% in libGL.so
 * 1.6% in js_CloneFunctionObject, called by js::mjit::stubs::Lambda, called from jitted code. Should I file a JS engine bug? CC'ing Andreas.
 * lots of time in realloc and malloc, some of which with stacks like

   - realloc                                                                     
      - 79.11% moz_xrealloc
         - 99.53% nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity
            - 38.10% WebGLZeroingObject::AddRefOwner(WebGLObjectBaseRefPtr*)
               - 92.16% WebGLObjectRefPtr<WebGLBuffer>::assign_with_AddRef
                   - WebGLContext::VertexAttribPointer

   and similar stacks pointing to BindBuffer and BindTexture. I need to investigate this.

 * also a lot of time in mutex locking, some of which with stacks like
   - pthread_mutex_lock
      - 8.74% nsTArray_base<nsTArrayDefaultAllocator>::ShrinkCapacity
         - 85.70% WebGLObjectRefPtr<WebGLBuffer>::assign_with_AddRef
            - 90.11% WebGLContext::VertexAttribPointer

   so we really need to look into these WebGL functions.

 * 1.1% in castNativeFromWrapper and 0.6% in nsID::Equals. This is much less XPC overhead than without the patches in bug 638328.
Comment 6 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 12:27:04 PDT
Created attachment 564071 [details] [diff] [review]
use plain pointers instead of WebGLObjectRefPtr (illegal, but fast)

So, the above-mentioned overhead goes away with this change replacing refptrs by plain pointers. This brings us to 50 FPS, up from 40 FPS, i.e. covers half the distance toward chrome's speed.

The question is how to turn this finding into a patch that we could actually land?
Comment 7 Vladimir Vukicevic [:vlad] [:vladv] 2011-10-02 12:32:49 PDT
It looks like the work is all in manipulating the list of refs in WebGLZeroingObject -- try stubbing just those functions and keeping everything else the same, and see how it behaves?
Comment 8 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 14:12:18 PDT
Created attachment 564083 [details] [diff] [review]
comment out WebGLZeroingObject AddRefOwner/RemoveRefOwner

Indeed, this patch instead of the previous one gives the same performance increase.
Comment 9 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 15:55:03 PDT
The benchmark has WebGLBuffers oscillating between 0, 1, and 2 owners. The ShrinkCapacity calls from above profile come from call stacks like below. When the number of owners goes down to 0, ShrinkCapacity throws away the whole mHdr so the next time it grows, it has to reallocate stuff from scratch.

#0  nsTArray_base<nsTArrayDefaultAllocator>::ShrinkCapacity (this=0x7fffd68f58d8, elemSize=8)
    at ../../dist/include/nsTArray-inl.h:143
#1  0x00007ffff583571e in RemoveElementsAt (count=1, start=<optimized out>, this=0x7fffd68f58d8)
    at ../../../dist/include/nsTArray.h:895
#2  RemoveElementAt (index=<optimized out>, this=0x7fffd68f58d8)
    at ../../../dist/include/nsTArray.h:900
#3  RemoveElement<mozilla::WebGLObjectBaseRefPtr*, nsDefaultComparator<mozilla::WebGLObjectBaseRefPtr*, mozilla::WebGLObjectBaseRefPtr*> > (item=<synthetic pointer>, this=0x7fffd68f58d8, 
    comp=<optimized out>) at ../../../dist/include/nsTArray.h:919
#4  RemoveElement<mozilla::WebGLObjectBaseRefPtr*> (item=<synthetic pointer>, 
    this=0x7fffd68f58d8) at ../../../dist/include/nsTArray.h:927
#5  mozilla::WebGLZeroingObject::RemoveRefOwner (this=0x7fffd68f58d8, owner=<optimized out>)
    at /home/bjacob/mozilla-central/content/canvas/src/WebGLContext.h:710
#6  0x00007ffff58493c5 in assign_assuming_AddRef (newPtr=0x7fffd75d1970, this=0x7fffd68c4cc0)
    at /home/bjacob/mozilla-central/content/canvas/src/WebGLContext.h:237
#7  mozilla::WebGLObjectRefPtr<mozilla::WebGLBuffer>::assign_with_AddRef (this=0x7fffd68c4cc0, 
    rawPtr=0x7fffd75d1970) at /home/bjacob/mozilla-central/content/canvas/src/WebGLContext.h:230
#8  0x00007ffff5849630 in operator= (rhs=<optimized out>, this=<optimized out>)
    at /home/bjacob/mozilla-central/content/canvas/src/WebGLContext.h:192
#9  mozilla::WebGLContext::BindBuffer (this=0x7fffd68c4c00, target=34963, bobj=<optimized out>)
    at /home/bjacob/mozilla-central/content/canvas/src/WebGLContextGL.cpp:231
Comment 10 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 17:58:07 PDT
Created attachment 564098 [details] [diff] [review]
use a nsAutoTArray with some preallocated capacity

Yup, this gives me as high performance as I've seen with any of the other patches, so let's go for this.
Comment 11 Andreas Gal :gal 2011-10-02 18:00:16 PDT
Wouldn't it be a better fix to just grow and not shrink the array so aggressively?
Comment 12 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 18:08:02 PDT
(In reply to Andreas Gal :gal from comment #11)
> Wouldn't it be a better fix to just grow and not shrink the array so
> aggressively?

Looking at my logs, I'd say yes, but do we have any Array class readily able to do that?

While that behavior would be nice to have, it's not worth making a new array class just to have that. The amount of memory that I preallocate here is not big compared to what these objects use anyway (that was the point of having this template parameter so we could adjust the preallocated size for each type of webgl object).
Comment 13 Robert O'Callahan (:roc) (email my personal email if necessary) 2011-10-02 18:09:30 PDT
We could quite easily make nsTArray parameterizable with options to control the shrinking policy. But we probably don't need to here.
Comment 14 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 18:09:50 PDT
@ Andreas, while you're here, please have a look at the js_CloneFunctionObject showing up in the profile in comment 5.
Comment 15 Robert O'Callahan (:roc) (email my personal email if necessary) 2011-10-02 18:11:06 PDT
Comment on attachment 564098 [details] [diff] [review]
use a nsAutoTArray with some preallocated capacity

Review of attachment 564098 [details] [diff] [review]:
-----------------------------------------------------------------

Who were you looking for review from?

I'm not really a peer, but the patch is trivial so who cares ...
Comment 16 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 18:11:32 PDT
Note that between this patch and the patches from bug 638328, we improved from 50 FPS to 60 FPS on WebGL Aquarium with 1000 fish!
Comment 17 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 18:12:18 PDT
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #15)
> Who were you looking for review from?

Oops, I thought I had entered :vlad, but apparently I failed. Thanks.
Comment 18 Andreas Gal :gal 2011-10-02 18:13:36 PDT
Sure, happy to look. What does the JS code look like around the call site? Can you paste here.
Comment 19 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 18:23:51 PDT
(In reply to Andreas Gal :gal from comment #18)
> Sure, happy to look. What does the JS code look like around the call site?
> Can you paste here.

I broke in js_CloneFunctionObject and called DumpJSStack, and got alternatively



0 anonymous(cache = undefined, stat = [function], time = 0, animNumber = 0) ["http://3bb.cc/projects/webgl/test1/model.js":120]
    getKeyR = [function]
    getKeyP = [function]
    parent3568 = 5
    r = [object Float32Array]
    p = [object Float32Array]
    anim = [object Object]
    i = 5
    akl = 46
    a = [object Object]
    this = [object Object]
1 draw(0) ["http://3bb.cc/projects/webgl/test1/foo.js":147]
    _js_idx3650 = undefined
    _js_arrvar3651 = undefined
    m = undefined
    animNodes = null
    xx = [object Float32Array]
    pos3649 = 0,-60,0
    instance = [object Object]
    stat = [function]
    x = 0
    count = 1
    drawCount = 1
    r = [object Float32Array]
    d2 = [object Float32Array]
    d = [object Float32Array]
    p2 = [object Float32Array]
    la = [object Float32Array]
    time = 1317604694.539
    this = [object Window]



*** and ***


0 anonymous(cache = undefined, stat = [function], time = 0, animNumber = 0) ["http://3bb.cc/projects/webgl/test1/model.js":107]
    getKeyR = [function]
    getKeyP = [function]
    parent3568 = 2
    r = [object Float32Array]
    p = [object Float32Array]
    anim = [object Object]
    i = 6
    akl = 46
    a = [object Object]
    this = [object Object]
1 draw(0) ["http://3bb.cc/projects/webgl/test1/foo.js":147]
    _js_idx3650 = undefined
    _js_arrvar3651 = undefined
    m = undefined
    animNodes = null
    xx = [object Float32Array]
    pos3649 = 0,-60,0
    instance = [object Object]
    stat = [function]
    x = 0
    count = 1
    drawCount = 1
    r = [object Float32Array]
    d2 = [object Float32Array]
    d = [object Float32Array]
    p2 = [object Float32Array]
    la = [object Float32Array]
    time = 1317604694.539
    this = [object Window]


The code in http://3bb.cc/projects/webgl/test1/model.js, around lines 107 and 120, is (line markers added, search for ### ):



var I = _mat4i();
var Temp = _mat4i();
function makeOldAnimator(obj, nodes, anims, lastAnim) {
    var pI = new Float32Array([0.0, 0.0, 0.0,0.0]);
    var rI = new Float32Array([0.0, 1.0, 0.0, 0.0, 0.0]);
    var pt = new Float32Array([0.0, 0.0, 0.0, 0.0]);
    var rt = new Float32Array([0.0, 1.0, 0.0, 0.0, 0.0]);
    var mx = Temp;//_mat4i;//new Float32Array(16);
    return function (animNumber, time, stat, cache) {
        var a = anims[animNumber];
        if (a.length)
            time = time % a.length;
        else
            time = 0;
        var akl = a.keys.length;
        for (var i = 0; i < akl; i += 1) {
            var anim = (a.keys)[i];
            var p = anim.position;
            var r = anim.rotation;
            var parent3568 = (obj.nodes)[anim.index].parent;
            var getKeyP = function (keys, time,dd) {            // ### line 107
                var k = keys[0];
                for (var ii = 0; ii < keys.length; ii += 1) {
                    var ki = keys[ii];
                    if (ki[0] < time) {
                        k = ki;
                    } else {
                        var dt = (time - k[0]) / (ki[0] - k[0]);
                        return lerppInto(k, ki, dt,dd);
                    };
                };
                return k;
            };
            var getKeyR = function (keys, time,d) {             // ### line 120
                var k = keys[0];
                for (var ii = 0; ii < keys.length; ii += 1) {
                    var ki = keys[ii];
                    if (ki[0] < time) {
                        k = ki;
                    } else {
                        var dt = (time - k[0]) / (ki[0] - k[0]);
                        return lerprInto(k, ki, dt,d);
                    };
                };
                return k;
            };
            animsT++;
            if (p) {
                if (p.length == 1) {
                    animsP1++;
                    p = p[0];
                } else {
                    p = getKeyP(p, time,pt);
                };
            } else {
                animsPi++;
                p = pI;
            };
            if (r) {
                if (r.length == 1) {
                    animsR1++;
                    r = r[0];
                } else {
                    r = getKeyR(r, time,rt);
                };
                _translationxquatInto(p[1], p[2], p[3], r[1], r[2], r[3], r[4], mx);
            } else {
                animsRi++;
                //r = rI;
                mx[ 0] = 1.0;
                mx[ 1] = 0.0;
                mx[ 2] = 0.0;
                //mx[ 3] = 0.0;
                mx[ 4] = 0.0;
                mx[ 5] = 1.0;
                mx[ 6] = 0.0;
                //mx[ 7] = 0.0;
                mx[ 8] = 0.0;
                mx[ 9] = 0.0;
                mx[10] = 1.0;
                //mx[11] = 0.0;
                mx[12] = p[1];
                mx[13] = p[2];
                mx[14] = p[3];
                //mx[15] = 1.0;
            };
            if (parent3568 == 0) animsXi++;
            //var pm = parent3568 > 0 ? (obj.animNodes)[parent3568] : I;
            //_matx3(pm, mx, (obj.animNodes)[anim.index]);

            if (parent3568 > 0)
                _matx3((obj.animNodes)[parent3568], mx,
                       (obj.animNodes)[anim.index]);
            else
                _mcopy(mx, (obj.animNodes)[anim.index]);
        };
    };
};
Comment 20 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 18:25:02 PDT
On tryserver: https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=2790be438dc3
Comment 21 Andreas Gal :gal 2011-10-02 18:30:36 PDT
Yeah, the closure in the function escapes and it closes over different objects every time. Not much we can do there. How much time do we spend in the JS engine after your optimization?
Comment 22 Benoit Jacob [:bjacob] (mostly away) 2011-10-02 18:38:31 PDT
The profile now looks like:

    25.64%     firefox  perf-1813.map                [.] 0x7f24ad30b585  
    14.72%     firefox  libnvidia-glcore.so.275.28   [.] 0x12e8bc8       
    11.17%     firefox  libGL.so.275.28              [.] 0x9bd77         
     1.61%     firefox  libxul.so                    [.] js_CloneFunctionObject
     1.25%     firefox  libxul.so                    [.] castNativeFromWrapper
     0.99%     firefox  libpthread-2.13.so           [.] pthread_self
     0.80%     firefox  libxul.so                    [.] nsID::Equals(nsID const&) const
     0.79%     firefox  [kernel.kallsyms]            [k] system_call_after_swapgs
     0.68%     firefox  libxul.so                    [.] js::gc::FinalizeArenas
     0.67%     firefox  [usb_storage]                [.] 0x5e5a2108      
     0.65%     firefox  libpthread-2.13.so           [.] pthread_mutex_lock
     0.65%     firefox  libc-2.13.so                 [.] __memcpy_ssse3_back
     0.56%     firefox  firefox                      [.] arena_malloc
     0.51%     firefox  libxul.so                    [.] js_GCThingIsMarked
     0.49%     firefox  libxul.so                    [.] js_GetClassPrototype
     0.47%     firefox  libxul.so                    [.] JSObject::isWrapper
     0.47%     firefox  libxul.so                    [.] XPCWrappedNative::GetWrappedNativeOfJSObject
     0.47%     firefox  libxul.so                    [.] castNative
     0.46%     firefox  libxul.so                    [.] mozilla::WebGLContext::ValidateBuffers

The top one, perf-1813.map, is methodjit code (many of the call stacks have EnterMethodJIT)

The two next ones are the OpenGL libraries.
Comment 23 Benoit Jacob [:bjacob] (mostly away) 2011-10-03 05:43:02 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/704f37801611
Comment 24 Matt Brubeck (:mbrubeck) 2011-10-03 16:55:38 PDT
https://hg.mozilla.org/mozilla-central/rev/704f37801611

Note You need to log in before you can comment on or make changes to this bug.