Closed Bug 1329988 Opened 3 years ago Closed 3 years ago

Roughly 10% of total CPU time in a WebGL 2 demo is spent in transposing 4x4 matrices.

Categories

(Core :: Canvas: WebGL, defect, P2)

53 Branch
Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox51 --- wontfix
firefox53 --- affected
firefox54 --- fixed

People

(Reporter: jujjyl, Assigned: jgilbert)

References

(Blocks 1 open bug)

Details

(Whiteboard: gfx-noted)

Attachments

(5 files)

Benchmarking a WebGL 2 demo from a partner, with having ANGLE disabled (to avoid performance impact from bug 1329983), and faking framebuffer completeness checks as no-ops (to avoid performance impact from bug 1329815), what remains as the hottest bottleneck is glUniformMatrix4fv, which leads to about 10% of total CPU time is spent inside libGLESv2 in TransposeMatrix<float>.

Looking at the code for that function is extremely naive: (gfx\angle\src\libANGLE\renderer\d3d\ProgramD3D.cpp, line ~190 or so)

template <typename T>
static inline void SetIfDirty(T *dest, const T &source, bool *dirtyFlag)
{
    ASSERT(dest != NULL);
    ASSERT(dirtyFlag != NULL);

    *dirtyFlag = *dirtyFlag || (memcmp(dest, &source, sizeof(T)) != 0);
    *dest      = source;
}

template <typename T>
bool TransposeMatrix(T *target,
                     const GLfloat *value,
                     int targetWidth,
                     int targetHeight,
                     int srcWidth,
                     int srcHeight)
{
    bool dirty = false;
    int copyWidth = std::min(targetHeight, srcWidth);
    int copyHeight = std::min(targetWidth, srcHeight);

    for (int x = 0; x < copyWidth; x++)
    {
        for (int y = 0; y < copyHeight; y++)
        {
            SetIfDirty(target + (x * targetWidth + y), static_cast<T>(value[y * srcWidth + x]),
                       &dirty);
        }
    }
    // clear unfilled right side
    for (int y = 0; y < copyWidth; y++)
    {
        for (int x = copyHeight; x < targetWidth; x++)
        {
            SetIfDirty(target + (y * targetWidth + x), static_cast<T>(0), &dirty);
        }
    }
    // clear unfilled bottom.
    for (int y = copyWidth; y < targetHeight; y++)
    {
        for (int x = 0; x < targetWidth; x++)
        {
            SetIfDirty(target + (y * targetWidth + x), static_cast<T>(0), &dirty);
        }
    }

    return dirty;
}


Transposing a 4x4 matrix should take about eight hardware instructions. This is a ridiculously good example of the antipattern for how to route a specific task to such a generic function that is so unoptimized that it ends up running about a billion times slower than it needs to, and in this case, the number billion is probably not even an exaggeration.
Attached image glUniformMatrix4fv.png
Marked down https://bugs.chromium.org/p/angleproject/issues/detail?id=1694 which is related. It looks that in upstream ANGLE code, the above code path for uploading matrices has been optimized slightly to avoid the SetIfDirty() calls in the innermost loop, but the code is still going through a too-generic-to-be-fast path. (https://chromium.googlesource.com/angle/angle/+/master/src/libANGLE/renderer/d3d/ProgramD3D.cpp#132)
Running on OS X Mac Pro (Late 2013), 3.5 GHz 6-core Intel Xeon E5 with AMD FirePro D500, the corresponding time taken to upload matrices is only 2.8%, so this is Windows specific.
OS: Unspecified → Windows
Priority: -- → P2
Whiteboard: gfx-noted
We can just workaround this in our code.
Assignee: nobody → jgilbert
Attached file try_results.txt
Tried out the build, but it does not seem to improve, current FF Nightly 64-bit gets a bit better result.
Flags: needinfo?(jujjyl)
I wonder if the builds are with the same flags necessarily (try vs Nightly?)
Please diff perf between the two.

I would also love to additionally see the profiler stack for the with-patch version.
Flags: needinfo?(jujjyl)
Results against the stock try build show that the patch improves FPS performance by +8.4%.
Flags: needinfo?(jujjyl)
(In reply to Jukka Jylänki from comment #11)
> Created attachment 8826796 [details]
> second_test_results.txt
> 
> Results against the stock try build show that the patch improves FPS
> performance by +8.4%.

I'm sold.
Tried running a geckoprofile with the patched version, but unfortunately it only pulls .pdb files from the online symbol server, so it's not able to get native call stacks to show the internals.
(In reply to Jukka Jylänki from comment #14)
> Tried running a geckoprofile with the patched version, but unfortunately it
> only pulls .pdb files from the online symbol server, so it's not able to get
> native call stacks to show the internals.

You might try extracting the "firefox-53.0a1.en-US.win64.crashreporter-symbols.zip" (or -full version) to the same directory you run it from. It should pick local PDBs before checking the symbol server.
Although, now I'm very surprised and suspicious about the benchmark results I got. Because looking at the patch, I suspect that the new added code path should never be hit(!), because it is gated on the transpose parameter being true, which in GLES2 and WebGL is always required to be false.

The ANGLE D3D11 backend internals implement their own transpose, which is called when transpose = false, when transpose = true, they don't transpose because they think it's already "pre"transposed. I'm not sure why they have the transpose inverted, but they call it something with GL<->D3D differences (I'm sad if they think D3D "should" do v*M and GL "should" do M*v in shaders). See here: https://dxr.mozilla.org/mozilla-central/source/gfx/angle/src/libANGLE/renderer/d3d/ProgramD3D.cpp#2010. TransposeMatrix<> is the slow function.
(oh nevermind, looking at my own native graphics engine, I see I do equally transpose matrices to convert from column to row major)
Btw, regarding the patch, is it possible that an adversary could attempt to upload a huge array of 4x4 matrices causing the malloc() to fail, after which the code would crash attempting to transpose to addresses around zero? Or is there some size check already prior that effectively prevents this from happening?
(In reply to Jukka Jylänki from comment #18)
> Btw, regarding the patch, is it possible that an adversary could attempt to
> upload a huge array of 4x4 matrices causing the malloc() to fail, after
> which the code would crash attempting to transpose to addresses around zero?
> Or is there some size check already prior that effectively prevents this
> from happening?

Yes, I was lazy. Will update with OOM. (though it's not really dangerous. They'd have to pick a size between 50 and 100% of memory, since we're duplicating the size of (a subset of) the source buffer.
(In reply to Jeff Gilbert [:jgilbert] from comment #19)
> (In reply to Jukka Jylänki from comment #18)
> > Btw, regarding the patch, is it possible that an adversary could attempt to
> > upload a huge array of 4x4 matrices causing the malloc() to fail, after
> > which the code would crash attempting to transpose to addresses around zero?
> > Or is there some size check already prior that effectively prevents this
> > from happening?
> 
> Yes, I was lazy. Will update with OOM. (though it's not really dangerous.
> They'd have to pick a size between 50 and 100% of memory, since we're
> duplicating the size of (a subset of) the source buffer.

Actually I think we limit the size you upload based on what the shader supports, so hitting the OOM here indicates we're close to the OOM limit anyway.
(In reply to Jukka Jylänki from comment #16)
> Although, now I'm very surprised and suspicious about the benchmark results
> I got. Because looking at the patch, I suspect that the new added code path
> should never be hit(!), because it is gated on the transpose parameter being
> true, which in GLES2 and WebGL is always required to be false.

To be clear, WebGL1 does require `transpose` to be false, but WebGL2 allows `transpose: true`.
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -

https://reviewboard.mozilla.org/r/104682/#review105412

::: dom/canvas/WebGLContextGL.cpp:2055
(Diff revision 1)
>      {
>          return;
>      }
>      MOZ_ASSERT(!loc->mInfo->mSamplerTexList, "Should not be a sampler.");
>  
> +    ////

are these comments marking the start/end of an experimental block?

::: dom/canvas/WebGLContextGL.cpp:2061
(Diff revision 1)
> +
> +    bool uploadTranspose = transpose;
> +    const float* uploadBytes = elemBytes;
> +
> +    UniqueBuffer temp;
> +    if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) {

Given the latest info from Yukka, we should be checking for `!transpose` here, since passing `transpose=true` flag will let Angle skip the work.

::: dom/canvas/WebGLContextGL.cpp:2063
(Diff revision 1)
> +    const float* uploadBytes = elemBytes;
> +
> +    UniqueBuffer temp;
> +    if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) {
> +        // ANGLE is really slow at this.
> +        temp = malloc(sizeof(float) * numElementsToUpload * A * B);

wouldn't malloc alone be slower than a bunch of arithmetic?

::: dom/canvas/WebGLContextGL.cpp:2063
(Diff revision 1)
> +    const float* uploadBytes = elemBytes;
> +
> +    UniqueBuffer temp;
> +    if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) {
> +        // ANGLE is really slow at this.
> +        temp = malloc(sizeof(float) * numElementsToUpload * A * B);

where is it getting freed up?

::: dom/canvas/WebGLContextGL.cpp:2066
(Diff revision 1)
> +    if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) {
> +        // ANGLE is really slow at this.
> +        temp = malloc(sizeof(float) * numElementsToUpload * A * B);
> +
> +        for (uint32_t i = 0; i < numElementsToUpload; ++i) {
> +            const auto srcMatrix = (const float*)elemBytes + i * (A * B);

we need to tell the compiler that these two memory chunks don't intersect, using `__restrict`
(In reply to Jeff Gilbert [:jgilbert] from comment #22)
> (In reply to Jukka Jylänki from comment #16)
> > Although, now I'm very surprised and suspicious about the benchmark results
> > I got. Because looking at the patch, I suspect that the new added code path
> > should never be hit(!), because it is gated on the transpose parameter being
> > true, which in GLES2 and WebGL is always required to be false.
> 
> To be clear, WebGL1 does require `transpose` to be false, but WebGL2 allows
> `transpose: true`.

Jukka, could you confirm if the actual demo code passes `true` there for the hot spots?
Flags: needinfo?(jujjyl)
Jeff, are you going to address the other issues I marked with the review?
I'm not able to change its status from r? again, hope the associated bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1330672) gets resolved soon.
Flags: needinfo?(jgilbert)
The demo passes false for transpose always, which is the scenario when ANGLE flips.

Doing more heavyweight profiling with AMD CodeXL, which gives a better breakdown of native code callstacks compared to gecko profiler, CodeXL gives the estimate around 5%, though I am profiling a different build of the demo, which we spent a week optimizing, so the code paths are quite different.

I've implemented a leaner version of the matrix transpose directly within ANGLE, which removes this particular hotspot from showing up in profiles altogether. See here for the commit: https://github.com/juj/gecko-dev/commit/3f400ff3b1dc64a02453e63fdfabb9340c218cc4 . I wish that could make its way directly in ANGLE, but I don't have a workflow or familiarity right now to make that a patch upstream, I've never contributed before to ANGLE itself. Not sure how it would reach us back as well the quickest, since I suppose we'd need to update ANGLE as well to latest to get it.

After my above misbenchmarking, I feel a bit skeptical about using try builds to benchmark, so I've started to do local builds instead to profile, and more sample runs to get to more stable numbers.
Flags: needinfo?(jujjyl)
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -

https://reviewboard.mozilla.org/r/104682/#review109260

the last change looks good, but there are still issues opened from the original review that need to be addressed, or at least explained
Attachment #8826802 - Flags: review?(kvark) → review-
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -

https://reviewboard.mozilla.org/r/104682/#review105412

> are these comments marking the start/end of an experimental block?

No, it's just a way to visually separate code blocks.

> wouldn't malloc alone be slower than a bunch of arithmetic?

How would a bunch of arithmetic solve this? We can't mutate the source buffer.
I'm not too worried about the cost of malloc here.

> where is it getting freed up?

It's an RAII-freed object.

> we need to tell the compiler that these two memory chunks don't intersect, using `__restrict`

We don't need to but it's not a bad idea.
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -

https://reviewboard.mozilla.org/r/104682/#review115240

love those extra fixes, ship it!
Attachment #8826802 - Flags: review?(kvark) → review+
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -

https://reviewboard.mozilla.org/r/104682/#review105412

> How would a bunch of arithmetic solve this? We can't mutate the source buffer.
> I'm not too worried about the cost of malloc here.

A bunch of arithmetic is what Angle was doing in the first place with "slow transpose" path.
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/20c8c1cd5a3f
Always use ANGLE's less-slow transpose:true path. - r=kvark
https://hg.mozilla.org/mozilla-central/rev/20c8c1cd5a3f
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Flags: needinfo?(jgilbert)
You need to log in before you can comment on or make changes to this bug.