1329988 - Roughly 10% of total CPU time in a WebGL 2 demo is spent in transposing 4x4 matrices.

Reporter

Description

•

8 years ago

Benchmarking a WebGL 2 demo from a partner, with having ANGLE disabled (to avoid performance impact from bug 1329983), and faking framebuffer completeness checks as no-ops (to avoid performance impact from bug 1329815), what remains as the hottest bottleneck is glUniformMatrix4fv, which leads to about 10% of total CPU time is spent inside libGLESv2 in TransposeMatrix<float>. Looking at the code for that function is extremely naive: (gfx\angle\src\libANGLE\renderer\d3d\ProgramD3D.cpp, line ~190 or so) template <typename T> static inline void SetIfDirty(T *dest, const T &source, bool *dirtyFlag) { ASSERT(dest != NULL); ASSERT(dirtyFlag != NULL); *dirtyFlag = *dirtyFlag || (memcmp(dest, &source, sizeof(T)) != 0); *dest = source; } template <typename T> bool TransposeMatrix(T *target, const GLfloat *value, int targetWidth, int targetHeight, int srcWidth, int srcHeight) { bool dirty = false; int copyWidth = std::min(targetHeight, srcWidth); int copyHeight = std::min(targetWidth, srcHeight); for (int x = 0; x < copyWidth; x++) { for (int y = 0; y < copyHeight; y++) { SetIfDirty(target + (x * targetWidth + y), static_cast<T>(value[y * srcWidth + x]), &dirty); } } // clear unfilled right side for (int y = 0; y < copyWidth; y++) { for (int x = copyHeight; x < targetWidth; x++) { SetIfDirty(target + (y * targetWidth + x), static_cast<T>(0), &dirty); } } // clear unfilled bottom. for (int y = copyWidth; y < targetHeight; y++) { for (int x = 0; x < targetWidth; x++) { SetIfDirty(target + (y * targetWidth + x), static_cast<T>(0), &dirty); } } return dirty; } Transposing a 4x4 matrix should take about eight hardware instructions. This is a ridiculously good example of the antipattern for how to route a specific task to such a generic function that is so unoptimized that it ends up running about a billion times slower than it needs to, and in this case, the number billion is probably not even an exaggeration.

Jukka Jylänki

Reporter

Comment 1

•

8 years ago

Attached image glUniformMatrix4fv.png — Details

Jukka Jylänki

Reporter

Comment 2

•

8 years ago

Attached image glUniformMatrix4fv_2.png — Details

Jukka Jylänki

Reporter

Comment 3

•

8 years ago

Marked down https://bugs.chromium.org/p/angleproject/issues/detail?id=1694 which is related. It looks that in upstream ANGLE code, the above code path for uploading matrices has been optimized slightly to avoid the SetIfDirty() calls in the innermost loop, but the code is still going through a too-generic-to-be-fast path. (https://chromium.googlesource.com/angle/angle/+/master/src/libANGLE/renderer/d3d/ProgramD3D.cpp#132)

Jukka Jylänki

Reporter

Comment 4

•

8 years ago

Running on OS X Mac Pro (Late 2013), 3.5 GHz 6-core Intel Xeon E5 with AMD FirePro D500, the corresponding time taken to upload matrices is only 2.8%, so this is Windows specific.

OS: Unspecified → Windows

Kelsey Gilbert [:jgilbert]

Assignee

Updated

•

8 years ago

status-firefox51: --- → wontfix

Priority: -- → P2

Whiteboard: gfx-noted

Kelsey Gilbert [:jgilbert]

Assignee

Comment 5

•

8 years ago

We can just workaround this in our code.

Kelsey Gilbert [:jgilbert]

Assignee

Updated

•

8 years ago

Assignee: nobody → jgilbert

Kelsey Gilbert [:jgilbert]

Assignee

Comment 6

•

8 years ago

Please retest with: https://treeherder.mozilla.org/#/jobs?repo=try&revision=67618658d398ae4647c8337b1b8ef31172988c64

Flags: needinfo?(jujjyl)

Jukka Jylänki

Reporter

Comment 7

•

8 years ago

Attached file try_results.txt — Details

Tried out the build, but it does not seem to improve, current FF Nightly 64-bit gets a bit better result.

Flags: needinfo?(jujjyl)

Jukka Jylänki

Reporter

Comment 8

•

8 years ago

I wonder if the builds are with the same flags necessarily (try vs Nightly?)

Kelsey Gilbert [:jgilbert]

Assignee

Comment 9

•

8 years ago

I'll just spin try builds without the patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=03fab9c985a831c4e5520d60c3dbec3a91c4b592

Kelsey Gilbert [:jgilbert]

Assignee

Comment 10

•

8 years ago

Please diff perf between the two. I would also love to additionally see the profiler stack for the with-patch version.

Flags: needinfo?(jujjyl)

Jukka Jylänki

Reporter

Comment 11

•

8 years ago

Attached file second_test_results.txt — Details

Results against the stock try build show that the patch improves FPS performance by +8.4%.

Flags: needinfo?(jujjyl)

Kelsey Gilbert [:jgilbert]

Assignee

Comment 12

•

8 years ago

(In reply to Jukka Jylänki from comment #11) > Created attachment 8826796 [details] > second_test_results.txt > > Results against the stock try build show that the patch improves FPS > performance by +8.4%. I'm sold.

Comment hidden (mozreview-request)

Jukka Jylänki

Reporter

Comment 14

•

8 years ago

Tried running a geckoprofile with the patched version, but unfortunately it only pulls .pdb files from the online symbol server, so it's not able to get native call stacks to show the internals.

Kelsey Gilbert [:jgilbert]

Assignee

Comment 15

•

8 years ago

(In reply to Jukka Jylänki from comment #14) > Tried running a geckoprofile with the patched version, but unfortunately it > only pulls .pdb files from the online symbol server, so it's not able to get > native call stacks to show the internals. You might try extracting the "firefox-53.0a1.en-US.win64.crashreporter-symbols.zip" (or -full version) to the same directory you run it from. It should pick local PDBs before checking the symbol server.

Jukka Jylänki

Reporter

Comment 16

•

8 years ago

Although, now I'm very surprised and suspicious about the benchmark results I got. Because looking at the patch, I suspect that the new added code path should never be hit(!), because it is gated on the transpose parameter being true, which in GLES2 and WebGL is always required to be false. The ANGLE D3D11 backend internals implement their own transpose, which is called when transpose = false, when transpose = true, they don't transpose because they think it's already "pre"transposed. I'm not sure why they have the transpose inverted, but they call it something with GL<->D3D differences (I'm sad if they think D3D "should" do v*M and GL "should" do M*v in shaders). See here: https://dxr.mozilla.org/mozilla-central/source/gfx/angle/src/libANGLE/renderer/d3d/ProgramD3D.cpp#2010. TransposeMatrix<> is the slow function.

Jukka Jylänki

Reporter

Comment 17

•

8 years ago

(oh nevermind, looking at my own native graphics engine, I see I do equally transpose matrices to convert from column to row major)

Jukka Jylänki

Reporter

Comment 18

•

8 years ago

Btw, regarding the patch, is it possible that an adversary could attempt to upload a huge array of 4x4 matrices causing the malloc() to fail, after which the code would crash attempting to transpose to addresses around zero? Or is there some size check already prior that effectively prevents this from happening?

Kelsey Gilbert [:jgilbert]

Assignee

Comment 19

•

8 years ago

(In reply to Jukka Jylänki from comment #18) > Btw, regarding the patch, is it possible that an adversary could attempt to > upload a huge array of 4x4 matrices causing the malloc() to fail, after > which the code would crash attempting to transpose to addresses around zero? > Or is there some size check already prior that effectively prevents this > from happening? Yes, I was lazy. Will update with OOM. (though it's not really dangerous. They'd have to pick a size between 50 and 100% of memory, since we're duplicating the size of (a subset of) the source buffer.

Kelsey Gilbert [:jgilbert]

Assignee

Comment 20

•

8 years ago

(In reply to Jeff Gilbert [:jgilbert] from comment #19) > (In reply to Jukka Jylänki from comment #18) > > Btw, regarding the patch, is it possible that an adversary could attempt to > > upload a huge array of 4x4 matrices causing the malloc() to fail, after > > which the code would crash attempting to transpose to addresses around zero? > > Or is there some size check already prior that effectively prevents this > > from happening? > > Yes, I was lazy. Will update with OOM. (though it's not really dangerous. > They'd have to pick a size between 50 and 100% of memory, since we're > duplicating the size of (a subset of) the source buffer. Actually I think we limit the size you upload based on what the shader supports, so hitting the OOM here indicates we're close to the OOM limit anyway.

Comment hidden (mozreview-request)

Kelsey Gilbert [:jgilbert]

Assignee

Comment 22

•

8 years ago

(In reply to Jukka Jylänki from comment #16) > Although, now I'm very surprised and suspicious about the benchmark results > I got. Because looking at the patch, I suspect that the new added code path > should never be hit(!), because it is gated on the transpose parameter being > true, which in GLES2 and WebGL is always required to be false. To be clear, WebGL1 does require `transpose` to be false, but WebGL2 allows `transpose: true`.

Dzmitry Malyshau [:kvark]

Comment 23

•

8 years ago

mozreview-review

Comment on attachment 8826802 [details] Bug 1329988 - Always use ANGLE's less-slow transpose:true path. - https://reviewboard.mozilla.org/r/104682/#review105412 ::: dom/canvas/WebGLContextGL.cpp:2055 (Diff revision 1) > { > return; > } > MOZ_ASSERT(!loc->mInfo->mSamplerTexList, "Should not be a sampler."); > > + //// are these comments marking the start/end of an experimental block? ::: dom/canvas/WebGLContextGL.cpp:2061 (Diff revision 1) > + > + bool uploadTranspose = transpose; > + const float* uploadBytes = elemBytes; > + > + UniqueBuffer temp; > + if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) { Given the latest info from Yukka, we should be checking for `!transpose` here, since passing `transpose=true` flag will let Angle skip the work. ::: dom/canvas/WebGLContextGL.cpp:2063 (Diff revision 1) > + const float* uploadBytes = elemBytes; > + > + UniqueBuffer temp; > + if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) { > + // ANGLE is really slow at this. > + temp = malloc(sizeof(float) * numElementsToUpload * A * B); wouldn't malloc alone be slower than a bunch of arithmetic? ::: dom/canvas/WebGLContextGL.cpp:2063 (Diff revision 1) > + const float* uploadBytes = elemBytes; > + > + UniqueBuffer temp; > + if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) { > + // ANGLE is really slow at this. > + temp = malloc(sizeof(float) * numElementsToUpload * A * B); where is it getting freed up? ::: dom/canvas/WebGLContextGL.cpp:2066 (Diff revision 1) > + if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) { > + // ANGLE is really slow at this. > + temp = malloc(sizeof(float) * numElementsToUpload * A * B); > + > + for (uint32_t i = 0; i < numElementsToUpload; ++i) { > + const auto srcMatrix = (const float*)elemBytes + i * (A * B); we need to tell the compiler that these two memory chunks don't intersect, using `__restrict`

Dzmitry Malyshau [:kvark]

Comment 24

•

8 years ago

(In reply to Jeff Gilbert [:jgilbert] from comment #22) > (In reply to Jukka Jylänki from comment #16) > > Although, now I'm very surprised and suspicious about the benchmark results > > I got. Because looking at the patch, I suspect that the new added code path > > should never be hit(!), because it is gated on the transpose parameter being > > true, which in GLES2 and WebGL is always required to be false. > > To be clear, WebGL1 does require `transpose` to be false, but WebGL2 allows > `transpose: true`. Jukka, could you confirm if the actual demo code passes `true` there for the hot spots?

Flags: needinfo?(jujjyl)

Comment hidden (mozreview-request)

Dzmitry Malyshau [:kvark]

Comment 26

•

8 years ago

Jeff, are you going to address the other issues I marked with the review? I'm not able to change its status from r? again, hope the associated bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1330672) gets resolved soon.

Flags: needinfo?(jgilbert)

Jukka Jylänki

Reporter

Comment 27

•

8 years ago

The demo passes false for transpose always, which is the scenario when ANGLE flips. Doing more heavyweight profiling with AMD CodeXL, which gives a better breakdown of native code callstacks compared to gecko profiler, CodeXL gives the estimate around 5%, though I am profiling a different build of the demo, which we spent a week optimizing, so the code paths are quite different. I've implemented a leaner version of the matrix transpose directly within ANGLE, which removes this particular hotspot from showing up in profiles altogether. See here for the commit: https://github.com/juj/gecko-dev/commit/3f400ff3b1dc64a02453e63fdfabb9340c218cc4 . I wish that could make its way directly in ANGLE, but I don't have a workflow or familiarity right now to make that a patch upstream, I've never contributed before to ANGLE itself. Not sure how it would reach us back as well the quickest, since I suppose we'd need to update ANGLE as well to latest to get it. After my above misbenchmarking, I feel a bit skeptical about using try builds to benchmark, so I've started to do local builds instead to profile, and more sample runs to get to more stable numbers.

Flags: needinfo?(jujjyl)

Dzmitry Malyshau [:kvark]

Comment 28

•

8 years ago

mozreview-review

Comment on attachment 8826802 [details] Bug 1329988 - Always use ANGLE's less-slow transpose:true path. - https://reviewboard.mozilla.org/r/104682/#review109260 the last change looks good, but there are still issues opened from the original review that need to be addressed, or at least explained

Attachment #8826802 - Flags: review?(kvark) → review-

Jukka Jylänki

Reporter

Updated

•

8 years ago

Blocks: webgl-perf-parity

Kelsey Gilbert [:jgilbert]

Assignee

Comment 29

•

8 years ago

mozreview-review-reply

Comment on attachment 8826802 [details] Bug 1329988 - Always use ANGLE's less-slow transpose:true path. - https://reviewboard.mozilla.org/r/104682/#review105412 > are these comments marking the start/end of an experimental block? No, it's just a way to visually separate code blocks. > wouldn't malloc alone be slower than a bunch of arithmetic? How would a bunch of arithmetic solve this? We can't mutate the source buffer. I'm not too worried about the cost of malloc here. > where is it getting freed up? It's an RAII-freed object. > we need to tell the compiler that these two memory chunks don't intersect, using `__restrict` We don't need to but it's not a bad idea.

Comment hidden (mozreview-request)

Dzmitry Malyshau [:kvark]

Comment 32

•

8 years ago

mozreview-review

Comment on attachment 8826802 [details] Bug 1329988 - Always use ANGLE's less-slow transpose:true path. - https://reviewboard.mozilla.org/r/104682/#review115240 love those extra fixes, ship it!

Attachment #8826802 - Flags: review?(kvark) → review+

Dzmitry Malyshau [:kvark]

Comment 33

•

8 years ago

mozreview-review-reply

Comment on attachment 8826802 [details] Bug 1329988 - Always use ANGLE's less-slow transpose:true path. - https://reviewboard.mozilla.org/r/104682/#review105412 > How would a bunch of arithmetic solve this? We can't mutate the source buffer. > I'm not too worried about the cost of malloc here. A bunch of arithmetic is what Angle was doing in the first place with "slow transpose" path.

Pulsebot

Comment 34

•

8 years ago

Pushed by jgilbert@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/20c8c1cd5a3f Always use ANGLE's less-slow transpose:true path. - r=kvark

Carsten Book [:Tomcat]

Comment 35

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/20c8c1cd5a3f

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox54: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

Kelsey Gilbert [:jgilbert]

Assignee

Updated

•

8 years ago

Flags: needinfo?(jgilbert)

glUniformMatrix4fv.png 8 years ago Jukka Jylänki 100.76 KB, image/png		Details
glUniformMatrix4fv_2.png 8 years ago Jukka Jylänki 71.98 KB, image/png		Details
try_results.txt 8 years ago Jukka Jylänki 7.34 KB, text/plain		Details
second_test_results.txt 8 years ago Jukka Jylänki 2.86 KB, text/plain		Details
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. - 8 years ago Kelsey Gilbert [:jgilbert] 59 bytes, text/x-review-board-request	kvark : review+	Details