Closed
Bug 1329988
Opened 8 years ago
Closed 8 years ago
Roughly 10% of total CPU time in a WebGL 2 demo is spent in transposing 4x4 matrices.
Categories
(Core :: Graphics: CanvasWebGL, defect, P2)
Tracking
()
RESOLVED
FIXED
mozilla54
People
(Reporter: jujjyl, Assigned: jgilbert)
References
(Blocks 1 open bug)
Details
(Whiteboard: gfx-noted)
Attachments
(5 files)
Benchmarking a WebGL 2 demo from a partner, with having ANGLE disabled (to avoid performance impact from bug 1329983), and faking framebuffer completeness checks as no-ops (to avoid performance impact from bug 1329815), what remains as the hottest bottleneck is glUniformMatrix4fv, which leads to about 10% of total CPU time is spent inside libGLESv2 in TransposeMatrix<float>.
Looking at the code for that function is extremely naive: (gfx\angle\src\libANGLE\renderer\d3d\ProgramD3D.cpp, line ~190 or so)
template <typename T>
static inline void SetIfDirty(T *dest, const T &source, bool *dirtyFlag)
{
ASSERT(dest != NULL);
ASSERT(dirtyFlag != NULL);
*dirtyFlag = *dirtyFlag || (memcmp(dest, &source, sizeof(T)) != 0);
*dest = source;
}
template <typename T>
bool TransposeMatrix(T *target,
const GLfloat *value,
int targetWidth,
int targetHeight,
int srcWidth,
int srcHeight)
{
bool dirty = false;
int copyWidth = std::min(targetHeight, srcWidth);
int copyHeight = std::min(targetWidth, srcHeight);
for (int x = 0; x < copyWidth; x++)
{
for (int y = 0; y < copyHeight; y++)
{
SetIfDirty(target + (x * targetWidth + y), static_cast<T>(value[y * srcWidth + x]),
&dirty);
}
}
// clear unfilled right side
for (int y = 0; y < copyWidth; y++)
{
for (int x = copyHeight; x < targetWidth; x++)
{
SetIfDirty(target + (y * targetWidth + x), static_cast<T>(0), &dirty);
}
}
// clear unfilled bottom.
for (int y = copyWidth; y < targetHeight; y++)
{
for (int x = 0; x < targetWidth; x++)
{
SetIfDirty(target + (y * targetWidth + x), static_cast<T>(0), &dirty);
}
}
return dirty;
}
Transposing a 4x4 matrix should take about eight hardware instructions. This is a ridiculously good example of the antipattern for how to route a specific task to such a generic function that is so unoptimized that it ends up running about a billion times slower than it needs to, and in this case, the number billion is probably not even an exaggeration.
Reporter | ||
Comment 1•8 years ago
|
||
Reporter | ||
Comment 2•8 years ago
|
||
Reporter | ||
Comment 3•8 years ago
|
||
Marked down https://bugs.chromium.org/p/angleproject/issues/detail?id=1694 which is related. It looks that in upstream ANGLE code, the above code path for uploading matrices has been optimized slightly to avoid the SetIfDirty() calls in the innermost loop, but the code is still going through a too-generic-to-be-fast path. (https://chromium.googlesource.com/angle/angle/+/master/src/libANGLE/renderer/d3d/ProgramD3D.cpp#132)
Reporter | ||
Comment 4•8 years ago
|
||
Running on OS X Mac Pro (Late 2013), 3.5 GHz 6-core Intel Xeon E5 with AMD FirePro D500, the corresponding time taken to upload matrices is only 2.8%, so this is Windows specific.
OS: Unspecified → Windows
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 5•8 years ago
|
||
We can just workaround this in our code.
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → jgilbert
Assignee | ||
Comment 6•8 years ago
|
||
Please retest with:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=67618658d398ae4647c8337b1b8ef31172988c64
Flags: needinfo?(jujjyl)
Reporter | ||
Comment 7•8 years ago
|
||
Tried out the build, but it does not seem to improve, current FF Nightly 64-bit gets a bit better result.
Flags: needinfo?(jujjyl)
Reporter | ||
Comment 8•8 years ago
|
||
I wonder if the builds are with the same flags necessarily (try vs Nightly?)
Assignee | ||
Comment 9•8 years ago
|
||
I'll just spin try builds without the patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=03fab9c985a831c4e5520d60c3dbec3a91c4b592
Assignee | ||
Comment 10•8 years ago
|
||
Please diff perf between the two.
I would also love to additionally see the profiler stack for the with-patch version.
Flags: needinfo?(jujjyl)
Reporter | ||
Comment 11•8 years ago
|
||
Results against the stock try build show that the patch improves FPS performance by +8.4%.
Flags: needinfo?(jujjyl)
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Jukka Jylänki from comment #11)
> Created attachment 8826796 [details]
> second_test_results.txt
>
> Results against the stock try build show that the patch improves FPS
> performance by +8.4%.
I'm sold.
Comment hidden (mozreview-request) |
Reporter | ||
Comment 14•8 years ago
|
||
Tried running a geckoprofile with the patched version, but unfortunately it only pulls .pdb files from the online symbol server, so it's not able to get native call stacks to show the internals.
Assignee | ||
Comment 15•8 years ago
|
||
(In reply to Jukka Jylänki from comment #14)
> Tried running a geckoprofile with the patched version, but unfortunately it
> only pulls .pdb files from the online symbol server, so it's not able to get
> native call stacks to show the internals.
You might try extracting the "firefox-53.0a1.en-US.win64.crashreporter-symbols.zip" (or -full version) to the same directory you run it from. It should pick local PDBs before checking the symbol server.
Reporter | ||
Comment 16•8 years ago
|
||
Although, now I'm very surprised and suspicious about the benchmark results I got. Because looking at the patch, I suspect that the new added code path should never be hit(!), because it is gated on the transpose parameter being true, which in GLES2 and WebGL is always required to be false.
The ANGLE D3D11 backend internals implement their own transpose, which is called when transpose = false, when transpose = true, they don't transpose because they think it's already "pre"transposed. I'm not sure why they have the transpose inverted, but they call it something with GL<->D3D differences (I'm sad if they think D3D "should" do v*M and GL "should" do M*v in shaders). See here: https://dxr.mozilla.org/mozilla-central/source/gfx/angle/src/libANGLE/renderer/d3d/ProgramD3D.cpp#2010. TransposeMatrix<> is the slow function.
Reporter | ||
Comment 17•8 years ago
|
||
(oh nevermind, looking at my own native graphics engine, I see I do equally transpose matrices to convert from column to row major)
Reporter | ||
Comment 18•8 years ago
|
||
Btw, regarding the patch, is it possible that an adversary could attempt to upload a huge array of 4x4 matrices causing the malloc() to fail, after which the code would crash attempting to transpose to addresses around zero? Or is there some size check already prior that effectively prevents this from happening?
Assignee | ||
Comment 19•8 years ago
|
||
(In reply to Jukka Jylänki from comment #18)
> Btw, regarding the patch, is it possible that an adversary could attempt to
> upload a huge array of 4x4 matrices causing the malloc() to fail, after
> which the code would crash attempting to transpose to addresses around zero?
> Or is there some size check already prior that effectively prevents this
> from happening?
Yes, I was lazy. Will update with OOM. (though it's not really dangerous. They'd have to pick a size between 50 and 100% of memory, since we're duplicating the size of (a subset of) the source buffer.
Assignee | ||
Comment 20•8 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #19)
> (In reply to Jukka Jylänki from comment #18)
> > Btw, regarding the patch, is it possible that an adversary could attempt to
> > upload a huge array of 4x4 matrices causing the malloc() to fail, after
> > which the code would crash attempting to transpose to addresses around zero?
> > Or is there some size check already prior that effectively prevents this
> > from happening?
>
> Yes, I was lazy. Will update with OOM. (though it's not really dangerous.
> They'd have to pick a size between 50 and 100% of memory, since we're
> duplicating the size of (a subset of) the source buffer.
Actually I think we limit the size you upload based on what the shader supports, so hitting the OOM here indicates we're close to the OOM limit anyway.
Comment hidden (mozreview-request) |
Assignee | ||
Comment 22•8 years ago
|
||
(In reply to Jukka Jylänki from comment #16)
> Although, now I'm very surprised and suspicious about the benchmark results
> I got. Because looking at the patch, I suspect that the new added code path
> should never be hit(!), because it is gated on the transpose parameter being
> true, which in GLES2 and WebGL is always required to be false.
To be clear, WebGL1 does require `transpose` to be false, but WebGL2 allows `transpose: true`.
Comment 23•8 years ago
|
||
mozreview-review |
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -
https://reviewboard.mozilla.org/r/104682/#review105412
::: dom/canvas/WebGLContextGL.cpp:2055
(Diff revision 1)
> {
> return;
> }
> MOZ_ASSERT(!loc->mInfo->mSamplerTexList, "Should not be a sampler.");
>
> + ////
are these comments marking the start/end of an experimental block?
::: dom/canvas/WebGLContextGL.cpp:2061
(Diff revision 1)
> +
> + bool uploadTranspose = transpose;
> + const float* uploadBytes = elemBytes;
> +
> + UniqueBuffer temp;
> + if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) {
Given the latest info from Yukka, we should be checking for `!transpose` here, since passing `transpose=true` flag will let Angle skip the work.
::: dom/canvas/WebGLContextGL.cpp:2063
(Diff revision 1)
> + const float* uploadBytes = elemBytes;
> +
> + UniqueBuffer temp;
> + if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) {
> + // ANGLE is really slow at this.
> + temp = malloc(sizeof(float) * numElementsToUpload * A * B);
wouldn't malloc alone be slower than a bunch of arithmetic?
::: dom/canvas/WebGLContextGL.cpp:2063
(Diff revision 1)
> + const float* uploadBytes = elemBytes;
> +
> + UniqueBuffer temp;
> + if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) {
> + // ANGLE is really slow at this.
> + temp = malloc(sizeof(float) * numElementsToUpload * A * B);
where is it getting freed up?
::: dom/canvas/WebGLContextGL.cpp:2066
(Diff revision 1)
> + if (transpose && gl->WorkAroundDriverBugs() && gl->IsANGLE()) {
> + // ANGLE is really slow at this.
> + temp = malloc(sizeof(float) * numElementsToUpload * A * B);
> +
> + for (uint32_t i = 0; i < numElementsToUpload; ++i) {
> + const auto srcMatrix = (const float*)elemBytes + i * (A * B);
we need to tell the compiler that these two memory chunks don't intersect, using `__restrict`
Comment 24•8 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #22)
> (In reply to Jukka Jylänki from comment #16)
> > Although, now I'm very surprised and suspicious about the benchmark results
> > I got. Because looking at the patch, I suspect that the new added code path
> > should never be hit(!), because it is gated on the transpose parameter being
> > true, which in GLES2 and WebGL is always required to be false.
>
> To be clear, WebGL1 does require `transpose` to be false, but WebGL2 allows
> `transpose: true`.
Jukka, could you confirm if the actual demo code passes `true` there for the hot spots?
Flags: needinfo?(jujjyl)
Comment hidden (mozreview-request) |
Comment 26•8 years ago
|
||
Jeff, are you going to address the other issues I marked with the review?
I'm not able to change its status from r? again, hope the associated bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1330672) gets resolved soon.
Flags: needinfo?(jgilbert)
Reporter | ||
Comment 27•8 years ago
|
||
The demo passes false for transpose always, which is the scenario when ANGLE flips.
Doing more heavyweight profiling with AMD CodeXL, which gives a better breakdown of native code callstacks compared to gecko profiler, CodeXL gives the estimate around 5%, though I am profiling a different build of the demo, which we spent a week optimizing, so the code paths are quite different.
I've implemented a leaner version of the matrix transpose directly within ANGLE, which removes this particular hotspot from showing up in profiles altogether. See here for the commit: https://github.com/juj/gecko-dev/commit/3f400ff3b1dc64a02453e63fdfabb9340c218cc4 . I wish that could make its way directly in ANGLE, but I don't have a workflow or familiarity right now to make that a patch upstream, I've never contributed before to ANGLE itself. Not sure how it would reach us back as well the quickest, since I suppose we'd need to update ANGLE as well to latest to get it.
After my above misbenchmarking, I feel a bit skeptical about using try builds to benchmark, so I've started to do local builds instead to profile, and more sample runs to get to more stable numbers.
Flags: needinfo?(jujjyl)
Comment 28•8 years ago
|
||
mozreview-review |
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -
https://reviewboard.mozilla.org/r/104682/#review109260
the last change looks good, but there are still issues opened from the original review that need to be addressed, or at least explained
Attachment #8826802 -
Flags: review?(kvark) → review-
Reporter | ||
Updated•8 years ago
|
Blocks: webgl-perf-parity
Assignee | ||
Comment 29•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -
https://reviewboard.mozilla.org/r/104682/#review105412
> are these comments marking the start/end of an experimental block?
No, it's just a way to visually separate code blocks.
> wouldn't malloc alone be slower than a bunch of arithmetic?
How would a bunch of arithmetic solve this? We can't mutate the source buffer.
I'm not too worried about the cost of malloc here.
> where is it getting freed up?
It's an RAII-freed object.
> we need to tell the compiler that these two memory chunks don't intersect, using `__restrict`
We don't need to but it's not a bad idea.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 32•8 years ago
|
||
mozreview-review |
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -
https://reviewboard.mozilla.org/r/104682/#review115240
love those extra fixes, ship it!
Attachment #8826802 -
Flags: review?(kvark) → review+
Comment 33•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8826802 [details]
Bug 1329988 - Always use ANGLE's less-slow transpose:true path. -
https://reviewboard.mozilla.org/r/104682/#review105412
> How would a bunch of arithmetic solve this? We can't mutate the source buffer.
> I'm not too worried about the cost of malloc here.
A bunch of arithmetic is what Angle was doing in the first place with "slow transpose" path.
Comment 34•8 years ago
|
||
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/20c8c1cd5a3f
Always use ANGLE's less-slow transpose:true path. - r=kvark
Comment 35•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox54:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(jgilbert)
You need to log in
before you can comment on or make changes to this bug.
Description
•