Open Bug 749678 Opened 8 years ago Updated 4 years ago

GLContext optimization: avoid calling slow getCurrentContext functions, instead use some TLS and use the fact that we dont share GLContexts across threads

Categories

(Core :: Graphics, defect)

defect
Not set

Tracking

()

REOPENED
Future

People

(Reporter: bzbarsky, Assigned: bjacob)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: webgl-next [k9o:p2:fx?] [games:p?] [leave open])

Attachments

(8 files, 2 obsolete files)

1.11 KB, patch
ehsan
: review+
Details | Diff | Splinter Review
8.01 KB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
9.19 KB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
2.21 KB, patch
BenWa
: review+
Details | Diff | Splinter Review
730 bytes, patch
jrmuizel
: review+
Details | Diff | Splinter Review
791 bytes, patch
bjacob
: review+
Details | Diff | Splinter Review
17.38 KB, patch
Details | Diff | Splinter Review
1.52 KB, patch
jgilbert
: review+
Details | Diff | Splinter Review
Benoit wrote up some webgl uniform perf tests, and the MakeContextCurrent call is being about a third of the runtime.  The profile says that mozilla::gl::GLContextCGL::MakeCurrentImpl(bool) calls +[NSOpenGLContext currentContext] which calls CGLGetParameter which goes off into pthread mutex weeds.

Benoit says it might make sense to cache something in TLS here.  Looks like other GLContext impls do that, in fact.
It's nice to know we're doing good enough that this matters.
Component: Graphics → Canvas: WebGL
QA Contact: thebes → canvas.webgl
It doesn't just matter, it's the single biggest cost center outside the actual call to uniform* on the underlying GL impl in the uniform setters.  Somewhere around 25-35% depending on the setter.
Blocks: webgl-k9o
Whiteboard: webgl-next → webgl-next [k9o:p2:fx?] [games:p2]
Assignee: nobody → bjacob
Status: NEW → ASSIGNED
Attachment #621212 - Flags: review?(jmuizelaar)
Attachment #621212 - Attachment description: Switch GLContext to use mozilla::tls for current context → 1: Switch GLContext to use mozilla::tls for current context
Attachment #621214 - Flags: review?(bgirard)
Attachment #621214 - Attachment description: 3: check owning thread in MOZ_GL_DEBUG mode → 4: check owning thread in MOZ_GL_DEBUG mode
Summary: WebGL's MakeContextCurrent is showing up in profiles noticeably → Let GLContext use mozilla::tls instead of slow getCurrentContext functions
Summary: Let GLContext use mozilla::tls instead of slow getCurrentContext functions → GLContext optimization: avoid calling slow getCurrentContext functions, instead use some TLS and use on the fact that we dont share GLContexts across threads
Summary: GLContext optimization: avoid calling slow getCurrentContext functions, instead use some TLS and use on the fact that we dont share GLContexts across threads → GLContext optimization: avoid calling slow getCurrentContext functions, instead use some TLS and use the fact that we dont share GLContexts across threads
I just did some measurements on the float uniform tests at http://hg.mozilla.org/users/bjacob_mozilla.com/webgl-perf-tests/ on three builds: 1) m-c, 2) m-c + this patch, 3) m-c plus this patch plus new bindings for WebGLContext and WebGLUniformLocation.  This was on Mac; results on Windows would also be interesting.

In any case, on the uniform-float-taking-numbers.html test, I see the following median times:

  m-c: 64ms
  m-c + this patch: 48ms
  m-c + this patch + bindings: 44ms

A profile of that last one shows that about 68.5% of that time is spent under one of the glUniform*fARB_Exec functions.  So there's no way this number could drop below 30ms or so on my machine.

On uniform-float-taking-typed-array.html, I see:

  m-c: 70ms
  m-c + this patch: 51ms
  m-c + this patch + bindings: 45ms

About 57% of the profile is under setUniform, and another 5.5% under gldModifyPipelineProgram down in the system GL impl.  So here the best we could theoretically do would be about 28ms.  I do think I know how to shave off another 3-4ms here, for what it's worth; should happen this summer.

On uniform-float-taking-js-array.html, I see:

  m-c: 62ms
  m-c + this patch: 55ms
  m-c + this patch + bindings: 43ms

(note that this test only does 40% as many calls as the other two tests).  Here the malloc traffic for the C++ array we need to allocate is killing us.  I need to figure out a better way to deal with that, especially if passing short sequences is common in general, such that using auto arrays will be OK.
Assignee: bjacob → nobody
Component: Canvas: WebGL → Graphics
QA Contact: canvas.webgl → thebes
\o/

Even bigger win on my Thinkpad W520, Linux x86-64, NVIDIA 295.40, Quadro 1000M:

On uniform-float-taking-typed-array.html:
m-c: 166 ms
m-c + this patch: 97 ms
uniform-float-taking-numbers.html:
m-c: 157 ms
m-c + this patch: 77 ms

That is a > 2x improvement!
Comment on attachment 621214 [details] [diff] [review]
4: check owning thread in MOZ_GL_DEBUG mode

I was a bit confused when we spoke and understood that you wanted to do something else.

Current AFAIK when don't bind the gl context on the known owning thread however is still technically valid. I think area where we might use it would be to check the fence status or something to do with EGLImage. I don't object to landing this bit of code but it just come with a comment saying something like this:

We don't currently use a context outside of its owner thread but doing so is perfectly valid. When we have the need to do so this assertion will have to be removed.
Attachment #621214 - Flags: review?(bgirard)
(In reply to Benoit Girard (:BenWa) from comment #11)
> Current AFAIK we don't need to bind the GLContext outside the owning thread
> however it is still technically valid. I think an example where we might need to do this would
> be to check the fence status or certain usage of EGLImage.

Fixed the terrible grammar above. I think I need sleep now ^-^.
I'm thinking about this. Of course I agree that it's legit to use a GL context on a different thread, but the new element here is that we might start doing that soon. So of course I'm OK to add such a comment; my problem is that the optimization here relies fundamentally on the assumptions at all calls on GL context are made by the owning thread.

Some possible solutions:

 - If non-owning-thread GL calls remain exceptional, GLContext could have a boolean saying whether it allows non-owning-thread GL calls. Would default to false, giving the behavior of this patch. When set to true, non-owning-threads are allowed, but then MakeCurrent will be slower. The simplest would be to have MakeCurrent always forced in that mode (no check for already-current context), if that's just going to be for a few calls. If needed, we could optimize that with a TLS lookup per MakeCurrent call to check for already-current context.

- If non-owning-thread GL calls are going to be common / the majority of cases, then the optimization in this patch to avoid a TLS lookup per MakeCurrent call can't apply anymore. In this case, we'll have to do that TLS lookup per MakeCurrent call. Will make source code a bit simpler, but have to measure impact on performance, and it's going to be different on different platforms. It's still going to be faster than calling slow getCurrentContext functions (as current m-c does), but slower than not doing any TLS per MakeCurrent call (the present patch).
I'm fine with just documenting the work that is needed. Our latest attempt to bind the context to another thread had terrible performance implication. We tried binding the GLContext to a worker thread to make TexImage2D calls asynchronous but the overhead was too high.
Another idea, if only a few specific functions require calling from non-owning-thread (e.g. for fence sync), is to have GLContext methods doing that work specifically for them, bypassing the regular MakeCurrent just for these methods. As opposed to requiring the caller to toggle a boolean before and after making these calls.
OS: Mac OS X → All
Hardware: x86 → All
Attachment #621212 - Attachment is obsolete: true
Attachment #621212 - Flags: review?(jmuizelaar)
Attachment #621450 - Flags: review?(jmuizelaar)
Attachment #621214 - Attachment is obsolete: true
(In reply to Benoit Jacob [:bjacob] from comment #17)
> Created attachment 621451 [details] [diff] [review]
> 4: check owning thread in MOZ_GL_DEBUG mode

By the way, in MakeCurrent itself, the thread is now always checked in debug builds. In BeforeGLCall it remains MOZ_GL_DEBUG only to limit the performance impact.
Attachment #621451 - Attachment description: 4: check owning thread in MOZ_GL_DEBUG mode → 4: check owning thread in debug builds
Attachment #621213 - Flags: review?(ehsan) → review+
Attachment #621451 - Flags: review?(bgirard) → review+
Comment on attachment 621215 [details] [diff] [review]
3: stop calling platform-specific getCurrentContext functions

Review of attachment 621215 [details] [diff] [review]:
-----------------------------------------------------------------

I like
Attachment #621215 - Flags: review?(jmuizelaar) → review+
Comment on attachment 621450 [details] [diff] [review]
1: Switch GLContext to use mozilla::tls for current context

Review of attachment 621450 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is lovely

::: gfx/gl/GLContext.h
@@ +557,5 @@
> +
> +        mStorage = tls::get<Storage>(sTLSKey);
> +
> +        if (!mStorage) {
> +            mStorage = new Storage;

I think the refcounting here is correct, but I'm not sure. Are you?
Attachment #621450 - Flags: review?(jmuizelaar) → review+
(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)
> I think the refcounting here is correct, but I'm not sure. Are you?

In fact, there seems to be a potential bug here, although it doesn't seem triggerable at the moment.

When the last GLContext owned by a thread is destroyed, the mStorage refcount hits zero, so that Storage object is destroyed. But the TLS pointer remains, so the next time a GLContext is created, it will use that already-destroyed Storage.

The reason why this doesn't seem exploitable at the moment is that on the main thread, the global shared context stays alive until shutdown, and on other threads, AFAIU the only context is the OMTC compositor context and it also stays alive until shutdown.

Ideally I'd like to just leak this pointer instead of doing this refcounting, but I don't want to generate more valgrind noise.

What I'll do is in the Storage destructor I'll destroy the TLS storage (containing the pointer to the Storage object that is being destroyed). This way, subsequent tls::get calls will never return a pointer to that already destroyed Storage.
So, that wasn't quite a refcounting bug, rather a dangling pointer bug.
Attachment #621667 - Flags: review?(jmuizelaar)
Attachment #621667 - Flags: review?(jmuizelaar) → review+
This broke the builds on platforms where the sps profiler doesn't exist, since now thread_helper.h is required.
See http://buildbot.rhaalovely.net/builders/mozilla-central-i386/builds/169/steps/build/logs/stdio

Maybe exporting thread_helper.h outside of  MOZ_ENABLE_PROFILER_SPS in tools/profiler/Makefile.in fixes it, trying... just, *SIGH*.
I suppose bug 753119 will take care of it, but unconditionally exporting thread_helper.h fixed it for me.
We should back out the patches until bug 753119 is fixed or change the patch to use nspr.
Blocks: 753119
(In reply to Benoit Girard (:BenWa) from comment #27)
> We should back out the patches until bug 753119 is fixed or change the patch
> to use nspr.

Hell no!! If we want a quick fix, let's do as comment 26 says.

Landry, can you attach your patch to this bug?
Here you are, but the fix from 753119 also works for me.
Attachment #622774 - Flags: review?(bjacob)
Attachment #622774 - Flags: review?(bjacob) → review+
Depends on: 754040
I'm hesitant to set checkin-needed here, since it will conflict with 753119 which might be quickly fixed too. Any thoughts ?
That's all the more a reason to land it: it will be impossible to forget to take it out at the same time as bug 753119 lands. I infer from comment 30 that you dont have l3 access so I'm landing it now.
http://hg.mozilla.org/mozilla-central/rev/94d9ddb6fed8

With this commit message:

bug 749678 - inconditionally export thread_helper.h, gfx/gl requires it - r=bjacob

This is a VERY TEMPORARY fix until bug 753119 lands. The only reason why it's acceptable
is that bug 753119 will take it out very soon. Non-profiler code should not use profiler headers.
Backed out:
http://hg.mozilla.org/integration/mozilla-inbound/rev/3e0f7b9a39d7

With this commit message:

Unfortunately, in-process plugins using OpenGL break the assumption made by these patches, that the current GL context is only changed by GLContext::MakeCurrent. Another issue, regardless of in-process, is that our host-side code in nsCoreAnimationSupport.mm uses direct CGL calls, bypassing GLContext.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 753350
No longer blocks: 753119
This applies against current mozilla-central. Adapted to the thread_helper -> ThreadLocal changes and cleaned things up a bit.

Boris: can you please describe the hook that we can use to know when we are guaranteed to NOT be running any in-process plugin?
There seems to be a performance issue in the DOM bindings with vector uniform setters taking a typed array: they're apparently not faster with the patch. Here is a profile taken with the patch, on uniform-int-taking-typed-array.html :

+  40.68%  [.] 0x10092cc
+   9.28%  [.] js::UnwrapOneChecked(JSContext*, JSObject*)
+   5.53%  [.] unsigned int mozilla::dom::UnwrapObject<(mozilla::dom::prototypes::id::ID)6, mozilla::WebGLContext, mozilla::WebGLContext*>(JSContext*, JSObject*, mozilla::WebGLContext*&)
+   5.29%  [.] unsigned int mozilla::dom::UnwrapObject<(mozilla::dom::prototypes::id::ID)7, mozilla::WebGLUniformLocation, mozilla::WebGLUniformLocation*>(JSContext*, JSObject*, mozilla::WebGLUniformLocation*&)
+   3.84%  [.] mozilla::WebGLContext::Uniform3iv_base(mozilla::WebGLUniformLocation*, unsigned int, int const*)
+   3.74%  [.] js::UnwrapObjectChecked(JSContext*, JSObject*)
+   2.30%  [.] void mozilla::Maybe<mozilla::dom::TypedArray<int, int, &(JS_GetInt32ArrayData(JSObject*, JSContext*)), &(JS_GetTypedArrayLength(JSObject*, JSContext*)), &(JS_NewInt32Array(JSContext*, unsigned int))> >::construct<JSContext*, JSObject*>(JSContext* co
+   2.24%  [.] mozilla::dom::WebGLRenderingContextBinding::uniform1iv(JSContext*, unsigned int, JS::Value*)
+   2.18%  [.] mozilla::WebGLContext::Uniform4iv_base(mozilla::WebGLUniformLocation*, unsigned int, int const*)
+   1.87%  [.] JS_IsInt32Array(JSObject*, JSContext*)
+   1.85%  [.] js::shadow::Object::slotRef(unsigned long) const
+   1.80%  [.] mozilla::dom::WebGLRenderingContextBinding::uniform4iv(JSContext*, unsigned int, JS::Value*)
+   1.76%  [.] mozilla::dom::WebGLRenderingContextBinding::uniform2iv(JSContext*, unsigned int, JS::Value*)
+   1.74%  [.] mozilla::dom::WebGLRenderingContextBinding::uniform3iv(JSContext*, unsigned int, JS::Value*)
+   1.38%  [.] mozilla::WebGLUniformLocation* mozilla::dom::UnwrapDOMObject<mozilla::WebGLUniformLocation>(JSObject*)
+   1.27%  [.] JS_GetDataViewByteOffset(JSObject*, JSContext*)
+   1.23%  [.] mozilla::WebGLContext::Uniform1iv_base(mozilla::WebGLUniformLocation*, unsigned int, int const*)
+   1.19%  [.] mozilla::WebGLContext::Uniform2iv_base(mozilla::WebGLUniformLocation*, unsigned int, int const*)
+   1.05%  [.] mozilla::WebGLContext* mozilla::dom::UnwrapDOMObject<mozilla::WebGLContext>(JSObject*)
+   0.98%  [.] bool mozilla::WebGLContext::ValidateObjectAssumeNonNull<mozilla::WebGLUniformLocation>(char const*, mozilla::WebGLUniformLocation*)
+   0.93%  [.] __memmove_ssse3_back
+   0.69%  [.] JS_GetUint8ClampedArrayData(JSObject*, JSContext*)
> Boris: can you please describe the hook that we can use

I don't know.  Josh?
> they're apparently not faster with the patch

Odd.  Last I measured, those setters got faster with this patch (in its earlier incarnation), and the binding code hasn't changed...
Odd indeed; I can't see any notable difference in profiles with and without the patch, on uniform-{int,float}-taking-typed-array.html. On the other hand the patch does speed up a lot other uniform setter testcases.
WOW. So I was thinking about what could possibly explained this, and thought: "hey, this is exactly what we would get if we were forgetting to make the OpenGL context current in these uniform array setters".

And then "Oh and that could be a recent regression since we made so many changes there recently".

And, bingo, we _are_ forgetting to make the context current _exactly_ in these setters.
There you go, now it is again a 2x speed difference:

uniform-float-taking-typed-array.html:

without patch: 45 ms
with patch: 23 ms
And for the record, here is what the actual profile looks like, with this fix, WITHOUT the big patch here making makecurrent fast. See how we are spending all our time in pthread_mutex_lock and various kernel symbols.

+  31.57%  [.] 0x995c
+   8.54%  [.] pthread_mutex_lock
+   5.94%  [k] system_call
+   5.83%  [k] system_call_after_swapgs
+   5.08%  [.] js::UnwrapOneChecked(JSContext*, JSObject*)
+   4.79%  [.] pthread_mutex_unlock
+   3.96%  [.] unsigned int mozilla::dom::UnwrapObject<(mozilla::dom::prototypes::id::ID)6, mozilla::WebGLContext, mo
+   2.26%  [k] pid_nr_ns
+   2.06%  [.] js::UnwrapObjectChecked(JSContext*, JSObject*)
+   1.84%  [.] mozilla::WebGLContext::Uniform3fv_base(mozilla::WebGLUniformLocation*, unsigned int, float const*)
+   1.61%  [k] sys_getpid
+   1.39%  [.] mozilla::dom::WebGLRenderingContextBinding::uniform3fv(JSContext*, unsigned int, JS::Value*)
+   1.10%  [.] mozilla::dom::WebGLRenderingContextBinding::uniform1fv(JSContext*, unsigned int, JS::Value*)
+   1.08%  [k] sysret_check
+   1.04%  [k] pid_vnr
+   1.01%  [.] mozilla::dom::WebGLRenderingContextBinding::uniform4fv(JSContext*, unsigned int, JS::Value*)
+   1.01%  [.] mozilla::WebGLContext::Uniform4fv_base(mozilla::WebGLUniformLocation*, unsigned int, float const*)
+   0.99%  [.] mozilla::WebGLContext::Uniform1fv_base(mozilla::WebGLUniformLocation*, unsigned int, float const*)
+   0.98%  [.] js::shadow::Object::slotRef(unsigned long) const
+   0.92%  [.] glXGetCurrentContext
Comment on attachment 647246 [details] [diff] [review]
re-add missing MakeCurrent call in certain WebGL uniform setters

Review of attachment 647246 [details] [diff] [review]:
-----------------------------------------------------------------

Oops.
Attachment #647246 - Flags: review?(jgilbert) → review+
http://hg.mozilla.org/integration/mozilla-inbound/rev/91d9f8930d4e
Whiteboard: webgl-next [k9o:p2:fx?] [games:p2] → webgl-next [k9o:p2:fx?] [games:p2] [leave open]
Target Milestone: mozilla15 → Future
Comment on attachment 647246 [details] [diff] [review]
re-add missing MakeCurrent call in certain WebGL uniform setters

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 747619
User impact if declined: Ability for script to cause the compositor to use the wrong textures. Not a security bug, as the output of the compositor is not exposed to scripts. But still, quite severe
Testing completed (on m-c, etc.): m-i
Risk to taking this patch (and alternatives if risky): completely trivial fix. no risk at all.
String or UUID changes made by this patch: none
Attachment #647246 - Flags: approval-mozilla-beta?
Attachment #647246 - Flags: approval-mozilla-aurora?
Comment on attachment 647246 [details] [diff] [review]
re-add missing MakeCurrent call in certain WebGL uniform setters

low risk, approving.
Attachment #647246 - Flags: approval-mozilla-beta?
Attachment #647246 - Flags: approval-mozilla-beta+
Attachment #647246 - Flags: approval-mozilla-aurora?
Attachment #647246 - Flags: approval-mozilla-aurora+
Regarding the original topic of this bug and why we don't just land the rebased patch:

The problem (as BenWa diligently pointed out to me) is that any library that we call into might be making GL contexts current, without telling us. For example, graphics libraries (e.g. Core Graphics) and widget libraries.

So the problem that's preventing this from landing is really deep.

Incidentally, that's one thing to add to the list of what Electrolysis would give us. Especially if we went for a Chromium-like model where all our GL calls are made in a separate process, we wouldn't have to worry about this.

Failing that, it's not clear to me if there's anything that will allow to land this optimization. Testing for "we're still in the current JS callback" is not enough as the JS callback might make Canvas 2D calls that result in calls to some graphics library that uses OpenGL, etc. If we could check if we're in the current JS callback and the only DOM API calls we're making are WebGL calls, then perhaps we could start talking... I don't know the browser well enough to be sure.

Maybe we could land this at least on Windows with ANGLE, as there, WebGL is the only OpenGL (ANGLE) user. (Even if libraries/plugins used OpenGL, they wouldn't conflict with ANGLE's notion of current EGL context). But I don't know how much of a perf difference this makes on ANGLE. The dramatic perf differences measured above were on Linux and on Mac.
Whiteboard: webgl-next [k9o:p2:fx?] [games:p2] [leave open] → webgl-next [k9o:p2:fx?] [games:p?] [leave open]
You need to log in before you can comment on or make changes to this bug.