Closed Bug 656824 Opened 13 years ago Closed 13 years ago

Use ARB_robustness and EGL_CONTEXT_LOST to detect driver resets in WebGL

Categories

(Core :: Graphics: CanvasWebGL, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla11

People

(Reporter: bjacob, Assigned: drs)

References

(Blocks 1 open bug, )

Details

Attachments

(5 files, 9 obsolete files)

16.70 KB, patch
Details | Diff | Splinter Review
111.48 KB, patch
drs
: review+
Details | Diff | Splinter Review
21.65 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
992 bytes, patch
bjacob
: review+
Details | Diff | Splinter Review
792 bytes, patch
Details | Diff | Splinter Review
If Security eventually decides that the DOS issues with WebGL are too severe, we need to be ready to respond quickly. To that effect, two patches need to be written:
  1. use ARB_robustness when it's available
  2. block WebGL when ARB_robustness is not available.

Of course patch 1 would be desirable in any case. Hopefully we won't need to use patch 2, but we need to have it ready just in case.
Assignee: nobody → bjacob
Can we combine the two? A pref setting "webgl.require.ARB_robustness" that when true does your 2) and when false uses it when available. Then if we need it it's much easier to turn on than keeping a patch in our back pocket. It can even be turned on by worried users without needing an update.
Our build system can even handle #ifdefs to generate different defaults on different platforms if we decide to go that way.
Sorry for the delay, I know a bit more now.

Relevant email:
https://www.khronos.org/webgl/public-mailing-list/archives/1105/msg00069.html

When ARB_robustness is available, that means that the graphics driver is able to reset when needed (e.g. long-running shader) --- we then get that part for free. However, the difficult thing is then to gracefully recover from the reset. The key problem with ARB_robustness is that *all* GL contexts are typically reset i.e. lost. The ARB_robustness only provides us a way of knowing when GL contexts are lost. It's then up to us to act upon that.

For WebGL it's not too hard to act upon that: just mark the WebGL context as lost. It's then up to the script to handle that situation.

For GL Layers on the other hand, it seems a lot harder to act upon that. Will file a bug about that.

For D3D context, there's no problem, Bas' code already handles resets.

Notice that ARB_robustness per se is only for desktop OpenGL; but OpenGL ES has a similar concept built-in ("lost context"). This is implemented in ANGLE, and actually is easier to use than ARB_robustness.

So on Windows, where we use ANGLE for WebGL and D3D Layers, the problem is the easiest to solve, so we should do that first. Filing a bug.
Summary: Use ARB_robustness in WebGL → Use ARB_robustness and GLESin WebGL
Summary: Use ARB_robustness and GLESin WebGL → Use ARB_robustness and EGL_CONTEXT_LOST to detect driver resets in WebGL
(In reply to comment #1)
> Can we combine the two? A pref setting "webgl.require.ARB_robustness" that
> when true does your 2) and when false uses it when available. Then if we
> need it it's much easier to turn on than keeping a patch in our back pocket.
> It can even be turned on by worried users without needing an update.

Yes, we can do that. GLES (in particular ANGLE) would count as ARB_robustness for this purpose.
Depends on: 660070
Depends on: 660072
Hm, I thought I attached this somewhere, but can't find it.  This is my old work-in-progress ARB_robustness-using patch, to start creating robust contexts (on WGL) and at least try to detect the reset status (without actually notifying the app).
Version: unspecified → Trunk
Assignee: bjacob → dsherk
Let's start with GLX.

The problem is that webgl pages with long-running draw calls can freeze the GPU for too long. Recent drivers have watchdog mechanism to reset the GPU and all GL context after a certain timeout. The goal of using ARB_robustness here is to detect these resets to be able to act accordingly.

Here's an example of a page exposing this issue:
https://cvs.khronos.org/svn/repos/registry/trunk/public/webgl/sdk/tests/extra/lots-of-polys-example.html


Here's what needs to be done:

part 1: GLContextProviderGLX changes

In gfx/thebes/GLContextProviderGLX.cpp, when we actually create a GL context,

http://hg.mozilla.org/mozilla-central/file/817c2b9dc11d/gfx/thebes/GLContextProviderGLX.cpp#l666

Instead of using glXCreateNewContext we must use glXCreateContextAttribsARB, and we must pass to it an attributes array where the GLX_CONTEXT_RESET_NOTIFICATION_STRATEGY_ARB attribute is specified as GLX_LOSE_CONTEXT_ON_RESET_ARB.

Notice that we don't currently load the glXCreateContextAttribsARB symbol. You'll have to edit this file to load this symbol too. See how it's done for other functions. Load the function pointer as glXCreateContextAttribsARBInternal and then implement glXCreateContextAttribsARB calling glXCreateContextAttribsARBInternal and adding the debug stuff, as for other functions.


part 2: GLContext changes

We will also need to load the glGetGraphicsResetStatusARB symbol, which we'll need in part 3. The declaration goes into GLContextSymbols.h because it's a OpenGL, not a GLX, symbol. The actual loading must be done in GLContext.cpp, in InitWithPrefix(). At the end of the big block there, there are symbols that are loaded conditionally. You'll want to load this one only if the GL_ARB_robustness extension is available.

So you'll need to add GL_ARB_robustness to the list of extensions we recognize and check for. See how it's done for other extensions, it's all in GLContext.h and GLContext.cpp. Notice that IsExtensionSupported() has 2 variants: a slow one taking a string, and a fast one taking an enum value and looking it up in a table of extensions checked for at GL context initialization. You want to add GL_ARB_robustness to this list.


part 3: WebGL changes

We don't automatically get notified when the context is lost, instead we have to query the status ourselves to check if a context loss occured. For now, let's say that the right place to do this check is right after draw-operations. So in WebGLContextGL.cpp, in the implementation of DrawArrays and DrawElements, right after the actual Draw call, we must call glGetGraphicsResetStatusARB.

Unfortunately, what comes next depends on proper support for "context lost" and "context restored" events, which we don't have yet, so for now you could simply printf the result of glGetGraphicsResetStatusARB. Once lost/restored events are implemented, if we get to this point then the rest will be easy. I'll work on this ASAP so we can complete this work.

Useful resources:
 * Vlad's patch above for WGL
 * WebKit/Chromium code:
      WebGL side: http://trac.webkit.org/browser/trunk/Source/WebCore/html/canvas/WebGLRenderingContext.cpp
      GLX side: http://src.chromium.org/viewvc/chrome/trunk/src/ui/gfx/gl/gl_context_glx.cc?view=markup

Notice that you need at least NVIDIA driver version 265. It's worth using a driver as recent as possible, even beta, as there are known issues and NVIDIA is fixing them. One known issue is that the context reset timeout can be as long as 2 minutes so you'll have to be patient.
Mostly complete patch for ARB_robustness, implemented on Linux and partially for OSX.

This patch implements ARB_robustness and allows scripts to handle driver resets through events. The changes in this patch are very sweeping; they hit almost every NS_IMETHODIMP function within WebGLContextGL.cpp and WebGLContext.cpp. More work must be done on this to support EGL_CONTEXT_LOST. Note that this patch depends on OES_standard_derivatives (not directly, but through some extension changes that were introduced with this patch). OES_standard_derivatives can be found in bug 684853.
Attachment #563915 - Flags: review?(bjacob)
Also note that this patch includes WEBKIT_lose_context.
Depends on: 684853
Fixed an issue preventing this from compiling on Windows. I think there were two references to "NO_ERROR". GCC just overrode it, while Windows complained but didn't say that it was being redefined.
Attachment #563915 - Attachment is obsolete: true
Attachment #563915 - Flags: review?(bjacob)
Attachment #564773 - Flags: review?(bjacob)
Not all conformance tests were passing on Windows. This was a problem on other OS's but was being masked by random things going on behind the scenes (places in memory happening to be 0 when they were never set to it). Also fixed "Mozilla Corporation" -> "Mozilla Foundation".
Attachment #564773 - Attachment is obsolete: true
Attachment #564773 - Flags: review?(bjacob)
Attachment #565422 - Flags: review?(bjacob)
Blocks: 660070
No longer depends on: 660070
Comment on attachment 565422 [details] [diff] [review]
Patch v1.2, ARB_robustness implementation.

Review of attachment 565422 [details] [diff] [review]:
-----------------------------------------------------------------

This patch looks really, really great overall. I learned stuff. Still r- for nits, the most important ones being:
 * unwanted stuff in ANGLE gl2ext.h
 * unwanted stuff in GLDefs.h
 * WEBKIT_lose_context not at home in GLContext
 * mIsRobust not a good name

::: content/canvas/src/WebGLContext.h
@@ +371,3 @@
>      nsresult SynthesizeGLError(WebGLenum err);
>      nsresult SynthesizeGLError(WebGLenum err, const char *fmt, ...);
> +    nsresult SynthesizeGLErrorNoContext(WebGLenum err);

No need for this function, just assign to mWebGLError

@@ +424,5 @@
>      }
>  
> +    // Sets up the GL_ARB_robustness timer if it isn't already, so that if the
> +    // driver gets restarted, the context may get reset with it.
> +    void MakeRobustCall() {

SetupRobustnessTimer would be a better name

@@ +709,5 @@
> +    enum ContextResetARB {
> +        LOCAL_NO_ERROR = 0,
> +        LOCAL_GUILTY_CONTEXT_RESET_ARB = 0x8253,
> +        LOCAL_INNOCENT_CONTEXT_RESET_ARB = 0x8254,
> +        LOCAL_UNKNOWN_CONTEXT_RESET_ARB = 0x8255,

Make these #defines, in gfx/thebes/GLContext.h, since this is general OpenGL stuff

::: content/canvas/src/WebGLContextUtils.cpp
@@ +154,4 @@
>  }
>  
>  nsresult
> +WebGLContext::SynthesizeGLErrorNoContext(WebGLenum err)

As said above: no need for this.

::: content/canvas/src/WebGLExtensionLoseContext.cpp
@@ +19,5 @@
> + * Portions created by the Initial Developer are Copyright (C) 2009
> + * the Initial Developer. All Rights Reserved.
> + *
> + * Contributor(s):
> + *   Vladimir Vukicevic <vladimir@pobox.com> (original author)

Looking at this code, this is really entirely new, so no need to attribute to Vlad. Add yourself.

The same is probably true for WebGLStandardDerivatives.cpp. Only the header, WebGLExtensions.h, seems to be largely deriving from Vlad's code.

::: content/canvas/src/WebGLExtensions.h
@@ +60,5 @@
> +    NS_DECLARE_STATIC_IID_ACCESSOR(WEBGLEXTENSIONLOSECONTEXT_PRIVATE_IID)
> +};
> +
> +NS_DEFINE_STATIC_IID_ACCESSOR(WebGLExtensionLoseContext, WEBGLACTIVEINFO_PRIVATE_IID)
> +

By the way have a look at Bug 693595: do for WEBKIT_lose_context the same as it's doing for OES_standard_derivatives. (For the latter, we'll take Ted's patch)

::: gfx/angle/include/GLES2/gl2ext.h
@@ +1152,5 @@
> +#endif // GL_INNOCENT_CONTEXT_RESET_ARB
> +#ifndef GL_UNKNOWN_CONTEXT_RESET_ARB
> +#define GL_UNKNOWN_CONTEXT_RESET_ARB 0x8255
> +#endif // GL_UNKNOWN_CONTEXT_RESET_ARB
> +#endif // GL_ARB_robustness

Whoa why is that needed? AFAIK we never include this header. Moreover, ANGLE implements OpenGL ES 2, and ARB_robustness doesn't exist in OpenGL _ES_ 2.

::: gfx/thebes/GLContext.cpp
@@ +174,4 @@
>          return PR_TRUE;
>      }
>  
> +    mIsRobust = IsExtensionSupported(ARB_robustness);

please rename mIsRobust to mHasRobustness or some such.

@@ +443,5 @@
>      "GL_ARB_texture_float",
>      "GL_EXT_unpack_subimage",
>      "GL_OES_standard_derivatives",
> +    "GL_ARB_robustness",
> +    "WEBKIT_lose_context",

WEBKIT_lost_context is purely a WebGL extension with no existence at the level of OpenGL. So it doesn't belong here.

::: gfx/thebes/GLContext.h
@@ +982,5 @@
>          ARB_texture_float,
>          EXT_unpack_subimage,
>          OES_standard_derivatives,
> +        ARB_robustness,
> +        WEBKIT_lose_context,

Same: WEBKIT_lose_context doesn't belong here, it's WebGL-specific.

@@ +1025,4 @@
>      bool mIsOffscreen;
>      bool mIsGLES2;
>      bool mIsGlobalSharedContext;
> +    bool mIsRobust;

please rename, see GLContext.cpp

::: gfx/thebes/GLContextProviderGLX.cpp
@@ +258,5 @@
> +            // exists, so it's best to just try to load it and accept that it
> +            // might fail.
> +            //NS_WARNING("Couldn't load ARB_robustness symbols");
> +        } else {
> +            mHasRobustness = true;

Oh you did call that mHasRobustness, not mIsRobust. good. then I hope you'll agree to use the mHasRobustness terminology everywhere.

@@ +654,5 @@
> +{
> +    BEFORE_GLX_CALL;
> +    GLXContext result = xCreateContextAttribsInternal(display, 
> +                                                      config, 
> +	                                                  share_list, 

indentation problem

::: gfx/thebes/GLDefs.h
@@ +3052,5 @@
> +#define LOCAL_GL_CONTEXT_FLAG_ROBUST_ACCESS_BIT_ARB         0x00000004
> +
> +#define LOCAL_GL_CONTEXT_RESET_NOTIFICATION_STRATEGY_ARB    0x8256
> +#define LOCAL_GL_NO_RESET_NOTIFICATION_ARB                  0x8261
> +#define LOCAL_GL_LOSE_CONTEXT_ON_RESET_ARB                  0x8252

OK

@@ +3082,5 @@
> +#define LOCAL_GL_CONTEXT_ROBUST_ACCESS_BIT_ARB              0x00000004
> +
> +#define LOCAL_WGL_CONTEXT_CORE_PROFILE_BIT_ARB              0x00000001
> +#define LOCAL_WGL_CONTEXT_COMPATIBILITY_PROFILE_BIT_ARB     0x00000002
> +#define LOCAL_WGL_CONTEXT_ROBUST_ACCESS_BIT_ARB             0x00000004

What's that doing here, is that unwanted copy & paste?
Attachment #565422 - Flags: review?(bjacob) → review-
Fixed code review comments.
Attachment #565422 - Attachment is obsolete: true
Attachment #567886 - Flags: review?(bjacob)
I made a mistake in my timer implementation that I fixed in this rev. The previous implementation worked fine (or at least it did according to my tests) but was not optimal.
Attachment #567886 - Attachment is obsolete: true
Attachment #567886 - Flags: review?(bjacob)
Attachment #568218 - Flags: review?(bjacob)
Blocks: 660072
No longer depends on: 660072
Attachment #568218 - Flags: review?(bjacob) → review+
Unbitrotted, r+ carried.
Attachment #568218 - Attachment is obsolete: true
Attachment #569117 - Flags: review+
Try run for 952a1d080311 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=952a1d080311
Results (out of 7 total builds):
    success: 3
    warnings: 4
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dsherk@mozilla.com-952a1d080311
Try run for ce0048897836 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=ce0048897836
Results (out of 8 total builds):
    success: 7
    failure: 1
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dsherk@mozilla.com-ce0048897836
Tiny change to fix a conformance issue, +r carried.
Attachment #569117 - Attachment is obsolete: true
Attachment #569476 - Flags: review+
http://hg.mozilla.org/integration/mozilla-inbound/rev/4c2d39588760

Don't close this bug yet! More patches are coming.
I wasn't setting the context to current before querying its GetGraphicsResetStatus(), which was causing an error on debug builds and potentially giving incorrect information. This patch fixes this.

Try build:
https://tbpl.mozilla.org/?tree=Try&rev=e90135cb2d6c
Attachment #570455 - Flags: review?(bjacob)
Attachment #570455 - Flags: review?(bjacob) → review+
Depends on: 699522
This patch should improve performance of the robustness timer by not resetting
it each time a draw operation happens. It still checks if there's any activity
and, if not, it will stop firing it. It includes a single extra timer firing
after activity dies to make sure we don't miss anything.
Attachment #570455 - Attachment is obsolete: true
Attachment #572176 - Flags: review?(bjacob)
There was a mistake in the original robustness implementation which caused it to incorrectly detect when a machine had robustness. I don't know why try didn't catch it because the results are pretty bad, but this patch fixes it. Other than that, it's pretty much the same as the v1.0 of this patch.
Attachment #572176 - Attachment is obsolete: true
Attachment #572176 - Flags: review?(bjacob)
Attachment #572253 - Flags: review?(bjacob)
Comment on attachment 572253 [details] [diff] [review]
Patch v1.1, fix timer, robustness detection, and set current context.

Review of attachment 572253 [details] [diff] [review]:
-----------------------------------------------------------------

r- for minor details and requesting explanations.

Another comment: looking back at draw-functions in WebGLContextGL.cpp I can see that SetupRobustnessTimer is called far away from the GL draw function (like fDrawArrays). I think it should happen right before the draw-call.

::: content/canvas/src/WebGLContext.cpp
@@ +300,5 @@
>      WebGLMemoryReporter::AddWebGLContext(this);
>  
>      mContextLost = false;
>      mAllowRestore = false;
> +    mTimerRunning = false;

Can you choose a name that makes it clear that this timer is related to robustness? mRobustnessTimerRunning?

@@ +301,5 @@
>  
>      mContextLost = false;
>      mAllowRestore = false;
> +    mTimerRunning = false;
> +    mActivitySinceTimerSet = false;

Same. Also, 'Activity' is relatively vague. If 'Activity' here means draw-calls, maybe say 'Drew' or 'Drawing' or some such.

::: content/canvas/src/WebGLContext.h
@@ +453,4 @@
>              return;
>  
> +        if (mTimerRunning) {
> +            mActivitySinceTimerSet = true;

I think a big explanatory comment is needed here.

It took me a while to remember that this SetupRobustness function is what's called by draw-calls.

Explain that this mechanism is an optimization to avoid actually resetting the timer on every draw-call, while mActivitySinceTimeSet allows ensuring that we don't get fooled by draw-calls happening after we've set up the timer.

::: gfx/thebes/GLContext.cpp
@@ +363,5 @@
>  
>      mInitialized = LoadSymbols(&symbols[0], trygl, prefix);
>  
> +    // We can't use ARB_robustness directly because the extensions array hasn't
> +    // been initialized yet.

Then I think it would be worth moving this until after the extensions array is initialized. Parsing the extensions string is costly enough.
Attachment #572253 - Flags: review?(bjacob) → review-
Addressed code review comments. Also fixed a bug that seems to have just shown up recently. For some reason, GL_ARB_robustness disappeared from the extensions array (probably because of some change I made recently) so I switched it to GLX_ARB_context_create_robust, which makes more sense anyways and is more in context with the rest of the code around it.
Attachment #572253 - Attachment is obsolete: true
Attachment #572719 - Flags: review?(bjacob)
Comment on attachment 572719 [details] [diff] [review]
Patch v1.2, fix timer, robustness detection, and set current context.

Review of attachment 572719 [details] [diff] [review]:
-----------------------------------------------------------------

Yay, that's all pretty brilliant (famous last words, I know). I mean, the mDrawSinceTimerSet stuff to avoid actually doing timer work on each draw call. Hope there won't be bugs there, but I can't see any just by looking at the code.
Attachment #572719 - Flags: review?(bjacob) → review+
Attachment #573433 - Flags: review?(bjacob) → review+
Depends on: 703487
https://hg.mozilla.org/mozilla-central/rev/8384c4295ffd
https://hg.mozilla.org/mozilla-central/rev/47743c433776
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla11
Comment on attachment 569476 [details] [diff] [review]
Patch v1.5, ARB_robustness implementation.

>+    enum ContextResetARB {
>+        CONTEXT_NO_ERROR = 0,
>+        CONTEXT_GUILTY_CONTEXT_RESET_ARB = 0x8253,
>+        CONTEXT_INNOCENT_CONTEXT_RESET_ARB = 0x8254,
>+        CONTEXT_UNKNOWN_CONTEXT_RESET_ARB = 0x8255,
>+    };
My gcc doesn't like this trailing comma.
(In reply to neil@parkwaycc.co.uk from comment #33)
> My gcc doesn't like this trailing comma.

Your compiler does not complain about the trailing semicolon in
gfx/gl/GLContext.h:81?

     78   namespace layers {
     79     class LayerManagerOGL;
     80     class ColorTextureLayerProgram;
     81   };
Stefan: That's a completely different and unrelated situation.  There's no "trailing semicolon" there -- "class ColorTextureLayerProgram" would simply be an invalid declaration without a semicolon at the end.

In contrast, enum lists were explicitly prohibited from having a comma after the final value in C89 (but it's allowed in later versions of C - hence it being a build warning rather than an error).
(In reply to Daniel Holbert [:dholbert] from comment #38)
> There's no
> "trailing semicolon" there -- "class ColorTextureLayerProgram" would simply
> be an invalid declaration without a semicolon at the end.

Daniel: I don't mean the semicolon in line 80 but that of line 81 after the namespace decl's closing brace.
http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#569

gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291] is complaining since the source files moved into a new directory:

https://hg.mozilla.org/mozilla-central/rev/335e8e75eedc
Bug 703516 - Move GLContext code into a separate folder. r=jrmuizel
(In reply to Stefan from comment #39)
> Daniel: I don't mean the semicolon in line 80 but that of line 81 after the
> namespace decl's closing brace.

Ah, that makes more sense. :)  Sorry for misunderstanding you. In any case -- GCC 4.6 doesn't complain about that for me, but it sounds like you're saying earlier versions do.

(In reply to Stefan from comment #39)
> gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291] is complaining since
> the source files moved into a new directory:
> Bug 703516 - Move GLContext code into a separate folder. r=jrmuizel

Then we probably shouldn't resolve this (or discuss it further) as part of *this* bug here -- could you file a new bug on it, or note it on bug 703516 (assuming that's the bug responsible for triggering the gcc warning)?
Filed
Bug 706610 - remove trailing semicolon after namespace decl
Blocks: 707861
Blocks: 707860
You need to log in before you can comment on or make changes to this bug.