Closed Bug 1259811 Opened 4 years ago Closed 4 years ago

crash in @0x0 | mozilla::gl::GLContext::InitWithPrefix

Categories

(Core :: Canvas: WebGL, defect, critical)

x86
All
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla48
Tracking Status
firefox46 + verified
firefox47 + verified
firefox48 + verified
firefox-esr45 --- affected

People

(Reporter: milan, Assigned: jgilbert)

References

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-92720f09-989c-4ae2-b7d7-481b52160325.
=============================================================

Crashing when initializing the context.  The number of crashes would (should) increase since bug 1247706, which removed the "crash guard", disallowing any further usage of WebGL if we crashed during the initialization.

With the automatic guard out of the way, we will probably have to blocklist, but let's see.

Mostly XP, but some later Windows as well.  Old cards in most, if not all.
I think the problem was that our crash guard as it was before removal generated a lot of false positives and people who never had seen WebGL crashes still had WebGL removed from them by the crash guard. If we need/want to reinstate the crash guard, we should figure out how to be sharper about it so that it won't cause as many false positive WebGL removals as it did.

If this crash is directly in the call stack mozilla::gl::GLContext::InitWithPrefix() when we are initializing the GL context and it deterministically occurs on these systems, I don't think it would cause false positives to crash guard out this exact call site alone. Although, if this is a jump to null address(?) directly from somewhere in mozilla::gl::GLContext::InitWithPrefix(), then doesn't that imply a bug in our code, since our code jumps to null directly? I.e. the crash signature doesn't seem to have any GL driver code in it.
Jeff, all of these crash in glBindFramebuffer call - that is available as of OpenGL 3, but we explicitly allow 2 and above (https://dxr.mozilla.org/mozilla-central/source/gfx/gl/GLContextProviderWGL.cpp#284) when doing GL through WGL.  Or am I reading this incorrectly?
Flags: needinfo?(jgilbert)
I presume it's possible the issue is that in here https://hg.mozilla.org/releases/mozilla-beta/annotate/53d6e6648f97/gfx/gl/GLContext.cpp#l1744 the function pointer 'fBindFramebuffer' could be null?
Assignee: nobody → jgilbert
(In reply to Jukka Jylänki from comment #1)
> I think the problem was that our crash guard as it was before removal
> generated a lot of false positives and people who never had seen WebGL
> crashes still had WebGL removed from them by the crash guard. If we
> need/want to reinstate the crash guard, we should figure out how to be
> sharper about it so that it won't cause as many false positive WebGL
> removals as it did.
> 
> If this crash is directly in the call stack
> mozilla::gl::GLContext::InitWithPrefix() when we are initializing the GL
> context and it deterministically occurs on these systems, I don't think it
> would cause false positives to crash guard out this exact call site alone.
> Although, if this is a jump to null address(?) directly from somewhere in
> mozilla::gl::GLContext::InitWithPrefix(), then doesn't that imply a bug in
> our code, since our code jumps to null directly? I.e. the crash signature
> doesn't seem to have any GL driver code in it.

I don't want to forget this topic, so we may want to open another bug to discuss it in detail.  Basically, whether it's still worth putting some kind of a crash guard on WebGL usage, and if so, where the guard should be.
Whiteboard: [gfx-noted]
[Tracking Requested - why for this release]: We removed the (too aggressive) crash guard from WebGL crashes, letting these grow in numbers (early days and very small numbers - single digits, but it is #3 in the latest 46 beta.)  We may want to see if the actual fix for this scenario can be uplifted.
(In reply to Jukka Jylänki from comment #1)
> I think the problem was that our crash guard as it was before removal
> generated a lot of false positives and people who never had seen WebGL
> crashes still had WebGL removed from them by the crash guard. If we
> need/want to reinstate the crash guard, we should figure out how to be
> sharper about it so that it won't cause as many false positive WebGL
> removals as it did.

Users would had to have experienced a GL crash for the crash guard to activate, so on Windows it's pretty likely that would have been WebGL, unless I'm missing something. (I guess we do create a context for about:support)
The issue was more "if the very first one fails it's probably different than if we succeed sometimes and fail sometimes".
Flags: needinfo?(jgilbert)
Attachment #8734926 - Flags: review?(jmuizelaar)
if you just look on 46beta data for this signature, the crashes are predominantly affecting xp with old nvidia hardware/drivers:

Platform pretty version facet Rank 	Platform pretty version 	Count 	%
1 	Windows XP 	579 	93.09 %
2 	Windows Vista 	23 	3.70 %
3 	Windows 7 	20 	3.22 %

Adapter vendor id facet Rank 	Adapter vendor id 	Count 	%
1 	0x10de 	569 	91.48 %
2 	0x8086 	53 	8.52 %

Adapter driver version facet Rank 	Adapter driver version 	Count 	% (when limiting the search to nvidia gpus)
1 	7.1.8.9 	101 	17.75 %
2 	7.2.1.4 	65 	11.42 %
3 	8.1.9.8 	43 	7.56 %
4 	6.6.9.3 	35 	6.15 %
5 	7.1.8.4 	35 	6.15 %
6 	9.1.3.6 	30 	5.27 %
7 	6.7.4.2 	29 	5.10 %
8 	8.3.2.0 	28 	4.92 %
9 	7.7.7.2 	27 	4.75 %
10 	2005.6.17.1 	22 	3.87 %
Yes, this would be likely to crash on anything that only has OpenGL 2.* support.
(In reply to philipp from comment #9)
> if you just look on 46beta data for this signature, the crashes are
> predominantly affecting xp with old nvidia hardware/drivers:
> 
> Platform pretty version facet Rank 	Platform pretty version 	Count 	%
> 1 	Windows XP 	579 	93.09 %
> 2 	Windows Vista 	23 	3.70 %
> 3 	Windows 7 	20 	3.22 %
> 
> Adapter vendor id facet Rank 	Adapter vendor id 	Count 	%
> 1 	0x10de 	569 	91.48 %
> 2 	0x8086 	53 	8.52 %
> 
> Adapter driver version facet Rank 	Adapter driver version 	Count 	% (when
> limiting the search to nvidia gpus)
> 1 	7.1.8.9 	101 	17.75 %
> 2 	7.2.1.4 	65 	11.42 %
> 3 	8.1.9.8 	43 	7.56 %
> 4 	6.6.9.3 	35 	6.15 %
> 5 	7.1.8.4 	35 	6.15 %
> 6 	9.1.3.6 	30 	5.27 %
> 7 	6.7.4.2 	29 	5.10 %
> 8 	8.3.2.0 	28 	4.92 %
> 9 	7.7.7.2 	27 	4.75 %
> 10 	2005.6.17.1 	22 	3.87 %

Great, I expected something like this.

(In reply to Milan Sreckovic [:milan] from comment #10)
> Yes, this would be likely to crash on anything that only has OpenGL 2.*
> support.

Right. To be clear, we can't run on a driver that doesn't support framebuffers. We should be gently failing there. This patch should convert the current crash to a gentle fail.
Asked another way - if I have a system with OpenGL 2.1 support, would it have crashed before, and will it gently fail now?  Framebuffers weren't available before OpenGL 3, right?  Or do I have that mixed up?
Comment on attachment 8734926 [details] [diff] [review]
0001-Require-framebuffers-for-GLContext.patch

Review of attachment 8734926 [details] [diff] [review]:
-----------------------------------------------------------------

So I had a look at ANGLE and it's not obvious that it supports this feature so I'm not sure this is the right approach.

I also noticed that the D3D11 context does not support framebufferBlit and framebufferMultisample extensions on D3D9 level hardware, where as the D3D9 backend does. However, that doesn't explain the crashes on XP. What ANGLE feature are you trying to detect?
Attachment #8734926 - Flags: review?(jmuizelaar) → review-
Tracking for 46+, this is noticeable in beta 5 (the #36 topcrash). Maybe blocklisting particular drivers for XP could help.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #14)
> Tracking for 46+, this is noticeable in beta 5 (the #36 topcrash). Maybe
> blocklisting particular drivers for XP could help.

This spiked because we removed the crash guard that was taking away WebGL from too many people.  The people that are now crashing just used to not be able to run WebGL before, so arguably things got only slightly worse for them :)
Let's see what we can come up with over the next few days.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #13)
> Comment on attachment 8734926 [details] [diff] [review]
> 0001-Require-framebuffers-for-GLContext.patch
> 
> Review of attachment 8734926 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> So I had a look at ANGLE and it's not obvious that it supports this feature
> so I'm not sure this is the right approach.
> 
> I also noticed that the D3D11 context does not support framebufferBlit and
> framebufferMultisample extensions on D3D9 level hardware, where as the D3D9
> backend does. However, that doesn't explain the crashes on XP. What ANGLE
> feature are you trying to detect?

framebuffer_object_EXT_OES covers framebuffer objects without fbBlit or rbMSStorage, etc.
Attachment #8734926 - Flags: review- → review?(jmuizelaar)
Does ANGLE support framebuffer_object_EXT_OES?
Flags: needinfo?(jgilbert)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #17)
> Does ANGLE support framebuffer_object_EXT_OES?

Based on the code, if it doesn't, we never load glBindFramebuffer, so it must support one of the two.
Otherwise we would always crash.

We have mochitests that guarantee that WebGL works on all platforms as we expect it to, so it's not possible to land something that would break this.
Flags: needinfo?(jgilbert)
(In reply to Jeff Gilbert [:jgilbert] from comment #18)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #17)
> > Does ANGLE support framebuffer_object_EXT_OES?
> 
> Based on the code, if it doesn't, we never load glBindFramebuffer, so it
> must support one of the two.
> Otherwise we would always crash.

What I'm trying to understand is under what circumstances we get which extensions. If ANGLE unconditionally supports framebuffer_object it doesn't make sense to test for it. In fact looking at these crashes it looks like they are all WGL related, so it looks like the proper fix is to try harder to not use WGL.
Comment on attachment 8734926 [details] [diff] [review]
0001-Require-framebuffers-for-GLContext.patch

Review of attachment 8734926 [details] [diff] [review]:
-----------------------------------------------------------------

So I had a look at ANGLE and it's not obvious that it supports this feature so I'm not sure this is the right approach.

I also noticed that the D3D11 context does not support framebufferBlit and framebufferMultisample extensions on D3D9 level hardware, where as the D3D9 backend does. However, that doesn't explain the crashes on XP. What ANGLE feature are you trying to detect?
Attachment #8734926 - Flags: review?(jmuizelaar) → review-
(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)
> Comment on attachment 8734926 [details] [diff] [review]
> 0001-Require-framebuffers-for-GLContext.patch
> 
> Review of attachment 8734926 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> So I had a look at ANGLE and it's not obvious that it supports this feature
> so I'm not sure this is the right approach.
> 
> I also noticed that the D3D11 context does not support framebufferBlit and
> framebufferMultisample extensions on D3D9 level hardware, where as the D3D9
> backend does. However, that doesn't explain the crashes on XP. What ANGLE
> feature are you trying to detect?

Alright, let's voice-chat tomorrow to clear this up. We can drop the notes here. This is too low-bandwidth.
:jrmuizel and I talked and we're going to ship this fix to patch the regression here, and also land a follow-up to stop falling back to desktop GL when ANGLE fails.
See Also: → 1261179
(In reply to Jeff Gilbert [:jgilbert] from comment #22)
> ...and also land a follow-up to stop falling back to desktop
> GL when ANGLE fails.

This is bug 1261179.
Attachment #8734926 - Flags: review- → review?(jmuizelaar)
Attachment #8734926 - Flags: review?(jmuizelaar) → review+
Comment on attachment 8734926 [details] [diff] [review]
0001-Require-framebuffers-for-GLContext.patch

Approval Request Comment
[Feature/regressing bug #]: bug 1247706, effectively
[User impact if declined]: Older machines may crash when trying to load WebGL.
[Describe test coverage new/current, TreeHerder]: none (hard to test)
[Risks and why]: very low
[String/UUID change made/needed]: none
Attachment #8734926 - Flags: approval-mozilla-beta?
Attachment #8734926 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/c3e113d9db02
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Comment on attachment 8734926 [details] [diff] [review]
0001-Require-framebuffers-for-GLContext.patch

Crash fix for a fairly significant crash in beta/release, let's uplift it.
Attachment #8734926 - Flags: approval-mozilla-beta?
Attachment #8734926 - Flags: approval-mozilla-beta+
Attachment #8734926 - Flags: approval-mozilla-aurora?
Attachment #8734926 - Flags: approval-mozilla-aurora+
We should verify that the crash rate drops after this lands in beta 9.
Flags: qe-verify+
There are no crashes in Socorro on FX 46 newer than 46b8 in the last 28 days.
No new crashes with this signature present in Socorro for 47 nor 48. Marking accordingly.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
Crash volume for signature '@0x0 | mozilla::gl::GLContext::InitWithPrefix':
 - nightly (version 50): 0 crash from 2016-06-06.
 - aurora  (version 49): 0 crash from 2016-06-07.
 - beta    (version 48): 0 crash from 2016-06-06.
 - release (version 47): 0 crash from 2016-05-31.
 - esr     (version 45): 46 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          0          0          0          0          0          0
 - aurora           0          0          0          0          0          0          0
 - beta             0          0          0          0          0          0          0
 - release          0          0          0          0          0          0          0
 - esr              1          5          4          6          2         13          7

Affected platform: Windows
You need to log in before you can comment on or make changes to this bug.