Closed Bug 593867 Opened 14 years ago Closed 14 years ago

crash [@WebGLContext::ValidateProgram] in NVIDIA driver on Mac

Categories

(Core :: Graphics: CanvasWebGL, defect)

x86_64
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla9
Tracking Status
blocking2.0 --- final+
status1.9.2 --- unaffected
status1.9.1 --- unaffected

People

(Reporter: posidron, Assigned: bjacob)

Details

(Keywords: crash, testcase, Whiteboard: [sg:critical])

Crash Data

Attachments

(8 files)

Attached file testcase
The testcase is a bit large. FF crashes always at a different function but glReadPixels() is always called before.
Attached file callstack
Summary: WebGL gl_xxx crash [@glReadPixels] → WebGL gl_xxx crash [@glReadPixels/glrCompExecuteKernel]
I can't reproduce the crash here (linux x86-64 nvidia proprietary driver 195.36)

This page doesn't display or do anything here, all I can see is WebGL errors:

WebGL: framebufferRenderbuffer: renderbuffer: deleted object passed as argument
WebGL: texParameter: no texture is bound to this target
WebGL: GetFramebufferAttachmentParameter: pname: invalid enum value 0x8cd3
WebGL: GetFramebufferAttachmentParameter: pname: invalid enum value 0x8cd3
I haven't tested it on Linux. On MacOSX it's reproducible.
the assembly says it's crashing when trying to write into a buffer. Seems to mean the allocated buffer isn't big enough for what it's trying to write to it. Strange because the code seems safe in that respect:

from GLContext.cpp:

void
GLContext::ReadPixelsIntoImageSurface(GLint aX, GLint aY,
                                      GLsizei aWidth, GLsizei aHeight,
                                      gfxImageSurface *aDest)
{
    MakeCurrent();

    if (aDest->Format() != gfxASurface::ImageFormatARGB32 &&
        aDest->Format() != gfxASurface::ImageFormatRGB24)
    {
        NS_WARNING("ReadPixelsIntoImageSurface called with invalid image format");
        return;
    }

    if (aDest->Width() != aWidth ||
        aDest->Height() != aHeight ||
        aDest->Stride() != aWidth * 4)
    {
        NS_WARNING("ReadPixelsIntoImageSurface called with wrong size or stride surface");
        return;
    }

    GLint currentPackAlignment = 0;
    fGetIntegerv(LOCAL_GL_PACK_ALIGNMENT, &currentPackAlignment);
    fPixelStorei(LOCAL_GL_PACK_ALIGNMENT, 4);

    [SNIP]

    fReadPixels(0, 0, aWidth, aHeight,
                format, datatype,
                aDest->Data());

    [SNIP]
}

Seems like a question for graphics people, not WebGL-specific. Reassigning...
Component: Canvas: WebGL → Graphics
QA Contact: canvas.webgl → thebes
Summary: WebGL gl_xxx crash [@glReadPixels/glrCompExecuteKernel] → crash [@GLContext::ReadPixelsIntoImageSurface]
Are you running with OpenGL-accelerated layers enabled? As in layers.accelerate_all or MOZ_ACCELERATED? That could explain things. This is considered "not yet ready".
No, default build, default config except webgl.enabled_for_all_sites;true
OK. I tried with OSMesa in valgrind, no memory error reported.

What is your graphics card and driver?
Putting this back to webgl, since that's where it belongs because it's seen via webgl.
Component: Graphics → Canvas: WebGL
QA Contact: thebes → canvas.webgl
Modell-Identifizierung:	MacBookPro6,2

  Chipsatz-Modell:	NVIDIA GeForce GT 330M
  Typ:	GPU
  Bus:	PCIe
  PCIe-Lane-Breite:	x16
  VRAM (gesamt):	512 MB
  Hersteller:	NVIDIA (0x10de)
  Geräte-ID:	0x0a29
  Versions-ID:	0x00a2
  ROM-Version:	3540
  gMux-Version:	1.9.21


  Chipsatz-Modell:	Intel HD Graphics
  Typ:	GPU
  Bus:	Integriert
  VRAM (gesamt):	288 MB
  Hersteller:	Intel (0x8086)
  Geräte-ID:	0x0046
  Versions-ID:	0x0018
  gMux-Version:	1.9.21
Component: Canvas: WebGL → Graphics
seems like there's been a mid-air collision and the category change was overwritten
Component: Graphics → Canvas: WebGL
Attached file testcase-reduced
Was able to reduce the testcase to a minimum.
It may also crash at the following location:

#0  0x00000001225eac8a in gleUpdateFragmentFallbackProgram ()
#1  0x00000001225db674 in gleUpdateDeferredState ()
#2  0x00000001225dc949 in gleDoSelectiveDispatchNoErrorCore ()
#3  0x000000012251ea7d in glFlush_Exec ()
#4  0x0000000102b3aa0c in mozilla::layers::BasicCanvasLayer::Updated (this=0x1220ee360, aRect=@0x7fff5fbf9970) at /Users/cdiehl/Mozilla/trunk/gfx/layers/basic/BasicLayers.cpp:719
#5  0x0000000100d136ef in mozilla::WebGLContext::GetCanvasLayer (this=0x121d31e00, aOldLayer=0x0, aManager=0x12245ebb0) at /Users/cdiehl/Mozilla/trunk/content/canvas/src/WebGLContext.cpp:554
#6  0x0000000100e1be2b in nsHTMLCanvasElement::GetCanvasLayer (this=0x121d2af20, aOldLayer=0x0, aManager=0x12245ebb0) at /Users/cdiehl/Mozilla/trunk/content/html/content/src/nsHTMLCanvasElement.cpp:545
Group: core-security
Whiteboard: [sg:critical]
The two stack traces say it's crashing when BasicCanvasLayer::Updated(nsIntRect const&) calls glFlush(). A glFlush() crash inside of the driver means it's just crashing as the result of earlier calls. So the stack traces are actually not useful.

One thing that would be useful would be to make a "synchronous GL" mode where we'd call glFinish() after every GL call. Would make GL stack traces actually useful...

If I make you a patch doing that will you build and try it?
Sure
Good news / bad news time.

Good news: no need to rebuild firefox, here's a new version of your test case that calls gl.finish() after every WebGL call. Indeed, I had forgotten, but it turns out that WebGL does have a finish() function! This new testcase also prints stuff in the terminal, so make sure to launch Minefield from a terminal to see the debug output. When you get a crash, paste here the terminal output, that should tell us which WebGL function crashed.

Bad news: this only helps debug into WebGL. If the crash is not in WebGL, for example if it is in the OpenGL layers code, this won't help. What you could do is disable OpenGL-accelerated layers (preferences layers.accelerate_xxx and environment variable MOZ_ACCELERATED) and see if the crash persists.
Ah, I had forgotten about comment 6. So ignore the part of my previous comment about accelerated layers.
--- WebGL context created: 0x11cc2b000
before deleteProgram
after deleteProgram
begin add
before createShader
before shaderSource
before compileShader
before getShaderParameter
before attachShader
end add
begin add
before createShader
before shaderSource
before compileShader
before getShaderParameter
before attachShader
end add
before linkProgram
before getProgramParameter
before useProgram
end initWebGL
before deleteProgram
after deleteProgram
before createProgram
after createProgram
before validateProgram

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
0x000000020001e903 in gldUnbindPipelineProgram ()
OK, great! so validateProgram() is crashing. Can you confirm that the crash goes away if you remove the validateProgram() call?

The funny thing is that its code looks simple and fine:

NS_IMETHODIMP
WebGLContext::ValidateProgram(nsIWebGLProgram *pobj)
{
    WebGLuint progname;
    if (!GetGLName<WebGLProgram>("validateProgram", pobj, &progname))
        return NS_OK;

    MakeContextCurrent();

    gl->fValidateProgram(progname);

    return NS_OK;
}

When it crashes, can you attach a debugger an print the values of pobj and of progname?
Summary: crash [@GLContext::ReadPixelsIntoImageSurface] → crash [@WebGLContext::ValidateProgram]
Anyway... it's almost certainly a driver bug making it crash in glValidateProgram (what we call as gl->fValidateProgram)

Perhaps we should just let WebGLContext::ValidateProgram be a no-operation when this driver is used...
Yes, doesn't crash when validateProgram is removed.

gdb $ p pobj
$14 = (nsIWebGLProgram *) 0x12617b2d0
gdb $ p progname
$15 = 4
Keywords: crash, testcase
Let's block on figuring out what's going on here - we can always unblock on it later.
Assignee: nobody → bjacob
blocking2.0: --- → final+
OK, we do know what's going on here: NVIDIA driver crash in glValidateProgram(), not our fault.

Solutions by order of decreasing goodness and increasing quickness:

 * NVIDIA fixes their driver. We could convert this testcase into a C program using OpenGL, if that might help.

 * We use ANGLE to emulate glValidateProgram. Not realistic until certain ANGLE bugs are fixed.

 * We just avoid calling glValidateProgram, which means that we make webgl.validateProgram be a dummy no-op function, on NVIDIA. After all this function is not needed for rendering.
Wait, it's not NVIDIA --- it's Apple I guess who makes their own NVIDIA driver. right?
Christoph: 2 questions:

   * are you using Apple driver for your NVIDIA card ? (Appled makes their own NVIDIA driver, right? )

   * can you reproduce the webkit nightly builds? download from
         http://nightly.webkit.org/
     and type this into a terminal:
         defaults write com.apple.Safari WebKitWebGLEnabled -bool YES
Yes, I am using the default provided drivers by Apple 
- http://www.nvidia.de/page/macintosh.html
I am not able to reproduce it against WebKit (r69183)
This patch works around the crash by implementing WebGLContext::ValidateProgram as a no-op on Mac/NVIDIA, and printing an informatice message.
Attachment #484032 - Flags: review?(vladimir)
(In reply to comment #27)
> Yes, I am using the default provided drivers by Apple 
> - http://www.nvidia.de/page/macintosh.html
> I am not able to reproduce it against WebKit (r69183)

It's interesting that you can't reproduce with WebKit as they are definitely doing the glValidateProgram call; I wonder what difference between our and their GL setups causes us to trigger this crash; in any case it's a driver bug so all we can do is report it and work around it until it's fixed.
Vlad: ping, can you review the patch here?
Kev, do you know how to contact Apple about this?

This is a NVIDIA driver crash, but as far as I understand, Apple is the author of the NVIDIA driver on Mac.
Summary: crash [@WebGLContext::ValidateProgram] → crash [@WebGLContext::ValidateProgram] in NVIDIA driver on Mac
Comment on attachment 484032 [details] [diff] [review]
work around the crash

Do we know that this is NVIDIA-only, or is it in the common OSX driver layer?  Note that NVIDIA does not write the OSX drivers, Apple does. r+'ing this, but this might need to just be #ifdef XP_MACOSX without the vendor check.
Attachment #484032 - Flags: review?(vladimir) → review+
In the stack traces attached to this bug, we are in GeforceGLDriver, called from GLEngine.

I don't know if that's any conclusive, and I can't find this crash on crash-stats.

I can find some Mac GL crashes on crash-stats, and most of them are NVIDIA, but that might just reflect what Mac computers have. Their stacks are probably meaningless as was the first stack here before we started calling glFinish().

Christoph: if you can still reproduce, can you please go to about:crashes and give us a link to your crash?
Benoit: Are these types of crashes you are referencing in Comment 33: http://tinyurl.com/36ac89l. I just noticed them in crash stats today.
No, they're a different crash: they are crashing on GL initialization, while the present bug is about a crash that occurs while running GL commands long after initialization.

These 4 crash reports are in all likeliness 4 times the same user crashing. The reason why it's happening now could easily be that until bug 604395 was landed, some (no data, actually) users got no GL at all, and now they do.

Right now the best thing to do about these 4 crashes would be to report to Apple; unless you get a hold of the person for whom it's crashing (then send him/her to me!)
http://hg.mozilla.org/mozilla-central/rev/0518ede13821
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Group: core-security
Christoph, out of curiosity, what Mac OS X version was this on? 10.5 or 10.6 or both?
10.5
Can you tell me your precise 10.5.x version?
Apple released an upgrade to 10.6 a few weeks back.

ProductName:	Mac OS X
ProductVersion:	10.6.6
BuildVersion:	10J56
Sorry. I tested this on 10.6.4 as far as I remember.
Forwarded to Apple: bug 9129482
Crash Signature: [@WebGLContext::ValidateProgram]
The fix previously done in the patch for this bug created a new bug where Macs with non-Nvidia GPU's were having their ValidateProgram() function disabled, but their getProgramParameter() function wasn't changed accordingly. Thus, validation would never be done, and the status would always be non-success even if the program was valid. This patch fixes that.
Attachment #557327 - Flags: review?(bjacob)
Attachment #557327 - Flags: review?(bjacob) → review+
The workaround that was added in this issue is about 5 years old now. Marked down bug 1284425 to discuss if the workaround is relevant any more on recent OS X versions.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: