Last Comment Bug 718629 - intermittent waitpid failure causes OpenGL features to be blacklisted on X11
: intermittent waitpid failure causes OpenGL features to be blacklisted on X11
Status: RESOLVED FIXED
[qa-]
:
Product: Core
Classification: Components
Component: Graphics (show other bugs)
: unspecified
: x86_64 Linux
: -- normal (vote)
: mozilla13
Assigned To: Benoit Jacob [:bjacob] (mostly away)
:
Mentors:
: 678372 721343 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-01-17 06:22 PST by Benoit Jacob [:bjacob] (mostly away)
Modified: 2012-04-05 07:54 PDT (History)
10 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed


Attachments
report more info about waitpid failures (2.06 KB, patch)
2012-01-17 06:24 PST, Benoit Jacob [:bjacob] (mostly away)
no flags Details | Diff | Splinter Review
report more info about waitpid failures (2.06 KB, patch)
2012-01-17 06:40 PST, Benoit Jacob [:bjacob] (mostly away)
joe: review+
Details | Diff | Splinter Review
tolerate ECHILD (1.51 KB, patch)
2012-01-29 20:07 PST, Benoit Jacob [:bjacob] (mostly away)
no flags Details | Diff | Splinter Review
tolerate ECHILD as long as reading from the pipe succeeded (3.82 KB, patch)
2012-01-29 20:20 PST, Benoit Jacob [:bjacob] (mostly away)
no flags Details | Diff | Splinter Review
tolerate ECHILD, and only write to the pipe at the very end of glxtest (4.05 KB, patch)
2012-01-30 15:34 PST, Benoit Jacob [:bjacob] (mostly away)
karlt: review+
akeybl: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Benoit Jacob [:bjacob] (mostly away) 2012-01-17 06:22:46 PST
Here on Ubuntu 11.10 with the radeon driver, I sometimes get a spurious 'waitpid failure' reported in about:support. This causes GL features to be disabled. Needs investigating.

Also see bug 622127 comment 16, it's the same thing.
Comment 1 Benoit Jacob [:bjacob] (mostly away) 2012-01-17 06:24:14 PST
Created attachment 589171 [details] [diff] [review]
report more info about waitpid failures
Comment 2 Benoit Jacob [:bjacob] (mostly away) 2012-01-17 06:40:24 PST
Created attachment 589178 [details] [diff] [review]
report more info about waitpid failures
Comment 3 Ryan S Kingsbury 2012-01-17 09:07:55 PST
Affects me also (reproducing my comments from Bug 622127)

The wiki page says that Webgl is enabled by default for Mesa >7.10.3
https://wiki.mozilla.org/Blocklisting/Blocked_Graphics_Drivers#On_X11

However, my Intel Ironlake card is blacklisted even though I'm running Mesa 7.11.3 (Firefox 9 stable channel, ArchLinux, Kernel 3.1.9)
The graphics section of about:support says this:

===================================================================
Adapter DescriptionGLXtest process failed (waitpid failed): VENDOR
Tungsten Graphics, Inc
RENDERER
Mesa DRI Intel(R) Ironlake Desktop 
VERSION
2.1 Mesa 7.11.2
TFP
TRUE

WebGL RendererTungsten Graphics, Inc -- Mesa DRI Intel(R) Ironlake Desktop  -- 2.1 Mesa 7.11.2GPU Accelerated Windows0/1. Blocked for your graphics driver version. Try updating your graphics driver to version <Anything with EXT_texture_from_pixmap support> or newer.
===========================================================================

If I run glxinfo | grep texture_from_pixmap I get the following:
====================
GLX_ARB_multisample, GLX_EXT_import_context, GLX_EXT_texture_from_pixmap, 
GLX_SGIX_visual_select_group, GLX_EXT_texture_from_pixmap, 
GLX_EXT_texture_from_pixmap
=====================

So it appears that my driver and hardware do support the EXT_texture_from_pixmap extension, contrary to what the about:support message says.

I can get it working by force-enabling webgl and layers acceleration, but why am I having to do this? Is the Wiki page just wrong?  Is there a bug with Ironlake hardware that needs to be filed with the Mesa team?
Comment 4 Benoit Jacob [:bjacob] (mostly away) 2012-01-17 11:33:26 PST
Ryan, we're still investigating this. I'll land my patch so it should be in Thursday's Nightly build (nightly.mozilla.org). If you could try it then, and paste here the about:support Graphics information from it, that would greatly help us understand this problem.

The texture_from_pixmap message will be fixed at the same time. The probe does recognize that your driver supports texture_from_pixmap, see "TFP TRUE".
Comment 5 Benoit Jacob [:bjacob] (mostly away) 2012-01-17 17:46:14 PST
http://hg.mozilla.org/integration/mozilla-inbound/rev/924d6091bec7
Comment 6 Ryan S Kingsbury 2012-01-17 20:03:27 PST
Thanks for the quick response, Jacob!  In preparation for Thursday I went ahead and tried tonight (1/17) nightly and the problem seems resolved already. 

About:support reads:
================
Adapter DescriptionTungsten Graphics, Inc -- Mesa DRI Intel(R) Ironlake Desktop Vendor IDTungsten Graphics, IncDevice IDMesa DRI Intel(R) Ironlake Desktop Driver Version2.1 Mesa 7.11.2WebGL RendererTungsten Graphics, Inc -- Mesa DRI Intel(R) Ironlake Desktop  -- 2.1 Mesa 7.11.2GPU Accelerated Windows0
=================

If I enable layers acceleration in about:config, performance seems decent, too!

I'll test again on Thursday once your patch hits.
Comment 8 Benoit Jacob [:bjacob] (mostly away) 2012-01-18 05:06:10 PST
(In reply to Ryan S Kingsbury from comment #6)
> Thanks for the quick response, Jacob!  In preparation for Thursday I went
> ahead and tried tonight (1/17) nightly and the problem seems resolved
> already. 

Hm, OK, as I said in the title of this bug, it's intermittent. Which means that I don't think it's resolved, it's just not manifesting itself for you at the moment.

When the problem reappears, go again to about:support, check if it mentions "waitpid", if yes paste that info here: starting in tomorrow's Nightly it will have more useful information.

Keeping this bug open until it's actually fixed.
Comment 9 Ryan S Kingsbury 2012-01-19 16:27:55 PST
You were right, I'm having problems again even before updating the nightly build to today's version.

Tried the 1/19 nightly and I see the following:
========================
Adapter DescriptionTungsten Graphics, Inc -- Mesa DRI Intel(R) Ironlake Desktop 
Vendor IDTungsten Graphics, Inc
Device ID Mesa DRI Intel(R) Ironlake Desktop 
Driver Version 2.1 Mesa 7.11.2
WebGL Renderer Blocked for your graphics driver version. Try updating your graphics driver to version <Anything with EXT_texture_from_pixmap support> or newer.
GPU Accelerated Windows 0
==========================
Comment 10 Ryan S Kingsbury 2012-01-19 16:31:48 PST
If I force enable webgl I get the following:
==================
Adapter Description Tungsten Graphics, Inc -- Mesa DRI Intel(R) Ironlake Desktop 
Vendor ID Tungsten Graphics, Inc
Device ID Mesa DRI Intel(R) Ironlake Desktop 
Driver Version 2.1 Mesa 7.11.2
WebGL Renderer Tungsten Graphics, Inc -- Mesa DRI Intel(R) Ironlake Desktop  -- 2.1 Mesa 7.11.2
GPU Accelerated Windows 0
================

The GPU Accelerated windows count remains at 0 even when I have a webgl tab open (in this case webglearth.com)

Unfortunately I don't see anything about waitpid
Comment 11 Ryan S Kingsbury 2012-01-19 16:34:53 PST
Update: now I see the latter output (no error) even without webgl force-enabled.  I haven't changed anything, just starting and restarting the browser a few times.

I should not that I'm running these tests in safe mode so all addons are disabled.
Comment 12 Ryan S Kingsbury 2012-01-19 16:56:09 PST
Another update (sorry to spam):

I tried various combinations of
webgl.force-enabled T/F
layers.acceleration.force-enabled T/F
safe and normal mode

The only 2 configurations that result in "GPU Accelerated Windows" going from 0 to 1 involve using layers.acceleration.force-enabled TRUE, and opening in non-safe mode. This configuration works whether or not webgl is force-enabled.

The rest of the graphics output in about:support remains the same, only the window count changes.

Oddly, when I force-enable layers acceleration, I get strange behavior on input events. For instance, while trying to type the text into this box the keyboard became VERY laggy. Also, hovering the mouse over different HTML elements causes some of them to flicker and disappear. I don't know if that's related or off-track.  The layers acceleration DOES definitely bring the GPU online though; my framerate on webgl earth went from about 3 to 20+ when the GPU accelerated windows count was 1.
Comment 13 Karl Tomlinson (:karlt) 2012-01-27 20:30:05 PST
If PR_CreateProcess is called before waitpid is called, then this would be the same issue as bug 678372.
Comment 14 Benoit Jacob [:bjacob] (mostly away) 2012-01-29 19:43:16 PST
Karl, I set a breakpoint at prinit.c:731 and it's never reached here, during a normal Firefox run.
Comment 15 Benoit Jacob [:bjacob] (mostly away) 2012-01-29 20:00:00 PST
Though, in the debug build which I debugged, I can't reproduce this problem. And on Nightly, I get:

GLXtest process failed (waitpid failed with errno=10 for pid 31025):
VENDOR
X.Org
RENDERER
Gallium 0.4 on AMD RV710
VERSION
2.1 Mesa 7.11
TFP
TRUE

This means a few things. First, errno=10 means "no child" which is in agreement with your theory about PR_WaitProcess reaping the glxtest process.

Second, the glxtest process was pid 31025 while this Firefox instance is pid 31023. So there was indeed a process in between at some point. I guess Nightly does stuff that my debug build doesn't do.

Third, even though the glxtest process was gone, the data was still successfully read from the pipe. So perhaps this just doesn't matter and we should just ignore the errno=10 case?
Comment 16 Benoit Jacob [:bjacob] (mostly away) 2012-01-29 20:07:24 PST
Created attachment 592597 [details] [diff] [review]
tolerate ECHILD

A good reason to r- this would be if there were a good reason to fear that in case of ECHILD error, we could actually fail to get the data from the pipe. The premise of this patch is that ECHILD just means that the process was already reaped but we still get the data from the pipe.
Comment 17 Benoit Jacob [:bjacob] (mostly away) 2012-01-29 20:20:04 PST
Created attachment 592598 [details] [diff] [review]
tolerate ECHILD as long as reading from the pipe succeeded

This is better: we only tolerate ECHILD if reading from the pipe succeeded. If reading from the pipe failed, we report explicitly about it in AppNotes (and remove a wishful comment about it being impossible) and we still consider ECHILD a waitpid failure, so we also still report about that in AppNotes.
Comment 18 Benoit Jacob [:bjacob] (mostly away) 2012-01-30 15:34:11 PST
Created attachment 592894 [details] [diff] [review]
tolerate ECHILD, and only write to the pipe at the very end of glxtest

Per IRC discussion.
Comment 19 Karl Tomlinson (:karlt) 2012-01-30 18:54:52 PST
Comment on attachment 592894 [details] [diff] [review]
tolerate ECHILD, and only write to the pipe at the very end of glxtest

>+            } else {
>+                // Bug 718629
>+                // ECHILD happens when the glxtest process got reaped by a PR_WaitProcess
>+                // as per bug 227246. This shouldn't matter, as we still seem to get the data

The first sentence here isn't quite accurate.  This happens even without PR_WaitProcess being called.
It happens from WaitPidDaemonThread, which will be run after PR_CreateProcess is called.
I'd suggest "... got reaped after a PR_CreateProcess as per ..."

r+ with that touch-up.
Comment 20 Benoit Jacob [:bjacob] (mostly away) 2012-01-31 12:33:25 PST
http://hg.mozilla.org/integration/mozilla-inbound/rev/d550ac61fc59

Thanks for the comment on the comment.
Comment 21 Benoit Jacob [:bjacob] (mostly away) 2012-01-31 12:37:06 PST
Comment on attachment 592894 [details] [diff] [review]
tolerate ECHILD, and only write to the pipe at the very end of glxtest

[Approval Request Comment]
Regression caused by (bug #): don't know, but it must be 1 or 2 month old
User impact if declined: WebGL will be wrongly blacklisted half the time, intermittently, on linux (desktop, not android)
Testing completed (on m-c, etc.): just landed it on m-c.
Risk to taking this patch (and alternatives if risky): not risky. This will not cause us to wrongly whitelist a driver that should be blacklisted.
String changes made by this patch: None
Comment 22 Ed Morley [:emorley] 2012-02-01 11:23:59 PST
https://hg.mozilla.org/mozilla-central/rev/d550ac61fc59
Comment 23 Karl Tomlinson (:karlt) 2012-02-01 13:48:18 PST
*** Bug 678372 has been marked as a duplicate of this bug. ***
Comment 24 Alex Keybl [:akeybl] 2012-02-02 07:10:35 PST
Comment on attachment 592894 [details] [diff] [review]
tolerate ECHILD, and only write to the pipe at the very end of glxtest

[Triage Comment]
Prevents WebGL from being blacklisted in certain circumstances. Approved for Aurora 12.
Comment 25 Benoit Jacob [:bjacob] (mostly away) 2012-02-09 05:14:45 PST
http://hg.mozilla.org/releases/mozilla-aurora/rev/97e11c38e51b
Comment 26 Benoit Jacob [:bjacob] (mostly away) 2012-03-15 11:43:30 PDT
*** Bug 721343 has been marked as a duplicate of this bug. ***
Comment 27 gk.bts1 2012-03-18 06:25:23 PDT
I am still seeing this issue even with firefox version 11:

Adapter DescriptionGLXtest process failed (waitpid failed): VENDOR
NVIDIA Corporation
RENDERER
GeForce GTX 460/PCIe/SSE2
VERSION
4.2.0 NVIDIA 295.20
TFP
TRUE

WebGL RendererBlocked for your graphics card because of unresolved driver issues.GPU Accelerated Windows0. Blocked for your graphics driver version. Try updating your graphics driver to version <Anything with EXT_texture_from_pixmap support> or newer.

But I can't reproduce it if I start it as a different user with a clean profile (I haven't tried this with earlier versions of firefox, that is < 11). According to the wiki page this driver version should be whitelisted.
Comment 28 Benoit Jacob [:bjacob] (mostly away) 2012-03-18 10:07:49 PDT
gk.bts1: this is strange; please file a new bug as this has to be a different issue. To determine whether we should worry about this, we should check how many % of firefox 11+ users get this 'waitpid failure' from crash-stats data.
Comment 29 Karl Tomlinson (:karlt) 2012-03-18 15:42:58 PDT
This bug has been fixed for version 12, but not version 11 unfortunately.
Comment 30 gk.bts1 2012-03-22 09:00:07 PDT
Thank you for the hint - I was subscribed to a duplicate of this bug and the original submitter says it has been fixed for them with version 11 and I naively believed it was included, but now after checking the code I can see it's not there yet.  Will retest with version 12.  Sorry for the noise and keep up the good work.
Comment 31 Virgil Dicu [:virgil] [QA] 2012-04-04 08:40:17 PDT
Are there any reliable STR which could be used in order to verify this issue?
Comment 32 Benoit Jacob [:bjacob] (mostly away) 2012-04-04 09:52:41 PDT
Not easy to QA: this never was 100% reproducible.

Note You need to log in before you can comment on or make changes to this bug.