Closed Bug 786383 Opened 8 years ago Closed 8 years ago

OS crashes when opening a page

Categories

(Core :: Graphics, defect)

x86_64
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla18

People

(Reporter: baron, Assigned: bjacob)

Details

(Keywords: crash)

Attachments

(4 files)

Occasionally, and especially soon after resuming after suspend, Minefield crashes and takes the whole OS with it. (On another computer, I think it just kills the X server, but I'm not sure it is the same issue.)

I realize that a report like this is almost useless, but I am bothering to report it because I did manage to find some relevant information from /var/log/messages, Xorg.0.log.old, and .xsession-errors.old. Of course most of this looks like problems with the video system, but I am reporting it here because this problem happens only when I open a page with Minefield (and once when I closed a tab, but I'm not sure about that).

After I reboot, I can always open the same page that "caused" the crash with no problem.
Minefield no longer exists. It's now called Nightly: http://nightly.mozilla.org/
Can you provide a valid stack trace (see https://developer.mozilla.org/docs/How_to_get_a_stacktrace_for_a_bug_report)?
Severity: normal → critical
Keywords: crash
(In reply to Scoobidiver from comment #1)
> Minefield no longer exists. It's now called Nightly:
> http://nightly.mozilla.org/

I guess I knew this.

> Can you provide a valid stack trace (see
> https://developer.mozilla.org/docs/How_to_get_a_stacktrace_for_a_bug_report)?

I have already looked, and it seems that the answer is no. The crash is of the entire operating system, before the stacktrace is made. And of course I cannot reliably replicate the crash. Things work most of the time. And it seems that I have abrt running:

root@barber ~ > systemctl list-units | grep abrtd
abrtd.service             loaded active running       ABRT Automated Bug Reporting Tool

So I guess the crash is before this too.

Any other ideas?
Summary: crashes when opening a page → OS crashes when opening a page
Additional information. This seems like the same bug.
http://lists.freedesktop.org/archives/dri-devel/2012-June/024091.html

It does seem to happen reliably when I start Nightly after a resume. I'm not sure that this is the only time it happens.

And the crash does not happen until I start Nightly. That is the only thing that triggers it.
Component: General → Graphics
Product: Firefox → Core
First we need to know what driver you're using. From the log it appears to be Nouveau. Can you give more details: what exact Nouveau and Mesa versions? Do you have the Nouveau GL driver installed or only 2D? If you have GL, try:

    glxinfo | egrep vendor\|renderer\|version

Does your crash reproduce with default preferences (i.e. in a clean profile) or did you toggle some preference, such as layers.acceleration.force-enabled?
(In reply to Benoit Jacob [:bjacob] from comment #4)
> First we need to know what driver you're using. From the log it appears to
> be Nouveau. Can you give more details: what exact Nouveau and Mesa versions?

from lspci:
nVidia Corporation G86 [Quadro NVS 290] (rev a1) (prog-if 00 [VGA controller]) Subsystem: nVidia Corporation Device 0492

> Do you have the Nouveau GL driver installed or only 2D? If you have GL, try:
> 
>     glxinfo | egrep vendor\|renderer\|version

baron@barber ~ > glxinfo | egrep vendor\|renderer\|version
server glx vendor string: SGI
server glx version string: 1.4
client glx vendor string: Mesa Project and SGI
client glx version string: 1.4
GLX version: 1.4
OpenGL vendor string: nouveau
OpenGL renderer string: Gallium 0.4 on NV86
OpenGL version string: 2.1 Mesa 8.0.3
OpenGL shading language version string: 1.20

(I don't know whether this answers your question.)
 
> Does your crash reproduce with default preferences (i.e. in a clean profile)
> or did you toggle some preference, such as layers.acceleration.force-enabled?

I find that I can now replicate the crash reliably. I close Nightly, suspend the computer, wake up the computer, and try to start Nightly. That does it. Other things may do it too, but this is reliable. I just did this with a completely new profile. So I did not toggle any preferences.

I also checked ps and lsmod to see if anything was different before or after suspend - thinking that maybe "resume" was the problem - and was unable to find anything. But clearly "resume" is not working properly. On the other hand, I have been suspending this computer every night for several years with no problem. Anything else I should check along these lines?
https://bugs.archlinux.org/task/31338
seems to be the same bug

And it links to this, which is also the same:
http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/01611.html
Thanks.

Can you try this: go to about:config, set gfx.xrender.enabled to false, restart the browser. Does the problem persist?

Also, very recently (yesterday) a fix landed that changes some OpenGL stuff that Firefox does unconditionally on startup to detect system information. That was done to avoid X server crashes on certain drivers. See bug 680644. Please retry with today's Nightly build, as it should be the first build with the fix.
(In reply to Benoit Jacob [:bjacob] from comment #7)
> Thanks.
> 
> Can you try this: go to about:config, set gfx.xrender.enabled to false,
> restart the browser. Does the problem persist?

Yes.

> Also, very recently (yesterday) a fix landed that changes some OpenGL stuff
> that Firefox does unconditionally on startup to detect system information.
> That was done to avoid X server crashes on certain drivers. See bug 680644.

This looks pretty different, unfortunately.

> Please retry with today's Nightly build, as it should be the first build
> with the fix.

That didn't help either. I waited until I thought the new version (8-30) came out and tried this first, before changing gfx.xrender.enabled. Neither change helped.
OK.

The only way we could understand more about this problem is by using some system tracing tool to record a trace of what happens last just before the system crash.

2 things come to mind:

 - strace can help you record low-level system calls

 - I would also look into recording X11 activity: try to see if there exists some X11 tracing tool out there.
I replied already, then "added" the attachment, but now the reply did not show up. I can never figure out how to do this correctly.

(In reply to Benoit Jacob [:bjacob] from comment #9)
> OK.
> 
> The only way we could understand more about this problem is by using some
> system tracing tool to record a trace of what happens last just before the
> system crash.
> 
> 2 things come to mind:
> 
>  - strace can help you record low-level system calls

I attached the output of strace. The crash happened as usual, and I let it run through what seemed to be two attempts to restart X11.

>  - I would also look into recording X11 activity: try to see if there exists
> some X11 tracing tool out there.

I looked and could not find one.
The only relevant I see in the strace is this at the end:

  write(2, "firefox: Fatal IO error 0 (Succe"..., 52) = 52

But this is only writing an error message about a IO error, not the IO error itself; I was hoping that the IO error itself would show in strace but that's not the case. 

A google seach for X11 trace gave this:
http://xtrace.alioth.debian.org/
(In reply to Benoit Jacob [:bjacob] from comment #12)

> A google seach for X11 trace gave this:
> http://xtrace.alioth.debian.org/

Sorry, I did the Google search but missed this one. (There were others that seemed not to have a way of installing on my system.)

I will attempt to attach the output of
xtrace /home/baron/firefox/firefox > xtrace-output
up to the time of the crash, in the next comment.
output of xtrace /home/baron/firefox/firefox > xtrace-output
I discovered that abrt (automatic bug reporting tool in Fedora 17) has actually been working, sort of. It saves a lot of information but to my knowledge does not actually report anything. I thought that it wasn't working at all. So I have a whole bunch of information from the last crash, and I wonder if any of the following might be relevant (before I just send it all). The information seems to be about a crash in Xorg and does not mention firefox.

abrt_version  component                   executable  package    pkg_release  usr_share_xorg_conf_d.tar.gz
analyzer      count                       hostname    pkg_arch   pkg_version  uuid
architecture  duphash                     kernel      pkg_epoch  reason       Xorg.0.log
backtrace     etc_X11_xorg_conf_d.tar.gz  os_release  pkg_name   time

#####

Another thought: Why does this happen after suspend/resume? Clearly there is another bug, not in Firefox, that is causing the resume to be incomplete in some way. I have looked quite a bit and found nothing so far. For example, all the same kernel modules are loaded before and after suspend/resume. All the same processes are running. The output of "systemctl list-units" is the same.

(Still, Firefox is the only thing that causes the crash, and note that others have reported the same problem, although not here.)
> abrt_version  component                   executable  package    pkg_release
> usr_share_xorg_conf_d.tar.gz
> analyzer      count                       hostname    pkg_arch   pkg_version
> uuid
> architecture  duphash                     kernel      pkg_epoch  reason     
> Xorg.0.log
> backtrace     etc_X11_xorg_conf_d.tar.gz  os_release  pkg_name   time

This looks like mostly "column titles" in a table without the following table contents; except for the X-related filenames indeed which point to an issue in X.


> 
> #####
> 
> Another thought: Why does this happen after suspend/resume? Clearly there is
> another bug, not in Firefox, that is causing the resume to be incomplete in
> some way. I have looked quite a bit and found nothing so far. For example,
> all the same kernel modules are loaded before and after suspend/resume. All
> the same processes are running. The output of "systemctl list-units" is the
> same.

No idea; but driver/X  bugs on suspend/resume are not rare, I have some right here with the proprietary NVIDIA driver.
(In reply to Benoit Jacob [:bjacob] from comment #16)
> > abrt_version  component                   executable  package    pkg_release
> > usr_share_xorg_conf_d.tar.gz
> > analyzer      count                       hostname    pkg_arch   pkg_version
> > uuid
> > architecture  duphash                     kernel      pkg_epoch  reason     
> > Xorg.0.log
> > backtrace     etc_X11_xorg_conf_d.tar.gz  os_release  pkg_name   time
> 
> This looks like mostly "column titles" in a table without the following
> table contents; except for the X-related filenames indeed which point to an
> issue in X.

Sorry for not being clear. This is a listing of a directory. The file backtrace, for example, looks like this:

0: /usr/bin/Xorg (xorg_backtrace+0x36) [0x4652a6]
1: /usr/bin/Xorg (mieqEnqueue+0x26b) [0x5514ab]
2: /usr/bin/Xorg (0x400000+0x47f02) [0x447f02]
3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7f4421dbd000+0x60e4) [0x7f4421dc30e4]
4: /usr/bin/Xorg (0x400000+0x80787) [0x480787]
5: /usr/bin/Xorg (0x400000+0xa4a80) [0x4a4a80]
6: /lib64/libpthread.so.0 (0x38c4600000+0xefe0) [0x38c460efe0]
7: /lib64/libc.so.6 (ioctl+0x7) [0x38c3eea2f7]
8: /lib64/libdrm.so.2 (drmIoctl+0x28) [0x38ddc03548]
9: /lib64/libdrm.so.2 (drmCommandWrite+0x1b) [0x38ddc0577b]
10: /lib64/libdrm_nouveau.so.1 (0x7f4425b77000+0x3085) [0x7f4425b7a085]
11: /lib64/libdrm_nouveau.so.1 (nouveau_bo_map_range+0x103) [0x7f4425b7a6b3]
12: /usr/lib64/xorg/modules/drivers/nouveau_drv.so (0x7f4425d99000+0x6718) [0x7f4425d9f718]
13: /usr/lib64/xorg/modules/libexa.so (0x7f4424cf2000+0xb007) [0x7f4424cfd007]
14: /usr/bin/Xorg (0x400000+0x160483) [0x560483]
15: /usr/bin/Xorg (0x400000+0xc9d50) [0x4c9d50]
16: /usr/bin/Xorg (0x400000+0xfa8da) [0x4fa8da]
17: /usr/bin/Xorg (0x400000+0x3444a) [0x43444a]
18: /usr/bin/Xorg (0x400000+0x23485) [0x423485]
19: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x38c3e21735]
20: /usr/bin/Xorg (0x400000+0x2375d) [0x42375d]
ah, ok. I still don't see anything more there than "this is an issue with X".

But your xtrace output has more precise information: the last lines are:

001:<:001d: 16: Request(98): QueryExtension name='XFIXES'
001:>:001d:32: Reply to QueryExtension: present=true(0x01) major-opcode=147 first-event=98 first-error=158
001:<:001e: 12: XFIXES-Request(147,0): QueryVersion major version=5 minor version=0
001:>:001e:32: Reply to QueryVersion: major version=5 minor version=0
001:<:001f: 16: XFIXES-Request(147,5): CreateRegion region=0x02c00004 rectangles={x=0 y=0 w=16 h=16};
001:<:0020: 20: DRI2-Request(137,6): CopyRegion drawable=0x02c00002 region=0x02c00004 dest=FrontLeft(0x00000000) src=FakeFrontLeft(0x00000007)
001:>:0020:32: Reply to CopyRegion: 
001:<:0021:  8: XFIXES-Request(147,10): DestroyRegion region=0x02c00004
001:<:0022:  8: GLX-Request(153,4): glXDestroyContext context=0x02c00003
001:<:0023:  8: Request(4): DestroyWindow window=0x02c00002
001:<:0024:  8: Request(79): FreeColormap cmap=0x02c00001
001:<:0025:  8: Request(60): FreeGC gc=0x02c00000
001:<:0026:  4: Request(43): GetInputFocus 
001:>:0026:32: Reply to GetInputFocus: revert-to=Parent(0x02) focus=0x01e00037

*EOF*

As these are the last lines, it seems that the problem has something to do with the XFIXES extension. Could you try disabling it (maybe it's an option in xorg.conf) ?
Also, the other question is why is GLX being used there? The preceding lines in the log, with MOZILLA_COMMANDLINE, show that we are running out X11 initialization code, which can be either
  http://mxr.mozilla.org/mozilla-central/source/widget/xremoteclient/XRemoteClient.cpp
or
  http://mxr.mozilla.org/mozilla-central/source/toolkit/components/remote/nsGTKRemoteService.cpp
This doesn't use GLX.

Are you by any chance using a OpenGL-based compositing window manager? Could it be what's using XFIXES and GLX here, resulting in the crash? Could you try checking if the crash reproduces with a non-OpenGL-compositing window manager?
xtrace run in this way wouldn't catch what the window manager was doing.
Mesa uses XFIXES with DRI2.

Note that there are two X client connections in the log.
000 is the main Firefox process doing the remote communication.
001, with the last requests, is glxtest.cpp.

(CopyRegion seems to be a result of dri2FlushFrontBuffer, which I guess happens when finished with the GL context.)

Attachment 656114 [details] and "Fatal IO error" look like an X server crashes, but the stack in comment 17 may be a hang.

It does look like a problem with the graphics card/driver and suspend.
Does running glxinfo (instead of firefox) after resume cause similar symptoms?
(In reply to Karl Tomlinson (:karlt) from comment #20)
> xtrace run in this way wouldn't catch what the window manager was doing.
> Mesa uses XFIXES with DRI2.
> 
> Note that there are two X client connections in the log.
> 000 is the main Firefox process doing the remote communication.
> 001, with the last requests, is glxtest.cpp.

Oh! got it now. thanks.

(In reply to Karl Tomlinson (:karlt) from comment #21)
> Does running glxinfo (instead of firefox) after resume cause similar
> symptoms?

That is indeed what I'd like to know, as glxtest.cpp does almost the same as glxinfo.
(In reply to Benoit Jacob [:bjacob] from comment #22)
> > Does running glxinfo (instead of firefox) after resume cause similar
> > symptoms?
> 
> That is indeed what I'd like to know, as glxtest.cpp does almost the same as
> glxinfo.

Yes. glxinfo causes the crash. It doesn't look quite the same, but it has all the main features. After a reboot, glxinfo and firefox both work fine.

I suppose the workaround is to turn off glx. Maybe I don't need it anyway.

I'd be interested to see if this is in fact a bug in firefox. Probably not, since Firefox no longer uniquely causes the crash.
Thanks, that was very helpful.

In case you're interested in what's happening: on startup, Firefox uses GLX to query information about the driver, to check if enabling certain graphics features is safe. This is what we called "glxtest" above. This is very similar to the glxinfo program. The fact that both show similar symptoms strongly confirms that this is a driver bug rather than an issue in either. It also means that GLX is really broken on this system (glxinfo is among the simplest and common GLX-using programs) so you're better off disabling GLX anyway. An easy way to do that is to uninstall the Nouveau OpenGL driver (but do keep the Nouveau 2D driver). Alternatively you can also disable GLX in a xorg.conf file.

It would be interesting to debug this driver bug, but I don't know how to do that and now that it reproduces with glxinfo, that is a much better testcase. If you have the time, you could file a bug at

  https://bugs.freedesktop.org/enter_bug.cgi?product=Mesa

and select Drivers/DRI/Nouveau for Component.
On our side, it's hard to do anything as we crash precisely while trying to get the information that would tell us if we need to do something special.

What we could do though is check if glxtest crashed, and in that case permanently disable it (and all the features that depend on it). The downside is that we would in this case no longer automatically re-enable these features when the driver bug gets fixed.
(In reply to Benoit Jacob [:bjacob] from comment #25)
> On our side, it's hard to do anything as we crash precisely while trying to
> get the information that would tell us if we need to do something special.
> 
> What we could do though is check if glxtest crashed, and in that case
> permanently disable it (and all the features that depend on it). The
> downside is that we would in this case no longer automatically re-enable
> these features when the driver bug gets fixed.

In my opinion, which is not worth much since I'm just a bug reporter, it should be labeled NOTABUG (if that still exists). Now that you all have found what the problem is, the few people who suffer from this problem will do a Google search and find this bug report, which should be sufficient until the actual bug is fixed.

An alternative might be to put something in about:config to disable the use of glx. (I actually looked for such a thing.) Users might have other reasons to do this.
(In reply to Jonathan Baron from comment #26)
> (In reply to Benoit Jacob [:bjacob] from comment #25)
> > On our side, it's hard to do anything as we crash precisely while trying to
> > get the information that would tell us if we need to do something special.
> > 
> > What we could do though is check if glxtest crashed, and in that case
> > permanently disable it (and all the features that depend on it). The
> > downside is that we would in this case no longer automatically re-enable
> > these features when the driver bug gets fixed.
> 
> In my opinion, which is not worth much since I'm just a bug reporter, it
> should be labeled NOTABUG (if that still exists).

That would be INVALID. We will do that unless we decide to do something here.

> Now that you all have
> found what the problem is, the few people who suffer from this problem will
> do a Google search and find this bug report, which should be sufficient
> until the actual bug is fixed.

Unfortunately, most users react differently when their browser repeatedly crashes on startup: they switch browsers. On the other hand, a good reason NOT to do as I proposed in comment 25 is that this will cause permananent degradations on systems that had a one-time issue.

> 
> An alternative might be to put something in about:config to disable the use
> of glx. (I actually looked for such a thing.) Users might have other reasons
> to do this.

We can't easily do this because this GLXtest thing has to run very early during startup, before we start reading preferences. But we could allow disabling it with an environment variable, and that would be a good idea indeed as that at least would have no downside.
Comment on attachment 658950 [details] [diff] [review]
MOZ_AVOID_OPENGL_ALTOGETHER env var

Nice and simple.
Attachment #658950 - Flags: review?(karlt) → review+
http://hg.mozilla.org/integration/mozilla-inbound/rev/5c5001289c36
Assignee: nobody → bjacob
Target Milestone: --- → mozilla18
https://hg.mozilla.org/mozilla-central/rev/5c5001289c36
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.