Closed Bug 1193855 Opened 9 years ago Closed 6 years ago

Frequent hangs/crashes in due to apparent threading error (in Ubuntu 15.04, began with 40.0)

Categories

(Core :: Graphics, defect, P3)

40 Branch
x86_64
Linux
defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: adam, Unassigned)

References

Details

(Keywords: crash, hang, regression, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36

Steps to reproduce:

This behavior occurs in normal use and in safe mode and does not appear to be related to any particular activity in the browser. The delay before crashing has been as little as a few seconds or as much at 15+ minutes. This is new behavior following update to version 40.0. OS is Ubuntu 15.04.


Actual results:

See also this automated crash report (it's actually quite difficult in this crash/resume state to make a report, but I did get this one out: https://crash-stats.mozilla.org/report/index/9faaf25e-1bd0-4ef1-a10c-ad6772150812 ).

Browser hangs and fades to gray. It is not responsive to any inputs. On some of the hangs, there also seems to be some odd effect on UI functionality across the desktop (launcher icons not responding until the Firefox process is killed).

Further, on restart Firefox suddenly detected that it was not the default browser, and the search bar handler changed from Google to Yahoo. 

When starting Firefox (safe mode) in debug mode through gdb, this is the terminal readout.  I have also attached a backtrace from gdb.

(gdb) run
Starting program: /usr/lib/firefox/firefox --safe-mode
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
(process:10383): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed
warning: Corrupted shared library list: 0x7fffe8054800 != 0x7ffff6b93800
[New Thread 0x7fffc79fd700 (LWP 10448)]
[Thread 0x7fffc79fd700 (LWP 10448) exited]
[New Thread 0x7fffde1ed700 (LWP 10393)]
[New Thread 0x7fffb22ff700 (LWP 10559)]
[[Firefox keeps running fine. A bunch of other new threads truncated here...]]
[New Thread 0x7ffff7f71700 (LWP 10394)]
[New Thread 0x7fffde9ee700 (LWP 10392)]
[New Thread 0x7fffe9417700 (LWP 10391)]

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7fffde1ed700 (LWP 10393)]
0x00007ffff7bcb2ef in __libc_send (fd=120, buf=buf@entry=0x7fffbe6d0000, 
    n=n@entry=53, flags=flags@entry=0)
    at ../sysdeps/unix/sysv/linux/x86_64/send.c:31
31	../sysdeps/unix/sysv/linux/x86_64/send.c: No such file or directory.



Expected results:

The browser should not have crashed...?
Crash Signature: 9faaf25e-1bd0-4ef1-a10c-ad6772150812
Severity: normal → critical
OS: Unspecified → Linux
Hardware: Unspecified → x86_64
Summary: Firefox hangs at frequent random intervals (possibly on SIGPIPE) after update to 40.0 in Ubuntu 15.04 → Frequent hangs/crashes in due to apparent threading error (in Ubuntu 15.04, began with 40.0)
I have now re-created this in beta (41.0b1). I have also tried nightly and am getting crashes, but it is difficult for me to tell if it's the same issue or not, as it is handling the back-end processes differently I believe, and I have not been able to make gdb play as nicely with it.
My best guess at the moment is that the regression is somewhere in this diff (from Launchpad):

https://launchpadlibrarian.net/213839557/firefox_39.0%2Bbuild5-0ubuntu0.15.04.1_40.0%2Bbuild4-0ubuntu0.15.04.1.diff.gz
> it's actually quite difficult in this crash/resume state to make a report, but I did get this one out
How? This should work: https://developer.mozilla.org/en-US/docs/How_to_Report_a_Hung_Firefox#Linux_and_Mac
(I'd like to determine if Firefox hangs and only crashes after you force it to by sending a signal, or if it crashes on its own.)

Try crashing again, if the stack always looks similar, it may be related to OMTC, which was enabled on Linux in v40 (bug 994541, https://mozillagfx.wordpress.com/2015/05/19/off-main-thread-compositing-on-linux/ )

I'm not sure which pref can be used to test with it disabled (perhaps layers.offmainthreadcomposition.enabled?)

Also, please confirm that you use a build from mozilla.org, not the distro-provided build.


If above suggestions will not help in identifying the cause of the problem, to obtain the regression range use this: http://mozilla.github.io/mozregression/

> Program received signal SIGPIPE, Broken pipe.
That's not necessarily a crash. Per http://krijnhoetmer.nl/irc-logs/developers/20141126#l-1917 :
> <grobinson> seth: Does Firefox crash often when you're debugging it on Mac? I keep getting SIGPIPE
> <seth> grobinson: so that's actually not a crash; we use SIGPIPE in our IO code
> <seth> grobinson: i have a trick to fix that; just a sec
> <seth> grobinson: (unfortunately masking the singal in .lldbinit does not work for some reason)
> <seth> grobinson: OK, so i set a breakpoint on the function do_main in nsBrowserApp.cpp
> <seth> grobinson: the action type for the breakpoint is "debugger command"
> <seth> grobinson: and the command is "process handle SIGPIPE -n true -p true -s false"
> <seth> grobinson: breakpoints are saved in the project, so if you set this once, you'll solve the problem forever. it'll disable breaking on SIGPIPE every time you run firefox
> <seth> (there's nothing special about do_main, i was just trying to run the breakpoint as early as possible)
> * bz solves this problem by not using lldb
> <bz> and "handle SIGPIPE noprint nostop pass" in gdb
Flags: needinfo?(adam)
> This should work:
> https://developer.mozilla.org/en-US/docs/
> How_to_Report_a_Hung_Firefox#Linux_and_Mac
> (I'd like to determine if Firefox hangs and only crashes after you force it
> to by sending a signal, or if it crashes on its own.

> Try crashing again [...]

Thanks for that (and thanks very much for taking an interest in the bug in general!). Here are a couple more manually-triggered crash reports:

https://crash-stats.mozilla.com/report/index/c2e63486-bbf8-4d70-9648-f3f262150816

https://crash-stats.mozilla.com/report/index/26b5c1ea-631a-4a39-a999-3cdfc2150816

if the stack always looks similar, it may be related to
> OMTC, which was enabled on Linux in v40 (bug 994541,
> https://mozillagfx.wordpress.com/2015/05/19/off-main-thread-compositing-on-
> linux/ )

> I'm not sure which pref can be used to test with it disabled (perhaps
> layers.offmainthreadcomposition.enabled?)

I did set that to "false", and that session proceeded to crash as usual within a few minutes. Here's a crash report from there:

https://crash-stats.mozilla.com/report/index/25928c42-16be-411a-9d6e-65b612150816

Even after a restart, I still got another crash with that setting on "false":

https://crash-stats.mozilla.com/report/index/bp-0f6443fd-fb64-4a6f-b42b-baafc2150816

> Also, please confirm that you use a build from mozilla.org, not the
> distro-provided build.

I do use the distro-provided builds normally (https://launchpad.net/firefox), which is where the reports are coming from. The beta I tested was also from an Ubuntu PPA here (https://launchpad.net/~mozillateam/+archive/ubuntu/firefox-next) and the nightly here (https://launchpad.net/~ubuntu-mozilla-daily/+archive/ubuntu/ppa). Those Firefox builds hosted on Launchpad state that their bugs are tracked here at Bugzilla, though. 

I downloaded firefox from the Mozilla.org site, which gave me a .tar.bz that seems able to run straight from the folder when extracted. The build ID I saw was identical to the one I run, though: 20150807094836, and all of my configs looked the same. I'm not sure if this is expected behavior or if it just means that I wasn't actually running the version I downloaded but rather just launching the files that my system already associates with that program command. (Running a totally different build from a folder doesn't seem like a problem with things like the Tor Browser Bundle, though -- sorry for my ignorance about how it works). I tried specifying a download for the en-US 64-bit linux version, and that .tar.bz has the same MD5 as the one auto-generated for me. But when I launched firefox from *that* extracted folder (in a different random-ish on my machine), this time with no already-running instances it opened with a "checking your add-ons" screen. It froze there, however. Here's that crash:

https://crash-stats.mozilla.com/report/index/b6af6b1e-dd6f-499d-8bb3-4bca52150816

After running it for a time, it went as usual:

https://crash-stats.mozilla.com/report/index/3406f2da-604a-44e2-8dc6-118eb2150816

> If above suggestions will not help in identifying the cause of the problem,
> to obtain the regression range use this:
> http://mozilla.github.io/mozregression/

I did start making a run at that the other night, but it's quite difficult work for a couple of reasons. (1) It could be anywhere between 39.0.5 and 40.0.x, which I believe covers quite a few nightlies; and (2) the time to crash and actions to cause a crash are weirdly indeterminate, so testing each nightly in the sequence takes a long time, and it's hard to tell when the "good" build has been found. I will get back on it if necessary, but my hope has been that the traces themselves might provide more direct insight.

> > Program received signal SIGPIPE, Broken pipe.
> That's not necessarily a crash. Per
> http://krijnhoetmer.nl/irc-logs/developers/20141126#l-1917 :
> > <grobinson> seth: Does Firefox crash often when you're debugging it on Mac? I keep getting SIGPIPE
> > <seth> grobinson: so that's actually not a crash; we use SIGPIPE in our IO code
> > <seth> grobinson: i have a trick to fix that; just a sec
> > <seth> grobinson: (unfortunately masking the singal in .lldbinit does not work for some reason)
> > <seth> grobinson: OK, so i set a breakpoint on the function do_main in nsBrowserApp.cpp
> > <seth> grobinson: the action type for the breakpoint is "debugger command"
> > <seth> grobinson: and the command is "process handle SIGPIPE -n true -p true -s false"
> > <seth> grobinson: breakpoints are saved in the project, so if you set this once, you'll solve the problem forever. it'll disable breaking on SIGPIPE every time you run firefox
> > <seth> (there's nothing special about do_main, i was just trying to run the breakpoint as early as possible)
> > * bz solves this problem by not using lldb
> > <bz> and "handle SIGPIPE noprint nostop pass" in gdb

It's possible, then, that those lines were an artifact of my having opened that instance of firefox through gdb?  


Two other things I can add that may not be meaningful:

(1) So far, since this started, I have noticed one sort of use that does not seem to lead to a crash/hang/whatever it is. Sometimes the crash happens right on opening, before I've even been able to finish typing a URL or choose a bookmark (or even answer whether I want the session restored). But assuming that doesn't happen, I sometimes go to www.haxball.com , which is a multiplayer flash game I play through the Pipelight plugin. If I have done / do nothing else in the browser but play that one game, I don't recall it having crashed even after fairly extended periods (maybe 30-60+ minutes). For whatever that's worth. Note that crashes in other contexts can happen during actions as mundane as mousing over a link, so I don't know how much I'd read into it.

(2) Since I have been mostly using Chrome in the meantime, I have noticed a few similar (but maybe not identical, and much more rare) crashes with Chrome. I don't have debugging symbols for gdb, and I haven't been able to run the nacl-gdb version (probably because of sandboxing), but when I attach gdb to a Chrome process post-crash, I get something like this as a trace:

#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f5801088b58 in ?? ()
#2  0x000002018acf5020 in ?? ()
#3  0x0000020189097150 in ?? ()
#4  0x00007ffc0afa18ee in ?? ()
#5  0x0000000000000000 in ?? ()

Again, maybe/probably nothing, but I'm throwing darts at whatever.
Flags: needinfo?(adam)
>that session proceeded to crash as usual within a few minutes
I believe you're using the verb "crash" to refer to the "hangs and fades to gray" state you've described, not the death of the firefox process and/or appearance of the Crash reporter window. Is that correct?
The common terms here in b.m.o for the former cases is "hangs" and the latter cases "crashes".

>Those Firefox builds hosted on Launchpad state that their bugs are tracked here at Bugzilla, though. 
Where do they state that? I believe the Ubuntu mozilla team uses their own tracker.

>It's possible, then, that those lines were an artifact of my having opened that instance of firefox through gdb?
That's my guess. The IRC log I pasted has instructions on disabling this behavior in gdb (and/or just try continuing, I think it's "c" in gdb).

>I downloaded firefox from the Mozilla.org site
I couldn't figure out what you were trying to describe in this paragraph.

https://support.mozilla.org/en-US/kb/install-firefox-linux#firefox:linux:fx40 (starting from "2. Open a Terminal and go to your home directory: cd ~ ") has the instructions to install a build from tar.gz. Note that you must ensure no firefox processes are running, and run "path/to/firefox", not just "firefox".

(BTW, The Ubuntu builds, I believe, clearly identify themselves as such in the About dialog. At least the one that came with my ubuntu install does.)

Your settings are stored in a "profile" and are reused across different versions of Firefox. It might be a good idea to test with a clean profile too: https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles

A sure-fire way to run an isolated Firefox instance of a specific version is:
path/to/firefox -no-remote -profile /absolute/path/to/an/empty/dir

> http://mozilla.github.io/mozregression/
Did you succeed in reproducing the problem at least in one build launched with mozregression? It downloads/runs the mozilla.org builds.
Flags: needinfo?(adam)
And perhaps, someone from the graphics team can chime in with further suggestions.

Short summary (for details see comment 0): Firefox on Linux "hangs and fades to gray" starting with version 40. Force-crashing it in this state always has this stack (e.g. bp-3406f2da-604a-44e2-8dc6-118eb2150816):

0   libpthread-2.21.so  libpthread-2.21.so@0xcda0   
1   libnspr4.so     PR_WaitCondVar
2   libxul.so   mozilla::Monitor::Wait
3   libxul.so   mozilla::ipc::MessageChannel::WaitForSyncNotify
4   libxul.so   mozilla::ipc::MessageChannel::Send
5   libxul.so   mozilla::layers::PLayerTransactionChild::SendUpdate
6   libxul.so   mozilla::layers::ShadowLayerForwarder::EndTransaction
7   libxul.so   mozilla::layers::ClientLayerManager::ForwardTransaction
8   libxul.so   mozilla::layers::ClientLayerManager::EndTransaction
9   libxul.so   nsDisplayList::PaintRoot
10  libxul.so   nsLayoutUtils::PaintFrame
11  libxul.so   PresShell::Paint
...

I guessed it could be related to OMTC, but setting layers.offmainthreadcomposition.enabled=false didn't have any effect.
Flags: needinfo?(nical.bugzilla)
This stack indicates that the main thread sent a synchronous transaction to the compositor thread, and is waiting for the compositor to receive the transaction. in the crash bp-3406f2da-604a-44e2-8dc6-118eb2150816 we can see that the compositor is indeed busy compositing (perhaps hung). It's the OpenGL compositor, so it'd be interesting to know if the problem also occurs with hardware acceleration turned off (which is the default configuration on Linux). If the problem only happens with hardware acceleration, then this bug should block bug 594876, and is probably a duplicate of some other hangs reported with gl layers on Linux (I don't have the bug numbers handy).
Flags: needinfo?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #7)
> This stack indicates that the main thread sent a synchronous transaction to
> the compositor thread, and is waiting for the compositor to receive the
> transaction. in the crash bp-3406f2da-604a-44e2-8dc6-118eb2150816 we can see
> that the compositor is indeed busy compositing (perhaps hung). It's the
> OpenGL compositor, so it'd be interesting to know if the problem also occurs
> with hardware acceleration turned off (which is the default configuration on
> Linux). If the problem only happens with hardware acceleration, then this
> bug should block bug 594876, and is probably a duplicate of some other hangs
> reported with gl layers on Linux (I don't have the bug numbers handy).

The crash bp-3406f2da-604a-44e2-8dc6-118eb2150816 leads me to believe this could be related to XInitThreads pain, mainly with drivers that don't expect a thread-safe X11 environment (and end up deadlocking themselves somehow). It may be worth looking into this crash report with more rigor, since it's occurring in the open-source drivers.

Either way, this may make a good case for killing XInitThreads in gecko (as mentioned in bug 1189132).
Thank you all for your time so far in having a look at this.

(In reply to Nickolay_Ponomarev from comment #5)

> I believe you're using the verb "crash" to refer to the "hangs and fades to
> gray" state you've described, not the death of the firefox process and/or
> appearance of the Crash reporter window. Is that correct?
> The common terms here in b.m.o for the former cases is "hangs" and the
> latter cases "crashes".

Got it. There may have been instances where the process died on its own and the crash handler appeared, but only one or two (that was my earlier reference to why I had found it difficult to report). In general, I have to kill the process (sometimes possible through the OS GUI, sometimes requiring sending a kill signal because other OS elements have become unresponsive to the mouse at the same time as the Firefox hang). 

> >Those Firefox builds hosted on Launchpad state that their bugs are tracked here at Bugzilla, though. 
> Where do they state that? I believe the Ubuntu mozilla team uses their own
> tracker.

That's here for the main package (https://bugs.launchpad.net/firefox ). I see on re-reading it that while it is true that the primary message of that page is to say that bugs are tracked over here, it is also the case that there are some separate bug pages linked below that, which do exist on Launchpad. I guess I don't know quite enough about the relationship between the teams or the difference between the builds; I can try copying or linking the bug over to that side if you think that would be the thing to do.

> >It's possible, then, that those lines were an artifact of my having opened that instance of firefox through gdb?
> That's my guess. The IRC log I pasted has instructions on disabling this
> behavior in gdb (and/or just try continuing, I think it's "c" in gdb).

For now, I have stopped bothering running most test sessions through gdb. So far, I do not think I have seen a hang in safe mode again in this limited testing. So it's even more plausible that this is a problem that does not come up (or does not come up the same way) in safe mode, but it only appeared to in that test because a similar hang condition resulted from the interaction of the session with gdb.


> A sure-fire way to run an isolated Firefox instance of a specific version is:
> path/to/firefox -no-remote -profile /absolute/path/to/an/empty/dir

I had been launching by clicking the application in the directory through Nautilus, but that may have been just issuing a shell command to launch whatever the "firefox" path was; I'm not sure.  Thanks for pointing me in this direction, though.  I have started doing some runs with the Mozilla builds and an empty profile folder, but I haven't been able to put enough time in quite yet to tell if there's a difference.

 
> > http://mozilla.github.io/mozregression/
> Did you succeed in reproducing the problem at least in one build launched
> with mozregression? It downloads/runs the mozilla.org builds.

I have not reproduced it there yet, although I'm not sure I started at quite the right point in the nightlies, as I thought I was armed only with the version number (40.0).  I need to find a bigger block of time to try to bisect.

________
New Updates:

I am working on a grid/tree of behavior based on the build, whether it's in safe mode, whether my normal profile or an empty one is attached, etc. Again, it's slow going because there is an uncertain time to hang, but hopefully I'll be able to find some more time. 

I can report this odd condition when using the standard Ubuntu release build of firefox from the PPA:

- In safe mode, using my normal profile, I have not yet caused a hang (have tried for 10-15 mins or so).

- Not in safe mode, but using an empty profile, I have also not yet caused a hang (again tried 10-15 mins). 

- Not in safe mode, using my normal profile, but with ALL add-ons manually disabled, I have gotten the hang very quickly several times (e.g., https://crash-stats.mozilla.com/report/index/bp-80eeb501-fd28-4c81-8aa5-ac4562150817 ).  

I'm currently testing with a normal profile, non-safe-mode, add-ons disabled manually, and HW accel also disabled.  My next steps in that area may be Pipelight, since it's still active in safe mode. Assuming that I cannot easily recreate the bug in nightlies or with an empty profile, I may just "reset firefox", though I'm loathe to do that given the amount of individual customization I've put into some extensions (the settings, not the code itself).

In the meantime, is there a good summary somewhere of differences between normal and safe mode that are *not* the disabling of extensions? If I'm continuing to get the hangs with all extensions disabled but not getting them in safe mode, and all the remaining differences between the modes are things that show up in about:config... maybe I can use a config file diff to make my normal mode more and more like safe mode until it stops hanging, then see where the culprit was?
Here is a crash of the Mozilla build (not the Ubuntu build). In this session, I did not specify an empty profile, but I also did not carry over any extensions (which were all deactivated in my main profile anyway). This session had hardware accel ENABLED: https://crash-stats.mozilla.com/report/index/7f99e53d-5034-4b23-be7e-f3c982150817 . 

So I went back and tried using the profile that the Mozilla version had created by default in an empty folder. I let it totally replace the prefs.js again through a session, and I just made one change: enable hardware acceleration. I re-opened it with the same profile and, lo and behold, I got the hang: https://crash-stats.mozilla.com/report/index/bp-891b58ac-45b4-464a-9cc9-0b5052150817 .

I assume this is decently good evidence that HW accel may be at the heart of this -- not sure if that's conclusive enough that I should change the title of this bug?
Flags: needinfo?(adam)
Looks like I may have spoken too soon there -- irritating thing about the indeterminate time to hang. Here is what I think is the usual hang with layers.acceleration.force-enabled FALSE. Context: Ubuntu build, normal profile, some of my add-ons re-enabled, Report: https://crash-stats.mozilla.com/report/index/bp-7e7c1e7b-a5e7-47c3-b025-b69b42150817 . The option media.hardware-video-decoding.enabled was still set to "true", though -- I don't know if that would make a difference or not. I'll start running with both at "false" from now.
Crash Signature: 9faaf25e-1bd0-4ef1-a10c-ad6772150812 → [@ libpthread-2.21.so@0xcda0 ]
Keywords: crash
Component: Untriaged → General
(In reply to Andrew Comminos [:acomminos] from comment #9)
> Either way, this may make a good case for killing XInitThreads in gecko (as
> mentioned in bug 1189132).

Interesting. Could you open a new bug about killing XInitThreasd with a summary of the issues and possible solutions? All I remember is that before we added XInitThread OMTC would simply crash at startup with some assertion inside x11 or gtk or something like that. But that was more than three years ago.
(In reply to Nicolas Silva [:nical] from comment #13)
> (In reply to Andrew Comminos [:acomminos] from comment #9)
> > Either way, this may make a good case for killing XInitThreads in gecko (as
> > mentioned in bug 1189132).
> 
> Interesting. Could you open a new bug about killing XInitThreasd with a
> summary of the issues and possible solutions? All I remember is that before
> we added XInitThread OMTC would simply crash at startup with some assertion
> inside x11 or gtk or something like that. But that was more than three years
> ago.

Unfortunately, my patches in bug 1195359 that replaced usage of XInitThreads with separate display connections for the widget code and the compositor only 'fixed' the hang in bug 1189132 by not flushing an XUnmapWindow call when the compositor is shut down; the hang still occurs if we flush the X11 client queue on the widget's display connection. It's likely not worth it to make the switch considering the hangs still occur.
Although this issue is not reproduced on my end, due to the amount of comments from developers and its blocks(594876), I will be changing the bug from Unconfirmed to New.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Component: General → Graphics
Product: Firefox → Core
Priority: -- → P3
Whiteboard: [gfx-noted]
Closing because no crash reported since 12 weeks.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Closing because no crash reported since 12 weeks.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: