Closed
Bug 973192
Opened 11 years ago
Closed 11 years ago
Firefox hangs on start when using NouVeau driver
Categories
(Core :: Graphics: Layers, defect)
Tracking
()
RESOLVED
FIXED
mozilla30
People
(Reporter: fredbezies, Unassigned)
References
()
Details
(Keywords: regression)
Attachments
(13 files)
8.30 KB,
text/x-log
|
Details | |
22.34 KB,
text/plain
|
Details | |
42.74 KB,
text/plain
|
Details | |
6.86 KB,
patch
|
Details | Diff | Splinter Review | |
2.63 KB,
patch
|
Details | Diff | Splinter Review | |
1.08 KB,
patch
|
Details | Diff | Splinter Review | |
7.73 KB,
text/plain
|
Details | |
860 bytes,
patch
|
Details | Diff | Splinter Review | |
1.03 KB,
patch
|
Details | Diff | Splinter Review | |
1.01 KB,
patch
|
Details | Diff | Splinter Review | |
1.32 KB,
patch
|
Details | Diff | Splinter Review | |
2.01 KB,
patch
|
karlt
:
review+
|
Details | Diff | Splinter Review |
2.97 KB,
patch
|
karlt
:
review+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0 (Beta/Release)
Build ID: 20140215125843
Steps to reproduce:
Since february 14th, I cannot launch Firefox. It eats all CPU. I have to open a terminal, use top and kill firefox task in order to get it run.
I'm using Archlinux, 64 bits, with gcc 4.8.2.
Seems to be a really recent regression. Removing addons doesn't change a thing. A firefox task is eating all cpu.
Actual results:
Just build firefox. Try to launch it. Wait for at least 10 seconds. Nothing happens.
Opening a terminal and entering top shows me that firefox is eating 100% of CPU.
Expected results:
Just launching.
Reporter | ||
Updated•11 years ago
|
Component: Untriaged → General
Reporter | ||
Comment 1•11 years ago
|
||
My last working and quick starting build was based on revision : http://hg.mozilla.org/mozilla-central/rev/6687d299c464
Strange part is that Firefox starts ok when I kill it in top.
Reporter | ||
Comment 2•11 years ago
|
||
Busted build : https://hg.mozilla.org/mozilla-central/rev/b80f7eece913
Last official working build : https://hg.mozilla.org/mozilla-central/rev/d275eebfae04
Reporter | ||
Comment 3•11 years ago
|
||
Reporter | ||
Comment 4•11 years ago
|
||
Could this bug be related ? https://bugzilla.mozilla.org/show_bug.cgi?id=959787
Reporter | ||
Updated•11 years ago
|
Summary: Firefox very slow to start → Firefox hangs on start
Comment 5•11 years ago
|
||
Bug 959787 seems a bit unlikely to cause this sort of problem....
Can you try tinderbox builds to narrow down the regression range further, perhaps?
Flags: needinfo?(fredbezies)
Reporter | ||
Comment 6•11 years ago
|
||
I'll try asap hoping I will find a smaller regression range.
Flags: needinfo?(fredbezies)
Reporter | ||
Comment 7•11 years ago
|
||
(In reply to Frederic Bezies from comment #6)
> I'll try asap hoping I will find a smaller regression range.
Looks like I got one.
Last working tinderbuild (no hang on start) : https://hg.mozilla.org/mozilla-central/rev/23f7a629a217
First busted tinderbuild (hang and having to kill process) : https://hg.mozilla.org/mozilla-central/rev/5d7caa093f4f
Looks weird anyway...
Comment 8•11 years ago
|
||
Can you try inbound tinderbuilds?
Reporter | ||
Comment 9•11 years ago
|
||
(In reply to Boris Zbarsky [:bz] (reviews will be slow; ask someone else) from comment #8)
> Can you try inbound tinderbuilds?
Try it. Report asap.
Reporter | ||
Comment 10•11 years ago
|
||
On mozilla-inbound :
Last working tinderbuild : https://hg.mozilla.org/integration/mozilla-inbound/rev/c7802c9d6eec
First broken tinderbuild : https://hg.mozilla.org/integration/mozilla-inbound/rev/9e7cf0b1d80c
Could it be bug #972397 ?!
Comment 11•11 years ago
|
||
So http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c7802c9d6eec&tochange=9e7cf0b1d80c
Bug 972397 sure seems like the most likely culprit in there.
Blocks: 972397
tracking-firefox30:
--- → ?
Component: General → Graphics: Layers
Flags: needinfo?(nical.bugzilla)
Keywords: regression
Product: Firefox → Core
Comment 12•11 years ago
|
||
One more question. When Firefox hangs, can you possibly get stacks for all its threads?
Reporter | ||
Comment 13•11 years ago
|
||
Will try. But looking at it again, I could find another possible culprit ? bug #972700... What about it ?
Comment 14•11 years ago
|
||
Is that really in your range from comment 10? In any case, it's just a string change; shouldn't cause hangs!
Reporter | ||
Comment 15•11 years ago
|
||
Just asking your opinion. Just tell me how to report stacks while hanging. Will report datas asap.
Comment 16•11 years ago
|
||
Something like this should work, if "firefox-pid" is the process id and you're using an sh-compatible shell:
gdb > /tmp/log.txt 2>&1
and then at the gdb prompt type:
attach firefox-pid
and then:
thread apply all bt
and then:
quit
Reporter | ||
Comment 17•11 years ago
|
||
Reporter | ||
Comment 18•11 years ago
|
||
Added gdb log, hoping it will help fixing this bug.
Comment 19•11 years ago
|
||
Thanks! Here's what the thread 1 (main thread) stack looks like:
Thread 1 (Thread 0x7f2d032c2740 (LWP 3445)):
#0 0x00007f2d02ebfa8d in read () from /usr/lib/libpthread.so.0
#1 0x00007f2cfd97264a in mozilla::widget::GfxInfo::GetData() ()
from /home/fred/Téléchargements/firefox/libxul.so
#2 0x00007f2cfd96ea2d in nsBaseWidget::ComputeShouldAccelerate(bool) ()
from /home/fred/Téléchargements/firefox/libxul.so
#3 0x00007f2cfd96fbf8 in nsBaseWidget::GetLayerManager(mozilla::layers::PLayerTransactionChild*, mozilla::layers::LayersBackend, nsIWidget::LayerManagerPersistence, bool*) () from /home/fred/Téléchargements/firefox/libxul.so
#4 0x00007f2cfdeb761e in PresShell::Paint(nsView*, nsRegion const&, unsigned int) () from /home/fred/Téléchargements/firefox/libxul.so
#5 0x00007f2cfdbadcb4 in nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) () from /home/fred/Téléchargements/firefox/libxul.so
#6 0x00007f2cfdebe827 in nsRefreshDriver::Tick(long, mozilla::TimeStamp) ()
from /home/fred/Téléchargements/firefox/libxul.so
That shows us reading from the pipe to the glxtest process, and presumably blocking on that read...
Flags: needinfo?(bjacob)
Comment 20•11 years ago
|
||
(In reply to Boris Zbarsky [:bz] (reviews will be slow; ask someone else) from comment #11)
> So
> http://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=c7802c9d6eec&tochange=9e7cf0b1d80c
>
> Bug 972397 sure seems like the most likely culprit in there.
I would be very surprised. This should just affect assertions in debug builds.
other comments of this thread also look unrelated to bug 972397
Flags: needinfo?(nical.bugzilla)
Comment 21•11 years ago
|
||
Hmm. Actually, if we're using CPU, not just blocked... It might be worth getting a profile to see what's using the CPU. All the threads showing in attachment 8378338 [details] seem to be blocked, not running.
Reporter | ||
Comment 22•11 years ago
|
||
Firefox is blocked. I have to kill its process using top. Log was obtained while Firefox is blocked.
Comment 23•11 years ago
|
||
If it's blocking on the read from the glxtest process, that means that 1) the glxtest process is still alive and 2) it hasn't finished writing the data yet. That could be caused, for example, by the glxtest process stalling somewhere before it gets to write the data, which is not impossible as it has to call into X11 and OpenGL libraries.
Can you please:
1) Paste here the output of this command:
ps aux | grep firefox
2) One of these processes will be the glxtest process. Attach GDB to it, and get a backtrace (like above). If you don't know which of these is the glxtest process, do that for all of them, and paste all the resulting stacks here.
Flags: needinfo?(bjacob)
Reporter | ||
Comment 24•11 years ago
|
||
Here is the output I get :
[fred@fredo-arch ~]$ ps aux | grep firefox
fred 21115 48.0 6.5 832532 263672 pts/0 Sl 18:07 0:02 ./firefox
fred 21117 99.5 0.4 309064 17672 pts/0 R 18:07 0:05 ./firefox
fred 21140 0.0 0.0 10692 1064 pts/1 S+ 18:07 0:00 grep firefox
No glxtest listed at all.
Comment 25•11 years ago
|
||
That's normal: 'glxtest' only appears in symbol names. The glxtest process is one of these two firefox processes here. Please attach gdb to each of them and get backtraces.
Reporter | ||
Comment 26•11 years ago
|
||
Logged thread process as asked on comment #25.
Comment 27•11 years ago
|
||
Weird: these two processes are in nearly exactly the same state. Both are real Firefox processes, not glxtest.
Do you really get two such processes everytime you run Firefox, i.e. can you confirm that neither of these processes is a leftover from a previous attempt to run Firefox?
Reporter | ||
Comment 28•11 years ago
|
||
I told you I had no glxtest processes :)
And yes, I had no remaining firefox processes. Will try again, killing any remaining firefox processes before logging anything.
This bug is really weird. As I said in comment #0 : "I have to open a terminal, use top and kill firefox task in order to get it run."
Comment 29•11 years ago
|
||
Quick question: if you run with the environment variable MOZ_AVOID_OPENGL_ALTOGETHER=1 does the problem still occur?
Reporter | ||
Comment 30•11 years ago
|
||
Well, looks like you find it ;)
Using export MOZ_AVOID_OPENGL_ALTOGETHER=1 fix the bug. So, what now ? :D
Comment 31•11 years ago
|
||
Please rebuild and run with this (without any environment variable or debugger).
Comment 32•11 years ago
|
||
(Let me know if it's useful that I spawn a tryserver build so you don't need to rebuild yourself)
Reporter | ||
Comment 33•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #32)
> (Let me know if it's useful that I spawn a tryserver build so you don't need
> to rebuild yourself)
Let me unpack a trunk source code copy I have, and in about 50 minutes, I will give you all returns.
Reporter | ||
Comment 34•11 years ago
|
||
Here is the output :
(process:29448): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed
entering: bool fire_glxtest_process() on pid 29448
fork succeeded, we are the parent process (pid 29448), our child process is pid 29450
fork succeeded, we are the child process (pid 29450), running glxtest
entering: void glxtest() on pid 29450
opened OpenGL library: libGL.so.1
got glXGetProcAddress
got all proc addresses
opened X display
set X error handler
chose visual
created window
created GLX context
got GL strings
Last mesa update on my computer ? February 4th.
Last xorg driver update (nouveau for my nvidia card) ? January 13th.
Comment 35•11 years ago
|
||
And I assume that that is all, and you don't see "cleaned up" ?
In that case, that means that the glxtest process is stalling while trying to destroy its X11 or GL resources. Let me whip up a second patch with more detailed logging around that part.
Reporter | ||
Comment 36•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #35)
> And I assume that that is all, and you don't see "cleaned up" ?
>
Nothing cleaned at all. Hang after "got GL strings".
> In that case, that means that the glxtest process is stalling while trying
> to destroy its X11 or GL resources. Let me whip up a second patch with more
> detailed logging around that part.
Ok. Let's hope it will be more informative ;)
Comment 38•11 years ago
|
||
Actually, it would be extremely useful to have GDB attached to the glxtest process. This way,
- if it crashes, we'll know why
- if it stalls, you'll be able to interrupt it (Ctrl+C) and know what it is doing.
This patch should make it easy, by telling you the right command line to attach GDB to glxtest process, and giving you 30 seconds to do so.
Reporter | ||
Comment 39•11 years ago
|
||
[fred@fredo-arch firefox]$ ./firefox &
[1] 5199
[fred@fredo-arch firefox]$
(process:5199): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed
entering: bool fire_glxtest_process() on pid 5199
fork succeeded, we are the parent process (pid 5199), our child process is pid 5201
fork succeeded, we are the child process (pid 5201), running glxtest
entering: void glxtest() on pid 5201
opened OpenGL library: libGL.so.1
got glXGetProcAddress
got all proc addresses
opened X display
set X error handler
chose visual
created window
created GLX context
got GL strings
writing temporary buffer...
starting to clean up...
calling glXMakeCurrent(display=0x7f3dfcfd4000, None, nullptr)...
calling glXDestroyContext(display=0x7f3dfcfd4000, context=0x7f3dfcfed0b0)...
calling XDestroyWindow(display=0x7f3dfcfd4000, window=0x1600002)...
calling XFreeColormap(display=0x7f3dfcfd4000, ...)...
calling XCloseDisplay(display=0x7f3dfcfd4000)...
Hangs after XCloseDisplay().
Flags: needinfo?(fredbezies)
Reporter | ||
Comment 40•11 years ago
|
||
After applying your last patch, attaching to glxtest process, I got this. Not really useful I think :(
(gdb) thread apply all bt
Thread 1 (Thread 0x7f8ab4776740 (LWP 12801)):
#0 0x00007f8ab365575d in nanosleep () from /usr/lib/libc.so.6
#1 0x00007f8ab36555f4 in sleep () from /usr/lib/libc.so.6
#2 0x00007f8aafa84845 in ?? ()
from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so
#3 0x00007f8aafa85a27 in ?? ()
from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so
#4 0x00007f8aafa80293 in ?? ()
from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so
#5 0x00007f8aafa83661 in ?? ()
from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so
#6 0x00007f8aafa838f3 in XRE_main ()
from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so
#7 0x0000000000403bf3 in _start ()
Comment 41•11 years ago
|
||
Hey, after you attach GDB to the glxtest process, you need to tell GDB to let the glxtest process resume execution (attaching implicitly stops the process).
To do that, just enter 'c' in GDB:
(gdb) c
Then it will resume, but that will be resuming its 30-second sleep, so you may have to wait a bit for actually interesting things to happen. Let it run until it crashes or until it stalls and you have to Ctrl-C it.
Reporter | ||
Comment 42•11 years ago
|
||
I hanged for one minute before firefox starts. Wow !
Here is the log of firefox :
Start until hang :
"[fred@fredo-arch firefox]$ ./firefox &
[1] 20194
[fred@fredo-arch firefox]$
(process:20194): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed
entering: bool fire_glxtest_process() on pid 20194
fork succeeded, we are the parent process (pid 20194), our child process is pid 20196
fork succeeded, we are the child process (pid 20196), running glxtest
entering: void glxtest() on pid 20196
########################################################
You can attach GDB to the glxtest process now with this command:
gdb --pid 20196
Sleeping for 30 seconds now to give you time to attach...
########################################################
opened OpenGL library: libGL.so.1
got glXGetProcAddress
got all proc addresses
opened X display
set X error handler
chose visual
created window
created GLX context
got GL strings
writing temporary buffer...
starting to clean up...
calling glXMakeCurrent(display=0x7f4ccbad4000, None, nullptr)...
calling glXDestroyContext(display=0x7f4ccbad4000, context=0x7f4ccbaed0b0)...
calling XDestroyWindow(display=0x7f4ccbad4000, window=0x1400002)...
calling XFreeColormap(display=0x7f4ccbad4000, ...)...
calling XCloseDisplay(display=0x7f4ccbad4000)..."
1 minute after :
"calling dlclose on libGL...
cleaned up
wrote to pipe"
And the gdb output :
(gdb) c
Continuing.
[Inferior 1 (process 20196) exited normally]
I don't understand. Why is there a one minute hanging on start ?
Comment 43•11 years ago
|
||
OK, thanks, so now we know that the glxtest process is neither crashing nor even stalling forever, it is just spending a minute in XCloseDisplay().
Could you repeat the procedure, attaching GDB to glxtest process, and once you've seen the message "calling XCloseDisplay...", hit Ctrl-C to interrupt it, get a backtrace, then enter 'c' in GDB to continue execution, wait a few seconds, and repeat this a few times? I would like to get a sense of what it is, in XCloseDisplay(), that makes it take so long on your machine.
Alternatively: depending on what's happening here, a better tool to get this information might be a profiler. Install the 'perf' profiler that's part of the linux kernel (might be part of a "linux-utils" package, or some such). Run without GDB, but with the patch #3 so you get the PID of the glxtest process. Once you have that pid and see the message "calling XCloseDisplay...", do this:
perf record -f -g -p $THE_PID_OF_GLXTEST_PROCESS
let it run a few seconds, then hit Ctrl-C, then do this to view the resulting profile:
perf report
You can then output to a file by
perf report --stdio > somefile.txt
and attach this here (maybe trimmed / compressed).
Comment 44•11 years ago
|
||
Needless to say, it is very much worth installing debug symbols from your linux distro for X11 and GL libraries, to get useful stacks into them.
Comment 45•11 years ago
|
||
How to decide whether you need a debugger or a profiler:
- if the glxtest process is using 100% of a CPU, then you rather want a profiler to understand what it's doing.
- it it's not using much CPU, then you rather want a debugger to understand where it's waiting/sleeping.
Reporter | ||
Comment 46•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #45)
> How to decide whether you need a debugger or a profiler:
> - if the glxtest process is using 100% of a CPU, then you rather want a
> profiler to understand what it's doing.
> - it it's not using much CPU, then you rather want a debugger to understand
> where it's waiting/sleeping.
It eat 100% CPU. Profiling ? It will need a lot of packages, and I'm not sure I will have all debug ones.
I wonder bug #972397 broke everything for my linux installation.
I will try gdb first. If I cannot understand what is happening, I will give up with this bug, because I'm fed up with it. You can easily understand me.
Comment 47•11 years ago
|
||
I sure can understand :-)
At this point we pretty much know that the bug is somewhere in your distro's X libraries: XCloseDisplay should not eat 100% CPU for a minute. The question is should we do something about it? Is it likely to affect more users than just you? That's why I wanted to dig deep into what XCloseDisplay is actually doing here.
Comment 48•11 years ago
|
||
Bug 972397 is totally out of cause here; in fact, since the problem is a bug in XCloseDisplay, I am almost certain that even if you go back to an older version of Firefox, you will still experience the same symptons.
Comment 49•11 years ago
|
||
(In reply to Frederic Bezies from comment #10)
> On mozilla-inbound :
>
> Last working tinderbuild :
> https://hg.mozilla.org/integration/mozilla-inbound/rev/c7802c9d6eec
> First broken tinderbuild :
> https://hg.mozilla.org/integration/mozilla-inbound/rev/9e7cf0b1d80c
>
> Could it be bug #972397 ?!
Are we sure of that regression window? The debugging we've been doing around comment 39 and comment 42 says that we're just hitting a bug in X libraries (XCloseDisplay) long, long before any gecko gfx code kicks into play.
Reporter | ||
Comment 50•11 years ago
|
||
perf running during 15 seconds.
Reporter | ||
Comment 51•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #47)
> I sure can understand :-)
>
> At this point we pretty much know that the bug is somewhere in your distro's
> X libraries: XCloseDisplay should not eat 100% CPU for a minute. The
> question is should we do something about it? Is it likely to affect more
> users than just you? That's why I wanted to dig deep into what XCloseDisplay
> is actually doing here.
Note : Triggered by Mozilla Firefox code. I never had this bug before.
By the way, none of the Xorg libraries were modified on or around february 13th. So this is why I thought it was some code added / modified in Mozilla Firefox. I can see this behaviour on both homemade (so relying on my Xorg libs) and official builds.
If I look at perf log, I can see this :
13.38% firefox nouveau_dri.so [.] 0x000000000033f6a5
7.89% firefox [kernel.kallsyms] [k] put_prev_task_fair
7.82% firefox [kernel.kallsyms] [k] system_call
6.11% firefox [kernel.kallsyms] [k] sched_clock
5.30% firefox [kernel.kallsyms] [k] __schedule
Could it be a kernel bug ? If so, the best thing is to stop spamming this bug report.
Comment 52•11 years ago
|
||
I am filing a bug against DRI/nouveau. Meanwhile, if you are not convinced that this is a bug in X/GL libraries/drivers, do feel free to retry an older Firefox build. If my understanding is correct, this bug is completely independent of Firefox version; if you found otherwise, that would be very interesting!
Comment 53•11 years ago
|
||
Karl, would it make sense for us to just not call XCloseDisplay? Or not call it when running on this driver (nouveau)? Or would you rather have us get to the bottom of this issue before we start working around this bug?
Flags: needinfo?(karlt)
Reporter | ||
Comment 54•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #52)
> I am filing a bug against DRI/nouveau. Meanwhile, if you are not convinced
> that this is a bug in X/GL libraries/drivers, do feel free to retry an older
> Firefox build. If my understanding is correct, this bug is completely
> independent of Firefox version; if you found otherwise, that would be very
> interesting!
I'm writing this answer using my last working homemade build. So, using old code :)
It is based on this revision : http://hg.mozilla.org/mozilla-central/rev/6687d299c464
I can try any nightlies before february 14th and it will run flawlessly. Well I guess ;)
I will try yesterday nightly with nvidia drivers. And I will report to see if it works or not.
Comment 55•11 years ago
|
||
Updated•11 years ago
|
Comment 56•11 years ago
|
||
Note that if really this is a mozilla-side regression (i.e. if I'm wrong) then really the most important thing that you can do here is to bisect this down to one changeset. (Oh, now I remember, you already said that you were fed up with this bug ;-) )
Comment 57•11 years ago
|
||
A quicker thing you can do is revert the patch of Bug 972397 and see if that fixes the bug.
Reporter | ||
Comment 58•11 years ago
|
||
Thanks for bugreport on freedesktop.
Until this awful bug is either fixed or "workarounded", I will have to use Nvidia official drivers, which is not a really good news.
Sorry for being a little rude, but this bug is really annoying and can touch any NouVeau user.
Summary: Firefox hangs on start → Firefox hangs on start when using NouVeau driver
Reporter | ||
Comment 59•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #57)
> A quicker thing you can do is revert the patch of Bug 972397 and see if that
> fixes the bug.
Reverting it ? Tried, and got a reject message. Code is "too heavily" modified to do an automatic revert.
Comment 60•11 years ago
|
||
Generated like this:
bjacob:/hack/mozilla-central$ hg export 9e7cf0b1d80c > /tmp/a
bjacob:/hack/mozilla-central$ patch -p1 -R < /tmp/a
patching file gfx/layers/client/TextureClient.cpp
bjacob:/hack/mozilla-central$ hg qnew revert-bug-972397
Does this fix the bug?
Flags: needinfo?(fredbezies)
Reporter | ||
Comment 61•11 years ago
|
||
Thanks. Will try tomorrow morning and report.
Flags: needinfo?(fredbezies)
Comment 62•11 years ago
|
||
Frédéric: the good people over at the freedesktop.org bug would like to have the output of the 'glxinfo' program on your machine. While you're at it, it's probably not a bad idea to share some details about your graphics card, and the versions of some relevant packages installed on your system (Xorg, Mesa, kernel, etc).
Flags: needinfo?(fredbezies)
Comment 63•11 years ago
|
||
Frédéric, I have one more favor to ask you. The DRI developers wanted a stand-alone version of glxtest, which I put there:
https://bugs.freedesktop.org/attachment.cgi?id=94465
Obviously, they would like a confirmation that this program does reproduce the problem. You could check that in a few seconds by doing this:
wget https://bugs.freedesktop.org/attachment.cgi?id=94465 -O glxtest.cpp
c++ glxtest.cpp -o glxtest -ldl -lX11
./glxtest
Does this hang for minute?
Thanks!
(Side note: DRI developers also rightly point out that it might be most convenient if you CC'd yourself on the DRI bug).
Comment 64•11 years ago
|
||
It's hard to correlate the XCloseDisplay problem with the changesets in the regression range. A change would need to either affect what runs before glxtest forks, or affect the timing of the wait for the glxtest output.
(In reply to Benoit Jacob [:bjacob] from comment #53)
> Karl, would it make sense for us to just not call XCloseDisplay? Or not call
> it when running on this driver (nouveau)? Or would you rather have us get to
> the bottom of this issue before we start working around this bug?
If we are waiting for glxtest output earlier, then that would be worth understanding.
We want XCloseDisplay in builds with NS_FREE_PERMANENT_DATA defined, to clean up memory. In other builds, this could be changed to XSync, if that helps. We need the XSync to know whether the previous methods have generated any X errors, which would cause problems if they were used in the main process.
Flags: needinfo?(karlt)
Reporter | ||
Comment 65•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #63)
> Frédéric, I have one more favor to ask you. The DRI developers wanted a
> stand-alone version of glxtest, which I put there:
> https://bugs.freedesktop.org/attachment.cgi?id=94465
> Obviously, they would like a confirmation that this program does reproduce
> the problem. You could check that in a few seconds by doing this:
>
> wget https://bugs.freedesktop.org/attachment.cgi?id=94465 -O glxtest.cpp
> c++ glxtest.cpp -o glxtest -ldl -lX11
> ./glxtest
>
> Does this hang for minute?
Doesn't hang at all. Changed kernel less than one hour ago. Related ?!
>
> Thanks!
>
> (Side note: DRI developers also rightly point out that it might be most
> convenient if you CC'd yourself on the DRI bug).
Added some infos on freedesktop bugzilla tracker bug report.
Flags: needinfo?(fredbezies)
Reporter | ||
Comment 66•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #60)
> Created attachment 8379219 [details] [diff] [review]
> Revert bug 972397
>
> Generated like this:
>
> bjacob:/hack/mozilla-central$ hg export 9e7cf0b1d80c > /tmp/a
> bjacob:/hack/mozilla-central$ patch -p1 -R < /tmp/a
> patching file gfx/layers/client/TextureClient.cpp
> bjacob:/hack/mozilla-central$ hg qnew revert-bug-972397
>
> Does this fix the bug?
Sad news : no fix at all. Even reverting patch in gfx/layers/client/TextureClient.cpp doesn't "kill" this hang.
I don't know what is happening. A "dirty workaround" is to remove NouVeau and install nvidia driver. But it won't fix the problem :(
Comment 67•11 years ago
|
||
(In reply to Frederic Bezies from comment #65)
> (In reply to Benoit Jacob [:bjacob] from comment #63)
>
> Doesn't hang at all. Changed kernel less than one hour ago. Related ?!
Does the Firefox hang still reproduce?
Reporter | ||
Comment 68•11 years ago
|
||
Yes. Not kernel related. I grabbed last official nightly, and it hangs. So, either I replace NouVeau by nvidia which is the "worst workaround", or a "trick" is found.
I have no ideas, now.
Comment 69•11 years ago
|
||
Thanks. To be clear, we do have a trick we can possibly use: not call XCloseDisplay. See comment 64.
Reporter | ||
Comment 70•11 years ago
|
||
Ouch :(
I can verify this option. Just give me a patch with this trick. And I'll report you if it start flawlessly.
Comment 71•11 years ago
|
||
Pro: conceivably doable, because XSync will still catch any pending errors
Con: might not actually fix the issue, if it is happening because of asynchronous operations.
Comment 72•11 years ago
|
||
Pro: this is very likely to fix the issue
Con: this won't catch any error caused by pending async operations, since we're exiting without syncing.
Once we know whether each of these two tentative workarounds works, we can better decide what to do. If the hang is caused by syncing with a pending async operation, then we'll be able to figure which one by putting X in synchronized mode.
Comment 73•11 years ago
|
||
What I'd still like to understand is why the problem started. I mean, old nightlies work, new ones do not. So presumably some code change on our end is interacting with whatever is going on with Frederic's drivers and such... but which change is it? Frederic, if you do in fact have the patience to hg bisect this, that would be pretty awesome.
Reporter | ||
Comment 74•11 years ago
|
||
I would like too. But I don't know how much time it will take.
There is a tinderbox regression. So, if you tell me how to bisect this, I'll try to do my best. But it will be long to do :(
Comment 75•11 years ago
|
||
Assuming you have an hg pull of the tree...
The simplest thing to do is to start with the two nightlies that bracket the problem appearing. Look up their changeset IDs. Then run "hg bisect -g good-id" and "hg bisect -b bad-id".
Then hg will automatically update to some changeset in between. Build, test, and if it's good run "hg bisect -g" and if it's bad run "hg bisect -b". Keep doing that until it narrows things to one changeset.
Then paste the resulting output here.
Reporter | ||
Comment 76•11 years ago
|
||
[fred@fredo-arch src]$ hg bisect -g 23f7a629a217
[fred@fredo-arch src]$ hg bisect -b 5d7caa093f4f
The first bad revision is:
changeset: 168758:5d7caa093f4f
parent: 168692:23f7a629a217
parent: 168757:9e7cf0b1d80c
user: Ryan VanderMeulen <ryanvm@gmail.com>
date: Fri Feb 14 08:54:12 2014 -0500
summary: Merge inbound to m-c.
Not all ancestors of this changeset have been checked.
Use bisect --extend to continue the bisection from
the common ancestor, d275eebfae04.
When I enter hg bisect --extend ?
[fred@fredo-arch src]$ hg bisect --extend
Extending search to changeset 168677:d275eebfae04
3315 files updated, 0 files merged, 1858 files removed, 0 files unresolved
What to do now ? Building ?
Comment 77•11 years ago
|
||
Yep. Now you keep bisecting until it comes up with a new "the first bad revision is".
Comment 78•11 years ago
|
||
Oh, and yes, after runnigng "hg bisect --extend" you want to build, test, then run "hg bisect -g" or "hg bisect -b" as needed.
Reporter | ||
Comment 79•11 years ago
|
||
It will need only one build. I see the bug from one changeset to another. And in the "guilty changeset", there is bug #972397...
Reporter | ||
Comment 80•11 years ago
|
||
First build (based on last working tinderbox : ok).
[fred@fredo-arch src]$ hg bisect -g
Testing changeset 168743:1f73309205ad (29 changesets remaining, ~4 tests)
97 files updated, 0 files merged, 0 files removed, 0 files unresolved
We will see ;)
Reporter | ||
Comment 81•11 years ago
|
||
Full bisect log ! Was long.
[fred@fredo-arch src]$ hg bisect -g
Testing changeset 168743:1f73309205ad (29 changesets remaining, ~4 tests)
97 files updated, 0 files merged, 0 files removed, 0 files unresolved
[fred@fredo-arch src]$ hg bisect -b
Testing changeset 168736:5863efef36ed (14 changesets remaining, ~3 tests)
17 files updated, 0 files merged, 0 files removed, 0 files unresolved
[fred@fredo-arch src]$ hg bisect -b
Testing changeset 168732:a0fd5acb4ea5 (7 changesets remaining, ~2 tests)
5 files updated, 0 files merged, 0 files removed, 0 files unresolved
[fred@fredo-arch src]$ hg bisect -b
Testing changeset 168730:c164916b5dc8 (3 changesets remaining, ~1 tests)
7 files updated, 0 files merged, 0 files removed, 0 files unresolved
[fred@fredo-arch src]$ hg bisect -b
The first bad revision is:
changeset: 168730:c164916b5dc8
parent: 168729:532299d1368f
parent: 168677:d275eebfae04
user: Wes Kocher <wkocher@mozilla.com>
date: Thu Feb 13 18:50:08 2014 -0800
summary: Merge m-c to inbound on a CLOSED TREE
Not all ancestors of this changeset have been checked.
Use bisect --extend to continue the bisection from
the common ancestor, 0f18070eb5d6.
[fred@fredo-arch src]$ hg bisect --extend
Extending search to changeset 168617:0f18070eb5d6
154 files updated, 0 files merged, 20 files removed, 0 files unresolved
[fred@fredo-arch src]$ hg bisect -g
Testing changeset 168711:5ae4fa0b1340 (38 changesets remaining, ~5 tests)
39 files updated, 0 files merged, 0 files removed, 0 files unresolved
Shall I continue builds ?
Comment 82•11 years ago
|
||
Thanks! Yes, what you have found is a merge cset, so yes it is useful that you keep bisecting into the merged range... if you have the patience!
Comment 83•11 years ago
|
||
(In reply to Frederic Bezies from comment #81)
> Testing changeset 168711:5ae4fa0b1340 (38 changesets remaining, ~5 tests)
> 39 files updated, 0 files merged, 0 files removed, 0 files unresolved
Note that the regression range in comment 10 has only 11 changesets, with no merges, so it may be faster to start fresh with that range, if hg doesn't let you just add info for those revisions into the current bisect. Perhaps easiest to continue from the current bisect process though.
Comment 84•11 years ago
|
||
We were doubting that the regression range in comment 10 was valid; at least, it is worth double-checking.
I've taken another approach now:
glxtest is invoked only by code in nsAppRunner.cpp, and that code runs very early at startup, so whatever changed affects glxtest in anyway is fairly likely to be found in nsAppRunner.cpp.
If we hg annotate toolkit/xre/nsAppRunner.cpp we see that between the XRE_main() entry point and the place where glxtest is invoked in XRE_mainInit(), the only thing that changed recently is this:
$ hg ann -c -d toolkit/xre/nsAppRunner.cpp
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: XREMain::XRE_main(int argc, char* argv[], const nsXREAppData* aAppData)
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: {
383bed640c7b Tue May 28 14:03:38 2013 +0200: char aLocal;
383bed640c7b Tue May 28 14:03:38 2013 +0200: GeckoProfilerInitRAII profilerGuard(&aLocal);
7b508d11a791 Sat Mar 16 00:47:02 2013 -0400: PROFILER_LABEL("Startup", "XRE_Main");
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500:
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: nsresult rv = NS_OK;
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500:
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: gArgc = argc;
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: gArgv = argv;
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500:
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: NS_ENSURE_TRUE(aAppData, 2);
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500:
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: mAppData = new ScopedAppData(aAppData);
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: if (!mAppData)
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: return 1;
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // used throughout this file
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: gAppData = mAppData;
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500:
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: ScopedLogging log;
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500:
88d69928e3b1 Sat Jun 16 18:06:10 2012 +1200: #if defined(MOZ_WIDGET_GTK)
b07ab17dde6c Tue Feb 11 09:22:46 2014 -0500: #if defined(MOZ_MEMORY) || defined(__FreeBSD__) || defined(__NetBSD__)
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // Disable the slice allocator, since jemalloc already uses similar layout
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // algorithms, and using a sub-allocator tends to increase fragmentation.
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // This must be done before g_thread_init() is called.
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: g_slice_set_config(G_SLICE_CONFIG_ALWAYS_MALLOC, 1);
621becf19fe6 Fri Jun 20 10:34:42 2008 -0700: #endif
01fa49869175 Thu Oct 10 16:36:42 2013 -0400: g_thread_init(nullptr);
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: #endif
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500:
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // init
0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: bool exit = false;
50027b0a1ed6 Wed Mar 27 11:19:10 2013 -0400: int result = XRE_mainInit(&exit);
See changeset b07ab17dde6c from February 11 here? I would look there. From reading configure.in it does seem that MOZ_MEMORY is defined on linux by default, so it is taking effect here.
$ hg log -v -r b07ab17dde6c
changeset: 168067:b07ab17dde6c
user: Jan Beich <jbeich@tormail.org>
date: Tue Feb 11 09:22:46 2014 -0500
files: toolkit/xre/nsAppRunner.cpp
description:
Bug 969759 - Remove NetBSD version check for unsupported releases. r=glandium
Comment 85•11 years ago
|
||
In case you want to try it... I totally agree that keeping bisecting is the 'right' thing to do here, and I don't want to distract you from that.
Updated•11 years ago
|
Attachment #8380059 -
Attachment is patch: true
Comment 86•11 years ago
|
||
Oops, sorry about the distraction, I only looked at the diff now, this can't take effect outside of NetBSD... sorry!
Reporter | ||
Comment 87•11 years ago
|
||
End of my bisect log :
[fred@fredo-arch src]$ hg bisect -g
Testing changeset 168720:673bbf3b807f (19 changesets remaining, ~4 tests)
19 files updated, 0 files merged, 0 files removed, 0 files unresolved
[fred@fredo-arch src]$ hg bisect -g
Testing changeset 168725:5b69776cb061 (10 changesets remaining, ~3 tests)
11 files updated, 0 files merged, 0 files removed, 0 files unresolved
[fred@fredo-arch src]$ hg bisect -g
Testing changeset 168727:24548c52a802 (5 changesets remaining, ~2 tests)
3 files updated, 0 files merged, 0 files removed, 0 files unresolved
[fred@fredo-arch src]$ hg bisect -g
Testing changeset 168728:ae11ce69621f (3 changesets remaining, ~1 tests)
8 files updated, 0 files merged, 0 files removed, 0 files unresolved
[fred@fredo-arch src]$ hg bisect -b
The first bad revision is:
changeset: 168728:ae11ce69621f
user: Michael Wu <mwu@mozilla.com>
date: Wed Jan 22 08:23:27 2014 -0500
summary: Bug 748598 - Implement VolatileBuffer for OSX, Ashmem, and Windows, r=njn,glandium
What is going non ?!
Comment 88•11 years ago
|
||
This is really surprising as this changeset only adds new code that isn't used just yet... have you double-checked it by hg up <changeset> and rebuilding, for both this changeset and the parent changeset?
Reporter | ||
Comment 89•11 years ago
|
||
Yes I do, of course.
My computer spent 12 hours doing hg bissect.
Fed up, requiring this bug to be closed the way want to close it !
I think all have been done ! Sorry for being upset, but I can't stand this bug anymore.
Reporter | ||
Comment 90•11 years ago
|
||
Or at least give me a workaround code to test. More than a week and I cannot launch any homemade or official nightlies without seeing it hanging for one minute because I choose to use NouVeau for my nvidia card.
Thanks for any answer. I will be a painful bug for any NouVeau user when trunk code will get stabilizied.
Comment 91•11 years ago
|
||
Sorry if I missed it, but have you tried the tentative workaround patches from comment 71 and comment 72?
Comment 92•11 years ago
|
||
Glandium, see the bisection result in comment 87, can you make sense of it? The symptom we're observing here is that some code in nouveau_dri.so, that we call into from the glxtest, seems to go crazy since bug 748598 landed.
Flags: needinfo?(mh+mozilla)
Updated•11 years ago
|
Reporter | ||
Comment 93•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #91)
> Sorry if I missed it, but have you tried the tentative workaround patches
> from comment 71 and comment 72?
No. Will try them asap.
Comment 94•11 years ago
|
||
In fact, we have another avenue that is quite sure to work. In the main Firefox process, maybe we should, after, say, 1 second of waiting, decide to stop waiting and just kill the glxtest process.
Comment 95•11 years ago
|
||
Or better yet, a timebomb thread in the glxtest process that will terminate it after 1 second no matter what.
Reporter | ||
Comment 96•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #91)
> Sorry if I missed it, but have you tried the tentative workaround patches
> from comment 71 and comment 72?
Patch in comment 71 ? Works great. Will try comment 72 patch now.
Reporter | ||
Comment 97•11 years ago
|
||
Patch in comment #72 works too. So, now, what's up ?
Comment 98•11 years ago
|
||
Great! That comment 71 works is great news, as we could totally decide to land that and fix the problem that you're experiencing.
However, there remains a question. Comment 71 only removes one XCloseDisplay call; there are many other XCloseDisplay calls in the Gecko codebase. That Comment 71 works proves that there is a specific problem in XCloseDisplay on your machine, making it sometimes take 100% CPU for 1 minute... landing Comment 71 leaves open the possibility that the other XCloseDisplay calls in Gecko trigger the same bug in your drivers, which would be just as bad for you.
I guess that we should land comment 71, and cross fingers: maybe the other XCloseDisplay calls in Gecko won't trigger this driver bug. But we should be on our guard and react quickly if we get bug report that they in fact do.
Comment 99•11 years ago
|
||
Wow, look at this, some of our XCloseDisplay calls are already done only #ifdef NS_FREE_PERMANENT_DATA, because "closing the display is pretty scary" !
See this:
http://hg.mozilla.org/mozilla-central/file/2d5d5fd03050/dom/plugins/ipc/PluginModuleChild.cpp#l569
We definitely want to do the same, as Karl suggested above, then.
Comment 100•11 years ago
|
||
Attachment #8380230 -
Flags: review?(karlt)
Comment 101•11 years ago
|
||
Frederic, we really appreciate your help with getting this sorted out. We know it's been frustrating for you; it's also frustrating to us to try to debug and understand a bug we can't reproduce... And the issue is that we'd really like to understand it, not just work around, if we can manage, because without an understanding of what's going on we have no guarantee the workaround will work for other NouVeau users. :(
Reporter | ||
Comment 102•11 years ago
|
||
(In reply to Boris Zbarsky [:bz] (reviews will be slow; ask someone else) from comment #101)
> Frederic, we really appreciate your help with getting this sorted out. We
> know it's been frustrating for you; it's also frustrating to us to try to
> debug and understand a bug we can't reproduce... And the issue is that we'd
> really like to understand it, not just work around, if we can manage,
> because without an understanding of what's going on we have no guarantee the
> workaround will work for other NouVeau users. :(
I can easily understand you want to "fix" the bug, not workaround it. It could be a NouVeau bug, triggered by mesa 10.0.x and your code. If somebody using NouVeau with Mesa 10.0.x can confirm this bug report, it will be great.
Comment 103•11 years ago
|
||
Comment 10 and comment 87 are non-intersecting regression windows, so
either 2 different changes have triggered this, or the hang is intermittent.
Also the same code doesn't hang when not forked from firefox.
Sounds like memory layout is relevant, possibly even jemalloc, quite likely an
uninitialized variable.
I didn't find a stack for the hanging process.
Comment 104•11 years ago
|
||
Comment on attachment 8380230 [details] [diff] [review]
Workaround: call XSync instead of XCloseDisplay unless NS_FREE_PERMANENT_DATA is defined
The only likely harm here I think is that any gl library that would hang here may hang on application shutdown, but I guess that's better than hanging on startup.
Attachment #8380230 -
Flags: review?(karlt) → review+
Comment 105•11 years ago
|
||
Assignee: nobody → bjacob
Comment 106•11 years ago
|
||
(In reply to Karl Tomlinson (:karlt) from comment #103)
> Sounds like memory layout is relevant, possibly even jemalloc, quite likely
> an uninitialized variable.
Indeed, that is one of least unreasonable explanations. Note that the regressing changeset only adds code, that is not even called, supporting your memory layout hypothesis.
Valgrind would be a good tool to investigate this....
That's for trying to understand the root cause, and how it manifested itself as a regression.
If we now talk of the different topic of how to avoid running into this system/drivers bug elsewhere in Gecko, now there is something easy and concrete that we could do: we could have a function, XCloseDisplay_workaround(), that does either XCloseDisplay or XSync like in the patch here; and we could go over all current callers of XCloseDisplay in our codebase and make them call XCloseDisplay_workaround() instead. At least http://hg.mozilla.org/mozilla-central/file/2d5d5fd03050/dom/plugins/ipc/PluginModuleChild.cpp#l569 suggests that there have been issues with XCloseDisplay before, so this could be worth it?
Flags: needinfo?(karlt)
Comment 107•11 years ago
|
||
Probably, in most situations where XCloseDisplay is called and need not be called in NS_FREE_PERMANENT_DATA builds, XSync is not necessary.
I suspect PluginModuleChild.cpp#l569 was afraid of issues, rather than working around any particular known issue. I think we can leave the other callers until there are known problems.
Flags: needinfo?(karlt)
Comment 108•11 years ago
|
||
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Reporter | ||
Comment 109•11 years ago
|
||
Thanks for commiting the workaround. Too bad we won't find the real culprit here :(
Comment 110•11 years ago
|
||
Thanks for reporting this bug and all the work on it. If you really really feel motivated (still), I can walk you through valgrinding this, so we can find if there is some uninitialized-value problem here.
Updated•11 years ago
|
Flags: needinfo?(mh+mozilla)
Comment 112•11 years ago
|
||
Look at this, bug 975512 comment 1 has a stack with symbols!
#0 0x0000003f3a2da007 in sched_yield () from /lib64/libc.so.6
#1 0x00007f1e003bb1d1 in nouveau_fence_wait ()
from /usr/lib64/dri/nouveau_dri.so
#2 0x00007f1e0039d635 in nvc0_screen_destroy ()
from /usr/lib64/dri/nouveau_dri.so
#3 0x00007f1e002c6d9f in dri_destroy_screen_helper ()
from /usr/lib64/dri/nouveau_dri.so
#4 0x00007f1e002c6e25 in dri_destroy_screen ()
from /usr/lib64/dri/nouveau_dri.so
#5 0x00007f1e000087eb in driDestroyScreen ()
from /usr/lib64/dri/nouveau_dri.so
#6 0x000000321f646932 in dri2DestroyScreen () from /lib64/libGL.so.1
#7 0x000000321f61c526 in FreeScreenConfigs.isra.3 () from /lib64/libGL.so.1
#8 0x000000321f61c5a5 in glx_display_free () from /lib64/libGL.so.1
#9 0x000000321f61c6a5 in __glXCloseDisplay () from /lib64/libGL.so.1
#10 0x0000003f3e21fd22 in XCloseDisplay () from /lib64/libX11.so.6
#11 0x00007f1e09933dcf in glxtest() () from /home/sstangl/bin/nightly/libxul.so
#12 0x00007f1e09934b36 in fire_glxtest_process() ()
from /home/sstangl/bin/nightly/libxul.so
#13 0x00007f1e0992e655 in XREMain::XRE_mainInit(bool*) ()
from /home/sstangl/bin/nightly/libxul.so
#14 0x00007f1e09932886 in XREMain::XRE_main(int, char**, nsXREAppData const*)
() from /home/sstangl/bin/nightly/libxul.so
---Type <return> to continue, or q <return> to quit---
#15 0x00007f1e09932b94 in XRE_main () from /home/sstangl/bin/nightly/libxul.so
#16 0x0000000000406b66 in do_main(int, char**, nsIFile*) ()
#17 0x000000000040383b in main ()
Updated•11 years ago
|
Comment 113•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #112)
> Look at this, bug 975512 comment 1 has a stack with symbols!
The bug still reproduces for me with the 2014-02-25 Linux x86_64 Nightly, but with a different stack that doesn't include glxtest(). It looks like about half of the problem is solved, because startup isn't quite as bad as it used to be with the 2014-02-24 Nightly.
#0 0x0000003f3a2da007 in sched_yield () from /lib64/libc.so.6
#1 0x00007f1dd70b91d1 in nouveau_fence_wait ()
from /usr/lib64/dri/nouveau_dri.so
#2 0x00007f1dd709b635 in nvc0_screen_destroy ()
from /usr/lib64/dri/nouveau_dri.so
#3 0x00007f1dd6fc4d9f in dri_destroy_screen_helper ()
from /usr/lib64/dri/nouveau_dri.so
#4 0x00007f1dd6fc4e25 in dri_destroy_screen ()
from /usr/lib64/dri/nouveau_dri.so
#5 0x00007f1dd6d067eb in driDestroyScreen ()
from /usr/lib64/dri/nouveau_dri.so
#6 0x000000321f646932 in dri2DestroyScreen () from /lib64/libGL.so.1
#7 0x000000321f61c526 in FreeScreenConfigs.isra.3 () from /lib64/libGL.so.1
#8 0x000000321f61c5a5 in glx_display_free () from /lib64/libGL.so.1
#9 0x000000321f61c6a5 in __glXCloseDisplay () from /lib64/libGL.so.1
#10 0x0000003f3e21fd22 in XCloseDisplay () from /lib64/libX11.so.6
#11 0x0000003221651086 in gdk_display_x11_finalize ()
from /lib64/libgdk-x11-2.0.so.0
#12 0x0000003f3d614fcb in g_object_unref () from /lib64/libgobject-2.0.so.0
#13 0x00007f1e0992bde8 in MOZ_gdk_display_close(_GdkDisplay*) ()
from /home/sstangl/bin/nightly/libxul.so
#14 0x00007f1e099329ef in XREMain::XRE_main(int, char**, nsXREAppData const*)
() from /home/sstangl/bin/nightly/libxul.so
#15 0x00007f1e09932b94 in XRE_main () from /home/sstangl/bin/nightly/libxul.so
#16 0x0000000000406b66 in do_main(int, char**, nsIFile*) ()
#17 0x000000000040383b in main ()
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•11 years ago
|
Comment 114•11 years ago
|
||
Karl, comment 113 means that the same bug is also triggered by the MOZ_gdk_display_close calls made directly by XRE_main(). How can we work around that?
Updated•11 years ago
|
Flags: needinfo?(karlt)
Comment 115•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #114)
Looks like MOZ_gdk_display_close can be a no-op unless CLEANUP_MEMORY.
Flags: needinfo?(karlt)
Comment 116•11 years ago
|
||
Going to track this for now based on the reproduction reported in Comment 113.
Updated•11 years ago
|
Comment 117•11 years ago
|
||
For the record, I only was "assigned" this insofar as I was landing a patch there that was hopefully fixing this. I'm not actually responsible for working around all the ways that this hang can happen ;-)
Assignee: bjacob → nobody
Comment 118•11 years ago
|
||
Attachment #8384737 -
Flags: review?(karlt)
Comment 119•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #118)
> Created attachment 8384737 [details] [diff] [review]
> Workaround: do nothing in MOZ_gdk_display_close
I verified locally that this patch resolves the issue.
Comment 120•11 years ago
|
||
Comment on attachment 8384737 [details] [diff] [review]
Workaround: do nothing in MOZ_gdk_display_close
Surely we don't need to silence unused parameter warnings?
The very nature of virtual functions, for example, means that parameter will not always be used.
Attachment #8384737 -
Flags: review?(karlt) → review+
Comment 121•11 years ago
|
||
Not sure if that is your questions, but compilers do generate warnings about unused functions parameters; when a function does not use a parameter's value, the most recommended thing to do is to simply declare the parameter type only, omitting its name, like
void myfunction(int)
which is probably what I'd do for a virtual override not using a parameter; the problem here of course is that used-ness depends on #if branches.
Comment 122•11 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/85efdc0150dc
Intentionally not re-assigning to myself.
Comment 123•11 years ago
|
||
Assignee: nobody → bjacob
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
status-firefox30:
--- → fixed
Resolution: --- → FIXED
Comment 124•11 years ago
|
||
Nope, I still enjoy life, I don't want to be the assignee for this bug :-)
Assignee: bjacob → nobody
Just curious.
The patch produced this line (or maybe it was there before).
http://hg.mozilla.org/mozilla-central/annotate/441f5fd256e2/toolkit/xre/nsAppRunner.cpp#l2669
1.69 +#else // not CLEANUP_MEMORY
1.70 + // Don't do anything to avoid running into driver bugs under XCloseDisplay().
1.71 + // See bug 973192.
--> 1.72 + (void) display;
But the marked line above is NO-OP.
It should be eliminated completely, no?
TIA
Comment 126•11 years ago
|
||
The #else here refers to this #ifdef that was added in the same patch:
http://hg.mozilla.org/mozilla-central/annotate/441f5fd256e2/toolkit/xre/nsAppRunner.cpp#l2599
We made this entire function be a no-op.
(In reply to Benoit Jacob [:bjacob] from comment #126)
> The #else here refers to this #ifdef that was added in the same patch:
>
> http://hg.mozilla.org/mozilla-central/annotate/441f5fd256e2/toolkit/xre/
> nsAppRunner.cpp#l2599
>
> We made this entire function be a no-op.
Thank you.
[OK, there may lie a problem. I may have ended up with a few days old comm-central, which may not
have had the whole patches which have been referred to. comm-central can lag behind m-c sometimes.
I will do "python client.py checkout again" and see how it goes.]
My comment below is based on a few days old comm-central source.
I see, but in a FULL DEBUG build of TB (comm-central) of a few days ago
I got a crash due to a call of FcFini() in the said function at the line
http://hg.mozilla.org/mozilla-central/annotate/441f5fd256e2/toolkit/xre/nsAppRunner.cpp#l2660
and while looking around, I noticed this odd line
> (void) display;
[ this particular line does not seem to do anything ? ]
Well, if it is known to be safe, fine, but may be it can be removed completely was my s8uggestion.
TIA
Comment 128•11 years ago
|
||
This (void) display line is a no-op. Its purpose is to silence a compiler warning about this function parameter being unused.
(In reply to Benoit Jacob [:bjacob] from comment #128)
> This (void) display line is a no-op. Its purpose is to silence a compiler
> warning about this function parameter being unused.
Now finally I see the light :-)
[Anyone wanting to put this info in the comment line?]
Comment 130•11 years ago
|
||
(In reply to ISHIKAWA, Chiaki from comment #129)
> [Anyone wanting to put this info in the comment line?]
Better to silence the useless compiler warning and remove the confusing code.
You need to log in
before you can comment on or make changes to this bug.
Description
•