Closed Bug 973192 Opened 11 years ago Closed 11 years ago

Firefox hangs on start when using NouVeau driver

Categories

(Core :: Graphics: Layers, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla30
Tracking Status
firefox30 + fixed

People

(Reporter: fredbezies, Unassigned)

References

()

Details

(Keywords: regression)

Attachments

(13 files)

8.30 KB, text/x-log
Details
22.34 KB, text/plain
Details
42.74 KB, text/plain
Details
6.86 KB, patch
Details | Diff | Splinter Review
2.63 KB, patch
Details | Diff | Splinter Review
1.08 KB, patch
Details | Diff | Splinter Review
7.73 KB, text/plain
Details
860 bytes, patch
Details | Diff | Splinter Review
1.03 KB, patch
Details | Diff | Splinter Review
1.01 KB, patch
Details | Diff | Splinter Review
1.32 KB, patch
Details | Diff | Splinter Review
2.01 KB, patch
karlt
: review+
Details | Diff | Splinter Review
2.97 KB, patch
karlt
: review+
Details | Diff | Splinter Review
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0 (Beta/Release) Build ID: 20140215125843 Steps to reproduce: Since february 14th, I cannot launch Firefox. It eats all CPU. I have to open a terminal, use top and kill firefox task in order to get it run. I'm using Archlinux, 64 bits, with gcc 4.8.2. Seems to be a really recent regression. Removing addons doesn't change a thing. A firefox task is eating all cpu. Actual results: Just build firefox. Try to launch it. Wait for at least 10 seconds. Nothing happens. Opening a terminal and entering top shows me that firefox is eating 100% of CPU. Expected results: Just launching.
Component: Untriaged → General
My last working and quick starting build was based on revision : http://hg.mozilla.org/mozilla-central/rev/6687d299c464 Strange part is that Firefox starts ok when I kill it in top.
Summary: Firefox very slow to start → Firefox hangs on start
Bug 959787 seems a bit unlikely to cause this sort of problem.... Can you try tinderbox builds to narrow down the regression range further, perhaps?
Flags: needinfo?(fredbezies)
I'll try asap hoping I will find a smaller regression range.
Flags: needinfo?(fredbezies)
(In reply to Frederic Bezies from comment #6) > I'll try asap hoping I will find a smaller regression range. Looks like I got one. Last working tinderbuild (no hang on start) : https://hg.mozilla.org/mozilla-central/rev/23f7a629a217 First busted tinderbuild (hang and having to kill process) : https://hg.mozilla.org/mozilla-central/rev/5d7caa093f4f Looks weird anyway...
Can you try inbound tinderbuilds?
(In reply to Boris Zbarsky [:bz] (reviews will be slow; ask someone else) from comment #8) > Can you try inbound tinderbuilds? Try it. Report asap.
Blocks: 972397
Component: General → Graphics: Layers
Flags: needinfo?(nical.bugzilla)
Keywords: regression
Product: Firefox → Core
One more question. When Firefox hangs, can you possibly get stacks for all its threads?
Will try. But looking at it again, I could find another possible culprit ? bug #972700... What about it ?
Is that really in your range from comment 10? In any case, it's just a string change; shouldn't cause hangs!
Just asking your opinion. Just tell me how to report stacks while hanging. Will report datas asap.
Something like this should work, if "firefox-pid" is the process id and you're using an sh-compatible shell: gdb > /tmp/log.txt 2>&1 and then at the gdb prompt type: attach firefox-pid and then: thread apply all bt and then: quit
Added gdb log, hoping it will help fixing this bug.
Thanks! Here's what the thread 1 (main thread) stack looks like: Thread 1 (Thread 0x7f2d032c2740 (LWP 3445)): #0 0x00007f2d02ebfa8d in read () from /usr/lib/libpthread.so.0 #1 0x00007f2cfd97264a in mozilla::widget::GfxInfo::GetData() () from /home/fred/Téléchargements/firefox/libxul.so #2 0x00007f2cfd96ea2d in nsBaseWidget::ComputeShouldAccelerate(bool) () from /home/fred/Téléchargements/firefox/libxul.so #3 0x00007f2cfd96fbf8 in nsBaseWidget::GetLayerManager(mozilla::layers::PLayerTransactionChild*, mozilla::layers::LayersBackend, nsIWidget::LayerManagerPersistence, bool*) () from /home/fred/Téléchargements/firefox/libxul.so #4 0x00007f2cfdeb761e in PresShell::Paint(nsView*, nsRegion const&, unsigned int) () from /home/fred/Téléchargements/firefox/libxul.so #5 0x00007f2cfdbadcb4 in nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) () from /home/fred/Téléchargements/firefox/libxul.so #6 0x00007f2cfdebe827 in nsRefreshDriver::Tick(long, mozilla::TimeStamp) () from /home/fred/Téléchargements/firefox/libxul.so That shows us reading from the pipe to the glxtest process, and presumably blocking on that read...
Flags: needinfo?(bjacob)
(In reply to Boris Zbarsky [:bz] (reviews will be slow; ask someone else) from comment #11) > So > http://hg.mozilla.org/integration/mozilla-inbound/ > pushloghtml?fromchange=c7802c9d6eec&tochange=9e7cf0b1d80c > > Bug 972397 sure seems like the most likely culprit in there. I would be very surprised. This should just affect assertions in debug builds. other comments of this thread also look unrelated to bug 972397
Flags: needinfo?(nical.bugzilla)
Hmm. Actually, if we're using CPU, not just blocked... It might be worth getting a profile to see what's using the CPU. All the threads showing in attachment 8378338 [details] seem to be blocked, not running.
Firefox is blocked. I have to kill its process using top. Log was obtained while Firefox is blocked.
If it's blocking on the read from the glxtest process, that means that 1) the glxtest process is still alive and 2) it hasn't finished writing the data yet. That could be caused, for example, by the glxtest process stalling somewhere before it gets to write the data, which is not impossible as it has to call into X11 and OpenGL libraries. Can you please: 1) Paste here the output of this command: ps aux | grep firefox 2) One of these processes will be the glxtest process. Attach GDB to it, and get a backtrace (like above). If you don't know which of these is the glxtest process, do that for all of them, and paste all the resulting stacks here.
Flags: needinfo?(bjacob)
Here is the output I get : [fred@fredo-arch ~]$ ps aux | grep firefox fred 21115 48.0 6.5 832532 263672 pts/0 Sl 18:07 0:02 ./firefox fred 21117 99.5 0.4 309064 17672 pts/0 R 18:07 0:05 ./firefox fred 21140 0.0 0.0 10692 1064 pts/1 S+ 18:07 0:00 grep firefox No glxtest listed at all.
That's normal: 'glxtest' only appears in symbol names. The glxtest process is one of these two firefox processes here. Please attach gdb to each of them and get backtraces.
Logged thread process as asked on comment #25.
Weird: these two processes are in nearly exactly the same state. Both are real Firefox processes, not glxtest. Do you really get two such processes everytime you run Firefox, i.e. can you confirm that neither of these processes is a leftover from a previous attempt to run Firefox?
I told you I had no glxtest processes :) And yes, I had no remaining firefox processes. Will try again, killing any remaining firefox processes before logging anything. This bug is really weird. As I said in comment #0 : "I have to open a terminal, use top and kill firefox task in order to get it run."
Quick question: if you run with the environment variable MOZ_AVOID_OPENGL_ALTOGETHER=1 does the problem still occur?
Well, looks like you find it ;) Using export MOZ_AVOID_OPENGL_ALTOGETHER=1 fix the bug. So, what now ? :D
Please rebuild and run with this (without any environment variable or debugger).
(Let me know if it's useful that I spawn a tryserver build so you don't need to rebuild yourself)
(In reply to Benoit Jacob [:bjacob] from comment #32) > (Let me know if it's useful that I spawn a tryserver build so you don't need > to rebuild yourself) Let me unpack a trunk source code copy I have, and in about 50 minutes, I will give you all returns.
Here is the output : (process:29448): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed entering: bool fire_glxtest_process() on pid 29448 fork succeeded, we are the parent process (pid 29448), our child process is pid 29450 fork succeeded, we are the child process (pid 29450), running glxtest entering: void glxtest() on pid 29450 opened OpenGL library: libGL.so.1 got glXGetProcAddress got all proc addresses opened X display set X error handler chose visual created window created GLX context got GL strings Last mesa update on my computer ? February 4th. Last xorg driver update (nouveau for my nvidia card) ? January 13th.
And I assume that that is all, and you don't see "cleaned up" ? In that case, that means that the glxtest process is stalling while trying to destroy its X11 or GL resources. Let me whip up a second patch with more detailed logging around that part.
(In reply to Benoit Jacob [:bjacob] from comment #35) > And I assume that that is all, and you don't see "cleaned up" ? > Nothing cleaned at all. Hang after "got GL strings". > In that case, that means that the glxtest process is stalling while trying > to destroy its X11 or GL resources. Let me whip up a second patch with more > detailed logging around that part. Ok. Let's hope it will be more informative ;)
Actually, it would be extremely useful to have GDB attached to the glxtest process. This way, - if it crashes, we'll know why - if it stalls, you'll be able to interrupt it (Ctrl+C) and know what it is doing. This patch should make it easy, by telling you the right command line to attach GDB to glxtest process, and giving you 30 seconds to do so.
[fred@fredo-arch firefox]$ ./firefox & [1] 5199 [fred@fredo-arch firefox]$ (process:5199): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed entering: bool fire_glxtest_process() on pid 5199 fork succeeded, we are the parent process (pid 5199), our child process is pid 5201 fork succeeded, we are the child process (pid 5201), running glxtest entering: void glxtest() on pid 5201 opened OpenGL library: libGL.so.1 got glXGetProcAddress got all proc addresses opened X display set X error handler chose visual created window created GLX context got GL strings writing temporary buffer... starting to clean up... calling glXMakeCurrent(display=0x7f3dfcfd4000, None, nullptr)... calling glXDestroyContext(display=0x7f3dfcfd4000, context=0x7f3dfcfed0b0)... calling XDestroyWindow(display=0x7f3dfcfd4000, window=0x1600002)... calling XFreeColormap(display=0x7f3dfcfd4000, ...)... calling XCloseDisplay(display=0x7f3dfcfd4000)... Hangs after XCloseDisplay().
Flags: needinfo?(fredbezies)
After applying your last patch, attaching to glxtest process, I got this. Not really useful I think :( (gdb) thread apply all bt Thread 1 (Thread 0x7f8ab4776740 (LWP 12801)): #0 0x00007f8ab365575d in nanosleep () from /usr/lib/libc.so.6 #1 0x00007f8ab36555f4 in sleep () from /usr/lib/libc.so.6 #2 0x00007f8aafa84845 in ?? () from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so #3 0x00007f8aafa85a27 in ?? () from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so #4 0x00007f8aafa80293 in ?? () from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so #5 0x00007f8aafa83661 in ?? () from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so #6 0x00007f8aafa838f3 in XRE_main () from /home/fred/logs/fox/objdir-fx/dist/firefox/libxul.so #7 0x0000000000403bf3 in _start ()
Hey, after you attach GDB to the glxtest process, you need to tell GDB to let the glxtest process resume execution (attaching implicitly stops the process). To do that, just enter 'c' in GDB: (gdb) c Then it will resume, but that will be resuming its 30-second sleep, so you may have to wait a bit for actually interesting things to happen. Let it run until it crashes or until it stalls and you have to Ctrl-C it.
I hanged for one minute before firefox starts. Wow ! Here is the log of firefox : Start until hang : "[fred@fredo-arch firefox]$ ./firefox & [1] 20194 [fred@fredo-arch firefox]$ (process:20194): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed entering: bool fire_glxtest_process() on pid 20194 fork succeeded, we are the parent process (pid 20194), our child process is pid 20196 fork succeeded, we are the child process (pid 20196), running glxtest entering: void glxtest() on pid 20196 ######################################################## You can attach GDB to the glxtest process now with this command: gdb --pid 20196 Sleeping for 30 seconds now to give you time to attach... ######################################################## opened OpenGL library: libGL.so.1 got glXGetProcAddress got all proc addresses opened X display set X error handler chose visual created window created GLX context got GL strings writing temporary buffer... starting to clean up... calling glXMakeCurrent(display=0x7f4ccbad4000, None, nullptr)... calling glXDestroyContext(display=0x7f4ccbad4000, context=0x7f4ccbaed0b0)... calling XDestroyWindow(display=0x7f4ccbad4000, window=0x1400002)... calling XFreeColormap(display=0x7f4ccbad4000, ...)... calling XCloseDisplay(display=0x7f4ccbad4000)..." 1 minute after : "calling dlclose on libGL... cleaned up wrote to pipe" And the gdb output : (gdb) c Continuing. [Inferior 1 (process 20196) exited normally] I don't understand. Why is there a one minute hanging on start ?
OK, thanks, so now we know that the glxtest process is neither crashing nor even stalling forever, it is just spending a minute in XCloseDisplay(). Could you repeat the procedure, attaching GDB to glxtest process, and once you've seen the message "calling XCloseDisplay...", hit Ctrl-C to interrupt it, get a backtrace, then enter 'c' in GDB to continue execution, wait a few seconds, and repeat this a few times? I would like to get a sense of what it is, in XCloseDisplay(), that makes it take so long on your machine. Alternatively: depending on what's happening here, a better tool to get this information might be a profiler. Install the 'perf' profiler that's part of the linux kernel (might be part of a "linux-utils" package, or some such). Run without GDB, but with the patch #3 so you get the PID of the glxtest process. Once you have that pid and see the message "calling XCloseDisplay...", do this: perf record -f -g -p $THE_PID_OF_GLXTEST_PROCESS let it run a few seconds, then hit Ctrl-C, then do this to view the resulting profile: perf report You can then output to a file by perf report --stdio > somefile.txt and attach this here (maybe trimmed / compressed).
Needless to say, it is very much worth installing debug symbols from your linux distro for X11 and GL libraries, to get useful stacks into them.
How to decide whether you need a debugger or a profiler: - if the glxtest process is using 100% of a CPU, then you rather want a profiler to understand what it's doing. - it it's not using much CPU, then you rather want a debugger to understand where it's waiting/sleeping.
(In reply to Benoit Jacob [:bjacob] from comment #45) > How to decide whether you need a debugger or a profiler: > - if the glxtest process is using 100% of a CPU, then you rather want a > profiler to understand what it's doing. > - it it's not using much CPU, then you rather want a debugger to understand > where it's waiting/sleeping. It eat 100% CPU. Profiling ? It will need a lot of packages, and I'm not sure I will have all debug ones. I wonder bug #972397 broke everything for my linux installation. I will try gdb first. If I cannot understand what is happening, I will give up with this bug, because I'm fed up with it. You can easily understand me.
I sure can understand :-) At this point we pretty much know that the bug is somewhere in your distro's X libraries: XCloseDisplay should not eat 100% CPU for a minute. The question is should we do something about it? Is it likely to affect more users than just you? That's why I wanted to dig deep into what XCloseDisplay is actually doing here.
Bug 972397 is totally out of cause here; in fact, since the problem is a bug in XCloseDisplay, I am almost certain that even if you go back to an older version of Firefox, you will still experience the same symptons.
(In reply to Frederic Bezies from comment #10) > On mozilla-inbound : > > Last working tinderbuild : > https://hg.mozilla.org/integration/mozilla-inbound/rev/c7802c9d6eec > First broken tinderbuild : > https://hg.mozilla.org/integration/mozilla-inbound/rev/9e7cf0b1d80c > > Could it be bug #972397 ?! Are we sure of that regression window? The debugging we've been doing around comment 39 and comment 42 says that we're just hitting a bug in X libraries (XCloseDisplay) long, long before any gecko gfx code kicks into play.
perf running during 15 seconds.
(In reply to Benoit Jacob [:bjacob] from comment #47) > I sure can understand :-) > > At this point we pretty much know that the bug is somewhere in your distro's > X libraries: XCloseDisplay should not eat 100% CPU for a minute. The > question is should we do something about it? Is it likely to affect more > users than just you? That's why I wanted to dig deep into what XCloseDisplay > is actually doing here. Note : Triggered by Mozilla Firefox code. I never had this bug before. By the way, none of the Xorg libraries were modified on or around february 13th. So this is why I thought it was some code added / modified in Mozilla Firefox. I can see this behaviour on both homemade (so relying on my Xorg libs) and official builds. If I look at perf log, I can see this : 13.38% firefox nouveau_dri.so [.] 0x000000000033f6a5 7.89% firefox [kernel.kallsyms] [k] put_prev_task_fair 7.82% firefox [kernel.kallsyms] [k] system_call 6.11% firefox [kernel.kallsyms] [k] sched_clock 5.30% firefox [kernel.kallsyms] [k] __schedule Could it be a kernel bug ? If so, the best thing is to stop spamming this bug report.
I am filing a bug against DRI/nouveau. Meanwhile, if you are not convinced that this is a bug in X/GL libraries/drivers, do feel free to retry an older Firefox build. If my understanding is correct, this bug is completely independent of Firefox version; if you found otherwise, that would be very interesting!
Karl, would it make sense for us to just not call XCloseDisplay? Or not call it when running on this driver (nouveau)? Or would you rather have us get to the bottom of this issue before we start working around this bug?
Flags: needinfo?(karlt)
(In reply to Benoit Jacob [:bjacob] from comment #52) > I am filing a bug against DRI/nouveau. Meanwhile, if you are not convinced > that this is a bug in X/GL libraries/drivers, do feel free to retry an older > Firefox build. If my understanding is correct, this bug is completely > independent of Firefox version; if you found otherwise, that would be very > interesting! I'm writing this answer using my last working homemade build. So, using old code :) It is based on this revision : http://hg.mozilla.org/mozilla-central/rev/6687d299c464 I can try any nightlies before february 14th and it will run flawlessly. Well I guess ;) I will try yesterday nightly with nvidia drivers. And I will report to see if it works or not.
Note that if really this is a mozilla-side regression (i.e. if I'm wrong) then really the most important thing that you can do here is to bisect this down to one changeset. (Oh, now I remember, you already said that you were fed up with this bug ;-) )
A quicker thing you can do is revert the patch of Bug 972397 and see if that fixes the bug.
Thanks for bugreport on freedesktop. Until this awful bug is either fixed or "workarounded", I will have to use Nvidia official drivers, which is not a really good news. Sorry for being a little rude, but this bug is really annoying and can touch any NouVeau user.
Summary: Firefox hangs on start → Firefox hangs on start when using NouVeau driver
(In reply to Benoit Jacob [:bjacob] from comment #57) > A quicker thing you can do is revert the patch of Bug 972397 and see if that > fixes the bug. Reverting it ? Tried, and got a reject message. Code is "too heavily" modified to do an automatic revert.
Generated like this: bjacob:/hack/mozilla-central$ hg export 9e7cf0b1d80c > /tmp/a bjacob:/hack/mozilla-central$ patch -p1 -R < /tmp/a patching file gfx/layers/client/TextureClient.cpp bjacob:/hack/mozilla-central$ hg qnew revert-bug-972397 Does this fix the bug?
Flags: needinfo?(fredbezies)
Thanks. Will try tomorrow morning and report.
Flags: needinfo?(fredbezies)
Frédéric: the good people over at the freedesktop.org bug would like to have the output of the 'glxinfo' program on your machine. While you're at it, it's probably not a bad idea to share some details about your graphics card, and the versions of some relevant packages installed on your system (Xorg, Mesa, kernel, etc).
Flags: needinfo?(fredbezies)
Frédéric, I have one more favor to ask you. The DRI developers wanted a stand-alone version of glxtest, which I put there: https://bugs.freedesktop.org/attachment.cgi?id=94465 Obviously, they would like a confirmation that this program does reproduce the problem. You could check that in a few seconds by doing this: wget https://bugs.freedesktop.org/attachment.cgi?id=94465 -O glxtest.cpp c++ glxtest.cpp -o glxtest -ldl -lX11 ./glxtest Does this hang for minute? Thanks! (Side note: DRI developers also rightly point out that it might be most convenient if you CC'd yourself on the DRI bug).
It's hard to correlate the XCloseDisplay problem with the changesets in the regression range. A change would need to either affect what runs before glxtest forks, or affect the timing of the wait for the glxtest output. (In reply to Benoit Jacob [:bjacob] from comment #53) > Karl, would it make sense for us to just not call XCloseDisplay? Or not call > it when running on this driver (nouveau)? Or would you rather have us get to > the bottom of this issue before we start working around this bug? If we are waiting for glxtest output earlier, then that would be worth understanding. We want XCloseDisplay in builds with NS_FREE_PERMANENT_DATA defined, to clean up memory. In other builds, this could be changed to XSync, if that helps. We need the XSync to know whether the previous methods have generated any X errors, which would cause problems if they were used in the main process.
Flags: needinfo?(karlt)
(In reply to Benoit Jacob [:bjacob] from comment #63) > Frédéric, I have one more favor to ask you. The DRI developers wanted a > stand-alone version of glxtest, which I put there: > https://bugs.freedesktop.org/attachment.cgi?id=94465 > Obviously, they would like a confirmation that this program does reproduce > the problem. You could check that in a few seconds by doing this: > > wget https://bugs.freedesktop.org/attachment.cgi?id=94465 -O glxtest.cpp > c++ glxtest.cpp -o glxtest -ldl -lX11 > ./glxtest > > Does this hang for minute? Doesn't hang at all. Changed kernel less than one hour ago. Related ?! > > Thanks! > > (Side note: DRI developers also rightly point out that it might be most > convenient if you CC'd yourself on the DRI bug). Added some infos on freedesktop bugzilla tracker bug report.
Flags: needinfo?(fredbezies)
(In reply to Benoit Jacob [:bjacob] from comment #60) > Created attachment 8379219 [details] [diff] [review] > Revert bug 972397 > > Generated like this: > > bjacob:/hack/mozilla-central$ hg export 9e7cf0b1d80c > /tmp/a > bjacob:/hack/mozilla-central$ patch -p1 -R < /tmp/a > patching file gfx/layers/client/TextureClient.cpp > bjacob:/hack/mozilla-central$ hg qnew revert-bug-972397 > > Does this fix the bug? Sad news : no fix at all. Even reverting patch in gfx/layers/client/TextureClient.cpp doesn't "kill" this hang. I don't know what is happening. A "dirty workaround" is to remove NouVeau and install nvidia driver. But it won't fix the problem :(
(In reply to Frederic Bezies from comment #65) > (In reply to Benoit Jacob [:bjacob] from comment #63) > > Doesn't hang at all. Changed kernel less than one hour ago. Related ?! Does the Firefox hang still reproduce?
Yes. Not kernel related. I grabbed last official nightly, and it hangs. So, either I replace NouVeau by nvidia which is the "worst workaround", or a "trick" is found. I have no ideas, now.
Thanks. To be clear, we do have a trick we can possibly use: not call XCloseDisplay. See comment 64.
Ouch :( I can verify this option. Just give me a patch with this trick. And I'll report you if it start flawlessly.
Pro: conceivably doable, because XSync will still catch any pending errors Con: might not actually fix the issue, if it is happening because of asynchronous operations.
Pro: this is very likely to fix the issue Con: this won't catch any error caused by pending async operations, since we're exiting without syncing. Once we know whether each of these two tentative workarounds works, we can better decide what to do. If the hang is caused by syncing with a pending async operation, then we'll be able to figure which one by putting X in synchronized mode.
What I'd still like to understand is why the problem started. I mean, old nightlies work, new ones do not. So presumably some code change on our end is interacting with whatever is going on with Frederic's drivers and such... but which change is it? Frederic, if you do in fact have the patience to hg bisect this, that would be pretty awesome.
I would like too. But I don't know how much time it will take. There is a tinderbox regression. So, if you tell me how to bisect this, I'll try to do my best. But it will be long to do :(
Assuming you have an hg pull of the tree... The simplest thing to do is to start with the two nightlies that bracket the problem appearing. Look up their changeset IDs. Then run "hg bisect -g good-id" and "hg bisect -b bad-id". Then hg will automatically update to some changeset in between. Build, test, and if it's good run "hg bisect -g" and if it's bad run "hg bisect -b". Keep doing that until it narrows things to one changeset. Then paste the resulting output here.
[fred@fredo-arch src]$ hg bisect -g 23f7a629a217 [fred@fredo-arch src]$ hg bisect -b 5d7caa093f4f The first bad revision is: changeset: 168758:5d7caa093f4f parent: 168692:23f7a629a217 parent: 168757:9e7cf0b1d80c user: Ryan VanderMeulen <ryanvm@gmail.com> date: Fri Feb 14 08:54:12 2014 -0500 summary: Merge inbound to m-c. Not all ancestors of this changeset have been checked. Use bisect --extend to continue the bisection from the common ancestor, d275eebfae04. When I enter hg bisect --extend ? [fred@fredo-arch src]$ hg bisect --extend Extending search to changeset 168677:d275eebfae04 3315 files updated, 0 files merged, 1858 files removed, 0 files unresolved What to do now ? Building ?
Yep. Now you keep bisecting until it comes up with a new "the first bad revision is".
Oh, and yes, after runnigng "hg bisect --extend" you want to build, test, then run "hg bisect -g" or "hg bisect -b" as needed.
It will need only one build. I see the bug from one changeset to another. And in the "guilty changeset", there is bug #972397...
First build (based on last working tinderbox : ok). [fred@fredo-arch src]$ hg bisect -g Testing changeset 168743:1f73309205ad (29 changesets remaining, ~4 tests) 97 files updated, 0 files merged, 0 files removed, 0 files unresolved We will see ;)
Full bisect log ! Was long. [fred@fredo-arch src]$ hg bisect -g Testing changeset 168743:1f73309205ad (29 changesets remaining, ~4 tests) 97 files updated, 0 files merged, 0 files removed, 0 files unresolved [fred@fredo-arch src]$ hg bisect -b Testing changeset 168736:5863efef36ed (14 changesets remaining, ~3 tests) 17 files updated, 0 files merged, 0 files removed, 0 files unresolved [fred@fredo-arch src]$ hg bisect -b Testing changeset 168732:a0fd5acb4ea5 (7 changesets remaining, ~2 tests) 5 files updated, 0 files merged, 0 files removed, 0 files unresolved [fred@fredo-arch src]$ hg bisect -b Testing changeset 168730:c164916b5dc8 (3 changesets remaining, ~1 tests) 7 files updated, 0 files merged, 0 files removed, 0 files unresolved [fred@fredo-arch src]$ hg bisect -b The first bad revision is: changeset: 168730:c164916b5dc8 parent: 168729:532299d1368f parent: 168677:d275eebfae04 user: Wes Kocher <wkocher@mozilla.com> date: Thu Feb 13 18:50:08 2014 -0800 summary: Merge m-c to inbound on a CLOSED TREE Not all ancestors of this changeset have been checked. Use bisect --extend to continue the bisection from the common ancestor, 0f18070eb5d6. [fred@fredo-arch src]$ hg bisect --extend Extending search to changeset 168617:0f18070eb5d6 154 files updated, 0 files merged, 20 files removed, 0 files unresolved [fred@fredo-arch src]$ hg bisect -g Testing changeset 168711:5ae4fa0b1340 (38 changesets remaining, ~5 tests) 39 files updated, 0 files merged, 0 files removed, 0 files unresolved Shall I continue builds ?
Thanks! Yes, what you have found is a merge cset, so yes it is useful that you keep bisecting into the merged range... if you have the patience!
(In reply to Frederic Bezies from comment #81) > Testing changeset 168711:5ae4fa0b1340 (38 changesets remaining, ~5 tests) > 39 files updated, 0 files merged, 0 files removed, 0 files unresolved Note that the regression range in comment 10 has only 11 changesets, with no merges, so it may be faster to start fresh with that range, if hg doesn't let you just add info for those revisions into the current bisect. Perhaps easiest to continue from the current bisect process though.
We were doubting that the regression range in comment 10 was valid; at least, it is worth double-checking. I've taken another approach now: glxtest is invoked only by code in nsAppRunner.cpp, and that code runs very early at startup, so whatever changed affects glxtest in anyway is fairly likely to be found in nsAppRunner.cpp. If we hg annotate toolkit/xre/nsAppRunner.cpp we see that between the XRE_main() entry point and the place where glxtest is invoked in XRE_mainInit(), the only thing that changed recently is this: $ hg ann -c -d toolkit/xre/nsAppRunner.cpp 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: XREMain::XRE_main(int argc, char* argv[], const nsXREAppData* aAppData) 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: { 383bed640c7b Tue May 28 14:03:38 2013 +0200: char aLocal; 383bed640c7b Tue May 28 14:03:38 2013 +0200: GeckoProfilerInitRAII profilerGuard(&aLocal); 7b508d11a791 Sat Mar 16 00:47:02 2013 -0400: PROFILER_LABEL("Startup", "XRE_Main"); 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: nsresult rv = NS_OK; 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: gArgc = argc; 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: gArgv = argv; 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: NS_ENSURE_TRUE(aAppData, 2); 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: mAppData = new ScopedAppData(aAppData); 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: if (!mAppData) 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: return 1; 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // used throughout this file 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: gAppData = mAppData; 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: ScopedLogging log; 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: 88d69928e3b1 Sat Jun 16 18:06:10 2012 +1200: #if defined(MOZ_WIDGET_GTK) b07ab17dde6c Tue Feb 11 09:22:46 2014 -0500: #if defined(MOZ_MEMORY) || defined(__FreeBSD__) || defined(__NetBSD__) 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // Disable the slice allocator, since jemalloc already uses similar layout 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // algorithms, and using a sub-allocator tends to increase fragmentation. 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // This must be done before g_thread_init() is called. 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: g_slice_set_config(G_SLICE_CONFIG_ALWAYS_MALLOC, 1); 621becf19fe6 Fri Jun 20 10:34:42 2008 -0700: #endif 01fa49869175 Thu Oct 10 16:36:42 2013 -0400: g_thread_init(nullptr); 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: #endif 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: // init 0089fd403e76 Tue Mar 27 08:37:35 2012 -0500: bool exit = false; 50027b0a1ed6 Wed Mar 27 11:19:10 2013 -0400: int result = XRE_mainInit(&exit); See changeset b07ab17dde6c from February 11 here? I would look there. From reading configure.in it does seem that MOZ_MEMORY is defined on linux by default, so it is taking effect here. $ hg log -v -r b07ab17dde6c changeset: 168067:b07ab17dde6c user: Jan Beich <jbeich@tormail.org> date: Tue Feb 11 09:22:46 2014 -0500 files: toolkit/xre/nsAppRunner.cpp description: Bug 969759 - Remove NetBSD version check for unsupported releases. r=glandium
In case you want to try it... I totally agree that keeping bisecting is the 'right' thing to do here, and I don't want to distract you from that.
Attachment #8380059 - Attachment is patch: true
Oops, sorry about the distraction, I only looked at the diff now, this can't take effect outside of NetBSD... sorry!
End of my bisect log : [fred@fredo-arch src]$ hg bisect -g Testing changeset 168720:673bbf3b807f (19 changesets remaining, ~4 tests) 19 files updated, 0 files merged, 0 files removed, 0 files unresolved [fred@fredo-arch src]$ hg bisect -g Testing changeset 168725:5b69776cb061 (10 changesets remaining, ~3 tests) 11 files updated, 0 files merged, 0 files removed, 0 files unresolved [fred@fredo-arch src]$ hg bisect -g Testing changeset 168727:24548c52a802 (5 changesets remaining, ~2 tests) 3 files updated, 0 files merged, 0 files removed, 0 files unresolved [fred@fredo-arch src]$ hg bisect -g Testing changeset 168728:ae11ce69621f (3 changesets remaining, ~1 tests) 8 files updated, 0 files merged, 0 files removed, 0 files unresolved [fred@fredo-arch src]$ hg bisect -b The first bad revision is: changeset: 168728:ae11ce69621f user: Michael Wu <mwu@mozilla.com> date: Wed Jan 22 08:23:27 2014 -0500 summary: Bug 748598 - Implement VolatileBuffer for OSX, Ashmem, and Windows, r=njn,glandium What is going non ?!
This is really surprising as this changeset only adds new code that isn't used just yet... have you double-checked it by hg up <changeset> and rebuilding, for both this changeset and the parent changeset?
Yes I do, of course. My computer spent 12 hours doing hg bissect. Fed up, requiring this bug to be closed the way want to close it ! I think all have been done ! Sorry for being upset, but I can't stand this bug anymore.
Or at least give me a workaround code to test. More than a week and I cannot launch any homemade or official nightlies without seeing it hanging for one minute because I choose to use NouVeau for my nvidia card. Thanks for any answer. I will be a painful bug for any NouVeau user when trunk code will get stabilizied.
Sorry if I missed it, but have you tried the tentative workaround patches from comment 71 and comment 72?
Glandium, see the bisection result in comment 87, can you make sense of it? The symptom we're observing here is that some code in nouveau_dri.so, that we call into from the glxtest, seems to go crazy since bug 748598 landed.
Flags: needinfo?(mh+mozilla)
Blocks: 748598
No longer blocks: 972397
(In reply to Benoit Jacob [:bjacob] from comment #91) > Sorry if I missed it, but have you tried the tentative workaround patches > from comment 71 and comment 72? No. Will try them asap.
In fact, we have another avenue that is quite sure to work. In the main Firefox process, maybe we should, after, say, 1 second of waiting, decide to stop waiting and just kill the glxtest process.
Or better yet, a timebomb thread in the glxtest process that will terminate it after 1 second no matter what.
(In reply to Benoit Jacob [:bjacob] from comment #91) > Sorry if I missed it, but have you tried the tentative workaround patches > from comment 71 and comment 72? Patch in comment 71 ? Works great. Will try comment 72 patch now.
Patch in comment #72 works too. So, now, what's up ?
Great! That comment 71 works is great news, as we could totally decide to land that and fix the problem that you're experiencing. However, there remains a question. Comment 71 only removes one XCloseDisplay call; there are many other XCloseDisplay calls in the Gecko codebase. That Comment 71 works proves that there is a specific problem in XCloseDisplay on your machine, making it sometimes take 100% CPU for 1 minute... landing Comment 71 leaves open the possibility that the other XCloseDisplay calls in Gecko trigger the same bug in your drivers, which would be just as bad for you. I guess that we should land comment 71, and cross fingers: maybe the other XCloseDisplay calls in Gecko won't trigger this driver bug. But we should be on our guard and react quickly if we get bug report that they in fact do.
Wow, look at this, some of our XCloseDisplay calls are already done only #ifdef NS_FREE_PERMANENT_DATA, because "closing the display is pretty scary" ! See this: http://hg.mozilla.org/mozilla-central/file/2d5d5fd03050/dom/plugins/ipc/PluginModuleChild.cpp#l569 We definitely want to do the same, as Karl suggested above, then.
Frederic, we really appreciate your help with getting this sorted out. We know it's been frustrating for you; it's also frustrating to us to try to debug and understand a bug we can't reproduce... And the issue is that we'd really like to understand it, not just work around, if we can manage, because without an understanding of what's going on we have no guarantee the workaround will work for other NouVeau users. :(
(In reply to Boris Zbarsky [:bz] (reviews will be slow; ask someone else) from comment #101) > Frederic, we really appreciate your help with getting this sorted out. We > know it's been frustrating for you; it's also frustrating to us to try to > debug and understand a bug we can't reproduce... And the issue is that we'd > really like to understand it, not just work around, if we can manage, > because without an understanding of what's going on we have no guarantee the > workaround will work for other NouVeau users. :( I can easily understand you want to "fix" the bug, not workaround it. It could be a NouVeau bug, triggered by mesa 10.0.x and your code. If somebody using NouVeau with Mesa 10.0.x can confirm this bug report, it will be great.
Comment 10 and comment 87 are non-intersecting regression windows, so either 2 different changes have triggered this, or the hang is intermittent. Also the same code doesn't hang when not forked from firefox. Sounds like memory layout is relevant, possibly even jemalloc, quite likely an uninitialized variable. I didn't find a stack for the hanging process.
Comment on attachment 8380230 [details] [diff] [review] Workaround: call XSync instead of XCloseDisplay unless NS_FREE_PERMANENT_DATA is defined The only likely harm here I think is that any gl library that would hang here may hang on application shutdown, but I guess that's better than hanging on startup.
Attachment #8380230 - Flags: review?(karlt) → review+
(In reply to Karl Tomlinson (:karlt) from comment #103) > Sounds like memory layout is relevant, possibly even jemalloc, quite likely > an uninitialized variable. Indeed, that is one of least unreasonable explanations. Note that the regressing changeset only adds code, that is not even called, supporting your memory layout hypothesis. Valgrind would be a good tool to investigate this.... That's for trying to understand the root cause, and how it manifested itself as a regression. If we now talk of the different topic of how to avoid running into this system/drivers bug elsewhere in Gecko, now there is something easy and concrete that we could do: we could have a function, XCloseDisplay_workaround(), that does either XCloseDisplay or XSync like in the patch here; and we could go over all current callers of XCloseDisplay in our codebase and make them call XCloseDisplay_workaround() instead. At least http://hg.mozilla.org/mozilla-central/file/2d5d5fd03050/dom/plugins/ipc/PluginModuleChild.cpp#l569 suggests that there have been issues with XCloseDisplay before, so this could be worth it?
Flags: needinfo?(karlt)
Probably, in most situations where XCloseDisplay is called and need not be called in NS_FREE_PERMANENT_DATA builds, XSync is not necessary. I suspect PluginModuleChild.cpp#l569 was afraid of issues, rather than working around any particular known issue. I think we can leave the other callers until there are known problems.
Flags: needinfo?(karlt)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Thanks for commiting the workaround. Too bad we won't find the real culprit here :(
Thanks for reporting this bug and all the work on it. If you really really feel motivated (still), I can walk you through valgrinding this, so we can find if there is some uninitialized-value problem here.
Flags: needinfo?(mh+mozilla)
Look at this, bug 975512 comment 1 has a stack with symbols! #0 0x0000003f3a2da007 in sched_yield () from /lib64/libc.so.6 #1 0x00007f1e003bb1d1 in nouveau_fence_wait () from /usr/lib64/dri/nouveau_dri.so #2 0x00007f1e0039d635 in nvc0_screen_destroy () from /usr/lib64/dri/nouveau_dri.so #3 0x00007f1e002c6d9f in dri_destroy_screen_helper () from /usr/lib64/dri/nouveau_dri.so #4 0x00007f1e002c6e25 in dri_destroy_screen () from /usr/lib64/dri/nouveau_dri.so #5 0x00007f1e000087eb in driDestroyScreen () from /usr/lib64/dri/nouveau_dri.so #6 0x000000321f646932 in dri2DestroyScreen () from /lib64/libGL.so.1 #7 0x000000321f61c526 in FreeScreenConfigs.isra.3 () from /lib64/libGL.so.1 #8 0x000000321f61c5a5 in glx_display_free () from /lib64/libGL.so.1 #9 0x000000321f61c6a5 in __glXCloseDisplay () from /lib64/libGL.so.1 #10 0x0000003f3e21fd22 in XCloseDisplay () from /lib64/libX11.so.6 #11 0x00007f1e09933dcf in glxtest() () from /home/sstangl/bin/nightly/libxul.so #12 0x00007f1e09934b36 in fire_glxtest_process() () from /home/sstangl/bin/nightly/libxul.so #13 0x00007f1e0992e655 in XREMain::XRE_mainInit(bool*) () from /home/sstangl/bin/nightly/libxul.so #14 0x00007f1e09932886 in XREMain::XRE_main(int, char**, nsXREAppData const*) () from /home/sstangl/bin/nightly/libxul.so ---Type <return> to continue, or q <return> to quit--- #15 0x00007f1e09932b94 in XRE_main () from /home/sstangl/bin/nightly/libxul.so #16 0x0000000000406b66 in do_main(int, char**, nsIFile*) () #17 0x000000000040383b in main ()
(In reply to Benoit Jacob [:bjacob] from comment #112) > Look at this, bug 975512 comment 1 has a stack with symbols! The bug still reproduces for me with the 2014-02-25 Linux x86_64 Nightly, but with a different stack that doesn't include glxtest(). It looks like about half of the problem is solved, because startup isn't quite as bad as it used to be with the 2014-02-24 Nightly. #0 0x0000003f3a2da007 in sched_yield () from /lib64/libc.so.6 #1 0x00007f1dd70b91d1 in nouveau_fence_wait () from /usr/lib64/dri/nouveau_dri.so #2 0x00007f1dd709b635 in nvc0_screen_destroy () from /usr/lib64/dri/nouveau_dri.so #3 0x00007f1dd6fc4d9f in dri_destroy_screen_helper () from /usr/lib64/dri/nouveau_dri.so #4 0x00007f1dd6fc4e25 in dri_destroy_screen () from /usr/lib64/dri/nouveau_dri.so #5 0x00007f1dd6d067eb in driDestroyScreen () from /usr/lib64/dri/nouveau_dri.so #6 0x000000321f646932 in dri2DestroyScreen () from /lib64/libGL.so.1 #7 0x000000321f61c526 in FreeScreenConfigs.isra.3 () from /lib64/libGL.so.1 #8 0x000000321f61c5a5 in glx_display_free () from /lib64/libGL.so.1 #9 0x000000321f61c6a5 in __glXCloseDisplay () from /lib64/libGL.so.1 #10 0x0000003f3e21fd22 in XCloseDisplay () from /lib64/libX11.so.6 #11 0x0000003221651086 in gdk_display_x11_finalize () from /lib64/libgdk-x11-2.0.so.0 #12 0x0000003f3d614fcb in g_object_unref () from /lib64/libgobject-2.0.so.0 #13 0x00007f1e0992bde8 in MOZ_gdk_display_close(_GdkDisplay*) () from /home/sstangl/bin/nightly/libxul.so #14 0x00007f1e099329ef in XREMain::XRE_main(int, char**, nsXREAppData const*) () from /home/sstangl/bin/nightly/libxul.so #15 0x00007f1e09932b94 in XRE_main () from /home/sstangl/bin/nightly/libxul.so #16 0x0000000000406b66 in do_main(int, char**, nsIFile*) () #17 0x000000000040383b in main ()
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Karl, comment 113 means that the same bug is also triggered by the MOZ_gdk_display_close calls made directly by XRE_main(). How can we work around that?
Flags: needinfo?(karlt)
(In reply to Benoit Jacob [:bjacob] from comment #114) Looks like MOZ_gdk_display_close can be a no-op unless CLEANUP_MEMORY.
Flags: needinfo?(karlt)
Going to track this for now based on the reproduction reported in Comment 113.
For the record, I only was "assigned" this insofar as I was landing a patch there that was hopefully fixing this. I'm not actually responsible for working around all the ways that this hang can happen ;-)
Assignee: bjacob → nobody
(In reply to Benoit Jacob [:bjacob] from comment #118) > Created attachment 8384737 [details] [diff] [review] > Workaround: do nothing in MOZ_gdk_display_close I verified locally that this patch resolves the issue.
Comment on attachment 8384737 [details] [diff] [review] Workaround: do nothing in MOZ_gdk_display_close Surely we don't need to silence unused parameter warnings? The very nature of virtual functions, for example, means that parameter will not always be used.
Attachment #8384737 - Flags: review?(karlt) → review+
Not sure if that is your questions, but compilers do generate warnings about unused functions parameters; when a function does not use a parameter's value, the most recommended thing to do is to simply declare the parameter type only, omitting its name, like void myfunction(int) which is probably what I'd do for a virtual override not using a parameter; the problem here of course is that used-ness depends on #if branches.
Assignee: nobody → bjacob
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Nope, I still enjoy life, I don't want to be the assignee for this bug :-)
Assignee: bjacob → nobody
Just curious. The patch produced this line (or maybe it was there before). http://hg.mozilla.org/mozilla-central/annotate/441f5fd256e2/toolkit/xre/nsAppRunner.cpp#l2669 1.69 +#else // not CLEANUP_MEMORY 1.70 + // Don't do anything to avoid running into driver bugs under XCloseDisplay(). 1.71 + // See bug 973192. --> 1.72 + (void) display; But the marked line above is NO-OP. It should be eliminated completely, no? TIA
The #else here refers to this #ifdef that was added in the same patch: http://hg.mozilla.org/mozilla-central/annotate/441f5fd256e2/toolkit/xre/nsAppRunner.cpp#l2599 We made this entire function be a no-op.
(In reply to Benoit Jacob [:bjacob] from comment #126) > The #else here refers to this #ifdef that was added in the same patch: > > http://hg.mozilla.org/mozilla-central/annotate/441f5fd256e2/toolkit/xre/ > nsAppRunner.cpp#l2599 > > We made this entire function be a no-op. Thank you. [OK, there may lie a problem. I may have ended up with a few days old comm-central, which may not have had the whole patches which have been referred to. comm-central can lag behind m-c sometimes. I will do "python client.py checkout again" and see how it goes.] My comment below is based on a few days old comm-central source. I see, but in a FULL DEBUG build of TB (comm-central) of a few days ago I got a crash due to a call of FcFini() in the said function at the line http://hg.mozilla.org/mozilla-central/annotate/441f5fd256e2/toolkit/xre/nsAppRunner.cpp#l2660 and while looking around, I noticed this odd line > (void) display; [ this particular line does not seem to do anything ? ] Well, if it is known to be safe, fine, but may be it can be removed completely was my s8uggestion. TIA
This (void) display line is a no-op. Its purpose is to silence a compiler warning about this function parameter being unused.
(In reply to Benoit Jacob [:bjacob] from comment #128) > This (void) display line is a no-op. Its purpose is to silence a compiler > warning about this function parameter being unused. Now finally I see the light :-) [Anyone wanting to put this info in the comment line?]
(In reply to ISHIKAWA, Chiaki from comment #129) > [Anyone wanting to put this info in the comment line?] Better to silence the useless compiler warning and remove the confusing code.
Blocks: 993131
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: