Closed Bug 323853 Opened 19 years ago Closed 19 years ago

JB_BP not exported by newer glibc (2.4)

Categories

(Core :: XPCOM, defect)

x86
Linux
defect
Not set
major

Tracking

()

RESOLVED FIXED

People

(Reporter: wolfiR, Assigned: dbaron)

References

Details

(4 keywords, Whiteboard: [nvn-dl])

Attachments

(5 files, 5 obsolete files)

4.78 KB, patch
Details | Diff | Splinter Review
5.29 KB, patch
caillon
: review+
shaver
: superreview+
Details | Diff | Splinter Review
3.89 KB, patch
Details | Diff | Splinter Review
4.58 KB, patch
brendan
: review+
Details | Diff | Splinter Review
3.29 KB, patch
bryner
: review+
bryner
: superreview+
Details | Diff | Splinter Review
JB_BP is not exported anymore by glibc 2.4pre.
So Linux i386 and PPC are affected.
I'm testing a patch to use backtrace() glibc functions right now and will attach it if seems to work.
glibc documentation about backtrace() stuff:
http://www.delorie.com/gnu/docs/glibc/libc_665.html
Attached patch patch (part 1) (obsolete) — Splinter Review
This patch uses the backtrace() glibc functions if glibc 2.3 or newer is used.
It's mandatory only for glibc 2.4. But using the backtrace functions enables more platforms to be able to dump a stacktrace.
(This has to be activated at caller side, too. Therefore this is "part 1")
Attached patch patch (complete) (obsolete) — Splinter Review
This patch adds a dummy function for Linux platforms where stacktrace isn't implemented and changes the callers accordingly. To make the logic for the callers simpler.
Attachment #208843 - Attachment is obsolete: true
Attachment #208878 - Flags: review?(benjamin)
Attachment #208878 - Flags: review?(benjamin) → review+
Attachment #208878 - Flags: superreview?(shaver)
Comment on attachment 208878 [details] [diff] [review]
patch (complete)

> static char _progname[1024] = "huh?";
> 
>-#if defined(LINUX) && defined(DEBUG) && (defined(__i386) || defined(PPC))
>+#if defined(LINUX) && defined(DEBUG)
> #define CRAWL_STACK_ON_SIGSEGV
> #endif

Doesn't this #if also need the glibc version check?  Otherwise older-glibc Linuxes that aren't on PPC or x86 will define CRAWL_STACK_ON_SIGSEGV but have no way to do it.  Likely a build-killer there, and I don't know what glibc version is seen on various ARM Linux setups, etc.
Attachment #208878 - Flags: superreview?(shaver) → superreview-
(In reply to comment #4)
> (From update of attachment 208878 [details] [diff] [review] [edit])
> > static char _progname[1024] = "huh?";
> > 
> >-#if defined(LINUX) && defined(DEBUG) && (defined(__i386) || defined(PPC))
> >+#if defined(LINUX) && defined(DEBUG)
> > #define CRAWL_STACK_ON_SIGSEGV
> > #endif
> 
> Doesn't this #if also need the glibc version check?  Otherwise older-glibc
> Linuxes that aren't on PPC or x86 will define CRAWL_STACK_ON_SIGSEGV but have
> no way to do it.  Likely a build-killer there, and I don't know what glibc
> version is seen on various ARM Linux setups, etc.

hmm, basically right, but there was no glibc check here before.
So it should just output "not implemented" instead of the stack.
That's one of the reasons for the not implemented case in nsStackFrameUnix.cpp

Comment on attachment 208878 [details] [diff] [review]
patch (complete)

Ah, OK.  So people on those platforms will get a "write me, dammit" output when they crash, rather than a stack.  I can live with that, though it's likely to be a little confusing (and make people wonder at first if their crash is due to a call of an unimplemented methods, etc).

I can live with that, though.  Thanks for the clarification.
Attachment #208878 - Flags: superreview- → superreview+
Attached patch patch (slightly changed) (obsolete) — Splinter Review
This patch addresses the comments with minor tweaks:
- we could have called DumpStackToFile with non-glibc Linux systems (are there any?) without implementing the function on those systems
- changed wording of the not implemented case to
  "Info: Stacktrace not implemented for this Linux platform"
- define CRAWL_STACK_ON_SIGSEGV only if there is a better chance to have a useful
  stacktrace if it's a Linux glibc system
Checking in xpcom/base/nsStackFrameUnix.cpp;
/cvsroot/mozilla/xpcom/base/nsStackFrameUnix.cpp,v  <--  nsStackFrameUnix.cpp
new revision: 1.13; previous revision: 1.12
done
Checking in xpcom/base/nsTraceRefcntImpl.cpp;
/cvsroot/mozilla/xpcom/base/nsTraceRefcntImpl.cpp,v  <--  nsTraceRefcntImpl.cpp
new revision: 1.97; previous revision: 1.96
done
Checking in xpfe/bootstrap/nsSigHandlers.cpp;
/cvsroot/mozilla/xpfe/bootstrap/nsSigHandlers.cpp,v  <--  nsSigHandlers.cpp
new revision: 1.43; previous revision: 1.42
done
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
So... this changed the format of the data output by this code, no?  What happened to updating the tools that are used to post-process said data?  As things stand, refcount logging, for example, became nearly useless when this landed because the tree-balancer and so forth can't handle the data anymore.

I've backed this out of my tree locally so that I can continue to debug leaks, but we should have a better solution than that, since I presume sometimes this patch really is needed.

Would you like a separate bug on fixing things to work again, or should I reopen this one?
Also, doesn't that patch limit the stack to 20 deep?  Why was that done?  Why 20?  If nothing else, if I understand what backtrace() does, 20 is way too low to work -- we have stacks with far more frames on them quite routinely.
(In reply to comment #9)
> So... this changed the format of the data output by this code, no?  What
> happened to updating the tools that are used to post-process said data?  As
> things stand, refcount logging, for example, became nearly useless when this
> landed because the tree-balancer and so forth can't handle the data anymore.

Sorry, I wasn't aware of tools which analyze that data.
And I'm not sure if there is a way to get the old format back.
A workaround would be to change the glibc version check to only use the new
code if glibc is >= 2.4. (This would still break some platforms as current devel glibc release seems to be 2.3.90 and JB_BP is already removed.

(In reply to comment #10)
> Also, doesn't that patch limit the stack to 20 deep?  Why was that done?  Why
> 20?  If nothing else, if I understand what backtrace() does, 20 is way too low
> to work -- we have stacks with far more frames on them quite routinely.

I agree that 20 might be too low. But AFAIK we have to choose a fixed number.
What would you propose?
I think it's OK to reopen this bug.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Sounds like we need a configure test for JP_BP, so that only those cases which wouldn't work with the current (previous) code anyway get the new backtrace() behaviour.  (Or, better, someone could write another way to walk the stack without JP_BP so that we could preserve the output format and tool utility.)

If backtrace() makes us pick a maximum number of frames, it's definitely broken for our purposes and we should just revert and start over.
> Sorry, I wasn't aware of tools which analyze that data.

They're at least in tools/rb; I don't think we have other tools, but I could be wrong...

> And I'm not sure if there is a way to get the old format back.

That might be fine if we can fix the scripts to deal with both formats.

> I agree that 20 might be too low. But AFAIK we have to choose a fixed number.
> What would you propose?

The only way I can see of making backtrace() work for us is to pick a large fixed number and then do something like:

retry:
  entries = backtrace(array, size);
  if (entries == size) {
    // allocate new array with bigger size
    goto retry;
  }
  // call backtrace_symbols_fd

(except with better flow control than using goto, hopefully).

Any choice of fixed number without something like the above is going to probably cause breakage in some cases, unless we pick it big enough to guarantee that that many frames would case stack overflow...
I don't think we can allocate memory while we're doing backtraces, e.g., in the segfault handler.

OTOH I don't see why a limit of say 1000 would be any real constraint.
Attached patch Patch for Seamonkey 1.0 (obsolete) — Splinter Review
Comment on attachment 208878 [details] [diff] [review]
patch (complete)

We'll need this on the various branches.  My builds are failing with this error trying to deploy 1.5.0.1.  Patching my RPMs for now, but would be good to get this in the next set of releases.
Attachment #208878 - Flags: branch-1.8.1?
Attachment #208878 - Flags: approval1.8.0.2?
Attachment #208878 - Flags: approval1.7.13?
Attachment #208878 - Flags: approval-aviary1.0.8?
Comment on attachment 208878 [details] [diff] [review]
patch (complete)

Well, except this doesn't apply to the branches.  :-)

I'll post a backport in a few.
Attachment #208878 - Flags: branch-1.8.1?
Attachment #208878 - Flags: approval1.8.0.2?
Attachment #208878 - Flags: approval1.7.13?
Attachment #208878 - Flags: approval-aviary1.0.8?
Actually, bz raised a good point that this shouldn't even be being built for releases.  Filed bug 325537
No, I don't think we want that patch on the branches, because it breaks our memory-leak detection tools.  What we want on the branch is either a patch to use some alternate means of walking the stack so that we can preserve the format (best), or a configure test to only use backtrace() when JB_BP is not available and the build would therefore fail.

We should back this patch out of the trunk shortly, if nobody is going to be able to commit to providing a compatibility-restoring fix, say by Saturday?
I agree that this should be backed out and should not be taken on branches.
I also agree for now. But anyway we need a solution soon, while I don't have one at the moment :-(
The upcoming distributions will use something like glibc 2.3.90 which doesn't export JB_BP.
timeless backed this out on trunk.
The "trivial" solution (much less invasive, anyway) to getting the frame pointer should be to use inline __asm__ rather than setjmp.
Attached patch patch to use inline assembler (obsolete) — Splinter Review
I tested that this works on x86; the bits of ppc and sparc (for fun!) assembly are completely untested.
Attachment #210815 - Flags: review?(mozilla)
Comment on attachment 210815 [details] [diff] [review]
patch to use inline assembler

The patch looks OK for certain circumstances. At least it should keep the functionality (when -fomit-frame-pointer is not used as compiler flag)

I'm about checking out if there are other options which keep the output but are more portable.
Please stay tuned.
Attachment #210815 - Flags: review?(mozilla) → review+
This patch uses backtrace() again but should keep the output format as it was.
There is still the problem that we have a fixed number (here 1024) of stack depth.
I'm fine with dbaron's patch as it fixes the compile error but looking at the other architectures it might be interesting to get this one ready.
Attachment #208878 - Attachment is obsolete: true
Attachment #209335 - Attachment is obsolete: true
Attachment #210230 - Attachment is obsolete: true
So, I had a talk with our toolchain folks who suggest using

__builtin_frame_address (0);

which has been around in gcc since the 2.x days.  We should also check to make sure though that we are not building with -fomit-frame-pointer because that will make walking the stack pretty difficult :-)

To get better portability, we should use glibc backtrace() since that will work on platforms where it is otherwise impossible to get to the unwind information, such as s390(x), x86-64 and ia64.

We should probably combine the two.  Use glibc backtrace() if available, else fall back to what we have now, using __builtin_frame_address () instead on the platforms with easy stack walking code (i386 and ppc).
Jakub Jelinek does not have a bugzilla account here, and asked me add this comment:

"If you are worrying about fixed buffer size, you can use some sane default. Say 100 or something like that, and if backtrace fills the whole array (returns 100), then retry with twice as big buffer."
I think __builtin_frame_address is better than inline assembler.  This patch also cleans up a few includes for stuff that's no longer in that file (these were copied to nsStackFrameUnix.cpp but never removed from the source) and fixes some warnings.
Attachment #210815 - Attachment is obsolete: true
Attachment #210915 - Flags: superreview?(shaver)
Attachment #210915 - Flags: review?(mozilla)
Attachment #210815 - Flags: superreview?(shaver)
Comment on attachment 210915 [details] [diff] [review]
use __builtin_frame_address

sr=shaver, but is __builtin_frame_address linux-only, or can we do this with gcc on any platforms?  (Should we be configure-testing for __builtin_frame_address instead?)
Attachment #210915 - Flags: superreview?(shaver) → superreview+
__builtin_frame_address is not platform specific.  Why should it?  Just think about it.  The compiler always has to know about the current location of the frame to access parameters, local variables, etc.  gcc and all compilers which claim to be gcc-compatible have this built-in.
I did "think about it", and I realized that I didn't know enough about whether all gcc-supported computation models had a compatible notion of a frame.  Sorry.
Comment on attachment 210915 [details] [diff] [review]
use __builtin_frame_address

I'll go ahead and mark r=caillon on this.  It gets the job done for me here, plus it seems to be the cleanest solution to the specific issue with JB_BP.  Ideally, we can get this working with backtrace() so other platforms, especially x86-64, can work with this code, but that might deserve a different bug anyway.
Attachment #210915 - Flags: review?(mozilla) → review+
Assignee: dougt → dbaron
Status: REOPENED → NEW
__builtin_frame_address isn't Linux-specific or architecture specific, but some of the other stuff that the ifdef-ed code does (especially the stack frame layout, having the saved PC one word mathematically above the frame pointer) could be.  I'd rather expand the ifdefs based on what's known to work.
Fix checked in to trunk.

Filed bug 326594 on using backtrace().
Status: NEW → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
Comment on attachment 210915 [details] [diff] [review]
use __builtin_frame_address

b181=shaver
Attachment #210915 - Flags: approval-branch-1.8.1?(shaver) → approval-branch-1.8.1+
Merge conflicts, simple backport.  We should get this on the older branches as well.
Attachment #211873 - Flags: approval1.7.13?
Attachment #211873 - Flags: approval-aviary1.0.8?
Fix checked in to MOZILLA_1_8_BRANCH.
Keywords: fixed1.8.1
I'm curious why anyone cares about updating any branch other than the latest release branch to build on new systems.  Any new distributions should be using the latest release branch, not using a previous release branch, right?  I'd expect the maintainance of old release branches to be for existing systems, none of which have this new a version of glibc.
Fedora Core 5 will ship with Mozilla 1.7.12
It would also be helpful to have our official binaries run on these newer glibcs.
OK, yikes.  (I'm not sure there's any runtime issue here, though, but I suppose there could be.)

So I think we should approve this.
Comment on attachment 211873 [details] [diff] [review]
1.7.x branch patch

approving for old branches assuming you're now on the hook to help us sanity check these releases on linux, a=dveditz for drivers
Attachment #211873 - Flags: approval1.7.13?
Attachment #211873 - Flags: approval1.7.13+
Attachment #211873 - Flags: approval-aviary1.0.8?
Attachment #211873 - Flags: approval-aviary1.0.8+
Checked caillon's 1.7.x branch patch into MOZILLA_1_7_BRANCH and AVIARY_1_0_1_20050124_BRANCH .
*** Bug 324786 has been marked as a duplicate of this bug. ***
Flags: blocking1.8.0.2+
Comment on attachment 210915 [details] [diff] [review]
use __builtin_frame_address

approved for 1.8.0 branch, a=dveditz
Attachment #210915 - Flags: approval1.8.0.2? → approval1.8.0.2+
Fix checked in to MOZILLA_1_8_0_BRANCH.
Keywords: fixed1.8.0.2
Whiteboard: [nvn-dl]
We have other debugging code with the same problem...
Attachment #215509 - Flags: review?(brendan)
Attachment #215509 - Flags: approval-branch-1.8.1?(brendan)
Attachment #215509 - Flags: review?(brendan)
Attachment #215509 - Flags: review+
Attachment #215509 - Flags: approval-branch-1.8.1?(brendan)
Attachment #215509 - Flags: approval-branch-1.8.1+
Comment on attachment 215509 [details] [diff] [review]
and apply the same fix elsewhere

Checked in to trunk.
I'm rebuilding with -fno-omit-frame-pointer (which I think is now needed if you want frame pointers reliably) to test that this really works...
Attachment #215705 - Flags: superreview?(bryner)
Attachment #215705 - Flags: review?(bryner)
Attachment #215705 - Flags: superreview?(bryner)
Attachment #215705 - Flags: superreview+
Attachment #215705 - Flags: review?(bryner)
Attachment #215705 - Flags: review+
So, FWIW, none of the stack trace tools work for me on FC5, even with -fno-omit-frame-pointer.  But I don't *think* this should break anything...
Attachment #215705 - Flags: approval-branch-1.8.1?(bryner)
Attachment #215705 - Flags: approval-branch-1.8.1?(bryner) → approval-branch-1.8.1+
And I tested on an RH 7.x system that the two give the same result (i.e., not a frame off, which would matter here more than elsewhere).
Comment on attachment 215705 [details] [diff] [review]
and fix jprof too

Checked in to trunk and MOZILLA_1_8_BRANCH.
Comment on attachment 215509 [details] [diff] [review]
and apply the same fix elsewhere

Checked in to MOZILLA_1_8_BRANCH.
So it turns out __builtin_frame_address(0) stopped working on Fedora Core 5, so none of this new code actually works anymore.  The problem is described in more detail in bug 331436.  Unless somebody comes up with a decent solution, I'll probably scrap it and replace it with inline assembler.
caillon: any new ideas from the toolchain people? :)
ok, it looks like the 7th patch in that bug makes it compile, if you go in and fix the rejects by hand
(In reply to comment #58)
> ok, it looks like the 7th patch in that bug makes it compile, if you go in and
> fix the rejects by hand
> 
Just a user comment here, Some IBM client software (Workplace WMC) wants you to build Mozilla 1.4.3 if you use a non-supported Linux version. It needs to be configured to use the the gtk2 library and the Xft font engine.  This bug keeps it from building on many versions.  Is there a fix for 1.4.3?
thanks,
fred
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: