Closed Bug 521750 Opened 15 years ago Closed 15 years ago

Put a runtime NS_IsMainThread check in nsCycleCollector::Suspect2 and Forget2

Categories

(Core :: XPCOM, defect, P1)

x86
Windows XP
defect

Tracking

()

RESOLVED FIXED
mozilla1.9.3a1
Tracking Status
status1.9.2 --- .2-fixed
blocking1.9.1 --- .5+
status1.9.1 --- .9-fixed

People

(Reporter: dbaron, Assigned: dbaron)

References

Details

(Whiteboard: [first fix landed in 1.9.2beta2 and 1.9.1.5][crashkill][crashkill-fix][crashkill-thirdparty])

Attachments

(7 files, 3 obsolete files)

11.08 KB, patch
Details | Diff | Splinter Review
7.93 KB, patch
peterv
: review+
Details | Diff | Splinter Review
6.17 KB, patch
dbaron
: review+
Details | Diff | Splinter Review
6.16 KB, patch
samuel.sidler+old
: approval1.9.1.5+
Details | Diff | Splinter Review
7.57 KB, patch
samuel.sidler+old
: approval1.9.1.5+
Details | Diff | Splinter Review
1.30 KB, patch
benjamin
: review+
samuel.sidler+old
: approval1.9.1.5+
samuel.sidler+old
: approval1.9.1.6+
Details | Diff | Splinter Review
1.25 KB, patch
benjamin
: review+
Details | Diff | Splinter Review
Right now nsCycleCollector::Suspect2 has an assertion that NS_IsMainThread() returns true.  If we changed this to a runtime test for all builds (which would return null if it's false), I think this would alleviate a substantial number of the crashes in Firefox 3.5.* related to violations of our threadsafety rules in extensions.

Benjamin, you said in http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/c61c764813ba11ab/5472241d613b1684 that you had ways to make NS_IsMainThread() faster; we are likely to need this in order to do this without hurting performance, although it might be worth checking what the current performance hit is.
Yeah, at least on platforms which support compiler-native thread-locals (which is all of our tier-1 platforms) we can just declare one of those and use inline C++ to test it, which is really really cheap and involves no function-call overhead at all. I'll try to put a patch together for that.
Summary: Put a runtime NS_IsMainThread check in nsCycleCollector::Suspect2 → Put a runtime NS_IsMainThread check in nsCycleCollector::Suspect2 and Forget2
Are the compiler-native thread locals actually faster, or just less typing?
They are faster. I believe they are a lot faster, but we'll obviously need to measure.
Compiles locally on Linux: running it through tryserver now.
Assignee: dbaron → benjamin
Status: NEW → ASSIGNED
Attachment #405870 - Flags: review?
Attachment #405870 - Flags: review? → review?(dbaron)
We don't want to fault (which disables collection and produces leaks that are probably just about as bad as crashing), we just want to return null (and potentially miss an object in the purple buffer).  The only case where we might want to fault is the case where nsCycleCollectingAutoRefCnt::decr calls Forget2.
We certainly want to put up an error console message. I see how skipping Suspect/Suspect2 isn't a big deal. But it seems to me that skipping Forget or Forget2 could end up causing CC to reference dead objects and should cause a fault. Especially if there were UI hooked up to the 'cycle-collector-fault' notification which could prompt the user to restart/uninstall extensions/whatever.
But the only dangerous case for Forget is the rarer call from decr rather than the more-common call from incr (especially given what these extensions tend to be doing, which is calling methods on existing objects).

But we can solve that by making the case in decr in which Forget fails null out e->mObject, and making sure that the cycle-collector handles that case (if it doesn't already).  (That won't catch the case of fixing this for external glue users being refcounted, but I don't think we've hit that problem.)
And, I agree we should put up at least a console message (if not more UI) when the cycle collector faults (maybe at the point where we skip the collection?).  But that should probably be a different bug.
Filed bug 521853 on Firefox UI for CC faults.
Note that I'd be ok with a fault in Forget, but I'd really prefer not to fault in Suspect, Forget2, or Suspect2; just return instead (and not in a macro).
I think this is what you meant. I've verified that the code in nsCycleCollector.cpp can deal with a null mObject.
Attachment #405870 - Attachment is obsolete: true
Attachment #405897 - Flags: review?(dbaron)
Attachment #405870 - Flags: review?(dbaron)
Mostly, except for a few things:

 * I don't think we should notify "cycle-collector-fault" if we didn't actually fault
 * Based on that, I think CC_MAINTHREAD_CHECK becomes much simpler (one line in each half of the ifdef)
 * Given *that*, CC_MAINTHREAD_FAULT shouldn't exist, and the Fault() call can be explicit in Forget (test CC_MAINTHREAD_CHECK, call Fault)


 * Forget2 shouldn't set e->mObject to null; that would leave a confused state in the forget-called-from-incr case (the more common one) and we'll start crashing the next time we reference-count the object after the next cycle collect (since it won't have gotten an unmarkPurple call); instead it should be decr that sets it to null when Forget2 fails, but incr's handling of Forget2 failure is already correct (and changing Forget2 would break that)


I didn't look closely at the TLS stuff yet.
CC_MAINTHREAD_CHECK should definitely at least issue some error-console notification, which is a side-effect of the current runnable.
(In reply to comment #13)
> CC_MAINTHREAD_CHECK should definitely at least issue some error-console
> notification, which is a side-effect of the current runnable.

I think we'd need to figure out how to do that without significant performance degradation; this could be happening a pretty large number of times.
Do you want me to update the patch per comment 12, or are you planning to?
Could you do it? I'm not sure exactly what you want in Forget2.
OK, I've written the changes I was thinking of, and now I'm testing them.  Current state is http://hg.mozilla.org/users/dbaron_mozilla.com/patches-clean/file/c3dfaf8dbbaa/cc-main-thread-check
When I tried compiling this in a 1.9.1 branch debug build, I got:

c:\builds\1.9.1\obj\firefox-debug-vc9\dist\include\xpcom\nsThreadUtils.h(104) :
error C2492: 'gTLSIsMainThread' : 'thread' data may not have dll interface
make[5]: *** [nsConsoleService.obj] Error 2
Yeah, apparently you can't have threadlocal dllexports (makes sense). We could make this enable-libxul-only, I suppose.
I separated out the faster-NS_IsMainThread part from the cycle collection part; here's just the latter, modified to do what I wanted.

I checked that, with the RelevantKnowledge spyware installed, I'm:
 * failing a bunch of the NS_IsMainThread checks
 * going through multiple cycle collections after that without crashing
 * at the end of nsPurpleBuffer::SelectPointers, mCount is 0 each time.
I pushed a baseline (06656ee2b5dc), the baseline plus attachment 406349 [details] [diff] [review], and the baseline plus attachment 406349 [details] [diff] [review] and the NS_IsMainThread performance improvement patch at http://hg.mozilla.org/users/dbaron_mozilla.com/patches-clean/file/fbf27e7fed33/faster-is-main-thread to try.  I got the following talos results:

both patches  e8b16838a0e1
just cc patch: 9b3f756f3927
baseline: 06656ee2b5dc

Linux try talos:
  twinopen:
    both: 112.74, 112.47
    just cc: 114.32, 113.81
    baseline: 111.53
  tdhtml:
    both: 736.65, 735.35
    just cc: 738.38, 737.94
    baseline: 735.24

MacOSX Darwin 9.0.0 try talos:
  twinopen:
    both: 320.11, 320.42
    just cc: 323.11, 323.89
    baseline: 320.42, 318.84
  tdhtml:
    both: 784.68, 789.79
    just cc: 793.15, 788.09
    baseline: 791.03, 788.53

WINNT 5.1 try talos appears to be some number of days behind

WINNT 6.0 try talos:
  twinopen:
    both: 122.32
    just cc: 126.37, 125.84
    baseline: 126.74, 126.11
  tdhtml:
    both: 715.12
    just cc: 714.18, 715.85
    baseline: 714.44, 714.24
Comment on attachment 406528 [details] [diff] [review]
the faster-NS_IsMainThread() half

r=dbaron on this half

nsThreadUtils gets compiled twice, right?  Once with MOZILLA_INTERNAL_API defined and once without (for the external glue library)?
Attachment #406528 - Flags: review+
Comment on attachment 406528 [details] [diff] [review]
the faster-NS_IsMainThread() half

Oh... except you need to do something about the bit that this doesn't compile in libxul builds.  I'm not sure what to do there.
Attachment #406349 - Flags: review?(peterv) → review+
Progress here?  This affects top crash killing efforts.  It's a big deal.
Whiteboard: [crashkill]
I'm going to put up a new patch for the TLS stuff tomorrow, hopefully. dbaron is doing the cycle-collector part.
This fixes the --disable-libxul parts by hiding the TLS inside xpcom_core.
Attachment #406528 - Attachment is obsolete: true
Assignee: benjamin → dbaron
I'll re-merge the two and run the pair through try server.
And I think I'm going to drop the |CheckMainThreadIfFast| and just make it use NS_IsMainThread directly.
(In reply to comment #29)
> And I think I'm going to drop the |CheckMainThreadIfFast| and just make it use
> NS_IsMainThread directly.

Actually, never mind.
http://hg.mozilla.org/mozilla-central/rev/24ce78dd533f
http://hg.mozilla.org/mozilla-central/rev/ea6f9b5338b3
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Priority: -- → P1
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.3a1
Flags: blocking1.9.2?
Flags: blocking1.9.2? → blocking1.9.2+
(In reply to comment #32)
> http://hg.mozilla.org/releases/mozilla-1.9.2/rev/62cbe17eeeb3
> http://hg.mozilla.org/releases/mozilla-1.9.2/rev/f9c274218b33

I've got following error during linking libxpcom_core.so:
nsThreadUtils.o(.text+0x21): In function `NS_IsMainThread_P()':
/home/users/firefox/mozhg/src/objdir-ff-debug/xpcom/build/nsThreadUtils.cpp:139: undefined reference to `___tls_get_addr'
../threads/libxpcomthreads_s.a(nsThreadManager.o)(.text+0xb15): In function `nsThreadManager::Init()':
/home/users/firefox/mozhg/src/xpcom/threads/nsThreadManager.cpp:110: undefined reference to `___tls_get_addr'
collect2: ld returned 1 exit status

I guess changes from this bug are the culprit...
(In reply to comment #33)
> (In reply to comment #32)
> > http://hg.mozilla.org/releases/mozilla-1.9.2/rev/62cbe17eeeb3
> > http://hg.mozilla.org/releases/mozilla-1.9.2/rev/f9c274218b33

Ooops, I cited links to changes in "mozilla-1.9.2", but I got this error during compiling trunk (I haven't tried compiling 1.9.2 branch).
GCC 4.3.3, locally installed (my system compiler is too old... OTOH I use system-wide ld, which comes from binutils 2.15.94.0.2.2...)
Pretty trivial merging; for the previous patch, just some context differences in comments; for this one, one (debugging-only) function it was patching didn't exist back on 1.9.1.
Attachment #409228 - Flags: approval1.9.1.5?
This should fix BartZ's error; this makes our autoconf test check that everything links.  Presumably __thread support requires some pieces from the compiler and some from something in binutils.
Attachment #409229 - Flags: review?(benjamin)
(In reply to comment #39)
> Created an attachment (id=409229) [details]
> use AC_TRY_LINK when checking for __thread
> 
> This should fix BartZ's error; this makes our autoconf test check that
> everything links.  Presumably __thread support requires some pieces from the
> compiler and some from something in binutils.

Even with this patch applied it doesn't work :-(. During configuration there is still
..
checking for __thread keyword for TLS variables... yes
..

and compilation ends with the same error as in comment 33.
Thanks for trying, I guess I have to compile newer binutils...
(In reply to comment #40)
> and compilation ends with the same error as in comment 33.
> Thanks for trying, I guess I have to compile newer binutils...

I compiled and locally installed binutils 2.20 and the error still persists (I checked with strace whether gcc really uses newer ld - it does (I put in my PATH my local bin dirs before system ones)).

Any ideas?
Attachment #409229 - Flags: review?(benjamin) → review+
(In reply to comment #40)
> (In reply to comment #39)
> > Created an attachment (id=409229) [details] [details]
> > use AC_TRY_LINK when checking for __thread
> > 
> > This should fix BartZ's error; this makes our autoconf test check that
> > everything links.  Presumably __thread support requires some pieces from the
> > compiler and some from something in binutils.
> 
> Even with this patch applied it doesn't work :-(. During configuration there is
> still

Did you build with client.mk (which will regenerate configure from configure.in), or did you just rerun configure?
Comment on attachment 409228 [details] [diff] [review]
cycle collector patch for 1.9.1 branch

Approved for 1.9.1.5, a=dveditz for release-drivers
Attachment #409228 - Flags: approval1.9.1.5? → approval1.9.1.5+
Attachment #409227 - Flags: approval1.9.1.5? → approval1.9.1.5+
Attachment #409229 - Flags: approval1.9.1.5? → approval1.9.1.5+
(In reply to comment #42)
> Did you build with client.mk (which will regenerate configure from
> configure.in), or did you just rerun configure?

I always start compilation with
make -f client.mk build
I never run configure script directly.
I even removed the whole objdir - the error still persists...
Could you experiment to figure out under what conditions your compiler can compile and link a program with __thread and under what conditions it can't?  Is it a C vs. C++ difference?  Something related to one of the command line options we pass?  Something else?
(In reply to comment #47)
> Could you experiment to figure out under what conditions your compiler can
> compile and link a program with __thread and under what conditions it can't? 
> Is it a C vs. C++ difference?  Something related to one of the command line
> options we pass?  Something else?

I haven't tested all possible combinations of options given during compilation, but I was playing with some of them for a while and...:

$ cat test.cpp 
__thread bool tlsIsMainThread = false;
int main() {
        return tlsIsMainThread;
}
$ g++-4.3.3 -o test test.cpp -shared -fPIC -Wl,-z,defs
/home/users/firefox/tmp/cc8h2amk.o: In function `main':
test.cpp:(.text+0x22): undefined reference to `___tls_get_addr'
collect2: ld returned 1 exit status

Removing one of the options (either "-shared" or "-fPIC" or "-Wl,-z,defs") results in not reporting the error during compilation anymore.

The same story with C version of the program:

$ cat test.c 
__thread int tlsIsMainThread = 0;
int main( void ) {
        return tlsIsMainThread;
}
$ gcc-4.3.3 -o test test.c -shared -fPIC -Wl,-z,defs
/home/users/firefox/tmp/ccQIxv7p.o: In function `main':
test.c:(.text+0x22): undefined reference to `___tls_get_addr'
collect2: ld returned 1 exit status

Again, removing one of the options (either "-shared" or "-fPIC" or "-Wl,-z,defs") results in not reporting the error during compilation anymore.

I read that support for "__thread" requires support from the following elements: (1) compiler; (2) linker; (3) system libraries. I locally installed (1) and (2), but glibc on this system is a little bit rusty (2.3.6). I guess this is the root cause of my problem, but upgrading glibc could be a little bit problematic and I'd prefer not to do this at this time...
(In reply to comment #48)
> I read that support for "__thread" requires support from the following
> elements: (1) compiler; (2) linker; (3) system libraries. I locally installed
> (1) and (2), but glibc on this system is a little bit rusty (2.3.6). I guess
> this is the root cause of my problem, but upgrading glibc could be a little bit
> problematic and I'd prefer not to do this at this time...

You can get a working build by locally patching the autoconf test to fail (e.g., by introducing a typo in it).
(In reply to comment #49)
> You can get a working build by locally patching the autoconf test to fail
> (e.g., by introducing a typo in it).

Sure, but why not to try compiling with the relevant options and check the result? Seems that options that resulted in the error are in vars DSO_PIC_CFLAGS and DSO_LDOPTS. This patch add these options to LDFLAGS right before compiling test code.
(In reply to comment #50)
> Created an attachment (id=409560) [details]
> configure.in: add some compilation options when detecting __thread support
 
Actually, I guess that checks for GNU_CC are kinda silly since __thread is GCC-specific, right?
Attachment #409227 - Flags: approval1.9.1.5+
Attachment #409228 - Flags: approval1.9.1.5+
Attachment #409229 - Flags: approval1.9.1.5+
All three of these patches need to land on GECKO1914_20091006_RELBRANCH for 1.9.1.5.
blocking1.9.1: --- → .5+
Nothing for QA to do here to verify this for 1.9.1.5.
(In reply to comment #54)
> Nothing for QA to do here to verify this for 1.9.1.5.

That's not entirely true; one thing that is probably worth doing is installing RelevantKnowledge spyware (see bug 521748) and checking that it at least hasn't caused any new crashes.
Whiteboard: [crashkill] → [crashkill][crashkill-fix]
Whiteboard: [crashkill][crashkill-fix] → [crashkill][crashkill-fix][crashkill-thirdparty]
How often does it crash with the spyware? That isn't clear in the other bug (nor are there any steps for triggering the crashes with the spyware).

QA can install the spyware but, for example, a lack of a crash won't prove much other than it didn't crash that time.
Lack of a crash will at least demonstrate that it didn't make things worse.
Well, I installed the software mentioned in bug 521748 but I'm not sure how a crash is supposed to be triggered. Nothing obvious happened but I wouldn't expect that.
Based on looking at the 3.5.5 crash data, it looks like this fixed the crashes at:
  nsCycleCollectingAutoRefCnt::decr(nsISupports*)
  nsEventListenerManager::Release()
  nsGlobalChromeWindow::Release()
  nsPresContext::Release()
and various other Release methods, but that it did NOT fix the crashes at:
  nsGlobalWindow::cycleCollection::UnmarkPurple(nsISupports*)
  nsCycleCollectingAutoRefCnt::unmarkPurple()
which I was also expecting it would fix.  I need to look into why that's the case.
I filed bug 527339 on figuring out the nsGlobalWindow::cycleCollection::UnmarkPurple(nsISupports*) crashes so that I don't make this bug too complicated.
Now that beta 5 has been built, can we get this re-landed on mozilla-1.9.2 and a follow up filed to have the problems that mobile was having?
(In reply to comment #51)
> (In reply to comment #50)
> > Created an attachment (id=409560) [details] [details]
> > configure.in: add some compilation options when detecting __thread support
> 
> Actually, I guess that checks for GNU_CC are kinda silly since __thread is
> GCC-specific, right?

OK, let's try to review and checkin this (or should I fill separate bug?).
Attached patch adds (via LDFLAGS) some actually used compilation options when compiling real code, so __thread support detection is more robust: without this patch compilation on my system fails (see comment #33), because test code is compiled without -shared -fPIC -Wl,-z,defs (and succeeds), but these options (among others) are used when compiling real code (and with these options linking code with __thread fails on my system, see comment #48).
Attachment #409560 - Attachment is obsolete: true
Attachment #411751 - Flags: review?(dbaron)
Attachment #411751 - Flags: review?(dbaron) → review?(benjamin)
Attachment #411751 - Flags: review?(benjamin) → review+
Comment on attachment 411751 [details] [diff] [review]
add some actually used compilation options when detecting __thread support

http://hg.mozilla.org/mozilla-central/rev/d6c911c1f46e
Depends on: 529216
Bug 549743 changed this to an abort :)
Blocks: 549743
Depends on: 543241
Comment on attachment 411751 [details] [diff] [review]
add some actually used compilation options when detecting __thread support


Any reason why this should not go on branches where previous patches went?
Attachment #411751 - Flags: approval1.9.2.2?
Attachment #411751 - Flags: approval1.9.1.9?
Should we take this despite comment 67 which says that the structure changed? If so, we'll need a roll up patch for each branch.
Attachment #411751 - Flags: approval1.9.2.2?
Attachment #411751 - Flags: approval1.9.2.2-
Attachment #411751 - Flags: approval1.9.1.9?
Attachment #411751 - Flags: approval1.9.1.9-
Comment 67 doesn't have anything to do with whether attachment 411751 [details] [diff] [review] should go on branches; it's describing something that we did (and should do) on trunk only.
Attachment #411751 - Flags: approval1.9.2.2?
Attachment #411751 - Flags: approval1.9.2.2-
Attachment #411751 - Flags: approval1.9.1.9?
Attachment #411751 - Flags: approval1.9.1.9-
Comment on attachment 411751 [details] [diff] [review]
add some actually used compilation options when detecting __thread support

a=beltzner, and I'll find someone to land these.
Attachment #411751 - Flags: approval1.9.2.2?
Attachment #411751 - Flags: approval1.9.2.2+
Attachment #411751 - Flags: approval1.9.1.9?
Attachment #411751 - Flags: approval1.9.1.9+
http://hg.mozilla.org/releases/mozilla-1.9.2/rev/54fc0a861260
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/0d8bde830a1a

Not toggling any flags, since bug is already status1.9.2:beta2-fixed, as it should be.  (The follow-up patch probably should have had its own bug.)
It should have, but since the most recent fix went in, I've moved the flags over to that release. I think that sucks less.

Still: serge, next time, please file follow up bugs.
Whiteboard: [crashkill][crashkill-fix][crashkill-thirdparty] → [first fix landed in 1.9.2beta2 and 1.9.1.5][crashkill][crashkill-fix][crashkill-thirdparty]
No longer blocks: C192ConfSync
Blocks: 567949
Depends on: 586784
You need to log in before you can comment on or make changes to this bug.