Last Comment Bug 71718 - Carbonized Mac NSPR needs architectural work to run on dual-CPU machines
: Carbonized Mac NSPR needs architectural work to run on dual-CPU machines
Status: VERIFIED FIXED
:
Product: NSPR
Classification: Components
Component: NSPR (show other bugs)
: other
: PowerPC Mac OS X
: P1 critical with 7 votes (vote)
: 4.2
Assigned To: Simon Fraser
: larryh (gone)
Mentors:
: 90941 93720 93824 (view as bug list)
Depends on:
Blocks: 97866
  Show dependency treegraph
 
Reported: 2001-03-12 12:20 PST by Mike Pinkerton (not reading bugmail)
Modified: 2001-10-02 17:37 PDT (History)
24 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
diffs of nsprpb (8.35 KB, patch)
2001-06-12 16:08 PDT, Joe Francis
no flags Details | Diff | Review
nspr patch (30.48 KB, patch)
2001-06-17 17:39 PDT, Joe Francis
no flags Details | Diff | Review
work in progress patch. needs some cleanup. (21.28 KB, patch)
2001-09-12 13:41 PDT, Simon Fraser
no flags Details | Diff | Review
Final patch for review (11.25 KB, patch)
2001-09-12 19:14 PDT, Simon Fraser
no flags Details | Diff | Review
Really final patch (14.67 KB, patch)
2001-09-14 15:20 PDT, Simon Fraser
no flags Details | Diff | Review
New patch addressing wtc's email comments (13.48 KB, patch)
2001-09-14 17:35 PDT, Simon Fraser
no flags Details | Diff | Review
Super-final patch (16.17 KB, patch)
2001-09-18 12:09 PDT, Simon Fraser
no flags Details | Diff | Review
Patch with minor revisions (16.91 KB, patch)
2001-09-19 16:55 PDT, Simon Fraser
no flags Details | Diff | Review
Proposed patch, based on sfraser's last patch. (13.78 KB, patch)
2001-09-19 18:36 PDT, Wan-Teh Chang
no flags Details | Diff | Review
2001.09.18.20-trunk comm page load results (8.92 KB, text/html)
2001-09-20 16:33 PDT, sairuh (rarely reading bugmail)
no flags Details
two runs of page load results using 2001.09.18.20-trunk (22.59 KB, text/html)
2001-09-20 21:56 PDT, sairuh (rarely reading bugmail)
no flags Details
OS X perf results for Mail (4.96 KB, text/html)
2001-09-23 15:13 PDT, stephend@netscape.com (gone - use stephen.donner@gmail.com instead)
no flags Details
Patch to fix MPEnter/LeaveCriticalRegion on 10.0.x (6.33 KB, patch)
2001-09-24 15:11 PDT, Simon Fraser
no flags Details | Diff | Review
Tidied up MDEnter/Leave critical section code (9.82 KB, patch)
2001-09-25 11:37 PDT, Simon Fraser
sdagley: review+
Details | Diff | Review
Mail performance results, take 2 (10.1, this time) (5.98 KB, text/html)
2001-09-26 18:18 PDT, stephend@netscape.com (gone - use stephen.donner@gmail.com instead)
no flags Details
IM results of Performance testing on OSX. Everything is slower here also. (2.12 KB, text/html)
2001-09-27 13:09 PDT, scalkins
no flags Details

Description Mike Pinkerton (not reading bugmail) 2001-03-12 12:20:47 PST
Run the fizzilla build on a dual-cpu g4 (like the one in my cube) and the app 
stalls loading most pages. Here is the email from Apple (vlubet@apple.com):

Mike,

I looked at the code on macsockotpt.c and do not see any call to
OTEnterNotifier() and OTLeaveNotifier().

On Mac OS X, Open Transport notifiers, timer tasks and deferred task do not
run to completion. That means that code from the main event loop or another
thread may preempt code running from OT notifiers, OT timer tasks and OT
deferred tasks. The solution is to always protect your data by using either
atomic operations or by using OTEnterNotifier()/OTLeaveNotifier().

Bracketing your critical sections with OTEnterNotifier()/OTLeaveNotifier()
is usually the most convenient solution as this ensures that either the
notifier or the main event loop / cooperative thread is accessing common
data.

For example here is what I would do:

1369             do {
                     Boolean entered = OTEnterNotifier(endpoint);
1370                 fd->secret->md.read.thread = me;
1371                 fd->secret->md.readReady = PR_FALSE;                //
expect the worst   
1372                 result = OTRcv(endpoint, buf, bytesLeft, NULL);
1373                 if (fd->secret->nonblocking) {
1374                     fd->secret->md.readReady = (result !=
kOTNoDataErr);
                         if (entered)
                             OTLeaveNotifier(endpoint);
1375                     break;
1376                 } else {
1377                     if (result != kOTNoDataErr) {
1378                         // If we successfully read a blocking socket,
check for more data.
1379                         // According to IM:OT, we should be able to
rely on OTCountDataBytes
1380                         // to tell us whether there is a nonzero amount
of data pending.
1381                         size_t count;
1382                         OSErr tmpResult;
1383                         tmpResult = OTCountDataBytes(endpoint, &count);
1384                         fd->secret->md.readReady = ((tmpResult ==
kOTNoError) && (count > 0));
1385         
                             if (entered)
                                 OTLeaveNotifier(endpoint);
1386                         break;
1387                     }
1388 
1389                     // Blocking read, but no data available. Wait for
an event
1390                     // on this endpoint, and hope that we get a T_DATA
event.
                         if (entered)
                                 OTLeaveNotifier(endpoint);
1391                     WaitOnThisThread(me, PR_INTERVAL_NO_TIMEOUT);
1392                     result = me->md.osErrCode;
1393                     if (result != kOTNoError) // interrupted thread,
etc.
1394                         break;
1395 
1396                     // Prepare to loop back and try again
1397                     PrepareForAsyncCompletion(me, fd->secret->md.osfd);
1398                 }
1399             }
1400             // Retry read if we had to wait for data to show up.
1401             while(1);

It certainly behaves like our endpoints are being clobbered in a race-condition-
ish situation.
Comment 1 Steve Dagley 2001-04-17 21:26:45 PDT
Apple has shipped us a dual CPU G4 so we don't have to camp in pink's cube trying 
to debug this
Comment 2 Simon Fraser 2001-04-18 10:30:08 PDT
They have? Can I have it?  ;)
Comment 3 Steve Dagley 2001-04-22 18:29:33 PDT
sfraser has the dual CPU loaner Mac, now he has the bug too :-)
Comment 4 rubydoo123 2001-05-16 08:14:06 PDT
moving to 9.2
Comment 5 Steve Dagley 2001-05-16 08:29:59 PDT
I'll just point out that the dual-CPU loaner system is due back to Apple on 6/1 
so that, while this fix may not be critical for 0.9.1, it needs to land well 
before 0.9.2
Comment 6 benc 2001-05-23 12:44:22 PDT
mass move, v2.
qa to me.
Comment 7 Simon Fraser 2001-05-29 15:54:52 PDT
The Mac NSPR code has serious issues on Mac OS X, which are particularly visible
on machines with two processors.

The main issue here is that OT notifiers, async file I/O callbacks, and timer
tasks each use pthreads under the hood, and are thus interruptible. This breaks
assumptions that the mac code currently makes relating to the switching on and
off of interrupts, and lock consistency.

It would appear that the only good solution to this problem is to build NSPR as
a Mach-O binary, and have it use the Unix implementation. If we do that, we need
to ensure that the implementation permits clients to call any Carbon Toolbox
functions on threads, as they can now. It would be bad to move to pthreads, and
to find, suddenly, that bad things happen when trying to do, say, drawing, or
memory allocation, on non-main threads.
Comment 8 Wan-Teh Chang 2001-05-29 18:46:26 PDT
Steve, could you please work on this bug?  We already
have an implementation of NSPR for Mac OS X based on
pthreads.
Comment 9 Joe Francis 2001-05-30 11:50:39 PDT
wtc, can you tell me which nspr defines need to be lit to do a nspr for mac os x 
build?
Comment 10 Wan-Teh Chang 2001-05-30 17:35:14 PDT
The only way to build NSPR for Mac OS X is to use the Unix build
instructions, i.e., configure and make.

I don't know how to integrate this with Code Warrior.
Comment 11 Steve Dagley 2001-06-04 10:21:45 PDT
I spoke to Quinn at Apple's WWDC about calling into Mach-O from CFM and he 
pointed me to <http://developer.apple.com/samplecode/Sample_Code/
Runtime_Architecture/CallMachOFramework.htm>.  It's intended for calling 
frameworks but I'm hoping it can be adapted to general Mach-O libs.
Comment 12 Joe Francis 2001-06-05 03:08:51 PDT
Thanks.  As it happens I found that on my own.  And I got that much working... I 
can load the System framework at _MD_EARLY_INIT or whatever it's called.  But 
then I'm stuck.  I have no pthread headers that will work with CW.  In fact, just 
to get the first part working I had to copy over the Universal folder from CW7EA.  
There was needed stuff missing from the headers otherwise.  

I tried to handroll my own pthread declararions, because I need only a few calls.  
But unfortunately initializing a pthread mutex requires a *structure* as one of 
the params (what the heck were those posix folks thinking?) so it gets 
complicated to get it all correct.  I really need posix headers that work with 
CodeWarrior.

Or I could try porting nspr over to ProjectBuilder.  Then I could make a knock-
off nspr lib that had pthreads locks in the right places to get dual cpu macs 
happy.
Comment 13 Joe Francis 2001-06-05 03:16:15 PDT
oops.  Sorry Steve, I thought you were replying to a question I asked in another 
forum.  You are actually looking at the opposite solution from what I am 
thinking.  You want to make a mach-o nspr and clal it from cfm fizzilla, right?  
I want to make the current cfm nspr use pthread locks from the system framework 
make the INTS_OFF/INTS_ON stuff in nspr actually acquire a real lock that the 
underlying os will honor.

I'm worried about using real pthreads in mac nspr because I think you can only 
make most system calls from the main thread and I don't know that we can enforce 
that restriction on the current code base.  So I'm thinking keep the nspr threads 
and just enforce syncronization with the callbacks/timers/notifiers using a real 
lock.
Comment 14 Joe Francis 2001-06-05 06:27:40 PDT
ok, new idea.  make the INTS_ON/INTS_OFF stuff enter and leave a critical
region.  That way everything turning on/off nspr ints is competing on the same
critical region, acting just like a lock.  The reason this is good is that the
mac MP api supports this, and on mac os x it is just a wrapper for posix stuff,
so the underlying os threads running the callbacks/timers/notifiers will respect
this.  I'll wager the MP headers don't pose any problems for our build system,
so it should be easy to do.  i'll try it later today and see.
Comment 15 Mike Pinkerton (not reading bugmail) 2001-06-05 11:04:59 PDT
two thoughts:
- pro7 can build mach-o, so why aren't all the pthread headers there?
- aren't the correct pthread headers on the OSX developer install for you to
look at?
Comment 16 Simon Fraser 2001-06-05 11:19:34 PDT
Yes, you will be able to call MP stuff. This might be one way to go.
Comment 17 Mike Pinkerton (not reading bugmail) 2001-06-05 15:04:08 PDT
beta stopper
Comment 18 Joe Francis 2001-06-08 06:16:48 PDT
I've played around with this some.  My first attempt seems to just lock all the
threads up.  The chief difficulty is that _PR_INTSOFF retreives a value that
indicates whether the region is already entered.  But the MP api does not expose
any way to get that info for a critical region.  The result is that the nspr
code triggers more entries into the critical region than exits.

I think it may be possible to make _PR_INTSOFF & _PR_INTSON not care about that
value.  The number of places in the code that actually examine that value are
few, and are almost entirely in mac only code.  I think this value is only
needed because _PR_INTSOFF doesn't currently acquire a "real" lock of any kind. 

I'll play with this some more over the weekend.
Comment 19 gordon 2001-06-08 12:19:34 PDT
How about keeping a counter for the number of times _PR_INTSOFF has been called, 
and only enter/leave the critical section on 0/1 transitions, sort of like 
destructors and reference counting.  That should have similar semantics to the 
current usage of _PR_INTSOFF.

Or it may be even simpler than that.  You may just need to keep a flag indicating 
whether you're in the critical section, and return that flag.

If you're around this afternoon, I'd love to talk with you about it.
Comment 20 Mike Pinkerton (not reading bugmail) 2001-06-08 12:25:37 PDT
you would probably need a separate lock on this counter too, as you could be
pre-empted between checking the value and then acting on it causing a
race-condition.
Comment 21 Joe Francis 2001-06-11 14:02:42 PDT
re gordon: i was already trying that.
re mike: i was entering my critical section before using the counter.

there is some code around that saves the counter, and then enters the intsOff
state setting the counter to 1, and then restores the counter when it's done. 
This will have the effect of not turning ints back on if they weren't on before,
but it totally demolishes the concept of the counter being a reentrancy count.  
i think i can just not use a counter at all, and tweak the code in the handful
of places that use it.  Almost all are in mac only code, and most of it is to
detect when a notification might have been dropped becasue it came while ints
were off.  Under osx the notification wont be dropperd in that situation: if
someone else is in the critical section when the notification comes in, the
notification will block on the critical section, and then fire when it can enter.  
Comment 22 Joe Francis 2001-06-12 16:08:28 PDT
at request of macdev, attaching my patches.  they  DO NOT work.  But they 
show the lines I am thinking along.  
Comment 23 Joe Francis 2001-06-12 16:08:58 PDT
Created attachment 38164 [details] [diff] [review]
diffs of nsprpb
Comment 24 Joe Francis 2001-06-17 17:39:23 PDT
a second attempt.  not sure yet why this one doesn't work.
Comment 25 Joe Francis 2001-06-17 17:39:46 PDT
Created attachment 38876 [details] [diff] [review]
nspr patch
Comment 26 lchiang 2001-06-25 18:23:05 PDT
sdagley - are any of the machines which we'll get from Apple for testing a
dual-CPU machine?  If dual-CPU machine seems to have it's own set of problems,
we may need to test on that.
Comment 27 Simon Fraser 2001-06-26 11:27:10 PDT
We should *definately* test on dual-CPU machines, and it would be good for QA to 
invest in a few of these. Note that, currently, our builds are pretty much dead 
on such machines running Mac OS X because of this bug.
Comment 28 Steve Dagley 2001-06-28 15:54:37 PDT
lchiang, we aren't getting any dual CPUs in the new batch but Joe Francis has 
voluntereed his dual-cpu system for testing while he's on sabbatical next month
Comment 29 Marek Z. Jeziorek 2001-07-02 22:42:16 PDT
Steve, this is PDT+ bug. Tomorrow, on Tuesday, we'll try to build the first RTM
candidate. It would be good, if this could be resolved ASAP.
Comment 30 chris hofmann 2001-07-12 20:05:33 PDT
what going on with this bug?  getting close to the time where it may have to be
latered to future milestone/release if we are not there, or past there already...
Comment 31 Steve Dagley 2001-07-12 20:13:26 PDT
This bug is specific to Mac OS X and does not impact the 'Classic' builds for Mac 
OS 8.6/9.x.  As to where we're at with it, as an AOL employee I can't use the 
appropriate profanity in this public forum to describe it so I'll just say we 
don't expect to have a fix this month.
Comment 32 Boris Zbarsky [:bz] 2001-07-16 06:57:37 PDT
*** Bug 90941 has been marked as a duplicate of this bug. ***
Comment 33 Mike Pinkerton (not reading bugmail) 2001-07-16 09:23:37 PDT
we're not gonna have this fixed by beta, removing pdt+
Comment 34 Mike Pinkerton (not reading bugmail) 2001-08-06 01:06:07 PDT
*** Bug 93824 has been marked as a duplicate of this bug. ***
Comment 35 Mike Pinkerton (not reading bugmail) 2001-08-06 01:06:46 PDT
*** Bug 93720 has been marked as a duplicate of this bug. ***
Comment 36 Adam Masri 2001-08-13 11:14:59 PDT
The following Apple tech note may be helpful in resolving this bug, as well as
giving Fizzilla better OS X performance.

http://developer.apple.com/technotes/tn/tn2028.html

- Adam
Comment 37 Tiantian Kong 2001-08-17 12:15:52 PDT
Added keyword: nsenterprise+
Comment 38 Steve Dagley 2001-08-27 17:32:49 PDT
Simon is now working on this so reassigning bug
Comment 39 Simon Fraser 2001-09-12 13:41:28 PDT
Created attachment 49135 [details] [diff] [review]
work in progress patch. needs some cleanup.
Comment 40 Simon Fraser 2001-09-12 19:14:10 PDT
Created attachment 49168 [details] [diff] [review]
Final patch for review
Comment 41 Simon Fraser 2001-09-12 19:17:43 PDT
That last patch works. On single CPU machines, it's performance impact is below
the noise threshold of the page loading tests I used to measure it. It uses MP
critical sections to fix thread synchronization issues, blocking multiple MP
threads from running while NSPR interrupts are turned off.

The code is currently active if TARGET_CARBON is defined to be 1. It might be
necessary to detect the OS version using Gestalt if this affects Carbon under 9
performance (which is unknown).

I need some strong reviews, please.
Comment 42 Simon Fraser 2001-09-13 12:51:09 PDT
I have some further changes to the patch, to conditionalize the critical region 
code for Mac OS X only, and to add some explanatory comments. Please wait...
Comment 43 chris hofmann 2001-09-13 21:31:58 PDT
is this going to make it soon?  putting on the "+"
Comment 44 Simon Fraser 2001-09-14 15:20:29 PDT
Created attachment 49391 [details] [diff] [review]
Really final patch
Comment 45 Simon Fraser 2001-09-14 15:22:05 PDT
The latest patch does two new things:
1. Return non-Carbon builds to their previous behaviour (just some macro changes).
2. In carbon builds, conditionally uses critical sections only if we are running
on Mac OS X. So Carbon-on-classic builds are not penalized.
Comment 46 Simon Fraser 2001-09-14 17:35:16 PDT
Created attachment 49408 [details] [diff] [review]
New patch addressing wtc's email comments
Comment 47 Simon Fraser 2001-09-14 17:41:01 PDT
The new patch:
* Replaces the array of critical sections with just one (since I only used one),
  which now matches the single critical region entry count.
* Removes a bogus WalkTheStack prototype.
Comment 48 Wan-Teh Chang 2001-09-14 21:20:55 PDT
I reviewed Simon's "really final patch" (attachment 49391 [details] [diff] [review]).
I already sent him some suggested changes after a quick
look.  Then I stared at the patch for a good hour.  After
I internalized the patch, I found that Simon's solution is
actually very simple.  (It is all those macros that makes
the patch look complicated.) It is exactly what jfrancis
outlined in his comments back in June.  It surprises me that
the simple solution is enough to solve this problem.
Congratulations to all of you who worked on this bug!

I will go over the patch with Simon next week.  I just
wanted to point out that the patch also contains a fix
for bug 97866.  That bug is marked as depending on this
bug, but I think it is not related and its fix should
be reviewed and checked in independently.
Comment 49 lchiang 2001-09-17 10:52:14 PDT
(This looks close to a fix - great!  Who at Netscape has a dual-processor
machine besides kin?  Once this is fixed, we need to do some testing.)
Comment 50 Jaime Rodriguez, Jr. 2001-09-17 12:58:03 PDT
PDT+ per PDT. Let's get one.
Comment 51 benc 2001-09-18 00:23:43 PDT
Sorry for the somewhat basic quizzing but:

1- This bug discusses fixing the MacOS X builds only right? So it is completely
irrelevant to running MacOS X w/ classic and classic build.

2- What happened if you ran MacOS X builds on dual-processor systems before this
fix? (Boom?!). Did this affect dual processor systems in pre-X?

Need this info to figure out how much upgrading is needed for testing.
Comment 52 Mike Pinkerton (not reading bugmail) 2001-09-18 09:41:36 PDT
ben:

1) correct, this does not affect the classic bits (non-carbon) running on osx in 
the classic environment

2) pages would stall, and never finish loading. sometimes the app would crash 
when you quit after failing to load pages.
Comment 53 Simon Fraser 2001-09-18 12:09:31 PDT
Created attachment 49782 [details] [diff] [review]
Super-final patch
Comment 54 Simon Fraser 2001-09-18 12:14:01 PDT
The super-final patch:
* removes the async file I/O changes, which are now in a patch in bug 97866
* replaces separate intsoff/intson implemenations with a single, simplified
intsoff routine, and removes some macro code duplication.
* Uses intsoff/fastintson rather than setintsoff(1)/setintsoff(0) in a couple of
callback routines.
* Improves the comments in _MD_PauseCPU and DoneWaitingOnThisThread().

gordon, please review.
wtc, if you'll then bless this, I can check it in.
Comment 55 gordon 2001-09-18 18:04:16 PDT
r=gordon, unless you want to clean up _macos.h or add a wrapper to
DoneWaitingOnThisThread().

wtc, the #ifdefs in primpl.h give me headaches.  Are you sure they're correct? 
There seem to be redundant nesting of global and local thread cases.
Comment 56 Simon Fraser 2001-09-18 23:09:56 PDT
And now the bad news. Adding the critical regions on Mac OS X does have some 
perceptible impact on performance in some areas. Page loading is not affected, 
but some UI operations are (I noticed, for example, that loading the buddy list 
in a commercial build got slower), and the browser feels someone less responsive 
on X. It may be that parts of the code that rely a lot on PR_Lock/PR_Unlock, or 
atomic operations, are slowed down by this patch.
Comment 57 Jaime Rodriguez, Jr. 2001-09-19 10:07:55 PDT
This is not on Tian Tian Must Fix list for this release. Should we consider
minusing it, and changing the priority for this release?
Comment 58 Brian Nesse (gone) 2001-09-19 10:22:39 PDT
sdagley commenting here - If this isn't on tiantian's list it damn sure should
be - this is most DEFINITELY a stop ship bug.  Otherwise explain to Marty Fisher
hy we don't run on his new Mac :-)
Comment 59 Joe Francis 2001-09-19 14:15:43 PDT
Seconding Steve's comments.  As someone involved with the evolution of this bug 
fix I feel we are much better off with it than without it.  Both MacOS X and dual 
CPU macs are catching on pretty quickly.  Lets not leave those folks stranded.
Comment 60 Tiantian Kong 2001-09-19 14:56:54 PDT
Hi:
 This bug is the FIRST one listed on my stop ship bug list compiled yesterday.
Please check again the list on
http://bugscape.netscape.com/show_bug.cgi?id=9485. And it is the first bug on
the 8/21 list as well. 

Comment 61 Jaime Rodriguez, Jr. 2001-09-19 15:50:36 PDT
Check it into the trunk, and let's get some testing numbers (New Window,
Startup, etc.). Tian Tian and Hong to coordinate test efforts.
Comment 62 Simon Fraser 2001-09-19 16:55:55 PDT
Created attachment 49998 [details] [diff] [review]
Patch with minor revisions
Comment 63 Simon Fraser 2001-09-19 17:00:15 PDT
Chnages in this patch: 
* removed redundant duplicate of _MD_INTSOFF macro in _macos.h
* In code blocks like this:

    if (_PR_MD_GET_INTSOFF()) {
        thread->md.missedIONotify = PR_TRUE;
        cpu->u.missed[cpu->where] |= _PR_MISSED_IO;
        return;
    }

which are outside of the critical region, reversed the order of setting
missedIONotify and the _PR_MISSED_IO to mitigate the effects of preemption at
this point.

Note that the patch is a diff -b (ignoring space change), so spacing looks worse
than it really is.
Comment 64 gordon 2001-09-19 17:50:37 PDT
Comment on attachment 49998 [details] [diff] [review]
Patch with minor revisions

r=gordon
Comment 65 gordon 2001-09-19 17:53:04 PDT
I was unable to edit patch status to specify "has-review". Does that have
something to do with the Product field?
Comment 66 lchiang 2001-09-19 17:56:04 PDT
I will coordinate the testing for the perf numbers.  Results to be posted when 
they come in.  Per simon:

1.  Tests run on single processor CPU (we don't have dual handy)
2.  Run on today's trunk build vs. tomorrow's trunk build
3.  Run:  window open, startup, IM, prefs, mail performance tests
4.  Results posted back here as attachments

Thanks
Comment 67 Wan-Teh Chang 2001-09-19 18:36:02 PDT
Created attachment 50017 [details] [diff] [review]
Proposed patch, based on sfraser's last patch.
Comment 68 Wan-Teh Chang 2001-09-19 18:56:40 PDT
Most of the changes in my patch from Simon's last patch
are a matter of style.
- I make sure no lines are longer than 80 characters.
- I do not make whitespace changes.  This is to make the
  diffs cleaner.
- Whether tabs or spaces are used follows the local
  convention.  Again, this is to make the diffs easier
  to read.
- "Regions" and "REGIONS" are changed to "Region" and
  "REGION" because there is only one critical region.
- "TermCriticalRegions" is renamed "DestroyCriticalRegion".
- Some macro definitions were moved between primpl.h and
  _macos.h and the intermediate macro _MD_INTSOFF() was
  deleted.
- gUseCriticalRegion and gCriticalRegion are declared static
  because they are only used in macthr.c.

The real changes are:
- In _PR_INTSOFF(), I leave the critical region immediately
  so that the entry count is always 1.  This way, in
  _PR_FAST_INTSON() and _PR_INTSON(), I only need to leave
  the critical region once if we need to leave the critical
  region completely.  As a result, the global variable
  gCriticalRegionEntryCount can be deleted.
- In macio.c, AsyncIOCompletion(), _PR_INTSOFF() and
  _PR_FAST_INTSON() are not necessary because they are called
  in DoneWaitingOnThisThread().  I also moved the setting of
  thread->md.osErrCode before the if block.

Simon, please review my patch and test it.  If it doesn't
work and it takes too long to fix it, you can go ahead and
check in your patch on the NSPRPUB_PRE_4_2_CLIENT_BRANCH.
I will check email from home tonight.
Comment 69 Wan-Teh Chang 2001-09-19 19:00:52 PDT
The sentence
  In _PR_INTSOFF(), I leave the critical region immediately
  so that the entry count is always 1.
should read
  In _PR_INTSOFF(), I leave the critical region immediately
  if we are already in the critical region so that the entry
  count is always 1.

Comment 70 Simon Fraser 2001-09-19 20:00:29 PDT
wtc, your final patch does not work, alas. As-is, it deadlocks on launch. I 
noticed that this seems wrong:

 #define _PR_INTSOFF(_is) \
     PR_BEGIN_MACRO \
+        ENTER_CRITICAL_REGION(); \
         (_is) = _PR_MD_GET_INTSOFF(); \
         _PR_MD_SET_INTSOFF(1); \
+        if (_is) \
+            LEAVE_CRITICAL_REGION(); \
     PR_END_MACRO
 

and should probably be

         _PR_MD_SET_INTSOFF(1); \
+        if (_is == 0) \
+            LEAVE_CRITICAL_REGION(); \

but with that fixed, we get a slew of assertions about leaving non-existent 
critical sections.
Comment 71 Simon Fraser 2001-09-19 20:51:01 PDT
I checked my previous patch into the NSPRPUB_PRE_4_2_CLIENT_BRANCH at 8:45pm, so 
it will show up in trunk builds tomorrow (9/20).
Comment 72 sairuh (rarely reading bugmail) 2001-09-20 16:31:20 PDT
trunk testing [BEFORE checkin] using:

* 2001.09.18.20-trunk commercial
* on Mac OS 10.0.4, 450 Mhz G3, 384Mhz
* modern theme [note: separate profiles used for prefs and page load tests]

Preferences dialog open: 2.50sec
Switching from Navigator to Mail & Newsgroups panel: 0.70sec

will soon attach page load results [side note: crashed several times after page
load completed --see bug 100852].
Comment 73 sairuh (rarely reading bugmail) 2001-09-20 16:33:28 PDT
Created attachment 50169 [details]
2001.09.18.20-trunk comm page load results
Comment 74 sairuh (rarely reading bugmail) 2001-09-20 21:48:36 PDT
Comment on attachment 50169 [details]
2001.09.18.20-trunk comm page load results

re-ran the pre-checkin page  load tests, thus invalidating this run. will reattach new set.
Comment 75 sairuh (rarely reading bugmail) 2001-09-20 21:56:48 PDT
Created attachment 50209 [details]
two runs of page load results using 2001.09.18.20-trunk
Comment 76 sairuh (rarely reading bugmail) 2001-09-20 21:59:36 PDT
tried running the page load tests using the 2001.09.20.16-trunk bits: alas, it
would not complete. it hung [but not on the same page] each of the three times i
tried --even after rebooting the machine after the second hang. :(

will try to run the preferences perf tests. hopefully that will go more smoothly...
Comment 77 sairuh (rarely reading bugmail) 2001-09-20 22:17:16 PDT
alas, i'm blocked with the preferences testing as well. :(

i need to restart the app btwn each dialog-open timing [to get four timings and
report their average]. i was able to launch the app a first time with a new
profile...but with the second launch the browser window didn't even appear: only
got the menubar and the rainbow spinning cursor.

restarting the machine kinda "works around" the hang, but it's a rather harsh
workaround.

another note: simon mentioned that this might be a 10.0.4 issue --since he was
recently able to run the page load tests [same 2001.09.20.16-trunk build] on his
dual-cpu G4 [which has a later version of 10.x].
Comment 78 sairuh (rarely reading bugmail) 2001-09-20 23:06:10 PDT
recent email rec'd from todds@bigfoot.com:

"...I've been following this bug since I want to use mozilla on my dual 800 with
OS X 10.0.4.

"I just noticed your most recent comment at 22:17, and wanted to let you know
that I am experiencing the same spinning rainbow cursor, and the browser never
starts when using the 2001.09.20 build.

"Hope that helps in any way, let me know if I can help otherwise."
Comment 79 Simon Fraser 2001-09-20 23:22:00 PDT
Hold off on testing here folks. It looks like my changes causes deadlock on 
single and dual-CPU machines running 10.0.x builds. Damn. Running a new build 
now, more tomorrow.
Comment 80 Jussi-Pekka Mantere 2001-09-21 06:08:14 PDT
Running 10.1 would not affect the deadlock?
Comment 81 Jaime Rodriguez, Jr. 2001-09-21 15:52:02 PDT
Let's find out if this will ever work on 10.04? Simon is researching ...
Comment 82 Brian Bergstrand 2001-09-22 01:27:47 PDT
I think 10.0.x is a lost cause. I just got a new dual-800, and was having 
MAJOR stability problems with 10.0.4 4S10. The machine would randomly 
crash. I posted to the Darwin list, and was told 10.0.4 has major SMP 
issues and to wait for 10.1. Let's just say this was good advice.

See http://www.darwinfo.org/devlist.php3?number=11117 for the full 
message.
Comment 83 stephend@netscape.com (gone - use stephen.donner@gmail.com instead) 2001-09-23 15:13:20 PDT
Created attachment 50476 [details]
OS X perf results for Mail
Comment 84 lchiang 2001-09-24 10:25:38 PDT
Looking at Stephen's mail results for OS X for 9-18 vs. 9-21 builds, there is
some amount of slow down in the 9-21 build.
Comment 85 Steve Dagley 2001-09-24 10:50:49 PDT
Can we get those tests re-run with  Mac OS X 10.abouttobereleased?
Comment 86 Brice Ruth 2001-09-24 10:58:11 PDT
Do you need a dual CPU or does a single CPU work for the speed tests?  If a
single CPU works, lemme know exactly which files to grab for the tests to be
consistent.
Comment 87 lchiang 2001-09-24 13:41:22 PDT
Brice - thanks for the offer to help.  For the speed tests, here is one of them
(Mail).  All the test files and methodology are there. The important thing is to
use the same machine for the before/after builds with simon's fix.   A single
CPU is fine for this speed test.   Do you have version 10.1 of the OS?   If so,
that would be great.

sdagley - can you give me a CD with the updated OS?
Comment 88 Brice Ruth 2001-09-24 13:50:13 PDT
I looked at the perf results attachment and I think I can do all that ... gotta
see about the stopwatch, though :)

If I have any questions, I'll let ya know.

I assume grabbing today's 'nightly' is fine enough for a build w/ the patch
incorporated?
Comment 89 Simon Fraser 2001-09-24 13:51:05 PDT
Here's the deal. The MP APIs that I'm using for dual CPU protection are broken 
under 10.0.x, but fixed for 10.1. I have code that works around this problem, 
that will make things work on 10.0.x. So I need to check that in, and we then 
have to do another round of performance testing.
Comment 90 Brice Ruth 2001-09-24 13:52:01 PDT
Oh - and I can't confirm anything about having 10.1 - to my knowledge, there is
no such release yet. :)
Comment 91 Simon Fraser 2001-09-24 15:11:47 PDT
Created attachment 50595 [details] [diff] [review]
Patch to fix MPEnter/LeaveCriticalRegion on 10.0.x
Comment 92 stephend@netscape.com (gone - use stephen.donner@gmail.com instead) 2001-09-24 16:48:00 PDT
Brice, yes.  Thanks for helping with this.  So, to summarize:

Folder loading speed: done with local folders
Thread pane scrolling: local or IMAP
Message display speed: local folders
Thread pane sorting: local or imap
Message Reply Speed: IMAP, see
http://www.mozilla.org/mailnews/performance/msgcompose_loggin.html for details
on how to generate a performance logfile for this.
Batch delete 5 (10kb each) messages: local
Mail module loading time: I wouldn't bother with this, it's too complex
currently (using IMAP, non-isolated servers, etc.) to be truly reliable.
Comment 93 grega 2001-09-24 16:49:12 PDT
marking nsenterprise- to get off the enterprise stop ship list.
Comment 94 Simon Fraser 2001-09-24 17:44:42 PDT
I tested the lastest changes on 10.1 single- and dual- CPU machines, and both 
work just fine.
Comment 95 lchiang 2001-09-24 19:59:37 PDT
I have the 10.1 update to Mac OS X.  Stephen or Sairuh, pls stop by to install
for testing.

Simon - do you expect to see a timing difference between OS X and 10.1?
Comment 96 Simon Fraser 2001-09-25 11:37:20 PDT
Created attachment 50724 [details] [diff] [review]
Tidied up MDEnter/Leave critical section code
Comment 97 sairuh (rarely reading bugmail) 2001-09-25 15:34:11 PDT
spoke with lisa briefly: stephend upgraded to 10.1, but i'll stick with 10.0.4
for  the nonce --so that we can test with smfr's latest patch [after
review/check-in] on both OS versions.
Comment 98 Steve Dagley 2001-09-25 19:35:31 PDT
Comment on attachment 50724 [details] [diff] [review]
Tidied up MDEnter/Leave critical section code

r=sdagley
Comment 99 Simon Fraser 2001-09-25 19:38:52 PDT
I checked in the working critical region code to the NSPR branch, which is used 
by the mozilla TRUNK. So this bug is now good to go on for testing.
Comment 100 Joe Francis 2001-09-26 04:40:54 PDT
holy crap.  So that's why I couldn't get this to work before - I was doing the 
same thing under 10.0.4
Comment 101 Jaime Rodriguez, Jr. 2001-09-26 09:06:53 PDT
Simon - Pls get the SR=, so we can plus this one, and check it in today.
Comment 102 Simon Fraser 2001-09-26 11:08:38 PDT
This has all the reviews it needs to be checked in to the branch. It's already on 
the trunk.
Comment 103 sairuh (rarely reading bugmail) 2001-09-26 11:17:10 PDT
trunk testing of Preferences panel: compared with my 2001-09-20 16:31 results.
same configuration, with a new profile [modern, no sidebar].

AFTER checkin, using 2001.09.25.20-trunk
----------------------------------------
Open Preferences dialog open: 3.39sec
Switching from Navigator to Mail & Newsgroups panel: 0.97sec


BEFORE checkin, using 2001.09.18.20-trunk
-----------------------------------------
Preferences dialog open: 2.50sec
Switching from Navigator to Mail & Newsgroups panel: 0.70sec

looks like this checkin might've slowed thing down a bit.

currently running page load tests. will attach when finished.
Comment 104 stephend@netscape.com (gone - use stephen.donner@gmail.com instead) 2001-09-26 12:56:54 PDT
I thought we were supposed to test on the 26th builds:

AFTER build:

Trunk build from 9/26 (when it shows up).

If the fix was committed on 9/25 at 7:38pm, how does it show up in the 25-20
builds, which are produced at 8 pm (and presumably take more than ~20 minutes to
pull and build).

Just checking to make sure I test the right build.
Comment 105 Wan-Teh Chang 2001-09-26 13:19:27 PDT
I just wanted to point out the the focus of the QA
testing should be on the stability of this patch.
That is, whether the Mozilla client crashes or
deadlocks with the patch applied.

The current code is broken on Mac OS X.  No matter
how much faster it is, it is broken.

If Simon's patch fixes the thread safety problem
on Mac OS X, it should be checked in.  There is not
much point in comparing its performance with the
performance of broken code on Mac OS X.  Any
performance comparision should be done on classic
Mac OS, where the current code works.
Comment 106 Jaime Rodriguez, Jr. 2001-09-26 15:50:17 PDT
check it in - PDT+
Comment 107 sairuh (rarely reading bugmail) 2001-09-26 16:00:58 PDT
i've created bug 101838 for the performance slowdown i've noticed in the
2001.09.25.20-trunk builds.
Comment 108 stephend@netscape.com (gone - use stephen.donner@gmail.com instead) 2001-09-26 18:18:23 PDT
Created attachment 51005 [details]
Mail performance results, take 2 (10.1, this time)
Comment 109 Jennifer Kobayashi 2001-09-27 12:14:05 PDT
Startup Results
I take an average of three for launch and relaunch on both builds. I tested 
these manually via a stopwatch. Times are in seconds. 
Build 2001-09-18-20-trunk
Launch: 19.25 
Re-launch: 12.49
Build 2001-09-25-20-trunk
Launch: 24.17
Re-launch: 16.62

So, as you can see.. a fairly large jump in times in the 9-25 build :( 
There were no peaks or weird jumps, all times on both builds were fairly 
consistent. 
Comment 110 Brice Ruth 2001-09-27 12:55:01 PDT
OS X v10.1, iMac DV 400MHz, 768MB RAM

Test/Build|--9/18--|--9/27--|
Startup 1 |  24.45 |  18.40 |
Startup 2 |  10.31 |  11.78 |

Startup 1 indicates that I rebooted and launched Mozilla for the first time,
Startup 2 indicates that I launched Mozilla directly after quitting the same
build of Mozilla, after a clean reboot.

Basically, I rebooted, launched build 9/18, quit, launched again - recorded
times, rebooted, launched build 9/27, quit, launched again - recorded times. 
I'll try to get some Mail Perfs now that I have a stopwatch.
Comment 111 scalkins 2001-09-27 13:09:13 PDT
Created attachment 51106 [details]
IM results of Performance testing on OSX. Everything is slower here also.
Comment 112 Brice Ruth 2001-09-27 14:58:17 PDT
I'm really sorry that I have to report this ... but, I believe I just
experienced deadlock in Mozilla, build 2001092620 - I had the spinning pizza for
at least 5 minutes and when I checked out top, Moz was hovering right around 50%
usage.  I had to forcibly kill Moz :(

I don't really know what 'caused' it ... I was viewing a web page when my wife
sent me an email, so I command-2'd over to Mail/News, hit command-t to get new
mail, then clicked in the 'Denise' folder that her messages get filtered into
and that's when the pizza started spinning.

I don't have a dual-proc machine, this is the same machine I posted the perf
results from (iMac DV 400 w/ 768MB).  If this *wasn't* deadlock, what could it
have been?
Comment 113 Simon Fraser 2001-09-27 15:55:28 PDT
Checked in on 0.9.4 branch. Keeping open for NSPR tip.
Comment 114 Simon Fraser 2001-09-27 16:39:36 PDT
Checked in on the NSPR tip.
Comment 115 lchiang 2001-10-01 18:30:57 PDT
kin - I heard you have a dual processor machine.  Can you verify for QA that
that the branch build starts on your machine?
Comment 116 sharky_k_ 2001-10-01 22:20:41 PDT
I have a G4 533 dual and Mozilla build nr. 2001093021 works great. Startup time
could be less (I hope).
Comment 117 kinmoz 2001-10-02 09:37:04 PDT
lchiang - I have no dual processor machines ... I believe both jfrancis and
sfraser do.
Comment 118 lchiang 2001-10-02 13:08:22 PDT
Thank you.

I think I will mark this verified (for startup of application on dual processor
Mac) per sharky_k_@mac.com's comment.

I have also asked Marty F. at AOL to confirm for me on his dual processor for
the commercial build.  I will also inquire with one other source for the
commercial build.
Comment 119 Hong Kwon 2001-10-02 13:52:25 PDT
I will also test on fkeeney's dual processor which he has graciously lent to me.
 If anyone wants to come by and use it, you are more than welcome to (B20-1)
Comment 120 Hong Kwon 2001-10-02 16:13:48 PDT
Some results:

On sdagley's G4/733 it took around 5 minutes to download 2220 headers (using
yesterday's trunk)
On fkeeney's dualG4/500 it took around 5 minutes to download 3482 headers (using
today's branch)

On sdagley's G4/733 it took around 30 seconds to download 2220 headers (using a
debug build from a few weeks ago, pre Simon's fix).  This on a non-optimized build.

As you can see this is a HUGE performance degradation.  It is literally 10 times
slower.
Comment 121 Hong Kwon 2001-10-02 17:37:52 PDT
posted new bug for mail performance degradation: 
http://bugzilla.mozilla.org/show_bug.cgi?id=102797

lrg tested major windows and browsed around on dual CPU machine for about 3
hours and didn't find any problems (other than generally sluggish performance).

Note You need to log in before you can comment on or make changes to this bug.


Privacy Policy