Closed Bug 29908 Opened 25 years ago Closed 24 years ago

Crash on quit after using mail; horks machine

Categories

(Core :: Networking, defect, P3)

PowerPC
Mac System 9.x
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: mikepinkerton, Assigned: gordon)

Details

(Keywords: crash, Whiteboard: [PDT+] mustfix (fix checked in))

Attachments

(3 files)

- launch apprunner
- click "mail" icon in browser window
- type your imap mail password
- click a mail message, any message
- quit the app w/ cmd-q

CRASH!

PowerPC access exception at 1566399C PR_Lock+00154
 (CurStackBase does not seem to apply...dumping 4K.)
  Calling chain using A6/R1 links
  Back chain  ISA  Caller
  00000000    PPC  00B4DDC0  CallOnSwappedStack+00078
  00C67898    PPC  00C32730  OTScheduleDriverDeferredTask+0016C
  00C67838    PPC  00C2D1C4  qrun+00194
  00C677D8    PPC  00C04630  TNativeProvider::Notify(unsigned long, long, void*)+
00138
  00C67788    PPC  00C07B3C  OTOpenEndpointOnStreamPriv+02750
  00C676F8    PPC  00C029E8  TNativeProvider::NotifyClient(void*, unsigned long, 
long, void*)
+000D4
  00C67688    PPC  00BFD808  TOTProcNotifier::~TOTProcNotifier()+000FC
  00C67638    PPC  1566C924  NotifierRoutine+00438
  00C675B8    PPC  1566C4A4  WakeUpNotifiedThread+00068
  00C67578    PPC  1566A8BC  DoneWaitingOnThisThread+00030
 Closing log

Doing an 'es' in macsbug leaves the mac in a limping state that next time any app 
tries to use the network the machine locks hard.
keywords.
Keywords: beta1, crash
->gordon
Assignee: gagan → gordon
PDT+
Whiteboard: [PDT+] mustfix
I'm unable to reproduce this particular crash, and it seems to work for Mike now 
as well.  We'll continue to poke it a bit to see if we can get it to misbehave 
again.  If that doesn't work, then we'll close it.
Status: NEW → ASSIGNED
man, this worries me that this bug just seemed to go away....It was so severe, 
and then poof...no problems. *worried look on his face*
I just saw this today.
Adjust summary. Here's an ip for the crash:

PowerPC access exception at 1D9F5D2C PR_Lock+00150
 Disassembling PowerPC code from 1D9F5D04
  PR_Lock
     +00128 1D9F5D04   lwz        r3,0x0810(RTOC)                         | 
80620810
     +0012C 1D9F5D08   mr         r4,r25                                  | 
7F24CB78
     +00130 1D9F5D0C   li         r5,0x0106                               | 
38A00106
     +00134 1D9F5D10   bl         PR_Assert                  ; 0x1D9E6458 | 
4BFF0749
     +00138 1D9F5D14   nop                                                | 
60000000
     +0013C 1D9F5D18   lwz        r30,0x000C(r29)                         | 
83DD000C
     +00140 1D9F5D1C   addi       r0,r29,0x000C                           | 
381D000C
     +00144 1D9F5D20   cmplw      r30,r0                                  | 
7C1E0040
     +00148 1D9F5D24   beq        PR_Lock+00160              ; 0x1D9F5D3C | 
41820018
     +0014C 1D9F5D28   lwz        r4,-0x0050(r30)                         | 
809EFFB0
     +00150 1D9F5D2C  *lwz        r3,0x0010(r29)                          | 
807D0010
     +00154 1D9F5D30   lwz        r0,-0x0050(r3)                          | 
8003FFB0
     +00158 1D9F5D34   cmpw       r4,r0                                   | 
7C040000
     +0015C 1D9F5D38   bne        PR_Lock+00184              ; 0x1D9F5D60 | 
40820028
     +00160 1D9F5D3C   addi       r30,r29,0x000C                          | 
3BDD000C
     +00164 1D9F5D40   b          PR_Lock+00190              ; 0x1D9F5D6C | 
4800002C
     +00168 1D9F5D44   lwz        r3,0x000C(r29)                          | 
807D000C
     +0016C 1D9F5D48   subi       r26,r3,0x0054                           | 
3B43FFAC
     +00170 1D9F5D4C   lwz        r3,0x0004(r31)                          | 
807F0004
     +00174 1D9F5D50   lwz        r0,0x0004(r26)                          | 
801A0004
 Closing log
Summary: Opening Mail message takes out networking on Mac → Crash on quit after using mail; horks machine
It you can reproduce this, please dump the registers as well.  Simon, did your 
stack crawl look the same as Mike's?

We're dying somewhere in the second half of this condition:

    if (q == &lock->waitQ || _PR_THREAD_CONDQ_PTR(q)->priority ==
      	_PR_THREAD_CONDQ_PTR(lock->waitQ.prev)->priority) {

Here's the line in the file for more context:

http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/src/threads/combined/
prulock.c#281
Here's my stack crawl:

 (CurStackBase does not seem to apply...dumping 4K.)
  Calling chain using A6/R1 links
  Back chain  ISA  Caller
  00000000    PPC  1FD08B70  CallOnSwappedStack+00078
  00772C08    PPC  1FD2BAC0  OTScheduleDriverDeferredTask+0016C
  00772BA8    PPC  1FD26554  qrun+00194
  00772B48    PPC  1FCF3364  TNativeProvider::Notify(unsigned long, long, void*)+
00138
  00772AF8    PPC  1FCF68A8  OTOpenEndpointOnStreamPriv+02788
  00772A68    PPC  1FCF171C  TNativeProvider::NotifyClient(void*, unsigned long, 
long, void*)+000D4
  007729F8    PPC  1FCEC4D8  TOTProcNotifier::~TOTProcNotifier()+000FC
  007729A8    PPC  1D9FED24  NotifierRoutine+00438
  00772928    PPC  1D9FE8A4  WakeUpNotifiedThread+00068
  007728E8    PPC  1D9FCCBC  DoneWaitingOnThisThread+00030
PowerPC access exception at 1D9F5D2C PR_Lock+00150
Target Milestone: M14
It looks like async socket I/O is completing on a thread that has been 
deallocated.  Simon, could you post the log from the crash we looked at 
yesterday?  Thanks.
For those taking a look at the MacsBug log, keep in mind that the crash is 
actually occuring on line PR_Lock+14c.
I spoke with Wan-Teh and he proposed changes to SendReceiveStream() to fix this.  
We still need to find a way to reproduce the problem, so we can tell if we've 
fixed it.  At a minimum, I'd like to examine another crash to try to confirm our 
current theories about the bug.
If Wan Teh has a proposed fix, can we get that fix onto a machine that *can*
reproduce this ASAP, and see if the fix works??
Thanks,
Jim
Attached patch Proposed fix.Splinter Review
I've sent the fix to wtc, sfraser, davidm, pinkerton, and mwelch for review and 
testing; see the attachment for the diffs.

To my knowledge, only pinkerton and sfraser have seen this problem.  I will work 
with them to try and verify the fix.
Another slight variation for us to verify:
1.  Launch to browser, go to mail.
2.  Select a message.
3.  Close mail window (via close box).
4.  Application hangs, no crash.
My comments above were referring to me seeing this with IMAP account using
2000-03-10-14m15 commercial build. Able to reproduce over a couple different
machines.
laurel: this could be two things:
1. It could be something like bug 29733. Here, the app fails to quit, but you
   can still choose menu items, and a second Quit works.
2. It could be a bug (maybe unfiled?) where we hang in an NSPR thread. If you
   drop into MacsBug, you'll see _MD_PauseCPU on the stack.

So please get a stack crawl if possible so we can determine which it is.
I'll keep trying to get a stack crawl, but so far I'm not dropping into macsbug
-- I just hang and CAN'T select any menu items and can't quit or close browser,
I need to force quit. I am also not quitting from compose window. I'll give it
some more tries... 
when it hangs, you can usually drop into macsbug manually (cmd-power or the right 
front button on the machines's front panel). 
Laurel's log looks liks this:

 (CurStackBase does not seem to apply...dumping 4K.)
  Calling chain using A6/R1 links
  Back chain  ISA  Caller
  0E751D79    PPC  0046CFAC  EmToNatEndMoveParams+00014
  0E751D00    PPC  002D4FCC  
  0E751CB0    PPC  00D356C8  
  0E751BB8    68K  00428F3E  'scod BFAF 0002'+01F1E
  0E751B90    68K  0042D2D6  'scod BFAF 0002'+062B6
  0E751B7C    68K  00438AF8  'scod BFAF 0002'+11AD8
 Return addresses on the stack
  Stack Addr  Frame Addr   ISA   Caller
   0E751F18                PPC   1F18296C PRP_TryLock+00994
   0E751F08                68K   0E6E084E
   0E751EB8    0E751EB0    PPC   1F180EB8 PR_GetThreadPrivate+004D8
   0E751EAC                68K   0E6E084E
   0E751E68    0E751E60    PPC   1F1874BC PR_GetPrimaryThread+002A0
   0E751E5C                68K   0E6E084E
   0E751E58    0E751E50    PPC   1F181364 _PR_GetPrimordialCPU+0045C
   0E751E28    0E751E20    PPC   FFD0133C WaitNextEvent+00028
   0E751E22                68K   1E6041FE
   0E751E18                68K   0E6E084E
   0E751D74    0E751D70    68K   005829CE
   0E751D08                PPC   0046CFAC EmToNatEndMoveParams+00014
   0E751CE0                68K   0E751D9E
   0E751CB8    0E751CB0    PPC   002D4FCC
   0E751C78    0E751C70    PPC   002D6DD4 MPGetPoolStatistics+002A0
   0E751C68    0E751C60    PPC   00D356C8
   0E751B94    0E751B90    68K   00428F3E 'scod BFAF 0002'+01F1E
   0E751B80    0E751B7C    68K   0042D2D6 'scod BFAF 0002'+062B6
   0E751B70    0E751B6C    68K   00438AF8 'scod BFAF 0002'+11AD8
   0E751B60                68K   0F6A8F26
   0E751B54                68K   0F6DD306
   0E751B50    0E751B4C    68K   00438BA6 'scod BFAF 0002'+11B86

This looks to me like the zombie thread on quit problem.
gordon/sfraser: Given that it's the old "zombie thread on quit" problem
(whatever that means), where does that leave us in terms of a fix?  Is this a
notorious and unfixable problem?  A problem with a standard fix?  Please give us
a timeline for repair, and a proposed landing date.
The beta branch has been cut... and this is a bad problem.  Please give us some
info RSN.
Here's the deal. This bug (this particular crash on quit in PR_Lock()) is very 
hard to reproduce, happening only occasionally on some machines on quitting after 
using mail. I think it's very infrequency is enough to justify release noting 
this for beta. This infrequency also makes it hard to verify that a fix works.

The 'zombie thread on quit' is a separate problem (as far as I can tell), for 
which no bug exists (my bad, I guess). Next time I see it, I'll get a stack and 
file a bug. It also happens more if you've been using mail.

I think we need to get feedback from QA on how frequent each of these bugs are 
(perhaps using Talkback data), and triage based on that.
No longer blocks: 7799
Pinkerton: How often have you seen this?  Are you in agreement with sfraser that
it is rare, and we should release note?
Whiteboard: [PDT+] mustfix → [PDT+] mustfix (unless this is rare enough to relnote)
I'll start trying to repro this again with gordon's patch that he mailed me.
w/out gordon's patch, i am _still_ seeing this crash (it reappeared!). I'm trying 
gordon's patch right now and will keep trying to dupe.
Whiteboard: [PDT+] mustfix (unless this is rare enough to relnote) → [PDT+] mustfix (waiting for mark welch to review)
I checked in gordon's patch (attachment 6542 [details] [diff] [review]) on the
following branches (of NSPR):

1. Main trunk:
   /cvsroot/mozilla/nsprpub/pr/src/md/mac/macsockotpt.c, revision 3.16
2. NSPRPUB_RELEASE_4_0_BRANCH:
   /cvsroot/mozilla/nsprpub/pr/src/md/mac/macsockotpt.c, revision 3.14.8.4
3. NSPRPUB_CLIENT_BRANCH:
   /cvsroot/mozilla/nsprpub/pr/src/md/mac/macsockotpt.c, revision 3.15.2.1

The checkin to NSPRPUB_CLIENT_BRANCH is approved by jar@netscape.com.
We just missed the verification builds for today; we'll be ready to checkin to 
the beta branch on Wednesday 3/15.
Whiteboard: [PDT+] mustfix (waiting for mark welch to review) → [PDT+] mustfix (ready to checkin)
We've checked in a fix (to beta branch and trunk). The fix clears the thread
field of socket file descriptors after SendReceiveStream() is done using it, so
that later incoming async OT events for the socket don't mistakenly try to wake
the old thread up.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Whiteboard: [PDT+] mustfix (ready to checkin) → [PDT+] mustfix (fix checked in)
verified on MAC 0S9 - build 2000031615 nb1b
Status: RESOLVED → VERIFIED
OS: Mac System 9.x
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: