Closed Bug 118004 Opened 23 years ago Closed 15 years ago

N70PR1 crash [@ nsTimerImpl::~nsTimerImpl]

Categories

(Core :: XPCOM, defect)

x86
Windows ME
defect
Not set
critical

Tracking

()

RESOLVED FIXED

People

(Reporter: mscott, Unassigned)

References

Details

(Keywords: crash, topcrash-)

Crash Data

Attachments

(2 files, 5 obsolete files)

The network connection was down. I started up the browser and it was trying to load the netscape home page. Instead I crashed in timer code. I think Pav recently re-wrote the timer code so I'm going to tentatively assign this to him. Using 2002010203, win32. Talkback ID: 1200742 nsTimerImpl::~nsTimerImpl [d:\builds\seamonkey\mozilla\xpcom\threads\nsTimerImpl.cpp, line 114] nsTimerImpl::`scalar deleting destructor' nsTimerImpl::Release [d:\builds\seamonkey\mozilla\xpcom\threads\nsTimerImpl.cpp, line 73] destroyMyEvent [d:\builds\seamonkey\mozilla\xpcom\threads\nsTimerImpl.cpp, line 286] PL_DestroyEvent [d:\builds\seamonkey\mozilla\xpcom\threads\plevent.c, line 628] PL_HandleEvent [d:\builds\seamonkey\mozilla\xpcom\threads\plevent.c, line 603] PL_ProcessPendingEvents [d:\builds\seamonkey\mozilla\xpcom\threads\plevent.c, line 524] _md_EventReceiverProc [d:\builds\seamonkey\mozilla\xpcom\threads\plevent.c, line 1072] nsAppShellService::Run [d:\builds\seamonkey\mozilla\xpfe\appshell\src\nsAppShellService.cpp, line 303] netscp6.exe + 0x167d (0x0040167d) netscp6.exe + 0x121c (0x0040121c) n
Keywords: crash
Severity: normal → critical
can someone get me a reproducable testcase?
Status: NEW → ASSIGNED
Target Milestone: --- → Future
Attached file Stacks from Purify
I've attached the stacks from Purify. Hopefully that might shed some light. Unfortunately I haven't been able to come up with steps to reproduce. Seems rather sporadic.
are either of you running (mscott or dbradley) on dual processor machines? There's a rumour that this kind of problem is easier to reproduce on the dual cpu machines.
No, single Intel CPU. It's not readily reproduceable but I find it happens more often running under Purify, so I supsect maybe a race condition or some such that the slowness of Purify exposes more.
I'm not on a dual processor either.
Bug 121845 might be the same problem. Try my reproducible case from there.
I'm seeing crashes like this just about every time I do an e-mail search (like bug 127621). This is with a fresh cvs build on Linux SMP. When I reboot to a non-SMP kernel, the problem goes away. I _think_ I've found a race condition in the timer code which causes this crash, but I'm not 100% sure, so go easy on me if I'm way off track :-) Object A has a timer (nsITimer/nsTimerImpl), and it's active (so it's in the TimerThread.mTimers nsVoidArray). It's got the only reference, so the refcount is 1. Object A gets destroyed so the refcount drops to zero, and the destructor for nsTimerImpl is called. This calls gThread->RemoveTimer(this) to get itself out of mTimers - to do this, it needs the TimerThread lock and so it blocks. Meanwhile, over in TimerThread::Run, it's got the lock and calls RemoveTimerInternal, and then NS_ADDREF(timer) on the 'doomed' timer. It then releases the lock, so the destructor in Object A continues and the nsTimerImpl is deleted. A few lines later in the TimerThread, the event is fired (which might work if we are lucky) and shortly after that NS_RELEASEd again (in destroyMyEvent) - the destructor gets called again. Chaos follows. I came to this conclusion by adding an extra 'magic' member to nsTimerImpl and setting it to 12345678 on construct, and to 87654321 as the first line of the destructor, sure enough I caught double destructs. I then added something to TimerThread::Run to check the magic of the timer it's about to fire and found that the timer had been destroyed (or rather was in the process of being destroyed) Well I hope that helps, I haven't got a fix for it just yet.
that looks very plausible to me.
Targeting. Thanks very much for analyzing this, my face is red for not catching it in review. /be
Keywords: mozilla1.0+
Target Milestone: Future → mozilla1.0
Blocks: 127621
kerz and putterman see the crash on email quick search all the time. we're debugging now. we'll start with rich's suggestions.
Summary: crash at [@ nsTimerImpl::~nsTimerImp] → crash at [@ nsTimerImpl::~nsTimerImpl]
sspitzer, I've been making up for bad super-review by hacking a fix. Almost ready to attach a patch -- don't want to duplicate your work. /be
no, I haven't done more that reproduce this with kerz on his machine, and add bug comments. when you have a patch, I'll try it out.
Attached patch proposed fix (obsolete) — Splinter Review
This was hard. Someone remind me again why we ditched native widget timers for XP timers built on a thread? I haven't tested yet, so I'm braced for bad news. I've been running the race condition outcomes through my brain, and I need to lie down now. I did rename nsITimerScriptable to nsIScriptableTimer, as threatened elsewhere. I also used the preferred spelling of canceled (stress goes on the syllable before the doubled consonant, e.g. "pre_ferr_ed", as opposed to "_can_celed"). Picky, picky! /be
Let's avoid adding noise to 0.9.9 talkback, and adding people to the ranks of unhappy 0.9.9 users. Looking for testing, r= and sr=. /be
Blocks: 122050
Keywords: mozilla0.9.9
Target Milestone: mozilla1.0 → mozilla0.9.9
Just as I was going to sleep, I realized I forgot to transpose an addref and a (subroutine-obscured) release. Fixing platform and os fields, reassigning. /be
Assignee: pavlov → brendan
Status: ASSIGNED → NEW
OS: Windows 2000 → All
Hardware: PC → All
Status: NEW → ASSIGNED
Dammit, now I'm wide awake! My build is still going, I have not yet tested. /be
Attachment #72930 - Attachment is obsolete: true
Well that second patch works for me, I'd previously be able to crash every single time. I had been running a temporary hack which stablised things a little (just a if (!timer->mRefCnt) continue) in TimerThread::Run) which seemed to confirm that it was a race condition type thing. Thanks
Ok, time for bed or I'll turn into one of those X-Files cases who grow telepathic powers from sleep deprivation! Where's Agent Scully? /be
If the timer has been cancelled should it still fire? If nsTimerImpl::Release was in RemoveTimer (Meaning it is now marked as cancelled) and TimerThread::Run is coming out of RemoveTimerInternal then the timer has been cancelled, but Fire is still called. Also RemoveTimerInternal doesn?t appear to check to see if the timer has been "disarmed" before removing. I think this state could occur if nsTimerImpl::Release is blocked in RemoveTimer, and Run has called RemoveTimerInternal. The thread unblocks in Release's call to RemoveTimer, then calls RemoveTimerInternal. I don't think this failure causes any major problems, other than causing Release to return 1 rather than 0. Also why the assigns of nsnull to "timer" in TimerThread::Run? What bugs me about this, is the fact we have two classes that have state that is touched by multiple threads, when I think we could have just one class with state that needs to be thread safe. Without some analysis its hard to say if that would reduce or increase the complexity.
dbradley wrote: >If the timer has been cancelled should it still fire? If nsTimerImpl::Release >was in RemoveTimer (Meaning it is now marked as cancelled) and >TimerThread::Run is coming out of RemoveTimerInternal then the timer has been >cancelled, but Fire is still called. Run could check timer->mCanceled, but it didn't before -- it Fired the timer, sending a PLEvent across a thread boundary, and let nsTimerImpl::Process, all the way on the other side of that boundary, notice mCanceled. I just work here trying to fix last-minute 0.9.9 crash/topcrash bugs :-). I didn't design this pav-code, and I didn't review it well enough. But in this case, I think pavlov may have been right to avoid adding more mCanceled checks, saving code and complexity. It costs a little in the cancel-races-with-run case, but who cares? No harm comes, right? And we save a little (tiny) code footprint. >Also RemoveTimerInternal doesn?t appear to check to see if the timer has been >"disarmed" before removing. I think this state could occur if >nsTimerImpl::Release is blocked in RemoveTimer, and Run has called >RemoveTimerInternal. RemoveTimerInternal calls nsVoidArray::RemoveElement and lets it search. If the timer was disarmed while this call (from nsTimerImpl::Release, as you say) had to wait on TimerThread::mLock until Run had already removed the timer, we just spend O(n) IndexOf search time, find nothing, and return false from RTI. You are right that an mArmed test would avoid this cost, but it's a hard case that I believe is rare (though it happens, witness this bug!). Is it worth the extra code to test mArmed, to optimize away the IndexOf done by RemoveElement? >The thread unblocks in Release's call to RemoveTimer, >then calls RemoveTimerInternal. I don't think this failure causes any major >problems, other than causing Release to return 1 rather than 0. That I took pains to do, not that anyone uses NS_RELEASE2. It just scared me that someone might do such a thing some day and draw the wrong conclusion. I think I got the logic right -- were you calling out a flaw, or just noticing how it all works? >Also why the assigns of nsnull to "timer" in TimerThread::Run? You mean in the line after timer->Fire() is called? Because as the comment just above says, the Release done by destroyMyEvent on the timer's thread, not on the TimerThread, matches the AddRef done in Run. Symmetry breaks, the timer's lifetime is handed off across the PLEvent-spanning thread boundary, and its dtor runs on the thread targeted by Fire, the one where Process (these names aren't great -- pav?) actually calls back into the timer's callback function, or notify method, or observer. >What bugs me about this, is the fact we have two classes that have state that >is touched by multiple threads, when I think we could have just one class with >state that needs to be thread safe. Without some analysis its hard to say if >that would reduce or increase the complexity. I think I know what you mean. Timers should be simply messages between threads, but we actually have both some fields in nsTimerImpl, and the mTimers array in TimerThread, shared among threads. But hey, I'm just trying to fix this for 0.9.9. A better design for the long run would be great, I just have not had time to think on it much. /be
Ok, testing seems good -- how about r= and sr= for 0.9.9? It's wrapping up fast, and I think we need this fix. /be
Attachment #72940 - Attachment is obsolete: true
Comment on attachment 73023 [details] [diff] [review] more cleanup, coments and spacing except for the switch (mCallbackType) to dispatch in nsTimerImpl::Process I didn't dig deep enough to find that nsTimerImpl::Process checked for mCancelled. I didn't read your comment in nsTimerImpl::Release thoroughly, that clears up the concern I had with that. The timer=nsnull has nothing to do with your patch I was just curious if I was missing something: >You mean in the line after timer->Fire() is called? Because as the comment just >above says, the Release done by destroyMyEvent on the timer's thread, not on the >TimerThread, matches the AddRef done in Run. That one and the original initialization. "timer" is just an interface pointer and not an nsCOMPtr pointer, so assigning null here, and then later has no effect on reference counting, correct? And the timer variable is essentially always set to a non-null value before it's use. So FWIW I think the patch is good. I stressed it a bit by stalling the timer thread and traversing some of the lesser frequented paths. r=dbradley
Attachment #73023 - Flags: review+
>That one and the original initialization. "timer" is just an interface pointer >and not an nsCOMPtr pointer, so assigning null here, and then later has no >effect on reference counting, correct? And the timer variable is essentially >always set to a non-null value before it's use. Yes, and 'tho I avoid unnecessary initialization in my own code, this code was shared, came with some tricky control flow that I made more tricky, and so it seemed worth the few cycles to initialize timer to null at its declaration, and to clear it once the thing it pointed at had been handed off via Fire. Thanks for the r= -- who will sr=? /be
So the decoupled design I think dbradley and I were imagining would not share memory the way the current code does. Instead, the TimerThread would be asked to create a peer data structure in its scheduled list, and hand back a serial number identifying it. All set-type, set-delay, and cancel operations would go through the same inter-thread channel as this create-timer message that returns a serial number response, and the subsequent messages would name the timer using the serial number. Then client object (nsITimer-implementing XPCOM object) life cycle would not be coupled by strong or weak refs to the TimerThread. The only shared memory is in the inter-thread message channel. That's a design we could get into after mozilla1.0, if this code proves to have more bugs, or is just too unwieldy to maintain. I thought I'd babble about it here for posterity. /be
I tried quite a bit and couldn't get it to crash in quicksearch with this now, so far so good. I'm going to get putterman a build for him to try, as he was crashing on every quicksearch on his machine.
Just starting to look at this. You forgot to change mozilla/xpcom/threads/MANIFEST_IDL.
*** Bug 127621 has been marked as a duplicate of this bug. ***
*** Bug 118648 has been marked as a duplicate of this bug. ***
Fixed MANIFEST_IDL, not bothering with a new patch just for that -- thanks! /be
Comment on attachment 73023 [details] [diff] [review] more cleanup, coments and spacing except for the switch (mCallbackType) to dispatch in nsTimerImpl::Process r-=pavlov With this patch, you will always leak mCallback.i/mCallback.o when Cancel is called since you have added: + mCallbackType = CALLBACK_TYPE_UNKNOWN; nsTimerImpl::~nsTimerImpl won't be able to determine what kind of callback type it is to determine how to NS_RELEASE it. This release must happen on the thread that created the timer as well, so it needs to happen in the destructor. I don't see any reason for adding the mCallbackType = CALLBACK_TYPE_UNKNOWN to nsTimerImpl::Cancel() except for in Process() you have the default: case for the callback type to NS_ERROR.. I would suggest checking mCanceled before the switch() and removing the default handler.
Attachment #73023 - Flags: review+ → needs-work+
Thanks pav -- late breaking brain damage on my part, following your original damage :-). There is already an mCanceled check at the top of Process, so there is no need to add one a little later before the switch, where your version had a test for a non-null mCallback.c (I should have not worried about conserving that redundant test, but I was hacking without enough sleep). I'll fix that, but I will have to leave an empty default to avoid a GCC warning. New patch in a sec, /be
Attachment #73023 - Attachment is obsolete: true
Comment on attachment 73162 [details] [diff] [review] Fix that was checked into 0.9.9 branch and trunk the rest of this looks fine. r=pavlov .. if someone can run purify with this patch, I think that would be a good idea incase I missed anything though.
Attachment #73162 - Flags: review+
Comment on attachment 73162 [details] [diff] [review] Fix that was checked into 0.9.9 branch and trunk sr=shaver. Do we care about components that are not in the tree, but may use scriptable timers? I guess not!
Attachment #73162 - Flags: superreview+
Comment on attachment 73162 [details] [diff] [review] Fix that was checked into 0.9.9 branch and trunk sr=jband I'm good with this. Besides the issue of *when* we schedule the public flogging of Stuart for his horrid design, I see a couple of things worth mentioning... - What is the pattern of TimerThread shutdown and destruction? If the timer array cleanup in TimerThread::~TimerThread matters at all then I'd think it should look like the cleanup in TimerThread::Shutdown in order to not do evil. - I'm fairly convinced that the posible nsTimerImpl::Release call patterns won't burn us. The code... > if (count == 1 && mArmed) Is a little funky because it tests the state of the count as it *was* against the state of mArmed as it *is*. If the multithreaded calling patterns were more complex (at a time when the recount could go to 1), then you'd have to lock out refcount changers inside the 'if (count ==1)' block and check mRefCnt to see if was *still* 1 && mArmed. But, it looks like the rules here preclude the pattern where the refcount would be going down to one while *another thread* might be arming the timer. I think that could only happen if multiple outside users of the timer were twiddling it on different threads. Anyway, I think your patch makes things better. I have not tested it.
Comment on attachment 73162 [details] [diff] [review] Fix that was checked into 0.9.9 branch and trunk a=asa (on behalf of drivers) for checkin to 0.9.9
Attachment #73162 - Flags: approval+
nsITimerScriptable went in just three weeks ago. I doubt anyone outside the tree is using it, but anyway, what does that shaver guy say about unfrozen interface users? :-) jband: right, as the comment in Release says, if the count goes down to 1 and mArmed is apparently set, there can't be any "timer resuscitation" by some other thread than the TimerThread. No one is able to pull a timer strongly referenced via mTimers[i] back from the brink of death. /be
Pav, can you respond to jband's question: >- What is the pattern of TimerThread shutdown and destruction? If the >timer array cleanup in TimerThread::~TimerThread matters at all then >I'd think it should look like the cleanup in TimerThread::Shutdown in >order to not do evil. I'm weary and I have other bugs to tend. /be
Fixed on branch and trunk. I'm reassigning this to pavlov for followup to jband's question, and to file a new bug whose number is recorded here before this one gets closed. The new bug would want the design in comment #24, with one change: The create-timer message need not be a synchronous request/response message exchange with the TimerThread, where the creating thread waits to learn the serial number allocated by the TimerThread. Instead, the creating thread can atomically increment a serial number generator and stipulate that as the id it will use, passing it asynchronously to the TimerThread. All messages in the inter-thread channel should be asynchronous (Cancel has never failed with a non-successful nsresult when the timer was not found, so there' no need for a handshake there). Removing from the 122050 tracking bug. /be
Assignee: brendan → pavlov
No longer blocks: 122050
Status: ASSIGNED → NEW
TimerThread's Shutdown and destruction both come, one after the other, from nsTimerImpl::Shutdown(), which is called during XPCOM's shutdown. In the destructor, there should never be any timers alive, so we should probably just ASSERT if there are and remove the code that walks the list.
*** Bug 121845 has been marked as a duplicate of this bug. ***
can we obsolete the approved patch from 03/07/02 so it doesn't look like an unlanded approved change? thanks.
Attachment #73162 - Attachment description: proposed fix for 0.9.9 → Fix that was checked into 0.9.9 branch and trunk
Attachment #73162 - Attachment is obsolete: true
Target Milestone: mozilla0.9.9 → mozilla1.0
i see that originally this bug affected all platforms. i double checked talkback data and this sig only happens on win machines now. i'm trying to repro this using recent user comments but, alas, no such luck. i will also attach two more call stacks with the same stack signature. today's talkbalk call stack: Count Offset Real Signature [ 21 nsTimerImpl::~nsTimerImpl 7d102c41 - nsTimerImpl::~nsTimerImpl ] Crash date range: 2002-05-22 to 2002-05-30 Min/Max Seconds since last crash: 161 - 46581 Min/Max Runtime: 161 - 1044390 Keyword List : Count Platform List 8 Windows 98 4.90 build 73010104 (winME) 6 Windows NT 5.0 build 2195 (win2000) 4 Windows 98 4.10 build 67766446 (win98SE) 2 Windows NT 5.1 build 2600 (winXP) 1 Windows NT 4.0 build 1381 Count Build Id List 21 2002051220 No of Unique Users 19 Stack trace(Frame) nsTimerImpl::~nsTimerImpl [d:\builds\seamonkey\mozilla\xpcom\threads\nsTimerImpl.cpp line 183] nsTimerImpl::`scalar deleting destructor' nsTimerImpl::Release [d:\builds\seamonkey\mozilla\xpcom\threads\nsTimerImpl.cpp line 96] nsSupportsArray::Clear [d:\builds\seamonkey\mozilla\xpcom\ds\nsSupportsArray.cpp line 601] nsSupportsArray::DeleteArray [d:\builds\seamonkey\mozilla\xpcom\ds\nsSupportsArray.cpp line 304] nsSupportsArray::`vector deleting destructor' 0x80000000 (6817403) Comments: I was trying to close the broswer and it jammed up. But BTW the 7.0 beta kicks ass. The mutiple email accounts feature is excellent. Just figure out how to import a complete profile (just everything from the mail profile......I have three internet (6817403) Comments: businessesand one personal email account....4 different profiles in Netscape 4.78). If you can figure out how to import the address book with allthe email folders then you have the best browser on the market.(Well I always thought you had the (6817403) Comments: best browser anyway).Thanks for making a great product.-Mick Mitchell (6791175) Comments: again this is the second time i have gone to close the Aim messenger and recieved an error message (6773959) Comments: closing the aol instant messenger menu.(i had shut down my internet before i signed off the messenger) then closed the window and then recieved the error mesage (6732792) Comments: Opening Application (6727241) Comments: just stand by suddenly this window comes. (6713431) Comments: attempting to save a draft - to send later. I was offline composing a message. When I finished the program crashed - "performed illegal operation" (6691257) Comments: I was editing my preferences and when i pressed the ok the program didnt respond (6665453) Comments: I picked up the pic then freezed. (6570254) Comments: had opened the entire program chose to close AIM tab when program bombed...something about xp---.dll error (6562045) Comments: Using a Java applet
Keywords: topcrash
OS: All → Windows ME
Hardware: All → PC
Summary: crash at [@ nsTimerImpl::~nsTimerImpl] → N70PR1 crash [@ nsTimerImpl::~nsTimerImpl]
Keywords: qawanted
i e-mailed one of the users that sent feedback via talkback. his browser crashed when he tried to open an attachment. below is his e-mail response: it happens every time i try to open an attachment. before i was just reading my mail. according to my computer i do not run out of memory [ram]. if i try to do anything else when i open an attachment nothing happens. if i click on the options at the top [file, edit, etc.] a drop down window will open but there is nothing in it. i have amd 1.2 processor, 512 meg ddr ram hercules 3d prophet video card with 64 meg ram windows xp
seems like from above comments double deleting should be fixed but...5/12's build is still showing similar call stack. taking into account of postprocessed code, for some reason the i var is null, 180 NS_RELEASE(mCallback.i); so when 165 NS_LOG_RELEASE_CALL((_ptr), (_ptr)->Release(), __FILE__, __LINE__); is executed it crashes. perhaps mCallback.i was already released??? can't quite find out specifically where since i can't repro and run the debugger on it.
The most recent crash for this stack signature in Talkback reports is from build 2002062713
topcrash-
Keywords: topcrashtopcrash-
Looking at bug 184363 led me back to this bug, and I'm puzzled by one thing in the analysis in comment 7: NS_IMPL_THREADSAFE_RELEASE does refcount stabilization before it calls the destructor. It does this in a non-threadsafe way, however, on the assumption that if nothing holds owning references to the refcounted object then nobody will be doing anything with the object. This assumption is incorrect in this case. This suggests that the sequence of events needed to cause the race condition is different from the sequence given in comment 7, that is: 1) in either order, Main thread: nsTimerImpl::Release calls PR_AtomicDecrement, setting mRefCnt=0 Timer thread: TimerThread::Run acquires |TimerThread|'s lock 2) Timer thread: TimerThread::Run calls NS_ADDREF(timer) 3) nsTimerImpl::Release sets mRefCnt=1, negating the AddRef done in (2). I'm having trouble seeing how it's possible to reenter the destructor without this sequence of events occuring, due to the refcount stabilization done by nsTimerImpl::Release. Does this make sense? (I've managed to confuse myself quite a bit thinking about it, so it may not.) If so, might changing NS_IMPL_THREADSAFE_RELEASE to look like the following, along with moving the |gThread->RemoveTimer(this);| to the beginning of nsTimerImpl's destructor, have fixed this bug? #define NS_IMPL_THREADSAFE_RELEASE(_class) \ NS_IMETHODIMP_(nsrefcnt) _class::Release(void) \ { \ nsrefcnt count; \ NS_PRECONDITION(0 != mRefCnt, "dup release"); \ if (1 == count) { \ /* Leave reference count at 1 to prevent reentering destructor, but */ \ /* avoid explicitly setting it in case another thread has a raw pointer.*/\ /* enable this to find non-threadsafe destructors: */ \ /* NS_ASSERT_OWNINGTHREAD(_class); */ \ NS_DELETEXPCOM(this); \ NS_LOG_RELEASE(this, 0, #_class); \ return 0; \ } \ count = PR_AtomicDecrement((PRInt32 *)&mRefCnt); \ NS_LOG_RELEASE(this, count, #_class); \ return count; \ }
That should have said: if (1 == mRefCnt) { \
Keywords: mozilla0.9.9
Target Milestone: mozilla1.0 → ---
I think bug 220959 might be related to this. The top of the stack in that case is nsTimerImpl::Fire(), though.
Component: XP Miscellany → XPCOM
Assignee: pavlov → nobody
QA Contact: brendan → xpcom
Brendan, dbaron, please file new followup bugs. "Leaving open for followup" never ends well.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Crash Signature: [@ nsTimerImpl::~nsTimerImpl]
Keywords: qawanted
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: