<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

23 years ago

can someone get me a reproducable testcase?

Status: NEW → ASSIGNED

Updated

•

23 years ago

Target Milestone: --- → Future

Comment 2

•

23 years ago

Attached file Stacks from Purify — Details

I've attached the stacks from Purify. Hopefully that might shed some light. Unfortunately I haven't been able to come up with steps to reproduce. Seems rather sporadic.

David :Bienvenu

Comment 3

•

23 years ago

are either of you running (mscott or dbradley) on dual processor machines? There's a rumour that this kind of problem is easier to reproduce on the dual cpu machines.

Comment 4

•

23 years ago

No, single Intel CPU. It's not readily reproduceable but I find it happens more often running under Purify, so I supsect maybe a race condition or some such that the slowness of Purify exposes more.

Scott MacGregor

Reporter

Comment 5

•

23 years ago

I'm not on a dual processor either.

Brandon Ehle

Comment 6

•

23 years ago

Bug 121845 might be the same problem. Try my reproducible case from there.

Rich Williams

Comment 7

•

23 years ago

I'm seeing crashes like this just about every time I do an e-mail search (like bug 127621). This is with a fresh cvs build on Linux SMP. When I reboot to a non-SMP kernel, the problem goes away. I _think_ I've found a race condition in the timer code which causes this crash, but I'm not 100% sure, so go easy on me if I'm way off track :-) Object A has a timer (nsITimer/nsTimerImpl), and it's active (so it's in the TimerThread.mTimers nsVoidArray). It's got the only reference, so the refcount is 1. Object A gets destroyed so the refcount drops to zero, and the destructor for nsTimerImpl is called. This calls gThread->RemoveTimer(this) to get itself out of mTimers - to do this, it needs the TimerThread lock and so it blocks. Meanwhile, over in TimerThread::Run, it's got the lock and calls RemoveTimerInternal, and then NS_ADDREF(timer) on the 'doomed' timer. It then releases the lock, so the destructor in Object A continues and the nsTimerImpl is deleted. A few lines later in the TimerThread, the event is fired (which might work if we are lucky) and shortly after that NS_RELEASEd again (in destroyMyEvent) - the destructor gets called again. Chaos follows. I came to this conclusion by adding an extra 'magic' member to nsTimerImpl and setting it to 12345678 on construct, and to 87654321 as the first line of the destructor, sure enough I caught double destructs. I then added something to TimerThread::Run to check the magic of the timer it's about to fire and found that the timer had been destroyed (or rather was in the process of being destroyed) Well I hope that helps, I haven't got a fix for it just yet.

David :Bienvenu

Comment 8

•

23 years ago

that looks very plausible to me.

(not reading, please use seth@sspitzer.org instead)

Comment 9

•

23 years ago

Targeting. Thanks very much for analyzing this, my face is red for not catching it in review. /be

Keywords: mozilla1.0+

Target Milestone: Future → mozilla1.0

Julien Pierre

Updated

•

23 years ago

Blocks: 127621

Comment 10

•

23 years ago

kerz and putterman see the crash on email quick search all the time. we're debugging now. we'll start with rich's suggestions.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

23 years ago

Summary: crash at [@ nsTimerImpl::~nsTimerImp] → crash at [@ nsTimerImpl::~nsTimerImpl]

(not reading, please use seth@sspitzer.org instead)

Comment 11

•

23 years ago

sspitzer, I've been making up for bad super-review by hacking a fix. Almost ready to attach a patch -- don't want to duplicate your work. /be

Comment 12

•

23 years ago

no, I haven't done more that reproduce this with kerz on his machine, and add bug comments. when you have a patch, I'll try it out.

Comment 13

•

23 years ago

Attached patch proposed fix (obsolete) — Details — Splinter Review

This was hard. Someone remind me again why we ditched native widget timers for XP timers built on a thread? I haven't tested yet, so I'm braced for bad news. I've been running the race condition outcomes through my brain, and I need to lie down now. I did rename nsITimerScriptable to nsIScriptableTimer, as threatened elsewhere. I also used the preferred spelling of canceled (stress goes on the syllable before the doubled consonant, e.g. "pre_ferr_ed", as opposed to "_can_celed"). Picky, picky! /be

Comment 14

•

23 years ago

Let's avoid adding noise to 0.9.9 talkback, and adding people to the ranks of unhappy 0.9.9 users. Looking for testing, r= and sr=. /be

Blocks: 122050

Keywords: mozilla0.9.9

Target Milestone: mozilla1.0 → mozilla0.9.9

Comment 15

•

23 years ago

Just as I was going to sleep, I realized I forgot to transpose an addref and a (subroutine-obscured) release. Fixing platform and os fields, reassigning. /be

Assignee: pavlov → brendan

Status: ASSIGNED → NEW

OS: Windows 2000 → All

Hardware: PC → All

Updated

•

23 years ago

Status: NEW → ASSIGNED

Comment 16

•

23 years ago

Attached patch transpose AddRef and subroutine-obscured Release; better comments (obsolete) — Details — Splinter Review

Dammit, now I'm wide awake! My build is still going, I have not yet tested. /be

Attachment #72930 - Attachment is obsolete: true

Rich Williams

Comment 17

•

23 years ago

Well that second patch works for me, I'd previously be able to crash every single time. I had been running a temporary hack which stablised things a little (just a if (!timer->mRefCnt) continue) in TimerThread::Run) which seemed to confirm that it was a race condition type thing. Thanks

Comment 18

•

23 years ago

Attached patch best comments yet (no non-comment changes from last patch) (obsolete) — Details — Splinter Review

Ok, time for bed or I'll turn into one of those X-Files cases who grow telepathic powers from sleep deprivation! Where's Agent Scully? /be

Comment 19

•

23 years ago

If the timer has been cancelled should it still fire? If nsTimerImpl::Release was in RemoveTimer (Meaning it is now marked as cancelled) and TimerThread::Run is coming out of RemoveTimerInternal then the timer has been cancelled, but Fire is still called. Also RemoveTimerInternal doesn?t appear to check to see if the timer has been "disarmed" before removing. I think this state could occur if nsTimerImpl::Release is blocked in RemoveTimer, and Run has called RemoveTimerInternal. The thread unblocks in Release's call to RemoveTimer, then calls RemoveTimerInternal. I don't think this failure causes any major problems, other than causing Release to return 1 rather than 0. Also why the assigns of nsnull to "timer" in TimerThread::Run? What bugs me about this, is the fact we have two classes that have state that is touched by multiple threads, when I think we could have just one class with state that needs to be thread safe. Without some analysis its hard to say if that would reduce or increase the complexity.

Comment 20

•

23 years ago

dbradley wrote: >If the timer has been cancelled should it still fire? If nsTimerImpl::Release >was in RemoveTimer (Meaning it is now marked as cancelled) and >TimerThread::Run is coming out of RemoveTimerInternal then the timer has been >cancelled, but Fire is still called. Run could check timer->mCanceled, but it didn't before -- it Fired the timer, sending a PLEvent across a thread boundary, and let nsTimerImpl::Process, all the way on the other side of that boundary, notice mCanceled. I just work here trying to fix last-minute 0.9.9 crash/topcrash bugs :-). I didn't design this pav-code, and I didn't review it well enough. But in this case, I think pavlov may have been right to avoid adding more mCanceled checks, saving code and complexity. It costs a little in the cancel-races-with-run case, but who cares? No harm comes, right? And we save a little (tiny) code footprint. >Also RemoveTimerInternal doesn?t appear to check to see if the timer has been >"disarmed" before removing. I think this state could occur if >nsTimerImpl::Release is blocked in RemoveTimer, and Run has called >RemoveTimerInternal. RemoveTimerInternal calls nsVoidArray::RemoveElement and lets it search. If the timer was disarmed while this call (from nsTimerImpl::Release, as you say) had to wait on TimerThread::mLock until Run had already removed the timer, we just spend O(n) IndexOf search time, find nothing, and return false from RTI. You are right that an mArmed test would avoid this cost, but it's a hard case that I believe is rare (though it happens, witness this bug!). Is it worth the extra code to test mArmed, to optimize away the IndexOf done by RemoveElement? >The thread unblocks in Release's call to RemoveTimer, >then calls RemoveTimerInternal. I don't think this failure causes any major >problems, other than causing Release to return 1 rather than 0. That I took pains to do, not that anyone uses NS_RELEASE2. It just scared me that someone might do such a thing some day and draw the wrong conclusion. I think I got the logic right -- were you calling out a flaw, or just noticing how it all works? >Also why the assigns of nsnull to "timer" in TimerThread::Run? You mean in the line after timer->Fire() is called? Because as the comment just above says, the Release done by destroyMyEvent on the timer's thread, not on the TimerThread, matches the AddRef done in Run. Symmetry breaks, the timer's lifetime is handed off across the PLEvent-spanning thread boundary, and its dtor runs on the thread targeted by Fire, the one where Process (these names aren't great -- pav?) actually calls back into the timer's callback function, or notify method, or observer. >What bugs me about this, is the fact we have two classes that have state that >is touched by multiple threads, when I think we could have just one class with >state that needs to be thread safe. Without some analysis its hard to say if >that would reduce or increase the complexity. I think I know what you mean. Timers should be simply messages between threads, but we actually have both some fields in nsTimerImpl, and the mTimers array in TimerThread, shared among threads. But hey, I'm just trying to fix this for 0.9.9. A better design for the long run would be great, I just have not had time to think on it much. /be

Comment 21

•

23 years ago

Attached patch more cleanup, coments and spacing except for the switch (mCallbackType) to dispatch in nsTimerImpl::Process (obsolete) — Details — Splinter Review

Ok, testing seems good -- how about r= and sr= for 0.9.9? It's wrapping up fast, and I think we need this fix. /be

Attachment #72940 - Attachment is obsolete: true

(not reading, please use seth@sspitzer.org instead)

Comment 22

•

23 years ago

Comment on attachment 73023 [details] [diff] [review] more cleanup, coments and spacing except for the switch (mCallbackType) to dispatch in nsTimerImpl::Process I didn't dig deep enough to find that nsTimerImpl::Process checked for mCancelled. I didn't read your comment in nsTimerImpl::Release thoroughly, that clears up the concern I had with that. The timer=nsnull has nothing to do with your patch I was just curious if I was missing something: >You mean in the line after timer->Fire() is called? Because as the comment just >above says, the Release done by destroyMyEvent on the timer's thread, not on the >TimerThread, matches the AddRef done in Run. That one and the original initialization. "timer" is just an interface pointer and not an nsCOMPtr pointer, so assigning null here, and then later has no effect on reference counting, correct? And the timer variable is essentially always set to a non-null value before it's use. So FWIW I think the patch is good. I stressed it a bit by stalling the timer thread and traversing some of the lesser frequented paths. r=dbradley

Attachment #73023 - Flags: review+

Updated

•

23 years ago

Blocks: 118648

Comment 23

•

23 years ago

>That one and the original initialization. "timer" is just an interface pointer >and not an nsCOMPtr pointer, so assigning null here, and then later has no >effect on reference counting, correct? And the timer variable is essentially >always set to a non-null value before it's use. Yes, and 'tho I avoid unnecessary initialization in my own code, this code was shared, came with some tricky control flow that I made more tricky, and so it seemed worth the few cycles to initialize timer to null at its declaration, and to clear it once the thing it pointed at had been handed off via Fire. Thanks for the r= -- who will sr=? /be

(not reading, please use seth@sspitzer.org instead)

Comment 24

•

23 years ago

So the decoupled design I think dbradley and I were imagining would not share memory the way the current code does. Instead, the TimerThread would be asked to create a peer data structure in its scheduled list, and hand back a serial number identifying it. All set-type, set-delay, and cancel operations would go through the same inter-thread channel as this create-timer message that returns a serial number response, and the subsequent messages would name the timer using the serial number. Then client object (nsITimer-implementing XPCOM object) life cycle would not be coupled by strong or weak refs to the TimerThread. The only shared memory is in the inter-thread message channel. That's a design we could get into after mozilla1.0, if this code proves to have more bugs, or is just too unwieldy to maintain. I thought I'd babble about it here for posterity. /be

Jason Kersey

Comment 25

•

23 years ago

I tried quite a bit and couldn't get it to crash in quicksearch with this now, so far so good. I'm going to get putterman a build for him to try, as he was crashing on every quicksearch on his machine.

John Bandhauer

Comment 26

•

23 years ago

Just starting to look at this. You forgot to change mozilla/xpcom/threads/MANIFEST_IDL.

Updated

•

23 years ago

No longer blocks: 118648, 127621

(not reading, please use seth@sspitzer.org instead)

Comment 27

•

23 years ago

*** Bug 127621 has been marked as a duplicate of this bug. ***

(not reading, please use seth@sspitzer.org instead)

Comment 28

•

23 years ago

*** Bug 118648 has been marked as a duplicate of this bug. ***

Comment 29

•

23 years ago

Fixed MANIFEST_IDL, not bothering with a new patch just for that -- thanks! /be

Comment 30

•

23 years ago

Comment on attachment 73023 [details] [diff] [review] more cleanup, coments and spacing except for the switch (mCallbackType) to dispatch in nsTimerImpl::Process r-=pavlov With this patch, you will always leak mCallback.i/mCallback.o when Cancel is called since you have added: + mCallbackType = CALLBACK_TYPE_UNKNOWN; nsTimerImpl::~nsTimerImpl won't be able to determine what kind of callback type it is to determine how to NS_RELEASE it. This release must happen on the thread that created the timer as well, so it needs to happen in the destructor. I don't see any reason for adding the mCallbackType = CALLBACK_TYPE_UNKNOWN to nsTimerImpl::Cancel() except for in Process() you have the default: case for the callback type to NS_ERROR.. I would suggest checking mCanceled before the switch() and removing the default handler.

Attachment #73023 - Flags: review+ → needs-work+

Comment 31

•

23 years ago

Thanks pav -- late breaking brain damage on my part, following your original damage :-). There is already an mCanceled check at the top of Process, so there is no need to add one a little later before the switch, where your version had a test for a non-null mCallback.c (I should have not worried about conserving that redundant test, but I was hacking without enough sleep). I'll fix that, but I will have to leave an empty default to avoid a GCC warning. New patch in a sec, /be

Comment 32

•

23 years ago

Attached patch Fix that was checked into 0.9.9 branch and trunk (obsolete) — Details — Splinter Review

Attachment #73023 - Attachment is obsolete: true

Mike Shaver (:shaver emeritus)

Comment 33

•

23 years ago

Comment on attachment 73162 [details] [diff] [review] Fix that was checked into 0.9.9 branch and trunk the rest of this looks fine. r=pavlov .. if someone can run purify with this patch, I think that would be a good idea incase I missed anything though.

Attachment #73162 - Flags: review+

Comment 34

•

23 years ago

Comment on attachment 73162 [details] [diff] [review] Fix that was checked into 0.9.9 branch and trunk sr=shaver. Do we care about components that are not in the tree, but may use scriptable timers? I guess not!

Attachment #73162 - Flags: superreview+

John Bandhauer

Comment 35

•

23 years ago

Comment on attachment 73162 [details] [diff] [review] Fix that was checked into 0.9.9 branch and trunk sr=jband I'm good with this. Besides the issue of *when* we schedule the public flogging of Stuart for his horrid design, I see a couple of things worth mentioning... - What is the pattern of TimerThread shutdown and destruction? If the timer array cleanup in TimerThread::~TimerThread matters at all then I'd think it should look like the cleanup in TimerThread::Shutdown in order to not do evil. - I'm fairly convinced that the posible nsTimerImpl::Release call patterns won't burn us. The code... > if (count == 1 && mArmed) Is a little funky because it tests the state of the count as it *was* against the state of mArmed as it *is*. If the multithreaded calling patterns were more complex (at a time when the recount could go to 1), then you'd have to lock out refcount changers inside the 'if (count ==1)' block and check mRefCnt to see if was *still* 1 && mArmed. But, it looks like the rules here preclude the pattern where the refcount would be going down to one while *another thread* might be arming the timer. I think that could only happen if multiple outside users of the timer were twiddling it on different threads. Anyway, I think your patch makes things better. I have not tested it.

Asa Dotzler [:asa]

Comment 36

•

23 years ago

Comment on attachment 73162 [details] [diff] [review] Fix that was checked into 0.9.9 branch and trunk a=asa (on behalf of drivers) for checkin to 0.9.9

Attachment #73162 - Flags: approval+

Comment 37

•

23 years ago

nsITimerScriptable went in just three weeks ago. I doubt anyone outside the tree is using it, but anyway, what does that shaver guy say about unfrozen interface users? :-) jband: right, as the comment in Release says, if the count goes down to 1 and mArmed is apparently set, there can't be any "timer resuscitation" by some other thread than the TimerThread. No one is able to pull a timer strongly referenced via mTimers[i] back from the brink of death. /be

Comment 38

•

23 years ago

Pav, can you respond to jband's question: >- What is the pattern of TimerThread shutdown and destruction? If the >timer array cleanup in TimerThread::~TimerThread matters at all then >I'd think it should look like the cleanup in TimerThread::Shutdown in >order to not do evil. I'm weary and I have other bugs to tend. /be

Comment 39

•

23 years ago

Fixed on branch and trunk. I'm reassigning this to pavlov for followup to jband's question, and to file a new bug whose number is recorded here before this one gets closed. The new bug would want the design in comment #24, with one change: The create-timer message need not be a synchronous request/response message exchange with the TimerThread, where the creating thread waits to learn the serial number allocated by the TimerThread. Instead, the creating thread can atomically increment a serial number generator and stipulate that as the id it will use, passing it asynchronously to the TimerThread. All messages in the inter-thread channel should be asynchronous (Cancel has never failed with a non-successful nsresult when the timer was not found, so there' no need for a handshake there). Removing from the 122050 tracking bug. /be

Assignee: brendan → pavlov

No longer blocks: 122050

Status: ASSIGNED → NEW

Comment 40

•

23 years ago

TimerThread's Shutdown and destruction both come, one after the other, from nsTimerImpl::Shutdown(), which is called during XPCOM's shutdown. In the destructor, there should never be any timers alive, so we should probably just ASSERT if there are and remove the code that walks the list.

David :Bienvenu

Comment 41

•

23 years ago

*** Bug 121845 has been marked as a duplicate of this bug. ***

Asa Dotzler [:asa]

Comment 42

•

23 years ago

can we obsolete the approved patch from 03/07/02 so it doesn't look like an unlanded approved change? thanks.

timeless

Updated

•

23 years ago

Attachment #73162 - Attachment description: proposed fix for 0.9.9 → Fix that was checked into 0.9.9 branch and trunk

Attachment #73162 - Attachment is obsolete: true

Updated

•

23 years ago

Target Milestone: mozilla0.9.9 → mozilla1.0

Comment 43

•

23 years ago

i see that originally this bug affected all platforms. i double checked talkback data and this sig only happens on win machines now. i'm trying to repro this using recent user comments but, alas, no such luck. i will also attach two more call stacks with the same stack signature. today's talkbalk call stack: Count Offset Real Signature [ 21 nsTimerImpl::~nsTimerImpl 7d102c41 - nsTimerImpl::~nsTimerImpl ] Crash date range: 2002-05-22 to 2002-05-30 Min/Max Seconds since last crash: 161 - 46581 Min/Max Runtime: 161 - 1044390 Keyword List : Count Platform List 8 Windows 98 4.90 build 73010104 (winME) 6 Windows NT 5.0 build 2195 (win2000) 4 Windows 98 4.10 build 67766446 (win98SE) 2 Windows NT 5.1 build 2600 (winXP) 1 Windows NT 4.0 build 1381 Count Build Id List 21 2002051220 No of Unique Users 19 Stack trace(Frame) nsTimerImpl::~nsTimerImpl [d:\builds\seamonkey\mozilla\xpcom\threads\nsTimerImpl.cpp line 183] nsTimerImpl::`scalar deleting destructor' nsTimerImpl::Release [d:\builds\seamonkey\mozilla\xpcom\threads\nsTimerImpl.cpp line 96] nsSupportsArray::Clear [d:\builds\seamonkey\mozilla\xpcom\ds\nsSupportsArray.cpp line 601] nsSupportsArray::DeleteArray [d:\builds\seamonkey\mozilla\xpcom\ds\nsSupportsArray.cpp line 304] nsSupportsArray::`vector deleting destructor' 0x80000000 (6817403) Comments: I was trying to close the broswer and it jammed up. But BTW the 7.0 beta kicks ass. The mutiple email accounts feature is excellent. Just figure out how to import a complete profile (just everything from the mail profile......I have three internet (6817403) Comments: businessesand one personal email account....4 different profiles in Netscape 4.78). If you can figure out how to import the address book with allthe email folders then you have the best browser on the market.(Well I always thought you had the (6817403) Comments: best browser anyway).Thanks for making a great product.-Mick Mitchell (6791175) Comments: again this is the second time i have gone to close the Aim messenger and recieved an error message (6773959) Comments: closing the aol instant messenger menu.(i had shut down my internet before i signed off the messenger) then closed the window and then recieved the error mesage (6732792) Comments: Opening Application (6727241) Comments: just stand by suddenly this window comes. (6713431) Comments: attempting to save a draft - to send later. I was offline composing a message. When I finished the program crashed - "performed illegal operation" (6691257) Comments: I was editing my preferences and when i pressed the ok the program didnt respond (6665453) Comments: I picked up the pic then freezed. (6570254) Comments: had opened the entire program chose to close AIM tab when program bombed...something about xp---.dll error (6562045) Comments: Using a Java applet

Keywords: topcrash

OS: All → Windows ME

Hardware: All → PC

Summary: crash at [@ nsTimerImpl::~nsTimerImpl] → N70PR1 crash [@ nsTimerImpl::~nsTimerImpl]

Comment 44

•

23 years ago

Attached file two more call stacks from talkback — Details

Updated

•

23 years ago

Keywords: qawanted

Comment 45

•

23 years ago

i e-mailed one of the users that sent feedback via talkback. his browser crashed when he tried to open an attachment. below is his e-mail response: it happens every time i try to open an attachment. before i was just reading my mail. according to my computer i do not run out of memory [ram]. if i try to do anything else when i open an attachment nothing happens. if i click on the options at the top [file, edit, etc.] a drop down window will open but there is nothing in it. i have amd 1.2 processor, 512 meg ddr ram hercules 3d prophet video card with 64 meg ram windows xp