Closed Bug 134070 Opened 23 years ago Closed 23 years ago

Crash dismissing dialogs [@ _pl_GetEventCount]

Categories

(Core :: XPCOM, defect)

x86
Linux
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: bryner, Assigned: bryner)

References

Details

(Keywords: crash, topcrash, Whiteboard: [ADT])

Crash Data

Attachments

(1 file, 3 obsolete files)

On my debug build on Linux (current trunk), I'm seeing certain dialogs (such as the mail password dialog or the "Incorrect password" dialog) cause a crash as the dialog is dismissed. This doesn't happen every time, but does happen 10% of the time for me. The following message is printed: pthread_mutex_lock failed: rv = 22 Assertion failure: 0 == rv, at /builds/test/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:186 /builds/test/mozilla/obj-dbg/dist/bin/run-mozilla.sh: line 72: 12678 Aborted Further investigation shows that pthread_mutex_lock returns EINVAL, because the lock itself is no longer valid, because the PLEventQueue has already been destroyed. However, it is being told to process events from event_processor_callback in widget/src/gtk/nsAppShell.cpp.
I believe this crash will occur on debug builds due to the definition of PR_ASSERT.
Keywords: crash
Keywords: nsbeta1
I see this on Mandrake 8.1 with the commercial 2002-03-29-11 build: _pl_GetEventCount() PL_ProcessPendingEvents() nsEventQueueImpl::ProcessPendingEvents() event_processor_callback() our_gdk_io_invoke() libglib-1.2.so.0 + 0xfbd4 (0x40393bd4) libglib-1.2.so.0 + 0x11390 (0x40395390) libglib-1.2.so.0 + 0x1196f (0x4039596f) libglib-1.2.so.0 + 0x11b2b (0x40395b2b) libgtk-1.2.so.0 + 0x986e3 (0x402a56e3) nsAppShell::Run() nsAppShellService::Run() netscape-bin + 0x84e9 (0x080504e9) netscape-bin + 0x8cc7 (0x08050cc7) libc.so.6 + 0x1c5b0 (0x404e95b0)
Keywords: nsbeta1
Attached patch patch (obsolete) — Splinter Review
Here's a patch which should fix the problem (or at least the manifestation of it that I was seeing). stephend, can you verify if this fixes your problem as well? We also need to check whether this causes any slowdown of startup/new window/page load.
Summary: Crash dismissing dialogs → Crash dismissing dialogs [@ _pl_GetEventCount()]
Attached patch patch #2 (obsolete) — Splinter Review
Actually, use this one instead.
Attachment #76942 - Attachment is obsolete: true
Brian, How is the nsEventQueue object being accessed from a thread other than the thread that owns the queue? In other words, how would PL_IsQueueOnCurrentThread be false if called from CheckForDeactivation()?
Doug, it's being accessed from the timer thread. This creates a problem because we then try to unhook glib's IO channel for the queue on the timer thread, and if glib is in the middle of dispatching an event for that queue (on the main thread), we end up accessing a deleted PLEventQueue.
Can the check for thread access be higher than CheckForDeactivation()?
Keywords: nsbeta1nsbeta1+
Comment on attachment 76972 [details] [diff] [review] patch #2 r=dougt
Attachment #76972 - Flags: review+
taking
Assignee: dougt → bryner
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.0
Keywords: adt1.0.0
*** Bug 134361 has been marked as a duplicate of this bug. ***
Is Bug 132597 a dup of this? The stack signature is the same...but that bug seems to be about logging into mail. There might be a relation, but I'm not sure. Can't hurt to take a look though. Adding topcrash since this is currently the #2 topcrash on the Trunk.
Keywords: topcrash
Summary: Crash dismissing dialogs [@ _pl_GetEventCount()] → Crash dismissing dialogs [@ _pl_GetEventCount]
Adding saari, gordon and joki to make sure this patch doesn't introduce a leak.
darin, can you think of any way that your checkin might have caused this?
Comment on attachment 76972 [details] [diff] [review] patch #2 a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76972 - Flags: approval+
Jay, it's the same. Logging into mail requires typing a password (unless, of course, you have it saved).
*** Bug 132597 has been marked as a duplicate of this bug. ***
Darin, I'm not able to reproduce this with your patch for bug 128508 (nsEventQueueService.cpp, rev 1.34) backed out.
bryner: ok, good... i think. now, to figure out why that is causing this problem.
ok, nsEventQueue::CheckForDeactivation is only called from the following methods: StopAcceptingEvents ProcessPendingEvents GetEvent WaitForEvent GetYoungestActive the only method being called from nsTimerImpl is GetYoungestActive. it is being called indirectly via nsEventQueueService::GetThreadEventQueue (because of my patch). it looks like GetYoungestActive was not designed to be called from a background thread. (i'm basing that on the fact that CheckForDeactivation probably wasn't meant to be called from a background thread.) bryner's patch probably nails this problem shut. another solution would be to not call CheckForDeactivation from GetYoungestActive. a quick LXR search reveals that GetThreadEventQueue is the only method that'll lead to GetYoungestActive being called, and before my patch it was only calling GetYoungest, which didn't call CheckForDeactivation. in other words, prior to my patch, nsEventQueue::GetYoungestActive was never being executed. so, why does it need to call CheckForDeactivation? i am slightly worried that we aren't protected against SetYounger(nsnull) being called while inside GetYoungestActive. is there anything to guarantee that that won't happen? the same potential problem existed when we were calling GetYoungest, so maybe it is protected against somehow?!
GetYoungestActive must be able to be called from *any* thread. This is so that thread B can find out what event queue on thread A should be used for posting. Why CheckForDeactivation in GetYoungestActive? It would seam that this check should be only called after events are processed (in ProcessPendingEvents, GetEvent). The problem is that these calls know nothing about nested events queues.
via IRC, timeless mentioned seeing the following crash: <timeless> fwiw, i've experience a crash near getyoungestactive, it's really odd because msvs claims that if(foo[==0]) foo->get... (...) // <- is called even though foo is listed as null <darin> ah.. so you are seeing the race condition i was afraid might exist where |foo| is mYoungerQueue (i think).
dougt: i agree.. GetYoungestActive must be callable from any thread, and it sounds like you agree that GetYoungestActive shouldn't be calling CheckForDeactivation?! is there any reason not to remove it?
yes. risk of a leak. I feel much more comfortable if we just go with patch #2 this late in the game. For the trunk, we can rip it out and have more bake time.
uhm... but before my recent checkin, we were calling GetYoungest instead of GetYoungestActive. IOW, we weren't calling CheckForDeactivation. so, why call it now? where's the risk?
ignore me. what darin said. we should be okay to remove CheckForDeactivation from GetYoungestActive. I would like danm to review this change also. Dan, are we hosing ourselves by removing this check from GetYoungestActive?
Removing adt1.0.0. Pls renominate when you have all reviews.
Keywords: adt1.0.0
Whiteboard: [ADT]
My take on why Darin's patch for bug 99026 exposed this bug: my original impression scores one, even though I couldn't defend it. I think that during a single pass through ProcessPendingEvents, another event gets pushed onto the youngest queue and then the queue is inactivated. With the current version of GetYoungestEventQueue, a subsequent pass through PPE won't pick up that event. I remember being able to shoot down every such possible scenario when I reviewed that patch, but I think we must have missed this (simple) one. So I'm still concerned that an event is going unprocessed. However, I agree with y'all; I think we won't introduce any leaks by removing CheckForDeactivation altogether from GetYoungestActive, and that's the best thing to do here. Meanwhile it probably wouldn't hurt to also add Brian's PL_IsQueueOnCurrentThread patch. (I suppose Brian gets to write the combined patch and run it through process now?)
*** Bug 134702 has been marked as a duplicate of this bug. ***
>Meanwhile it probably wouldn't hurt to also add Brian's PL_IsQueueOnCurrentThread patch. (I suppose Brian gets to write the combined patch and run it through process now?) Be sure to assert in debug code in this case!
*** Bug 135296 has been marked as a duplicate of this bug. ***
*** Bug 135224 has been marked as a duplicate of this bug. ***
Attached patch new patch (obsolete) — Splinter Review
I think this does everything dougt is suggesting. Can I get r=/sr=?
Attachment #76972 - Attachment is obsolete: true
Comment on attachment 77631 [details] [diff] [review] new patch i thought dougt was saying that an assertion should be enough, but perhaps it's best to beef this up anyways. how about using NS_ERROR instead of NS_ASSERTION? either way, sr=darin
Attachment #77631 - Flags: superreview+
Comment on attachment 77631 [details] [diff] [review] new patch I would have made it so that on debug build PL_IsQueueOnCurrentThread isn't called twice. In either case r=dougt
Attachment #77631 - Flags: review+
Attached patch final patchSplinter Review
Implemented everyone's suggestions.
Attachment #77631 - Attachment is obsolete: true
Comment on attachment 77728 [details] [diff] [review] final patch carrying over r=dougt, sr=darin.
Attachment #77728 - Flags: superreview+
Attachment #77728 - Flags: review+
Comment on attachment 77728 [details] [diff] [review] final patch a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #77728 - Flags: approval+
adding adt1.0.0+ on behalf of the adt.
Keywords: adt1.0.0+
I'm seeing this with increased frequency using apr5 commercial trunk build, linux rh6.2 -- happens every single time I login (using correct password).
Fix is checked in.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
*** Bug 136939 has been marked as a duplicate of this bug. ***
If it's checked in, it's back. as of the 2002041003 trunk builds.
Do you have a talkback ID of a crash dismissing a dialog from that build?
I don't see any crashes with this stack signature in talkback data for 4/10 builds...so if Charles did crash dismissing dialogs it must have a different stack signature. I searched by Charles email address and none of his crashes seem to be related to this bug. Charles: If you are able to reproduce this and get a Talkback incident, please let us know. Your crash will most likely need to be logged as a new bug. Verifying fixed based on Talkback data.
Status: RESOLVED → VERIFIED
Incident ID 5090976 Build 2002041003 What: Open mail client from browser, enter password, hit enter key, and crash. Happened using three different user profiles/mail accounts. Email me if you need specific configuration info (server/account/etc)
Charles crash looks different: Incident ID 5090976 Stack Signature PR_EnterMonitor aef27f05 Trigger Time 2002-04-11 11:29:42 Email Address carosendahl@netscape.com URL visited Build ID 2002041011 Product ID MozillaTrunk Platform Operating System Win32 Module Trigger Reason Access violation User Comments Tried to access mail client Stack Trace PR_EnterMonitor [../../../../pr/src/threads/prmon.c, line 93] PL_ProcessPendingEvents [d:\builds\seamonkey\mozilla\xpcom\threads\plevent.c, line 505] USER32.DLL + 0x1b60 (0x77e11b60) 0x5e204689 From the stack, this crash points to bug 119693. Leaving this one verified fixed.
Crash Signature: [@ _pl_GetEventCount]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: