Closed Bug 56337 Opened 24 years ago Closed 24 years ago

Dialogs posed from Web pages crash entire app

Categories

(Core :: XUL, defect, P1)

x86
Windows 2000
defect

Tracking

()

VERIFIED FIXED
mozilla0.8

People

(Reporter: hyatt, Assigned: danm.moz)

References

()

Details

(Keywords: top100, Whiteboard: [rtm-])

Attachments

(3 files)

1. Go to the above URL.
2. Switch skins.
3. Scroll to the bottom of the page and try to add a sidebar panel.

Result: You end up in a state where the window is unresponsive and you have to
kill the app.
Expected Result: The add sidebar dialog should come up.
Platform: Win32
Uh-oh.  I was able to produce this without skin-switching.

Steps.
1. Launch app
2. Go to the above URL.
3. Try to add a sidebar panel

Whole app locked up, forced to quit.

Nominating rtm and raising priority to P1.
Assignee: hyatt → danm
Summary: Skin-switching causes dialogs posed from Web pages to crash entire app → Dialogs posed from Web pages crash entire app
Keywords: rtm
Data point: every Windows user I have seen try to install a new sidebar panel
hits this.  That's 5 out of 5 so far.  Most (3) were unable to ever install.
Saw one instance of this on Mac.  In all cases where I knew the age of the build
it was recent.  Regression?
Seeing this on Linux too, recent trunk build. cpu just spikes to 100%, nothing
happens.
I see this on linux too, don't see the cpu spiking but it takes forever for the
dialog to pop up when I click on Scotts links that add the sidebar panels.
rtm+ need info
Severity: normal → critical
Priority: P3 → P1
Whiteboard: [rtm+ need info]
Target Milestone: --- → M18
went to above URL, added sidebar panel without problems under Mac OS 9.0.4, Mac
Mozilla installer build 2000101808.
Depends on: 55460
Yes, this bug seems to happen only on some machines. Mine generally shows the 
problem. I don't have a grip on this one yet. I'd like to. (Note this is related 
to bug 55460, which is pretty much the same problem though described differently. 
I'm marking them dependent to keep track.)
my.aol.com can also do this to a user.
Keywords: top100
Bug 55460 seems to be trunk-only. Anyone seeing this bug on a branch build?
Changing this from [rtm+ need info] to [rtm need info], because the [rtm+ need
info] state doesn't exist.
Whiteboard: [rtm+ need info] → [rtm need info]
With the 11/01 post-"limbo1.0" builds for the branch, mac/linux/win32, I can
install the Bugzilla panel with no problem (with the dialog or the 
installation). But as danm notes: "this bug seems to happen only on some 
machines".

(I'm not sure what pinkerton saw on my.aol.com that hung him, but when I go 
there, in an unsigned-in state, the dialogs that come up do not hang the 
limbo builds either (they are simply alert and window.open dialogs).

NOTE: only windows seamonkey will work at my.aol.com; it locks out Mac and 
Linux seamonkey (very nice!) -- filed http://bugscape/show_bug.cgi?id=3171
to get some AOL family love.
I get the error www.scottcollins.et could not be found.  Is the site down or gone?

Pink, what pre-conditions are required to see this on my.aol.com?  Need to
register and log in first?  Any other steps?

If this can be repro on my.aol.com, would that be enough to make this stop-ship?
which redirects you to the all-lower-case version. URL summary updated. The 
salient part is the bit of JS in that page that calls sidebar.addPanel.
Do we have any indication this is happening to any users of the branch candidate
builds?  Is it showing up on Talkback at all?
Whiteboard: [rtm need info] → [rtm+ need info]
Not in talkback, as it's a hang not a crash. 

I just installed 11-01-09-MN6 branch build on win98/winME/winNT/mac8.6/rh6.2
and installed scc's bugzilla panel on all five, with no hang or other problem.
Anyone see this on a branch build?
PDT marking [rtm-] since the bug doesn't seem to hit everyone. For those who do
see it, are the panels in your sidebar after coming back from a hang/force quit?
Whiteboard: [rtm+ need info] → [rtm-]
  Sidebar panels aren't affected. The hang has something to do with the modal "do 
you really want to add this sidebar panel?" dialog. Since the app dies at that 
point, it hasn't really done anything and the sidebar panels are unaffected.
  It's still not nice. And I'm pretty sure it's just a symptom of a more general 
problem.
Whiteboard: [rtm-] → [rtm-] [bug is kicking my butt]
The window for the messagebox/alert is never being created at all.

It's as if the event that should be handled to process the creation of the 
window is not getting handlde.

So the parent gets disabled and then nothing happens.

The best way to debug this is NS6 branch vs. trunk. I'll keep hammering away at 
it.
  Michael: the behaviour is machine dependent. Some machines -- one of mine, for 
instance -- never show the alert window at all. Some machines have no problem. 
Some machine probably have that other bug, where the alert window isn't on top 
and the app hangs until you figure out the secret. I think the machines that 
don't show the window at all are uncommon.
  If you figure anything out, let me know (or fix it!). I sure haven't. (Yet.)
OK, I'm much closer.

I was able to narrow it down to between 10/01/00 and 10/2/00

The problem appears to be caused by dougt's checkins for:

http://bugzilla.mozilla.org/show_bug.cgi?id=54248

I'm still investigating to find the exact problem.
OK, the window IS getting created.

We're never getting the EndDocument to cause the window to be made visible.

Still looking.
OK, still dougt but different bug.

I know believe it is these changes to nsThread.*

http://bugzilla.mozilla.org/show_bug.cgi?id=36750

I triple checked.

Without these changes, things work. I'm looking at the changes now.

BTW, it seems to happen more on Win95 and OS/2.
I concur. Backing out nsThread.h to 1.15 and nsThread.cpp to 1.26 fixes this 
problem and at least the javascript:alert() part of bug 55460 on my machine. Time 
to get dougt more involved, I think. Great work, Michael! I've been going at this 
the wrong way, instead of searching for the checkin that caused the problem, and 
I was getting nowhere forever. Thanks!
I've been looking at the changes in nsThread.cpp/h and it looks like one of the 
major parts of the update had to do with cleaning up threads from the thread 
pool if there are more than the "minimum" number of threads, where the minimum 
is defined as 1.  If I change this to 2, the problem goes away.  Not sure what 
this means, but I thought it might be interesting information for anyone else 
looking at this problem.
*** Bug 60155 has been marked as a duplicate of this bug. ***
I have attached an output log that shows the thread being removed, and then 
another request comes in (the last line in the file), which just hangs.

Looking at the same scenario on NT, the thread pool seems to stay up at 4 
threads (the max) and never gets reduced down to the min (1 thread).
Oops, that NT run was with the Netscape 6 branch.  I don't have a log from a 
trunk build on NT.  You can get one by setting NSPR_LOG_MODULES=nsIThread:5 
before running Mozilla.
Blocks: 60888
Two raw nsIThread logs beginning just after typing "javascript:alert('hi')" in 
the URL bar:

-- mMinThreads = 2 (worked)
0[494d48]: nsIThreadPool thread 4983c0 dispatched 3438e84 status 0
0[494d48]: nsIThreadPool thread 4983c0: 1 threads in pool, max = 4, requests = 
1, creating new thread...
0[494d48]: nsIThread 343c038 created
0[494d48]: nsIThreadPool adding new thread 343c038 (2 total)
0[494d48]: nsIThreadPool thread 4983c0 dispatched 343b364 status 0
0[494d48]: nsIThreadPool thread 4983c0: 2 threads in pool, max = 4, requests = 
2, creating new thread...
0[494d48]: nsIThread 30cb7a8 created
0[494d48]: nsIThreadPool adding new thread 30cb7a8 (3 total)
0[494d48]: nsIThreadPool thread 4983c0 dispatched 32e549c status 0
536[324dc68]: nsIThreadPool thread 324dc10 got request 3438e84
536[324dc68]: nsIThreadPool thread 324dc10 got request 3438e84
536[324dc68]: nsIThreadPool thread 324dc10 running 3438e84
536[324dc68]: nsIThreadPool thread 324dc10 completed 3438e84 status=0
536[324dc68]: nsIThreadPool thread 324dc10 got request 343b364
536[324dc68]: nsIThreadPool thread 324dc10 got request 343b364
536[324dc68]: nsIThreadPool thread 324dc10 running 343b364
536[324dc68]: nsIThreadPool thread 324dc10 completed 343b364 status=0
536[324dc68]: nsIThreadPool thread 324dc10 got request 32e549c
536[324dc68]: nsIThreadPool thread 324dc10 got request 32e549c
536[324dc68]: nsIThreadPool thread 324dc10 running 32e549c
0[494d48]: nsIThreadPool thread 4983c0 dispatched 3432e74 status 0
0[494d48]: nsIThreadPool thread 4983c0 dispatched 3438e84 status 0
768[343c090]: nsIThread 343c038 start run 343bfe0
768[343c090]: nsIThreadPool thread 343c038 got request 3432e74
768[343c090]: nsIThreadPool thread 343c038 got request 3432e74
768[343c090]: nsIThreadPool thread 343c038 running 3432e74
768[343c090]: nsIThreadPool thread 343c038 completed 3432e74 status=0
768[343c090]: nsIThreadPool thread 343c038 got request 3438e84
768[343c090]: nsIThreadPool thread 343c038 got request 3438e84
768[343c090]: nsIThreadPool thread 343c038 running 3438e84
768[343c090]: nsIThreadPool thread 343c038 completed 3438e84 status=0
768[343c090]: nsIThreadPool thread 343c038: 3 threads in pool, min = 2, 
exiting...
768[343c090]: nsIThreadPool thread 343c038 being removed
768[343c090]: nsIThreadPool thread 343c038 quitting 343bfe0
768[343c090]: nsIThread 343c038 end run 343bfe0
768[343c090]: nsIThread 343c038 exited
768[343c090]: nsIThread 343c038 destroyed
1016[33095d0]: nsIThread 30cb7a8 start run 32bc8b8
1016[33095d0]: nsIThreadPool thread 30cb7a8 waiting (2 threads in pool)
0[494d48]: nsIThreadPool thread 4983c0 dispatched 348668c status 0
1016[33095d0]: nsIThreadPool thread 30cb7a8 got request 348668c
1016[33095d0]: nsIThreadPool thread 30cb7a8 got request 348668c
1016[33095d0]: nsIThreadPool thread 30cb7a8 running 348668c
1016[33095d0]: nsIThreadPool thread 30cb7a8 completed 348668c status=0
1016[33095d0]: nsIThreadPool thread 30cb7a8 waiting (2 threads in pool)
(and on it runs)

-- mMinThreads = 1 (wedged)
0[494e28]: nsIThreadPool thread 497e48 dispatched 3388f44 status 0
0[494e28]: nsIThreadPool thread 497e48: 1 threads in pool, max = 4, requests = 
1, creating new thread...
0[494e28]: nsIThread 338b7d8 created
0[494e28]: nsIThreadPool adding new thread 338b7d8 (2 total)
0[494e28]: nsIThreadPool thread 497e48 dispatched 338ac94 status 0
0[494e28]: nsIThreadPool thread 497e48: 2 threads in pool, max = 4, requests = 
2, creating new thread...
0[494e28]: nsIThread 31b3720 created
0[494e28]: nsIThreadPool adding new thread 31b3720 (3 total)
0[494e28]: nsIThreadPool thread 497e48 dispatched 26f439c status 0
576[31c0be0]: nsIThreadPool thread 31c0b98 got request 3388f44
576[31c0be0]: nsIThreadPool thread 31c0b98 got request 3388f44
576[31c0be0]: nsIThreadPool thread 31c0b98 running 3388f44
576[31c0be0]: nsIThreadPool thread 31c0b98 completed 3388f44 status=0
576[31c0be0]: nsIThreadPool thread 31c0b98 got request 338ac94
576[31c0be0]: nsIThreadPool thread 31c0b98 got request 338ac94
576[31c0be0]: nsIThreadPool thread 31c0b98 running 338ac94
576[31c0be0]: nsIThreadPool thread 31c0b98 completed 338ac94 status=0
576[31c0be0]: nsIThreadPool thread 31c0b98 got request 26f439c
576[31c0be0]: nsIThreadPool thread 31c0b98 got request 26f439c
576[31c0be0]: nsIThreadPool thread 31c0b98 running 26f439c
0[494e28]: nsIThreadPool thread 497e48 dispatched 3384564 status 0
0[494e28]: nsIThreadPool thread 497e48 dispatched 3388f44 status 0
896[338b820]: nsIThread 338b7d8 start run 338b790
896[338b820]: nsIThreadPool thread 338b7d8 got request 3384564
896[338b820]: nsIThreadPool thread 338b7d8 got request 3384564
896[338b820]: nsIThreadPool thread 338b7d8 running 3384564
896[338b820]: nsIThreadPool thread 338b7d8 completed 3384564 status=0
896[338b820]: nsIThreadPool thread 338b7d8 got request 3388f44
896[338b820]: nsIThreadPool thread 338b7d8 got request 3388f44
896[338b820]: nsIThreadPool thread 338b7d8 running 3388f44
896[338b820]: nsIThreadPool thread 338b7d8 completed 3388f44 status=0
896[338b820]: nsIThreadPool thread 338b7d8: 3 threads in pool, min = 1, 
exiting...
896[338b820]: nsIThreadPool thread 338b7d8 being removed
896[338b820]: nsIThreadPool thread 338b7d8 quitting 338b790
896[338b820]: nsIThread 338b7d8 end run 338b790
896[338b820]: nsIThread 338b7d8 exited
896[338b820]: nsIThread 338b7d8 destroyed
764[31bb3f8]: nsIThread 31b3720 start run 30c7fd0
764[31bb3f8]: nsIThreadPool thread 31b3720: 2 threads in pool, min = 1, 
exiting...
764[31bb3f8]: nsIThreadPool thread 31b3720 being removed
764[31bb3f8]: nsIThreadPool thread 31b3720 quitting 30c7fd0
764[31bb3f8]: nsIThread 31b3720 end run 30c7fd0
764[31bb3f8]: nsIThread 31b3720 exited
764[31bb3f8]: nsIThread 31b3720 destroyed
0[494e28]: nsIThreadPool thread 497e48 dispatched 33cc604 status 0
(end of log)

My Cliff Notes version:

-- mMinThreads = 2 (worked)
(threads p, a already exist)
thread p dispatches request A
thread p gets request, creates new thread b 338b7d8
thread p dispatches request B
thread p gets request, creates new thread c 31b3720
thread p dispatches request C
thread a gets request A (completes)
thread a gets request B (completes)
thread a gets request C
thread p dispatches request D
thread p dispatches request E
thread b starts
thread b gets request D (completes)
thread b gets request E (completes)
thread b gets no request, exits
thread c starts
thread c gets no request, waits
thread p dispatches request F
thread c gets request F (completes)
(continue)

-- mMinThreads = 1 (wedged)
(threads p, a already exist)
thread p dispatches request A
thread p gets request, creates new thread b 338b7d8
thread p dispatches request B
thread p gets request, creates new thread c 31b3720
thread p dispatches request C
thread a gets request A (completes)
thread a gets request B (completes)
thread a gets request C
thread p dispatches request D
thread p dispatches request E
thread b starts
thread b gets request D (completes)
thread b gets request E (completes)
thread b gets no request, exits
thread c starts
thread c gets no request, exits
thread p dispatches request F
(application wedges)

These two runs are identical to the point where thread c starts looking for 
requests. In both cases its first attempt to get a waiting request finds none 
waiting. In one case, it was patient, a request came later, and the app 
continued. In the other case, thread c impatiently grabbed for the seppuku 
blade, and the app wedged.

As Jeffrey notes above, setting mMinThreads = 2 serendipitously evades seppuku 
wedging because it gives thread c in this case the necessary patience, and the 
app appears healthy. (Though the basic problem persists.)

Now, I'm not terribly familiar with this code, so my conclusions may be 
ignorable. But it seems clear that there are multithreaded timing issues afoot, 
and exiting a thread the moment it finds no pending requests isn't going to 
work. Do we have any better way to determine when a thread is truly finished? 
I'm expecting the answer to be "no."

Adding gagan to the cc: list because this bug really beards my goat, I'd love to 
see it fixed, and it's necko that's driving the data flow here. Perhaps he has 
suggestions.
Whiteboard: [rtm-] [bug is kicking my butt] → [rtm-] [we have clues now]
Strange... I just reproduced this problem on NT with a trunk build from code 
pulled yesterday.  The output log looks pretty much the same.
Looking at your "Cliff Notes", it looks like thread "a" is busy handling a 
request, and another request comes in that has to be handled (by another 
thread).  Another thread is kicked off, but it determines that there are no 
requests waiting and terminates before a request comes in.  The problem here is 
a timing issue: the new thread checks for requests before the request is added.

Take, for example, the code in nsThreadPool::DispatchRequest().  AddThread is 
called when it is determined that we need another thread, but 
mRequests->AppendElement isn't done until AFTER the thread is created.  If, in 
between, the thread runs and sees nothing on the request list, it will terminate 
before the request is added.

Is there a way to know that the thread hasn't received any requests yet, or if 
it should expect to receive any?  Starting up a thread and then killing it 
before it does anything doesn't seem right to me; whatever code started the 
thread needs to be able to indicate that it shouldn't go away yet.... or the 
nsThread code shouldn't kill a thread that has never received any requests.
Ok, more thoughts:

The code in nsThreadPool::DispatchRequest() looks to see if the number of 
requests exceeds or equals the number of threads in the pool, and if so creates 
a new thread; however, what if those threads in the pool are all busy doing 
something that requires another request to be processed first?  You might have 3 
running threads that are all processing requests, but no requests waiting on the 
queue.  In this case, the number of requests is 0, but the number of threads is 
3.... so no new thread would be created.  Am I understanding this correctly?

If this is the case, then my proposal would be to remove the "busy" threads from 
the pool while they are processing requests, then put them back on when they are 
available to handle a request again.  The only problem I see with this is the 
Shutdown proceedings, which go through the list of threads in the pool to 
interrupt them... if we remove busy threads from the list, they wouldn't be 
interrupted by that shutdown processing.  In any case, I've tried putting in 
what I just proposed, and it solved the javascript:alert() problem.  It's just a 
matter of adding in two lines in GetRequest: I will attach the diff.
OK, I think someone missed the concept of a thread pool when they wrote the 
original fix, or maybe I missed something.

A thread pool is about having a set number of threads that are ALWAYS around so 
that servicing requests does not require creating and destroying threads.

This implementation of a pool doesn't seem correct. It should create 
a new thread with each request up to MAX_THREADS and keep them around to service 
requests. The number of threads should never drop until the app is shut down and 
the threads need to be cleaned up.

Isn't that a thread pool?
Dan, I think that you are right on track here.  Thank you for digging up the
problem. You asked, how will we know when the thread should go away?  There are
two possibles to fix the bug here.

1.  We sit on a condVar when our thread is initally created by AddTread().  It
should be notified when we are done adding the requests. This will allow us to
avoid a greedy exit, although it would allow another thread to "steal" the
request that was created for the new thread.

2.  We could add a threshold wait on a condvar every time through GetRequest
prior to the check for the thread count.  This was in my original patch to
nsThread some months back.  It was dropped.

I can work up either fix and attach it to this bug.  Which do you feel is
better.  I personally like (2).

Jeff, your first comments about where the problem could be occuring is great.
(however, I disagree with your thoughts about removing threads from the list).

Mike, thank you for your comments too, but (as I stated in  an email) there is a
lower bound in this threadpool whereby threads hitting this will not be deleted.




Doug, I agree that removing the threads from the list may not be the right way 
to go, but I was doing that to illustrate the problem.  Take this hypothetical 
situation: One thread in the pool.  Request A comes in.  Thread 1 dispatches 
request A, removing it from the request list.  While processing Request A, 
Thread 1 does something that causes Request B to be generated, and can't 
continue until Request B is complete.  With one request in the list, and one 
thread in the pool (albeit a busy one), a new thread will NOT be created, and we 
will block indefinitely.

What if, instead of removing the threads from the pool, we have a second list of 
"active" threads, and remove the busy ones from that list?  Basically, the same 
code that I proposed, except using a new list mIdleThreads, and add every 
newly-created thread from mThreads onto mIdleThreads as well.  That way, we have 
the total list of threads in the pool in mThreads, but will still create another 
thread to service requests if all of the threads in the pool are busy (if 
mIdleThreads has zero elements, and we have fewer than the maximum in mThreads).
Status: NEW → ASSIGNED
Whiteboard: [rtm-] [we have clues now] → [rtm-]
  Michael: you're probably right about threadpools wanting to keep threads 
around and available, but I can't fault people who want to try to clean up 
unused threads. At any rate the threadpool does always maintain at least a 
minimum number of threads. It's just set low enough right now that we run into 
problems.
  Jeffrey: your 22 Nov patch scares me; I'm afraid it'll just move subtle timing 
problems into new territory. However, your idea posted today seems like the 
right thing to do.
  Below is a patch that effectively maintains two separate lists of available 
and busy threads, except I did this by merely keeping an integer count of busy 
threads, pretending they're on an "unavailable" list. It solves this bug and 
related ones on my machine.
  Adding threading gurus beard and saari to the cc:list and hoping for comment 
and review.
Didn't mean to scare you, Dan.  I had mentioned that my suggestion was probably 
not correct, but I was hoping to stir up the creative juices among the others 
looking at the problem towards a correct solution.  I like what you came up 
with, especially since it's a lot cleaner than what I had thrown together as a 
prototype.  Thanks for your attention to this problem....
Blocks: 55032
Blocks: 58404
Patch looks like Goodness (TM) to me. r=saari
All righty. Who should the super reviewer be on this one.

I assume brendan?
  I've asked Brendan to review it, yes.
  Doug points out that I unnecessarily moved two methods in nsThreadPool from 
public to protected, and that Brendan will probably hose me for that. So, a small 
preemptive strike.
  Originally I did that because I had a situation with the mBusyThreads variable; 
it could have been accessed outside the object lock. I realized that AddThread 
and RemoveThread were never used from the outside, and in my opinion aren't 
really part of the public interface, anyway. Later I simplified my patch so that 
this change was no longer necessary, but I continue to feel that those methods 
shouldn't be public. Always nice to remove extraneous stuff from a public 
interface.
PS: Doug and Patrick have also both come by, talked a bit, had their fears 
allayed, and approved the patch.
I don't mind cleanups, especially those that remove internal methods from
public: C++ APIs.  But why not go further:

+    nsresult IncrementBusyThreadCount();
+    nsresult DecrementBusyThreadCount();

don't seem generally useful, either -- they're not part of the nsIThread API,
and for proper operation, they must be called around the Run() method call for
any request got from a pool.  Would they be better coded as protected friends of
class nsThreadPoolRunnable?

Otherwise looks good.

/be
*** Bug 56313 has been marked as a duplicate of this bug. ***
cc: shrir
*** Bug 54793 has been marked as a duplicate of this bug. ***
Woo hoo! Evil dead bug. Worker threads' occasionally going AWOL was causing this 
problem. Thanks again, Michael and Jeffrey, for your help fixing this thing. 
Especially Michael who spent days tracking down the offending checkin.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
No longer depends on: 55460
Resolution: --- → FIXED
Target Milestone: M18 → mozilla0.8
*** Bug 55460 has been marked as a duplicate of this bug. ***
Congratulations on fixing this bug.  Could someone see whether bug 52910
(similar behavior on MacOS 9) looks like it could have been the same bug?  It
wasn't reproducible, but maybe you can tell whether the StdLog is consistent
with it being the same bug.
*** Bug 55088 has been marked as a duplicate of this bug. ***
Blocks: 61697
Verifying on 20010702 WindowsME.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: