Closed Bug 420187 Opened 16 years ago Closed 16 years ago

hang in nsNSSHttpRequestSession::internal_send_receive_attempt

Categories

(Core :: Security: PSM, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9

People

(Reporter: nthomas, Assigned: KaiE)

References

Details

(Keywords: hang, relnote)

Attachments

(7 files, 8 obsolete files)

14.38 KB, text/plain
Details
491 bytes, patch
sayrer
: review+
ted
: review+
beltzner
: approval1.9+
Details | Diff | Splinter Review
15.93 KB, text/plain
Details
10.46 KB, text/plain
Details
80.92 KB, image/jpeg
Details
8.92 KB, text/plain
Details
57.44 KB, patch
KaiE
: review+
beltzner
: approval1.9+
Details | Diff | Splinter Review
On investigating fx-win32-tbox after 3hrs 30min building, it was stuck at:
 OBJDIR=obj-fx-trunk python obj-fx-trunk/_profile/pgo/profileserver.py

firefox.exe was in process list but not on the taskbar. When I ended the firefox process, the output was:

Application pid: 7328
FAIL Exited with code 1 during test run
fx-win32-tbox.build.mozilla.org - - [28/Feb/2008 11:33:33] "GET /index.html HTTP/1.1" 200 - fx-win32-tbox.build.mozilla.org - - [28/Feb/2008 11:33:33] "GET /quit.js HTTP/1.1" 200 -
fx-win32-tbox.build.mozilla.org - - [28/Feb/2008 11:33:34] code 404, message File not found
fx-win32-tbox.build.mozilla.org - - [28/Feb/2008 11:33:34] "GET /favicon.ico HTTP/1.1" 404 -

It continued to the second build, ignoring the exit status. If the build crashes out then tinderbox should start from the beginning.
Component: Cmd-line Features → Build Config
QA Contact: cmd-line → build-config
Had this problem with profiling again today, so attached with the debugger and did Break All. This is the resulting thread table and callstacks for each thread.
Assignee: nobody → kengert
Component: Build Config → Security: PSM
Keywords: hang
QA Contact: build-config → psm
Summary: Python profiler should pay attention to firefox exit code → hang in nsNSSHttpRequestSession::internal_send_receive_attempt
should split this into two bugs, one for the python script not exiting with failure if the browser does, and one for the actual hang (which it looks like this bug has already been morphed into).
Filed bug 421281 on the profiling script.
I can reproduce this 100% of the time on OS X using the profiling script, even with a vanilla libxul build.

#0  0x908d7a46 in semaphore_timedwait_signal_trap ()
#1  0x90909daf in _pthread_cond_wait ()
#2  0x90954de7 in pthread_cond_timedwait ()
#3  0x0005cd09 in pt_TimedWait (cv=0x1c1bdcc4, ml=0x1c1bdcc4, timeout=<value temporarily unavailable, due to optimizations>) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:280
#4  0x0005d05c in PR_WaitCondVar (cvar=0x1c1bdcc0, timeout=250) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:407
#5  0x0215b97a in nsNSSHttpRequestSession::internal_send_receive_attempt (this=0x1c1bd280, retryable_error=@0xb02a348c, pPollDesc=0x0, http_response_code=0xb02a35bc, http_response_content_type=0x0, http_response_headers=0x0, http_response_data=0xb02a35b0, http_response_data_len=0xb02a35b4) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:406

the main thread is sitting here:
0  0x908d7a22 in semaphore_wait_trap ()
#1  0x9094fb7f in pthread_join ()
#2  0x00061eb3 in PR_JoinThread (thred=0x3203) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:594
#3  0x021578d5 in nsPSMBackgroundThread::requestExit (this=0x3203) at /Users/luser/build/mozilla/security/manager/ssl/src/nsPSMBackgroundThread.cpp:97
#4  0x0215c63c in nsNSSComponent::DoProfileChangeNetTeardown (this=0x3203) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSComponent.cpp:2326
Oh, I should have pasted a full stack from that first thread:
0  0x908d7a46 in semaphore_timedwait_signal_trap ()
#1  0x90909daf in _pthread_cond_wait ()
#2  0x90954de7 in pthread_cond_timedwait ()
#3  0x0005cd09 in pt_TimedWait (cv=0x1c1bdcc4, ml=0x1c1bdcc4, timeout=<value temporarily unavailable, due to optimizations>) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:280
#4  0x0005d05c in PR_WaitCondVar (cvar=0x1c1bdcc0, timeout=250) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:407
#5  0x0215b97a in nsNSSHttpRequestSession::internal_send_receive_attempt (this=0x1c1bd280, retryable_error=@0xb02a348c, pPollDesc=0x0, http_response_code=0xb02a35bc, http_response_content_type=0x0, http_response_headers=0x0, http_response_data=0xb02a35b0, http_response_data_len=0xb02a35b4) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:406
#6  0x0215bbc1 in nsNSSHttpRequestSession::trySendAndReceiveFcn (this=0x1c1bd280, pPollDesc=0x0, http_response_code=0xb02a35bc, http_response_content_type=0x0, http_response_headers=0x0, http_response_data=0xb02a35b0, http_response_data_len=0xb02a35b4) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:300
#7  0x00269a52 in ocsp_GetEncodedOCSPResponseFromRequest ()
#8  0x0026b339 in CERT_CheckOCSPStatus ()
#9  0x0026e4b6 in CERT_VerifyCert ()
#10 0x0026e513 in CERT_VerifyCertNow ()
#11 0x0023d482 in SSL_AuthCertificate ()
#12 0x0215a48b in AuthCertificateCallback (client_data=0x0, fd=0x1c1b85a0, checksig=-1869776314, isServer=-1869776314) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:907
#13 0x0023b31e in ssl3_HandleHandshakeMessage ()
#14 0x0023ca1e in ssl3_HandleRecord ()
#15 0x0023d0b7 in ssl3_GatherCompleteHandshake ()
#16 0x0023e08c in ssl_GatherRecord1stHandshake ()
#17 0x00241fee in ssl_Do1stHandshake ()
#18 0x00242f53 in ssl_SecureSend ()
#19 0x00242fec in ssl_SecureWrite ()
#20 0x00246df4 in ssl_Write ()
#21 0x02157db1 in nsSSLThread::Run (this=0x1b51dba0) at /Users/luser/build/mozilla/security/manager/ssl/src/nsSSLThread.cpp:1029
#22 0x0006189c in _pt_root (arg=0x1b51dc60) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:221
#23 0x90908c55 in _pthread_start ()
#24 0x90908b12 in thread_start ()

looks like it's trying to do some ocsp stuff? If I had to make an uneducated guess, I would say that we hit aus looking for extension updates, and then are screwing around in SSL code when the browser wants to shutdown.
I guess that's in Nick's stack too. This sucks, because this hangs fx-win32-tbox on occasion (usually during the nightly). I guess it could hang anyone's browser if they startup and shutdown too quickly.
Severity: normal → major
Flags: blocking1.9?
Priority: -- → P1
Hit this again today, which is painful because takes another two and a half hours to get a nightly build out.
Flags: blocking1.9?
Flags: blocking1.9?
Marking as blocking, bad effects on build infrastructure related to fast startup-shutdown, which is also a pretty common scenario during upgrade to a new Firefox (extension compat checks, EM restarts, etc.)
Once more, with actual flags.
Flags: blocking1.9? → blocking1.9+
Gentle ping - any update on this?
This will hopefully prevent the tinderbox hanging until this is resolved.
Attachment #309626 - Flags: review?(sayrer)
Attachment #309626 - Flags: review?(ted.mielczarek)
Comment on attachment 309626 [details] [diff] [review]
[checked in] Temporarily increase the time the browser stays open to 10s

worth a shot!
Attachment #309626 - Flags: review?(ted.mielczarek) → review+
Attachment #309626 - Flags: review?(sayrer) → review+
Attachment #309626 - Attachment description: Temporarily increase the time the browser stays open to 10s → [checked in] Temporarily increase the time the browser stays open to 10s
http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp&rev=1.62&mark=112#197

perhaps:
nsHTTPDownloadEvent::Run or nsHTTPListener should register an observer for some "shutdown" message and do something when they get it.
We ready to go on this?
Schrep: we have a workaround in place in the PGO script, but I haven't seen any progress on the actual hang.
I'm working on it now
I'm able to reproduce this quite reliably using the following approach:
- open 5 tabs
- in the tabs, load the following sites:
  https://www.paypal.com/
  https://www.verisign.com/
  https://signin.ebay.com/
  https://www.thawte.com/
  https://www.godaddy.com/
- use File/Quit
- Select "Save all tabs", and select "do not ask me next time"
- restart firefox
- after the window is up with all 5 tabs, wait about 2 seconds, then press CTRL-Q, requesting to quit without a prompt

With this approach I can reproduce a deadlock 80% of my attempts.
Target Milestone: --- → mozilla1.9beta5
I have a patch, but it needs a bit more polishing. Working on it. ETA later today. My patch works quite reliably.
However, using the STR from comment 18 I still get an occassional deadlock, which seems to be unrelated to the SSL thread, unrelated to OCSP. Rather the deadlock is when the SSL layer attempts to proxy a call to the main thread.

This is just an update, I'll attach more details later.
Attached patch Patch v1 (obsolete) — Splinter Review
This patch seems to work, I don't see hangs on exit anymore with this patch and my steps to reproduce.

I would like to invite you to test this patch with your own STR.
Thanks!
Could you please explain your steps to reproduce? How did you call the profile script?

On Linux, I did this, as inspired by comment 0:
  OBJDIR=obj-fire-debug obj-fire-debug/_profile/pgo/profileserver.py

Does NOT hang for me (without patch), so I'd appreciate more detailed descriptions of the STR. Thanks!
Attached patch Patch v2 (obsolete) — Splinter Review
Minor improvements compared to v1.

My proposed fix isn't small, so let me explain what I learned and the changes in this patch.

The OCSP HTTP download code usually waits for completion
(nsNSSHttpRequestSession::internal_send_receive_attempt).
We need a way to leave that loop if the application is exiting.
I added nsSSLThread::exitRequested() for that purpose.

But that isn't sufficient.
I'd expect the loop to exit after 10 seconds, but it doesn't.
So, some events are not coming in on shutdown.
Or rather: The current code is unable to cancel all loads it
wants to wait for, because:

When this code got implemented initially, an incorrect assumption was made.
It has been assumed there would be only a single OCSP download at any given time,
and tracked in static variable mPendingHTTPRequest.
The assumption is wrong.

I propose to drop the global references to pending OCSP requests
and use a direct reference in each context.
That's why I changed nsCancelHTTPDownloadEvent.

That way we are able to cancel all pending loads, the event dispatcher
will hold the last reference to the events, and they will go away as soon
as the dispatcher releases them.

In order to allow us to cancel from within the "cancel event",
forwarding to the "download event" and executing the actual network 
cancellation from there, the "download event" needs a reference 
to the loadgroup.

I'm worried about a race between nsHTTPDownloadEvent::Run
and nsHTTPDownloadEvent::Cancel.
Therefore I added a lock to synchronize access to the loadgroup member.

Then I ran into assertions, telling me that loadgroup is not threadsafe.
It happened during assignment to nsCOMPtr.

Because of that I'm using a raw pointer to refer to the loadgroup.
I remember the thread ID of the creation thread, and make sure
we'll only release it if we're on the same thread again
(we should, both actions will happen on the thread that processes events).

I hope this approach is ok. I hope the loadgroup object does not get 
created on a different thread using some proxying mechanisms that I don't see.

So, after all these changes I still ran into more problems :-(

First, I got assertions "wrong thread" from I/O-Service and about
standard-url not being thread safe. This was a simply caused by PSM
accessing the string bundle on a thread the I/O service is not happy with.
The assertion happens when using the JAR-URI for the string bundle.

I think this assertion is a separate issue, but why not fix it now.
My attempt to fix is: Load a dummy string at PSM init time.
That seems to init the string bundle objects somehow, 
it avoids the assertion in my testing.
I don't know if this is a real fix, but at least it should not hurt.

Finally, one more tricky issue.

During a recent code enhancement, when we fixed an assertion about security-UI
not being thread safe, we changed the code that tracks the security state
of a document window. We added a reentrance detection for notifications to
nsIWebProgress::OnStateChange and OnLocationChange.
I had naively assumed this reentrace would never happen and I was wrong.

The security state tracking gets notified that a load is finished,
it attempts to derive the security state, part of that is doing OCSP.
When the user cancels the load of the document, then we receive stop 
notifications, while we're still waiting for its associated OCSP download.

That's a problem. The code to track the security state is complex and
not reentrant (for a single tracked window).

I propose the following workaround:
As soon as we detect reentrace for a given object, we switch that object
into an "inconsistent" state. As long as we're in that state,
we ignore any further events that would influence or update 
the security state.

As soon as the reentrance situation has finished (we receive a new
notification back in the original frame), then we'll clear the 
tracking state and reset it to insecure.

You'll find that code if you search for "inconsistency".
Attachment #310823 - Attachment is obsolete: true
Attachment #310848 - Flags: review?(rrelyea)
Attached file another hang
Even with the most recent patch I still get occassional hangs!

This time the problem is with proxying of the socket's callback pointer to the main thread. I wish that wouldn't be necessary.

Look at this attachment.
You'll see that the main thread requests termination of the SSL thread, and it wants to wait for it (it really must, because of our NSS resources shutdown story).

However, the SSL thread is trying to produce an error message. Unfortunately, it wants to give nsISSLErrorListener a chance to supress error messages. That's being proxied to the main thread which, well, is busy waiting for the SSL thread.

to be continued...
Attached file yet another hang
And I got one more deadlock, this time not involving the SSL thread at all.

Occassionally Necko wants to shut down on the main thread. However, on the socket transport thread, it is still trying to set up a new SSL socket.

In nsNSSSocketInfo::SetNotificationCallbacks PSM wants to obtain some information from the provided interfaceRequested. It attempts to proxy to the main thread for querying docshell, well, but the main thread is blocked.
Attached patch Patch v3 (obsolete) — Splinter Review
All I said before when explaining the proposed patch is still valid.
This patch is based on v2, but has the following changes in addition:

When building a new SSL socket, when the notification-callback-object is passed to the socket, we'll now store the original un-proxied object. And we will not immediately query the docshell to derive the information we require (e.g. whether or not to do external-error-reporting (error page)). We'll postpone that until we require the information.

In addition, I added a safety check. I'm not sure if it will be helpful in all scenarios, but it can't hurt: Whenever the SSL socket requires information from the main thread, and plans to proxy a call to the main thread, it will check whether the SSL thread is about to shut down. If it is, it doesn't make any sense to proceed, and therefore the action will be omitted and we'll return with a failure.


I don't see any hangs with this patch revision, but I'll continue testing.
Please join me :-)
Attachment #310848 - Attachment is obsolete: true
Attachment #310884 - Flags: review?(rrelyea)
Attachment #310848 - Flags: review?(rrelyea)
Kai: before attachment 309626 [details] [diff] [review] was checked in, this was 100% reproducible for me on mac. You may try backing that out and testing.
(In reply to comment #26)
> Kai: before attachment 309626 [details] [diff] [review] was checked in, this was 100% reproducible for me
> on mac. You may try backing that out and testing.

Ted:

(a) I don't know HOW you tested. Can you please describe what exactly you did to reproduce the hang?

(b) Would you be able to back out the attachment in your local tree, apply my patch, and test that?
Kai: just running the profiling script, like:
"OBJDIR=obj-fx-trunk python obj-fx-trunk/_profile/pgo/profileserver.py"

I'll see if I have a chance to test with the backout+your patch today.
We're blocking on this bug: Robert (Rylea) - any ETA of review?
Kai: is there anyone else who can review this? rrylea seems to be AFK, and we're holding Firefox 3 Beta 5 on this fix.

Alternatively: do we need to be holding Firefox 3 Beta 5 on this fix?
Dcamp can you do a drive-by?
(In reply to comment #28)
> Kai: just running the profiling script, like:
> "OBJDIR=obj-fx-trunk python obj-fx-trunk/_profile/pgo/profileserver.py"
> 
> I'll see if I have a chance to test with the backout+your patch today.


Thanks Ted.

I backed out attachment 309626 [details] [diff] [review], and I was able to reproduce your hang using a debug build on Linux.

I applied my patch v3, attachment 310884 [details] [diff] [review], and I no longer hang.
Blocks: 424901
We haven't hit this on fx-win32-tbox since we bumped up the timeout, so unless Kai's patch needs beta exposure for some reason, I think we can take it afterwards.
Priority: P1 → P2
Target Milestone: mozilla1.9beta5 → mozilla1.9
(In reply to comment #33)
> We haven't hit this on fx-win32-tbox since we bumped up the timeout, so unless
> Kai's patch needs beta exposure for some reason, I think we can take it
> afterwards.

I'd recommend to have beta exposure for this patch.
There are enough changes in this patch that could allow for unexpected regressions, and we'd better see them during beta testing.
Priority: P2 → P1
Target Milestone: mozilla1.9 → mozilla1.9beta5
Kai can you suggest a different reviewer?
Comment on attachment 310884 [details] [diff] [review]
Patch v3

r+ I've asked kaie to include some comments, since this conditions is subtle and still has some corner cases.

Most importantly there is a corner case where the security UI may not exactly reflect the pages that are loaded. This can only happen when the stop button is pushed or the application exits. Obviously the latter isn't an issue. In the former case, the UI reflects and intermediate state that the page really had while loading, so the state of a stop page is inconsistent to begin with.

The alternative is that the browser hangs.
Attachment #310884 - Flags: review?(rrelyea) → review+
(In reply to comment #35)
> Kai can you suggest a different reviewer?

I think Bob is the person who has the most up to date knowledge on this code by far.

I guess we're good now. We have just spent 50 minutes discussing the patch, and he just gave r+.

He asked me for comments only. No code changes are planned as of now.

Given the urgency, I propose to land the code now, and I'll do the comments in a follow up patch - today!
Attachment #310884 - Flags: approval1.9b5?
Comment on attachment 310884 [details] [diff] [review]
Patch v3

a1.9b5=beltzner
Attachment #310884 - Flags: approval1.9b5? → approval1.9b5+
Checked in.
Marking fixed.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Attached patch fix red tinderbox on windows (obsolete) — Splinter Review
Comment on attachment 311638 [details] [diff] [review]
fix red tinderbox on windows

let's get the bustage fix in..
Attachment #311638 - Flags: approval1.9b5+
Attached patch fix leaks, v5 (obsolete) — Splinter Review
sorry my patch had introduced a leak. I have a trivial new patch that should fix it. basically I forgot to manually release that pointer to the pending load in the success scneario. but it required me to move the tracked pointer to a different object
Attachment #311657 - Flags: review?
Comment on attachment 311657 [details] [diff] [review]
fix leaks, v5

r+ rrelyea
Attachment #311657 - Flags: review? → review+
I'm afraid I have to reopen the bug.
The patch causes regressions around secure UI.

For example, open firefox, visit https://certs.starfieldtech.com/images/trust_logo.gif
then set firefox to save your session.

Quit firefox, start again, so it will load that page immediately.

With the patches applied, I get a weird UI.
No security indicators at all.
Page is being reported as insecure.

With the patches reverted, I get secure indicators and even EV status.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Kai/Rob - do we want to just back this out and try and address in time for 1.9RC1 or do you have a safe plan of attack?
I just download the latest hourly, and when visiting the URL I see EV (green) Larry, and yellow in the Awesome Bar.  
Close/restart Minefield - tab loads and Larry +EV and Awesome bar look OK here, am I missing something ?

Vista HP SP1 
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9b5pre) Gecko/2008032517 Minefield/3.0b5pre Firefox/3.0 ID:2008032517 <-Latest Hourly
I am ok to back it out.

I don't have a good plan right now, besides spending more time on this and carefully testing for regressions around security UI.
(In reply to comment #46)
> I just download the latest hourly, and when visiting the URL I see EV (green)
> Larry, and yellow in the Awesome Bar.  
> Close/restart Minefield - tab loads and Larry +EV and Awesome bar look OK here,
> am I missing something ?

I'm using Linux, debug build. Maybe it depends on platform and/or order of events?
I see now, once you restart if the url above is active in a tab, then the EV looks OK, but once you switch tabs and back to the tab with the url then the EV collapses to a small size and 'Larry' can't be brought up... 

I think the regression has to do with my change to delay querying the information from the NotifiationCallbacks / DocShell.

We could back out only that portion of the patch that introduced this new delay.

This would leave in some of the hangs I saw, but I think it fixes the most serious hangs. I'll do some tests around this idea.
(In reply to comment #49)
> I see now, once you restart if the url above is active in a tab, then the EV
> looks OK, but once you switch tabs and back to the tab with the url then the EV
> collapses to a small size and 'Larry' can't be brought up... 

Yep, I confirm this on Mac.  It is caused by the initial patch (attachment 310884 [details] [diff] [review]), not the red-fix or the leak-fix that followed it.

If I disable OCSP checking, Larry stops becoming unresponsive (unsurprisingly, since it means this code path isn't travelled), but the EV UI still breaks since these certs are now treated as DV by design.

In no case am I seeing anything logged to the error console.
At this point I think we should back out and shoot for RC1.  This doesn't seem to be hit as often in the wild as it is in our test harness, and with the workaround we can probably safely defer to RC1...
Agreed with Comment 52, looking at the dates didn't this problem ship in Beta 4?

Let's back out and give Kai and Robert time to do this right...
Attached image screenshot
now something is wrong with the EV Cert Display. The Screenshot is from a new Build with this Patch from this Bug.

When i have 2 Sites open (like starfield and british airways) and i change the tabs fast, i get for the starfield site, the starfield favicon....but the british airways text ? 

Not sure if this is also another bug.
Thanks for the quick response and backout Kai, now let's continue trying to figure out what's going on.  :) 

Tomcat - I suspect it's the same bug.  Both the "Green with no text when coming from http tab" and the "green with wrong text when coming from other EV tab" behaviours sound like the SSLStatus object is not getting set properly on the gBrowser.securityUI object.

I mis-spoke before when I said I was seeing nothing in the console - swapping back and forth, I do get |this._lastStatus is null| being logged from the getIdentityData() code: 

http://mxr.mozilla.org/mozilla/source/browser/base/content/browser.js#6370

Which is not logged without the patch here (i.e. when things work).  That cached value is set here:

http://mxr.mozilla.org/mozilla/source/browser/base/content/browser.js#6407

Which is called from onSecurityChange here:

http://mxr.mozilla.org/mozilla/source/browser/base/content/browser.js#4195

So the moral is, the Identity UI is still getting the *state* changes (hence the colouring on the site button), but since SSLStatus is not being updated properly, it errors out when attempting to set the UI strings, or respond to mouseclicks.

That doesn't explain the source of the error, but I've included it in case it helps the effort to trace it.
Adjusting TM and priority to match this not blocking Beta5
Priority: P1 → P2
Target Milestone: mozilla1.9beta5 → mozilla1.9
Attached patch regression culprit (obsolete) — Splinter Review
I identified the culprit to be this subset of patch v3.

This new code is supposed to detect an indirect reentrancy. Something the security tracking code is not designed to deal with. However, using the other portions of the patch, I detected this might happen when we quit the application or cancel loading (users presses stop).

So, I had decided, it's best to detect it, and prevent our code from executing, preventing that two parallel executions of the same function (separated by several stack levels) would cause bad things to the internal state.

Now, using the new test case "load secure page as part of session restore", I see this we run into this reentrance scenario. That's bad. The current code simply assumes that all nsIWebProgress notifications for a single target come in in serial order. According to my latest understanding, this assumption is false.

If I'm right, we must either declare that a bug in docshell (or whoever sends our the web progress notifications), or we must enhance the security tracking code to become fully reentrant.
Attached file unexpected stack
based on comment 57, it sounds like we want biesi and bz's brains in the room, too.

> The current code simply assumes that all nsIWebProgress notifications for a
> single target come in in serial order.

What do you mean by "target"?  Notifications for a single _request_ should always be serial (the latter not happening before the former return) unless the channel implementation screws up.  But notifications for different requests inside the same loadgroup are a different issue entirely.  Here you're getting an OnLocationChange with aRequest=0x9ae1090.  The PSM code calls into NSS, NSS calls into nsNSSHttpInterface::trySendAndReceiveFcn, which posts an event to do some stuff and then spins the main thread event queue if |running_on_main_thread| (which is true, in this case).  Once you're spinning the event queue, you'll get other notifications.  You'll get arbitrary script execution.  You'll get the browser shutting down in the middle of the NS_ProcessNextEvent call.  You have to deal with all of that if you're going to spin the event queue.

I think I said pretty much this when you brought this up on IRC a few days back....

Think about it this way: the way HTTP works, requests are serialized on a connection.  Which means that if all the connections to the server are already in use, the HTTP request you're making here can't possibly complete until some of the earlier requests complete.  As they proceed and complete, they will send notifications.

Ideally, we'd avoid doing this effectively-synchronous HTTP request here.  The effect that sort of thing has on the UI is very much unsalutary.  In this case, we're polling every 50 microseconds (assuming the OS can manage that sort of resolution).  Then we're doing a bunch of work (PR_IntervalNow() is pretty expensive, if nothing else).  This will pretty much peg the CPU, and might make the UI not so responsive for the 10 seconds or so it takes this request to time out (say if all the connections to the server _are_ tied up).

Is there no possible way to kick the load off, and then return to the event loop and when it finishes handle whatever it returned?  If not, that sounds like a pretty serious issue with NSS here...
Boris, thanks a lot for your valuable comments.
Is this crash related to the problem here?

http://crash-stats.mozilla.com/report/index/49b8500a-fc5d-11dc-90ee-001a4bd43e5c 

Looks to me like we re-entered an event loop and then ended up re-entering NSS, which seemed to get quite upset about it at the end.

If not, we can file another!
Yeah, that looks like exactly this issue...
nominating for b5 relnote.  Comment 18 has good repro steps.  and it looks like this affects Bug 424901 also, which involves certain extensions.
Keywords: relnote
This patch will require a new enhanced patch, but I'm afraid, the new patch will have even more non-trivial changes.

Furthermore, I've finally been able to work on the review/updated patch for bug 358438. This is another important non-trivial change.

To make things more funny, the patches from this patch and bug 358438 will heavily overlap.

I think we have two options:
- land bug 358438 asap and postpone the work for this one until it's done
- I start to work on a combined patch immediately

I tend to do the latter.

I'm adding 383369 to the mix of overlapping patches. Bug 358438 now contains a patch that fixes both of them.
Summary.

We've learned from Boris that the current page state tracking implementation makes some false assumptions.

During the work on this patch, I realized there must be something wrong. I blamed the fact that my code is canceling stuff. But Boris tells us, what I saw is possible to happen at any time, and the regression after landing confirmed that.

The patch subset which I declared as "regression culprit" only served a single purpose: Detect that reentrancy we can't deal with, panic, and try to recover later. But because this can happen at any time, we panic too often.


So, doing the real fix will involve an enhanced implementation of event tracking. It might require that I introduce queues for incoming events, that can't be processed immediately. It might require posting events to avoid the reentrance.


Do we really have to wait for this new implementation?
We should work on it ASAP but is it necessary to postpone fixing this bug?

I think it's not necessary to wait.

We could take the attached patches, but without the "regression culprit" portion. (Because we're already living with the fact that our code happens to receive events in an order that it might not be prepared to handle. The patch does not make that worse.)
Attached patch Patch v7 (obsolete) — Splinter Review
This is:
- patch v3 from this bug minus the regression culprit
- tinderbox fix from this bug
- fix leaks v5 from this bug
- the fix for bug 358438, which Honza and I worked on
- the fix for bug 383369, which is not yet reviewed

In my testing this bug fixes the hang on exit, and 358438 and 383369.

It probably does not yet fix the reported crash, but we can work on that in a separate step.

I would like to invite you to test this patch.
+  attribute long subRequestsHighSecurity;
+  attribute long subRequestsLowSecurity;
+  attribute long subRequestsBrokenSecurity;
+  attribute long subRequestsNoSecurity;

these need comments at the very least. personally, i'd much rather these be arrays of nsIURI or something so that I could actually tell my users which requests are problematic. The current properties are misnamed - the word "count" or similar is necessary.
(In reply to comment #60)
> requests are serialized on a
> connection.  Which means that if all the connections to the server are already
> in use, the HTTP request you're making here can't possibly complete until some
> of the earlier requests complete.  As they proceed and complete, they will send
> notifications.

FYI, that request we are starting is not going to the same server, but to another one (we connect to A, and ask B to confirm A's cert is still ok).


> Ideally, we'd avoid doing this effectively-synchronous HTTP request here.  The
> effect that sort of thing has on the UI is very much unsalutary.

I have some good news, but first some background.


In the past, our processing of nsIWebProgressNotifications did not cause those recursive stacks.

When we introduced EV verification, I had decided that we should only test certs for EV when needed, and as of today, we only really need the EV status for the cert for the toplevel page.

I had used the simple approach: Look for the code that derives the overall security state for the toplevel page, and execute EV verification just before that. Which happened to be in the middle of event processing.

I did not think about threads and potential consequences. Because EV verification will require additional results from OCSP, it caused the stacks that got reported.


But luckily, things have changed just 2 days ago.

Because of a different issue (bug 406755) we had decided that we must execute the EV verification at an earlier point of time. We must execute it, while NSS still has all information from the SSL handshake with the server in memory.

This means, since the landing of bug 406755 attachment 314248 [details] [diff] [review], we're now executing EV verification and OCSP requests from a network thread (I assume it's the SSL thread that runs the NSS handshake callback).

The result of the verification will be attached to the SSL status of the socket/channel.

PSM's nsIWebProgress event processing will still call the same function to obtain the EV verification result, but the execution will no longer involve networking activity, it will use the cached result.


I can confirm this theory because I no longer get the recently added assertion about "unexpected parallel nsIWebProgress OnStateChange and/or OnLocationChange"! (which had inspired me to add the code described further above as the "regression culprit) (and as expected I quickly see the assertion when I back out attachment 314248 [details] [diff] [review] for testing purposes)


These new insights give me much more confidence that my most recent attachment to this bug is a good solution for now.
Comment on attachment 315037 [details] [diff] [review]
Patch v7

I'm requesting review on this bug.

As said before, it's a combination of patches from 3 bugs.

It's the earlier patch from this bug, minus the (now) unnecessary inconsistency handling. I consider that part already reviewed.

It contains the overlapping fix for bug 358438. That part was mostly prepared by Honza, I reviewed his patch and made adjustments. So that part has my and Honza's collabarative review, but Bob Relyea might want to have a another look.

Finally, it has the rather trivial fix from bug 383369, which still needs review from Bob, too.
Attachment #315037 - Flags: review?(rrelyea)
(In reply to comment #69)
> Created an attachment (id=315037) [details]
> Patch v7
> I would like to invite you to test this patch.
> 

Hi,

the Patch works for me and also fix the crash/hang from bug 424901 for me - tested with :

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9pre) Gecko/2008041112 Firefox/3.0pre ID:2008041112

and

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9pre) Gecko/2008041111 Minefield/3.0pre
> This means, since the landing of bug 406755 attachment 314248 [details] [diff] [review], we're
> now executing EV verification and OCSP requests from a network thread

Do you mean that we're executing newChannel and such from off the main thread?  That's not really supported, and will likely cause issues...

But yeah, not doing the sync request from inside a web progress notification should be good.
(In reply to comment #74)
> 
> Do you mean that we're executing newChannel and such from off the main thread? 
> That's not really supported, and will likely cause issues...

We're initiating it on another thread, but we do it by sending an event to the main thread. (nsHTTPDownloadEvent, NS_DispatchToMainThread)
OK.  So as long as you're OK with other web progress notifications happening while that event is running, things are fine.
Attached patch Patch v8 (obsolete) — Splinter Review
Bob Relyea now has reviewed all individual patches.
This Patch v8 brings in a small change that he proposed in bug 383369.

I'm marking this one as r=rrelyea, as it's simply a merged version of the individual patches.

I'll quickly do another update of this patch, to address timeless request to add a comment.
Attachment #315219 - Flags: review+
(In reply to comment #70)
> +  attribute long subRequestsHighSecurity;
> +  attribute long subRequestsLowSecurity;
> +  attribute long subRequestsBrokenSecurity;
> +  attribute long subRequestsNoSecurity;
> 
> these need comments at the very least.

Did you see the big comment just above?
The single comment describes them all, however, ...

> The current properties are misnamed - the word
> "count" or similar is necessary.

I agree and have added the word "count" to their names.


> personally, i'd much rather these be
> arrays of nsIURI or something so that I could actually tell my users which
> requests are problematic. 

That's a good idea, but outside of the scope of these bug fix.

Also note this interface is purely internal to PSM right now.

If you want to propose that more information (expensive) should be made available using additional properties, could you please file a separate enhancement bug?
Attached patch Patch v9Splinter Review
Renamed the interface attributes as described in the previous comment, carrying forward r+
Attachment #310884 - Attachment is obsolete: true
Attachment #311638 - Attachment is obsolete: true
Attachment #311657 - Attachment is obsolete: true
Attachment #311726 - Attachment is obsolete: true
Attachment #315037 - Attachment is obsolete: true
Attachment #315219 - Attachment is obsolete: true
Attachment #315221 - Flags: review+
Attachment #315037 - Flags: review?(rrelyea)
Comment on attachment 315221 [details] [diff] [review]
Patch v9

Requesting approval.

This bug has blocking1.9
This patch fixes this bug, and incorporates the overlapping fixes from bug 383369 (wanted-next+, wanted1.8.1.x) and bug 358438 (wanted1.9.0.x+, wanted1.8.1.x)
Attachment #315221 - Flags: approval1.9?
Comment on attachment 315221 [details] [diff] [review]
Patch v9

a1.9=beltzner
Attachment #315221 - Flags: approval1.9? → approval1.9+
Comment on attachment 309626 [details] [diff] [review]
[checked in] Temporarily increase the time the browser stays open to 10s

Do you want me to undo this timeout increase, so we can better test whether the hang fix works?
Attachment #309626 - Flags: approval1.9?
Patch v9 checked in, marking fixed.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Comment on attachment 309626 [details] [diff] [review]
[checked in] Temporarily increase the time the browser stays open to 10s

Yes, please take this out, but only do it when someone's watching the tree.
Attachment #309626 - Flags: approval1.9? → approval1.9+
Hi Kai, i can confirm that the Patch fix the Problem from Bug 424901.

During the Tests for Verification of this Bug with your steps to reproduce from comment #18 i crashed on Debug Builds and Official Windows Nightly Builds. Not sure if this crashes related to this bug, so i filed a new Bug.

I filed Bug 428716 (for the Assertion i got before the crash) and  Bug 428726 for the crash. 
I didn't get that assertion.

I saw the crash from bug 428726, and another crash with a different stack, see new bug 428746.

I don't know what caused those crashes, no idea whether it's a regression from this bug. It could be memory corruption.

Finally, very rarely, I still get a deadlock. It's one of the scenarios I had attempted to unblock, but it's not yet eliminated completely. We can attempt to fix that in a follow up patch. I've filed bug 428749.
comments should really be per attribute. i don't know of *any* interfaces where attributes are just referenced by the interface comment. in general for constants there may be a comment before the first constant (not interface), but this is actually a bug, because it breaks javadoc and friends.

therefor should probably be therefore.

and i'd rather not have to file the bugs.
Blocks: 429110
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: