420187 - hang in nsNSSHttpRequestSession::internal_send_receive_attempt

Reporter

Description

•

16 years ago

On investigating fx-win32-tbox after 3hrs 30min building, it was stuck at:
 OBJDIR=obj-fx-trunk python obj-fx-trunk/_profile/pgo/profileserver.py

firefox.exe was in process list but not on the taskbar. When I ended the firefox process, the output was:

Application pid: 7328
FAIL Exited with code 1 during test run
fx-win32-tbox.build.mozilla.org - - [28/Feb/2008 11:33:33] "GET /index.html HTTP/1.1" 200 - fx-win32-tbox.build.mozilla.org - - [28/Feb/2008 11:33:33] "GET /quit.js HTTP/1.1" 200 -
fx-win32-tbox.build.mozilla.org - - [28/Feb/2008 11:33:34] code 404, message File not found
fx-win32-tbox.build.mozilla.org - - [28/Feb/2008 11:33:34] "GET /favicon.ico HTTP/1.1" 404 -

It continued to the second build, ignoring the exit status. If the build crashes out then tinderbox should start from the beginning.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

16 years ago

Component: Cmd-line Features → Build Config

QA Contact: cmd-line → build-config

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

16 years ago

Attached file Stacks from "hung" firefox — Details

Had this problem with profiling again today, so attached with the debugger and did Break All. This is the resulting thread table and callstacks for each thread.

timeless

Updated

•

16 years ago

Assignee: nobody → kengert

Component: Build Config → Security: PSM

Keywords: hang

QA Contact: build-config → psm

Summary: Python profiler should pay attention to firefox exit code → hang in nsNSSHttpRequestSession::internal_send_receive_attempt

(not currently active) Ted Mielczarek

Comment 2

•

16 years ago

should split this into two bugs, one for the python script not exiting with failure if the browser does, and one for the actual hang (which it looks like this bug has already been morphed into).

(not currently active) Ted Mielczarek

Comment 3

•

16 years ago

Filed bug 421281 on the profiling script.

(not currently active) Ted Mielczarek

Comment 4

•

16 years ago

I can reproduce this 100% of the time on OS X using the profiling script, even with a vanilla libxul build.

#0  0x908d7a46 in semaphore_timedwait_signal_trap ()
#1  0x90909daf in _pthread_cond_wait ()
#2  0x90954de7 in pthread_cond_timedwait ()
#3  0x0005cd09 in pt_TimedWait (cv=0x1c1bdcc4, ml=0x1c1bdcc4, timeout=<value temporarily unavailable, due to optimizations>) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:280
#4  0x0005d05c in PR_WaitCondVar (cvar=0x1c1bdcc0, timeout=250) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:407
#5  0x0215b97a in nsNSSHttpRequestSession::internal_send_receive_attempt (this=0x1c1bd280, retryable_error=@0xb02a348c, pPollDesc=0x0, http_response_code=0xb02a35bc, http_response_content_type=0x0, http_response_headers=0x0, http_response_data=0xb02a35b0, http_response_data_len=0xb02a35b4) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:406

the main thread is sitting here:
0  0x908d7a22 in semaphore_wait_trap ()
#1  0x9094fb7f in pthread_join ()
#2  0x00061eb3 in PR_JoinThread (thred=0x3203) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:594
#3  0x021578d5 in nsPSMBackgroundThread::requestExit (this=0x3203) at /Users/luser/build/mozilla/security/manager/ssl/src/nsPSMBackgroundThread.cpp:97
#4  0x0215c63c in nsNSSComponent::DoProfileChangeNetTeardown (this=0x3203) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSComponent.cpp:2326

(not currently active) Ted Mielczarek

Comment 5

•

16 years ago

Oh, I should have pasted a full stack from that first thread:
0  0x908d7a46 in semaphore_timedwait_signal_trap ()
#1  0x90909daf in _pthread_cond_wait ()
#2  0x90954de7 in pthread_cond_timedwait ()
#3  0x0005cd09 in pt_TimedWait (cv=0x1c1bdcc4, ml=0x1c1bdcc4, timeout=<value temporarily unavailable, due to optimizations>) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:280
#4  0x0005d05c in PR_WaitCondVar (cvar=0x1c1bdcc0, timeout=250) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:407
#5  0x0215b97a in nsNSSHttpRequestSession::internal_send_receive_attempt (this=0x1c1bd280, retryable_error=@0xb02a348c, pPollDesc=0x0, http_response_code=0xb02a35bc, http_response_content_type=0x0, http_response_headers=0x0, http_response_data=0xb02a35b0, http_response_data_len=0xb02a35b4) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:406
#6  0x0215bbc1 in nsNSSHttpRequestSession::trySendAndReceiveFcn (this=0x1c1bd280, pPollDesc=0x0, http_response_code=0xb02a35bc, http_response_content_type=0x0, http_response_headers=0x0, http_response_data=0xb02a35b0, http_response_data_len=0xb02a35b4) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:300
#7  0x00269a52 in ocsp_GetEncodedOCSPResponseFromRequest ()
#8  0x0026b339 in CERT_CheckOCSPStatus ()
#9  0x0026e4b6 in CERT_VerifyCert ()
#10 0x0026e513 in CERT_VerifyCertNow ()
#11 0x0023d482 in SSL_AuthCertificate ()
#12 0x0215a48b in AuthCertificateCallback (client_data=0x0, fd=0x1c1b85a0, checksig=-1869776314, isServer=-1869776314) at /Users/luser/build/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:907
#13 0x0023b31e in ssl3_HandleHandshakeMessage ()
#14 0x0023ca1e in ssl3_HandleRecord ()
#15 0x0023d0b7 in ssl3_GatherCompleteHandshake ()
#16 0x0023e08c in ssl_GatherRecord1stHandshake ()
#17 0x00241fee in ssl_Do1stHandshake ()
#18 0x00242f53 in ssl_SecureSend ()
#19 0x00242fec in ssl_SecureWrite ()
#20 0x00246df4 in ssl_Write ()
#21 0x02157db1 in nsSSLThread::Run (this=0x1b51dba0) at /Users/luser/build/mozilla/security/manager/ssl/src/nsSSLThread.cpp:1029
#22 0x0006189c in _pt_root (arg=0x1b51dc60) at /Users/luser/build/mozilla/nsprpub/pr/src/pthreads/ptthread.c:221
#23 0x90908c55 in _pthread_start ()
#24 0x90908b12 in thread_start ()

looks like it's trying to do some ocsp stuff? If I had to make an uneducated guess, I would say that we hit aus looking for extension updates, and then are screwing around in SSL code when the browser wants to shutdown.

(not currently active) Ted Mielczarek

Comment 6

•

16 years ago

I guess that's in Nick's stack too. This sucks, because this hangs fx-win32-tbox on occasion (usually during the nightly). I guess it could hang anyone's browser if they startup and shutdown too quickly.

Severity: normal → major

Reed Loden [:reed]

Updated

•

16 years ago

Flags: blocking1.9?

Reed Loden [:reed]

Updated

•

16 years ago

Priority: -- → P1

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

16 years ago

Hit this again today, which is painful because takes another two and a half hours to get a nightly build out.

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

16 years ago

Flags: blocking1.9?

Reed Loden [:reed]

Updated

•

16 years ago

Flags: blocking1.9?

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 9

•

16 years ago

Marking as blocking, bad effects on build infrastructure related to fast startup-shutdown, which is also a pretty common scenario during upgrade to a new Firefox (extension compat checks, EM restarts, etc.)

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 10

•

16 years ago

Once more, with actual flags.

Flags: blocking1.9? → blocking1.9+

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 11

•

16 years ago

Gentle ping - any update on this?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 12

•

16 years ago

Attached patch [checked in] Temporarily increase the time the browser stays open to 10s — Details — Splinter Review

This will hopefully prevent the tinderbox hanging until this is resolved.

Attachment #309626 - Flags: review?(sayrer)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

16 years ago

Attachment #309626 - Flags: review?(ted.mielczarek)

(not currently active) Ted Mielczarek

Comment 13

•

16 years ago

Comment on attachment 309626 [details] [diff] [review]
[checked in] Temporarily increase the time the browser stays open to 10s

worth a shot!

Attachment #309626 - Flags: review?(ted.mielczarek) → review+

Robert Sayre

Updated

•

16 years ago

Attachment #309626 - Flags: review?(sayrer) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

16 years ago

Attachment #309626 - Attachment description: Temporarily increase the time the browser stays open to 10s → [checked in] Temporarily increase the time the browser stays open to 10s

timeless

Comment 14

•

16 years ago

http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp&rev=1.62&mark=112#197

perhaps:
nsHTTPDownloadEvent::Run or nsHTTPListener should register an observer for some "shutdown" message and do something when they get it.

Mike Schroepfer

Comment 15

•

16 years ago

We ready to go on this?

(not currently active) Ted Mielczarek

Comment 16

•

16 years ago

Schrep: we have a workaround in place in the PGO script, but I haven't seen any progress on the actual hang.

Kai Engert (:KaiE:)

Assignee

Comment 17

•

16 years ago

I'm working on it now

Kai Engert (:KaiE:)

Assignee

Comment 18

•

16 years ago

I'm able to reproduce this quite reliably using the following approach:
- open 5 tabs
- in the tabs, load the following sites:
  https://www.paypal.com/
  https://www.verisign.com/
  https://signin.ebay.com/
  https://www.thawte.com/
  https://www.godaddy.com/
- use File/Quit
- Select "Save all tabs", and select "do not ask me next time"
- restart firefox
- after the window is up with all 5 tabs, wait about 2 seconds, then press CTRL-Q, requesting to quit without a prompt

With this approach I can reproduce a deadlock 80% of my attempts.

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

16 years ago

Target Milestone: --- → mozilla1.9beta5

Kai Engert (:KaiE:)

Assignee

Comment 19

•

16 years ago

I have a patch, but it needs a bit more polishing. Working on it. ETA later today. My patch works quite reliably.
However, using the STR from comment 18 I still get an occassional deadlock, which seems to be unrelated to the SSL thread, unrelated to OCSP. Rather the deadlock is when the SSL layer attempts to proxy a call to the main thread.

This is just an update, I'll attach more details later.

Kai Engert (:KaiE:)

Assignee

Comment 20

•

16 years ago

Attached patch Patch v1 (obsolete) — Details — Splinter Review

This patch seems to work, I don't see hangs on exit anymore with this patch and my steps to reproduce.

I would like to invite you to test this patch with your own STR.
Thanks!

Kai Engert (:KaiE:)

Assignee

Comment 21

•

16 years ago

Could you please explain your steps to reproduce? How did you call the profile script?

On Linux, I did this, as inspired by comment 0:
  OBJDIR=obj-fire-debug obj-fire-debug/_profile/pgo/profileserver.py

Does NOT hang for me (without patch), so I'd appreciate more detailed descriptions of the STR. Thanks!

Kai Engert (:KaiE:)

Assignee

Comment 22

•

16 years ago

Attached patch Patch v2 (obsolete) — Details — Splinter Review

Minor improvements compared to v1.

My proposed fix isn't small, so let me explain what I learned and the changes in this patch.

The OCSP HTTP download code usually waits for completion
(nsNSSHttpRequestSession::internal_send_receive_attempt).
We need a way to leave that loop if the application is exiting.
I added nsSSLThread::exitRequested() for that purpose.

But that isn't sufficient.
I'd expect the loop to exit after 10 seconds, but it doesn't.
So, some events are not coming in on shutdown.
Or rather: The current code is unable to cancel all loads it
wants to wait for, because:

When this code got implemented initially, an incorrect assumption was made.
It has been assumed there would be only a single OCSP download at any given time,
and tracked in static variable mPendingHTTPRequest.
The assumption is wrong.

I propose to drop the global references to pending OCSP requests
and use a direct reference in each context.
That's why I changed nsCancelHTTPDownloadEvent.

That way we are able to cancel all pending loads, the event dispatcher
will hold the last reference to the events, and they will go away as soon
as the dispatcher releases them.

In order to allow us to cancel from within the "cancel event",
forwarding to the "download event" and executing the actual network
cancellation from there, the "download event" needs a reference
to the loadgroup.

I'm worried about a race between nsHTTPDownloadEvent::Run
and nsHTTPDownloadEvent::Cancel.
Therefore I added a lock to synchronize access to the loadgroup member.

Then I ran into assertions, telling me that loadgroup is not threadsafe.
It happened during assignment to nsCOMPtr.

Because of that I'm using a raw pointer to refer to the loadgroup.
I remember the thread ID of the creation thread, and make sure
we'll only release it if we're on the same thread again
(we should, both actions will happen on the thread that processes events).

I hope this approach is ok. I hope the loadgroup object does not get
created on a different thread using some proxying mechanisms that I don't see.

So, after all these changes I still ran into more problems :-(

First, I got assertions "wrong thread" from I/O-Service and about
standard-url not being thread safe. This was a simply caused by PSM
accessing the string bundle on a thread the I/O service is not happy with.
The assertion happens when using the JAR-URI for the string bundle.

I think this assertion is a separate issue, but why not fix it now.
My attempt to fix is: Load a dummy string at PSM init time.
That seems to init the string bundle objects somehow,
it avoids the assertion in my testing.
I don't know if this is a real fix, but at least it should not hurt.

Finally, one more tricky issue.

During a recent code enhancement, when we fixed an assertion about security-UI
not being thread safe, we changed the code that tracks the security state
of a document window. We added a reentrance detection for notifications to
nsIWebProgress::OnStateChange and OnLocationChange.
I had naively assumed this reentrace would never happen and I was wrong.

The security state tracking gets notified that a load is finished,
it attempts to derive the security state, part of that is doing OCSP.
When the user cancels the load of the document, then we receive stop
notifications, while we're still waiting for its associated OCSP download.

That's a problem. The code to track the security state is complex and
not reentrant (for a single tracked window).

I propose the following workaround:
As soon as we detect reentrace for a given object, we switch that object
into an "inconsistent" state. As long as we're in that state,
we ignore any further events that would influence or update
the security state.

As soon as the reentrance situation has finished (we receive a new
notification back in the original frame), then we'll clear the
tracking state and reset it to insecure.

You'll find that code if you search for "inconsistency".

Attachment #310823 - Attachment is obsolete: true

Kai Engert (:KaiE:)

Assignee

Updated

•

16 years ago

Attachment #310848 - Flags: review?(rrelyea)

Kai Engert (:KaiE:)

Assignee

Comment 23

•

16 years ago

Attached file another hang — Details

Even with the most recent patch I still get occassional hangs!

This time the problem is with proxying of the socket's callback pointer to the main thread. I wish that wouldn't be necessary.

Look at this attachment.
You'll see that the main thread requests termination of the SSL thread, and it wants to wait for it (it really must, because of our NSS resources shutdown story).

However, the SSL thread is trying to produce an error message. Unfortunately, it wants to give nsISSLErrorListener a chance to supress error messages. That's being proxied to the main thread which, well, is busy waiting for the SSL thread.

to be continued...

Kai Engert (:KaiE:)

Assignee

Comment 24

•

16 years ago

Attached file yet another hang — Details

And I got one more deadlock, this time not involving the SSL thread at all.

Occassionally Necko wants to shut down on the main thread. However, on the socket transport thread, it is still trying to set up a new SSL socket.

In nsNSSSocketInfo::SetNotificationCallbacks PSM wants to obtain some information from the provided interfaceRequested. It attempts to proxy to the main thread for querying docshell, well, but the main thread is blocked.

Kai Engert (:KaiE:)

Assignee

Comment 25

•

16 years ago

Attached patch Patch v3 (obsolete) — Details — Splinter Review

All I said before when explaining the proposed patch is still valid.
This patch is based on v2, but has the following changes in addition:

When building a new SSL socket, when the notification-callback-object is passed to the socket, we'll now store the original un-proxied object. And we will not immediately query the docshell to derive the information we require (e.g. whether or not to do external-error-reporting (error page)). We'll postpone that until we require the information.

In addition, I added a safety check. I'm not sure if it will be helpful in all scenarios, but it can't hurt: Whenever the SSL socket requires information from the main thread, and plans to proxy a call to the main thread, it will check whether the SSL thread is about to shut down. If it is, it doesn't make any sense to proceed, and therefore the action will be omitted and we'll return with a failure.


I don't see any hangs with this patch revision, but I'll continue testing.
Please join me :-)

Attachment #310848 - Attachment is obsolete: true

Attachment #310884 - Flags: review?(rrelyea)

Attachment #310848 - Flags: review?(rrelyea)

(not currently active) Ted Mielczarek

Comment 26

•

16 years ago

Kai: before attachment 309626 [details] [diff] [review] was checked in, this was 100% reproducible for me on mac. You may try backing that out and testing.

Kai Engert (:KaiE:)

Assignee

Comment 27

•

16 years ago

(In reply to comment #26)
> Kai: before attachment 309626 [details] [diff] [review] was checked in, this was 100% reproducible for me
> on mac. You may try backing that out and testing.

Ted:

(a) I don't know HOW you tested. Can you please describe what exactly you did to reproduce the hang?

(b) Would you be able to back out the attachment in your local tree, apply my patch, and test that?

(not currently active) Ted Mielczarek

Comment 28

•

16 years ago

Kai: just running the profiling script, like:
"OBJDIR=obj-fx-trunk python obj-fx-trunk/_profile/pgo/profileserver.py"

I'll see if I have a chance to test with the backout+your patch today.

Mike Beltzner [:beltzner, not reading bugmail]

Comment 29

•

16 years ago

We're blocking on this bug: Robert (Rylea) - any ETA of review?

Mike Beltzner [:beltzner, not reading bugmail]

Comment 30

•

16 years ago

Kai: is there anyone else who can review this? rrylea seems to be AFK, and we're holding Firefox 3 Beta 5 on this fix.

Alternatively: do we need to be holding Firefox 3 Beta 5 on this fix?

Mike Schroepfer

Comment 31

•

16 years ago

Dcamp can you do a drive-by?

Kai Engert (:KaiE:)

Assignee

Comment 32

•

16 years ago

(In reply to comment #28)
> Kai: just running the profiling script, like:
> "OBJDIR=obj-fx-trunk python obj-fx-trunk/_profile/pgo/profileserver.py"
> 
> I'll see if I have a chance to test with the backout+your patch today.


Thanks Ted.

I backed out attachment 309626 [details] [diff] [review], and I was able to reproduce your hang using a debug build on Linux.

I applied my patch v3, attachment 310884 [details] [diff] [review], and I no longer hang.

Carsten Book [:Tomcat]

Updated

•

16 years ago

Blocks: 424901

(not currently active) Ted Mielczarek

Comment 33

•

16 years ago

We haven't hit this on fx-win32-tbox since we bumped up the timeout, so unless Kai's patch needs beta exposure for some reason, I think we can take it afterwards.

Priority: P1 → P2

Target Milestone: mozilla1.9beta5 → mozilla1.9

Kai Engert (:KaiE:)

Assignee

Comment 34

•

16 years ago

(In reply to comment #33)
> We haven't hit this on fx-win32-tbox since we bumped up the timeout, so unless
> Kai's patch needs beta exposure for some reason, I think we can take it
> afterwards.

I'd recommend to have beta exposure for this patch.
There are enough changes in this patch that could allow for unexpected regressions, and we'd better see them during beta testing.

Reed Loden [:reed]

Updated

•

16 years ago

Priority: P2 → P1

Target Milestone: mozilla1.9 → mozilla1.9beta5

Mike Schroepfer

Comment 35

•

16 years ago

Kai can you suggest a different reviewer?

Robert Relyea

Comment 36

•

16 years ago

Comment on attachment 310884 [details] [diff] [review]
Patch v3

r+ I've asked kaie to include some comments, since this conditions is subtle and still has some corner cases.

Most importantly there is a corner case where the security UI may not exactly reflect the pages that are loaded. This can only happen when the stop button is pushed or the application exits. Obviously the latter isn't an issue. In the former case, the UI reflects and intermediate state that the page really had while loading, so the state of a stop page is inconsistent to begin with.

The alternative is that the browser hangs.

Attachment #310884 - Flags: review?(rrelyea) → review+

Kai Engert (:KaiE:)

Assignee

Comment 37

•

16 years ago

(In reply to comment #35)
> Kai can you suggest a different reviewer?

I think Bob is the person who has the most up to date knowledge on this code by far.

I guess we're good now. We have just spent 50 minutes discussing the patch, and he just gave r+.

He asked me for comments only. No code changes are planned as of now.

Given the urgency, I propose to land the code now, and I'll do the comments in a follow up patch - today!

Kai Engert (:KaiE:)

Assignee

Updated

•

16 years ago

Attachment #310884 - Flags: approval1.9b5?

Mike Beltzner [:beltzner, not reading bugmail]

Comment 38

•

16 years ago

Comment on attachment 310884 [details] [diff] [review]
Patch v3

a1.9b5=beltzner

Attachment #310884 - Flags: approval1.9b5? → approval1.9b5+

Kai Engert (:KaiE:)

Assignee

Comment 39

•

16 years ago

Checked in.
Marking fixed.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Kai Engert (:KaiE:)

Assignee

Comment 40

•

16 years ago

Attached patch fix red tinderbox on windows (obsolete) — Details — Splinter Review

Mike Schroepfer

Comment 41

•

16 years ago

Comment on attachment 311638 [details] [diff] [review]
fix red tinderbox on windows

let's get the bustage fix in..

Attachment #311638 - Flags: approval1.9b5+

Kai Engert (:KaiE:)

Assignee

Comment 42

•

16 years ago

Attached patch fix leaks, v5 (obsolete) — Details — Splinter Review

sorry my patch had introduced a leak. I have a trivial new patch that should fix it. basically I forgot to manually release that pointer to the pending load in the success scneario. but it required me to move the tracked pointer to a different object

Attachment #311657 - Flags: review?

Robert Relyea

Comment 43

•

16 years ago

Comment on attachment 311657 [details] [diff] [review]
fix leaks, v5

r+ rrelyea

Attachment #311657 - Flags: review? → review+

Kai Engert (:KaiE:)

Assignee

Comment 44

•

16 years ago

I'm afraid I have to reopen the bug.
The patch causes regressions around secure UI.

For example, open firefox, visit https://certs.starfieldtech.com/images/trust_logo.gif
then set firefox to save your session.

Quit firefox, start again, so it will load that page immediately.

With the patches applied, I get a weird UI.
No security indicators at all.
Page is being reported as insecure.

With the patches reverted, I get secure indicators and even EV status.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Mike Schroepfer

Comment 45

•

16 years ago

Kai/Rob - do we want to just back this out and try and address in time for 1.9RC1 or do you have a safe plan of attack?

Jim Jeffery not reading bug-mail 1/2/11

Comment 46

•

16 years ago

I just download the latest hourly, and when visiting the URL I see EV (green) Larry, and yellow in the Awesome Bar.  
Close/restart Minefield - tab loads and Larry +EV and Awesome bar look OK here, am I missing something ?

Vista HP SP1 
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9b5pre) Gecko/2008032517 Minefield/3.0b5pre Firefox/3.0 ID:2008032517 <-Latest Hourly

Kai Engert (:KaiE:)

Assignee

Comment 47

•

16 years ago

I am ok to back it out.

I don't have a good plan right now, besides spending more time on this and carefully testing for regressions around security UI.

Kai Engert (:KaiE:)

Assignee

Comment 48

•

16 years ago

(In reply to comment #46)
> I just download the latest hourly, and when visiting the URL I see EV (green)
> Larry, and yellow in the Awesome Bar.  
> Close/restart Minefield - tab loads and Larry +EV and Awesome bar look OK here,
> am I missing something ?

I'm using Linux, debug build. Maybe it depends on platform and/or order of events?

Jim Jeffery not reading bug-mail 1/2/11

Comment 49

•

16 years ago

I see now, once you restart if the url above is active in a tab, then the EV looks OK, but once you switch tabs and back to the tab with the url then the EV collapses to a small size and 'Larry' can't be brought up...

Kai Engert (:KaiE:)

Assignee

Comment 50

•

16 years ago

I think the regression has to do with my change to delay querying the information from the NotifiationCallbacks / DocShell.

We could back out only that portion of the patch that introduced this new delay.

This would leave in some of the hangs I saw, but I think it fixes the most serious hangs. I'll do some tests around this idea.

Johnathan Nightingale [:johnath]

Comment 51

•

16 years ago

(In reply to comment #49)
> I see now, once you restart if the url above is active in a tab, then the EV
> looks OK, but once you switch tabs and back to the tab with the url then the EV
> collapses to a small size and 'Larry' can't be brought up... 

Yep, I confirm this on Mac.  It is caused by the initial patch (attachment 310884 [details] [diff] [review]), not the red-fix or the leak-fix that followed it.

If I disable OCSP checking, Larry stops becoming unresponsive (unsurprisingly, since it means this code path isn't travelled), but the EV UI still breaks since these certs are now treated as DV by design.

In no case am I seeing anything logged to the error console.

Mike Connor [:mconnor]

Comment 52

•

16 years ago

At this point I think we should back out and shoot for RC1.  This doesn't seem to be hit as often in the wild as it is in our test harness, and with the workaround we can probably safely defer to RC1...

Mike Schroepfer

Comment 53

•

16 years ago

Agreed with Comment 52, looking at the dates didn't this problem ship in Beta 4?

Let's back out and give Kai and Robert time to do this right...

Carsten Book [:Tomcat]

Comment 54

•

16 years ago

Attached image screenshot — Details

now something is wrong with the EV Cert Display. The Screenshot is from a new Build with this Patch from this Bug.

When i have 2 Sites open (like starfield and british airways) and i change the tabs fast, i get for the starfield site, the starfield favicon....but the british airways text ? 

Not sure if this is also another bug.

Johnathan Nightingale [:johnath]

Comment 55

•

16 years ago

Thanks for the quick response and backout Kai, now let's continue trying to figure out what's going on.  :) 

Tomcat - I suspect it's the same bug.  Both the "Green with no text when coming from http tab" and the "green with wrong text when coming from other EV tab" behaviours sound like the SSLStatus object is not getting set properly on the gBrowser.securityUI object.

I mis-spoke before when I said I was seeing nothing in the console - swapping back and forth, I do get |this._lastStatus is null| being logged from the getIdentityData() code: 

http://mxr.mozilla.org/mozilla/source/browser/base/content/browser.js#6370

Which is not logged without the patch here (i.e. when things work).  That cached value is set here:

http://mxr.mozilla.org/mozilla/source/browser/base/content/browser.js#6407

Which is called from onSecurityChange here:

http://mxr.mozilla.org/mozilla/source/browser/base/content/browser.js#4195

So the moral is, the Identity UI is still getting the *state* changes (hence the colouring on the site button), but since SSLStatus is not being updated properly, it errors out when attempting to set the UI strings, or respond to mouseclicks.

That doesn't explain the source of the error, but I've included it in case it helps the effort to trace it.

Mike Schroepfer

Comment 56

•

16 years ago

Adjusting TM and priority to match this not blocking Beta5

Priority: P1 → P2

Target Milestone: mozilla1.9beta5 → mozilla1.9

Kai Engert (:KaiE:)

Assignee

Comment 57

•

16 years ago

Attached patch regression culprit (obsolete) — Details — Splinter Review

I identified the culprit to be this subset of patch v3.

This new code is supposed to detect an indirect reentrancy. Something the security tracking code is not designed to deal with. However, using the other portions of the patch, I detected this might happen when we quit the application or cancel loading (users presses stop).

So, I had decided, it's best to detect it, and prevent our code from executing, preventing that two parallel executions of the same function (separated by several stack levels) would cause bad things to the internal state.

Now, using the new test case "load secure page as part of session restore", I see this we run into this reentrance scenario. That's bad. The current code simply assumes that all nsIWebProgress notifications for a single target come in in serial order. According to my latest understanding, this assumption is false.

If I'm right, we must either declare that a bug in docshell (or whoever sends our the web progress notifications), or we must enhance the security tracking code to become fully reentrant.

Kai Engert (:KaiE:)

Assignee

Comment 58

•

16 years ago

Attached file unexpected stack — Details

Johnathan Nightingale [:johnath]

Comment 59

•

16 years ago

based on comment 57, it sounds like we want biesi and bz's brains in the room, too.

Boris Zbarsky [:bzbarsky]

Comment 60

•

16 years ago

> The current code simply assumes that all nsIWebProgress notifications for a
> single target come in in serial order.

What do you mean by "target"?  Notifications for a single _request_ should always be serial (the latter not happening before the former return) unless the channel implementation screws up.  But notifications for different requests inside the same loadgroup are a different issue entirely.  Here you're getting an OnLocationChange with aRequest=0x9ae1090.  The PSM code calls into NSS, NSS calls into nsNSSHttpInterface::trySendAndReceiveFcn, which posts an event to do some stuff and then spins the main thread event queue if |running_on_main_thread| (which is true, in this case).  Once you're spinning the event queue, you'll get other notifications.  You'll get arbitrary script execution.  You'll get the browser shutting down in the middle of the NS_ProcessNextEvent call.  You have to deal with all of that if you're going to spin the event queue.

I think I said pretty much this when you brought this up on IRC a few days back....

Think about it this way: the way HTTP works, requests are serialized on a connection.  Which means that if all the connections to the server are already in use, the HTTP request you're making here can't possibly complete until some of the earlier requests complete.  As they proceed and complete, they will send notifications.

Ideally, we'd avoid doing this effectively-synchronous HTTP request here.  The effect that sort of thing has on the UI is very much unsalutary.  In this case, we're polling every 50 microseconds (assuming the OS can manage that sort of resolution).  Then we're doing a bunch of work (PR_IntervalNow() is pretty expensive, if nothing else).  This will pretty much peg the CPU, and might make the UI not so responsive for the 10 seconds or so it takes this request to time out (say if all the connections to the server _are_ tied up).

Is there no possible way to kick the load off, and then return to the event loop and when it finishes handle whatever it returned?  If not, that sounds like a pretty serious issue with NSS here...

Kai Engert (:KaiE:)

Assignee

Comment 61

•

16 years ago

Boris, thanks a lot for your valuable comments.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 63

•

16 years ago

Is this crash related to the problem here?

http://crash-stats.mozilla.com/report/index/49b8500a-fc5d-11dc-90ee-001a4bd43e5c 

Looks to me like we re-entered an event loop and then ended up re-entering NSS, which seemed to get quite upset about it at the end.

If not, we can file another!

Boris Zbarsky [:bzbarsky]

Comment 64

•

16 years ago

Yeah, that looks like exactly this issue...

Tony Chung [:tchung]

Comment 65

•

16 years ago

nominating for b5 relnote.  Comment 18 has good repro steps.  and it looks like this affects Bug 424901 also, which involves certain extensions.

Keywords: relnote

Kai Engert (:KaiE:)

Assignee

Comment 66

•

16 years ago

This patch will require a new enhanced patch, but I'm afraid, the new patch will have even more non-trivial changes.

Furthermore, I've finally been able to work on the review/updated patch for bug 358438. This is another important non-trivial change.

To make things more funny, the patches from this patch and bug 358438 will heavily overlap.

I think we have two options:
- land bug 358438 asap and postpone the work for this one until it's done
- I start to work on a combined patch immediately

I tend to do the latter.

Kai Engert (:KaiE:)

Assignee

Comment 67

•

16 years ago

I'm adding 383369 to the mix of overlapping patches. Bug 358438 now contains a patch that fixes both of them.

Kai Engert (:KaiE:)

Assignee

Comment 68

•

16 years ago

Summary.

We've learned from Boris that the current page state tracking implementation makes some false assumptions.

During the work on this patch, I realized there must be something wrong. I blamed the fact that my code is canceling stuff. But Boris tells us, what I saw is possible to happen at any time, and the regression after landing confirmed that.

The patch subset which I declared as "regression culprit" only served a single purpose: Detect that reentrancy we can't deal with, panic, and try to recover later. But because this can happen at any time, we panic too often.


So, doing the real fix will involve an enhanced implementation of event tracking. It might require that I introduce queues for incoming events, that can't be processed immediately. It might require posting events to avoid the reentrance.


Do we really have to wait for this new implementation?
We should work on it ASAP but is it necessary to postpone fixing this bug?

I think it's not necessary to wait.

We could take the attached patches, but without the "regression culprit" portion. (Because we're already living with the fact that our code happens to receive events in an order that it might not be prepared to handle. The patch does not make that worse.)

Kai Engert (:KaiE:)

Assignee

Comment 69

•

16 years ago

Attached patch Patch v7 (obsolete) — Details — Splinter Review

This is:
- patch v3 from this bug minus the regression culprit
- tinderbox fix from this bug
- fix leaks v5 from this bug
- the fix for bug 358438, which Honza and I worked on
- the fix for bug 383369, which is not yet reviewed

In my testing this bug fixes the hang on exit, and 358438 and 383369.

It probably does not yet fix the reported crash, but we can work on that in a separate step.

I would like to invite you to test this patch.

timeless

Comment 70

•

16 years ago

+  attribute long subRequestsHighSecurity;
+  attribute long subRequestsLowSecurity;
+  attribute long subRequestsBrokenSecurity;
+  attribute long subRequestsNoSecurity;

these need comments at the very least. personally, i'd much rather these be arrays of nsIURI or something so that I could actually tell my users which requests are problematic. The current properties are misnamed - the word "count" or similar is necessary.

Kai Engert (:KaiE:)

Assignee

Comment 71

•

16 years ago

(In reply to comment #60)
> requests are serialized on a
> connection.  Which means that if all the connections to the server are already
> in use, the HTTP request you're making here can't possibly complete until some
> of the earlier requests complete.  As they proceed and complete, they will send
> notifications.

FYI, that request we are starting is not going to the same server, but to another one (we connect to A, and ask B to confirm A's cert is still ok).


> Ideally, we'd avoid doing this effectively-synchronous HTTP request here.  The
> effect that sort of thing has on the UI is very much unsalutary.

I have some good news, but first some background.


In the past, our processing of nsIWebProgressNotifications did not cause those recursive stacks.

When we introduced EV verification, I had decided that we should only test certs for EV when needed, and as of today, we only really need the EV status for the cert for the toplevel page.

I had used the simple approach: Look for the code that derives the overall security state for the toplevel page, and execute EV verification just before that. Which happened to be in the middle of event processing.

I did not think about threads and potential consequences. Because EV verification will require additional results from OCSP, it caused the stacks that got reported.


But luckily, things have changed just 2 days ago.

Because of a different issue (bug 406755) we had decided that we must execute the EV verification at an earlier point of time. We must execute it, while NSS still has all information from the SSL handshake with the server in memory.

This means, since the landing of bug 406755 attachment 314248 [details] [diff] [review], we're now executing EV verification and OCSP requests from a network thread (I assume it's the SSL thread that runs the NSS handshake callback).

The result of the verification will be attached to the SSL status of the socket/channel.

PSM's nsIWebProgress event processing will still call the same function to obtain the EV verification result, but the execution will no longer involve networking activity, it will use the cached result.


I can confirm this theory because I no longer get the recently added assertion about "unexpected parallel nsIWebProgress OnStateChange and/or OnLocationChange"! (which had inspired me to add the code described further above as the "regression culprit) (and as expected I quickly see the assertion when I back out attachment 314248 [details] [diff] [review] for testing purposes)


These new insights give me much more confidence that my most recent attachment to this bug is a good solution for now.

Kai Engert (:KaiE:)

Assignee

Comment 72

•

16 years ago

Comment on attachment 315037 [details] [diff] [review]
Patch v7

I'm requesting review on this bug.

As said before, it's a combination of patches from 3 bugs.

It's the earlier patch from this bug, minus the (now) unnecessary inconsistency handling. I consider that part already reviewed.

It contains the overlapping fix for bug 358438. That part was mostly prepared by Honza, I reviewed his patch and made adjustments. So that part has my and Honza's collabarative review, but Bob Relyea might want to have a another look.

Finally, it has the rather trivial fix from bug 383369, which still needs review from Bob, too.

Attachment #315037 - Flags: review?(rrelyea)

Carsten Book [:Tomcat]

Comment 73

•

16 years ago

(In reply to comment #69)
> Created an attachment (id=315037) [details]
> Patch v7
> I would like to invite you to test this patch.
> 

Hi,

the Patch works for me and also fix the crash/hang from bug 424901 for me - tested with :

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9pre) Gecko/2008041112 Firefox/3.0pre ID:2008041112

and

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9pre) Gecko/2008041111 Minefield/3.0pre

Boris Zbarsky [:bzbarsky]

Comment 74

•

16 years ago

> This means, since the landing of bug 406755 attachment 314248 [details] [diff] [review], we're
> now executing EV verification and OCSP requests from a network thread

Do you mean that we're executing newChannel and such from off the main thread?  That's not really supported, and will likely cause issues...

But yeah, not doing the sync request from inside a web progress notification should be good.

Kai Engert (:KaiE:)

Assignee

Comment 75

•

16 years ago

(In reply to comment #74)
> 
> Do you mean that we're executing newChannel and such from off the main thread? 
> That's not really supported, and will likely cause issues...

We're initiating it on another thread, but we do it by sending an event to the main thread. (nsHTTPDownloadEvent, NS_DispatchToMainThread)

Boris Zbarsky [:bzbarsky]

Comment 76

•

16 years ago

OK.  So as long as you're OK with other web progress notifications happening while that event is running, things are fine.

Kai Engert (:KaiE:)

Assignee

Comment 77

•

16 years ago

Attached patch Patch v8 (obsolete) — Details — Splinter Review

Bob Relyea now has reviewed all individual patches.
This Patch v8 brings in a small change that he proposed in bug 383369.

I'm marking this one as r=rrelyea, as it's simply a merged version of the individual patches.

I'll quickly do another update of this patch, to address timeless request to add a comment.

Attachment #315219 - Flags: review+

Kai Engert (:KaiE:)

Assignee

Comment 78

•

16 years ago

(In reply to comment #70)
> +  attribute long subRequestsHighSecurity;
> +  attribute long subRequestsLowSecurity;
> +  attribute long subRequestsBrokenSecurity;
> +  attribute long subRequestsNoSecurity;
> 
> these need comments at the very least.

Did you see the big comment just above?
The single comment describes them all, however, ...

> The current properties are misnamed - the word
> "count" or similar is necessary.

I agree and have added the word "count" to their names.


> personally, i'd much rather these be
> arrays of nsIURI or something so that I could actually tell my users which
> requests are problematic. 

That's a good idea, but outside of the scope of these bug fix.

Also note this interface is purely internal to PSM right now.

If you want to propose that more information (expensive) should be made available using additional properties, could you please file a separate enhancement bug?

Kai Engert (:KaiE:)

Assignee

Comment 79

•

16 years ago

Attached patch Patch v9 — Details — Splinter Review

Renamed the interface attributes as described in the previous comment, carrying forward r+

Attachment #310884 - Attachment is obsolete: true

Attachment #311638 - Attachment is obsolete: true

Attachment #311657 - Attachment is obsolete: true

Attachment #311726 - Attachment is obsolete: true

Attachment #315037 - Attachment is obsolete: true

Attachment #315219 - Attachment is obsolete: true

Attachment #315221 - Flags: review+

Attachment #315037 - Flags: review?(rrelyea)

Kai Engert (:KaiE:)

Assignee

Comment 80

•

16 years ago

Comment on attachment 315221 [details] [diff] [review]
Patch v9

Requesting approval.

This bug has blocking1.9
This patch fixes this bug, and incorporates the overlapping fixes from bug 383369 (wanted-next+, wanted1.8.1.x) and bug 358438 (wanted1.9.0.x+, wanted1.8.1.x)

Attachment #315221 - Flags: approval1.9?

Mike Beltzner [:beltzner, not reading bugmail]

Comment 81

•

16 years ago

Comment on attachment 315221 [details] [diff] [review]
Patch v9

a1.9=beltzner

Attachment #315221 - Flags: approval1.9? → approval1.9+

Kai Engert (:KaiE:)

Assignee

Comment 82

•

16 years ago

Comment on attachment 309626 [details] [diff] [review]
[checked in] Temporarily increase the time the browser stays open to 10s

Do you want me to undo this timeout increase, so we can better test whether the hang fix works?

Attachment #309626 - Flags: approval1.9?

Kai Engert (:KaiE:)

Assignee

Comment 83

•

16 years ago

Patch v9 checked in, marking fixed.

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

Mike Beltzner [:beltzner, not reading bugmail]

Comment 84

•

16 years ago

Comment on attachment 309626 [details] [diff] [review]
[checked in] Temporarily increase the time the browser stays open to 10s

Yes, please take this out, but only do it when someone's watching the tree.

Attachment #309626 - Flags: approval1.9? → approval1.9+

Carsten Book [:Tomcat]

Comment 85

•

16 years ago

Hi Kai, i can confirm that the Patch fix the Problem from Bug 424901.

During the Tests for Verification of this Bug with your steps to reproduce from comment #18 i crashed on Debug Builds and Official Windows Nightly Builds. Not sure if this crashes related to this bug, so i filed a new Bug.

I filed Bug 428716 (for the Assertion i got before the crash) and  Bug 428726 for the crash.

Kai Engert (:KaiE:)

Assignee

Comment 86

•

16 years ago

I didn't get that assertion.

I saw the crash from bug 428726, and another crash with a different stack, see new bug 428746.

I don't know what caused those crashes, no idea whether it's a regression from this bug. It could be memory corruption.

Finally, very rarely, I still get a deadlock. It's one of the scenarios I had attempted to unblock, but it's not yet eliminated completely. We can attempt to fix that in a follow up patch. I've filed bug 428749.

timeless

Comment 87

•

16 years ago

comments should really be per attribute. i don't know of *any* interfaces where attributes are just referenced by the interface comment. in general for constants there may be a comment before the first constant (not interface), but this is actually a bug, because it breaks javadoc and friends.

therefor should probably be therefore.

and i'd rather not have to file the bugs.

Kai Engert (:KaiE:)

Assignee

Updated

•

16 years ago

Blocks: 429110

Stacks from "hung" firefox 16 years ago Nick Thomas [:nthomas] (UTC+12) 14.38 KB, text/plain		Details
[checked in] Temporarily increase the time the browser stays open to 10s 16 years ago Nick Thomas [:nthomas] (UTC+12) 491 bytes, patch	sayrer : review+ ted : review+ beltzner : approval1.9+	Details \| Diff \| Splinter Review
Patch v1 16 years ago Kai Engert (:KaiE:) 18.41 KB, patch		Details \| Diff \| Splinter Review
Patch v2 16 years ago Kai Engert (:KaiE:) 18.41 KB, patch		Details \| Diff \| Splinter Review
another hang 16 years ago Kai Engert (:KaiE:) 15.93 KB, text/plain		Details
yet another hang 16 years ago Kai Engert (:KaiE:) 10.46 KB, text/plain		Details
Patch v3 16 years ago Kai Engert (:KaiE:) 29.17 KB, patch	rrelyea : review+ beltzner : approval1.9b5+	Details \| Diff \| Splinter Review
fix red tinderbox on windows 16 years ago Kai Engert (:KaiE:) 822 bytes, patch	mtschrep : approval1.9b5+	Details \| Diff \| Splinter Review
fix leaks, v5 16 years ago Kai Engert (:KaiE:) 7.96 KB, patch	rrelyea : review+	Details \| Diff \| Splinter Review
screenshot 16 years ago Carsten Book [:Tomcat] 80.92 KB, image/jpeg		Details
regression culprit 16 years ago Kai Engert (:KaiE:) 4.58 KB, patch		Details \| Diff \| Splinter Review
unexpected stack 16 years ago Kai Engert (:KaiE:) 8.92 KB, text/plain		Details
Patch v7 16 years ago Kai Engert (:KaiE:) 56.36 KB, patch		Details \| Diff \| Splinter Review
Patch v8 16 years ago Kai Engert (:KaiE:) 57.33 KB, patch	KaiE : review+	Details \| Diff \| Splinter Review
Patch v9 16 years ago Kai Engert (:KaiE:) 57.44 KB, patch	KaiE : review+ beltzner : approval1.9+	Details \| Diff \| Splinter Review