Last Comment Bug 589296 - Crash in [@ nsHttpChannel::ContinueProcessResponse(unsigned int) ][@ nsHttpChannel::ContinueProcessResponse]
: Crash in [@ nsHttpChannel::ContinueProcessResponse(unsigned int) ][@ nsHttpCh...
Status: RESOLVED FIXED
[mozmill], [4b5]
: crash, regression, relnote, topcrash
Product: Core
Classification: Components
Component: Networking: HTTP (show other bugs)
: Trunk
: All All
: -- critical (vote)
: ---
Assigned To: Honza Bambas (:mayhemer)
:
: Patrick McManus [:mcmanus]
Mentors:
: 589843 590094 (view as bug list)
Depends on:
Blocks: 513008
  Show dependency treegraph
 
Reported: 2010-08-20 13:56 PDT by Johnny Stenback (:jst, jst@mozilla.com)
Modified: 2011-06-13 10:01 PDT (History)
27 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
beta7+


Attachments
v1 - backout bug 513008 [Check-in comment 39] (27.58 KB, patch)
2010-09-01 14:31 PDT, Honza Bambas (:mayhemer)
michal.novotny: review+
benjamin: approval2.0+
Details | Diff | Splinter Review

Description Johnny Stenback (:jst, jst@mozilla.com) 2010-08-20 13:56:47 PDT
I've seen this crash on startup randomly twice now (recent trunk builds). Not able to reproduce tho. This was in a build with the patches from bug 513008 and bug 549767 applied, so could in theory be caused by those patches. Anyways, here's the stack:

#2  <signal handler called>
#3  0x00007f1dc7e9d7fa in nsHttpChannel::ContinueProcessResponse (this=
    0x7f1db87f1800, rv=2147500035)
    at ../../../../mozilla/netwerk/protocol/http/nsHttpChannel.cpp:1033
#4  0x00007f1dc7e9889f in nsHttpChannel::OnRedirectVerifyCallback (this=
    0x7f1db87f1800, result=2147500035)
    at ../../../../mozilla/netwerk/protocol/http/nsHttpChannel.cpp:4374
#5  0x00007f1dc7e2e50c in nsAsyncVerifyRedirectCallbackEvent::Run (
    this=<value optimized out>)
    at ../../../../mozilla/netwerk/base/src/nsAsyncRedirectVerifyHelper.cpp:73
#6  0x00007f1dc88207e4 in nsThread::ProcessNextEvent (this=0x7f1dc60491f0, 
    mayWait=0, result=0x7fffb728169c)
    at ../../../mozilla/xpcom/threads/nsThread.cpp:547

The crash is on the following line of code:

    if (mTransaction->SSLConnectFailed()) {

and mTransaction is null. mURI and mOriginalURI in the channel are both at this point "http://en-us.start3.mozilla.com/firefox?client=firefox-a&rls=org.mozilla:en-US:official". Here's what p *this says:

(gdb) p *this
$12 = (nsHttpChannel) {<mozilla::net::HttpBaseChannel> = 
    {<nsHashPropertyBag> = {<nsIWritablePropertyBag> = {<nsIPropertyBag> = 
    {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7490}, <No data fields>}, <No data fields>}, <nsIWritablePropertyBag2> = {<nsIPropertyBag2> = {<nsIPropertyBag> = {<nsISupports> = {
              _vptr.nsISupports = 
    0x7f1dc92a7900}, <No data fields>}, <No data fields>}, <No data fields>}, 
      mRefCnt = {mValue = 7}, mPropertyHash = 
    {<nsBaseHashtable<nsStringHashKey, nsCOMPtr<nsIVariant>, nsIVariant*>> = 
    {<nsTHashtable<nsBaseHashtableET<nsStringHashKey, nsCOMPtr<nsIVariant> > >> = {mTable = {ops = 0x7f1dc9466c90, data = 0x0, hashShift = 28, maxAlphaFrac = 
    192 '\300', minAlphaFrac = 64 '@', entrySize = 32, entryCount = 1, 
              removedCount = 0, generation = 0, entryStore = 
    0x7f1db6c44e00 ""}}, <No data fields>}, <No data fields>}}, <nsIHttpChannel> = {<nsIChannel> = {<nsIRequest> = {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a79e8}, <No data fields>}, <No data fields>}, <No data fields>}, <nsIHttpChannelInternal> = {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7b78}, <No data fields>}, <nsIUploadChannel> = {<nsISupports> = {
        _vptr.nsISupports = 
    0x7f1dc92a7be8}, <No data fields>}, <nsIUploadChannel2> = {<nsISupports> = 
    {_vptr.nsISupports = 
    0x7f1dc92a7c20}, <No data fields>}, <nsISupportsPriority> = 
    {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7c50}, <No data fields>}, <nsIResumableChannel> = 
    {<nsISupports> = {_vptr.nsISupports = 0x7f1dc92a7c90}, <No data fields>}, 
    mURI = {<nsCOMPtr_base> = {mRawPtr = 0x7f1db592c3c0}, <No data fields>}, 
    mOriginalURI = {<nsCOMPtr_base> = {mRawPtr = 
    0x7f1db592c3c0}, <No data fields>}, mDocumentURI = {<nsCOMPtr_base> = {
        mRawPtr = 0x7f1db592c3c0}, <No data fields>}, mListener = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mListenerContext = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mLoadGroup = 
    {<nsCOMPtr_base> = {mRawPtr = 0x7f1db5b9e1a0}, <No data fields>}, mOwner = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mCallbacks = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mProgressSink = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mReferrer = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mApplicationCache = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mRequestHead = {
      mHeaders = {mHeaders = {<nsTArray_base> = {static sEmptyHdr = {mLength = 
    0, mCapacity = 0, mIsAutoArray = 0}, mHdr = 
    0x7f1db59b5ec0}, <No data fields>}}, mMethod = {_val = 
    0x7f1dc8b52b43 "GET"}, mVersion = 11 '\v', mRequestURI = 
    {<nsACString_internal> = {mData = 
    0x7f1db386d388 "/firefox?client=firefox-a&rls=org.mozilla:en-US:official", 
          mLength = 56, mFlags = 5}, <No data fields>}}, mUploadStream = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mResponseHead = {
      mRawPtr = 0x7f1db4eb1ba0}, mConnectionInfo = {mRawPtr = 0x7f1db59d7340}, 
    mSpec = {<nsACString_internal> = {mData = 
    0x7f1db59d8108 "http://en-us.start3.mozilla.com/firefox?client=firefox-a&rls=org.mozilla:en-US:official", mLength = 87, mFlags = 5}, <No data fields>}, 
    mContentTypeHint = {<nsACString_internal> = {mData = 0x7f1dc94d9c90 "", 
        mLength = 0, mFlags = 1}, <No data fields>}, mContentCharsetHint = 
    {<nsACString_internal> = {mData = 0x7f1dc94d9c90 "", mLength = 0, mFlags = 
    1}, <No data fields>}, mUserSetCookieHeader = {<nsACString_internal> = {
        mData = 0x7f1dc94d9c90 "", mLength = 0, mFlags = 
    1}, <No data fields>}, mEntityID = {<nsACString_internal> = {mData = 
    0x7f1dc94d9c90 "", mLength = 0, mFlags = 1}, <No data fields>}, 
    mStartPos = 18446744073709551615, mStatus = 2147549183, mLoadFlags = 
    6881280, mPriority = 0, mCaps = 1 '\001', mRedirectionLimit = 20 '\024', 
    mCanceled = 0, mIsPending = 0, mWasOpened = 1, mResponseHeadersModified = 
    0, mAllowPipelining = 1, mForceAllowThirdPartyCookie = 0, 
    mUploadStreamHasHeaders = 0, mInheritApplicationCache = 0, 
    mChooseApplicationCache = 0, mLoadedFromApplicationCache = 
    0}, <nsIStreamListener> = {<nsIRequestObserver> = {<nsISupports> = {
        _vptr.nsISupports = 
    0x7f1dc92a7cc8}, <No data fields>}, <No data fields>}, <nsICachingChannel> = {<nsICacheInfoChannel> = {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7d08}, <No data fields>}, <No data fields>}, <nsICacheListener> = {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7db8}, <No data fields>}, <nsIEncodedChannel> = {<nsISupports> = 
    {_vptr.nsISupports = 
    0x7f1dc92a7de8}, <No data fields>}, <nsITransportEventSink> = 
    {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7e28}, <No data fields>}, <nsIProtocolProxyCallback> = 
    {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7e58}, <No data fields>}, <nsIHttpAuthenticableChannel> = 
    {<nsIProxiedChannel> = {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7e88}, <No data fields>}, <No data fields>}, <nsITraceableChannel> = {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7f30}, <No data fields>}, <nsIApplicationCacheChannel> = 
    {<nsIApplicationCacheContainer> = {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7f60}, <No data fields>}, <No data fields>}, <nsIAsyncVerifyRedirectCallback> = {<nsISupports> = {_vptr.nsISupports = 
    0x7f1dc92a7fc0}, <No data fields>}, mSecurityInfo = {<nsCOMPtr_base> = {
      mRawPtr = 0x0}, <No data fields>}, mProxyRequest = {<nsCOMPtr_base> = {
      mRawPtr = 0x0}, <No data fields>}, mTransactionPump = {mRawPtr = 0x0}, 
  mTransaction = {mRawPtr = 0x0}, mLogicalOffset = 278, mCacheEntry = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mCachePump = {
    mRawPtr = 0x0}, mCachedResponseHead = {mRawPtr = 0x0}, mCacheAccess = 0, 
  mPostID = 0, mRequestTime = 1281980612, mOnCacheEntryAvailableCallback = 
    NULL, mAsyncCacheOpen = 1, mOfflineCacheEntry = {<nsCOMPtr_base> = {
      mRawPtr = 0x0}, <No data fields>}, mOfflineCacheAccess = 65280, 
  mOfflineCacheClientID = {<nsACString_internal> = {mData = 0x7f1dc94d9c90 "", 
      mLength = 0, mFlags = 1}, <No data fields>}, mAuthProvider = 
    {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, 
  mPendingAsyncCallOnResume = NULL, mTargetProxyInfo = {<nsCOMPtr_base> = {
      mRawPtr = 0x0}, <No data fields>}, mSuspendCount = 0, mFallbackKey = 
    {<nsACString_internal> = {mData = 0x7f1dc94d9c90 "", mLength = 0, mFlags = 
    1}, <No data fields>}, mRedirectURI = {<nsCOMPtr_base> = {mRawPtr = 
    0x7f1db592c580}, <No data fields>}, mRedirectChannel = {<nsCOMPtr_base> = {
      mRawPtr = 0x0}, <No data fields>}, mRedirectType = 301, 
  mApplyConversion = 1, mCachedContentIsValid = 0, mCachedContentIsPartial = 
    0, mTransactionReplaced = 0, mAuthRetryPending = 0, mResuming = 0, 
  mInitedCacheEntry = 0, mCacheForOfflineUse = 0, mCachingOpportunistically = 
    0, mFallbackChannel = 0, mTracingEnabled = 1, mCustomConditionalRequest = 
    0, mFallingBack = 0, mWaitingForRedirectCallback = 0, mRemoteChannel = 0, 
  mRequestTimeInitialized = 1, mRedirectFuncStack = {<nsTArray_base> = {
      static sEmptyHdr = {mLength = 0, mCapacity = 0, mIsAutoArray = 0}, 
      mHdr = 0x7f1dc94d9788}, <No data fields>}}

That's about all the information I have about this for now, but if specific information would be useful here I can try to dig more out of the core file that I do have saved.
Comment 1 Bjarne (:bjarne) 2010-08-21 02:26:40 PDT
This happens at startup with no stored tabs, just loading the default startpage? IPC-build?
Comment 2 Josh Matthews [:jdm] (on vacation until Dec 5) 2010-08-23 11:29:33 PDT
*** Bug 589843 has been marked as a duplicate of this bug. ***
Comment 3 Paul O'Shannessy [:zpao] (not reading much bugmail, email directly) 2010-08-23 11:35:09 PDT
(coming from bug 589843...) I hit this restoring a session on 32bit OS X 10.5.8
http://crash-stats.mozilla.com/report/index/bp-dab7e8a6-5f45-4572-bf9c-ad6862100823
Comment 4 Johnny Stenback (:jst, jst@mozilla.com) 2010-08-23 11:42:09 PDT
Blocking since others are seeing this as well.

Bjarne, I hit this on startup with a regular trunk Firefox build, so IPC enabled, but not actually doing any IPC for networking. I was not restoring a session, just loading the default start page.
Comment 5 Honza Bambas (:mayhemer) 2010-08-23 12:39:47 PDT
Are any of you experiencing this problem have Firebug installed?
Comment 6 Honza Bambas (:mayhemer) 2010-08-23 12:41:12 PDT
Ah, I found mTracingEnabled = PR_TRUE that is used by Firebug in the p *this, but it is enabled by default.
Comment 7 Honza Bambas (:mayhemer) 2010-08-23 12:44:40 PDT
To all having chance to reproduce this, please set following env vars to always produce logs, it may help to figure out:

NSPR_LOG_FILE=somepath/log.log
NSPR_LOG_MODULES=URILoader:5,LoadGroup:5,DocLoader:5,nsHttp:5,nsSocketTransport:5
Comment 8 Marcia Knous [:marcia - use ni] 2010-08-23 17:08:09 PDT
http://tinyurl.com/23mcb2z shows this is currently the #4 top crash on the trunk, but there are no comments in any of these reports.
Comment 9 Paul O'Shannessy [:zpao] (not reading much bugmail, email directly) 2010-08-23 17:13:45 PDT
(In reply to comment #5)
> Are any of you experiencing this problem have Firebug installed?

I have it installed but not enabled.

(In reply to comment #7)
> To all having chance to reproduce this, please set following env vars to always
> produce logs, it may help to figure out:
> 
> NSPR_LOG_FILE=somepath/log.log
> NSPR_LOG_MODULES=URILoader:5,LoadGroup:5,DocLoader:5,nsHttp:5,nsSocketTransport:5

Will this work in normal nightlies?
Comment 10 Johnny Stenback (:jst, jst@mozilla.com) 2010-08-23 17:24:01 PDT
Per the crash reports this does not appear to be a startup only crash, most happen thousands of seconds after startup.
Comment 11 Johnny Stenback (:jst, jst@mozilla.com) 2010-08-23 17:24:28 PDT
(In reply to comment #9)
...
> Will this work in normal nightlies?

Yes.
Comment 12 Bjarne (:bjarne) 2010-08-24 05:38:31 PDT
To me this looks like nsHttpChannel::OnStopRequest() has been called before the nsHttpChannel is is informed about the decision for the redirect (by  nsAsyncVerifyRedirectCallbackEvent::Run() ). Even tho all the sinks currently respond in a synchronous way, the decision is sent to the nsHttpChannel in an event.

OnStopRequest() will clear a number of members, including mTransaction. See e.g. mCallbacks, mProgressSink and mIsPending.

I can try to whip up a test which triggers this problem.
Comment 13 Josh Matthews [:jdm] (on vacation until Dec 5) 2010-08-24 07:54:04 PDT
*** Bug 590094 has been marked as a duplicate of this bug. ***
Comment 14 Honza Bambas (:mayhemer) 2010-08-24 09:32:16 PDT
I slowly start thinking of bug 513008 is the culprit here.

I it is the only check-in between nightly 20100821040844 and first crashing 20100822040607 (between 5c5c3bf8dfeb and 29ac0213b54c).

I suspect the channel gets resumed or somehow canceled (cleaned) before we get the callback, same as Bjarne suspects.
Comment 15 Honza Bambas (:mayhemer) 2010-08-24 09:59:01 PDT
Checked that http://hg.mozilla.org/mozilla-central/rev/72d2863f43c7 is contained in 20100824040950 nightly and http://hg.mozilla.org/mozilla-central/rev/643758385088 in 20100822040607.

Also jst reported the patch was in his repo when the crash occurred.
Comment 16 Daniel Cater 2010-08-26 15:56:15 PDT
Mozilla/5.0 (X11; Linux i686; rv:2.0b5pre) Gecko/20100826 Minefield/4.0b5pre

Just hit this on Linux: bp-a987478f-6264-4741-8eb3-a1f952100826

Frame 1 is nsHttpChannel::ProcessFailedSSLConnect

I middle-clicked a link and it had just started to load in a background tab, but trying again with that exact same link doesn't crash. This is the first time I've hit this crash (late into using this nightly) and I didn't hit it in yesterday's build which also had the fix for bug 513008, so it's difficult to narrow down when it happens.
Comment 17 Bjarne (:bjarne) 2010-08-27 04:02:37 PDT
(In reply to comment #4)
> Bjarne, I hit this on startup with a regular trunk Firefox build, so IPC
> enabled, but not actually doing any IPC for networking. I was not restoring a
> session, just loading the default start page.

Do you know if your cache is clean when you start? Phrased differently: Do you normally start up with a clean profile/cache, or could it be that cached content was being used when this crash was observed. (Just gathering more information.)

(In reply to comment #16)
> Just hit this on Linux: bp-a987478f-6264-4741-8eb3-a1f952100826
> 
> Frame 1 is nsHttpChannel::ProcessFailedSSLConnect

This stack is a little confusing to me...  could anyone explain how

nsAsyncVerifyRedirectCallbackEvent::Run()  can call
nsHttpChannel::ProcessFailedSSLConnect()   which then calls
nsHttpChannel::ContinueProcessResponse()   ?

If this is not explainable I guess this can be the root of the problem we see...
Comment 18 Honza Bambas (:mayhemer) 2010-08-27 06:46:48 PDT
This is the top crasher.

We should collect logs from people.  Anybody experiencing this crash, please set following environment variables in a shell and start firefox from there:

NSPR_LOG_FILE=somepath/log.log
NSPR_LOG_MODULES=URILoader:5,LoadGroup:5,DocLoader:5,nsHttp:5,nsSocketTransport:5,cache:5

If the crash occurs, then BEFORE you RESTART firefox, please put the log at 'somepath' to this bug as an attachment or just email it to my bug mail.

Thanks a lot, we have to figure this out soon.

Also I vote for backing bug 513008 out for few days to see if it is triggering this crash.  JST, agree?
Comment 19 Johnny Stenback (:jst, jst@mozilla.com) 2010-08-27 11:01:27 PDT
(In reply to comment #17)
> Do you know if your cache is clean when you start? Phrased differently: Do you
> normally start up with a clean profile/cache, or could it be that cached
> content was being used when this crash was observed. (Just gathering more
> information.)

This was not with a clean profile, the start page could very well have been in the cache.
Comment 20 Benjamin Smedberg [:bsmedberg] 2010-08-29 05:26:16 PDT
This is the #1 topcrasher for 4.0b5pre, tentatively marking blocking beta5, unless beltzner disagrees. The earliest I can see in crash-stats is 27-August on Windows, and 23-August on Mac. This has different signatures for Mac/Linux and Windows.
Comment 21 Wes Kocher (:KWierso) 2010-08-29 18:13:38 PDT
I had this crash only happen once for me today, with last night's nightly, as I was opening a link from within Echofon.  http://crash-stats.mozilla.com/report/index/bp-2813c142-9628-4f6a-a4a9-6ab782100829

Haven't been able to reproduce since.
Comment 22 Bjarne (:bjarne) 2010-08-30 08:15:43 PDT
(In reply to comment #14)
> I slowly start thinking of bug 513008 is the culprit here.

After digging through this all day I tend to agree that this is probably caused by bug 513008. Not sure we should put all blame on it, though. :)

Some observations from the printout in comment #0:

mAsyncCacheOpen is set, meaning that OpenCacheEntry() found an entry. It was most likely found in the "normal" cache (let's ignore the app-cache for now) which means OpenNormalCacheEntry() was called. Furthermore, mOnCacheEntryAvailableCallback is cleared, meaning that OnCacheEntryAvailable() was called (before mIsPending was cleared). OnCacheEntryAvailable() has then called OnNormalCacheEntryAvailable(), which eventually calls Connect(FALSE).

Note that if OnNormalCacheEntryAvailable returns a failure-code, OnCacheEntryAvailable() will call CloseCacheEntry() and then AsyncAbort() which in turn calls HandleAsyncNotifyListener(), DoNotifyListener() and finally OnStopRequest(). Perhaps this is what happens here...?

Continuing, mRequestTime and mRequestTimeInitialized are set, meaning that SetupTransaction() was called. This probably happened from the call to Connect(FALSE) mentioned above. But it means that mCacheEntry at that point was not set, or it was set but CheckCache() fails.

What I don't see at the moment is how ProcessResponse() gets called. It means that OnStartRequest() is invoked somehow while mTransaction is set, but I don't see how. A little stuck for the moment...
Comment 23 Mike Beltzner [:beltzner, not reading bugmail] 2010-08-30 09:10:29 PDT
--> beta6+
Comment 24 chris hofmann 2010-08-30 12:03:49 PDT
did you mean to set "2.0-" ?   renominating.
Comment 25 Johnny Stenback (:jst, jst@mozilla.com) 2010-08-30 16:37:16 PDT
I got this crash once more, and this time I had the logging enabled, but the log file was empty! :( Repeated runs of firefox do produce data in the logs, but not that one time when I hit the crash. I can't imagine the logging isn't started before we hit this crash...
Comment 26 Bjarne (:bjarne) 2010-08-31 13:08:22 PDT
(In reply to comment #25)
> I got this crash once more, and this time I had the logging enabled, but the
> log file was empty! :( 

That just makes it a bit more interesting, no...?  :)  (Maybe we should file a new bug for this?)

Seriously: I'll try to summarize results from yesterday into a theory of what happens to the channel causing this crash.

The channel starts loading default startpage for minefield (mURI in dump from comment #0). Connect(TRUE) is called, entering the firstTime-block and calling OpenCacheEntry(). IMO (since |mAsyncCacheOpen| is set) OpenCacheEntry() calls down to OpenNormalCacheEntry() which returns success (if-block which sets |mAsyncCacheOpen| and |mOnCacheEntryAvailableCallback|). Hence, Connect(TRUE) returns success at the end of the firstTime-block.

So, default startpage is found in the cache, and since |mOnCacheEntryAvailableCallback| is cleared I believe OnCacheEntryAvailable() was called. OnNormalCacheEntryAvailable() is therefore called, which finally calls Connect(FALSE). Connect(FALSE) skips firstTime-block and I assume |mCacheEntry| has been set at this point, causing a call to CheckCache().

Now, since |mRequestTime| is set in comment #0, SetupTransaction() must have been called. I believe this is because CheckCache() decides that the entry must be validated, and Connect(FALSE) ends up calling SetupTransaction(). Loading the URI manually from the server confirms that there is no header setting the entry valid for any amount of time. (There are two concerns though: The response-code is 302 as opposed to 301 in comment #0, and Cache-Control: private is set. But lets ignore these for now...)

The channel now goes to the net to load the URI and on the response, ProcessResponse() is called which triggers calls to redirection-sinks, which eventually results in the callback to OnRedirectVerifyCallback(), which crashes.

Honza, Michal: Does this make sense or am I wrong in this analysis? Is there anything in async-reading-from-cache which can cause a call to OnStopRequest() after we have set up the transaction?
Comment 27 Michal Novotny (:michal) 2010-08-31 17:41:07 PDT
(In reply to comment #22)
> mAsyncCacheOpen is set, meaning that OpenCacheEntry() found an entry. It was

That's wrong. mAsyncCacheOpen==PR_TRUE means that AsyncOpenCacheEntry() succeeded and posted a request to the background thread.


> Note that if OnNormalCacheEntryAvailable returns a failure-code,
> OnCacheEntryAvailable() will call CloseCacheEntry() and then AsyncAbort() which
> in turn calls HandleAsyncNotifyListener(), DoNotifyListener() and finally
> OnStopRequest(). Perhaps this is what happens here...?

OnNormalCacheEntryAvailable() fails only in specific cases:

- channel was canceled before OnCacheEntryAvailable()
- cache entry wasn't found and LOAD_ONLY_FROM_CACHE was specified
- Connect(PR_FALSE) failed

Could it be one of these cases?
Comment 28 Michal Novotny (:michal) 2010-08-31 17:42:58 PDT
(In reply to comment #26)
> So, default startpage is found in the cache, and since
> |mOnCacheEntryAvailableCallback| is cleared I believe OnCacheEntryAvailable()
> was called. OnNormalCacheEntryAvailable() is therefore called, which finally
> calls Connect(FALSE). Connect(FALSE) skips firstTime-block and I assume
> |mCacheEntry| has been set at this point, causing a call to CheckCache().

But it could also happen that the startpage isn't found in the cache...


> Honza, Michal: Does this make sense or am I wrong in this analysis? Is there
> anything in async-reading-from-cache which can cause a call to OnStopRequest()
> after we have set up the transaction?

That makes sense with the possibility that the cache entry wasn't found. Regarding the OnStopRequest(), IMO the only possibility is via AsyncAbort() in OnCacheEntryAvailable().
Comment 29 Honza Bambas (:mayhemer) 2010-09-01 07:03:37 PDT
(In reply to comment #25)
> I can't imagine the logging isn't
> started before we hit this crash...

Please add 'sync' among the modules to NSPR_LOG_MODULES.  Sorry, I forgot to mention it...

I have set my env to do this as well, but I cannot reproduce the crash for 3 days of intensive Minefield usage.

Bjarne: I would like to get answer to a different question - how is that possible that we are in the middle of opening a cache entry (waiting for OnCacheEntryAvaialble) and also in the middle of redirecting (waiting for a redirect callback).  I don't think it is possible, but I might be wrong...
Comment 30 Bjarne (:bjarne) 2010-09-01 07:11:46 PDT
(In reply to comment #29)
> Bjarne: I would like to get answer to a different question - how is that
> possible that we are in the middle of opening a cache entry (waiting for
> OnCacheEntryAvaialble) and also in the middle of redirecting (waiting for a
> redirect callback).  I don't think it is possible, but I might be wrong...

I actually think OnCacheEntryAvailable() has been called since |mOnCacheEntryAvailableCallback| is cleared. It must have been set since |mAsyncCacheOpen| is set, and there is only one place it is cleared afaics.
Comment 31 Michal Novotny (:michal) 2010-09-01 07:20:21 PDT
(In reply to comment #30)
> (In reply to comment #29)
> > Bjarne: I would like to get answer to a different question - how is that
> > possible that we are in the middle of opening a cache entry (waiting for
> > OnCacheEntryAvaialble) and also in the middle of redirecting (waiting for a
> > redirect callback).  I don't think it is possible, but I might be wrong...
> 
> I actually think OnCacheEntryAvailable() has been called since
> |mOnCacheEntryAvailableCallback| is cleared. It must have been set since
> |mAsyncCacheOpen| is set, and there is only one place it is cleared afaics.

Exactly. mAsyncCacheOpen is set to PR_FALSE only in OpenCacheEntry() and it's not an error.
Comment 32 alex_mayorga 2010-09-01 13:49:37 PDT
Got this one today http://crash-stats.mozilla.com/report/index/bp-515d31ca-4b59-4fc7-952b-60e4f2100901
on Mozilla/5.0 (Windows NT 6.0; rv:2.0b6pre) Gecko/20100901 Firefox/4.0b6pre is it this bug?
Finding it hard/impossible to reproduce. FWIW I got it shortly after trying to reproduce Bug 590247
Would enable logging from now on in case it happens again.
Comment 33 Honza Bambas (:mayhemer) 2010-09-01 14:31:01 PDT
Created attachment 471285 [details] [diff] [review]
v1 - backout bug 513008 [Check-in comment 39]

Partially backing out nsHttpChannel only changes and disabling ProcessRequest call on the background thread.
Comment 34 Bjarne (:bjarne) 2010-09-01 14:51:32 PDT
Ok, here's a theory:

Say ContinueProcessResponse() is pushed on the manual callstack. Then, a new function f() is pushed on the stack, sinks are invoked followed by WaitForRedirectCallback(). When OnRedirectVerifyCallback() is called it pops f() and executes it. Say now that f() for whatever reason pushes another function g() on the stack and invokes sinks and WaitForRedirectCallback() again. When OnRedirectVerifyCallback() is called the second time it pops and executes g(). Here's the snag: Even if g() for some reason fails and somehow triggers OnStopRequest(), the manual callstack will pop and execute ContinueProcessResponse().

Could this be what we see? I.e. are there functions f() and g() which may cause this scenario? (Too late for me now to read code to figure out - will give it a shot tomorrow if still unresolved.)
Comment 35 Marcia Knous [:marcia - use ni] 2010-09-01 15:11:05 PDT
Adding relnote keyword since I hit this testing Beta 5.
Comment 36 Michal Novotny (:michal) 2010-09-01 17:10:11 PDT
Comment on attachment 471285 [details] [diff] [review]
v1 - backout bug 513008 [Check-in comment 39]

Seems to be correct.
Comment 37 Bjarne (:bjarne) 2010-09-02 07:08:00 PDT
(In reply to comment #34)
> Could this be what we see? I.e. are there functions f() and g() which may cause
> this scenario? (Too late for me now to read code to figure out - will give it a
> shot tomorrow if still unresolved.)

I have not been able to identify candidates for f() and g() yet. Any ideas (or arguments why this is impossible) are appreciated. It's worth mentioning that ContinueProcessResponse() is called with failure-status, otherwise we wouldn't need |mTransaction| to be set.
Comment 38 Honza Bambas (:mayhemer) 2010-09-02 09:36:37 PDT
Comment on attachment 471285 [details] [diff] [review]
v1 - backout bug 513008 [Check-in comment 39]

Not sure this backout/disability patch needs a=2.0...
Comment 39 Honza Bambas (:mayhemer) 2010-09-02 11:15:33 PDT
Comment on attachment 471285 [details] [diff] [review]
v1 - backout bug 513008 [Check-in comment 39]

http://hg.mozilla.org/mozilla-central/rev/a93c9118f16d
Comment 40 Henrik Skupin (:whimboo) 2010-09-05 06:23:37 PDT
We are getting this crash mostly each day for our Mozmill testrun against trunk builds. See http://crash-stats.mozilla.com/report/index/4d04336a-9dd8-41fc-aa0c-388f42100905
Comment 41 Bjarne (:bjarne) 2010-09-05 13:58:46 PDT
(In reply to comment #40)
> We are getting this crash mostly each day for our Mozmill testrun against trunk
> builds. See
> http://crash-stats.mozilla.com/report/index/4d04336a-9dd8-41fc-aa0c-388f42100905

After the backout? Any chance of getting a log?
Comment 42 Henrik Skupin (:whimboo) 2010-09-05 16:27:58 PDT
Looks like the crash happened for our update tests and we weren't able to update our builds on that machine. I have now updated the builds manually, means the build id is now 100906. I'm sure that will fix the crashes.
Comment 43 Honza Bambas (:mayhemer) 2010-09-06 13:00:10 PDT
Back out of bug 513008 apparently helped to get rid of this crash on nightly builds.  Curious is that this crash is exceptionally only windows crash.

We cannot get the log from people now.  If anyone is willing to get log, please use this nightly build, and disable auto-update:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2010-08-31-04-mozilla-central/

I'll try to figure out using a human brain debugger what's going on here.
Comment 44 Daniel Cater 2010-09-06 14:08:00 PDT
(In reply to comment #43)
> Curious is that this crash is exceptionally only windows crash.

This bug was filed on Linux and comment 3 confirmed it on OS X.
Comment 45 Bjarne (:bjarne) 2010-09-07 01:54:40 PDT
(In reply to comment #42)
> Looks like the crash happened for our update tests and we weren't able to
> update our builds on that machine. I have now updated the builds manually,
> means the build id is now 100906. I'm sure that will fix the crashes.

Any chance to identify which part of your tests which caused this crash consistently and report it here so we could try to reproduce? We're pretty much in the dark here..
Comment 46 chris hofmann 2010-09-07 09:51:50 PDT
looks like the backout change did not make it into beta5.  the crash volume is growing fast with this signature and is about 8 times the volume of the #2 crash in early b5 testing before release.  wonder if we should consider respinning b5 to pick up this change?
Comment 47 Daniel Cater 2010-09-07 12:57:23 PDT
(In reply to comment #46)
> looks like the backout change did not make it into beta5.

I assumed that that was intentional given comment 23.
Comment 48 Mike Beltzner [:beltzner, not reading bugmail] 2010-09-07 13:45:19 PDT
Beta 5 was just released; will have to wait for beta 6. I may have screwed this up, but based on the comments at the time of code freeze, it didn't look like a solution was readily apparent, or a known set of triggers was, either.

We should keep an eye on this as the beta goes out there; are we convinced that it's happening randomly / through normal use, or is it exacerbated by a certain type of page or software addition? Based on the volume I would have expected to experience it, and to date have not, though I'm not a perfect litmus test for our userbase (I also missed the Farmville problems, woe!)
Comment 49 Henrik Skupin (:whimboo) 2010-09-08 14:29:44 PDT
(In reply to comment #45)
> (In reply to comment #42)
> > Looks like the crash happened for our update tests and we weren't able to
> > update our builds on that machine. I have now updated the builds manually,
> > means the build id is now 100906. I'm sure that will fix the crashes.
> 
> Any chance to identify which part of your tests which caused this crash
> consistently and report it here so we could try to reproduce? We're pretty much
> in the dark here..

I'm at a conference right now and can't check our Mozmill tests when this crash happens. Anthony, can you have a look at this? You should take a nightly build before Aug 31 and simply check when the build crashes. If you can reproduce it on qa-set I can try to get a stack and local variable output end of this week.
Comment 50 Johnny Stenback (:jst, jst@mozilla.com) 2010-09-08 15:43:21 PDT
Henric and/or Anthony, if you guys try to reproduce this then please make sure that you run with the environment variables set as Honza mentions earlier in this bug, getting logs from a crashing build would be very very valuable here, more so than getting this in a debugger, even. To recap, here's the environment variables you should make sure are set:

NSPR_LOG_FILE=firefox.log
NSPR_LOG_MODULES=URILoader:5,LoadGroup:5,DocLoader:5,nsHttp:5,nsSocketTransport:5,sync

and if you hit this crash, make firefox.log available to Honza and others (ideally attach it to this bug, or if it contains private info we can work out other ways to get the necessary parts to the people who need it).
Comment 51 Mike Beltzner [:beltzner, not reading bugmail] 2010-09-09 06:20:56 PDT
Choff: how has this crash reacted to our wider beta audience? Still near the top?
Comment 52 Marcia Knous [:marcia - use ni] 2010-09-09 07:25:07 PDT
http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/4.0b5 shows this is currently the #1 crash for B5, with 9399 crashes on Windows.

(In reply to comment #51)
> Choff: how has this crash reacted to our wider beta audience? Still near the
> top?
Comment 53 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2010-09-09 15:02:33 PDT
(In reply to comment #49)
> Anthony, can you have a look at this? You should take a nightly build
> before Aug 31 and simply check when the build crashes. If you can reproduce it
> on qa-set I can try to get a stack and local variable output end of this week.

So far, I've been unable to reproduce this crash.
Comment 54 Marcia Knous [:marcia - use ni] 2010-09-09 16:27:59 PDT
Honza: When I review trunk crash-stats, I still see a handful of crashes such as http://crash-stats.mozilla.com/report/index/381abd8a-4146-4764-8b88-e0ba92100908. That person was using yesterday's build. If you look http://tinyurl.com/32e2a78 and select the "Table" tab, you can see that overall the numbers have dwindled significantly since the 2010090100 build.

(In reply to comment #43)
> Back out of bug 513008 apparently helped to get rid of this crash on nightly
> builds.  Curious is that this crash is exceptionally only windows crash.
> 
> We cannot get the log from people now.  If anyone is willing to get log, please
> use this nightly build, and disable auto-update:
> http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2010-08-31-04-mozilla-central/
> 
> I'll try to figure out using a human brain debugger what's going on here.
Comment 55 Honza Bambas (:mayhemer) 2010-09-10 07:17:06 PDT
Marcia, those are tracemonkey builds.  See:
http://hg.mozilla.org/tracemonkey/shortlog/8d80930e10f3

- the latest tracemonkey changeset is 9ce0b72525ba (2010-09-08 01:13 -0700)
- after that you can see merge with mozilla-central where a93c9118f16d is present.

I cannot find the tracemonkey nightly build ID that contains this fix, but I believe this is it.  I cannot see a different cause of the bug then patch for bug 513008.
Comment 56 Mike Beltzner [:beltzner, not reading bugmail] 2010-09-10 14:57:08 PDT
So is this FIXED?
Comment 57 Johnny Stenback (:jst, jst@mozilla.com) 2010-09-10 17:04:02 PDT
I *think* we can call this fixed, but Honza would know with more certainty as he was looking into crash data after this change landed etc. Honza, what's your thoughts, can we close this and open a new bug to track getting the backed out changes re-landed once we figure out what the problem is?
Comment 58 Honza Bambas (:mayhemer) 2010-09-11 05:38:46 PDT
I think we already have that tracking bug.  And I cannot create a fix unless we re-land the backout as the fix is part of that code.  So, I'll fix and re-land in that followup bug.

Closing this one.
Comment 59 Mike Beltzner [:beltzner, not reading bugmail] 2010-09-13 13:40:59 PDT
Can we get this landed on mozilla-central GECKO20b5_20100831_RELBRANCH as well, please? We may want to do a beta6 based on this.
Comment 60 Honza Bambas (:mayhemer) 2010-09-13 13:52:54 PDT
If we get patch from bug 594882 reviewed soon, we may rather land the true fix.  Otherwise I'll do that.
Comment 61 Mike Beltzner [:beltzner, not reading bugmail] 2010-09-13 14:44:01 PDT
Please land the fix we know works, which is the backout. We can take the true fix in Beta 6.
Comment 62 Mike Beltzner [:beltzner, not reading bugmail] 2010-09-13 21:46:47 PDT
This was also fixed on the relbranch (see bug 513008 for cset)

Note You need to log in before you can comment on or make changes to this bug.