Closed Bug 1362978 Opened 2 years ago Closed 2 years ago

Crash in nsCOMPtr_base::assign_with_AddRef | nsInputStreamReadyEvent::OnInputStreamReady

Categories

(Core :: Networking: HTTP, defect, critical)

55 Branch
Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 1363372
Tracking Status
firefox55 --- disabled
firefox56 --- disabled

People

(Reporter: calixte, Assigned: dragana)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [clouseau][necko-active])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-ee1732cc-d833-4fd3-b632-54e4a0170507.
=============================================================

There are 8 crashes in nightly 55 with buildid 20170507030205. In analyzing the backtrace, the regression may have introduced by patch [1] to fix bug 1359938.

[1] https://hg.mozilla.org/mozilla-central/rev?node=4a3896a7e29f605d0bc509ca98095c1960aa70cf
Flags: needinfo?(dd.mozilla)
There are 4 crashes with signature "RefPtr<T>::assign_with_AddRef | nsInputStreamReadyEvent::OnInputStreamReady" and buildid 20170506030204.
Crash Signature: [@ nsCOMPtr_base::assign_with_AddRef | nsInputStreamReadyEvent::OnInputStreamReady] → [@ nsCOMPtr_base::assign_with_AddRef | nsInputStreamReadyEvent::OnInputStreamReady] [@ RefPtr<T>::assign_with_AddRef | nsInputStreamReadyEvent::OnInputStreamReady ]
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Flags: needinfo?(dd.mozilla)
Whiteboard: [clouseau] → [clouseau][necko-active]
This one is not an easy one. I have a theory:

This is a crash in inputStream. During TFO we should only wait for outputStream, i.e. we are waiting for a connect. After connect TFO is finished the connection continues to perform in the same way as before.

I was assuming we never call ResumeRecv and I made sure that nsHttpConnection flow does not do it, but h2 does it on its own.
H2 do call ResumeRecv independent of nsHttpConnection. So when we cancel TFO connection we do not clear inputStream callbacks and if the underlying socketTransport is reused, this can happen after RecoverFromError is called and socketTransport retries and reconnects, it can happend that inputStream still has the old callback, which is a old transaction and not the one new one...

I will make a patch that clears inputStream callbacks if TFO connection gets canceled.
I will also disallow call to inputStream->AsyncWait if we are in TFO, but I will ask Patrick to confirm that I will not mess up h2 work flow, we knows H2 code.

I am not sure that this is going to solve this problem, let's see.
Crash Signature: [@ nsCOMPtr_base::assign_with_AddRef | nsInputStreamReadyEvent::OnInputStreamReady] [@ RefPtr<T>::assign_with_AddRef | nsInputStreamReadyEvent::OnInputStreamReady ] → [@ nsCOMPtr_base::assign_with_AddRef | nsInputStreamReadyEvent::OnInputStreamReady] [@ RefPtr<T>::assign_with_AddRef | nsInputStreamReadyEvent::OnInputStreamReady] [@ nsCOMPtr_base::assign_assuming_AddRef | nsInputStreamReadyEvent::Run]
May I forbid calling mSocketIn->AsyncWait(nullptr, 0, 0, nullptr); in ResumeRecv if fastopen is in progress? I am asking if whether is this going to break h2. If my theory is correct the other part of the patch will resolve the crash on it own.
Attachment #8866312 - Flags: review?(mcmanus)
Comment on attachment 8866312 [details] [diff] [review]
bug_1362978_v1.patch

Review of attachment 8866312 [details] [diff] [review]:
-----------------------------------------------------------------

I think it is fine to suppress resumerecv as long as it is called unsupressed again when the handshake is over.. and that seems to be missing.

relatedly halfopensocket::setfastopenstatus calls nshttpconnection::setfastopenstatus, but instead of that calling through to its nsAHttpTransaction, the halfopen socket does that directly only if the connection is h1. shouldn't the connection just call through the nsAHttpTransaction which might be h1, or it might be the h2 session which would want to know this for exactly reasons like calling resumerecv() ?
Attachment #8866312 - Flags: review?(mcmanus) → review+
(In reply to Patrick McManus [:mcmanus] from comment #5)
> Comment on attachment 8866312 [details] [diff] [review]
> bug_1362978_v1.patch
> 
> Review of attachment 8866312 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think it is fine to suppress resumerecv as long as it is called
> unsupressed again when the handshake is over.. and that seems to be missing.

The patch suppresses ResumeRecv in nsHttpConnection if mFastOpen is true. It is set to false in 
nsHttpConnectionMgr::nsHalfOpenSocket::SetFastOpenConnected which is called in both cases on error and on a successful connect

> relatedly halfopensocket::setfastopenstatus calls
> nshttpconnection::setfastopenstatus, but instead of that calling through to
> its nsAHttpTransaction, the halfopen socket does that directly only if the
> connection is h1. shouldn't the connection just call through the
> nsAHttpTransaction which might be h1, or it might be the h2 session which
> would want to know this for exactly reasons like calling resumerecv() ?

You have a point there, but that is a separate issue. This is only used for telemetry. I will fix that I added a check because the function was not implemented in NullHttpTransaction. I will open a separate bug for this one.
Pushed by dd.mozilla@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/34a781e4f023
Remove callbacks on socketInput stream if we cancel a TFO socketTransport. r=mcmanus
Keywords: leave-open
Dragana, should this still be left open?
Flags: needinfo?(dd.mozilla)
(In reply to Mike Taylor [:miketaylr] from comment #9)
> Dragana, should this still be left open?

I have turned off tfo, so I do not know if this is fix or not.
I want to land one more patch and do some testing before turning TFO back on.

I will leave this bug open until I confirm that it is fixed.
Flags: needinfo?(dd.mozilla)
this is still present with TFO from 31.5.
Crash signatures here are still high--between the 3 crash sigs listed above, we're getting around 60 crashes/day on 56.
Flags: needinfo?(dd.mozilla)
This is tfo crash. TFO will be disabled in 56.
Flags: needinfo?(dd.mozilla)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1363372
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.