Closed Bug 1685942 Opened 3 years ago Closed 3 years ago

Crash in [@ mozilla::net::nsHttpTransaction::Finish0RTT]

Categories

(Core :: Networking, defect, P1)

Firefox 85
defect

Tracking

()

RESOLVED FIXED
87 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox85 --- wontfix
firefox86 --- wontfix
firefox87 --- fixed

People

(Reporter: furkan, Assigned: kershaw)

References

Details

(Keywords: crash, regression, Whiteboard: [necko-triaged])

Crash Data

Attachments

(1 file)

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/2c3db9f9-54e7-498d-b678-d61050210110

Reason: SIGSEGV /SEGV_MAPERR

Top 10 frames of crashing thread:

0 libxul.so mozilla::net::nsHttpTransaction::Finish0RTT netwerk/protocol/http/nsHttpTransaction.cpp:2902
1 libxul.so mozilla::net::Http2Session::Finish0RTT netwerk/protocol/http/Http2Session.cpp:3459
2 libxul.so mozilla::net::nsHttpConnection::EnsureNPNComplete netwerk/protocol/http/nsHttpConnection.cpp:574
3 libxul.so mozilla::net::nsHttpConnection::OnSocketWritable netwerk/protocol/http/nsHttpConnection.cpp:1953
4 libxul.so {virtual override thunk} 
5 libxul.so mozilla::net::nsSocketOutputStream::OnSocketReady netwerk/base/nsSocketTransport2.cpp:520
6 libxul.so mozilla::net::nsSocketTransport::OnSocketReady netwerk/base/nsSocketTransport2.cpp:2296
7 libxul.so mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:1156
8 libxul.so {virtual override thunk} 
9 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1200
Assignee: nobody → dd.mozilla
Severity: -- → S3
Priority: -- → P1
Whiteboard: [necko-triaged]

Looks like a dupe of bug 1685923.

I've been getting these crashes lately (1-2x per day). This is on a Windows laptop running Nightly with Fission disabled. My work computer (where Nightly sees much more use) is MacOS running Nightly+Fission, and I haven't seen the crash there.

If there are any diagnostics, logging, etc. that would be helpful for me to run, let me know. I have a couple of sites that seem to trigger it more frequently and I can probably catch it in the act.

(In reply to Kevin Jacobs [:kjacobs] from comment #1)

Looks like a dupe of bug 1685923.

They are dups.

I've been getting these crashes lately (1-2x per day). This is on a Windows laptop running Nightly with Fission disabled. My work computer (where Nightly sees much more use) is MacOS running Nightly+Fission, and I haven't seen the crash there.

If there are any diagnostics, logging, etc. that would be helpful for me to run, let me know. I have a couple of sites that seem to trigger it more frequently and I can probably catch it in the act.

A http log would be useful, but you would need to run it all the time, because I need to figure out what is exactly happening. it will make browser slow.
If you could do that for a day or until it crashes that would be great.
Logging instruction:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

you can use rotate to limit log file.

Thanks!

Flags: needinfo?(kjacobs.bugzilla)

Okay, I've emailed you HTTP logs of the crash with Fission both enabled and disabled.

Let me know if you need me to try anything else.

Flags: needinfo?(kjacobs.bugzilla)
See Also: → 1685923

This is related to HTTPS RR records. We restart a transaction 2 times and we have the transaction 2 time added to nsHttpConnectionMgr.

Kershaw, ca you take a look? I am no sure what is a right solution here. Restarting transaction on error is what cause the problem.

We are also to aggressive when we restart. We are restarting a connection even if it is cancled by nsHttpChannel which is not right. The second restart in the log that I will send you is due to nsHttpChannel canceling transaction because it got a cache response first (race cache with network feature).

Flags: needinfo?(kershaw)

I think we should not restart the transaction when the close reason is not a network error.

Assignee: dd.mozilla → kershaw
Flags: needinfo?(kershaw)
Attachment #9198590 - Attachment description: Bug 1685942 - Only fallback to original conn info when network error happens → Bug 1685942 - Only fallback to original conn info when the close reason is a network hard error or a security error
Attachment #9198590 - Attachment description: Bug 1685942 - Only fallback to original conn info when the close reason is a network hard error or a security error → Bug 1685942 - Only fallback to original conn info when network error happens

I've tried to create a short list of error codes that are allowed to restart transactions with HTTPS RR.

static inline bool AllowedErrorForHTTPSRRFallback(nsresult aError) {
  return psm::IsNSSErrorCode(-1 * NS_ERROR_GET_CODE(aError)) ||
         aError == NS_ERROR_NET_RESET ||
         aError == NS_ERROR_CONNECTION_REFUSED ||
         aError == NS_ERROR_UNKNOWN_HOST;
}

I assume we want to be conservative about restarting transactions and I am also wondering if we want to add NS_ERROR_NET_TIMEOUT into this list, but there are couple of places (here and here) that channels could be canceled by this error code. Currently, we don't know if the transaction is closed by nsHttpChannel::Cancel or not, so it'd be a bit tricky if we want to add NS_ERROR_NET_TIMEOUT into this list.

I also tried to figure out if nsHttpChannel::Cancel could be called by the error codes listed above. There are two cases where this could happen.

  1. nsHttpChannel::Cancel is called by nsHttpChannel::ProcessFailedProxyConnect at here. This only happens when CONNECT is used, so we can easily filter this case out (I should file another bug for this as described here).
  2. nsHttpChannel::Cancel is called with it's mStatus error code. For example, in HTMLMediaElement::MediaLoadListener::OnStartRequest it returns the request's status when it's failed and then Cancel is called in HttpChannelChild::DoOnStartRequest. This should be fine, since the transaction should be already closed.

Except two cases above, I think it should be safe to restart transactions with these error codes.
Dragana, could you take a look and let me know your thoughts? Thanks.

Flags: needinfo?(dd.mozilla)

Thank you for investigating this!!!!!!

(In reply to Kershaw Chang [:kershaw] from comment #7)

I've tried to create a short list of error codes that are allowed to restart transactions with HTTPS RR.

static inline bool AllowedErrorForHTTPSRRFallback(nsresult aError) {
  return psm::IsNSSErrorCode(-1 * NS_ERROR_GET_CODE(aError)) ||
         aError == NS_ERROR_NET_RESET ||
         aError == NS_ERROR_CONNECTION_REFUSED ||
         aError == NS_ERROR_UNKNOWN_HOST;
}

I assume we want to be conservative about restarting transactions and I am also wondering if we want to add NS_ERROR_NET_TIMEOUT into this list, but there are couple of places (here and here) that channels could be canceled by this error code. Currently, we don't know if the transaction is closed by nsHttpChannel::Cancel or not, so it'd be a bit tricky if we want to add NS_ERROR_NET_TIMEOUT into this list.

This is worrying. I think we should include NS_ERROR_NET_TIMEOUT. At lease the firs link uses that error code so that a proper error page is shown. I do not really know what the second link is doing :)

What about introducing NS_ERROR_NET_TIMEOUT_EXTERNAL ? that behaves the same as NS_ERROR_NET_TIMEOUT, but we can distinguish them.

I also tried to figure out if nsHttpChannel::Cancel could be called by the error codes listed above. There are two cases where this could happen.

  1. nsHttpChannel::Cancel is called by nsHttpChannel::ProcessFailedProxyConnect at here. This only happens when CONNECT is used, so we can easily filter this case out (I should file another bug for this as described here).

This will probably never happen. As you pointed out we should not use ECH for the connection to the proxy (only for the inner TLS to the end server), and we cannot use it for altSvc also because HTTP3 does not support CONNECT at moment.

  1. nsHttpChannel::Cancel is called with it's mStatus error code. For example, in HTMLMediaElement::MediaLoadListener::OnStartRequest it returns the request's status when it's failed and then Cancel is called in HttpChannelChild::DoOnStartRequest. This should be fine, since the transaction should be already closed.

Except two cases above, I think it should be safe to restart transactions with these error codes.
Dragana, could you take a look and let me know your thoughts? Thanks.

This is a good set of errors. Let me know what you think about NS_ERROR_NET_TIMEOUT_EXTERNAL

Flags: needinfo?(dd.mozilla)

This is a good set of errors. Let me know what you think about NS_ERROR_NET_TIMEOUT_EXTERNAL

Thanks for this suggestion.
I think the way we want to proceed is to avoid channels being cancelled with hard error codes, so I'll create this new error code.

Blocks: 1685923
See Also: 1685923
Keywords: regression
Version: unspecified → Firefox 85

¡Hola y'all!

Nightly crashed like https://crash-stats.mozilla.org/report/index/ba9b040d-20a5-456e-88c1-0b5d30210204

Is this the same bug?

Also, should this be set NEW now?

¡Gracias!
Alex

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/8a2fb0aee75c
Only fallback to original conn info when network error happens r=necko-reviewers,valentin,dragana
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 87 Branch

The patch landed in nightly and beta is affected.
:kershaw, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(kershaw)

Low level of crashes in beta and we are entering RC week, wontfix 86.

Flags: needinfo?(kershaw)
Regressions: 1692120
Crash Signature: [@ mozilla::net::nsHttpTransaction::Finish0RTT] → [@ mozilla::net::nsHttpTransaction::Finish0RTT] [@ mozilla::net::nsHttpTransaction::ReadSegments]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: