Closed Bug 1423988 Opened 2 years ago Closed 2 years ago

Crash in mozilla::net::nsHttpTransaction::Finish0RTT

Categories

(Core :: Networking: HTTP, defect, P1, critical)

58 Branch
defect

Tracking

()

VERIFIED FIXED
mozilla59
Tracking Status
firefox-esr52 --- unaffected
firefox57 --- unaffected
firefox58 blocking verified
firefox59 --- verified

People

(Reporter: philipp, Assigned: dragana)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [necko-triaged])

Crash Data

Attachments

(2 files)

[Tracking Requested - why for this release]:
this cross-platform crash signature started taking off in 58.0b and 59.0a1 just a couple of hours hours ago and so far it looks to become a very frequent crash (20% of browser crashes on beta in the past hours).

url correlations or installed addons in crash reports didn't provide any particular lead what might cause this when i was looking into that.


This bug was filed from the Socorro interface and is
report bp-28cd8d9f-59fa-44b4-9864-baa4c0171207.
=============================================================

Top 10 frames of crashing thread:

0 XUL mozilla::net::nsHttpTransaction::Finish0RTT netwerk/protocol/http/nsHttpTransaction.cpp:2343
1 XUL mozilla::net::Http2Session::Finish0RTT netwerk/protocol/http/Http2Stream.cpp:1622
2 XUL mozilla::net::nsHttpConnection::EnsureNPNComplete netwerk/protocol/http/nsHttpConnection.cpp:503
3 XUL mozilla::net::nsHttpConnection::OnSocketWritable netwerk/protocol/http/nsHttpConnection.cpp:1786
4 XUL non-virtual thunk to mozilla::net::nsHttpConnection::OnOutputStreamReady netwerk/protocol/http/nsHttpConnection.cpp:2312
5 XUL mozilla::net::nsSocketOutputStream::OnSocketReady netwerk/base/nsSocketTransport2.cpp:563
6 XUL mozilla::net::nsSocketTransport::OnSocketReady netwerk/base/nsSocketTransport2.cpp:2199
7 XUL mozilla::net::nsSocketTransportService::DoPollIteration netwerk/base/nsSocketTransportService2.cpp:1193
8 XUL mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:947
9 XUL non-virtual thunk to mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:857

=============================================================
Flags: needinfo?(mcmanus)
Flags: needinfo?(dd.mozilla)
dragana do you think the nss changes are related?
Flags: needinfo?(mcmanus)
In the channel meeting there is a note "tls1.3-middlebox addon updated to 10% of en-US beta" - does this have anything to do with the crash?
Several of the comments mention crashing when reloading a tab.
I have reliable STR:

1) Load https://austinyallhands2017.sched.com/event/CwLb/17-death-defying-stats-2-using-telemetry-data-to-answer-more-questions
2) Force-reload it (Ctrl+Shift+R on Linux/Windows)

I can reproduce it on a fresh profile on a dev-built Firefox as well as on oldish profiles on current Nightly and Beta on Linux and Windows, respectively.
not able to reproduce it here with those steps - perhaps it's depending on the network environment one is in.
does it make a difference when you set security.tls.enable_0rtt_data to false in about:config?
Flags: needinfo?(chutten)
(In reply to [:philipp] from comment #5)
> not able to reproduce it here with those steps - perhaps it's depending on
> the network environment one is in.
> does it make a difference when you set security.tls.enable_0rtt_data to
> false in about:config?

Same with me, haven't yet been able to reproduce this on two different Macs using both Dev Edition and Beta.
(In reply to Chris H-C :chutten from comment #4)
> I have reliable STR:
> 
> 1) Load
> https://austinyallhands2017.sched.com/event/CwLb/17-death-defying-stats-2-
> using-telemetry-data-to-answer-more-questions
> 2) Force-reload it (Ctrl+Shift+R on Linux/Windows)
> 
> I can reproduce it on a fresh profile on a dev-built Firefox as well as on
> oldish profiles on current Nightly and Beta on Linux and Windows,
> respectively.

Can you create a http log:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

you can also send he log to me directly.

Thanks
Marking as blocking at least until we figure out what caused this to spike yesterday.
Here's the network log. Next: trying with the pref unset.
Flags: needinfo?(chutten)
(In reply to [:philipp] from comment #5)
> not able to reproduce it here with those steps - perhaps it's depending on
> the network environment one is in.
> does it make a difference when you set security.tls.enable_0rtt_data to
> false in about:config?

With security.tls.enable_0rtt_data set to false, this bug no longer reproduces.

When I switch it back on, I needed to force-reload twice to reproduce. Then we were back to "normal" (crashing).

Mind if I just leave that as 'false' for a little while?
Adding topcrash keyboard. Almost overnight this crash moved into the #2 top slot in 58.0b9 and is also moving up on nightly 59 (#10).
Keywords: topcrash
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Flags: needinfo?(dd.mozilla)
Priority: -- → P1
Whiteboard: [necko-triaged]
I think this is related with bug 1406908.
(In reply to Dragana Damjanovic [:dragana] from comment #12)
> I think this is related with bug 1406908.

425 response is something new. Probably there was no server implementation until this point, therefore a big jump in crashes.
The crash should be easy to fix. (sorry for a delay there is a public holiday today in Austria :)
This is more interesting than I thought. A transaction is added 2 times to the pending queue :)

And than added 2 times to a h2 session, there we hit this line:
https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/Http2Session.cpp#389
which would assert in debug build.
If a crash rate is to high we can turn off the early-data until I find a proper fix for this.

The problem is in a very old code that was not use that often with h2.

I will need a bit of time to figure out what is the best way to fix this and also to try to find an upliftable patch. Probably we can just turn off the early-data for 58, I am not even sure if tls1.3 will be turned on on 58 release.
So far we have 1370 crashes in 58.0b9 in a short time period, and this crash is already at the top of 58.0b10 crashes. Would suggest for 58 turning off early data if we think that is the cause. On nightly 59 the crash has moved up to #8. Thanks.
next week there will only be one beta & that's built on monday. if there was some stopgap fix that could make it into that somehow, it would be great (as otherwise we may be dragging that problem around for another week and the volume of this crash is rather unhealthy).
When we force-restart a httpTransaction in case of 421 and 425 and the transaction is on a h2 session, the transaction will be added to the httpConnectionMgr's pending queue twice. The problem exist for a long time but 425 was implemented by some server and turned on recently and therefore it is triggered a lot in last couple of days.

Let's turn off the early-data on 59 and 58 for now, until I find a fix.

I am not sure I will have a easy and safe fix for 58, so maybe we can keep it off on 58. I am not sure if tls1.3 will be turned off on the 58 release anyway. I will ask.
Attachment #8936084 - Flags: review?(mcmanus)
I will finish the explanation:
so a transaction is added 2 times to the pending queue and then dispatch to a h2 session 2 times. The h2 session will close the transaction at the moment when it is added the second time:
https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/Http2Session.cpp#389

and then we have a httpTransaction that is closed, i.e. nsHttpTransaction::mConnection==nullptr, and call to Finish0RTT tries to access mConnection and it crashes.

The real problem is that we add the transaction 2 times to nsHttpConnectionMgr.
Attachment #8936084 - Flags: review?(mcmanus) → review+
why does it get added twice?
(In reply to Patrick McManus [:mcmanus] from comment #20)
> why does it get added twice?

Http2Session::WriteSegments calls a transaction to read its data, the transaction gets the complete header, detects 425 and restart the transaction(this adds it to the pending queue), after that we return to Http2Session::WriteSegments with an error that cause the stream to be closed and also calls nshttpTransaction::Close that restarts the transaction again....



2017-12-08 13:49:56.067524 UTC - [Socket Thread]: I/nsHttp Http2Session::WriteSegments 0x7efeaad78000 InternalState 7
2017-12-08 13:49:56.067526 UTC - [Socket Thread]: I/nsHttp Http2Stream::WriteSegments 0x7efedc834320 count=32768 state=4
2017-12-08 13:49:56.067528 UTC - [Socket Thread]: V/nsHttp nsHttpTransaction::WriteSegments 0x7efeb93fa000
2017-12-08 13:49:56.067530 UTC - [Socket Thread]: V/nsHttp nsHttpConnectionMgr::TouchThrottlingTimeWindow
2017-12-08 13:49:56.067535 UTC - [Socket Thread]: I/nsHttp Http2Stream::OnWriteSegment 0x7efedc834320 count=32768 state=4 0xF
2017-12-08 13:49:56.067542 UTC - [Socket Thread]: I/nsHttp Http2Session::ResetDownstreamState() 0x7efeaad78000
2017-12-08 13:49:56.067544 UTC - [Socket Thread]: I/nsHttp Http2Session::ChangeDownstreamState() 0x7efeaad78000 from 7 to 1
2017-12-08 13:49:56.067546 UTC - [Socket Thread]: V/nsHttp nsHttpTransaction::WritePipeSegment 0x7efeb93fa000 written=148
2017-12-08 13:49:56.067547 UTC - [Socket Thread]: V/nsHttp nsHttpTransaction::ProcessData [this=0x7efeb93fa000 count=148]

....

2017-12-08 13:49:56.067635 UTC - [Socket Thread]: I/nsHttp http response [
2017-12-08 13:49:56.067651 UTC - [Socket Thread]: I/nsHttp   HTTP/2.0 425 No Reason Phrase
2017-12-08 13:49:56.067653 UTC - [Socket Thread]: I/nsHttp   server: cloudflare-nginx
2017-12-08 13:49:56.067655 UTC - [Socket Thread]: I/nsHttp   date: Fri, 08 Dec 2017 13:49:56 GMT
2017-12-08 13:49:56.067657 UTC - [Socket Thread]: I/nsHttp   content-length: 0
2017-12-08 13:49:56.067659 UTC - [Socket Thread]: I/nsHttp   cf-ray: 3ca030d94f700fab-YYZ
2017-12-08 13:49:56.067660 UTC - [Socket Thread]: I/nsHttp   X-Firefox-Spdy: h2
2017-12-08 13:49:56.067663 UTC - [Socket Thread]: I/nsHttp     OriginalHeaders
2017-12-08 13:49:56.067666 UTC - [Socket Thread]: I/nsHttp   server: cloudflare-nginx
2017-12-08 13:49:56.067669 UTC - [Socket Thread]: I/nsHttp   date: Fri, 08 Dec 2017 13:49:56 GMT
2017-12-08 13:49:56.067671 UTC - [Socket Thread]: I/nsHttp   content-length: 0
2017-12-08 13:49:56.067673 UTC - [Socket Thread]: I/nsHttp   cf-ray: 3ca030d94f700fab-YYZ
2017-12-08 13:49:56.067675 UTC - [Socket Thread]: I/nsHttp   X-Firefox-Spdy: h2
2017-12-08 13:49:56.067677 UTC - [Socket Thread]: I/nsHttp ]
2017-12-08 13:49:56.067679 UTC - [Socket Thread]: V/nsHttp nsHttpTransaction::CheckForStickyAuthScheme this=0x7efeb93fa000
2017-12-08 13:49:56.067692 UTC - [Socket Thread]: V/nsHttp nsHttpConnection::OnHeadersAvailable [this=0x7efeb97f4200 trans=0x7efeb93fa000 response-head=0x7efedbef9880]
2017-12-08 13:49:56.067696 UTC - [Socket Thread]: V/nsHttp Connection can be reused [this=0x7efeb97f4200 idle-timeout=170sec]
2017-12-08 13:49:56.067699 UTC - [Socket Thread]: V/nsHttp Too Early.
2017-12-08 13:49:56.067712 UTC - [Socket Thread]: V/nsHttp nsHttpTransaction::Close [this=0x7efeb93fa000 reason=804b0014]
2017-12-08 13:49:56.067716 UTC - [Socket Thread]: V/nsHttp nsHttpConnectionMgr::RemoveActiveTransaction t=0x7efeb93fa000 tabid=380000001(1) thr=0
2017-12-08 13:49:56.067720 UTC - [Socket Thread]: V/nsHttp Active transactions -[5,1,2,0]
2017-12-08 13:49:56.067727 UTC - [Socket Thread]: V/nsHttp restarting transaction @0x7efeb93fa000
2017-12-08 13:49:56.067740 UTC - [Socket Thread]: I/nsHttp Http2Session::DontReuse 0x7efeaad78000
2017-12-08 13:49:56.067744 UTC - [Socket Thread]: V/nsHttp nsHttpConnectionMgr::AddTransaction [trans=0x7efeb93fa000 0]
2017-12-08 13:49:56.067747 UTC - [Socket Thread]: D/nsSocketTransport STS dispatch [0x7efebaba4f60]
2017-12-08 13:49:56.067760 UTC - [Socket Thread]: D/nsSocketTransport OnDispatchedEvent Same Thread Skip Signal
2017-12-08 13:49:56.067762 UTC - [Socket Thread]: V/nsHttp nsHttpResponseHead::Reset
2017-12-08 13:49:56.067765 UTC - [Socket Thread]: V/nsHttp transaction force restarted
2017-12-08 13:49:56.067770 UTC - [Socket Thread]: I/nsHttp Http2Session::ResetDownstreamState() 0x7efeaad78000
2017-12-08 13:49:56.067772 UTC - [Socket Thread]: I/nsHttp Http2Session::ChangeDownstreamState() 0x7efeaad78000 from 4 to 1
2017-12-08 13:49:56.067774 UTC - [Socket Thread]: I/nsHttp Http2Session::WriteSegments session=0x7efeaad78000 id 0xF needscleanup=(nil). cleanup stream based on stream->writeSegments returning code 804b0004
2017-12-08 13:49:56.067776 UTC - [Socket Thread]: I/nsHttp Http2Session::CleanupStream 0x7efeaad78000 by ID 0xF to stream 0x7efedc834320
2017-12-08 13:49:56.067778 UTC - [Socket Thread]: I/nsHttp Http2Session::CleanupStream 0x7efeaad78000 0x7efedc834320 0xF 0
2017-12-08 13:49:56.067780 UTC - [Socket Thread]: I/nsHttp Stream 0xF had not processed recv FIN, sending RST code 8
2017-12-08 13:49:56.067782 UTC - [Socket Thread]: I/nsHttp Http2Session::GenerateRst 0x7efeaad78000 0xF 8
2017-12-08 13:49:56.067784 UTC - [Socket Thread]: V/nsHttp Http2Session::LogIO 0x7efeaad78000 stream=(nil) id=0x0 [Generate Reset]
2017-12-08 13:49:56.067788 UTC - [Socket Thread]: V/nsHttp 00000000: 00 00 04 03 00 00 00 00 0F 00 00 00 08
2017-12-08 13:49:56.067791 UTC - [Socket Thread]: D/nsSocketTransport nsSocketOutputStream::Write [this=0x7efeb0d7f690 count=13]
2017-12-08 13:49:56.067794 UTC - [Socket Thread]: D/nsSocketTransport   calling PR_Write [count=13]
2017-12-08 13:49:56.067833 UTC - [Socket Thread]: D/nsSocketTransport   PR_Write returned [n=13]
2017-12-08 13:49:56.067836 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::SendStatus [this=0x7efeb0d7f400 status=804b0005]
2017-12-08 13:49:56.067840 UTC - [Socket Thread]: I/nsHttp Http2Session::FlushOutputQueue 0x7efeaad78000 sz=13 rv=0 actual=13
2017-12-08 13:49:56.067844 UTC - [Socket Thread]: I/nsHttp Http2Session::CloseStream 0x7efeaad78000 0x7efedc834320 0xf 0
2017-12-08 13:49:56.067847 UTC - [Socket Thread]: I/nsHttp MaybeDecrementConcurrent 0x7efeaad78000 id=0xF concurrent=4 active=1
2017-12-08 13:49:56.067850 UTC - [Socket Thread]: V/nsHttp nsHttpTransaction::Close [this=0x7efeb93fa000 reason=0]
2017-12-08 13:49:56.067857 UTC - [Socket Thread]: V/nsHttp restarting transaction @0x7efeb93fa000
2017-12-08 13:49:56.067870 UTC - [Socket Thread]: V/nsHttp nsHttpConnectionMgr::AddTransaction [trans=0x7efeb93fa000 0]
Keywords: checkin-needed
Comment on attachment 8936084 [details] [diff] [review]
bug_1423988.patch

Approval Request Comment
[Feature/Bug causing the regression]: Old code that gets triggered because of bug 1406908. The code from bug 1406908 is only triggered if the tls1.3 early-data is turned on.
[User impact if declined]: crash
[Is this code covered by automated tests?]: all automatic test run without the tls1.3 early-data, so yes.
[Has the fix been verified in Nightly?]: no, but tested in comment 10 
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: The patch only changes a pref that turns off the tls1.3 early-data feature.
[String changes made/needed]: none
Attachment #8936084 - Flags: approval-mozilla-beta?
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4ddaa45e3ec2
Turn odd the tls 1.3 early-data. r=mcmanus
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/4ddaa45e3ec2
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Comment on attachment 8936084 [details] [diff] [review]
bug_1423988.patch

avoid crashing on some sites with tls1.3, beta58+

there's an off by one in the commit message btw (odd vs off) :)
Attachment #8936084 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
No more crashes in nightly and beta since the patches landed.
Status: RESOLVED → VERIFIED
I accidentally changed the tracking flag.
You need to log in before you can comment on or make changes to this bug.