crash in mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*)

RESOLVED WORKSFORME

Status

()

P3
critical
RESOLVED WORKSFORME
4 years ago
9 months ago

People

(Reporter: kairo, Assigned: valentin)

Tracking

(Blocks: 2 bugs, {crash, leave-open, topcrash-win})

Trunk
x86
Windows NT
crash, leave-open, topcrash-win
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox37 wontfix, firefox38+ wontfix, firefox39+ wontfix, firefox40+ wontfix, firefox41+ wontfix, firefox42- affected, firefox-esr38 affected)

Details

(Whiteboard: [necko-backlog], crash signature)

Attachments

(3 attachments)

(Reporter)

Description

4 years ago
[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-61c3b717-2934-40f2-ab5b-d0c8e2150413.
=============================================================

This signature is/was a low-volume crash on 37 and older as well as Nightly and Dev Edition channels, but is in the top 20 with ~.7% of all crashes in 38 beta now.

Top frames:
0 	xul.dll 	mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) 	netwerk/protocol/http/nsHttpTransaction.cpp
1 	xul.dll 	mozilla::net::nsHttpConnection::OnSocketReadable() 	netwerk/protocol/http/nsHttpConnection.cpp
2 	xul.dll 	mozilla::net::nsHttpConnection::OnInputStreamReady(nsIAsyncInputStream*) 	netwerk/protocol/http/nsHttpConnection.cpp
3 	xul.dll 	nsSocketInputStream::OnSocketReady(nsresult) 	netwerk/base/nsSocketTransport2.cpp
4 	xul.dll 	nsSocketTransport::OnSocketReady(PRFileDesc*, short) 	netwerk/base/nsSocketTransport2.cpp
5 	xul.dll 	nsSocketTransportService::DoPollIteration(bool) 	netwerk/base/nsSocketTransportService2.cpp
6 	nss3.dll 	PR_Assert 	nsprpub/pr/src/io/prlog.c
7 	xul.dll 	nsSocketTransportService::Run() 	netwerk/base/nsSocketTransportService2.cpp
8 		@0xd4b5ff 	
9 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp
10 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp

I saw a bit of variance of stacks there, esp. sometimes Spdy stuff is in there and sometimes stacks are cut off.


Top module correlations on Firefox 38.0b Windows:

78% (166/212	6% (2036/36148	AcLayers.dll
84% (179/212	26% (9359/36148	mpr.dll
83% (175/212	35% (12706/36148	winspool.drv
44% (93/212	8% (2746/36148	sfc.dll
44% (93/212	9% (3277/36148	sfc_os.dll
100% (211/212	69% (24932/36148	FWPUCLNT.DLL
99% (209/212	68% (24704/36148	explorerframe.dll


Not sure if some of those correlated DLLs are causing this or if it's on our code, but given it's almost non-existent in Nightly and Dev Edition but there in beta, I'd lean towards the DLLs playing a role (as beta and release users have way more such things installed usually than Nightly and DevEd users).
(Reporter)

Comment 1

4 years ago
Oh, and as always nowadays, click the link in the Crash Signature field on this bug for more reports and stats on this signature.
I looked at a few of these

https://crash-stats.mozilla.com/report/index/49139b89-536a-4be6-bc8f-b95702150408
https://crash-stats.mozilla.com/report/index/c9ef5a78-6e26-4245-97d8-4b4312150408
https://crash-stats.mozilla.com/report/index/51d0ea31-10df-40d9-bcca-7ec602150408
https://crash-stats.mozilla.com/report/index/9efa67e4-16d9-4a84-be32-63c442150409

they all segv on 0x1c in nsHttpTransaction.cpp
   mPipeOut->WriteSegments(WritePipeSegment, this, count, contWritten);

presumably that means mPipeOut is nullptr; does that mean the whole transaction is uaf
if we did figure this out for 38 it would be good.. any ideas honza or nick?
Flags: needinfo?(hurley)
Flags: needinfo?(honzab.moz)
Could it be that the transaction fails to initialize (Init() fails before we create the pipe) but we don't check the result and use it?
Top crash, tracking.
status-firefox37: --- → wontfix
tracking-firefox38: ? → +
(In reply to Patrick McManus [:mcmanus] from comment #5)
> I think there is just one new/init
> 
> https://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/
> nsHttpChannel.cpp#796

Worth to check if there are no others?
(Reporter)

Comment 8

4 years ago
This is 0.7% of 38.0b6 crashes and sit in the top 20 crash signatures.
Patrick, could you propose a patch as it is a topcrash? Thanks
Flags: needinfo?(mcmanus)
Keywords: topcrash-win
(In reply to Sylvestre Ledru [:sylvestre] from comment #9)
> Patrick, could you propose a patch as it is a topcrash? Thanks

not at this point - I don't know what's wrong. It doesn't particularly look like a h2 or spdy thing, those crashes are present but so are h1 versions.

nick - any ideas?

honza - re comment 7; no others.
Flags: needinfo?(mcmanus)

Comment 11

4 years ago
I'm pretty well stumped at this point (though I will of course keep this at least in the back of my mind). I did see something in nsPipe that would be bad in certain cases (the pipe having mInited = true even if init fails, and the caller doesn't check rv), but that doesn't seem to be an issue in the code related to this bug (nsHttpTransaction doesn't ignore the rv of nsPipe::Init).

A reliable repro (or even an unreliable one!) with an http log would be nice.
Flags: needinfo?(hurley)
Too late for 38. Tracking for 39 in case we have a fix for it.
status-firefox38: affected → wontfix
status-firefox39: --- → affected
tracking-firefox39: --- → +
(Assignee)

Comment 13

4 years ago
The plan we talked about in the Necko meeting:

1. Figure out if mPipeOut is actually null.
2. Add a null check so this does not crash - and uplift it
3. Revert it on nightly, so we don't end up hiding an underlying bug.

This may be due to FWPUCLNT.DLL or explorerframe.dll, but we need to properly check this.
Assignee: nobody → valentin.gosu
(Assignee)

Comment 14

4 years ago
Created attachment 8602356 [details] [diff] [review]
Add assertions and null checks to fix windows crash in nsHttpTransaction
Attachment #8602356 - Flags: review?(mcmanus)
(Assignee)

Comment 15

4 years ago
I added a check for mPipeOut not being null.
I also added a check for mWriter not being null, but I don't think that's really necessary.

If we uplift this and the crashes go away, maybe it could help us figure out why it happens.
Keywords: leave-open
Comment on attachment 8602356 [details] [diff] [review]
Add assertions and null checks to fix windows crash in nsHttpTransaction

Review of attachment 8602356 [details] [diff] [review]:
-----------------------------------------------------------------

doesn't really inspire confidence, does it? :)
Attachment #8602356 - Flags: review?(mcmanus) → review+
Flags: needinfo?(honzab.moz)
(Assignee)

Comment 19

4 years ago
There seem to be 2 crashes on nightly since this patch landed, so this didn't really fix the issue.

https://crash-stats.mozilla.com/search/?signature=~mozilla%3A%3Anet%3A%3AnsHttpTransaction%3A%3AWriteSegments&version=41.0a1&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

I suspect there's an issue with the DLLs, maybe some form of memory corruption?!
I will keep looking at this, but it seems quite tricky.
(Assignee)

Comment 20

4 years ago
I took a look at the ASM code, and it seems that although mPipeOut is not null, for some reason its vtable is null!!
https://pastebin.mozilla.org/8834769

So while
  51361DC4  test        ecx,ecx
will succeed, because the pointer is not null
  51361DCF  mov         eax,dword ptr [ecx]
  51361DE2  call        dword ptr [eax+1Ch]
will make it crash at address 0x1c

All that's left now is figuring out why the vtable ends up NULL (easier said than done)
(Assignee)

Comment 21

4 years ago
I'm trying to narrow down when the breaking change occurred. 

According to the graph, it would seem the changeset we are looking for was somewhere between 8-15 feb 2015.

https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=mozilla%3A%3Anet%3A%3AnsHttpTransaction%3A%3AWriteSegments%28mozilla%3A%3Anet%3A%3AnsAHttpSegmentWriter*%2C+unsigned+int%2C+unsigned+int*%29&page=2#tab-graph

Maybe unrelated, but a lot of the comments mention flash being involved.
(Assignee)

Comment 22

4 years ago
Judging by the date, I'm beginning to  suspect that bug 1100398 introduced a bug in nsPipe3.cpp
Ben, do you think this could be related to bug 1136762?
Though I'm not sure how that would lead to a null vtable for mPipeOut
Flags: needinfo?(bkelly)
(In reply to Valentin Gosu [:valentin] from comment #20)
> I took a look at the ASM code, and it seems that although mPipeOut is not
> null, for some reason its vtable is null!!

This suggests to me that mPipeOut has been overwritten with garbage.  It seems UAF somewhere is the most likely cause of that.

(In reply to Valentin Gosu [:valentin] from comment #22)
> Judging by the date, I'm beginning to  suspect that bug 1100398 introduced a
> bug in nsPipe3.cpp
> Ben, do you think this could be related to bug 1136762?
> Though I'm not sure how that would lead to a null vtable for mPipeOut

I also don't see how bug 1136762 could be the culprit here.

I'll look at the pipe code today to see if I can see anything.  I would expect breakage across more code if it really was the pipe, though, since its used in so many places in the tree.
Note, there have been instances of this crash going back long before FF38:

  https://crash-stats.mozilla.com/search/?product=Firefox&version=!38.0a2&version=!38.0&version=!39.0a1&version=!40.0a1&version=!38.0b9&version=!38.0a1&version=!38.0b8&version=!38.0b6&signature=%3Dmozilla%3A%3Anet%3A%3AnsHttpTransaction%3A%3AWriteSegments%28mozilla%3A%3Anet%3A%3AnsAHttpSegmentWriter*%2C+unsigned+int%2C+unsigned+int*%29&date=%3E11%2F27%2F2014&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

For example, here is one from FF37:

  https://crash-stats.mozilla.com/report/index/ad3cecd8-cc35-47dc-9956-413702150501

And here is one in FF30:

  https://crash-stats.mozilla.com/report/index/ed8bd230-dfeb-44cd-8c48-e159a2150503

This is prior to the introduction of bug 1100398 in FF38.

Now bug 1100398 did change the structure of nsPipe significantly.  Its possible that an existing problem just triggers a crash more frequently with the new pipe object layout.  The nsPipeOutputStream is embedded by value in nsPipe and it moved up in the nsPipe layout by many bytes in bug 1100398.
Flags: needinfo?(bkelly)
How do you guarantee the lifetime of the nsASocketHandler extends past the time the SocketContext::mHandler raw pointer references it?

https://dxr.mozilla.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.h#159
Flags: needinfo?(valentin.gosu)
AFAICT, this does not hold the nsUDPSocket alive:

  https://dxr.mozilla.org/mozilla-central/source/netwerk/base/nsUDPSocket.cpp#556

It seems either the socket service must hold the socket alive or the socket destructor needs a way to remove itself.
(Assignee)

Comment 27

4 years ago
Hi Ben, thanks for looking into the issue. I hadn't found the crash reports from previous versions, probably because the crash address was different. I think you're right, and bug 1100398 made it occur more often, and always crash at the same address.
I'll look further into the socket lifetime issue, and hopefully figure it out. Thanks a lot!
Flags: needinfo?(valentin.gosu)
(Assignee)

Comment 28

4 years ago
Dragana took a look at this issue and pointed out that mPipeOut, since it is a member (by value) of nsPipe, it is kept alive by nsPipeInputStream, which in this case is nsHttpChannel::mTransactionPump.
So this could caused by us releasing mTransactionPump in nsHttpChannel::OnStopRequest() on the main thread, while on the socket thread nsHttpTransaction::WriteSegments() gets called.
So we may need to set mTransactionDone before we null out mTransactionPump. Or better yet, add a new guard variable to let the transaction know that the pipe is going away (mPipeClosing?)
Is there a reason we can't just AddRef() the nsPipeOutputStream directly?
(Assignee)

Comment 30

4 years ago
Actually, you're right, Ben. The transaction already holds a RefPtr to both the input and output, so that can't be the problem. Maybe it's the nsHttpTransaction's destructor that causing the pipe to go away, 
because the channel also sets mTransaction to null in OnStopRequest().
Did you figure out how the lifetime of the socket is guaranteed in comment 25 and comment 26?  Because if the socket is UAF, then the transaction object being operated on could be completely bogus.
(Assignee)

Comment 32

4 years ago
(In reply to Ben Kelly [:bkelly] from comment #31)
> Did you figure out how the lifetime of the socket is guaranteed in comment
> 25 and comment 26?  Because if the socket is UAF, then the transaction
> object being operated on could be completely bogus.

From what I can tell it is manually AddRef'd at
https://dxr.mozilla.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.cpp#216
(In reply to Valentin Gosu [:valentin] from comment #32)
> From what I can tell it is manually AddRef'd at
> https://dxr.mozilla.org/mozilla-central/source/netwerk/base/
> nsSocketTransportService2.cpp#216

Ah, yea.  Ok.  Thanks for looking.
(Assignee)

Comment 34

4 years ago
(In reply to Valentin Gosu [:valentin] from comment #30)
> Maybe it's the
> nsHttpTransaction's destructor that causing the pipe to go away, 
> because the channel also sets mTransaction to null in OnStopRequest().

Actually, I don't think it's possible for the transaction to be released - nsHttpConnection also holds a ref to the transaction, and it's definitely not null because it calls to WriteSegments.

Back to square 1.

Comment 35

4 years ago
Patrick/Nick, we could use some fresh eyes on this bug, if you've got cycles.  I'd throw Honza at it but he's sick.
Flags: needinfo?(mcmanus)

Updated

4 years ago
Flags: needinfo?(hurley)
Wontfix for 39. There are a few hundred crashes showing up in 39.0b5 and 39.0b6. 
Marking 40 as affected.
status-firefox40: --- → affected
status-firefox41: --- → ?
tracking-firefox40: --- → +
tracking-firefox41: --- → +
status-firefox39: affected → wontfix
(Assignee)

Comment 37

4 years ago
I am running out of ideas for this bug, and it seems to me that the crash volume is quite large.
Even though it's super hacky, I'm considering checking the vtable for mPipeOut and if that's null, returning an error code. We could at least avoid being so crashy, even if this might be caused by a plugin or something out of our control.
(Assignee)

Comment 38

4 years ago
Hi David,
We currently have no way of reproducing this crash, and little information about its causes.
It seems the crash occurs when accessing mPipeOut, but the object has a null vtable.
Do you think we could safely check the vtable, and return an error code if it's null? 
I guess this would not be portable, however the crashes are mostly confined to windows x86.

Thanks!
Flags: needinfo?(dbaron)
Are there any common DLLs here that might indicate some kind of 3rd party software is trying to tie into our network stack?
(Assignee)

Comment 40

4 years ago
comment 0 lists some correlations with the most common DLLs.
(In reply to Valentin Gosu [:valentin] from comment #38)
> Do you think we could safely check the vtable, and return an error code if
> it's null? 

That's not a normal way to fix a crash, and might leave bad crashes still present.  There should be a better way.
Flags: needinfo?(dbaron)
(Assignee)

Comment 42

4 years ago
If there's a better way, I can't seem to find it.
We've been looking at this for a months now, and no cause for this seems to pop up.
Looking through the comments from the crash report, a lot of them seem to point out that the Flash player crashes just before this occurs.
So in the absence of a way to reproduce this bug, I can't tell if the problem is in our way of handling plugin crashes, an external library or the Flash plugin itself.
Is there any way to get additional telemetry in the crash reporter?  Can we add boolean flags or counters to the crash report?  That would help us provide more context on these crashes.
(Reporter)

Comment 44

4 years ago
You probably can add some annotations before crashing, if that's possible.
(Assignee)

Comment 45

4 years ago
Hi Benjamin,
We are pretty much out of ideas on this bug, and we wanted to get your opinion on it. Apart from finding someone who can reproduce it, is there a way to check if the crash is caused by the DLLs mentioned in comment 0? I understand you've dealt with stuff like this before.
Flags: needinfo?(benjamin)
Some correlations from version 39:

     63% (148/234) vs.   5% (3173/68114) AcLayers.dll
     72% (168/234) vs.  25% (16936/68114) mpr.dll
     39% (92/234) vs.   7% (4708/68114) sfc.dll
     40% (93/234) vs.   8% (5699/68114) sfc_os.dll
     65% (153/234) vs.  34% (23479/68114) winspool.drv

The numbers aren't very strong. AcLayers and the SFC modules are suspiciously high, but I don't think those modules _cause_ this crash. More likely is that whatever causes this crash also loads those modules.

I agree that papering over the null vtable doesn't seem like a good long-term fix, but maybe we can use that vtable for diagnostics. What if you sprinkle null-checks all over this class and crash if you ever find one? It won't reduce total crash volume but may give you better information on the timeline, e.g. is mPipeOut already busted when you return from NS_NewPipe2?
Flags: needinfo?(benjamin)
(Assignee)

Comment 47

4 years ago
Created attachment 8634378 [details] [diff] [review]
Add diagnostic asserts to check vtable is still present for mPipeOut
Attachment #8634378 - Flags: review?(mcmanus)
Comment on attachment 8634378 [details] [diff] [review]
Add diagnostic asserts to check vtable is still present for mPipeOut

Review of attachment 8634378 [details] [diff] [review]:
-----------------------------------------------------------------

you might also add a check to make sure writesegments isn't being called re-entrantly somehow (a local static bool)
Attachment #8634378 - Flags: review?(mcmanus) → review+
Flags: needinfo?(mcmanus)

Updated

3 years ago
Flags: needinfo?(hurley)
there is a shot in the dark that bug 1152048 shares a root cause here.
(Assignee)

Comment 52

3 years ago
We finally got a crash report following the landing of the last patch.
Interestingly, we haven't had any of the crashes related to the bad vtable - maybe because I nulled out the ref to the pipe in the transaction's Close(), maybe something else changed, or maybe there just weren't enough people testing.

Anyway, from the latest crash it seems that indeed, nsHttpTransaction::WriteSegments is being called reentrantly.

https://crash-stats.mozilla.com/report/index/3af0d06a-d316-4c64-a30d-12d1a2150726
(Assignee)

Comment 53

3 years ago
> https://crash-stats.mozilla.com/report/index/3af0d06a-d316-4c64-a30d-12d1a2150726

Patrick, can you make sense of the stack trace on this crash?
I don't get how this is called reentrantly (WriteSegments should be in the crashing thread twice, right?)
Neither does it seem to be called on a different thread, as nsHttpTransaction does not appear in the stacks of other threads.
Flags: needinfo?(mcmanus)
I've been staring at the stack for the last day wondering the same thing. I don't have a better answer than corruption. It might be interesting to see where that static lives in memory if it got corrupted.
Flags: needinfo?(mcmanus)
(In reply to Valentin Gosu [:valentin] from comment #53)
> > https://crash-stats.mozilla.com/report/index/3af0d06a-d316-4c64-a30d-12d1a2150726
> 
> Neither does it seem to be called on a different thread, as
> nsHttpTransaction does not appear in the stacks of other threads.

This is a slightly weaker conclusion just because of the way the asserts are handled.. when the diagnostic_assert() fired it had to run a bunch of code to handle the assert and stop the other threads - the other cores don't stop synchronously with the assert failure. So its possible that write_segments was executing somehow on 2 threads when the assert() failed, but the other thread completed before the stack trace was taken. Seems unlikely, but not impossible.

the function already assert()s the socket thread, I think based on this you should change that to a diagnostic assert to see if it fires.. so that (along with the comment 54 bit about finding the layout of the static) gives a couple things to do..

and as another hope, when https://bugzilla.mozilla.org/show_bug.cgi?id=1152048 we should track any correlation of this bug in crash stats.
I don't really see any incidence of this crash on 40. Interestingly, there are more reported crashes on Aurora 41 in the past month than on Beta 40. This bug is now wontfix for 40. Please do keep pushing.
status-firefox40: affected → wontfix
status-firefox41: ? → affected
status-firefox42: --- → affected
status-firefox-esr38: --- → affected
tracking-firefox42: --- → +
(Assignee)

Comment 57

3 years ago
Created attachment 8641978 [details] [diff] [review]
Null mPipeOut in nsHttpTransaction after closing. [aurora]
(Assignee)

Comment 58

3 years ago
Comment on attachment 8641978 [details] [diff] [review]
Null mPipeOut in nsHttpTransaction after closing. [aurora]

I would like to uplift part of the landed patch to aurora, more specifically the the part that nulls out mPipeOut after it's closed.
There are a bunch of related crashes that occur in ::~nsHttpTransaction() also because of a null vtable of mPipeOut when releasing the reference.
If memory corruption of its vtable occurs after the pipe is closed, at least this will prevent us from crashing in ::~nsHttpTransaction. This doesn't effectively fix the underlying cause, but it should improve user experience by reducing the number of crashes.

Approval Request Comment
[Feature/regressing bug #]:
This bug.

[User impact if declined]:
Continued crashes in nsHttpTransaction's destructor.

[Describe test coverage new/current, TreeHerder]:
Has been on m-c since July 19th. No crashes in the destructor since then.

[Risks and why]:
This is low risk. Worst case scenario, these crashes still happen, but in a different place, which might indicate the root cause. Best case, a large portion of the crashes stop happening.

[String/UUID change made/needed]:
None
Attachment #8641978 - Flags: feedback?(mcmanus)
Attachment #8641978 - Flags: approval-mozilla-aurora?
valentin's suggestion in comment 58 is reasonable. The worst case for it is really that it moves the crash around to somewhere that might be more actionable.
Attachment #8641978 - Flags: feedback?(mcmanus) → feedback+
Comment on attachment 8641978 [details] [diff] [review]
Null mPipeOut in nsHttpTransaction after closing. [aurora]

Simple fix, Aurora+
Attachment #8641978 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(Assignee)

Updated

3 years ago
Blocks: 1193389
(Assignee)

Updated

3 years ago
Duplicate of this bug: 844583
Based on comment 60, FF41 was in Aurora phase and since merge day of 8-10, this fix should be in Beta41. Updating status to reflect that.
status-firefox41: affected → fixed
I see crashes in 41.0b8, b7 & b6. I don't think it is fixed.
status-firefox41: fixed → affected
Too late to fix in 41.
status-firefox41: affected → wontfix
Stop tracking this bug, I started to track it 5 months ago and we have been shipping several releases with this bug...
tracking-firefox42: + → -

Updated

3 years ago
Crash Signature: [@ mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*)] → [@ mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*)] [@ mozilla::net::nsHttpTransaction::WriteSegments]
there are only a small handful of these on 43 or 44 in the last month.. still don't understand it
Whiteboard: [necko-backlog]
(Assignee)

Updated

3 years ago
Blocks: 1196787
there are 0 crashes on >= 46 in the last 28 days and only 5 on some version of 45. (2 of those were betas of 45).
(Assignee)

Comment 71

9 months ago
This crash went away (either fixed or moved to another place).
Status: NEW → RESOLVED
Last Resolved: 9 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.