Closed Bug 1153929 Opened 8 years ago Closed 5 years ago
crash in mozilla::net::ns
Http Transaction::Write Segments(mozilla::net::ns AHttp Segment Writer*, unsigned int, unsigned int*)
1.96 KB, patch
|Details | Diff | Splinter Review|
2.54 KB, patch
|Details | Diff | Splinter Review|
955 bytes, patch
|Details | Diff | Splinter Review|
[Tracking Requested - why for this release]: This bug was filed from the Socorro interface and is report bp-61c3b717-2934-40f2-ab5b-d0c8e2150413. ============================================================= This signature is/was a low-volume crash on 37 and older as well as Nightly and Dev Edition channels, but is in the top 20 with ~.7% of all crashes in 38 beta now. Top frames: 0 xul.dll mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*) netwerk/protocol/http/nsHttpTransaction.cpp 1 xul.dll mozilla::net::nsHttpConnection::OnSocketReadable() netwerk/protocol/http/nsHttpConnection.cpp 2 xul.dll mozilla::net::nsHttpConnection::OnInputStreamReady(nsIAsyncInputStream*) netwerk/protocol/http/nsHttpConnection.cpp 3 xul.dll nsSocketInputStream::OnSocketReady(nsresult) netwerk/base/nsSocketTransport2.cpp 4 xul.dll nsSocketTransport::OnSocketReady(PRFileDesc*, short) netwerk/base/nsSocketTransport2.cpp 5 xul.dll nsSocketTransportService::DoPollIteration(bool) netwerk/base/nsSocketTransportService2.cpp 6 nss3.dll PR_Assert nsprpub/pr/src/io/prlog.c 7 xul.dll nsSocketTransportService::Run() netwerk/base/nsSocketTransportService2.cpp 8 @0xd4b5ff 9 xul.dll NS_ProcessNextEvent(nsIThread*, bool) xpcom/glue/nsThreadUtils.cpp 10 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp I saw a bit of variance of stacks there, esp. sometimes Spdy stuff is in there and sometimes stacks are cut off. Top module correlations on Firefox 38.0b Windows: 78% (166/212 6% (2036/36148 AcLayers.dll 84% (179/212 26% (9359/36148 mpr.dll 83% (175/212 35% (12706/36148 winspool.drv 44% (93/212 8% (2746/36148 sfc.dll 44% (93/212 9% (3277/36148 sfc_os.dll 100% (211/212 69% (24932/36148 FWPUCLNT.DLL 99% (209/212 68% (24704/36148 explorerframe.dll Not sure if some of those correlated DLLs are causing this or if it's on our code, but given it's almost non-existent in Nightly and Dev Edition but there in beta, I'd lean towards the DLLs playing a role (as beta and release users have way more such things installed usually than Nightly and DevEd users).
Oh, and as always nowadays, click the link in the Crash Signature field on this bug for more reports and stats on this signature.
I looked at a few of these https://crash-stats.mozilla.com/report/index/49139b89-536a-4be6-bc8f-b95702150408 https://crash-stats.mozilla.com/report/index/c9ef5a78-6e26-4245-97d8-4b4312150408 https://crash-stats.mozilla.com/report/index/51d0ea31-10df-40d9-bcca-7ec602150408 https://crash-stats.mozilla.com/report/index/9efa67e4-16d9-4a84-be32-63c442150409 they all segv on 0x1c in nsHttpTransaction.cpp mPipeOut->WriteSegments(WritePipeSegment, this, count, contWritten); presumably that means mPipeOut is nullptr; does that mean the whole transaction is uaf
if we did figure this out for 38 it would be good.. any ideas honza or nick?
Could it be that the transaction fails to initialize (Init() fails before we create the pipe) but we don't check the result and use it?
I think there is just one new/init https://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpChannel.cpp#796
(In reply to Patrick McManus [:mcmanus] from comment #5) > I think there is just one new/init > > https://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/ > nsHttpChannel.cpp#796 Worth to check if there are no others?
This is 0.7% of 38.0b6 crashes and sit in the top 20 crash signatures.
Patrick, could you propose a patch as it is a topcrash? Thanks
(In reply to Sylvestre Ledru [:sylvestre] from comment #9) > Patrick, could you propose a patch as it is a topcrash? Thanks not at this point - I don't know what's wrong. It doesn't particularly look like a h2 or spdy thing, those crashes are present but so are h1 versions. nick - any ideas? honza - re comment 7; no others.
I'm pretty well stumped at this point (though I will of course keep this at least in the back of my mind). I did see something in nsPipe that would be bad in certain cases (the pipe having mInited = true even if init fails, and the caller doesn't check rv), but that doesn't seem to be an issue in the code related to this bug (nsHttpTransaction doesn't ignore the rv of nsPipe::Init). A reliable repro (or even an unreliable one!) with an http log would be nice.
Too late for 38. Tracking for 39 in case we have a fix for it.
The plan we talked about in the Necko meeting: 1. Figure out if mPipeOut is actually null. 2. Add a null check so this does not crash - and uplift it 3. Revert it on nightly, so we don't end up hiding an underlying bug. This may be due to FWPUCLNT.DLL or explorerframe.dll, but we need to properly check this.
Assignee: nobody → valentin.gosu
I added a check for mPipeOut not being null. I also added a check for mWriter not being null, but I don't think that's really necessary. If we uplift this and the crashes go away, maybe it could help us figure out why it happens.
Comment on attachment 8602356 [details] [diff] [review] Add assertions and null checks to fix windows crash in nsHttpTransaction Review of attachment 8602356 [details] [diff] [review]: ----------------------------------------------------------------- doesn't really inspire confidence, does it? :)
Attachment #8602356 - Flags: review?(mcmanus) → review+
There seem to be 2 crashes on nightly since this patch landed, so this didn't really fix the issue. https://crash-stats.mozilla.com/search/?signature=~mozilla%3A%3Anet%3A%3AnsHttpTransaction%3A%3AWriteSegments&version=41.0a1&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports I suspect there's an issue with the DLLs, maybe some form of memory corruption?! I will keep looking at this, but it seems quite tricky.
I took a look at the ASM code, and it seems that although mPipeOut is not null, for some reason its vtable is null!! https://pastebin.mozilla.org/8834769 So while 51361DC4 test ecx,ecx will succeed, because the pointer is not null 51361DCF mov eax,dword ptr [ecx] 51361DE2 call dword ptr [eax+1Ch] will make it crash at address 0x1c All that's left now is figuring out why the vtable ends up NULL (easier said than done)
I'm trying to narrow down when the breaking change occurred. According to the graph, it would seem the changeset we are looking for was somewhere between 8-15 feb 2015. https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=mozilla%3A%3Anet%3A%3AnsHttpTransaction%3A%3AWriteSegments%28mozilla%3A%3Anet%3A%3AnsAHttpSegmentWriter*%2C+unsigned+int%2C+unsigned+int*%29&page=2#tab-graph Maybe unrelated, but a lot of the comments mention flash being involved.
Judging by the date, I'm beginning to suspect that bug 1100398 introduced a bug in nsPipe3.cpp Ben, do you think this could be related to bug 1136762? Though I'm not sure how that would lead to a null vtable for mPipeOut
(In reply to Valentin Gosu [:valentin] from comment #20) > I took a look at the ASM code, and it seems that although mPipeOut is not > null, for some reason its vtable is null!! This suggests to me that mPipeOut has been overwritten with garbage. It seems UAF somewhere is the most likely cause of that. (In reply to Valentin Gosu [:valentin] from comment #22) > Judging by the date, I'm beginning to suspect that bug 1100398 introduced a > bug in nsPipe3.cpp > Ben, do you think this could be related to bug 1136762? > Though I'm not sure how that would lead to a null vtable for mPipeOut I also don't see how bug 1136762 could be the culprit here. I'll look at the pipe code today to see if I can see anything. I would expect breakage across more code if it really was the pipe, though, since its used in so many places in the tree.
Note, there have been instances of this crash going back long before FF38: https://crash-stats.mozilla.com/search/?product=Firefox&version=!38.0a2&version=!38.0&version=!39.0a1&version=!40.0a1&version=!38.0b9&version=!38.0a1&version=!38.0b8&version=!38.0b6&signature=%3Dmozilla%3A%3Anet%3A%3AnsHttpTransaction%3A%3AWriteSegments%28mozilla%3A%3Anet%3A%3AnsAHttpSegmentWriter*%2C+unsigned+int%2C+unsigned+int*%29&date=%3E11%2F27%2F2014&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports For example, here is one from FF37: https://crash-stats.mozilla.com/report/index/ad3cecd8-cc35-47dc-9956-413702150501 And here is one in FF30: https://crash-stats.mozilla.com/report/index/ed8bd230-dfeb-44cd-8c48-e159a2150503 This is prior to the introduction of bug 1100398 in FF38. Now bug 1100398 did change the structure of nsPipe significantly. Its possible that an existing problem just triggers a crash more frequently with the new pipe object layout. The nsPipeOutputStream is embedded by value in nsPipe and it moved up in the nsPipe layout by many bytes in bug 1100398.
How do you guarantee the lifetime of the nsASocketHandler extends past the time the SocketContext::mHandler raw pointer references it? https://dxr.mozilla.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.h#159
AFAICT, this does not hold the nsUDPSocket alive: https://dxr.mozilla.org/mozilla-central/source/netwerk/base/nsUDPSocket.cpp#556 It seems either the socket service must hold the socket alive or the socket destructor needs a way to remove itself.
Hi Ben, thanks for looking into the issue. I hadn't found the crash reports from previous versions, probably because the crash address was different. I think you're right, and bug 1100398 made it occur more often, and always crash at the same address. I'll look further into the socket lifetime issue, and hopefully figure it out. Thanks a lot!
Dragana took a look at this issue and pointed out that mPipeOut, since it is a member (by value) of nsPipe, it is kept alive by nsPipeInputStream, which in this case is nsHttpChannel::mTransactionPump. So this could caused by us releasing mTransactionPump in nsHttpChannel::OnStopRequest() on the main thread, while on the socket thread nsHttpTransaction::WriteSegments() gets called. So we may need to set mTransactionDone before we null out mTransactionPump. Or better yet, add a new guard variable to let the transaction know that the pipe is going away (mPipeClosing?)
Is there a reason we can't just AddRef() the nsPipeOutputStream directly?
Actually, you're right, Ben. The transaction already holds a RefPtr to both the input and output, so that can't be the problem. Maybe it's the nsHttpTransaction's destructor that causing the pipe to go away, because the channel also sets mTransaction to null in OnStopRequest().
Did you figure out how the lifetime of the socket is guaranteed in comment 25 and comment 26? Because if the socket is UAF, then the transaction object being operated on could be completely bogus.
(In reply to Ben Kelly [:bkelly] from comment #31) > Did you figure out how the lifetime of the socket is guaranteed in comment > 25 and comment 26? Because if the socket is UAF, then the transaction > object being operated on could be completely bogus. From what I can tell it is manually AddRef'd at https://dxr.mozilla.org/mozilla-central/source/netwerk/base/nsSocketTransportService2.cpp#216
(In reply to Valentin Gosu [:valentin] from comment #32) > From what I can tell it is manually AddRef'd at > https://dxr.mozilla.org/mozilla-central/source/netwerk/base/ > nsSocketTransportService2.cpp#216 Ah, yea. Ok. Thanks for looking.
(In reply to Valentin Gosu [:valentin] from comment #30) > Maybe it's the > nsHttpTransaction's destructor that causing the pipe to go away, > because the channel also sets mTransaction to null in OnStopRequest(). Actually, I don't think it's possible for the transaction to be released - nsHttpConnection also holds a ref to the transaction, and it's definitely not null because it calls to WriteSegments. Back to square 1.
Patrick/Nick, we could use some fresh eyes on this bug, if you've got cycles. I'd throw Honza at it but he's sick.
Wontfix for 39. There are a few hundred crashes showing up in 39.0b5 and 39.0b6. Marking 40 as affected.
I am running out of ideas for this bug, and it seems to me that the crash volume is quite large. Even though it's super hacky, I'm considering checking the vtable for mPipeOut and if that's null, returning an error code. We could at least avoid being so crashy, even if this might be caused by a plugin or something out of our control.
Hi David, We currently have no way of reproducing this crash, and little information about its causes. It seems the crash occurs when accessing mPipeOut, but the object has a null vtable. Do you think we could safely check the vtable, and return an error code if it's null? I guess this would not be portable, however the crashes are mostly confined to windows x86. Thanks!
Are there any common DLLs here that might indicate some kind of 3rd party software is trying to tie into our network stack?
comment 0 lists some correlations with the most common DLLs.
(In reply to Valentin Gosu [:valentin] from comment #38) > Do you think we could safely check the vtable, and return an error code if > it's null? That's not a normal way to fix a crash, and might leave bad crashes still present. There should be a better way.
If there's a better way, I can't seem to find it. We've been looking at this for a months now, and no cause for this seems to pop up. Looking through the comments from the crash report, a lot of them seem to point out that the Flash player crashes just before this occurs. So in the absence of a way to reproduce this bug, I can't tell if the problem is in our way of handling plugin crashes, an external library or the Flash plugin itself.
Is there any way to get additional telemetry in the crash reporter? Can we add boolean flags or counters to the crash report? That would help us provide more context on these crashes.
You probably can add some annotations before crashing, if that's possible.
Hi Benjamin, We are pretty much out of ideas on this bug, and we wanted to get your opinion on it. Apart from finding someone who can reproduce it, is there a way to check if the crash is caused by the DLLs mentioned in comment 0? I understand you've dealt with stuff like this before.
Some correlations from version 39: 63% (148/234) vs. 5% (3173/68114) AcLayers.dll 72% (168/234) vs. 25% (16936/68114) mpr.dll 39% (92/234) vs. 7% (4708/68114) sfc.dll 40% (93/234) vs. 8% (5699/68114) sfc_os.dll 65% (153/234) vs. 34% (23479/68114) winspool.drv The numbers aren't very strong. AcLayers and the SFC modules are suspiciously high, but I don't think those modules _cause_ this crash. More likely is that whatever causes this crash also loads those modules. I agree that papering over the null vtable doesn't seem like a good long-term fix, but maybe we can use that vtable for diagnostics. What if you sprinkle null-checks all over this class and crash if you ever find one? It won't reduce total crash volume but may give you better information on the timeline, e.g. is mPipeOut already busted when you return from NS_NewPipe2?
Comment on attachment 8634378 [details] [diff] [review] Add diagnostic asserts to check vtable is still present for mPipeOut Review of attachment 8634378 [details] [diff] [review]: ----------------------------------------------------------------- you might also add a check to make sure writesegments isn't being called re-entrantly somehow (a local static bool)
Attachment #8634378 - Flags: review?(mcmanus) → review+
there is a shot in the dark that bug 1152048 shares a root cause here.
We finally got a crash report following the landing of the last patch. Interestingly, we haven't had any of the crashes related to the bad vtable - maybe because I nulled out the ref to the pipe in the transaction's Close(), maybe something else changed, or maybe there just weren't enough people testing. Anyway, from the latest crash it seems that indeed, nsHttpTransaction::WriteSegments is being called reentrantly. https://crash-stats.mozilla.com/report/index/3af0d06a-d316-4c64-a30d-12d1a2150726
> https://crash-stats.mozilla.com/report/index/3af0d06a-d316-4c64-a30d-12d1a2150726 Patrick, can you make sense of the stack trace on this crash? I don't get how this is called reentrantly (WriteSegments should be in the crashing thread twice, right?) Neither does it seem to be called on a different thread, as nsHttpTransaction does not appear in the stacks of other threads.
I've been staring at the stack for the last day wondering the same thing. I don't have a better answer than corruption. It might be interesting to see where that static lives in memory if it got corrupted.
(In reply to Valentin Gosu [:valentin] from comment #53) > > https://crash-stats.mozilla.com/report/index/3af0d06a-d316-4c64-a30d-12d1a2150726 > > Neither does it seem to be called on a different thread, as > nsHttpTransaction does not appear in the stacks of other threads. This is a slightly weaker conclusion just because of the way the asserts are handled.. when the diagnostic_assert() fired it had to run a bunch of code to handle the assert and stop the other threads - the other cores don't stop synchronously with the assert failure. So its possible that write_segments was executing somehow on 2 threads when the assert() failed, but the other thread completed before the stack trace was taken. Seems unlikely, but not impossible. the function already assert()s the socket thread, I think based on this you should change that to a diagnostic assert to see if it fires.. so that (along with the comment 54 bit about finding the layout of the static) gives a couple things to do.. and as another hope, when https://bugzilla.mozilla.org/show_bug.cgi?id=1152048 we should track any correlation of this bug in crash stats.
I don't really see any incidence of this crash on 40. Interestingly, there are more reported crashes on Aurora 41 in the past month than on Beta 40. This bug is now wontfix for 40. Please do keep pushing.
Comment on attachment 8641978 [details] [diff] [review] Null mPipeOut in nsHttpTransaction after closing. [aurora] I would like to uplift part of the landed patch to aurora, more specifically the the part that nulls out mPipeOut after it's closed. There are a bunch of related crashes that occur in ::~nsHttpTransaction() also because of a null vtable of mPipeOut when releasing the reference. If memory corruption of its vtable occurs after the pipe is closed, at least this will prevent us from crashing in ::~nsHttpTransaction. This doesn't effectively fix the underlying cause, but it should improve user experience by reducing the number of crashes. Approval Request Comment [Feature/regressing bug #]: This bug. [User impact if declined]: Continued crashes in nsHttpTransaction's destructor. [Describe test coverage new/current, TreeHerder]: Has been on m-c since July 19th. No crashes in the destructor since then. [Risks and why]: This is low risk. Worst case scenario, these crashes still happen, but in a different place, which might indicate the root cause. Best case, a large portion of the crashes stop happening. [String/UUID change made/needed]: None
valentin's suggestion in comment 58 is reasonable. The worst case for it is really that it moves the crash around to somewhere that might be more actionable.
Attachment #8641978 - Flags: feedback?(mcmanus) → feedback+
Comment on attachment 8641978 [details] [diff] [review] Null mPipeOut in nsHttpTransaction after closing. [aurora] Simple fix, Aurora+
Attachment #8641978 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Based on comment 60, FF41 was in Aurora phase and since merge day of 8-10, this fix should be in Beta41. Updating status to reflect that.
Stop tracking this bug, I started to track it 5 months ago and we have been shipping several releases with this bug...
Crash Signature: [@ mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*)] → [@ mozilla::net::nsHttpTransaction::WriteSegments(mozilla::net::nsAHttpSegmentWriter*, unsigned int, unsigned int*)] [@ mozilla::net::nsHttpTransaction::WriteSegments]
there are only a small handful of these on 43 or 44 in the last month.. still don't understand it
there are 0 crashes on >= 46 in the last 28 days and only 5 on some version of 45. (2 of those were betas of 45).
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
This crash went away (either fixed or moved to another place).
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.