Closed Bug 702042 Opened 13 years ago Closed 8 years ago

crash in nsSocketTransport::IsAlive @WSARecv

Categories

(Core :: Networking, defect, P1)

8 Branch
x86
Windows 7
defect

Tracking

()

RESOLVED WONTFIX
mozilla8
Tracking Status
firefox8 + ---

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-2e1c37d4-8041-4786-9bbd-195da2111112 .
============================================================= 

Seen while looking at the explosive report for Firefox 7 - all Windows crashes that span multiple versions.

https://crash-stats.mozilla.com/report/list?signature=WSARecv

https://crash-stats.mozilla.com/report/index/baa3db05-d2bd-4091-868e-d7e812111111

Frame 	Module 	Signature [Expand] 	Source
0 	ws2_32.dll 	WSARecv 	
1 		@0xc39c98a 	
2 	wsock32.dll 	recv 	
3 	nspr4.dll 	_PR_MD_RECV 	nsprpub/pr/src/md/windows/w95sock.c:354
4 	nspr4.dll 	SocketRecv 	nsprpub/pr/src/io/prsocket.c:639
5 	xul.dll 	nsSocketTransport::IsAlive 	netwerk/base/src/nsSocketTransport2.cpp:1855
6 	xul.dll 	nsHttpConnection::IsAlive 	netwerk/protocol/http/nsHttpConnection.cpp:280
7 	xul.dll 	nsHttpConnection::CanReuse 	netwerk/protocol/http/nsHttpConnection.cpp:245
8 	xul.dll 	nsHttpConnection::OnInputStreamReady 	netwerk/protocol/http/nsHttpConnection.cpp:901
9 	xul.dll 	nsSocketInputStream::OnSocketReady 	netwerk/base/src/nsSocketTransport2.cpp:256
10 	xul.dll 	nsSocketTransport::OnSocketReady 	netwerk/base/src/nsSocketTransport2.cpp:1537
11 	xul.dll 	nsSocketTransportService::DoPollIteration 	netwerk/base/src/nsSocketTransportService2.cpp:742
12 	xul.dll 	nsSocketTransportService::Run 	netwerk/base/src/nsSocketTransportService2.cpp:631
13 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:617
14 	xul.dll 	nsRunnable::Release 	obj-firefox/xpcom/build/nsThreadUtils.cpp:55
15 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:426
16 	nspr4.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:122
17 	mozcrt19.dll 	_callthreadstartex 	obj-firefox/memory/jemalloc/crtsrc/threadex.c:348
18 	mozcrt19.dll 	_threadstartex 	obj-firefox/memory/jemalloc/crtsrc/threadex.c:326
19 	kernel32.dll 	BaseThreadStart
Component: General → Networking
Priority: -- → P1
Product: Firefox → Core
QA Contact: general → networking
Target Milestone: --- → mozilla8
11686 crashes in the last three days across all Firefox versions.
(In reply to Marcia Knous [:marcia] from comment #1)
> 11686 crashes in the last three days across all Firefox versions.

Can you also figure out the date of the spike, if there is any?  This bug seems to be present for a very long time not much attended until now.
According to the explosive report, the spike started 2011-11-07. See https://crash-analysis.mozilla.com/rkaiser/2011-11-07/2011-11-07.firefox.7.explosiveness.html.
Honza, do you need something more specific than the graph that is available on the graph tab here?:
https://crash-stats.mozilla.com/report/list?signature=WSARecv
Callstacks are quit random (meaning almost always different, some making sense - WSARecv expected to be called, some quit strange).  This also applies when grouping by e.g. the same error.

I suspect a bit the ssl thread doing something unexpected with the system socket.

Brian, I miss the following:
- see spike dates (some nice graphs)
- see the first occurrence of the signature crash
- see the first occurrence of the whole callstack crash
- be able to group by (and see) the whole stack/exception type

This all not limited to just 4 weeks.  However, not sure I would be wiser from it.

(In reply to Marcia Knous [:marcia] from comment #3)
> According to the explosive report, the spike started 2011-11-07. See
> https://crash-analysis.mozilla.com/rkaiser/2011-11-07/2011-11-07.firefox.7.
> explosiveness.html.


I'd more say since 2011-11-10: https://crash-analysis.mozilla.com/rkaiser/2011-11-13/2011-11-13.firefox.7.explosiveness.html


I received following updates on 2001-11-09 (Win7/64bit):

Security Update for Windows 7 for x64-based Systems (KB2588516)
http://go.microsoft.com/fwlink/?LinkId=229071

Security Update for Windows 7 for x64-based Systems (KB2617657)
http://go.microsoft.com/fwlink/?LinkId=229245

Security Update for Windows 7 for x64-based Systems (KB2620704)
http://go.microsoft.com/fwlink/?LinkId=229638

Windows Malicious Software Removal Tool x64 - November 2011 (KB890830)
http://go.microsoft.com/fwlink/?LinkId=39987

Definition Update for Windows Defender - KB915597 (Definition 1.115.1462.0)
http://go.microsoft.com/fwlink/?LinkId=52661


But from the crash-stats hard to say if that could be related, hard to say which callstack occurrence has raised, which error.
Assigning to Patrick based on hg blame and because of the MSG_PEEK issue mentioned below:
http://hg.mozilla.org/mozilla-central/annotate/24c8d04f6174/netwerk/base/src/nsSocketTransport2.cpp#l1875

These crashes only happen on Windows. There are a variety: some are access violations during read, some are access violations during write, some or access violations trying to *execute* bad memory (EXCEPTION_ACCESS_VIOLATION_EXEC).

I also noticed that many of the EXCEPTION_ACCESS_VIOLATION_EXEC crashes happen at the same address; this indicates to me that the LSP(s) probably doesn't implement ASLR. Jesse, is this a correct inference?

Windows-only plus lots of EXCEPTION_ACCESS_VIOLATION_EXEC crashes make me think this is due to some LSP (e.g. buggy anti-virus, malware, etc.).

AFAICT, in our entire product, this is the only place we ever do I/O using PR_MSG_PEEK, which translates into the MSG_PEEK flag of recv. MSG_PEEK is a relatively rarely-used feature. I would not be surprised if some Windows LSP that normally work fine have bugs in their MSG_PEEK checks. We should investigate if we can avoid using MSG_PEEK ASAP.

It is also possible that we are passing a bad file descriptor to recv(), e.g. one that we have already closed. I recommend that we make sure that we modify our code so that it NULLs our all file descriptors after closing them, even in "unnecessary cases.

Secondly, I noticed that this code is doing a PR_Recv into a one-byte stack-allocated variable. Given that we know buggy LSPs exist, it would be smart to size this buffer to be at least 16 long and 16-byte aligned to account for any LSP that is trying to access the output buffer with SSE instructions (which may require 16-byte alignment) or other multi-byte read/writes.

We could also investigate using SEH (Strucured Exception Handling) to catch these crashes and simply have this function return FALSE when such a crash happens. I do not have experience with SEH to know how horrible this would be. bsmedberg, am I on crack?
Assignee: nobody → mcmanus
*IF* (big if) this is caused by LSPs, then there are other measures we can take, in addition to the one above, to make the suggested use of SEH safer and more effective:
1. Allocate a buffer using VirtualAlloc that is larger than necessary, that is memory-page-aligned and 3 memory pages long.
1. Use crazy VirtualProtect magic to make the first and third pages unreadable and unwritable.
2. Do all recv() calls into the middle page.
3. Trap all READ/WRITE violations from this PR_Recv. If those violations happen on either the first or third pages, assume buggy LSP. (What then?)
4. Trap all EXEC violations that happen from this PR_Recv, and assume buggy LSP.
5. Long-term, modify all of our network I/O code to use these LSP protections.

But, first, I would try whatever is simplest to remove the use of MSG_PEEK (again, assuming buggy LSPs).
FYI we've used SEH to wrap buggy accessibility calls in the past. For example:
http://mxr.mozilla.org/mozilla-central/source/accessible/src/msaa/nsAccessNodeWrap.cpp#233
7. When doing I/O into/out of a buffer, put that buffer as close to the third page as possible, while still having it 16-byte aligned, so that VirtualProtect will help us catch most buffer overruns.
6. When doing a PR_Recv of <n> bytes, set buffer[n], buffer[n+1], buffer[n+2], etc. to some known values (e.g. 0xDE), and make sure that after the PR_Recv, they still have the same values; else assume buggy LSP. This will catch bugs in LSPs that VirtualProtect would not (between buffer[n] and the start of the next page), since VirtualProtect only works on pages, AFAICT.
(In reply to Honza Bambas (:mayhemer) from comment #5)
> Callstacks are quit random (meaning almost always different, some making
> sense - WSARecv expected to be called, some quit strange).  This also
> applies when grouping by e.g. the same error.

> - be able to group by (and see) the whole stack/exception type

Yes, it would be extremely helpful if to have the crash data grouped by at least the caller of PR_Recv, or even further up. Also, grouping by exception type would be helpful. How do we get this data?
(In reply to Brian Smith (:bsmith) from comment #10)
> How do we get this data?

Ask some Socorro people around, there could be access to the raw database we could try do queries on.


This has also spiked up: https://crash-stats.mozilla.com/report/list?signature=send, https://crash-analysis.mozilla.com/rkaiser/2011-11-13/2011-11-13.firefox.explosiveness.html.


Brian: do you understand the data we get for the LSPs registration in reports?  I'm not deeply familiar with LSPs, but finding some correlation there first would be useful before trying to code what you have suggested.
For at least the stack in comment 0, this is very likely triggered by receiving a FIN from the other end of a persistent connection. That is what the code is testing for, and there is no outstanding request so the FIN is pretty much the only thing we expect to see (sometimes we see a 408 or some whitespace too, but the FIN is what we would expect) in that state.

So some kind of use-after-free in the LSP for Peek, perhaps a less well tested path, makes some sense at least. I'm not a fan of msg_peek myself.

I'll see if I can figure out a way to just make that a normal read and buffer it in the nsSocketInputStream  (and adjust the transportservice poll loop to note it, which is probably a bigger deal). That might even save a system call or two.

Honza, biesi, what do you think of that approach?

Meanwhile,
1] STR would be really helpful to verify the change
2] would be great if NSPR or NSS expert could come up with a theory on what is causing that crash.. lacking that we should probably remove the API as broken if I remove the only reference from nsSocketTransport.
(In reply to Patrick McManus from comment #12)
> So some kind of use-after-free in the LSP for Peek, perhaps a less well
> tested path, makes some sense at least. I'm not a fan of msg_peek myself.

Chromium has almost the exact same code as us for this, using recv(fd, c, 1, MSG_PEEK)

"use-after-free" makes a lot of sense and would explain the EXEC. However, the cause of the "use-after-free" within WSARecv could also be that we have closed the socket already. This seems like a pretty strong possibility too. We may be able to discover if this were the case by patching NSPR to always set the OS-level file descriptor in the PRFileDesc to INVALID_HANDLE on Windows after every close().

AFAICT, such use-after-close errors could potentially happen due to races with the SSL thread, like Honza mentioned, or by some bug that closes a PRFileDesc but leaves a reference to the PRFileDesc in some data structure.

> I'll see if I can figure out a way to just make that a normal read and
> buffer it in the nsSocketInputStream  (and adjust the transportservice poll
> loop to note it, which is probably a bigger deal). That might even save a
> system call or two.
> 
> Honza, biesi, what do you think of that approach?

It seems like too risky of a change for a chemspill, which is why we're all looking at this now.

> 2] would be great if NSPR or NSS expert could come up with a theory on what
> is causing that crash.. lacking that we should probably remove the API as
> broken if I remove the only reference from nsSocketTransport.

It seems less likely that buggy LSP + MSG_PEEK is the problem if Chromium is also using it. If it were buggy LSP + MSG_PEEK then we should be able to find some bug reports similar to this one in the Chromium bug tracker. I didn't find anything that jumped out, just:
http://code.google.com/p/chromium/issues/detail?id=93122&q=WSARecv&colspec=ID%20Pri%20Mstone%20ReleaseBlock%20Area%20Feature%20Status%20Owner%20Summary
http://code.google.com/p/chromium/issues/detail?id=93628&q=WSARecv&colspec=ID%20Pri%20Mstone%20ReleaseBlock%20Area%20Feature%20Status%20Owner%20Summary

I noticed that in the Chromium bug report, they are able to get a deeper trace; i.e. it didn't just stop at WSARecv like ours does. Why is that?;
0x7782de7d	 [msvcrt.dll	 + 0x0000de7d]	strstr
0x7782de49	 [msvcrt.dll	 + 0x0000de49]	strtolX
0x06d7b3f3	 [netconfig.dll	 + 0x0000b3f3]	
0x06d8682b	 [netconfig.dll	 + 0x0001682b]	
0x7782de10	 [msvcrt.dll	 + 0x0000de10]	atol
0x7379b15d	 [mfc42.dll	 + 0x0001b15d]	CString::~CString()
0x06d7df1b	 [netconfig.dll	 + 0x0000df1b]	
0x06d81c77	 [netconfig.dll	 + 0x00011c77]	
0x06d7d91d	 [netconfig.dll	 + 0x0000d91d]	
0x06d81f5f	 [netconfig.dll	 + 0x00011f5f]	
0x06d7facd	 [netconfig.dll	 + 0x0000facd]	
0x06d82097	 [netconfig.dll	 + 0x00012097]	
0x762170f9	 [ws2_32.dll	 + 0x000070f9]	WSARecv
0x640606b1	 [chrome.dll	 - tcp_client_socket_win.cc:682]	
[...]
I have found in many reports of crashes @WSARecv (with EXCEPTION_ACCESS_VIOLATION_EXEC) and @send reference to a weird modules sysappNN.dll, for N in 1,2,...29, etc.  (also contained in some stacks).

According virustotal it obviosly seems to be a virus: http://www.virustotal.com/file-scan/report.html?id=9240d03e0fb7e6bcac987baaf04b93dccc7ddeece6fbe82aaa70777944d385a7-1307200254

However, this virus seems to be older then just few days...
I suggest blacklisting these dll's and see how that works.  If it helps, we can easily port back.

Any objections?
Am I reading this right?:
https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/8.0/7
https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/11.0a1/7

According to that, this is the #106 topcrash for Firefox 8 and it doesn't even show up in the top 300 for Firefox 11.0a1. So, I don't quite understand why this particular bug is more urgent than other (networking) bugs for Firefox 8. What am I missing?

Also, almost all of crashes in WSARecv for Firefox 8 are on Windows XP SP2 or XP SP3. Only a handful are Vista or 7. (This is all according to the Socorro data.) Am I wrong in thinking that Socorro should generally show a more even distribution of affected version numbers, if this were a bug in our code, instead of OS-level or LSP badness?
chofmann's correlation report for today, based on 58 crashes, shows this correlation for this signature:

90% (52/58) vs.   0% (271/65058) sysapp6.dll
     90% (52/58) vs.   0% (271/65058) sysapp12.dll
     90% (52/58) vs.   0% (272/65058) sysapp43.dll
     90% (52/58) vs.   0% (275/65058) sysapp8.dll
     90% (52/58) vs.   0% (276/65058) sysapp2.dll
     90% (52/58) vs.   0% (276/65058) sysapp9.dll
     88% (51/58) vs.   0% (263/65058) sysapp32.dll
     88% (51/58) vs.   0% (274/65058) sysapp21.dll
    100% (58/58) vs.  13% (8310/65058) GdiPlus.dll
(In reply to Brian Smith (:bsmith) from comment #16)
> Am I reading this right?:
> https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/8.0/7
> https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/11.0a1/7
> 
> According to that, this is the #106 topcrash for Firefox 8 and it doesn't
> even show up in the top 300 for Firefox 11.0a1. So, I don't quite understand
> why this particular bug is more urgent than other (networking) bugs for
> Firefox 8. What am I missing?

This only became explosive on Friday, so it may not yet be a top crasher for 8 across the entire release so far. Additionally, we think that this crash may be related to 702041 and 702040 (due to their spikes temporally lining up). Together all three make up 22000 crashes in the past 3 or so days, multiplied by 10 due to Socorro throttling.

We have a checkpoint in Warp Core at 2PM PT to decide how best to move forward with this investigation for 8 (if at all). We'll make sure to take into consideration that this looks like a malware issue, which is typically outside the scope of a chemspill.
Assignee: mcmanus → honzab.moz
Status: NEW → ASSIGNED
Comment on attachment 574406 [details] [diff] [review]
v1 - blacklisting of sysapp*.dll

Benjamin, we would like to be able to get this in ASAP.  Thanks.
Attachment #574406 - Flags: review?(benjamin)
(In reply to Honza Bambas (:mayhemer) from comment #20)
> Comment on attachment 574406 [details] [diff] [review] [diff] [details] [review]
> v1 - blacklisting of sysapp*.dll

Are we absolutely sure that we don't block any valid DLLs with this?
I'm concerned about this as well: we don't know what might be called sysapp*.dll in the future, and we have only anecdotal evidence that this is always malware, which means that we could be blocking legitimate functionality. While I understand that we want to keep our crashiness down, the DLL blocklist is ultimately a poor tool to do anti-malware blocking, especially when we start adding these sorts of wildcards which can have unintended consequences.
At least regexp would be good.  True is that blocking some random dlls is not a true solution here and I also don't know well what are the rules for adding dlls to this list.

We should at least contact people that gave their emails and there was sysappNN.dll module in their reports, that they might have a malware.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #22)
> I'm concerned about this as well: we don't know what might be called
> sysapp*.dll in the future, and we have only anecdotal evidence that this is
> always malware, which means that we could be blocking legitimate
> functionality. While I understand that we want to keep our crashiness down,
> the DLL blocklist is ultimately a poor tool to do anti-malware blocking,
> especially when we start adding these sorts of wildcards which can have
> unintended consequences.

Yeah - and above and beyond that I'd be particularly cautious about adding file globbing/regex to custom loader code during a chemspill. I wouldn't resist globbing in general on the principle of "what if some future thing is called sysapp*?" but I agree deeply with the argument that the DLL blocklist is a brittle and inappropriate mechanism for dealing with malware in the general case. We can use it to swat down the most egregious offenders, but comment 16 and 18 leave me wondering whether this is one of those or not.

Adding rstrong, because he flirted last year with just blocking all LSPs from our process, iinvmm.
This is much less invasive than blocking all LSPs but still has the potential of having unintended consequences. That along with comment #16 and other comments makes me hesitant to recommend this approach.
Attachment #574406 - Flags: review?(benjamin) → review-
I missed the bugmail for this for a long time.  But I should mention that I will fight pretty hard for any glob/regexp facility in the blocklist handler.  Not only blocking DLLs matching a pattern is a bad idea most of the time, but also we want to keep the code which we run in our LdrLoadDll thunk as simple as possible.
Assignee: honzab.moz → nobody
Status: ASSIGNED → NEW
I wonder if this crash is caused by the fact that we are calling recv() on the main thread, causing the LSP (malware?) to be calling socket functions only on our socket transport thread? I.e. maybe the LSP isn't thread safe? If so, we could look at alternatives to calling recv() on the main thread like this.
Crash Signature: [@ WSARecv] → [@ WSARecv] [@ WSARecv | recv | _PR_MD_RECV | SocketRecv | nsSocketTransport::IsAlive]
Depends on: 716345
Summary: crash WSARecv → crash in nsSocketTransport::IsAlive @WSARecv
this particular stack is gone afaict.. and there is a separate bug underway to isolate all nspr io to a single thread. so this one can be closed i think
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: