Closed
Bug 702042
Opened 13 years ago
Closed 8 years ago
crash in nsSocketTransport::IsAlive @WSARecv
Categories
(Core :: Networking, defect, P1)
Tracking
()
RESOLVED
WONTFIX
mozilla8
Tracking | Status | |
---|---|---|
firefox8 | + | --- |
People
(Reporter: marcia, Unassigned)
References
Details
(Keywords: crash)
Crash Data
Attachments
(1 file)
2.40 KB,
patch
|
benjamin
:
review-
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-2e1c37d4-8041-4786-9bbd-195da2111112 . ============================================================= Seen while looking at the explosive report for Firefox 7 - all Windows crashes that span multiple versions. https://crash-stats.mozilla.com/report/list?signature=WSARecv https://crash-stats.mozilla.com/report/index/baa3db05-d2bd-4091-868e-d7e812111111 Frame Module Signature [Expand] Source 0 ws2_32.dll WSARecv 1 @0xc39c98a 2 wsock32.dll recv 3 nspr4.dll _PR_MD_RECV nsprpub/pr/src/md/windows/w95sock.c:354 4 nspr4.dll SocketRecv nsprpub/pr/src/io/prsocket.c:639 5 xul.dll nsSocketTransport::IsAlive netwerk/base/src/nsSocketTransport2.cpp:1855 6 xul.dll nsHttpConnection::IsAlive netwerk/protocol/http/nsHttpConnection.cpp:280 7 xul.dll nsHttpConnection::CanReuse netwerk/protocol/http/nsHttpConnection.cpp:245 8 xul.dll nsHttpConnection::OnInputStreamReady netwerk/protocol/http/nsHttpConnection.cpp:901 9 xul.dll nsSocketInputStream::OnSocketReady netwerk/base/src/nsSocketTransport2.cpp:256 10 xul.dll nsSocketTransport::OnSocketReady netwerk/base/src/nsSocketTransport2.cpp:1537 11 xul.dll nsSocketTransportService::DoPollIteration netwerk/base/src/nsSocketTransportService2.cpp:742 12 xul.dll nsSocketTransportService::Run netwerk/base/src/nsSocketTransportService2.cpp:631 13 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:617 14 xul.dll nsRunnable::Release obj-firefox/xpcom/build/nsThreadUtils.cpp:55 15 nspr4.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c:426 16 nspr4.dll pr_root nsprpub/pr/src/md/windows/w95thred.c:122 17 mozcrt19.dll _callthreadstartex obj-firefox/memory/jemalloc/crtsrc/threadex.c:348 18 mozcrt19.dll _threadstartex obj-firefox/memory/jemalloc/crtsrc/threadex.c:326 19 kernel32.dll BaseThreadStart
Updated•13 years ago
|
Component: General → Networking
Priority: -- → P1
Product: Firefox → Core
QA Contact: general → networking
Target Milestone: --- → mozilla8
Reporter | ||
Comment 1•13 years ago
|
||
11686 crashes in the last three days across all Firefox versions.
Comment 2•13 years ago
|
||
(In reply to Marcia Knous [:marcia] from comment #1) > 11686 crashes in the last three days across all Firefox versions. Can you also figure out the date of the spike, if there is any? This bug seems to be present for a very long time not much attended until now.
Reporter | ||
Comment 3•13 years ago
|
||
According to the explosive report, the spike started 2011-11-07. See https://crash-analysis.mozilla.com/rkaiser/2011-11-07/2011-11-07.firefox.7.explosiveness.html.
Comment 4•13 years ago
|
||
Honza, do you need something more specific than the graph that is available on the graph tab here?: https://crash-stats.mozilla.com/report/list?signature=WSARecv
Updated•13 years ago
|
tracking-firefox8:
--- → +
Comment 5•13 years ago
|
||
Callstacks are quit random (meaning almost always different, some making sense - WSARecv expected to be called, some quit strange). This also applies when grouping by e.g. the same error. I suspect a bit the ssl thread doing something unexpected with the system socket. Brian, I miss the following: - see spike dates (some nice graphs) - see the first occurrence of the signature crash - see the first occurrence of the whole callstack crash - be able to group by (and see) the whole stack/exception type This all not limited to just 4 weeks. However, not sure I would be wiser from it. (In reply to Marcia Knous [:marcia] from comment #3) > According to the explosive report, the spike started 2011-11-07. See > https://crash-analysis.mozilla.com/rkaiser/2011-11-07/2011-11-07.firefox.7. > explosiveness.html. I'd more say since 2011-11-10: https://crash-analysis.mozilla.com/rkaiser/2011-11-13/2011-11-13.firefox.7.explosiveness.html I received following updates on 2001-11-09 (Win7/64bit): Security Update for Windows 7 for x64-based Systems (KB2588516) http://go.microsoft.com/fwlink/?LinkId=229071 Security Update for Windows 7 for x64-based Systems (KB2617657) http://go.microsoft.com/fwlink/?LinkId=229245 Security Update for Windows 7 for x64-based Systems (KB2620704) http://go.microsoft.com/fwlink/?LinkId=229638 Windows Malicious Software Removal Tool x64 - November 2011 (KB890830) http://go.microsoft.com/fwlink/?LinkId=39987 Definition Update for Windows Defender - KB915597 (Definition 1.115.1462.0) http://go.microsoft.com/fwlink/?LinkId=52661 But from the crash-stats hard to say if that could be related, hard to say which callstack occurrence has raised, which error.
Comment 6•13 years ago
|
||
Assigning to Patrick based on hg blame and because of the MSG_PEEK issue mentioned below: http://hg.mozilla.org/mozilla-central/annotate/24c8d04f6174/netwerk/base/src/nsSocketTransport2.cpp#l1875 These crashes only happen on Windows. There are a variety: some are access violations during read, some are access violations during write, some or access violations trying to *execute* bad memory (EXCEPTION_ACCESS_VIOLATION_EXEC). I also noticed that many of the EXCEPTION_ACCESS_VIOLATION_EXEC crashes happen at the same address; this indicates to me that the LSP(s) probably doesn't implement ASLR. Jesse, is this a correct inference? Windows-only plus lots of EXCEPTION_ACCESS_VIOLATION_EXEC crashes make me think this is due to some LSP (e.g. buggy anti-virus, malware, etc.). AFAICT, in our entire product, this is the only place we ever do I/O using PR_MSG_PEEK, which translates into the MSG_PEEK flag of recv. MSG_PEEK is a relatively rarely-used feature. I would not be surprised if some Windows LSP that normally work fine have bugs in their MSG_PEEK checks. We should investigate if we can avoid using MSG_PEEK ASAP. It is also possible that we are passing a bad file descriptor to recv(), e.g. one that we have already closed. I recommend that we make sure that we modify our code so that it NULLs our all file descriptors after closing them, even in "unnecessary cases. Secondly, I noticed that this code is doing a PR_Recv into a one-byte stack-allocated variable. Given that we know buggy LSPs exist, it would be smart to size this buffer to be at least 16 long and 16-byte aligned to account for any LSP that is trying to access the output buffer with SSE instructions (which may require 16-byte alignment) or other multi-byte read/writes. We could also investigate using SEH (Strucured Exception Handling) to catch these crashes and simply have this function return FALSE when such a crash happens. I do not have experience with SEH to know how horrible this would be. bsmedberg, am I on crack?
Assignee: nobody → mcmanus
Comment 7•13 years ago
|
||
*IF* (big if) this is caused by LSPs, then there are other measures we can take, in addition to the one above, to make the suggested use of SEH safer and more effective: 1. Allocate a buffer using VirtualAlloc that is larger than necessary, that is memory-page-aligned and 3 memory pages long. 1. Use crazy VirtualProtect magic to make the first and third pages unreadable and unwritable. 2. Do all recv() calls into the middle page. 3. Trap all READ/WRITE violations from this PR_Recv. If those violations happen on either the first or third pages, assume buggy LSP. (What then?) 4. Trap all EXEC violations that happen from this PR_Recv, and assume buggy LSP. 5. Long-term, modify all of our network I/O code to use these LSP protections. But, first, I would try whatever is simplest to remove the use of MSG_PEEK (again, assuming buggy LSPs).
Comment 8•13 years ago
|
||
FYI we've used SEH to wrap buggy accessibility calls in the past. For example: http://mxr.mozilla.org/mozilla-central/source/accessible/src/msaa/nsAccessNodeWrap.cpp#233
Comment 9•13 years ago
|
||
7. When doing I/O into/out of a buffer, put that buffer as close to the third page as possible, while still having it 16-byte aligned, so that VirtualProtect will help us catch most buffer overruns. 6. When doing a PR_Recv of <n> bytes, set buffer[n], buffer[n+1], buffer[n+2], etc. to some known values (e.g. 0xDE), and make sure that after the PR_Recv, they still have the same values; else assume buggy LSP. This will catch bugs in LSPs that VirtualProtect would not (between buffer[n] and the start of the next page), since VirtualProtect only works on pages, AFAICT.
Comment 10•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #5) > Callstacks are quit random (meaning almost always different, some making > sense - WSARecv expected to be called, some quit strange). This also > applies when grouping by e.g. the same error. > - be able to group by (and see) the whole stack/exception type Yes, it would be extremely helpful if to have the crash data grouped by at least the caller of PR_Recv, or even further up. Also, grouping by exception type would be helpful. How do we get this data?
tracking-firefox8:
+ → ---
Comment 11•13 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #10) > How do we get this data? Ask some Socorro people around, there could be access to the raw database we could try do queries on. This has also spiked up: https://crash-stats.mozilla.com/report/list?signature=send, https://crash-analysis.mozilla.com/rkaiser/2011-11-13/2011-11-13.firefox.explosiveness.html. Brian: do you understand the data we get for the LSPs registration in reports? I'm not deeply familiar with LSPs, but finding some correlation there first would be useful before trying to code what you have suggested.
Comment 12•13 years ago
|
||
For at least the stack in comment 0, this is very likely triggered by receiving a FIN from the other end of a persistent connection. That is what the code is testing for, and there is no outstanding request so the FIN is pretty much the only thing we expect to see (sometimes we see a 408 or some whitespace too, but the FIN is what we would expect) in that state. So some kind of use-after-free in the LSP for Peek, perhaps a less well tested path, makes some sense at least. I'm not a fan of msg_peek myself. I'll see if I can figure out a way to just make that a normal read and buffer it in the nsSocketInputStream (and adjust the transportservice poll loop to note it, which is probably a bigger deal). That might even save a system call or two. Honza, biesi, what do you think of that approach? Meanwhile, 1] STR would be really helpful to verify the change 2] would be great if NSPR or NSS expert could come up with a theory on what is causing that crash.. lacking that we should probably remove the API as broken if I remove the only reference from nsSocketTransport.
Comment 13•13 years ago
|
||
(In reply to Patrick McManus from comment #12) > So some kind of use-after-free in the LSP for Peek, perhaps a less well > tested path, makes some sense at least. I'm not a fan of msg_peek myself. Chromium has almost the exact same code as us for this, using recv(fd, c, 1, MSG_PEEK) "use-after-free" makes a lot of sense and would explain the EXEC. However, the cause of the "use-after-free" within WSARecv could also be that we have closed the socket already. This seems like a pretty strong possibility too. We may be able to discover if this were the case by patching NSPR to always set the OS-level file descriptor in the PRFileDesc to INVALID_HANDLE on Windows after every close(). AFAICT, such use-after-close errors could potentially happen due to races with the SSL thread, like Honza mentioned, or by some bug that closes a PRFileDesc but leaves a reference to the PRFileDesc in some data structure. > I'll see if I can figure out a way to just make that a normal read and > buffer it in the nsSocketInputStream (and adjust the transportservice poll > loop to note it, which is probably a bigger deal). That might even save a > system call or two. > > Honza, biesi, what do you think of that approach? It seems like too risky of a change for a chemspill, which is why we're all looking at this now. > 2] would be great if NSPR or NSS expert could come up with a theory on what > is causing that crash.. lacking that we should probably remove the API as > broken if I remove the only reference from nsSocketTransport. It seems less likely that buggy LSP + MSG_PEEK is the problem if Chromium is also using it. If it were buggy LSP + MSG_PEEK then we should be able to find some bug reports similar to this one in the Chromium bug tracker. I didn't find anything that jumped out, just: http://code.google.com/p/chromium/issues/detail?id=93122&q=WSARecv&colspec=ID%20Pri%20Mstone%20ReleaseBlock%20Area%20Feature%20Status%20Owner%20Summary http://code.google.com/p/chromium/issues/detail?id=93628&q=WSARecv&colspec=ID%20Pri%20Mstone%20ReleaseBlock%20Area%20Feature%20Status%20Owner%20Summary I noticed that in the Chromium bug report, they are able to get a deeper trace; i.e. it didn't just stop at WSARecv like ours does. Why is that?; 0x7782de7d [msvcrt.dll + 0x0000de7d] strstr 0x7782de49 [msvcrt.dll + 0x0000de49] strtolX 0x06d7b3f3 [netconfig.dll + 0x0000b3f3] 0x06d8682b [netconfig.dll + 0x0001682b] 0x7782de10 [msvcrt.dll + 0x0000de10] atol 0x7379b15d [mfc42.dll + 0x0001b15d] CString::~CString() 0x06d7df1b [netconfig.dll + 0x0000df1b] 0x06d81c77 [netconfig.dll + 0x00011c77] 0x06d7d91d [netconfig.dll + 0x0000d91d] 0x06d81f5f [netconfig.dll + 0x00011f5f] 0x06d7facd [netconfig.dll + 0x0000facd] 0x06d82097 [netconfig.dll + 0x00012097] 0x762170f9 [ws2_32.dll + 0x000070f9] WSARecv 0x640606b1 [chrome.dll - tcp_client_socket_win.cc:682] [...]
Comment 14•13 years ago
|
||
I have found in many reports of crashes @WSARecv (with EXCEPTION_ACCESS_VIOLATION_EXEC) and @send reference to a weird modules sysappNN.dll, for N in 1,2,...29, etc. (also contained in some stacks). According virustotal it obviosly seems to be a virus: http://www.virustotal.com/file-scan/report.html?id=9240d03e0fb7e6bcac987baaf04b93dccc7ddeece6fbe82aaa70777944d385a7-1307200254 However, this virus seems to be older then just few days...
Comment 15•13 years ago
|
||
I suggest blacklisting these dll's and see how that works. If it helps, we can easily port back. Any objections?
Updated•13 years ago
|
tracking-firefox8:
--- → +
Comment 16•13 years ago
|
||
Am I reading this right?: https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/8.0/7 https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/11.0a1/7 According to that, this is the #106 topcrash for Firefox 8 and it doesn't even show up in the top 300 for Firefox 11.0a1. So, I don't quite understand why this particular bug is more urgent than other (networking) bugs for Firefox 8. What am I missing? Also, almost all of crashes in WSARecv for Firefox 8 are on Windows XP SP2 or XP SP3. Only a handful are Vista or 7. (This is all according to the Socorro data.) Am I wrong in thinking that Socorro should generally show a more even distribution of affected version numbers, if this were a bug in our code, instead of OS-level or LSP badness?
Reporter | ||
Comment 17•13 years ago
|
||
chofmann's correlation report for today, based on 58 crashes, shows this correlation for this signature: 90% (52/58) vs. 0% (271/65058) sysapp6.dll 90% (52/58) vs. 0% (271/65058) sysapp12.dll 90% (52/58) vs. 0% (272/65058) sysapp43.dll 90% (52/58) vs. 0% (275/65058) sysapp8.dll 90% (52/58) vs. 0% (276/65058) sysapp2.dll 90% (52/58) vs. 0% (276/65058) sysapp9.dll 88% (51/58) vs. 0% (263/65058) sysapp32.dll 88% (51/58) vs. 0% (274/65058) sysapp21.dll 100% (58/58) vs. 13% (8310/65058) GdiPlus.dll
Comment 18•13 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #16) > Am I reading this right?: > https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/8.0/7 > https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/11.0a1/7 > > According to that, this is the #106 topcrash for Firefox 8 and it doesn't > even show up in the top 300 for Firefox 11.0a1. So, I don't quite understand > why this particular bug is more urgent than other (networking) bugs for > Firefox 8. What am I missing? This only became explosive on Friday, so it may not yet be a top crasher for 8 across the entire release so far. Additionally, we think that this crash may be related to 702041 and 702040 (due to their spikes temporally lining up). Together all three make up 22000 crashes in the past 3 or so days, multiplied by 10 due to Socorro throttling. We have a checkpoint in Warp Core at 2PM PT to decide how best to move forward with this investigation for 8 (if at all). We'll make sure to take into consideration that this looks like a malware issue, which is typically outside the scope of a chemspill.
Comment 19•13 years ago
|
||
Assignee: mcmanus → honzab.moz
Status: NEW → ASSIGNED
Comment 20•13 years ago
|
||
Comment on attachment 574406 [details] [diff] [review] v1 - blacklisting of sysapp*.dll Benjamin, we would like to be able to get this in ASAP. Thanks.
Attachment #574406 -
Flags: review?(benjamin)
Comment 21•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #20) > Comment on attachment 574406 [details] [diff] [review] [diff] [details] [review] > v1 - blacklisting of sysapp*.dll Are we absolutely sure that we don't block any valid DLLs with this?
Comment 22•13 years ago
|
||
I'm concerned about this as well: we don't know what might be called sysapp*.dll in the future, and we have only anecdotal evidence that this is always malware, which means that we could be blocking legitimate functionality. While I understand that we want to keep our crashiness down, the DLL blocklist is ultimately a poor tool to do anti-malware blocking, especially when we start adding these sorts of wildcards which can have unintended consequences.
Comment 23•13 years ago
|
||
At least regexp would be good. True is that blocking some random dlls is not a true solution here and I also don't know well what are the rules for adding dlls to this list. We should at least contact people that gave their emails and there was sysappNN.dll module in their reports, that they might have a malware.
Comment 24•13 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #22) > I'm concerned about this as well: we don't know what might be called > sysapp*.dll in the future, and we have only anecdotal evidence that this is > always malware, which means that we could be blocking legitimate > functionality. While I understand that we want to keep our crashiness down, > the DLL blocklist is ultimately a poor tool to do anti-malware blocking, > especially when we start adding these sorts of wildcards which can have > unintended consequences. Yeah - and above and beyond that I'd be particularly cautious about adding file globbing/regex to custom loader code during a chemspill. I wouldn't resist globbing in general on the principle of "what if some future thing is called sysapp*?" but I agree deeply with the argument that the DLL blocklist is a brittle and inappropriate mechanism for dealing with malware in the general case. We can use it to swat down the most egregious offenders, but comment 16 and 18 leave me wondering whether this is one of those or not. Adding rstrong, because he flirted last year with just blocking all LSPs from our process, iinvmm.
Comment 25•13 years ago
|
||
This is much less invasive than blocking all LSPs but still has the potential of having unintended consequences. That along with comment #16 and other comments makes me hesitant to recommend this approach.
Updated•13 years ago
|
Attachment #574406 -
Flags: review?(benjamin) → review-
Comment 26•12 years ago
|
||
I missed the bugmail for this for a long time. But I should mention that I will fight pretty hard for any glob/regexp facility in the blocklist handler. Not only blocking DLLs matching a pattern is a bad idea most of the time, but also we want to keep the code which we run in our LdrLoadDll thunk as simple as possible.
Updated•12 years ago
|
Assignee: honzab.moz → nobody
Updated•12 years ago
|
Status: ASSIGNED → NEW
Comment 27•12 years ago
|
||
I wonder if this crash is caused by the fact that we are calling recv() on the main thread, causing the LSP (malware?) to be calling socket functions only on our socket transport thread? I.e. maybe the LSP isn't thread safe? If so, we could look at alternatives to calling recv() on the main thread like this.
Updated•12 years ago
|
Crash Signature: [@ WSARecv] → [@ WSARecv]
[@ WSARecv | recv | _PR_MD_RECV | SocketRecv | nsSocketTransport::IsAlive]
Depends on: 716345
Summary: crash WSARecv → crash in nsSocketTransport::IsAlive @WSARecv
Comment 28•8 years ago
|
||
this particular stack is gone afaict.. and there is a separate bug underway to isolate all nspr io to a single thread. so this one can be closed i think
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•