Crash in [@ memcpy | nsJARInputStream::Read]
Categories
(Core :: Networking, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | wontfix |
firefox-esr78 | --- | wontfix |
firefox66 | --- | wontfix |
firefox67 | --- | wontfix |
firefox68 | --- | wontfix |
firefox69 | --- | wontfix |
firefox70 | --- | wontfix |
firefox71 | --- | wontfix |
firefox80 | --- | wontfix |
firefox81 | --- | wontfix |
firefox82 | + | wontfix |
firefox87 | --- | wontfix |
firefox88 | --- | wontfix |
firefox89 | --- | fixed |
People
(Reporter: pascalc, Assigned: valentin)
References
Details
(Keywords: crash, Whiteboard: [necko-triaged])
Crash Data
Attachments
(1 file)
This bug is for crash report bp-42282c80-52e2-41d9-be6a-3c8230190514.
Top 10 frames of crashing thread:
0 vcruntime140.dll memcpy f:\dd\vctools\crt\vcruntime\src\string\i386\memcpy.asm:319
1 xul.dll nsresult nsJARInputStream::Read modules/libjar/nsJARInputStream.cpp:226
2 xul.dll nsresult nsJARInputThunk::Read modules/libjar/nsJARChannel.cpp:153
3 xul.dll nsresult mozilla::net::nsInputStreamTransport::Read netwerk/base/nsStreamTransportService.cpp:149
4 xul.dll static nsresult nsStreamCopierIB::ConsumeInputBuffer xpcom/io/nsStreamUtils.cpp:492
5 xul.dll nsresult nsPipeOutputStream::WriteSegments xpcom/io/nsPipe3.cpp:1671
6 xul.dll unsigned int nsStreamCopierOB::DoCopy xpcom/io/nsStreamUtils.cpp:551
7 xul.dll nsresult nsAStreamCopier::Run xpcom/io/nsStreamUtils.cpp:409
8 xul.dll nsresult nsThreadPool::Run xpcom/threads/nsThreadPool.cpp:241
9 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1179
Reporter | ||
Updated•6 years ago
|
Comment 1•5 years ago
|
||
This is fairly low volume on release. Comments are not particularly useful. Not exactly sure where to bucket this one based on the stack.
Comment 2•5 years ago
|
||
I have attempted reproducing this crash using the (dead) links from comments in the versions it occurred most: ESR v68.2.0 and Release v70.0.
Unfortunately, nothing sticks.
Comment 3•5 years ago
|
||
It's been 15 days since this bug had its last poke. Does any of you have an idea about its cause of reproduction? What is there to do about this bug?
Reporter | ||
Updated•5 years ago
|
Comment 4•5 years ago
|
||
Bugbug thinks this bug should belong to this component, but please revert this change in case of error.
Updated•5 years ago
|
Comment 6•5 years ago
|
||
Looking at the error codes, this can be read as:
"the memory mapped page cannot be loaded from disk because reading it from the media failed"
the second error is the reason why it has failed. there are bad sectors errors, invalid handles (closed by local clients, perhaps because of earlier detected errors), nothing saying STATUS_OBJECT_NAME_NOT_FOUND, network read errors suggesting profiles on network disks.
the crash rate is high. there is definitely some percent of users running remote profiles and also running portable usb-stick installation. but that so many of them would be faulty? hard to say, this can't be read from our crash reports.
to fix this, I believe memory mapped access should be in a try/catch and handled gracefully as a read failure. or rewrite the zip reader to not use mmap - zip reader implemented that way was an experiment and I heard a lot of complains about it since then.
Comment 7•5 years ago
|
||
the error code reference page:
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
Comment 8•5 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #7)
the error code reference page:
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
Thanks for this link.
(In reply to Honza Bambas (:mayhemer) from comment #6)
to fix this, I believe memory mapped access should be in a try/catch and handled gracefully as a read failure.
But that code is already in try-catch block (https://searchfox.org/mozilla-central/source/modules/libjar/nsJARInputStream.cpp#196-238). Well, it's not in try block if HAVE_SEH_EXCEPTIONS isn't defined. Not sure when this can happen.
Comment 9•5 years ago
|
||
Looks like we just need to add more error codes here https://searchfox.org/mozilla-central/rev/3811b11b5773c1dccfe8228bfc7143b10a9a2a99/modules/libjar/MmapFaultHandler.h#17
Comment 10•5 years ago
|
||
(In reply to Michal Novotny [:michal] from comment #9)
Looks like we just need to add more error codes here https://searchfox.org/mozilla-central/rev/3811b11b5773c1dccfe8228bfc7143b10a9a2a99/modules/libjar/MmapFaultHandler.h#17
Hmm.. I'm not sure. I think the exception code and the handling code itself is correct according docs. We also build with SEH enabled on win (my local debug and searchfox builds).
The "EXCEPTION_IN_PAGE_ERROR_READ" string comes from breakpad and handles MD_EXCEPTION_CODE_WIN_IN_PAGE_ERROR
which is defined as 0xc0000006 which is what EXCEPTION_IN_PAGE_ERROR
code in win32 is defined as (static_assert(EXCEPTION_IN_PAGE_ERROR == 0xc0000006);
proves that).
So I'm puzzled.
Comment 11•5 years ago
|
||
Just in case: static_assert(EXCEPTION_EXECUTE_HANDLER == 1);
passes too.
Comment 12•5 years ago
|
||
Thanks for confirming it. That's what I realized yesterday when I was looking at the error codes. Then I'm out of ideas...
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 13•4 years ago
|
||
We seem to have a pretty big increase in volume of these crashes in beta 81.
Comment 14•4 years ago
|
||
This is the top crash on Windows for the August 26th Nightlies, with 10% of all Windows crashes.
Assignee | ||
Comment 15•4 years ago
|
||
I've been looking at this for a long time yesterday.
There seem to be several oddities related to this bug.
- In 81b the crash reason changes from
EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_DATA_ERROR
toEXCEPTION_ACCESS_VIOLATION_READ
. Also the crash address goes from things like 0x2bea2a69000 to 0xcb5000 - It's not clear to me why the try catch block doesn't work properly when the crash reason is: EXCEPTION_IN_PAGE_ERROR_READ?
It also seems the crashes have started spiking with 81.b2 - 81.b1 had the same crash rate as 80.0
There is a chance the spike is related to bug 1627075.
My running theories for the problem:
- We have a bug in the zip decoding code that is sometimes causing us to go over the bounds of the mmaped memory
- The zip archive is corrupted in a way that is causing us to decode it incorrectly (unlikely)
- Something is rewriting the archive while we're reading it - bug 1583735 comment 12 indicates that may happen sometimes.
I think we may be able to make the exception handler work if we also catch EXCEPTION_ACCESS_VIOLATION errors here.
My only concern is that this may be hiding other bugs. Aaron, what do you think?
Assignee | ||
Comment 16•4 years ago
•
|
||
For 81.b2 crashes
https://crash-stats.mozilla.org/report/index/615c0865-df7e-4699-85a0-8e9a50200826
VCRUNTIME140.dll!memcpy(unsigned char * dst, unsigned char * src, unsigned long count) Line 319 Unknown
count 396818516 unsigned long
- dst 0x087eb000 <Error reading characters of string.> unsigned char *
<Unable to read memory> unsigned char
- src 0x00008000 <Error reading characters of string.> unsigned char *
<Unable to read memory> unsigned char
The source really shouldn't be that low - and the count doesn't make much sense either.
The main difference between release and beta are the changes made in bug 1627075.
Comment 17•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #15)
It also seems the crashes have started spiking with 81.b2 - 81.b1 had the same crash rate as 80.0
FWIW 81.0b1 was only rolled out at 25%, 81.0b2 at 50%, so that would explain some difference in volume. 81.0b3 is going out today at 100%.
Comment 18•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #15)
I think we may be able to make the exception handler work if we also catch EXCEPTION_ACCESS_VIOLATION errors here.
My only concern is that this may be hiding other bugs. Aaron, what do you think?
I'm not a fan of using the exception handler for this use case.
Comment 19•4 years ago
|
||
The changes from 1627075 and friends have been backed out on beta two days ago in bug 1656261 (https://hg.mozilla.org/releases/mozilla-beta/rev/79ff75977f09), so if that is the culprit we should see this dying down.
Assignee | ||
Comment 20•4 years ago
|
||
(In reply to Doug Thayer [:dthayer] (he/him) (PTO until Sept 1st) from comment #19)
The changes from 1627075 and friends have been backed out on beta two days ago in bug 1656261 (https://hg.mozilla.org/releases/mozilla-beta/rev/79ff75977f09), so if that is the culprit we should see this dying down.
No crashes in 81.0b6 on only a handful in 81.0b7. I think it worked. Thanks!
Assignee | ||
Comment 21•4 years ago
|
||
Julien, I assume this doesn't block 81 anymore?
Comment 22•4 years ago
|
||
Right, the volume seems way down.
Comment 23•4 years ago
|
||
This crash is the top signature for the September 9th Windows Nightly builds , with 13% of all crashes. I don't know if this might be caused by or fixed by the patch that landed that day in bug 1656261.
Comment 24•4 years ago
|
||
[@ vcruntime140.dll | nsJARInputStream::Read ] looks like a Windows-specific variation.
Comment 25•4 years ago
|
||
Bug 1656261 comment 48 should have stopped the spike in 82.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 26•4 years ago
|
||
Looking again at the crash reason of memcpy | nsJARInputStream::Read I see:
1 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_DATA_ERROR 746 55.10 %
2 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_DEVICE_ERROR 452 33.38 %
3 EXCEPTION_ACCESS_VIOLATION_READ 70 5.17 %
4 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_UNEXPECTED_NETWORK_ERROR 21 1.55 %
5 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_HARDWARE_ERROR 14 1.03 %
6 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_FILE_INVALID 12 0.89 %
7 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IN_PAGE_ERROR 8 0.59 %
8 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_CONNECTION_DISCONNECTED 6 0.44 %
9 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_TIMEOUT 6 0.44 %
10 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_OBJECT_NAME_NOT_FOUND 6 0.44 %
11 EXCEPTION_ACCESS_VIOLATION_READ / STATUS_VOLUME_DISMOUNTED 4 0.30 %
12 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_ACCESS_DENIED 2 0.15 %
13 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_INVALID_DEVICE_REQUEST 2 0.15 %
14 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_NO_SUCH_DEVICE 2 0.15 %
15 EXCEPTION_ACCESS_VIOLATION_WRITE 1 0.07 %
16 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_NOT_CONNECTED 1 0.07 %
17 EXCEPTION_IN_PAGE_ERROR_READ / STATUS_INVALID_PARAMETER 1 0.07 %
The vast majority of cases (EXCEPTION_IN_PAGE_ERROR_READ
) still seem to indicate some problem with code not yet in or paged out of the memory that cannot be read anymore. This seems to happen randomly all over our code, I wonder if we can do anything about this?
The remainder of EXCEPTION_ACCESS_VIOLATION_READ/WRITE
could be our fault, though. I'd concentrate investigations on those.
Comment 27•4 years ago
|
||
So we expect to have an exception handling in place for EXCEPTION_IN_PAGE_ERROR_READ
if HAVE_SEH_EXCEPTIONS
is true.
But looking at the disassembly taken from a minidump I cannot see any generated code for the macros.
...
*aBytesRead = 0;
00007FFE3E496B0C 49 89 D7 mov r15,rdx
00007FFE3E496B0F 48 89 CF mov rdi,rcx
00007FFE3E496B12 41 C7 04 24 00 00 00 00 mov dword ptr [r12],0
nsresult rv = NS_OK;
MMAP_FAULT_HANDLER_BEGIN_HANDLE(mFd)
switch (mMode) {
00007FFE3E496B1A 8B 81 B0 00 00 00 mov eax,dword ptr [rcx+0B0h]
00007FFE3E496B20 45 31 F6 xor r14d,r14d
00007FFE3E496B23 83 F8 04 cmp eax,4
00007FFE3E496B26 75 6A jne nsJARInputStream::Read+0B2h (07FFE3E496B92h)
}
break;
...
mFd = nullptr;
00007FFE3E496B76 4C 89 E9 mov rcx,r13
00007FFE3E496B79 E8 E2 03 00 00 call RefPtr<nsZipHandle>::operator= (07FFE3E496F60h)
}
break;
}
MMAP_FAULT_HANDLER_CATCH(NS_ERROR_FAILURE)
return rv;
}
00007FFE3E496B7E 44 89 F0 mov eax,r14d
00007FFE3E496B81 48 83 C4 28 add rsp,28h
00007FFE3E496B85 5B pop rbx
00007FFE3E496B86 5F pop rdi
00007FFE3E496B87 5E pop rsi
00007FFE3E496B88 41 5C pop r12
00007FFE3E496B8A 41 5D pop r13
00007FFE3E496B8C 41 5E pop r14
00007FFE3E496B8E 41 5F pop r15
00007FFE3E496B90 5D pop rbp
00007FFE3E496B91 C3 ret
Assignee | ||
Comment 28•4 years ago
|
||
So, a small update here.
I checked that exception handling works properly on windows and properly returns NS_ERROR_FAILURE.
I did this by creating a build that calls RaiseException(EXCEPTION_IN_PAGE_ERROR) when a pref is set.
I proceeded to load an addon xpi (uncompressed), flip the pref, and reload it - the result was that reloading failed with NS_ERROR_FAILURE.
Not exactly sure where the exception handler is stored on Windows, but it's probably expected that it wouldn't be on the stack.
That said, it's still rather curious that the crashes we're getting are caused by EXCEPTION_IN_PAGE_ERROR even though that should be caught.
I just took another look at some stack traces and found something interesting:
https://crash-stats.mozilla.org/report/index/4e9cfafa-7837-4826-9570-037030210401#tab-rawdump
This one happens to have both mZs and the mmaped area in the crash dump, so I could see what's being loaded via the JARChannel, and check where things are going wrong.
The file being loaded here is devtools/client/inspector/computed/computed.js
It has 48401 (0xbd11) characters which matches the values of count and mOutSize.
Looking at the memory of mZs.next_in (0x000001868c7479ec) it seems the file is properly loaded into memory, and we finally crash when going to the next page 0x1868c748000 - which doesn't seem to be mapped yet.
I think this is a step forward. I was beginning to wonder if there's some other obvious logic error causing these crashes to happen.
Now the question is why is this happening?
Assignee | ||
Comment 29•4 years ago
|
||
I think I finally figured it out.
It turns out the problem isn't with MMAP_FAULT_HANDLER_CATCH
, but with calling memcpy from inside the try/catch block.
Since memcpy
is a C function, it is guaranteed to not throw any exception, so the exception handling was optimized out.
https://stackoverflow.com/questions/7164019/does-memcpy-not-throw-exceptions
I did a few tests with reading past the end of the mmaped memory. When using memcpy Firefox crashed with EXCEPTION_ACCESS_VIOLATION no matter how I configured the exception handling. If I switched to using CopyMemory
from windows.h
then I was able to intercept the exception and not crash.
My quick fix for this is to use CopyMemory when HAVE_SEH_EXCEPTIONS is defined, in nsJARInputStream.cpp
This should get rid of a large fraction of these issues.
As a follow-up we should check that the C-implemented methods that we call from inside the try/catch blocks, like the inflate()
and BrotliDecoderDecompressStream
don't have the same issue.
Assignee | ||
Comment 30•4 years ago
|
||
Comment 31•4 years ago
|
||
Updated•4 years ago
|
Comment 32•4 years ago
|
||
bugherder |
Comment 33•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #29)
I think I finally figured it out.
It turns out the problem isn't withMMAP_FAULT_HANDLER_CATCH
, but with calling memcpy from inside the try/catch block.
Sincememcpy
is a C function, it is guaranteed to not throw any exception, so the exception handling was optimized out.
https://stackoverflow.com/questions/7164019/does-memcpy-not-throw-exceptionsI did a few tests with reading past the end of the mmaped memory. When using memcpy Firefox crashed with EXCEPTION_ACCESS_VIOLATION no matter how I configured the exception handling. If I switched to using
CopyMemory
fromwindows.h
then I was able to intercept the exception and not crash.
Great find, Valentin!
Updated•4 years ago
|
Description
•