Closed Bug 1551562 Opened 5 years ago Closed 3 years ago

Crash in [@ memcpy | nsJARInputStream::Read]

Categories

(Core :: Networking, defect, P1)

x86
Windows 7
defect

Tracking

()

RESOLVED FIXED
89 Branch
Tracking Status
firefox-esr68 --- wontfix
firefox-esr78 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 --- wontfix
firefox71 --- wontfix
firefox80 --- wontfix
firefox81 --- wontfix
firefox82 + wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- fixed

People

(Reporter: pascalc, Assigned: valentin)

References

Details

(Keywords: crash, Whiteboard: [necko-triaged])

Crash Data

Attachments

(1 file)

This bug is for crash report bp-42282c80-52e2-41d9-be6a-3c8230190514.

Top 10 frames of crashing thread:

0 vcruntime140.dll memcpy f:\dd\vctools\crt\vcruntime\src\string\i386\memcpy.asm:319
1 xul.dll nsresult nsJARInputStream::Read modules/libjar/nsJARInputStream.cpp:226
2 xul.dll nsresult nsJARInputThunk::Read modules/libjar/nsJARChannel.cpp:153
3 xul.dll nsresult mozilla::net::nsInputStreamTransport::Read netwerk/base/nsStreamTransportService.cpp:149
4 xul.dll static nsresult nsStreamCopierIB::ConsumeInputBuffer xpcom/io/nsStreamUtils.cpp:492
5 xul.dll nsresult nsPipeOutputStream::WriteSegments xpcom/io/nsPipe3.cpp:1671
6 xul.dll unsigned int nsStreamCopierOB::DoCopy xpcom/io/nsStreamUtils.cpp:551
7 xul.dll nsresult nsAStreamCopier::Run xpcom/io/nsStreamUtils.cpp:409
8 xul.dll nsresult nsThreadPool::Run xpcom/threads/nsThreadPool.cpp:241
9 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1179

This is fairly low volume on release. Comments are not particularly useful. Not exactly sure where to bucket this one based on the stack.

I have attempted reproducing this crash using the (dead) links from comments in the versions it occurred most: ESR v68.2.0 and Release v70.0.
Unfortunately, nothing sticks.

It's been 15 days since this bug had its last poke. Does any of you have an idea about its cause of reproduction? What is there to do about this bug?

Flags: needinfo?(pascalc)

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: Untriaged → Networking
Product: Firefox → Core

michal, can you take a look?

Flags: needinfo?(michal.novotny)
Assignee: nobody → michal.novotny
Flags: needinfo?(michal.novotny)
Priority: -- → P1
Whiteboard: [necko-triaged]

Looking at the error codes, this can be read as:
"the memory mapped page cannot be loaded from disk because reading it from the media failed"

the second error is the reason why it has failed. there are bad sectors errors, invalid handles (closed by local clients, perhaps because of earlier detected errors), nothing saying STATUS_OBJECT_NAME_NOT_FOUND, network read errors suggesting profiles on network disks.

the crash rate is high. there is definitely some percent of users running remote profiles and also running portable usb-stick installation. but that so many of them would be faulty? hard to say, this can't be read from our crash reports.

to fix this, I believe memory mapped access should be in a try/catch and handled gracefully as a read failure. or rewrite the zip reader to not use mmap - zip reader implemented that way was an experiment and I heard a lot of complains about it since then.

(In reply to Honza Bambas (:mayhemer) from comment #7)

the error code reference page:
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55

Thanks for this link.

(In reply to Honza Bambas (:mayhemer) from comment #6)

to fix this, I believe memory mapped access should be in a try/catch and handled gracefully as a read failure.

But that code is already in try-catch block (https://searchfox.org/mozilla-central/source/modules/libjar/nsJARInputStream.cpp#196-238). Well, it's not in try block if HAVE_SEH_EXCEPTIONS isn't defined. Not sure when this can happen.

(In reply to Michal Novotny [:michal] from comment #9)

Looks like we just need to add more error codes here https://searchfox.org/mozilla-central/rev/3811b11b5773c1dccfe8228bfc7143b10a9a2a99/modules/libjar/MmapFaultHandler.h#17

Hmm.. I'm not sure. I think the exception code and the handling code itself is correct according docs. We also build with SEH enabled on win (my local debug and searchfox builds).

The "EXCEPTION_IN_PAGE_ERROR_READ" string comes from breakpad and handles MD_EXCEPTION_CODE_WIN_IN_PAGE_ERROR which is defined as 0xc0000006 which is what EXCEPTION_IN_PAGE_ERROR code in win32 is defined as (static_assert(EXCEPTION_IN_PAGE_ERROR == 0xc0000006); proves that).

So I'm puzzled.

Just in case: static_assert(EXCEPTION_EXECUTE_HANDLER == 1); passes too.

Thanks for confirming it. That's what I realized yesterday when I was looking at the error codes. Then I'm out of ideas...

Crash Signature: [@ memcpy | nsJARInputStream::Read] → [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ]
Assignee: michal.novotny → nobody
Assignee: nobody → valentin.gosu
Depends on: 1657913

We seem to have a pretty big increase in volume of these crashes in beta 81.

This is the top crash on Windows for the August 26th Nightlies, with 10% of all Windows crashes.

I've been looking at this for a long time yesterday.
There seem to be several oddities related to this bug.

  1. In 81b the crash reason changes from EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_DATA_ERROR to EXCEPTION_ACCESS_VIOLATION_READ. Also the crash address goes from things like 0x2bea2a69000 to 0xcb5000
  2. It's not clear to me why the try catch block doesn't work properly when the crash reason is: EXCEPTION_IN_PAGE_ERROR_READ?

It also seems the crashes have started spiking with 81.b2 - 81.b1 had the same crash rate as 80.0
There is a chance the spike is related to bug 1627075.

My running theories for the problem:

  • We have a bug in the zip decoding code that is sometimes causing us to go over the bounds of the mmaped memory
  • The zip archive is corrupted in a way that is causing us to decode it incorrectly (unlikely)
  • Something is rewriting the archive while we're reading it - bug 1583735 comment 12 indicates that may happen sometimes.

I think we may be able to make the exception handler work if we also catch EXCEPTION_ACCESS_VIOLATION errors here.
My only concern is that this may be hiding other bugs. Aaron, what do you think?

Flags: needinfo?(aklotz)

For 81.b2 crashes
https://crash-stats.mozilla.org/report/index/615c0865-df7e-4699-85a0-8e9a50200826

VCRUNTIME140.dll!memcpy(unsigned char * dst, unsigned char * src, unsigned long count) Line 319	Unknown

		count	396818516	unsigned long
-		dst	0x087eb000 <Error reading characters of string.>	unsigned char *
			<Unable to read memory>	unsigned char
-		src	0x00008000 <Error reading characters of string.>	unsigned char *
			<Unable to read memory>	unsigned char

The source really shouldn't be that low - and the count doesn't make much sense either.

The main difference between release and beta are the changes made in bug 1627075.

Flags: needinfo?(dothayer)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #15)

It also seems the crashes have started spiking with 81.b2 - 81.b1 had the same crash rate as 80.0

FWIW 81.0b1 was only rolled out at 25%, 81.0b2 at 50%, so that would explain some difference in volume. 81.0b3 is going out today at 100%.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #15)

I think we may be able to make the exception handler work if we also catch EXCEPTION_ACCESS_VIOLATION errors here.
My only concern is that this may be hiding other bugs. Aaron, what do you think?

I'm not a fan of using the exception handler for this use case.

Flags: needinfo?(aklotz)

The changes from 1627075 and friends have been backed out on beta two days ago in bug 1656261 (https://hg.mozilla.org/releases/mozilla-beta/rev/79ff75977f09), so if that is the culprit we should see this dying down.

Flags: needinfo?(dothayer)

(In reply to Doug Thayer [:dthayer] (he/him) (PTO until Sept 1st) from comment #19)

The changes from 1627075 and friends have been backed out on beta two days ago in bug 1656261 (https://hg.mozilla.org/releases/mozilla-beta/rev/79ff75977f09), so if that is the culprit we should see this dying down.

No crashes in 81.0b6 on only a handful in 81.0b7. I think it worked. Thanks!

See Also: → 1627075

Julien, I assume this doesn't block 81 anymore?

Flags: needinfo?(jcristau)

Right, the volume seems way down.

Flags: needinfo?(jcristau)

This crash is the top signature for the September 9th Windows Nightly builds , with 13% of all crashes. I don't know if this might be caused by or fixed by the patch that landed that day in bug 1656261.

[@ vcruntime140.dll | nsJARInputStream::Read ] looks like a Windows-specific variation.

Crash Signature: [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ] → [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ] [@ vcruntime140.dll | nsJARInputStream::Read ]
Crash Signature: [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ] [@ vcruntime140.dll | nsJARInputStream::Read ] → [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ] [@ vcruntime140.dll | nsJARInputStream::Read ] [@ vcruntime140.dll | `anonymous namespace'::VerifyAppManifest]
Crash Signature: [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ] [@ vcruntime140.dll | nsJARInputStream::Read ] [@ vcruntime140.dll | `anonymous namespace'::VerifyAppManifest] → [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ] [@ vcruntime140.dll | nsJARInputStream::Read ] [@ vcruntime140.dll | `anonymous namespace'::VerifyAppManifest] [@ nsJARInputStream::Read ]
Depends on: 1675465

Looking again at the crash reason of memcpy | nsJARInputStream::Read I see:

 1 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_DATA_ERROR 	746 	55.10 %
2 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_DEVICE_ERROR 	452 	33.38 %
3 	EXCEPTION_ACCESS_VIOLATION_READ 	70 	5.17 %
4 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_UNEXPECTED_NETWORK_ERROR 	21 	1.55 %
5 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_HARDWARE_ERROR 	14 	1.03 %
6 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_FILE_INVALID 	12 	0.89 %
7 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IN_PAGE_ERROR 	8 	0.59 %
8 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_CONNECTION_DISCONNECTED 	6 	0.44 %
9 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_TIMEOUT 	6 	0.44 %
10 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_OBJECT_NAME_NOT_FOUND 	6 	0.44 %
11 	EXCEPTION_ACCESS_VIOLATION_READ / STATUS_VOLUME_DISMOUNTED 	4 	0.30 %
12 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_ACCESS_DENIED 	2 	0.15 %
13 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_INVALID_DEVICE_REQUEST 	2 	0.15 %
14 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_NO_SUCH_DEVICE 	2 	0.15 %
15 	EXCEPTION_ACCESS_VIOLATION_WRITE 	1 	0.07 %
16 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_NOT_CONNECTED 	1 	0.07 %
17 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_INVALID_PARAMETER 	1 	0.07 %

The vast majority of cases (EXCEPTION_IN_PAGE_ERROR_READ) still seem to indicate some problem with code not yet in or paged out of the memory that cannot be read anymore. This seems to happen randomly all over our code, I wonder if we can do anything about this?

The remainder of EXCEPTION_ACCESS_VIOLATION_READ/WRITE could be our fault, though. I'd concentrate investigations on those.

So we expect to have an exception handling in place for EXCEPTION_IN_PAGE_ERROR_READ if HAVE_SEH_EXCEPTIONS is true.

But looking at the disassembly taken from a minidump I cannot see any generated code for the macros.

...
  *aBytesRead = 0;
00007FFE3E496B0C 49 89 D7             mov         r15,rdx  
00007FFE3E496B0F 48 89 CF             mov         rdi,rcx  
00007FFE3E496B12 41 C7 04 24 00 00 00 00 mov         dword ptr [r12],0  

  nsresult rv = NS_OK;
  MMAP_FAULT_HANDLER_BEGIN_HANDLE(mFd)
  switch (mMode) {
00007FFE3E496B1A 8B 81 B0 00 00 00    mov         eax,dword ptr [rcx+0B0h]  
00007FFE3E496B20 45 31 F6             xor         r14d,r14d  
00007FFE3E496B23 83 F8 04             cmp         eax,4  
00007FFE3E496B26 75 6A                jne         nsJARInputStream::Read+0B2h (07FFE3E496B92h)  
      }
      break;
...
        mFd = nullptr;
00007FFE3E496B76 4C 89 E9             mov         rcx,r13  
00007FFE3E496B79 E8 E2 03 00 00       call        RefPtr<nsZipHandle>::operator= (07FFE3E496F60h)  
      }
      break;
  }
  MMAP_FAULT_HANDLER_CATCH(NS_ERROR_FAILURE)
  return rv;
}
00007FFE3E496B7E 44 89 F0             mov         eax,r14d  
00007FFE3E496B81 48 83 C4 28          add         rsp,28h  
00007FFE3E496B85 5B                   pop         rbx  
00007FFE3E496B86 5F                   pop         rdi  
00007FFE3E496B87 5E                   pop         rsi  
00007FFE3E496B88 41 5C                pop         r12  
00007FFE3E496B8A 41 5D                pop         r13  
00007FFE3E496B8C 41 5E                pop         r14  
00007FFE3E496B8E 41 5F                pop         r15  
00007FFE3E496B90 5D                   pop         rbp  
00007FFE3E496B91 C3                   ret  

So, a small update here.
I checked that exception handling works properly on windows and properly returns NS_ERROR_FAILURE.
I did this by creating a build that calls RaiseException(EXCEPTION_IN_PAGE_ERROR) when a pref is set.
I proceeded to load an addon xpi (uncompressed), flip the pref, and reload it - the result was that reloading failed with NS_ERROR_FAILURE.

Not exactly sure where the exception handler is stored on Windows, but it's probably expected that it wouldn't be on the stack.
That said, it's still rather curious that the crashes we're getting are caused by EXCEPTION_IN_PAGE_ERROR even though that should be caught.

I just took another look at some stack traces and found something interesting:
https://crash-stats.mozilla.org/report/index/4e9cfafa-7837-4826-9570-037030210401#tab-rawdump

This one happens to have both mZs and the mmaped area in the crash dump, so I could see what's being loaded via the JARChannel, and check where things are going wrong.
The file being loaded here is devtools/client/inspector/computed/computed.js
It has 48401 (0xbd11) characters which matches the values of count and mOutSize.

Looking at the memory of mZs.next_in (0x000001868c7479ec) it seems the file is properly loaded into memory, and we finally crash when going to the next page 0x1868c748000 - which doesn't seem to be mapped yet.

I think this is a step forward. I was beginning to wonder if there's some other obvious logic error causing these crashes to happen.
Now the question is why is this happening?

I think I finally figured it out.
It turns out the problem isn't with MMAP_FAULT_HANDLER_CATCH, but with calling memcpy from inside the try/catch block.
Since memcpy is a C function, it is guaranteed to not throw any exception, so the exception handling was optimized out.
https://stackoverflow.com/questions/7164019/does-memcpy-not-throw-exceptions

I did a few tests with reading past the end of the mmaped memory. When using memcpy Firefox crashed with EXCEPTION_ACCESS_VIOLATION no matter how I configured the exception handling. If I switched to using CopyMemory from windows.h then I was able to intercept the exception and not crash.

My quick fix for this is to use CopyMemory when HAVE_SEH_EXCEPTIONS is defined, in nsJARInputStream.cpp
This should get rid of a large fraction of these issues.

As a follow-up we should check that the C-implemented methods that we call from inside the try/catch blocks, like the inflate() and BrotliDecoderDecompressStream don't have the same issue.

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/18416ab86595
Use CopyMemory instead of memcpy on Windows inside try/catch blocks r=necko-reviewers,kershaw
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch

(In reply to Valentin Gosu [:valentin] (he/him) from comment #29)

I think I finally figured it out.
It turns out the problem isn't with MMAP_FAULT_HANDLER_CATCH, but with calling memcpy from inside the try/catch block.
Since memcpy is a C function, it is guaranteed to not throw any exception, so the exception handling was optimized out.
https://stackoverflow.com/questions/7164019/does-memcpy-not-throw-exceptions

I did a few tests with reading past the end of the mmaped memory. When using memcpy Firefox crashed with EXCEPTION_ACCESS_VIOLATION no matter how I configured the exception handling. If I switched to using CopyMemory from windows.h then I was able to intercept the exception and not crash.

Great find, Valentin!

Depends on: 1707853
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: