<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Updated

•

5 years ago

Assignee: nobody → michal.novotny

Flags: needinfo?(michal.novotny)

Priority: -- → P1

Whiteboard: [necko-triaged]

Comment 6

•

5 years ago

Looking at the error codes, this can be read as:
"the memory mapped page cannot be loaded from disk because reading it from the media failed"

the second error is the reason why it has failed. there are bad sectors errors, invalid handles (closed by local clients, perhaps because of earlier detected errors), nothing saying STATUS_OBJECT_NAME_NOT_FOUND, network read errors suggesting profiles on network disks.

the crash rate is high. there is definitely some percent of users running remote profiles and also running portable usb-stick installation. but that so many of them would be faulty? hard to say, this can't be read from our crash reports.

to fix this, I believe memory mapped access should be in a try/catch and handled gracefully as a read failure. or rewrite the zip reader to not use mmap - zip reader implemented that way was an experiment and I heard a lot of complains about it since then.

Comment 7

•

5 years ago

the error code reference page:
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55

Comment 8

•

5 years ago

(In reply to Honza Bambas (:mayhemer) from comment #7)

the error code reference page:
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55

Thanks for this link.

(In reply to Honza Bambas (:mayhemer) from comment #6)

to fix this, I believe memory mapped access should be in a try/catch and handled gracefully as a read failure.

But that code is already in try-catch block (https://searchfox.org/mozilla-central/source/modules/libjar/nsJARInputStream.cpp#196-238). Well, it's not in try block if HAVE_SEH_EXCEPTIONS isn't defined. Not sure when this can happen.

Comment 9

•

5 years ago

Looks like we just need to add more error codes here https://searchfox.org/mozilla-central/rev/3811b11b5773c1dccfe8228bfc7143b10a9a2a99/modules/libjar/MmapFaultHandler.h#17

Comment 10

•

5 years ago

(In reply to Michal Novotny [:michal] from comment #9)

Looks like we just need to add more error codes here https://searchfox.org/mozilla-central/rev/3811b11b5773c1dccfe8228bfc7143b10a9a2a99/modules/libjar/MmapFaultHandler.h#17

Hmm.. I'm not sure. I think the exception code and the handling code itself is correct according docs. We also build with SEH enabled on win (my local debug and searchfox builds).

The "EXCEPTION_IN_PAGE_ERROR_READ" string comes from breakpad and handles MD_EXCEPTION_CODE_WIN_IN_PAGE_ERROR which is defined as 0xc0000006 which is what EXCEPTION_IN_PAGE_ERROR code in win32 is defined as (static_assert(EXCEPTION_IN_PAGE_ERROR == 0xc0000006); proves that).

So I'm puzzled.

Comment 11

•

5 years ago

Just in case: static_assert(EXCEPTION_EXECUTE_HANDLER == 1); passes too.

Comment 12

•

5 years ago

Thanks for confirming it. That's what I realized yesterday when I was looking at the error codes. Then I'm out of ideas...

Updated

•

4 years ago

Crash Signature: [@ memcpy | nsJARInputStream::Read] → [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ]

Updated

•

4 years ago

Assignee: michal.novotny → nobody

Ryan VanderMeulen [:RyanVM]

Assignee

Updated

•

4 years ago

Assignee: nobody → valentin.gosu

Updated

•

4 years ago

status-firefox70: affected → wontfix

status-firefox71: affected → wontfix

status-firefox-esr68: affected → wontfix

Assignee

Updated

•

4 years ago

Depends on: 1657913

Comment 13

•

4 years ago

We seem to have a pretty big increase in volume of these crashes in beta 81.

status-firefox80: --- → wontfix

status-firefox81: --- → affected

status-firefox82: --- → affected

tracking-firefox81: --- → blocking

tracking-firefox82: --- → +

Comment 14

•

4 years ago

This is the top crash on Windows for the August 26th Nightlies, with 10% of all Windows crashes.

Assignee

Comment 15

•

4 years ago

I've been looking at this for a long time yesterday.
There seem to be several oddities related to this bug.

In 81b the crash reason changes from EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_DATA_ERROR to EXCEPTION_ACCESS_VIOLATION_READ. Also the crash address goes from things like 0x2bea2a69000 to 0xcb5000
It's not clear to me why the try catch block doesn't work properly when the crash reason is: EXCEPTION_IN_PAGE_ERROR_READ?

It also seems the crashes have started spiking with 81.b2 - 81.b1 had the same crash rate as 80.0
There is a chance the spike is related to bug 1627075.

My running theories for the problem:

We have a bug in the zip decoding code that is sometimes causing us to go over the bounds of the mmaped memory
The zip archive is corrupted in a way that is causing us to decode it incorrectly (unlikely)
Something is rewriting the archive while we're reading it - bug 1583735 comment 12 indicates that may happen sometimes.

I think we may be able to make the exception handler work if we also catch EXCEPTION_ACCESS_VIOLATION errors here.
My only concern is that this may be hiding other bugs. Aaron, what do you think?

Flags: needinfo?(aklotz)

Assignee

Comment 16

•

4 years ago

•

Edited

For 81.b2 crashes
https://crash-stats.mozilla.org/report/index/615c0865-df7e-4699-85a0-8e9a50200826

VCRUNTIME140.dll!memcpy(unsigned char * dst, unsigned char * src, unsigned long count) Line 319	Unknown

		count	396818516	unsigned long
-		dst	0x087eb000 <Error reading characters of string.>	unsigned char *
			<Unable to read memory>	unsigned char
-		src	0x00008000 <Error reading characters of string.>	unsigned char *
			<Unable to read memory>	unsigned char

The source really shouldn't be that low - and the count doesn't make much sense either.

The main difference between release and beta are the changes made in bug 1627075.

Flags: needinfo?(dothayer)

(No longer employed by Mozilla) Aaron Klotz

Comment 17

•

4 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #15)

It also seems the crashes have started spiking with 81.b2 - 81.b1 had the same crash rate as 80.0

FWIW 81.0b1 was only rolled out at 25%, 81.0b2 at 50%, so that would explain some difference in volume. 81.0b3 is going out today at 100%.

Comment 18

•

4 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #15)

I think we may be able to make the exception handler work if we also catch EXCEPTION_ACCESS_VIOLATION errors here.
My only concern is that this may be hiding other bugs. Aaron, what do you think?

I'm not a fan of using the exception handler for this use case.

Flags: needinfo?(aklotz)

Alex Thayer [:alexical] (she/her)

Comment 19

•

4 years ago

The changes from 1627075 and friends have been backed out on beta two days ago in bug 1656261 (https://hg.mozilla.org/releases/mozilla-beta/rev/79ff75977f09), so if that is the culprit we should see this dying down.

Flags: needinfo?(dothayer)

Assignee

Comment 20

•

4 years ago

(In reply to Doug Thayer [:dthayer] (he/him) (PTO until Sept 1st) from comment #19)

The changes from 1627075 and friends have been backed out on beta two days ago in bug 1656261 (https://hg.mozilla.org/releases/mozilla-beta/rev/79ff75977f09), so if that is the culprit we should see this dying down.

No crashes in 81.0b6 on only a handful in 81.0b7. I think it worked. Thanks!

Comment 21

•

4 years ago

Julien, I assume this doesn't block 81 anymore?

Flags: needinfo?(jcristau)

Comment 22

•

4 years ago

Right, the volume seems way down.

status-firefox81: affected → wontfix

tracking-firefox81: blocking → ---

Flags: needinfo?(jcristau)

Comment 23

•

4 years ago

This crash is the top signature for the September 9th Windows Nightly builds , with 13% of all crashes. I don't know if this might be caused by or fixed by the patch that landed that day in bug 1656261.

Comment 24

•

4 years ago

[@ vcruntime140.dll | nsJARInputStream::Read ] looks like a Windows-specific variation.

Comment 25

•

4 years ago

Bug 1656261 comment 48 should have stopped the spike in 82.

status-firefox82: affected → wontfix

[:philipp]

Updated

•

4 years ago

Jens Stutte [:jstutte]

Updated

•

4 years ago

Crash Signature: [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ] [@ vcruntime140.dll | nsJARInputStream::Read ] [@ vcruntime140.dll | `anonymous namespace'::VerifyAppManifest] → [@ memcpy | nsJARInputStream::Read] [@ memcpy_repmovs | nsJARInputStream::Read ] [@ vcruntime140.dll | nsJARInputStream::Read ] [@ vcruntime140.dll | `anonymous namespace'::VerifyAppManifest] [@ nsJARInputStream::Read ]

Assignee

Updated

•

4 years ago

Depends on: 1675465

Jens Stutte [:jstutte]

Comment 26

•

4 years ago

Looking again at the crash reason of memcpy | nsJARInputStream::Read I see:

 1 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_DATA_ERROR 	746 	55.10 %
2 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_DEVICE_ERROR 	452 	33.38 %
3 	EXCEPTION_ACCESS_VIOLATION_READ 	70 	5.17 %
4 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_UNEXPECTED_NETWORK_ERROR 	21 	1.55 %
5 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_HARDWARE_ERROR 	14 	1.03 %
6 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_FILE_INVALID 	12 	0.89 %
7 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IN_PAGE_ERROR 	8 	0.59 %
8 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_CONNECTION_DISCONNECTED 	6 	0.44 %
9 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_TIMEOUT 	6 	0.44 %
10 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_OBJECT_NAME_NOT_FOUND 	6 	0.44 %
11 	EXCEPTION_ACCESS_VIOLATION_READ / STATUS_VOLUME_DISMOUNTED 	4 	0.30 %
12 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_ACCESS_DENIED 	2 	0.15 %
13 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_INVALID_DEVICE_REQUEST 	2 	0.15 %
14 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_NO_SUCH_DEVICE 	2 	0.15 %
15 	EXCEPTION_ACCESS_VIOLATION_WRITE 	1 	0.07 %
16 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_NOT_CONNECTED 	1 	0.07 %
17 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_INVALID_PARAMETER 	1 	0.07 %

The vast majority of cases (EXCEPTION_IN_PAGE_ERROR_READ) still seem to indicate some problem with code not yet in or paged out of the memory that cannot be read anymore. This seems to happen randomly all over our code, I wonder if we can do anything about this?

The remainder of EXCEPTION_ACCESS_VIOLATION_READ/WRITE could be our fault, though. I'd concentrate investigations on those.

Jens Stutte [:jstutte]

Comment 27

•

4 years ago

So we expect to have an exception handling in place for EXCEPTION_IN_PAGE_ERROR_READ if HAVE_SEH_EXCEPTIONS is true.

But looking at the disassembly taken from a minidump I cannot see any generated code for the macros.

...
  *aBytesRead = 0;
00007FFE3E496B0C 49 89 D7             mov         r15,rdx  
00007FFE3E496B0F 48 89 CF             mov         rdi,rcx  
00007FFE3E496B12 41 C7 04 24 00 00 00 00 mov         dword ptr [r12],0  

  nsresult rv = NS_OK;
  MMAP_FAULT_HANDLER_BEGIN_HANDLE(mFd)
  switch (mMode) {
00007FFE3E496B1A 8B 81 B0 00 00 00    mov         eax,dword ptr [rcx+0B0h]  
00007FFE3E496B20 45 31 F6             xor         r14d,r14d  
00007FFE3E496B23 83 F8 04             cmp         eax,4  
00007FFE3E496B26 75 6A                jne         nsJARInputStream::Read+0B2h (07FFE3E496B92h)  
      }
      break;
...
        mFd = nullptr;
00007FFE3E496B76 4C 89 E9             mov         rcx,r13  
00007FFE3E496B79 E8 E2 03 00 00       call        RefPtr<nsZipHandle>::operator= (07FFE3E496F60h)  
      }
      break;
  }
  MMAP_FAULT_HANDLER_CATCH(NS_ERROR_FAILURE)
  return rv;
}
00007FFE3E496B7E 44 89 F0             mov         eax,r14d  
00007FFE3E496B81 48 83 C4 28          add         rsp,28h  
00007FFE3E496B85 5B                   pop         rbx  
00007FFE3E496B86 5F                   pop         rdi  
00007FFE3E496B87 5E                   pop         rsi  
00007FFE3E496B88 41 5C                pop         r12  
00007FFE3E496B8A 41 5D                pop         r13  
00007FFE3E496B8C 41 5E                pop         r14  
00007FFE3E496B8E 41 5F                pop         r15  
00007FFE3E496B90 5D                   pop         rbp  
00007FFE3E496B91 C3                   ret

Assignee

Comment 28

•

4 years ago

So, a small update here.
I checked that exception handling works properly on windows and properly returns NS_ERROR_FAILURE.
I did this by creating a build that calls RaiseException(EXCEPTION_IN_PAGE_ERROR) when a pref is set.
I proceeded to load an addon xpi (uncompressed), flip the pref, and reload it - the result was that reloading failed with NS_ERROR_FAILURE.

Not exactly sure where the exception handler is stored on Windows, but it's probably expected that it wouldn't be on the stack.
That said, it's still rather curious that the crashes we're getting are caused by EXCEPTION_IN_PAGE_ERROR even though that should be caught.

I just took another look at some stack traces and found something interesting:
https://crash-stats.mozilla.org/report/index/4e9cfafa-7837-4826-9570-037030210401#tab-rawdump

This one happens to have both mZs and the mmaped area in the crash dump, so I could see what's being loaded via the JARChannel, and check where things are going wrong.
The file being loaded here is devtools/client/inspector/computed/computed.js
It has 48401 (0xbd11) characters which matches the values of count and mOutSize.

Looking at the memory of mZs.next_in (0x000001868c7479ec) it seems the file is properly loaded into memory, and we finally crash when going to the next page 0x1868c748000 - which doesn't seem to be mapped yet.

I think this is a step forward. I was beginning to wonder if there's some other obvious logic error causing these crashes to happen.
Now the question is why is this happening?

Assignee

Comment 29

•

4 years ago

I think I finally figured it out.
It turns out the problem isn't with MMAP_FAULT_HANDLER_CATCH, but with calling memcpy from inside the try/catch block.
Since memcpy is a C function, it is guaranteed to not throw any exception, so the exception handling was optimized out.
https://stackoverflow.com/questions/7164019/does-memcpy-not-throw-exceptions

I did a few tests with reading past the end of the mmaped memory. When using memcpy Firefox crashed with EXCEPTION_ACCESS_VIOLATION no matter how I configured the exception handling. If I switched to using CopyMemory from windows.h then I was able to intercept the exception and not crash.

My quick fix for this is to use CopyMemory when HAVE_SEH_EXCEPTIONS is defined, in nsJARInputStream.cpp
This should get rid of a large fraction of these issues.

As a follow-up we should check that the C-implemented methods that we call from inside the try/catch blocks, like the inflate() and BrotliDecoderDecompressStream don't have the same issue.

Ryan VanderMeulen [:RyanVM]

Assignee

Comment 30

•

4 years ago

Attached file Bug 1551562 - Use CopyMemory instead of memcpy on Windows inside try/catch blocks r=#necko — Details

Pulsebot

Comment 31

•

4 years ago

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/18416ab86595 Use CopyMemory instead of memcpy on Windows inside try/catch blocks r=necko-reviewers,kershaw

Updated

•

4 years ago

status-firefox87: --- → wontfix

status-firefox88: --- → wontfix

status-firefox89: --- → affected

status-firefox-esr78: --- → affected

Alexandru Michis [:malexandru]

Comment 32

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/18416ab86595

Status: NEW → RESOLVED

Closed: 4 years ago

status-firefox89: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 89 Branch

(No longer employed by Mozilla) Aaron Klotz

Comment 33

•

4 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #29)

I think I finally figured it out.
It turns out the problem isn't with MMAP_FAULT_HANDLER_CATCH, but with calling memcpy from inside the try/catch block.
Since memcpy is a C function, it is guaranteed to not throw any exception, so the exception handling was optimized out.
https://stackoverflow.com/questions/7164019/does-memcpy-not-throw-exceptions

I did a few tests with reading past the end of the mmaped memory. When using memcpy Firefox crashed with EXCEPTION_ACCESS_VIOLATION no matter how I configured the exception handling. If I switched to using CopyMemory from windows.h then I was able to intercept the exception and not crash.

Great find, Valentin!