Closed Bug 1035892 Opened 10 years ago Closed 3 years ago

All 64-bit mode crashes on OS X with reason EXC_BAD_ACCESS have crash addresses truncated to 32-bits

Categories

(Toolkit :: Crash Reporting, defect, P2)

x86_64
macOS
defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox91 --- fixed

People

(Reporter: smichaud, Assigned: gsvelto)

References

(Blocks 4 open bugs)

Details

Attachments

(1 file, 1 obsolete file)

[This bug is spun off from bug 1002564, which has grown too unwieldy. This bug covers issue #2 from bug 1002564 comment #14.] Bugs in OS X Breakpad code cause crashes with "reason" EXC_BAD_ACCESS to have their crash addresses > 0xffffffff truncated to 32 bits. (Crash addresses > 0x80000000 are also sign extended.) A patch for this was already posted at bug 1002564 comment #26. I'll update it to current trunk and repost it here.
Severity: normal → major
Attached patch Fix (obsolete) — Splinter Review
I based this partly on the following patch from Vlad: http://people.mozilla.com/~vladimir/misc/breakpad-64-bit-exc.patch I also found valuable information here: http://stackoverflow.com/questions/2824105/handling-mach-exceptions-in-64bit-os-x-application The only real documentation for this stuff is in the xnu source code. The source for Apple's implementation of it is available at http://opensource.apple.com/. Unlike Vlad, I made Breakpad (and the OS) use 64-bit "subcodes" in both 64-bit and 32-bit mode. This works just fine in 32-bit mode (which I tested), and it makes the code simpler. Ted, if you think this needs more reviewers, please add them as you see fit. I've started tryserver builds here: https://tbpl.mozilla.org/?tree=Try&rev=d9283e658e9c
Assignee: nobody → smichaud
Attachment #8452400 - Flags: review?(ted)

Taking this.

Assignee: smichaud → gsvelto
Status: NEW → ASSIGNED
Attachment #8452400 - Attachment is obsolete: true

I've updated the patch trying to address the review comments as best as I could. I've also double-checked other implementations and run both automated and manual tests and everything seems to work correctly. Heavy emphasis on seems considering how many unknown corner-cases there might be.

There seems to be an issue with marionette tests, it didn't show up locally in debug builds only try opt ones: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fd1961ad3347270a4827ab719dff4fd303cd18cb&selectedJob=238146499

I'm testing again with a local opt build.

While I couldn't reproduce the failure locally it seems like it's a race in the minidump generation or something along the lines. In the marionette logs I can see three lock failures before the actual error:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=238146533&repo=try&lineNumber=39510-39513

Locally everything works fine, even after hundreds of runs with a heavy background load to slow down the system. This is definitely a race and possibly a narrow one. My changes here might have made it wide enough to reproduce frequently on try.

Interestingly we have a bunch of similar issues regarding marionette that occur only very rarely and I haven't been able to debug them yet (see bug 1523583). Hopefully this is something similar. There is also a known race in the crash reporting code (bug 1489536) but I don't think that's what causing the problem here.

Priority: -- → P2
Blocks: 1523276

I finally found some time to investigate the issue I encountered in comment 6... and it's gone. I'm running more tests to make sure it's really gone for good. It looked like a race between the crash generation and the front-end code so either the race is still there but not triggered or it was solved (most likely in the front-end code). Either way if testing comes up clean I'll land this, it's about time. This is my latest try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=900578d15112851fecfe57dd53dde0a8e04dcc2f

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2b1ffc8004d5 Handle 64-bit addresses for EXC_BAD_ACCESS exceptions on Mac r=froydnj
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Status: RESOLVED → REOPENED
Flags: needinfo?(gsvelto)
Resolution: FIXED → ---

I have just tested on my Mac... and I don't reproduce, no matter how hard I try. I'll push this to try again but I'm wondering why this doesn't reproduce locally nor it happened on try before landing. Sigh. I'll never land this fix.

Flags: needinfo?(gsvelto)
No longer blocks: 1523276

I gave this another spin on try and the issue with gtests seems to have gone away... I'll try to land this again and hope that it sticks this time.

Nope, I was wrong, I was looking at the opt build but it's the debug one that was failing.

Blocks: 1707751
Blocks: 1707746

I did more investigation on this today, and spent a long time poring over Apple's kernel code as well as crashpad's implementation. Long story short, we need a lot more changes here compared to what we have in attachment 9052569 [details]. I'll work on this starting with next week.

Some results from my investigation: recent version of macOS deliver a lot more exception as signals and Chrome is mostly using these to catch crashes... maybe we should too? It's definitely easier than dealing with mach exceptions. That being said I finally figured out what's wrong with gtests: for some reason on mac we're calling XRE_mainInit() which sets the exception handler twice, so when we disable it only one of the two exception ports we had set up get removed and the following crash is directed to the second one. How did this ever work is beyond me but somehow this patch was needed to surface the problem.

I'm now trying to figure out how this is happening given that this most definitely does not happen on Linux.

Attachment #9052569 - Attachment description: Bug 1035892 - Handle 64-bit addresses for EXC_BAD_ACCESS exceptions on Mac → Bug 1035892 - Handle 64-bit addresses for EXC_BAD_ACCESS exceptions on Mac r=froydnj

Alright, I think I finally fixed the last issue with this. The behavior I saw in comment 16 was "normal" in the sense that we were forking the main process and that's supposed to happen (I'm not sure why it doesn't on Linux but I don't care). The actual issue was that with the changes in the patch the following chain of events happened:

  • The child process would crash, with crash reporting enabled (in the build) but disabled (at runtime)
  • Since the child process' exception handler was disabled the exception would be delivered to the main process as its exception handler had been fork()'d into the child (note, the exception was not delivered to the copy of the main process' exception handler in the child because the test disabled all exception handling)
  • The main process would realize that the exception was not its own and tried to forward it
  • Previously we would tell the macOS kernel to handle the exception on its own, as it was not meant for the main process, but the forked code tried to exit the main process - for some reason this didn't work
  • At this point the exception handler thread would hang waiting for a shutdown message that never came and the process would be stuck

The fix involved not trying to close the main process if we receive an exception not meant for it, but just ignore the exception and proceed as normal.

Let's see if this sticks.

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a6185badd5c7 Handle 64-bit addresses for EXC_BAD_ACCESS exceptions on Mac r=froydnj
Status: REOPENED → RESOLVED
Closed: 5 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: mozilla70 → 91 Branch
Blocks: 1724215
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: