1274628 - Annotate crashes when the code in memory around the crashing instruction differs from the code in the shipped binary

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Description

•

9 years ago

In bug 1272750 we have a minidump that crashes because the code being executed is corrupt relative to what is in the binary we ship. We know this, I believe, because the minidump contains some of the memory around the instruction pointer. Crashes where the code around the instruction pointer doesn't match the code in the binary are clear signs of memory corruption, and they may be a decent portion of our crashes. It would be useful if they could be prominently marked in soccoro so that it's clear they're not worth investigating (perhaps with details showing the difference so that we can spot-check that the analysis is correct), and also so that we could get statistics on how common this problem is. (This is perhaps an alternative to bug 1274428. This is a much more focused heuristic for detecting the particular problem we saw there.)

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Comment 1

•

9 years ago

Also note that this would only catch a subset of memory/disk corruption of the binary, since some corruption could lead to things like jumping elsewhere before crashing.

Nicholas Nethercote [inactive]

Comment 2

•

9 years ago

We'd have to exclude JIT-generated code.

(No longer employed by Mozilla) Aaron Klotz

Comment 3

•

9 years ago

See also bug 1251395, which would take this a bit further and actually add suspicious pages to the crash dump.

Comment 4

•

9 years ago

I don't see how that bug is related; this is about doing more with information we already have in minidumps.

Benjamin Smedberg

Updated

•

9 years ago

Depends on: 1280442

Nicholas Nethercote [inactive]

Comment 6

•

9 years ago

Some extra info that was from bug 1280442: > It may be that this is mismatched because of: > > * function hooking (from 3rd-party) > * bad memory > * bad disk > > In reporting, if there are mismatches we should highlight the particular disassembly situation (both original > and modified) and show whether this is a one-bit flip or something else. And relocations will need to be accounted for, which is finicky but doable.

Nicholas Nethercote [inactive]

Updated

•

9 years ago

Blocks: 1289666

Nicholas Nethercote [inactive]

Updated

•

9 years ago

Blocks: 1289676

Nicholas Nethercote [inactive]

Comment 7

•

9 years ago

Extracting the memory around the crashing instruction from the minidump sounds straightforward. Extracting the corresponding bytes from the shipped binary for matching sounds harder. We'd need to access the binary from somewhere (in bug 1280442 ted mentioned the Windows symbol server), and we'd need to do this on every(?) Windows crash. All that sounds slow...

Nicholas Nethercote [inactive]

Updated

•

9 years ago

Summary: annotate crashes where the code around the instruction pointer is not the code that's in the binary we ship → Annotate crashes when the code in memory around the crashing instruction differs from the code in the shipped binary

Nicholas Nethercote [inactive]

Comment 8

•

9 years ago

> Extracting the corresponding bytes from the shipped binary for matching > sounds harder. We'd need to access the binary from somewhere (in bug 1280442 > ted mentioned the Windows symbol server), and we'd need to do this on > every(?) Windows crash. All that sounds slow... I talked with lonnen about this. He said the interface between Socorro and releng infrastructure isn't great at the moment, and so getting ahold of the right binaries, while possible, would require significant effort, and it's probably not something the Socorro team is likely to be able to help with in the near future :(

Nicholas Nethercote [inactive]

Comment 9

•

9 years ago

A couple of other notes, for future reference: - Bug 1124241 and bug 1268029 were about integrating jit-crash-categorize, which is a similar sort of thing and indicates how this step might fit into Socorro's processing. - froydnj thinks that adjusting for relocations shouldn't be necessary, because text (code) doesn't have relocations on Windows.

Benjamin Smedberg

Comment 10

•

9 years ago

I don't think that finding the matching binaries would be hard at all. We don't need any releng infrastructure: the binaries are already stored as part of the symbol server and can be looked up by debug ID directly from the minidump without any other metadata.

Nicholas Nethercote [inactive]

Comment 11

•

9 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #10) > I don't think that finding the matching binaries would be hard at all. We > don't need any releng infrastructure: the binaries are already stored as > part of the symbol server and can be looked up by debug ID directly from the > minidump without any other metadata. Ok, good to know. This matching is something that we'd want to do on almost every crash, so we'd want to cache copies of the binaries on Socorro somehow.

(not currently active) Ted Mielczarek

Comment 12

•

9 years ago

Just to fill in some info here: The "symbol server" just serves files from an s3 bucket where we upload symbols (both .sym format and the native format that the toolchain produces). For Windows builds we also upload the binaries, since Microsoft's debuggers support fetching symbols+binaries given just a minidump (very handy for debugging minidumps from random builds from crash-stats). The stackwalk binary that Socorro uses for processing has code to fetch symbols via HTTP and cache them on local disk, so we already fetch the .sym files for processing that way. I have a script that fetches Microsoft system library symbols from their symbol server, and as part of that for Win64 libraries it will also fetch the DLL from their symbol server, since on Win64 that's where the unwind tables live: https://hg.mozilla.org/users/tmielczarek_mozilla.com/fetch-win32-symbols/file/52d03ad15bdf/symsrv-fetch.py#l121 We'd need something like that to fetch the binaries during processing. That code shells out to `cabextract` to decompress the files (they're stored compressed on the symbol server), but I recently wrote a Rust implementation of the `makecab` program we use to compress them during the build, so it wouldn't be much work to write a cabextract-equivalent on top of that same codebase: https://github.com/luser/rust-makecab We would then have to have code to parse the PE binary format and locate the matching bytes of the text section given an address. There's lots of code floating around online to do that. Breakpad has some that uses Win32 APIs, and I know there's at least one Rust crate to parse PE. Handling relocations is a bit out of my range of knowledge, I assume we'd need a disassembler at that point? I've used capstone in the recent past with good results: http://capstone-engine.org/

Bobby Holley (:bholley)

Comment 13

•

7 years ago

This came up in the context of bug 1475772, where Cameron used windbg's !chkimg command to prove that the crashes were the result of bad RAM. It would be awesome to have this automated (either during processing, or on-demand) so that developers can easily check this without getting minidump access, finding a windows machine, and doing all the steps. Just chatted with Lonnen, WillK is going to think about how this could be done.

Flags: needinfo?(willkg)

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 14

•

7 years ago

Making this a P2 to look into. If anyone has updates on the state of things and anything that's changed since comment #12, I'd be much obliged.

Flags: needinfo?(willkg)

Priority: -- → P2

Cameron McCormack (:heycam)

Comment 15

•

7 years ago

For a weekend project, I wrote a command line tool that basically does what chkimg does. https://github.com/heycam/chkimg I really don't know how the crash report processor environment works so I don't know how easily that could be integrated. But I could turn that tool into a library crate easily enough.

Bobby Holley (:bholley)

Comment 16

•

7 years ago

Wow, that's awesome - thanks Cameron! That both (a) simplifies the process for somebody to analyze dumps without a windows machine (though they still need minidump access), and (b) hopefully make's it much more straightforward for willkg to plug this analysis into socorro.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 17

•

7 years ago

This is awesome! I need to finish up some other things and then I can get to looking at this. Couple of things off the top of my head: 1. Is it possible for the output from chkimg to contain user data or information that could be exploited? 2. It's probably the case we need to change chkimg to output in JSON so Socorro can read it. Probably makes sense to add an output file as an argument. I'll write up some issues in the chkimg issue tracker when I spend more time on this. Backing up a bit, I want to throw together a "cupcake" of this with minimal work so we can start seeing what it looks like performance-wise and also numbers. We can run this on -stage for a while to figure things out. Then if we decide that we want to move forward, we can fix performance issues, make the code maintainable, and move it forward to prod.

Mats Palmgren (inactive)

Comment 18

•

7 years ago

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #17) > 1. Is it possible for the output from chkimg to contain user data or > information that could be exploited? I don't think the actual values are that interesting to most people, so just a summary like "chkimg has detected a N-bit error in an instruction near the crash, which is likely caused by faulty RAM" or something like that. We can provide the full output to the group that already has access to the minidump, should it be needed. It be awesome if we could also give this feedback to the crash submitter (when possible) and provide advice on how to run a memory check tool etc.

Cameron McCormack (:heycam)

Comment 19

•

7 years ago

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #17) > 1. Is it possible for the output from chkimg to contain user data or > information that could be exploited? As Mats says we probably just want to know whether there was corruption rather than what the specific differences were. (The differences printed would only be bytes from one of our DLLs, assuming you specify only our own symbol server on the command line, so that's not user specific data.) > 2. It's probably the case we need to change chkimg to output in JSON so > Socorro can read it. Probably makes sense to add an output file as an > argument. > > I'll write up some issues in the chkimg issue tracker when I spend more time > on this. Please needinfo me if you'd like me to make some changes to the tool to help, or feel free to file a PR. > Backing up a bit, I want to throw together a "cupcake" of this with minimal > work so we can start seeing what it looks like performance-wise and also > numbers. We can run this on -stage for a while to figure things out. Then if > we decide that we want to move forward, we can fix performance issues, make > the code maintainable, and move it forward to prod. Thanks for looking at this, I think it's going to save developers a lot of time.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Comment 20

•

7 years ago

That said... if it's a single-bit difference, which is probably the most common case of differences, it seems like it doesn't contain any obvious private data, and it might be interesting in terms of understanding how the crash occurred (and might sometimes be useful in distinguishing disk corruption from memory corruption by seeing whether it's consistent across reboots). It could be more problematic in terms of data leakage if large numbers of bits were different (e.g., if part of the binary has been overwritten with something else on the user's disk). On the other hand, writing that code is extra complexity...

(Away)

Comment 22

•

7 years ago

If it doesn't add too much complexity, I'd suggest a count of the total number of mismatched bits across all the bytes that were checked. If a report shows a single-bit difference, we can with high confidence call it a hardware error and move on without even looking at the details. If the error is many bits, I would at least open the dump and check the page table flags on @ip to see if anyone had written to that region (injected code doing a hook or something).

Randell Jesup [:jesup] (needinfo me)

Comment 23

•

7 years ago

Running this on a bunch of crashes shows some 1-bit corruptions (yeah! kinda), and other more systematic changes which I suspect are something unrelated to either instruction trashing or bad ram. Relocation? Hooking by some AV dll? For example: crashing IP: 0x7ff829b704c9 mismatch: 0x7ff8537ba040 .. 0x7ff8537ba04f (16 bytes) in ntdll.dll [ 48 b8 20 0a a3 8d f6 7f 00 00 ff e0 0d 00 00 00 ] should be [ 4c 8b d1 b8 0d 00 00 00 f6 04 25 08 03 fe 7f 01 ] and the exact same "corruption" on multiple dumps for the same crash and same FF version, or crashing IP: 0x612efe49 mismatch: 0x612efe34 .. 0x612efe35 (2 bytes) in xul.dll [ ac 63 ] should be [ c6 12 ] mismatch: 0x612efe42 .. 0x612efe42 (1 byte) in xul.dll [ 0d ] should be [ 0f ] mismatch: 0x612efff8 .. 0x612efff9 (2 bytes) in xul.dll [ 2f 61 ] should be [ 49 10 ] or crashing IP: 0x7ffee98737e6 mismatch: 0x7ffeed330c32 .. 0x7ffeed330c35 (4 bytes) in xul.dll [ a2 ec fe 7f ] should be [ 53 84 01 00 ] mismatch: 0x7ffeed330d02 .. 0x7ffeed330d05 (4 bytes) in xul.dll [ a2 ec fe 7f ] should be [ 53 84 01 00 ] mismatch: 0x7ffeed330d72 .. 0x7ffeed330d75 (4 bytes) in xul.dll [ a9 ec fe 7f ] should be [ 5a 84 01 00 ] mismatch: 0x7ffeed330d82 .. 0x7ffeed330d85 (4 bytes) in xul.dll [ 0c ed fe 7f ] should be [ bd 84 01 00 ] [20 or 30 more like this] or crashing IP: 0x5a5603ab mismatch: 0x5a560344 .. 0x5a560345 (2 bytes) in xul.dll [ 89 5c ] should be [ 7c 12 ] mismatch: 0x5a5603a7 .. 0x5a5603a8 (2 bytes) in xul.dll [ 78 04 ] should be [ 6a 00 ] mismatch: 0x5a5603aa .. 0x5a5603aa (1 byte) in xul.dll [ ff ] should be [ 01 ] mismatch: 0x5a5603ac .. 0x5a5603ad (2 bytes) in xul.dll [ 94 33 ] should be [ 74 24 ] mismatch: 0x5a5603af .. 0x5a5603b0 (2 bytes) in xul.dll [ 84 de ] should be [ 8b ce ] mismatch: 0x5a5603b7 .. 0x5a5603b7 (1 byte) in xul.dll [ 1d ] should be [ f5 ] mismatch: 0x5a5603b9 .. 0x5a5603b9 (1 byte) in xul.dll [ e8 ] should be [ ff ] mismatch: 0x5a5603bc .. 0x5a5603bf (4 bytes) in xul.dll [ c3 d4 f8 40 ] should be [ 24 34 85 c0 ] mismatch: 0x5e3baac9 .. 0x5e3baaca (2 bytes) in DWrite.dll [ 34 5e ] should be [ 4d 1e ] mismatch: 0x5e3bab07 .. 0x5e3bab08 (2 bytes) in DWrite.dll [ 34 5e ] should be [ 4d 1e ] mismatch: 0x5e3bab11 .. 0x5e3bab12 (2 bytes) in DWrite.dll [ 34 5e ] should be [ 4d 1e ] mismatch: 0x5e3bab36 .. 0x5e3bab37 (2 bytes) in DWrite.dll [ 3f 5e ] should be [ 58 1e ] mismatch: 0x5e3bab47 .. 0x5e3bab48 (2 bytes) in DWrite.dll [ 40 5e ] should be [ 59 1e ] mismatch: 0x5e3bab6d .. 0x5e3bab6e (2 bytes) in DWrite.dll [ 34 5e ] should be [ 4d 1e ] mismatch: 0x5e3bab78 .. 0x5e3bab79 (2 bytes) in DWrite.dll [ 34 5e ] should be [ 4d 1e ]

Cameron McCormack (:heycam)

Comment 24

•

7 years ago

(In reply to David Major [:dmajor] from comment #22) > If the error > is many bits, I would at least open the dump and check the page table flags > on @ip to see if anyone had written to that region (injected code doing a > hook or something). How do I check the page table flags? Do these tools doing injection or whatever tend to just leave the page as writable after doing their dirty work?

Flags: needinfo?(dmajor)

(Away)

Comment 25

•

7 years ago

(In reply to Cameron McCormack (:heycam) from comment #24) > How do I check the page table flags? Do these tools doing injection or > whatever tend to just leave the page as writable after doing their dirty > work? Well, I was going to suggest the following, but it seems I'm wrong about this: I thought that `!address @rip` would show a separate line when the current flags don't match the original allocation flags, so we could take advantage of the fact that writing to a DLL will drop its copy-on-write status: Protect: 00000020 PAGE_EXECUTE_READ Allocation Protect: 00000080 PAGE_EXECUTE_WRITECOPY ...but it looks like this is the case for untouched DLL pages too. :-( So you're right, this would only work if the injected code neglects to clean up after itself (which isn't unheard of, at least). Caution: `!address` resets your exception context so you have to do `.ecxr` to get your state back.

Flags: needinfo?(dmajor)

(Away)

Comment 26

•

7 years ago

But I guess you could `u` the mismatched bytes to see if they at least decode to valid instructions (especially a jmp or call).

Cameron McCormack (:heycam)

Comment 27

•

7 years ago

Note that I just fixed a bug in chkimg which was resulting in a bunch of false positives -- I wasn't ignoring parts of the binary that were fixed up due to relocations!

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 28

•

7 years ago

I pulled down the latest chkimg today and ran it on 100 crash reports from today for Windows for 62.0.3. Of those 100, 65 printed "no errors found" and the other 35 didn't. Most of them look like mismatches, but three look like this: >>> Working on crashdata/v1/dump/57d50768-6d72-4a62-94c6-d19f10181003... PE failure: ResolveMapError >>> Working on crashdata/v1/dump/73f94540-d841-4905-b7c2-965ca0181003... PE failure: ResolveMapError >>> Working on crashdata/v1/dump/cecc6f7e-36b5-41a6-836c-2c4780181003... PE failure: ResolveMapError (The "Working on..." output is from my script.) I have a few questions. First, I sort of understand what this utility is doing, but not enough to verify correctness over a larger set of crashes. It doesn't look like there are tests. How do we verify that this is working correctly? Second, PE stands for Portable Executable and I sort of understand what that is, but beyond that I don't know what a PE failure might be here. Third, 32 out of the 100 crash reports I looked at have mismatches--does that seem plausible to you? Assuming chkimg is working correctly, does that suggest that 1/3 of the crashes I looked at are from computers with bad memory?

Ted Campbell [:tcampbell]

Comment 29

•

7 years ago

One related metric that I'd find useful here is to auto classify if the current instruction can even fault. I often see crashes on things like |cmp rax, rdx| which have been traced to bad memory when I've engaged with users. Probably way out of scope for this bug, but might be a future extension for chkimg. When I last looked in-depth at JIT crashes I saw at least 1/3 of crashes that where clearly bit-flips. (The numbers were a bit skewed by known-bad-CPU-silicon). Some of them manifested as bad data in the minidump like this bug looks at, and some of them showed the correct machine code, but that observed fault only makes sense if a single bit is flipped in the crashing instruction and the register and memory state makes sense.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 30

•

7 years ago

This is interesting work, but needs more people interested in working on it to move it forward. For now, I'm going to bump it down to P3.

Priority: P2 → P3

Lonnen :lonnen

Comment 31

•

7 years ago

Hey Cam! Will had a few questions in comment 28: https://bugzilla.mozilla.org/show_bug.cgi?id=1274628#c28

Do you have any insights?

Flags: needinfo?(cam)

Andrew Overholt [:overholt]

Updated

•

7 years ago

Blocks: 1544085

Cameron McCormack (:heycam)

Comment 32

•

5 years ago

Sorry that I overlooked this needinfo. In case it's still relevant:

First, I sort of understand what this utility is doing, but not enough to verify correctness over a larger set of crashes. It doesn't look like there are tests. How do we verify that this is working correctly?

Tests would need to be written.

Second, PE stands for Portable Executable and I sort of understand what that is, but beyond that I don't know what a PE failure might be here.

That's printing out the error that came from the pe crate I'm using:

https://docs.rs/pe/0.1.1/pe/enum.Error.html#variant.ResolveMapError

Perhaps it means that my code is trying to read out of range bytes from the DLL for some reason.

Third, 32 out of the 100 crash reports I looked at have mismatches--does that seem plausible to you? Assuming chkimg is working correctly, does that suggest that 1/3 of the crashes I looked at are from computers with bad memory?

That seems like too many! It's possible there are some kinds of relocations I wasn't handling correctly. To verify my code is doing the right thing, I would compare to running .chkimg in WinDbg.

Flags: needinfo?(cam)