Open Bug 974420 Opened 6 years ago Updated 7 days ago

Addresses >128TB are displayed as 0xffffffffffffffff on crash-stats

Categories

(Socorro :: Backend, task)

x86_64
Windows 7
task
Not set

Tracking

(Not tracked)

People

(Reporter: dmajor, Unassigned)

References

Details

Look at the "Crash Address" field in bp-9604e9ff-e608-466c-8e23-2d69b2140217 or the "Address" column in https://crash-stats.mozilla.com/report/list?product=Firefox&signature=nsSocketInputStream%3A%3AOnSocketReady%28tag_nsresult%29#tab-reports

The actual crash address is 0x5a5a5a5a5a5a5a5a which is the memory poisoning value on 64-bit. Showing the correct address would allow us to quickly identify crashes related to poisoning.
This is a breakpad bug not a Socorro bug, see the raw dump tab at https://crash-stats.mozilla.com/report/index/ed94c121-5c35-491a-8bda-095642140219

Crash|EXCEPTION_ACCESS_VIOLATION_READ|0xffffffffffffffff|5
Assignee: nobody → dmajor
Component: Webapp → Backend
Turns out that this value comes straight from the .dmp file. Even !analyze gets mislead:

ExceptionAddress: 00007ff8606fb9e1 (xul!nsSocketInputStream::OnSocketReady+0x00000000000000b5)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: ffffffffffffffff
Attempt to read from address ffffffffffffffff

PROCESS_NAME:  firefox.exe

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_PARAMETER1:  0000000000000000

EXCEPTION_PARAMETER2:  ffffffffffffffff

READ_ADDRESS:  ffffffffffffffff
You said that windbg knew that it was 0x5a5a5a5a5a5a5a5a though... how did it report that?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #3)
> You said that windbg knew that it was 0x5a5a5a5a5a5a5a5a though... how did
> it report that?

It was decoding the instruction at the faulting IP:

xul!nsSocketInputStream::OnSocketReady+0xb5:
00007ff8`606fb9e1 ff5010          call    qword ptr [rax+10h] ds:5a5a5a5a`5a5a5a6a=????????????????
This may be related to the 48-bit hardware limit.

I tampered with some registers of 64-bit calc.exe (to make sure Breakpad wasn't a factor) and it seems that anything over 0x00007fff`ffffffff gets reported as 0xffffffff`ffffffff.
I guess this isn't as bad as I first thought. Access violations on meaningful addresses still show up correctly on crash-stats. And because of the non-monospace font, I originally thought the site was showing only 32 bits worth of f's, but it's actually the full 0xffffffffffffffff.
Summary: Addresses >4GB are displayed as 0xffffffff on crash-stats → Addresses >128TB are displayed as 0xffffffffffffffff on crash-stats
Interesting. If we ever wrote our own minidump writer for Windows, we should fix this. Otherwise I feel like we should just mark it INCOMPLETE. We'd solve this problem for the poison address by poisoning using the frame-poisoning buffer which is an unmapped memory block but <128TB right?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #7)
> If we ever wrote our own minidump writer for Windows, we should fix this. 
I suspect the value is already that way in the Windows EXCEPTION_RECORD structure, so we'd have to get into the business of decoding instructions in order to "fix" this...

> We'd solve this problem for the poison address by poisoning using the
> frame-poisoning buffer which is an unmapped memory block but <128TB right?
Hmm, even the frame poisoning value is above that range:
0:000> ?? mozglue!gMozillaPoisonValue
unsigned int64 0x7fffffff`f0de7fff

Deliberately, it seems: http://mxr.mozilla.org/mozilla-central/source/mfbt/Poison.cpp#151
Hrm, given the data here I think we should fix that to do what we do on win32 and reserve a region that doesn't display this bug.
(In reply to comment #5)

Something very similar happens on OS X (which is of course now Intel-only) -- though there the kernel zeros out the crash address in that case.  See bug 1002564 comment #30.
See Also: → 1002564
It'd (presumably) be easy to work around this kind of problem (in Socorro data) if we added the "thread state" of the crashing thread to the data collected by Breakpad (and displayed by Socorro).  The "thread state" is basically the contents of all the user-level registers.

In the case of this bug, I'm sure we'd find that rax == 0x5a5a5a5a5a5a5a5a.
We do have the thread state in minidumps, it's just not displayed in Socorro. If you download a minidump and run minidump_stackwalk on it it will show you the registers it has for each frame.
Just noticed this is assigned to me. I haven't been actively working on it, so it's up for grabs.
Assignee: dmajor → nobody
Bug 1002564 comment #34 has even more explanation, it's due to how the x86_64 CPUs work.

Ted, is there anything we can do (in Breakpad or Socorro) to either get a valid address or make the right info available in the Socorro UI?

The more we get people to use 64bit builds, the more developers we get confused by those mangled addresses, see e.g. the recent discussion in bug 1232229.
Flags: needinfo?(ted)
See also bug 1018360, and particularly bug 1018360 comment #0.

The crashing thread's thread state is already available in the Socorro UI (under "raw dump", at the top).  That lets you work around this bug.  But to be sure that a particular register >= 0x7FFFFFFFFFFF is the one that was dereferenced, you'll need to look at the assembly code around where the crash happened.
The problem is that the raw dump tab isn't available without special permissions, not searchable, and that information can't be easily displayed as columns on lists of crashes in the UI.

Additionally, displaying a bogus value as the crash address confuses developers into thinking that's actually an address that was seen in the crash.
Per comment 2 I'm not really sure how to handle this more sensibly than Microsoft's own debuggers. :-(

I think we could potentially change our poison patterns to use canonical form, but I'm not sure of the impact of that.
Flags: needinfo?(ted)
(In reply to comment #16)

> the raw dump tab isn't available without special permissions

That surprises me.  I just looked at the "Raw Dump" tab for a recent bug report (bp-340079ed-2428-4875-a8ed-2042a2160213) and had no problem seeing it (including the thread state), even though I no longer work for Mozilla and (presumably) no longer have access to any privileged information from crash reports.  (I also no longer have a Socorro account and am not signed in.)

> not searchable

> displaying a bogus value as the crash address confuses developers into thinking that's actually
> an address that was seen in the crash

Good points.

It's easy enough to flag crash addresses of 0xffffffffffffffff.  But on OS X (and Linux?) the bogus value is zero/null, which (of course) can also be a legitimate value.  So it will be a lot harder to (automatically) flag this problem on those OSes, and maybe impossible.
(In reply to Steven Michaud [:smichaud] (Retired) from comment #18)
> That surprises me.  I just looked at the "Raw Dump" tab for a recent bug
> report (bp-340079ed-2428-4875-a8ed-2042a2160213) and had no problem seeing
> it (including the thread state), even though I no longer work for Mozilla
> and (presumably) no longer have access to any privileged information from
> crash reports.  (I also no longer have a Socorro account and am not signed
> in.)

I stand corrected. I see that you actually can access it without being signed in, I thought you couldn't, but good to know :)

> It's easy enough to flag crash addresses of 0xffffffffffffffff.  But on OS X
> (and Linux?) the bogus value is zero/null, which (of course) can also be a
> legitimate value.  So it will be a lot harder to (automatically) flag this
> problem on those OSes, and maybe impossible.

Even 0xffffffffffffffff is a "legitimate" value (-1) in some cases - and in the end, half of the address space ends up with this "magic" address and developers think they have a -1 somewhere (or a 0 on Mac/Linux).
If we can't fix / work around this on the Breakpad side, we should at least find a way to fix the confusion on the Socorro side and e.g. do not display a value for the "address" field at all or something like that.
After some discussion on IRC, I think our best course of action here would be to at least make sure that we're not intentionally inflicting this situation upon ourselves, by making our poison patterns into canonical pointers on x86-64. As dmajor pointed out in comment 4, currently the mfbt/Poison code intentionally chooses an address in the hardware-inaccessible region for x86-64. We could change that to map an inaccessible page like it does on other architectures, which should work around this problem. We could then use any value in that page as a poison pattern, setting the low 3 bytes to whatever.
There is a strong rationale for using addresses inside the hardware-inaccessible region for poisoning:

 * Nothing can ever make those addresses accessible.  Anything you can do with mmap and mprotect, in contrast, can be undone again, by accident or malice.

 * The hardware-inaccessible region is huge; the software-created poison region is only one page.  It's therefore much easier to escape the software region.  When I created this stuff, there weren't any frame-tree objects that were bigger than half a page, but it looks like it's getting used for a lot more stuff now.

As such, it's disappointing to hear that the hardware doesn't report precise fault locations for addresses inside the permanently inaccessible region, and I would encourage y'all to try to find a way to report such crashes that does not involve giving up the above advantages.
Zack: unfortunately this is broken at the hardware level. Accesses to memory addresses in that region are reported as general protection faults, and the address in question is not provided as a parameter.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #22)
> Zack: unfortunately this is broken at the hardware level. Accesses to memory
> addresses in that region are reported as general protection faults, and the
> address in question is not provided as a parameter.

Yes, that's what I said I was disappointed about.

I just think the direct benefits of continuing to use the inaccessible region outweigh the negative side effects (needing to find some sort of workaround in breakpad/socorro).
On the Breakpad side, could we find the register used by the crashing frame? I think someone said that's the value that should contain the correct address.
If the crash was caused by a register (containing a value > 0x7FFFFFFFFFFF) being dereferenced, you should be able to tell which register by looking at the assembly code around where the crash happened.  That register's value (plus those of all other user-level registers) will be visible in the crashing thread's thread state (at the top of Socorro's "raw dump").

Can this be automated?  Comment #3 and comment #4 seem to imply that windbg can do it.  If so, maybe other programs also can, and maybe some of them are open source (and can be used to add this automation capability to Socorro or Breakpad).

But before we go to all this trouble, let's just disassemble a few lines of code around where the crash happened and add this to what's displayed in Socorro's "raw dump" tab (presumably near the crashing thread's thread state).
(In reply to Steven Michaud [:smichaud] (Retired) from comment #25)
> But before we go to all this trouble, let's just disassemble a few lines of
> code around where the crash happened and add this to what's displayed in
> Socorro's "raw dump" tab (presumably near the crashing thread's thread
> state).

I thought ted had something in this direction. I would *love* this. It's enough of a nuisance right now to load up a minidump that I just won't bother for bugs that I'm not directly involved in. With a few lines of assembly, I can avoid a lot of unfounded speculation immediately.
Yes, I have:
http://hg.mozilla.org/users/tmielczarek_mozilla.com/get-minidump-instructions/

Which can disassemble the code where we crashed (using objdump) and also present an interleaved source+assembly view. We've had bug 977778 on getting this integrated into crash-stats for a while.
So! I sat down this week and tried something out, and I have a patch to fix this in the Breakpad processor:
https://github.com/luser/breakpad/commit/c8c0bbcdba178b74db171c21d2c6f4a59efd0131

It depends on another upstream patch that needs to get reviewed and landed first:
https://codereview.chromium.org/1821293002/

I'm using capstone, a disassembler library, to diassemble the instruction that crashed and look in the registers to find the actual value. I generated test minidumps on Windows/Mac/Linux with this test program and verified that they all work as expected (and actually added them to the Breakpad test suite):
https://github.com/luser/breakpad-non-canonical-pointer-crash/blob/master/test_crash.cpp
Assignee: nobody → ted
Blocks: 1289676
No longer blocks: 1289663
Getting the crash addresses completely correct would be ideal, but if we can at least indicate when a crash address is definitely or possibly wrong that alone would be a big help.
Assignee: ted → nobody
You need to log in before you can comment on or make changes to this bug.