Closed Bug 507294 Opened 15 years ago Closed 15 years ago

Validate frame-poisoning poison address on Windows and on secondary targets

Categories

(Core :: Layout, defect, P2)

defect

Tracking

()

RESOLVED DUPLICATE of bug 522088

People

(Reporter: zwol, Assigned: zwol)

References

()

Details

(Whiteboard: [sg:dupe 522088])

Attachments

(1 file, 3 obsolete files)

The "frame poisoning" work in bug 497495 uses the 4- or 8-byte pattern 0x(FFFF_FFFF_)FFDE_ADFF (depending on the size of a pointer) to overwrite freed frame-tree memory.  When interpreted as a pointer, this is believed to refer to a region of the address space that is always unmapped (or more precisely, reserved to the kernel) on all platforms of interest.  However, I have only verified that this is the case on i386 and amd64 Linux and Win32, and I'm not 100% sure about the latter two.

Other platforms of interest -- at least OSX (PPC, i386, amd64), Win64, WinCE (various CPUs), and ARM Linux -- may put their always-unmapped address regions somewhere else, or may not even have them.  We need to investigate what is done on these platforms, and possibly modify nsPresArena to use a different poison pattern or even allocate and mprotect() its own unusable page or two.
How about using a non-zero value on page 0, such as 0x10F?
There's a detailed writeup of the frame poisoning stuff at the URL in the header now.  I have specific requests for platform experts to verify the "totally inaccessible" property of the currently-chosen poison address, or find a better address if the one we're using turns out to be accessible under some circumstances.

 * Most urgent is 32-bit Windows.  Despite what I said in the description, I have only marginal confidence in the address used on that platform.
 * ARM Linux and WinCE for various mobile platforms are second priority.
 * Confirming assumptions about other Linux architectures (32- and 64-bit non-x86, non-ARM) would also be of interest but is low priority.

Jesse: a non-zero value on page 0 would still be vulnerable to OS bugs that make page 0 accessible, and would be too hard to tell apart from an offset dereference of the null pointer in crash logs.
Summary: Determine guaranteed-unmapped address regions for frame poisoning → Validate frame-poisoning poison address on Windows and on secondary targets
Flags: blocking1.9.2?
This program uses mlock() and mmap() to find regions of the address space that the OS doesn't allow programs to map.  This is *not* the same thing as "totally inaccessible" in the frame poisoning writeup - my amd64 Linux box has a Very Special chunk of code at address 0xFFFF_FFFF_FF60_0000 which is readable and executable by user space, but cannot be written to, mapped, or unmapped.  But it gives us a starting point.

WARNING: If you run it as is with 64-bit pointers, it will take several hundred years to finish.  No joke.

Sample output (32-on-64, Linux) looks like this:

0000000000000000-0000000000000fff inaccessible
0000000000001000-0000000008047fff usable
0000000008048000-0000000008049fff mapped
000000000804a000-0000000043c57fff usable
0000000043c58000-0000000043c78fff mapped
0000000043c79000-0000000043e96fff usable
0000000043e97000-0000000043febfff mapped
0000000043fec000-00000000f7fc4fff usable
00000000f7fc5000-00000000f7fc5fff mapped
00000000f7fc6000-00000000f7fddfff usable
00000000f7fde000-00000000f7fe1fff mapped
00000000f7fe2000-00000000ff954fff usable
00000000ff955000-00000000ff969fff mapped
00000000ff96a000-00000000ffffdfff usable
00000000ffffe000-00000000ffffffff inaccessible

"mapped" regions already have something in them (as determined by trying to mlock() the page); "usable" regions can be assigned to anonymous memory with mmap(); "inaccessible" regions fail both mlock() and mmap().

A version of this program that used Windows kernel primitives instead of Unix primitives would be a good start on resolving the "where to put the poison address on Windows" question.
Here's the report on OSX 10.5 (Intel, 32-bit binary):

0000000000000000-0000000000007fff mapped
0000000000008000-00000000000fffff usable
0000000000100000-00000000001fffff mapped
0000000000200000-00000000007fffff usable
0000000000800000-0000000000ffffff mapped
0000000001000000-000000008fdfffff usable
000000008fe00000-000000008fe74fff mapped
000000008fe75000-000000008fffffff usable
0000000090000000-0000000097b49fff mapped
0000000097b4a000-000000009fffffff usable
00000000a0000000-00000000a08dcfff mapped
00000000a08dd000-00000000a09fffff usable
00000000a0a00000-00000000a0aa7fff mapped
00000000a0aa8000-00000000bbffffff usable
00000000bc000000-00000000bfffffff mapped
00000000c0000000-00000000ffdfffff usable
00000000ffe00000-00000000ffffffff mapped

As you can see there are no "inaccessible" regions *at all*, although I would not be surprised if some of those "mapped" areas cannot be touched by CPU instructions.  "mapped" just means that mlock() succeeded on the page.
I'm gonna do 10.6 as soon as I manage to do the upgrade; I would very much like to know what the results are on 10.4 (both Intel and PPC) but I don't have anything that old.
Attached file Program results on 10.4 (obsolete) —
(Intel)
Interesting that it has so many segments.  And *something* happens at 0xC0000000.  I think I'm not inclined to investigate much closer, though - we know we have to fabricate an inaccessible address for 10.5 and up, and 10.4 is of dwindling significance.
Maybe Jim can help on Windows?
Flags: blocking1.9.2? → blocking1.9.2+
Priority: -- → P2
Assignee: nobody → zweinberg
(In reply to comment #9)
> Maybe Jim can help on Windows?

On win32, using a system info call  - 

Lowest   : 0x0000000000010000
Highest  : 0x000000007FFEFFFF

These are the lowest and highest application address I can allocate at.

Not sure if that's what you're looking for. I can go in and probe ranges as well if need be.
Would someone mind cc'ing me in on bug 497495?
Group: core-security
(In reply to comment #10)
> (In reply to comment #9)
> > Maybe Jim can help on Windows?
> 
> On win32, using a system info call  - 
> 
> Lowest   : 0x0000000000010000
> Highest  : 0x000000007FFEFFFF
> 
> These are the lowest and highest application address I can allocate at.
> 
> Not sure if that's what you're looking for. I can go in and probe ranges as
> well if need be.

That's good information.  I'd like to know how you got those numbers, and I'd also like to know if you know of any circumstances where the OS will hand out a (possibly read-only, or execute-only) pointer outside that range.
(In reply to comment #12)
> (In reply to comment #10)
> > (In reply to comment #9)
> > > Maybe Jim can help on Windows?
> > 
> > On win32, using a system info call  - 
> > 
> > Lowest   : 0x0000000000010000
> > Highest  : 0x000000007FFEFFFF
> > 
> > These are the lowest and highest application address I can allocate at.
> > 
> > Not sure if that's what you're looking for. I can go in and probe ranges as
> > well if need be.
> 
> That's good information.  I'd like to know how you got those numbers, ..

Those numbers come from a call to GetSystemInfo:

http://msdn.microsoft.com/en-us/library/ms724381%28VS.85%29.aspx

0x80000000 or 0xC0000000 are the general boundaries between user and system memory.

> also like to know if you know of any circumstances where the OS will hand out > a
> (possibly read-only, or execute-only) pointer outside that range.

The upper limit is absolute, user mode apps can't access it, except maybe through a security flaw in a driver. :) There are probably some debugging apis that allow you to look at that memory as well. Nothing conventional though.
Ok, then I think we can trust the poison address currently in use (starts 0xF0....) on Windows generally.  If someone manages to persuade the browser to call debugging APIs on itself, we've already lost.

That means we're covered on all primary platforms modulo bug 522088.  I'd like to know if WoW64 needs to be included in the work in that bug, but I don't yet have that environment - Jim, could you possibly post a comparison of GetSystemInfo vs GetNativeSystemInfo for a WoW64 process?

For secondary platforms (especially thinking of mobile) I think what I'm going to do is write a "make check" test that runs all the setup code that the browser will, then tries to access the selected poison region and fails the test if it can.
(In reply to comment #14)
> Ok, then I think we can trust the poison address currently in use (starts
> 0xF0....) on Windows generally.  If someone manages to persuade the browser to
> call debugging APIs on itself, we've already lost.
> 
> That means we're covered on all primary platforms modulo bug 522088.  I'd like
> to know if WoW64 needs to be included in the work in that bug, but I don't yet
> have that environment - Jim, could you possibly post a comparison of
> GetSystemInfo vs GetNativeSystemInfo for a WoW64 process?
> 
> For secondary platforms (especially thinking of mobile) I think what I'm going
> to do is write a "make check" test that runs all the setup code that the
> browser will, then tries to access the selected poison region and fails the
> test if it can.

Unfortunately, I'm not running 64-bit. Rob Arnold might be.
I am running 64 bit Windows 7. By WoW64, do you mean a 32 bit process or a 64 bit process? The 32 bit process should have the same address space limitations but I'll double check.
(In reply to comment #16)
> I am running 64 bit Windows 7. By WoW64, do you mean a 32 bit process or a 64
> bit process? The 32 bit process should have the same address space limitations
> but I'll double check.

I'd like to see both, but the 32-bit process on 64-bit kernel case is more interesting.
Attached file beginnings of poison test progra, (obsolete) —
I was hoping to be a little further along with this program now, but it's noon and I'm starving, so I'm going to post it for comments at this point.

The reserve_poison_area() function is a first approximation to the logic that will be added to the main browser in bug 522088.  This program is intended to be a standalone unit test, and in its full form, it's going to try to access the area that reserve_poison_area sets up in various different ways and make sure it gets crashed.  (That logic will *not* go into the main browser. ;-)

I'd like people to poke at this particularly on exotic Unix (i.e. neither Linux nor OSX) and on Windows.  The Windows code in here has not been compiled at all, as my Windows install destroyed itself earlier this week and I have too many other demands on my time to fix it right now. :-/  Hopefully, though, if it is broken, it's just a typo or two.
I should add that the behavior of the program on all three of:

 - native Win32
 - Win32-on-Win64
 - native Win64

is of interest.
Whiteboard: [sg:investigate]
We have 64-bit Windows on a machine in the office, Chris P can probably help here.
Win32-on-Win64:
Min: 0x00010000
Max: 0x7FFEFFFF
It reports the native system info as
Min: 0x00010000
Max: 0xFFFEFFFF
though since the structure is the same, the upper 32 bits are being cut off.
Native 64 bit:
Min: 0x0000000000010000
Max: 0x000007FFFFFEFFFF
I should note that the server editions of Windows may have different limits set for 64 bit processes. They certainly have higher limits for the physical memory. Someone with MSDN access and some spare time should be able to confirm this.

I don't think we officially support the server editions so perhaps this isn't an important issue.
This program finally does all that I need it to do for a thorough test of frame poisoning's assumptions.  The ReservePoisonArea() function is, I hope, what's going to be incorporated into the browser core in bug 522088, and I intend this to become a compiled unit test.  I've confirmed that it behaves as I expect on 32- and 64-bit Linux, 32- and 64-bit OSX (10.5), and 32-bit Windows.

I don't have a 64-bit Windows installation.  I would really appreciate it if someone could give this program a try in that environment, compiled both as 32-bit and as 64-bit.  Its output is in reftest/mochitest format, so if you get nothing but INFO and TEST-PASS, you're good.  If there are any ERRORs or UNEXPECTED-FAILs, please try changing GetSystemInfo() to GetNativeSystemInfo() in main() and see if that helps.
Attachment #406159 - Attachment is obsolete: true
Attachment #406288 - Attachment is obsolete: true
Attachment #410840 - Attachment is obsolete: true
Attachment #411565 - Attachment mime type: application/octet-stream → text/plain
(In reply to comment #23)
> I don't have a 64-bit Windows installation.  I would really appreciate it if
> someone could give this program a try in that environment, compiled both as
> 32-bit and as 64-bit. 

Compiled as X64 on Windows Vista 64bit, did not fail:

INFO | negative control allocated at 0x0000000000020000
INFO | positive control allocated at 0x0000000000150000
INFO | poison area assumed at 0x7ffffffff0de0000
TEST-PASS | reading negative control
TEST-PASS | executing negative control
TEST-PASS | writing negative control
TEST-PASS | reading positive control | exception code c0000005
TEST-PASS | executing positive control | exception code c0000005
TEST-PASS | writing positive control | exception code c0000005
TEST-PASS | reading poison area | exception code c0000005
TEST-PASS | executing poison area | exception code c0000005
TEST-PASS | writing poison area | exception code c0000005

There were exceptions in the Visual Studio output log when control ran line 493 ("failed |= TestPage("poison area", poison, 0);") it logged the following errors, but didn't seem to crash.

First-chance exception at 0x00000001400017f5 in create-poison.exe: 0xC0000005: Access violation reading location 0x0000000000157fff.
First-chance exception at 0x00158000 in create-poison.exe: 0xC0000005: Access violation at location 0x0000000000158000.
First-chance exception at 0x000000014000182f in create-poison.exe: 0xC0000005: Access violation writing location 0x0000000000157fff.
First-chance exception at 0x00000001400017f5 in create-poison.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.
First-chance exception at 0x0000000140001811 in create-poison.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.
First-chance exception at 0x000000014000182f in create-poison.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.

For regular win32 I had to manually define "RETURN_INSTR 0xC3C3C3C3", as it wouldn't compile otherwise. I hope that's correct.

Compiled as win32/x86 app running on Vista64bit, did not fail:

INFO | negative control allocated at 0x00020000
INFO | positive control allocated at 0x00030000
INFO | poison area probed at 0xf0de0000 | Attempt to access invalid address.
TEST-PASS | reading negative control
TEST-PASS | executing negative control
TEST-PASS | writing negative control
TEST-PASS | reading positive control | exception code c0000005
TEST-PASS | executing positive control | exception code c0000005
TEST-PASS | writing positive control | exception code c0000005
TEST-PASS | reading poison area | exception code c0000005
TEST-PASS | executing poison area | exception code c0000005
TEST-PASS | writing poison area | exception code c0000005

Again there were exceptions in the Visual Studio output log when control ran line 493:

First-chance exception at 0x00411d25 in create-poison.exe: 0xC0000005: Access violation reading location 0x00037fff.
First-chance exception at 0x00038000 in create-poison.exe: 0xC0000005: Access violation reading location 0x00038000.
First-chance exception at 0x00411d69 in create-poison.exe: 0xC0000005: Access violation writing location 0x00037fff.
First-chance exception at 0x00411d25 in create-poison.exe: 0xC0000005: Access violation reading location 0xf0de7fff.
First-chance exception at 0xf0de8000 in create-poison.exe: 0xC0000005: Access violation reading location 0xf0de8000.
First-chance exception at 0x00411d69 in create-poison.exe: 0xC0000005: Access violation writing location 0xf0de7fff.

Note those exceptions don't appear when you run this program from the command line, they only appear in the VS output log. I'm not sure if they're a problem or not...
(In reply to comment #24)
> First-chance exception at 0x00000001400017f5 in create-poison.exe: 0xC0000005:
> Access violation reading location 0x0000000000157fff.
> First-chance exception at 0x00158000 in create-poison.exe: 0xC0000005: Access
> violation at location 0x0000000000158000.
[...]

Ah, this is just VS logging exceptions that are thrown. Neat.
(In reply to comment #24)

This all looks good.

...
> There were exceptions in the Visual Studio output log when control ran line 493
> ("failed |= TestPage("poison area", poison, 0);") it logged the following
> errors, but didn't seem to crash.

The program is deliberately doing things that should provoke exceptions, and using SEH to detect those exceptions and recover control.  I imagine Visual Studio is using the debugger interface to report those exceptions as they go by.  The trace looks consistent with what the program is doing, except:

> First-chance exception at 0x00000001400017f5 in create-poison.exe: 0xC0000005:
> Access violation reading location 0xffffffffffffffff.
> First-chance exception at 0x0000000140001811 in create-poison.exe: 0xC0000005:
> Access violation reading location 0xffffffffffffffff.
> First-chance exception at 0x000000014000182f in create-poison.exe: 0xC0000005:
> Access violation reading location 0xffffffffffffffff.

it's not reporting the nature and location of the fault correctly, with a 64-bit process using the hardware-inaccessible region.  This is not a fatal flaw, it just means our crash reports for native 64-bit processes on Windows may be less informative than they could be.  I would describe this as a bug in Windows.

> For regular win32 I had to manually define "RETURN_INSTR 0xC3C3C3C3", as it
> wouldn't compile otherwise. I hope that's correct.

It is supposed to figure that out from the _M_ defines.  In my own testing, I had to do /D_M_I386 when I was running cl on the command line, but didn't need to do that when I was using a proper VS project.  For incorporation into our own builds, I *hope* the Makefiles do the right thing.

> Compiled as win32/x86 app running on Vista64bit, did not fail:
> 
> INFO | negative control allocated at 0x00020000
> INFO | positive control allocated at 0x00030000
> INFO | poison area probed at 0xf0de0000 | Attempt to access invalid address.
> TEST-PASS | reading negative control
> TEST-PASS | executing negative control
> TEST-PASS | writing negative control
> TEST-PASS | reading positive control | exception code c0000005
> TEST-PASS | executing positive control | exception code c0000005
> TEST-PASS | writing positive control | exception code c0000005
> TEST-PASS | reading poison area | exception code c0000005
> TEST-PASS | executing poison area | exception code c0000005
> TEST-PASS | writing poison area | exception code c0000005

Looks good here.  Crucially, it didn't let it *allocate* the poison area at 0xf0de0000, which is (per comment #21) outside the application address space range that GetSystemInfo reports, but inside the range reported by GetNativeSystemInfo.  I couldn't figure out from MSDN which of those to use.
> The program is deliberately doing things that should provoke exceptions, and
> using SEH to detect those exceptions and recover control.

when this lands would we expect the crash rate to go even higher than what we see in bug 526587?

will it make the debugging of these or other crashes easier or harder?

if crash rates go higher, and/or debugging of the crashes gets easier or harder I'd argue for a longer beta period to flush out the impact before we ship the change to a lot of people.
(In reply to comment #27)
> > The program is deliberately doing things that should provoke exceptions, and
> > using SEH to detect those exceptions and recover control.

It is maybe worth emphasizing that "the program" there (included in the patch in 522088, btw) is a unit test.  The browser will not do any such thing.

> when this lands would we expect the crash rate to go even higher than what we
> see in bug 526587?
> 
> will it make the debugging of these or other crashes easier or harder?

I expect this to have little or no practical impact on anything, frankly.  It is another layer of defensiveness, ensuring that the address we use as a poison value cannot become the address of any other piece of data.  There are only a few OS/CPU combinations where this could happen in the first place:

 - 32-bit browser binary on OSX (any current version)
 - 32-bit browser binary on 64-bit Linux kernel
 - possibly some nonstandard Windows configurations (server mode was mentioned)

and the odds of "other data" being allocated to the 0xF0DEA000 page were pretty slim, anyway (the main malloc() heap, which is where most of our allocations come from, is nowhere near there).
Seems to me that the test here is in bug 522088, so we should just resolve this as FIXED since we've checked Windows and it's OK. If other platforms have problems the test in bug 522088 will catch them. Does that sound right, Zack?
(In reply to comment #29)
> Seems to me that the test here is in bug 522088, so we should just resolve this
> as FIXED since we've checked Windows and it's OK. If other platforms have
> problems the test in bug 522088 will catch them. Does that sound right, Zack?

Yes, my intent was to close this when I landed bug 522088.
http://hg.mozilla.org/mozilla-central/rev/3173494c8bdb
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
Group: core-security
Whiteboard: [sg:investigate] → [sg:dupe 522088]
Product: Core → Core Graveyard
Component: Layout: Misc Code → Layout
Product: Core Graveyard → Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: