Windows requires stack to grow one page at a time

RESOLVED FIXED in Q2 11 - Wasabi

Status

defect
P2
normal
RESOLVED FIXED
9 years ago
8 years ago

People

(Reporter: wmaddox, Assigned: wmaddox)

Tracking

unspecified
Q2 11 - Wasabi
x86
All
Dependency tree / graph
Bug Flags:
in-testsuite -
flashplayer-injection -
flashplayer-qrb +
flashplayer-bug +

Details

(Whiteboard: WE:2801119,fixed-in-nanojit)

Attachments

(2 attachments, 4 obsolete attachments)

Windows uses a single guard page for extending the stack.  If allocating a frame larger than a single page, special care must be taken that each page is first written in sequential order in the direction of stack growth.

See http://support.microsoft.com/kb/100775 .

The JIT does not currently respect this requirement.
Posted patch Draft patch -- needs testing (obsolete) — Splinter Review
Insert a sequence of dummy writes to fault and commit new stack pages.  This should be faster for reasonably-sized frames than calling an out-of-line function like Microsoft's __chkstk() function, which supposedly just does something similar.
Assignee: nobody → wmaddox
Status: NEW → ASSIGNED
Whiteboard: WE:2801119
Target Milestone: --- → Q2 11 - Wasabi
Flags: in-testsuite-
Flags: flashplayer-qrb+
Flags: flashplayer-injection-
Flags: flashplayer-bug+
OS: Windows XP → All
Priority: -- → P2
Attachment #521722 - Attachment is obsolete: true
Last post was wrong version -- Updated.
Attachment #521992 - Attachment is obsolete: true
Attachment #521992 - Flags: review?(edwsmith)
Attachment #521994 - Flags: review?(edwsmith)
Comment on attachment 521994 [details] [diff] [review]
Patch for tr-spicy. Uses old i386 register names (EAX instead of rEAX, etc.).

Is it possible to add a lirasm test case for this?

It would probably be better to write a constant to the stack than the contents of EAX.  stretching, but: EAX could have a pointer, causing conservative retention.  or eax could somehow be attacker-controlled.  that reminds me, we do have an api for random bits that could be used, too.

nit: house style is braces on top right.

conditional R+: use VMPI_getVMPageSize() for the page size instead of hard coding 4096.
Attachment #521994 - Flags: review?(edwsmith) → review+
1) Uses VMPI_getVMPageSize() instead of hard-coded value.

2) A lirasm test case is provided.  It is principally useful for examining
the verbose disassembly output, but it can be used with the next feature as well.

3) Lirasm is extended to facilitate experimentation.  The --stkskip N switch causes Lirasm to push approximately N Kbytes of stack before beginning execution of any code fragments.  This is done by simply performing the execution at the end of a chain of recursions in which each recursive call pushes roughly 1Kbyte of memory.  (The method declares one local array 1K in length.  Frame overhead and spill slots are not accounted for.) With a sufficiently-large choice of N, one can provide that the first fragment will execute near (i.e., within a page) of the committed extent of the stack.  It's up to the user to choose an appropriate value for N, so as to keep Lirasm from needing to know the details of its own stack usage, including the invocation of the compiler proper.

4) There's a bit of refactoring in Lirasm to avoid duplicating code with the --stkskip option.  parseOptionalInt() handles options that behave similarly to --stkskip and --random.
Attachment #521991 - Attachment is obsolete: true
Attachment #521994 - Attachment is obsolete: true
Cool extension to lirasm.

Do we know whether other OSes require sequential page accesses for large stack allocations?
(In reply to comment #7)
> Cool extension to lirasm.
> 
> Do we know whether other OSes require sequential page accesses for large stack
> allocations?

Googling around a bit, it appears that this is widely recognized as a Windows issue, but I was unable to find a satisfactorily detailed and definititive description how Linux and MacOS manage their user-mode stack.

On both Linux and MacOSX, there is a fixed stack limit set in the executable, by a ulimit call (or shell command), or by default.  It is typically 8MB. Access outside of the reserved stack region produces a segfault exception.  In Windows, there is a similar reservation (1MB by default), but the space is mapped one page at a time on demand.  Empirically, I verified that on MacOSX, one can allocate a single frame sufficiently large as to extend into the last reserved page without failure.  It is not clear whether the OS is extending the mapped stack region dynamically, or whether the entire reserved area is mapped at the outset, though most likely committed lazily as the pages are touched.

By default, at least, GCC does not generate stack probes or anything similar in the function prologues on either MacOSX or Linux, so I suspect they work the same way.
In multi-threaded programs, threads are generally allocated a fixed-size stack, without the guarantee of an unallocated region following, as would be the case in the usual storage layout of a single-threaded process.  Guard pages are used in this case to detect stack overflow, so that overflow does not cause silent corruption.  There is an option in GCC to generate stack checks, including guard page probes.

If a nanojit host is expecting to use such mechanisms, then probes may be needed in the generated code.  The situation differs in Windows in that stack space that is nominally allocated is incrementally committed.  At present, it appears to be assumed that nanojit-generated code will not produce a stack overflow, or that the host will explicitly generate LIR to perform an overflow check as we do in Tamarin.

BTW, I question whether the check we are doing in Tamarin is adequate.  We compare the address of the method frame with the stack limit minstack, but the method frame is allocated prior to the variables.  The frame can extend beyond this point, limited only by NJ_MAX_STACK_ENTRY, e.g, 16KB on i386 or 32KB on SPARC.  I imagine the only reason we are getting away with this is that we allocate some slop with kStackMargin, which is woefully small on the embedded platforms.
Attachment #522065 - Flags: review?(edwsmith)
Attachment #522065 - Flags: review?(edwsmith) → review+
Unfortunately, the patch for nanojit-central relies on some changes made since the Spicy branch.  On i386, the old register name EBP is used instead of rEBP.
On X64, the MOVLMI opcode emitter is needed, to generate instructions of the
form 'movl disp(r), imm32'.  Support for immediate moves to memory was added recently.
Attachment #523501 - Flags: review?(rreitmai)
Attachment #523501 - Flags: review?(rreitmai) → review+
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
marking as verified.
Status: RESOLVED → VERIFIED
This bug should remain open until the change has propagated to tamarin-redux and tracemonkey.
Whiteboard: WE:2801119 → WE:2801119,fixed-in-nanojit
http://hg.mozilla.org/tracemonkey/rev/70dc21f4187f(In reply to comment #14)

> This bug should remain open until the change has propagated to tamarin-redux
> and tracemonkey.

Yes.  People haven't been following protocol lately.  I recently cleaned up https://developer.mozilla.org/en/NanojitMerge which describes the procedure, now might be a good time for everyone to refresh their memories.
Status: VERIFIED → RESOLVED
Closed: 9 years ago8 years ago
changeset: 6227:0c16b8944742
user:      William Maddox <wmaddox@adobe.com>
summary:   Bug 644900 - Generate probes for stack expansion when allocating large frames on Windows (r=edwsmith)

http://hg.mozilla.org/tamarin-redux/rev/0c16b8944742
Depends on: 656017
You need to log in before you can comment on or make changes to this bug.