Closed Bug 958796 Opened 11 years ago Closed 11 years ago

Users of _beginthreadex that set a custom stack size may not be getting the behavior they want

Categories

(NSPR :: NSPR, defect, P2)

All
Windows 7
defect

Tracking

(Not tracked)

RESOLVED FIXED
4.10.4

People

(Reporter: ehoogeveen, Assigned: ehoogeveen)

References

Details

Attachments

(2 files, 1 obsolete file)

In bug 930797 comment #66 I found that various users of _beginthreadex in the tree are setting the stack size for new threads without passing in the STACK_SIZE_PARAM_IS_A_RESERVATION flag. This flag is only documented for CreateThread [0], but _beginthreadex is simply a wrapper for this internal function. When STACK_SIZE_PARAM_IS_A_RESERVATION is *not* used, the stack size parameter of CreateThread and _beginthreadex will only set the *commit* size of the stack. The language used is confusing, but I think this means each thread can end up using more memory than anticipated - I honestly can't tell if they mean virtual or real memory, but either is bad in a 32-bit process. I made a small patch to add this flag to each _beginthreadex call in the tree that sets the stack size to a non-zero value and sent it to try. dmajor, could you test the following build to see if you still get 1MB of additional stack like in bug 930797 comment #56? Let me know if you can't do this test and I'll see if I can get the xperf stuff working. I have no experience with this though. https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/emanuel.hoogeveen@gmail.com-e624dc06954e/try-win32/ The patch isn't really ready for production as is, since it doesn't check for a stack size of 0 (it probably shouldn't use the flag in that case), and it should be split across the various components it touches. I haven't seen any problems browsing around with it for a while though. If we can confirm that this build has better memory or VM usage I'll get real patches up. A few further notes: 1) The Chromium code in ipc/chromium/src/base/platform_thread_win.cc and security/sandbox/base/threading/platform_thread_win.cc sets this flag when using CreateThread. This code also only passes the flag and sets the stack size if the OS is Windows XP or higher. Since Firefox no longer supports Windows 2k, that probably isn't required here. 2) There was also a bug filed about this on the Java VM: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6316878 [0] http://msdn.microsoft.com/en-us/library/windows/desktop/ms682453.aspx
Flags: needinfo?(dmajor)
> dmajor, could you test the following build to see if you still get 1MB of > additional stack like in bug 930797 comment #56? Let me know if you can't do > this test and I'll see if I can get the xperf stuff working. I have no > experience with this though. By default, try builds don't save the symbol files that xperf needs. Could you push another build with MOZ_CRASHREPORTER_UPLOAD_FULL_SYMBOLS=1? (https://wiki.mozilla.org/ReleaseEngineering/TryServer#Getting_debug_symbols)
Flags: needinfo?(dmajor)
Oops, thanks for the tip. New try build is here: https://tbpl.mozilla.org/?tree=Try&rev=96320bc611c4
Flags: needinfo?(dmajor)
Fascinating! With this change, the WoW64 thread stack now honors the same reserve size that you specify for the regular stack. (I went back and checked that this is not the case in bug 930797 just to make sure.) However, this is only true if you specify a non-default stack size. If you leave it at zero (which many callers still do because e.g. _PR_MD_CREATE_THREAD is called very indirectly) then you still get a 1MB reserve for WoW64.
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #4) > However, > this is only true if you specify a non-default stack size. If you leave it > at zero (which many callers still do because e.g. _PR_MD_CREATE_THREAD is > called very indirectly) then you still get a 1MB reserve for WoW64. Ah, that's good! There's no reason then to leave out the flag if zero is specified - Windows isn't stupid enough to actually try and give you 0 stack. Thanks for testing, I'll see about splitting this up into patches for each component for review (the changes themselves are tiny).
I should emphasize that *very* few callers are using non-default sizes. In a startup to wikipedia, I see exactly one 256K thread (xul.dll!mozilla::dom::DOMStorageDBThread::Init) and one 128K thread (xul.dll!mozilla::net::CacheIOThread::Init) versus 40 threads that defaulted to 1MB. So in practice we really won't see much savings unless we also change the size parameters. Let me know if you want data on who are the most frequent offenders.
I think bhackett ran into this in bug 943924, so the JS engine could get gain here. I also noticed that some of the cubeb backends (WASAPI and winmm) set the stack to 32kiB, so they could gain a lot. I think that's imported code, so that might have to be upstreamed first.
I think I've been getting these stack types backwards. The WOW64 stack (RtlpWow64CreateUserStack) is actually the 32-bit stack that regular application code runs on. The other one (RtlCreateUserStack) is the 64-bit stack for internal use. Confusing! Anyway, the flag only controls the behavior of the application stack. Here's how CacheIOThread::Init looks: Before: RtlpWow64CreateUserStack: reserve 1024k, commit 128k RtlCreateUserStack: reserve 256k, commit 28k After: RtlpWow64CreateUserStack: reserve 128k, commit 12k RtlCreateUserStack: reserve 256k, commit 28k
Ted, I would like to add this flag (STACK_SIZE_PARAM_IS_A_RESERVATION) in nsprpub/pr/src/md/windows/ntthread.c, but it seems that it was introduced in Windows XP and NSPR goes back a lot further than that. There are a few functions in NSPR that call GetVersionEx to get the Windows version, but the result of that should probably be cached somehow if I want to use it here. Could you advise on how to create a patch for NSPR? Firefox includes its own copy and doesn't support < Windows XP, but I'm not sure what the policy is here. For motivation see comment #0, and the results in comment #4 and comment #8.
Flags: needinfo?(ted)
If we will modify the version check logic, GetVersionEx should be replaced with VerifyVersionInfo. Microsoft deprecated the former.
The <video>/<audio> code sets the stack sizes of media decoding threads [1]. When we made this change it reduced our random orange rate especially on 32bit Linux which reserved 8MB stack sizes IIRC. Running the content/media mochitests on Windows, or otherwise starting up a large number of <audio>/<video> elements may be a good way to test your patch. [1] http://mxr.mozilla.org/mozilla-central/search?string=MEDIA_THREAD_STACK_SIZE
This should be fixed in cubeb with https://github.com/kinetiknz/cubeb/commit/ca908882adf65a25a7630ee6a0c4297b2ffe5546 which should get into Firefox shortly. (In reply to Chris Pearce (:cpearce) from comment #11) > Running the content/media > mochitests on Windows, or otherwise starting up a large number of > <audio>/<video> elements may be a good way to test your patch. Thanks! It looks like these eventually flow into NSPR, so a fix there should help. I'll look into creating a small test case for this.
I honestly don't know what NSPR's guarantees of platform support are. We'd need wtc to weigh in on that, unless there's documentation somewhere that I just haven't seen.
Flags: needinfo?(ted) → needinfo?(wtc)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #9) > Ted, I would like to add this flag (STACK_SIZE_PARAM_IS_A_RESERVATION) in > nsprpub/pr/src/md/windows/ntthread.c, but it seems that it was introduced in > Windows XP and NSPR goes back a lot further than that. NSPR supports Windows XP SP2 and later only. So it is OK to use this flag without a Windows version check. You can just create a patch for the copy of NSPR in Firefox. Ted or I will apply it to NSPR upstream. Switching to VerifyVersionInfo is a good idea, too.
Flags: needinfo?(wtc)
(In reply to Wan-Teh Chang from comment #14) > NSPR supports Windows XP SP2 and later only. So it is OK to > use this flag without a Windows version check. You can just > create a patch for the copy of NSPR in Firefox. Ted or I will > apply it to NSPR upstream. Excellent, that makes this much easier. I'm attaching a patch that adds the flag everywhere in NSPR where _beginthreadex is called with a non-default stack size - including two places in tests, for the sake of completeness. I was confused at first about w95thred versus ntthread, but it seems they're just different implementations of the same thing and Firefox uses WIN95, so it seemed appropriate to add the flag to both. I didn't have much luck testing the effects of the patch locally, so I turned to Talos instead: I see about a 2% improvement in tp5o_pbytes_paint and (for some reason) a11yr_paint [1] - everything else is in the noise (except for a spurious high value in tp5n_main_startup_netio_paint). Pretty good for adding a single flag :) Try also looks green (apart from infra failure) [2]. [1] http://compare-talos.mattn.ca/?oldRevs=210ac94a316c&newRev=4209c8db54be&submit=true&server=graphs.mozilla.org [2] https://tbpl.mozilla.org/?tree=Try&rev=4209c8db54be
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
Attachment #8365577 - Flags: review?(wtc)
Out of curiosity, why only pass the flag when the size is non-default?
There's no functional difference, and whenever the API allows setting the stack size, users can still pass in 0 to get the default (in which case the flag does nothing) - so it just seemed like increasing the size of the patch for no real reason. But perhaps consistently passing in the flag would alert future implementers to the problem? FWIW there are 3 uses of _beginthreadex in NSPR that always pass in 0 for the stack size parameter, and all of them are in tests.
The default is 1mb right? Seems like we could still get some savings by not committing most of that. (I'm thinking more generally, not just the NSPR tests) Or are you saying that the flag is ignored if size is default?
Yes, if you pass in 0 you get 1MiB of stack regardless as far as I know. The flag only changes the behavior if you pass in a non-default value.
I should clarify, I think when you pass in 0 Windows will *reserve* 1MiB regardless of the flag, but it probably commits as little as it thinks it can get away with. Based on your results in comment #8, it looks like passing in a non-zero value for the stack size causes Windows to commit *more* memory initially (and also allows the stack to grow to the nearest MiB, potentially reserving more address space). You could presumably pass in a very low value for the stack size (and omit the flag) to cause Windows to commit even less physical memory initially than it might normally, while allowing the stack to grow. But most users probably expect the stack size parameter to represent the maximum size the stack is allowed to have, and so will pass in a value that causes Windows (without the flag) to commit *more*. Even if you use the parameter to request a stack size bigger than 1MiB because you expect to use a lot of stack, you probably don't expect Windows to commit all of it right away - but if you don't use STACK_SIZE_PARAM_IS_A_RESERVATION, that's what I expect will happen :)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #20) > I should clarify, I think when you pass in 0 Windows will *reserve* 1MiB > regardless of the flag, but it probably commits as little as it thinks it > can get away with. You're right -- I took a trace and confirmed this.
Thank you for the patch. I assume you already verified that STACK_SIZE_PARAM_IS_A_RESERVATION can be passed to _beginthreadex even though it's not documented in the MSDN page for _beginthreadex. I also changed the other three _beginthreadex calls in NSPR test programs. Patch checked in: https://hg.mozilla.org/projects/nspr/rev/539942e2dd07
Attachment #8365577 - Attachment is obsolete: true
Attachment #8365577 - Flags: review?(wtc)
Attachment #8366763 - Flags: review+
Attachment #8366763 - Flags: checkin+
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #17) > There's no functional difference, and whenever the API allows setting the > stack size, users can still pass in 0 to get the default (in which case the > flag does nothing) - so it just seemed like increasing the size of the patch > for no real reason. But perhaps consistently passing in the flag would alert > future implementers to the problem? FWIW there are 3 uses of _beginthreadex > in NSPR that always pass in 0 for the stack size parameter, and all of them > are in tests. Ah, I just noticed that you intentionally skipped those three NSPR test programs. (I updated your patch without reading these comments). Should I revert my changes to those three NSPR test programs?
Thanks! Should I take any action (open a bug) to sync the in-tree version of NSPR with upstream? (In reply to Wan-Teh Chang from comment #22) > Thank you for the patch. I assume you already verified that > STACK_SIZE_PARAM_IS_A_RESERVATION can be passed to _beginthreadex > even though it's not documented in the MSDN page for _beginthreadex. Yes; the msvcrt source (included in Visual Studio 2010) shows that _beginthreadex calls CreateThread internally and passes in the initflag argument (called 'createflag' in the source code) without modifying it [1]. dmajor's experiments (e.g. comment #8) also show that this works as expected :) (In reply to Wan-Teh Chang from comment #23) > Ah, I just noticed that you intentionally skipped those three NSPR test > programs. (I updated your patch without reading these comments). Should I > revert my changes to those three NSPR test programs? No, that's fine. If anything it's more consistent, and it can't do any harm. [1] At $(VCInstallDir)VC\crt\src\threadex.c#127
Adding bug 962346 as a dependency for tracking and posterity.
Depends on: 962346
Emanuel: I will push a new NSPR update to mozilla-central within two weeks. I'll just use this bug. No need to open a new bug.
I went through the tree and made a list of functions that call either _beginthreadex or CreateThread and request a non-default stack size. Of these, the ones that don't pass in the flag (and probably should) are: 1) http://dxr.mozilla.org/mozilla-central/source/media/webrtc/trunk/webrtc/system_wrappers/source/thread_win.cc#77 2) http://dxr.mozilla.org/mozilla-central/source/tools/profiler/platform-win32.cc#252 3) http://dxr.mozilla.org/mozilla-central/source/other-licenses/7zstub/src/Windows/Thread.h#19 There's also http://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/windows/handler/exception_handler.cc#196 but given the parameter name, I think that one is intentional. Of the above three, 1) is fairly bad: it looks like they try to be explicit by requesting a stack size of 1MiB (which is the default anyway), but what this means is that Windows will initially commit 1 MiB of real RAM, even though the thread might never use that much stack. 2) and 3) appear to be harmless at present - even though their APIs use a stack size parameter, it is always 0 in practice. There's a lot more places where improper usage *might* sneak in if we ever start passing in a non-zero value, but if we can get the WebRTC one fixed we should be good for now (along with the NSPR and cubeb fixes).
Comment on attachment 8366763 [details] [diff] [review] Use STACK_SIZE_PARAM_IS_A_RESERVATION when setting stack size in NSPR, v2, by Emanuel Hoogeveen Patch pushed to mozilla-inbound as part of NSPR_4_10_4_BETA1: https://hg.mozilla.org/integration/mozilla-inbound/rev/ed9194eec73b
Looks like this caused a lot of test failures due to assertions: https://tbpl.mozilla.org/?tree=Mozilla-Inbound
Flags: needinfo?(wtc)
Flags: needinfo?(emanuel.hoogeveen)
The changes in this bug only affect Windows (_beginthreadex doesn't exist anywhere else), so these failures must be due to the fixes for bug 844784 and bug 939786 that were also present in the landing.
Flags: needinfo?(emanuel.hoogeveen)
Comment on attachment 8366763 [details] [diff] [review] Use STACK_SIZE_PARAM_IS_A_RESERVATION when setting stack size in NSPR, v2, by Emanuel Hoogeveen Patch pushed to mozilla-inbound as part of NSPR_4_10_4_BETA2: https://hg.mozilla.org/integration/mozilla-inbound/rev/ae20eb00cabf This time only this patch plus the NSPR version change was pushed.
Flags: needinfo?(wtc)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Component: General → NSPR
Flags: checkin+
OS: All → Windows 7
Priority: -- → P2
Product: Core → NSPR
Target Milestone: mozilla30 → 4.10.4
Version: Trunk → other
ehoogeveen, could you briefly summarize the effect of this change? It's not clear to me how it affects memory consumption in practice. Thanks!
I don't know if you can call this brief, but I hope it clears things up :) By default when you create a thread on Windows using either CreateThread or _beginthreadex, it will reserve 1MiB of address space but commit only as much RAM as needed (growing up to the 1MiB cap as necessary). A lot of users don't need 1MiB of stack, so some of them pass a non-zero value into the stack size parameter that these functions offer. However it turns out that by default, this overrides the *initial commit size* that Windows uses for the stack of the thread you created. It will still allow the stack to grow to the nearest multiple of 1MiB, but commit as much RAM as the amount you passed in. Since users intuitively expect the amount they pass in to represent an upper limit on the amount of stack the thread can use, this means that Windows will always overcommit, potentially by a lot. By passing in STACK_SIZE_PARAM_IS_A_RESERVATION you override the reserve size instead, getting the behavior users expect and allowing Windows to commit only as much memory as it needs to. To give a practical example: in bug 942984 bhackett gave JS worker threads a stack size of 512kiB, but this caused a tp5o private bytes regression that made him disable it on Windows in bug 943924. Given the above, what I think happened is that his change caused Windows to initially commit 512kiB of RAM for the stack of each worker thread, where before most worker threads probably never got close to needing that much. Now that NSPR is using STACK_SIZE_PARAM_IS_A_RESERVATION, backing out bug 943924 no longer regresses the private bytes measurement (I intend to file a bug to do this soon): http://compare-talos.mattn.ca/?oldRevs=208155fe186c&newRev=4ba604c8e19a&server=graphs.mozilla.org&submit=true What isn't entirely clear to me is how this affects the vsize of the process - I would expect asking Windows to reserve less address space to lower the vsize, but in my (somewhat clumsy) local testing I didn't see the kind of difference I expected. So it's possible that Windows doesn't actually reserve the full 1MiB initially either - or maybe there's some trickery going on with the WoW64 stack that wraps the 32-bit stack that the process sees.
Blocks: 969038
Thanks for the detailed explanation. The JS worker example is a nice one. Do you know how many threads will typically be affected, and thus how much memory will no longer be committed? I'm only looking for rough numbers :)
I don't really have a handle on the number of threads, but in comment #15 I saw about a 2% improvement to tp5o private bytes (from 155.4MiB to 151.8MiB on Windows XP, from 167.0MiB to 163.8MiB on Windows 7). I guess I'd have to add some logging code to NSPR and test it on the Talos workload to find out how many threads it uses.
Here's some more information: This is a list of all the (direct and indirect) callers of PR_CreateThread that pass a non-default stack size. There's some pretty interesting ones in there.
To note, the original bug that spawned this bug was bug 930797, where this was presumed to be a contributor to a user's OOM crashes via media decoding threads. It's possible that this will have a noticeable impact on the memory usage of users viewing HTML5 media (like YouTube in HTML5 mode).
So: multiple MBs, possibly more when media is involved. Nice work!
Yep :) Actually, the component that should benefit from this the most is DOM Workers, which (now correctly) reserve a stack size of 1MiB on 32-bit systems and 2MiB on 64-bit systems (but of course Win64 isn't a main focus yet).
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #40) > To note, the original bug that spawned this bug was bug 930797, where this > was presumed to be a contributor to a user's OOM crashes via media decoding > threads. It's possible that this will have a noticeable impact on the memory > usage of users viewing HTML5 media (like YouTube in HTML5 mode). Note that most (I think maybe 80%?) of those threads were spawned by system media libraries whose code we don't control. This patch ought to get us *some* wins in that scenario, but it isn't intended to fix the underlying problem.
(In reply to David Major [:dmajor] from comment #43) > Note that most (I think maybe 80%?) of those threads were spawned by system > media libraries whose code we don't control. This patch ought to get us > *some* wins in that scenario, but it isn't intended to fix the underlying > problem. Bug 963922 will help us by reducing the number of idle media elements' decoders we keep lying around.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: