Closed
Bug 867779
Opened 12 years ago
Closed 12 years ago
Linux: Use /proc/meminfo CommitLimit to define an upper-bound to GC allocations.
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
INVALID
People
(Reporter: nbp, Assigned: nbp)
References
Details
+++ This bug was initially created as a clone of Bug #863398 +++
The optimal GC settings found in Bug 863398 are suggesting that a GC might grow by 300% if the heap size is above 40 MB. Doing so might allocate more virtual memory than what is available for a process, and thus will cause the process to be killed by the Linux kernel if the memory is committed.
This bug goal is to use the CommitLimit [1] contained in /proc/meminfo and allow the process to reserve at most some percent of it (~75%). In the case of the JS Engine, we want to prevent the over allocation case and want to trigger more GC before stopping the scripts execution instead of getting the applications killed by the kernel.
The idea is to use the CommitLimit when the preference defined for mem.max is set to a negative (or invalid) number.
Sadly, this information does not seems to be available with any syscall as it is computed just before defining file content [2], which means that the most reliable way to get it is to parse /proc/meminfo.
PS: By the way, I was surprised to see that on the full-ram kernel unagi (512 MB) the overcommit_ratio is the default setting which is set to 50%. Which means that an application cannot access the totality of the memory available on the device, even with the full-ram kernel.
[1] https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/broken-out/add-documentation-for-new-commitlimit-and-commitavail-meminfo.patch
[2] https://github.com/torvalds/linux/blob/master/fs/proc/meminfo.c
Comment 1•12 years ago
|
||
> The optimal GC settings found in Bug 863398 are suggesting that a GC might grow by 300% if the
> heap size is above 40 MB.
A GC might grow what by 300%?
I don't understand exactly what you propose we modify. We're going to set the overcommit_ratio to what, exactly? And are you proposing changing a global setting or a per-process setting?
But the net result will be that no one process can commit than 75% of the phone's physical memory? Or is the idea that no process can have a vm space greater than 75% of the phone's physical memory?
Then either way, you're hoping that with this limit, the GC will see its malloc/mmap's return null, and will respond by allocating less memory? Or something else?
> By the way, I was surprised to see that on the full-ram kernel unagi (512 MB) the
> overcommit_ratio is the default setting which is set to 50%. Which means that an application
> cannot access the totality of the memory available on the device, even with the full-ram kernel.
That sounds like a (separate) bug in the full-ram kernel to me.
| Assignee | ||
Comment 2•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #1)
> > The optimal GC settings found in Bug 863398 are suggesting that a GC might grow by 300% if the
> > heap size is above 40 MB.
>
> A GC might grow what by 300%?
A GC can multiply by 3 the heap size. This is what provides the best octane / snappy results in Bug 863398.
> I don't understand exactly what you propose we modify. We're going to set
> the overcommit_ratio to what, exactly? And are you proposing changing a
> global setting or a per-process setting?
I am not suggesting any modification to the overcommit_ratio in this bug, just expressing the fact that I was surprised by its value which seems quite low.
> But the net result will be that no one process can commit than 75% of the
> phone's physical memory? Or is the idea that no process can have a vm space
> greater than 75% of the phone's physical memory?
The idea is that the GC cannot have a VM space greater than what can be committed to the memory. This way we could have any GC growth factor without any risk of OOM Kill caused by the kernel (as long as the rest of gecko does not cause us to over commit).
> Then either way, you're hoping that with this limit, the GC will see its
> malloc/mmap's return null, and will respond by allocating less memory? Or
> something else?
And will respond by reporting a JS OOM which cause the script execution to stop. Which means that the content of the application will remain, and the user will still be able to interact with the page.
> > By the way, I was surprised to see that on the full-ram kernel unagi (512 MB) the
> > overcommit_ratio is the default setting which is set to 50%. Which means that an application
> > cannot access the totality of the memory available on the device, even with the full-ram kernel.
>
> That sounds like a (separate) bug in the full-ram kernel to me.
Indeed. 50 seems to be the default value of the vm_overcommit_ratio of the kernel. We probably want to change that on the phone such as we can get larger application running on the phone.
(In reply to Nicolas B. Pierron [:nbp] from comment #2)
> (In reply to Justin Lebar [:jlebar] from comment #1)
> > > The optimal GC settings found in Bug 863398 are suggesting that a GC might grow by 300% if the
> > > heap size is above 40 MB.
> >
> > A GC might grow what by 300%?
>
> A GC can multiply by 3 the heap size. This is what provides the best octane
> / snappy results in Bug 863398.
Just to clarify, Nicolas is talking about javascript.options.mem.gc_high_frequency_heap_growth_max, which controls when we trigger GCs. Imagine the pref is set to 300%. Then at the end of a GC, if we're using X MB of GC memory, we would trigger the next GC when the heap grows to 3*X. With the current setting of the pref (150%), we would wait until the heap grows to 1.5*X.
The goal of this bug is to cap the trigger at the physical memory of the phone. So we would really set it to Min(3*X, phone_memory*factor), where Nicolas suggested 75% for the factor.
These heuristics only apply to high-frequency GCs, which are ones that happen within about a second of each other. Our hope is that these GCs don't happen during normal phone operation, and so the new heuristics will only affect our benchmark scores and not our memory footprint. We have to verify that, though.
Comment 4•12 years ago
|
||
Isn't this GC trigger a per-process thing? In which case setting it to 75% of physical memory seems like you'll never hit it, unless there is only one process consuming all of the memory.
Comment 5•12 years ago
|
||
> The goal of this bug is to cap the trigger at the physical memory of the phone. So we would really
> set it to Min(3*X, phone_memory*factor), where Nicolas suggested 75% for the factor.
I see, okay. Additionally, the suggestion is that phone_memory*factor should be a hard limit on the memory used by JS, right?
This doesn't seem sound to me. Comment 1 explains why:
> This way we could have any GC growth factor without any risk of OOM Kill caused by the kernel (as
> long as the rest of gecko does not cause us to over commit).
If we ignore the rest of Gecko (and also all other processes running on the system which have a lower oom_adj than this process), then maybe JS won't cause us to OOM. But I don't see how we can soundly ignore those things.
Child processes are sometimes JS-heavy, and sometimes not. Whatever constant |factor| we choose, it's going to be either too high to prevent OOMs in some cases, so low that we don't use all available memory in other cases, or both.
We have a similar problem with images in b2g. We'd like to know "how much memory can we safely use for images in this process?", but that's a hard question to answer, even approximately. We have working low-memory notifications, and maybe that can help with JS. But to make them work, you have to be able to check for low-memory pretty frequently, and then act on it quickly.
(Right now, low-memory notifications come in on something other than the main thread, and then we fire an event to the main thread that causes us to do a shrinking GC, among other things. But that's obviously not fast enough to catch all cases.)
> Our hope is that these GCs don't happen during normal phone operation, and so the new heuristics
> will only affect our benchmark scores and not our memory footprint. We have to verify that,
> though.
I'm ignorant of how we normally tune these numbers, but naively, it seems to me that tweaking a value that we think has no effect on user-perceived performance at the expense of possibly OOM'ing processes more often is a bad trade-off.
Comment 6•12 years ago
|
||
I guess the conclusion of comment 5 is that periodically checking for low-memory while running JS is probably a lot saner than enforcing a hard limit on JS memory usage. But even checking periodically isn't particularly good, since Gecko can OOM us, and since another process on the system can OOM us. The best thing to do is to use as little memory as possible, always.
| Assignee | ||
Comment 7•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #5)
> > The goal of this bug is to cap the trigger at the physical memory of the phone. So we would really
> > set it to Min(3*X, phone_memory*factor), where Nicolas suggested 75% for the factor.
>
> I see, okay. Additionally, the suggestion is that phone_memory*factor
> should be a hard limit on the memory used by JS, right?
>
> This doesn't seem sound to me. Comment 1 explains why:
>
> > This way we could have any GC growth factor without any risk of OOM Kill caused by the kernel (as
> > long as the rest of gecko does not cause us to over commit).
>
> If we ignore the rest of Gecko (and also all other processes running on the
> system which have a lower oom_adj than this process), then maybe JS won't
> cause us to OOM. But I don't see how we can soundly ignore those things.
1/ The rest of Gecko, we cannot, but we this would be way more complex as I don't have the knowledge to deal with that yet. but the factor is a conservative way to say that we care about it.
2/ As mentioned in Comment 1, this is not the phone_memory, but the limit of memory allocatable by an application (aka. CommitLimit). Which means that an application which reach this limit, if this is the only application running, will be killed by the kernel (at 50% of the available memory on the phone (overcommit_ratio = 50)).
3/ Other processes, we don't really care because they would be killed by the kernel as soon as the foreground application tries to commit allocated memory. This is already what is happening if I am not making any mistake.
> Child processes are sometimes JS-heavy, and sometimes not. Whatever
> constant |factor| we choose, it's going to be either too high to prevent
> OOMs in some cases, so low that we don't use all available memory in other
> cases, or both.
When I am looking at an application which is not DOM-intensive, I can see that the browser takes approximately 20 MB, using 75% on a 100MB limit give us a conservative limit for running JS benchmarks without getting the application running the benchmark killed by the kernel.
> We have a similar problem with images in b2g. We'd like to know "how much
> memory can we safely use for images in this process?", but that's a hard
> question to answer, even approximately. We have working low-memory
> notifications, and maybe that can help with JS. But to make them work, you
> have to be able to check for low-memory pretty frequently, and then act on
> it quickly.
All the memory that an application can commit is identified by the CommitLimit, you can allocate more as long as you don't access it, or as along as you decommit pages which have been allocated before.
Correct me if I am wrong, but I think background applications are killed first, no?
> > Our hope is that these GCs don't happen during normal phone operation, and so the new heuristics
> > will only affect our benchmark scores and not our memory footprint. We have to verify that,
> > though.
>
> I'm ignorant of how we normally tune these numbers, but naively, it seems to
> me that tweaking a value that we think has no effect on user-perceived
> performance at the expense of possibly OOM'ing processes more often is a bad
> trade-off.
I want to rectify here, We are tweaking a value knowing that it has an effect on the UX.
Currently all apps will just OOM and disappear after a crash of the application, with no possible fallback for the application. After this patch, some apps will not crash but stop executing any script, which might leave the application in a broken state (the script is not currently until completion) which might still be of interest for the user.
Comment 8•12 years ago
|
||
I want to be clear that there are two separate proposals here.
1) Limit JS to using 75% of CommitLimit, and
2) Make JS grow more quickly towards CommitLimit.
My main concern is about (2), since this will make the phone more likely to
crash if we ever enter into this fast-gc domain. It's been argued here that we
should probably never enter into this domain except for benchmarks, in which
case it seems to me like a bad idea to make the phone more likely to crash so
that we can get a better score on benchmarks.
Additionally, I don't think the increased likelihood of crashing is balanced
out by (1); to argue that this is the case requires us to ignore Gecko and
other processes running on the system.
I think (1) is probably wrong on its own, but if you guys want to limit the
amount of memory available to JS because that's the best way to avoid crashes,
then that's fine with me. If our goal is to prevent crashes, my feeling is
that we should consider alternative ways to limit memory used by JS.
>> I see, okay. Additionally, the suggestion is that phone_memory*factor
>> should be a hard limit on the memory used by JS, right?
>>
>> This doesn't seem sound to me. Comment 1 explains why:
>>
>>> This way we could have any GC growth factor without any risk of OOM Kill
>>> caused by the kernel (as long as the rest of gecko does not cause us to
>>> over commit).
>>
>> If we ignore the rest of Gecko (and also all other processes running on the
>> system which have a lower oom_adj than this process), then maybe JS won't
>> cause us to OOM. But I don't see how we can soundly ignore those things.
>
>1/ The rest of Gecko, we cannot, but we this would be way more complex as I
>don't have the knowledge to deal with that yet. but the factor is a
>conservative way to say that we care about it.
Because the factor is conservative, it sounds like we agree that (1) does not
completely remove the chance that (2) will cause OOMs. That is my main point.
I also argue below that 75% is not actually conservative; it would reduce the
amount of memory available to JS by a factor of .80 in some cases.
>2/ As mentioned in Comment 1, this is not the phone_memory, but the limit of
>memory allocatable by an application (aka. CommitLimit). Which means that an
>application which reach this limit, if this is the only application running,
>will be killed by the kernel (at 50% of the available memory on the phone
>(overcommit_ratio = 50)).
I have observed processes on my phone using more than CommitLimit memory and
not getting killed, so I think it may not behave as you describe.
> # b2g-procrank
> APPLICATION PID Vss Rss Pss Uss cmdline
> Browser 15628 113908K 106424K 103865K 101472K /system/b2g/plugin-container
> b2g 144 62844K 52808K 50285K 47904K /system/b2g/b2g
>
> # cat /proc/meminfo
> [...]
> CommitLimit: 90368 kB
>
> # cat /proc/sys/vm/overcommit_memory
> 1
I think this is because /proc/sys/vm/overcommit_memory is 1, not 2, on my
Hamachi.
> CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'),
> this is the total amount of memory currently available to
> be allocated on the system. **This limit is only adhered to
> if strict overcommit accounting is enabled (mode 2 in
> 'vm.overcommit_memory').**
What is overcommit_memory on your phone? I'd believe that the 512mb Unagi has
weird settings; we shouldn't worry about that device+kernel.
>3/ Other processes, we don't really care because they would be killed by the
>kernel as soon as the foreground application tries to commit allocated memory.
>This is already what is happening if I am not making any mistake.
If you're a foreground app, you can be oom'ed because you're using too much
memory, or because any process with a lower oom_adj than you is using too much
memory. There are many such processes on the device, but the one which
allocates the most memory is the B2G master process.
So we do very much care about this case.
>> Child processes are sometimes JS-heavy, and sometimes not. Whatever
>> constant |factor| we choose, it's going to be either too high to prevent
>> OOMs in some cases, so low that we don't use all available memory in other
>> cases, or both.
>
>When I am looking at an application which is not DOM-intensive, I can see that
>the browser takes approximately 20 MB, using 75% on a 100MB limit give us a
>conservative limit for running JS benchmarks without getting the application
>running the benchmark killed by the kernel.
On my Hamachi device -- which is an actual device we'll be shipping -- I have
CommitLimit = 90368kB, and I can allocate 83mb of JS memory before we crash.
(*)
If we instead set a 75% limit on the JS memory, I'd be able to allocate only
67mb of JS memory.
That sounds like a pretty bad regression to me. But if you're OK with that,
then I guess I'm OK with that.
Note that after using the phone for a while, the main process usually uses an
additional ~25mb of memory. If that were the case, then we'd crash after
allocating even 67mb of JS memory (83-25=58). This illustrates how the 75%
limit will not work well in common situations.
You could use a lower limit instead, but then you have the opposite problem of
leaving memory that you could use on the table. I contend there's no good way
to set this value.
>> We have a similar problem with images in b2g. We'd like to know "how much
>> memory can we safely use for images in this process?", but that's a hard
>> question to answer, even approximately. We have working low-memory
>> notifications, and maybe that can help with JS. But to make them work, you
>> have to be able to check for low-memory pretty frequently, and then act on
>> it quickly.
>
>All the memory that an application can commit is identified by the
>CommitLimit, you can allocate more as long as you don't access it, or as along
>as you decommit pages which have been allocated before.
>
>Correct me if I am wrong, but I think background applications are killed first, no?
Background apps are killed first, yes. But the master process is always there.
>> I'm ignorant of how we normally tune these numbers, but naively, it seems to
>> me that tweaking a value that we think has no effect on user-perceived
>> performance at the expense of possibly OOM'ing processes more often is a bad
>> trade-off.
>
>I want to rectify here, We are tweaking a value knowing that it has an effect
>on the UX.
>
>Currently all apps will just OOM and disappear after a crash of the
>application, with no possible fallback for the application. After this patch,
>some apps will not crash but stop executing any script, which might leave the
>application in a broken state (the script is not currently until completion)
>which might still be of interest for the user.
I believe that (1) has an effect on UX; the question is about (2). Does that
also have an effect on UX?
I am totally in favor of preventing JS OOM crashes. I just think that a 75%
limit (or any constant-factor limit) on JS memory is far too simplistic a way
of accomplishing this. It would cause us to leaving memory on the table in
common cases, and it would not let us prevent many common JS memory OOMs.
Separately, I think that growing the heap faster (i.e, (2)) is really scary,
and if the only reason to do this is to improve our benchmark scores, I don't
understand why we'd take the risk of this having real effects on users. I
don't think that (1) mitigates the risk of (2).
(*) Tested using bit.ly/membuster; click "bust system memory".
Comment 9•12 years ago
|
||
> I am totally in favor of preventing JS OOM crashes.
To be clear, I think it would be /fantastic/ if we could say "this page is using too much JS memory, so we've killed scripts here". Or even "this page is using a lot of JS memory; do you want to stop scripts?" Either one of those is probably better than crashing, which is what we do now.
| Assignee | ||
Comment 10•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #8)
> I think this is because /proc/sys/vm/overcommit_memory is 1, not 2, on my
> Hamachi.
Same here with the full-ram kernel for the unagi.
> > CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'),
> > this is the total amount of memory currently available to
> > be allocated on the system. **This limit is only adhered to
> > if strict overcommit accounting is enabled (mode 2 in
> > 'vm.overcommit_memory').**
Ok, so this bug is invalid, and we will just crash as we currently do … so nothing to worry about.
> What is overcommit_memory on your phone? I'd believe that the 512mb Unagi
> has
> weird settings; we shouldn't worry about that device+kernel.
Sorry if we are not aware of the latest device to test anything with. From my understanding, it was that we should only care about the Unagi to safely scale on any other devices.
| Assignee | ||
Updated•12 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
Comment 11•12 years ago
|
||
> Sorry if we are not aware of the latest device to test anything with. From my understanding, it was
> that we should only care about the Unagi to safely scale on any other devices.
The Unagi device is fine; it's the Unagi with the special kernel that unlocks all 512mb of RAM that's problematic, because the device we're targeting only has 256mb of RAM.
| Assignee | ||
Comment 12•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #11)
> > Sorry if we are not aware of the latest device to test anything with. From my understanding, it was
> > that we should only care about the Unagi to safely scale on any other devices.
>
> The Unagi device is fine; it's the Unagi with the special kernel that
> unlocks all 512mb of RAM that's problematic, because the device we're
> targeting only has 256mb of RAM.
The one with 256MB of RAM does not have enough memory to run octane at all, and *the phone crash* (with the default config) by running out-of-memory.
You need to log in
before you can comment on or make changes to this bug.
Description
•