Add memory reserve and xmalloc() API

RESOLVED FIXED

Status

()

Core
Memory Allocator
RESOLVED FIXED
9 years ago
7 years ago

People

(Reporter: bsmedberg, Assigned: Jason Evans)

Tracking

(Depends on: 1 bug, Blocks: 2 bugs)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 3 obsolete attachments)

(Reporter)

Description

9 years ago
The default jemalloc allocation API should never fail. If it really can't allocate any memory, it should abort() or something equivalent.

We should provide a separate API for locations in our code which allocate a lot of memory (e.g. image decoders, network caches) and expect allocation to perhaps fail.

This bug covers only the core allocator API. Teaching code to use it will be separate bugs, see blockers of this one.

The actual name of this API is TBD, see bug 427107
(Reporter)

Updated

9 years ago
Depends on: 427111
(Reporter)

Updated

9 years ago
Blocks: 427115
(Assignee)

Comment 1

9 years ago
malloc() is already defined to possibly fail, so it seems to me that we only need to add a malloc_nofail() interface.
(Reporter)

Comment 2

9 years ago
Perhaps so... in that case we'd need to rewrite almost everything in our codebase to use it.
Can we throw an exception rather than abort() please? E.g. if someone tries to load an SVG containing a stupidly large amount of GIS data, I'd rather we throw a catchable exception and fail to load that page than abort() and trash all other windows and tabs. There's no guarantee that the first OOM failure will occur under the page load stack/thread of course, but it gives us a chance to do sensible things for OOM at certain points in the code, and some chance is better than none.
(Reporter)

Comment 4

9 years ago
No. We do not use exceptions and the amount of work to get exception-safe code is huge.

The point of this bug is that if you know that a particular kind of large allocation can come from the user, you can make that allocation-point use the null-returning allocator and handle failure in the appropriate way.
I assumed that this bug would be a Mozilla 2 thing, and in the Mozilla 2 time frame we're turning on exceptions, no?
(Reporter)

Comment 6

9 years ago
Not for mozilla 2.0, no. That work, if it happens at all, will be sometime later in 2.x
Ah. Fair enough then.
(Assignee)

Updated

9 years ago
Assignee: jasone → nobody
Component: General → jemalloc
QA Contact: general → jemalloc
(Assignee)

Updated

9 years ago
Assignee: nobody → jasone
(Assignee)

Comment 8

9 years ago
For operating systems that over-commit (pretty much everything except Windows), there is the possibility that malloc() will succeed, but when the application writes to the allocated space, the virtual memory system will fail to allocate a physical page for the memory.  We can avoid this failure mode by pre-faulting all memory as it is allocated, at the cost of some extra complexity and some extra memory writes.  Is it desirable to avoid asynchronous memory allocation failures due to over-commit?
(Assignee)

Comment 9

9 years ago
Another possibly superior solution to the over-commit problem would be to create temporary page files as necessary when allocating memory via mmap(2), so that the operating system is guaranteed to be able to physically back all allocated pages.  I have never experimented with this approach, but it is likely to work.
The latter approach sounds good.

If all available solutions to the over-commit problem impose a significant performance penalty, then we should probably have a mode switch so that regular desktop Firefox turns it off and can still die on over-commit, but embedded/mobile builds can turn it on if they need it. You'd hope that the mobile folks actually turn over-commit off, though...
(Assignee)

Comment 11

9 years ago
The "memory is low" callback approach is inherently racy in the multi-threaded case.  It is possible to guarantee that an individual malloc_nofail() call will succeed, but another thread may call malloc_nofail() before the main application loop has a chance to free up memory.  

One option is to make malloc_nofail() keep trying forever, under the assumption that some other thread will free up enough memory.  However, for that to be 100% safe, there really needs to be a dedicated thread that never calls malloc_nofail().  The advantage with this approach is that as long as memory is not unrecoverably exhausted, the application will always make forward progress.

I am concerned about the tractability of making the app resilient to two forms of memory exhaustion: 1) the race condition described above, and 2) memory-intensive code paths that cause sufficient memory allocation to cause permanent failure before the main application loop gets a chance to discard non-critical data.

Am I missing something about the intended use of malloc_nofail() and the low memory callback that makes these non-issues?  For desktop use, none of this really matters much in practice, but for mobile devices, it seems to me that fixed memory budgets could make the low memory handling a critical part of memory management.
(In reply to comment #11)
> Am I missing something about the intended use of malloc_nofail() and the low
> memory callback that makes these non-issues?  For desktop use, none of this
> really matters much in practice, but for mobile devices, it seems to me that
> fixed memory budgets could make the low memory handling a critical part of
> memory management.

Indeed.

> The "memory is low" callback approach is inherently racy in the multi-threaded
> case.  It is possible to guarantee that an individual malloc_nofail() call
> will succeed, but another thread may call malloc_nofail() before the main
> application loop has a chance to free up memory.  

That's possible.

> One option is to make malloc_nofail() keep trying forever, under the
> assumption that some other thread will free up enough memory.  However, for
> that to be 100% safe, there really needs to be a dedicated thread that never
> calls malloc_nofail().  The advantage with this approach is that as long as
> memory is not unrecoverably exhausted, the application will always make
> forward progress.

This approach is quite likely to deadlock in practice. That's worse than just aborting the process.

> I am concerned about the tractability of making the app resilient to two forms
> of memory exhaustion: 1) the race condition described above, and 2)
> memory-intensive code paths that cause sufficient memory allocation to cause
> permanent failure before the main application loop gets a chance to discard
> non-critical data.

We will probably not be able to completely avoid those problems. However, I think it's plausible we can avoid them "most of the time".

Ideally we'd be able to compute a bound on the amount of memory any one thread can consume before it reaches a "safe point" where we can free arbitrary amounts of memory or exit the thread. Then the reserve memory buffer would just be sized to the sum of all those bounds. This would still be simpler than OOM checking and recovery at every single allocation point.

In practice we probably won't be able to compute hard bounds, but we will be able to test to get some reasonable estimates.
(Reporter)

Comment 13

9 years ago
Did you write up a proposal for the callbacks yet? I was imagining that the allocator would block while the callbacks were running, or at least block if allocation failed and the callbacks were running.

As for overcommit, I think we should ignore that problem for the time being: we can solve it as a later step if necessary, right?
(Assignee)

Comment 14

9 years ago
(In reply to comment #13)
> Did you write up a proposal for the callbacks yet?

No, I have not, but will do so shortly.

> I was imagining that the
> allocator would block while the callbacks were running, or at least block if
> allocation failed and the callbacks were running.

Ah, I thought the intention was to handle low memory entirely via a "memory is low" callback, but I see now that you had in mind two complementary callback mechanisms: 1) "memory is low, FYI" and 2) "memory is exhausted, do something now or we die".

> As for overcommit, I think we should ignore that problem for the time being: we
> can solve it as a later step if necessary, right?

I don't think we can ignore over-commit, because the allocator will be able to allocate more virtual memory than can be backed by physical pages.  This will prevent us from ever detecting a low memory situation, and we will instead crash when we run out of physical pages.
(Reporter)

Comment 15

9 years ago
Yes, we will crash, but that's no worse than what happens right now, is it?
(Assignee)

Comment 16

9 years ago
(In reply to comment #15)
> Yes, we will crash, but that's no worse than what happens right now, is it?

It's no worse than what happens right now, but there's little point in writing the low-memory callback infrastructure unless we prevent over-commit, since the code will go unused for typical system configuration (less physical memory than virtual memory).  It doesn't make sense to maintain a memory reserve that is merely virtual address space; it has little more practical value than as-yet-unallocated virtual address space.

By the way, I wrote a working prototype yesterday that confirms the technique works.  With a bit of additional caching infrastructure, performance impact should be minimal.  The only real gotcha is that disk space can become a limiting factor.
(Assignee)

Comment 17

9 years ago
See http://wiki.mozilla.org/Mozilla_2/Memory/OOM_API for OOM API documentation
(subject to review/revision).
(Reporter)

Comment 18

9 years ago
So, we have two different goals to serve:

1) simplify allocation points by guaranteeing non-null
2) handle OOM more gracefully with an OOM flag and callbacks

I think we can solve #1 without dealing with overcommit, and that's something I'd like to do for 1.9.1 if possible. #2 makes no sense without overcommit protection. Does that sound reasonable? Of course if overcommit protection is easy to come by, you could just do it all at once, but I don't want that necessarily to gate landing of bug 427115 etc.
Looks good. You probably should support a set of "reserve low" callbacks. Otherwise you end up in a situation like signals where multiple libraries and apps fight over a single resource.
Benjamin: you're right, my only concern is that I'd like to have some hard evidence that approach #2 is the right approach before we go down path #1. Because it would totally suck if we removed a lot of OOM-handling code and then found out that approach #2 doesn't work.
(Reporter)

Comment 21

9 years ago
roc, assuming we have callbacks but not overcommit protection, how would be worse than now? We would use the callbacks on Windows, and Linux would continue to overcommit as it always did, and we'd be none the wiser.
OK, I guess we can prove the concept on Windows, sure.
(Assignee)

Comment 23

9 years ago
(In reply to comment #18)
> So, we have two different goals to serve:
> 
> 1) simplify allocation points by guaranteeing non-null
> 2) handle OOM more gracefully with an OOM flag and callbacks
> 
> I think we can solve #1 without dealing with overcommit, and that's something
> I'd like to do for 1.9.1 if possible. #2 makes no sense without overcommit
> protection. Does that sound reasonable?

Yes, it sounds reasonable to me.

> Of course if overcommit protection is
> easy to come by, you could just do it all at once, but I don't want that
> necessarily to gate landing of bug 427115 etc.

The over-commit protection and memory reserve mechanisms overlap quite a bit,
but they are both part of your item (2).  Item (1) in isolation is a trivial
change to the allocator; the hard part there is rewriting the rest of mozilla
to call xmalloc() and friends.
(Assignee)

Comment 24

9 years ago
(In reply to comment #19)
> Looks good. You probably should support a set of "reserve low" callbacks.
> Otherwise you end up in a situation like signals where multiple libraries and
> apps fight over a single resource.

Do I understand correctly that the idea is to allow multiple callback functions to be registered so that, for example, a different callback function can be registered for each flushable cache?  If so, there are some issues I'm not sure how to deal with, all of which might be better dealt with outside the allocator.

1) Should the callbacks be called in a particular order?  If so, then I don't think this feature belongs in the allocator at all; the allocator should not be responsible for implementing cache flushing policy.

2) Should all the callbacks be called, even if one of the early ones frees enough memory to restore the reserve to sufficient size?  I don't think they should be called, since the precondition for calling them no longer exists.  However, if call order is unspecified (as I think it should be), then it is not possible to track all memory reserve state changes in a single callback function.

3) If multiple "reserve low" callbacks are supported, shouldn't multiple "reserve critical" callbacks also be supported?

4) [Implementation detail.]  There is not a simple way to compactly store an arbitrary number of callback pointers, due to the way internal data structures are allocated.  In other words, this feature would introduce more complexity than you might initially think.
It's not important for us so don't worry about it if it's a big deal.

But if you package this as something useful for general purpose use, you'll probably find situations where you have multiple clients in the same address space that don't know about each other (e.g., an embedded Gecko plus some jemalloc-using application). Though I suppose you can just treat those as separate heaps.
(Reporter)

Comment 26

9 years ago
I think we should try to do multiple callbacks, either in jemalloc itself or one level up as an API exposed by XPCOM. We're certainly going to have to have an expandable set of caches which may want to implement memory-flushing behavior. Perhaps we can do something simple like a chain of callback functions, except that we can't guarantee any sort of LIFO removal of callbacks during shutdown.

Are there any restrictions on what a callback function may do? e.g. can it attempt to allocate memory? Will the attempt fail? Will smaller-size reallocs work?

Perhaps rather than having separate low/critical callbacks, we could have a single callback with an flag?
(Assignee)

Comment 27

9 years ago
I have revised http://wiki.mozilla.org/Mozilla_2/Memory/OOM_API based on comment #19 and comment #26.

(In reply to comment #19)
> Looks good. You probably should support a set of "reserve low" callbacks.
> Otherwise you end up in a situation like signals where multiple libraries and
> apps fight over a single resource.

I have followed your suggestion, but note that "condition notification" is different than "signal delivery".  Do the "condition notification" semantics look sane to you?

(In reply to comment #26)
> I think we should try to do multiple callbacks, either in jemalloc itself or
> one level up as an API exposed by XPCOM. We're certainly going to have to have
> an expandable set of caches which may want to implement memory-flushing
> behavior.

I guess the multiplexing of callbacks might as well be in the allocator, as long as mozilla doesn't need tight control over callback order.  Do you think the current specification is adequate for mozilla's needs?

> Are there any restrictions on what a callback function may do? e.g. can it
> attempt to allocate memory? Will the attempt fail? Will smaller-size reallocs
> work?

There are no restrictions on what callback functions may do, though they must be thread-safe, and allocating memory has the potential to cause callback recursion (and therefore deadlock if the callback functions hold any locks during allocation).

Shrinking reallocs do not necessarily allocate in-place, so they are not guaranteed to succeed.

> Perhaps rather than having separate low/critical callbacks, we could have a
> single callback with an flag?

I have followed your suggestion, and overall I think it's an improvement, though it adds some complexity in the form of tracking which conditions are reported to the callback functions.
You probably want to make the (callback_function, data) pair be the key that you register and unregister, not just the callback_function. Otherwise looks good.
(Assignee)

Comment 29

9 years ago
As I was falling asleep last night I realized that it is exceptionally difficult to reliably call callback functions more than once per condition-causing allocation request, due to the possibility that other threads are concurrently performing callbacks.  Therefore I want to drop that feature.
(Assignee)

Comment 30

9 years ago
(In reply to comment #28)
> You probably want to make the (callback_function, data) pair be the key that
> you register and unregister, not just the callback_function. Otherwise looks
> good.

Done.
(Assignee)

Comment 31

9 years ago
I now have a functionally complete (though only moderately tested) memory reserve implementation.  In order for it to see adequate testing, mozilla needs to start using the x*() functions and register callbacks.

My code is based on the actionmonkey tree, but it is layered on 9 other patches that have not been committed.  What is the best way for me to make the memory reserve code available to those who need to use it?
(Assignee)

Comment 32

9 years ago
Created attachment 322695 [details] [diff] [review]
Memory reserve rollup patch

[This patch includes numerous indirectly related features that have not yet been integrated into the actionmonkey tree, so that the patch is actually usable.]

This patch implements a memory reserve, as documented in jemalloc.h and implemented in jemalloc.c.  It also implements pagefile support, in order to prevent over-commit on Unix-like systems.

The memory reserve code is moderately tested; I made some local changes to use the x*() functions instead of their standard counterparts (ex: malloc --> xmalloc), and then added numerous callback functions that release pre-allocated memory when notifications indicate low/critical memory conditions.  I simulated memory pressure by limiting virtual memory, but this can cause non-fatal errors for extension loading that would not exist in normal memory pressure situations.  In any case, the code is sufficiently exercised that there are no remaining obvious deadlock conditions, and condition notification appears to work as specified.

Pagefile support and memory reserve size can be controlled via the MALLOC_OPTIONS environment variable (see the O, G, and R options).  We can set better memory reserve defaults once we actually have callback functions registered.
(Assignee)

Comment 33

9 years ago
Created attachment 324115 [details] [diff] [review]
Memory reserve rollup patch (v2)

This revised patch includes general bugfixes and Windows-specific portability fixes.

The patch also includes a fix for a long-standing jemalloc bug that (as near as I can tell) should only be hit if a double-free is happening in the application.  Without the allocator fix, firefox crashes, whereas with the fix it gets lucky and... does something non-crashy instead.  I plan to add diagnostic output for debug builds when assert() fails (rather than unconditionally disabling assert()), so hopefully we can eventually figure out the root cause of the crashes.

This rollup patch is composed of 12 distinct patches, so I'll upload the reserve-specific patch for review shortly.
Attachment #322695 - Attachment is obsolete: true
(Assignee)

Comment 34

9 years ago
Created attachment 324117 [details] [diff] [review]
Memory reserve

In addition to implementing the memory reserve infrastructure, this patch completely removes the sbrk-related code.  This is because the MALLOC_PAGEFILE code cannot be made to work in conjunction with sbrk, and we disable the sbrk code on all platforms anyway.
Attachment #324117 - Flags: review?(benjamin)
(sbrk disabled in patch for bug 427351.)

/be
(Reporter)

Comment 36

9 years ago
Comment on attachment 324117 [details] [diff] [review]
Memory reserve

Looks good!
Attachment #324117 - Flags: review?(benjamin) → review+
(Reporter)

Comment 37

9 years ago
Jason, what should we do about the case where jemalloc isn't enabled?

1) on mac... can we just turn jemalloc on by default, or are there perf issues?
2) for Windows users without VC pro: I was thinking that we would re-implement the 4 x-functions in XPCOM with a dirt-simple abort, such as:

void*
xmalloc(size_t size)
{
  void *r = malloc(size);
  if (!r)
    abort();

  return r;
}

Does that sound reasonable?
I think on OS X, there just wasn't a real win, but stuart or jason may know more.
(Assignee)

Comment 39

9 years ago
(In reply to comment #37)
> Jason, what should we do about the case where jemalloc isn't enabled?
> 
> 1) on mac... can we just turn jemalloc on by default, or are there perf issues?

Yes, I think we can turn on jemalloc for OS X (after cleaning up likely
bitrot).  Performance is fine; we punted on jemalloc for OS X because of
stability issues, and I expect that recent changes to jemalloc will make those
issues easier to work out.

> 2) for Windows users without VC pro: I was thinking that we would re-implement
> the 4 x-functions in XPCOM with a dirt-simple abort, such as:
> 
> [...]
> 
> Does that sound reasonable?

Yes, that seems reasonable to me.
(Assignee)

Updated

9 years ago
Summary: Make je_malloc never fail, and provide a separate je_malloc_canfail API → Add memory reserve and xmalloc() API
(Assignee)

Comment 40

9 years ago
During final testing, I encountered a critical bug when running on Windows.  The cause turned out to be a problem with chunk coalescing in the memory reserve.  Windows' virtual memory system requires all map/unmap operations via VirtualAlloc()/VirtualFree() calls to have a 1:1 correspondence.  For example, it is illegal to map two adjacent 1MiB chunks, then unmap them in a single operation.

I am currently unsure how to proceed; suggestions welcome.  Here are the options I have explored:

1) Do not coalesce adjacent chunks in the memory reserve.  This is simple to implement, but it allows significant virtual memory fragmentation.  In practice, such fragmentation doesn't matter much, unless address space exhaustion becomes an issue.

2) Only store one-chunk regions in the memory reserve, and do not coalesce adjacent chunks.  This is relatively simple to implement, and it works fine, except that the memory reserve can directly thwart huge allocations when memory is nearly exhausted.  This is because the memory reserve cannot be used for multi-chunk huge allocations, yet the reserved memory, if released, would be sufficient to satisfy the request.

3) On Windows, add unmap/map operations that allow memory reserve coalescing/splitting.  For 1MiB chunks (the current default), this causes a ~1% slowdown for sunspider.  I also tested with 4MiB chunks, and the performance regression went away, but there are reasons to prefer as small a chunk size as is practical.
(Reporter)

Comment 41

9 years ago
Interesting. I'm not sure I understand all the tradeoffs involved here, but... in option 2, how large is "huge"? Any allocations larger than 16k or so should be fail-safe, so I don't think that we need to worry about those allocations failing.

So, I'm currently thinking we should try option 2, and maybe revisit with option 3 if it becomes an issue... does that sound reasonable?

Is this tradeoff windows-only?
(Assignee)

Comment 42

9 years ago
(In reply to comment #41)
> Interesting. I'm not sure I understand all the tradeoffs involved here, but...
> in option 2, how large is "huge"? Any allocations larger than 16k or so should
> be fail-safe, so I don't think that we need to worry about those allocations
> failing.

For most purposes, "huge" in jemalloc is ~2-3 pages shy of 1MiB, but in the context of this issue, we start running into issues at >1MiB.

> So, I'm currently thinking we should try option 2, and maybe revisit with
> option 3 if it becomes an issue... does that sound reasonable?

Yes, I agree that this is the right course of action.  Since writing comment #40, I've been developing an option 4 that turns out to be alarmingly complex (several hundred lines of branch-heavy code):

4) Modify the memory reserve to logically coalesce adjacent mappings, but keep enough information to track the actual mappings so that they can be manipulated correctly.

My (incomplete) attempt at (4) abuses existing extent tree data structures, but I have come to the conclusion that if we ever do need to make this approach work, we need a better way of intermingling the address-ordered and size/address-ordered trees.  The obvious solution is to use doubly linked rings to track logically coalesced mappings, and add a size field that specifically tracks the logically coalesced size.

I think (4) a better option than (3), but its complexity needs to be justified by some real-world problems that show (2) to be inadequate.

> Is this tradeoff windows-only?

Yes, this is a Windows-only issue.
(Assignee)

Comment 43

9 years ago
Created attachment 325784 [details] [diff] [review]
Incomplete attempt at option 4

Here is a patch that captures the essence of my aborted attempt at option 4, posted for posterity, in case we have to revisit this issue.
(Assignee)

Comment 44

9 years ago
Created attachment 326014 [details] [diff] [review]
Memory reserve (v2)

This revised patch fixes crashes on Windows by disabling memory reserve coalescing and storing only single-chunk mappings in the reserve.
Attachment #324117 - Attachment is obsolete: true
Attachment #326014 - Flags: review?(benjamin)
(Reporter)

Comment 45

9 years ago
Comment on attachment 326014 [details] [diff] [review]
Memory reserve (v2)

Reviewed via interdiff.
Attachment #326014 - Flags: review?(benjamin) → review+
(Assignee)

Comment 46

9 years ago
changeset:   15482:c14ab4f6cec6
user:        Jason Evans <jasone@canonware.com>
date:        Mon Jun 23 07:46:37 2008 -0700
summary:     Bug 427109: Add memory reserve and xmalloc() API, r=benjamin
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
I backed this out to try to fix the persistent Linux Talos bustage that appeared right after this was checked in.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
There was intermittent green on some boxes today, while this patch was in, but after the backout everything has gone green. So I'm pretty sure this patch was the problem.
(Assignee)

Comment 49

9 years ago
I've tried to reproduce the talos failures on a 32-bit Ubuntu 8.04 system.  What I see is crashes in tp, both with and without the reserve patch.  However, the failure mode changes a bit, from (without reserve):

========================================================================
Running test tp: 
                Started Tue, 24 Jun 2008 17:49:17
        Screen width/height:1600/1200
        colorDepth:24
        Browser inner width/height: 1024/646
        Browser outer width/height: 1024/768

(firefox-bin:12392): Gtk-CRITICAL **: gtk_widget_destroy: assertion `GTK_IS_WIDGET (widget)' failed

(firefox-bin:12392): Gtk-CRITICAL **: gtk_widget_destroy: assertion `GTK_IS_WIDGET (widget)' failed
Segmentation fault
Completed test tp: 
                Stopped Tue, 24 Jun 2008 17:52:04
========================================================================

to (with reserve):

========================================================================
Running test tp: 
                Started Wed, 25 Jun 2008 09:05:22
        Screen width/height:1600/1200
        colorDepth:24
        Browser inner width/height: 1024/646
        Browser outer width/height: 1024/768
Segmentation fault
Completed test tp: 
                Stopped Wed, 25 Jun 2008 09:08:39
========================================================================

I also tested the reserve patch with overcommit enabled (MALLOC_OPTIONS=O), and I saw the same failure mode as without the reserve patch.  This led me to suspect that there is a pre-existing crashing bug that simply behaves differently when more aggressive chunk caching is enabled.  The most likely explanation for this behavior change is that increased chunk caching makes it possible to do huge allocations (~1MiB+) without zeroing memory.  To support this hypothesis, I did a talos run with MALLOC_OPTION=Z, which causes all allocations to be pre-zeroed, and indeed the failure mode was the same as without the reserve patch.  In summary, I am guessing that somewhere in firefox, huge allocations are incorrectly asusmed to be zeroed.

Incidentally, during one of these talos runs, I saw an assertion failure for twinopen that looks like it may be due to a double free bug, but I have not eliminated the possibility that it is a bug in jemalloc.
Ok, but tinderboxes aren't crashing at all without this patch so it's hard to diagnose without the patch in.

Can you get a stack trace for these faults (both with/without zeroing)?
Does running under valgrind help catch the issue?
Since this is getting close to landing, I should point out that I have a rewriting program ready that rewrites all calls to malloc/calloc/realloc/memalign to the 'x' versions (as long as they occur outside of macros--inside macros I just find the location of the macro).
(Assignee)

Comment 53

9 years ago
(In reply to comment #50)
> Ok, but tinderboxes aren't crashing at all without this patch so it's hard to
> diagnose without the patch in.
> 
> Can you get a stack trace for these faults (both with/without zeroing)?

It turns out that all of the crashes I'd seen at the time I wrote comment #49 were due to the flash plugin.  (I didn't realize that talos would automatically use flash if the plugin were present.)  Intermittent crashes exist without flash, so since then I have spent considerable time looking into the problem, but it is proving very time-intensive to track down.

Due to the intermittency of the crashes, I suspected the problem might be a race condition, so I audited the memory reserve locking and did find one place where locking was missing.  However, adding a fix for that bug did not have any apparent impact on the crashes.

I am slowly gaining an understanding of the problem, but since talos only fails approximately one run in five, it could take some time yet.  What I know so far is that a run tree is being corrupted.  Unfortunately, by the time assertions catch this corruption, most of the evidence is long lost.  I've been incrementally adding assertions, but have yet to find the cause of the problem.
Can valgrind help at all, possibly in concert with modifications to page-separate more of the jemalloc data?
(Assignee)

Comment 55

9 years ago
(In reply to comment #54)
> Can valgrind help at all, possibly in concert with modifications to
> page-separate more of the jemalloc data?

It's potentially worth a try, but I've been trying to avoid things that change layout or timing much, since it's hard to make the crash happen.

Incidentally, I think I just got a lot closer; one of the assertions I added this morning appears to be easy to trip.
(Assignee)

Comment 56

9 years ago
Created attachment 327521 [details] [diff] [review]
Memory reserve (v3)

This patch fixes two races present in v2 of the patch:

1) chunk_dealloc() failed to acquire reserve_mtx when calling chunk_dealloc_reserve().

2) Dropping the arena mutex during chunk allocation has rather subtle, far-reaching implications with regard to race recovery.  v2 of the patch failed to take bin->runcur into account.  I added diagnostic output to tell when this race occurs, and verified that with this patch the browser continues to run without apparent fault afterwards.
Attachment #326014 - Attachment is obsolete: true
Attachment #327521 - Flags: review?(benjamin)
(Reporter)

Updated

9 years ago
Attachment #327521 - Flags: review?(benjamin) → review+
(Assignee)

Comment 57

9 years ago
changeset:   15611:fdc5cdb7c20b
parent:      15609:446be0510148
parent:      15610:16faa0c2a6d5
user:        Jason Evans <jasone@canonware.com>
date:        Tue Jul 01 16:00:48 2008 -0700
summary:     Merge commit for bug 427109.

changeset:   15609:446be0510148
user:        Jason Evans <jasone@canonware.com>
date:        Tue Jul 01 15:41:14 2008 -0700
summary:     Bug 427109: Add memory reserve and xmalloc() API, r=benjamin
Status: REOPENED → RESOLVED
Last Resolved: 9 years ago9 years ago
Resolution: --- → FIXED
Depends on: 447710

Comment 58

9 years ago
Jason,

The last line of reserve_cb_unregister() will not be reached.

(Assignee)

Comment 59

9 years ago
(In reply to comment #58)
> The last line of reserve_cb_unregister() will not be reached.

Ah, missing braces above.  Thanks!
You need to log in before you can comment on or make changes to this bug.