Closed Bug 573192 Opened 16 years ago Closed 11 years ago

pt_SetMethods seems to have a tendency to leak

Categories

(NSPR :: NSPR, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED
4.10.8

People

(Reporter: jruderman, Assigned: froydnj)

References

(Blocks 4 open bugs)

Details

(Keywords: memory-leak, valgrind, Whiteboard: [MemShrink:P2])

Attachments

(3 files, 2 obsolete files)

I'm using a 64-bit Firefox built from mozilla-central on Ubuntu 10.04. Valgrind reports a bunch of leaks of what I think are PRFileDesc objects. I looked at the CrashReporter leak (the first one in the Valgrind log) and saw that it calls PR_Close(fd), which I think is all it's supposed to do to clean up.
Jesse: thanks a lot for the bug fix. NSPR has a freelist for PRFileDesc objects. This freelist is destroyed by PR_Cleanup, but Mozilla cannot call PR_Cleanup because it has C++ static destructors that call NSPR. It doesn't seem possible to disable NSPR's PRFileDesc freelist. I found two environment varibales NSPR_FD_CACHE_SIZE_LOW and NSPR_FD_CACHE_SIZE_HIGH, but they seem to merely change the behavior of the PRFileDesc freelist, without fully disabling it. You can try adding a _PR_CleanupIO(); call to _PR_Fini in mozilla/nsprpub/pr/src/pthreads/ptthread.c: http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/nsprpub/pr/src/pthreads/ptthread.c&rev=3.88&mark=1010,1028#1010 If that doesn't work, we'll need to look into adding an environment variable that disables NSPR's PRFileDesc freelist completely.
Status: NEW → ASSIGNED
(In reply to comment #1) > Jesse: thanks a lot for the bug fix. I meant "thanks a lot for the bug report." Sorry about the typo.
We are seeing these leaks come up in LSan (Leak Sanitizer; similar ideas to valgrind, but lower overhead at the cost of slightly less precision) runs; we've had to write fairly broad suppressions to silence complaints about the leaks: http://mxr.mozilla.org/mozilla-central/source/build/sanitizers/lsan_suppressions.txt#75 http://mxr.mozilla.org/mozilla-central/source/build/sanitizers/lsan_suppressions.txt#59 http://mxr.mozilla.org/mozilla-central/source/build/sanitizers/lsan_suppressions.txt#52 We're also seeing over 1MB allocated to file descriptors in some tests: https://bugzilla.mozilla.org/show_bug.cgi?id=1039914#c6 This is an active leak while the browser is running, so just adding _PR_CleanupIO() calls doesn't help. According to: http://mxr.mozilla.org/nspr/source/pr/src/io/prfdcach.c#234 we ought to be disabling the cache when limit_high is 0, but that's clearly not happening: http://mxr.mozilla.org/nspr/source/pr/src/io/prfdcach.c#57 http://mxr.mozilla.org/nspr/source/pr/src/io/prfdcach.c#131 Can we just change those blocks of code to not cache at all, rather than adding yet another environment variable that users of Firefox won't be able to set anyway?
Blocks: 1039914
Flags: needinfo?(wtc)
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: [MemShrink]
Attached patch nspr-filedesc.patch (obsolete) — Splinter Review
Here's a patch which completely blows away the stack-based cache that was being used and arguably makes the code closer to what was intended (e.g. the comments about .limit_high being 0 disabling the cache). Modern malloc implementations should be fast enough to deal with repeated reallocations nowadays. There's no indication that this caching scheme helps on any workload, and plenty of evidence that leak-checking tools are quite confused by this cache.
Attachment #8464772 - Flags: review?(wtc)
Comment on attachment 8464772 [details] [diff] [review] nspr-filedesc.patch Review of attachment 8464772 [details] [diff] [review]: ----------------------------------------------------------------- This is a nice cleanup. Sorry for the drive-by, I was interested enough in this to take a look at the patch. I have a few comments on possible improvements and possibly some preexisting bugs that could be addressed as well. ::: pr/src/io/prfdcach.c @@ +35,5 @@ > PRIntn limit_low, limit_high; > } _PR_Fd_Cache; > > static _PR_Fd_Cache _pr_fd_cache; > static PRFileDesc **stack2fd = &(((PRFileDesc*)NULL)->higher); min: This can go as well. @@ +120,5 @@ > { > PR_ASSERT(PR_NSPR_IO_LAYER == fd->identity); > fd->methods = &_pr_faulty_methods; > fd->identity = PR_INVALID_IO_LAYER; > fd->secret->state = _PR_FILEDESC_FREED; min: This part is only needed in the caching branch. @@ +127,1 @@ > { --min: We _could_ get rid of this as well, the test below covers both cases. @@ +163,5 @@ > if (low > high) low = high; /* sanity check the params */ > > PR_Lock(_pr_fd_cache.ml); > + _pr_fd_cache.limit_high = high; > + _pr_fd_cache.limit_low = low; med: We still should do two things here: #1 - Going from caching to not caching: delete all the cached entries, update the count to 0. #2 - Going from more caching to less caching: delete all the cached entries above the new threshold. Update the count to the new threshold. This wasn't being done before, but it's the correct thing to do. maj: (Not your bug) It looks like |count| isn't actually initialized in |_PR_InitFdCache|.
Attachment #8464772 - Flags: feedback+
(In reply to Eric Rahm [:erahm] from comment #5) > maj: (Not your bug) It looks like |count| isn't actually initialized > in |_PR_InitFdCache|. Nix that, I see now that it's a file level static.
Whiteboard: [MemShrink] → [MemShrink:P2]
Comment on attachment 8464772 [details] [diff] [review] nspr-filedesc.patch Review of attachment 8464772 [details] [diff] [review]: ----------------------------------------------------------------- Ted might have some useful feedback here.
Attachment #8464772 - Flags: review?(ted)
(In reply to Eric Rahm [:erahm] from comment #5) > @@ +163,5 @@ > > if (low > high) low = high; /* sanity check the params */ > > > > PR_Lock(_pr_fd_cache.ml); > > + _pr_fd_cache.limit_high = high; > > + _pr_fd_cache.limit_low = low; > > med: We still should do two things here: > > #1 - Going from caching to not caching: delete all the cached entries, > update the count to 0. > > #2 - Going from more caching to less caching: delete all the cached > entries above the new threshold. Update the count to the new > threshold. This wasn't being done before, but it's the correct > thing to do. I support doing these, but I think these requests require a separate bug. I didn't want to complicate the patch review with bugfixes for separate things.
I'm sorry I forgot to review the patch yesterday. I will review it tomorrow.
(In reply to Nathan Froyd (:froydnj) from comment #3) > > This is an active leak while the browser is running, so just adding > _PR_CleanupIO() calls doesn't help. Can you give me an LSan report of the active leak? I inspected the relevant code but didn't see anything obviously wrong. I did see that PR_Close may leak the fd when it returns PR_FAILURE. PR_Close may return PR_FAILURE for two reasons: 1. The thread has been interrupted by a PR_Interrupt call. I will attach a patch to fix this, but in mozilla-central I only found a PR_Interrupt call in xpcom/threads/BackgroundHangMonitor.cpp, and the hang monitor thread doesn't seem to call PR_Close. So this should not affect Firefox. 2. The close() system call fails. But this can only happen if the application has a bug. So I'd like to understand this active leak before I approve your patch to remove the use of PRStack in prfdcach.c. Thanks.
Flags: needinfo?(wtc)
Since PR_Close isn't a blocking operation, it doesn't really need to check the thread interrupt status. Also, most applications don't know what to do if PR_Close fails, so a PR_Close failure will result in a leak of the PRFileDesc.
Attachment #8469756 - Flags: review?(nfroyd)
Comment on attachment 452429 [details] parts of a valgrind log Nathan: this old valgrind log shows the allocation sites but doesn't show where the memory blocks are leaked. Can LSan show where the memory blocks are (actively) leaked?
(In reply to Wan-Teh Chang from comment #10) > Can you give me an LSan report of the active leak? I inspected the > relevant code but didn't see anything obviously wrong. There are leak stacks in bug 1028456, bug 1021854, and bug 1021302. > Can LSan show where the memory blocks are (actively) leaked? Unfortunately, LSan only shows the allocation stacks. I do have an experimental tool that does conservative heap scanning at shutdown to give more information about what is holding things alive, if that's what you mean. I could try that.
"active leak" is probably a bad term. See for instance bug 1039914, especially bug 1039914 comment 20, where there's ~1.5MB of unfreed memory allocated by _PR_Getfd. That memory will get reused if we open more file descriptors, but it'd be better to have the memory allocator managing that memory for all requests, rather than _PR_Getfd managing it for its own purposes.
Comment on attachment 8469756 [details] [diff] [review] PR_Close should not fail if the calling thread has been interrupted Review of attachment 8469756 [details] [diff] [review]: ----------------------------------------------------------------- Makes sense to me.
Attachment #8469756 - Flags: review?(nfroyd) → review+
Andrew, Nathan: thanks for the info. So, are these leaks unreachable memory, or are they the memory still reachable from the NSPR fd cache? If the latter, then I think the problem is that NSPR's fd cache, when implemented using a PRStack (an atomic singly-linked list), has an unbounded size. Do you agree?
(In reply to Nathan Froyd (:froydnj) from comment #14) > ..., but it'd be better to have the memory allocator managing that > memory for all requests, rather than _PR_Getfd managing it for its own > purposes. Free lists for commonly used objects were a common optimization technique when this code was written. But that was before the era of jemalloc and tcmalloc. Is the current recommendation to just rely on the memory allocator?
> Is the current recommendation to just rely on the memory allocator? Yes. I've removed from Firefox several such cases of heap blocks being cached and recycled over the past few years. Bug 814312 is one case, and the first part of https://blog.mozilla.org/nnethercote/2012/01/18/memshrink-progress-week-31/ describes another, and I know there were at least two more. In every case it was a simplification, a memory win, and didn't hurt speed at all. In general, custom allocators rarely beat well-written general-purpose allocators. There's even academic research about it (http://emeryblogger.com/2012/10/28/most-influential-oopsla2012/). One common exception to this rule is that pool/arena allocators are often a win, but this case isn't one of those.
Comment on attachment 8464772 [details] [diff] [review] nspr-filedesc.patch It looks like Wan-Teh is already invested here, so it doesn't seem worthwhile for me to try to come up to speed. Feel free to re-request if you really need me.
Attachment #8464772 - Flags: review?(ted)
(In reply to Nicholas Nethercote [:njn] from comment #18) > > Is the current recommendation to just rely on the memory allocator? > > Yes. I've removed from Firefox several such cases of heap blocks being > cached and recycled over the past few years. Nathan, Nicholas: in that case, should we just remove the _PR_Fd_Cache and the _PR_Getfd and _PR_Putfd functions completely?
(In reply to Wan-Teh Chang from comment #20) > (In reply to Nicholas Nethercote [:njn] from comment #18) > > > Is the current recommendation to just rely on the memory allocator? > > > > Yes. I've removed from Firefox several such cases of heap blocks being > > cached and recycled over the past few years. > > Nathan, Nicholas: in that case, should we just remove the _PR_Fd_Cache > and the _PR_Getfd and _PR_Putfd functions completely? I think removing _PR_Fd_Cache makes sense. Removing the other two functions seems like a bad idea, since they perform functions that would be tedious to reproduce at each callsite: basic clearing for _PR_Getfd and poisoning and secret-freeing for _PR_Putfd. Renaming the functions to something like _PR_AllocFD and _PR_FreeFD make more sense to me.
Nathan's suggestion sounds good to me.
Wan-Teh, do you have an opinion on the patch?
Flags: needinfo?(wtc)
(In reply to Nathan Froyd (:froydnj) from comment #21) > > I think removing _PR_Fd_Cache makes sense. Removing [the _PR_Getfd and _PR_Putfd functions] > seems like a bad idea, since they perform functions that would be tedious to > reproduce at each callsite: basic clearing for _PR_Getfd and poisoning and > secret-freeing for _PR_Putfd. Poisoning will be done by the memory allocator. But I agree with the other points. > Renaming the functions to something like _PR_AllocFD and _PR_FreeFD make more > sense to me. Yes. They could be confused with the exported functions PR_AllocFileDesc and PR_FreeFileDesc though. I don't think we need _PR_FreeFD. We may not need _PR_AllocFD -- we may be able to just change the callers of _PR_Getfd to call PR_AllocFileDesc instead. I haven't looked into this though.
Flags: needinfo?(wtc)
Nathan, Nicholas: Sorry about the late reply. Ideally, I'd prefer that we make the current lock-free _PR_Fd_Cache have a bounded size. We should be able to do this by incrementing and decrementing an element count variable atomically. This way, applications that can't use an alternative memory allocator will continue to benefit from the _PR_Fd_Cache. If you don't have time to do that, let's just get rid of _PR_Fd_Cache. I won't be able to review your patches in the next two weeks. Please have one of you review the other person's patches. Then, Kai Engert (:kaie) or Ted can check in the patches for you. Thanks.
Attached patch nspr-filedesc.patch (obsolete) — Splinter Review
This is a new patch, with the deletion of stack2fd as suggested by Eric. I've chosen to leave Eric's other suggestions regarding code simplification undone in this bug, because I think they make the patch easier to review. Eric's comment about _pr_fd_cache.count not being initialized in _PR_InitFdCache is, I think, superfluous, because the whole of _pr_fd_cache is initialized to zero as static data. So I think this is ready for review by Nicholas, per Wan-Teh's comment 25.
Attachment #8464772 - Attachment is obsolete: true
Attachment #8464772 - Flags: review?(wtc)
Attachment #8485959 - Flags: review?(n.nethercote)
Comment on attachment 8485959 [details] [diff] [review] nspr-filedesc.patch Review of attachment 8485959 [details] [diff] [review]: ----------------------------------------------------------------- I'm giving f+ because this is the first time I've looked at this code. I suggest r?ing erahm once you've addressed my comments, because he has looked more closely at this code than I have. ::: pr/src/io/prfdcach.c @@ +126,2 @@ > { > + if (_pr_fd_cache.count < _pr_fd_cache.limit_high) Shouldn't this be |<=| instead of |<|? @@ +146,5 @@ > } > } > + > + PR_Free(fd->secret); > + PR_Free(fd); These PR_Free() calls now execute if |0 == _pr_fd_cache.limit_high| is true. They didn't before. Is this intentional? If not, given this issue and the |<=| issue above, I'd suggest changing the control flow as little as possible. Changing the |0 ==| to |0 !=| is fine, because that removes an otherwise-empty then-branch, but I'd undo the other control flow changes.
Attachment #8485959 - Flags: review?(n.nethercote) → feedback+
(In reply to Nicholas Nethercote [:njn] from comment #27) > Comment on attachment 8485959 [details] [diff] [review] > nspr-filedesc.patch > > Review of attachment 8485959 [details] [diff] [review]: > ----------------------------------------------------------------- > > I'm giving f+ because this is the first time I've looked at this code. I > suggest r?ing erahm once you've addressed my comments, because he has looked > more closely at this code than I have. > > ::: pr/src/io/prfdcach.c > @@ +126,2 @@ > > { > > + if (_pr_fd_cache.count < _pr_fd_cache.limit_high) > > Shouldn't this be |<=| instead of |<|? Let's say _pr_fd_cache.limit_high is 100. * In the _pr_fd_cache.count = 99 case, we do want to add the entry to the cache; * In the _pr_fd_cache.count = 100 case, we do not want to add the entry to the cache. So I think |<| is correct and the previous code had an off-by-one error in it: it would add an entry to the cache when the cache already contained as many entries as limit_high specified. I could see the case for being conservative and making it |<=| for compatibility with the existing code. > @@ +146,5 @@ > > } > > } > > + > > + PR_Free(fd->secret); > > + PR_Free(fd); > > These PR_Free() calls now execute if |0 == _pr_fd_cache.limit_high| is true. > They didn't before. Is this intentional? Yes, this is intentional. Elsewhere in the code, |_pr_fd_cache.limit_high == 0| is cited as the condition for disabling the cache. And when the cache is disabled, we always want to free the entries. > If not, given this issue and the |<=| issue above, I'd suggest changing the > control flow as little as possible. Changing the |0 ==| to |0 !=| is fine, > because that removes an otherwise-empty then-branch, but I'd undo the other > control flow changes. I am confused as to how inverting the |0 ==| to |0 !=| eliminates an otherwise-empty then-case if you want to retain the rest of the control flow in that function. Because you still need to do something in the |0 == _pr_fd_cache.limit_high| case, whether you have control flow like: if (0 != _pr_fd_cache.limit_high) { if (_pr_fd_cache.count > _pr_fd_cache.limit_high) { /* free the fd */ } else { /* place fd in cache */ } return; } /* free the fd */ or whether it's written as in the patch, combining the two /* free the fd */ cases. If you want maximal retaining of the control flow, I suppose you could write: if (0 == _pr_fd_cache.limit_high) { /* free the fd */ } else { if (_pr_fd_cache.count > _pr_fd_cache.limit_high) { /* free the fd */ } else { /* place fd in cache */ } } Can you help clarify things for me here?
Flags: needinfo?(n.nethercote)
If the control flow changes are deliberate, that's fine. I just wanted to make sure they were deliberate.
Flags: needinfo?(n.nethercote)
Comment on attachment 8485959 [details] [diff] [review] nspr-filedesc.patch Review of attachment 8485959 [details] [diff] [review]: ----------------------------------------------------------------- OK, asking for Eric's review then.
Attachment #8485959 - Flags: review?(erahm)
Comment on attachment 8485959 [details] [diff] [review] nspr-filedesc.patch Review of attachment 8485959 [details] [diff] [review]: ----------------------------------------------------------------- lgtm. Can we also make FD caching opt-in for debug builds [1] as well? Can you also file follow ups for the issues mentioned in comment 8? [1] http://hg.mozilla.org/mozilla-central/annotate/6b8da5940f74/nsprpub/pr/src/io/prfdcach.c#l239
Attachment #8485959 - Flags: review?(erahm) → review+
Updating patch with r=erahm. Ted, Wan-Teh said you can you check this in; can you do that, please?
Attachment #8485959 - Attachment is obsolete: true
Attachment #8489559 - Flags: review+
Flags: needinfo?(ted)
Blocks: 1067536
Blocks: 1067537
I filed bug 1067536 for the opt-in DEBUG setting and bug 1067537 for addressing the issues Eric pointed out in comment 8.
http://hg.mozilla.org/projects/nspr/rev/84acbf6789fb Sorry for the delay. Do you need a new NSPR tag to sync this to mozilla-central?
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #34) > http://hg.mozilla.org/projects/nspr/rev/84acbf6789fb > > Sorry for the delay. Do you need a new NSPR tag to sync this to > mozilla-central? Yes please. Can you handle the tagging and the syncing to central?
Flags: needinfo?(ted)
I tagged NSPR_4_10_8_BETA1, you can push it yourself using python client.py update_nspr NSPR_4_10_8_BETA1
Flags: needinfo?(ted)
Assignee: wtc → nfroyd
Priority: -- → P1
Target Milestone: --- → 4.10.8
Comment on attachment 8489559 [details] [diff] [review] nspr-filedesc.patch Review of attachment 8489559 [details] [diff] [review]: ----------------------------------------------------------------- r=wtc. I have a question about a minor issue. ::: pr/src/io/prfdcach.c @@ +126,2 @@ > { > + if (_pr_fd_cache.count < _pr_fd_cache.limit_high) To match the original code, this should be <= instead of < , right? The original code tests _pr_fd_cache.count > _pr_fd_cache.limit_high for the opposite condition. Is the original code wrong?
Attachment #8489559 - Flags: review+
[setting needinfo for comment 37]
Flags: needinfo?(nfroyd)
(In reply to Wan-Teh Chang from comment #37) > Comment on attachment 8489559 [details] [diff] [review] > nspr-filedesc.patch > > Review of attachment 8489559 [details] [diff] [review]: > ----------------------------------------------------------------- > > r=wtc. I have a question about a minor issue. > > ::: pr/src/io/prfdcach.c > @@ +126,2 @@ > > { > > + if (_pr_fd_cache.count < _pr_fd_cache.limit_high) > > To match the original code, this should be <= instead of < , right? Yes. > The original code tests _pr_fd_cache.count > _pr_fd_cache.limit_high > for the opposite condition. Is the original code wrong? I think so. Nicholas asked the same question in his review of the patch. I responded to that question in the first part of comment 28; assuming I understood what |limit_high| represented, I think the original code had an off-by-one error, because we could wind up adding an entry to the cache when count == limit_high. Which would mean that we could have more than limit_high entries in the cache, which seems wrong (although admittedly a minor mistake).
Flags: needinfo?(nfroyd)
This issue is in ASSIGNED state, but I can see a commit already against the 4.10.8 version and 4.10.8 is already released. Is this issue resolved or not?
This bug is fixed in NSPR 4.10.8, which was released on Jan 28, 2015.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: