Closed
Bug 654926
Opened 14 years ago
Closed 14 years ago
Hang in nsDiskCacheStreamIO::CloseOutputStream while downloading WebM video from YouTube
Categories
(Core :: Networking: Cache, defect)
Core
Networking: Cache
Tracking
()
RESOLVED
FIXED
mozilla7
People
(Reporter: kinetik, Assigned: bjarne)
Details
Attachments
(1 file, 12 obsolete files)
13.71 KB,
patch
|
mounir
:
checkin+
|
Details | Diff | Splinter Review |
Running a fairly recent nightly (86248f7209b7), the browser hung while downloading a video (via the Save Video As context menu) from YouTube. I had maybe 30 other tabs open, but my last interaction with the browser was saving the video to disk.
Possible steps to reproduce:
- Find HTML 5 video on YouTube
- Open Page Info and find the video URL
- Open video URL in a new tab
- Hit the Escape key to stop the video redownloading
- Save Video As...
- Open the download manager to watch download progress
- Close YouTube tab
After this, I switched back to my editor and noticed the download manager reached 15.0/15.1MB and then stopped making progress. Attempting to switch focus to the browser revealed that it had hung.
The main thread had the following stack:
PR_Lock () at nsprpub/pr/src/pthreads/ptsynch.c:206
nsDiskCacheStreamIO::CloseOutputStream(nsDiskCacheOutputStream*) () at netwerk/cache/nsCacheService.h:304
nsDiskCacheOutputStream::Close() () at netwerk/cache/nsDiskCacheStreams.cpp:244
nsDiskCacheOutputStream::~nsDiskCacheOutputStream() () at netwerk/cache/nsDiskCacheStreams.cpp:233
nsDiskCacheOutputStream::~nsDiskCacheOutputStream() () at netwerk/cache/nsDiskCacheStreams.cpp:235
nsDiskCacheOutputStream::Release() () at netwerk/cache/nsDiskCacheStreams.cpp:219
nsCOMPtr_base::begin_assignment() () at xpcom/build/../glue/nsCOMPtr.h:479
nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() () at netwerk/cache/nsCacheEntryDescriptor.cpp:613
nsCacheEntryDescriptor::nsOutputStreamWrapper::Close() () at netwerk/cache/nsCacheEntryDescriptor.cpp:642
nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper() () at netwerk/cache/nsCacheEntryDescriptor.h:155
nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper() () at netwerk/cache/nsCacheEntryDescriptor.h:161
nsCacheEntryDescriptor::nsOutputStreamWrapper::Release() () at netwerk/cache/nsCacheEntryDescriptor.cpp:595
nsStreamListenerTee::OnStopRequest(nsIRequest*, nsISupports*, unsigned int) () at ../../../dist/include/nsCOMPtr.h:663
nsHttpChannel::OnStopRequest(nsIRequest*, nsISupports*, unsigned int) () at netwerk/protocol/http/nsHttpChannel.cpp:4116
nsInputStreamPump::OnStateStop() () at netwerk/base/src/nsInputStreamPump.cpp:578
nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*) () at netwerk/base/src/nsInputStreamPump.cpp:403
nsInputStreamReadyEvent::Run() () at xpcom/io/nsStreamUtils.cpp:114
No others threads seemed to be doing much with the cache, and were all sleeping in poll() or pthread_cond_wait().
I notice that nsDiskCacheStreamIO::CloseOutputStream and nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit both take the nsCacheService lock, which is non-recursive, so it seems like this is the source of the deadlock.
Assignee | ||
Comment 1•14 years ago
|
||
Few days ago I saw an identical hang caused by a code-blunder in a different context. It was even reproducible, but after fixing my blunder I didn't see the hang again and forgot all about it...
I'll take a closer look by re-introducing my mistake. :)
Assignee | ||
Comment 2•14 years ago
|
||
I've reproduced and identified the problem. Will attach a patch shortly.
The issue is that nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit is called twice and on the second call |mOutput| is set. When LazyInit assigns a new value to |mOutput|, the old value is cleared and we end up in the dtor of nsDiskCacheOutputStream which tries to reclaim the lock. And boom... deadlock!
This can happen because after setting |mOutput| but before setting the flag |mInitialized| , LazyInit() can return (for two separate reasons). Hence we may end up with |mOutput| set and not |mInitialized|, and LazyInit could be called twice. My blunder (mentioned in comment #1) caused OnDataSizeChange to return an error and LazyInit to return early.
Assignee: nobody → bjarne
Status: NEW → ASSIGNED
Assignee | ||
Comment 3•14 years ago
|
||
I'm not able to reproduce my issue with this patch applied. I'd be interested to known if reporter can reproduce *his* issue with it applied.
Pushed to tryserver, requesting review.
Attachment #531006 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 4•14 years ago
|
||
Comment on attachment 531006 [details] [diff] [review]
Proposed fix
Ahh - it has a leak! I'll fix and re-submit.
Attachment #531006 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 5•14 years ago
|
||
Passes simple local testing. Pushed to tryserver...
Attachment #531006 -
Attachment is obsolete: true
Attachment #531037 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 6•14 years ago
|
||
Tryserver ok. Awaiting review.
Comment 7•14 years ago
|
||
Comment on attachment 531037 [details] [diff] [review]
Proposed fix (wo/ leaks this time)
> +#include "nsAutoPtr.h"
...
> + nsRefPtr<nsIOutputStream> stream;
Use nsCOMPtr.
Otherwise looks good. And please add a test.
Attachment #531037 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 8•14 years ago
|
||
Comment on attachment 531037 [details] [diff] [review]
Proposed fix (wo/ leaks this time)
I'd love to add a test here and I've spent quite some time trying to figure out how. The issue blocking me is that in order to test this patch, nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() must return after creating |mOutput| but before setting |mInitialized|. I.e. in this interval either nsCacheEntry::CacheDevice() must return null or nsCacheDevice::OnDataSizeChange() must return an error. I see no way to reliably achieve this without replacing the entire cache-service in the test. This is certainly doable but requires quite a lot of coding so I'm not sure it's worth the time and trouble.
If you have an idea of how to achieve this in a different way, or a pointer/hint to a test or similar which do replace the cache-service, I'd be happy to start from there.
Attachment #531037 -
Flags: feedback?(michal.novotny)
Comment 9•14 years ago
|
||
Entry is first written with flag STORE_ON_DISK_AS_FILE, so it isn't doomed. And STORE_ON_DISK is used when appending the entry, which causes nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() to fail.
Updated•14 years ago
|
Attachment #531037 -
Flags: feedback?(michal.novotny)
Assignee | ||
Comment 10•14 years ago
|
||
Brilliant! I'll make a patch out of it, attach and r+ it (it's yours).
Assignee | ||
Comment 11•14 years ago
|
||
I put the ostream->close-call inside a try-catch because it will fail if LazyInit fails. Test has an r+ from me...
Attachment #532403 -
Attachment is obsolete: true
Attachment #532606 -
Flags: review+
Assignee | ||
Comment 12•14 years ago
|
||
Comment on attachment 532606 [details] [diff] [review]
Slightly modified test
Requesting review for the change/modified behaviour of Close()...
Attachment #532606 -
Flags: review?(michal.novotny)
Assignee | ||
Comment 13•14 years ago
|
||
In this version nsDiskCacheStreamIO::Flush() closes the fd if entry is doomed.
Without this change the test for this bug asserts when shutting down the cacheservice, and this can happen in real-life as well. IMO it makes sense to close the fd, but exactly *where* to do it could probably be discussed. My proposal is in Flush().
Pushed to tryserver.
Attachment #531037 -
Attachment is obsolete: true
Attachment #532607 -
Flags: review?(michal.novotny)
Comment 14•14 years ago
|
||
Comment on attachment 532606 [details] [diff] [review]
Slightly modified test
> + do_check_true(false); // should fail
you can use do_throw()
> + do_check_true(true); // should succeed
you can omit this check
Attachment #532606 -
Flags: review?(michal.novotny) → review+
Comment 15•14 years ago
|
||
Comment 16•14 years ago
|
||
Comment 17•14 years ago
|
||
with this patch all three tests pass successfully
test_bug654926.js
test_bug654926_test_seek.js
test_bug654926_doom_and_read.js
Attachment #532607 -
Attachment is obsolete: true
Attachment #532607 -
Flags: review?(michal.novotny)
Attachment #533020 -
Flags: review?(bjarne)
Assignee | ||
Comment 18•14 years ago
|
||
You've been busy... :) I agree with your change to Flush() and I've consolidated your new tests and the patch and pushed everything to tryserver.
Assignee | ||
Comment 19•14 years ago
|
||
Oh - btw: In the previous tryserver-run, the test for this particular bug on Win/Opt failed (i.e. both append and close completed without errors). Let's see if the adjustment you made changes this...
Assignee | ||
Comment 20•14 years ago
|
||
Tryserver is all green.
Assignee | ||
Comment 21•14 years ago
|
||
Combined tests.
I'm uncertain about the process here but I'll r+ this as well as Michals modification to the code-patch, then request check-in.
Attachment #532606 -
Attachment is obsolete: true
Attachment #533016 -
Attachment is obsolete: true
Attachment #533018 -
Attachment is obsolete: true
Attachment #533555 -
Flags: review+
Assignee | ||
Comment 22•14 years ago
|
||
Comment on attachment 533020 [details] [diff] [review]
slightly modified patch
Review of attachment 533020 [details] [diff] [review]:
-----------------------------------------------------------------
Modification in Flush() seems fine. r=me
Together with comment #7 I consider this ready to go.
Attachment #533020 -
Flags: review?(bjarne) → review+
Assignee | ||
Comment 23•14 years ago
|
||
Requesting check-in for attachment #533020 [details] [diff] [review] (code change) and attachment #533555 [details] [diff] [review] (tests).
Keywords: checkin-needed
Updated•14 years ago
|
Comment 24•14 years ago
|
||
Pushed:
http://hg.mozilla.org/mozilla-central/rev/33e975a5d150
http://hg.mozilla.org/mozilla-central/rev/b63a7d9c1735
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Whiteboard: [fixed in cedar]
Target Milestone: --- → mozilla6
Comment 25•14 years ago
|
||
Backed out because of a perma orange on Win opt xpcshell:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1305819552.1305821909.7630.gz&fulltext=1
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla6 → ---
Assignee | ||
Comment 26•14 years ago
|
||
Mmmm... same problem as mentioned in comment #19. Not seen on try-run, ref comment #20. Is it really perma? Any chance to get a nspr-log from it?
Comment 27•14 years ago
|
||
(In reply to comment #26)
> Mmmm... same problem as mentioned in comment #19. Not seen on try-run, ref
> comment #20. Is it really perma? Any chance to get a nspr-log from it?
Happens 2 times out of two on Cedar and I saw it happens one time out of one on mozilla-central. It's likely perma. Maybe intermittent but in that case, rarely passing.
How can I help with nspr-log?
Assignee | ||
Comment 28•14 years ago
|
||
Yeah - I rebuilt these tests on tryserver and it came up with warnings. Must have passed first time on pure coincidence... :(
Nspr-log: Is there a way to rebuild/re-run these tests, setting the env-variables as described in https://developer.mozilla.org/en/HTTP_Logging#Logging_HTTP_Activity ? If there is, plz add ",cache:5" to the end of NSPR_LOG_MODULES. The result is a file pointed to by NSPR_LOG_FILE which would be useful to try to understand what's going on.
Comment 29•14 years ago
|
||
See result of a modified test at http://tbpl.mozilla.org/?tree=Try&rev=60487b56b6cc
I'd guess that the problem is in reusing an entry that is being deactivated
http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDevice.cpp#510
The reused entry has policy STORE_ON_DISK_AS_FILE instead of requested STORE_ON_DISK so LazyInit() doesn't fail or it isn't called at all because the output is still open. Syncing with cache IO thread (like in test in bug 651100) would fix this. But maybe we should rather change the code so that we won't reuse the entry if the policy differs.
Assignee | ||
Comment 30•14 years ago
|
||
Nice catch! :) I have not been able to reproduce the orange, nor has Honza on opt/Win7, so this is good insight.
(In reply to comment #29)
> Syncing with cache IO thread (like in test in bug
> 651100) would fix this.
I'd prefer this. IMO the orange we see here is a special case/result of another bug (see below). We should either make this bug depend on the other one, or implement your suggested workaround in the test to get this bug out of the way. Input?
> But maybe we should rather change the code so that
> we won't reuse the entry if the policy differs.
This may actually be a more general issue: Afaics when using an active entry we never ensure that the requested storage-policy is compatible with existing storage for the entry. The orange in this bug is IMO a special case because even if the entry we find is in the pending state between "active" and "deactivated" we just re-use the object like we would have done if it was "active".
So (theoretically) if we have an active entry stored in memory we may find and use it for a request with storage-policy STORE_ON_DISK_AS_FILE, for example. (Or vice-versa, which would be a security-issue.)
Another thing which puzzles me is why we never search the disk for a request with policy STORE_ON_DISK_AS_FILE. It means that even if we have the resource cached it will always be re-loaded from server, and I'm not sure how the disk-cache deals with book-keeping in this case.
I'll try to come up with some testcases for the above scenarios - your input is appreciated also.
Assignee | ||
Comment 31•14 years ago
|
||
Hmm... apparently nsHttpHandler::GetCacheSession() takes some steps which handle at least the obvious issue described above. This needs some more investigation, and it is a question whether we should leave such things to the client...
Comment 32•14 years ago
|
||
(In reply to comment #30)
> Another thing which puzzles me is why we never search the disk for a request
> with policy STORE_ON_DISK_AS_FILE. It means that even if we have the
> resource cached it will always be re-loaded from server, and I'm not sure
> how the disk-cache deals with book-keeping in this case.
I guess that this was an intention, otherwise we'd need to check if the data is in a separate file. But you're right. The cache code doesn't handle this correctly and opening an already existing entry results in duplicate entries.
Assignee | ||
Comment 33•14 years ago
|
||
(In reply to comment #32)
> I guess that this was an intention, otherwise we'd need to check if the data
> is in a separate file.
You mean we'd have to handle the situation where a cached entry is NOT in a separate file and the new request has policy STORE_ON_DISK_AS_FILE? (Because if it already is in a separate file there is no problem, and checking it is trivial, no?)
Comment 34•14 years ago
|
||
(In reply to comment #33)
> You mean we'd have to handle the situation where a cached entry is NOT in a
> separate file and the new request has policy STORE_ON_DISK_AS_FILE? (Because
> if it already is in a separate file there is no problem, and checking it is
> trivial, no?)
Yes, exactly.
Assignee | ||
Comment 35•14 years ago
|
||
(In reply to comment #30)
> So (theoretically) if we have an active entry stored in memory we may find
> and use it for a request with storage-policy STORE_ON_DISK_AS_FILE, for
> example. (Or vice-versa, which would be a security-issue.)
This can not happen with the current http-implementation because nsHttpHandler::GetCacheSession() creates sessions with different names for requests requiring memory- or offline-storage. The session-name becomes the prefix of the cache-key in nsCacheService::CreateRequest(), hence a request which have to go to memory (or offline) cannot be mistaken for a request which can go to disk (or can go anywhere).
I'm not 100% confident about the implications of this scheme. One implication for sure is that even if we have a resource in cache we may end up reloading it because the necko-client requested a different storage-policy. Not sure if this is a problem in real life though.
The good news is that the "special case" from comment #30 is in fact the only case where we may get a policy-clash so I'll focus on this for now.
Assignee | ||
Comment 36•14 years ago
|
||
Right - after some thinking I've come to the following:
- this bug is about avoiding deadlock by ensuring that LazyInit() is only called once, hence the tests for this bug should test that
- the problem found in comment #29 and comment #32 and clarified in comment #34 is separate and needs its own fix and test
So, I suggest to patch Michals test for this bug as suggested at the end of comment #29 in order to avoid the problem causing the orange. I'll open a new bug for the issue clarified in comment #34.
Assignee | ||
Comment 37•14 years ago
|
||
The oranges shown in comment #25 and comment #29 only applies to one of the tests so I've changed only that one to sync with cache-operations in order to avoid the subtle issue identified in comment #29. I'm not 100% sure why the two other tests don't fail also but I have not analyzed it very deeply either.
Requesting review plus clarification about the two other tests. Passes local testing, pushed to tryserver together with attachment #533020 [details] [diff] [review].
Attachment #533555 -
Attachment is obsolete: true
Attachment #536552 -
Flags: review?(michal.novotny)
Attachment #536552 -
Flags: feedback?(michal.novotny)
Assignee | ||
Comment 38•14 years ago
|
||
(In reply to comment #36)
>I'll open a
> new bug for the issue clarified in comment #34.
Bug #661115 opened.
Assignee | ||
Comment 39•14 years ago
|
||
(In reply to comment #37)
> pushed to tryserver together with attachment #533020 [details] [diff] [review]
> [review].
Tryserver green. Awaiting review and/or feedback...
Comment 40•14 years ago
|
||
Comment on attachment 536552 [details] [diff] [review]
Version of test where we sync with cache-io
(In reply to comment #37)
> I'm not 100% sure why the two other tests don't fail also but I have not
> analyzed it very deeply either.
It's because the other tests don't write beyond the 5MB limit, so it doesn't matter whether the entry is reopened with policy STORE_ON_DISK_AS_FILE or STORE_ON_DISK.
Attachment #536552 -
Flags: review?(michal.novotny)
Attachment #536552 -
Flags: review+
Attachment #536552 -
Flags: feedback?(michal.novotny)
Assignee | ||
Comment 41•14 years ago
|
||
Requesting check-in for attachment #536552 [details] [diff] [review] (tests) and attachment #533020 [details] [diff] [review] (code-change)
Keywords: checkin-needed
Comment 42•14 years ago
|
||
The "slightly modified patch" doesn't have an author. (Michal attached it, but this bug is assigned to Bjarne?!) The "Version of test where we sync with cache-io" says "From: r=michal.novotny@gmail.com", which doesn't seem to make any sense. Please fix this, then add checkin-needed back.
Keywords: checkin-needed
Assignee | ||
Comment 43•14 years ago
|
||
Attachment #533020 -
Attachment is obsolete: true
Assignee | ||
Comment 44•14 years ago
|
||
Attachment #536552 -
Attachment is obsolete: true
Assignee | ||
Comment 45•14 years ago
|
||
Requesting check-in for the two latest attachments
Keywords: checkin-needed
Comment 46•14 years ago
|
||
Apparently the first patch tries to modify a file added by the second patch, but there are conflicts. Can you just attach the code changes and tests in one patch?
Keywords: checkin-needed
Assignee | ||
Comment 47•14 years ago
|
||
Combined into one patch as requested. I have not sent this to tryserver since it is a simple qfold and "I see no reasons for it to fail (tm)"... :)
Attachment #537982 -
Attachment is obsolete: true
Attachment #537983 -
Attachment is obsolete: true
Assignee | ||
Comment 49•14 years ago
|
||
Comment on attachment 539463 [details] [diff] [review]
Combined and unbitrotted
Requesting check-in using this flag also
Attachment #539463 -
Flags: checkin?
Comment 50•14 years ago
|
||
Comment on attachment 539463 [details] [diff] [review]
Combined and unbitrotted
Pushed to mozilla-inbound.
Attachment #539463 -
Flags: checkin? → checkin+
Updated•14 years ago
|
Keywords: checkin-needed
Whiteboard: [inbound]
Comment 51•14 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Whiteboard: [inbound]
Target Milestone: --- → mozilla7
You need to log in
before you can comment on or make changes to this bug.
Description
•