Closed Bug 654926 Opened 9 years ago Closed 9 years ago
Hang in ns
Disk Cache Stream IO::Close Output Stream while downloading Web M video from You Tube
Running a fairly recent nightly (86248f7209b7), the browser hung while downloading a video (via the Save Video As context menu) from YouTube. I had maybe 30 other tabs open, but my last interaction with the browser was saving the video to disk. Possible steps to reproduce: - Find HTML 5 video on YouTube - Open Page Info and find the video URL - Open video URL in a new tab - Hit the Escape key to stop the video redownloading - Save Video As... - Open the download manager to watch download progress - Close YouTube tab After this, I switched back to my editor and noticed the download manager reached 15.0/15.1MB and then stopped making progress. Attempting to switch focus to the browser revealed that it had hung. The main thread had the following stack: PR_Lock () at nsprpub/pr/src/pthreads/ptsynch.c:206 nsDiskCacheStreamIO::CloseOutputStream(nsDiskCacheOutputStream*) () at netwerk/cache/nsCacheService.h:304 nsDiskCacheOutputStream::Close() () at netwerk/cache/nsDiskCacheStreams.cpp:244 nsDiskCacheOutputStream::~nsDiskCacheOutputStream() () at netwerk/cache/nsDiskCacheStreams.cpp:233 nsDiskCacheOutputStream::~nsDiskCacheOutputStream() () at netwerk/cache/nsDiskCacheStreams.cpp:235 nsDiskCacheOutputStream::Release() () at netwerk/cache/nsDiskCacheStreams.cpp:219 nsCOMPtr_base::begin_assignment() () at xpcom/build/../glue/nsCOMPtr.h:479 nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() () at netwerk/cache/nsCacheEntryDescriptor.cpp:613 nsCacheEntryDescriptor::nsOutputStreamWrapper::Close() () at netwerk/cache/nsCacheEntryDescriptor.cpp:642 nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper() () at netwerk/cache/nsCacheEntryDescriptor.h:155 nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper() () at netwerk/cache/nsCacheEntryDescriptor.h:161 nsCacheEntryDescriptor::nsOutputStreamWrapper::Release() () at netwerk/cache/nsCacheEntryDescriptor.cpp:595 nsStreamListenerTee::OnStopRequest(nsIRequest*, nsISupports*, unsigned int) () at ../../../dist/include/nsCOMPtr.h:663 nsHttpChannel::OnStopRequest(nsIRequest*, nsISupports*, unsigned int) () at netwerk/protocol/http/nsHttpChannel.cpp:4116 nsInputStreamPump::OnStateStop() () at netwerk/base/src/nsInputStreamPump.cpp:578 nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*) () at netwerk/base/src/nsInputStreamPump.cpp:403 nsInputStreamReadyEvent::Run() () at xpcom/io/nsStreamUtils.cpp:114 No others threads seemed to be doing much with the cache, and were all sleeping in poll() or pthread_cond_wait(). I notice that nsDiskCacheStreamIO::CloseOutputStream and nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit both take the nsCacheService lock, which is non-recursive, so it seems like this is the source of the deadlock.
Few days ago I saw an identical hang caused by a code-blunder in a different context. It was even reproducible, but after fixing my blunder I didn't see the hang again and forgot all about it... I'll take a closer look by re-introducing my mistake. :)
I've reproduced and identified the problem. Will attach a patch shortly. The issue is that nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit is called twice and on the second call |mOutput| is set. When LazyInit assigns a new value to |mOutput|, the old value is cleared and we end up in the dtor of nsDiskCacheOutputStream which tries to reclaim the lock. And boom... deadlock! This can happen because after setting |mOutput| but before setting the flag |mInitialized| , LazyInit() can return (for two separate reasons). Hence we may end up with |mOutput| set and not |mInitialized|, and LazyInit could be called twice. My blunder (mentioned in comment #1) caused OnDataSizeChange to return an error and LazyInit to return early.
Assignee: nobody → bjarne
Status: NEW → ASSIGNED
I'm not able to reproduce my issue with this patch applied. I'd be interested to known if reporter can reproduce *his* issue with it applied. Pushed to tryserver, requesting review.
Comment on attachment 531006 [details] [diff] [review] Proposed fix Ahh - it has a leak! I'll fix and re-submit.
Passes simple local testing. Pushed to tryserver...
Tryserver ok. Awaiting review.
Comment on attachment 531037 [details] [diff] [review] Proposed fix (wo/ leaks this time) > +#include "nsAutoPtr.h" ... > + nsRefPtr<nsIOutputStream> stream; Use nsCOMPtr. Otherwise looks good. And please add a test.
Comment on attachment 531037 [details] [diff] [review] Proposed fix (wo/ leaks this time) I'd love to add a test here and I've spent quite some time trying to figure out how. The issue blocking me is that in order to test this patch, nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() must return after creating |mOutput| but before setting |mInitialized|. I.e. in this interval either nsCacheEntry::CacheDevice() must return null or nsCacheDevice::OnDataSizeChange() must return an error. I see no way to reliably achieve this without replacing the entire cache-service in the test. This is certainly doable but requires quite a lot of coding so I'm not sure it's worth the time and trouble. If you have an idea of how to achieve this in a different way, or a pointer/hint to a test or similar which do replace the cache-service, I'd be happy to start from there.
Entry is first written with flag STORE_ON_DISK_AS_FILE, so it isn't doomed. And STORE_ON_DISK is used when appending the entry, which causes nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() to fail.
Brilliant! I'll make a patch out of it, attach and r+ it (it's yours).
I put the ostream->close-call inside a try-catch because it will fail if LazyInit fails. Test has an r+ from me...
Comment on attachment 532606 [details] [diff] [review] Slightly modified test Requesting review for the change/modified behaviour of Close()...
Attachment #532606 - Flags: review?(michal.novotny)
In this version nsDiskCacheStreamIO::Flush() closes the fd if entry is doomed. Without this change the test for this bug asserts when shutting down the cacheservice, and this can happen in real-life as well. IMO it makes sense to close the fd, but exactly *where* to do it could probably be discussed. My proposal is in Flush(). Pushed to tryserver.
Comment on attachment 532606 [details] [diff] [review] Slightly modified test > + do_check_true(false); // should fail you can use do_throw() > + do_check_true(true); // should succeed you can omit this check
Attachment #532606 - Flags: review?(michal.novotny) → review+
with this patch all three tests pass successfully test_bug654926.js test_bug654926_test_seek.js test_bug654926_doom_and_read.js
You've been busy... :) I agree with your change to Flush() and I've consolidated your new tests and the patch and pushed everything to tryserver.
Oh - btw: In the previous tryserver-run, the test for this particular bug on Win/Opt failed (i.e. both append and close completed without errors). Let's see if the adjustment you made changes this...
Tryserver is all green.
Combined tests. I'm uncertain about the process here but I'll r+ this as well as Michals modification to the code-patch, then request check-in.
Comment on attachment 533020 [details] [diff] [review] slightly modified patch Review of attachment 533020 [details] [diff] [review]: ----------------------------------------------------------------- Modification in Flush() seems fine. r=me Together with comment #7 I consider this ready to go.
Attachment #533020 - Flags: review?(bjarne) → review+
Requesting check-in for attachment #533020 [details] [diff] [review] (code change) and attachment #533555 [details] [diff] [review] (tests).
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: [fixed in cedar]
Pushed: http://hg.mozilla.org/mozilla-central/rev/33e975a5d150 http://hg.mozilla.org/mozilla-central/rev/b63a7d9c1735
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [fixed in cedar]
Target Milestone: --- → mozilla6
Backed out because of a perma orange on Win opt xpcshell: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1305819552.1305821909.7630.gz&fulltext=1
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla6 → ---
Mmmm... same problem as mentioned in comment #19. Not seen on try-run, ref comment #20. Is it really perma? Any chance to get a nspr-log from it?
(In reply to comment #26) > Mmmm... same problem as mentioned in comment #19. Not seen on try-run, ref > comment #20. Is it really perma? Any chance to get a nspr-log from it? Happens 2 times out of two on Cedar and I saw it happens one time out of one on mozilla-central. It's likely perma. Maybe intermittent but in that case, rarely passing. How can I help with nspr-log?
Yeah - I rebuilt these tests on tryserver and it came up with warnings. Must have passed first time on pure coincidence... :( Nspr-log: Is there a way to rebuild/re-run these tests, setting the env-variables as described in https://developer.mozilla.org/en/HTTP_Logging#Logging_HTTP_Activity ? If there is, plz add ",cache:5" to the end of NSPR_LOG_MODULES. The result is a file pointed to by NSPR_LOG_FILE which would be useful to try to understand what's going on.
See result of a modified test at http://tbpl.mozilla.org/?tree=Try&rev=60487b56b6cc I'd guess that the problem is in reusing an entry that is being deactivated http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDevice.cpp#510 The reused entry has policy STORE_ON_DISK_AS_FILE instead of requested STORE_ON_DISK so LazyInit() doesn't fail or it isn't called at all because the output is still open. Syncing with cache IO thread (like in test in bug 651100) would fix this. But maybe we should rather change the code so that we won't reuse the entry if the policy differs.
Nice catch! :) I have not been able to reproduce the orange, nor has Honza on opt/Win7, so this is good insight. (In reply to comment #29) > Syncing with cache IO thread (like in test in bug > 651100) would fix this. I'd prefer this. IMO the orange we see here is a special case/result of another bug (see below). We should either make this bug depend on the other one, or implement your suggested workaround in the test to get this bug out of the way. Input? > But maybe we should rather change the code so that > we won't reuse the entry if the policy differs. This may actually be a more general issue: Afaics when using an active entry we never ensure that the requested storage-policy is compatible with existing storage for the entry. The orange in this bug is IMO a special case because even if the entry we find is in the pending state between "active" and "deactivated" we just re-use the object like we would have done if it was "active". So (theoretically) if we have an active entry stored in memory we may find and use it for a request with storage-policy STORE_ON_DISK_AS_FILE, for example. (Or vice-versa, which would be a security-issue.) Another thing which puzzles me is why we never search the disk for a request with policy STORE_ON_DISK_AS_FILE. It means that even if we have the resource cached it will always be re-loaded from server, and I'm not sure how the disk-cache deals with book-keeping in this case. I'll try to come up with some testcases for the above scenarios - your input is appreciated also.
Hmm... apparently nsHttpHandler::GetCacheSession() takes some steps which handle at least the obvious issue described above. This needs some more investigation, and it is a question whether we should leave such things to the client...
(In reply to comment #30) > Another thing which puzzles me is why we never search the disk for a request > with policy STORE_ON_DISK_AS_FILE. It means that even if we have the > resource cached it will always be re-loaded from server, and I'm not sure > how the disk-cache deals with book-keeping in this case. I guess that this was an intention, otherwise we'd need to check if the data is in a separate file. But you're right. The cache code doesn't handle this correctly and opening an already existing entry results in duplicate entries.
(In reply to comment #32) > I guess that this was an intention, otherwise we'd need to check if the data > is in a separate file. You mean we'd have to handle the situation where a cached entry is NOT in a separate file and the new request has policy STORE_ON_DISK_AS_FILE? (Because if it already is in a separate file there is no problem, and checking it is trivial, no?)
(In reply to comment #33) > You mean we'd have to handle the situation where a cached entry is NOT in a > separate file and the new request has policy STORE_ON_DISK_AS_FILE? (Because > if it already is in a separate file there is no problem, and checking it is > trivial, no?) Yes, exactly.
(In reply to comment #30) > So (theoretically) if we have an active entry stored in memory we may find > and use it for a request with storage-policy STORE_ON_DISK_AS_FILE, for > example. (Or vice-versa, which would be a security-issue.) This can not happen with the current http-implementation because nsHttpHandler::GetCacheSession() creates sessions with different names for requests requiring memory- or offline-storage. The session-name becomes the prefix of the cache-key in nsCacheService::CreateRequest(), hence a request which have to go to memory (or offline) cannot be mistaken for a request which can go to disk (or can go anywhere). I'm not 100% confident about the implications of this scheme. One implication for sure is that even if we have a resource in cache we may end up reloading it because the necko-client requested a different storage-policy. Not sure if this is a problem in real life though. The good news is that the "special case" from comment #30 is in fact the only case where we may get a policy-clash so I'll focus on this for now.
Right - after some thinking I've come to the following: - this bug is about avoiding deadlock by ensuring that LazyInit() is only called once, hence the tests for this bug should test that - the problem found in comment #29 and comment #32 and clarified in comment #34 is separate and needs its own fix and test So, I suggest to patch Michals test for this bug as suggested at the end of comment #29 in order to avoid the problem causing the orange. I'll open a new bug for the issue clarified in comment #34.
The oranges shown in comment #25 and comment #29 only applies to one of the tests so I've changed only that one to sync with cache-operations in order to avoid the subtle issue identified in comment #29. I'm not 100% sure why the two other tests don't fail also but I have not analyzed it very deeply either. Requesting review plus clarification about the two other tests. Passes local testing, pushed to tryserver together with attachment #533020 [details] [diff] [review].
(In reply to comment #36) >I'll open a > new bug for the issue clarified in comment #34. Bug #661115 opened.
(In reply to comment #37) > pushed to tryserver together with attachment #533020 [details] [diff] [review] > [review]. Tryserver green. Awaiting review and/or feedback...
Comment on attachment 536552 [details] [diff] [review] Version of test where we sync with cache-io (In reply to comment #37) > I'm not 100% sure why the two other tests don't fail also but I have not > analyzed it very deeply either. It's because the other tests don't write beyond the 5MB limit, so it doesn't matter whether the entry is reopened with policy STORE_ON_DISK_AS_FILE or STORE_ON_DISK.
Requesting check-in for attachment #536552 [details] [diff] [review] (tests) and attachment #533020 [details] [diff] [review] (code-change)
The "slightly modified patch" doesn't have an author. (Michal attached it, but this bug is assigned to Bjarne?!) The "Version of test where we sync with cache-io" says "From: email@example.com", which doesn't seem to make any sense. Please fix this, then add checkin-needed back.
Requesting check-in for the two latest attachments
Apparently the first patch tries to modify a file added by the second patch, but there are conflicts. Can you just attach the code changes and tests in one patch?
Combined into one patch as requested. I have not sent this to tryserver since it is a simple qfold and "I see no reasons for it to fail (tm)"... :)
Comment on attachment 539463 [details] [diff] [review] Combined and unbitrotted Requesting check-in using this flag also
Attachment #539463 - Flags: checkin?
Comment on attachment 539463 [details] [diff] [review] Combined and unbitrotted Pushed to mozilla-inbound.
Attachment #539463 - Flags: checkin? → checkin+
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla7
You need to log in before you can comment on or make changes to this bug.