Last Comment Bug 654926 - Hang in nsDiskCacheStreamIO::CloseOutputStream while downloading WebM video from YouTube
: Hang in nsDiskCacheStreamIO::CloseOutputStream while downloading WebM video f...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla7
Assigned To: Bjarne (:bjarne)
:
: Patrick McManus [:mcmanus]
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-05-04 23:08 PDT by Matthew Gregan [:kinetik]
Modified: 2011-06-22 14:50 PDT (History)
7 users (show)
mounir: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Proposed fix (2.42 KB, patch)
2011-05-09 03:31 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Proposed fix (wo/ leaks this time) (4.34 KB, patch)
2011-05-09 06:53 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
reliable test (2.13 KB, application/javascript)
2011-05-13 19:10 PDT, Michal Novotny (:michal)
no flags Details
Slightly modified test (2.74 KB, patch)
2011-05-16 04:50 PDT, Bjarne (:bjarne)
bjarne: review+
michal.novotny: review+
Details | Diff | Splinter Review
Slightly modified patch which also closes fd (4.61 KB, patch)
2011-05-16 04:56 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
test that works with current FF but fails with patch 532607 (2.37 KB, application/javascript)
2011-05-17 11:03 PDT, Michal Novotny (:michal)
no flags Details
test that fails with as well as without patch 532607 (2.12 KB, application/javascript)
2011-05-17 11:03 PDT, Michal Novotny (:michal)
no flags Details
slightly modified patch (4.54 KB, patch)
2011-05-17 11:04 PDT, Michal Novotny (:michal)
bjarne: review+
Details | Diff | Splinter Review
All three tests in one patch (7.69 KB, patch)
2011-05-19 00:47 PDT, Bjarne (:bjarne)
bjarne: review+
Details | Diff | Splinter Review
Version of test where we sync with cache-io (9.38 KB, patch)
2011-06-01 02:44 PDT, Bjarne (:bjarne)
michal.novotny: review+
Details | Diff | Splinter Review
Metainformation fixed (8.64 KB, patch)
2011-06-08 04:30 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Metainformation fixed (9.38 KB, patch)
2011-06-08 04:31 PDT, Bjarne (:bjarne)
no flags Details | Diff | Splinter Review
Combined and unbitrotted (13.71 KB, patch)
2011-06-15 02:48 PDT, Bjarne (:bjarne)
mounir: checkin+
Details | Diff | Splinter Review

Description Matthew Gregan [:kinetik] 2011-05-04 23:08:12 PDT
Running a fairly recent nightly (86248f7209b7), the browser hung while downloading a video (via the Save Video As context menu) from YouTube.  I had maybe 30 other tabs open, but my last interaction with the browser was saving the video to disk.

Possible steps to reproduce:
- Find HTML 5 video on YouTube
- Open Page Info and find the video URL
- Open video URL in a new tab
- Hit the Escape key to stop the video redownloading
- Save Video As...
- Open the download manager to watch download progress
- Close YouTube tab

After this, I switched back to my editor and noticed the download manager reached 15.0/15.1MB and then stopped making progress.  Attempting to switch focus to the browser revealed that it had hung.

The main thread had the following stack:

PR_Lock () at nsprpub/pr/src/pthreads/ptsynch.c:206
nsDiskCacheStreamIO::CloseOutputStream(nsDiskCacheOutputStream*) () at netwerk/cache/nsCacheService.h:304
nsDiskCacheOutputStream::Close() () at netwerk/cache/nsDiskCacheStreams.cpp:244
nsDiskCacheOutputStream::~nsDiskCacheOutputStream() () at netwerk/cache/nsDiskCacheStreams.cpp:233
nsDiskCacheOutputStream::~nsDiskCacheOutputStream() () at netwerk/cache/nsDiskCacheStreams.cpp:235
nsDiskCacheOutputStream::Release() () at netwerk/cache/nsDiskCacheStreams.cpp:219
nsCOMPtr_base::begin_assignment() () at xpcom/build/../glue/nsCOMPtr.h:479
nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() () at netwerk/cache/nsCacheEntryDescriptor.cpp:613
nsCacheEntryDescriptor::nsOutputStreamWrapper::Close() () at netwerk/cache/nsCacheEntryDescriptor.cpp:642
nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper() () at netwerk/cache/nsCacheEntryDescriptor.h:155
nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper() () at netwerk/cache/nsCacheEntryDescriptor.h:161
nsCacheEntryDescriptor::nsOutputStreamWrapper::Release() () at netwerk/cache/nsCacheEntryDescriptor.cpp:595
nsStreamListenerTee::OnStopRequest(nsIRequest*, nsISupports*, unsigned int) () at ../../../dist/include/nsCOMPtr.h:663
nsHttpChannel::OnStopRequest(nsIRequest*, nsISupports*, unsigned int) () at netwerk/protocol/http/nsHttpChannel.cpp:4116
nsInputStreamPump::OnStateStop() () at netwerk/base/src/nsInputStreamPump.cpp:578
nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*) () at netwerk/base/src/nsInputStreamPump.cpp:403
nsInputStreamReadyEvent::Run() () at xpcom/io/nsStreamUtils.cpp:114

No others threads seemed to be doing much with the cache, and were all sleeping in poll() or pthread_cond_wait().

I notice that nsDiskCacheStreamIO::CloseOutputStream and nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit both take the nsCacheService lock, which is non-recursive, so it seems like this is the source of the deadlock.
Comment 1 Bjarne (:bjarne) 2011-05-05 01:02:00 PDT
Few days ago I saw an identical hang caused by a code-blunder in a different context. It was even reproducible, but after fixing my blunder I didn't see the hang again and forgot all about it...

I'll take a closer look by re-introducing my mistake. :)
Comment 2 Bjarne (:bjarne) 2011-05-09 03:15:38 PDT
I've reproduced and identified the problem. Will attach a patch shortly.

The issue is that nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit is called twice and on the second call |mOutput| is set. When LazyInit assigns a new value to |mOutput|, the old value is cleared and we end up in the dtor of nsDiskCacheOutputStream which tries to reclaim the lock. And boom... deadlock!

This can happen because after setting |mOutput| but before setting the flag |mInitialized| , LazyInit() can return (for two separate reasons). Hence we may end up with |mOutput| set and not |mInitialized|, and LazyInit could be called twice. My blunder (mentioned in comment #1) caused OnDataSizeChange to return an error and LazyInit to return early.
Comment 3 Bjarne (:bjarne) 2011-05-09 03:31:52 PDT
Created attachment 531006 [details] [diff] [review]
Proposed fix

I'm not able to reproduce my issue with this patch applied. I'd be interested to known if reporter can reproduce *his* issue with it applied.

Pushed to tryserver, requesting review.
Comment 4 Bjarne (:bjarne) 2011-05-09 04:30:56 PDT
Comment on attachment 531006 [details] [diff] [review]
Proposed fix

Ahh - it has a leak! I'll fix and re-submit.
Comment 5 Bjarne (:bjarne) 2011-05-09 06:53:55 PDT
Created attachment 531037 [details] [diff] [review]
Proposed fix (wo/ leaks this time)

Passes simple local testing. Pushed to tryserver...
Comment 6 Bjarne (:bjarne) 2011-05-10 01:08:56 PDT
Tryserver ok. Awaiting review.
Comment 7 Michal Novotny (:michal) 2011-05-11 08:33:36 PDT
Comment on attachment 531037 [details] [diff] [review]
Proposed fix (wo/ leaks this time)

> +#include "nsAutoPtr.h"
...
> +    nsRefPtr<nsIOutputStream> stream;

Use nsCOMPtr.


Otherwise looks good. And please add a test.
Comment 8 Bjarne (:bjarne) 2011-05-13 03:14:51 PDT
Comment on attachment 531037 [details] [diff] [review]
Proposed fix (wo/ leaks this time)

I'd love to add a test here and I've spent quite some time trying to figure out how. The issue blocking me is that in order to test this patch, nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() must return after creating |mOutput| but before setting |mInitialized|. I.e. in this interval either nsCacheEntry::CacheDevice() must return null or nsCacheDevice::OnDataSizeChange() must return an error. I see no way to reliably achieve this without replacing the entire cache-service in the test. This is certainly doable but requires quite a lot of coding so I'm not sure it's worth the time and trouble.

If you have an idea of how to achieve this in a different way, or a pointer/hint to a test or similar which do replace the cache-service, I'd be happy to start from there.
Comment 9 Michal Novotny (:michal) 2011-05-13 19:10:28 PDT
Created attachment 532403 [details]
reliable test

Entry is first written with flag STORE_ON_DISK_AS_FILE, so it isn't doomed. And STORE_ON_DISK is used when appending the entry, which causes nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() to fail.
Comment 10 Bjarne (:bjarne) 2011-05-16 01:42:55 PDT
Brilliant! I'll make a patch out of it, attach and r+ it (it's yours).
Comment 11 Bjarne (:bjarne) 2011-05-16 04:50:53 PDT
Created attachment 532606 [details] [diff] [review]
Slightly modified test

I put the ostream->close-call inside a try-catch because it will fail if LazyInit fails. Test has an r+ from me...
Comment 12 Bjarne (:bjarne) 2011-05-16 04:52:04 PDT
Comment on attachment 532606 [details] [diff] [review]
Slightly modified test

Requesting review for the change/modified behaviour of Close()...
Comment 13 Bjarne (:bjarne) 2011-05-16 04:56:37 PDT
Created attachment 532607 [details] [diff] [review]
Slightly modified patch which also closes fd

In this version nsDiskCacheStreamIO::Flush() closes the fd if entry is doomed.

Without this change the test for this bug asserts when shutting down the cacheservice, and this can happen in real-life as well. IMO it makes sense to close the fd, but exactly *where* to do it could probably be discussed. My proposal is in Flush().

Pushed to tryserver.
Comment 14 Michal Novotny (:michal) 2011-05-16 12:10:50 PDT
Comment on attachment 532606 [details] [diff] [review]
Slightly modified test

> +    do_check_true(false); // should fail

you can use do_throw()


> +    do_check_true(true); // should succeed

you can omit this check
Comment 15 Michal Novotny (:michal) 2011-05-17 11:03:14 PDT
Created attachment 533016 [details]
test that works with current FF but fails with patch 532607
Comment 16 Michal Novotny (:michal) 2011-05-17 11:03:44 PDT
Created attachment 533018 [details]
test that fails with as well as without patch 532607
Comment 17 Michal Novotny (:michal) 2011-05-17 11:04:56 PDT
Created attachment 533020 [details] [diff] [review]
slightly modified patch

with this patch all three tests pass successfully

test_bug654926.js
test_bug654926_test_seek.js
test_bug654926_doom_and_read.js
Comment 18 Bjarne (:bjarne) 2011-05-18 02:33:48 PDT
You've been busy...  :)  I agree with your change to Flush() and I've consolidated your new tests and the patch and pushed everything to tryserver.
Comment 19 Bjarne (:bjarne) 2011-05-18 02:38:28 PDT
Oh - btw: In the previous tryserver-run, the test for this particular bug on Win/Opt failed (i.e. both append and close completed without errors). Let's see if the adjustment you made changes this...
Comment 20 Bjarne (:bjarne) 2011-05-19 00:43:55 PDT
Tryserver is all green.
Comment 21 Bjarne (:bjarne) 2011-05-19 00:47:52 PDT
Created attachment 533555 [details] [diff] [review]
All three tests in one patch

Combined tests.

I'm uncertain about the process here but I'll r+ this as well as Michals modification to the code-patch, then request check-in.
Comment 22 Bjarne (:bjarne) 2011-05-19 00:54:52 PDT
Comment on attachment 533020 [details] [diff] [review]
slightly modified patch

Review of attachment 533020 [details] [diff] [review]:
-----------------------------------------------------------------

Modification in Flush() seems fine.  r=me

Together with comment #7 I consider this ready to go.
Comment 23 Bjarne (:bjarne) 2011-05-19 00:56:35 PDT
Requesting check-in for attachment #533020 [details] [diff] [review] (code change) and attachment #533555 [details] [diff] [review] (tests).
Comment 25 Mounir Lamouri (:mounir) 2011-05-19 09:56:57 PDT
Backed out because of a perma orange on Win opt xpcshell:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1305819552.1305821909.7630.gz&fulltext=1
Comment 26 Bjarne (:bjarne) 2011-05-19 15:15:25 PDT
Mmmm...  same problem as mentioned in comment #19. Not seen on try-run, ref comment #20. Is it really perma? Any chance to get a nspr-log from it?
Comment 27 Mounir Lamouri (:mounir) 2011-05-20 03:00:24 PDT
(In reply to comment #26)
> Mmmm...  same problem as mentioned in comment #19. Not seen on try-run, ref
> comment #20. Is it really perma? Any chance to get a nspr-log from it?

Happens 2 times out of two on Cedar and I saw it happens one time out of one on mozilla-central. It's likely perma. Maybe intermittent but in that case, rarely passing.

How can I help with nspr-log?
Comment 28 Bjarne (:bjarne) 2011-05-20 03:14:16 PDT
Yeah - I rebuilt these tests on tryserver and it came up with warnings. Must have passed first time on pure coincidence... :(

Nspr-log: Is there a way to rebuild/re-run these tests, setting the env-variables as described in https://developer.mozilla.org/en/HTTP_Logging#Logging_HTTP_Activity ?  If there is, plz add ",cache:5" to the end of NSPR_LOG_MODULES. The result is a file pointed to by NSPR_LOG_FILE which would be useful to try to understand what's going on.
Comment 29 Michal Novotny (:michal) 2011-05-23 12:43:54 PDT
See result of a modified test at http://tbpl.mozilla.org/?tree=Try&rev=60487b56b6cc

I'd guess that the problem is in reusing an entry that is being deactivated
http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDevice.cpp#510

The reused entry has policy STORE_ON_DISK_AS_FILE instead of requested STORE_ON_DISK so LazyInit() doesn't fail or it isn't called at all because the output is still open. Syncing with cache IO thread (like in test in bug 651100) would fix this. But maybe we should rather change the code so that we won't reuse the entry if the policy differs.
Comment 30 Bjarne (:bjarne) 2011-05-24 04:56:24 PDT
Nice catch! :)  I have not been able to reproduce the orange, nor has Honza on opt/Win7, so this is good insight.

(In reply to comment #29)
> Syncing with cache IO thread (like in test in bug
> 651100) would fix this.

I'd prefer this. IMO the orange we see here is a special case/result of another bug (see below). We should either make this bug depend on the other one, or implement your suggested workaround in the test to get this bug out of the way. Input?

> But maybe we should rather change the code so that
> we won't reuse the entry if the policy differs.

This may actually be a more general issue: Afaics when using an active entry we never ensure that the requested storage-policy is compatible with existing storage for the entry. The orange in this bug is IMO a special case because even if the entry we find is in the pending state between "active" and "deactivated" we just re-use the object like we would have done if it was "active".

So (theoretically) if we have an active entry stored in memory we may find and use it for a request with storage-policy STORE_ON_DISK_AS_FILE, for example. (Or vice-versa, which would be a security-issue.)

Another thing which puzzles me is why we never search the disk for a request with policy STORE_ON_DISK_AS_FILE. It means that even if we have the resource cached it will always be re-loaded from server, and I'm not sure how the disk-cache deals with book-keeping in this case.

I'll try to come up with some testcases for the above scenarios - your input is appreciated also.
Comment 31 Bjarne (:bjarne) 2011-05-24 07:06:17 PDT
Hmm...  apparently nsHttpHandler::GetCacheSession() takes some steps which handle at least the obvious issue described above. This needs some more investigation, and it is a question whether we should leave such things to the client...
Comment 32 Michal Novotny (:michal) 2011-05-24 08:53:40 PDT
(In reply to comment #30)
> Another thing which puzzles me is why we never search the disk for a request
> with policy STORE_ON_DISK_AS_FILE. It means that even if we have the
> resource cached it will always be re-loaded from server, and I'm not sure
> how the disk-cache deals with book-keeping in this case.

I guess that this was an intention, otherwise we'd need to check if the data is in a separate file. But you're right. The cache code doesn't handle this correctly and opening an already existing entry results in duplicate entries.
Comment 33 Bjarne (:bjarne) 2011-05-24 14:41:24 PDT
(In reply to comment #32)
> I guess that this was an intention, otherwise we'd need to check if the data
> is in a separate file. 

You mean we'd have to handle the situation where a cached entry is NOT in a separate file and the new request has policy STORE_ON_DISK_AS_FILE? (Because if it already is in a separate file there is no problem, and checking it is trivial, no?)
Comment 34 Michal Novotny (:michal) 2011-05-25 02:09:26 PDT
(In reply to comment #33)
> You mean we'd have to handle the situation where a cached entry is NOT in a
> separate file and the new request has policy STORE_ON_DISK_AS_FILE? (Because
> if it already is in a separate file there is no problem, and checking it is
> trivial, no?)

Yes, exactly.
Comment 35 Bjarne (:bjarne) 2011-05-25 06:46:24 PDT
(In reply to comment #30)
> So (theoretically) if we have an active entry stored in memory we may find
> and use it for a request with storage-policy STORE_ON_DISK_AS_FILE, for
> example. (Or vice-versa, which would be a security-issue.)

This can not happen with the current http-implementation because nsHttpHandler::GetCacheSession() creates sessions with different names for requests requiring memory- or offline-storage. The session-name becomes the prefix of the cache-key in nsCacheService::CreateRequest(), hence a request which have to go to memory (or offline) cannot be mistaken for a request which can go to disk (or can go anywhere).

I'm not 100% confident about the implications of this scheme. One implication for sure is that even if we have a resource in cache we may end up reloading it because the necko-client requested a different storage-policy. Not sure if this is a problem in real life though.

The good news is that the "special case" from comment #30 is in fact the only case where we may get a policy-clash so I'll focus on this for now.
Comment 36 Bjarne (:bjarne) 2011-05-31 13:26:29 PDT
Right - after some thinking I've come to the following:

- this bug is about avoiding deadlock by ensuring that LazyInit() is only called once, hence the tests for this bug should test that

- the problem found in comment #29 and comment #32 and clarified in comment #34 is separate and needs its own fix and test

So, I suggest to patch Michals test for this bug as suggested at the end of comment #29 in order to avoid the problem causing the orange. I'll open a new bug for the issue clarified in comment #34.
Comment 37 Bjarne (:bjarne) 2011-06-01 02:44:29 PDT
Created attachment 536552 [details] [diff] [review]
Version of test where we sync with cache-io

The oranges shown in comment #25 and comment #29 only applies to one of the tests so I've changed only that one to sync with cache-operations in order to avoid the subtle issue identified in comment #29. I'm not 100% sure why the two other tests don't fail also but I have not analyzed it very deeply either.

Requesting review plus clarification about the two other tests. Passes local testing, pushed to tryserver together with attachment #533020 [details] [diff] [review].
Comment 38 Bjarne (:bjarne) 2011-06-01 03:26:33 PDT
(In reply to comment #36)
>I'll open a
> new bug for the issue clarified in comment #34.

Bug #661115 opened.
Comment 39 Bjarne (:bjarne) 2011-06-03 03:20:19 PDT
(In reply to comment #37)
> pushed to tryserver together with attachment #533020 [details] [diff] [review]
> [review].

Tryserver green. Awaiting review and/or feedback...
Comment 40 Michal Novotny (:michal) 2011-06-07 17:00:33 PDT
Comment on attachment 536552 [details] [diff] [review]
Version of test where we sync with cache-io

(In reply to comment #37)
> I'm not 100% sure why the two other tests don't fail also but I have not
> analyzed it very deeply either.

It's because the other tests don't write beyond the 5MB limit, so it doesn't matter whether the entry is reopened with policy STORE_ON_DISK_AS_FILE or STORE_ON_DISK.
Comment 41 Bjarne (:bjarne) 2011-06-08 01:31:54 PDT
Requesting check-in for attachment #536552 [details] [diff] [review] (tests) and attachment #533020 [details] [diff] [review] (code-change)
Comment 42 Dão Gottwald [:dao] 2011-06-08 02:40:19 PDT
The "slightly modified patch" doesn't have an author. (Michal attached it, but this bug is assigned to Bjarne?!) The "Version of test where we sync with cache-io" says "From: r=michal.novotny@gmail.com", which doesn't seem to make any sense. Please fix this, then add checkin-needed back.
Comment 43 Bjarne (:bjarne) 2011-06-08 04:30:07 PDT
Created attachment 537982 [details] [diff] [review]
Metainformation fixed
Comment 44 Bjarne (:bjarne) 2011-06-08 04:31:52 PDT
Created attachment 537983 [details] [diff] [review]
Metainformation fixed
Comment 45 Bjarne (:bjarne) 2011-06-08 04:32:37 PDT
Requesting check-in for the two latest attachments
Comment 46 Dão Gottwald [:dao] 2011-06-11 09:44:45 PDT
Apparently the first patch tries to modify a file added by the second patch, but there are conflicts. Can you just attach the code changes and tests in one patch?
Comment 47 Bjarne (:bjarne) 2011-06-15 02:48:51 PDT
Created attachment 539463 [details] [diff] [review]
Combined and unbitrotted

Combined into one patch as requested. I have not sent this to tryserver since it is a simple qfold and "I see no reasons for it to fail (tm)"...  :)
Comment 48 Bjarne (:bjarne) 2011-06-15 02:49:20 PDT
Requesting check-in...
Comment 49 Bjarne (:bjarne) 2011-06-18 13:55:08 PDT
Comment on attachment 539463 [details] [diff] [review]
Combined and unbitrotted

Requesting check-in using this flag also
Comment 50 Mounir Lamouri (:mounir) 2011-06-22 01:56:16 PDT
Comment on attachment 539463 [details] [diff] [review]
Combined and unbitrotted

Pushed to mozilla-inbound.
Comment 51 Matt Brubeck (:mbrubeck) 2011-06-22 10:23:45 PDT
http://hg.mozilla.org/mozilla-central/rev/80a510fa4b7f

Note You need to log in before you can comment on or make changes to this bug.