Hang in nsDiskCacheStreamIO::CloseOutputStream while downloading WebM video from YouTube

RESOLVED FIXED in mozilla7

Status

()

Core
Networking: Cache
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: kinetik, Assigned: bjarne)

Tracking

Trunk
mozilla7
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 12 obsolete attachments)

13.71 KB, patch
mounir
: checkin+
Details | Diff | Splinter Review
(Reporter)

Description

6 years ago
Running a fairly recent nightly (86248f7209b7), the browser hung while downloading a video (via the Save Video As context menu) from YouTube.  I had maybe 30 other tabs open, but my last interaction with the browser was saving the video to disk.

Possible steps to reproduce:
- Find HTML 5 video on YouTube
- Open Page Info and find the video URL
- Open video URL in a new tab
- Hit the Escape key to stop the video redownloading
- Save Video As...
- Open the download manager to watch download progress
- Close YouTube tab

After this, I switched back to my editor and noticed the download manager reached 15.0/15.1MB and then stopped making progress.  Attempting to switch focus to the browser revealed that it had hung.

The main thread had the following stack:

PR_Lock () at nsprpub/pr/src/pthreads/ptsynch.c:206
nsDiskCacheStreamIO::CloseOutputStream(nsDiskCacheOutputStream*) () at netwerk/cache/nsCacheService.h:304
nsDiskCacheOutputStream::Close() () at netwerk/cache/nsDiskCacheStreams.cpp:244
nsDiskCacheOutputStream::~nsDiskCacheOutputStream() () at netwerk/cache/nsDiskCacheStreams.cpp:233
nsDiskCacheOutputStream::~nsDiskCacheOutputStream() () at netwerk/cache/nsDiskCacheStreams.cpp:235
nsDiskCacheOutputStream::Release() () at netwerk/cache/nsDiskCacheStreams.cpp:219
nsCOMPtr_base::begin_assignment() () at xpcom/build/../glue/nsCOMPtr.h:479
nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() () at netwerk/cache/nsCacheEntryDescriptor.cpp:613
nsCacheEntryDescriptor::nsOutputStreamWrapper::Close() () at netwerk/cache/nsCacheEntryDescriptor.cpp:642
nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper() () at netwerk/cache/nsCacheEntryDescriptor.h:155
nsCacheEntryDescriptor::nsOutputStreamWrapper::~nsOutputStreamWrapper() () at netwerk/cache/nsCacheEntryDescriptor.h:161
nsCacheEntryDescriptor::nsOutputStreamWrapper::Release() () at netwerk/cache/nsCacheEntryDescriptor.cpp:595
nsStreamListenerTee::OnStopRequest(nsIRequest*, nsISupports*, unsigned int) () at ../../../dist/include/nsCOMPtr.h:663
nsHttpChannel::OnStopRequest(nsIRequest*, nsISupports*, unsigned int) () at netwerk/protocol/http/nsHttpChannel.cpp:4116
nsInputStreamPump::OnStateStop() () at netwerk/base/src/nsInputStreamPump.cpp:578
nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream*) () at netwerk/base/src/nsInputStreamPump.cpp:403
nsInputStreamReadyEvent::Run() () at xpcom/io/nsStreamUtils.cpp:114

No others threads seemed to be doing much with the cache, and were all sleeping in poll() or pthread_cond_wait().

I notice that nsDiskCacheStreamIO::CloseOutputStream and nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit both take the nsCacheService lock, which is non-recursive, so it seems like this is the source of the deadlock.
(Assignee)

Comment 1

6 years ago
Few days ago I saw an identical hang caused by a code-blunder in a different context. It was even reproducible, but after fixing my blunder I didn't see the hang again and forgot all about it...

I'll take a closer look by re-introducing my mistake. :)
(Assignee)

Comment 2

6 years ago
I've reproduced and identified the problem. Will attach a patch shortly.

The issue is that nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit is called twice and on the second call |mOutput| is set. When LazyInit assigns a new value to |mOutput|, the old value is cleared and we end up in the dtor of nsDiskCacheOutputStream which tries to reclaim the lock. And boom... deadlock!

This can happen because after setting |mOutput| but before setting the flag |mInitialized| , LazyInit() can return (for two separate reasons). Hence we may end up with |mOutput| set and not |mInitialized|, and LazyInit could be called twice. My blunder (mentioned in comment #1) caused OnDataSizeChange to return an error and LazyInit to return early.
Assignee: nobody → bjarne
Status: NEW → ASSIGNED
(Assignee)

Comment 3

6 years ago
Created attachment 531006 [details] [diff] [review]
Proposed fix

I'm not able to reproduce my issue with this patch applied. I'd be interested to known if reporter can reproduce *his* issue with it applied.

Pushed to tryserver, requesting review.
Attachment #531006 - Flags: review?(michal.novotny)
(Assignee)

Comment 4

6 years ago
Comment on attachment 531006 [details] [diff] [review]
Proposed fix

Ahh - it has a leak! I'll fix and re-submit.
Attachment #531006 - Flags: review?(michal.novotny)
(Assignee)

Comment 5

6 years ago
Created attachment 531037 [details] [diff] [review]
Proposed fix (wo/ leaks this time)

Passes simple local testing. Pushed to tryserver...
Attachment #531006 - Attachment is obsolete: true
Attachment #531037 - Flags: review?(michal.novotny)
(Assignee)

Comment 6

6 years ago
Tryserver ok. Awaiting review.
Comment on attachment 531037 [details] [diff] [review]
Proposed fix (wo/ leaks this time)

> +#include "nsAutoPtr.h"
...
> +    nsRefPtr<nsIOutputStream> stream;

Use nsCOMPtr.


Otherwise looks good. And please add a test.
Attachment #531037 - Flags: review?(michal.novotny)
(Assignee)

Comment 8

6 years ago
Comment on attachment 531037 [details] [diff] [review]
Proposed fix (wo/ leaks this time)

I'd love to add a test here and I've spent quite some time trying to figure out how. The issue blocking me is that in order to test this patch, nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() must return after creating |mOutput| but before setting |mInitialized|. I.e. in this interval either nsCacheEntry::CacheDevice() must return null or nsCacheDevice::OnDataSizeChange() must return an error. I see no way to reliably achieve this without replacing the entire cache-service in the test. This is certainly doable but requires quite a lot of coding so I'm not sure it's worth the time and trouble.

If you have an idea of how to achieve this in a different way, or a pointer/hint to a test or similar which do replace the cache-service, I'd be happy to start from there.
Attachment #531037 - Flags: feedback?(michal.novotny)
Created attachment 532403 [details]
reliable test

Entry is first written with flag STORE_ON_DISK_AS_FILE, so it isn't doomed. And STORE_ON_DISK is used when appending the entry, which causes nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit() to fail.
Attachment #531037 - Flags: feedback?(michal.novotny)
(Assignee)

Comment 10

6 years ago
Brilliant! I'll make a patch out of it, attach and r+ it (it's yours).
(Assignee)

Comment 11

6 years ago
Created attachment 532606 [details] [diff] [review]
Slightly modified test

I put the ostream->close-call inside a try-catch because it will fail if LazyInit fails. Test has an r+ from me...
Attachment #532403 - Attachment is obsolete: true
Attachment #532606 - Flags: review+
(Assignee)

Comment 12

6 years ago
Comment on attachment 532606 [details] [diff] [review]
Slightly modified test

Requesting review for the change/modified behaviour of Close()...
Attachment #532606 - Flags: review?(michal.novotny)
(Assignee)

Comment 13

6 years ago
Created attachment 532607 [details] [diff] [review]
Slightly modified patch which also closes fd

In this version nsDiskCacheStreamIO::Flush() closes the fd if entry is doomed.

Without this change the test for this bug asserts when shutting down the cacheservice, and this can happen in real-life as well. IMO it makes sense to close the fd, but exactly *where* to do it could probably be discussed. My proposal is in Flush().

Pushed to tryserver.
Attachment #531037 - Attachment is obsolete: true
Attachment #532607 - Flags: review?(michal.novotny)
Comment on attachment 532606 [details] [diff] [review]
Slightly modified test

> +    do_check_true(false); // should fail

you can use do_throw()


> +    do_check_true(true); // should succeed

you can omit this check
Attachment #532606 - Flags: review?(michal.novotny) → review+
Created attachment 533016 [details]
test that works with current FF but fails with patch 532607
Created attachment 533018 [details]
test that fails with as well as without patch 532607
Created attachment 533020 [details] [diff] [review]
slightly modified patch

with this patch all three tests pass successfully

test_bug654926.js
test_bug654926_test_seek.js
test_bug654926_doom_and_read.js
Attachment #532607 - Attachment is obsolete: true
Attachment #532607 - Flags: review?(michal.novotny)
Attachment #533020 - Flags: review?(bjarne)
(Assignee)

Comment 18

6 years ago
You've been busy...  :)  I agree with your change to Flush() and I've consolidated your new tests and the patch and pushed everything to tryserver.
(Assignee)

Comment 19

6 years ago
Oh - btw: In the previous tryserver-run, the test for this particular bug on Win/Opt failed (i.e. both append and close completed without errors). Let's see if the adjustment you made changes this...
(Assignee)

Comment 20

6 years ago
Tryserver is all green.
(Assignee)

Comment 21

6 years ago
Created attachment 533555 [details] [diff] [review]
All three tests in one patch

Combined tests.

I'm uncertain about the process here but I'll r+ this as well as Michals modification to the code-patch, then request check-in.
Attachment #532606 - Attachment is obsolete: true
Attachment #533016 - Attachment is obsolete: true
Attachment #533018 - Attachment is obsolete: true
Attachment #533555 - Flags: review+
(Assignee)

Comment 22

6 years ago
Comment on attachment 533020 [details] [diff] [review]
slightly modified patch

Review of attachment 533020 [details] [diff] [review]:
-----------------------------------------------------------------

Modification in Flush() seems fine.  r=me

Together with comment #7 I consider this ready to go.
Attachment #533020 - Flags: review?(bjarne) → review+
(Assignee)

Comment 23

6 years ago
Requesting check-in for attachment #533020 [details] [diff] [review] (code change) and attachment #533555 [details] [diff] [review] (tests).
Keywords: checkin-needed
Keywords: checkin-needed
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: [fixed in cedar]
Pushed:
http://hg.mozilla.org/mozilla-central/rev/33e975a5d150
http://hg.mozilla.org/mozilla-central/rev/b63a7d9c1735
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Whiteboard: [fixed in cedar]
Target Milestone: --- → mozilla6
Backed out because of a perma orange on Win opt xpcshell:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1305819552.1305821909.7630.gz&fulltext=1
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla6 → ---
(Assignee)

Comment 26

6 years ago
Mmmm...  same problem as mentioned in comment #19. Not seen on try-run, ref comment #20. Is it really perma? Any chance to get a nspr-log from it?
(In reply to comment #26)
> Mmmm...  same problem as mentioned in comment #19. Not seen on try-run, ref
> comment #20. Is it really perma? Any chance to get a nspr-log from it?

Happens 2 times out of two on Cedar and I saw it happens one time out of one on mozilla-central. It's likely perma. Maybe intermittent but in that case, rarely passing.

How can I help with nspr-log?
(Assignee)

Comment 28

6 years ago
Yeah - I rebuilt these tests on tryserver and it came up with warnings. Must have passed first time on pure coincidence... :(

Nspr-log: Is there a way to rebuild/re-run these tests, setting the env-variables as described in https://developer.mozilla.org/en/HTTP_Logging#Logging_HTTP_Activity ?  If there is, plz add ",cache:5" to the end of NSPR_LOG_MODULES. The result is a file pointed to by NSPR_LOG_FILE which would be useful to try to understand what's going on.
See result of a modified test at http://tbpl.mozilla.org/?tree=Try&rev=60487b56b6cc

I'd guess that the problem is in reusing an entry that is being deactivated
http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDevice.cpp#510

The reused entry has policy STORE_ON_DISK_AS_FILE instead of requested STORE_ON_DISK so LazyInit() doesn't fail or it isn't called at all because the output is still open. Syncing with cache IO thread (like in test in bug 651100) would fix this. But maybe we should rather change the code so that we won't reuse the entry if the policy differs.
(Assignee)

Comment 30

6 years ago
Nice catch! :)  I have not been able to reproduce the orange, nor has Honza on opt/Win7, so this is good insight.

(In reply to comment #29)
> Syncing with cache IO thread (like in test in bug
> 651100) would fix this.

I'd prefer this. IMO the orange we see here is a special case/result of another bug (see below). We should either make this bug depend on the other one, or implement your suggested workaround in the test to get this bug out of the way. Input?

> But maybe we should rather change the code so that
> we won't reuse the entry if the policy differs.

This may actually be a more general issue: Afaics when using an active entry we never ensure that the requested storage-policy is compatible with existing storage for the entry. The orange in this bug is IMO a special case because even if the entry we find is in the pending state between "active" and "deactivated" we just re-use the object like we would have done if it was "active".

So (theoretically) if we have an active entry stored in memory we may find and use it for a request with storage-policy STORE_ON_DISK_AS_FILE, for example. (Or vice-versa, which would be a security-issue.)

Another thing which puzzles me is why we never search the disk for a request with policy STORE_ON_DISK_AS_FILE. It means that even if we have the resource cached it will always be re-loaded from server, and I'm not sure how the disk-cache deals with book-keeping in this case.

I'll try to come up with some testcases for the above scenarios - your input is appreciated also.
(Assignee)

Comment 31

6 years ago
Hmm...  apparently nsHttpHandler::GetCacheSession() takes some steps which handle at least the obvious issue described above. This needs some more investigation, and it is a question whether we should leave such things to the client...
(In reply to comment #30)
> Another thing which puzzles me is why we never search the disk for a request
> with policy STORE_ON_DISK_AS_FILE. It means that even if we have the
> resource cached it will always be re-loaded from server, and I'm not sure
> how the disk-cache deals with book-keeping in this case.

I guess that this was an intention, otherwise we'd need to check if the data is in a separate file. But you're right. The cache code doesn't handle this correctly and opening an already existing entry results in duplicate entries.
(Assignee)

Comment 33

6 years ago
(In reply to comment #32)
> I guess that this was an intention, otherwise we'd need to check if the data
> is in a separate file. 

You mean we'd have to handle the situation where a cached entry is NOT in a separate file and the new request has policy STORE_ON_DISK_AS_FILE? (Because if it already is in a separate file there is no problem, and checking it is trivial, no?)
(In reply to comment #33)
> You mean we'd have to handle the situation where a cached entry is NOT in a
> separate file and the new request has policy STORE_ON_DISK_AS_FILE? (Because
> if it already is in a separate file there is no problem, and checking it is
> trivial, no?)

Yes, exactly.
(Assignee)

Comment 35

6 years ago
(In reply to comment #30)
> So (theoretically) if we have an active entry stored in memory we may find
> and use it for a request with storage-policy STORE_ON_DISK_AS_FILE, for
> example. (Or vice-versa, which would be a security-issue.)

This can not happen with the current http-implementation because nsHttpHandler::GetCacheSession() creates sessions with different names for requests requiring memory- or offline-storage. The session-name becomes the prefix of the cache-key in nsCacheService::CreateRequest(), hence a request which have to go to memory (or offline) cannot be mistaken for a request which can go to disk (or can go anywhere).

I'm not 100% confident about the implications of this scheme. One implication for sure is that even if we have a resource in cache we may end up reloading it because the necko-client requested a different storage-policy. Not sure if this is a problem in real life though.

The good news is that the "special case" from comment #30 is in fact the only case where we may get a policy-clash so I'll focus on this for now.
(Assignee)

Comment 36

6 years ago
Right - after some thinking I've come to the following:

- this bug is about avoiding deadlock by ensuring that LazyInit() is only called once, hence the tests for this bug should test that

- the problem found in comment #29 and comment #32 and clarified in comment #34 is separate and needs its own fix and test

So, I suggest to patch Michals test for this bug as suggested at the end of comment #29 in order to avoid the problem causing the orange. I'll open a new bug for the issue clarified in comment #34.
(Assignee)

Comment 37

6 years ago
Created attachment 536552 [details] [diff] [review]
Version of test where we sync with cache-io

The oranges shown in comment #25 and comment #29 only applies to one of the tests so I've changed only that one to sync with cache-operations in order to avoid the subtle issue identified in comment #29. I'm not 100% sure why the two other tests don't fail also but I have not analyzed it very deeply either.

Requesting review plus clarification about the two other tests. Passes local testing, pushed to tryserver together with attachment #533020 [details] [diff] [review].
Attachment #533555 - Attachment is obsolete: true
Attachment #536552 - Flags: review?(michal.novotny)
Attachment #536552 - Flags: feedback?(michal.novotny)
(Assignee)

Comment 38

6 years ago
(In reply to comment #36)
>I'll open a
> new bug for the issue clarified in comment #34.

Bug #661115 opened.
(Assignee)

Comment 39

6 years ago
(In reply to comment #37)
> pushed to tryserver together with attachment #533020 [details] [diff] [review]
> [review].

Tryserver green. Awaiting review and/or feedback...
Comment on attachment 536552 [details] [diff] [review]
Version of test where we sync with cache-io

(In reply to comment #37)
> I'm not 100% sure why the two other tests don't fail also but I have not
> analyzed it very deeply either.

It's because the other tests don't write beyond the 5MB limit, so it doesn't matter whether the entry is reopened with policy STORE_ON_DISK_AS_FILE or STORE_ON_DISK.
Attachment #536552 - Flags: review?(michal.novotny)
Attachment #536552 - Flags: review+
Attachment #536552 - Flags: feedback?(michal.novotny)
(Assignee)

Comment 41

6 years ago
Requesting check-in for attachment #536552 [details] [diff] [review] (tests) and attachment #533020 [details] [diff] [review] (code-change)
Keywords: checkin-needed
The "slightly modified patch" doesn't have an author. (Michal attached it, but this bug is assigned to Bjarne?!) The "Version of test where we sync with cache-io" says "From: r=michal.novotny@gmail.com", which doesn't seem to make any sense. Please fix this, then add checkin-needed back.
Keywords: checkin-needed
(Assignee)

Comment 43

6 years ago
Created attachment 537982 [details] [diff] [review]
Metainformation fixed
Attachment #533020 - Attachment is obsolete: true
(Assignee)

Comment 44

6 years ago
Created attachment 537983 [details] [diff] [review]
Metainformation fixed
Attachment #536552 - Attachment is obsolete: true
(Assignee)

Comment 45

6 years ago
Requesting check-in for the two latest attachments
Keywords: checkin-needed
Apparently the first patch tries to modify a file added by the second patch, but there are conflicts. Can you just attach the code changes and tests in one patch?
Keywords: checkin-needed
(Assignee)

Comment 47

6 years ago
Created attachment 539463 [details] [diff] [review]
Combined and unbitrotted

Combined into one patch as requested. I have not sent this to tryserver since it is a simple qfold and "I see no reasons for it to fail (tm)"...  :)
Attachment #537982 - Attachment is obsolete: true
Attachment #537983 - Attachment is obsolete: true
(Assignee)

Comment 48

6 years ago
Requesting check-in...
Keywords: checkin-needed
(Assignee)

Comment 49

6 years ago
Comment on attachment 539463 [details] [diff] [review]
Combined and unbitrotted

Requesting check-in using this flag also
Attachment #539463 - Flags: checkin?
Comment on attachment 539463 [details] [diff] [review]
Combined and unbitrotted

Pushed to mozilla-inbound.
Attachment #539463 - Flags: checkin? → checkin+
Keywords: checkin-needed
Whiteboard: [inbound]
http://hg.mozilla.org/mozilla-central/rev/80a510fa4b7f
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED
Whiteboard: [inbound]
Target Milestone: --- → mozilla7
You need to log in before you can comment on or make changes to this bug.