Last Comment Bug 651100 - Browser becomes unresponsive while watching embedded Youtube <video>
: Browser becomes unresponsive while watching embedded Youtube <video>
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Networking: Cache (show other bugs)
: Trunk
: x86 All
: -- critical (vote)
: mozilla6
Assigned To: Michal Novotny (:michal)
:
Mentors:
http://oduinn.com/blog/2011/04/16/mis...
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-19 05:27 PDT by Ted Mielczarek [:ted.mielczarek]
Modified: 2011-10-20 07:50 PDT (History)
12 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
stack samples (17.27 KB, text/plain)
2011-04-19 05:27 PDT, Ted Mielczarek [:ted.mielczarek]
no flags Details
fix (5.84 KB, patch)
2011-04-30 14:58 PDT, Michal Novotny (:michal)
no flags Details | Diff | Splinter Review
patch v2 (13.09 KB, patch)
2011-05-05 04:35 PDT, Michal Novotny (:michal)
bjarne: feedback+
Details | Diff | Splinter Review
patch v3 (12.30 KB, patch)
2011-05-19 17:29 PDT, Michal Novotny (:michal)
bjarne: review+
Details | Diff | Splinter Review
patch v4 (12.61 KB, patch)
2011-05-23 13:16 PDT, Michal Novotny (:michal)
jduell.mcbugs: review+
Details | Diff | Splinter Review
minor fix: xpcshell test needs manifest entry (13.17 KB, patch)
2011-05-23 14:40 PDT, Jason Duell [:jduell] (needinfo me)
jduell.mcbugs: review+
Details | Diff | Splinter Review

Description Ted Mielczarek [:ted.mielczarek] 2011-04-19 05:27:13 PDT
Created attachment 526962 [details]
stack samples

The linked URL contains two Youtube embeds, for me the first is Flash and the second is an actual <video>. After playing the first, I started playing the second, and after 10 seconds or so my browser became completely unresponsive. I attached a debugger and broke into it a few times (stacks attached), and the main thread was consistently evicting cache entries.

I wasn't sure whether to file this under Video or Networking, since it seems like a bad interaction between the two. I couldn't find any existing bugs that seemed to cover this, although it's possible there's undiagnosed "Firefox hangs" bugs that have this as the underlying cause.

I haven't been able to reproduce it either, although I had to forcibly quit my browser, and last I knew we threw away the entire cache after that, so that's a possible explanation.
Comment 1 Jason Duell [:jduell] (needinfo me) 2011-04-19 11:49:41 PDT
In firefox 3.x and earlier, we allowed big files (such as youtube vids) to evict items in the cache until they reached half the cache size.  In FF 4 that has dropped to 5 MB (or 1/8th of cache, if that's smaller than 5 MB) (bug 81640), AND if we have a Content-length header that indicates the file is >5MB we skip caching it entirely (for media files, the media cache will cache it instead) (bug 588507).

How big is the YouTube vid? Do you know if YouTube vids pass a Content-Length?  If not, it would be normal for 5 MB of eviction to occur.  Of course, having your browser freeze (as a result of the evictions, of something else) isn't normal.

But it's hard to follow up w/o a reproducible test case.

I see all the stacks traces stop with

 ntdll.dll!_ZwClose@4()  + 0x12 bytes	
 ntdll.dll!_ZwClose@4()  + 0x12 bytes	
 xul.dll!nsLocalFile::IsDirectory(int * _retval=0x00000001)  Line 248

I'm guesing that means you're hanging somewhere in IsDirectory (which calls Close at some point)?
Comment 2 Ted Mielczarek [:ted.mielczarek] 2011-04-19 12:12:26 PDT
No, if you look at the stacks, they're each different (the cache entries they're evicting are different):
 	xul.dll!nsDiskCacheMap::DeleteStorage(nsDiskCacheRecord * record=0x0ced43f0, int metaData=0)  + 0x1bffb5 bytes	C++
 	xul.dll!nsDiskCacheMap::DeleteStorage(nsDiskCacheRecord * record=0x0cf12cf0, int metaData=1)  + 0x1bffb5 bytes	C++
 	xul.dll!nsDiskCacheMap::DeleteStorage(nsDiskCacheRecord * record=0x0ce52e50, int metaData=0)  + 0x1bffb5 bytes	C++

It just happens that when you break in the debugger you usually catch that thread in the middle of a syscall. (The active thread was some other thread doing media work.)

The second video (that was actually showing up as HTML5 <video>) is this one:
http://www.youtube.com/watch?v=QZyE4xiJ3KA

I have no idea how to find the actual webm video from that.
Comment 3 Ted Mielczarek [:ted.mielczarek] 2011-04-19 12:13:43 PDT
The video is 10 minutes long, so I suspect it's probably rather large, but I don't know how large that is.
Comment 4 Jason Duell [:jduell] (needinfo me) 2011-04-19 12:20:59 PDT
What's the video thread up to?  So far we've established that it's not necessarily a problem for a video to be evicting stuff (up to 5 MB: if you're well into the video and it's still happening, that's probably a bug.   You say you freeze after 10 seconds, so it could be ether way).  This may yet be a video issue not an HTTP cache one.
Comment 5 Ted Mielczarek [:ted.mielczarek] 2011-04-19 12:48:20 PDT
I don't know, I didn't save all the thread stacks, just the main thread. :-/ I let it run for a while and it didn't seem to unstick itself. The audio from the video was still playing while the browser UI was unresponsive, FWIW.
Comment 6 Matthew Gregan [:kinetik] 2011-04-19 13:39:54 PDT
It looks like we do get Content-Length, and the video is well over 5MB:

Date: Tue, 19 Apr 2011 20:30:12 GMT
X-Content-Type-Options: nosniff
Connection: close
Content-Length: 71056183
Last-Modified: Mon, 28 Feb 2011 04:12:24 GMT
Server: gvs 1.0
Content-Type: video/webm
Cache-Control: private, max-age=23088
Accept-Ranges: bytes
Expires: Tue, 19 Apr 2011 20:30:12 GMT

Ted, are you running Firefox 4 or a nightly?
Comment 7 Jason Duell [:jduell] (needinfo me) 2011-04-19 14:28:19 PDT
Re: comment 6: Well that's a bug then (I filed bug 651234).

Still not clear it's causing the hang.
Comment 8 Ted Mielczarek [:ted.mielczarek] 2011-04-19 17:11:19 PDT
I'm running Nightly builds, I'm a few days behind the latest though.
Comment 9 Jason Duell [:jduell] (needinfo me) 2011-04-26 22:03:33 PDT
So we've established that the caching of the flash video is intentional (I'm not if it's necessary in some absolute sense, but the media player code passes a flag that makes us cache the content as a file, in which we cache regardless of size).

So the fact that cache entries are being evicted is not an issue per se, and it's not clear if the cache is causing the hang here.
Comment 10 Ted Mielczarek [:ted.mielczarek] 2011-04-27 05:44:12 PDT
It may not be, but the fact that the main thread was evicting cache entries every time I broke into the process while the browser was unresponsive is fairly damning...
Comment 11 Michal Novotny (:michal) 2011-04-30 14:57:44 PDT
This is a regression caused by fix in bug #644431. The video is larger than 65535KiB and the nsDiskCacheRecord::DataFileSize() overflows. In nsDiskCacheStreamIO::UpdateFileSize() we decrement size modulo 64MiB but increment the real size. So although the file is only 69MiB we evict the whole cache which causes the hang. IMO the best solution for this problem is to ignore anything over 65535KiB.
Comment 12 Michal Novotny (:michal) 2011-04-30 14:58:34 PDT
Created attachment 529317 [details] [diff] [review]
fix
Comment 13 Bjarne (:bjarne) 2011-05-01 13:11:26 PDT
Comment on attachment 529317 [details] [diff] [review]
fix

Looks reasonable but please add a test.
Comment 14 Jason Duell [:jduell] (needinfo me) 2011-05-02 17:53:21 PDT
Comment on attachment 529317 [details] [diff] [review]
fix

Some questions:

The 5 MB limit for disk cache object is quite recent (as of ff 4).  I'm puzzled as to how we run into issues now that we've re-enabled > 5MB files (when stored to separate file), and this breaks things, when before FF4 we'd routinely store all kinds of things >5MB.  

The patch to fix this basically seems to be "treat anything >64K as if it were 65K"?  I don't know enough about our cache internals to know if that's a reasonable hack or not.  It seems weird to me (at a mimumum won't it mean we can't enforce total cache size limits if we're counting potentially huge files as 64K?).

Am I correct in thinking that the stack trace for this bug indicates that we're doing the I/O to evict entries from OnDataAvailable, i.e. we're still doing I/O on the main thread?  I've been telling the fennec guys we now do all I/O off-main-thread, but it sounds like that's not true...

> #define kSeparateFile      0
>-// #define must always  be <= 65535KB, or overflow. See bug 443067 Comment 8
>+#define kMaxDataSizeK      0xFFFF  // eFileSizeMask >> eFileSizeOffset
> #define kMaxDataFileSize   5 * 1024 * 1024  // 5 MB (in bytes) 
> #define kBuckets           (1 << 5)    // must be a power of 2!


Add a comment describing kMaxDataSizeK?
Comment 15 Michal Novotny (:michal) 2011-05-03 14:01:04 PDT
(In reply to comment #14)
> The 5 MB limit for disk cache object is quite recent (as of ff 4).  I'm puzzled
> as to how we run into issues now that we've re-enabled > 5MB files (when stored
> to separate file), and this breaks things, when before FF4 we'd routinely store
> all kinds of things >5MB.

You need to differentiate between storing as separate file because of filesize and storing as separate file upon a request. We allow >5MB only in the latter case. We set the flag at these places
http://mxr.mozilla.org/mozilla-central/ident?i=cacheAsFile&tree=mozilla-central&filter=


> The patch to fix this basically seems to be "treat anything >64K as if it were
> 65K"?  I don't know enough about our cache internals to know if that's a
> reasonable hack or not.  It seems weird to me (at a mimumum won't it mean we
> can't enforce total cache size limits if we're counting potentially huge files
> as 64K?).

It is 64M and not 64K, so IMO not a big problem. I agree that this is a hack, but the proper solution would be to stat the file to find out the real size when it is >64M, which would affect the performance.


> Am I correct in thinking that the stack trace for this bug indicates that we're
> doing the I/O to evict entries from OnDataAvailable, i.e. we're still doing I/O
> on the main thread?  I've been telling the fennec guys we now do all I/O
> off-main-thread, but it sounds like that's not true...

We're doing IO on the main thread only in cases described above. See
http://hg.mozilla.org/mozilla-central/annotate/7528b2718827/netwerk/protocol/http/nsHttpChannel.cpp#l3169
Comment 16 Michal Novotny (:michal) 2011-05-05 04:35:28 PDT
Created attachment 530271 [details] [diff] [review]
patch v2

I've finally some test that seems to be reliable. I've added a description for kMaxDataSizeK and updated the note for eFileSizeMask. I've also fixed another small bug in nsDiskCacheMap::EvictRecords() that I've found while debugging the test.
Comment 17 Bjarne (:bjarne) 2011-05-10 02:48:19 PDT
Comment on attachment 530271 [details] [diff] [review]
patch v2

Review of attachment 530271 [details] [diff] [review]:
-----------------------------------------------------------------

Seems like there is consensus about the fact that this is a hack, and as a hack it seems reasonable, but I'd like to clarify one thing: What would be a proper solution? What does the hack achieve, and how does this relate to the desired/proper (if any) way to handle this?
Comment 18 Bjarne (:bjarne) 2011-05-12 02:07:58 PDT
Comment on attachment 530271 [details] [diff] [review]
patch v2

Forgot in previous step...  clear review-request flag and request feedback.
Comment 19 Michal Novotny (:michal) 2011-05-12 04:23:00 PDT
(In reply to comment #17)
> What would be a proper solution?

As already stated in paragraph 2 in comment 15, the proper solution would be to stat the file whenever nsDiskCacheRecord::DataFileSize() or nsDiskCacheRecord::MetaFileSize() returns some magic number (e.g. 0xFFFF) that signalizes that the file was too big to store the size in nsDiskCacheRecord.


> What does the hack achieve, and how does this relate to the desired/proper
> (if any) way to handle this?

I'm not sure I understand this question...
Comment 20 Bjarne (:bjarne) 2011-05-12 05:13:04 PDT
(In reply to comment #19)

> > What does the hack achieve, and how does this relate to the desired/proper
> > (if any) way to handle this?
> 
> I'm not sure I understand this question...

Since it's not a proper solution, how does its result compare to the result of a proper solution? Or: How does the effect of the hack deviate from the effect of a proper solution?
Comment 21 Michal Novotny (:michal) 2011-05-12 05:20:34 PDT
(In reply to comment #20)

> Since it's not a proper solution, how does its result compare to the result
> of a proper solution? Or: How does the effect of the hack deviate from the
> effect of a proper solution?

It differs only in performance because of stat() (hack is faster) and in cache size accuracy because of not counting anything above 64MiB (proper solution is closer to the real cache size).
Comment 22 Bjarne (:bjarne) 2011-05-13 04:28:33 PDT
Comment on attachment 530271 [details] [diff] [review]
patch v2

Review of attachment 530271 [details] [diff] [review]:
-----------------------------------------------------------------

In my book, a hack is a perfectly reasonable solution if we understand its limitations (compared to a "proper" solution) AND we are ok with the tradeoffs between these limitations and the wins of the hack.

If I understand you correctly, this hack trades improved performance for inaccurate cache-size in cases where cached files are larger than 64MiB. To me this is reasonable, but please add a comment at the appropriate place in the code to clarify this tradeoff and your choice.

A few questions about your test:

You set disk-cache capacity to 50MiB and then writes a cache-entry >64MiB which is automatically doomed because it's too large, right? Wouldn't this happen also without your patch? Then you sync with cache-operations and verifies that a small cache-entry can be cached and assumes this implies that the large entry has been evicted. Could you test that the large entry is evicted in a more direct way instead of this indirect way?

Next, you write cache-metadata >64MiB, but it's entry is not automatically doomed and a small cache-entry must be written in order to make the cache evict the entry with large metadata? Is this expected and desired or should we change this? And then, the same questions as above: Could you test that the large entry is evicted in a more direct way? Wouldn't this work also without your patch?
Comment 23 Michal Novotny (:michal) 2011-05-18 07:10:24 PDT
(In reply to comment #22)
> You set disk-cache capacity to 50MiB and then writes a cache-entry >64MiB
> which is automatically doomed because it's too large, right? Wouldn't this
> happen also without your patch? 

Well, actually the entry is doomed because the cache is full and not because the entry is too large. But yes, this happens without the patch too.


> Then you sync with cache-operations and verifies that a small cache-entry
> can be cached and assumes this implies that the large entry has been evicted.
> Could you test that the large entry is evicted in a more direct way instead
> of this indirect way?

Why? In fact, what I'm really interested in is if the small cache entry can be cached. If the small entry can't be cached, it could be because

- the big entry wasn't for some reason evicted
- the cache size did overflow
- some other error...

In any case this is a valid test and the cache should always behave this way.


> Next, you write cache-metadata >64MiB, but it's entry is not automatically
> doomed and a small cache-entry must be written in order to make the cache
> evict the entry with large metadata? Is this expected and desired or should
> we change this? 

Right. The current code doesn't check cache fullness or cache entry size when writing metadata. We probably could fix this some day, but IMO it isn't a priority because the metadata is pretty small in real life...


> And then, the same questions as above: Could you test that the large entry
> is evicted in a more direct way?

As above... I don't see a reason for doing this. The main goal of this test is to test if cache size overflows in certain situations.

> Wouldn't this work also without your patch?

No, because of the cache size overflow...
Comment 24 Bjarne (:bjarne) 2011-05-19 03:28:07 PDT
Comment on attachment 530271 [details] [diff] [review]
patch v2

Review of attachment 530271 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to comment #23)
> In fact, what I'm really interested in is if the small cache entry can
> be cached. 

Really? I thought you wanted to verify that a large entry do not cause the cache size to overflow. Further below you write

> The main goal of this test
> is to test if cache size overflows in certain situations.

My point is that there is no direct test for that: it's tested indirectly by checking that a small entry can be cached. This may or may not be equivalent to cache size not overflowing, but that is not obvious to me.

To be clear: I'll accept that testing a small entry can be cached after a large is a valid test for this. As long as you have a clear understanding of how your test relates to your code-changes I'm ok with it.

> Right. The current code doesn't check cache fullness or cache entry size
> when writing metadata. We probably could fix this some day, but IMO it isn't
> a priority because the metadata is pretty small in real life...

New bug? I assume writing metadata updates cache size properly so that size is in sync with reality...
Comment 25 Michal Novotny (:michal) 2011-05-19 17:29:00 PDT
Created attachment 533855 [details] [diff] [review]
patch v3

> To me this is reasonable, but please add a comment at the appropriate place
> in the code to clarify this tradeoff and your choice.

I've added a comment to nsDiskCacheMap.h.


> My point is that there is no direct test for that: it's tested indirectly by
> checking that a small entry can be cached. This may or may not be equivalent
> to cache size not overflowing, but that is not obvious to me.

Hmm, you're probably right. I've changed the test and now it checks nsICacheDeviceInfo.totalSize directly.


> New bug? I assume writing metadata updates cache size properly so that size
> is in sync with reality...

The cache size is in sync. We just don't check cache fullness and metadata size like we do for data changes in OnDataSizeChange()...
Comment 26 Bjarne (:bjarne) 2011-05-20 02:46:03 PDT
Comment on attachment 533855 [details] [diff] [review]
patch v3

Review of attachment 533855 [details] [diff] [review]:
-----------------------------------------------------------------

See comment about proposed extra text below. Assuming this passes tryserver, r=me

::: netwerk/cache/nsDiskCacheMap.h
@@ +103,5 @@
>  #define kBuckets           (1 << 5)    // must be a power of 2!
>  
> +// Maximum size in K which can be stored in the location (see eFileSizeMask).
> +// Both data and metadata can be larger, but only up to kMaxDataSizeK can be
> +// counted into total cache size.

Suggested additional text: "I.e. if there are entries where either data or metadata is larger than kMaxDataSizeK, the total cache size will be inaccurate (smaller) than the actual cache size. The alternative is to stat the files to find the real size, which was decided against for performance reasons. See bug #651100 comment #21."
Comment 27 Michal Novotny (:michal) 2011-05-23 13:16:19 PDT
Created attachment 534535 [details] [diff] [review]
patch v4

More verbose comment on kMaxDataSizeK.

Tryserver is as green as possible. There are some failures on Android and Maemo but they seem to have some problem these days.
Comment 28 Jason Duell [:jduell] (needinfo me) 2011-05-23 14:20:10 PDT
Comment on attachment 534535 [details] [diff] [review]
patch v4

carrying forward bjarne's +r
Comment 29 Jason Duell [:jduell] (needinfo me) 2011-05-23 14:40:16 PDT
Created attachment 534578 [details] [diff] [review]
minor fix: xpcshell test needs manifest entry
Comment 30 Jason Duell [:jduell] (needinfo me) 2011-05-23 18:52:09 PDT
http://hg.mozilla.org/mozilla-central/rev/f7fc3beccae7
Comment 31 Honza Bambas (:mayhemer) 2011-10-20 06:38:33 PDT
I figured out recently that the Flash player on YouTube is caching the video as it is being downloaded (not through our cache).  I discovered that while watching a 30min long HD (720p) video and suddenly my old Windows XP notebook with small disk started to yell about lowering capacity on my C: drive.  So I started to look after the cause and I found out there is huge file kept by the plugin process that disappears right when I close the tab.

Not sure we have some mechanism that watches at run-time about disk capacity left and if low, evict entries from cache.  Raising this here as it could potentially be related..

Michal, do you know about anything like that?
Comment 32 Michal Novotny (:michal) 2011-10-20 07:50:46 PDT
What was the name of the file and where it was located? I guess that it was created by nsPluginStreamListener, see http://hg.mozilla.org/mozilla-central/annotate/67673422f7d2/dom/plugins/base/nsPluginStreamListenerPeer.cpp#l446

The disk cache isn't used when nsHttpChannel::SetCacheAsFile() fails or when there is any encoding of the content, see nsPluginStreamListenerPeer::SetUpStreamListener().

Note You need to log in before you can comment on or make changes to this bug.