Closed Bug 1397128 Opened 2 years ago Closed 2 years ago

Crash on www.ft.com

Categories

(Core :: DOM: Core & HTML, defect, P1)

57 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- disabled
firefox55 - wontfix
firefox56 + wontfix
firefox57 + fixed

People

(Reporter: c.justin88, Assigned: bkelly)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(13 files, 28 obsolete files)

193.54 KB, text/plain
Details
3.45 KB, patch
Details | Diff | Splinter Review
2.17 KB, patch
Details | Diff | Splinter Review
4.68 KB, patch
Details | Diff | Splinter Review
4.74 KB, patch
Details | Diff | Splinter Review
8.50 KB, patch
Details | Diff | Splinter Review
7.72 KB, patch
Details | Diff | Splinter Review
3.56 KB, patch
bkelly
: feedback+
Details | Diff | Splinter Review
16.34 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
9.57 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
5.82 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
4.88 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
15.50 KB, patch
Details | Diff | Splinter Review
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0
Build ID: 20170905220108

Steps to reproduce:

In Firefox 55.02+ (including 56 and 57), open a fresh Firefox profile, go to www.ft.com and click on some of the pages and scroll.

Using Fedora 26 with Wayland enabled and Intel integrated graphics with the stock (i915) driver. Kernel 4.12.9-300.fc26.x86_64. 


Actual results:

The tab will crash or Firefox will crash entirely.


Expected results:

No crashing should happen. Stable in Firefox 54.
Please provide the crash ID from about:crashes : https://developer.mozilla.org/en/How_to_get_a_stacktrace_for_a_bug_report explains how to do this. When doing so, please also add the keyword "crashreportid" to the "Keywords" field of this report.
For more information, see http://support.mozilla.com/kb/Firefox%20crashes
Flags: needinfo?(c.justin88)
This looks like another instance of fd exhaustion, but I'm not sure what's causing it.  With multiprocess disabled, it becomes much harder to crash the browser.  Instead, it'll start giving lots of graphical glitches once all fds are used up.  This is accompanied by a flood of three different error messages to the terminal:

IPDL protocol error: [PBackgroundChild] Received an invalid file descriptor!
[32557] WARNING: file /builds/worker/workspace/build/src/ipc/chromium/src/base/shared_memory_posix.cc, line 216
[GFX1-]: Failed 2 buffer db=0 dw=0 for 0, 0, 1280, 972

However, I *did* manage to crash twice with multiprocess disabled.  Unfortunately the crash reporter only managed one of them:

https://crash-stats.mozilla.com/report/index/79fbe6b9-4848-48a3-bd3a-6ced10170906
FWIW blocking https://www.ft.com/__assets/hashed/n-ui/6049dca4/es5.js with UBO seems to eliminate the issue.
Is there supposed to be some limit or throttle on file access via the cache API?
(In reply to Andre Klapper from comment #1)
> Please provide the crash ID from about:crashes :
> https://developer.mozilla.org/en/How_to_get_a_stacktrace_for_a_bug_report
> explains how to do this. When doing so, please also add the keyword
> "crashreportid" to the "Keywords" field of this report.
> For more information, see http://support.mozilla.com/kb/Firefox%20crashes


Crash ID from crashing of Firefox 57.0a1 (2017-09-07) entirely: de16fb0a-b8d7-4baf-a191-7a4330170908

(Sometimes the page only crashes the tab, so I don't get a crash report)
Flags: needinfo?(c.justin88)
(In reply to Henri Kemppainen from comment #3)
> This looks like another instance of fd exhaustion, but I'm not sure what's
> causing it.  With multiprocess disabled, it becomes much harder to crash the
> browser.  Instead, it'll start giving lots of graphical glitches once all
> fds are used up.  This is accompanied by a flood of three different error
> messages to the terminal:
> 
> IPDL protocol error: [PBackgroundChild] Received an invalid file descriptor!
> [32557] WARNING: file
> /builds/worker/workspace/build/src/ipc/chromium/src/base/shared_memory_posix.
> cc, line 216
> [GFX1-]: Failed 2 buffer db=0 dw=0 for 0, 0, 1280, 972
> 
> However, I *did* manage to crash twice with multiprocess disabled. 
> Unfortunately the crash reporter only managed one of them:
> 
> https://crash-stats.mozilla.com/report/index/79fbe6b9-4848-48a3-bd3a-
> 6ced10170906

I can confirm that I have also experienced graphical glitches (black square and page tearing). This is usually followed by a complete crash of Firefox.

Can we have an expert triage this bug and assign it the appropriate status flags?
Crash Signature: [@ mozilla::ipc::BackgroundChildImpl::ProcessingError]
Resource exhaustion in mozilla::dom::cache?
Crash Signature: [@ mozilla::ipc::BackgroundChildImpl::ProcessingError]
Component: Untriaged → DOM
Product: Firefox → Core
Crash Signature: [@ mozilla::ipc::BackgroundChildImpl::ProcessingError ]
Maybe related to bug 1325918?
Flags: needinfo?(bhsu)
Sorry for not being working on bug 1325918 recently.

Forwarding the NI? to a more knowledgeable person :)
Flags: needinfo?(bhsu) → needinfo?(kchen)
I can easily reproduce this on Linux, though it's not clear to me where the file descriptors are used. Maybe Andrea or Ben knows.
Flags: needinfo?(kchen)
Flags: needinfo?(bkelly)
Flags: needinfo?(amarchesini)
It can be a duplicate of bug 1398635.
Flags: needinfo?(amarchesini)
No, it's not. bug 1398635 is generated by bug 1397627 that landed the 9/8.
From comment 4 its pretty clear Firefox is holding open a bunch of Cache API files.  This could be a leak or it could be caused by script.

The way Cache API uses these files is:

1. Site makes a Cache API call in the child process.
2. Request goes to the parent process and we find the file.
3. Open the file in the parent process and send the resulting file descriptor back to the child process.
4. The fd remains open until the Response is read by script or is GC'd

So if the site just holds on to thousands of Response objects without reading them then they won't be GC'd and the fds won't be closed.

Since this appears to reproduce in linux we could instrument Response or InternalResponse to try to determine if this is happening.
Flags: needinfo?(bkelly)
BTW, I don't see any cache API usage in the script linked in comment 5.
> BTW, I don't see any cache API usage in the script linked in comment 5.

Yep, that script was just an intermediate step to get us to the real problem.

Actual cache use is in the service worker script at https://www.ft.com/__sw-prod.js
Hmm.  If its only in the service worker then it would need to be a lot of Responses in a short period.  We kill the service worker when its idle, so that should free any in-memory responses.

I wonder if we have a test account for ft.com.  I don't have an account, so can't scroll around articles.
> Hmm.  If its only in the service worker then it would need to be a lot of
> Responses in a short period.  We kill the service worker when its idle, so
> that should free any in-memory responses.
> 
> I wonder if we have a test account for ft.com.  I don't have an account, so
> can't scroll around articles.

Clicking one of the main news sections (e.g. Home, Companies, Opinion) right under the heading will result in the opening of hundreds of fds.  They seem to be closed some 15 to 30 seconds after opening.  So clicking a few of these sections in a quick succession is all it takes to bring the browser on its knees, at least with the default fd limit of 1024 on Fedora.
Do you have any addons installed?  Does running in safe mode change anything for you?

https://support.mozilla.org/en-US/kb/troubleshoot-firefox-issues-using-safe-mode

I'm just curious if some addon is triggering additional network requests and service worker FetchEvents for the page.  I see the same files open many times in the lsof log you posted.
No addons, I've been testing under a fresh profile in /tmp.  Safe mode does not change behavior.
Tom do you have any time to look at this?

As a first step, we need to try to reproduce it.  I'm curious which situation we are hitting:

a. The Response objects are being used and collected normally.  We just have so many alive at once we are hitting the file descriptor limit.
b. The Response objects are not being GC'd when they should and being leaked until the worker is killed.  This would inflate how many file descriptors we keep alive at once.
c. The file descriptors are being leaked past the end of life of the Response objects.

I'm hoping we are in case (a).  If that is so, then I think we should:

1. Write a test that opens thousands of Cache API Response objects simultaneously to provoke the condition.  See if this triggers problems on mac/windows in addition to linux.
2. Design and implement some "deferred open" nsIInputStream that wraps the nsFileInputStream.  This class would send an IPC message to the parent on first read to open the file and pass back the underlying nsFileInputStream.
3. Make Cache API track use "deferred open" streams after it hits some limit on currently open streams.  Something like 256 or 512, etc.

What do you think?
Flags: needinfo?(ttung)
I also wonder if this could explain some perf issues I've seen on android.
(In reply to Ben Kelly [:bkelly] from comment #22)
> 3. Make Cache API track use "deferred open" streams after it hits some limit
> on currently open streams.  Something like 256 or 512, etc.

Or more likely, just use this deferred open mode by default at first.  If there is a perf hit we can allow the first "N" streams to bypass it later.
When considering a deferred implementation, it's worth noting that IPCBlobInputStream already does something like this.  It only causes the parent to serialize the stream across (StreamNeeded request, StreamReady response) when its nsIAsyncInputStream::AsyncWait() or nsIAsyncFileMetadata::AsyncWait() methods are invoked.
(In reply to Ben Kelly [:bkelly] from comment #22)
> Tom do you have any time to look at this?

Yes, I'll work on this!

> As a first step, we need to try to reproduce it.  I'm curious which
> situation we are hitting:
> 
> a. The Response objects are being used and collected normally.  We just have
> so many alive at once we are hitting the file descriptor limit.
> b. The Response objects are not being GC'd when they should and being leaked
> until the worker is killed.  This would inflate how many file descriptors we
> keep alive at once.
> c. The file descriptors are being leaked past the end of life of the
> Response objects.
> 
> I'm hoping we are in case (a).  If that is so, then I think we should:
> 
> 1. Write a test that opens thousands of Cache API Response objects
> simultaneously to provoke the condition.  See if this triggers problems on
> mac/windows in addition to linux.
> 2. Design and implement some "deferred open" nsIInputStream that wraps the
> nsFileInputStream.  This class would send an IPC message to the parent on
> first read to open the file and pass back the underlying nsFileInputStream.
> 3. Make Cache API track use "deferred open" streams after it hits some limit
> on currently open streams.  Something like 256 or 512, etc.
> 
> What do you think?

Thanks for providing hints! I'll start from reproducing it to see if it's case (a).

(In reply to Andrew Sutherland [:asuth] from comment #25)
> When considering a deferred implementation, it's worth noting that
> IPCBlobInputStream already does something like this.  It only causes the
> parent to serialize the stream across (StreamNeeded request, StreamReady
> response) when its nsIAsyncInputStream::AsyncWait() or
> nsIAsyncFileMetadata::AsyncWait() methods are invoked.

Thanks for the hint! It does save my time to investigate it
Assignee: nobody → ttung
Flags: needinfo?(ttung)
Keywords: regression
Priority: -- → P1
I don't think we have any evidence this is a regression yet.

Tom, I also had another thought:  If the service worker is passing these Cache API Response objects to respondWith(), they should be getting closed pretty quickly.  Is that not happening for some reason?  Thanks for investigating!
(In reply to Ben Kelly [:bkelly] from comment #27)
> I don't think we have any evidence this is a regression yet.

I was thinking along the lines of "ft.com didn't used to crash"-type regression.
I can reproduce the crash by debug build on both my desktop (Ubuntu 14.04) and my MacBookPro (Sierra 10.12.5). However, it seems to have more than one issue which make this website crash.

On Ubuntu:
It's relativly easy to reproduce the crash (by randomly click main news sections around 4~5 times). 
And, I got:
Hit MOZ_CRASH(MsgDropped: Channel error: cannot send/recv) at /home/tom/mozilla/inbound/ipc/glue/BackgroundChildImpl.cpp:152

On Mac:
First, I got several messages which are related to not be able to open file (hit the fd limit?):
[Parent 42184] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file /Mozilla/Work/m-i/dom/quota/FileStreams.cpp, line 56
[Parent 42184] WARNING: NS_ENSURE_SUCCESS(rv, nullptr) failed with result 0x80004005: file /Mozilla/Work/m-i/dom/quota/FileStreams.cpp, line 104
[Parent 42184] WARNING: '!fileStream', file /Mozilla/Work/m-i/dom/cache/FileUtils.cpp, line 270
[Parent 42184] WARNING: 'NS_FAILED(rv)', file /Mozilla/Work/m-i/dom/cache/Manager.cpp, line 548
[Parent 42184] WARNING: 'aRv.Failed()', file /Mozilla/Work/m-i/dom/cache/CacheOpParent.cpp, line 170
[Parent 42184] WARNING: Converting non-IndexedDB error code (0x80520015) to NS_ERROR_DOM_INDEXEDDB_UNKNOWN_ERR: file /Mozilla/Work/m-i/dom/indexedDB/ActorsParent.cpp, line 592
[Child 42188] WARNING: 'aRv.Failed()', file /Mozilla/Work/m-i/dom/cache/CacheOpChild.cpp, line 113
[Parent 42184] WARNING: Converting non-IndexedDB error code (0x80520015) to NS_ERROR_DOM_INDEXEDDB_UNKNOWN_ERR: file /Mozilla/Work/m-i/dom/indexedDB/ActorsParent.cpp, line 592
[Parent 42184] WARNING: Converting non-IndexedDB error code (0x80520015) to NS_ERROR_DOM_INDEXEDDB_UNKNOWN_ERR: file /Mozilla/Work/m-i/dom/indexedDB/ActorsParent.cpp, line 592

Then, I hit the assertion below, it seems that we have another issue on Necko?
Assertion failure: channel == mChannel, at /Mozilla/Work/m-i/dom/workers/ServiceWorkerScriptCache.cpp:853
#01: mozilla::dom::workers::serviceWorkerScriptCache::(anonymous namespace)::CompareNetwork::OnStartRequest(nsIRequest*, nsISupports*)[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x495cabe]
#02: non-virtual thunk to mozilla::dom::workers::serviceWorkerScriptCache::(anonymous namespace)::CompareNetwork::OnStartRequest(nsIRequest*, nsISupports*)[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x495cd04]
#03: mozilla::net::nsStreamLoader::OnStartRequest(nsIRequest*, nsISupports*)[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x44839e]
#04: mozilla::net::HttpChannelChild::DoOnStartRequest(nsIRequest*, nsISupports*)[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x937ca0]
#05: mozilla::net::HttpChannelChild::OnStartRequest(nsresult const&, mozilla::net::nsHttpResponseHead const&, bool const&, mozilla::net::nsHttpHeaderArray const&, bool const&, bool const&, int const&, unsigned int const&, nsTString<char> const&, nsTString<char[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x93e0b0]
#06: mozilla::net::StartRequestEvent::Run()[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x9c068a]
#07: mozilla::net::ChannelEventQueue::RunOrEnqueue(mozilla::net::ChannelEvent*, bool)[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x8843d2]
#08: mozilla::net::HttpChannelChild::RecvOnStartRequest(nsresult const&, mozilla::net::nsHttpResponseHead const&, bool const&, mozilla::net::nsHttpHeaderArray const&, bool const&, bool const&, int const&, unsigned int const&, nsTString<char> const&, nsTString<[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x93d5df]
#09: mozilla::net::PHttpChannelChild::OnMessageReceived(IPC::Message const&)[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0xe813c2]
#10: mozilla::dom::PContentChild::OnMessageReceived(IPC::Message const&)[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x1422287]
#11: mozilla::dom::ContentChild::OnMessageReceived(IPC::Message const&)[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0x4802b07]
#12: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&)[/Mozilla/Work/m-i/obj-x86_64-apple-darwin16.6.0/dist/NightlyDebug.app/Contents/MacOS/XUL +0xccdaf3]
(In reply to Ben Kelly [:bkelly] from comment #27)
> Tom, I also had another thought:  If the service worker is passing these
> Cache API Response objects to respondWith(), they should be getting closed
> pretty quickly.  Is that not happening for some reason?  Thanks for
> investigating!

Honestly, I have no idea about this for now. I'll dig into it more. Thanks for providing information!
(In reply to Tom Tung [:tt] (OoO: 9/22 ~ 9/29) from comment #29)
Note: 
I check out the soft fd limit on my machine by "ulimit -n"
The result on my desktop Ubuntu: 1024
The result on my MacBook: 7168

Besides, setting fd limit to a bigger number (4096) on Ubuntu does make Firefox harder(need to click items more times) to crash on www.ft.com
I create a test for trying to reproduce this issue. However, I can only get lots of warning as following rather than crash.

GECKO(20038) | [Parent 20038] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file /home/tom/mozilla/inbound/dom/quota/FileStreams.cpp, line 56
GECKO(20038) | [Parent 20038] WARNING: NS_ENSURE_SUCCESS(rv, nullptr) failed with result 0x80004005: file /home/tom/mozilla/inbound/dom/quota/FileStreams.cpp, line 119
GECKO(20038) | [Parent 20038] WARNING: '!fileStream', file /home/tom/mozilla/inbound/dom/cache/FileUtils.cpp, line 191
GECKO(20038) | [Parent 20038] WARNING: 'NS_FAILED(rv)', file /home/tom/mozilla/inbound/dom/cache/Manager.cpp, line 944
GECKO(20038) | [Parent 20038] WARNING: 'NS_FAILED(rv)', file /home/tom/mozilla/inbound/dom/cache/Manager.cpp, line 722
GECKO(20038) | [Parent 20038] WARNING: 'aRv.Failed()', file /home/tom/mozilla/inbound/dom/cache/CacheOpParent.cpp, line 170
GECKO(20038) | [Child 20095] WARNING: 'aRv.Failed()', file /home/tom/mozilla/inbound/dom/cache/CacheOpChild.cpp, line 113
> I create a test for trying to reproduce this issue. However, I can only get
> lots of warning as following rather than crash.

The common crash signatures from me and Justin show a stack frame from CacheMatchAction.  So I guess you would have to be using cache.match to reproduce that particular crash?  But I'll note that even if that particular call didn't crash, there's plenty other code that will fail under fd pressure.  So it's not just that one crash that is problematic, but fd exhaustion in general.
FWIW I have this behaviour on ft.com on both Android and macOS 55.0.2/3. However macOS version never crashes and only hangs with all network activity in all tabs paused for a minute or so. Android though has graphics glitches and crashes without triggering the reporter. Either way no crash report is ever generated.

While firefox is hanging on macOS lsof shows lots (like well over 2000) open handles to: /Users/nick/Library/Application Support/Firefox/Profiles/ar7ybhg8.default/storage/default/https+++www.ft.com/cache/morgue/(someId)/{some-UUID}.final

For comparison ulimit -n on my Mac is 256 (no idea why it's so low) and 1024 on the Android device.
(In reply to Henri Kemppainen from comment #33) 
> The common crash signatures from me and Justin show a stack frame from
> CacheMatchAction.  So I guess you would have to be using cache.match to
> reproduce that particular crash?  But I'll note that even if that particular
> call didn't crash, there's plenty other code that will fail under fd
> pressure.  So it's not just that one crash that is problematic, but fd
> exhaustion in general.

Thanks for the information! 

You are right. I create a simple test in order to reproduce this issue, so I can narrow down the scope (ensuring there are no other things happen at the same time) and verify if I fix this issue easily. I'll try to testing cache.match or other API that cause crash while fd exhaustion. 

(In reply to nickl from comment #34)
> FWIW I have this behaviour on ft.com on both Android and macOS 55.0.2/3.
> However macOS version never crashes and only hangs with all network activity
> in all tabs paused for a minute or so. Android though has graphics glitches
> and crashes without triggering the reporter. Either way no crash report is
> ever generated.
> 
> While firefox is hanging on macOS lsof shows lots (like well over 2000) open
> handles to: /Users/nick/Library/Application
> Support/Firefox/Profiles/ar7ybhg8.default/storage/default/https+++www.ft.com/
> cache/morgue/(someId)/{some-UUID}.final
> 
> For comparison ulimit -n on my Mac is 256 (no idea why it's so low) and 1024
> on the Android device.

Thanks for the information as well!
(In reply to Tom Tung [:tt] (OoO: 9/22 ~ 9/29) from comment #29)
> Then, I hit the assertion below, it seems that we have another issue on
> Necko?
> Assertion failure: channel == mChannel, at
> /Mozilla/Work/m-i/dom/workers/ServiceWorkerScriptCache.cpp:853

We probably need a similar fix to what I just pushed for bug 1398484.  This assertion does not permit redirects in importScripts() which is bogus.
(In reply to Ben Kelly [:bkelly] from comment #36)
> We probably need a similar fix to what I just pushed for bug 1398484.  This
> assertion does not permit redirects in importScripts() which is bogus.

I see. I'll try again with your patches.
(In reply to Tom Tung [:tt] (OoO: 9/22 ~ 9/29) from comment #37)
> I see. I'll try again with your patches.

Ok, but you also need to modify this line:

http://searchfox.org/mozilla-central/source/dom/workers/ServiceWorkerScriptCache.cpp#853

To be `MOZ_ASSERT_IF(mIsMainScript, channel == mChannel)`.
Note, if we go the deferred-open route, I think the easiest thing to do would be to add the IPC messaging to this existing actor:

http://searchfox.org/mozilla-central/source/dom/cache/PCacheStreamControl.ipdl#13

The Cache returned streams already implement a custom stream class that integrates with that actor.

Tom, I know you are going on PTO shortly.  If you have time to write a test, I can try to add this IPC messaging if you don't get to it.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Andrew and I spoke, and given the short timeline I'm going to try to implement the messaging I mention in comment 39.
These are some work-in-progress patches.  They are not functional yet.  Its maybe 50% complete.
P5 Add ReadStream::Inner::EnsureStream() to synchronously open stream when Availabe/Read/ReadSegments is called. r=asuth
Attachment #8907901 - Attachment description: bug1397128_p5_ensurestream.patch → P5 Add ReadStream::Inner::EnsureStream() to synchronously open stream when Availabe/Read/ReadSegments is called. r=asuth
Attached patch wip.patch (obsolete) — Splinter Review
Still a work-in-progress, but with this patch I can run the dom/cache mochitests locally with some amount of deferred opening.  This proves that its feasible.  I need to clean up the patches some more.

Note, one big restriction is that cache API ReadStream objects can't be read on the main IPC owning thread.  We will MOZ_CRASH() if someone tries that.  This *should* be ok because we really shouldn't be doing any sync I/O on these threads anyway.
(In reply to Ben Kelly [:bkelly] from comment #39)
> Note, if we go the deferred-open route, I think the easiest thing to do
> would be to add the IPC messaging to this existing actor:
> 
> http://searchfox.org/mozilla-central/source/dom/cache/PCacheStreamControl.
> ipdl#13
> 
> The Cache returned streams already implement a custom stream class that
> integrates with that actor.
> 
> Tom, I know you are going on PTO shortly.  If you have time to write a test,
> I can try to add this IPC messaging if you don't get to it.

Thanks for consideration! I'll try to finish the test. Right now, attachment 8907478 [details] [diff] [review] can run out of fds on my desktop(Ubuntu). I'll figure out how to make it crash :p
Tom, I did not observe any remarkable fd use with your test (it stays under 200).  However, with a little addition (use cache.match() to mass-open the cached responses), I can get it to reach the limit and crash on my end.  I also had to bump the fd count to make it a more reliable crasher.
(In reply to Henri Kemppainen from comment #50)
> Created attachment 8907997 [details] [diff] [review]
> test-cache-tons-of-fd-tweaked.patch
> 
> Tom, I did not observe any remarkable fd use with your test (it stays under
> 200).  However, with a little addition (use cache.match() to mass-open the
> cached responses), I can get it to reach the limit and crash on my end.  I
> also had to bump the fd count to make it a more reliable crasher.

Thanks for the help! I have a similar patch as yours in my local machine (cache.put() and then cache.match(), but I don't bump the fd count to 4096). However, it's not always crashing. Most of the time it reached the test time out, but I did get warning like "Warning: failed to duplicate file handle for current process" from ipc/glue/FileDescriptor.cpp, line 175.
Still WIP patch.
I cannot ensure it always crash or test time out. 

I make sure it does run out of fd by "lsof -p $pid | wc -l". And, I can get warnings after : "[Child 15598] WARNING: Failed to duplicate file handle for current process!: file /home/tom/mozilla/inbound/ipc/glue/FileDescriptor.cpp, line 175".
Attachment #8907478 - Attachment is obsolete: true
(In reply to Tom Tung [:tt] (OoO: 9/22 ~ 9/29) from comment #52)
> I cannot ensure it always crash or test time out. 

I ran this a couple dozen times.  Not a single crash.  I also uncommented the idb open, but that doesn't crash it either.  Fd use does indeed reach the limit though.
In addition to opening stream lazily I think we should probably throttle them at some point.  So if Cache API or a specific origin with Cache API tries to open more than N bodies then we block further opens until previous ones complete.
Updated to handle shutdown and premature closing better.
Attachment #8907901 - Attachment is obsolete: true
This patch set will make all content script usage of Cache API use lazy body opening.  Internal chrome-only usage, like service worker scripts, will use eager body opening.  I partly want to keep chrome-only eager to help ensure that code patch continues to work.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=12c0aaae7d03dbfc3ac0ec2bf2d4a9247ee0047f
Note, I haven't implemented any throttling of total number of open files yet.
Try looks green.  I'm going to see if I can reproduce the crash with/without the patches.
I'm having trouble reproducing, but I see why ft.com is triggering this behavior.  It seems it frequently tries to expire all Response objects in its Cache.  It does:

1. Cache.keys() to get all requests
2. Cache.match() on each request
3. Checks the Response timestamp using headers or something
4. Delete the response if its expired, otherwise let it the Response object get GC'd

In the case where its not expired (most of the time) we end up with a lot of Response objects waiting to be GC'd.  We don't aggressively GC on workers, so typically these live until we time out the service worker for being idle.
So the good news I think the current patch set is adequate and we don't need to throttle opens.  This is good because I was worrying about potential deadlocks when opening a lot of streams through our limited sized StreamTransportService thread pool.
Henri, can you reproduce with this linux build?

https://queue.taskcluster.net/v1/task/Os4IQMw_QJCGL4PkQsp4Zg/runs/0/artifacts/public/build/target.tar.bz2
Flags: needinfo?(mozilla)
Given that its hard to trigger the crash in a test I think I'm going to go ahead with just this patch set for now.  Adding to the test difficulty is that the file descriptor limit in automation is not well controlled.  Its basically set by the environment hosting our docker image which may change over time.
And I've pretty much stolen this bug.  Sorry Tom!  I'll make it up to you by sending you all the reviews. :-)
Assignee: ttung → bkelly
Status: NEW → ASSIGNED
(In reply to Ben Kelly [:bkelly] from comment #64)
> Henri, can you reproduce with this linux build?

It appears to be fixed.  ft.com is stable and performant, and fd count stays under 300.
Flags: needinfo?(mozilla)
Maybe instead of crashing we can make the test check that the body can be read correctly on each match().  Maybe do 5000 cache.match() for the same Request and store the Responses in an array.  Then read their bodies one by one to make sure they can be consumed correctly.  I think this test would fail on linux/mac/android today.
Comment on attachment 8908284 [details] [diff] [review]
P1 Make CacheReadStream ipdl struct contain an OptionalIPCStream. r=tt

Review of attachment 8908284 [details] [diff] [review]:
-----------------------------------------------------------------

Tom, this patch just makes the CacheReadStream IPC structure capable of passing a void stream instead of requiring it to be populated every time.  No functional change.
Attachment #8908284 - Flags: review?(ttung)
I have to run now, but will continue flagging these for review later today.

Henri, thanks for testing!
This patch refactors StreamList::Add() a bit and makes it possible to pass a nullptr stream.
Attachment #8908285 - Attachment is obsolete: true
Attachment #8908360 - Flags: review?(ttung)
This patch adds a Manager::ExecuteOpenStream() method and a backing OpenStreamAction.

Note, this code takes a Manager::Listener, but we actually signal completion through a std::function callback object.  This is a little nicer because it lets us use a lambda in the caller.  We still must pass the Listener, though, in order to signal if the caller is still alive.

Ideally we would unify the Listener/lambda approach in the future, but I don't think this is too bad for now.  I tried to add comments to clarify what is going on.

The other reason I wanted to use a lambda here was it fits better with the new IPDL async returns generated code.  This lets us avoid building a new "operation" actor for this kind of asynchronous rpc-type call.
Attachment #8908286 - Attachment is obsolete: true
Attachment #8908363 - Flags: review?(ttung)
This patch adds the IPC messaging to PCacheStreamControl.  We use the new async returns syntax.  This produces a MozPromise as the return on the caller side and produces a resolver function on the receiver side.

Note, we have to copy the `this` pointer for the actor into the lambda we pass to the Manager.  Normally this would not be safe because and ActorDestroy could come in at any time and delete the object.  In this case, though, we also pass the actor as a Listener and clear it from the Listener list when the actor is destroyed.  This ensures the Manager will only "complete" the operation and call the lambda if the actor is still alive.

Again, we should consider refactoring the Listener machinery in the future to better fit lambda like this.

Also note we must provide a base class StreamControl::OpenStream() in both the child and the parent.  This method can be called in both processes.  In the child we start the async IPC messaging as a result.  In the parent we start the Manager::ExecuteOpenStream operation.
Attachment #8908287 - Attachment is obsolete: true
Attachment #8908369 - Flags: review?(ttung)
This patch modifies ReadStream::Inner to synchronously open the stream if its inner mStream/mSnappyStream attributes are nullptr.

It does this by using a mutex and condition variable to block the current thread while the async IPC/open operation is running.  Since these streams are already blocking on file I/O, it seems ok to perform this blocking.

Note, however, this means we can't support reading these streams on the main js owning thread.  If code tried that it would deadlock on the condition variable wait.  We prevent this by checking the current thread and triggering a MOZ_CRASH() if called on the wrong thread.  This should be reasonable to do because we really shouldn't be using these file blocking streams on the main js threads anyway.  Its bad for UX.

If anything goes wrong we end up creating an empty string stream that is immediately closed.

If the actor is destroyed we create the empty string stream immediately, close it, and signal the condition variable to wake up any listeners.  This should avoid the possibility of anyone waiting forever.
Attachment #8908288 - Attachment is obsolete: true
Attachment #8908370 - Flags: review?(ttung)
This patch fixes various assertions and checks so that we can pass a nullptr stream through to the ReadStream in the child.
Attachment #8908289 - Attachment is obsolete: true
Attachment #8908371 - Flags: review?(ttung)
This creates an enumeration that can be set in the operation args to indicate if bodies should be eagerly or lazily loaded.  I want to eager load bodies for service worker scripts, but lazy load bodies for content script usage of Cache API for now.
Attachment #8908290 - Attachment is obsolete: true
Attachment #8908372 - Flags: review?(ttung)
This makes us pass the Cache API "namespace" back with the opened Cache object.  We need this to track the namespace on the Cache object so we know whether to eager or lazy load bodies.
Attachment #8908291 - Attachment is obsolete: true
Attachment #8908373 - Flags: review?(ttung)
Finally, set BodyLoad::Eager for chrome-only namespace Cache operations.  All other requests get BodyLoad::Lazy.
Attachment #8908292 - Attachment is obsolete: true
Attachment #8908374 - Flags: review?(ttung)
thanks for the detective work Ben :-) We should probably severely throttle those calls as our cache TTLs are typically in days not milliseconds. And thanks for patching too
Thank you Ben, Tom, Henri, and Andrew for working so quickly on this. For over a month, I was reluctant to report the bug because I thought it would get swept under the rug/fixed as part of a bigger bug/oh it only affects me (nonsense reasons in retrospect). I'm very happy and impressed by the work being done on Firefox (especially with version 57). Thank you. I will report bugs in the future and hope to contribute to the development when I am a better programmer :)

Justin
Attachment #8908284 - Flags: review?(ttung) → review+
Attachment #8908360 - Flags: review?(ttung) → review+
Attachment #8908363 - Flags: review?(ttung) → review+
Comment on attachment 8908369 [details] [diff] [review]
P4 Add a PCacheStreamControl message to open a stream based on its body ID. r=tt

Review of attachment 8908369 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good! Maybe we could add one ~ two diagnostic asserts to ensure we won't get a null pointer from GetManager(), though we should be able to get the pointer to manager at this moment.

::: dom/cache/CacheStreamControlParent.cpp
@@ +83,5 @@
> +  // Make sure to add ourself as a Listener even thought we are using
> +  // a separate resolver function to signal the completion of the
> +  // operation.  The Manager uses the existence of the Listener to ensure
> +  // that its safe to complete the operation.
> +  mStreamList->GetManager()->ExecuteOpenStream(this, Move(aResolver), aId);

Perhaps MOZ_DIAGNOSTIC_ASSERT(mStreamList->GetManager()) before calling this line or MOZ_DIAGNOSTIC_ASSERT(mManager) inside the function GetManager()?

@@ +111,5 @@
>    // be called before mStreamList is set.
>    if (!mStreamList) {
>      return;
>    }
> +  mStreamList->GetManager()->RemoveListener(this);

So does this?
Attachment #8908369 - Flags: review?(ttung) → review+
Attachment #8908370 - Flags: review?(ttung) → review+
Attachment #8908371 - Flags: review?(ttung) → review+
Comment on attachment 8908372 [details] [diff] [review]
P7 Add a BodyLoad enumeration to control if individual Cache API requests eagerly or lazily load bodies. r=tt

Review of attachment 8908372 [details] [diff] [review]:
-----------------------------------------------------------------

Note to my self:
This patch creates an enum (BodyLoad) and pass as an argument to cache.match/matchAll/keys and caches.match (eagerly load by default for now). They will be set to be lazily loaded if it's not a service worker script later in P9.
Attachment #8908372 - Flags: review?(ttung) → review+
Attachment #8908373 - Flags: review?(ttung) → review+
Comment on attachment 8908374 [details] [diff] [review]
P9 Make content script usage of Cache API use lazy body loading. r=tt

Review of attachment 8908374 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for fixing this issue!
Attachment #8908374 - Flags: review?(ttung) → review+
Address Comment 77, but I only cached 1024 responses rather than 5000. The reason is that it takes too much time to cached 5000 responses.

Right now, this test will stuck until test time out with message below:
GECKO(24479) | [Child 24536] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file /home/tom/mozilla/inbound/netwerk/base/nsFileStreams.cpp, line 528
GECKO(24479) | [Child 24536] WARNING: 'NS_FAILED(rv)', file /home/tom/mozilla/inbound/xpcom/io/SnappyUncompressInputStream.cpp, line 333
GECKO(24479) | [Child 24536] WARNING: 'NS_FAILED(rv)', file /home/tom/mozilla/inbound/xpcom/io/SnappyUncompressInputStream.cpp, line 223
GECKO(24479) | [Child 24536] WARNING: 'NS_FAILED(aStatus)', file /home/tom/mozilla/inbound/dom/fetch/FetchConsumer.cpp, line 551

Hi Ben,

Would you mind giving quick feedback if I understand Comment 77 correctlly? If so, I will test it with applying your patches later.
Thanks!
Attachment #8908067 - Attachment is obsolete: true
Attachment #8908571 - Flags: feedback?(bkelly)
Comment on attachment 8908571 [details] [diff] [review]
Test for caching tons of response

Review of attachment 8908571 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great!  I'll try it with the other patches today.

::: dom/cache/test/mochitest/test_cache_tons_of_fd.html
@@ +38,5 @@
> +          ["dom.caches.testing.enabled", true],
> +          ["dom.quotaManager.testing", true]]
> +}, async function() {
> +  const name = 'cacheTonsOfFD';
> +  const number_of_fd = 1024;

Let's make this a bit larger to try to force the problem.  Maybe 5000 or something.

@@ +56,5 @@
> +  for (let i = 0; i < number_of_fd; ++i) {
> +    // Give request differnet url to create different entry.
> +    let request = new Request(i);
> +    request_array.push(request);
> +    await cache.put(request, response.clone());

We actually only need one entry in the Cache.  It doesn't matter how many files there, just how many times we open any files.  So you can do a single cache.put() and then just cache.match() the same request N times.

@@ +66,5 @@
> +  let cached_response_array = [];
> +  for (let i = 0; i < number_of_fd; ++i) {
> +    let cachedResponse = await cache.match(request_array[i]);
> +    cached_response_array.push(cachedResponse);
> +  }

We could run this a bit faster in the test by letting the match() calls work in parallel:

let matchList = [];
for (let i = 0; i < number_of_fd; ++i) {
  matchList.push(cache.match(request));
}
let cached_response_array = await Promise.all(matchList);

@@ +72,5 @@
> +  info("Stage 4: Consume the cached body")
> +  for (let i = 0; i < number_of_fd; ++i) {
> +    let bodyText = await cached_response_array[i].text();
> +    ok(bodyText == body, "The cached body does the same as original one");
> +  }

Doing this loop one at a time is good, though.

Maybe do a `ok(cached_response_array[i], "response exists")` check before trying to read the text.
Attachment #8908571 - Flags: feedback?(bkelly) → feedback+
The test passes with the patches here applied.

Without the patches here the test does time out.  If we add a try/catch to the test, though, it will fail with an AbortError from the .text() call.

Since the final merge to beta is coming up I want to land these changes today.  I'm going to move landing the test to a separate bug.  We can finish reviewing and landing the test next week.
Blocks: 1400282
I decided to add the assert to StreamList::GetManager() method itself.  Also, I figured out how to call it StreamList::Manager().  I originally used Get* because the method name conflicted with the Manager type.  If I just use cache::Manager elsewhere in StreamList, though, I can safely name the method Manager() which is our preferred style.
Attachment #8908369 - Attachment is obsolete: true
Attachment #8908664 - Flags: review+
I decided to rename the BodyLoad enumeration to OpenMode.  I was worried "Load" would make people think we were eagerly loading the entire body stream into memory.
Attachment #8908372 - Attachment is obsolete: true
Attachment #8908668 - Flags: review+
Rebase to use OpenMode instead of BodyLoad.
Attachment #8908374 - Attachment is obsolete: true
Attachment #8908670 - Flags: review+
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3f1e68da9634
P1 Make CacheReadStream ipdl struct contain an OptionalIPCStream. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/15edfaab92f6
P2 Refactor StreamList::Add() to use move constructors and accept a nullptr stream. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/bfbcc07fcb4b
P3 Add an OpenStreamAction class to open the file stream on the io thread. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/389e9f39d6b8
P4 Add a PCacheStreamControl message to open a stream based on its body ID. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/5988716b66bb
P5 Add ReadStream::Inner::EnsureStream() to synchronously open stream when Availabe/Read/ReadSegments is called. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb52082d2d3c
P6 Allow ReadStream to be created with an initially nullptr base stream. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/b28740aedf34
P7 Add an OpenMode enumeration to control if individual Cache API requests eagerly or lazily opens bodies. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/e37614cb41e1
P8 Track which storage "namespace" each Cache object is associated with. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/b5ea37e08220
P9 Make content script usage of Cache API use lazy body loading. r=tt
Blocks: 1400298
Backed out for bustage at mozilla/dom/cache/CacheStorage.h:108:

https://hg.mozilla.org/integration/mozilla-inbound/rev/ee4c8b6d9a083e24a6f902e9f3b3621561ab9451
https://hg.mozilla.org/integration/mozilla-inbound/rev/c0a6198c2345c243a40f82220f7640ddcb755515
https://hg.mozilla.org/integration/mozilla-inbound/rev/3011315fd77cd184ca7e4721b7e35233b945b0c9
https://hg.mozilla.org/integration/mozilla-inbound/rev/0241711c57ebe826345fc38ee5e695ff301c24b8
https://hg.mozilla.org/integration/mozilla-inbound/rev/b1630c159759298a39b7a33397832c458d28ab9f
https://hg.mozilla.org/integration/mozilla-inbound/rev/852e0b52c84f041880f1ed404be20d6c8022977f
https://hg.mozilla.org/integration/mozilla-inbound/rev/ea17b06fd8eb6e5cdd6021f8095c25d640b226d8
https://hg.mozilla.org/integration/mozilla-inbound/rev/768e5defb464e14a76b83884702b9ed5bff07401
https://hg.mozilla.org/integration/mozilla-inbound/rev/927cceba22bea3a94e7ae10112f0e037588c583f

Push with bustage: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=b5ea37e08220cef0cd3bb3130ea0b1104ad8d16b&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Build log: https://treeherder.mozilla.org/logviewer.html#?job_id=131312380&repo=mozilla-inbound

[task 2017-09-15T16:19:59.938Z] 16:19:59     INFO -  In file included from /builds/worker/workspace/build/src/js/xpconnect/src/Sandbox.cpp:32:0,
[task 2017-09-15T16:19:59.938Z] 16:19:59     INFO -                   from /builds/worker/workspace/build/src/obj-firefox/js/xpconnect/src/Unified_cpp_js_xpconnect_src0.cpp:11:
[task 2017-09-15T16:19:59.940Z] 16:19:59     INFO -  /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/dom/cache/CacheStorage.h:108:14: error: declaration of 'mozilla::dom::cache::OpenMode mozilla::dom::cache::CacheStorage::OpenMode() const' [-fpermissive]
[task 2017-09-15T16:19:59.940Z] 16:19:59     INFO -     OpenMode() const;
[task 2017-09-15T16:19:59.940Z] 16:19:59     INFO -                ^~~~~
[task 2017-09-15T16:19:59.940Z] 16:19:59     INFO -  In file included from /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/dom/cache/CacheStorage.h:10:0,
[task 2017-09-15T16:19:59.940Z] 16:19:59     INFO -                   from /builds/worker/workspace/build/src/js/xpconnect/src/Sandbox.cpp:32,
[task 2017-09-15T16:19:59.943Z] 16:19:59     INFO -                   from /builds/worker/workspace/build/src/obj-firefox/js/xpconnect/src/Unified_cpp_js_xpconnect_src0.cpp:11:
[task 2017-09-15T16:19:59.943Z] 16:19:59     INFO -  /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/dom/cache/Types.h:42:12: error: changes meaning of 'OpenMode' from 'enum class mozilla::dom::cache::OpenMode' [-fpermissive]
[task 2017-09-15T16:19:59.943Z] 16:19:59     INFO -   enum class OpenMode : uint8_t
[task 2017-09-15T16:19:59.943Z] 16:19:59     INFO -              ^~~~~~~~
[task 2017-09-15T16:19:59.943Z] 16:19:59     INFO -  /builds/worker/workspace/build/src/config/rules.mk:1064: recipe for target 'Unified_cpp_js_xpconnect_src0.o' failed
[task 2017-09-15T16:19:59.943Z] 16:19:59     INFO -  gmake[5]: *** [Unified_cpp_js_xpconnect_src0.o] Error 1
Flags: needinfo?(bkelly)
Attachment #8908670 - Attachment is obsolete: true
Flags: needinfo?(bkelly)
Attachment #8908728 - Flags: review+
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b16b6ed01f51
P1 Make CacheReadStream ipdl struct contain an OptionalIPCStream. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/88015eaa125d
P2 Refactor StreamList::Add() to use move constructors and accept a nullptr stream. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/6a1adad5e9db
P3 Add an OpenStreamAction class to open the file stream on the io thread. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/faa7c0f91b4e
P4 Add a PCacheStreamControl message to open a stream based on its body ID. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/afc46e4aa18c
P5 Add ReadStream::Inner::EnsureStream() to synchronously open stream when Availabe/Read/ReadSegments is called. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/c02de0753e45
P6 Allow ReadStream to be created with an initially nullptr base stream. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/882ab6414140
P7 Add an OpenMode enumeration to control if individual Cache API requests eagerly or lazily opens bodies. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/7650e58c8b52
P8 Track which storage "namespace" each Cache object is associated with. r=tt
https://hg.mozilla.org/integration/mozilla-inbound/rev/5657bd411b28
P9 Make content script usage of Cache API use lazy body loading. r=tt
Just to clarify, the problem exists in 52 but service workers are disabled there.  So sites are much less likely to encounter this issue.
FWIW, this causes us to slightly regress one of our SW metrics:

* FETCH_EVENT_FINISH_SYNTHESIS_MS median went from ~2ms to ~5.5ms
* Most of this was due to the 25th-percentile data going from 0ms to ~1ms
* The mean and 75th-percentile were also affected, but its hard to tell if this is normal noise in the data or due to this change.
* It seems this is mostly noticeable on desktop.  On mobile the numbers are noisy enough we can't really see the impact.

I don't think there is a problem here.  I just wanted to note the impact on our metrics.
Depends on: 1402586
Depends on: 1426395
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.