Closed Bug 1154124 Opened 5 years ago Closed 4 years ago
Stack overflow by a sync call of net::Cache
Entry's callbacks in a chain
Steps to reproduce: 1. Go to this URL: http://www.geeksofdoom.com/2015/01/13/rockstar-delays-grand-theft-auto-v-pc-release-minimum-specs-4k-screenshots-shared 2. Middle click three of the screenshots at the bottom of this post. 3. Wait. First the browser becomes unresponsive, showing the dialog. "Stop Script" is either unclickable or ineffective. Eventually the browser crashes. Here are two crash signatures I got for this crash: https://crash-stats.mozilla.com/report/index/56a058a3-d1aa-4c50-adc6-564392150414 https://crash-stats.mozilla.com/report/index/4f42870b-d66a-452d-bff9-fcf9c2150414 (the latter is a Windows XP VirtualBox VM with Nightly, following the same STR)
Stack overflow: mozilla::net::CacheEntryHandle::Release() nsRefPtr<nsIAtom>::~nsRefPtr<nsIAtom>() mozilla::net::CacheEntry::InvokeAvailableCallback(mozilla::net::CacheEntry::Callback const&) mozilla::net::CacheEntry::InvokeCallback(mozilla::net::CacheEntry::Callback&) mozilla::net::CacheEntry::InvokeCallbacks(bool) mozilla::net::CacheEntry::InvokeCallbacks() mozilla::net::CacheEntry::OnHandleClosed(mozilla::net::CacheEntryHandle const*) mozilla::net::CacheEntryHandle::~CacheEntryHandle() mozilla::net::CacheEntryHandle::`scalar deleting destructor'(unsigned int) mozilla::net::CacheEntryHandle::Release() ...
Roman, are you using a fresh profile? If not, can you retry with it? I was trying to reproduce this bug with Nightly (e10s on and off) and Fx 37 (all fresh profile) but no luck.
Assignee: nobody → honzab.moz
nsHttpChannel::OnCacheEntryCheck apparently returns ENTRY_NEEDS_REVALIDATION, hence we go to nsHttpChannel::OnCacheEntryAvailable. But that doesn't take the entry handle ownership (and returns OK) and also somehow restarts wait for the same entry (ending up added as a callback to the list) so on releasing the handle inside CacheEntry::InvokeAvailableCallback (line 832, not 802) we do immediately another loop. How this happens state happens, is absolutely unclear to me. Anyway, we should be able resilient. Have to think on a fix. @Roman, can you please produce an NSPR log while reproducing the bug? set two environment vars (best in a cmd.exe window): NSPR_LOG_MODULES=cache2:5,nsHttp:4,timestamp,sync NSPR_LOG_FILE=\path\to\the\file and then run firefox with that env. If you could not reproduce, remove the 'sync' module from the list and retry. Thanks!
The crash reproduces on today's nightly with a clean profile (previously the profile was almost-clean). Add one step to the STR: close the browser. That seems to be what finally triggers the crash reporter. What I observe when I open the screenshots in new tabs is that the browser becomes very unresponsive, but the crash reporter seems to only be triggered when the browser is closed. NSPR log (sync module included): attached. This was done on a WinXP VM, because that's where my nightly lives after a couple of incidents with accidentally running the nightly on my main profile. Let me know if you still can't reproduce and require further logs from Win7.
Thanks. Now I know what's going on. A script schedules an enormous amount of requests to a single URL. It means there are 100+ consumer waiting for a single entry. nsHttpChannel::OnCacheEntryCheck returns REVAL_NEEDED but nsHttpChannel::OnCacheEntryAvailable later fails since the channel has been canceled in the mean time. This leads to processing (no redispatch) of the next callback in chain => stack overflow. I am in favor of leaving the code as much as possible w/o redispatching (i.e. breaking the stack). But in this case we need to make an exception, tho there may be more places. This will not fix the unresponsive script. e10s turned on would, but still, this is somewhat malicious behavior.
Status: NEW → ASSIGNED
Perfect! As for e10s, my nightly was pretty unresponsive with e10s enabled - a bit better than v37, but only slightly. Is it supposed to remain perfectly responsive? If so, maybe I should file another bug related to how these pages affect UI responsiveness with e10s enabled?
Crash Signature: [@ choose_arena ] [@ nsACString_internal::Replace(unsigned int, unsigned int, char const*, unsigned int) ]
Summary: Reproduceable crash in 37 through nightly, preceded by unresponsive script warning → Reproducible crash preceded by Unresponsive Script warning. high cache hit results in stack overflow
Crash Signature: [@ choose_arena ] [@ nsACString_internal::Replace(unsigned int, unsigned int, char const*, unsigned int) ] → [@ choose_arena ] [@ nsACString_internal::Replace(unsigned int, unsigned int, char const*, unsigned int) ] [@ nsACString_internal::Replace ]
Summary: Reproducible crash preceded by Unresponsive Script warning. high cache hit results in stack overflow → Stack overflow by a sync call of net::CacheEntry's callbacks in a chain
The fix is actually that easy...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=edc9797d8e2d (I've also added a comment to the patch)
Comment on attachment 8706917 [details] [diff] [review] v1 I think we should also prevent the race with closing the output stream that btw invokes (by a dispatch!) CacheEntry::OnOutputClosed() with invokes the callbacks again. I will update the patch accordingly.
- the phantom output stream can be easily closed and released under the lock - mOutputStream->Close() invokes callbacks (asynchronously), so no need to do invoke callbacks twice https://treeherder.mozilla.org/#/jobs?repo=try&revision=6b43e1905e73 -w patch follows
Attachment #8708032 - Flags: review?(michal.novotny) → review+
You need to log in before you can comment on or make changes to this bug.