Closed Bug 717761 Opened 10 years ago Closed 6 years ago

Main thread can be blocked by IO on the cache thread

Categories

(Core :: Networking: Cache, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 913806

People

(Reporter: u408661, Assigned: michal)

References

Details

(Keywords: main-thread-io, perf, Whiteboard: [Snappy:P1][necko-backlog])

Attachments

(4 files, 2 obsolete files)

In bug 715774 comment 3 dmandelin identifies a call stack that caused a disturbingly long jank via the cache:

    NtWaitForSingleObject
    RtlIntegerToUnicodeString
    PR_Lock
    nsCacheService::Lock
    nsCacheEntryDescriptor::GetMetaDataElement
    nsHttpChannel::CheckCache
    nsHttpChannel::Connect
    nsHttpChannel::OnNormalCacheEntryAvailable
    nsHttpChannel::OnCacheEntryAvailable
    nsCacheListenerEvent::Run
    nsThread::ProcessNextEvent

It seems that getting a metadata element from a particular cache entry can block if anything else is being done with the cache service, as we have one global lock for the entire cache. In this case, it runs on the main thread, causing the jank. We should probably have finer-grained locking to prevent things like this from happening. Assigning to myself as part of the cache work to be done this quarter.
There are many problems in this code.

One of them is that the HSTS (and SPDY Alternate-Protocol?) checks are happening in the wrong spot--they should be happening BEFORE we looked up the entry in the cache, but currently they are happening AFTER we look up the entry in the cache.

But, a more major problem is that it doesn't make sense for the cache to call a callback on the main thread so that the callback can decide whether to read an entry from the cache (involving the cache thread/lock) or from the network (involving the socket transport thread). AFAICT, instead of calling this callback on the main thread, the cache itself should decide whether validation is necessary. If validation isn't necessary, then the cache should return the response directly to the main thread. If validation isn't necessary, it should jump directly to the socket transport thread to do the validation (skipping the main thread). If that validation response comes back 304, then the socket transport thread should transfer control back to the cache thread to read the response, and then the response should be sent to the main thread. If the validation response comes back otherwise, then that response data should be sent to the main thread and then later sent to the cache.

More generally, the main thread is responsible for issuing a cache lookup (to the cache thread) or issuing a network request (to the socket transport thread), and it is responsible for receiving the response data or error result. But, between those two points, the main thread should never be involved. And, even more generally than that, there should never be any case where any thread (except the cache thread) obtains any cache lock, except for the lock built into the internals of the cache thread's nsIEventTarget implementation. (All of this AFAICT.)

Basically, this is very much like what bz suggested in bug 612632 comment 8. AFAICT, a good solution for bug 612632 will solve this problem too.
Although the I/O doesn't happen directly on the main thread, the main thread is blocked (because of the lock) on I/O happening on the cache thread, so this is effectively main thread I/O.
Keywords: main-thread-io, perf
Whiteboard: [Snappy]
Version: unspecified → Trunk
(In reply to Brian Smith (:bsmith) from comment #1)
> Basically, this is very much like what bz suggested in bug 612632 comment 8.
> AFAICT, a good solution for bug 612632 will solve this problem too.

While I agree in general that we should be better for validation to be non-blocking in some way or another (which is why I filed this bug), that's different (in my view) from what bz is talking about. He's talking, effectively, about reducing the number of "round trips" in the cache, while this bug is talking about not being dumb about our locking. Perhaps they can/will both be resolved in the same patch, but I'm not convinced of that, since we could easily solve one without solving the other. (I would even argue that we SHOULD do this piecemeal, to make it easier to identify errors we may have made in the implementation of one versus the other.)
Summary: nsHttpChannel::CheckCache can be blocked on unrelated cache operations → Main thread can be blocked by IO on the cache thread
Duplicate of this bug: 716172
This may be causing massive slowdowns when filling the cache. See http://groups.google.com/group/mozilla.dev.tech.network/browse_thread/thread/6dfa3e6ebe80c336#
Initially I was hoping this would be pretty simple, but the locking & interdependencies are much hairier than I initially expected. There might still be a solution for this particular issue that doesn't involve redoing *all* the cache locking in one big go, but that requires a bit more investigation.
Attached patch start of a patchSplinter Review
Here's the start of a patch for this. Doesn't fix all the places yet (particularly calls to nsCacheEntryDescriptor::{Doom,Close} and one synchronous call to OpenCacheEntry), but it's a start. Seems to be working without crashes so far, but I have no quantification of how well this fixes (or doesn't) the slowdowns.
Attached file stacks of bad locks (obsolete) —
This is the output generated by part of the patch that adds an assertion when we take the lock on the main thread. It's from opening www.cnn.com and then quitting the browser. There are lots of stack traces in here, but a lot of them are the same, just for different channels. You can write a script to slog through it fairly easily to detect the unique stacks that matter.
Attached file stacks of bad locks
Re-upload with the appropriate mime type
Attachment #593534 - Attachment is obsolete: true
FYI: you can change the MIME type of an existing attachment, via "details | edit details"
Depends on: 723362
Depends on: 723577
Depends on: 723582
Depends on: 723585
Here is the cleaned-up version of my lock classifications:

* evil (will be taken on the main thread in common operation)

* bad (lock will be taken on the main thread in less performance-critical situations like private browsing transition or only by tests or only when using about:cache)

* benign (lock is always taken on the cache thread).

* unclasified. There are a few instances that I still left unclassified, which are marked UNCLASSIFIED_CACHE_LOCK in the patch. I would like to know what the classification for those instances should be.
Attachment #594338 - Flags: feedback?(michal.novotny)
Attachment #594338 - Flags: feedback?(hurley)
Attached patch Remove isStorageEnabledForPolicy (obsolete) — Splinter Review
This method is dead code in libxul (AFAICT) and it takes the cache lock on the main thread, so we should just remove it.
Attachment #594346 - Flags: review?(michal.novotny)
Comment on attachment 594346 [details] [diff] [review]
Remove isStorageEnabledForPolicy

I'm working on this as well a fixing the locking with the lower layers in bug 723582, let's leave it all together
Attachment #594346 - Flags: review-
Comment on attachment 594346 [details] [diff] [review]
Remove isStorageEnabledForPolicy

Clearing the flag since Nick will solve this in a different bug.
Attachment #594346 - Flags: review?(michal.novotny)
Whiteboard: [Snappy] → [Snappy:P1]
Attachment #594346 - Attachment is obsolete: true
Depends on: 730163
Attachment #594338 - Flags: feedback?(hurley)
I recently looked at Telemetry reports of transient main thread hangs and I noticed that cache operations show up fairly often:

http://blog.mozilla.org/vdjeric/2012/06/08/cache-plugin-font-operations-most-common-in-chrome-hang-reports/

I attached the call stacks in question. All of the  stacks were reported by a single Windows XP machine, however the reports are over the span of a week and this machine is responsible for almost half of the hang stacks in the entire data set.
Assignee: hurley → michal.novotny
Blocks: 642257
Attachment #594338 - Flags: feedback?(michal.novotny)
honza, what do you think the current status of this is?
Flags: needinfo?(honzab.moz)
Whiteboard: [Snappy:P1] → [Snappy:P1][necko-backlog]
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(honzab.moz)
Resolution: --- → DUPLICATE
Duplicate of bug: cache2enable
(In reply to Honza Bambas (:mayhemer) from comment #17)
> 
> *** This bug has been marked as a duplicate of bug 913806 ***

Either that's the wrong bug... or you should provide some context for duplicating this bug against that bug :)
Flags: needinfo?(honzab.moz)
This was a cache1 code bug.  It's no longer used after we switched to cache2.
Flags: needinfo?(honzab.moz)
Ok, thanks for explaining!
You need to log in before you can comment on or make changes to this bug.