Closed Bug 1028415 Opened 5 years ago Closed 5 years ago

Cache thread gets stuck in CacheFileIOManager::OverLimitEvictionInternal loop

Categories

(Core :: Networking: Cache, defect)

x86_64
All
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox30 --- unaffected
firefox31 --- unaffected
firefox32 + fixed
firefox33 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: jchen, Assigned: michal)

References

Details

Attachments

(2 files)

Attached file nightly-cache-loop.txt
For some reason, Nightly on my MacBook Air was stuck in a loop in CacheFileIOManager::OverLimitEvictionInternal (CPU was at 100% for many minutes and it was enough to turn on the CPU fan at full speed). This happened on the Cache2 I/O thread, and the main thread was fully responsive.

I finally used "Clear Recent History" to clear the last hour of history, and afterwards CPU usage immediately dropped down to 0%.

Attached is the output from OS X's internal profiler (I didn't have the Gecko profiler installed).
(In reply to Jim Chen [:jchen :nchen] from comment #0)
> Created attachment 8443751 [details]
> nightly-cache-loop.txt
> 
> For some reason, Nightly on my MacBook Air was stuck in a loop in
> CacheFileIOManager::OverLimitEvictionInternal (CPU was at 100% for many
> minutes and it was enough to turn on the CPU fan at full speed). This
> happened on the Cache2 I/O thread, and the main thread was fully responsive.

How big was the cache when this happened?


> I finally used "Clear Recent History" to clear the last hour of history, and
> afterwards CPU usage immediately dropped down to 0%.

"Clear Recent History" always deletes the whole cache and it is always faster than selective evicting of the least useful entries, so this is expected behavior.

It seems to me that for some reason the maximum cache size has been dramatically lowered, so we started to evict a lot of entries. But it's not clear to me how this could happen. What's your maximum cache size? Do you have smartsize enabled?
This happened to me twice yesterday while I was streaming a live BBC broadcast. about:cache from the first time,

> Information about the Network Cache Storage Service
> Private Anonymous AppID In Browser Element
> memory
> Number of entries:  1098
> Maximum storage size:   32768 KiB
> Storage in use:     6784 KiB
> Storage disk location:  none, only stored in memory
> List Cache Entries
> disk
> Number of entries:  7896
> Maximum storage size:   327680 KiB
> Storage in use:     331063 KiB
> Storage disk location:  /Users/nchen/Library/Caches/Firefox/Profiles/u6sv5ul4.default/cache2
> List Cache Entries
> appcache
> Number of entries:  0
> Maximum storage size:   512000 KiB
> Storage in use:     0 KiB
> Storage disk location:  /Users/nchen/Library/Caches/Firefox/Profiles/u6sv5ul4.default/OfflineCache

After that, I cleared recent history (about:cache showed cache was cleared), and things returned to normal. But after some time, while I was still streaming the broadcast, this happened again. about:cache from the second time,

> Information about the Network Cache Storage Service
> Private Anonymous AppID In Browser Element
> memory
> Number of entries:  0
> Maximum storage size:   32768 KiB
> Storage in use:     0 KiB
> Storage disk location:  none, only stored in memory
> disk
> Number of entries:  1788
> Maximum storage size:   358400 KiB
> Storage in use:     2785412 KiB
> Storage disk location:  /Users/nchen/Library/Caches/Firefox/Profiles/u6sv5ul4.default/cache2
> List Cache Entries
> appcache
> Number of entries:  0
> Maximum storage size:   512000 KiB
> Storage in use:     0 KiB
> Storage disk location:  /Users/nchen/Library/Caches/Firefox/Profiles/u6sv5ul4.default/OfflineCache

Both times CPU was at 100%, but there was *no* disk activity according to Activity Monitor. The broadcast I was watching was using Flash for streaming; not sure if it's related. Smartsize is enabled according to about:config.
For reference:

   [18:19]	NeilAway: mayhemer: I've had a bug where the cache doom thread was in an infinite loop because the lru entry had already been doomed
   [18:21]	NeilAway: mayhemer: I used the debugger to nop-out both the IsDoomed() checks and the thread eventually finished shrinking the cache, but unfortunately the browser still wouldn't shut down normally and I had to kill it anyway :s
I can verify that my 2-week-old 64-bit Windows self-build hung on a background thread in a loop in the same function.

I didn't notice the problem until I tried to shut the build down and noticed that the process was lingering with 25% (1 of 4 CPUs) utilisation.

My build was an unoptimised build with symbols so I attached a debugger and noticed that the cache entry that it was trying to doom was already doomed.

At first I just used the debugger to skip over the IsDoomed() checks in the appropriate functions and some of the cache got evicted (it turns out that forcibly dooming a doomed entry still removes it from the mru list) but the loop got stuck again and I got bored and used the debugger to remove the checks. This allowed the function to complete its loop and return to the thread but unfortunately the browser didn't complete shutdown and I had to kill it anyway.
OS: Mac OS X → All
We should never have doomed entries in frecency/expiration array in the index. The CacheFileHandle::mIsDoomed is set to true only in CacheFileIOManager::DoomFileInternal() and the entry is removed from the index just one line above that. Neil, could you please try to reproduce it with NSPR logging turned on (nsHttp:5,cache2:5)?
Attached patch patch v1Splinter Review
This patch doesn't fix the root of the cause of the problem. It just breaks the loop so we won't try to doom already doomed handle over and over again. I still have no idea how we can have such entry in the index. Neil, did you try to reproduce the problem with NSPR logging enabled?
Assignee: nobody → michal.novotny
Attachment #8456219 - Flags: review?(honzab.moz)
Flags: needinfo?(neil)
(In reply to Michal Novotny from comment #6)
> This patch doesn't fix the root of the cause of the problem. It just breaks
> the loop so we won't try to doom already doomed handle over and over again.
> I still have no idea how we can have such entry in the index. Neil, did you
> try to reproduce the problem with NSPR logging enabled?

Sorry, I haven't been watching the bug, but as it happens, I've not seen the problem again anyway.
Flags: needinfo?(neil)
Comment on attachment 8456219 [details] [diff] [review]
patch v1

Review of attachment 8456219 [details] [diff] [review]:
-----------------------------------------------------------------

Hope this helps.  Tho seems like the problem may actually be somewhere else according michal's claims that this simple cannot happen.

r=honzab
Attachment #8456219 - Flags: review?(honzab.moz) → review+
Comment on attachment 8456219 [details] [diff] [review]
patch v1

Approval Request Comment
[Feature/regressing bug #]: cache2
[User impact if declined]: browser can go into 100% CPU loop on cache thread 
[Risks and why]: low: fairly small change
[String/UUID change made/needed]: none

This cache2 fix also seems worth taking on aurora IMO.
Attachment #8456219 - Flags: approval-mozilla-aurora?
Going to wait for this to land on m-c before granting Aurora approval.
https://hg.mozilla.org/mozilla-central/rev/a078bf8b709d
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Attachment #8456219 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Depends on: 1085172
You need to log in before you can comment on or make changes to this bug.