Closed Bug 321361 Opened 19 years ago Closed 16 years ago

Cache corruption when viewing sites with several large images (simultaneous hash collision & cache dooming by disk cache size limit)

Categories

(Core :: Networking: Cache, defect)

x86
All
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: public, Assigned: michal)

References

Details

Attachments

(12 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

FF 1.5 release on PC/XP SP 2 with new profile, default settings, no extensions.

When viewing sites with several large images (where the amount of data per page
exceeds both memory and disk cache size), and changing back and forth between multiple such pages, cache gets corrupted such that images show up in disk or memory cache, but are not loaded or not completely loaded. Also seen corrupt image contents after repeating for a while.

Reproduced several times. Pages used to reproduce were checked for correct HTML.

Reproducible: Always

Steps to Reproduce:
1. Go to www.theresia.name. 
2. Select "The Talers", then "Catalog".
3. Load different parts of catalog; e.g., pages 1-10, 11-20, 21-30, Entire Catalog. Swap back and forth between pages.
4. Go back to "The Talers".
5. Select "Guenzburg".
6. Check if all images display.
7. Repeat until failure.

Actual Results:  
Some images either missing or corrupt.

Expected Results:  
Images should continue to be displayed correctly.

Used default setting with new profile to reproduce.
Attached file Disk cache
Attached file Memory cache
Assignee: nobody → darin
Component: General → Networking: Cache
Product: Firefox → Core
QA Contact: general → networking.cache
Version: unspecified → 1.8 Branch
Problem still seen with:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060127 Firefox/1.5.0.1

cut/paste from about:cache for failing entry:

key:	http://www.theresia.name/150/H13c_Revers_150.jpg
fetch count: 6
last fetched: 2006-01-29 08:37:21
last modified: 2006-01-29 08:35:48
expires: 2006-03-31 20:03:13
Data size: 32561
file on disk: C:\Documents and Settings\Guenter Roeck\Local Settings\Application Data\Mozilla\Firefox\Profiles\default.d8c\Cache\C396CD5Dd01
Security: This document does not have any security info associated with it.
Client: HTTP
request-method: GET
response-head: HTTP/1.1 200 OK
Date: Sun, 29 Jan 2006 16:31:26 GMT
Server: Apache/1.3.34 (Unix) PHP/4.4.2 mod_ssl/2.8.25 OpenSSL/0.9.7c FrontPage/5.0.2.2635 mod_throttle/3.1.2
Last-Modified: Mon, 24 May 2004 21:13:35 GMT
Etag: "1e9b33-7f31-40b2657f"
Accept-Ranges: bytes
Content-Length: 32561
Content-Type: image/jpeg

-------------

Error message when I try to load the image by selecting the key:
    Firefox can't find the file at /150/H13c_Revers_150.jpg

Looking into the cache on disk itself, the "file on disk" as listed above
does not exist in the cache directory.
Assignee: darin → nobody
I still see this problem (or a similar problem) with Firefox 2.0.1.
With this version, some of the images are sometimes simply wrong - instead of showing the expected image, some other image may be displayed.
"Reload" does not help. Problem goes away after the cache was cleared.
Also see https://bugzilla.mozilla.org/show_bug.cgi?id=330820.
Looks like others see the same or a similar problem.
Can you still reproduce the bug using latest firefox? I can't reproduce it, but I can't reproduce it in firefox 1.5.0.1 either...

> 3. Load different parts of catalog; e.g., pages 1-10, 11-20, 21-30, Entire
> Catalog. Swap back and forth between pages.

Do you wait for all images to load or do you swap back and forth in the middle of the loading?
Current version:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1b3pre) Gecko/20090122 Shiretoko/3.1b3pre

Yes, I still do see the problem. I wait for the images to finish loading, then swap back and forth between pages.

I'll try to attach a couple of current screen shots.
Assignee: nobody → michal
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows XP → All
Version: 1.8 Branch → Trunk
Blocks: 290032
> Error message when I try to load the image by selecting the key:
>   Firefox can't find the file at /150/H13c_Revers_150.jpg
> Looking into the cache on disk itself, the "file on disk" as listed above does not exist in the cache directory.

I could observe similar error by manual deletion of files in cache directory while Fx is running, during test for a problem in the past. ("Push away from disk cache" due to disk cache size limit or Bug 175600 was suspected, so I deleted files manually)
After manual deletion of files in cache, click of link to the URL produced similar error. HTTP GET was not issued upon the error(i.e. cache entry for URL exists and it's not expired, but data file is not found.)

Bug 175600 is relevant? How many files exist in Firefox's Cache directory?
> Bug 175600 Only 8192 objects can be stored in disk cache
There are less than 600 objects in the cache when I see the problem.
How about "about:cache" display? Still same as previous screen shot? 
>(your screen shot says as follows)
> Disk cache device
> Number of entries   : 556
> Maximum storage size: 50000 KiB
> Storage in use:       49920 KiB
What is real disk cache directory size? (Property of directory calculates total size) Same as about:cache?

Following may help your problem analysis.
(a) Fx's activity log on Cache directory/files in Cache directory
    by such as Filemon, Process Monitor.
> http://technet.microsoft.com/en-us/sysinternals/default.aspx
> http://technet.microsoft.com/en-us/sysinternals/cb56073f-62a3-4ed8-9dd6-40c84cb9e2f5.aspx
> http://technet.microsoft.com/en-us/sysinternals/bb545046.aspx
(b) NSPR log with timestamp(use DebugView).
    See Bug 402793 Comment #1 and Bug 402793 Comment #6.
    (NSPR log still doesn't have timestamp capability)
 (1) Start DebugView(as admin user), stop capture
 (2) SET NSPR_LOG_MODULES=all:5
     SET NSPR_LOG_FILE=WinDebug
     firefox.exe -P
 (3) Execute required setups for problem recreation
 (4) Enable capture at DebugView
 (5) Execute required test
 (6) Confirm that problem occurs
 (7) Stop capture at DebugView, and save log data to a file
 (8) Terminate Firefox
 (9) Teminate DebugView

Check when the cache file is deleted from cache directory with (a), and check Fx's activity around it with (b).
After lot of debugging I have finally a reproducible testcase:

1) set following preferences in about:config

browser.cache.check_doc_frequency = 1
image.cache.size = 1
network.http.max-persistent-connections-per-server = 1
browser.cache.disk.capacity = 1024
browser.cache.memory.enable = false

2) restart of FF is needed for image cache change to take effect
3) clear the cache
4) visit http://michal.etc.cz/bug321361/test1.html
5) click on link "page2"
6) 4th image on the page2 isn't loaded

There is a problem with collision detection system. Following happens when STR above are performed:

1) URLs of image 1 and 4 on the first page have the same hash. Cache entry for image 4 is obtained before image 1 starts downloading so there is no collision in nsCacheService::OpenCacheEntry().
2) Image 1 starts downloading, BindEntry() is called and generation number is set to 1. Cache entry is closed and removed from nsDiskCacheBindery when download finishes.
3) Images 2 and 3 are loaded (they aren't interesting...)
4) Image 4 starts downloading, BindEntry() is called and here collision detection system fails. It gets also generation number 1 and cache file of image 1 is rewritten.
5) The cache is now almost full, size of these 4 images is 944036 bytes.
6) Only image 4 is reused on the page2, other images have different URL so loading of the page will evict some entries.
7) HTTP channels for images 1,2,3,4 are created and cache entry is found for image 4, so only if-modified request is sent.
8) Image 1 (from page 2) evicts image 1 (from page 1) from the cache.
9) 304 Not modified is returned by server for image 4. nsHttpChannel::ReadFromCache() is called, but the cache file was deleted at previous step.

To be honest I must say that I don't fully understand all the "generation" stuff. Should it solve exactly these issues or does it exist for another purpose? Also I'm not sure if description of method nsDiskCacheDevice::BindEntry() is correct. From its description I would assume that mCacheMap.AddRecord() detects hash collisions but it detects only bucket collision. We can fix the bug exactly according to BindEntry()'s description, i.e. detect hash collision and either remove old entry to refuse to bind the new one. Do we then need the generations?
ccing dcamp, but it might be that you know more about this code than anyone else active at this point...  :(
Attached patch patch v1 (obsolete) — Splinter Review
This patch adds a collision check into nsDiskCacheDevice::BindEntry(). There are comments explaining the change in the patch.
Attachment #362442 - Flags: review?(bzbarsky)
Attachment #362442 - Flags: superreview?(bzbarsky)
Attachment #362442 - Flags: review?(dcamp)
Attachment #362442 - Flags: review?(bzbarsky)
Comment on attachment 362442 [details] [diff] [review]
patch v1

dcamp's probably a better reviewer here.
Comment on attachment 362442 [details] [diff] [review]
patch v1

>diff --git a/netwerk/cache/src/nsDiskCacheDevice.cpp b/netwerk/cache/src/nsDiskCacheDevice.cpp
>--- a/netwerk/cache/src/nsDiskCacheDevice.cpp
>+++ b/netwerk/cache/src/nsDiskCacheDevice.cpp

>+    nsDiskCacheBinding *binding;
>+    PLDHashNumber hashNumber = nsDiskCache::Hash(entry->Key()->get());
>+
>+    // Find out if there is already an active binding for this hash. If yes it
>+    // should have another key since BindEntry() shouldn't be called twice for
>+    // the same entry. Doom the old entry, the new one will get another
>+    // generation number so files won't collide.
>+    binding = mBindery.FindActiveBinding(hashNumber);
>+    if (binding) {
>+        if (PL_strcmp(binding->mCacheEntry->Key()->get(),
>+                      entry->Key()->get()) != 0)

Any reason to not just use mCacheEntry->Key()->Equals() here?

>+    // Lookup hash number in cache map. There can be a colliding inactive entry.
>+    // See bug #321361 comment 23 for the scenario. If there is such entry,
>+    // delete it.

Looks like this should be comment 21.
Attachment #362442 - Flags: review?(dcamp) → review+
Attachment #362442 - Attachment is obsolete: true
Attachment #362831 - Flags: superreview?(bzbarsky)
Attachment #362442 - Flags: superreview?(bzbarsky)
Attachment #362831 - Flags: superreview?(bzbarsky) → superreview+
Comment on attachment 362831 [details] [diff] [review]
patch v2 - uses Equals() to compare strings, fixed comment number

>+++ b/netwerk/cache/src/nsDiskCacheDevice.cpp
>+        if (!binding->mCacheEntry->Key()->Equals(*entry->Key()))
>+            nsCacheService::DoomEntry(binding->mCacheEntry);
>+        else
>+            NS_ASSERTION(PR_FALSE, "BindEntry called for already bound entry!");

Why not make the Doom unconditional and just assert the condition?  If you really want to play it safe, use NS_ERROR instead of NS_ASSERTION in the else.

sr=bzbarsky with that.
Attachment #362831 - Attachment is obsolete: true
Keywords: checkin-needed
Comment on attachment 363000 [details] [diff] [review]
patch v3
[Checkin: Comment 29]


http://hg.mozilla.org/mozilla-central/rev/f796debb11af
Attachment #363000 - Attachment description: patch v3 → patch v3 [Checkin: Comment 29]
Status: NEW → RESOLVED
Closed: 16 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.2a1
Guenter, could you please test if you still have some problems? I wasn't able to reproduce the problem with corrupted image. Note that the fix was committed to mozilla-central only, so use following nightly build for testing:

http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2009/02/2009-02-20-10-mozilla-central/
Problem not seen anymore with the following build:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090223 Minefield/3.2a1pre
Adding some words(hash collision, cache dooming) in summary, for ease of search.
Michal Novotny, please correct it if wrong or inappropriate.
Summary: Cache corruption when viewing sites with several large images → Cache corruption when viewing sites with several large images (simultaneous hash collision & cache doomig by disk cache size limit)
Summary: Cache corruption when viewing sites with several large images (simultaneous hash collision & cache doomig by disk cache size limit) → Cache corruption when viewing sites with several large images (simultaneous hash collision & cache dooming by disk cache size limit)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: