Closed Bug 771832 Opened 12 years ago Closed 12 years ago

File not found error for HTTP URL, can be fixed by a forced reload

Categories

(Core :: Networking: Cache, defect)

x86
Windows 7
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla19
Tracking Status
firefox16 --- affected
firefox17 --- fixed
firefox18 + fixed
firefox19 + fixed

People

(Reporter: Matti, Assigned: michal)

References

Details

(Keywords: dogfood, Whiteboard: [qa-])

Attachments

(5 files)

Attached file Http log
After a long day doing bug triage I got hungry and opened the website of a nearby pizza delivery service.
I had used this URL already yesterday and the page should be in my cache.
The URL is http://www.pizza777-777.de/
I clicked on "Speisekarte" (translated=menu) and I got an error message

>File Not Found
>The file http://www.pizza777-777.de/speisekarte.html cannot be found. Please 
>check the location and try again.

The URL itself works in a different profile and you can fix this single URL by forcing a reload (ctrl+f5) or all pages by clearing the cache.
Each menu URL is affected by this e.g. http://www.pizza777-777.de/speisekarte_2.html 

This is the third time that this happened in 4 months with my Seamonkey trunk builds. I currently use 
Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/16.0 Firefox/16.0 SeaMonkey/2.13a1
Build identifier: 20120707003002
Note: I can't reproduce this easily after clearing the cache.

My cache is currently still in this state if you need additional logs.
I'll take a look at this.
Assignee: nobody → honzab.moz
Matthias, I can see we successfully open the cache entry (for R/W), but we probably fail to open the file when opening the input stream.  There is an error NS_ERROR_FILE_NOT_FOUND seen.  

Could you please add nsCache:5 to the list of modules and rerun?  

The question here is more why we have a cache entry and not the file.  Also I would expect the page be in one of the block files unless its content is really huge.

The fix here is to fail load from cache gracefully and load from the network instead.
Err: it's cache:5.
Damn, my browser crashed today and the cached entries are gone. It may take some time until it happens again :-(

I would have included cache:5 in my first log but I couldn't find it documented somewhere like here https://developer.mozilla.org/en/HTTP_Logging
I'll change the code to not fail on missing file/data (if I'm right with the cause of this bug's symptoms).

Matthias, next time you have a profile that can reproduce a bug, please make a backup ;)
Status: NEW → ASSIGNED
Attached file Http log with cache:5
I don't know if it's exact the same issue but I got the same error again on http://wikipedia.org.

I started with a blank page, loaded wikipedia.org and closed the browser and I made a copy of my Cache directory.
bug 791101 looks like the same issue (it's the same error message in the screenshot)
(In reply to Matthias Versen (Matti) from comment #6)
> Created attachment 662609 [details]
> Http log with cache:5

Sorry Matthias, but either me or Michal are able to find cache:5 logging in the file.  It seems you didn't do it correctly.  The setting should be:

NSPR_LOG_MODULES=nsHttp:5,cache:5
Attached file http log next try
directly copied from my commandline:
C:\Program Files\SeaMonkey>set NSPR_LOG_MODULES=nsHttp:5,cache:5
as far as I can tell, cache doesn't define FORCE_PR_LOG and therefore needs a debug build to have working logging...
Right, I'm just creating a patch which adds FORCE_PR_LOG...
I got this once in gdb; nsDiskCacheStreamIO::OpenCacheFile was trying to open the file Cache/8/8D/08AF4d01 which doesn't exist, although Cache/8/8D/08AF4m01 does (although obviously it has no data in it).

Reloading the URL causes the d01 file to become populated.

My breakpoint also triggered on a bunch of other URLs but there it's less obvious when something is wrong (e.g. one was a stylesheet).
I often saw some pages which had been visited already broken due to a part of .css or .js files failed to load. I guess that it's same bug of this.

Now, I have never seen the bug and this bug since I cleared my cache. If I will reproduce this bug with the same profile, I guess that we have a bug in cache management. Otherwise, I guess that current cache code thinks some cache made by older build are broken. If it's the former or the latter and the cache made by release build, I think that this is very serious bug for everyone.

IIRC, I've seen this bug after m-c became 18.
Anything we can do to figure this out? I'm hitting this very regularly when using bugzilla
This is a log of a page load that had cache failures.
Is this log useful? Is there anything more I can provide here?
Keywords: dogfood
Severity: normal → major
Regardless of whether there is a bug in the cache that causes a wrong file not found error, we should make sure we're never stopping a page load due to an error reported from the cache; the cache should be purely optional. So, potentially there are two bugs here.
Tracking this for FF18,FF19 because the latest reports is more on this issue
(In reply to Dave Townsend (:Mossop) from comment #18)
> Created attachment 669310 [details]
> NSPR log with caching info
> 
> This is a log of a page load that had cache failures.

Dave, thanks for the log.

I can see the problem happening (between else) for https://bugzilla.mozilla.org/js/field.js?1349333318.  In my cache (usual settings) it is stored in a block file.

Based on

9308[10b55210]: Opened cache input stream [channel=10e8dbc0, wrapper=94fa528, transport=10b3a9a0, stream=18724c20]
5672[62780c0]: nsInputStreamWrapper::LazyInit [entry=147fbc70, wrapper=18724c20, mInput=0, rv=-2142109678]
5672[62780c0]: nsInputStreamWrapper::Read [entry=147fbc70, wrapper=18724c20, mInput=0, rv=-2142109678]

where (-2142109678 = FFFFFFFF80520012 = NS_ERROR_FILE_NOT_FOUND) the error with high probability comes from [1]  :

265 nsDiskCacheStreamIO::GetInputStream(uint32_t offset, nsIInputStream ** inputStream)

...

286     } else if (mBinding->mRecord.DataFile() == 0) {
287         // open file desc for data
288         rv = OpenCacheFile(PR_RDONLY, &fd);
289         if (NS_FAILED(rv))  return rv;  // unable to open file        

Could it be, that we mistakenly change the metadata as the data were stored in a disk file instead of in a block file?


[1] http://hg.mozilla.org/mozilla-central/annotate/99898ec9976a/netwerk/cache/nsDiskCacheStreams.cpp#l286
Something interesting. I'm suddenly no longer seeing this with bugzilla really when I used to be able to reproduce it almost at will. Now I'm seeing it with wiki.mozilla.org instead.
(In reply to Honza Bambas (:mayhemer) from comment #23)
> Could it be, that we mistakenly change the metadata as the data were stored
> in a disk file instead of in a block file?

First, there might be a bug in the cache code.

But, even if there is not a bug in the cache code, this kind of failure will occur sometimes, because we don't have an ACID datastore for the cache. It could be that we change the metadata correctly, create and write the file correctly, but for some reason the OS writes the metadata to disk, but never writes the actual file. So many possibilities there.

Now that we have the HttpCacheQuery code, we should be able to actually open the file in AsyncOpenCacheEntry. That way, if we get a "file not found" error during opening, we can doom the entry and get the resource from the network. In other words, the logic in the patch for bug 746018 would help us recover from this particular error without user disruption. So, IMO, regardless of whether we have an underlying cache issue to solve, we should fix bug 746018 to make the cache more reliable.
(In reply to Dave Townsend (:Mossop) from comment #24)
> Something interesting. I'm suddenly no longer seeing this with bugzilla

i still see it on bmo, but with a much reduced frequency.
To all on this bug that collects logs: please add module "append" (no :5 after) to NSPR_LOG_MODULES.  This will preserve the log by not truncating but appending it on Fx restart and we may then try to backtrace origin of the failing entry.

(In reply to Brian Smith (:bsmith) from comment #25)
> (In reply to Honza Bambas (:mayhemer) from comment #23)
> > Could it be, that we mistakenly change the metadata as the data were stored
> > in a disk file instead of in a block file?
> 
> First, there might be a bug in the cache code.
> 
> It
> could be that we change the metadata correctly, create and write the file
> correctly, but for some reason the OS writes the metadata to disk, but never
> writes the actual file. 

Maybe the cache entry data didn't fit block file on machines the bug manifests on?

> 
> Now that we have the HttpCacheQuery code, we should be able to actually open
> the file in AsyncOpenCacheEntry. That way, if we get a "file not found"
> error during opening, we can doom the entry and get the resource from the
> network. In other words, the logic in the patch for bug 746018 would help us
> recover from this particular error without user disruption. So, IMO,
> regardless of whether we have an underlying cache issue to solve, we should
> fix bug 746018 to make the cache more reliable.

Sure, I want to do exactly that.  But first I would like to try to find the true issue since this may lead to raise in network traffic.  However, any fix must land on m-a not to get this horrible bad UX bug to m-b.

Also, this is a recent regression, so this is probably caused by some recent change to how we are opening/closing the cache entry.  Threading has changed and we may now uncover some bug in the deep and dark forest of the cache code.

I believe Michal is the only one able to put some light in here.
I'm looking at the same symptoms (file not found opening one page's css) and using about:cache I tracked it to its cache file.

It's a standalone file ("file on disk" is not "none" like some smaller cache entries) and it does exist on my file system, but its size is zero.

Hopefully this helps a bit.
Attached patch fixSplinter Review
We don't close file and release the file descriptor when we delete the old storage. This is a problem when the old location is a separate file and the new data is small enough that it would normally fit into the block file but the block files are full. In this case the data is written to the separate file in nsDiskCacheStreamIO::FlushBufferToFile() but since the mFD is still open, we write the data to already deleted file.
Assignee: honzab.moz → michal.novotny
Attachment #671869 - Flags: review?(honzab.moz)
Comment on attachment 671869 [details] [diff] [review]
fix

Changing reviewer because Honza is offline this week.
Attachment #671869 - Flags: review?(honzab.moz) → review?(hurley)
Comment on attachment 671869 [details] [diff] [review]
fix

Looks good to me.
Attachment #671869 - Flags: review?(hurley) → review+
https://hg.mozilla.org/mozilla-central/rev/cc7c7fdbfd53

Should this have a test?
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
In bug 802729, michal wrote:

> Yes, this is a dupe of #771832. Maybe it's worth to note that with fix in bug #771832 we just stop creating new bad entries, so it can take some time until all the existing entries with missing data are evicted or reloaded from the net...

I filed bug 802915 about recovering from these failures.
I'm still seeing this on Bugzilla with the latest hourly build m-c win32   
https://hg.mozilla.org/mozilla-central/rev/7fcac3016159

previously reported/duped https://bugzilla.mozilla.org/show_bug.cgi?id=800266


win7 x64
Jim: The wrong entries are still in your cache files. This patch fixes the writing of bad entries to the disk cache. Clearing the disk cache once should fix your cache.
Are we planning on uplifting this patch? Matti reported this on Gecko 16 but the exact cause of the bug is not clear to me.
Comment on attachment 671869 [details] [diff] [review]
fix

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 604897
User impact if declined: users will see "File Not Found" error page more often
Testing completed (on m-c, etc.): manually on m-c
Risk to taking this patch (and alternatives if risky): probably none
String or UUID changes made by this patch: none
Attachment #671869 - Flags: approval-mozilla-beta?
Attachment #671869 - Flags: approval-mozilla-aurora?
Attachment #671869 - Flags: approval-mozilla-beta?
Attachment #671869 - Flags: approval-mozilla-beta+
Attachment #671869 - Flags: approval-mozilla-aurora?
Attachment #671869 - Flags: approval-mozilla-aurora+
It's not fixed for me. I still see an error File not found.

> File not found
> Firefox can't find the file at http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml/4.

Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/19.0 Firefox/19.0
BuildID: 20121023030553

HTTP Connection Diagnostics: http://pastebin.mozilla.org/1876770
HTTP Log: http://pastebin.mozilla.org/1876771
(In reply to Ekanan Ketunuti from comment #44)
> It's not fixed for me. I still see an error File not found.

Have you delete your cache?
I remember I delete my cache two days ago.
I am still seeing this behavior on the 10/26 Aurora build. It appears to be fixed in Nightly, as I am unable to reproduce the problem.
Is deleting the cache required to see the effects of this fix?
(In reply to Ehsan Akhgari [:ehsan] from comment #50)
> Is deleting the cache required to see the effects of this fix?

Yes
I still see this bug happen from time to time (current trunk build, cache has been cleared multiple times) on the German Wikipedia. Did we actually figure why this bug started happening? As far as I see this, this was/is a regression.
Blocks: 808532
Is there something QA can do to help dogfood and verify this issue, thereby verifying it fixed? It's not clear to me that we have a reliable testcase, steps, nor if this is truly fixed.
Whiteboard: [qa?]
I'm still seeing the effects of this on Aurora with a cleared cache so I don't think this is fixed either. Sadly I don't have reliable repro steps but I'm seeing the stylesheets for http://xboxforums.create.msdn.com/ fail to load around 50% of the time even though the web console's net view shows 304s for all the relevant assets. All I have to do is browse around through the various subforums.

I even just cleared my cache 15 minutes ago and it's still happening so I think something is still broken here.
Oh, and I just checked and this seems to affect view source in the developer tools too, thus:

Firefox can't find the file at view-source:http://xboxforums.create.msdn.com/assets/css/main_Default_UpLevel.css?version=3.0.10335.
View Source very often fails to load for me as well (shows an empty document). A quick reload (F5) of the page in question, followed again by CTRL+U (View Source) does work; since this doesn't actually clear the cache... I'm not too sure it really has to do with this problem...
(In reply to M. Sokolewicz from comment #56)
> View Source very often fails to load for me as well (shows an empty
> document). A quick reload (F5) of the page in question, followed again by
> CTRL+U (View Source) does work; since this doesn't actually clear the
> cache... I'm not too sure it really has to do with this problem...

I think it's still related because if you read the original comment on the bug:
> The URL itself works in a different profile and you can fix this single URL by forcing a reload (ctrl+f5)
So that is consistent with the behavior you can observe by view-sourcing the stylesheet (and force-reloading the view-source fixes the bad cache entry, at least for a little bit).
See Bug 808532 on the remaining issues with the cache, it's being tracked in a new bug.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #53)
> Is there something QA can do to help dogfood and verify this issue, thereby
> verifying it fixed? It's not clear to me that we have a reliable testcase,
> steps, nor if this is truly fixed.

STR: download http://bagr.lightcomp.cz/bug771832/Cache.zip, unpack it and replace the cache directory in the FF profile with it. Start Firefox, go to URL http://bagr.lightcomp.cz/bug771832/test.html and follow the steps.
Adding verifyme to test with the steps in comment 59.
Keywords: verifyme
Whiteboard: [qa?]
I have tried to reproduce this issue on Firefox 16b6 and 16.0.2 but it never reproduced. Using the steps in comment 59, I get the same results for Fx 16 and Fx 17 (20121116115405): refresh happens every time I press F5 or click the reload button, but it doesn't always happen when I click the URL bar and press Enter. No other issues happen.

Are there any rebliable steps to reproduce this issue?
Keywords: verifyme
Whiteboard: [qa?]
May be there still need to be more cases where the file that was written too, needs to closed. mFD for output streams is only closed when Flush() is called, but it seems that in some error cases, Flush() is not called on close or destroy of the nsDiskCacheStream.
I just experienced this on Firefox 30 running on an Ubuntu netbook.  It started around the time I was browsing while the disk partition unexpectedly ran completely out of space.  Then, every Back resulted in the misleading “File not found” message until I cleared the cache.
(In reply to Travis Evans from comment #63)
> I just experienced this on Firefox 30 running on an Ubuntu netbook. 

You're probably hitting a different bug with a similar symptom since *this* bug was fixed over a year ago. Please file a new bug report if you can reproduce what you've described in comment 63. Thank you.
Whiteboard: [qa?] → [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: