Closed Bug 617123 Opened 14 years ago Closed 13 years ago

Images are randomly corrupted when fallocate used on block files

Categories

(Core :: Networking: Cache, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla5

People

(Reporter: polidobj, Assigned: michal)

References

Details

(Keywords: regression)

Attachments

(4 files)

Randomly I run into images being corrupted.  I have a fairly reproducible set of steps.  

1. Start here:
http://www.flickr.com/photos/15530177@N05/5210167211/in/photostream/
2. keep going through the photo stream. I just kept going to the next photo in the stream.
Eventually I get a corrupt image shown. Reloading does not fix the problem.  But doing a Ctrl-F5 to reload the page corrects the problem. 

The file in the cache seems to be ok. I looked in about:cache and found the entry for the photo. I copied the file from the cache directory using windows explorer and renamed it as a jpg. Viewing that showed a photo with no problem.

Here's the nightly regression range that I found:
11-24 - ok - http://hg.mozilla.org/mozilla-central/rev/7f5cd850578e
11-25 - broken - http://hg.mozilla.org/mozilla-central/rev/32cd5d95a95f
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7f5cd850578e&tochange=32cd5d95a95f

Since the cache seems to be involved bug 592422 seems like a likely culprit.
I see this too
blocking2.0: --- → ?
Confirmed on Mozilla/5.0 (Windows NT 5.1; rv:2.0b8pre) Gecko/20101202 Firefox/4.0b8pre ID:20101202030316

Regression window in local build:
Build from 59144b59cd11 : fails
Build from 2c8654b57cca : not corruption
So this is regression of changeset 59144b59cd11 bug 592422
It's happening with PNGs too. Very hard to reproduce, kinda random, but the STR always brings the problem.
Attached image reproduceable
could reproduce it on a new profile .
I've been experiencing this quite a lot. Also a quick search to find this bug shows that quite a few bugs could be duped to this.
I see it to on the linked flickr page, it seems random but usually happens within 10 or so pictures viewed.

Reloading does not fix it but shift+reload (reload cache) does.

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b8pre) Gecko/20101206 Firefox/4.0b8pre
What platforms are you all using?  We've got two reports from Windows so far. I'm wondering if this is Windows-specific, or specific to FAT32 vs NFTS.

> So this is regression of changeset 59144b59cd11 bug 592422

Very strange.  It's hard to imagine how fallocate-ing the file length could be causing such problems.  Unless we've still got the Windows fallocate emulation code wrong.  I just pored over it and don't see anything obvious.

If we can't reproduce and fix, I guess we'll have to back out bug 592422 (again), at least for the platforms where we're seeing this.
If anyone is seeing this reliably, and can run a debugger with a breakpoint set at mozilla::fallocate() to see if the file pointer is not getting restored to the correct position (ie. PR_Seek64(aFD, oldpos, PR_SEEK_SET) doesn't get reached for the windows case), that would be most useful.
I've been seeing this on Flickr on my MacBook Pro in 10.6.4.
I've been seeing it on my laptop (Windows Vista x64, D3D9) and my PC (Windows 7 x64, D2D/D3D10). I haven't seen it on today's nightly however, even though it happened to me a lot yesterday. Might just be an unfortunate coincidence though ...
(In reply to comment #11)
> I've been seeing it on my laptop (Windows Vista x64, D3D9) and my PC (Windows 7
> x64, D2D/D3D10). I haven't seen it on today's nightly however, even though it
> happened to me a lot yesterday. Might just be an unfortunate coincidence though
> ...
Coincidence.. Yesterday I was able to reproduce it after only like 6-8 pictures, today it took 50+ samples, then only like 6 samples after that one.
(In reply to comment #8)
> What platforms are you all using?  We've got two reports from Windows so far.
> I'm wondering if this is Windows-specific, or specific to FAT32 vs NFTS.

I'm using build Mozilla/5.0 (Windows NT 6.1; rv:2.0b8pre) Gecko/20101208 Firefox/4.0b8pre ID:20101208030336
(In reply to comment #8)
> What platforms are you all using?  We've got two reports from Windows so far.
> I'm wondering if this is Windows-specific, or specific to FAT32 vs NFTS.

I was able to reproduce it on Fedora 14.

> Very strange.  It's hard to imagine how fallocate-ing the file length could be
> causing such problems.  Unless we've still got the Windows fallocate emulation
> code wrong.  I just pored over it and don't see anything obvious.

I've checked the cache entry and it is correct. The image is shown correctly after FF restart without re-fetching it again from the server. So the problem is in the image cache, not necko cache.
Is this reproducible elsewhere or is it very specific to the way flickr embeds images? I've browsed through lots of sites that are image based and couldn't reproduce it elsewhere.

In any case I think this should probably be a blocker, possibly b8.
(In reply to comment #15)
> Is this reproducible elsewhere or is it very specific to the way flickr...

It's not flickr-only.  Flickr is just convenient.
I saw this a lot on www.mangafox.com for various manga. The corruption appears in different forms - sometimes only part of the image loads, other times images appear grainy (corrupt pixels in a grid-like pattern), and still other times images would fully load but change colour partway through and either stay reasonably normal or become completely corrupted. I haven't seen it since yesterday though, for whatever reason.
(In reply to comment #15)
> Is this reproducible elsewhere or is it very specific to the way flickr embeds
> images? I've browsed through lots of sites that are image based and couldn't
> reproduce it elsewhere.
> 
> In any case I think this should probably be a blocker, possibly b8.

I've seen it elsewhere. I'm actually seeing this a lot, but I peruse images online a lot.
Blocking on this recent regression. Michal says he can look into this one.
Assignee: nobody → michal.novotny
blocking2.0: ? → betaN+
For those who can't reproduce: the STR on bug 618014 (combined with an artificially-slowed-down internet connection with a tool like "wondershaper" on linux) seem to be pretty reliable.  (Basically: load a page with a large image, hit 'stop' while the image is partially loaded, and reload)
If a fix can't be made in time shouldn't the regression bug be backed out before Beta8?
> shouldn't the regression bug be backed out before Beta8?

Yes. I've removed the cache's calls to fallocate for now:

http://hg.mozilla.org/mozilla-central/rev/8d59a3285d92
Please let me know if this actually appears to fix things: it's still mysterious to me as to how fallocate is breaking these image loads.

(I didn't back out mozilla:fallocate itself, but if just having the function around w/o getting called is breaking stuff, then we have Big Problems :)
Things appear to have been fixed. But you know, it could be due to the networking problems as that patch was backed out too. I'd suggest pushing the fallocate to m-c again for a day to test.
Paul,

Which networking patch backout are you referring to?
(In reply to comment #23)
> > shouldn't the regression bug be backed out before Beta8?
> 
> Yes. I've removed the cache's calls to fallocate for now:
> 
> http://hg.mozilla.org/mozilla-central/rev/8d59a3285d92

(In reply to comment #23)
> > shouldn't the regression bug be backed out before Beta8?
> 
> Yes. I've removed the cache's calls to fallocate for now:
> 
> http://hg.mozilla.org/mozilla-central/rev/8d59a3285d92

I am on current hourly and seen the bug on the STR after the 3rd image. Was corrupted with half of the image being green. only fixed by a shift-reload (cache reload).
Scott:  Can you test and see if the regression range in comment 3 is correct?  Testing with a nightly from Nov 23rd ought to do the trick.
Scott: Oh, stupid me.  Please also try clearing your cache.  I'm going to push a cache version++ in case that's the issue.
I cleared the cache and got the latest hourly, went to the testcase and clicked through over 100 images and haven't seen the issue. Maybe it needed a cache flush first or it wasn't in my previous hourly changeset. Haven't gotten to testing pre Nov 23rd builds.
(In reply to comment #26)
> Paul,
> 
> Which networking patch backout are you referring to?

bug 614677
Is anyone still seeing this bug in nightlies?

I'm hoping the backout of bug 592422 (comment 23) and/or bug 614677 has fixed this for now, and we revisit after 2.0 ships.
Not seeing this at all any more.
Same here. Can't get it to reproduce on on nightles.
This appears to have been fixed by one of the things backed out.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Depends on: 619685
Still seems to be fixed running today's nightly (20101217) which includes the relanding(?) of the HTTP acceleration bugs.
(In reply to comment #36)
> relanding(?) of the HTTP acceleration bugs.

fixes! (not bugs :))
Csaba,

What build are you using?
Latest win32 nightly.
Well then this is reopened!  Patrick, I suppose the main culprit is the HTTP acceleration patches, since fallocate hasn't re-landed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Forgive me if I'm wrong, but I'm not sure that what Csaba is seeing is the same as what this bug pertains to. Since last week I haven't noticed any image corruption artifacts in the nightlies.
Maybe the link is confusing. It's not a screenshot of the corruption, it's just the image which was corrupted, and it's displayed well for you.

Now it's okay for me to, because I reloaded the image, and the corruption went away. When it was corrupted, it looked like attachments.
(In reply to comment #41)
> Well then this is reopened!  Patrick, I suppose the main culprit is the HTTP
> acceleration patches, since fallocate hasn't re-landed.

fwiw - I'm not able to see a problem - even with the shaper active.

csaba, is it reproducible for you - and if so, does setting the pref network.http.connection-retry-timeout to 0 fix the problem? 

Theories are welcome.
I can't reproduce, but it still happens randomly for me. It didn't happen since that image.
I've run a bunch of tests this morning, and I definitely think we need to stare at fallocate some more.

I tested using http://www.flickr.com/photos/15530177@N05/5210167211/in/photostream/ and clicked the "older" link up to 25 times. I tried each build a few times and for each iteration on the same build I reached a consistent "reproduced or can't reproduce" within those 25 clicks. For whatever reason when it did reproduce the fifth click was very often (though not always) the first place I saw it.

Every test was with a clean profile on linux 64 using ext4.

1] 59033:4182651505fb - this was back during beta 7 with both the fallocate and syn retry patches included and I built this basically as a control. The problem easily reproduced.

2] 59034:0a9e64523c06 - the only change here was to disable the syn retry code via pref. The problem easily reproduced with syn retry disabled.
 
3] 59076:ed44f2f4c7cb - wrt fallocate and syn retry this is the same (no syn retry, yes fallocate) as #2. The problem did still reproduce.

4] 59077:8d59a3285d92 - this is the patch that removes the fallocate calls. The problem was not reproducible.

5] 59552:49a1b2aa43c5 - this was the tip this morning. Syn retry has been reneabled there. The problem was not reproducible,

I'm not sure how to explain comment 38 other than to wonder about some kind of artifact in the profile. I tested the URL cited there on builds 3, 4, and 5. I was able to reproduce the problem only on build 3. I did it there by interrupting the loading of the image and then pressing ^R. About 1/2 the time on build three that would put a green line in the image, where I tried that a dozen times on builds 4 and 5 and never got a corruption.

The fact that there is just 1 change between builds 3 and 4 and such a dramatic change in behavior is pretty convincing to me.
(In reply to comment #45)
> I can't reproduce, but it still happens randomly for me. It didn't happen since
> that image.

WonderCsabo: Based on other comments, I wonder if there might be some bad stuff cached by the older nightly builds that had the problem. Have you cleared your cache since the other changes were backed out (around 10th December)?
Whiteboard: hardblocker
Whiteboard: hardblocker → [hardblocker]
I see it's marked as a hardblocker.

I don't see this any more.
Michal tells me that this is no longer an issue that we need to worry about for Firefox 4 since the fallocate changes in bug 592422 seemed to have caused this, and those changes were backed out for now. Leaving the bug open to track this after we re-land bug 592422, but removing the blocker flags etc.
blocking2.0: betaN+ → ---
Whiteboard: [hardblocker]
Whiteboard: [currently fixed by backout of bug 592422]
Attached patch fixSplinter Review
Flickr preloads images that are on the next page. If the next page is opened before the preload is finished then the partial data from the cache is used. Reading goes beyond cached data because nsDiskCacheInputStream::Read() doesn't check the end of the stream when reading from the separate file.
Attachment #503029 - Flags: review?(bjarne)
Attachment #503029 - Flags: approval2.0?
If you stop an image load midway, instead of continuing to show the partially loaded image the text "The image [image location here] cannot be displayed because it contains errors." replaces it. Is that related to this issue, or a different bug? (I decided to mention it here because the patch sounds somewhat related)
This is another bug. I saw it some time ago too and IIRC this happens only in case of gzipped encoding and (or?) chunked transfer.
Comment on attachment 503029 [details] [diff] [review]
fix

Could you clarify this patch vs comment #49 before we continue the process?
(In reply to comment #53)
> Comment on attachment 503029 [details] [diff] [review]
> fix
> 
> Could you clarify this patch vs comment #49 before we continue the process?

That comment is true. The patch re-lands fallocating from bug 592422 and fixes the problem.
Comment on attachment 503029 [details] [diff] [review]
fix

Please re-request approval once this is reviewed.
Attachment #503029 - Flags: approval2.0?
The fix in this bug resolves Bug 592422 which had been marked "checkin-needed".
Bjarne: your review is blocking
Hmm...  wonder why I didn't end up on CClist after adding comment #53...

Michal: If you want a quick review please ask Jason since he is familiar with the issue and bug #592422. I will need to start pretty much from the beginning in order to review this. It may also be useful to consolidate states for this bug and bug #592422 since it seems to me like only one of them should be open?
Comment on attachment 503029 [details] [diff] [review]
fix

Looks good.  I'm going to push through try and if it's green, land this.

Note this the patch re-enables the patch for 644431, so we can close that when this is resolved.
Attachment #503029 - Flags: review?(bjarne) → review+
Summary: Images are randomly corrupted → Images are randomly corrupted when fallocate used on block files
Whoops!  I mean to say in comment 58 that this also fixes bug 592422, not 644431.
Just curious; will this fix cause a slight improvement to caching? When I noticed the bug, it appeared that Firefox was trying to use the partially downloaded image from cache and resume downloading the rest of the image as normal. Perhaps this was even the same behaviour before the bug, I just never noticed it.
Landed on cedar:

   http://hg.mozilla.org/projects/cedar/rev/473ec8c05215

Lozzy:  The idea is to improve cache IO performance, yes.  Our imagelib code will use a cache copy of an image, and then replace it with an updated image from the net.  Some consider that a bug, some a feature...
Pushed to m-c: http://hg.mozilla.org/mozilla-central/rev/473ec8c05215
Status: REOPENED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
Does that mean bug 592422 is fixed as well and that the whiteboard in this bug should be cleared?
Target Milestone: --- → mozilla2.2
Whiteboard: [currently fixed by backout of bug 592422]
Target Milestone: mozilla2.2 → ---
Target Milestone: --- → mozilla2.2
Do we have tests in the testsuite that can detect regressions on this issue? This would be useful for bug 816642.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: