617123 - Images are randomly corrupted when fallocate used on block files

Reporter

Description

•

14 years ago

Attached image screenshot of problem - the bottom of the photo always is a shade of green — Details

Randomly I run into images being corrupted.  I have a fairly reproducible set of steps.  

1. Start here:
http://www.flickr.com/photos/15530177@N05/5210167211/in/photostream/
2. keep going through the photo stream. I just kept going to the next photo in the stream.
Eventually I get a corrupt image shown. Reloading does not fix the problem.  But doing a Ctrl-F5 to reload the page corrects the problem. 

The file in the cache seems to be ok. I looked in about:cache and found the entry for the photo. I copied the file from the cache directory using windows explorer and renamed it as a jpg. Viewing that showed a photo with no problem.

Here's the nightly regression range that I found:
11-24 - ok - http://hg.mozilla.org/mozilla-central/rev/7f5cd850578e
11-25 - broken - http://hg.mozilla.org/mozilla-central/rev/32cd5d95a95f
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7f5cd850578e&tochange=32cd5d95a95f

Since the cache seems to be involved bug 592422 seems like a likely culprit.

(dormant account)

Comment 2

•

14 years ago

I see this too

blocking2.0: --- → ?

Alice0775 White

Comment 3

•

14 years ago

Confirmed on Mozilla/5.0 (Windows NT 5.1; rv:2.0b8pre) Gecko/20101202 Firefox/4.0b8pre ID:20101202030316

Regression window in local build:
Build from 59144b59cd11 : fails
Build from 2c8654b57cca : not corruption
So this is regression of changeset 59144b59cd11 bug 592422

Csaba Kozák [:WonderCsabo]

Comment 4

•

14 years ago

It's happening with PNGs too. Very hard to reproduce, kinda random, but the STR always brings the problem.

Exactly {Clear}

Comment 5

•

14 years ago

Attached image reproduceable — Details

could reproduce it on a new profile .

Paul [pwd]

Comment 6

•

14 years ago

Attached image Image showing the problem — Details

I've been experiencing this quite a lot. Also a quick search to find this bug shows that quite a few bugs could be duped to this.

Scott A.

Comment 7

•

14 years ago

I see it to on the linked flickr page, it seems random but usually happens within 10 or so pictures viewed.

Reloading does not fix it but shift+reload (reload cache) does.

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b8pre) Gecko/20101206 Firefox/4.0b8pre

Jason Duell

Comment 8

•

14 years ago

What platforms are you all using?  We've got two reports from Windows so far. I'm wondering if this is Windows-specific, or specific to FAT32 vs NFTS.

> So this is regression of changeset 59144b59cd11 bug 592422

Very strange.  It's hard to imagine how fallocate-ing the file length could be causing such problems.  Unless we've still got the Windows fallocate emulation code wrong.  I just pored over it and don't see anything obvious.

If we can't reproduce and fix, I guess we'll have to back out bug 592422 (again), at least for the platforms where we're seeing this.

Jason Duell

Comment 9

•

14 years ago

If anyone is seeing this reliably, and can run a debugger with a breakpoint set at mozilla::fallocate() to see if the file pointer is not getting restored to the correct position (ie. PR_Seek64(aFD, oldpos, PR_SEEK_SET) doesn't get reached for the windows case), that would be most useful.

Josh Matthews [:jdm]

Comment 10

•

14 years ago

I've been seeing this on Flickr on my MacBook Pro in 10.6.4.

Emanuel Hoogeveen [:ehoogeveen]

Comment 11

•

14 years ago

I've been seeing it on my laptop (Windows Vista x64, D3D9) and my PC (Windows 7 x64, D2D/D3D10). I haven't seen it on today's nightly however, even though it happened to me a lot yesterday. Might just be an unfortunate coincidence though ...

Scott A.

Comment 12

•

14 years ago

(In reply to comment #11)
> I've been seeing it on my laptop (Windows Vista x64, D3D9) and my PC (Windows 7
> x64, D2D/D3D10). I haven't seen it on today's nightly however, even though it
> happened to me a lot yesterday. Might just be an unfortunate coincidence though
> ...
Coincidence.. Yesterday I was able to reproduce it after only like 6-8 pictures, today it took 50+ samples, then only like 6 samples after that one.

Paul [pwd]

Comment 13

•

14 years ago

(In reply to comment #8)
> What platforms are you all using?  We've got two reports from Windows so far.
> I'm wondering if this is Windows-specific, or specific to FAT32 vs NFTS.

I'm using build Mozilla/5.0 (Windows NT 6.1; rv:2.0b8pre) Gecko/20101208 Firefox/4.0b8pre ID:20101208030336

Michal Novotny [:michal]

Assignee

Comment 14

•

14 years ago

(In reply to comment #8)
> What platforms are you all using?  We've got two reports from Windows so far.
> I'm wondering if this is Windows-specific, or specific to FAT32 vs NFTS.

I was able to reproduce it on Fedora 14.

> Very strange.  It's hard to imagine how fallocate-ing the file length could be
> causing such problems.  Unless we've still got the Windows fallocate emulation
> code wrong.  I just pored over it and don't see anything obvious.

I've checked the cache entry and it is correct. The image is shown correctly after FF restart without re-fetching it again from the server. So the problem is in the image cache, not necko cache.

Scott A.

Comment 15

•

14 years ago

Is this reproducible elsewhere or is it very specific to the way flickr embeds images? I've browsed through lots of sites that are image based and couldn't reproduce it elsewhere.

In any case I think this should probably be a blocker, possibly b8.

IU

Comment 16

•

14 years ago

(In reply to comment #15)
> Is this reproducible elsewhere or is it very specific to the way flickr...

It's not flickr-only.  Flickr is just convenient.

Emanuel Hoogeveen [:ehoogeveen]

Comment 17

•

14 years ago

I saw this a lot on www.mangafox.com for various manga. The corruption appears in different forms - sometimes only part of the image loads, other times images appear grainy (corrupt pixels in a grid-like pattern), and still other times images would fully load but change colour partway through and either stay reasonably normal or become completely corrupted. I haven't seen it since yesterday though, for whatever reason.

Paul [pwd]

Comment 18

•

14 years ago

(In reply to comment #15)
> Is this reproducible elsewhere or is it very specific to the way flickr embeds
> images? I've browsed through lots of sites that are image based and couldn't
> reproduce it elsewhere.
> 
> In any case I think this should probably be a blocker, possibly b8.

I've seen it elsewhere. I'm actually seeing this a lot, but I peruse images online a lot.

Johnny Stenback (:jst)

Comment 19

•

14 years ago

Blocking on this recent regression. Michal says he can look into this one.

Assignee: nobody → michal.novotny

blocking2.0: ? → betaN+

Daniel Holbert [:dholbert]

Comment 21

•

14 years ago

For those who can't reproduce: the STR on bug 618014 (combined with an artificially-slowed-down internet connection with a tool like "wondershaper" on linux) seem to be pretty reliable.  (Basically: load a page with a large image, hit 'stop' while the image is partially loaded, and reload)

Scott A.

Comment 22

•

14 years ago

If a fix can't be made in time shouldn't the regression bug be backed out before Beta8?

Jason Duell

Comment 23

•

14 years ago

> shouldn't the regression bug be backed out before Beta8?

Yes. I've removed the cache's calls to fallocate for now:

http://hg.mozilla.org/mozilla-central/rev/8d59a3285d92

Jason Duell

Comment 24

•

14 years ago

Please let me know if this actually appears to fix things: it's still mysterious to me as to how fallocate is breaking these image loads.

(I didn't back out mozilla:fallocate itself, but if just having the function around w/o getting called is breaking stuff, then we have Big Problems :)

Paul [pwd]

Comment 25

•

14 years ago

Things appear to have been fixed. But you know, it could be due to the networking problems as that patch was backed out too. I'd suggest pushing the fallocate to m-c again for a day to test.

Jason Duell

Comment 26

•

14 years ago

Paul,

Which networking patch backout are you referring to?

Scott A.

Comment 27

•

14 years ago

(In reply to comment #23)
> > shouldn't the regression bug be backed out before Beta8?
> 
> Yes. I've removed the cache's calls to fallocate for now:
> 
> http://hg.mozilla.org/mozilla-central/rev/8d59a3285d92

(In reply to comment #23)
> > shouldn't the regression bug be backed out before Beta8?
> 
> Yes. I've removed the cache's calls to fallocate for now:
> 
> http://hg.mozilla.org/mozilla-central/rev/8d59a3285d92

I am on current hourly and seen the bug on the STR after the 3rd image. Was corrupted with half of the image being green. only fixed by a shift-reload (cache reload).

Jason Duell

Comment 28

•

14 years ago

Scott:  Can you test and see if the regression range in comment 3 is correct?  Testing with a nightly from Nov 23rd ought to do the trick.

Jason Duell

Comment 29

•

14 years ago

Scott: Oh, stupid me.  Please also try clearing your cache.  I'm going to push a cache version++ in case that's the issue.

Scott A.

Comment 30

•

14 years ago

I cleared the cache and got the latest hourly, went to the testcase and clicked through over 100 images and haven't seen the issue. Maybe it needed a cache flush first or it wasn't in my previous hourly changeset. Haven't gotten to testing pre Nov 23rd builds.

Paul [pwd]

Comment 31

•

14 years ago

(In reply to comment #26)
> Paul,
> 
> Which networking patch backout are you referring to?

bug 614677

Jason Duell

Comment 32

•

14 years ago

Is anyone still seeing this bug in nightlies?

I'm hoping the backout of bug 592422 (comment 23) and/or bug 614677 has fixed this for now, and we revisit after 2.0 ships.

Paul [pwd]

Comment 33

•

14 years ago

Not seeing this at all any more.

Lozzy

Comment 34

•

14 years ago

Same here. Can't get it to reproduce on on nightles.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 35

•

14 years ago

This appears to have been fixed by one of the things backed out.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

timeless

Updated

•

14 years ago

Depends on: 619685

Scott A.

Comment 36

•

14 years ago

Still seems to be fixed running today's nightly (20101217) which includes the relanding(?) of the HTTP acceleration bugs.

Patrick McManus [:mcmanus]

Comment 37

•

14 years ago

(In reply to comment #36)
> relanding(?) of the HTTP acceleration bugs.

fixes! (not bugs :))

Csaba Kozák [:WonderCsabo]

Comment 38

•

14 years ago

I still see this:

http://img193.imageshack.us/img193/9853/gladiator1a.jpg

Jason Duell

Comment 39

•

14 years ago

Csaba,

What build are you using?

Csaba Kozák [:WonderCsabo]

Comment 40

•

14 years ago

Latest win32 nightly.

Jason Duell

Comment 41

•

14 years ago

Well then this is reopened!  Patrick, I suppose the main culprit is the HTTP acceleration patches, since fallocate hasn't re-landed.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Lozzy

Comment 42

•

14 years ago

Forgive me if I'm wrong, but I'm not sure that what Csaba is seeing is the same as what this bug pertains to. Since last week I haven't noticed any image corruption artifacts in the nightlies.

Csaba Kozák [:WonderCsabo]

Comment 43

•

14 years ago

Maybe the link is confusing. It's not a screenshot of the corruption, it's just the image which was corrupted, and it's displayed well for you.

Now it's okay for me to, because I reloaded the image, and the corruption went away. When it was corrupted, it looked like attachments.

Patrick McManus [:mcmanus]

Comment 44

•

14 years ago

(In reply to comment #41)
> Well then this is reopened!  Patrick, I suppose the main culprit is the HTTP
> acceleration patches, since fallocate hasn't re-landed.

fwiw - I'm not able to see a problem - even with the shaper active.

csaba, is it reproducible for you - and if so, does setting the pref network.http.connection-retry-timeout to 0 fix the problem? 

Theories are welcome.

Csaba Kozák [:WonderCsabo]

Comment 45

•

14 years ago

I can't reproduce, but it still happens randomly for me. It didn't happen since that image.

Patrick McManus [:mcmanus]

Comment 46

•

14 years ago

I've run a bunch of tests this morning, and I definitely think we need to stare at fallocate some more.

I tested using http://www.flickr.com/photos/15530177@N05/5210167211/in/photostream/ and clicked the "older" link up to 25 times. I tried each build a few times and for each iteration on the same build I reached a consistent "reproduced or can't reproduce" within those 25 clicks. For whatever reason when it did reproduce the fifth click was very often (though not always) the first place I saw it.

Every test was with a clean profile on linux 64 using ext4.

1] 59033:4182651505fb - this was back during beta 7 with both the fallocate and syn retry patches included and I built this basically as a control. The problem easily reproduced.

2] 59034:0a9e64523c06 - the only change here was to disable the syn retry code via pref. The problem easily reproduced with syn retry disabled.

3] 59076:ed44f2f4c7cb - wrt fallocate and syn retry this is the same (no syn retry, yes fallocate) as #2. The problem did still reproduce.

4] 59077:8d59a3285d92 - this is the patch that removes the fallocate calls. The problem was not reproducible.

5] 59552:49a1b2aa43c5 - this was the tip this morning. Syn retry has been reneabled there. The problem was not reproducible,

I'm not sure how to explain comment 38 other than to wonder about some kind of artifact in the profile. I tested the URL cited there on builds 3, 4, and 5. I was able to reproduce the problem only on build 3. I did it there by interrupting the loading of the image and then pressing ^R. About 1/2 the time on build three that would put a green line in the image, where I tried that a dozen times on builds 4 and 5 and never got a corruption.

The fact that there is just 1 change between builds 3 and 4 and such a dramatic change in behavior is pretty convincing to me.

Michael Lefevre

Comment 47

•

14 years ago

(In reply to comment #45)
> I can't reproduce, but it still happens randomly for me. It didn't happen since
> that image.

WonderCsabo: Based on other comments, I wonder if there might be some bad stuff cached by the older nightly builds that had the problem. Have you cleared your cache since the other changes were backed out (around 10th December)?

Johnny Stenback (:jst)

Updated

•

14 years ago

Whiteboard: hardblocker

Johnny Stenback (:jst)

Updated

•

14 years ago

Whiteboard: hardblocker → [hardblocker]

Csaba Kozák [:WonderCsabo]

Comment 48

•

14 years ago

I see it's marked as a hardblocker.

I don't see this any more.

Johnny Stenback (:jst)

Comment 49

•

14 years ago

Michal tells me that this is no longer an issue that we need to worry about for Firefox 4 since the fallocate changes in bug 592422 seemed to have caused this, and those changes were backed out for now. Leaving the bug open to track this after we re-land bug 592422, but removing the blocker flags etc.

blocking2.0: betaN+ → ---

Whiteboard: [hardblocker]

Daniel Holbert [:dholbert]

Updated

•

14 years ago

Whiteboard: [currently fixed by backout of bug 592422]

Michal Novotny [:michal]

Assignee

Comment 50

•

14 years ago

Attached patch fix — Details — Splinter Review

Flickr preloads images that are on the next page. If the next page is opened before the preload is finished then the partial data from the cache is used. Reading goes beyond cached data because nsDiskCacheInputStream::Read() doesn't check the end of the stream when reading from the separate file.

Attachment #503029 - Flags: review?(bjarne)

Attachment #503029 - Flags: approval2.0?

Emanuel Hoogeveen [:ehoogeveen]

Comment 51

•

14 years ago

If you stop an image load midway, instead of continuing to show the partially loaded image the text "The image [image location here] cannot be displayed because it contains errors." replaces it. Is that related to this issue, or a different bug? (I decided to mention it here because the patch sounds somewhat related)

Michal Novotny [:michal]

Assignee

Comment 52

•

14 years ago

This is another bug. I saw it some time ago too and IIRC this happens only in case of gzipped encoding and (or?) chunked transfer.

Bjarne (:bjarne)

Comment 53

•

14 years ago

Comment on attachment 503029 [details] [diff] [review]
fix

Could you clarify this patch vs comment #49 before we continue the process?

Michal Novotny [:michal]

Assignee

Comment 54

•

14 years ago

(In reply to comment #53)
> Comment on attachment 503029 [details] [diff] [review]
> fix
> 
> Could you clarify this patch vs comment #49 before we continue the process?

That comment is true. The patch re-lands fallocating from bug 592422 and fixes the problem.

Johnny Stenback (:jst)

Comment 55

•

13 years ago

Comment on attachment 503029 [details] [diff] [review]
fix

Please re-request approval once this is reviewed.

Attachment #503029 - Flags: approval2.0?

Harsh86

Comment 56

•

13 years ago

The fix in this bug resolves Bug 592422 which had been marked "checkin-needed".
Bjarne: your review is blocking

Bjarne (:bjarne)

Comment 57

•

13 years ago

Hmm...  wonder why I didn't end up on CClist after adding comment #53...

Michal: If you want a quick review please ask Jason since he is familiar with the issue and bug #592422. I will need to start pretty much from the beginning in order to review this. It may also be useful to consolidate states for this bug and bug #592422 since it seems to me like only one of them should be open?

Jason Duell

Comment 58

•

13 years ago

Comment on attachment 503029 [details] [diff] [review]
fix

Looks good.  I'm going to push through try and if it's green, land this.

Note this the patch re-enables the patch for 644431, so we can close that when this is resolved.

Attachment #503029 - Flags: review?(bjarne) → review+

Jason Duell

Updated

•

13 years ago

Summary: Images are randomly corrupted → Images are randomly corrupted when fallocate used on block files

Jason Duell

Comment 59

•

13 years ago

Whoops!  I mean to say in comment 58 that this also fixes bug 592422, not 644431.

Lozzy

Comment 60

•

13 years ago

Just curious; will this fix cause a slight improvement to caching? When I noticed the bug, it appeared that Firefox was trying to use the partially downloaded image from cache and resume downloading the rest of the image as normal. Perhaps this was even the same behaviour before the bug, I just never noticed it.

Jason Duell

Comment 61

•

13 years ago

Landed on cedar:

   http://hg.mozilla.org/projects/cedar/rev/473ec8c05215

Lozzy:  The idea is to improve cache IO performance, yes.  Our imagelib code will use a cache copy of an image, and then replace it with an updated image from the net.  Some consider that a bug, some a feature...

Benjamin Stover (:stechz)

Comment 62

•

13 years ago

Pushed to m-c: http://hg.mozilla.org/mozilla-central/rev/473ec8c05215

Status: REOPENED → RESOLVED

Closed: 14 years ago → 13 years ago

Resolution: --- → FIXED

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 63

•

13 years ago

Does that mean bug 592422 is fixed as well and that the whiteboard in this bug should be cleared?

Target Milestone: --- → mozilla2.2

Jason Duell

Updated

•

13 years ago

Whiteboard: [currently fixed by backout of bug 592422]

Target Milestone: mozilla2.2 → ---

:Ms2ger (he/him; ⌚ UTC+1/+2)

Updated

•

13 years ago

Target Milestone: --- → mozilla2.2

David Teller [:Yoric] - still alive but not very active

Comment 64

•

12 years ago

Do we have tests in the testsuite that can detect regressions on this issue? This would be useful for bug 816642.

screenshot of problem - the bottom of the photo always is a shade of green 14 years ago Brian Polidoro 647.48 KB, image/png		Details
reproduceable 14 years ago Exactly {Clear} 355.87 KB, image/jpeg		Details
Image showing the problem 14 years ago Paul [pwd] 164.71 KB, image/png		Details
fix 14 years ago Michal Novotny [:michal] 2.01 KB, patch	jduell.mcbugs : review+	Details \| Diff \| Splinter Review