Closed
Bug 617123
Opened 14 years ago
Closed 14 years ago
Images are randomly corrupted when fallocate used on block files
Categories
(Core :: Networking: Cache, defect)
Tracking
()
RESOLVED
FIXED
mozilla5
People
(Reporter: polidobj, Assigned: michal)
References
Details
(Keywords: regression)
Attachments
(4 files)
Randomly I run into images being corrupted. I have a fairly reproducible set of steps. 1. Start here: http://www.flickr.com/photos/15530177@N05/5210167211/in/photostream/ 2. keep going through the photo stream. I just kept going to the next photo in the stream. Eventually I get a corrupt image shown. Reloading does not fix the problem. But doing a Ctrl-F5 to reload the page corrects the problem. The file in the cache seems to be ok. I looked in about:cache and found the entry for the photo. I copied the file from the cache directory using windows explorer and renamed it as a jpg. Viewing that showed a photo with no problem. Here's the nightly regression range that I found: 11-24 - ok - http://hg.mozilla.org/mozilla-central/rev/7f5cd850578e 11-25 - broken - http://hg.mozilla.org/mozilla-central/rev/32cd5d95a95f http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7f5cd850578e&tochange=32cd5d95a95f Since the cache seems to be involved bug 592422 seems like a likely culprit.
Comment 3•14 years ago
|
||
Confirmed on Mozilla/5.0 (Windows NT 5.1; rv:2.0b8pre) Gecko/20101202 Firefox/4.0b8pre ID:20101202030316 Regression window in local build: Build from 59144b59cd11 : fails Build from 2c8654b57cca : not corruption So this is regression of changeset 59144b59cd11 bug 592422
Comment 4•14 years ago
|
||
It's happening with PNGs too. Very hard to reproduce, kinda random, but the STR always brings the problem.
Comment 6•14 years ago
|
||
I've been experiencing this quite a lot. Also a quick search to find this bug shows that quite a few bugs could be duped to this.
I see it to on the linked flickr page, it seems random but usually happens within 10 or so pictures viewed. Reloading does not fix it but shift+reload (reload cache) does. Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b8pre) Gecko/20101206 Firefox/4.0b8pre
Comment 8•14 years ago
|
||
What platforms are you all using? We've got two reports from Windows so far. I'm wondering if this is Windows-specific, or specific to FAT32 vs NFTS. > So this is regression of changeset 59144b59cd11 bug 592422 Very strange. It's hard to imagine how fallocate-ing the file length could be causing such problems. Unless we've still got the Windows fallocate emulation code wrong. I just pored over it and don't see anything obvious. If we can't reproduce and fix, I guess we'll have to back out bug 592422 (again), at least for the platforms where we're seeing this.
Comment 9•14 years ago
|
||
If anyone is seeing this reliably, and can run a debugger with a breakpoint set at mozilla::fallocate() to see if the file pointer is not getting restored to the correct position (ie. PR_Seek64(aFD, oldpos, PR_SEEK_SET) doesn't get reached for the windows case), that would be most useful.
Comment 11•14 years ago
|
||
I've been seeing it on my laptop (Windows Vista x64, D3D9) and my PC (Windows 7 x64, D2D/D3D10). I haven't seen it on today's nightly however, even though it happened to me a lot yesterday. Might just be an unfortunate coincidence though ...
Comment 12•14 years ago
|
||
(In reply to comment #11) > I've been seeing it on my laptop (Windows Vista x64, D3D9) and my PC (Windows 7 > x64, D2D/D3D10). I haven't seen it on today's nightly however, even though it > happened to me a lot yesterday. Might just be an unfortunate coincidence though > ... Coincidence.. Yesterday I was able to reproduce it after only like 6-8 pictures, today it took 50+ samples, then only like 6 samples after that one.
Comment 13•14 years ago
|
||
(In reply to comment #8) > What platforms are you all using? We've got two reports from Windows so far. > I'm wondering if this is Windows-specific, or specific to FAT32 vs NFTS. I'm using build Mozilla/5.0 (Windows NT 6.1; rv:2.0b8pre) Gecko/20101208 Firefox/4.0b8pre ID:20101208030336
Assignee | ||
Comment 14•14 years ago
|
||
(In reply to comment #8) > What platforms are you all using? We've got two reports from Windows so far. > I'm wondering if this is Windows-specific, or specific to FAT32 vs NFTS. I was able to reproduce it on Fedora 14. > Very strange. It's hard to imagine how fallocate-ing the file length could be > causing such problems. Unless we've still got the Windows fallocate emulation > code wrong. I just pored over it and don't see anything obvious. I've checked the cache entry and it is correct. The image is shown correctly after FF restart without re-fetching it again from the server. So the problem is in the image cache, not necko cache.
Comment 15•14 years ago
|
||
Is this reproducible elsewhere or is it very specific to the way flickr embeds images? I've browsed through lots of sites that are image based and couldn't reproduce it elsewhere. In any case I think this should probably be a blocker, possibly b8.
Comment 16•14 years ago
|
||
(In reply to comment #15) > Is this reproducible elsewhere or is it very specific to the way flickr... It's not flickr-only. Flickr is just convenient.
Comment 17•14 years ago
|
||
I saw this a lot on www.mangafox.com for various manga. The corruption appears in different forms - sometimes only part of the image loads, other times images appear grainy (corrupt pixels in a grid-like pattern), and still other times images would fully load but change colour partway through and either stay reasonably normal or become completely corrupted. I haven't seen it since yesterday though, for whatever reason.
Comment 18•14 years ago
|
||
(In reply to comment #15) > Is this reproducible elsewhere or is it very specific to the way flickr embeds > images? I've browsed through lots of sites that are image based and couldn't > reproduce it elsewhere. > > In any case I think this should probably be a blocker, possibly b8. I've seen it elsewhere. I'm actually seeing this a lot, but I peruse images online a lot.
Comment 19•14 years ago
|
||
Blocking on this recent regression. Michal says he can look into this one.
Assignee: nobody → michal.novotny
blocking2.0: ? → betaN+
Comment 21•14 years ago
|
||
For those who can't reproduce: the STR on bug 618014 (combined with an artificially-slowed-down internet connection with a tool like "wondershaper" on linux) seem to be pretty reliable. (Basically: load a page with a large image, hit 'stop' while the image is partially loaded, and reload)
Comment 22•14 years ago
|
||
If a fix can't be made in time shouldn't the regression bug be backed out before Beta8?
Comment 23•14 years ago
|
||
> shouldn't the regression bug be backed out before Beta8? Yes. I've removed the cache's calls to fallocate for now: http://hg.mozilla.org/mozilla-central/rev/8d59a3285d92
Comment 24•14 years ago
|
||
Please let me know if this actually appears to fix things: it's still mysterious to me as to how fallocate is breaking these image loads. (I didn't back out mozilla:fallocate itself, but if just having the function around w/o getting called is breaking stuff, then we have Big Problems :)
Comment 25•14 years ago
|
||
Things appear to have been fixed. But you know, it could be due to the networking problems as that patch was backed out too. I'd suggest pushing the fallocate to m-c again for a day to test.
Comment 27•14 years ago
|
||
(In reply to comment #23) > > shouldn't the regression bug be backed out before Beta8? > > Yes. I've removed the cache's calls to fallocate for now: > > http://hg.mozilla.org/mozilla-central/rev/8d59a3285d92 (In reply to comment #23) > > shouldn't the regression bug be backed out before Beta8? > > Yes. I've removed the cache's calls to fallocate for now: > > http://hg.mozilla.org/mozilla-central/rev/8d59a3285d92 I am on current hourly and seen the bug on the STR after the 3rd image. Was corrupted with half of the image being green. only fixed by a shift-reload (cache reload).
Comment 28•14 years ago
|
||
Scott: Can you test and see if the regression range in comment 3 is correct? Testing with a nightly from Nov 23rd ought to do the trick.
Comment 29•14 years ago
|
||
Scott: Oh, stupid me. Please also try clearing your cache. I'm going to push a cache version++ in case that's the issue.
Comment 30•14 years ago
|
||
I cleared the cache and got the latest hourly, went to the testcase and clicked through over 100 images and haven't seen the issue. Maybe it needed a cache flush first or it wasn't in my previous hourly changeset. Haven't gotten to testing pre Nov 23rd builds.
Comment 31•14 years ago
|
||
(In reply to comment #26) > Paul, > > Which networking patch backout are you referring to? bug 614677
Comment 32•14 years ago
|
||
Is anyone still seeing this bug in nightlies? I'm hoping the backout of bug 592422 (comment 23) and/or bug 614677 has fixed this for now, and we revisit after 2.0 ships.
This appears to have been fixed by one of the things backed out.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 36•14 years ago
|
||
Still seems to be fixed running today's nightly (20101217) which includes the relanding(?) of the HTTP acceleration bugs.
Comment 37•14 years ago
|
||
(In reply to comment #36) > relanding(?) of the HTTP acceleration bugs. fixes! (not bugs :))
Comment 41•14 years ago
|
||
Well then this is reopened! Patrick, I suppose the main culprit is the HTTP acceleration patches, since fallocate hasn't re-landed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 42•14 years ago
|
||
Forgive me if I'm wrong, but I'm not sure that what Csaba is seeing is the same as what this bug pertains to. Since last week I haven't noticed any image corruption artifacts in the nightlies.
Comment 43•14 years ago
|
||
Maybe the link is confusing. It's not a screenshot of the corruption, it's just the image which was corrupted, and it's displayed well for you. Now it's okay for me to, because I reloaded the image, and the corruption went away. When it was corrupted, it looked like attachments.
Comment 44•14 years ago
|
||
(In reply to comment #41) > Well then this is reopened! Patrick, I suppose the main culprit is the HTTP > acceleration patches, since fallocate hasn't re-landed. fwiw - I'm not able to see a problem - even with the shaper active. csaba, is it reproducible for you - and if so, does setting the pref network.http.connection-retry-timeout to 0 fix the problem? Theories are welcome.
Comment 45•14 years ago
|
||
I can't reproduce, but it still happens randomly for me. It didn't happen since that image.
Comment 46•14 years ago
|
||
I've run a bunch of tests this morning, and I definitely think we need to stare at fallocate some more. I tested using http://www.flickr.com/photos/15530177@N05/5210167211/in/photostream/ and clicked the "older" link up to 25 times. I tried each build a few times and for each iteration on the same build I reached a consistent "reproduced or can't reproduce" within those 25 clicks. For whatever reason when it did reproduce the fifth click was very often (though not always) the first place I saw it. Every test was with a clean profile on linux 64 using ext4. 1] 59033:4182651505fb - this was back during beta 7 with both the fallocate and syn retry patches included and I built this basically as a control. The problem easily reproduced. 2] 59034:0a9e64523c06 - the only change here was to disable the syn retry code via pref. The problem easily reproduced with syn retry disabled. 3] 59076:ed44f2f4c7cb - wrt fallocate and syn retry this is the same (no syn retry, yes fallocate) as #2. The problem did still reproduce. 4] 59077:8d59a3285d92 - this is the patch that removes the fallocate calls. The problem was not reproducible. 5] 59552:49a1b2aa43c5 - this was the tip this morning. Syn retry has been reneabled there. The problem was not reproducible, I'm not sure how to explain comment 38 other than to wonder about some kind of artifact in the profile. I tested the URL cited there on builds 3, 4, and 5. I was able to reproduce the problem only on build 3. I did it there by interrupting the loading of the image and then pressing ^R. About 1/2 the time on build three that would put a green line in the image, where I tried that a dozen times on builds 4 and 5 and never got a corruption. The fact that there is just 1 change between builds 3 and 4 and such a dramatic change in behavior is pretty convincing to me.
Comment 47•14 years ago
|
||
(In reply to comment #45) > I can't reproduce, but it still happens randomly for me. It didn't happen since > that image. WonderCsabo: Based on other comments, I wonder if there might be some bad stuff cached by the older nightly builds that had the problem. Have you cleared your cache since the other changes were backed out (around 10th December)?
Updated•14 years ago
|
Whiteboard: hardblocker
Updated•14 years ago
|
Whiteboard: hardblocker → [hardblocker]
Comment 49•14 years ago
|
||
Michal tells me that this is no longer an issue that we need to worry about for Firefox 4 since the fallocate changes in bug 592422 seemed to have caused this, and those changes were backed out for now. Leaving the bug open to track this after we re-land bug 592422, but removing the blocker flags etc.
blocking2.0: betaN+ → ---
Whiteboard: [hardblocker]
Updated•14 years ago
|
Whiteboard: [currently fixed by backout of bug 592422]
Assignee | ||
Comment 50•14 years ago
|
||
Flickr preloads images that are on the next page. If the next page is opened before the preload is finished then the partial data from the cache is used. Reading goes beyond cached data because nsDiskCacheInputStream::Read() doesn't check the end of the stream when reading from the separate file.
Attachment #503029 -
Flags: review?(bjarne)
Attachment #503029 -
Flags: approval2.0?
Comment 51•14 years ago
|
||
If you stop an image load midway, instead of continuing to show the partially loaded image the text "The image [image location here] cannot be displayed because it contains errors." replaces it. Is that related to this issue, or a different bug? (I decided to mention it here because the patch sounds somewhat related)
Assignee | ||
Comment 52•14 years ago
|
||
This is another bug. I saw it some time ago too and IIRC this happens only in case of gzipped encoding and (or?) chunked transfer.
Comment 53•14 years ago
|
||
Comment on attachment 503029 [details] [diff] [review] fix Could you clarify this patch vs comment #49 before we continue the process?
Assignee | ||
Comment 54•14 years ago
|
||
(In reply to comment #53) > Comment on attachment 503029 [details] [diff] [review] > fix > > Could you clarify this patch vs comment #49 before we continue the process? That comment is true. The patch re-lands fallocating from bug 592422 and fixes the problem.
Comment 55•14 years ago
|
||
Comment on attachment 503029 [details] [diff] [review] fix Please re-request approval once this is reviewed.
Attachment #503029 -
Flags: approval2.0?
Comment 56•14 years ago
|
||
The fix in this bug resolves Bug 592422 which had been marked "checkin-needed". Bjarne: your review is blocking
Comment 57•14 years ago
|
||
Hmm... wonder why I didn't end up on CClist after adding comment #53... Michal: If you want a quick review please ask Jason since he is familiar with the issue and bug #592422. I will need to start pretty much from the beginning in order to review this. It may also be useful to consolidate states for this bug and bug #592422 since it seems to me like only one of them should be open?
Comment 58•14 years ago
|
||
Comment on attachment 503029 [details] [diff] [review] fix Looks good. I'm going to push through try and if it's green, land this. Note this the patch re-enables the patch for 644431, so we can close that when this is resolved.
Attachment #503029 -
Flags: review?(bjarne) → review+
Updated•14 years ago
|
Summary: Images are randomly corrupted → Images are randomly corrupted when fallocate used on block files
Comment 59•14 years ago
|
||
Whoops! I mean to say in comment 58 that this also fixes bug 592422, not 644431.
Comment 60•14 years ago
|
||
Just curious; will this fix cause a slight improvement to caching? When I noticed the bug, it appeared that Firefox was trying to use the partially downloaded image from cache and resume downloading the rest of the image as normal. Perhaps this was even the same behaviour before the bug, I just never noticed it.
Comment 61•14 years ago
|
||
Landed on cedar: http://hg.mozilla.org/projects/cedar/rev/473ec8c05215 Lozzy: The idea is to improve cache IO performance, yes. Our imagelib code will use a cache copy of an image, and then replace it with an updated image from the net. Some consider that a bug, some a feature...
Comment 62•14 years ago
|
||
Pushed to m-c: http://hg.mozilla.org/mozilla-central/rev/473ec8c05215
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Comment 63•14 years ago
|
||
Does that mean bug 592422 is fixed as well and that the whiteboard in this bug should be cleared?
Target Milestone: --- → mozilla2.2
Updated•14 years ago
|
Whiteboard: [currently fixed by backout of bug 592422]
Target Milestone: mozilla2.2 → ---
Updated•14 years ago
|
Target Milestone: --- → mozilla2.2
Do we have tests in the testsuite that can detect regressions on this issue? This would be useful for bug 816642.
You need to log in
before you can comment on or make changes to this bug.
Description
•