739.26 KB, image/png
38.20 KB, application/octet-stream
18.99 KB, application/octet-stream
234.72 KB, image/png
Created attachment 8832839 [details] gfx-corruption.PNG User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0 Build ID: 20170202004013 Steps to reproduce: I used a windows 10 machine with an intel HD4600 gfx adapter. The screen is in portrait mode with resolution 1440x2560. Open a long page, e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1067470 scroll down "quite a bit" Actual results: eventually horizontal band of display corruption appears. Usually this band is hundreds of pixels high and background colors still appear, but font glyphs and images do not (see screenshot). Sometimes scrolling a small amount will cause the corrupted area to turn completely black (or if black, background-only) again. Continuously scrolling causes the region to flicker. Expected results: The region should render as usual (i.e. as in chrome/edge/FF without hw-accel).
This bug has occurred for at least several months, on various stable, dev edition and nightly builds.
Corruption appears to start around pixel-row 8192.
If you test with some old versions of Firefox, does it happen too? http://ftp.mozilla.org/pub/firefox/releases/ (try FF48, 45, 40 e.g.)
Since there's so many copies, I used the following command to start new instance without messing up my normal profile or needing full windows profiles (I doubt that affects anything, but just to be sure): firefox -profile EmnTmpProfile -no-remote -new-instance I tried various versions (I tested esr versions first, and 32 bit unless otherwise mentioned): Bugfree: 10.0.0, 31.0.0, 38.8.0, 45.7.0 (32+64 bit), 48.0.2 (32+64-bit) Buggy: 48.0.2 (32-bit), 51.0.1 (32+64bit) It's something of a hassle because if you're not careful, the old versions will in-place autoupdate and then you'll be testing something other than what you downloaded. However, I'm pretty sure I got these tests without browser restarts (and I checked about:support to double check the versions too, which is how I noticed that autoupdated had kicked in sometimes).
Amusing trip down memory lane, all those old skins ;-).
OS: Unspecified → Windows
Priority: -- → P3
> Bugfree: 10.0.0, 31.0.0, 38.8.0, 45.7.0 (32+64 bit), 48.0.2 (32+64-bit) > > Buggy: 48.0.2 (32-bit), 51.0.1 (32+64bit) I just noticed I listed 48 twice as both buggy and non-buggy; this was one of the versions where I accidentally let the auto-update happen, so I'm guessing the "buggy v48" was actually autoupdated to v51, but I'll recheck next time I'm behind that machine. Sorry for the confusion!
You can disable the update system in the options of your testing profile, I do that to have 10 old versions installed on my computer for testing, they never update.
platform-rel: --- → ?
Whiteboard: [gfx-noted] → [gfx-noted][platform-rel-Intel]
Using a mozregression GUI tool, specifying 45 as a good release and the latest as a bad one, would let us narrow down this regression a lot better. Any chance you could try using that? It doesn't change the version(s) of Firefox you are using, just downloads and runs different ones to narrow things down.
mozregression looks like a huge timesaver - thanks for the pointer! I'll have access to the machine again on tuesday; I'll be sure to run it then!
I ran the 32-bit bisection a couple of times to be sure (if you scroll too quickly, the bug does not appear, so it took a few times to nail down). But at this point, I'm sure this is the first problematic build: app_name: firefox build_date: 2015-07-25 build_file: C:\Users\nerbonne\.mozilla\mozregression\persist\2015-07-25--mozilla-central--firefox-42.0a1.en-US.win32.zip build_type: nightly build_url: https://archive.mozilla.org/pub/firefox/nightly/2015/07/2015-07-25-03-02-09-mozilla-central/firefox-42.0a1.en-US.win32.zip changeset: d3228c82badd pushlog_url: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b0b3dcfa5557&tochange=d3228c82badd repo_name: mozilla-central repo_url: https://hg.mozilla.org/mozilla-central mozregression then prints out a bug of debug messages looking for builds between b0b3dcfa5557..d3228c82badd, but appears not to be able to find any builds in that range. I also checked about:support in the bad 2015-07-25 and the last working build 2015-07-24; they're identical except for some version info (unsurprising), some temp-dir info, and one more interesting difference: the bug-free 2015-07-24 reports "Asynchronous Pan/Zoom: none", whereas the buggy 2015-07-25 reports "Asynchronous Pan/Zoom: wheel input enabled" I just checked with my primary profile (currently at 54.0a2 (2017-04-18) (64-bit)) - by default, it exhibits this bug too, but turning off hardware acceleration prevents the bug, as does leaving on hardware acceleration but setting layers.async-pan-zoom.enabled to false.
In the range b0b3dcfa5557..d3228c82badd; this merged: https://bugzilla.mozilla.org/show_bug.cgi?id=1157745
Kats, this sounds vaguely familiar, but I thought only on OS X and with much larger pages. Do you recall? Either way, any more information we need, can we tell what's going on?
I don't recall seeing a bug quite like this one. However it's interesting that the corruption starts at y=8192. From the screenshot it looks like the corruption band is exactly 768 pixels tall. I'm guessing that for some reason we're going down into the codepath at  and creating a displayport that's 8192 pixels tall. But I'm not sure why the displayport isn't shifting down as you scroll and why the corruption area is where it is. I can put together some builds with extra logging to try and get more information on what's going on here. Leaving needinfo on me for now.  http://searchfox.org/mozilla-central/rev/66d9eb3103e429127c85a7921e16c5a02458a127/layout/base/nsLayoutUtils.cpp#1102
Actually, before I do a custom build, it would be useful to get a layers dump of this scenario. Do you happen to have cygwin installed? I find that cygwin shells can capture the stdout/stderr output from firefox whereas other shells such as the standard windows command prompt, powershell, and msys do not. If you do have cygwin, please open a cygwin command shell and run firefox.exe from there. If you have it set up so that the profile manager window pops up first, please pass the -P <profile_name> option to firefox.exe to bypass the profile manager and start the profile directly. Otherwise firefox restarts when you select a profile and the restarted process doesn't send its output to the shell anymore. Once you have it running, go to about:config, set layers.dump to true, and reproduce the problem. Once you are reproducing the problem, grab the last few screenfuls of output from the command shell window (around 100 lines of output should do it) and attach it to this bug. If you don't have cygwin and are unwilling to install it, let me know and I can make a build which logs it to a file or somewhere else that's easier to access. Thanks!
Flags: needinfo?(bugmail) → needinfo?(eamon)
Created attachment 8872600 [details] layers.dump.txt.xz I set layers.dump to true and used 54.0a2 (2017-04-18) (64-bit) to open one page and autoscroll past the problematic 8192 pixel boundary. The corruption appeared when I did so (although the flashing seemed a lot slower that usual, perhaps due to the additional logging?) Since the file compresses down to pretty much nothing, I just attached the whole thing, so I don't accidentally leave out anything relevant. Aside: as you said, the output streams didn't work in plain cmd.exe, and they also didn't work in the (newish) bash in linux on windows, but they did appear using git bash (which is mingw bash).
Thanks! This is very helpful - it shows that the displayport is 8960 pixels tall, which in a way explains some of the behaviour. For whatever reason we're making the displayport 8960 pixels tall but presumably the graphics card only uses 8192 pixels of that and so the remaining 768 ends up garbage. Some open questions: - are computing the 8960 correctly? In theory we should not be generating a displayport larger than the max allowed texture size on the graphics card. If the max allowed texture size is being misreported or there is a bug in our calculations here we might end up with a too-large displayport. - is there a 8192 limit elsewhere in the code? maybe we only upload textures upto 8192 somewhere else in the code, so even though the displayport is correct and the graphics card supports it we're ending up with garbage in the bottom 768 pixels - why is it possible to scroll into this area of garbage? as the scroll position moves down the displayport should as well so unless you're at the bottom of the page we shouldn't really be seeing the garbage. I'll put together a build with more logging for you to run that should help answer some of these questions.
It'd be interesting to see if the problem goes away with the about:config value of gfx.max-texture-size set to 8192 (and a restart.)
Sorry for the delay, the build with logging is at https://firstname.lastname@example.org/try-win64/ - please run this, reproduce the problem, and collect the output as before. It might be good to do it with layers.dump set to true as before. There isn't much logging I added because honestly I wasn't sure what would be good places to log, but hopefully this will narrow it down a little bit. And yes, trying the suggestion in comment 18 would also be useful to see if it fixes the problem.
Created attachment 8873386 [details] firefox-stderr-stdout-layers-dump-and-more.7z I attached the dump of the logging output of the extra build you provided (with layers.dump true).
Created attachment 8873389 [details] with-max-texture-size-8192.PNG Separately, I also tried playing with gfx.max-texture-size; I tried 4096, 8192, and 16384. 16384 superficially appears to have no effect. 8192 and 4096 break differently; it appears the content sometimes wraps. This is much harder to notice however, since most of the screen looks reasonable, it's just that there are discontinutities; I attached a screenshot of a fairly obvious example. But I also encountered situations where the scrollbar was at the top of the extend, and the entire screen looked reasonable, and yet it was obviously not the top of the page.
So the logging indicates the driver/LayerManager is reporting a max texture size of 16384, and the gecko side is capping the max texture at 32767 in the absence of a pref override. So as far as we can tell, displayport sizes anywhere up to 16384 should work fine. Therefore the layout code picks a displayport size of 8960 which should work. However, when we go to actually upload the texture it seems to somehow limit it to 8192 pixels and fill the rest with garbage. I can't find any place in the gecko code that induces this limit, so I suspect it is in the graphics driver. Also supporting this theory is your results with different gfx.max-texture-size values. When this pref is changed, the displayport size doesn't change (it doesn't take into account the value of this pref, although arguably it should... I'll file something for that). So let's say you've changed gfx.max-texture-size to 8192 - in this case the displayport is still going to be 8960 pixels tall, but this time it's the gecko code that's going to limit the size of the texture upload. And the remaining space is going to be filled with whatever was drawn before, which is why you see the discontinuities but of "reasonable looking" screen contents. In this scenario, at no point do we actually upload garbage from gecko. The discontinuities are a result of the mismatch between the displayport and compositor's notions of the max texture size, which is something we can fix.
Actually I found another place we have a 8192 limit. I'll make another build with both things fixed and let's see if that helps.
Can you please try with the build at https://email@example.com/try-win64/ ? No need to collect logging in this one, just observe the behaviour in the default configuration and with gfx.max-texture-size set to 4096 or 8192. With the modified gfx.max-texture-size at least I would expect the behaviour to be normal (no more discontinuities) but in the default configuration it may or may not be fixed.
Sorry for the delay, I'm on vacation; I'll update once I have access to the problem machine again, which should be at the latest the first week of july.
Back from vacation! The link you provide https://firstname.lastname@example.org/try-win64/ returns a 404 - perhaps it's expired? Can you reupload the build?
The link should work now, I triggered a rebuild of the expired one.
Flags: needinfo?(bugmail) → needinfo?(eamon)
I ran the build you provided multiple times with various values for max-texture size. When gfx.max-texture size is unset, the corruption occurs as usual. When gfx.max-texture size is set to 4096 or 8192, the issue does not occur.
Thanks, I filed bug 1378355 for updating GetMaxDisplayPortSize. However since the corruption still occurred for you with gfx.max-texture-size unset, the other "8192" that I changed had no effect. I'm suspecting a driver limitation here.
Bug 1378355 is now landed, so if you go to about:config on today's nightly and set gfx.max-texture to 8192 you shouldn't see the issue any more. This allows you to work around what I believe is a driver fault.
Works for me (with gfx.max-texture=8192). I wouldn't know how to test for this particular driver bug, but it certainly sounds plausible it's a driver issue. A little worrying to see that in what's probably one of the most common GFX drivers around (intel HD graphics), but then again, the screen layout is unusual (portrait mode 2560x1440). Thanks for the fix!
(Incidentally, I just tried reinstalling the latest driver from windows update; no change).
I'm tempted to limit Intel HD to 8k maximum texture size. It shouldn't be that much of a limiting factor, and we have proof by example that some devices in that range do not like larger textures.
This is a Haswell Gen7.5 which is pretty common so we might want to investigate more.
You need to log in before you can comment on or make changes to this bug.