Closed Bug 1396984 Opened 2 years ago Closed 2 years ago

Scrollbar becomes black on first connection of second screen

Categories

(Core :: DOM: Content Processes, defect, P3)

x86
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- unaffected
firefox56 --- unaffected
firefox57 + fixed
firefox58 --- unaffected
firefox59 --- unaffected
firefox60 --- fixed

People

(Reporter: agashlin, Assigned: bobowen)

References

Details

(Keywords: regression, regressionwindow-wanted, Whiteboard: [gfx-noted])

Attachments

(5 files, 2 obsolete files)

Steps to reproduce:

1. On a fresh Windows "session" (either straight from initial login or after doing Switch user), start Firefox
2. Navigate to a page with a scroll bar
3. Attach an external monitor

Actual result:

Scroll bar becomes black, sometimes only the handle part indicating the current view, sometimes only with mouseover, sometimes it is the whole bar including arrows.

Expected result:

Scroll bar remains two tones of grey.

The black scrollbar does not return when Firefox is restarted, and subsequent reconnections of the monitor on the same Windows "session"(?) don't fail in this way. Sometimes it is only every 4th opened tab that has the issue, or 3 of every 4.

Setup info:
Windows 10 on MacBook Pro (Retina, 13-inch, Early 2015)
Intel(R) Iris(TM) Graphics 6100
I've attached my about:support report.

This may be a Windows-on-Mac graphics driver issue, but the bar sticking black is a regression. Before the regression sometimes when the monitor was connected the bar would sometimes flash black but then return to normal.

mozregression narrowed it down to: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=24d5dbf3a9dcadf708aa6874d9cfbe911af3d3d7&tochange=01fd73e4b8b89080505bf9f004c82c3e1043f40e
so I blocked on bug 1384336
Component: DOM: Content Processes → Graphics
Adam, can you get us the regression range (e.g. mozregression result.)
Priority: -- → P3
Whiteboard: [gfx-noted]
11:40.00 INFO: No more inbound revisions, bisection finished.
11:40.00 INFO: Last good revision: 24d5dbf3a9dcadf708aa6874d9cfbe911af3d3d7
11:40.00 INFO: First bad revision: 01fd73e4b8b89080505bf9f004c82c3e1043f40e
11:40.00 INFO: Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=24d5dbf3a9dcadf708aa6874d9cfbe911af3d3d7&tochange=01fd73e4b8b89080505bf9f004c82c3e1043f40e

I noticed that this also causes form control widgets to disappear (just saw in Bugzilla now): checkboxes, scrollbars for multiple select boxes, dropdown widgets for select (but not the text).

It happens on a per-process basis, seemingly, tabs in the same process (going by the number in the tooltip) exhibit the issue but others don't.
I can reproduce this with hardware acceleration disabled, (options -> uncheck "Use recommended performance settings" -> uncheck "Use hardware acceleration when available"), about:support attached.
Nice.  That sounds very much like bug 1386598.  I'll verify your regression range with my STR (high contrast white mode triggers it.)
See Also: → 1386598
I can confirm the same regression range.  Will make bug 1386598 a duplicate of this one as well.
Component: Graphics → DOM: Content Processes
[Tracking Requested - why for this release]:
The two known STRs (plug external monitor or trigger high contrast white mode) may not be common, but aoverholt and I ran into this without doing either (see bug 1386598), so there are undiscovered STR that are likely more common.
Flags: needinfo?(wmccloskey)
I have an easier way to reproduce this now, and it works on a different machine as well (Quantum reference hardware, Acer Aspire E 15):

1. Switch user
2. Start Firefox
3. Change scaling 150% (or just different from what it was before, this Windows setting is called "Change the size of text, apps, and other items", if you search for "dpi" it will show up)

This seems to work (black scroll bar, invisible widgets) 100% of the time on current Nightly, but I've not been able to find a stable regression range for it, it seems like maybe it only became more likely recently?
One more about:support, I was able to reproduce without e10s.

I've also been able to reproduce, with Nightly via mozregression, all the way back to 2015-07-01, but I haven't found a solid lead on a regression range yet.
(In reply to Adam Gashlin [:agashlin] from comment #9)
> I've also been able to reproduce, with Nightly via mozregression, all the
> way back to 2015-07-01, but I haven't found a solid lead on a regression
> range yet.

Huh. FWIW, I use an external monitor multiple times every day and only very recently started experiencing this (and not every time I plug/unplug my monitor).
(In reply to Andrew Overholt [:overholt] from comment #10)
> Huh. FWIW, I use an external monitor multiple times every day and only very
> recently started experiencing this (and not every time I plug/unplug my
> monitor).

I should have been clearer in comment 9, I meant I was able to reproduce using the change of scale mentioned in comment 8 (most readily reproduced switching while still loading the initial page), the symptoms are the same but the screen plugging trigger might be recent.
Tracking as regression in 57.
I believe am experiencing this bug as well, though in a slightly different way. I often experience it when RDP-ing into a machine with a running Firefox session or after I have ended the RDP connection and am back at the machine (in person).

When I experience it, the scroll bar consistently becomes solid black. It seems to behave normally, but is very hard to use since I cannot see where any of the discrete elements of the bar are.

I additionally find that <input> elements do not seem to be rendered correctly; they seems to be invisible. With text inputs, I can type into them and the text will appear, but the input will continue to have no border or color. With checkbox inputs, I cannot tell whether they are checked at all (or maybe I simply cannot set them to checked - it is hard to tell). The chrome inputs (location bar and search box) seem to be unaffected.
Attached image Screen shot of issue
This screen shot shows the black scroll bar and the invisible input elements.
This just happened to me again, presumably after disconnecting my external monitor.

Markus, you reviewed billm's patch in bug 1384336 which is one of two bugs that mozregression points to here. Do you have any thoughts? Can we back that out?
Aaron, you reviewed Bill's *other* patch (bug 1384336 as well) that mozregression points to here. Same questions as to Markus :)
Flags: needinfo?(mstange)
Flags: needinfo?(aklotz)
The patch is already backed out.
What Bill said.
Flags: needinfo?(aklotz)
Flags: needinfo?(mstange)
Bug 1384336 comment 32 indicates bug 1384336 was disabled in 57 in bug 1397956 but it seems like the patch in that bug (bug 1397956) was backed out.
Oops, you're right. I thought I re-pushed a patch for bug 1397956 on Monday, but it looks like it got rebased out of existence.
20170915100121 works for me (as in, the high contrast workflow doesn't break things.)
This should be fixed by a backout in bug 1397956.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Flags: needinfo?(wmccloskey)
Blocks: 1423628
Re-opening this, so it can be addressed and we can hopefully turn off native event processing in bug 1423628.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I needed a distraction so I took a look at this.  First, for me at least, there is an easy STR:

0. Set your desktop to a low DPI (I use 125%).  (Just search windows for the 'DPI' setting as mentioned in comment 8)
1. Launch firefox to a page with a scrollbar (about:home works fine if it has a scrollbar)
2. Change the DPI to something very big (I use 225%).

Expected result:
Big everything, including scrollbar

Actual Result:
Nearly every time, the scrollbar handle will become black.  Most of the time, the whole scrollbar becomes black.

-----------------------

The issue is similar to one that I ran into a bunch of times back in the days of e10s -- the Widget hasn't yet updated the scale factor in the parent process (so it uses -1.0) when the nsNativeThemeWin fetches the theme widget dimensions.  It then scales the native widget dimensions by -1.0, causing negative dimensions and, therefore, havoc.   Sometimes with these bugs, we can reschedule the task since the scale factor will arrive soon.  Other times, we needed to complete the task with bad data, like using a scale of 1.0, _and_ reschedule the task to clean it up soon.  This would result in glitches but sometimes the alternative was visual garbage.  I don't think we have any more of those glitches tho so we shouldn't create one.

Here's a sample stack that gets the negative size.  Note that this happens in the _parent_ (I think this bug would be better classified as a widget bug):

 	xul.dll!nsNativeThemeWin::GetMinimumWidgetSize(nsPresContext * aPresContext=0x0000000000000001, nsIFrame * aFrame=0x000001f8e6eca5b8, unsigned char aWidgetType='6', mozilla::gfx::IntSizeTyped<mozilla::LayoutDevicePixel> * aResult=0x0000004452ffd0a8, bool * aIsOverridable=0x0000004452ffd0a0) Line 2308	C++
 	xul.dll!nsIFrame::AddXULMinSize(nsBoxLayoutState & aState={...}, nsIFrame * aBox=0x000001f8e6eca5b8, nsSize & aSize={...}, bool & aWidthSet=false, bool & aHeightSet=false) Line 677	C++
 	xul.dll!nsBoxFrame::GetXULMinSize(nsBoxLayoutState & aBoxLayoutState={...}) Line 848	C++
 	xul.dll!nsBoxFrame::GetXULPrefSize(nsBoxLayoutState & aBoxLayoutState={...}) Line 800	C++
 	xul.dll!nsSprocketLayout::GetXULPrefSize(nsIFrame * aBox=0x000001f8e6ec9488, nsBoxLayoutState & aState={...}) Line 1326	C++
 	xul.dll!nsBoxFrame::GetXULPrefSize(nsBoxLayoutState & aBoxLayoutState={...}) Line 789	C++
 	xul.dll!GetScrollbarMetrics(nsBoxLayoutState & aState={...}, nsIFrame * aBox=0x000001f8e6ec9488, nsSize * aMin=0x0000000000000000, nsSize * aPref=0x0000004452ffd488, bool aVertical=true) Line 293	C++
 	xul.dll!nsHTMLScrollFrame::ReflowScrolledFrame(mozilla::ScrollReflowInput * aState=0x0000004452ffd840, bool aAssumeHScroll=false, bool aAssumeVScroll=true, mozilla::ReflowOutput * aMetrics=0x0000004452ffd6f0, bool aFirstPass=true) Line 516	C++
 	xul.dll!nsHTMLScrollFrame::ReflowContents(mozilla::ScrollReflowInput * aState=0x0000004452ffd840, const mozilla::ReflowOutput & aDesiredSize) Line 698	C++
 	xul.dll!nsHTMLScrollFrame::Reflow(nsPresContext * aPresContext=0x000001f8e6d1a000, mozilla::ReflowOutput & aDesiredSize={...}, const mozilla::ReflowInput & aReflowInput={...}, nsReflowStatus & aStatus={...}) Line 1054	C++
 	xul.dll!nsContainerFrame::ReflowChild(nsIFrame * aKidFrame=0x000001f8e6ec30e8, nsPresContext * aPresContext=0x000001f8e6d1a000, mozilla::ReflowOutput & aDesiredSize={...}, const mozilla::ReflowInput & aReflowInput={...}, int aX=0, int aY=0, unsigned int aFlags=0, nsReflowStatus & aStatus={...}, nsOverflowContinuationTracker * aTracker=0x0000000000000000) Line 989	C++
 	xul.dll!mozilla::ViewportFrame::Reflow(nsPresContext * aPresContext=0x000001f8e6d1a000, mozilla::ReflowOutput & aDesiredSize={...}, const mozilla::ReflowInput & aReflowInput={...}, nsReflowStatus & aStatus={...}) Line 337	C++
 	xul.dll!mozilla::PresShell::DoReflow(nsIFrame * target=0x000001f8e6ec2918, bool aInterruptible=true) Line 8978	C++
>	xul.dll!mozilla::PresShell::ProcessReflowCommands(bool aInterruptible=true) Line 9145	C++
 	xul.dll!mozilla::PresShell::DoFlushPendingNotifications(mozilla::ChangesToFlush aFlush) Line 4265	C++
 	xul.dll!nsRefreshDriver::Tick(__int64 aNowEpoch, mozilla::TimeStamp aNowTime={...}) Line 1923	C++
 	xul.dll!mozilla::RefreshDriverTimer::TickDriver(nsRefreshDriver * driver=0x000001f8e5c0e400, __int64 jsnow=1516943467196105, mozilla::TimeStamp now={...}) Line 337	C++
 	xul.dll!mozilla::RefreshDriverTimer::TickRefreshDrivers(__int64 aJsNow=1516943467196105, mozilla::TimeStamp aNow={...}, nsTArray<RefPtr<nsRefreshDriver> > & aDrivers) Line 308	C++
 	xul.dll!mozilla::RefreshDriverTimer::Tick(__int64 jsnow=1516943467196105, mozilla::TimeStamp now={...}) Line 330	C++
 	xul.dll!mozilla::VsyncRefreshDriverTimer::RunRefreshDrivers(mozilla::TimeStamp aTimeStamp={...}) Line 770	C++
 	xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver(mozilla::TimeStamp aVsyncTimestamp={...}) Line 685	C++
 	xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::ParentProcessVsyncNotifier::Run() Line 529	C++
 	xul.dll!nsThread::ProcessNextEvent(bool aMayWait=48, bool * aResult=0x0000004452ffee80) Line 1041	C++
 	xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait=false) Line 517	C++
 	xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate=0x000001f8d7c86080) Line 97	C++
 	xul.dll!MessageLoop::RunHandler() Line 320	C++
 	xul.dll!MessageLoop::Run() Line 300	C++
 	xul.dll!nsBaseAppShell::Run() Line 159	C++
 	xul.dll!nsAppShell::Run() Line 346	C++
 	xul.dll!nsAppStartup::Run() Line 288	C++
 	xul.dll!XREMain::XRE_mainRun() Line 4707	C++
 	xul.dll!XREMain::XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig={...}) Line 4842	C++
 	xul.dll!XRE_main(int argc=2, char * * argv=0x000001f8d7c03060, const mozilla::BootstrapConfig & aConfig={...}) Line 4934	C++
 	firefox.exe!do_main(int argc=2, char * * argv=0x000001f8d7c03060, char * * envp=0x000001f8d79541a0) Line 232	C++
 	firefox.exe!NS_internal_main(int argc=2, char * * argv=0x000001f8d7c03060, char * * envp=0x000001f8d79541a0) Line 306	C++
 	firefox.exe!wmain(int argc, wchar_t * * argv=0xffffffffffd43070) Line 115	C++
 	firefox.exe!__scrt_common_main_seh() Line 253	C++
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

---

My sense is that we shouldn't reflow without a valid display scale.  Otherwise, I'm not sure what to do here.
(In reply to David Parks (dparks) [:handyman] from comment #23)
> I needed a distraction so I took a look at this.  First, for me at least,
> there is an easy STR:
> 
> 0. Set your desktop to a low DPI (I use 125%).  (Just search windows for the
> 'DPI' setting as mentioned in comment 8)
> 1. Launch firefox to a page with a scrollbar (about:home works fine if it
> has a scrollbar)
> 2. Change the DPI to something very big (I use 225%).

Thanks for looking at this.
This is very similar to the STR I'm using.

I should have updated the bug more ... the calls that appear to be failing are NtGdiDrawStream ones (under our call to DrawThemeBackground) in the content process.
I think this might be because the theme library in the content process is picking up the wrong DPI and then the library gets into an unrecoverable state.

Of course, it could be that this negative scale/widget size is getting down to the content process as well and that's the root cause.
So that gives me something else to look into.
Cool.  yeah, I didn't chase it too far so you probably have a better approach.  I just glanced at it because it sounded familiar (its not tho).  I'll leave you to it.
Assignee: nobody → bobowencode
Status: REOPENED → ASSIGNED
I've sort of got to the bottom of what's going on here after some kernel debugging.
Basically during the theme change the internal GDI stock bitmaps get removed and re-added to a table with new indexes (or at least parts of the index).
The content process doesn't get updated with these new indexes and so even though we're closing and reopening the theme, the internal checks that involves those parts of the index fail.

It looks like these updates normally happen as part of the internal Windows messages that are processed when you call PeekMessage and as we stop calling that, we get the problem.
I haven't quite debugged through PeekMessage to see these updates in action yet.

The first patch that I attached does a dummy PeekMessage before each OpenTheme call, this generally seems to work, but is a bit unreliable.

The second one adds in a FakeNativeEventPump when we're not properly processing them and does the dummy PeekMessage every time we would normally process native events.
This seems to fix the issue and I think is probably what we should go with as it may well fix other issues that we don't know about yet.

Once we finally get all of the GDI stuff out of the content process, we should be able to get rid of this.
Attachment #8951021 - Flags: review?(jmathies)
(In reply to Bob Owen (:bobowen) (PTO until 26th Feb) from comment #28)
> Once we finally get all of the GDI stuff out of the content process, we
> should be able to get rid of this.

Good to hear.
Attachment #8951021 - Flags: review?(jmathies) → review+
I pushed to try [1] with an assertion to see if we ever get any messages and we still do for the OleMainThreadWndName window, which is created in the CoInitializeEx call here [2].

They seem pretty rare, so I think we should change this to do a single peek and dispatch instead of the dummy peek.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=1b7821409508e343ecbb622a5afe9058466a8c87&selectedJob=164241039
[2] https://hg.mozilla.org/mozilla-central/file/580d833df9c4/ipc/mscom/COMApartmentRegion.h#l23
Attachment #8951019 - Attachment is obsolete: true
We still get occasional messages for the internal OLE Main Thread window.
Also, the PeekMessage call allows internal windows messages to be processed for
things like GDI.
Attachment #8954346 - Flags: review?(jmathies)
Attachment #8951021 - Attachment is obsolete: true
Attachment #8954346 - Flags: review?(jmathies) → review+
I pushed a try push that logged out any messages [1].

I spot checked a fair few, many reports had none and the most I found in a single log was 5 messages.
I only found 1 in 1 opt build run, for some reason it seems to be much more common in debug builds, possibly just the length of time for which they are running.

They were all for the OLE Main Thread window, a WM_USER (0x400) message and wParam of 0xBABE.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=5133aca3747cb998ca0298e98beaccea4a03b632
Pushed by bobowencode@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/825fd04dacc6
When not generally processing native events, do single message pumps instead. r=jimm
https://hg.mozilla.org/mozilla-central/rev/825fd04dacc6
Status: ASSIGNED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: mozilla57 → mozilla60
Hi Adam, when this gets into the next Nightly, it would be great if you could confirm that this fixes your issue.
Flags: needinfo?(agashlin)
I haven't been able to reproduce by plugging in an external monitor, even with old builds, something about how Windows switches displays may have changed.

But I can still reproduce by switching to the High Contrast White theme.
Flags: needinfo?(agashlin)
(In reply to Adam Gashlin [:agashlin] from comment #37)
> I haven't been able to reproduce by plugging in an external monitor, even
> with old builds, something about how Windows switches displays may have
> changed.
> 
> But I can still reproduce by switching to the High Contrast White theme.

Can you reproduce this with dom.ipc.useNativeEventProcessing.content set back to true?
Flags: needinfo?(agashlin)
Yes, it still happens on build 20180305220117 with that pref set true when switching to High Contrast White while Firefox is running.
Flags: needinfo?(agashlin)
(In reply to Adam Gashlin [:agashlin] from comment #39)
> Yes, it still happens on build 20180305220117 with that pref set true when
> switching to High Contrast White while Firefox is running.

OK thanks, can you file that as a new bug then, I suspect that it will be reproducible in Release as well.
You need to log in before you can comment on or make changes to this bug.