Closed Bug 1401250 Opened 7 years ago Closed 6 years ago

Permafailing Win10 toolkit/xre/test/browser_checkdllblockliststate.js | Windows dll blocklist status should be true, indicating it is running properly. A failure in this test is considered a release blocker. -

Categories

(Toolkit :: Blocklist Implementation, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox56 --- wontfix
firefox57 - verified
firefox58 --- fixed
firefox59 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: away)

References

Details

Attachments

(2 files)

Filed by: rvandermeulen [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=132006346&repo=mozilla-beta

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-beta-win64/mozilla-beta_win10_64_test-mochitest-e10s-browser-chrome-4-bm109-tests1-windows-build0.txt.gz

This showed up when I did a sync between m-c and m-b today.

What we know so far:
* Fails only on Desktop Firefox builds. DevEdition is passing. And only opt builds - debug is green.
* We switched Win64 mochitest-browser-chrome from Win8 to Win10 (see bug 1397229) in that range. It was previously passing on Win8.
* It is passing on Trunk. The failures are only happening on Beta.
[Tracking Requested - why for this release]: According to the error message in the test, this is bad.
If the test says that, it is probably right :)
ni?ing jimm who wrote the test in bug 1286306.
Flags: needinfo?(jmathies)
will investigate.
Assignee: nobody → jmathies
Flags: needinfo?(jmathies)
Priority: P5 → P1
Flags: needinfo?(jmathies)
Component: Security: Process Sandboxing → Blocklisting
Flags: needinfo?(jmathies)
Product: Core → Toolkit
Version: unspecified → Trunk
Local beta build passes running on one of our low end win10 reference laptops. Joel, do you know what kind of hardware these tests run on? I think it's pretty easy to gain access to one of these slaves, I'd like to do that to investigate*. Curious if you can point me to some docs on how to do that?

* only if I can install dev tools on it, check out source and build.
Flags: needinfo?(jmaher)
these were running on hardware, as of 2 days ago I moved them to virtual machines (and oddly the failures have gone away).  This was on trunk, we could uplift the patch to beta if needed.

Possibly we can revisit this bug in a week to see if it is still a problem?
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #7)
> and oddly the failures have gone away

Where did they go away from? As far as I know, this only affected Beta and is still happening as of the latest push there today. But yes, I'd be happy to land the patches that moved the tests over to VMs if you think it might help.
Flags: needinfo?(jmaher)
:ryanvm, lets upload the patch from bug 1402068 to mozilla-beta, that would give us a good chance to debug on what we have ensured is greenest on trunk.
Flags: needinfo?(jmaher)
thanks :RyanVM!

this was failing on hardware:
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation

now it continues to fail on a VM.

here is how to get a windows 10 vm loaner:
https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance

I do not think you will easily be able to build on there- probably installing mozilla-build and maybe some additional tools will help, it will be a longer bootstrap problem though.  Historically on all loaners we have not had the ability to easily build and run debugging tools- the builders have different packages installed than the testers.
Flags: needinfo?(jmathies)
Any luck with this so far, Jim? This failure is still hitting on the 58-as-Beta Try simulation runs as well.
Hi Jim,
Any progress here? It seems it is still the #28 most frequent failure.
(In reply to Gerry Chang [:gchang] - On PTO - Back on October 23 from comment #21)
> Hi Jim,
> Any progress here? It seems it is still the #28 most frequent failure.

Nope not yet. Need to get a hold of one of these slaves and poke around, figure out what's going on. I haven't had the time to go step through the "borrow a slave" process.

If you would like to pick this up and have the cycles, please do.
Flags: needinfo?(jmathies)
Aaron, is this something you might be able to look at? 
I know you're busy - so I'll n-i to dmajor or handyman as well. 

From what I understand here, the test is failing, and we aren't sure if there's a real problem with dll blocklists for 57, or if it is a minor problem with the test itself. DLL blocklisting is pretty important for stability on the release channel (more so than on beta), so the test failure is marked as a release blocker.
Flags: needinfo?(dmajor)
Flags: needinfo?(davidp99)
Flags: needinfo?(aklotz)
Does this ever happen on debug builds? There ought to be enough printf in there to help narrow this down.
All opt builds, unfortunately. It's easily reproducible on Try if you want to attempt any extra logging patches, though.
Can you try this? It'll turn the failure into a crash, and we can read the variables from the crash .extra file.

--- a/mozglue/build/WindowsDllBlocklist.cpp
+++ b/mozglue/build/WindowsDllBlocklist.cpp
@@ -924,6 +924,6 @@ MFBT_API bool
 DllBlocklist_CheckStatus()
 {
   if (sBlocklistInitFailed || sUser32BeforeBlocklist)
-    return false;
+    __debugbreak();
   return true;
 }
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #23)
> Aaron, is this something you might be able to look at? 
> I know you're busy - so I'll n-i to dmajor or handyman as well. 
> 
> From what I understand here, the test is failing, and we aren't sure if
> there's a real problem with dll blocklists for 57, or if it is a minor
> problem with the test itself. DLL blocklisting is pretty important for
> stability on the release channel (more so than on beta), so the test failure
> is marked as a release blocker.

I don't think there's a real problem here, we've tested this on various real hardware and it doesn't reproduce. the failure was introduced when we switched a set of tests to aws vms.

Akotz is busy with a11y, davidp is busy with flash sandbox hardening. dmajor, if you have cycles I'd appreciate the help.. Otherwise I'll try to get back to this in a week or so.
Flags: needinfo?(davidp99)
Flags: needinfo?(aklotz)
The build from Ryan's push has a load-time dependency on user32. Here's the module list before the initial process breakpoint. I'll investigate further tomorrow.

ModLoad: 00007ff6`199d0000 00007ff6`19a3f000   firefox.exe
ModLoad: 00007fff`8ce70000 00007fff`8d04b000   ntdll.dll
ModLoad: 00007fff`8aba0000 00007fff`8ac4e000   C:\WINDOWS\System32\KERNEL32.DLL
ModLoad: 00007fff`89f00000 00007fff`8a149000   C:\WINDOWS\System32\KERNELBASE.dll
ModLoad: 00007fff`81cc0000 00007fff`81d3e000   C:\WINDOWS\SYSTEM32\apphelp.dll
ModLoad: 00007fff`738e0000 00007fff`73937000   C:\WINDOWS\AppPatch\AppPatch64\AcGenral.DLL
ModLoad: 00007fff`8a530000 00007fff`8a5cd000   C:\WINDOWS\System32\msvcrt.dll
ModLoad: 00007fff`8b550000 00007fff`8b5a9000   C:\WINDOWS\System32\sechost.dll
ModLoad: 00007fff`8a690000 00007fff`8a7b5000   C:\WINDOWS\System32\RPCRT4.dll
ModLoad: 00007fff`8acc0000 00007fff`8ad11000   C:\WINDOWS\System32\SHLWAPI.dll
ModLoad: 00007fff`8cab0000 00007fff`8cda9000   C:\WINDOWS\System32\combase.dll
ModLoad: 00007fff`8a270000 00007fff`8a366000   C:\WINDOWS\System32\ucrtbase.dll
ModLoad: 00007fff`89e90000 00007fff`89efa000   C:\WINDOWS\System32\bcryptPrimitives.dll
ModLoad: 00007fff`8aa20000 00007fff`8aa47000   C:\WINDOWS\System32\GDI32.dll
ModLoad: 00007fff`89430000 00007fff`895b7000   C:\WINDOWS\System32\gdi32full.dll
ModLoad: 00007fff`89390000 00007fff`8942a000   C:\WINDOWS\System32\msvcp_win.dll
ModLoad: 00007fff`8aa50000 00007fff`8ab9a000   C:\WINDOWS\System32\USER32.dll
For reference here's a list from a good build; these modules can be cleared of suspicion.

start             end                 module name
00007ff6`90310000 00007ff6`90380000   firefox    (deferred)             
00007fff`5dc10000 00007fff`5dcaf000   MSVCP140   (deferred)             
00007fff`61880000 00007fff`618ae000   mozglue    (deferred)             
00007fff`730e0000 00007fff`730f6000   VCRUNTIME140   (deferred)             
00007fff`78700000 00007fff`788a9000   dbghelp    (deferred)             
00007fff`859c0000 00007fff`859ca000   VERSION    (deferred)             
00007fff`88d30000 00007fff`88d3b000   CRYPTBASE   (deferred)             
00007fff`89e90000 00007fff`89efa000   bcryptPrimitives   (deferred)             
00007fff`89f00000 00007fff`8a149000   KERNELBASE   (deferred)             
00007fff`8a270000 00007fff`8a366000   ucrtbase   (deferred)             
00007fff`8a530000 00007fff`8a5cd000   msvcrt     (deferred)             
00007fff`8a690000 00007fff`8a7b5000   RPCRT4     (deferred)             
00007fff`8aba0000 00007fff`8ac4e000   KERNEL32   (deferred)             
00007fff`8b550000 00007fff`8b5a9000   sechost    (deferred)             
00007fff`8cdb0000 00007fff`8ce51000   ADVAPI32   (deferred)             
00007fff`8ce70000 00007fff`8d04b000   ntdll      (pdb symbols)
AcGenral.DLL is my biggest suspect right now. Why on earth is it in our process? Did we accidentally request some appcompat thing?
Ok, I think I know what's triggering this.

According to the Compatibility Administrator in the Windows ADK, Windows 10 ships with an appcompat setting for Firefox called "DamThrottleControl". It keys off PRODUCT_NAME="Firefox" and FILE_DESCRIPTION="Firefox" so we don't trigger it on trunk where our app calls itself "Nightly". This "DamThrottleControl" must be new to Windows 10 because I don't see it in my Windows 8.1 machine's compat database.

I have no idea what this compat fix was intended to solve, or how to get around it (short of using a different product name in our app metadata!).

Milan, I've heard you mention a "Microsoft list" in another bug. Would you be able to get some help on this? To recap: this compat setting causes AcGenrl.dll to load early in our process, which in turn pulls in user32.dll, which loads AppInit_DLLs before our DLL blocklist is initialized, which prevents us from blocking crashy injected libraries.
Flags: needinfo?(dmajor) → needinfo?(milan)
That sounds plausible. So far, I've ruled out via Try pushes the version number change and --enable-profiling as related to this, which isn't leaving much left other than official branding :)
(In reply to David Major [:dmajor] from comment #32)
> Ok, I think I know what's triggering this.
> 
> According to the Compatibility Administrator in the Windows ADK, Windows 10
> ships with an appcompat setting for Firefox called "DamThrottleControl". It
> keys off PRODUCT_NAME="Firefox" and FILE_DESCRIPTION="Firefox" so we don't
> trigger it on trunk where our app calls itself "Nightly". This
> "DamThrottleControl" must be new to Windows 10 because I don't see it in my
> Windows 8.1 machine's compat database.
> 
> I have no idea what this compat fix was intended to solve, or how to get
> around it (short of using a different product name in our app metadata!).
> 
> Milan, I've heard you mention a "Microsoft list" in another bug. Would you
> be able to get some help on this? To recap: this compat setting causes
> AcGenrl.dll to load early in our process, which in turn pulls in user32.dll,
> which loads AppInit_DLLs before our DLL blocklist is initialized, which
> prevents us from blocking crashy injected libraries.

I've run this test manually on a fully updated win 10 laptop and didn't get a failure. Could this be something weird aws installed on it's slaves?
(In reply to Jim Mathies [:jimm] from comment #34)
> I've run this test manually on a fully updated win 10 laptop and didn't get
> a failure. Could this be something weird aws installed on it's slaves?

Was it on a local build? You need to have official branding enabled so that the product name is "Firefox" rather than "Nightly".
Message sent to the microsoft discuss list, :dmajor is CC-d on it.
Flags: needinfo?(milan)
If we don't hear back from Microsoft, I think this is our only option: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3e0a693a081920b58fea13415a2c4bad185bfa40

This patch changes our Windows file-description metadata from "Firefox" to "Mozilla Firefox" in order to work around an issue on Windows 10.

To be clear: This patch does NOT change any branding normally presented to a typical user. Short of attaching a debugger, the only way to see this change is to right-click firefox.exe and choose Properties then Details.

We're doing this because Windows 10 ships with an appcompat fix called DamThrottleControl that applies to any firefox.exe with product name and description equal to the string "Firefox". I have no idea what that fix was intended to do, but Nightly and DevEdition appear to be doing fine without it. In any case, the appcompat fix comes with the side effect of loading user32.dll very early in our process, which allows AppInit_DLLs to sneak in before our blocklist initializes. That's bad (and it fails a mochitest). This patch evades the Windows appcompat engine by making us no longer match the string that it's testing for.
Assignee: jmathies → dmajor
Attachment #8925932 - Flags: review?(core-build-config-reviews)
Attached image filedescription.png
Example of what the patch affects
(In reply to David Major [:dmajor] from comment #35)
> (In reply to Jim Mathies [:jimm] from comment #34)
> > I've run this test manually on a fully updated win 10 laptop and didn't get
> > a failure. Could this be something weird aws installed on it's slaves?
> 
> Was it on a local build? You need to have official branding enabled so that
> the product name is "Firefox" rather than "Nightly".

Yes, it's a non-branded build. Thanks for picking this up.
Pike, I'm planning to change an internal string in comment 37. I pulled up a ru build (without my patch) and the file properties still had their English, Latin-script values, so I'm pretty sure these strings are not localized. But I wanted to give you a heads up in case you foresee any problems I'm not aware of.
Flags: needinfo?(l10n)
dbolter: I don't expect comment 37 to break any screen readers (they better not be testing for product name "Firefox" or else they'd be broken already in Nightly and FirefoxDeveloperEdition) but just to be safe, in case we rush this to release, could you take the try build for a spin? Thanks!
Flags: needinfo?(dbolter)
That string is coming from https://dxr.mozilla.org/mozilla-central/source/browser/branding/official/configure.sh#5, I think, so keeping it hard-coded in English is fine. We also don't want to modify our branding from a trademarks POV, so this is OK from an l10n pov. Thanks for asking.
Flags: needinfo?(l10n)
(In reply to Mike Conley (:mconley) (:⚙️) - Backlogged on reviews and needinfos from comment #39)
> For the curious, here's what I've found about "DAM":
> https://channel9.msdn.com/Shows/Going+Deep/Inside-Windows-8-Jon-Berry-
> Desktop-Activity-Moderator-and-Connected-Standby

WOW. This might explain some problems that we've had in a11y-land.
David, Jim, :Philipp and a bunch of others have reviewed the severity of this issue and whether it should block 57. The findings so far is that this is not a new problem in 57. This has been a problem with all Firefox versions (55, 56) on win10. The crash rates from 56 are not concerning around this specific issue. David and Jim also mentioned that uplifting this patch to 57 is risky and should be avoided if possible.

To summarize, while this is a problem that needs to be fixed in 58+. For 57, this is not release blocking as the crash rates due to this problem is low (malware crashes on win10 due to some of the dll blocklisting code not working as expected). 

I would like to keep this tracking for 57 and status affected for a bit longer post 57 launch.
(In reply to David Major [:dmajor] from comment #37)
> Created attachment 8925932 [details] [diff] [review]
> Change firefox.exe description metadata

Graphics drivers are keying off the name of the executable, so as long as that stays firefox.exe, I don't think the change in the meta data will have any negative effect on the graphics.
(In reply to David Major [:dmajor] from comment #42)
> dbolter: I don't expect comment 37 to break any screen readers (they better
> not be testing for product name "Firefox" or else they'd be broken already
> in Nightly and FirefoxDeveloperEdition) but just to be safe, in case we rush
> this to release, could you take the try build for a spin? Thanks!

I'm betting MarcoZ will beat me to this (and he has more screen readers to try out). Transferring NI. Marco can you jump on this?
Flags: needinfo?(dbolter) → needinfo?(mzehe)
(In reply to David Major [:dmajor] from comment #37)
> the appcompat fix comes with the side effect of loading
> user32.dll very early in our process

Maybe that was the goal of that fix?
Comment on attachment 8925932 [details] [diff] [review]
Change firefox.exe description metadata

Review of attachment 8925932 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/app/module.ver
@@ +2,5 @@
>  WIN32_MODULE_COPYRIGHT=©Firefox and Mozilla Developers; available under the MPL 2 license.
>  WIN32_MODULE_PRODUCTVERSION=@MOZ_APP_WINVERSION@
>  WIN32_MODULE_PRODUCTVERSION_STRING=@MOZ_APP_VERSION@
>  WIN32_MODULE_TRADEMARKS=Firefox is a Trademark of The Mozilla Foundation.
> +WIN32_MODULE_DESCRIPTION=Mozilla @MOZ_APP_DISPLAYNAME@

Hardcoding Mozilla here is not really great.
Sadly the branding's configure.sh doesn't have an equivalent to brandFullName.
(In reply to Mike Hommey [:glandium] from comment #48)
> (In reply to David Major [:dmajor] from comment #37)
> > the appcompat fix comes with the side effect of loading
> > user32.dll very early in our process
> 
> Maybe that was the goal of that fix?

Given the DAM context, I'm doubtful. I think it's more likely that user32 is merely along for the ride, as it's an (indirect) dependency of the appcompat DLLs.
Blocks: 1415337
Comment on attachment 8925932 [details] [diff] [review]
Change firefox.exe description metadata

Review of attachment 8925932 [details] [diff] [review]:
-----------------------------------------------------------------

I don't see any reason this shouldn't be OK from a build system perspective. It is a little worrying that we are working around a Microsoft appcompat fix, but it's also annoying that they would ship something like that without asking us to fix a bug on our side first.
Attachment #8925932 - Flags: review?(core-build-config-reviews) → review+
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #47)
> (In reply to David Major [:dmajor] from comment #42)
> > dbolter: I don't expect comment 37 to break any screen readers (they better
> > not be testing for product name "Firefox" or else they'd be broken already
> > in Nightly and FirefoxDeveloperEdition) but just to be safe, in case we rush
> > this to release, could you take the try build for a spin? Thanks!
> 
> I'm betting MarcoZ will beat me to this (and he has more screen readers to
> try out). Transferring NI. Marco can you jump on this?

I tried the try build from comment #37, and both NVDA and JAWS, our main assistive technologies, still recognize that they're running in Firefox. So I am reasonably sure we're safe.
Flags: needinfo?(mzehe)
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b07ff291caa
Change firefox.exe description metadata to evade the Windows 10 appcompat engine. r=ted
Keywords: checkin-needed
Jim based on the "DamThrottleControl" email thread and a discussion just now on IRC (release-drivers) I understand we're not taking the current fix on 57 correct?
Flags: needinfo?(jmathies)
Per the MS discussion list, this appcompat setting is applied only on Windows 10 Creators Update (1703), and it allows Firefox to run without throttling in that version. I've confirmed that we don't get AcGenral/user32 in Anniversary Update (1607) or Fall Creators Update (1709). Since our workaround means Firefox will be subject to throttling, it's probably doing more harm than good, and we'd be better off living with the weakened blocklist in 1703.

Ryan, could you please back out this patch?
Flags: needinfo?(ryanvm)
Flags: needinfo?(ryanvm)
Backout by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6d174f463c1a
Backed out changeset 0b07ff291caa because it'll probably do more harm than good.
Flags: needinfo?(jmathies)
Sylvestre - I suggest we remove this bug from tracking 57 because it isn't a regression relative to 56. The issue is fixed in Win 10 1709 which came out on the 17th of October so is slowly making its way out to users.
Flags: needinfo?(sledru)
OK, thanks!
Flags: needinfo?(sledru)
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/76c9e778b0e1
Skip browser_checkdllblockliststate.js on Win10 since it permafails on the version we use in automation. r=me
Hopefully we remember to un-skip this once we update to a newer version of Win10 in CI.
Whiteboard: [checkin-needed-beta]
https://hg.mozilla.org/mozilla-central/rev/76c9e778b0e1
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Depends on: 1522896
No longer depends on: 1522896
Component: Blocklist Policy Requests → Blocklist Implementation
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: