Closed Bug 918676 Opened 9 years ago Closed 9 years ago

High cpu load Windows machines due to Windows Defender service

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

All
Windows 8
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cosmin-malutan, Assigned: whimboo)

References

Details

Attachments

(2 files)

Found this while investigating bug 856541.
Something totally screwed up with the Windows Defender in the last days. Not sure what but for whatever task the cpu load goes up like crazy. Already the installation of Firefox takes about 30s longer with the Defender's real-time checks enabled. Excluding our c:\jenkins folder will not help given that we still install Firefox into the tmp folder, create the profile in the tmp folder, and do other stuff there.

As immediate action I will disable the real-time checks and we will not enable them as long as we don't store any related file inside the workspace. Once that is done we can exclude our c:\jenkins folder from real-time checks but leave the remaining system protected.

Here two update test examples:

Before: http://mm-ci-master.qa.scl3.mozilla.com:8080/view/-mozilla-aurora/job/mozilla-aurora_update/4345/console
After: http://mm-ci-master.qa.scl3.mozilla.com:8080/view/-mozilla-aurora/job/mozilla-aurora_update/4344/console 

The first job has been taken 3:07 minutes while the second only took 1:30min. That's half of the time!

I hope that with decreasing the cpu load we will also see lesser disconnect failures, which might also be the cause of this behavior.
Assignee: nobody → hskupin
Severity: normal → critical
Status: NEW → ASSIGNED
Summary: CPU usage is 100% on Windows machines causing freezes → High cpu load Windows machines due to Windows Defender service
Blocks: 862748
I have disabled Defender now for all Windows 8 and 8.1 machines. Lets see how it works and if it even stops the restarts on the two win 8.1 64bit nodes. Once https://github.com/mozilla/mozmill-automation/issues/80 has been fixed we might want to consider re-enabling the real-time protection.

Adrian, can you please check if the general cpu load of our Windows VMs went back significantly? Thanks.
So on older machines where Microsoft Security Essentials are installed we also have to make those changes. Looks like the real-time protection is coming from exactly that software, and as given on the Microsoft website those have been integrated into Windows Defender starting with Windows 8.

On XP we drop from 4:40min to about 2:45min. Transferring the mozmill-tests take a long time here. That's something i will investigate separately.
Blocks: 916746
Blocks: 862747
Adrian, all the latest issues with high CPU load across all of our Windows nodes are really suspicious. Have there be done any changes to the ESX system? Does it show a general high load? I really cannot explain why we got such a change in cpu utilization. And it is still not bound to a single process, it varies between defender, network svchost processes, and the desktop window manager (constantly 25% load).
Flags: needinfo?(afernandez)
There's no general high load on the actual ESX cluster.

However, seems that right around August 20th and moving forward, something occurred on all the VMs in terms of CPU stats.

We are currently investigative this.
Flags: needinfo?(afernandez)
Verified the reported performance stats with the Virtualization team and was advice to use more accurate reporting (vCenter Operations Manager).

Seems out of the Windows VM, only the Windows 8 VMs show 100% utilization of the CPU (flat line).
Seems the flat lining started around July 30-31.

The following was mentioned in irc;
"just to note... QA had to wait 5h for our update results yesterday for 25.0b1. usually it takes about 1h"

Was this for ALL vms or just the Windows 8 ones?

Just to note, at this time, the Virtualization team does not see any load issues with the ESX cluster. So we'll need to go at this at OS level.
(In reply to Adrian Fernandez [:Aj] from comment #7)
> Was this for ALL vms or just the Windows 8 ones?

This was for all VMs but I believe that to be a flaw in our Jenkins CI which was exacerbated by this bug. Henrik can give you more technical details but when one of the jobs gets hung the rest lie in waiting until it times out.
(In reply to Adrian Fernandez [:Aj] from comment #7)
> Seems out of the Windows VM, only the Windows 8 VMs show 100% utilization of
> the CPU (flat line).
> Seems the flat lining started around July 30-31.

I was on vacation at this time. So if a change on our side should have caused this, it wasn't me. Dave, can you please check our weekly pads if you did some os level changes or updates around that time?

> The following was mentioned in irc;
> "just to note... QA had to wait 5h for our update results yesterday for
> 25.0b1. usually it takes about 1h"
> 
> Was this for ALL vms or just the Windows 8 ones?

All VMs show a high cpu load. I was also able to see it on XP where Microsoft Security Essentials are using around 20% cpu mostly all the time when real-time protection is enabled. But I think the most problematic VMs are the ones for 8 and 8.1, which show about 100% utilization and Firefox takes about 4 minutes to be installed via the silent installer! 

> Just to note, at this time, the Virtualization team does not see any load
> issues with the ESX cluster. So we'll need to go at this at OS level.

Do we have some long-time graphs we can have a look at? Screenshots if possible? Are there times when the utilization drops? I would like to figure out a correlation to the times when tests have been run.

(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #8)
> This was for all VMs but I believe that to be a flaw in our Jenkins CI which

There is absolutely no flaw in our Jenkins CI. To get appropriate regression ranges the junit report generator for the current job has to wait until the formerly started job has been finished. We have mentioned that already a couple of times. So please don't blame it as a flaw. As a workaround we can disable it for sure, but it should not be a default setting given that we will miss very important historical stats without it. If slave nodes are working as expected we never should run into those issues.
Flags: needinfo?(dave.hunt)
(In reply to Henrik Skupin (:whimboo) from comment #9)
> (In reply to Adrian Fernandez [:Aj] from comment #7)
> > Seems out of the Windows VM, only the Windows 8 VMs show 100% utilization of
> > the CPU (flat line).
> > Seems the flat lining started around July 30-31.
> 
> I was on vacation at this time. So if a change on our side should have
> caused this, it wasn't me. Dave, can you please check our weekly pads if you
> did some os level changes or updates around that time?

As you may remember I was also on PTO at this time, so no, I didn't change anything on the OS.

> (In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #8)
> > This was for all VMs but I believe that to be a flaw in our Jenkins CI which
> 
> There is absolutely no flaw in our Jenkins CI. To get appropriate regression
> ranges the junit report generator for the current job has to wait until the
> formerly started job has been finished. We have mentioned that already a
> couple of times. So please don't blame it as a flaw. As a workaround we can
> disable it for sure, but it should not be a default setting given that we
> will miss very important historical stats without it. If slave nodes are
> working as expected we never should run into those issues.

Yes, this is in fact a feature, so the reports can identify the build where a regression first occurred. It doesn't necessarily apply perfectly to the way we use Jenkins, but there is currently no way to disable this analysis without disabling the entire report processing.
Flags: needinfo?(dave.hunt)
I have updated the Windows XP machines yesterday by letting them use the VNC mirror driver. That seems to drop the cpu usage by about 4% and rarely show more than 0%. I have also re-enabled the realtime-protection of Defender. I will check if that shows any performance issues today. If not, I will do the same for the other machines. Otherwise we will stop using Defender.

Armen, how does RelEng handle software like Windows Defender? Are those tools installed at all on the buildbot machines?
Flags: needinfo?(armenzg)
We disable every service that could cause performance anomalies; that includes Defender.
We don't want anything running that should not be running while our talos jobs run.
Flags: needinfo?(armenzg)
That makes sense. Thank you Armen. Whenever we really have to test the interaction of Firefox with Firewalls, Virus scanners and others, we should have dedicated machines for that.

As of late Friday I disabled Windows Defender completely on all Windows boxes. Recent ondemand updates were working pretty well. So I will call this fixed.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 912363
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.