Closed Bug 1561324 Opened 5 years ago Closed 5 years ago

Determine Windows configuration options that reduce noise on reference laptop (windows10-64-ref-hw-2017)

Categories

(Testing :: Raptor, task, P1)

task

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: acreskey, Assigned: acreskey)

References

Details

Attachments

(2 files)

Using the reference laptop in CI and locally we see very significant variations in performance results.
This makes getting reproduceable results extremely difficult.

The purpose of this bug is to collect Windows configuration options that minimize OS-induced noise.

These are the configurations options that I've been disabling locally:
• Indexing Service (file search)
• Windows Defender (default antivirus)

Denis, Mike, I know that you two have had some success in reducing noise on the 2017 reference laptop.

Can you please add any OS features that you are disabling?

Flags: needinfo?(mconley)
Flags: needinfo?(dpalmeiro)
Priority: -- → P1

In my experience, operating system updates make the system much slower, due to triggering a lot of disk activity. Both when they are being downloaded, and after they have been installed during the next ~10h while Windows is 'optimizing' stuff on the disk after the update install.

Not sure if your scripts already include this, but when I was trying to get numbers automatically from this hardware, I used a script that waited for CPU idle and disk idle before starting Firefox.

Windows defender was the big one for me. After I turned that off (I used 3 different ways to do this to make sure it's never on), the machine became quite usable. Other than that, I just make sure disk is close to 0% before I begin my tests.

Flags: needinfo?(dpalmeiro)

I disable the Superfetch / Prefetch stuff (now called SysMain in Services), because otherwise, I was noticing a big shift in measurement over time as Windows "learned" what I liked to run during start-up.

Flags: needinfo?(mconley)

When I was running some tests locally, to reduce noise, I disabled bluetooth, enabled metered network connection, and disabled windows updates.

I should note that Windows Defender on is the "default" mode that users will use systems in, so our testing should reflect that.

:jesup, the problem with leaving defaults on is that we introduce false positives/negatives into our data, and this makes it a bit more difficult to directly relate the performance issues to either firefox changes or because some OS tasks whose - resource-usage interacts poorly with firefox - have intermittently started. However, based on this, I'm thinking it might be worthwhile if we look into testing interoperability throughput performance separately from application-only throughput performance.

Assignee: nobody → acreskey
Status: NEW → ASSIGNED
Summary: Determine Windows configuration options that reduce noise on reference laptop (-ux) → Determine Windows configuration options that reduce noise on reference laptop (windows10-64-ref-hw-2017)

Noise is still a major problem on the reference laptop.
Comparing 10 runs against 10 runs of the same changeset I see large differences:

• Amazon warm load metrics off by ~10%
Facebook loadtime off by 13%
Netflix metrics off by ~10%

Attached image netflix.png
Attached image netflix_replicates.png

This is a particularly interesting replicates view.
Note the batch of loadtimes that come in at ~200ms while the median is about 2000ms.

Ionut, can I ask for your thoughts on the noise in this comparison?
It's a changeset compared against itself on windows10-64-ref-hw-2017 and also windows10-64
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=792c62b2ad217ba4c0a49c639d1f6696eac2578b&newProject=try&newRevision=5aec96c94b123de3b4b7d2af1292c86ac20e3e01&framework=10

Perfherder is picking up two significant changes on the windows10-64-ref-hw-2017
I also see a 9% and a 10% change on raptor-tp6-imgur-firefox and raptor-tp6-outlook-firefox for windows10-64

I don't have any experience with sheriffing, but I would imagine that all of these are problematic.

If we could solve just the issues that lead to the changes marked as "Significant/Important" by perfherder, would that get us most of the value?

Flags: needinfo?(igoldan)

(In reply to Andrew Creskey from comment #12)

Ionut, can I ask for your thoughts on the noise in this comparison?
It's a changeset compared against itself on windows10-64-ref-hw-2017 and also windows10-64
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=792c62b2ad217ba4c0a49c639d1f6696eac2578b&newProject=try&newRevision=5aec96c94b123de3b4b7d2af1292c86ac20e3e01&framework=10

Perfherder is picking up two significant changes on the windows10-64-ref-hw-2017
I also see a 9% and a 10% change on raptor-tp6-imgur-firefox and raptor-tp6-outlook-firefox for windows10-64

I don't have any experience with sheriffing, but I would imagine that all of these are problematic.

Indeed, this is a weird situation. I actually see more changes here than those you mentioned. They vary from +/- 4% to 10%.
Seems like our Windows platform's environments aren't yet quite suited for properly running perf tests.

If we could solve just the issues that lead to the changes marked as "Significant/Important" by perfherder, would that get us most of the value?

Yes, I see this as a valuable step forward.

Flags: needinfo?(igoldan)

Thank you Ionut.

I did do some quick tests:

1. Disable OCSP and compare against same revision
This is a known source of noise on the reference laptop. I didn't think it would have any impact here because we connect to mitmproxy and thus don't use the actual site certificates.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1a1b8f51b847ef0affe3f17df8a399911ded8021&newProject=try&newRevision=344295d59f90f2e5a1bb2c4aa471c0e903bbe60c&framework=10

Maybe I got lucky on the runs but this comparison doesn't show any flagged perf differences.
So this could be worth looking into.
If this did reduce noise in the test environment, we could argue for disabling OCSP in the perf profile, since OCSP itself will be replaced in the not-too-distant future.

2. Defer setTimeouts() during pageload (would otherwise run on idle)

This is another known source of bimodal behaviour.

Comparing this job against itself also gives a perfherder diff with no flagged perf differences. (although still quite a bit of noise).
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=411785ebb7de902bc37af52759d4d4e2aef83532&newProject=try&newRevision=71cb7ba08f32eea6891e401174323e62209050bf&framework=10

If these results had been smoother, maybe we could consider a 'deterministic load' preference for the perf profile, but I'm not sure...

Back to the bug as logged.

Dave, who could explain to me the differences in the OS setup between the reference laptops in test (i.e. windows10-64-ref-hw-2017) and the devices that, to my understanding, run virtualized on AWS, such as windows10-64 ?

Presumably the images for windows10-64 on AWS don't allow system updates and Windows Defender to be running?

Flags: needinfo?(dave.hunt)

(In reply to Andrew Creskey from comment #15)

Dave, who could explain to me the differences in the OS setup between the reference laptops in test (i.e. windows10-64-ref-hw-2017) and the devices that, to my understanding, run virtualized on AWS, such as windows10-64 ?

Kendall: Could someone from your team help to understand the differences between these platforms in automation, or point Andrew to the relevant documentation/configuration.

Flags: needinfo?(dave.hunt) → needinfo?(klibby)

Mark knows the most about the ref laptops, and is familar with AWS, redirecting NI to him.

Flags: needinfo?(klibby) → needinfo?(mcornmesser)

The two most significant differences is the quality of the hardware, and a difference in the Windows 10 build; 1803 for AWS and 1703 for the reference laptops. General configuration like Windows Update and Windows Defender are disable for both platforms.

Do we have examples of the noisy tests from the last week? If so I can start looking through papertrail logs and see if anything obvious jumps out.

Also fee free to hit me up on Slack or send a meeting invite to discuss this in further detail.

Flags: needinfo?(mcornmesser)

Thank you Mark.

I think raptor-tp6-netflix-firefox loadtime opt is as good as any for a noisy test example.
If you see anything in the papertrail, I would be curious.
My hunch is that the runtime is fighting with the OS for resources like the slow platter drive, but I don't know for sure.

I'll try some local testing and the script and suggestion from Comment 2 and Comment 3 (wait until the disk is quiet before starting tests) to see if that helps.

By the way, disabling OCSP is not helpful here, I was just lucky in Comment 14.
Here are two pushes with ~10 retries of the same revision compared: 4 tests flagged as significant changes (~8% to 13%)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3ebb29d9815219c462e95200016d6fadd84331dc&newProject=try&newRevision=d8fdc966ff39defbe61af3491ffde17fe6983bb8&framework=10

There was nothing significant in Papertrail.

Is this an issue that progressively gotten worse, or has these test always been noisy?

Thanks for looking Mark
As far as I know these tests have always been very noisy.
These are some related bugs:
Bug 1525017 (8 months ago)
Bug 1536090 (7 months ago)

See Also: → 1525017, 1536090

I ran a test where I made raptor wait for idle CPU (below 3%) and disk (no new activity) as in :florian's script

However, within the time given to wait (only 15 seconds), the device never comes close to idle:

For example,

[task 2019-10-03T15:27:53.365Z] 15:27:53     INFO -  raptor-main Info: CPU use: 27.7%
[task 2019-10-03T15:27:53.365Z] 15:27:53     INFO -  raptor-main Info: AJC - disk reads: 11
[task 2019-10-03T15:27:53.365Z] 15:27:53     INFO -  raptor-main Info: AJC - disk writes: 8

and

[task 2019-10-03T15:25:27.378Z] 15:25:27     INFO -  raptor-main Info: CPU use: 5.8%
[task 2019-10-03T15:25:27.378Z] 15:25:27     INFO -  raptor-main Info: AJC - disk reads: 321
[task 2019-10-03T15:25:27.378Z] 15:25:27     INFO -  raptor-main Info: AJC - disk writes: 4

I'll try to relax the conditions and give it a bit more time to wait.

Mark, I forgot to ask you -- can you tell me if the Windows Indexing Service is disabled on these configurations?

Flags: needinfo?(mcornmesser)

I'll investigate this further, but I was able to get the reference laptop to be roughly idle before starting the pageload tests.

I reduced the raptor post_startup_delay from 30 seconds to 1 second and instead made the runner wait for <5 % CPU usage and only a handful of disk read/writes.

The wait for near idle seems to take between 25 and 45 seconds.

I'm now bumping into test timeouts, but it could be an error in how I've set this up.

(In reply to Andrew Creskey from comment #23)

Mark, I forgot to ask you -- can you tell me if the Windows Indexing Service is disabled on these configurations?

It is disabled.

Flags: needinfo?(mcornmesser)

I've spun off Bug 1589356 based on Florian's script comment 2 - waiting for the OS to be idle before Raptor starts a test (warm or cold load).
Early results are promising, at least on the other desktop hardware.

See Also: → 1589356

Mark, I think the last question -- can you tell me if the Windows Superfetch / Prefetch is disabled, as described in comment 4.
This could, at least theoretically, introduce some irregularities into the page load tests.

Flags: needinfo?(mcornmesser)

(In reply to Andrew Creskey from comment #27)

Mark, I think the last question -- can you tell me if the Windows Superfetch / Prefetch is disabled, as described in comment 4.
This could, at least theoretically, introduce some irregularities into the page load tests.

Currently those service are not explicitly disabled. I have asked Bitbar to check one of the laptops to see if the services is running or not.

Flags: needinfo?(mcornmesser)

Bitbar verified that Superfetch was running.

Thank you Mark.
Let me ask around for input on this.
Disabling Superfetch could give us more reliable results (again reducing 'realism' in the same way that Windows Update, Windows Defender, and Windows Indexing Service are disabled).

The view of the performance team was that Superfetch should not impact pageload performance.

And it turns out that :denispal had done tests that confirm this.

So I'll close this bug -- it doesn't look like there's anything to be done here.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX

FWIW, Superfetch / Prefetch would definitely impact startup tests. Is that a consideration here, or is this bug strictly about page load?

Flags: needinfo?(acreskey)

I did create this bug to try and reduce the page load noise but if it could help other tests, that would be great.
Specifically the target was the Windows configuration used in CI (AWS and Bitbar devices).

I know next to nothing about the startup tests -- are they running on AWS and or on the reference laptop in automation?

Flags: needinfo?(acreskey)

(In reply to Andrew Creskey from comment #33)

I know next to nothing about the startup tests -- are they running on AWS and or on the reference laptop in automation?

They will eventually be running on the reference laptop in automation.

(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #34)

(In reply to Andrew Creskey from comment #33)

I know next to nothing about the startup tests -- are they running on AWS and or on the reference laptop in automation?

They will eventually be running on the reference laptop in automation.

Interesting.

Then let's flip this around: is there any reason to not disable SuperFetch/Sysmain in the automation Windows configurations?

If we're favouring reproducible results in general then I think this can't hurt.

If startup tests are coming to automation then I think this is absolutely necessary.

Mark, I'm leaning on you again for thoughts, next steps?

Status: RESOLVED → REOPENED
Flags: needinfo?(mcornmesser)
Resolution: WONTFIX → ---

The start up testing is going to be a very small pool separate from other reference laptops.

I can set up a laptop with a testing workerType in automation, and have superfetch disabled on that laptop. We will then be able to push tests to it with changes similar to https://hg.mozilla.org/try/rev/c7c581111bdf320defefe476560897c7c810d62e . It is such a s mall pool of nodes that we would have to stick to one or two testing nodes.

Flags: needinfo?(mcornmesser)

Thanks again Mark.
Given that there is already work planned for setting up the separate startup testing pool, I don't see anything else to do here.

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: