Open Bug 2039391 Opened 10 days ago Updated 1 day ago

Investigate if we should run windows VM (azure) in parallel with windows hardware (just sp3 for now)

Tracking

(Not tracked)

Status:

ASSIGNED

People

(Reporter: kshampur, Assigned: kshampur)

References

Details

(Whiteboard: [fxp][vision])

Attachments

(3 files)

WIP: Bug 2039391 - run sp3 on windows VM 10 days ago Kash Shampur [:kshampur] ⌚EST 48 bytes, text/x-phabricator-request		Details \| Review
[mozilla-releng/fxci-config] Bug 2039391 - Add gecko-t/win11-64-25h2-gpu-perf-experiment pool locked to west-us-3 (#1001) 5 days ago BMO Github Automation 55 bytes, text/x-github-pull-request		Details \| Review
[mozilla-releng/fxci-config] Bug 2039391 - Increase SP3 perf experiment pool capacity (#1009) 2 days ago BMO Github Automation 55 bytes, text/x-github-pull-request		Details \| Review

Kash Shampur [:kshampur] ⌚EST

Assignee

Description

•

10 days ago

•

Edited

windows azure VMs have gpu's and might be more realistic VM. the sp3 mochi tests that :florian found were very stable. We should run this in parallel for a bit

Jira Integration Bot

Updated

•

10 days ago

See Also: → https://mozilla-hub.atlassian.net/browse/FXP-4947

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 1

•

10 days ago

talked to Florian, we can try the browsertime version on here https://searchfox.org/firefox-main/rev/5f15185120f488edb4b4f2f38599fff543b9aa5a/taskcluster/config.yml#849

he was using it for the mochitest version

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 2

•

10 days ago

to start with...

see if browsertime sp3 passes
just run on sp3 for a bit
re-evaluate in 2 weeks

Kash Shampur [:kshampur] ⌚EST

Assignee

Updated

•

10 days ago

Summary: Run windows VM (azure) in parallel with windows hardware → Run windows VM (azure) in parallel with windows hardware (just sp3 for now)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 3

•

10 days ago

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=74ca538f8fe9c091d586e4f0b9b9a7608374aeb6

need to turn off power test flag i believe

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 4

•

10 days ago

passing https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=f2de500d95d2d18d5be6308b9ccf71bb3d3d37fc

(wasnt --power-test flag, but needed a change in the profiling.js)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 5

•

10 days ago

Attached file WIP: Bug 2039391 - run sp3 on windows VM — Details

Phabricator Automation

Updated

•

10 days ago

Assignee: nobody → kshampur

Status: NEW → ASSIGNED

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 6

•

10 days ago

i'll make another push with CaR and chrome

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 7

•

10 days ago

ah looks like the issue from bug 2028315 is hitting chrome here on the VMs

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=cd1f4e9792ad2dc88062b536192a81d805108b36

let's first see how CaR does before i ask mcornmesser about this

Kash Shampur [:kshampur] ⌚EST

Assignee

Updated

•

9 days ago

Summary: Run windows VM (azure) in parallel with windows hardware (just sp3 for now) → Investigate if we should run windows VM (azure) in parallel with windows hardware (just sp3 for now)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 8

•

9 days ago

•

Edited

when looking at summary scores it looks more stable compared to the win non-ref machines (which have a known noise issue at the moment)
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=0&highlightCommonAlerts=0&highlightInitialDataPoints=0&highlightedRevisions=cd1f4e9792ad2dc88062b536192a81d805108b36&replicates=0&series=try,5835703,1,13&series=autoland,5257392,1,13&series=mozilla-central,5273367,1,13&timerange=604800&zoom=1778179613524,1778769928551,13.866999999999997,27.281000000000002

but doesn't seem to beat the ref-hw in terms of stability

also, replicates view suggest the VM has even more spread than the known noisy hardware machines
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=0&highlightCommonAlerts=0&highlightInitialDataPoints=0&highlightedRevisions=cd1f4e9792ad2dc88062b536192a81d805108b36&replicates=1&series=try,5835703,1,13&series=autoland,5257392,1,13&timerange=172800

Mark Cornmesser [:markco]

Comment 9

•

8 days ago

•

Edited

There are a few challenges with using Azure VMs as a performance baseline.

First, for standard shared Azure VMs, we do not control the hypervisor, host placement, or the underlying physical hardware. The host environment can change due to Azure maintenance, live migration, hardware decommissioning, or other platform-level events, and there may also be differences across regions or zones.

Second, performance can be affected by host-level scheduling and contention from other workloads running on the same physical hardware. That makes it harder to determine whether a performance shift came from Firefox, the test environment, or the cloud platform itself.

Also worth noting: we are already exploring options to potentially replace the non-reference hardware.

For testing, we may also want to create a separate pool isolated to a specific Azure region, and investigate which VM size or type best matches the workload. That would help reduce variability and give us a cleaner signal while we evaluate whether this is a viable path.

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 10

•

6 days ago

•

Edited

(In reply to Mark Cornmesser [:markco] from comment #9)

There are a few challenges with using Azure VMs as a performance baseline.

First, for standard shared Azure VMs, we do not control the hypervisor, host placement, or the underlying physical hardware. The host environment can change due to Azure maintenance, live migration, hardware decommissioning, or other platform-level events, and there may also be differences across regions or zones.

Second, performance can be affected by host-level scheduling and contention from other workloads running on the same physical hardware. That makes it harder to determine whether a performance shift came from Firefox, the test environment, or the cloud platform itself.

Also worth noting: we are already exploring options to potentially replace the non-reference hardware.

For testing, we may also want to create a separate pool isolated to a specific Azure region, and investigate which VM size or type best matches the workload. That would help reduce variability and give us a cleaner signal while we evaluate whether this is a viable path.

Thanks for the detailed breakdown. This came up at the work week as a possible alternative (partly due to the recent PSU failures on hardware) ... but between the replicate spread we saw in the try runs and the platform-level unpredictability you've described (live migration, host contention, no control over placement), I don't think VMs are a viable baseline right now.

Good to know the non-ref hardware replacement is already being explored. Unless someone has anything to chime in within the next 1-2w I think it makes sense to close this out as a WONTFIX

if dedicated/region-isolated pool option ever gets traction, that would be interesting though

Florian Quèze [:florian]

Comment 11

•

5 days ago

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #10)

if dedicated/region-isolated pool option ever gets traction, that would be interesting though

It seems like this bug is what would give it traction.

Unless someone has anything to chime in within the next 1-2w I think it makes sense to close this out as a WONTFIX

The point of the bug is to evaluate what would make running these tests in the cloud possible and to assess the noise level. comment 9 suggested a few possible sources of noise, but also how we could address some of them, so I don't see how that comment would be a reason to wontfix.

BMO Github Automation

Comment 12

•

5 days ago

Attached file [mozilla-releng/fxci-config] Bug 2039391 - Add gecko-t/win11-64-25h2-gpu-perf-experiment pool locked to west-us-3 (#1001) — Details

Mark Cornmesser [:markco]

Comment 13

•

5 days ago

Additional worker pools are relatively easy to create. I created gecko-t/win11-64-25h2-gpu-perf-experiment based on gecko-t/win11-64-25h2-gpu, with the main difference being that the experiment pool is limited to a single region to reduce variability.

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 14

•

4 days ago

(In reply to Mark Cornmesser [:markco] from comment #13)

Additional worker pools are relatively easy to create. I created gecko-t/win11-64-25h2-gpu-perf-experiment based on gecko-t/win11-64-25h2-gpu, with the main difference being that the experiment pool is limited to a single region to reduce variability.

thanks Mark!

Try run pushed to that pool

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=d3419e628ea0362b17cb75ab96ca716fc8db7e33&selectedTaskRun=VzU8s81_TxmyIUj3viG3Kw.0

Don't really see an improvement. spread looks worse on this one than the previous VM push which was not region isolated...

compared against hardware (ref and non-ref) data

https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=0&highlightCommonAlerts=0&highlightInitialDataPoints=0&replicates=0&selected=5835703,2493744836&series=try,5835703,1,13&series=mozilla-central,5276630,1,13&series=mozilla-central,5273367,1,13&timerange=1209600

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 15

•

4 days ago

•

Edited

but as Florian/Mark pointed out the question remain on the other sources of noise on the VM and worth exploring.

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 16

•

4 days ago

(take this with a grain of salt)

Compared a bunch of Windows VM (windows11-64-25h2-shippable) runs from try revision against a few hardware (ref machines since they are stable) runs using profiler-cli tool on the raw profile artifacts.

Score ranges:

VM: 17.989 – 19.835 (geomean CV ~2.5%, with 3 outlier runs near 18.1–18.3)
Hardware: 20.556 – 20.652 (CV <1%)

Apparently the Azure VMs can have their execution interrupted by the hypervisor at any point . This shows up in the GC data: the Firefox garbage collector runs short "slices" of work with a ~50ms time budget. On hardware those slices finish in 26–30ms. On VMs they regularly hit the full 54ms ceiling, because some of that time was spent waiting for the CPU rather than doing actual work.

Another issue... If the VM gets interrupted during that warmup phase, more garbage accumulates in memory before the first GC cycle runs. The worst VM run's first GC cycle took 422ms but then the best VM run's first GC cycle took 50ms. That seems to set the tone for the entire run.

what could help is forcing a GC cycle before the benchmark time starts. Should be a very small change in Raptor, to test out this profiler-cli hypothesis

:jmoss

Comment 17

•

4 days ago

We have updated gecko-t/win11-64-25h2-gpu-perf-experiment to test Azure Dedicated Hosts. :kshampur could you try to do some more pushes against that worker pool now?

Worker Manager places the Windows 25H2 GPU VMs directly on a dedicated host in westus3 so the goal is to remove one possible source of Azure placement noise and see whether SP3 becomes closer to hardware.

Initial try comparison:

Dedicated-host 25H2 push:
https://treeherder.mozilla.org/jobs?repo=try&revision=d8d7bbd73860706cdb4410d6e38f8f57cfa0df19

Current/default Azure 25H2 GPU pool comparison push:
https://treeherder.mozilla.org/jobs?repo=try&revision=4e70f1e0ef13a4bdb405502b33182ee945da51f2

Flags: needinfo?(kshampur)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 18

•

4 days ago

oh cool. Here is a new push
https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=e21d4c103eaa101ca0a92711b113c234be084a37

Flags: needinfo?(kshampur)

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 19

•

3 days ago

•

Edited

Ran a validation experiment to test whether the VMs can actually detect real improvement despite the variance we've been seeing. (it should, but would be good to see the extent...)

I reverted Bug 2030147 (a word cache optimization) as the base (t-test view for alert parity)

https://perf.compare/compare-results?baseRev=ae5ad2bf98a2821fdd83b67f979da2a2381e3ce3&baseRepo=try&newRev=87e0f02f184337dbd24cbfcfd088c37304ec969e&newRepo=try&framework=13&title=VM+before%2Fafter+with++Bug+2030147+reverted&test_version=student-t

and the VM runs were able to pick up that improvement

Looking at the subtest view, Editor-TipTap/total was the highest delta (to be more comparable with alert summary, looking at t-test view for subtests as well: e.g. LINK ),

This is expected if you look at https://bugzilla.mozilla.org/show_bug.cgi?id=2030147#c11 and examine the alert summary you'll see TipTap was what was affected from that patch.

the MWU view shows a lot of improvements which may or may not be real, but for comparison with the original bug we are just looking at the t-test view (since t-test is used currently for generating alerts)

Mark Cornmesser [:markco]

Comment 20

•

3 days ago

If we want to continue testing with dedicated hosts for VMs in Azure, the next step would be to run two dedicated hosts with six VMs per host, then load them with 100–200 tasks to evaluate performance and stability.

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 21

•

2 days ago

https://perf.compare/compare-results?baseRev=87e0f02f184337dbd24cbfcfd088c37304ec969e&baseRepo=try&newRev=9967ca4f7de2afb6ce49d26de89277f473ab03a4&newRepo=try&framework=13&title=A%2FA+test+azure+VM&test_version=student-t

Denis suggested an A/A test to get an idea of the noise as well. The low confidence is good to see..

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 22

•

2 days ago

(In reply to Mark Cornmesser [:markco] from comment #20)

If we want to continue testing with dedicated hosts for VMs in Azure, the next step would be to run two dedicated hosts with six VMs per host, then load them with 100–200 tasks to evaluate performance and stability.

Makes sense. Before we go ahead, what's the effort/timeline on your side to get those dedicated hosts set up? And is the main goal to see if the VMs on the same host interfere with each other under load, or is there something else you'd want to evaluate with 100-200 tasks?

Separately, Denis suggested a follow-up to comment 19 to test whether the VMs can catch smaller regressions/improvements too. The example I used was a pretty big win in contrast, so that two host setup could be useful for testing such sensitivity as well.

Mark Cornmesser [:markco]

Comment 23

•

2 days ago

Makes sense. Before we go ahead, what's the effort/timeline on your side to get those dedicated hosts set up? And is the main goal to see if the VMs on the same host interfere with each other under load, or is there something else you'd want to evaluate with 100-200 tasks?

Adding additional dedicated hosts is straightforward and should have a relatively quick turnaround. Jmoss is on top of this and has submitted a request to Azure to increase our quotas. Once that is complete, the remaining work should primarily involve updating the relevant Terraform code.

There are three things I want to evaluate under heavy load across two dedicated hosts:

Interference between VMs under continuous load
Consistency over time
Consistency between physical hosts

BMO Github Automation

Comment 24

•

2 days ago

Attached file [mozilla-releng/fxci-config] Bug 2039391 - Increase SP3 perf experiment pool capacity (#1009) — Details

Kash Shampur [:kshampur] ⌚EST

Assignee

Comment 25

•

2 days ago

•

Edited

Mark Cornmesser [:markco]

Comment 26

•

1 day ago

It might be worth limiting the pool size to 1 so that there would be excess resources available on the dedicated host and no contention between competing VMs. We could then see whether that affects the results.

Let me know if we want to try that. It would be a quick one-line patch on our side.

You need to log in before you can comment on or make changes to this bug.