Open Bug 2039391 Opened 10 days ago Updated 1 day ago

Investigate if we should run windows VM (azure) in parallel with windows hardware (just sp3 for now)

Categories

(Testing :: Performance, enhancement, P3)

enhancement

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: kshampur, Assigned: kshampur)

References

Details

(Whiteboard: [fxp][vision])

Attachments

(3 files)

windows azure VMs have gpu's and might be more realistic VM. the sp3 mochi tests that :florian found were very stable. We should run this in parallel for a bit

talked to Florian, we can try the browsertime version on here https://searchfox.org/firefox-main/rev/5f15185120f488edb4b4f2f38599fff543b9aa5a/taskcluster/config.yml#849

he was using it for the mochitest version

to start with...

  1. see if browsertime sp3 passes
  2. just run on sp3 for a bit
  3. re-evaluate in 2 weeks
Summary: Run windows VM (azure) in parallel with windows hardware → Run windows VM (azure) in parallel with windows hardware (just sp3 for now)

passing https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=f2de500d95d2d18d5be6308b9ccf71bb3d3d37fc

(wasnt --power-test flag, but needed a change in the profiling.js)

Assignee: nobody → kshampur
Status: NEW → ASSIGNED

i'll make another push with CaR and chrome

ah looks like the issue from bug 2028315 is hitting chrome here on the VMs

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=cd1f4e9792ad2dc88062b536192a81d805108b36

let's first see how CaR does before i ask mcornmesser about this

Summary: Run windows VM (azure) in parallel with windows hardware (just sp3 for now) → Investigate if we should run windows VM (azure) in parallel with windows hardware (just sp3 for now)

There are a few challenges with using Azure VMs as a performance baseline.

First, for standard shared Azure VMs, we do not control the hypervisor, host placement, or the underlying physical hardware. The host environment can change due to Azure maintenance, live migration, hardware decommissioning, or other platform-level events, and there may also be differences across regions or zones.

Second, performance can be affected by host-level scheduling and contention from other workloads running on the same physical hardware. That makes it harder to determine whether a performance shift came from Firefox, the test environment, or the cloud platform itself.

Also worth noting: we are already exploring options to potentially replace the non-reference hardware.

For testing, we may also want to create a separate pool isolated to a specific Azure region, and investigate which VM size or type best matches the workload. That would help reduce variability and give us a cleaner signal while we evaluate whether this is a viable path.

(In reply to Mark Cornmesser [:markco] from comment #9)

There are a few challenges with using Azure VMs as a performance baseline.

First, for standard shared Azure VMs, we do not control the hypervisor, host placement, or the underlying physical hardware. The host environment can change due to Azure maintenance, live migration, hardware decommissioning, or other platform-level events, and there may also be differences across regions or zones.

Second, performance can be affected by host-level scheduling and contention from other workloads running on the same physical hardware. That makes it harder to determine whether a performance shift came from Firefox, the test environment, or the cloud platform itself.

Also worth noting: we are already exploring options to potentially replace the non-reference hardware.

For testing, we may also want to create a separate pool isolated to a specific Azure region, and investigate which VM size or type best matches the workload. That would help reduce variability and give us a cleaner signal while we evaluate whether this is a viable path.

Thanks for the detailed breakdown. This came up at the work week as a possible alternative (partly due to the recent PSU failures on hardware) ... but between the replicate spread we saw in the try runs and the platform-level unpredictability you've described (live migration, host contention, no control over placement), I don't think VMs are a viable baseline right now.

Good to know the non-ref hardware replacement is already being explored. Unless someone has anything to chime in within the next 1-2w I think it makes sense to close this out as a WONTFIX

if dedicated/region-isolated pool option ever gets traction, that would be interesting though

(In reply to Kash Shampur [:kshampur] ⌚EST from comment #10)

if dedicated/region-isolated pool option ever gets traction, that would be interesting though

It seems like this bug is what would give it traction.

Unless someone has anything to chime in within the next 1-2w I think it makes sense to close this out as a WONTFIX

The point of the bug is to evaluate what would make running these tests in the cloud possible and to assess the noise level. comment 9 suggested a few possible sources of noise, but also how we could address some of them, so I don't see how that comment would be a reason to wontfix.

Additional worker pools are relatively easy to create. I created gecko-t/win11-64-25h2-gpu-perf-experiment based on gecko-t/win11-64-25h2-gpu, with the main difference being that the experiment pool is limited to a single region to reduce variability.

(In reply to Mark Cornmesser [:markco] from comment #13)

Additional worker pools are relatively easy to create. I created gecko-t/win11-64-25h2-gpu-perf-experiment based on gecko-t/win11-64-25h2-gpu, with the main difference being that the experiment pool is limited to a single region to reduce variability.

thanks Mark!

Try run pushed to that pool

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=d3419e628ea0362b17cb75ab96ca716fc8db7e33&selectedTaskRun=VzU8s81_TxmyIUj3viG3Kw.0

Don't really see an improvement. spread looks worse on this one than the previous VM push which was not region isolated...

compared against hardware (ref and non-ref) data

https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=0&highlightCommonAlerts=0&highlightInitialDataPoints=0&replicates=0&selected=5835703,2493744836&series=try,5835703,1,13&series=mozilla-central,5276630,1,13&series=mozilla-central,5273367,1,13&timerange=1209600

but as Florian/Mark pointed out the question remain on the other sources of noise on the VM and worth exploring.

(take this with a grain of salt)

Compared a bunch of Windows VM (windows11-64-25h2-shippable) runs from try revision against a few hardware (ref machines since they are stable) runs using profiler-cli tool on the raw profile artifacts.

Score ranges:

  • VM: 17.989 – 19.835 (geomean CV ~2.5%, with 3 outlier runs near 18.1–18.3)
  • Hardware: 20.556 – 20.652 (CV <1%)

Apparently the Azure VMs can have their execution interrupted by the hypervisor at any point . This shows up in the GC data: the Firefox garbage collector runs short "slices" of work with a ~50ms time budget. On hardware those slices finish in 26–30ms. On VMs they regularly hit the full 54ms ceiling, because some of that time was spent waiting for the CPU rather than doing actual work.

Another issue... If the VM gets interrupted during that warmup phase, more garbage accumulates in memory before the first GC cycle runs. The worst VM run's first GC cycle took 422ms but then the best VM run's first GC cycle took 50ms. That seems to set the tone for the entire run.

what could help is forcing a GC cycle before the benchmark time starts. Should be a very small change in Raptor, to test out this profiler-cli hypothesis

We have updated gecko-t/win11-64-25h2-gpu-perf-experiment to test Azure Dedicated Hosts. :kshampur could you try to do some more pushes against that worker pool now?

Worker Manager places the Windows 25H2 GPU VMs directly on a dedicated host in westus3 so the goal is to remove one possible source of Azure placement noise and see whether SP3 becomes closer to hardware.

Initial try comparison:

Dedicated-host 25H2 push:
https://treeherder.mozilla.org/jobs?repo=try&revision=d8d7bbd73860706cdb4410d6e38f8f57cfa0df19

Current/default Azure 25H2 GPU pool comparison push:
https://treeherder.mozilla.org/jobs?repo=try&revision=4e70f1e0ef13a4bdb405502b33182ee945da51f2

Flags: needinfo?(kshampur)

Ran a validation experiment to test whether the VMs can actually detect real improvement despite the variance we've been seeing. (it should, but would be good to see the extent...)

I reverted Bug 2030147 (a word cache optimization) as the base (t-test view for alert parity)

https://perf.compare/compare-results?baseRev=ae5ad2bf98a2821fdd83b67f979da2a2381e3ce3&baseRepo=try&newRev=87e0f02f184337dbd24cbfcfd088c37304ec969e&newRepo=try&framework=13&title=VM+before%2Fafter+with++Bug+2030147+reverted&test_version=student-t

and the VM runs were able to pick up that improvement

Looking at the subtest view, Editor-TipTap/total was the highest delta (to be more comparable with alert summary, looking at t-test view for subtests as well: e.g. LINK ),

This is expected if you look at https://bugzilla.mozilla.org/show_bug.cgi?id=2030147#c11 and examine the alert summary you'll see TipTap was what was affected from that patch.

the MWU view shows a lot of improvements which may or may not be real, but for comparison with the original bug we are just looking at the t-test view (since t-test is used currently for generating alerts)

If we want to continue testing with dedicated hosts for VMs in Azure, the next step would be to run two dedicated hosts with six VMs per host, then load them with 100–200 tasks to evaluate performance and stability.

(In reply to Mark Cornmesser [:markco] from comment #20)

If we want to continue testing with dedicated hosts for VMs in Azure, the next step would be to run two dedicated hosts with six VMs per host, then load them with 100–200 tasks to evaluate performance and stability.

Makes sense. Before we go ahead, what's the effort/timeline on your side to get those dedicated hosts set up? And is the main goal to see if the VMs on the same host interfere with each other under load, or is there something else you'd want to evaluate with 100-200 tasks?

Separately, Denis suggested a follow-up to comment 19 to test whether the VMs can catch smaller regressions/improvements too. The example I used was a pretty big win in contrast, so that two host setup could be useful for testing such sensitivity as well.

Makes sense. Before we go ahead, what's the effort/timeline on your side to get those dedicated hosts set up? And is the main goal to see if the VMs on the same host interfere with each other under load, or is there something else you'd want to evaluate with 100-200 tasks?

Adding additional dedicated hosts is straightforward and should have a relatively quick turnaround. Jmoss is on top of this and has submitted a request to Azure to increase our quotas. Once that is complete, the remaining work should primarily involve updating the relevant Terraform code.

There are three things I want to evaluate under heavy load across two dedicated hosts:

Interference between VMs under continuous load
Consistency over time
Consistency between physical hosts

.

It might be worth limiting the pool size to 1 so that there would be excess resources available on the dedicated host and no contention between competing VMs. We could then see whether that affects the results.

Let me know if we want to try that. It would be a quick one-line patch on our side.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: