Closed Bug 1935151 Opened 2 months ago Closed 2 months ago

[LINUX] linux1804-64-shippable-qr on try runs unexpectedly fast every now and then

Categories

(Core :: Performance, defect)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jstutte, Unassigned)

Details

Attachments

(1 file)

Attached image Outlier in SP3 run

​​### Basic information

Steps to Reproduce:

Run many times speedometer 3 tests on Linux shippable with the same revision.

Expected Results:

Deviations in results should be reasonably small.

Actual Results:

Every 15-30 runs we see a (positive) spike in execution performance, like 20-25% better than the rest.


Performance recording (profile)

Profile URL:
(If this report is about slow performance or high CPU usage, please capture a performance profile by following the instructions at https://profiler.firefox.com/. Then upload the profile and insert the link here.)

System configuration:

OS version:
GPU model:
Number of cores:
Amount of memory (RAM):

More information

Please consider attaching the following information after filing this bug, if relevant:

  • Screenshot / screen recording
  • Anonymized about:memory dump, for issues with memory usage
  • Troubleshooting information: Go to about:support, click "Copy text to clipboard", paste it to a file, save it, and attach the file here.

Thanks so much for your help.

An example perf run where this happens.

My best guess is that those runs hit a cold physical CPU that is otherwise idle and has more room in its thermal budget for overclocking.

If it is inevitable (which might be the case) we can probably exclude extreme outliers from our calculations (both if faster or slower) if we have enough runs ?

It would be nice to investigate that improvement outlier, but I'm not sure we have the infrastructure setup to properly profile it yet. I plan on continuing the work in bug 1893493 which should help collect this sort of data in the future.

Sparky, I recall some discussions before about trying to filter out these outliers. Do you remember where we landed on that?

Flags: needinfo?(gmierz2)

Also, are we clamping the CPU frequency on the moonshot devices so every test uses the same frequency?

There are no CPU optimizations being done on the CI machines. Regarding outliers, there isn't much that we can do on the test side, and we'd like to implement some things on the analysis side of things to handle the outliers better there. This might come from better detection techniques or something else. We're currently looking into alternate detection techniques.

This seems to be a machine-specific issue though. Here's a graph showing how the noise in sp3 data is above 11+: https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&replicates=0&series=try,4569401,1,13&timerange=1209600&zoom=1733205310825,1733301950374,8.438664737088072,12.704742188068465

I went to redash and queried to find which machine is causing them: https://sql.telemetry.mozilla.org/queries/104251/source

All of those noisy score values are coming from a single machine: t-linux64-ms-055

:aerickson, could we remove the linux t-linux64-ms-055 machine and replace it with a new one?

Flags: needinfo?(gmierz2) → needinfo?(aerickson)

Not sure why this one blade would be so much faster. I've quarantined the host.

We haven't done blade swapping on the Moonshots yet, so we'll need to develop a plan and procedure. We don't have any spares online.

Flags: needinfo?(aerickson)

Thanks :aerickson! That sounds good to me.

:denispal/:jstutte, could either of you do some try runs for sp3 to see if that outlier is still there?

(In reply to Greg Mierzwinski [:sparky] from comment #6)

:denispal/:jstutte, could either of you do some try runs for sp3 to see if that outlier is still there?

On its way

All of those noisy score values are coming from a single machine

Surprising, but obviously a much better explanation for such a consistent difference. Thanks for finding!

After ~80 runs I still see no more extreme outliers. Thanks!

Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: