Closed Bug 1416501 Opened 7 years ago Closed 6 years ago

Try to reduce noise on taskcluster linux talos hardware/vms

Categories

(Testing :: Talos, enhancement)

Version 3
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rwood, Unassigned)

References

Details

(Whiteboard: [PI:January])

Attachments

(9 files)

In Bug 1412031 some new ubuntu machines (dedicated hardware running 16.04 vms, for running talos via taskcluster) were stood up and made available for testing.

Initial talos results look quite noisy, making it very difficult (if not impossible) for our talos/perfhereder algorithms to automatically detect regressions.

Using a loaner we need to try to figure out if it is possible to reduce the talos testing data noise on these machines/vms.
talos.json output from running perf-reftest-singletons-e10s on the tc linux loaner machine/vm
Attached file loaner-tp6-local.json
talos.json from running talos tp6-e10s suite on the loaner tc linux hw/vm
Whiteboard: [PI:November]
Screenshot of 'top' utility running on tc linux hw/loaner during the talos perf-reftest-singletons-e10s suite run
that is odd that Firefox is chewing up so much CPU, although that is probably expected.  We might need to look at IO counters and memory- top doesn't look like the memory is exhausted.
Using 'atop' for a little more detail during talos run
A bit of disk i/o info via 'atop' during talos run
(In reply to Robert Wood [:rwood] from comment #6)
> Created attachment 8927872 [details]
> tc-linux-loaner-during-talos-atop-2.png
> 
> A bit of disk i/o info via 'atop' during talos run

Not sure what 121% 'ACPU' for "Web Content" means vs Firefox
Demonstration of 'noisy' data.    

Recent run from existing linux x64 bb hardware:
     
    name    "bloom-basic.html"
    replicates     
    0       89.84
    1       90.63499999999999
    2       88.42999999999999
    3       92.66499999999999
    4       91.58000000000001
    5       92.14
    6       88.355
    7       96.63000000000001
    8       89.29499999999999
    9       90.51
    10      89.945
    11      90.06000000000002
    12      90.565
    13      86.74
    14      87.595
    unit    "ms"
    value   90.0025
     
Run on new linux tc hw / vm loaner (ssh'd into the machine and running talos from terminal mirroring production):
     
    name    "bloom-basic.html"
    replicates     
    0       54.789999999999964
    1       214.23000000000002
    2       143.695
    3       201.41500000000002
    4       194.84500000000003
    5       43.035
    6       212.43
    7       40.07000000000002
    8       165.12
    9       47.35499999999999
    10      146.09999999999997
    11      187.78000000000003
    12      40.670000000000016
    13      59.150000000000034
    14      49.670000000000016
    unit    "ms"
    value   54.410000000000025
Unsure if this helps, but it's a netstat capture taken during the perf-reftest-singletons talos suite.
On the existing buildbot linux hardware, cpu usage during talos is similar, so I don't believe that's an issue on the new hw.
Sample run on existing buildbot linux hardware loaner (initiated via terminal & mozharness mirroring production):

          "name": "bloom-basic.html",
          "replicates": [
            106.31500000000003,
            101.89500000000001,
            101.21000000000001,
            93.385,
            97.60500000000002,
            92.215,
            105.52000000000001,
            93.55,
            95.795,
            94.21000000000001,
            95.57,
            101.30999999999999,
            93.255,
            100.57,
            96.195
          ],
          "unit": "ms",
          "value": 95.6825
about:support for existing talos buildbot linux hw
about:support for new talos taskcluster linux hw (vm)
it is interesting that webrender is not enabled on the new machines- possibly this is by design.

:milan, can you look at the about:support from comment 13 and verify this looks right for what we should be testing with talos performance and graphics?
Flags: needinfo?(milan)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #14)
> it is interesting that webrender is not enabled on the new machines-
> possibly this is by design.
> 
> :milan, can you look at the about:support from comment 13 and verify this
> looks right for what we should be testing with talos performance and
> graphics?

I think it's just because I'm using an older mozhanress release url/download package on the new hw (comment 13) Firefox 58 but on the existing bb hardware (comment 14) I used a new release package today (Firefox 59).
that might be what is going on assuming we changed that in the last week.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #14)
> it is interesting that webrender is not enabled on the new machines-
> possibly this is by design.
> 
> :milan, can you look at the about:support from comment 13 and verify this
> looks right for what we should be testing with talos performance and
> graphics?

You'd need to force enable acceleration on Linux in order for WebRender to be available (and then enabled WebRender if you actually want to use it.)
Flags: needinfo?(milan)
I am not sure if that makes a difference in our reliability- just thought I would point it out.
Looking at https://bug1416501.bmoattachments.org/attachment.cgi?id=8928211 it looks like we're a bit behind on the Intel graphics driver. Currently reporting 12.0.6, but the 2017Q3 Intel graphics stack (https://01.org/linuxgraphics/downloads/2017q3-intel-graphics-stack-recipe) has 17.1.0. I suspect other things there are also older.

Is there a requirement for the older version, or should we update that?
I don't think there is a specific version required.

:milan- do you have a specific version or set of features on the intel graphics driver that we use in the new hardware for performance?
Flags: needinfo?(milan)
See Also: → 1419161
We don't currently have a minimum version, but that may show up if we start discovering issues specific to some driver versions.  I'd keep as much up to date as practical.  I don't know how old what we currently have is, but it's probably good to be less than a year out of date, and up to date is the best.
Flags: needinfo?(milan)
Whiteboard: [PI:November] → [PI:December]
Depends on: 1424465
Whiteboard: [PI:December] → [PI:January]
I think we can mark this as done?
Yes! It was resolved in Bug 1424465
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: