Open Bug 1589356 Opened 5 years ago Updated 2 years ago

Determine if waiting for the OS to be idle (low CPU and disk usage) can reduce raptor test noise

Categories

(Testing :: Raptor, enhancement, P3)

Version 3
Desktop
All
enhancement

Tracking

(Not tracked)

People

(Reporter: acreskey, Unassigned)

References

Details

It was suggested in Bug 1561324 that waiting for the OS to be idle (defined as low CPU and disk usage determined via psutil) could significantly reduce test variance.

I integrated a prototype of this into Raptor's tp6 Desktop pageload tests.
Instead of waiting a fixed duration (e.g. 30 seconds after startup), the Raptor web extension polls the python control server and only starts the test when the OS is relatively idle. (In this case defined as < 25% cpu and <20 reads and writes.)

Early results show some promise:
50% reduction of noise metric on Linux, 40% on Mac, and 9% on Windows10-64.

Left hand side: baseline
Right hand side: prototype wait_for_os_idle
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1e65376557aa7e122dbec1149dabfe223111ca20&newProject=try&newRevision=b8fb3c7b69edab08d06b62a09c8eed63c8513eec&framework=10

This approach would change the duration of the tests:
-The initial 30 second post startup delay looks to be significantly reduced (e.g. 8 seconds)
-The delay between warm page cycles (currently 1 second) is almost always increased (anywhere from 4 to 10 seconds from what I see)

If the noise level can indeed be reduced by those amounts (e.g. 40%), then we would likely need fewer page cycles and so could save test running in that manner.

Status: NEW → ASSIGNED
See Also: → 1561324

There are quite a few factors at play here.
I've been splitting up the tests into a few categories in order to isolate the impactful changes. (warm tests, cold tests, primary desktop platforms, and the reference laptop).

This compares baseline vs 'waiting for os idle' on Linux64, MacOS, and Windows10-64
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=df9a034bbebbd84633af817d996dd20ca0701c41&newProject=try&newRevision=538cd6efdf8b3de3878b54c7c12969d8b11fa498&framework=10

The results overall are not as encouraging as the earlier change, although if there is a ~10% reduction in noise on Linux and Windows10, that would be good.

But the major problems is that on MacOS many of the test never reach "OS idle" state and tend to timeout.
Here is a common failure state:

[task 2019-10-21T19:40:30.858Z] 19:40:30     INFO -  raptor-control-server Info: CPU use: 22.3%
[task 2019-10-21T19:40:30.859Z] 19:40:30     INFO -  raptor-control-server Info: disk reads: 5283
[task 2019-10-21T19:40:30.859Z] 19:40:30     INFO -  raptor-control-server Info: disk writes: 873
[task 2019-10-21T19:40:30.859Z] 19:40:30     INFO -  raptor-control-server Info: AJC - is_os_idle: false
[task 2019-10-21T19:40:30.859Z] 19:40:30     INFO -  raptor-control-server Info: received webext_status: begin pagecycle 1
[task 2019-10-21T19:40:30.859Z] 19:40:30     INFO -  PID 1488 | console.log: "[raptor-runnerjs] beginPageCycleOnOSIdle - response: {\"is_os_idle\": false}"
[task 2019-10-21T19:40:30.859Z] 19:40:30     INFO -  PID 1488 | console.log: "[raptor-runnerjs] AJC - aborted wait for OS idle (waited: 60 seconds). beginPageCycle"

This isn't always the case, for instance in some cases OS idle is achieved:

[task 2019-10-21T20:01:47.877Z] 20:01:47     INFO -  raptor-control-server Info: CPU use: 2.2%
[task 2019-10-21T20:01:47.877Z] 20:01:47     INFO -  raptor-control-server Info: disk reads: 7
[task 2019-10-21T20:01:47.877Z] 20:01:47     INFO -  raptor-control-server Info: disk writes: 2
[task 2019-10-21T20:01:47.877Z] 20:01:47     INFO -  raptor-control-server Info: AJC - is_os_idle: true
[task 2019-10-21T20:01:47.889Z] 20:01:47     INFO -  raptor-control-server Info: received webext_status: begin pagecycle 22
[task 2019-10-21T20:01:47.898Z] 20:01:47     INFO -  PID 1488 | console.log: "[raptor-runnerjs] beginPageCycleOnOSIdle - response: {\"is_os_idle\": true}"
[task 2019-10-21T20:01:47.898Z] 20:01:47     INFO -  PID 1488 | console.log: "[raptor-runnerjs] AJC - OS is idle (waited: 4 seconds). beginPageCycle"

However it looks to be quite rare.

Priority: -- → P1
Priority: P1 → P2
See Also: → 1598014, 1578694

Andrew, are you still working on this bug? If not please unassign, and reset the priority to P3. Thanks.

Flags: needinfo?(acreskey)
Priority: P2 → P1

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #3)

Andrew, are you still working on this bug? If not please unassign, and reset the priority to P3. Thanks.

Done. I still think this is promising, at least for Desktop, but it's not something I can work on right now.

Assignee: acreskey → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(acreskey)
Priority: P1 → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.