Closed Bug 834003 Opened 11 years ago Closed 11 years ago

Compare telemetry histograms on a Talos run on a PGO versus a non PGO build

Categories

(Core :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: ehsan.akhgari, Assigned: vladan)

References

Details

Attachments

(3 files, 3 obsolete files)

See <https://wiki.mozilla.org/Buildbot/Talos> for how to run Talos locally.  For Tp5, you can get the pageset zip file by pinging :jmaher or somebody on #releng.

I think the interesting tests will be Tp5 and Ts.  The JS tests are explicitly non-interesting since we don't have any reason to consider stopping PGOing JS.
Depends on: 834029
Attachment #708922 - Attachment is obsolete: true
Attachment #708923 - Attachment is obsolete: true
Attachment #708925 - Attachment is obsolete: true
I ran the Talos suite locally on a slow Win7 laptop using the Jan23rd (PGO) and Jan24th (no PGO) Nightly builds and then wrote a script to compare the gathered Telemetry data.

The first attachment shows Telemetry measurements that suffered 1% or greater regressions when PGO was disabled, e.g. GC, CC, image decode, page load, session restore, search service initialization, etc. You will also notice regressions in several MOZ_SQLITE_* histograms in this file. These SQLite operations are I/O bound and their histograms are not meaningful to this experiment, but they do help explain regressions in other operations which use SQLite, e.g. PLACES_FRECENCY_CALC_TIME_MS.

The second attachment shows measures that were unaffected by disabling PGO -- unsurprisingly, these are mostly I/O bound operations. The third attachment has a list of histograms that seemingly benefited from disabling PGO. Some of these improvements are clearly I/O timing noise (e.g. DNS_LOOKUP_TIME, FX_SESSION_RESTORE_WRITE_FILE_MS, MOZ_SQLITE*), but others are a bit harder to explain:
- All the cache lock wait times improved. This might be an I/O artifact
- GC_SLICE_MS improved by 27% but almost twice as many GCs were done
- EVENTLOOP_UI_LAG_EXP_MS, a measure of browser responsiveness, improved by 12.9%
- Gradient generation time improved by 2.2% (probably noise)

A few notes on methodology:
- I used the Nightly builds Ehsan linked to in comment 1. They're from different days, so there might be some variation from patches that landed on mozilla-central during January 23rd.
- I limited my script to the histograms collected by Telemetry. The simpleMeasurements Telemetry (e.g. startup & shutdown timings) aren't meaningful since the entire Talos suite is run in a single Firefox session. We can refer to the real Talos numbers for PGO impact on startup & shutdown times. The Telemetry chromeHang & slowSQL data isn't relevant to this experiment.
- I had to configure Talos to run only 45 of the 100 pages in the benchmark since it would error out after benchmarking ~50% of the pages and I didn't want to waste time debugging the test scripts.
- I used the histograms from bug 833917 + about 80 other timing-based histograms which could be easily identified by names of the form *_MS
- The test machine was an E-350 laptop with a mechanical hard drive, Windows 7, 2GB RAM shared with video card, power options set to max performance
Attachment #708939 - Attachment is patch: false
Please note that most of the compared histograms are time measures and not necessarily performance measures -- you'll need to understand the Telemetry probe to interpret the regression.
Thanks a lot, Vladan, this is super helpful!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
The TOTAL_CONTENT_PAGE_LOAD_TIME was surprising to me, because I usually think of pageload as being mostly io-bound. Maybe it's something in the network cache.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: