Closed Bug 1271035 Opened 8 years ago Closed 8 years ago

Reduce I/O during reftests by disabling Places

Categories

(Testing :: Reftest, defect)

Version 3
defect
Not set
normal

Tracking

(firefox49 fixed)

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: gps, Assigned: gps)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Patch forthcoming.
It was a cold Friday night in San Francisco. Earlier in the day, I
informed Chris AtLee that I was going to start focusing on improving
the efficiency of Firefox automation and asked him where the biggest
capacity issues were. He said "we're hurting most on Windows tests."

As I was casually drinking a barleywine (note to reader: barleywines
are serious beers - there's nothing casual about them), I found myself
tediously clicking through Treeherder looking at logs for Windows jobs,
looking for patterns and other oddities. As I was clicking through,
something stood out to me: the sheer number of reftest jobs. I
recalled a random project I started a few years ago. Its aim was to
analyze buildbot job metadata so we could better understand where time
was spent in automation. I had mostly written off the side project as
a failure and a near complete waste of my time. Not even a stray
random thought of this project had entered my mind in the past year.
But clicking through Treeherder after a few glasses of barleywine
somehow reminded me of one of the few useful findings of that project:
reftest jobs consumed a seemingly disproportiate amount of machine time,
something like 35 or 40% IIRC of the time spent on all jobs.

Now, this finding was several years ago and almost certainly no longer
relevant. But, again, I had a few glasses of barleywine in me and was
bothered by the amount of reftest jobs and their duration, so I thought
"hey, why don't I run reftests and see why they take so long." So I
built Firefox on Windows - the platform Chris AtLee said we're "hurting
most on."

I decided to start my very casual profiling session by recording a
`mach reftest` run using Sysinternals Process Monitor. To my surprise,
it yielded a very obvious and unexpected result: the Places SQLite
database was incurring a lot of I/O. On my i7-6700K Windows 10 desktop
with a high performance SSD, `mach reftest` yielded the following:

File Time  #Events  #Reads   #Writes  RBytes        WBytes          Path
198s       980,872  243,270  669,231  7,971,471,360 20,667,084,080  places.sqlite-wal
165s       645,853  222,407  367,775  7,287,701,820 14.071,529,472  places.sqlite
  2s       377,121        1        0         32,768              0  places.sqlite-shm

The Places SQLite database accounts for 2,003,846 of the total of
3,547,527 system calls (56.49%) recorded by procmon during `mach
reftest` execution. This represents a staggering 49,997,786,732 of the
50,307,660,589 (99.38%) bytes of I/O recorded! Yes, that's 50 GB.

I reckon the reason the Places database accumulates so much I/O load
during reftests is because the reftest suite essentially loads thousands
of pages as quickly as possible. This effectively performs a stress
test against the Places history recording service.

As effective as reftests are at stress-testing Places, it adds no value
to reftests because reftests are testing the layout features, not the
performance of history recording. So these 2M system calls and 50 GB
of I/O are overhead.

This commit disables Places when executing reftests and prevents
the overhead.

After this commit, `mach reftest` has significantly reduced interaction
with the Places SQLite database:

File Time  #Events  #Reads   #Writes  RBytes        WBytes         Path
0.09s          502     138       302     4,521,984      8,961,528  places.sqlite-wal
0.07s          254      20       140       524,604      8,126,464  places.sqlite
0.01s        3,289       1         0        32,768              0  places.sqlite-shm

Of the 948,033 system calls recorded with this change (26.7% of
original), 691,322 were related to I/O. The Places SQLite database
only consumed ~22MB of I/O, <0.01% of original. It's worth noting that
~half of the remaining I/O system calls are related to reftest.log,
which now accounts for the largest percentage of write I/O (only
~53 MB, however). It's worth noting that reftest.log appears to be
using unbuffered writes and is requiring an excessive amount of
system calls for writing. But that's for another bug and commit.

In terms of wall time, the drastic I/O reduction during `mach reftest`
appears to have minimal impact on my machine: maybe 30s shaved from a
~900s execution, or ~3%. But my machine with its modern SSD doesn't
struggle with I/O.

In automation, it is a different story.

I pushed this change to Try along with the base revision and triggered
4 runs of most reftest jobs. The runtime improvements in automation
are impressive. Here are the fastest reported times for various jobs:

Job                     Before      After     Delta
Linux Opt R1               31m        34m       +3m
Linux Opt R2               43m        35m       -8m
Linux Opt Ru1              40m        34m       -6m
Linux Opt Ru2              43m        37m       -6m
Linux Opt R E10s           89m        72m      -17m
Linux Debug R1             52m        40m      -12m
Linux Debug R2             49m        42m       -7m
Linux Debug R3             60m        51m       -9m
Linux Debug R4             42m        37m       -5m
Linux Debug R1 E10s        84m        72m      -12m
Linux Debug R2 E10s        97m        85m      -12m
Linux64 Opt R1             35m        24m      -11m
Linux64 Opt R2             37m        26m      -11m
Linux64 Opt Ru1            32m        29m       -3m
Linux64 Opt Ru2            37m        26m      -12m
Linux64 Opt TC R1          12m        10m       -2m
Linux64 Opt TC R2          10m         7m       -3m
Linux64 Opt TC R3          11m         9m       -2m
Linux64 Opt TC R4          11m         9m       -2m
Linux64 Opt TC R5          13m        11m       -2m
Linux64 Opt TC R6          11m         9m       -2m
Linux64 Opt TC R7           9m         8m       -1m
Linux64 Opt TC R8          11m        10m       -1m
Linux64 Opt TC Ru1         30m        25m       -5m
Linux64 Opt TC Ru2         36m        27m      -11m
OS X 10.10 Opt             31m        27m       -4m
OS X 10.10 Opt E10s        26m        25m       -1m
OS X 10.10 Debug           68m        55m      -13m
Win7 Opt R                 30m        28m       -2m
Win7 Opt Ru                28m        26m       -2m
Win7 Opt R E10S            29m        27m       -2m
Win7 Debug R               85m        76m       -9m
Win7 Debug R E10S          75m        65m      -10m
Win8 x64 Opt R             29m        26m       -3m
Win8 x64 Opt Ru            27m        25m       -2m
Win8 x64 Debug R           90m        71m      -19m
Android 4.3 API15 Opt R1   89m        71m      -18m
Android 4.3 API15 Opt R2   78m        64m      -14m
Android 4.3 API15 Opt R3   75m        64m      -11m
Android 4.3 API15 Opt R4   74m        68m       -6m
Android 4.3 API15 Opt R5   75m        69m       -6m
Android 4.3 API15 Opt R6   91m        86m       -5m
Android 4.3 API15 Opt R7   87m        66m      -21m
Android 4.3 API15 Opt R8   87m        82m       -5m
Android 4.3 API15 Opt R9   80m        66m      -14m
Android 4.3 API15 Opt R10  80m        67m      -13m
Android 4.3 API15 Opt R11  73m        66m       -7m
Android 4.3 API15 Opt R12 105m        91m      -14m
Android 4.3 API15 Opt R13  72m        59m      -13m
Android 4.3 API15 Opt R14  82m        61m      -21m
Android 4.3 API15 Opt R15  73m        62m      -11m
Android 4.3 API15 Opt R16  79m        78m       -1m

The savings total 6+ *hours* or ~15% when running all reftests. I'd
say this isn't bad for a one-line code change!

Review commit: https://reviewboard.mozilla.org/r/51267/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/51267/
Attachment #8749983 - Flags: review?(dholbert)
(In reply to Gregory Szorc [:gps] from comment #1)

> As I was casually drinking a barleywine (note to reader: barleywines
> are serious beers - there's nothing casual about them)

What barleywine were you drinking, and how did you like it?
Flags: needinfo?(gps)
Attachment #8749983 - Flags: review?(dholbert) → review+
Comment on attachment 8749983 [details]
MozReview Request: Bug 1271035 - Disable Places during reftests, preventing 50 GB of I/O; r?dholbert

https://reviewboard.mozilla.org/r/51267/#review47931

Very nice. r=me
(In reply to Justin Dolske [:Dolske] from comment #4)
> What barleywine were you drinking, and how did you like it?

It was a Sierra Nevada Barrel Aged Bigfoot. Solid beer and definitely a memorable experience. Would do again.
Flags: needinfo?(gps)
margaret: mfinkle: I observe from gps's work that Android reftest times dropped significantly.  This surprised me, but I think I now understand why: Fennec uses the same pref to disable maintaining visited links and writes to its Android SQLite DBs as Desktop.  Could one of you confirm?

Future thought: if we separated the Android history flag from the Desktop one, would we opt-out of additional toolkit/ Places cruft?  Or is that already all removed from Fennec, and the flag name is just ... ancient history?
Flags: needinfo?(mark.finkle)
Flags: needinfo?(margaret.leibovic)
Reading the commit message the day after, it's obviously a by-product of drinking a high alcohol beer and staying up late to finish :) But some things deserve to be kept for posterity, so I didn't change it before landing.

For the record, I'm still somewhat shocked that a random side-project I did a few years ago to explore and analyze buildbot data produced a take-a-way that I randomly recalled last night which contributed to me looking at reftests first and that by some miracle there was a low-hanging fruit that resulted in double digit percentage wins mostly across the board. It feels like I won the lottery. I would be shocked if I manage to find a single low-effort high-impact improvement again.
(In reply to Nick Alexander :nalexander from comment #8)
> Future thought: if we separated the Android history flag from the Desktop
> one, would we opt-out of additional toolkit/ Places cruft? 

I think Android doesn't define MOZ_PLACES and thus it should have no places code at all. It's likely the pref is there to simplify syncing prefs across profiles with Sync.
(In reply to Nick Alexander :nalexander from comment #8)
> Fennec uses the same pref to disable maintaining visited links and writes to
> its Android SQLite DBs as Desktop.  Could one of you confirm?

Yes. I did this in Bug 1093886, to allow us to turn off history in web apps for Bug 851854. This is a nice side-effect.


> Future thought: if we separated the Android history flag from the Desktop
> one, would we opt-out of additional toolkit/ Places cruft?  Or is that
> already all removed from Fennec, and the flag name is just ... ancient
> history?

We don't ship Places, but we obey the same pref.
Flags: needinfo?(mark.finkle)
Flags: needinfo?(margaret.leibovic)
Blocks: 1271068
(In reply to Marco Bonardo [::mak] from comment #10)
> It's likely the pref is there to simplify syncing prefs across
> profiles with Sync.

Thread-drift, but: we don't sync prefs across applications (Thunderbird <-> Firefox, for example), and we don't implement prefs sync at all on either iOS or Android.
(In reply to Nick Alexander :nalexander from comment #8)
> margaret: mfinkle: I observe from gps's work that Android reftest times
> dropped significantly.  This surprised me, but I think I now understand why:
> Fennec uses the same pref to disable maintaining visited links and writes to
> its Android SQLite DBs as Desktop.  Could one of you confirm?

Yes, we added support for the pref a little while ago:
http://mxr.mozilla.org/mozilla-central/source/mobile/android/components/build/nsAndroidHistory.cpp#129

> Future thought: if we separated the Android history flag from the Desktop
> one, would we opt-out of additional toolkit/ Places cruft?  Or is that
> already all removed from Fennec, and the flag name is just ... ancient
> history?

I don't think we are opting into Places cruft as it is. Android uses an entirely different History storage system. It just so happens that we used the same preference to control it. I think the morale of this long, long, long commit message story is: I/O while loading web pages slows everything down.
Any chance we'd get a similar win from setting this pref during mochitests?
Flags: needinfo?(gps)
(In reply to Nathan Froyd [:froydnj] from comment #14)
> Any chance we'd get a similar win from setting this pref during mochitests?

We'd have to be more careful about that, because some mochitests explicitly test history-related things (like visited coloring of links).
(In reply to Nathan Froyd [:froydnj] from comment #14)
> Any chance we'd get a similar win from setting this pref during mochitests?

Procmon says yes. Also, the Necko cache seems to be generating a fair bit of I/O.

I'm not sure what is safe to disable where in mochitest land. Obviously some mochitests need to test features backed by I/O and disabling too many browser features may undermine the value of some tests

We really need the system resource metrics obtained by mozharness surfaced better. e.g from http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win32/1462612660/mozilla-inbound_win7_ix_test-mochitest-1-bm109-tests1-windows-build32.txt.gz:

04:16:08     INFO - Total resource usage - Wall time: 981s; CPU: 10.0%; Read bytes: 96571392; Write bytes: 1094087168; Read time: 2205010; Write time: 46124710
04:16:08     INFO - install - Wall time: 3s; CPU: 11.0%; Read bytes: 0; Write bytes: 33091072; Read time: 0; Write time: 326250
04:16:08     INFO - run-tests - Wall time: 979s; CPU: 10.0%; Read bytes: 96571392; Write bytes: 1035534336; Read time: 2205010; Write time: 45589980

That's 1 GB of write I/O during a mochitest-1 job. Seems fishy to me. And I /think/ this is actual disk I/O, not what the processes are generating (during many workloads large percentages of process I/O are serviced by the page cache and wouldn't be reported here).
Bug 859573 added the resource monitoring to mozharness. It is using https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py which uses psutil for gathering *system* level metrics. Those are described at http://pythonhosted.org/psutil/#system-related-functions.

What immediately stands out to me is that "write time" value of 45,589,980. That's supposed to be in milliseconds. Sounds like a bug in data collection or documentation to me. We're also likely running an older psutil in automation.

I think conditions are ripe for a lot of follow-up bugs :)
Flags: needinfo?(gps)
Perhaps we should file a followup bug to investigate whether there are things we could do to improve the performance characteristics for our end-users?
(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) (busy May 9-13) from comment #18)
> Perhaps we should file a followup bug to investigate whether there are
> things we could do to improve the performance characteristics for our
> end-users?

Is bug 1209027 sufficient?
In bug 1271077 I coded up a quick patch to enable system resource numbers to be printed in Treeherder. It is appropriate Mother's Day is this weekend because we appear to have hit the mother load: there appears to be a treasure trove of interesting data pointing at potential bottle necks and areas for optimization.

We know the patch in this bug significantly improves reftest execution times by reducing I/O. Procmon recorded ~50 GB of saved I/O in system calls. However, not all of that I/O hits disk. What was the actual disk I/O implication of this change?

Before: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d09130fbe0545f1b6e29a97045949e1d385a819b
After:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=65b2a1870e1819e1d54fca904a404c4646843939

And the breakdown of disk write bytes before and after disabling Places:

                           Before          After
Linux64 Opt R1          9,235,111,936   274,952,192
Linux64 Opt R2          9,109,041,152   344,014,848
Linux64 Opt R E10s     19,124,137,984   412,995,584
Win7 Opt R             11,082,519,552   550,678,016
Win7 Opt Ru             9,697,164,800   518,457,856
OS X 10.10 Opt R E10s  18,420,071,424   609,679,872

I suspect most of the write bytes after this patch are due to extracting the build and test files. That will almost certainly be a few hundred megabytes due to the way we're doing things today. Although we've been talking about changing how that works. Stay tuned...

As for non-reftest jobs, there appears to be significant I/O usage across the board. Mochitest jobs are frequently writing ~1 GB. xpcshell jobs are writing ~11 GB (!!!). We're doing a *lot* of I/O in automation. Keep in mind that a number of machines in automation have spinning disks or are virtualized (e.g. in EC2), so I/O isn't great.

We also report CPU metrics in those Try pushes. A number of jobs are reporting <50% CPU utilization. Win7 Opt X reported ~24% CPU usage - and xpcshell tests execute concurrently. With 10.5 GB of write I/O in that job, I suspect I/O wait is slowing us down. OS X mochitest jobs are hovering around 50% CPU utilization. We know mochitests execute one test at a time. I suspect the OS X machines have 2 cores/threads. Before you look at CPU times too much, pay attention to whether the machine is a VM, especially in EC2. We're currently reporting physical CPU utilization. So e.g. on EC2 CPU utilization will often be near 100% because some other VM you don't have insight to is using the CPU. Bug 893391 tracks improving this reporting.
Blocks: 1271141
Blocks: fastci
https://hg.mozilla.org/mozilla-central/rev/6b064f9f6e10
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
See Also: → 1271448
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: