Closed Bug 912596 Opened 11 years ago Closed 7 years ago

Run xpcshell tests on more powerful hardware

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Unassigned)

References

Details

(Whiteboard: [buildfaster:1])

xpcshell tests are not running as fast as they could in release automation due to hardware limitations. This conclusion is reached from two changes:

1) We now collect system resource usage during mozharness jobs as of bug 883209.
2) xpcshell tests now run concurrently, on multiple cores as of bug 660788.

#1 resulted in bug 895225, which seemed to indicate that xpcshell tests were incurring gigabytes of I/O. :mihneadb has been patching the xpcshell harness to collect per-test resource usage. His work has confirmed these findings: some xpcshell tests incur lots of I/O load - some incurring several hundred megabytes. This all adds up.

Compounding the problem of I/O load is the fact we now run xpcshell tests concurrently. Before, we only ran 1 at a time. Now, we scale up to the number of cores in a machine. This effectively multiplies the I/O load by the number of cores in a machine.

Comparing a known fast machine (a local development machine with 4 cores and an SSD) to logs from jobs in automation reveals that the machines in automation are significantly slower than they could be.

*The entire xpcshell test suite can now execute in under 3 minutes on a 4 core machine with an SSD*. Here are the times for a recent m-c push (this is total job time - there is a minute or two of overhead in these numbers for fetching the test archive, extracting it, etc):

Linux opt (tst-linux32-ec2-344) - 21 minutes
Linux PGO (tst-linux32-ec2-042) - 21 minutes
Linux64 opt (tst-linux64-ec2-347) - 20 minutes
Linux64 PGO (tst-linux64-ec2-431) - 20 minutes
OS X 10.6 opt (talos-r4-snow-060) - 12 minutes
OS X 10.7 opt (talos-r4-lion-080) - 14 minutes
OS X 10.8 opt (talos-mtnlion-r5-081) - 10 minutes
Windows XP opt (t-xp32-ix-064) - 8 minutes
Windows XP PGO (t-xp32-ix-035) - 9 minutes
Windows 7 opt (t-w732-ix-093) - 20 minutes
Windows 7 PGO (t-w732-ix-042) - 17 minutes
Windows 8 opt (t-w864-ix-106) - 36 minutes (remote?)
Windows 8 PGO (t-w864-ix-085) - 35 minutes (remote?)

If you assume we could get runtimes down to 5 minutes for all but the Windows 8 tests, that would result in a total of 55 minutes of time vs the 172 total - a net savings of 117 minutes. Not too shabby. (This is already on top of the speedup from making the tests run concurrently.)

Why are the tests running so slow compared to local developer machines? Two reasons:

1) Not enough CPU cores to take advantage of parallel execution
2) I/O layer can't keep up with test load

I recommend the following actions to improve the situation:

1) Run xpcshell test jobs on machines with more CPU cores
2a) Install SSDs on machines (to keep up with I/O load)
2b) Have tests use a RAM disk instead of local storage

All of these actions should be fully within the realm of Release Engineering to enact. Even 2b - just set the TMPDIR, TEMP or TMP environment variables to point to a RAM disk and the xpchell harness should use that directory for temporary files.
Can you measure the build time with each change in isolation?
1) 4 cores with a regular hard disk
2) 1 core with an SSD?

I'm interested in knowing if one or the other change has a big impact on its own.
Flags: needinfo?(gps)
If the issue is number of cores, then using the physical linux hardware in the datacenters instead of the ec2 instances will probably help quite a bit.

I'm not sure how much free RAM we have on these machines (since we are not permitted to gather system metrics on test machines), so a RAM disk may or may not be a viable choice for those OSes that support it.
(In reply to John Hopkins (:jhopkins) from comment #1)
> Can you measure the build time with each change in isolation?
> 1) 4 cores with a regular hard disk
> 2) 1 core with an SSD?
> 
> I'm interested in knowing if one or the other change has a big impact on its
> own.

For #1, look at job times before and after parallel xpcshell tests landed. Also read http://www.mihneadb.net/post/parallelizing-a-test-harness/. I know tests used to take ~13 minutes on my MBP (with an SSD) before parallel execution. Now they take under 3 minutes. The speedup is what you'd expect going from 1 core to 4+4HT cores - 4-5x (each hyperthreaded core is only typically good for about 25% of a physical core).

If your goal is to measure which has a bigger impact, "it depends." It depends on the model of HD, CPU frequency, etc.

Furthermore, the more tests you run in parallel, the more important I/O becomes because it is stressed more. We could probably throw 2 cores and a mechanical HD for a nice win without the HD being a major bottleneck. But once we get into 4+ core territory, a mechanical HD will be stressed and will be the bottleneck.
Flags: needinfo?(gps)
(In reply to Amy Rich [:arich] [:arr] from comment #2)
> If the issue is number of cores, then using the physical linux hardware in
> the datacenters instead of the ec2 instances will probably help quite a bit.
> 
> I'm not sure how much free RAM we have on these machines (since we are not
> permitted to gather system metrics on test machines), so a RAM disk may or
> may not be a viable choice for those OSes that support it.

Would it be possible for someone to "benchmark" xpcshell tests on various ec2 instance types (small, medium, large, etc)? If the EC2 cost per xpcshell test job is less for a beefier instance (due to less wall time), perhaps we should be looking at upgrading our instance types.
We can create and loan out a test instance if anybody has time to do the benchmarking. It's pretty easy to change instance types, the machine just needs to be offline.

We're currently using m1.medium instances (http://aws.amazon.com/ec2/instance-types/#instance-details). m1.large are about twice as powerful for twice the cost.
Could we also look at running the tests on the physical hardware?  This seems like the perfect candidate since they're multi-core and have lots of RAM.
I support :gps's suggestion, using a ramdisk for (the equivalent of) /tmp should help.
The test slaves have /run/shm already configured as a tmpfs mount with about 1.7G free.

How much data will be written to the ramdisk? Obviously this impacts the amount of ram available to test processes. We currently don't have swap configured for the EC2 instances.
(In reply to Chris AtLee [:catlee] from comment #8)
> The test slaves have /run/shm already configured as a tmpfs mount with about
> 1.7G free.
> 
> How much data will be written to the ramdisk? Obviously this impacts the
> amount of ram available to test processes. We currently don't have swap
> configured for the EC2 instances.

I did watch -n 0.3 du -hs /tmp while running the xpcshell tests and the highest size was around 250 MB, so let's say 300 MB to be sure. Not sure about the other OS's though. Worth a try.
Moving this over to the Taskcluster queue where active development is happening.

There are two issues in this bug:
1) Make xpcshell tests run faster using more powerful hardware
2) Make xpcshell tests run faster using ramdisks for /tmp
Component: General Automation → Task Configuration
Product: Release Engineering → Taskcluster
QA Contact: catlee
Is this something someone cc'd would be willing to mentor for a contributor?
Is this still a going concern?
Flags: needinfo?(catlee)
I don't know. gps?
Flags: needinfo?(catlee) → needinfo?(gps)
I'm going with "no" :)
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(gps)
Resolution: --- → WONTFIX
A quick perusal of logs shows xpcshell on Windows is still a bit slow. But I don't think there's anything to be done in this 4 year old bug. I filed this bug back in the day of buildbot. So many things have changed that I don't think it is worth keeping this open.
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.