912596 - Run xpcshell tests on more powerful hardware

Reporter

Description

•

12 years ago

xpcshell tests are not running as fast as they could in release automation due to hardware limitations. This conclusion is reached from two changes: 1) We now collect system resource usage during mozharness jobs as of bug 883209. 2) xpcshell tests now run concurrently, on multiple cores as of bug 660788. #1 resulted in bug 895225, which seemed to indicate that xpcshell tests were incurring gigabytes of I/O. :mihneadb has been patching the xpcshell harness to collect per-test resource usage. His work has confirmed these findings: some xpcshell tests incur lots of I/O load - some incurring several hundred megabytes. This all adds up. Compounding the problem of I/O load is the fact we now run xpcshell tests concurrently. Before, we only ran 1 at a time. Now, we scale up to the number of cores in a machine. This effectively multiplies the I/O load by the number of cores in a machine. Comparing a known fast machine (a local development machine with 4 cores and an SSD) to logs from jobs in automation reveals that the machines in automation are significantly slower than they could be. *The entire xpcshell test suite can now execute in under 3 minutes on a 4 core machine with an SSD*. Here are the times for a recent m-c push (this is total job time - there is a minute or two of overhead in these numbers for fetching the test archive, extracting it, etc): Linux opt (tst-linux32-ec2-344) - 21 minutes Linux PGO (tst-linux32-ec2-042) - 21 minutes Linux64 opt (tst-linux64-ec2-347) - 20 minutes Linux64 PGO (tst-linux64-ec2-431) - 20 minutes OS X 10.6 opt (talos-r4-snow-060) - 12 minutes OS X 10.7 opt (talos-r4-lion-080) - 14 minutes OS X 10.8 opt (talos-mtnlion-r5-081) - 10 minutes Windows XP opt (t-xp32-ix-064) - 8 minutes Windows XP PGO (t-xp32-ix-035) - 9 minutes Windows 7 opt (t-w732-ix-093) - 20 minutes Windows 7 PGO (t-w732-ix-042) - 17 minutes Windows 8 opt (t-w864-ix-106) - 36 minutes (remote?) Windows 8 PGO (t-w864-ix-085) - 35 minutes (remote?) If you assume we could get runtimes down to 5 minutes for all but the Windows 8 tests, that would result in a total of 55 minutes of time vs the 172 total - a net savings of 117 minutes. Not too shabby. (This is already on top of the speedup from making the tests run concurrently.) Why are the tests running so slow compared to local developer machines? Two reasons: 1) Not enough CPU cores to take advantage of parallel execution 2) I/O layer can't keep up with test load I recommend the following actions to improve the situation: 1) Run xpcshell test jobs on machines with more CPU cores 2a) Install SSDs on machines (to keep up with I/O load) 2b) Have tests use a RAM disk instead of local storage All of these actions should be fully within the realm of Release Engineering to enact. Even 2b - just set the TMPDIR, TEMP or TMP environment variables to point to a RAM disk and the xpchell harness should use that directory for temporary files.

John Hopkins (:jhopkins)

Comment 1

•

12 years ago

Can you measure the build time with each change in isolation? 1) 4 cores with a regular hard disk 2) 1 core with an SSD? I'm interested in knowing if one or the other change has a big impact on its own.

Flags: needinfo?(gps)

Amy Rich [:arr] [:arich]

Comment 2

•

12 years ago

If the issue is number of cores, then using the physical linux hardware in the datacenters instead of the ec2 instances will probably help quite a bit. I'm not sure how much free RAM we have on these machines (since we are not permitted to gather system metrics on test machines), so a RAM disk may or may not be a viable choice for those OSes that support it.

Gregory Szorc [:gps]

Reporter

Comment 3

•

12 years ago

(In reply to John Hopkins (:jhopkins) from comment #1) > Can you measure the build time with each change in isolation? > 1) 4 cores with a regular hard disk > 2) 1 core with an SSD? > > I'm interested in knowing if one or the other change has a big impact on its > own. For #1, look at job times before and after parallel xpcshell tests landed. Also read http://www.mihneadb.net/post/parallelizing-a-test-harness/. I know tests used to take ~13 minutes on my MBP (with an SSD) before parallel execution. Now they take under 3 minutes. The speedup is what you'd expect going from 1 core to 4+4HT cores - 4-5x (each hyperthreaded core is only typically good for about 25% of a physical core). If your goal is to measure which has a bigger impact, "it depends." It depends on the model of HD, CPU frequency, etc. Furthermore, the more tests you run in parallel, the more important I/O becomes because it is stressed more. We could probably throw 2 cores and a mechanical HD for a nice win without the HD being a major bottleneck. But once we get into 4+ core territory, a mechanical HD will be stressed and will be the bottleneck.

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Reporter

Comment 4

•

12 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #2) > If the issue is number of cores, then using the physical linux hardware in > the datacenters instead of the ec2 instances will probably help quite a bit. > > I'm not sure how much free RAM we have on these machines (since we are not > permitted to gather system metrics on test machines), so a RAM disk may or > may not be a viable choice for those OSes that support it. Would it be possible for someone to "benchmark" xpcshell tests on various ec2 instance types (small, medium, large, etc)? If the EC2 cost per xpcshell test job is less for a beefier instance (due to less wall time), perhaps we should be looking at upgrading our instance types.

Chris AtLee [:catlee]

Comment 5

•

12 years ago

We can create and loan out a test instance if anybody has time to do the benchmarking. It's pretty easy to change instance types, the machine just needs to be offline. We're currently using m1.medium instances (http://aws.amazon.com/ec2/instance-types/#instance-details). m1.large are about twice as powerful for twice the cost.

Amy Rich [:arr] [:arich]

Comment 6

•

12 years ago

Could we also look at running the tests on the physical hardware? This seems like the perfect candidate since they're multi-core and have lots of RAM.

Mihnea Dobrescu-Balaur (:mihneadb)

Comment 7

•

12 years ago

I support :gps's suggestion, using a ramdisk for (the equivalent of) /tmp should help.

Chris AtLee [:catlee]

Comment 8

•

12 years ago

The test slaves have /run/shm already configured as a tmpfs mount with about 1.7G free. How much data will be written to the ramdisk? Obviously this impacts the amount of ram available to test processes. We currently don't have swap configured for the EC2 instances.

Mihnea Dobrescu-Balaur (:mihneadb)

Comment 9

•

12 years ago

(In reply to Chris AtLee [:catlee] from comment #8) > The test slaves have /run/shm already configured as a tmpfs mount with about > 1.7G free. > > How much data will be written to the ramdisk? Obviously this impacts the > amount of ram available to test processes. We currently don't have swap > configured for the EC2 instances. I did watch -n 0.3 du -hs /tmp while running the xpcshell tests and the highest size was around 250 MB, so let's say 300 MB to be sure. Not sure about the other OS's though. Worth a try.

Chris AtLee [:catlee]

Comment 10

•

8 years ago

Moving this over to the Taskcluster queue where active development is happening. There are two issues in this bug: 1) Make xpcshell tests run faster using more powerful hardware 2) Make xpcshell tests run faster using ramdisks for /tmp

Component: General Automation → Task Configuration

Product: Release Engineering → Taskcluster

QA Contact: catlee

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

8 years ago

Is this something someone cc'd would be willing to mentor for a contributor?

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

8 years ago

Is this still a going concern?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 13

•

8 years ago

I don't know. gps?

Flags: needinfo?(catlee) → needinfo?(gps)

Dustin J. Mitchell [:dustin] (he/him)

Comment 14

•

8 years ago

I'm going with "no" :)

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(gps)

Resolution: --- → WONTFIX

Gregory Szorc [:gps]

Reporter

Comment 15

•

8 years ago

A quick perusal of logs shows xpcshell on Windows is still a bit slow. But I don't think there's anything to be done in this 4 year old bug. I filed this bug back in the day of buildbot. So many things have changed that I don't think it is worth keeping this open.

BMO Automation

Updated

•

7 years ago

Product: TaskCluster → Firefox Build System

Bugzilla

Run xpcshell tests on more powerful hardware

Categories

(Firefox Build System :: Task Configuration, task)

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

Details

(Whiteboard: [buildfaster:1])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Updated