Closed Bug 1321026 Opened 5 years ago Closed 4 years ago

Make number of threads for XPCShell test more optimal

Categories

(Testing :: XPCShell Harness, defect)

defect
Not set
normal

Tracking

(firefox55 fixed)

VERIFIED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: mayhemer, Assigned: mayhemer)

References

Details

Attachments

(1 file, 4 obsolete files)

Using 4*CPU count parallel thread for xpcshell tests is an overkill, specially on windows (debug builds).  It was a pain to run in background and do something else on an 8-thread CPU, is incredible on an 20-thread CPU (80 tests in parallel!!)

I want to introduce an optional argument to runxpcshelltests.py that will override the default cpu_count() * 4.
Attached patch v1 (--threads arg) (obsolete) — Splinter Review
- new |--threads=N| arg taking an int (signed) when running |mach xpcshelltests|
- when the number is a positive number, it will override the default cpu_count()*4
- when the number provided is negative, it will do cpu_count() - abs(thread_count) ; i.e. preserve the CPU cores specified
- when 0, the default cpu_count() * 4 is used (hence, w/o specifying the arg the behavior is unchanged)
Attachment #8815383 - Flags: review?(ted)
This is a very annoying thing indeed, doing anything while tests are running is impossible.
I think today what most devs do is using --sequential option.
Could we evaluate using a smaller count by default, and enforce with the option a larger count for tinderboxes, rather than forcing every dev to provide an option to slow it down?
(In reply to Marco Bonardo [::mak] from comment #2)
> This is a very annoying thing indeed, doing anything while tests are running
> is impossible.
> I think today what most devs do is using --sequential option.
> Could we evaluate using a smaller count by default, and enforce with the
> option a larger count for tinderboxes, rather than forcing every dev to
> provide an option to slow it down?

I personally don't have an opinion here, but I'm glad I'm not alone here.  

--sequential means to return to the dark age and let the tests run for several minutes.  It's just another extreme, to the opposite direction.

If we by default preserve at least one or two cores, I think it would be alright.  I can also do some measurements on how long the tests take regarding various threads/cores rate and report here.  But it will be specific to a 20-core i7 and windows.
I think when we initially landed this we did some brief measurements. This might have wound up like this because the test machines were single-core at the time, and running 4 processes was faster than running just one? If you're going to change this I'd suggest just changing the default heuristic. We could either cap it at the number of cores, or min(4, num_cores) or something like that.
Ideally we'd just do the right thing and nobody would have to fiddle with options. If you change the default it'd be good to do some try pushes to compare how long the xpcshell test runs take in automation.
I'll do some local and try tests with different heuristic.  Tomorrow or later, tho.
Comment on attachment 8815383 [details] [diff] [review]
v1 (--threads arg)

Dropping r? based on comment from Ted.
Attachment #8815383 - Flags: review?(ted)
Status: ASSIGNED → NEW
So, here are some numbers.  Running ./mach xpcshell-test netwerk/ --threads=N on i7-6950X (10 cores/20 threads):

       CPU util.           time to finish        feeling

N=20:  35-40               2m35s                 no slowdown
N=30:  80-100              1m57s                 tiny slowdown
N=40:  80-100              1m54s                 some small but bearable slowdown
N=80:  80-100              1m55s                 noticeable slowdown



The same on i7-950 (4 cores/8 threads):

       CPU util.           time to finish        feeling

N=8:   95-100              7m37s                 tiny-small slowdown
N=12:  100-100             7m34s                 bearable slowdown
N=16:  100-100             9m08s*                major slowdown
N=32:  100-100             9m13s*                major slowdown



So, for me the magic number is 1.5 x CPUs.


I still think it's worth landing the arg (when we already have it written) and also change the default to cpu_count() * 1.5.


* the test_sjs_throwing_exceptions.js test, which usually is exhaustive itself, timed out (I can see this happening quite often).
Status: NEW → ASSIGNED
Attached patch v2 (1.5*cpu, --threads arg) (obsolete) — Splinter Review
try pushes on the same base changeset with --rebuild 5

no patch (4*cpu): https://treeherder.mozilla.org/#/jobs?repo=try&revision=42ce85fb38faf22097493a3c40afb0b01d344ad1

w/ patch (1.5*cpu): https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca859902358e6a8e93e186383b466bb60435b9df

Let's see which one wins!
Attachment #8815383 - Attachment is obsolete: true
Attachment #8838512 - Flags: review?(ted)
Comment on attachment 8838512 [details] [diff] [review]
v2 (1.5*cpu, --threads arg)

Review of attachment 8838512 [details] [diff] [review]:
-----------------------------------------------------------------

This is mostly fine, I just have a suggestion for a change that might make it more user-friendly.

::: testing/xpcshell/runxpcshelltests.py
@@ +1085,5 @@
>                   testingModulesDir=None, pluginsPath=None,
>                   testClass=XPCShellTestThread, failureManifest=None,
>                   log=None, stream=None, jsDebugger=False, jsDebuggerPort=0,
>                   test_tags=None, dump_tests=None, utility_path=None,
> +                 rerun_failures=False, threadCount=None, failure_manifest=None, jscovdir=None, **otherOptions):

You do numeric comparisons on `threadCount` below, so the default should be a number. You could just set it `=DEFAULT_NUM_THREADS` and drop the elif there.

::: testing/xpcshell/xpcshellcommandline.py
@@ +115,5 @@
> +    parser.add_argument("--threads",
> +                        type=int, dest="threadCount", default=0,
> +                        help="override the number of jobs (threads) when running tests in parallel"
> +												     ", use negative value to let the harness calculate number of preserved CPU cores"
> +														 " (e.g. -2 will run |CPU count - 2| threads)")

This seems a little overly-complicated. What if you made this accept an int, or some named strings, like "max" and "usable", where "max" uses the default value to max out speed, and "usable" keeps a core free so you can use your machine?
Attachment #8838512 - Flags: review?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10)
> You do numeric comparisons on `threadCount` below, so the default should be
> a number. You could just set it `=DEFAULT_NUM_THREADS` and drop the elif
> there.

Yep, probably better.

									 " (e.g. -2 will run |CPU count - 2| threads)")
> 
> This seems a little overly-complicated. What if you made this accept an int,
> or some named strings, like "max" and "usable", where "max" uses the default
> value to max out speed, and "usable" keeps a core free so you can use your
> machine?

Sure, I don't have other work to do :D
Attached patch v3 (just lowering the max) (obsolete) — Splinter Review
No time to fiddle with the arg, hence here is a patch to just lower the default max to 1.5*cpu.

In comment 5 you said if we find an optimal default we don't need an arg at all anyway.

Depends on results from comment 9 to take or not.
Attachment #8838512 - Attachment is obsolete: true
Attachment #8838514 - Flags: review?(ted)
Summary: Parametric number of threads for XPCShell test → Make number of threads for XPCShell test more optimal
(In reply to Honza Bambas (:mayhemer) from comment #9)
> Created attachment 8838512 [details] [diff] [review]
> v2 (1.5*cpu, --threads arg)
> 
> try pushes on the same base changeset with --rebuild 5
> 
> no patch (4*cpu):
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=42ce85fb38faf22097493a3c40afb0b01d344ad1
> 
> w/ patch (1.5*cpu):
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=ca859902358e6a8e93e186383b466bb60435b9df
> 
> Let's see which one wins!

Some test timings are mostly identical, some seems to be slower with the patch (e.g. Linux opt tc-X(2): 16min w/ patch vs 14-15min w/o patch.)

We then may need a different default when running on infra than locally.
Attachment #8838514 - Flags: review?(ted) → review+
Doesn't seem worth doing a bunch of extra work for a single minute on some already fairly quick jobs.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #14)
> Doesn't seem worth doing a bunch of extra work for a single minute on some
> already fairly quick jobs.

Sorry I didn't drop the r? flag, could save you some time.

Since this affects all trees and all builds, I think it's worth to update a bit here.  Releng would not like me :)  

I think of a bit of a hack here - when there is MOZ_UPLOAD_DIR env var defined, use 4*cpu, ohterwise 1.5*cpu.
Oh, I think there's a simpler way to do the right thing here--make your change to pass 1.5 * ncpu in the mach command:
https://dxr.mozilla.org/mozilla-central/rev/e150eaff1f83e4e4a97d1e30c57d233859efe9cb/testing/xpcshell/mach_commands.py#240

and then keep runxpcshelltest.py defaulting to the existing behavior. That way developers running via `mach xpcshell-test` will get the new behavior, but automation running runxpcshelltest.py directly will use the current behavior.

Does that sound sensible?
Thanks for the tip, Ted!

- reintroduced the argument (just for completeness) and kept very simple this time
- mach_commands.py defaults it to 1.5*cpus
- the param default is 0 so that we can detect it has not been set, either by user or by mach's default from mach_commands.py
- when runxpcshelltests.py finds threadCount == 0 (when running directly as part of automation), it's set to 4*cpus

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0170bcb45480d602fff0a6d486ab832dd0d09c5d
Attachment #8838514 - Attachment is obsolete: true
Attachment #8843956 - Flags: review?(ted)
Comment on attachment 8843956 [details] [diff] [review]
v4 (mach commands hack: 1.5x, direct run: 4xCPUs)

Review of attachment 8843956 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/xpcshell/runxpcshelltests.py
@@ +1193,5 @@
>          self.testingModulesDir = testingModulesDir
>          self.pluginsPath = pluginsPath
>          self.sequential = sequential
>          self.failure_manifest = failure_manifest
> +        self.threadCount = threadCount or NUM_THREADS

You can probably leave off the `or NUM_THREADS` given that it has a default value in the parameter list. Although I guess you set the default in the commandline argument to be 0, so you want to handle that here?
Attachment #8843956 - Flags: review?(ted) → review+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #18)
> Comment on attachment 8843956 [details] [diff] [review]
> v4 (mach commands hack: 1.5x, direct run: 4xCPUs)
> 
> Review of attachment 8843956 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: testing/xpcshell/runxpcshelltests.py
> @@ +1193,5 @@
> >          self.testingModulesDir = testingModulesDir
> >          self.pluginsPath = pluginsPath
> >          self.sequential = sequential
> >          self.failure_manifest = failure_manifest
> > +        self.threadCount = threadCount or NUM_THREADS
> 
> You can probably leave off the `or NUM_THREADS` given that it has a default
> value in the parameter list. Although I guess you set the default in the
> commandline argument to be 0, so you want to handle that here?

Yes.  It's a leftover.
Attachment #8843956 - Attachment is obsolete: true
Attachment #8844032 - Flags: review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a293a617ebed
Have an argument to change the number of parallel tests when running xpcshelltests. r=ted
Keywords: checkin-needed
Backed out on suspicion of causing xpcshell bustage:

https://hg.mozilla.org/integration/mozilla-inbound/rev/a33c70b339b9c1e015165170ef38d1768a553881

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2e025eb770e85de558fb72825539903b3e049172&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=81990620&repo=mozilla-inbound

[task 2017-03-06T22:10:31.757682Z] 22:10:31     INFO - Calling ['/home/worker/workspace/build/venv/bin/python', '-u', '/home/worker/workspace/build/tests/xpcshell/runxpcshelltests.py', '--total-chunks', '8', '--this-chunk', '1', '--symbols-path=https://queue.taskcluster.net/v1/task/cyWiF4bFRbuy5pUHojYipA/artifacts/public/build/target.crashreporter-symbols.zip', '--test-plugin-path=/home/worker/workspace/build/application/firefox/plugins', '--log-raw=/home/worker/workspace/build/blobber_upload_dir/xpcshell_raw.log', '--log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/xpcshell_errorsummary.log', '--utility-path=tests/bin', '--xpcshell=/home/worker/workspace/build/application/firefox/xpcshell', '--manifest=tests/xpcshell/tests/xpcshell.ini'] with output_timeout 1000
[task 2017-03-06T22:10:32.235809Z] 22:10:32     INFO -  Found node at /usr/local/bin/node
[task 2017-03-06T22:10:32.238723Z] 22:10:32     INFO -  Found moz-http2 at /home/worker/workspace/build/tests/xpcshell/moz-http2/moz-http2.js
[task 2017-03-06T22:10:33.191997Z] 22:10:33     INFO -  Using at most 0 threads.
[task 2017-03-06T22:10:33.195945Z] 22:10:33     INFO -  SUITE-START | Running 391 tests
[task 2017-03-06T22:27:13.274505Z] 22:27:13     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/home/worker/workspace/build/venv/bin/python', '-u', '/home/worker/workspace/build/tests/xpcshell/runxpcshelltests.py', '--total-chunks', '8', '--this-chunk', '1', '--symbols-path=https://queue.taskcluster.net/v1/task/cyWiF4bFRbuy5pUHojYipA/artifacts/public/build/target.crashreporter-symbols.zip', '--test-plugin-path=/home/worker/workspace/build/application/firefox/plugins', '--log-raw=/home/worker/workspace/build/blobber_upload_dir/xpcshell_raw.log', '--log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/xpcshell_errorsummary.log', '--utility-path=tests/bin', '--xpcshell=/home/worker/workspace/build/application/firefox/xpcshell', '--manifest=tests/xpcshell/tests/xpcshell.ini']
[task 2017-03-06T22:27:13.321220Z] 22:27:13    ERROR - timed out after 1000 seconds of no output
[task 2017-03-06T22:27:13.321609Z] 22:27:13    ERROR - Return code: -9
[task 2017-03-06T22:27:13.323760Z] 22:27:13    ERROR - No tests run or test summary not found
Flags: needinfo?(honzab.moz)
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #22)
> Backed out on suspicion of causing xpcshell bustage:
> 
> https://hg.mozilla.org/integration/mozilla-inbound/rev/
> a33c70b339b9c1e015165170ef38d1768a553881
> 
> Push with failures:
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&revision=2e025eb770e85de558fb72825539903b3e049172&filter-
> resultStatus=testfailed&filter-resultStatus=busted&filter-
> resultStatus=exception&filter-resultStatus=retry&filter-
> resultStatus=usercancel&filter-resultStatus=runnable
> Failure log:
> https://treeherder.mozilla.org/logviewer.html#?job_id=81990620&repo=mozilla-
> inbound

Why doesn't it show in my try run?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0170bcb45480d602fff0a6d486ab832dd0d09c5d

The patch should have zero impact in automation.  I suspect this was something else.
Flags: needinfo?(honzab.moz) → needinfo?(aryx.bugmail)
Ah!! 

"Using at most 0 threads."

Yes, seems like it had some meaning to leave the "or NUM_THREADS" bit...
Flags: needinfo?(aryx.bugmail)
Attachment #8844032 - Attachment is obsolete: true
Attachment #8844032 - Flags: review+
Attachment #8843956 - Attachment is obsolete: false
Let's land the v4 patch which is tested (comment 17).  Thanks.
Keywords: checkin-needed
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/96d360b9eadf
Have an argument to change the number of parallel tests when running xpcshelltests, r=ted
Keywords: checkin-needed
Depends on: 1345620
https://hg.mozilla.org/mozilla-central/rev/96d360b9eadf
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Just trying this in current central.
Thank you, Thank you, Thank you, Thank you, Thank you.
Status: RESOLVED → VERIFIED
(In reply to Marco Bonardo [::mak] from comment #28)
> Just trying this in current central.
> Thank you, Thank you, Thank you, Thank you, Thank you.

:))
Duplicate of this bug: 960767
You need to log in before you can comment on or make changes to this bug.