Closed Bug 1205785 Opened 9 years ago Closed 9 years ago

Use ec2 instances with more memory when running linux gtests

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chmanchester, Assigned: kmoir)

References

Details

Attachments

(5 files, 11 obsolete files)

2.52 KB, patch
rail
: review+
kmoir
: checked-in+
Details | Diff | Splinter Review
5.98 KB, text/plain
Details
858 bytes, patch
rail
: review+
kmoir
: checked-in+
Details | Diff | Splinter Review
14.13 KB, patch
kmoir
: checked-in+
Details | Diff | Splinter Review
19.23 KB, patch
kmoir
: checked-in+
Details | Diff | Splinter Review
We found in bug 1204148 that a few gtests require a lot of memory. The path forward seems to be to allow these to run on instances with more memory.

It looks like the tst-linux64 instance type has 3.75 GiB, the next level up, 7.5 GiB, should be enough. I have a loaner out in bug 1204153 we can change the type of to confirm this is sufficient.
rail, is this something you can help me figure out? If it's something I could conceivably do myself I'm more than happy to write a patch with a few pointers. Thank you.
Flags: needinfo?(rail)
We tried to switch to a different instance type in bug 945981 for all tst-linux machines, but failed due to some flaky tests.

Maybe we can use tst-emulator64 slaves for this? Kim added them at some point IIRC. We can ask our buildduty to work on this I think.

ATM we use c3.xlarge (7.5G of RAM) and m3.xlarge (15G of RAM) (see https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L76-L84) for these slaves. 

If this approach works, we will need to 1) use "tst-emulator64" slaves for these tests, 2) update regexes at https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L7 and 3) probably increase the pool. 

If we need linux32 as well, we may need to setup a different slave type.
Flags: needinfo?(rail)
If tst-emulator64 is a lot tst-linux64, but with more memory, that should work great.

The tests in question are disabled on linux32, so that would not be an immediate need.

I will ask in bug 1204153 to switch to this instance type to confirm.
Component: General Automation → Buildduty
QA Contact: catlee → bugspam.Callek
What is the priority on this bug?  Right now the buildduty folks are working on testing our new 10.10 machines.  I can add it to their queue

As for expanding the emulator pool, we have to bring up new masters first (bug 1205409) before we increase the pool size again (bug 1204756)
Depends on: 1205409
Thank you for the additional information.

I think this is "normal" priority. This is the last step in what has been a quarter-long effort to get the GTests running and green on test machines. There isn't a single factor I'm aware of that would make this time sensitive in a way that would make it a priority over ongoing work, but there is some urgency in the sense that the longer these tests are run on builders the more likely it is issues will creep in that keep them from running and passing on test machines, and the process will be prolonged yet again.

With this in mind, what kind of time frame is realistic for this change?
I confirmed in bug 1204153 that simply increasing the instance size is enough for these tests.
I see the dependent bug was marked as fixed, is there anything I can do to help here? I think this needs at least an entry in watch_pending.cfg. We want to match builders like:

Ubuntu VM 12.04 x64 mozilla-inbound debug test gtest
Ubuntu VM 12.04 x64 mozilla-inbound opt test gtest
Ubuntu VM 12.04 x64 mozilla-inbound pgo test gtest

so I think we can go with

"^Ubuntu VM 12.04 .+ test gtest$"

as our regex.
Flags: needinfo?(kmoir)
Thanks for the regexp for watch pending.  We need to make some changes in buildbot-configs to add this as a platform. Is this just for linux64? (not linux which may be deprecated anyways - see bug 1209932) I can look at this now, it should be a small patch.
Flags: needinfo?(kmoir)
Assignee: nobody → kmoir
Great, thank you! The issue in question is only a problem for linux64, so we can change that to

"^Ubuntu VM 12.04 x64 .+ test gtest$"

Different instance types for different architectures might cause some confusion for people trying to debug these tests, but if 32 bit linux is on the way out anyway, it's probably ok.
Attached patch bug1205785.patch (obsolete) — Splinter Review
I'm having some problems enabling these tests on ubuntu64_vm_large (which exists an a attached machine type on the same masters that provide ubuntu64_vm)

I tried to enable this machine type for linux64 builds with the attached patch only this builder is enabled.  

Ubuntu ASAN VM 12.04 x64 <branch> opt test gtest ScriptFactory

Versus before the patch where
Ubuntu ASAN VM 12.04 x64 <branch> opt test gtest ScriptFactory
Ubuntu VM 12.04 x64 <branch> opt test gtest ScriptFactory
Ubuntu VM 12.04 x64 <branch> pgo test gtest ScriptFactory
Ubuntu VM 12.04 x64 <branch> debug test gtest ScriptFactory

Rail do you have any suggestions on what I might be missing?
Attachment #8669085 - Flags: feedback?(rail)
Comment on attachment 8669085 [details] [diff] [review]
bug1205785.patch

Review of attachment 8669085 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla-tests/config.py
@@ +139,5 @@
>  PLATFORMS['linux64']['talos_slave_platforms'] = ['ubuntu64_hw']
>  PLATFORMS['linux64']['env_name'] = 'linux-perf'
>  PLATFORMS['linux64']['ubuntu64_vm'] = {'name': 'Ubuntu VM 12.04 x64'}
>  PLATFORMS['linux64']['ubuntu64_hw'] = {'name': 'Ubuntu HW 12.04 x64'}
> +PLATFORMS['linux64']['ubuntu64_vm_large'] = {'name': 'Ubuntu HW 12.04 x64'}

Maybe because of this line? It says "HW", but it is a "VM".

@@ +2466,5 @@
> +                continue
> +    
> +            if platform in BRANCHES[name]['platforms']:
> +                if slave_platform in BRANCHES[name]['platforms'][platform]:
> +                    if platform in "linux64" and slave_platform in "ubuntu64_vm":

This condition is a bit confusing. I'd rather use:
  platform == "linux64"
or
  platform in ["linux64"]

This way you do exact matching instead of partial (`"linux" in "linux64"` returns True). This may lead to hidden issues in the future.

@@ +2468,5 @@
> +            if platform in BRANCHES[name]['platforms']:
> +                if slave_platform in BRANCHES[name]['platforms'][platform]:
> +                    if platform in "linux64" and slave_platform in "ubuntu64_vm":
> +                        continue
> +                    elif platform in "linux64" and slave_platform in "ubuntu64_vm_large":

The same here.

@@ +2469,5 @@
> +                if slave_platform in BRANCHES[name]['platforms'][platform]:
> +                    if platform in "linux64" and slave_platform in "ubuntu64_vm":
> +                        continue
> +                    elif platform in "linux64" and slave_platform in "ubuntu64_vm_large":
> +                        BRANCHES[name]['platforms'][platform][slave_platform]['debug_unittest_suites'] = []

Looks like this line is redundant. Just set the value to GTEST instead of += GTEST.

@@ +2471,5 @@
> +                        continue
> +                    elif platform in "linux64" and slave_platform in "ubuntu64_vm_large":
> +                        BRANCHES[name]['platforms'][platform][slave_platform]['debug_unittest_suites'] = []
> +                        BRANCHES[name]['platforms'][platform][slave_platform]['debug_unittest_suites'] += GTEST
> +                        BRANCHES[name]['platforms'][platform][slave_platform]['opt_unittest_suites'] = []

The same here.
Attachment #8669085 - Flags: feedback?(rail) → feedback+
Thanks for the feedback rail.  The reason that builder wasn't appearing was that
1) had to create a new platform type to avoid duplicate builder names
2) add the new platform type to master_config.json

I have some new patches which I'll attach soon
Attached patch bug1205785bb.patch (obsolete) — Splinter Review
Attachment #8669085 - Attachment is obsolete: true
puppet changes
Attached patch bug1205785bb.patch (obsolete) — Splinter Review
Attachment #8670317 - Attachment is obsolete: true
Attached patch bug1205785bb.patch (obsolete) — Splinter Review
Attachment #8670354 - Attachment is obsolete: true
Attached file bug1205785builder.diff (obsolete) —
builder diff
Attached file bug1205785builder.diff (obsolete) —
Attachment #8670567 - Attachment is obsolete: true
Attached patch bug1205785bb.patch (obsolete) — Splinter Review
Attachment #8670563 - Attachment is obsolete: true
Attached patch bug1205785tools.patch (obsolete) — Splinter Review
Attached patch bug1205785builder.diff (obsolete) — Splinter Review
Attachment #8670577 - Attachment is obsolete: true
Attached patch bug1205785cloud.patch (obsolete) — Splinter Review
Attachment #8670318 - Flags: review?(rail)
Comment on attachment 8670578 [details] [diff] [review]
bug1205785bb.patch

I don't like this patch.  It feels so twisted and complicated.  

But it enables emulator64 instances for linux64 + asan + tsan + code coverage gtests
Attachment #8670578 - Flags: review?(rail)
Attachment #8670585 - Flags: review?(rail)
Attachment #8671020 - Flags: review?(rail)
Comment on attachment 8670318 [details] [diff] [review]
bug1205785puppet.patch

Review of attachment 8670318 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/buildmaster/templates/BuildSlaves-tests.py.erb
@@ +17,5 @@
>      'ubuntu64_vm-mulet': """<%=scope.function_secret(["linux_tests_password"])%>""",
>      'ubuntu64_hw': """<%=scope.function_secret(["linux_tests_password"])%>""",
>      'ubuntu64_vm_mobile': """<%=scope.function_secret(["linux_tests_password"])%>""",
>      'ubuntu64_vm_large': """<%=scope.function_secret(["linux_tests_password"])%>""",
> +    'ubuntu64_vm_lnx_large': """<%=scope.function_secret(["linux_tests_password"])%>""",    

A nit: trailing whit espace.
Attachment #8670318 - Flags: review?(rail) → review+
Comment on attachment 8670585 [details] [diff] [review]
bug1205785tools.patch

Review of attachment 8670585 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/maintenance/production-masters.json
@@ +547,5 @@
>      "limit_fx_slave_platforms": {
>        "linux64": [
> +        "ubuntu64_vm",
> +        "ubuntu64_vm_lnx_large"
> +        

A nit: extra line with trailing whitespace

@@ +1961,5 @@
>        ],
>        "linux64-cc": [
> +        "ubuntu64_vm",
> +        "ubuntu64_vm_lnx_large"
> +        

the same here
Attachment #8670585 - Flags: review?(rail) → review+
Comment on attachment 8671020 [details] [diff] [review]
bug1205785cloud.patch

Review of attachment 8671020 [details] [diff] [review]:
-----------------------------------------------------------------

Instead of attaching a patch you can submit a PR on github :)

::: configs/watch_pending.cfg
@@ +29,5 @@
>          "^(TB )?WINNT (5.2|6.1|6.2)( x86-64)? try": "y-2008",
>          "^b2g_try_linux(32|64)_gecko(-debug)?": "try-linux64",
> +        "^Ubuntu VM large 12.04 x64.* (gtest).*" : "tst-emulator64",
> +        "^Ubuntu Code Coverage VM large 12.04 x64.* test (gtest).*": "tst-emulator64",
> +        "^Ubuntu ASAN VM large 12.04 x64.* test (gtest).*": "tst-emulator64",

(gtest).* looks confusing, it should be either without parenthesis or (gtest)?.* to indicate that whatever it's in parenthesis is optional (question mark).
Attachment #8671020 - Flags: review?(rail) → review-
Comment on attachment 8670578 [details] [diff] [review]
bug1205785bb.patch

Review of attachment 8670578 [details] [diff] [review]:
-----------------------------------------------------------------

Only one question:

::: mozilla-tests/config.py
@@ +2604,5 @@
> +                    BRANCHES[name]['platforms'][platform][slave_platform]['opt_unittest_suites'] += GTEST
> +
> +trunk_branches = []
> +for name, branch in items_at_least(BRANCHES, 'gecko_version', 44):
> +    trunk_branches.append(name)

Hmm, "trunk_branches" sounds confusing. Do you want the tests to ride the trains or want them stay only on the trunk branches? For the latter you may want to use BRANCHES['mozilla-central']['gecko_version'] instead of 44.
Comment on attachment 8670318 [details] [diff] [review]
bug1205785puppet.patch

and merged to production
Attachment #8670318 - Flags: checked-in+
Attached patch bug1205785bb-2.patch (obsolete) — Splinter Review
Attachment #8670578 - Attachment is obsolete: true
Attachment #8670578 - Flags: review?(rail)
Attachment #8671005 - Attachment is obsolete: true
Attachment #8673132 - Attachment is patch: false
Attachment #8671020 - Attachment is obsolete: true
Attachment #8673134 - Flags: review?(rail)
Attachment #8673130 - Attachment is patch: true
Attachment #8673130 - Attachment mime type: text/x-patch → text/plain
Attachment #8673130 - Flags: review?(rail)
Attachment #8673134 - Flags: review?(rail) → review+
Comment on attachment 8673130 [details] [diff] [review]
bug1205785bb-2.patch

Review of attachment 8673130 [details] [diff] [review]:
-----------------------------------------------------------------

\o/

r+ with 3 fixes below

::: mozilla-tests/config.py
@@ +2610,5 @@
> +delete_slave_platform(BRANCHES, PLATFORMS, {'linux64-tsan': 'ubuntu64_vm' }, branch_exclusions=["try"])
> +delete_slave_platform(BRANCHES, PLATFORMS, {'linux64-tsan': 'ubuntu64_vm_lnx_large'}, branch_exclusions=["try"])
> +delete_slave_platform(BRANCHES, PLATFORMS, {'linux64-asan': 'ubuntu64_vm'}, branch_exclusions=ride_train_branches)
> +delete_slave_platform(BRANCHES, PLATFORMS, {'linux64-asan': 'ubuntu64-asan_vm_lnx_large'}, branch_exclusions=ride_train_branches)
> +delete_slave_platform(BRANCHES, PLATFORMS, {'linux64': 'ubuntu64_vm_lnx_large'}, branch_exclusions=ride_train_branches)

Typo in 3 places:
s/ride_train_branches/ride_trains_branches/g
Attachment #8673130 - Flags: review?(rail) → review+
tools patch with fixed white space
Attachment #8670585 - Attachment is obsolete: true
Attachment #8673156 - Flags: checked-in+
thanks for the reviews rail. I had the variable name correct on my master but didn't export the correct patch in my workspace ...monday...I mean tuesday
Attachment #8673130 - Attachment is obsolete: true
Attachment #8673157 - Flags: checked-in+
Attachment #8673134 - Flags: checked-in+
This is deployed and the old gtest builders are removed but the new ones aren't appearing.  I am wondering if this is a problem where the master needs to be restarted to pick up the new platform.  I'm going to graceful bm125 and restart to test this theory.
restarted bm125, this did not fix my builder problem, investigating further....
My current theory on why the builders are not enabled.

I think the new platform names are too long and are being truncated as part of the builder dir.
My test master did not have the same master_config.json as a production master (to save reconfig time).  If I add the same platform limitations to my test master, the builders disappear.  Continuing debugging....
no, this is not it, the problem is the master_config.json doesn't have the new platforms
I have fixed this issue.  It seems that a reconfig doesn't update the master_config.py.  You need to run

cd /builds/buildbot/tests1-linux64/
export PRODUCTION_MASTERS=tools/buildfarm/maintenance/production-masters.json
python buildbot-configs/update-master-json.py $PRODUCTION_MASTERS master/master_config.json
make checkconfig
make reconfig

on the masters to update it which I did on bm51-54, 67,68, 113-125  I didn't run a reconfig on all the masters because I wanted the regular reconfig process to run. I did run on on bm125 and can see gtests are running yet failing.

Are the gtests not defined in tree yet?  I noticed they were failing before the instance type was changed.  Looking at the results, I can stuff like this in the logs which suggests the test scripts are not in place yet.  This is for a Ubuntu VM large 12.04 x64 try opt test gtest job.

13:24:39     INFO - copying tree: /builds/slave/test/build/tests/gtest/gtest_bin to /builds/slave/test/build/application/firefox
13:24:39     INFO - Running post-action listener: _package_coverage_data
13:24:39     INFO - Running post-action listener: _resource_record_post_action
13:24:39    FATAL - Uncaught exception: Traceback (most recent call last):
13:24:39    FATAL -   File "/builds/slave/test/scripts/mozharness/base/script.py", line 1718, in run
13:24:39    FATAL -     self.run_action(action)
13:24:39    FATAL -   File "/builds/slave/test/scripts/mozharness/base/script.py", line 1660, in run_action
13:24:39    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
13:24:39    FATAL -   File "/builds/slave/test/scripts/mozharness/base/script.py", line 1601, in _possibly_run_method
13:24:39    FATAL -     return getattr(self, method_name)()
13:24:39    FATAL -   File "scripts/scripts/desktop_unittest.py", line 503, in run_tests
13:24:39    FATAL -     preflight_run_method=self.preflight_gtest)
13:24:39    FATAL -   File "scripts/scripts/desktop_unittest.py", line 596, in _run_category_suites
13:24:39    FATAL -     preflight_run_method(suites)
13:24:39    FATAL -   File "scripts/scripts/desktop_unittest.py", line 563, in preflight_gtest
13:24:39    FATAL -     os.path.join(abs_app_dir))
13:24:39    FATAL -   File "/builds/slave/test/scripts/mozharness/base/script.py", line 619, in copytree
13:24:39    FATAL -     files = os.listdir(src)
13:24:39    FATAL - OSError: [Errno 2] No such file or directory: '/builds/slave/test/build/tests/gtest/gtest_bin'
13:24:39    FATAL - Running post_fatal callback...
Flags: needinfo?(cmanchester)
Great, thank you for working on this! I need to land bug 992983 to move these to the test job (they're currently running during make check, so when we can't leave a gap when moving them).

I'll check these on try today, and if everything looks good there land the conversion.
Flags: needinfo?(cmanchester)
Wooot!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.