unable to schedule e10s specific talos jobs on tc scheduled platforms

RESOLVED FIXED in mozilla55

Status

Firefox Build System
Task Configuration
--
critical
RESOLVED FIXED
a year ago
2 months ago

People

(Reporter: jmaher, Assigned: dustin)

Tracking

({regression})

unspecified
mozilla55
regression

Firefox Tracking Flags

(Not tracked)

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(2 attachments)

in this try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4173fcce525300eadf66cc9295e7371d0484a69a&group_state=expanded&selectedJob=97665271

the syntax:
try: -b o -p linux,linux64,macosx64,win32,win64 -u none -t chromez-e10s,dromaeojs-e10s,other-e10s,g1-e10s,g2-e10s,svgr-e10s,tp5o-e10s,xperf-e10s --rebuild-talos 5 

doesn't trigger jobs except for win7.  It should have linux64 and and osx scheduled (win8 would need custom syntax).

what is interesting is both osx and linux64 are managed by taskcluster and most likely the try syntax parser isn't understanding other-e10s, g1-e10s, etc.

I have seen this often in the last week and I am excited that people are being responsible with their try pushes, but asking them to push again with -t all which only adds to the burden on our infrastructure.
FYI, more than anything: 

Once I used -t all, the try submission email has 

"It looks like this submission has talos jobs. You can compare the performance of your push against a baseline revision here:
https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=4aeff94a7fc872e4716def28f89647fd0d0b4488"

Attempts with e10s only talos didn't have the above text in the submission email.  Just the usual results on Treeherder link and builds and logs link.
There are a half-dozen things that interpret try syntax, and the one that prints the message when you push is the least accurate of them, so I wouldn't read too much into what it says!  It even sounds uncertain ("it looks like..").

It looks like the original push *did* trigger jobs for all of those platforms, it just didn't run the talos jobs.

To start debugging this, I'd suggest taking a look at the talos_try_name attribute of the jobs that you'd expect to see scheduled.  Maybe those aren't having `-e10s` appended?
:garndt, can someone from the TC team look into this?
Flags: needinfo?(garndt)
I do not have anyone immediately that could look at this, but there are a few other people that have had their hands in it that could help out I hope.  comment 2 does seem to be a step in the right direction for someone that wants to debug it. 

From my cursory glance at the scheduling, what I can tell is that we allow a user to specifically schedule *only* e10s jobs by appending e10s to the test suite name for both talos and unittests.  The intree scheduling respects this for unittests, but I'm not seeing anything that will match talos_try_name in this way.  If I adjust the talos_try_name for a talos suite to include "e10s" so it will match the try syntax and include the job in the graph, which comment 2 suggests.
Flags: needinfo?(garndt)
Also, I'm starting to dig into the in-tree scheduling myself as I'm curious, so for those that dive into it and have questions, I can try to answer what I know so far.  If I have any strokes of insight into this problem, I'll leave notes here.
Created attachment 8866128 [details]
talos_e10s_first_stab

I'm not entirely sure if this is the right direction, but this seemed to cause the talos e10s tests to show up in the graph.  Not sure if the task definitions are correctly though.
ok, it seems that this has now landed on m-c which enables e10s for talos jobs, and it will default to running both non-e10s and e10s.

https://hg.mozilla.org/mozilla-central/rev/77584adad526


It doesn't appear to address the issue where you *only* want to run e10s talos tests, which might be related to the talos-try-name change here:

--- a/taskcluster/taskgraph/transforms/tests.py
+++ b/taskcluster/taskgraph/transforms/tests.py
@@ -533,16 +533,17 @@ def split_e10s(config, tests):
             test['test-name'] += '-e10s'
             test['e10s'] = True
             test['attributes']['e10s'] = True
             group, symbol = split_symbol(test['treeherder-symbol'])
             if group != '?':
                 group += '-e10s'
             test['treeherder-symbol'] = join_symbol(group, symbol)
             if test['suite'] == 'talos':
+                test['talos-try-name'] += '-e10s'
                 for i, option in enumerate(test['mozharness']['extra-options']):
                     if option.startswith('--suite='):
                         test['mozharness']['extra-options'][i] += '-e10s'
             else:
                 test['mozharness']['extra-options'].append('--e10s')
         yield test



Kim, do you have any knowledge about this?  I see you enabled the e10s option for talos tests.
Flags: needinfo?(kmoir)
So looking at tests.py:

521 def split_e10s(config, tests):
522     for test in tests:
..
532         if e10s:
533             test['test-name'] += '-e10s'

and later

671 def make_job_description(config, tests):
675     for test in tests:
..
682         if 'talos-try-name' in test:
683             try_name = test['talos-try-name']
684             attr_try_name = 'talos_try_name'
685         else:
686             try_name = test.get('unittest-try-name', test['test-name'])
687             attr_try_name = 'unittest_try_name'
..
698         attributes.update({
..
705             attr_try_name: try_name,
706         })

Looking in taskcluster/ci/test/tests.yml, `unittest-try-name` is only given explicitly for tasks which run without e10s.  But `talos-try-name` is given for every talos job.  So the fallback on line 686 to `test['test-name']` occurs for *all* e10s unittests, but for none of the e10s talos.

So the fix in comment 7 would work, but it leaves unittests and talos with markedly different implementations of the same thing, and will break when a unittest gets 'unittest-try-name' added to it in tests.yml, or when the `talos-try-name` is removed from a talos stanza in tests.yml.

I suspect the fix is to move the *-try-name setting logic earlier in `tests.py`, perhaps to `set_defaults`.  Then by the time `split_e10s` runs, one of `unittest-try-name` and `talos-try-name` will definitely be set, and we can append `"-e10s"` to it.

Probably a nicer cleanup would be to combine `unittest-try-name` and `talos-try-name` into just `try-name` in the schema, and only split them out into the separate attributes (`unittest_try_name`, `talos_try_name`) in `make_job_description`.  But that could be deferred to another bug.
Flags: needinfo?(kmoir)
Major regressions like this (ones without any indication of failure) should be announced to mozilla.dev.platform and firefox-dev to avoid wasting developer and CPU time wondering when/if talos jobs will appear.

I've now sent such an email: https://mail.mozilla.org/pipermail/firefox-dev/2017-May/005429.html
Severity: normal → critical
Keywords: regression
@@ -415066,7 +415066,7 @@
         "autoland",
         "try"
       ],
-      "talos_try_name": "chromez",
+      "talos_try_name": "chromez-e10s",
       "test_chunk": "1",
       "test_platform": "linux64-nightly/opt",
       "unittest_flavor": "talos",

.. that's a promising diff (same for all the suites)
Comment hidden (mozreview-request)
Attachment #8867502 - Flags: review?(garndt)

Comment 12

11 months ago
mozreview-review
Comment on attachment 8867502 [details]
Bug 1363409: ensure e10s talos have -e10s in their try name

https://reviewboard.mozilla.org/r/139042/#review142828

So I tried this out locally and it seemed to take the try syntax for specifying e10s talos tests and also obeyed "all".  It also seems to obey tests that are set with "e10s: both" and scheduling non-e10s as well.  To be honest, I do not know much about the in-tree scheduling so some of this a little lost on me on how it works, so I totally understand if you want a second opinion.

::: taskcluster/taskgraph/transforms/tests.py:738
(Diff revision 1)
>  
>          build_label = test['build-label']
>  
> -        if 'talos-try-name' in test:
> -            try_name = test['talos-try-name']
> +        try_name = test['try-name']
> +        if test['suite'] == 'talos':
>              attr_try_name = 'talos_try_name'

It looks like you got rid of this attribute from tests.yml.  Wouldn't this be trying to find something that doesn't exist later?
Attachment #8867502 - Flags: review?(garndt) → review+
The *attributes* (keys in `task.attributes`) stick around, and are used for try filtering.  I just combined the `talos-try-name` and `test-try-name` keys in test descriptions into `try-name`, since it's easy for the code to figure out which attribute the value should end up in.

Comment 14

11 months ago
Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/31710e5be104
ensure e10s talos have -e10s in their try name r=garndt

Comment 15

11 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/31710e5be104
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Assignee: nobody → dustin

Updated

2 months ago
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.