Open Bug 1502247 Opened 6 years ago Updated 10 months ago

Reject Tier 1 jobs at the taskgraph level if they don't run across all branches

Categories

(Firefox Build System :: Task Configuration, task)

task

Tracking

(Not tracked)

People

(Reporter: RyanVM, Unassigned)

References

Details

Attachments

(2 files)

One of the requirements of the job visibility policy (https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy) is that a job must run on all supported branches in order to be Tier 1. Therefore, jobs which only run on m-c by definition can't be higher than Tier 2 and should be rejected as such during taskgraph generation.

I assume that fixing this will also be enough to ensure that Tier 1 jobs are always available from ./mach try fuzzy without having to use --full, but if not, we should also enforce that.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #0)
> I assume that fixing this will also be enough to ensure that Tier 1 jobs are
> always available from ./mach try fuzzy without having to use --full, but if
> not, we should also enforce that.

Note that there *are* tier 1 jobs that don't show up without --full, and that's probably fine: toolchains (and things alike).
Hm. In theory this is good. In practice, does that mean we have to run devedition builds on ESR60 and Release? or beta-specific things like pushsnap?
Tom is right. Sorry, I was too loose in my wording. The spirit of the rule is that we don't want jobs to break for the first time on m-c and then expect it to be followed with an immediate backout as would normally be expected for Tier 1 bustage.

Now, IMO, it *is* an interesting discussion to have about whether we should be running DevEdition builds on trunk or not (and other jobs which normally only run on release branches), but we can punt on that discussion for for now :)
This is a really cool use of taskgraph!  There's some `verify` support in generator.py that could accomplish this, or it'd be possible to add a little more structure to support checks like this.
So, I came up with a simple patch that prints out task names along with their run_on_projects when it doesn't contain 'all' or 'trunk'. The result is attached. There are 8092 of them.

The vast majority of them are release-related, like balrog, beetmover-checksums, beetmover-repackage, checksums-signing, etc.

Many have an empty run_on_projects, which is fine for many of them (at least toolchain, packages, fetch, docker-images) , but could be a problem for some.

There are a few that are try-only, and they should clearly not be tier 1.

Everything devedition-related is obviously mozilla-beta-only.

Some nightly-related tasks only mozilla-central/try only. Which actually sounds wrong (why shouldn't they run on beta/release?)

There are some obviously problematic ones:
  addon-tps-xpi ['mozilla-central']
  build-android-api-16-ccov/debug ['mozilla-central']
  build-android-api-16-without-google-play-services/opt ['mozilla-central']
  build-android-test-ccov/opt ['mozilla-central']
  build-linux64-asan-fuzzing-ccov/opt ['mozilla-central', 'try']
  build-linux64-fuzzing-ccov/opt ['mozilla-central', 'try']
  spidermonkey-sm-mozjs-sys-linux64/debug ['integration', 'release', 'try']
  test-linux64-pgo/opt-talos-sessionrestore-many-windows-e10s ['mozilla-central', 'try']
  test-linux64-qr/opt-talos-sessionrestore-many-windows-e10s ['mozilla-central', 'try']
  test-linux64/opt-browser-screenshots-e10s ['mozilla-central', 'integration']
  test-macosx64/opt-browser-screenshots-e10s ['mozilla-central', 'integration']
  test-macosx64/opt-talos-sessionrestore-many-windows-e10s ['mozilla-central', 'try']
  test-macosx64/opt-talos-tp6-stylo-threads-e10s ['mozilla-beta', 'autoland', 'try']
  test-windows10-64-pgo/opt-talos-sessionrestore-many-windows-e10s ['mozilla-central', 'try']
  test-windows10-64-qr/opt-talos-sessionrestore-many-windows-e10s ['mozilla-central', 'try']
  test-windows10-64/opt-browser-screenshots-e10s ['mozilla-central', 'integration']
  test-windows10-64/opt-talos-sessionrestore-many-windows-e10s ['mozilla-central', 'try']

There are a bunch of linux64 debug mochitest/reftests that are central-only.

It's not going to be trivial to enforce rules under those conditions...
Attachment #9020979 - Attachment mime type: application/octet-stream → text/plain
Depends on: 1518972

So the different values of run-on-projects that we have for tier 1 jobs at the moment are:

  • all
  • release
  • release, integration, try
  • mozilla-beta
  • mozilla-beta, trunk, try
  • mozilla-beta, autoland, try
  • mozilla-central
  • mozilla-central, trunk, try
  • mozilla-central, try
  • mozilla-central, integration
  • trunk, try
  • try

First, I'll note that "mozilla-central, integration" is the mostly same as "trunk" (trunk includes comm-central, but I don't think this matters, comm-central uses different yaml definitions auiu)

Similarly, "mozilla-central, trunk, try" is the same as "trunk, try".

The "mozilla-beta, autoland, try" one is weird. Why autoland specifically, and not integration?

"release, integration, try" is kind of confusing. "release, trunk, try" would be the same thing, but less confusing.

Now, looking at all the tier 1 jobs that aren't balrog, beetmover, build-signing, cron-bouncer-check, checksums-signing, docker-image, fetch, mar-signing, nightly, packages, pipfile-update, release-*, repackage, repo-update, toolchain, upload-generated-sources or devedition, we're left with:

  • A bunch of tests set run_on_projects for an explicit disable on some platforms. I'd say ideally we shouldn't even create tasks for those.
  • */opt-browser-screenshots-e10s run on trunk but not try.
  • The following tasks run on mozilla-central only (not integration, not try):
    • addon-tps-xpi
    • build-android-api-16-ccov/debug
    • build-android-api-16-without-google-play-services/opt
    • build-android-test-ccov/opt
    • all serviceworker-e10s tests
  • The following tasks run on mozilla-central or try, but not integration:
    • build-linux64-asan-fuzzing-ccov/opt
    • build-linux64-fuzzing-ccov/opt
    • build-macosx64-ccov/debug
    • test-android-hw-p2-8-0-android-aarch64/opt-raptor-speedometer-geckoview-power-e10s
    • test-android-hw-p2-8-0-arm7-api-16/opt-raptor-speedometer-geckoview-power-e10s
    • test-linux64-pgo/opt-talos-sessionrestore-many-windows-e10s
    • test-linux64-qr/opt-talos-sessionrestore-many-windows-e10s
    • test-linux64/opt-talos-sessionrestore-many-windows-e10s
    • test-macosx64/opt-talos-sessionrestore-many-windows-e10s
    • test-windows10-64-pgo/opt-talos-sessionrestore-many-windows-e10s
    • test-windows10-64-qr/opt-talos-sessionrestore-many-windows-e10s
    • test-windows10-64/opt-talos-sessionrestore-many-windows-e10s
    • test-windows7-32-pgo/opt-talos-sessionrestore-many-windows-e10s
    • test-windows7-32/opt-talos-sessionrestore-many-windows-e10s
  • The following tasks don't run on mozilla-central (in fact, they only run on try):
    • */opt-awsy-base-dmd-e10s
    • */opt-awsy-dmd-e10s
    • */*-upload-symbols

(In reply to Mike Hommey [:glandium] from comment #7)

First, I'll note that "mozilla-central, integration" is the mostly same as "trunk" (trunk includes comm-central, but I don't think this matters, comm-central uses different yaml definitions auiu)

I was going to suggest that there might be places where run-on-projects is set in transforms (which might impact c-c), but that doesn't appear to be the case.

  • A bunch of tests set run_on_projects for an explicit disable on some platforms. I'd say ideally we shouldn't even create tasks for those.

I haven't looked at any details, but one thing that not generating tasks at all means, is that they can't selected via try, which might be relevant, depending on the reason the tests are disabled on a given platform. (also for add-new-jobs, which might be relevant for some talos jobs)

(In reply to Mike Hommey [:glandium] from comment #7)

The "mozilla-beta, autoland, try" one is weird. Why autoland specifically, and not integration?

That's from https://searchfox.org/mozilla-central/source/taskcluster/ci/test/talos.yml#264-267

Maybe it was for osx capacity issues? No clue why it isn't running on m-c...

Is this basically what we would want? For each task in the full task graph, validation is run to check the following:

  • The task is tier-1
  • The task name doesn't make an exact match to any of the task names in WHITELISTED_TASKS
  • The task name doesn't make a substring match to any of the task names in WHITELISTED_SUBSTRINGS

If a task meets those conditions, we look at the run-on-projects value for the test, and check the following:

  • run-on-projects does not include 'trunk'
  • run-on-projects does not include 'all'
  • run-on-projects does not include both 'mozilla-central' and 'integration'

If one of those conditions occurs, the task name is added to a list of invalid results. If the list is not empty, we print out the list of invalid task names, along with their current run-on-projects values.

Here's what happens for some jobs that don't meet the new validation:

Task configuration changed, generating target task graph
Error running mach:

    ['try', 'fuzzy']

The error occurred in code that was called by the mach command. This is either
a bug in the called code itself or in the way that mach is calling it.

You should consider filing a bug for this issue.

If filing a bug, please include the full output of mach, including this error
message.

The details of the failure are as follows:

Exception: Invalid Tier-1 tasks identified.

Tier-1 tasks must be run on mozilla-central and all projects that merge into it.

The following task(s) do not meet that requirement:
  beetmover-source-fennec-source/opt: []
  beetmover-source-firefox-source/opt: []


Please set run-on-projects to include at least one of the following:
  [all]
  [trunk]
  [mozilla-central, integration]

  File "/Users/wkocher/mozilla/mozilla-central/tools/tryselect/mach_commands.py", line 150, in try_fuzzy
    return run_fuzzy_try(**kwargs)
  File "/Users/wkocher/mozilla/mozilla-central/tools/tryselect/selectors/fuzzy.py", line 220, in run_fuzzy_try
    tg = generate_tasks(parameters, full, root=vcs.path)
  File "/Users/wkocher/mozilla/mozilla-central/tools/tryselect/tasks.py", line 96, in generate_tasks
    tg = getattr(TaskGraphGenerator(root_dir=root, parameters=params), attr)
  File "/Users/wkocher/mozilla/mozilla-central/taskcluster/taskgraph/generator.py", line 168, in target_task_graph
    return self._run_until('target_task_graph')
  File "/Users/wkocher/mozilla/mozilla-central/taskcluster/taskgraph/generator.py", line 341, in _run_until
    k, v = self._run.next()
  File "/Users/wkocher/mozilla/mozilla-central/taskcluster/taskgraph/generator.py", line 286, in _run
    yield verifications('full_task_graph', full_task_graph, graph_config)
  File "/Users/wkocher/mozilla/mozilla-central/taskcluster/taskgraph/util/verify.py", line 36, in __call__
    verification(None, graph, scratch_pad=scratch_pad, graph_config=graph_config)
  File "/Users/wkocher/mozilla/mozilla-central/taskcluster/taskgraph/util/verify.py", line 334, in verify_tier_one_requirements
    .format(testString)

(The "[]" next to the task names printed out will include the current run-on-project value for that task, these particular tasks just don't have anything for that value.)

Updated the patch to fix lint errors and address glandium's comments.

Locally, a verification failure will print out an error during taskgraph generation like comment 11 shows.

That should cause ./mach try to fail in most cases (assuming the chosen selector generates the taskgraph; empty can still proceed), preventing the try push from ever going out.

If you do manage to push with this verification failure, the Decision task will fail like this: https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=retry%2Ctestfailed%2Cbusted%2Cexception%2Csuccess%2Crunning%2Cpending%2Crunnable&group_state=expanded&revision=9588652966c1ef3bcda86e03a22d59a11add1d2a

IMO, this is still something we should do.

Assignee: kwierso → nobody
Severity: normal → --
Severity: -- → N/A
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: