1632946 - [meta] Support for manifest based scheduling

Assignee

Description

•

5 years ago

For smart scheduling we're looking at what are the pieces to move to a manifest based scheduling rather than a task based scheduling.

We currently schedule tasks based if the manifest to be scheduled as part of the task has high value to be scheduled or not.

We believe that currently there's a backfill-and-filter workflow where a sheriff follows these steps:

Task has regressed and uses backfill action
The sheriff either uses the signature or the extend task label to look at the backfilled tasks [1] (there's even a hotkey shortcut to get to it)

Please correct me if that workflow is incorrect.

In the new model of scheduling by manifest we cannot guarantee that the manifest will have the same extended task label (Linux 18.04 x64 asan opt Mochitests test-linux1804-64-asan/opt-mochitest-devtools-chrome-e10s-3 M(dt3)) or signature (c15377f1f0ac8c097f7cd61753999deee12596aa). This is because chunking will be dynamic, thus, manifests can be changing from push to push.

It might be possible to adjust the job signature to take manifests into consideration and exclude the symbol:
https://github.com/mozilla/treeherder/blob/368c112266f4f276251a4886a5337fcb17b3a1e9/treeherder/etl/jobs.py#L147-L171

                [
                    build_system_type,
                    repository.name,
                    build_platform.os_name,
                    build_platform.platform,
                    build_platform.architecture,
                    machine_platform.os_name,
                    machine_platform.platform,
                    machine_platform.architecture,
                    job_group.name,
                    job_group.symbol,
                    job_type.name,
                    job_type.symbol,
                    option_collection_hash,
                    reference_data_name,
                ],

Perhaps we need a new signature as a stepping stone to deprecate the current one.

In the new model, we will backfill new tasks with the list of manifests executed in the backfilled task rather than based on the task label. We will be able to filter those tasks by looking at tasks that have that set of manifests and that platform configuration.

Now, we currently have the ability to filter by manifest by appending &test_paths, however, the platform will also need to match.

Originally I thought we could reach the maximum URL length, however, upon further investigation it is unlikely.

Perhaps we can add a link that will adjust the &test_paths and platform related parameters to match the backfilled tasks. The information is extractable from the taskgraph.json, but there's no convenient place / answer. marco is trying to solve the same problem.

If we use &test_paths we should fix bug 1626623 by moving all the frontend fetching and data manipulations to the backend. That will probably need to define a new Django model to represent a Manifest. A task would probably need a reference to a ManifestGroup which refers to an N number of Manifests. We will need to modify the jobs endpoint to the test_paths property. I wonder if we would need a TestPath model as well which would probably lead to the need for a TestGroup. In short:

Task -1:1-> ManifestGroup -1:N-> Manifest -1:1-> TestGroup -1:N-> TestPath
Such model probably would be the least amount of data needed to store it. We need to evaluate the storage cost. We would also need to verify if cycling data would delete these. I wonder how a test path can be stored compressed rather than as plain text.

This is probably good enough of a description to discuss things.

[1]
Job: (sig): Linux 18.04 x64 asan opt Mochitests test-linux1804-64-asan/opt-mochitest-devtools-chrome-e10s-3 M(dt3)

Flags: needinfo?(cdawson)

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Summary: Support filtering by manifests associated to a backfilled task → Support filtering for manifest-based backfilled task

Kyle Lahnakoski [:ekyle]

Updated

•

5 years ago

Flags: needinfo?(klahnakoski)

Summary: Support filtering for manifest-based backfilled task → Support filtering by manifests associated to a backfilled task

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Summary: Support filtering by manifests associated to a backfilled task → Support filtering for manifest-scheduled backfilled task

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 1

•

5 years ago

I believe we need "add new jobs" support as well, likewise "retrigger" support. These workflows would use what is outlined in comment 0.

I like the idea of changing the table sooner to get into a hybrid state, we will need 4 months of data before expiring. We will probably need both models supported (current vs new) as 4 months is a long time; although I think having 4 weeks is enough time to cover almost all if not all scenarios for backfill, retrigger, add new job.

Armen [:armenzg]

Assignee

Comment 2

•

5 years ago

Ahal and I did not believe there was a technichal reason that would make adding new jobs or retriggers require any special changes. As far as we understand we would schedule what the Gecko decision task would have scheduled (the artifact contains all the data) and re-triggers re-run a clone of the task that got scheduled. Please let me know if we made an oversight. In any case, if we were to encounter any issues we missed later on we would tackle it.

We might consider backfilling data if we need to. We can backfill up to when Andrew got the artifacts generated by the Gecko decision task (sometime in Q1).

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 3

•

5 years ago

I guess there is an 'add new jobs' feature now, but what would you be adding? mochitest-1 or dom/indexedDB ? how will we display the results of manifest jobs M(m1 m2 m3) where m1 = set of manifests ? then is there a purpose to running the original M(1) job? It is ok if we duplicate tests, but want to think this through.

Cameron Dawson [:camd]

Comment 4

•

5 years ago

So this signature has a couple uses:

We can filter with it by clicking the (sig) link. That will show only those jobs in all the loaded pushes that have the same signature. The link of text strings next to it is likely ALMOST as precise as that. So the filtering aspect of it may not be super crucial.
We use it to trigger "add new jobs". But the signatures in there are not the same as what we store in the DB. The signatures for Runnable Jobs are like this: addon-tps-xpi, condprof-linux64-firefox, searchfox-linux64-searchfox/debug

These signatures do not match the Treeherder signatures. These come from taskcluster from the url we get by calling getRunnableJobsURL.

History: These signatures were originally created for when the sheriffs managed a list of jobs to hide. We now use Tier-3 for that, so that functionality is no longer needed.

I think the only thing that field on the jobs table is used for is that sig filter link.

Flags: needinfo?(cdawson)

Cameron Dawson [:camd]

Comment 5

•

5 years ago

Asked Sebastian to comment on the value of that sig link.

Flags: needinfo?(aryx.bugmail)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 6

•

5 years ago

The sig link will show if the task config changed and also has the benefit of creating a shorter url than the task name. The task name will always show the tasks which match the name even if their config changed.

RyanVM, how is your usage of the 'sig' link at the bottom left?

Flags: needinfo?(aryx.bugmail) → needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 7

•

5 years ago

Never used it.

Flags: needinfo?(ryanvm)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 8

•

5 years ago

Ok to remove it from the sheriffing side.

Cameron Dawson [:camd]

•

5 years ago

I'm turning this bug into a meta bug because there's various components involved until we get this working.

bug 1633866 takes care of the Firefox build changes to support dynamic scheduling.

Here are some steps I documented in the PR of what is needed (I will update the information there as I think of it):

Store transformed joint data of manifests-by-task.json and tests-by-manifest.json in the proposed models
- https://github.com/mozilla/treeherder/blob/master/ui/job-view/pushes/Push.jsx#L39-L57
- Create API to expose that data (project/<repo>/<revision>/task-to-test-paths)
  - This would not be used by the UI since the jobs endpoint would include manifest/test paths, however,
    it is cheap to add and can be of use for some other systems
- Script to backfill past data?
Bug 1636506 - API to return test_paths as a property
- Switch UI to use new API and deprecate JS code
  - This will improve the memory usage when test_paths is used (bug 1626623)
Add link to the UI that will filter out tasks that run the same platform config and the same set of manifests
- We can take advantage of concatenating &test_paths
- We need to determine if we need to show tasks showing the same manifest set OR tasks that match one of the manifests
  - I'm leaning for the latter one
  - The question is if we want to show past tasks that cover some of the manifests from the originating task OR only
    show tasks that have been backfilled
- This important to get right on the first PR
- This link should show the tasks that a backfill will schedule
Change behaviour of backfills to trigger tasks with the same manifest set
- The backfilled tasks need to run the same manifest set as the originating task

A separate project is to return the test_paths property as part of the jobs endpoint. All of the above will enable to make that change.

Updated

•

5 years ago

Depends on: 1639873

Armen [:armenzg]

Assignee

Comment 13

•

5 years ago

NOTE to self: the source of truth for what manifests and tests paths a task executes is via MOZHARNESS_TEST_PATH env.

Tasks scheduled out of band would not be using what the artifacts generated by the Gecko decision task says they should execute.

Maybe we need to store the value of MOZHARNESS_TEST_PATH in the DB or Redis.
I was planning of returning the tasks' test paths via the jobs endpoint, however, we might want to remove using test_paths as a filtering method and use hyperlink from a selected job.

I'm trying to avoid storing such piece of data in the DB if we don't have to.

I have not this profoundly so don't read too much into it.

Kyle Lahnakoski [:ekyle]

Comment 14

•

5 years ago

On another note, we may want to remove the job signature:

The job signature can be replaced by a (albeit long) unique tuple, as demonstrated in the code. This signature is a design flaw that is hindering our use of the treeherder data in other ways. Specifically, the signature is very specific optimization, that will get in the way of manifest scheduling.

I suggest columns in the signature table be merged with the job table; and code that uses a signature be replaced with code that uses the tuple-of-values it represents. The immediate benefit being the tuple better describes the class of tasks than a hash value does. Using tuples-of-values will also allow shorter tuples (we only need to specify job_type.name or job_type.symbol not both). A bigger benefit comes from other use cases, like manifest scheduling:

The manifest_name will be unique; the job_type.name is irrelevant and job_group.name is functionally dependent on the manifest_name (if you know the manifest_name, you will be able to conclude the suite). This means finding the class of jobs which run a manifest is best described by this tuple (depending on how specific you want to be:

            [
                repository.name,
                machine_platform.platform,
                manifest_name
            ]

By simply storing the signature properties in the job table, we can use different tuples to select jobs in different ways.

Flags: needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Comment 15

•

5 years ago

From https://github.com/mozilla/treeherder/pull/6384/files

I'm debating between these two relationships:

Job -1:1-> ManifestSet -1:N-> Manifest -1:1-> TestSet -1:N-> TestPath

Job -1:N-> Manifest -1:N-> TestPath

The second is preferred; jobs, manifests and tests are the only entities we are dealing with.

All 1:1 relations are "annotations": A 1:1 relation is logically no different from merging columns from both tables into one: Any columns you may have in ManifestSet can be added to Job for the same effect. 1:1 relations also increase query complexity. That said, 1:1 relations can be useful for avoiding an ALTER TABLE command on the main table.

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1649229

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1650224

Armen [:armenzg]

Assignee

Updated

•

5 years ago

No longer depends on: 1636506

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Type: enhancement → task

Summary: [meta] Support filtering for manifest-scheduled backfilled task → [meta] Support for manifest based scheduling

Armen [:armenzg]

Assignee

Comment 16

•

5 years ago

There are improvements that can be made but for now we shipped this.

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/6371 5 years ago GitHub Bugzilla PR Linker 47 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/6384 5 years ago GitHub Bugzilla PR Linker 47 bytes, text/x-github-pull-request		Details \| Review