Store taskId of taskcluster jobs in treeherder and make it accessible

RESOLVED FIXED

Status

Tree Management
Treeherder: API
RESOLVED FIXED
a year ago
10 months ago

People

(Reporter: bstack, Assigned: wlach)

Tracking

(Depends on: 1 bug)

Details

Attachments

(2 attachments, 1 obsolete attachment)

(Reporter)

Description

a year ago
In https://reviewboard.mozilla.org/r/97634 we extract the taskId of the decision task from the task inspector link available from a task. If I'm not mistaken, this is how the taskId is determined at some places the the Treeherder UI as well. It would be nice if the taskId was available by itself as well. Not a high-priority change, but something nice-to-have we found during the review.

If this already exists, please just let me know and feel free to close this.
Duplicate of this bug: 1317958
Yeah, this came up in the context of triggering in-tree actions from Treeherder as well (bug 1322433). We should definitely do this, or something equivalent. Having to parse out a URL to get the taskId is ridiculous.

We could consider, in fact, storing this information in the jobs table itself (maybe returning it in the instance endpoint of a job). For legacy buildbot information, we could just store null/None for this field. Did you have thoughts on this Ed?
Flags: needinfo?(emorley)
(In reply to William Lachance (:wlach) from comment #2)
> We could consider, in fact, storing this information in the jobs table
> itself (maybe returning it in the instance endpoint of a job). For legacy
> buildbot information, we could just store null/None for this field. 

I think the field should be generically named (eg "source_job_identifier" or something less verbose) and then for buildbot we'd just set the request ID instead. Ideally we'd also move the "build_system" field (and pick a clearer name; it stores taskcluster vs buildbot) from the job signatures table to the jobs table, alongside this new field. Then a combination of the two fields could allow for many different workflows interacting with upstream job submitters.
Flags: needinfo?(emorley)
I humbly suggest two separate fields `task_id` and `buildbot_id`. The `source_job_identifier` is a context-sensitive field, which adds complexity to code that touches it; specifically, `source_job_identifier` requires `build_system` (or external knowledge) to be usable. Two context free fields can be used independently, you can use them for foreign keys, plus, one can be removed without changing the meaning of the other.
(In reply to Kyle Lahnakoski [:ekyle] from comment #4)
> I humbly suggest two separate fields `task_id` and `buildbot_id`. The
> `source_job_identifier` is a context-sensitive field, which adds complexity
> to code that touches it; specifically, `source_job_identifier` requires
> `build_system` (or external knowledge) to be usable. Two context free fields
> can be used independently, you can use them for foreign keys, plus, one can
> be removed without changing the meaning of the other.

One advantage of storing the scheduler/runner vs. task id seperately is that it's expandable to any number of possible scheduler/runner systems, rather than being constrained to just taskcluster and buildbot (and one instance thereof, at that). 

I'm a little confused at why we'd want to use these fields for foreign keys-- I also don't understand your point about removing one without changing the meaning of the other. Do you have specific use cases here in mind?
My suggestion comes from the good practice of reducing the number of functional dependencies found in the attributes of the database [1]. In general, the reduction of functional dependencies simplifies the database, and also simplifies the application code that uses it. I have no specific use case, I am appealing to an ascetic sense not adding complexity when it is not needed. Specifically, the `build_system` field is functionally dependent on the `source_job_identifier`. If we wanted to righteously remove the functional dependencies, we would add more relations:

> CREATE TABLE build_systems (
> 	id   INT NOT NULL PRIMARY KEY,
> 	name VARCHAR(30)
> );
> INSERT INTO build_systems VALUES (1, 'taskcluster');
> INSERT INTO build_systems VALUES (2, 'buildbot');
> 
> CREATE TABLE external_jobs (
> 	id                    INT NOT NULL PRIMARY KEY,
> 	source_job_identifier VARCHAR(50),
> 	build_system          INT,
> 	FOREIGN KEY (build_system) REFERENCES build_systems(id)
> )> ;
>
> ALTER TABLE job ADD FOREIGN KEY (external_job) REFERENCES external_jobs(id)

With the functional dependencies removed, we can see that adding more build_systems is trivial, and will not change the `job` table nor the application code that uses this schema. 

Of course, I do not seriously suggest we denormalize the TH database; All design is a compromise.

My suggestion of two separate fields is an attempt to avoid the need for more tables, yet still get the benefits of removing the functional dependency. In this case, I am suggesting a technique an ORM would use to model inheritance (jobs come in two forms: taskcluster jobs and buildbot jobs). It is not perfect given the database constraints, because it still allows both a taskcluster_id and a buildbot_id for a single job.

Concerning my statement "...you can use them for foreign keys...": You can view your database as just a small part of a much larger data landscape. Your database is designed well when the model it uses for its data matches the subset of the model that should be used for the greater data landscape. Imagine you wanted to track extra taskcluster properties in the TH database: You would make a taskcluster table, and your task_id in your job table would be a perfect foreign key. Same for making a buildbot table (of probably different set of properties); the buildbot_id will make a perfect foreign key. If you used the functionally dependent (`build_system`, `source_job_identifier`) pair, you could not setup such relations without another redesign.

[1] https://en.wikipedia.org/wiki/Functional_dependency
When we wrote the pulse ingestion format we added 3 identifiers that combined should uniquely identify the task/job [0]:
  buildSystem:  Name of build system, examples: 'buildbot', 'taskcluster'
  taskId:       Unique id for a set of similar jobs (when combined with retryId this is the job_guid)
  retryId       Integer count of how many times it's been retried
---
Unfortunately, in taskcluster-treeherder it was implemented as [1]:
Example A)
  taskId: `${slugid.decode(taskId)}/${runId}`,
  retryId: runId,
When it was the intent that it should have been implemented as:
Example B)
  taskId: `${taskId}`,
  retryId: runId,
and then the treeherder ingestion should have created job_guid as:
Example C)
  job_guid = taskId + '/' + retryId
I don't know if this was ever done.

But if someone can ensure that (C) is implemented in treeherder ingestion.
I would happy to switch from (A) to (B) in taskcluster-treeherder.
(Obviously, this will mess up status of active job symbols when we do this).

The idea when moving to pulse ingestion was to feed taskId/retryId to treeherder.
Then have treeherder construct the ugly job_guid like: taskId + '/' + retryId,
but then it would be possible for treeherder to refactor and store the
taskId and retryId as separate properties in the future.
I see that we failed to get this right. So there is no way around a hard compatibility break.

[0] https://github.com/mozilla/treeherder/blob/master/schemas/pulse-job.yml
[1] https://github.com/taskcluster/taskcluster-treeherder/blob/44e5f2d0ca74af2b812b624fd49d19cc83853c22/src/handler.js#L199-L200
Making it more clear that we don't necessarily want to use job details here.
Summary: Allow jobdetail to have taskcluster taskId of job → Store taskId of taskcluster jobs in treeherder and make it accessible
Depends on: 1334654
I'm going to take this one, it will make other work I want to do easier and is a good break between trying to fully understand taskcluster.
Assignee: nobody → wlachance

Comment 10

11 months ago
Created attachment 8835792 [details] [review]
[treeherder] jonasfj:patch-2 > mozilla:master

Comment 11

11 months ago
Created attachment 8836128 [details] [review]
[treeherder] wlach:1323110 > mozilla:master
Comment on attachment 8836128 [details] [review]
[treeherder] wlach:1323110 > mozilla:master

Hey Cam, I know you're away right now, but we're not in a *gigantic* rush to push this (though I would like to get it done sooner than later) so it can wait until next week when you're back. I'm pretty happy with how my implementation turned out.

I know that Jonas would have preferred to also fix taskcluster-treeherder which would render some of the convoluted hacks in job_loader.py unnecessary, but I'm not sure if we should block doing this work (which makes a lot of things easier) on that.
Attachment #8836128 - Flags: review?(cdawson)
Attachment #8836128 - Flags: feedback?(jopsen)
Attachment #8836128 - Flags: feedback?(emorley)
For those curious, my implementation wound up using an external table with the following format:

+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table                | Create Table                                                                                                                                                                                                                                                                                                                                                       |
+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| taskcluster_metadata | CREATE TABLE `taskcluster_metadata` (
  `job_id` bigint(20) NOT NULL,
  `task_id` varchar(24) COLLATE utf8_bin NOT NULL,
  `retry_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`job_id`),
  CONSTRAINT `model_taskclustermetadata_job_id_943e05ff_fk_job_id` FOREIGN KEY (`job_id`) REFERENCES `job` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin |
+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Per discussion amongst treeherder devs, I've decided to focus my solution here on taskcluster, rather than try to come up with some kind of "ultimate" schema for supporting everything else under the sun: buildbot and others can just continue using the job details table to store their custom information.
Comment on attachment 8836128 [details] [review]
[treeherder] wlach:1323110 > mozilla:master

Yeah, okay fixing the ingestion to do the backwards compatible thing after this would be pretty easy.
Attachment #8836128 - Flags: feedback?(jopsen) → feedback+

Comment 15

11 months ago
Comment on attachment 8836128 [details] [review]
[treeherder] wlach:1323110 > mozilla:master

Only had a quick glance, but I've left a comment.
Attachment #8836128 - Flags: feedback?(emorley) → feedback+

Comment 16

11 months ago
Comment on attachment 8836128 [details] [review]
[treeherder] wlach:1323110 > mozilla:master

I asked for one comment, but otherwise, I think this is ready to go.  Thanks!!
Attachment #8836128 - Flags: review?(cdawson) → review+
Filed bug 1341345 to deal with the treeherder-taskcluster formatting issues, to reduce confusion.
(Assignee)

Updated

11 months ago
Blocks: 1330762
This seems to be working well on stage, resolving.
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED

Updated

11 months ago
Depends on: 1342371

Comment 20

11 months ago
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/6b77a588f6d920e439f929b394ddfacba4ba6f29
Revert "Bug 1323110 - Return taskcluster information in job detail endpoint"

This reverts commit 6832102dffea24eebd63d17ffd945559e9c3b3ae.

https://github.com/mozilla/treeherder/commit/79e930338eddc6bc85af6f18dc42d351260f0943
Revert "Bug 1323110 - Store task and retry ids from taskcluster jobs explicitly"

This reverts commit f0eaf7f954957581443021d598ddf506a981d10a.
(Assignee)

Updated

11 months ago
Duplicate of this bug: 1342371
Bug 1342371 showed a bug in the original implementation: I incorrectly added a unique constraint on the taskId field, which prevented retriggers from appearing.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 23

11 months ago
Since the migration has been reverted, the table will need manually dropping (or a `./manage.py migrate <undo X>` command run on prod before this is deployed), otherwise the next attempt at landing will fail during migrate.

As such, might it be best to revert the ingestion part only, and leave the model (which will be fixed in the followup)?

Also, I was going to post this on bug 1342371 but got mid-aired:

(In reply to William Lachance (:wlach) (use needinfo!) from bug 1342371 comment #2)
> The problem is that we have a unique constraint on the task id, but that
> obviously doesn't hold in the case of retries. 

Ah that too.

However seeing that TaskclusterMetadata has a OneToOneField mapping to Job, the job_id should be the only unique field here, in which case the get_or_create() should still be using `defaults`. (Or the get_or_create switched to create or update_or_create per bug 1342371 comment 1).
(In reply to Ed Morley [:emorley] from comment #23)
> Since the migration has been reverted, the table will need manually dropping
> (or a `./manage.py migrate <undo X>` command run on prod before this is
> deployed), otherwise the next attempt at landing will fail during migrate.
> 
> As such, might it be best to revert the ingestion part only, and leave the
> model (which will be fixed in the followup)?

(In reply to Ed Morley [:emorley] from comment #23)
> Since the migration has been reverted, the table will need manually dropping
> (or a `./manage.py migrate <undo X>` command run on prod before this is
> deployed), otherwise the next attempt at landing will fail during migrate.

Too late, I've already done it. I don't think it's worth complicating this in any case -- I'll just reapply the original commit verbatim, and put a fix on top of that.

> Also, I was going to post this on bug 1342371 but got mid-aired:
> 
> (In reply to William Lachance (:wlach) (use needinfo!) from bug 1342371
> comment #2)
> > The problem is that we have a unique constraint on the task id, but that
> > obviously doesn't hold in the case of retries. 
> 
> Ah that too.
> 
> However seeing that TaskclusterMetadata has a OneToOneField mapping to Job,
> the job_id should be the only unique field here, in which case the
> get_or_create() should still be using `defaults`. (Or the get_or_create
> switched to create or update_or_create per bug 1342371 comment 1).

Yes, I agree. I think get_or_create is still the right one -- task id / retry information shouldn't change per job after first submission.

Comment 25

11 months ago
(In reply to William Lachance (:wlach) (use needinfo!) from comment #24)
> Too late, I've already done it. I don't think it's worth complicating this
> in any case -- I'll just reapply the original commit verbatim, and put a fix
> on top of that.

Sounds good :-)

> Yes, I agree. I think get_or_create is still the right one -- task id /
> retry information shouldn't change per job after first submission.

I agree it shouldn't change after first submission. However if that's the case, given that we don't use the return value, wouldn't `create()` (with optionally a try-except that ignores or logs then ignores the IntegrityError) be more appropriate? (We use `Job.objects.create()` for example; albeit that case probably wants the try-except logging to debug, given the recent exceptions we saw)
(In reply to Ed Morley [:emorley] from comment #25)
> > Yes, I agree. I think get_or_create is still the right one -- task id /
> > retry information shouldn't change per job after first submission.
> 
> I agree it shouldn't change after first submission. However if that's the
> case, given that we don't use the return value, wouldn't `create()` (with
> optionally a try-except that ignores or logs then ignores the
> IntegrityError) be more appropriate? (We use `Job.objects.create()` for
> example; albeit that case probably wants the try-except logging to debug,
> given the recent exceptions we saw)

I don't really know, the two seem pretty equivalent to me. It's normal for jobs to be submitted multiple times (as they change state) so it's not really an error. And create() also has a return value. But whichever, it doesn't really matter to me. :)

I'll do up a PR shortly.

Comment 27

11 months ago
Created attachment 8840913 [details] [review]
[treeherder] wlach:1323110-take2 > mozilla:master
(Assignee)

Updated

11 months ago
Attachment #8840913 - Flags: review?(emorley)

Comment 28

11 months ago
(In reply to William Lachance (:wlach) (use needinfo!) from comment #26)
> I don't really know, the two seem pretty equivalent to me. It's normal for
> jobs to be submitted multiple times (as they change state) so it's not
> really an error. And create() also has a return value. But whichever, it
> doesn't really matter to me. :)

get_or_create() generates two SQL statements for an insert:
https://github.com/django/django/blob/1.10.5/django/db/models/query.py#L462-L475

...compared to one for create():
https://github.com/django/django/blob/1.10.5/django/db/models/query.py#L392-L400

Comment 29

11 months ago
Comment on attachment 8840913 [details] [review]
[treeherder] wlach:1323110-take2 > mozilla:master

r+ with the Travis failure resolved :-)
Attachment #8840913 - Flags: review?(emorley) → review+

Comment 30

11 months ago
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/0409fff9d1ea3c546bf012c5cd39f8defff80b90
Bug 1323110 - Return taskcluster information in job detail endpoint

https://github.com/mozilla/treeherder/commit/6b96b91fbf3c28c57381fd290405814960476796
Bug 1323110 - Store task and retry ids from taskcluster jobs explicitly

https://github.com/mozilla/treeherder/commit/38149ad0be0a65be3a4f4e080bb662d697c079c9
Bug 1323110 - Don't have a uniqueness constraint on task id column

We can have multiple jobs with the same task id in the case of retries

https://github.com/mozilla/treeherder/commit/de5e188a5abc8455decf567d5eb22fe937f37129
Bug 1323110 - Skip SETA test that started failing after ingestion logic fixed

Updated

11 months ago
Status: REOPENED → RESOLVED
Last Resolved: 11 months ago11 months ago
Resolution: --- → FIXED

Updated

11 months ago
See Also: → bug 1345220

Updated

11 months ago
Depends on: 1345220
See Also: bug 1345220

Comment 31

10 months ago
Comment on attachment 8835792 [details] [review]
[treeherder] jonasfj:patch-2 > mozilla:master

This part has moved to bug 1334654.
Attachment #8835792 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.