Closed Bug 1595381 Opened 5 years ago Closed 5 years ago

tasks on pre-migration pushes should allow retriggers, backfills etc.

Categories

(Tree Management :: Treeherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: perf-alert)

Attachments

(1 file)

Retriggering a task which ran before the taskcluster migration fails:

Error message: "Wrong version of actions.json, unable to continue"
404 for https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr68&selectedJob=275535527&revision=a43e93dedfe9c6a6f11fb339ac400f23fd606250

Performance and code sheriffs retrigger or backfill on older pushes to identify what caused a performance regression or triggered an intermittent failure to become frequent.

The failed Android nightlies are also because of that (decision task not found): https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr68&selectedJob=275535527&revision=a43e93dedfe9c6a6f11fb339ac400f23fd606250

A quick, short-term fix to this in TH would be to make a get_repo_root_url utility function which takes a push date and repo and returns repo.tc_root_url if the push date is after november 9, otherwise a literal https://taskcluster.net. That "hack" could be backed out some time later (a month or two).

Component: Operations and Service Requests → Treeherder
Product: Taskcluster → Tree Management
Version: unspecified → ---

Grabbing this, in an attempt to prepare a hotfix patch in advance.

Assignee: nobody → igoldan
Priority: -- → P1
Severity: normal → blocker
Type: task → defect

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #1)

A quick, short-term fix to this in TH would be to make a get_repo_root_url utility function which takes a push date and repo and returns repo.tc_root_url if the push date is after november 9, otherwise a literal https://taskcluster.net. That "hack" could be backed out some time later (a month or two).

Is this a Javascript-only fix? As I've spotted this line of code from treeherder/ui/models/runnableJob.js.

Flags: needinfo?(dustin)

I'm able to reproduce locally, when using yarn start.

Yes, the rough idea is to replace

-    const queue = taskcluster.getQueue(currentRepo.tc_root_url, testMode);
+    const tcwDate = new Date('2019-11-09');
+    const rootUrl = pushDate < tcwDate ? 'https://taskcluster.net' : currentRepo.tc_root_url;
+    const queue = taskcluster.getQueue(currentRepo.tc_root_url, testMode);

(and remove the redefinition of queue on line 102).

From my look at the code, the hard part would be finding pushDate, particularly since it seems that TaskclusterModel.load is sometimes called with job=null.

Flags: needinfo?(dustin)

Couldn't figure out how to prepare the hotfix for this. Cameron, could you take it from here?

Flags: needinfo?(cdawson)
Assignee: igoldan → nobody

Yep, I'm working on this now. Unfortunately, as Dustin stated, it is a little complicated to get that information in some places. Not terrible, I think. Just takes some needle-threading... It may take me a couple hours to get this patch going.

Flags: needinfo?(cdawson)
Assignee: nobody → cdawson
Status: NEW → ASSIGNED

Cam, I'd suggest Dustin as a reviewer once you have a patch ready. I can be a "reviewer of last resort," but I trust Dustin's knowledge of this code much more than I trust mine.

Ping me when you've got something rough -- tom pointed out to me today that most old tasks won't run on the new deployment without some changes landed, so it's not entirely clear that this functionality will even be useful. If the patch is easy, we can land it and see -- but if it's hard, let's reconsider.

As far as credentials go, if you pass in 'taskcluster.net' to taskcluster.getCredentials, its going to look for credentials indexed by that root url so if you want to use firefox-ci credentials instead (if I'm understanding what dustin is saying about running old tasks on the new deployment, not the legacy cluster) you'll need to add some logic there.

As I mentioned in comment 11, I'm not sure this is ever going to work -- the pushes from before the 9th don't have the necessary patches to work on the new deployment. So no matter what Treeherder does, the resulting retriggered jobs would fail.

I think the fix here may be to re-push those old revisions to try and get the perf data there. It would probably be useful to come up with a list of revs to cherry-pick onto those try pushes.

Let's talk about this tomorrow.

Hah, it looks like we don't have a good time to talk synchronously. So, bug comments will have to do.

Tom, is the assessment in the previous comment correct? Do you have a list of revs for cherry-picking?

igoldan, dave, aside from waving a magic wand and making this "just work", could we do anything else to make this easier?

Flags: needinfo?(mozilla)
Flags: needinfo?(igoldan)
Flags: needinfo?(dave.hunt)
Assignee: cdawson → nobody
Status: ASSIGNED → NEW

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #15)

igoldan, dave, aside from waving a magic wand and making this "just work", could we do anything else to make this easier?

I don't know what other alternative I could ask for. Aryx, any thoughts?

Flags: needinfo?(igoldan) → needinfo?(aryx.bugmail)
Assignee: nobody → cdawson
Assignee: cdawson → nobody

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #15)

igoldan, dave, aside from waving a magic wand and making this "just work", could we do anything else to make this easier?

The impact to perf sheriffing has mostly passed as we're able to retrigger and backfill since the migration. I think the best we can do is to consider how Taskcluster infrastructure changes may affect the ability to detect and investigate performance regressions.

Flags: needinfo?(dave.hunt)

Close this bug as INVALID?

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
Flags: needinfo?(mozilla)
Keywords: perf-alert
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: