Closed Bug 1668693 Opened 4 years ago Closed 4 years ago

Unable to backfill further down - backfilled tasks run but are not shown by Treeherder

Categories

(Firefox Build System :: Task Configuration, defect)

defect

Tracking

(firefox85 fixed)

RESOLVED FIXED
85 Branch
Tracking Status
firefox85 --- fixed

People

(Reporter: dluca, Assigned: mozilla)

References

Details

Attachments

(2 files)

I have encoutered this issue while backfilling some Wd jobs on autoland:
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&fromchange=7cecd07c5e2e53e495cf5dda766868a9145eb2e5&searchStr=osx&selectedTaskRun=Q2bmEwQnS3GSnjUxhztFow.0

No matter how many times I tried to backfill, jobs won't start after further down after the selected job in the link above.

Aki has found this:
"I think the backfill issue is [task 2020-10-01T19:59:14.829Z] 19:59:14 INFO - STDOUT: E InvalidArgumentException: invalid argument (400): unknown variantwheel, expected one of none, key, pointer at line 1 column 215"

That appears to be the test bustage issue, not the backfill issue. I'm wondering:

  • Is this because of a backstop push? Digging, doesn't seem like it for at least one of the backfill issues
  • Is this because the backfilled task is unscheduled, waiting for an upstream dependency? Not finding this.
  • Not sure if there's a limit to how old a push can be for us to backfill.

The log in the backfill action task does say that the task was created. Here's the task view for one of them:
https://firefox-ci-tc.services.mozilla.com/tasks/WFW5OH8RTIW4eLfitiGyyw

The wdspec task is listed as a dependent (though it failed):
https://firefox-ci-tc.services.mozilla.com/tasks/GWbKHNOgTHqzNwDDdO3UJw

But for some reason treeherder doesn't display it. Could there have been some kind of treeherder ingestion issue around this time?

Things at least seem to be working from the taskgraph point of view.

Sheriff note: if you run into this again, it'll be possible to view the task by drilling into the task view of the "backfill" action, and then clicking "See more" to find the "Dependents" section. The requested task should show up there, and clicking it will display its state and a link to the log (under "Artifacts"). I'm only suggesting this as a temporary workaround, not a solution to this bug.

There was at least one more instance around this time of Treeherder showing tasks as still running, while Taskcluster showed them as completed. It's possible there was a larger hiccup.

This is not a hiccup but a general issue.

Steps to reproduce:

  1. Open this Treeherder link.
  2. In the bottom toolbar, click on "..." > "Custom Actions".
  3. Pick "Backfill".
  4. Change depth to 5 and times to 3 and submit.
  5. Wait for the tasks to be shown.

Actual result:

  1. The main action task which schedules the action tasks for each push gets shown by Treeherder.
  2. That launches this action task.
  3. Which creates the actual backfilled mda task which taskcluster shows as running.

Actual result:
Opening the Treeherder link near the top of the task page from step 3 shows the parent push of the one from which we initiated the backfill but the error message Task not found: dY1PkZ73Rb-e8nEUoGZOIQ. The Treeherder database (table taskcluster_metadata) doesn't find the task id.

Comparison of the original backfill *-bk task and the 'missing one only shows the following parameters for run-task as main difference:

    "--total-chunk=4",
    "--this-chunk=2",

Sarah, do you have insights why these tasks might be missing in TH?

Flags: needinfo?(sclements)
Summary: Unable to backfill further down → Unable to backfill further down - backfilled tasks run but are not shown by Treeherder

Hrmm... so the first thing I've noticed is that the url to Treeherder from Taskcluster is wrong.

In the example you have above, the url to the task looks like this: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=9283658152d84f604d80ddf693746f3d2e66e3f7&selectedTaskRun=dY1PkZ73Rb-e8nEUoGZOIQ

When it should be have .1 or .0 to signify the run/retry id.

But even after correcting the url, the task is not found. It hits this API: https://treeherder.mozilla.org/api/jobs/?task_id=dY1PkZ73Rb-e8nEUoGZOIQ&retry_id=1

So this makes me think there is something happening in the task ingestion process where we're either not processing that task or not receiving it. I'll have to dig into this more. cc'ing Cam in case he has any ideas.

Does this only happen with certain jobs/repos or is it random? How frequently has it occurred in the past month?

Flags: needinfo?(sclements)

As far as I know it reproduces always. Sheriffs only backfill on autoland (except rare circumstances on other repositories, e.g. to check if an issue is not related to check-ins but infrastructure).

Circling back to this, I think I need to clarify a few things:

  • Is the issue that only backfilled jobs at a certain depth don't show in Treeherder? What does the depth mean exactly?
  • Is there a way to see the type of message we'd get from the pulse exchange for the kinds of tasks that are not showing in TH?
  • Is the issue that only backfilled jobs at a certain depth don't show in Treeherder? What does the depth mean exactly?
    depth is a parameter for backfill actions which defines to how many of the previous pushes the task shall be added. It has an upper limit or a sheriff might be conservative with scheduling and then wants to add more backfill tasks for the pushes which initially got the tasks from the backfill actions.
  • Is there a way to see the type of message we'd get from the pulse exchange for the kinds of tasks that are not showing in TH?
    Dustin, can you provide any guidance for this, please?
Flags: needinfo?(dustin)

The messages are all documented here. At a guess, whatever is producing these tasks and getting something slightly unusual about them that causes treeherder ingestion to treat them differently. I suppose tracking down the bug in comment 5, if that is not the cause of the issue, would lead to some bits of code that could do with further scrutiny -- perhaps there's another error nearby.

Flags: needinfo?(dustin)

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #9)

I suppose tracking down the bug in comment 5, if that is not the cause of the issue, would lead to some bits of code that could do with further scrutiny -- perhaps there's another error nearby.

Which team would investigating that fall to?

I'm going to push some code to improve our logging surrounding the ingestion. I can't investigate further until I know what the messages look like for those specific tasks. Sheriffs, the next time this situation occurs, please ping me or record the task id in this bug and I can check the logs to see if that task has been caught by an exception.

It's 100% reproducible: Go to this Wr4 backfilled task and try to backfill it (I tried with "..." menu > Custom Actions > Backfill > depth: 2).

Which team would investigating that fall to?

I'm not too sure, but jwalker's team.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #11)

It's 100% reproducible: Go to this Wr4 backfilled task and try to backfill it (I tried with "..." menu > Custom Actions > Backfill > depth: 2).

I tried doing that (accidentally just hitting the backfill button first, oops). However, I realized I could try to manually ingest the task that that you listed in comment #4 and it hit a schema exception for the job symbol:

JSON Schema validation error during Taskcluster message ingestion: 'mda2-dafa26b89ed-bk-1754415059f-bk' is too long

Failed validating 'maxLength' in schema['properties']['symbol']:
    {'description': 'This is the symbol that will appear in a Treeherder '
                    'resultset for a\n'
                    'given push.  This symbol could be something such as '
                    '"B" or a number representing\n'
                    'the current chunk.\n',
     'maxLength': 25,
     'minLength': 0,
     'title': 'symbol',
     'type': 'string'}

This is what one of the messages looks like:

{
   "buildSystem":"taskcluster",
   "owner":"bpostelnicu@mozilla.com",
   "taskId":"758d4f91-9ef7-45bf-9ef2-7114a0664e21/0",
   "retryId":0,
   "isRetried":False,
   "display":{
      "jobSymbol":"mda2-dafa26b89ed-bk-1754415059f-bk",
      "groupSymbol":"M-bk",
      "jobName":"test-linux1804-64-tsan/opt-mochitest-media-e10s-2",
      "groupName":"Mochitests"
   },
   "state":"completed",
   "result":"exception",
   "tier":1,
   "timeScheduled":"2020-10-23T11:57:30.998Z",
   "jobKind":"test",
   "reason":"scheduled",
   "jobInfo":{
      "links":[
         {
            "label":"artifact uploaded",
            "linkText":"live.log",
            "url":"https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/dY1PkZ73Rb-e8nEUoGZOIQ/runs/0/artifacts/public/logs/live.log"
         }
      ],
      "summary":"Mochitest media run ([Treeherder job](https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=9283658152d84f604d80ddf693746f3d2e66e3f7&selectedTaskRun=dY1PkZ73Rb-e8nEUoGZOIQ))"
   },
   "version":1,
   "origin":{
      "kind":"hg.mozilla.org",
      "project":"autoland",
      "revision":"9283658152d84f604d80ddf693746f3d2e66e3f7",
      "pushLogID":0
   },
   "labels":[
      "opt"
   ],
   "buildMachine":{
      "name":"i-0a996d6ba8fed23f1",
      "platform":"linux1804-64-tsan",
      "os":"-",
      "architecture":"-"
   },
   "timeStarted":"2020-10-23T11:57:37.807Z",
   "timeCompleted":"2020-10-23T12:17:39.200Z",
   "logs":[
      
   ]
}

I've asked :jmaher about this since :ahal is on parental leave.

Just to clarify the above ^. I know where it's hitting those exceptions - we have two schema validation checks (at least one of them is unnecessary and can go), but the question I have is whether this job symbol is correct. If so, I can revise the maxLength on the schemas and on the JobType table. Otherwise this task configuration needs to be corrected.

I believe that the limit of 25 is a hard pulse limit, but we'll need to double-check with the taskcluster team if this information is still valid.

We can fix this by changing this line: https://hg.mozilla.org/mozilla-central/file/tip/taskcluster/taskgraph/actions/backfill.py#l153 . What the fix should be may be up for discussion.

(We may want to just add a -bk to the symbol, and add metadata into task.extra? Or maybe we should find an existing -bk, and strip it, since it looks like the offending symbol has multiple: mda2-dafa26b89ed-bk-1754415059f-bk) If the double -bk is useful, perhaps we can reduce the revision from 11 chars to 5?

Component: Treeherder: Job Triggering & Cancellation → Task Configuration
Product: Tree Management → Firefox Build System
Version: --- → unspecified

In my day, symbols were 1-2 characters!

Do the symbols end up in pulse routes anywhere? If so, then the real limit is overall length of the pulse routing key. If not, then there's no platform limit on this value.

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #16)

Do the symbols end up in pulse routes anywhere? If so, then the real limit is overall length of the pulse routing key. If not, then there's no platform limit on this value.

How would we find the answer to this? I don't see any routes in the task definition with the job symbol, and I don't see any dependent tasks, e.g. index tasks.

:dluca, do we explicitly want the symbol to be mda2-dafa26b89ed-bk-1754415059f-bk ? is that useful? Or would it be just as useful to see a symbol of, say, mda2-1754415059f-bk ?

Flags: needinfo?(dustin)
Flags: needinfo?(dluca)

It sounds like you looked at the output of ./mach taskgraph and that's the strategy I'd use.

Flags: needinfo?(dustin)

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #18)

It sounds like you looked at the output of ./mach taskgraph and that's the strategy I'd use.

Makes sense. I only looked at this specific backfill task, so I don't know the story for all hg.m.o branches yet. I also don't know if there's a different story in mobile / ci-{configuration,admin} etc. I'm leaning away from changing the schema if we have a straightforward symbol shortening fix.

The more I think about it and dig into our codebase, the symbol is used in the Treeherder UI (for the job buttons, for example) and in push health so I'd vote for changing it to a more human readable name. When I look at other backfilled tasks, the symbol is "backfill-task" (like here: https://firefox-ci-tc.services.mozilla.com/tasks/R9XuK1QmTKa7SnvS6LuoCA) and that's what you see in the TH UI. Is there a reason why certain backfilled tasks shouldn't be changed (like the one in comment 13) and need to have the lengthy symbol?

We ended up with a symbol mda2-dafa26b89ed-bk-1754415059f-bk, which is longer
than the allowed schema length of 25 characters. I don't believe the multiple
REVISION-bk suffixes are useful; stripping an existing backfill suffix before
appending a new one should help avoid this error case.

Assignee: nobody → aki
Status: NEW → ASSIGNED

Adding a quick test python file, since it's not always easy to eyeball-review regexes.

So this strips it a bit, but it doesn't really answer my question in comment #20 about whether we can just call them 'backfill-task' or if it needs to be bk-sha. Keeping in mind this symbol will show up as a button in the treeherder UI like so: https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=backfill&selectedTaskRun=RhO0rtSPQSGLZKN22Qd-AQ.0

But will not show the button as 'backfill-task', just whatever the symbol is.

This is a question for the sheriffs, but I would be surprised if they're ok with backfill-task. If they backfill 10 different types of tasks in the same sets of pushes, and they're all labeled backfill-task, it would be difficult to tell what tasks are green or red without clicking through.

This is a question for the sheriffs, but I would be surprised if they're ok with backfill-task. If they backfill 10 different types of tasks in the same sets of pushes, and they're all labeled backfill-task, it would be difficult to tell what tasks are green or red without clicking through.

Not sure I follow - Treeherder's UI color codes the symbol name/job buttons based on their status, so it will show if they are green or red at a glance.

Aryx can you weigh in what this task symbol should be renamed to please?

I'm really curious why some tasks have the symbol of 'backfill-task' and some don't and how that is determined.

Flags: needinfo?(aryx.bugmail)
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a08e52498be3
fix too-long treeherder symbols on backfill. r=aryx,taskgraph-reviewers,jmaher DONTBUILD

I asked in matrix #sheriffs yesterday, ~11:38am pst. We were both wrong; if there is an existing symbol-origrevision-bk symbol, we want to keep it rather than replacing it with backfill-task or symbol-newrevision-bk. Landed on autoland with that. This bug should be resolved once this merges to central.

Flags: needinfo?(dluca)
Flags: needinfo?(aryx.bugmail)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch

(In reply to Aki Sasaki [:aki] (he/him) from comment #27)

I asked in matrix #sheriffs yesterday, ~11:38am pst. We were both wrong; if there is an existing symbol-origrevision-bk symbol, we want to keep it rather than replacing it with backfill-task or symbol-newrevision-bk. Landed on autoland with that. This bug should be resolved once this merges to central.

Thanks!

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: