Unable to backfill further down - backfilled tasks run but are not shown by Treeherder
Categories
(Firefox Build System :: Task Configuration, defect)
Tracking
(firefox85 fixed)
Tracking | Status | |
---|---|---|
firefox85 | --- | fixed |
People
(Reporter: dluca, Assigned: mozilla)
References
Details
Attachments
(2 files)
I have encoutered this issue while backfilling some Wd jobs on autoland:
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&fromchange=7cecd07c5e2e53e495cf5dda766868a9145eb2e5&searchStr=osx&selectedTaskRun=Q2bmEwQnS3GSnjUxhztFow.0
No matter how many times I tried to backfill, jobs won't start after further down after the selected job in the link above.
Aki has found this:
"I think the backfill issue is [task 2020-10-01T19:59:14.829Z] 19:59:14 INFO - STDOUT: E InvalidArgumentException: invalid argument (400): unknown variantwheel, expected one of none, key, pointer at line 1 column 215"
Assignee | ||
Comment 1•4 years ago
|
||
That appears to be the test bustage issue, not the backfill issue. I'm wondering:
- Is this because of a backstop push? Digging, doesn't seem like it for at least one of the backfill issues
- Is this because the backfilled task is unscheduled, waiting for an upstream dependency? Not finding this.
- Not sure if there's a limit to how old a push can be for us to backfill.
Comment 2•4 years ago
•
|
||
The log in the backfill action task does say that the task was created. Here's the task view for one of them:
https://firefox-ci-tc.services.mozilla.com/tasks/WFW5OH8RTIW4eLfitiGyyw
The wdspec task is listed as a dependent (though it failed):
https://firefox-ci-tc.services.mozilla.com/tasks/GWbKHNOgTHqzNwDDdO3UJw
But for some reason treeherder doesn't display it. Could there have been some kind of treeherder ingestion issue around this time?
Things at least seem to be working from the taskgraph point of view.
Sheriff note: if you run into this again, it'll be possible to view the task by drilling into the task view of the "backfill" action, and then clicking "See more" to find the "Dependents" section. The requested task should show up there, and clicking it will display its state and a link to the log (under "Artifacts"). I'm only suggesting this as a temporary workaround, not a solution to this bug.
Assignee | ||
Comment 3•4 years ago
|
||
There was at least one more instance around this time of Treeherder showing tasks as still running, while Taskcluster showed them as completed. It's possible there was a larger hiccup.
![]() |
||
Comment 4•4 years ago
|
||
This is not a hiccup but a general issue.
Steps to reproduce:
- Open this Treeherder link.
- In the bottom toolbar, click on "..." > "Custom Actions".
- Pick "Backfill".
- Change
depth
to 5 andtimes
to 3 and submit. - Wait for the tasks to be shown.
Actual result:
- The main action task which schedules the action tasks for each push gets shown by Treeherder.
- That launches this action task.
- Which creates the actual backfilled mda task which taskcluster shows as running.
Actual result:
Opening the Treeherder link near the top of the task page from step 3 shows the parent push of the one from which we initiated the backfill but the error message Task not found: dY1PkZ73Rb-e8nEUoGZOIQ
. The Treeherder database (table taskcluster_metadata
) doesn't find the task id.
Comparison of the original backfill *-bk task and the 'missing one only shows the following parameters for run-task
as main difference:
"--total-chunk=4",
"--this-chunk=2",
Sarah, do you have insights why these tasks might be missing in TH?
Comment 5•4 years ago
|
||
Hrmm... so the first thing I've noticed is that the url to Treeherder from Taskcluster is wrong.
In the example you have above, the url to the task looks like this: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=9283658152d84f604d80ddf693746f3d2e66e3f7&selectedTaskRun=dY1PkZ73Rb-e8nEUoGZOIQ
When it should be have .1
or .0
to signify the run/retry id.
But even after correcting the url, the task is not found. It hits this API: https://treeherder.mozilla.org/api/jobs/?task_id=dY1PkZ73Rb-e8nEUoGZOIQ&retry_id=1
So this makes me think there is something happening in the task ingestion process where we're either not processing that task or not receiving it. I'll have to dig into this more. cc'ing Cam in case he has any ideas.
Does this only happen with certain jobs/repos or is it random? How frequently has it occurred in the past month?
![]() |
||
Comment 6•4 years ago
|
||
As far as I know it reproduces always. Sheriffs only backfill on autoland (except rare circumstances on other repositories, e.g. to check if an issue is not related to check-ins but infrastructure).
Comment 7•4 years ago
|
||
Circling back to this, I think I need to clarify a few things:
- Is the issue that only backfilled jobs at a certain depth don't show in Treeherder? What does the depth mean exactly?
- Is there a way to see the type of message we'd get from the pulse exchange for the kinds of tasks that are not showing in TH?
![]() |
||
Comment 8•4 years ago
|
||
- Is the issue that only backfilled jobs at a certain depth don't show in Treeherder? What does the depth mean exactly?
depth is a parameter for backfill actions which defines to how many of the previous pushes the task shall be added. It has an upper limit or a sheriff might be conservative with scheduling and then wants to add more backfill tasks for the pushes which initially got the tasks from the backfill actions.
- Is there a way to see the type of message we'd get from the pulse exchange for the kinds of tasks that are not showing in TH?
Dustin, can you provide any guidance for this, please?
Comment 9•4 years ago
|
||
The messages are all documented here. At a guess, whatever is producing these tasks and getting something slightly unusual about them that causes treeherder ingestion to treat them differently. I suppose tracking down the bug in comment 5, if that is not the cause of the issue, would lead to some bits of code that could do with further scrutiny -- perhaps there's another error nearby.
Comment 10•4 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #9)
I suppose tracking down the bug in comment 5, if that is not the cause of the issue, would lead to some bits of code that could do with further scrutiny -- perhaps there's another error nearby.
Which team would investigating that fall to?
I'm going to push some code to improve our logging surrounding the ingestion. I can't investigate further until I know what the messages look like for those specific tasks. Sheriffs, the next time this situation occurs, please ping me or record the task id in this bug and I can check the logs to see if that task has been caught by an exception.
![]() |
||
Comment 11•4 years ago
|
||
It's 100% reproducible: Go to this Wr4 backfilled task and try to backfill it (I tried with "..." menu > Custom Actions > Backfill > depth: 2).
Comment 12•4 years ago
|
||
Which team would investigating that fall to?
I'm not too sure, but jwalker's team.
Comment 13•4 years ago
•
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #11)
It's 100% reproducible: Go to this Wr4 backfilled task and try to backfill it (I tried with "..." menu > Custom Actions > Backfill > depth: 2).
I tried doing that (accidentally just hitting the backfill button first, oops). However, I realized I could try to manually ingest the task that that you listed in comment #4 and it hit a schema exception for the job symbol:
JSON Schema validation error during Taskcluster message ingestion: 'mda2-dafa26b89ed-bk-1754415059f-bk' is too long
Failed validating 'maxLength' in schema['properties']['symbol']:
{'description': 'This is the symbol that will appear in a Treeherder '
'resultset for a\n'
'given push. This symbol could be something such as '
'"B" or a number representing\n'
'the current chunk.\n',
'maxLength': 25,
'minLength': 0,
'title': 'symbol',
'type': 'string'}
This is what one of the messages looks like:
{
"buildSystem":"taskcluster",
"owner":"bpostelnicu@mozilla.com",
"taskId":"758d4f91-9ef7-45bf-9ef2-7114a0664e21/0",
"retryId":0,
"isRetried":False,
"display":{
"jobSymbol":"mda2-dafa26b89ed-bk-1754415059f-bk",
"groupSymbol":"M-bk",
"jobName":"test-linux1804-64-tsan/opt-mochitest-media-e10s-2",
"groupName":"Mochitests"
},
"state":"completed",
"result":"exception",
"tier":1,
"timeScheduled":"2020-10-23T11:57:30.998Z",
"jobKind":"test",
"reason":"scheduled",
"jobInfo":{
"links":[
{
"label":"artifact uploaded",
"linkText":"live.log",
"url":"https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/dY1PkZ73Rb-e8nEUoGZOIQ/runs/0/artifacts/public/logs/live.log"
}
],
"summary":"Mochitest media run ([Treeherder job](https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=9283658152d84f604d80ddf693746f3d2e66e3f7&selectedTaskRun=dY1PkZ73Rb-e8nEUoGZOIQ))"
},
"version":1,
"origin":{
"kind":"hg.mozilla.org",
"project":"autoland",
"revision":"9283658152d84f604d80ddf693746f3d2e66e3f7",
"pushLogID":0
},
"labels":[
"opt"
],
"buildMachine":{
"name":"i-0a996d6ba8fed23f1",
"platform":"linux1804-64-tsan",
"os":"-",
"architecture":"-"
},
"timeStarted":"2020-10-23T11:57:37.807Z",
"timeCompleted":"2020-10-23T12:17:39.200Z",
"logs":[
]
}
I've asked :jmaher about this since :ahal is on parental leave.
Comment 14•4 years ago
|
||
Just to clarify the above ^. I know where it's hitting those exceptions - we have two schema validation checks (at least one of them is unnecessary and can go), but the question I have is whether this job symbol is correct. If so, I can revise the maxLength on the schemas and on the JobType table. Otherwise this task configuration needs to be corrected.
Assignee | ||
Comment 15•4 years ago
•
|
||
I believe that the limit of 25 is a hard pulse limit, but we'll need to double-check with the taskcluster team if this information is still valid.
We can fix this by changing this line: https://hg.mozilla.org/mozilla-central/file/tip/taskcluster/taskgraph/actions/backfill.py#l153 . What the fix should be may be up for discussion.
(We may want to just add a -bk
to the symbol, and add metadata into task.extra
? Or maybe we should find an existing -bk
, and strip it, since it looks like the offending symbol has multiple: mda2-dafa26b89ed-bk-1754415059f-bk
) If the double -bk
is useful, perhaps we can reduce the revision from 11 chars to 5?
Assignee | ||
Updated•4 years ago
|
Comment 16•4 years ago
|
||
In my day, symbols were 1-2 characters!
Do the symbols end up in pulse routes anywhere? If so, then the real limit is overall length of the pulse routing key. If not, then there's no platform limit on this value.
Assignee | ||
Comment 17•4 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #16)
Do the symbols end up in pulse routes anywhere? If so, then the real limit is overall length of the pulse routing key. If not, then there's no platform limit on this value.
How would we find the answer to this? I don't see any routes in the task definition with the job symbol, and I don't see any dependent tasks, e.g. index tasks.
:dluca, do we explicitly want the symbol to be mda2-dafa26b89ed-bk-1754415059f-bk
? is that useful? Or would it be just as useful to see a symbol of, say, mda2-1754415059f-bk ?
Comment 18•4 years ago
|
||
It sounds like you looked at the output of ./mach taskgraph
and that's the strategy I'd use.
Assignee | ||
Comment 19•4 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #18)
It sounds like you looked at the output of
./mach taskgraph
and that's the strategy I'd use.
Makes sense. I only looked at this specific backfill task, so I don't know the story for all hg.m.o branches yet. I also don't know if there's a different story in mobile / ci-{configuration,admin} etc. I'm leaning away from changing the schema if we have a straightforward symbol shortening fix.
Comment 20•4 years ago
•
|
||
The more I think about it and dig into our codebase, the symbol is used in the Treeherder UI (for the job buttons, for example) and in push health so I'd vote for changing it to a more human readable name. When I look at other backfilled tasks, the symbol is "backfill-task" (like here: https://firefox-ci-tc.services.mozilla.com/tasks/R9XuK1QmTKa7SnvS6LuoCA) and that's what you see in the TH UI. Is there a reason why certain backfilled tasks shouldn't be changed (like the one in comment 13) and need to have the lengthy symbol?
Assignee | ||
Comment 21•4 years ago
|
||
We ended up with a symbol mda2-dafa26b89ed-bk-1754415059f-bk
, which is longer
than the allowed schema length of 25 characters. I don't believe the multiple
REVISION-bk
suffixes are useful; stripping an existing backfill suffix before
appending a new one should help avoid this error case.
Updated•4 years ago
|
Assignee | ||
Comment 22•4 years ago
|
||
Adding a quick test python file, since it's not always easy to eyeball-review regexes.
Comment 23•4 years ago
|
||
So this strips it a bit, but it doesn't really answer my question in comment #20 about whether we can just call them 'backfill-task' or if it needs to be bk-sha. Keeping in mind this symbol will show up as a button in the treeherder UI like so: https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=backfill&selectedTaskRun=RhO0rtSPQSGLZKN22Qd-AQ.0
But will not show the button as 'backfill-task', just whatever the symbol is.
Assignee | ||
Comment 24•4 years ago
|
||
This is a question for the sheriffs, but I would be surprised if they're ok with backfill-task
. If they backfill 10 different types of tasks in the same sets of pushes, and they're all labeled backfill-task
, it would be difficult to tell what tasks are green or red without clicking through.
Comment 25•4 years ago
|
||
This is a question for the sheriffs, but I would be surprised if they're ok with
backfill-task
. If they backfill 10 different types of tasks in the same sets of pushes, and they're all labeledbackfill-task
, it would be difficult to tell what tasks are green or red without clicking through.
Not sure I follow - Treeherder's UI color codes the symbol name/job buttons based on their status, so it will show if they are green or red at a glance.
Aryx can you weigh in what this task symbol should be renamed to please?
I'm really curious why some tasks have the symbol of 'backfill-task' and some don't and how that is determined.
Comment 26•4 years ago
|
||
Pushed by asasaki@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a08e52498be3 fix too-long treeherder symbols on backfill. r=aryx,taskgraph-reviewers,jmaher DONTBUILD
Assignee | ||
Comment 27•4 years ago
|
||
I asked in matrix #sheriffs yesterday, ~11:38am pst. We were both wrong; if there is an existing symbol-origrevision-bk
symbol, we want to keep it rather than replacing it with backfill-task
or symbol-newrevision-bk
. Landed on autoland with that. This bug should be resolved once this merges to central.
Comment 28•4 years ago
|
||
bugherder |
Comment 29•4 years ago
|
||
(In reply to Aki Sasaki [:aki] (he/him) from comment #27)
I asked in matrix #sheriffs yesterday, ~11:38am pst. We were both wrong; if there is an existing
symbol-origrevision-bk
symbol, we want to keep it rather than replacing it withbackfill-task
orsymbol-newrevision-bk
. Landed on autoland with that. This bug should be resolved once this merges to central.
Thanks!
Description
•