Closed Bug 1585757 Opened 11 months ago Closed 1 month ago

Retriggers use build for previous push if backfill requested before

Categories

(Firefox Build System :: Task Configuration, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Unassigned)

Details

(Keywords: intermittent-failure, regression)

Filed by: dvarga [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=269458109&repo=autoland
Full log: https://queue.taskcluster.net/v1/task/HL4JRIPwRHKZJGeYAtqMkA/runs/0/artifacts/public/logs/live_backing.log


[task 2019-10-02T15:58:02.467Z] 15:58:02     INFO -  Included file 'Z:\task_1570028969\build\tests\mochitest\tests\toolkit\mozapps\extensions\test\mochitest\mochitest.ini' does not exist
[task 2019-10-02T15:58:02.467Z] 15:58:02     INFO -  Included file 'Z:\task_1570028969\build\tests\mochitest\tests\toolkit\xre\test\mochitest.ini' does not exist
[task 2019-10-02T15:58:02.467Z] 15:58:02     INFO -  Included file 'Z:\task_1570028969\build\tests\mochitest\tests\uriloader\exthandler\tests\mochitest\mochitest.ini' does not exist
[task 2019-10-02T15:58:02.467Z] 15:58:02     INFO -  Included file 'Z:\task_1570028969\build\tests\mochitest\tests\widget\tests\mochitest.ini' does not exist
[task 2019-10-02T15:58:02.467Z] 15:58:02    ERROR - No tests were found for flavor 'plain' and the following manifest filters:
[task 2019-10-02T15:58:02.467Z] 15:58:02    ERROR - skip_if, run_if, fail_if, remove_imptest_failure_expectations, subsuite(name=None), chunk_by_dir(1, 5, 4)
[task 2019-10-02T15:58:02.468Z] 15:58:02    ERROR - 
[task 2019-10-02T15:58:02.468Z] 15:58:02    ERROR - Make sure the test paths (if any) are spelt correctly and the corresponding
[task 2019-10-02T15:58:02.468Z] 15:58:02    ERROR - --flavor and --subsuite are being used. See `mach mochitest --help` for a
[task 2019-10-02T15:58:02.468Z] 15:58:02    ERROR - list of valid flavors.
[task 2019-10-02T15:58:02.468Z] 15:58:02    ERROR - 
[task 2019-10-02T15:58:02.468Z] 15:58:02     INFO - SUITE-START | Running 0 tests
[task 2019-10-02T15:58:02.468Z] 15:58:02     INFO -  0 INFO TEST-START | Shutdown
[task 2019-10-02T15:58:02.468Z] 15:58:02     INFO -  1 INFO Passed:  0
[task 2019-10-02T15:58:02.468Z] 15:58:02     INFO -  2 INFO Failed:  0
[task 2019-10-02T15:58:02.468Z] 15:58:02     INFO -  3 INFO Todo:    0
[task 2019-10-02T15:58:02.468Z] 15:58:02     INFO -  4 INFO Mode:    e10s
[task 2019-10-02T15:58:02.468Z] 15:58:02     INFO -  5 INFO SimpleTest FINISHED
[task 2019-10-02T15:58:02.469Z] 15:58:02     INFO - Buffered messages finished
[task 2019-10-02T15:58:02.469Z] 15:58:02     INFO - SUITE-END | took 0s
[task 2019-10-02T15:58:02.502Z] 15:58:02    ERROR - Return code: 1
[task 2019-10-02T15:58:02.502Z] 15:58:02    ERROR - No checks run.
[task 2019-10-02T15:58:02.502Z] 15:58:02     INFO - TinderboxPrint: mochitest-mochitest-plain-chunked<br/><em class="testfail">T-FAIL</em>
[task 2019-10-02T15:58:02.502Z] 15:58:02    ERROR - # TBPL FAILURE #
[task 2019-10-02T15:58:02.503Z] 15:58:02  WARNING - setting return code to 2
[task 2019-10-02T15:58:02.503Z] 15:58:02    ERROR - The mochitest suite: mochitest-plain-chunked ran with return status: FAILURE
[task 2019-10-02T15:58:02.503Z] 15:58:02     INFO - Running post-action listener: _package_coverage_data
[task 2019-10-02T15:58:02.503Z] 15:58:02     INFO - Running post-action listener: _resource_record_post_action
[task 2019-10-02T15:58:02.503Z] 15:58:02     INFO - Running post-action listener: process_java_coverage_data
[task 2019-10-02T15:58:02.503Z] 15:58:02     INFO - [mozharness: 2019-10-02 15:58:02.503000Z] Finished run-tests step (success)
[task 2019-10-02T15:58:02.503Z] 15:58:02     INFO - Running post-run listener: _resource_record_post_run

These failures are all for one push and on platform (Windows 10 asan). Reruns for one of those failures were green: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=269458109&revision=b3743b6fb2f3408201cb491fa4f3c8d69222a1a6

Andrew, have you seen anything like this before and is there a difference in the logic which explains that the retriggers passed?

Flags: needinfo?(ahal)

I haven't seen this before and I'm not sure how it would be possible :/.

Were there any issues with the build task?

Flags: needinfo?(ahal)
Component: Mochitest → Treeherder: Job Triggering & Cancellation
Product: Testing → Tree Management
Version: Version 3 → ---

Chatted with Aryx in Zoom. This may be related to doing a Retrigger All (pinned jobs) with jobs from multiple pushes. I will investigate and do some testing on my end in Treeherder. That being said, the case we tried, we could not reproduce the problem. They jobs used the correct installer_urls.

Flags: needinfo?(cdawson)

Cameron and I took a look at it. The retrigger https://tools.taskcluster.net/groups/HylFB9IrRtWk6laSKtuNcA/tasks/YMUtQw-BS8unY8tZS0k3cw/details has the

Gecko decision task https://tools.taskcluster.net/tasks/HylFB9IrRtWk6laSKtuNcA

for the same push. The

Action task https://tools.taskcluster.net/groups/HylFB9IrRtWk6laSKtuNcA/tasks/PZpzJ2qNRZeUnfLqRtDRKA/details

references that. But the build dependency

Windows asan opt build https://tools.taskcluster.net/tasks/eqsZsOBvTISMlKgdka0o5g

belongs to the previous push. In the action task, the correct gecko decision task is referenced: HylFB9IrRtWk6laSKtuNcA

There is also a line:
No label-to-taskid.json found for UNFTUAyoRxO6lTw9Xxt-xw: 404 Client Error: Not Found for url: https://queue.taskcluster.net/v1/task/UNFTUAyoRxO6lTw9Xxt-xw/artifacts/public/label-to-taskid.json

Dustin, any idea where the switch to a build from a different push could originate from?

Flags: needinfo?(dustin)

This sounds like something else I did some digging on 2-3 weeks ago, but I can't find that bug now. It came down to some unintuitive interplay between action tasks and the records kept in files like label-to-taskid.json and full-task-graph.json. To dig into this, I'd suggest mapping out all of those files and also looking at what existed when.

Flags: needinfo?(dustin)

Steps to reproduce:

1 . Find a push on a tree for which you are allowed to trigger and backfill.
2. Select a test task, e.g. a mochitest M one.
3. From the "..." menu at the bottom left, use "Backfill".
4. Wait for the action task for the backfill (AC(Bk)) to finish.
5. Retrigger the same task for which you backfilled.

When I do a retrigger > backfill > retrigger, the retrigger jobs use the same installer urls (had waited until the action tasks finished before calling the next one - please correct if more wait time would be needed).

FWIW, Treeherder calls retrigger-multiple and breakpoints in https://github.com/mozilla/treeherder/blob/master/ui/models/job.js point to the correct decision task.

https://searchfox.org/mozilla-central/rev/1fe0cf575841dbf3b7e159e88ba03260cd1354c0/taskcluster/taskgraph/actions/util.py#66 indicates gecko.v2.autoland.pushlog-id.99293.actions does not exist (only .decision). 99293 is the push id submitted for retriggers of https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&selectedJob=273799497&resultStatus=pending%2Crunning%2Csuccess%2Csuperseded%2Cusercancel%2Cretry%2Ctestfailed%2Cbusted%2Cexception&searchStr=browser-chrome&revision=55724db5349e429c044d3493eb13bfc94c620ecf

Tom, any idea what causes this unexpected behavior?

Component: Treeherder: Job Triggering & Cancellation → Task Configuration
Flags: needinfo?(mozilla)
Priority: P5 → --
Product: Tree Management → Firefox Build System
Summary: Intermittent No tests were found for flavor 'plain' and the following manifest filters: → Retriggers use build for previous push if backfill requested before
Version: --- → unspecified

Increasing severity as this affects all tasks on the platform, e.g. Linux x64 debug tests of https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&tochange=304f062595a5d8adf3f7f5932b48f305213e64dc&fromchange=d8796ee34018b922213c371407c701e679cc95c2 used the build before the backout for the backout push.

Severity: normal → critical

The issue here is that the backfill job is an action on one push, but creates jobs on other pushes. Thus, those jobs can't be found by other actions looking at the other pushes, but can be found on the original push. The backfill action needs to be split into two parts, one that runs on the original push, and triggers actions on the other pushes that create the appropriate tasks.

Flags: needinfo?(mozilla)

It might make sense to see if we can handle the failure case of Bug 1617107 better when we address this, too.

Is this now fixed? (Since we now schedule an intermediary action for every push)

I believe so, please reopen if it seen happening again.

Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.