Open Bug 1617107 Opened 5 years ago Updated 1 year ago

Taskcluster task failing silently (bustage on base build prevents other pgo builds and tests from running)

Categories

(Firefox Build System :: Task Configuration, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: alexandrui, Unassigned)

Details

After several attempts to backfill test-android-hw-p2-8-0-android-aarch64/pgo-raptor-tp6m-12-geckoview-cold-e10s I checked the rt/Bk jobs from Gecko Decision Task opt and I found that a taskcluster task is failing silently due to an exception on Android 5.0 AArch64 PGO.
Context:
Sheriffing alert 24945
Doing backfill on Android 8.0 Pixel2 AArch64 pgo / tp6m-c-12
I somehow got to this exception and this exception that finally gets into this fail respectively this fail.

Component: Performance → General
Product: Testing → Taskcluster
Version: Version 3 → unspecified

Geoff, any info about this?

Flags: needinfo?(gbrown)

I notice that Android 4.0 API16+ pgo / run tasks are marked busted and fixed by commit https://hg.mozilla.org/integration/autoland/rev/c7172b32a80d6ec2be386c60d48f42b4dbbcf5a6, in that range. I don't see any Android 5.0 AArch64 pgo builds in that range, so I wonder if Android 5.0 AArch64 pgo builds are dependent on Android 4.0 API16+ pgo builds. :mshal might be able to tell us...

Flags: needinfo?(gbrown) → needinfo?(mshal)

Yeah, all Android PGO builds use the profile data from the android-api-16 build and run task. This comes from the 'use-pgo' attribute in the task definition:

https://searchfox.org/mozilla-central/rev/c1e3d3edd4a9b784971555dc74a5de23d768b2e1/taskcluster/ci/build/android.yml#232

So it will be difficult to get an Android 5.0 AArch64 PGO build in that range because the run task to generate the profile data had a high failure rate until the patch mentioned in #c2 was backed out.

Flags: needinfo?(mshal)
Product: Taskcluster → Firefox Build System

Should we close this as WONTFIX? If it is just failing within the regression range, I don't really see what we can do here.

Well, this is preventing the sheriffs to identify the culprits for some alerts (true that this doesn't happen often), but if there's nothing you can do here, you can close this.
Probably worth telling us (the sheriffs) how to identify this to avoid spending too much time on backfilling jobs like this. Is it happening on Android 5.0 AArch64 PGO only? I see 4 task definitions in android.yml containing use-pgo.

Flags: needinfo?(mshal)
Summary: Taskcluster task failing silently → Taskcluster task failing silently (bustage on base build prevents other pgo builds and tests from running)

I don't think I have the answer to that unfortunately. :tomprince, is there a better way to get feedback from taskcluster on what's happening here? If I understand correctly, I think the fundamental issue is that it can be hard to tell why retriggering a job isn't working (an android test in this case) if one of its dependencies earlier in the taskgraph is the one with the actual failure (android PGO profile generation here). Or is this something that would need to be solved in treeherder?

Flags: needinfo?(mshal) → needinfo?(mozilla)

We could perhaps error out if we find that any of task dependencies used in backfilling have already failed. I'm not sure if that would be better or worse than the current situation, in the case that some pushes have broken jobs and other don't. This could maybe be improved as part of Bug 1585757.

Component: General → Task Configuration
Flags: needinfo?(mozilla)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.