Closed Bug 1668645 Opened 4 years ago Closed 4 years ago

Intermittent test tasks only containing test manifest which they skip, causing the task to get judged as failed

Categories

(Firefox Build System :: Task Configuration, defect, P2)

defect

Tracking

(firefox83 fixed)

RESOLVED FIXED
83 Branch
Tracking Status
firefox83 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: ahal)

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Filed by: abutkovits [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=317307261&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KhZTe_eLQOK_gAr-nmwOmQ/runs/0/artifacts/public/logs/live_backing.log


[task 2020-10-01T17:48:15.430Z] 17:48:15     INFO - Using env: (same as previous command)
[task 2020-10-01T17:48:15.717Z] 17:48:15     INFO -  adb Using adb 1.0.41
[task 2020-10-01T17:48:16.135Z] 17:48:16     INFO -  adb /system/bin/ls -1A supported
[task 2020-10-01T17:48:16.240Z] 17:48:16     INFO -  adb Native cp support: True
[task 2020-10-01T17:48:16.345Z] 17:48:16     INFO -  adb Native chmod -R support: True
[task 2020-10-01T17:48:16.450Z] 17:48:16     INFO -  adb Native chown -R support: True
[task 2020-10-01T17:48:16.867Z] 17:48:16     INFO -  adb Native flaky pidof support: True
[task 2020-10-01T17:48:16.972Z] 17:48:16     INFO -  adb adbd running as root
[task 2020-10-01T17:48:17.389Z] 17:48:17     INFO -  adb Setting SELinux Permissive
[task 2020-10-01T17:48:18.744Z] 17:48:18     INFO -  adb Setting test_root to /data/local/tmp/test_root
[task 2020-10-01T17:48:18.961Z] 17:48:18     INFO -  pushing /builds/worker/workspace/build/tests/xpcshell/tests
[task 2020-10-01T17:48:24.195Z] 17:48:24     INFO -  Pushing xpcshell..
[task 2020-10-01T17:48:24.822Z] 17:48:24     INFO -  Pushing ssltunnel..
[task 2020-10-01T17:48:25.448Z] 17:48:25     INFO -  Pushing certutil..
[task 2020-10-01T17:48:26.076Z] 17:48:26     INFO -  Pushing pk12util..
[task 2020-10-01T17:48:26.703Z] 17:48:26     INFO -  Pushing BadCertAndPinningServer..
[task 2020-10-01T17:48:27.330Z] 17:48:27     INFO -  Pushing DelegatedCredentialsServer..
[task 2020-10-01T17:48:27.958Z] 17:48:27     INFO -  Pushing OCSPStaplingServer..
[task 2020-10-01T17:48:28.585Z] 17:48:28     INFO -  Pushing GenerateOCSPResponse..
[task 2020-10-01T17:48:29.212Z] 17:48:29     INFO -  Pushing SanctionsTestServer..
[task 2020-10-01T17:48:31.495Z] 17:48:31     INFO -  Pushing lib/x86_64/libfreebl3.so..
[task 2020-10-01T17:48:32.136Z] 17:48:32     INFO -  Pushing lib/x86_64/liblgpllibs.so..
[task 2020-10-01T17:48:32.762Z] 17:48:32     INFO -  Pushing lib/x86_64/libmozavcodec.so..
[task 2020-10-01T17:48:33.404Z] 17:48:33     INFO -  Pushing lib/x86_64/libmozavutil.so..
[task 2020-10-01T17:48:34.033Z] 17:48:34     INFO -  Pushing lib/x86_64/libmozglue.so..
[task 2020-10-01T17:48:34.687Z] 17:48:34     INFO -  Pushing lib/x86_64/libnss3.so..
[task 2020-10-01T17:48:35.347Z] 17:48:35     INFO -  Pushing lib/x86_64/libnssckbi.so..
[task 2020-10-01T17:48:35.980Z] 17:48:35     INFO -  Pushing lib/x86_64/libplugin-container.so..
[task 2020-10-01T17:48:36.608Z] 17:48:36     INFO -  Pushing lib/x86_64/libsoftokn3.so..
[task 2020-10-01T17:48:37.242Z] 17:48:37     INFO -  Pushing lib/x86_64/libxul.so..
[task 2020-10-01T17:48:41.674Z] 17:48:41     INFO -  ro.product.cpu.abi x86_64
[task 2020-10-01T17:48:41.779Z] 17:48:41     INFO -  ro.product.cpu.abilist x86_64,x86
[task 2020-10-01T17:48:41.780Z] 17:48:41     INFO -  Using abi x86_64.
[task 2020-10-01T17:48:41.780Z] 17:48:41     INFO -  Found node at /builds/worker/fetches/node/bin/node
[task 2020-10-01T17:48:41.780Z] 17:48:41     INFO -  Found moz-http2 at /builds/worker/workspace/build/tests/xpcshell/moz-http2/moz-http2.js
[task 2020-10-01T17:48:41.887Z] 17:48:41     INFO -  Found /builds/worker/workspace/build/tests/xpcshell/http3server/http3server
[task 2020-10-01T17:48:41.888Z] 17:48:41     INFO -  Using /builds/worker/workspace/build/tests/xpcshell/http3server/http3serverDB
[task 2020-10-01T17:48:41.894Z] 17:48:41     INFO -  Could not run the http3 server: [Errno 2] No such file or directory
[task 2020-10-01T17:48:42.145Z] 17:48:42     INFO -  no tests to run using specified combination of filters: skip_if, run_if, fail_if, pathprefix(['browser/extensions/formautofill/test/unit/xpcshell.ini', 'services/sync/tests/unit/xpcshell.ini'])```

I think http3server has never been set up for android -- we normally run android xpcshell tests without http3server, and it should not be a problem.

:ahal - Failures here look like they might be related to manifest scheduling? I think I'm seeing the android xpcshell test harness called only with manifests that would normally not run on android.

Flags: needinfo?(ahal)
Component: General → Task Configuration
Product: Testing → Firefox Build System
Summary: Intermittent Perma Xpc http3 server: [Errno 2] No such file or directory → Intermittent test tasks only containing test manifest which they skip, causing the task to get judged as failed

Yes, this is a known issue. When we assign manifests to tasks in the taskgraph, we try to take into account the skip-if conditions using a "guessed" mozinfo context:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/util/chunking.py#32

But this context isn't comprehensive and so we can sometimes assign manifests to configs where the entire manifest is skipped. Android is susceptible to this bug since there are a lot of such manifests.

There are few solutions / mitigations here:

  1. Land ML based config selection. Currently configs are selected naively, but marco is working on a patch to have those chosen by the ML instead. Presumably the ML would learn that manifests should not need to run on configs where they are skipped.

  2. Figure out what context value is missing from our "guessed" mozinfo and add it.

  3. Modify the test harnesses to not error when all tests are skipped and a test path is passed in (they could still error if using --this-chunk/--total-chunks if we wanted though).

Flags: needinfo?(ahal)

Oh, it looks like the issue isn't that the test got skipped, but that it got run in the first place. I see firefox-appdir = browser in the manifest. I guess that's some kind of custom xpcshell thing that gets ignored when you pass in paths directly instead of using the chunking params? I'm not finding its implementation in searchfox though...

https://searchfox.org/mozilla-central/rev/7ef5cefd0468b8f509efe38e0212de2398f4c8b3/testing/xpcshell/runxpcshelltests.py#1561 sheds some light. I don't think firefox-appdir has much influence on this issue...but I don't know much about it.

I think moz.build exclusions might be more important: for example, most of <topsrcdir>/browser is excluded from android builds, as I recall.

We've encountered this again here: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=317610389&repo=autoland&lineNumber=1627
Seems like high frequency, are there any updates?

Flags: needinfo?(ahal)

Ah, I think I see why this is happening.

The manifest from the latest link is services/sync/tests/unit/xpcshell.ini. It is included from a moz.build file that is only conditionally traversed based on the build config:
https://searchfox.org/mozilla-central/source/services/moz.build#29

I'm going to go out on a limb and guess that MOZ_SERVICES_SYNC is not defined on Android builds. It's also why we see a lot of these failures in browser-chrome tasks on Android. There are likely tons of moz.build manifests under /browser that don't get processed by Android builds.

We don't filter them out in the taskgraph because we do a "file system traversal" of the moz.build files rather than a build system one. So variables like MOZ_SERVICES_SYNC don't get taken into account. There's not really much we can do about this w.r.t moz.build processing.

There are two other ways forward though:

  1. The root of this issue is that platform selection is naive and not controlled by the ML. Therefore once we choose platforms via ML, this should become very rare as the ML will learn that these tests have no correlation to the platform.
  2. We can simply make this not an error (especially if test paths were passed in).
  3. Add a "redundant" skip-if to the manifests. E.g, even if a manifest wouldn't normally be picked up on Android, add skip-if = os == "android" to it anyway. This will cause it to not be considered when resolving tests in the taskgraph.

I vote we do 2) and/or 3), and then possibly consider turning it back into an error when manifest-scheduling lands (but not a huge deal to leave it off IMO).

I'll try to fix this in the next week or so.

Assignee: nobody → ahal
Severity: normal → S3
Status: NEW → ASSIGNED
Flags: needinfo?(ahal)
Priority: P5 → P2

This appears to be exclusively happening with xpcshell. I believe this is because it is one of the few remaining suites that still uses the DesktopUnittestOutputParser in mozharness (as opposed to the StructuredOutputParser). Xpcshell emits structured logs, so I wonder why we never got around to switching it. Maybe I can solve this bug by making that switch.

If all test paths don't resolve to any tests and we're running in CI, exit without
causing the task to fail. This situation can happen due to moz.build traversal that
causes manifests to not exist under certain configurations.

Ideally I'd love if we could prevent those cases from happening in the first place (i.e
generate the 'all-tests.pkl' file via a file-system traversal and then rely on skip-if's
to not run things), but until then this fixes a fairly frequent intermittent.

Pushed by rmaries@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f9a6da3ea564 [xpcshell] Don't fail CI when no specified test paths contain tests, r=jmaher
Whiteboard: [stockwell disable-recommended]
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 83 Branch
Whiteboard: [stockwell disable-recommended]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: