Permafail Windows coverage build running out of time

RESOLVED FIXED in Firefox 68

Status

defect
P1
normal
RESOLVED FIXED
2 months ago
2 months ago

People

(Reporter: intermittent-bug-filer, Assigned: ahal)

Tracking

(Blocks 1 bug, Regression, {intermittent-failure, regression})

Version 3
mozilla68
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr60 unaffected, firefox66 unaffected, firefox67 unaffected, firefox68 fixed)

Details

Attachments

(1 attachment)

Reporter

Description

2 months ago
treeherder

#[markdown(off)]
Filed by: btara [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=238345090&repo=mozilla-central

https://queue.taskcluster.net/v1/task/LSSyyg_HSzOnM9bVve7K-w/runs/0/artifacts/public/logs/live_backing.log

11:04:40 INFO - package-tests> mozmake.EXE[2]: Leaving directory 'z:/build/build/src/obj-firefox'
[taskcluster 2019-04-05T11:10:27.663Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/LSSyyg_HSzOnM9bVve7K-w/0/on/us-west-1/i-0f6ccd032e967af2c/until/1554463828.376
[taskcluster 2019-04-05T11:27:29.123Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/LSSyyg_HSzOnM9bVve7K-w/0/on/us-west-1/i-0f6ccd032e967af2c/until/1554464849.637
[taskcluster 2019-04-05T11:44:30.293Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/LSSyyg_HSzOnM9bVve7K-w/0/on/us-west-1/i-0f6ccd032e967af2c/until/1554465870.715
[taskcluster 2019-04-05T12:01:31.456Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/LSSyyg_HSzOnM9bVve7K-w/0/on/us-west-1/i-0f6ccd032e967af2c/until/1554466891.74
[taskcluster:error] Aborting task...
[taskcluster 2019-04-05T12:02:23.898Z] ERROR: The process "1316" not found.
[taskcluster 2019-04-05T12:02:23.898Z]
[taskcluster:warn 2019-04-05T12:02:23.898Z] exit status 128
[taskcluster 2019-04-05T12:02:23.935Z] ERROR: The process "3064" not found.
[taskcluster 2019-04-05T12:02:23.935Z]
[taskcluster:warn 2019-04-05T12:02:23.935Z] exit status 128
[taskcluster 2019-04-05T12:02:23.969Z] ERROR: The process "1436" not found.
[taskcluster 2019-04-05T12:02:23.969Z]
[taskcluster:warn 2019-04-05T12:02:23.969Z] exit status 128
[taskcluster 2019-04-05T12:02:24.003Z] ERROR: The process "3956" not found.
[taskcluster 2019-04-05T12:02:24.003Z]
[taskcluster:warn 2019-04-05T12:02:24.003Z] exit status 128
[taskcluster 2019-04-05T12:02:24.038Z] ERROR: The process "4028" not found.
[taskcluster 2019-04-05T12:02:24.038Z]
[taskcluster:warn 2019-04-05T12:02:24.038Z] exit status 128
[taskcluster 2019-04-05T12:02:24.072Z] ERROR: The process "3396" not found.
[taskcluster 2019-04-05T12:02:24.072Z]
[taskcluster:warn 2019-04-05T12:02:24.072Z] exit status 128
[taskcluster 2019-04-05T12:02:24.107Z] ERROR: The process "2508" not found.
[taskcluster 2019-04-05T12:02:24.107Z]
[taskcluster:warn 2019-04-05T12:02:24.107Z] exit status 128
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3484 (child process of PID 4228) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 4228 (child process of PID 3368) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3368 (child process of PID 680) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 680 (child process of PID 4964) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 4964 (child process of PID 1500) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 1500 (child process of PID 4568) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 4568 (child process of PID 3824) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3824 (child process of PID 4280) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 4280 (child process of PID 732) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 732 (child process of PID 3776) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3776 (child process of PID 2176) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 2176 (child process of PID 3404) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3404 (child process of PID 700) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3540 (child process of PID 3600) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 700 (child process of PID 3600) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3600 (child process of PID 2484) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z]
[taskcluster 2019-04-05T12:02:24.146Z] === Task Finished ===
[taskcluster 2019-04-05T12:02:24.146Z] Task Duration: 2h0m1.4955177s
[taskcluster:error] Uploading error artifact public/code-coverage-grcov.zip from file build\src\obj-firefox\code-coverage-grcov.zip with message "Could not read file 'Z:\\task_1554458424\\build\\src\\obj-firefox\\code-coverage-grcov.zip'", reason "file-missing-on-worker" and expiry 2020-04-04T09:52:14.758Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read file 'Z:\task_1554458424\build\src\obj-firefox\code-coverage-grcov.zip'
[taskcluster 2019-04-05T12:02:25.542Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/octet-stream" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:26.183Z] Uploading artifact public/build/target.awsy.tests.tar.gz from file public\build\target.awsy.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:26.640Z] Uploading artifact public/build/target.common.tests.tar.gz from file public\build\target.common.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:27.580Z] Uploading artifact public/build/target.cppunittest.tests.tar.gz from file public\build\target.cppunittest.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:28.616Z] Uploading artifact public/build/target.gtest.tests.tar.gz from file public\build\target.gtest.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:34.317Z] Uploading artifact public/build/target.mochitest.tests.tar.gz from file public\build\target.mochitest.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:35.813Z] Uploading artifact public/build/target.raptor.tests.tar.gz from file public\build\target.raptor.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:36.528Z] Uploading artifact public/build/target.reftest.tests.tar.gz from file public\build\target.reftest.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:38.543Z] Uploading artifact public/build/target.talos.tests.tar.gz from file public\build\target.talos.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:39.298Z] Uploading artifact public/build/target.updater-dep.tests.tar.gz from file public\build\target.updater-dep.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:39.683Z] Uploading artifact public/build/target.web-platform.tests.tar.gz from file public\build\target.web-platform.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:41.130Z] Uploading artifact public/build/target.xpcshell.tests.tar.gz from file public\build\target.xpcshell.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:41.820Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:43.612Z] Uploading artifact public/chain-of-trust.json from file generic-worker\chain-of-trust.json with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:43.985Z] Uploading artifact public/chain-of-trust.json.sig from file generic-worker\chain-of-trust.json.sig with content encoding "gzip", mime type "application/octet-stream" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:44.372Z] Uploading artifact public/chainOfTrust.json.asc from file generic-worker\chainOfTrust.json.asc with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:44.943Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/LSSyyg_HSzOnM9bVve7K-w/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-04-04T09:52:14.758Z
[taskcluster:error] Task aborted - max run time exceeded
[taskcluster:error] file-missing-on-worker: Could not read file 'Z:\task_1554458424\build\src\obj-firefox\code-coverage-grcov.zip'

Flags: needinfo?(mcastelluccio)

This is a permafailure. Since this push: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=ccov&revision=a765634b9ec26b122c094ad5def8e70ea5c23e17.
The runtime before that push was ~80 minutes, now it breaks the 120 minutes barrier.

Summary: Intermittent Tier 2 Win 2012 [taskcluster:error] Uploading error artifact public/code-coverage-grcov.zip from file build\src\obj-firefox\code-coverage-grcov.zip with message "Could not read file 'Z:\\*\code-coverage-grcov.zip'", reason "file-missing-on-wo → Permafail Windows coverage build running out of time

I've triggered ccov builds for some of the autoland pushes from that push, and https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=c28c538fca0a&searchStr=ccov is the first failing one.

Flags: needinfo?(mcastelluccio) → needinfo?(ahal)
Regressed by: 1195299
Comment hidden (Intermittent Failures Robot)
Assignee

Comment 5

2 months ago

Looks like the relevant part of the log is near the end:

14:05:49 INFO - package-tests> Wrote 72285 files in 54373432 bytes to target.web-platform.tests.tar.gz in 39.02s
14:05:49 INFO - package-tests> mozmake.EXE[2]: Leaving directory 'z:/build/build/src/obj-firefox'
[taskcluster 2019-04-07T14:18:02.388Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/fHyuCZfgQpe6BkY5crxuJg/0/on/us-west-2/i-02a455cadb66f0fc1/until/1554647882.465
[taskcluster 2019-04-07T14:35:03.170Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/fHyuCZfgQpe6BkY5crxuJg/0/on/us-west-2/i-02a455cadb66f0fc1/until/1554648903.22
[taskcluster 2019-04-07T14:52:03.886Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/fHyuCZfgQpe6BkY5crxuJg/0/on/us-west-2/i-02a455cadb66f0fc1/until/1554649923.912
[taskcluster 2019-04-07T15:09:04.598Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/fHyuCZfgQpe6BkY5crxuJg/0/on/us-west-2/i-02a455cadb66f0fc1/until/1554650944.584
[taskcluster:error] Aborting task...
[taskcluster 2019-04-07T15:10:01.465Z] ERROR: The process "2556" not found.
[taskcluster 2019-04-07T15:10:01.465Z]
...
[taskcluster 2019-04-07T15:10:01.723Z] === Task Finished ===

Pay close attention to the timestamps. So the build finishes packaging web-platform-tests and then there are four Successfully refreshed taskcluster-proxy credentials log messages, and a 15 minute gap between each one. After an hour we see Aborting task.... It looks like that taskcluster refresh log happens periodically throughout the build process, so I assume there is an actual hang happening here (and not just a case of making the builds slower).

Marco, there are two high level changes that the regressing bug did:

  1. Update the copies of mozprocess, mozinfo, mozfile and manifestparser that mozharness uses
  2. Call the builds via mach python instead of python2.7 (which could potentially cause different versions of packages to be used)

Marco, do either of those sound particularly suspicious?

Flags: needinfo?(ahal)
Assignee

Comment 6

2 months ago

I checked a couple failing builds and they all seem to hang after packaging web-platform tests. In a normal passing build, the next couple log lines should have been:

14:23:16 INFO - package-tests> Wrote 72285 files in 54373432 bytes to target.web-platform.tests.tar.gz in 59.50s
14:23:16 INFO - package-tests> mozmake.EXE[2]: Leaving directory 'z:/build/build/src/obj-firefox'
14:26:27 INFO - package> mozmake.EXE[5]: Entering directory 'z:/build/build/src/obj-firefox/browser/installer'
14:26:27 INFO - package> NO_PKG_FILES="core bsdecho js js-config jscpucfg nsinstall viewer TestGtkEmbed elf-dynstr-gc mangle* maptsv* mfc* msdump* msmap* nm2tsv* nsinstall* res/samples res/throbber shlibsign* certutil* pk12util* BadCertServer* OCSPStaplingServer* SymantecSanctionsServer* GenerateOCSPResponse* chrome/chrome.rdf chrome/app-chrome.manifest chrome/overlayinfo components/compreg.dat components/xpti.dat content_unit_tests necko_unit_tests *.dSYM SmokeDMD"
14:26:27 INFO - package> z:/build/build/src/obj-firefox/_virtualenvs/init/Scripts/python.exe z:/build/build/src/toolkit/mozapps/installer/packager.py

Assignee

Comment 8

2 months ago

The Wrote .. to target.web-platform.tests line comes from here:
https://searchfox.org/mozilla-central/source/python/mozbuild/mozbuild/action/test_archive.py#820

Which gets called from here:
https://searchfox.org/mozilla-central/source/testing/testsuite-targets.mk#158

Reading makefiles is not my forte though. I'm also confused why ccov builds would be different from normal ones here.

I tried to disable Rust tests, which is one of the differences between normal builds and ccov builds, but it didn't make a difference: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5de35c528c768271487823284b9d23d1d32ce46e.

Assignee

Comment 10

2 months ago

I ran a few try runs:
https://treeherder.mozilla.org/#/jobs?repo=try&author=ahalberstadt%40mozilla.com&fromchange=87f98216ee979df1af96e598e7322130d16499e2&tochange=c4c688b2ec2723fd4a269823eade5bf4edfd5c8f

The bottom is baseline (aka current central), then I backed out the two patches one by one. In otherwords, this is the specific commit that caused the regression:
https://hg.mozilla.org/mozilla-central/rev/50495a5da9f6

Since this is Windows only, it's likely this hunk specifically:
https://hg.mozilla.org/mozilla-central/diff/50495a5da9f6/taskcluster/taskgraph/transforms/job/mozharness.py

Assignee

Updated

2 months ago
Blocks: 1543149
Assignee

Updated

2 months ago
Assignee: nobody → ahal
Status: NEW → ASSIGNED
Priority: P5 → P1
Assignee

Comment 11

2 months ago

Have it working on try using PYTHONPATH rather than mach python:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5def23c2c7f27d9a4831db0daf292cf03487032e

I filed bug 1543149 to figure out why using mach python breaks the ccov build. I think we'll need someone on the build team to investigate as I wasn't getting anywhere.

Assignee

Comment 12

2 months ago

This is a hack to get around Windows ccov build hangs caused by bug 1195299.
Bug 1543149 will track the investigation of the hang and removal of this hack.

Comment 13

2 months ago
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/20c86313e332
[taskgraph] Hack required mozbase module onto PYTHONPATH rather than run 'mach python' for Windows builds, r=marco

Comment 14

2 months ago
bugherder
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
You need to log in before you can comment on or make changes to this bug.