Closed Bug 1542242 Opened 5 years ago Closed 5 years ago

Permafail Windows coverage build running out of time

Categories

(Testing :: Code Coverage, defect, P1)

Version 3
defect

Tracking

(firefox-esr60 unaffected, firefox66 unaffected, firefox67 unaffected, firefox68 fixed)

RESOLVED FIXED
mozilla68
Tracking Status
firefox-esr60 --- unaffected
firefox66 --- unaffected
firefox67 --- unaffected
firefox68 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: ahal)

References

(Regression)

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

#[markdown(off)]
Filed by: btara [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=238345090&repo=mozilla-central

https://queue.taskcluster.net/v1/task/LSSyyg_HSzOnM9bVve7K-w/runs/0/artifacts/public/logs/live_backing.log

11:04:40 INFO - package-tests> mozmake.EXE[2]: Leaving directory 'z:/build/build/src/obj-firefox'
[taskcluster 2019-04-05T11:10:27.663Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/LSSyyg_HSzOnM9bVve7K-w/0/on/us-west-1/i-0f6ccd032e967af2c/until/1554463828.376
[taskcluster 2019-04-05T11:27:29.123Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/LSSyyg_HSzOnM9bVve7K-w/0/on/us-west-1/i-0f6ccd032e967af2c/until/1554464849.637
[taskcluster 2019-04-05T11:44:30.293Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/LSSyyg_HSzOnM9bVve7K-w/0/on/us-west-1/i-0f6ccd032e967af2c/until/1554465870.715
[taskcluster 2019-04-05T12:01:31.456Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/LSSyyg_HSzOnM9bVve7K-w/0/on/us-west-1/i-0f6ccd032e967af2c/until/1554466891.74
[taskcluster:error] Aborting task...
[taskcluster 2019-04-05T12:02:23.898Z] ERROR: The process "1316" not found.
[taskcluster 2019-04-05T12:02:23.898Z]
[taskcluster:warn 2019-04-05T12:02:23.898Z] exit status 128
[taskcluster 2019-04-05T12:02:23.935Z] ERROR: The process "3064" not found.
[taskcluster 2019-04-05T12:02:23.935Z]
[taskcluster:warn 2019-04-05T12:02:23.935Z] exit status 128
[taskcluster 2019-04-05T12:02:23.969Z] ERROR: The process "1436" not found.
[taskcluster 2019-04-05T12:02:23.969Z]
[taskcluster:warn 2019-04-05T12:02:23.969Z] exit status 128
[taskcluster 2019-04-05T12:02:24.003Z] ERROR: The process "3956" not found.
[taskcluster 2019-04-05T12:02:24.003Z]
[taskcluster:warn 2019-04-05T12:02:24.003Z] exit status 128
[taskcluster 2019-04-05T12:02:24.038Z] ERROR: The process "4028" not found.
[taskcluster 2019-04-05T12:02:24.038Z]
[taskcluster:warn 2019-04-05T12:02:24.038Z] exit status 128
[taskcluster 2019-04-05T12:02:24.072Z] ERROR: The process "3396" not found.
[taskcluster 2019-04-05T12:02:24.072Z]
[taskcluster:warn 2019-04-05T12:02:24.072Z] exit status 128
[taskcluster 2019-04-05T12:02:24.107Z] ERROR: The process "2508" not found.
[taskcluster 2019-04-05T12:02:24.107Z]
[taskcluster:warn 2019-04-05T12:02:24.107Z] exit status 128
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3484 (child process of PID 4228) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 4228 (child process of PID 3368) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3368 (child process of PID 680) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 680 (child process of PID 4964) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 4964 (child process of PID 1500) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 1500 (child process of PID 4568) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 4568 (child process of PID 3824) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3824 (child process of PID 4280) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 4280 (child process of PID 732) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 732 (child process of PID 3776) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3776 (child process of PID 2176) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 2176 (child process of PID 3404) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3404 (child process of PID 700) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3540 (child process of PID 3600) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 700 (child process of PID 3600) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z] SUCCESS: The process with PID 3600 (child process of PID 2484) has been terminated.
[taskcluster 2019-04-05T12:02:24.146Z]
[taskcluster 2019-04-05T12:02:24.146Z] === Task Finished ===
[taskcluster 2019-04-05T12:02:24.146Z] Task Duration: 2h0m1.4955177s
[taskcluster:error] Uploading error artifact public/code-coverage-grcov.zip from file build\src\obj-firefox\code-coverage-grcov.zip with message "Could not read file 'Z:\\task_1554458424\\build\\src\\obj-firefox\\code-coverage-grcov.zip'", reason "file-missing-on-worker" and expiry 2020-04-04T09:52:14.758Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read file 'Z:\task_1554458424\build\src\obj-firefox\code-coverage-grcov.zip'
[taskcluster 2019-04-05T12:02:25.542Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/octet-stream" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:26.183Z] Uploading artifact public/build/target.awsy.tests.tar.gz from file public\build\target.awsy.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:26.640Z] Uploading artifact public/build/target.common.tests.tar.gz from file public\build\target.common.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:27.580Z] Uploading artifact public/build/target.cppunittest.tests.tar.gz from file public\build\target.cppunittest.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:28.616Z] Uploading artifact public/build/target.gtest.tests.tar.gz from file public\build\target.gtest.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:34.317Z] Uploading artifact public/build/target.mochitest.tests.tar.gz from file public\build\target.mochitest.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:35.813Z] Uploading artifact public/build/target.raptor.tests.tar.gz from file public\build\target.raptor.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:36.528Z] Uploading artifact public/build/target.reftest.tests.tar.gz from file public\build\target.reftest.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:38.543Z] Uploading artifact public/build/target.talos.tests.tar.gz from file public\build\target.talos.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:39.298Z] Uploading artifact public/build/target.updater-dep.tests.tar.gz from file public\build\target.updater-dep.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:39.683Z] Uploading artifact public/build/target.web-platform.tests.tar.gz from file public\build\target.web-platform.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:41.130Z] Uploading artifact public/build/target.xpcshell.tests.tar.gz from file public\build\target.xpcshell.tests.tar.gz with content encoding "", mime type "application/x-gzip" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:41.820Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:43.612Z] Uploading artifact public/chain-of-trust.json from file generic-worker\chain-of-trust.json with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:43.985Z] Uploading artifact public/chain-of-trust.json.sig from file generic-worker\chain-of-trust.json.sig with content encoding "gzip", mime type "application/octet-stream" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:44.372Z] Uploading artifact public/chainOfTrust.json.asc from file generic-worker\chainOfTrust.json.asc with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2020-04-04T09:52:14.758Z
[taskcluster 2019-04-05T12:02:44.943Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/LSSyyg_HSzOnM9bVve7K-w/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-04-04T09:52:14.758Z
[taskcluster:error] Task aborted - max run time exceeded
[taskcluster:error] file-missing-on-worker: Could not read file 'Z:\task_1554458424\build\src\obj-firefox\code-coverage-grcov.zip'

Flags: needinfo?(mcastelluccio)

This is a permafailure. Since this push: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=ccov&revision=a765634b9ec26b122c094ad5def8e70ea5c23e17.
The runtime before that push was ~80 minutes, now it breaks the 120 minutes barrier.

Summary: Intermittent Tier 2 Win 2012 [taskcluster:error] Uploading error artifact public/code-coverage-grcov.zip from file build\src\obj-firefox\code-coverage-grcov.zip with message "Could not read file 'Z:\\*\code-coverage-grcov.zip'", reason "file-missing-on-wo → Permafail Windows coverage build running out of time

I've triggered ccov builds for some of the autoland pushes from that push, and https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=c28c538fca0a&searchStr=ccov is the first failing one.

Flags: needinfo?(mcastelluccio) → needinfo?(ahal)
Regressed by: 1195299

Looks like the relevant part of the log is near the end:

14:05:49 INFO - package-tests> Wrote 72285 files in 54373432 bytes to target.web-platform.tests.tar.gz in 39.02s
14:05:49 INFO - package-tests> mozmake.EXE[2]: Leaving directory 'z:/build/build/src/obj-firefox'
[taskcluster 2019-04-07T14:18:02.388Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/fHyuCZfgQpe6BkY5crxuJg/0/on/us-west-2/i-02a455cadb66f0fc1/until/1554647882.465
[taskcluster 2019-04-07T14:35:03.170Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/fHyuCZfgQpe6BkY5crxuJg/0/on/us-west-2/i-02a455cadb66f0fc1/until/1554648903.22
[taskcluster 2019-04-07T14:52:03.886Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/fHyuCZfgQpe6BkY5crxuJg/0/on/us-west-2/i-02a455cadb66f0fc1/until/1554649923.912
[taskcluster 2019-04-07T15:09:04.598Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/fHyuCZfgQpe6BkY5crxuJg/0/on/us-west-2/i-02a455cadb66f0fc1/until/1554650944.584
[taskcluster:error] Aborting task...
[taskcluster 2019-04-07T15:10:01.465Z] ERROR: The process "2556" not found.
[taskcluster 2019-04-07T15:10:01.465Z]
...
[taskcluster 2019-04-07T15:10:01.723Z] === Task Finished ===

Pay close attention to the timestamps. So the build finishes packaging web-platform-tests and then there are four Successfully refreshed taskcluster-proxy credentials log messages, and a 15 minute gap between each one. After an hour we see Aborting task.... It looks like that taskcluster refresh log happens periodically throughout the build process, so I assume there is an actual hang happening here (and not just a case of making the builds slower).

Marco, there are two high level changes that the regressing bug did:

  1. Update the copies of mozprocess, mozinfo, mozfile and manifestparser that mozharness uses
  2. Call the builds via mach python instead of python2.7 (which could potentially cause different versions of packages to be used)

Marco, do either of those sound particularly suspicious?

Flags: needinfo?(ahal)

I checked a couple failing builds and they all seem to hang after packaging web-platform tests. In a normal passing build, the next couple log lines should have been:

14:23:16 INFO - package-tests> Wrote 72285 files in 54373432 bytes to target.web-platform.tests.tar.gz in 59.50s
14:23:16 INFO - package-tests> mozmake.EXE[2]: Leaving directory 'z:/build/build/src/obj-firefox'
14:26:27 INFO - package> mozmake.EXE[5]: Entering directory 'z:/build/build/src/obj-firefox/browser/installer'
14:26:27 INFO - package> NO_PKG_FILES="core bsdecho js js-config jscpucfg nsinstall viewer TestGtkEmbed elf-dynstr-gc mangle* maptsv* mfc* msdump* msmap* nm2tsv* nsinstall* res/samples res/throbber shlibsign* certutil* pk12util* BadCertServer* OCSPStaplingServer* SymantecSanctionsServer* GenerateOCSPResponse* chrome/chrome.rdf chrome/app-chrome.manifest chrome/overlayinfo components/compreg.dat components/xpti.dat content_unit_tests necko_unit_tests *.dSYM SmokeDMD"
14:26:27 INFO - package> z:/build/build/src/obj-firefox/_virtualenvs/init/Scripts/python.exe z:/build/build/src/toolkit/mozapps/installer/packager.py

The Wrote .. to target.web-platform.tests line comes from here:
https://searchfox.org/mozilla-central/source/python/mozbuild/mozbuild/action/test_archive.py#820

Which gets called from here:
https://searchfox.org/mozilla-central/source/testing/testsuite-targets.mk#158

Reading makefiles is not my forte though. I'm also confused why ccov builds would be different from normal ones here.

I tried to disable Rust tests, which is one of the differences between normal builds and ccov builds, but it didn't make a difference: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5de35c528c768271487823284b9d23d1d32ce46e.

I ran a few try runs:
https://treeherder.mozilla.org/#/jobs?repo=try&author=ahalberstadt%40mozilla.com&fromchange=87f98216ee979df1af96e598e7322130d16499e2&tochange=c4c688b2ec2723fd4a269823eade5bf4edfd5c8f

The bottom is baseline (aka current central), then I backed out the two patches one by one. In otherwords, this is the specific commit that caused the regression:
https://hg.mozilla.org/mozilla-central/rev/50495a5da9f6

Since this is Windows only, it's likely this hunk specifically:
https://hg.mozilla.org/mozilla-central/diff/50495a5da9f6/taskcluster/taskgraph/transforms/job/mozharness.py

Blocks: 1543149
Assignee: nobody → ahal
Status: NEW → ASSIGNED
Priority: P5 → P1

Have it working on try using PYTHONPATH rather than mach python:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5def23c2c7f27d9a4831db0daf292cf03487032e

I filed bug 1543149 to figure out why using mach python breaks the ccov build. I think we'll need someone on the build team to investigate as I wasn't getting anywhere.

This is a hack to get around Windows ccov build hangs caused by bug 1195299.
Bug 1543149 will track the investigation of the hang and removal of this hack.

Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/20c86313e332
[taskgraph] Hack required mozbase module onto PYTHONPATH rather than run 'mach python' for Windows builds, r=marco
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: