Closed Bug 1487587 Opened 6 years ago Closed 2 years ago

Intermittent [taskcluster:error] Task aborted - max run time exceeded [taskcluster:error] Task timeout after X seconds. Force killing container.due to slow cloning

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Assigned: markco)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Attachments

(1 file)

Filed by: ncsoregi [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=196763202&repo=mozilla-central https://queue.taskcluster.net/v1/task/f2YQvf9fRkG1Z0axZ0n8Sg/runs/0/artifacts/public/logs/live_backing.log 23:47:07 INFO - l10n-check> 23:47:07 INFO - l10n-check> 5 warnings: 23:47:07 INFO - l10n-check> 23:47:07 INFO - l10n-check> LangString "OPTIONAL_EXTENSIONS_CHECKBOX_DESC" is not set in language table of language baseLocale 23:47:07 INFO - l10n-check> 23:47:07 INFO - l10n-check> LangString "OPTIONAL_EXTENSIONS_DESC" is not set in language table of language baseLocale 23:47:07 INFO - l10n-check> 23:47:07 INFO - l10n-check> Variable "InstallOptionalExtensions" not referenced or never set, wasting memory! 23:47:07 INFO - l10n-check> 23:47:07 INFO - l10n-check> Variable "ExtensionRecommender" not referenced or never set, wasting memory! 23:47:07 INFO - l10n-check> 23:47:07 INFO - l10n-check> Insecure filename "setup.exe", Windows will unsafely load compatibility shims into the process. 23:47:07 INFO - l10n-check> 23:47:07 INFO - l10n-check> mozmake.EXE[6]: Leaving directory 'z:/build/build/src/obj-firefox/browser/installer/windows' 23:47:51 INFO - l10n-check> mozmake.EXE[6]: Entering directory 'z:/build/build/src/obj-firefox/browser/installer/windows' 23:47:51 INFO - l10n-check> Packaging . 23:47:51 INFO - l10n-check> z:/build/build/src/obj-firefox/_virtualenvs/init/Scripts/python.exe z:/build/build/src/config/nsinstall.py -D 'z:/build/build/src/obj-firefox/dist/install/sea/' 23:47:51 INFO - l10n-check> z:/build/build/src/mach repackage installer \ 23:47:51 INFO - l10n-check> -o 'z:/build/build/src/obj-firefox/dist/install/sea/firefox-63.0a1.x-test.win64.installer.exe' \ 23:47:51 INFO - l10n-check> --package-name 'firefox' \ 23:47:51 INFO - l10n-check> --package 'z:/build/build/src/obj-firefox/dist/firefox-63.0a1.x-test.win64.zip' \ 23:47:51 INFO - l10n-check> --tag z:/build/build/src/browser/installer/windows/app.tag \ 23:47:51 INFO - l10n-check> --setupexe l10ngen/setup.exe \ 23:47:51 INFO - l10n-check> --sfx-stub z:/build/build/src/other-licenses/7zstub/firefox/7zSD.sfx 23:47:51 INFO - l10n-check> Ultimate Packer for eXecutables 23:47:51 INFO - l10n-check> Copyright (C) 1996 - 2017 23:47:51 INFO - l10n-check> UPX 3.94w Markus Oberhumer, Laszlo Molnar & John Reiser May 12th 2017 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> File size Ratio Format Name 23:47:51 INFO - l10n-check> -------------------- ------ ----------- ----------- 23:47:51 INFO - l10n-check> 180736 -> 120320 66.57% win32/pe 7zSD.sfx 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> Packed 1 file. 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 7-Zip [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> Scanning the drive: 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 10 folders, 103 files, 162592808 bytes (156 MiB) 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> Creating archive: c:/users/task_1535663680/appdata/local/temp/tmpxnzpeu/app.7z 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> Items to compress: 113 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> Files read from disk: 100 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> Archive size: 41352985 bytes (40 MiB) 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> Everything is Ok 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> mozmake.EXE[6]: Leaving directory 'z:/build/build/src/obj-firefox/browser/installer/windows' 23:47:51 INFO - l10n-check> mozmake.EXE[4]: Entering directory 'z:/build/build/src/obj-firefox/browser/locales' 23:47:51 INFO - l10n-check> repackaging done 23:47:51 INFO - l10n-check> 23:47:51 INFO - l10n-check> mozmake.EXE[4]: Leaving directory 'z:/build/build/src/obj-firefox/browser/locales' 23:50:15 INFO - l10n-check> mozmake.EXE[3]: Entering directory 'z:/build/build/src/obj-firefox/browser/locales' 23:50:15 INFO - l10n-check> z:/build/build/src/obj-firefox/_virtualenvs/init/Scripts/python.exe z:/build/build/src/toolkit/mozapps/installer/unpack.py ../../dist/l10n-stage/firefox 23:50:15 INFO - l10n-check> cd ../../dist/l10n-stage && test $(cat firefox/update.locale) = x-test 23:50:15 INFO - l10n-check> mozmake.EXE[3]: Leaving directory 'z:/build/build/src/obj-firefox/browser/locales' [taskcluster:error] Aborting task... [taskcluster 2018-08-30T23:53:31.092Z] ERROR: The process "4000" not found. [taskcluster 2018-08-30T23:53:31.092Z] [taskcluster:warn 2018-08-30T23:53:31.092Z] exit status 128 [taskcluster 2018-08-30T23:53:31.133Z] ERROR: The process "2644" not found. [taskcluster 2018-08-30T23:53:31.133Z] [taskcluster:warn 2018-08-30T23:53:31.133Z] exit status 128 [taskcluster 2018-08-30T23:53:31.174Z] ERROR: The process "1128" not found. [taskcluster 2018-08-30T23:53:31.174Z] [taskcluster:warn 2018-08-30T23:53:31.174Z] exit status 128 [taskcluster 2018-08-30T23:53:31.214Z] ERROR: The process "3620" not found. [taskcluster 2018-08-30T23:53:31.214Z] [taskcluster:warn 2018-08-30T23:53:31.214Z] exit status 128 [taskcluster 2018-08-30T23:53:31.254Z] ERROR: The process "2552" not found. [taskcluster 2018-08-30T23:53:31.254Z] [taskcluster:warn 2018-08-30T23:53:31.254Z] exit status 128 [taskcluster 2018-08-30T23:53:31.293Z] ERROR: The process "2200" not found. [taskcluster 2018-08-30T23:53:31.293Z] [taskcluster:warn 2018-08-30T23:53:31.293Z] exit status 128 [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 4236 (child process of PID 5036) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 5780 (child process of PID 5036) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 5036 (child process of PID 5304) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 5304 (child process of PID 3092) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 3092 (child process of PID 5488) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 5488 (child process of PID 3408) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 3408 (child process of PID 996) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 888 (child process of PID 956) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 996 (child process of PID 956) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 956 (child process of PID 2660) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 2660 (child process of PID 1956) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 1900 (child process of PID 3732) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 1956 (child process of PID 3732) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] SUCCESS: The process with PID 3732 (child process of PID 2964) has been terminated. [taskcluster 2018-08-30T23:53:31.335Z] [taskcluster 2018-08-30T23:53:31.335Z] === Task Finished === [taskcluster 2018-08-30T23:53:31.335Z] Task Duration: 1h59m59.0288045s [taskcluster 2018-08-30T23:53:33.932Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-08-30T21:51:30.277Z [taskcluster 2018-08-30T23:53:34.807Z] Uploading artifact public/chainOfTrust.json.asc from file generic-worker\chainOfTrust.json.asc with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-08-30T21:51:30.277Z [taskcluster 2018-08-30T23:53:35.378Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/f2YQvf9fRkG1Z0axZ0n8Sg/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-08-30T21:51:30.277Z [taskcluster:error] Task aborted - max run time exceeded
Component: Untriaged → Release Automation: Uploading
Product: Firefox → Release Engineering
QA Contact: mtabara
Summary: Intermittent tier 2 bmsvc [taskcluster:error] Task aborted - max run time exceeded → Intermittent bmsvc [taskcluster:error] Task aborted - max run time exceeded
Component: Release Automation: Uploading → Toolchains
Product: Release Engineering → Firefox Build System
QA Contact: mtabara
Component: Toolchains → Mercurial: hg.mozilla.org
Product: Firefox Build System → Developer Services
Summary: Intermittent bmsvc [taskcluster:error] Task aborted - max run time exceeded → Intermittent [taskcluster:error] Task aborted - max run time exceeded due to slow cloning
Greg, what can we do to diagnose the issue here? Here's an example log: https://taskcluster-artifacts.net/W74h37GrTcq0EQ5ZYaYyig/0/public/logs/live_backing.log I'm worried this is going to end up biting us at an inopportune time if we don't get to the bottom of it now.
Flags: needinfo?(gps)
Severity: normal → major
Priority: P5 → P2
Connor, is this something you could help with? This just cost us both Windows nightlies (2 hours down the drain) and it's only a matter of time until this is burning us on dot releases or something urgent.
Flags: needinfo?(sheehan)
By EOY 2018, we hope to have rolled out Mercurial 4.8 on all our hg servers and leverage shallow clones/fast checkouts everywhere in CI. We are hoping this change will alleviate many of the various "slow/failing checkout" bugs that are currently on file, including this one. As of right now, our team's position is that we will be re-examining these bugs once the changes are deployed and we will be prioritizing any issues that persist. Bug hg48 would be worth following if you are interested in progress there. :)
Flags: needinfo?(sheehan)
Flags: needinfo?(gps)
Summary: Intermittent [taskcluster:error] Task aborted - max run time exceeded due to slow cloning → Intermittent [taskcluster:error] Task aborted - max run time exceeded [taskcluster:error] Task timeout after X seconds. Force killing container.due to slow cloning

Today the Windows ccov job failed 4 times because cloning was slow (15-30 minutes). For the 99.9% percentile, it's 10 minutes: https://sql.telemetry.mozilla.org/queries/65011/source

Classifications in this bug indicate this mostly affects Linux.

Connor, can you check if their is a pattern for the log clone times?

Flags: needinfo?(sheehan)

The only pattern I seem to see are tasks taking a long time during the working directory update stage (ie writing all the files to disk once the worker has a copy of the repo itself). A few examples:

10 minutes to 124400/283483 files.
20 minutes to 200100/283497 files.
10 minutes to 191261/27776 files. Interestingly this one slows down considerably about halfway through.

An aside: apologies for taking so long to respond to your needinfo.

Flags: needinfo?(sheehan)

There are 35 total failures in the last 7 days and 114 total failures in the last 30 days on linux64-shippable and linux32-shippable opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=278169683&repo=autoland&lineNumber=343

fetches 2019-11-26T06:41:32.740Z] Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/amJy-JSaSQeIAec8oSLdAg/artifacts/public/build/target.tar.bz2
[fetches 2019-11-26T06:41:32.961Z] https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ZYGge5qsTr6LK-0URjRksA/artifacts/public/build/minidump_stackwalk.tar.xz resolved to 1059240 bytes with sha256 434e9ef66fd15c388a8d32141ec80de91d1434d543b4094d3efdf8a42610216d in 0.222s
[fetches 2019-11-26T06:41:32.961Z] Extracting /builds/worker/fetches/minidump_stackwalk.tar.xz to /builds/worker/fetches
[fetches 2019-11-26T06:41:33.058Z] /builds/worker/fetches/minidump_stackwalk.tar.xz extracted in 0.097s
[fetches 2019-11-26T06:41:33.058Z] Removing /builds/worker/fetches/minidump_stackwalk.tar.xz
[fetches 2019-11-26T06:41:35.702Z] https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/amJy-JSaSQeIAec8oSLdAg/artifacts/public/build/target.crashreporter-symbols.zip resolved to 46728139 bytes with sha256 dc4b4ee3c7dd901f90d90c490947613ce01f657efe87d416ec4925313296cb23 in 2.962s
[fetches 2019-11-26T06:41:35.702Z] Extracting /builds/worker/fetches/target.crashreporter-symbols.zip to /builds/worker/fetches
[fetches 2019-11-26T06:41:35.704Z] Archive: /builds/worker/fetches/target.crashreporter-symbols.zip
[fetches 2019-11-26T06:41:35.705Z] inflating: certutil/76E7CD108CC27257328B56C47CD334100/certutil.sym
[fetches 2019-11-26T06:41:35.709Z] inflating: crashreporter/A1FFF060CE84FE55D994BCEE60B9A50B0/crashreporter.sym
[fetches 2019-11-26T06:41:35.709Z] inflating: firefox/03EE84E97F81759D018D295F1A6A93B50/firefox.sym
[fetches 2019-11-26T06:41:35.718Z] inflating: firefox-bin/71AD26377FE876DE485640E729CA02ED0/firefox-bin.sym
[fetches 2019-11-26T06:41:35.719Z] inflating: libclearkey.so/580AE5298498DDFBB4125FF926C5574C0/libclearkey.so.sym
[fetches 2019-11-26T06:41:35.719Z] inflating: libfake.so/6B7A3E79866BEFFB4000C16C5BCE149F0/libfake.so.sym
[fetches 2019-11-26T06:41:35.720Z] inflating: libfakeopenh264.so/5435E7B18F1ED2629787D95AF83915BB0/libfakeopenh264.so.sym
[fetches 2019-11-26T06:41:35.720Z] inflating: libfreebl3.so/FBF49429510AD4BC4D58CE77A42937EB0/libfreebl3.so.sym
[fetches 2019-11-26T06:41:35.723Z] inflating: libfreeblpriv3.so/33F589E9D4312E2800BF25B0D078AE0A0/libfreeblpriv3.so.sym
[fetches 2019-11-26T06:41:35.724Z] inflating: liblgpllibs.so/678F58EE229BE9F45E61281490864D420/liblgpllibs.so.sym
[fetches 2019-11-26T06:41:35.930Z] inflating: libmozavcodec.so/E7854067DA8B6284B2FAB7D9BAC69C840/libmozavcodec.so.sym
[fetches 2019-11-26T06:41:35.932Z] inflating: libmozavutil.so/60815712CF6559ADCAD034645D02E4250/libmozavutil.so.sym
[fetches 2019-11-26T06:41:35.932Z] inflating: libmozgtk.so/57A90437BF2EF72D1EA14AC4E44926E30/libmozgtk.so.sym
[fetches 2019-11-26T06:41:35.932Z] inflating: libmozgtk.so/C5FE8C993BA0F6B5E77A2BCB71D84D2E0/libmozgtk.so.sym
[fetches 2019-11-26T06:41:35.933Z] inflating: libmozgtk_stub.so/3A4D6E41EBE27A9813BA4A152F985A900/libmozgtk_stub.so.sym
[fetches 2019-11-26T06:41:35.937Z] inflating: libmozsandbox.so/71E6EFC3EED512540C336A02A6F870560/libmozsandbox.so.sym
[fetches 2019-11-26T06:41:35.950Z] inflating: libmozsqlite3.so/8D3704DF18441DC39B430C977EF9E2960/libmozsqlite3.so.sym
[fetches 2019-11-26T06:41:35.950Z] inflating: libmozwayland.so/C79186E09B6D1CC1D2B66ADC1B65656D0/libmozwayland.so.sym
[fetches 2019-11-26T06:41:35.953Z] inflating: libnspr4.so/E8DCACEE7AC116F7E4F9E33441AF2BE00/libnspr4.so.sym
[fetches 2019-11-26T06:41:36.158Z] inflating: libnss3.so/F418568F16AF15E5C26C1140D89813C10/libnss3.so.sym
[fetches 2019-11-26T06:41:36.160Z] inflating: libnssckbi.so/5402DB9A5BB9B1CDD9919C51B7F0BE190/libnssckbi.so.sym
[fetches 2019-11-26T06:41:36.161Z] inflating: libnssdbm3.so/A1B4CC76EF52CCF1886F6A767FE2ADDD0/libnssdbm3.so.sym
[fetches 2019-11-26T06:41:36.162Z] inflating: libnssutil3.so/53F461198067A7408B6CE1A949EBC6F00/libnssutil3.so.sym
[fetches 2019-11-26T06:41:36.163Z] inflating: libplc4.so/65C1756078EA29F37A00CF496A220EA50/libplc4.so.sym
[fetches 2019-11-26T06:41:36.163Z] inflating: libplds4.so/AD755F5643F033D4E2642A673FAE52DC0/libplds4.so.sym
[fetches 2019-11-26T06:41:36.165Z] inflating: libsmime3.so/CDC81B706813F94C9C45B426FBE819130/libsmime3.so.sym
[fetches 2019-11-26T06:41:36.167Z] inflating: libsoftokn3.so/4C0994035F8D41A6DD0EE5E2F4DE67690/libsoftokn3.so.sym
[fetches 2019-11-26T06:41:36.171Z] inflating: libssl3.so/E2154F25538C375DDF46F2532F9552980/libssl3.so.sym
[fetches 2019-11-26T06:42:14.528Z] inflating: libxul.so/87D3F7C9536291DA200AFD00DA50798F0/libxul.so.sym https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/amJy-JSaSQeIAec8oSLdAg/artifacts/public/build/target.tar.bz2 resolved to 117839081 bytes with sha256 f0510d5434f4563bd33b8e8a89506e25a07433d5527fd839f63b562d64a322fd in 41.788s
[fetches 2019-11-26T06:42:14.528Z] Extracting /builds/worker/fetches/target.tar.bz2 to /builds/worker/fetches
[fetches 2019-11-26T06:43:43.460Z] https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/JtZkq06jSIOAKM4JBolPSg/artifacts/public/build/clang.tar.xz resolved to 182985204 bytes with sha256 950192ece3a8f114f22faab48f5b1c36fe75f2ac2c9dff729bc1361e19b5998e in 130.720s
[fetches 2019-11-26T06:43:43.460Z] Extracting /builds/worker/fetches/clang.tar.xz to /builds/worker/fetches
[taskcluster:error] Task timeout after 1200 seconds. Force killing container.
[taskcluster 2019-11-26 06:44:35.798Z] === Task Finished ===
[taskcluster 2019-11-26 06:44:35.798Z] Unsuccessful task run with exit code: -1 completed in 1236.448 seconds

Connor can you assign someone to work on this?

Flags: needinfo?(sheehan)
Whiteboard: [stockwell needswork:owner]

In the last 7 days there were 22 occurrences most on linux32-shippable, some on linux64-shippable, all on build type opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=279773029&repo=autoland&lineNumber=586

Flags: needinfo?(sheehan)

See also the comments on bug 1580652 comment 112 up to bug 1580652 comment 118.

(In reply to Connor Sheehan [:sheehan] from comment #118)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #114)

(In reply to Connor Sheehan [:sheehan] from comment #113)

IIRC the shared repository is a .hg directory mounted into the worker without a working directory checkout. This .hg directory is shared across task containers so each container can use the repo to create a working directory checkout for the task. If the worker doesn't have a copy of the repo in it's shared mount, it has to do a full clone to get one. Later tasks running on the same worker re-use this cached repo to do their working directory checkouts. So we essentially do a full clone when we spawn new workers.

How often do we create workers? It feels like that at some times during the day it's happening more often than in others, so that failures for test jobs are more likely to hit.

I'm not sure to be honest. We ought to have some telemetry that can answer these questions. I'm not entirely sure if that exists or where it might live.

Also given that the initial clone from a remote location can take such a long time would it be possible to already run this command when a new worker is getting created?

This was actually one of the things I had in mind - moving the clone time out of task-execution time and into worker-creation time, so the long clone time isn't developer facing. Then vcs operations in tasks would be much more consistent in their runtimes. This may affect some assumptions about how fast we can stand up workers when demand arises, so it's likely a more involved change I'll be looking into next year.

If the issue with these exceptionally slow tasks is filesystem/storage related we will obviously still see slow vcs operations, so that area is ripe for optimization regardless of whether we move the clone to worker-startup time.

There have been 42 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • windows10-64
Flags: needinfo?(sheehan)

There have been 30 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • windows10-64
  • android-hw-a51-11-0-aarch64-shippable-qr

There have been 42 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • android-hw-a51-11-0-aarch64-shippable-qr
  • windows10-64
[vcs 2022-10-08T10:54:41.500Z] updating [=================================>             ] 237000/327286 8m42s
[vcs 2022-10-08T10:54:42.708Z] updating [=================================>             ] 237200/327286 8m40s
[vcs 2022-10-08T10:54:43.750Z] updating [=================================>             ] 237400/327286 8m39s
[vcs 2022-10-08T10:54:45.272Z] updating [=================================>             ] 237500/327286 8m42s
[vcs 2022-10-08T10:54:46.775Z] updating [=================================>             ] 237700/327286 8m44s
[vcs 2022-10-08T10:54:47.840Z] updating [=================================>             ] 237900/327286 8m46s
[vcs 2022-10-08T10:54:48.977Z] updating [=================================>             ] 238200/327286 8m38s
[taskcluster:error] Aborting task...
[taskcluster 2022-10-08T10:54:50.641Z] SUCCESS: The process with PID 6432 (child process of PID 6164) has been terminated.
[taskcluster 2022-10-08T10:54:50.641Z] SUCCESS: The process with PID 5268 (child process of PID 8128) has been terminated.
[taskcluster 2022-10-08T10:54:50.641Z] SUCCESS: The process with PID 6164 (child process of PID 8128) has been terminated.
[taskcluster 2022-10-08T10:54:50.641Z] SUCCESS: The process with PID 8128 (child process of PID 4448) has been terminated.
[taskcluster 2022-10-08T10:54:50.641Z] 
[taskcluster 2022-10-08T10:54:50.642Z] === Task Finished ===
[taskcluster 2022-10-08T10:54:50.642Z] Task Duration: 1h0m0.2658106s
[taskcluster 2022-10-08T10:54:50.651Z] [mounts] Preserving cache: Moving "Z:\\task_166522274304904\\build" to "Z:\\caches\\NWV9cMErQoOa8tIiNk8_yA"
[taskcluster 2022-10-08T10:54:50.991Z] [mounts] Denying task_166522274304904 access to 'Z:\caches\NWV9cMErQoOa8tIiNk8_yA'
[taskcluster 2022-10-08T10:55:45.346Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2023-01-06T09:39:28.623Z
[taskcluster:error] Task aborted - max run time exceeded

Kash, could the browsertime test running on A51 be made more reliable?

Flags: needinfo?(kshampur)

:aryx Looking at the most recent week-to-date orange factor graph, looks like the btime failures have gone down, especially since Oct 6th onwards.

Regardless, I see some overlap of the a51 btime failures here and recent a51 failures in Bug 1589796 (which we are actively looking into, best case the solution for that works for this)

I will bring up this bug in our Triage meeting next week.

Flags: needinfo?(kshampur)
Whiteboard: [stockwell unknown] → [stockwell unknown], [perftest:triage]
See Also: → 1589796

a51 btime failures have gone down a lot. I think resetting the docker on all hosts, as done in Bug 1589796, comment548, has helped with this. Feel free to ni? me or someone else from my team if this reoccurs

Whiteboard: [stockwell unknown], [perftest:triage] → [stockwell unknown]

There have been 30 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • windows10-64
[vcs 2022-11-12T10:52:08.088Z] updating [=======================================>       ] 286400/332294 9m38s
[vcs 2022-11-12T10:52:09.553Z] updating [=======================================>       ] 286500/332294 9m36s
[vcs 2022-11-12T10:52:10.829Z] updating [=======================================>       ] 286600/332294 9m37s
[vcs 2022-11-12T10:52:12.195Z] updating [=======================================>       ] 286700/332294 9m35s
[vcs 2022-11-12T10:52:13.606Z] updating [=======================================>       ] 286800/332294 9m35s
[vcs 2022-11-12T10:52:14.893Z] updating [=======================================>       ] 286900/332294 9m35s
[vcs 2022-11-12T10:52:16.298Z] updating [=======================================>       ] 287000/332294 9m34s
[vcs 2022-11-12T10:52:17.920Z] updating [=======================================>       ] 287100/332294 9m34s
[taskcluster:error] Aborting task...
[taskcluster 2022-11-12T10:52:18.399Z] SUCCESS: The process with PID 8432 (child process of PID 3532) has been terminated.
[taskcluster 2022-11-12T10:52:18.399Z] SUCCESS: The process with PID 256 (child process of PID 6584) has been terminated.
[taskcluster 2022-11-12T10:52:18.399Z] SUCCESS: The process with PID 3532 (child process of PID 6584) has been terminated.
[taskcluster 2022-11-12T10:52:18.399Z] SUCCESS: The process with PID 6584 (child process of PID 6592) has been terminated.
[taskcluster 2022-11-12T10:52:18.399Z] 
[taskcluster 2022-11-12T10:52:18.399Z] === Task Finished ===
[taskcluster 2022-11-12T10:52:18.399Z] Task Duration: 1h0m0.163866s
[taskcluster 2022-11-12T10:52:18.399Z] [mounts] Preserving cache: Moving "Z:\\task_166823680117229\\build" to "Z:\\caches\\ILgLDzM5RhebSfLPMr0BNw"
[taskcluster 2022-11-12T10:52:18.879Z] [mounts] Denying task_166823680117229 access to 'Z:\caches\ILgLDzM5RhebSfLPMr0BNw'
[taskcluster 2022-11-12T10:53:13.089Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2023-02-10T09:51:01.941Z
[taskcluster:error] Task aborted - max run time exceeded

There have been 42 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • windows10-64
[vcs 2023-01-04T22:42:12.157Z] updating [======================================>       ] 285900/335224 11m57s
[vcs 2023-01-04T22:42:13.645Z] updating [======================================>       ] 286000/335224 11m53s
[vcs 2023-01-04T22:42:14.868Z] updating [======================================>       ] 286100/335224 11m52s
[vcs 2023-01-04T22:42:16.181Z] updating [======================================>       ] 286200/335224 11m48s
[vcs 2023-01-04T22:42:17.714Z] updating [======================================>       ] 286300/335224 11m45s
[vcs 2023-01-04T22:42:19.219Z] updating [======================================>       ] 286400/335224 11m45s
[vcs 2023-01-04T22:42:20.572Z] updating [======================================>       ] 286500/335224 11m44s
[vcs 2023-01-04T22:42:21.867Z] updating [======================================>       ] 286600/335224 11m42s
[vcs 2023-01-04T22:42:23.159Z] updating [======================================>       ] 286700/335224 11m38s
[vcs 2023-01-04T22:42:24.617Z] updating [======================================>       ] 286800/335224 11m35s
[taskcluster:error] Aborting task...
[taskcluster 2023-01-04T22:42:25.937Z] SUCCESS: The process with PID 7712 (child process of PID 8700) has been terminated.
[taskcluster 2023-01-04T22:42:25.937Z] SUCCESS: The process with PID 8176 (child process of PID 1576) has been terminated.
[taskcluster 2023-01-04T22:42:25.937Z] SUCCESS: The process with PID 8700 (child process of PID 1576) has been terminated.
[taskcluster 2023-01-04T22:42:25.937Z] SUCCESS: The process with PID 1576 (child process of PID 4408) has been terminated.
[taskcluster 2023-01-04T22:42:25.937Z] 
[taskcluster 2023-01-04T22:42:25.937Z] === Task Finished ===
[taskcluster 2023-01-04T22:42:25.938Z] Task Duration: 1h0m0.2030385s
[taskcluster 2023-01-04T22:42:25.938Z] [mounts] Preserving cache: Moving "Z:\\task_167286411738299\\build" to "Z:\\caches\\XDYx5yC0TfuQ7T4VYwAVyQ"
[taskcluster 2023-01-04T22:42:26.465Z] [mounts] Denying task_167286411738299 access to 'Z:\caches\XDYx5yC0TfuQ7T4VYwAVyQ'
[taskcluster 2023-01-04T22:43:21.201Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2024-01-04T21:42:06.103Z
[taskcluster:error] Task aborted - max run time exceeded

In the last 14+ hours, several machines (both northcentralus and southcentralus datacenters) which ran Windows source tasks (linting, python tests) failed permanently during cloning:
vm-pa6xdm9qt6zw8jej6opwut9xgsogs460654
vm-ec7vjb9hryy1gthn3tu5vwbwdkqc6cqhi6h
vm-wq105tgwrygycmzzob0uyqc4lybzioqluiw
vm-fgobhmnhr2sstpqjkjm3lgiwtef8imtkmwo
vm-m7y3n7oftsk4gkbjax38qsefrgbsxtwmxuq

example: https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/win10-64-2004-source/workers/northcentralus/vm-wq105tgwrygycmzzob0uyqc4lybzioqluiw?sortBy=started&sortDirection=desc

jcristau mentioned bug 1766735 as change to the workers, no idea how this could be related. Mark, could you investigate? This query will list machines which failed their last 3+ tasks.

Flags: needinfo?(mcornmesser)
Assignee: nobody → mcornmesser
Status: NEW → ASSIGNED

I am not sure why increasing the disk would cause this behavior. I have a phab patch in to increase the disk size and to use a premium SSD. Hopefully this will improve the performance of the clone.

Fortunately we have improvements coming with the future win11-2009 and win10-2009 source pools.

Flags: needinfo?(mcornmesser)
Pushed by mcornmesser@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/2196b1b6420b Bump win 10 source pools to premium 500 GB data disks. r=releng-reviewers,hneiva

Tentatively calling this fixed.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(sheehan)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: