Closed
Bug 1060214
Opened 10 years ago
Closed 9 years ago
Intermittent command timed out: 10800 seconds without output running ['/tools/buildbot/bin/python', '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config', 'builds/releng_base_linux_64_builds.py', '--branch', 'mozilla-inbound', '--build-poo
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: cbook, Unassigned)
References
()
Details
(Keywords: intermittent-failure)
b2g_mozilla-inbound_linux32_gecko build on 2014-08-28 14:48:00 PDT for push b54f8eca9cb9 slave: bld-linux64-spot-1031 https://tbpl.mozilla.org/php/getParsedLog.php?id=46991464&tree=Mozilla-Inbound command timed out: 10800 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-i386', '--cwd', '/builds/slave/m-in-linux32_g-000000000000000/build', '--unpriv', '--shell', u'/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" LOCALES_FILE="locales/languages_dev.json" TOOLTOOL_HOME="/builds" SYMBOL_SERVER_HOST="symbolpush.mozilla.org" CCACHE_DIR="/builds/ccache" POST_SYMBOL_UPLOAD_CMD="/usr/local/bin/post-symbol-upload.py" MOZ_AUTOMATION="1" MOZ_OBJDIR="obj-firefox" SYMBOL_SERVER_... [exceeded max length]
Reporter | ||
Comment 1•10 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=46994801&tree=Fx-Team https://tbpl.mozilla.org/php/getParsedLog.php?id=46993432&tree=Fx-Team
Reporter | ||
Comment 2•10 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=46994801&tree=Fx-Team
Comment 3•10 years ago
|
||
how can we get the tbpl robot to pick this up? Is that something releng can do?
Comment 4•10 years ago
|
||
There are two ways: have your error messages output as "Thing that says it's an error | Place where the error happened | Description of the error" (like "TEST-UNEXPECTED-FAIL | testInputConnection | Can set empty composition - got foofoo, expected foo") in which case tbpl will search for bugs with "Place where the error happened" in the summary, or, have the entire line of your error message match the bug summary, because for things that aren't "FAIL | Foo | Bar" tbpl does full line matching. Given that your full line includes a commandline, and that commandline includes paths, and those paths are repo-specific... good luck. Possibly this bug would be suggested for another Linux32 failure on mozilla-inbound, but probably not even for that, with the combination of Bugzilla's maxlength and tbpl's trimming of long error lines. Practically, for these timeouts, the solution would be to print the name of the buildstep, in this case "compile", rather than the commandline. Dunno whether it was implemented printing commandline for a reason, like maybe that distinguishes the dozen different things that all happen in the mozharness step for mozharness jobs, though.
Comment 5•10 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #4) > There are two ways: have your error messages output as "Thing that says it's > an error | Place where the error happened | Description of the error" (like > "TEST-UNEXPECTED-FAIL | testInputConnection | Can set empty composition - > got foofoo, expected foo") in which case tbpl will search for bugs with > "Place where the error happened" in the summary, or, have the entire line of > your error message match the bug summary, because for things that aren't > "FAIL | Foo | Bar" tbpl does full line matching. > > Given that your full line includes a commandline, and that commandline > includes paths, and those paths are repo-specific... good luck. Possibly > this bug would be suggested for another Linux32 failure on mozilla-inbound, > but probably not even for that, with the combination of Bugzilla's maxlength > and tbpl's trimming of long error lines. > > Practically, for these timeouts, the solution would be to print the name of > the buildstep, in this case "compile", rather than the commandline. Dunno > whether it was implemented printing commandline for a reason, like maybe > that distinguishes the dozen different things that all happen in the > mozharness step for mozharness jobs, though. hmm so sounds like not trivially or with accuracy in this case. thanks for the explanation of how this works.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 7•10 years ago
|
||
Though, oddly, a wrong explanation - according to https://tbpl.mozilla.org/php/getLogExcerpt.php?type=annotated®enerate=1&id=47125137&debug=1 what tbpl actually searches for is "command timed out: 10800 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-i386" randomly cutting it off at a convenient spot.
Comment 8•10 years ago
|
||
moving to ::Other pending any actual actionable item for buildduty
Component: Buildduty → Other
QA Contact: bugspam.Callek → pmoore
Comment 9•10 years ago
|
||
Ed, is this something we can work into https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php? If yes, is there also an equivalent for tree herder?
Flags: needinfo?(emorley)
Comment 10•10 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #9) > Ed, is this something we can work into > https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php? That file is responsible for highlighting error lines within the log. That's already happening successfully (as is the bug suggestions), since I can see the error line in comment 6. (Both TBPL and Treeherder, search for the first 100 characters of the error line). The problem is that treeherder is currently struggling to suggest bugs for some of these long error line variants at the at the moment - filed as bug 1072377 (and also other deps of bug 1057359). (And if it helps for reference... Log parser regex: https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php https://github.com/mozilla/treeherder-service/blob/master/treeherder/log_parser/parsers.pyx#L283 Bug suggestions logic & blacklist: https://hg.mozilla.org/webtools/tbpl/file/default/php/inc/AnnotatedSummaryGenerator.php#l73 https://github.com/mozilla/treeherder-service/blob/master/treeherder/log_parser/utils.py )
Flags: needinfo?(emorley)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 12•10 years ago
|
||
Ed, now that 1072377 has landed, and I see a TBPL Robot comment above (comment 11) can I assume that this bug now correctly gets reported by the TBPL Robot? I guess the bug itself still remains, so we need to keep this bug open, even if the "TBPL Robot should report automatically" part is done.
Flags: needinfo?(emorley)
Comment 14•10 years ago
|
||
Hi guys, I don't see any more additions by the TBPL Robot - can we assume this matter is resolved now? Pete
Component: Other → Buildduty
Flags: needinfo?(philringnalda)
QA Contact: pmoore → bugspam.Callek
Comment 15•10 years ago
|
||
Probably, dunno.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(philringnalda)
Resolution: --- → WORKSFORME
Comment 16•10 years ago
|
||
Yeah, make that a "no, we switched to building with mozharness (and got pretty indifferent to infra failures and build timeouts and now just retrigger them)."
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: Intermittent command timed out: 10800 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-i386', '--cwd', '/builds/slave/m-in-linux32_g-000000000000000/build', '--unpriv', '--shell', u'/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" → Intermittent command timed out: 10800 seconds without output running ['/tools/buildbot/bin/python', '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config', 'builds/releng_base_linux_64_builds.py', '--branch', 'mozilla-inbound', '--build-poo
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 51•9 years ago
|
||
below are some stats collected from tbpl robot. Will dive in more tomorrow but this doesn't look to be exclusive to any one slave or even a slave pool, though linux64 ec2 is obviously the dominant one this happens too. Lot's of asan and debug jobs. Frequencies all over the map but '2015-01-08': 7, and '2015-01-23': 5, stand out.
[playground]jlund@Hastings163:~/devel/mozilla/cleanRepos/braindump/buildduty
> ./diagnose_intermittent_bug.py 1060214
{'builder_counts': {'Linux fx-team build': 2,
'Linux fx-team leak test build': 3,
'Linux mozilla-inbound build': 1,
'Linux mozilla-inbound leak test build': 1,
'Linux x86-64 b2g-inbound debug asan build': 1,
'Linux x86-64 b2g-inbound debug static analysis build': 1,
'Linux x86-64 b2g-inbound pgo-build': 1,
'Linux x86-64 fx-team build': 3,
'Linux x86-64 fx-team leak test build': 3,
'Linux x86-64 mozilla-central debug asan build': 2,
'Linux x86-64 mozilla-central debug asan nightly': 1,
'Linux x86-64 mozilla-inbound asan build': 2,
'Linux x86-64 mozilla-inbound build': 3,
'Linux x86-64 mozilla-inbound debug asan build': 6,
'Linux x86-64 mozilla-inbound leak test build': 1,
'OS X 10.7 64-bit mozilla-inbound leak test build': 1,
'b2g_b2g-inbound_linux32_gecko-debug build': 1,
'b2g_mozilla-central_macosx64_gecko build': 1,
'b2g_mozilla-inbound_macosx64_gecko build': 1},
'repo_counts': {'b2g-inbound': 4,
'fx-team': 11,
'mozilla-central': 4,
'mozilla-inbound': 16},
'slave_counts': {'bld-linux64-spot-037': 1,
'bld-linux64-spot-041': 1,
'bld-linux64-spot-051': 1,
'bld-linux64-spot-052': 1,
'bld-linux64-spot-075': 1,
'bld-linux64-spot-097': 1,
'bld-linux64-spot-1010': 1,
'bld-linux64-spot-1013': 1,
'bld-linux64-spot-1043': 1,
'bld-linux64-spot-1068': 1,
'bld-linux64-spot-109': 1,
'bld-linux64-spot-1094': 1,
'bld-linux64-spot-1095': 1,
'bld-linux64-spot-132': 1,
'bld-linux64-spot-136': 1,
'bld-linux64-spot-176': 1,
'bld-linux64-spot-178': 1,
'bld-linux64-spot-192': 1,
'bld-linux64-spot-200': 1,
'bld-linux64-spot-205': 1,
'bld-linux64-spot-210': 1,
'bld-linux64-spot-215': 1,
'bld-linux64-spot-216': 1,
'bld-linux64-spot-252': 1,
'bld-linux64-spot-257': 1,
'bld-linux64-spot-306': 1,
'bld-linux64-spot-312': 1,
'bld-linux64-spot-345': 1,
'bld-linux64-spot-347': 1,
'bld-linux64-spot-362': 1,
'bld-linux64-spot-399': 1,
'bld-linux64-spot-476': 1,
'bld-lion-r5-062': 1,
'bld-lion-r5-076': 1,
'bld-lion-r5-080': 1},
'time_counts': {'2014-10-09': 1,
'2014-12-25': 1,
'2014-12-29': 1,
'2014-12-30': 2,
'2015-01-05': 1,
'2015-01-06': 2,
'2015-01-08': 7,
'2015-01-12': 1,
'2015-01-16': 1,
'2015-01-17': 1,
'2015-01-19': 3,
'2015-01-20': 2,
'2015-01-22': 1,
'2015-01-23': 5,
'2015-01-26': 2,
'2015-01-27': 2,
'2015-01-28': 2}}
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 76•9 years ago
|
||
I think we can scale the 'timeout without output' limit down a bit here so we are not wasting so much time on, obviously, hung slaves. looks like we upped the total time by an hour back in "Bug 1027983 - bm-2008-sm-00xx fail to link xul.dll when build is PGO enabled" since we don't have those machines anymore, it's probably safe to bring that back down to 2hours
Comment 77•9 years ago
|
||
looking at the actual issue though: it appears like most of these intermittents happen while doing some checksums: 02:40:33 INFO - '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums' '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums'.asc command timed out: 10800 seconds without output running ['/tools/buildbot/bin/python', '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config' looking at a good build, we can do this step in less than 10min so something obviously seems hung: 19:07:18 INFO - '../../dist//firefox-39.0a1.en-US.linux-x86_64-asan.checksums' 19:14:37 INFO - sys.argv: ['/usr/local/bin/post_upload.py', '--tinderbox-builds-dir', 'mozilla-central-linux64-asan', '-p', 'firefox', '-i', mshal, any ideas ^ ?
Flags: needinfo?(mshal)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 79•9 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #77) > looking at the actual issue though: > > it appears like most of these intermittents happen while doing some > checksums: > > 02:40:33 INFO - > '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums' > '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums'.asc > command timed out: 10800 seconds without output running > ['/tools/buildbot/bin/python', > '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config' > > > looking at a good build, we can do this step in less than 10min so something > obviously seems hung: > > 19:07:18 INFO - > '../../dist//firefox-39.0a1.en-US.linux-x86_64-asan.checksums' > 19:14:37 INFO - sys.argv: ['/usr/local/bin/post_upload.py', > '--tinderbox-builds-dir', 'mozilla-central-linux64-asan', '-p', 'firefox', > '-i', > > mshal, any ideas ^ ? I think it is probably not the checksum that is hanging, but the upload step that comes right after (so maybe either upload.py or post_upload.py). The output can be misleading since make is doing some buffering here, and it might not always print the full buffer when it gets killed. AFAIK the checksums.py just reads some files on the local disk and generates sha/md5 sums for them, and writes the output to another local file. Maybe we need to make upload.py and/or post_upload.py more noisy to track this further, though again the output buffering may make that tricky. How are we killing the make process when we hit the timeout? Testing locally on linux, it seems that a hanging target will flush its output buffer when killed with ctrl-C, but not with 'kill -9'.
Flags: needinfo?(mshal)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 86•9 years ago
|
||
Inactive; closing (see bug 1180138).
Status: REOPENED → RESOLVED
Closed: 10 years ago → 9 years ago
Resolution: --- → WORKSFORME
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•