If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Intermittent command timed out: 10800 seconds without output running ['/tools/buildbot/bin/python', '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config', 'builds/releng_base_linux_64_builds.py', '--branch', 'mozilla-inbound', '--build-poo

RESOLVED WORKSFORME

Status

Release Engineering
Buildduty
RESOLVED WORKSFORME
3 years ago
2 years ago

People

(Reporter: Tomcat, Unassigned)

Tracking

(Depends on: 1 bug, {intermittent-failure})

Firefox Tracking Flags

(Not tracked)

Details

(URL)

(Reporter)

Description

3 years ago
b2g_mozilla-inbound_linux32_gecko build on 2014-08-28 14:48:00 PDT for push b54f8eca9cb9

slave: bld-linux64-spot-1031

https://tbpl.mozilla.org/php/getParsedLog.php?id=46991464&tree=Mozilla-Inbound



command timed out: 10800 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-i386', '--cwd', '/builds/slave/m-in-linux32_g-000000000000000/build', '--unpriv', '--shell', u'/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" LOCALES_FILE="locales/languages_dev.json" TOOLTOOL_HOME="/builds" SYMBOL_SERVER_HOST="symbolpush.mozilla.org" CCACHE_DIR="/builds/ccache" POST_SYMBOL_UPLOAD_CMD="/usr/local/bin/post-symbol-upload.py" MOZ_AUTOMATION="1" MOZ_OBJDIR="obj-firefox" SYMBOL_SERVER_... [exceeded max length]
(Reporter)

Comment 1

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=46994801&tree=Fx-Team
https://tbpl.mozilla.org/php/getParsedLog.php?id=46993432&tree=Fx-Team
(Reporter)

Comment 2

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=46994801&tree=Fx-Team

Comment 3

3 years ago
how can we get the tbpl robot to pick this up? Is that something releng can do?
There are two ways: have your error messages output as "Thing that says it's an error | Place where the error happened | Description of the error" (like "TEST-UNEXPECTED-FAIL | testInputConnection | Can set empty composition - got foofoo, expected foo") in which case tbpl will search for bugs with "Place where the error happened" in the summary, or, have the entire line of your error message match the bug summary, because for things that aren't "FAIL | Foo | Bar" tbpl does full line matching.

Given that your full line includes a commandline, and that commandline includes paths, and those paths are repo-specific... good luck. Possibly this bug would be suggested for another Linux32 failure on mozilla-inbound, but probably not even for that, with the combination of Bugzilla's maxlength and tbpl's trimming of long error lines.

Practically, for these timeouts, the solution would be to print the name of the buildstep, in this case "compile", rather than the commandline. Dunno whether it was implemented printing commandline for a reason, like maybe that distinguishes the dozen different things that all happen in the mozharness step for mozharness jobs, though.

Comment 5

3 years ago
(In reply to Phil Ringnalda (:philor) from comment #4)
> There are two ways: have your error messages output as "Thing that says it's
> an error | Place where the error happened | Description of the error" (like
> "TEST-UNEXPECTED-FAIL | testInputConnection | Can set empty composition -
> got foofoo, expected foo") in which case tbpl will search for bugs with
> "Place where the error happened" in the summary, or, have the entire line of
> your error message match the bug summary, because for things that aren't
> "FAIL | Foo | Bar" tbpl does full line matching.
> 
> Given that your full line includes a commandline, and that commandline
> includes paths, and those paths are repo-specific... good luck. Possibly
> this bug would be suggested for another Linux32 failure on mozilla-inbound,
> but probably not even for that, with the combination of Bugzilla's maxlength
> and tbpl's trimming of long error lines.
> 
> Practically, for these timeouts, the solution would be to print the name of
> the buildstep, in this case "compile", rather than the commandline. Dunno
> whether it was implemented printing commandline for a reason, like maybe
> that distinguishes the dozen different things that all happen in the
> mozharness step for mozharness jobs, though.

hmm so sounds like not trivially or with accuracy in this case. thanks for the explanation of how this works.
Comment hidden (Treeherder Robot)
Though, oddly, a wrong explanation - according to https://tbpl.mozilla.org/php/getLogExcerpt.php?type=annotated&regenerate=1&id=47125137&debug=1 what tbpl actually searches for is "command timed out: 10800 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-i386" randomly cutting it off at a convenient spot.
moving to ::Other pending any actual actionable item for buildduty
Component: Buildduty → Other
QA Contact: bugspam.Callek → pmoore
Ed, is this something we can work into https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php?

If yes, is there also an equivalent for tree herder?
Flags: needinfo?(emorley)
(In reply to Pete Moore [:pete][:pmoore] from comment #9)
> Ed, is this something we can work into
> https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php?

That file is responsible for highlighting error lines within the log. That's already happening successfully (as is the bug suggestions), since I can see the error line in comment 6. (Both TBPL and Treeherder, search for the first 100 characters of the error line).

The problem is that treeherder is currently struggling to suggest bugs for some of these long error line variants at the at the moment - filed as bug 1072377 (and also other deps of bug 1057359).


(And if it helps for reference...

Log parser regex:
https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php
https://github.com/mozilla/treeherder-service/blob/master/treeherder/log_parser/parsers.pyx#L283

Bug suggestions logic & blacklist:
https://hg.mozilla.org/webtools/tbpl/file/default/php/inc/AnnotatedSummaryGenerator.php#l73
https://github.com/mozilla/treeherder-service/blob/master/treeherder/log_parser/utils.py

)
Flags: needinfo?(emorley)
Comment hidden (Treeherder Robot)
Ed, now that 1072377 has landed, and I see a TBPL Robot comment above (comment 11) can I assume that this bug now correctly gets reported by the TBPL Robot?

I guess the bug itself still remains, so we need to keep this bug open, even if the "TBPL Robot should report automatically" part is done.
Flags: needinfo?(emorley)
Correct :-)
Flags: needinfo?(emorley)
Hi guys,

I don't see any more additions by the TBPL Robot - can we assume this matter is resolved now?

Pete
Component: Other → Buildduty
Flags: needinfo?(philringnalda)
QA Contact: pmoore → bugspam.Callek
Probably, dunno.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Flags: needinfo?(philringnalda)
Resolution: --- → WORKSFORME
Yeah, make that a "no, we switched to building with mozharness (and got pretty indifferent to infra failures and build timeouts and now just retrigger them)."
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: Intermittent command timed out: 10800 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-i386', '--cwd', '/builds/slave/m-in-linux32_g-000000000000000/build', '--unpriv', '--shell', u'/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" → Intermittent command timed out: 10800 seconds without output running ['/tools/buildbot/bin/python', '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config', 'builds/releng_base_linux_64_builds.py', '--branch', 'mozilla-inbound', '--build-poo
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
below are some stats collected from tbpl robot. Will dive in more tomorrow but this doesn't look to be exclusive to any one slave or even a slave pool, though linux64 ec2 is obviously the dominant one this happens too. Lot's of asan and debug jobs. Frequencies all over the map but '2015-01-08': 7, and '2015-01-23': 5, stand out.

[playground]jlund@Hastings163:~/devel/mozilla/cleanRepos/braindump/buildduty
> ./diagnose_intermittent_bug.py 1060214
{'builder_counts': {'Linux fx-team build': 2,
                    'Linux fx-team leak test build': 3,
                    'Linux mozilla-inbound build': 1,
                    'Linux mozilla-inbound leak test build': 1,
                    'Linux x86-64 b2g-inbound debug asan build': 1,
                    'Linux x86-64 b2g-inbound debug static analysis build': 1,
                    'Linux x86-64 b2g-inbound pgo-build': 1,
                    'Linux x86-64 fx-team build': 3,
                    'Linux x86-64 fx-team leak test build': 3,
                    'Linux x86-64 mozilla-central debug asan build': 2,
                    'Linux x86-64 mozilla-central debug asan nightly': 1,
                    'Linux x86-64 mozilla-inbound asan build': 2,
                    'Linux x86-64 mozilla-inbound build': 3,
                    'Linux x86-64 mozilla-inbound debug asan build': 6,
                    'Linux x86-64 mozilla-inbound leak test build': 1,
                    'OS X 10.7 64-bit mozilla-inbound leak test build': 1,
                    'b2g_b2g-inbound_linux32_gecko-debug build': 1,
                    'b2g_mozilla-central_macosx64_gecko build': 1,
                    'b2g_mozilla-inbound_macosx64_gecko build': 1},
 'repo_counts': {'b2g-inbound': 4,
                 'fx-team': 11,
                 'mozilla-central': 4,
                 'mozilla-inbound': 16},
 'slave_counts': {'bld-linux64-spot-037': 1,
                  'bld-linux64-spot-041': 1,
                  'bld-linux64-spot-051': 1,
                  'bld-linux64-spot-052': 1,
                  'bld-linux64-spot-075': 1,
                  'bld-linux64-spot-097': 1,
                  'bld-linux64-spot-1010': 1,
                  'bld-linux64-spot-1013': 1,
                  'bld-linux64-spot-1043': 1,
                  'bld-linux64-spot-1068': 1,
                  'bld-linux64-spot-109': 1,
                  'bld-linux64-spot-1094': 1,
                  'bld-linux64-spot-1095': 1,
                  'bld-linux64-spot-132': 1,
                  'bld-linux64-spot-136': 1,
                  'bld-linux64-spot-176': 1,
                  'bld-linux64-spot-178': 1,
                  'bld-linux64-spot-192': 1,
                  'bld-linux64-spot-200': 1,
                  'bld-linux64-spot-205': 1,
                  'bld-linux64-spot-210': 1,
                  'bld-linux64-spot-215': 1,
                  'bld-linux64-spot-216': 1,
                  'bld-linux64-spot-252': 1,
                  'bld-linux64-spot-257': 1,
                  'bld-linux64-spot-306': 1,
                  'bld-linux64-spot-312': 1,
                  'bld-linux64-spot-345': 1,
                  'bld-linux64-spot-347': 1,
                  'bld-linux64-spot-362': 1,
                  'bld-linux64-spot-399': 1,
                  'bld-linux64-spot-476': 1,
                  'bld-lion-r5-062': 1,
                  'bld-lion-r5-076': 1,
                  'bld-lion-r5-080': 1},
 'time_counts': {'2014-10-09': 1,
                 '2014-12-25': 1,
                 '2014-12-29': 1,
                 '2014-12-30': 2,
                 '2015-01-05': 1,
                 '2015-01-06': 2,
                 '2015-01-08': 7,
                 '2015-01-12': 1,
                 '2015-01-16': 1,
                 '2015-01-17': 1,
                 '2015-01-19': 3,
                 '2015-01-20': 2,
                 '2015-01-22': 1,
                 '2015-01-23': 5,
                 '2015-01-26': 2,
                 '2015-01-27': 2,
                 '2015-01-28': 2}}
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
I think we can scale the 'timeout without output' limit down a bit here so we are not wasting so much time on, obviously, hung slaves.

looks like we upped the total time by an hour back in "Bug 1027983 - bm-2008-sm-00xx fail to link xul.dll when build is PGO enabled"

since we don't have those machines anymore, it's probably safe to bring that back down to 2hours

Updated

3 years ago
Depends on: 1141941
looking at the actual issue though:

it appears like most of these intermittents happen while doing some checksums:

02:40:33     INFO -  	'../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums' '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums'.asc
command timed out: 10800 seconds without output running ['/tools/buildbot/bin/python', '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config'


looking at a good build, we can do this step in less than 10min so something obviously seems hung:

19:07:18     INFO -  	'../../dist//firefox-39.0a1.en-US.linux-x86_64-asan.checksums'
19:14:37     INFO -  sys.argv: ['/usr/local/bin/post_upload.py', '--tinderbox-builds-dir', 'mozilla-central-linux64-asan', '-p', 'firefox', '-i',

mshal, any ideas ^ ?
Flags: needinfo?(mshal)
Comment hidden (Treeherder Robot)
(In reply to Jordan Lund (:jlund) from comment #77)
> looking at the actual issue though:
> 
> it appears like most of these intermittents happen while doing some
> checksums:
> 
> 02:40:33     INFO -  
> '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums'
> '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums'.asc
> command timed out: 10800 seconds without output running
> ['/tools/buildbot/bin/python',
> '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config'
> 
> 
> looking at a good build, we can do this step in less than 10min so something
> obviously seems hung:
> 
> 19:07:18     INFO -  
> '../../dist//firefox-39.0a1.en-US.linux-x86_64-asan.checksums'
> 19:14:37     INFO -  sys.argv: ['/usr/local/bin/post_upload.py',
> '--tinderbox-builds-dir', 'mozilla-central-linux64-asan', '-p', 'firefox',
> '-i',
> 
> mshal, any ideas ^ ?

I think it is probably not the checksum that is hanging, but the upload step that comes right after (so maybe either upload.py or post_upload.py). The output can be misleading since make is doing some buffering here, and it might not always print the full buffer when it gets killed. AFAIK the checksums.py just reads some files on the local disk and generates sha/md5 sums for them, and writes the output to another local file. Maybe we need to make upload.py and/or post_upload.py more noisy to track this further, though again the output buffering may make that tricky.

How are we killing the make process when we hit the timeout? Testing locally on linux, it seems that a hanging target will flush its output buffer when killed with ctrl-C, but not with 'kill -9'.
Flags: needinfo?(mshal)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Inactive; closing (see bug 1180138).
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago2 years ago
Resolution: --- → WORKSFORME
Comment hidden (Treeherder Robot)
You need to log in before you can comment on or make changes to this bug.