Closed Bug 1060214 Opened 10 years ago Closed 9 years ago

Intermittent command timed out: 10800 seconds without output running ['/tools/buildbot/bin/python', '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config', 'builds/releng_base_linux_64_builds.py', '--branch', 'mozilla-inbound', '--build-poo

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Gonk (Firefox OS)
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: cbook, Unassigned)

References

()

Details

(Keywords: intermittent-failure)

b2g_mozilla-inbound_linux32_gecko build on 2014-08-28 14:48:00 PDT for push b54f8eca9cb9

slave: bld-linux64-spot-1031

https://tbpl.mozilla.org/php/getParsedLog.php?id=46991464&tree=Mozilla-Inbound



command timed out: 10800 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-i386', '--cwd', '/builds/slave/m-in-linux32_g-000000000000000/build', '--unpriv', '--shell', u'/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" LOCALES_FILE="locales/languages_dev.json" TOOLTOOL_HOME="/builds" SYMBOL_SERVER_HOST="symbolpush.mozilla.org" CCACHE_DIR="/builds/ccache" POST_SYMBOL_UPLOAD_CMD="/usr/local/bin/post-symbol-upload.py" MOZ_AUTOMATION="1" MOZ_OBJDIR="obj-firefox" SYMBOL_SERVER_... [exceeded max length]
how can we get the tbpl robot to pick this up? Is that something releng can do?
There are two ways: have your error messages output as "Thing that says it's an error | Place where the error happened | Description of the error" (like "TEST-UNEXPECTED-FAIL | testInputConnection | Can set empty composition - got foofoo, expected foo") in which case tbpl will search for bugs with "Place where the error happened" in the summary, or, have the entire line of your error message match the bug summary, because for things that aren't "FAIL | Foo | Bar" tbpl does full line matching.

Given that your full line includes a commandline, and that commandline includes paths, and those paths are repo-specific... good luck. Possibly this bug would be suggested for another Linux32 failure on mozilla-inbound, but probably not even for that, with the combination of Bugzilla's maxlength and tbpl's trimming of long error lines.

Practically, for these timeouts, the solution would be to print the name of the buildstep, in this case "compile", rather than the commandline. Dunno whether it was implemented printing commandline for a reason, like maybe that distinguishes the dozen different things that all happen in the mozharness step for mozharness jobs, though.
(In reply to Phil Ringnalda (:philor) from comment #4)
> There are two ways: have your error messages output as "Thing that says it's
> an error | Place where the error happened | Description of the error" (like
> "TEST-UNEXPECTED-FAIL | testInputConnection | Can set empty composition -
> got foofoo, expected foo") in which case tbpl will search for bugs with
> "Place where the error happened" in the summary, or, have the entire line of
> your error message match the bug summary, because for things that aren't
> "FAIL | Foo | Bar" tbpl does full line matching.
> 
> Given that your full line includes a commandline, and that commandline
> includes paths, and those paths are repo-specific... good luck. Possibly
> this bug would be suggested for another Linux32 failure on mozilla-inbound,
> but probably not even for that, with the combination of Bugzilla's maxlength
> and tbpl's trimming of long error lines.
> 
> Practically, for these timeouts, the solution would be to print the name of
> the buildstep, in this case "compile", rather than the commandline. Dunno
> whether it was implemented printing commandline for a reason, like maybe
> that distinguishes the dozen different things that all happen in the
> mozharness step for mozharness jobs, though.

hmm so sounds like not trivially or with accuracy in this case. thanks for the explanation of how this works.
Though, oddly, a wrong explanation - according to https://tbpl.mozilla.org/php/getLogExcerpt.php?type=annotated&regenerate=1&id=47125137&debug=1 what tbpl actually searches for is "command timed out: 10800 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-i386" randomly cutting it off at a convenient spot.
moving to ::Other pending any actual actionable item for buildduty
Component: Buildduty → Other
QA Contact: bugspam.Callek → pmoore
Ed, is this something we can work into https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php?

If yes, is there also an equivalent for tree herder?
Flags: needinfo?(emorley)
(In reply to Pete Moore [:pete][:pmoore] from comment #9)
> Ed, is this something we can work into
> https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php?

That file is responsible for highlighting error lines within the log. That's already happening successfully (as is the bug suggestions), since I can see the error line in comment 6. (Both TBPL and Treeherder, search for the first 100 characters of the error line).

The problem is that treeherder is currently struggling to suggest bugs for some of these long error line variants at the at the moment - filed as bug 1072377 (and also other deps of bug 1057359).


(And if it helps for reference...

Log parser regex:
https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php
https://github.com/mozilla/treeherder-service/blob/master/treeherder/log_parser/parsers.pyx#L283

Bug suggestions logic & blacklist:
https://hg.mozilla.org/webtools/tbpl/file/default/php/inc/AnnotatedSummaryGenerator.php#l73
https://github.com/mozilla/treeherder-service/blob/master/treeherder/log_parser/utils.py

)
Flags: needinfo?(emorley)
Ed, now that 1072377 has landed, and I see a TBPL Robot comment above (comment 11) can I assume that this bug now correctly gets reported by the TBPL Robot?

I guess the bug itself still remains, so we need to keep this bug open, even if the "TBPL Robot should report automatically" part is done.
Flags: needinfo?(emorley)
Correct :-)
Flags: needinfo?(emorley)
Hi guys,

I don't see any more additions by the TBPL Robot - can we assume this matter is resolved now?

Pete
Component: Other → Buildduty
Flags: needinfo?(philringnalda)
QA Contact: pmoore → bugspam.Callek
Probably, dunno.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(philringnalda)
Resolution: --- → WORKSFORME
Yeah, make that a "no, we switched to building with mozharness (and got pretty indifferent to infra failures and build timeouts and now just retrigger them)."
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: Intermittent command timed out: 10800 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-i386', '--cwd', '/builds/slave/m-in-linux32_g-000000000000000/build', '--unpriv', '--shell', u'/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" → Intermittent command timed out: 10800 seconds without output running ['/tools/buildbot/bin/python', '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config', 'builds/releng_base_linux_64_builds.py', '--branch', 'mozilla-inbound', '--build-poo
below are some stats collected from tbpl robot. Will dive in more tomorrow but this doesn't look to be exclusive to any one slave or even a slave pool, though linux64 ec2 is obviously the dominant one this happens too. Lot's of asan and debug jobs. Frequencies all over the map but '2015-01-08': 7, and '2015-01-23': 5, stand out.

[playground]jlund@Hastings163:~/devel/mozilla/cleanRepos/braindump/buildduty
> ./diagnose_intermittent_bug.py 1060214
{'builder_counts': {'Linux fx-team build': 2,
                    'Linux fx-team leak test build': 3,
                    'Linux mozilla-inbound build': 1,
                    'Linux mozilla-inbound leak test build': 1,
                    'Linux x86-64 b2g-inbound debug asan build': 1,
                    'Linux x86-64 b2g-inbound debug static analysis build': 1,
                    'Linux x86-64 b2g-inbound pgo-build': 1,
                    'Linux x86-64 fx-team build': 3,
                    'Linux x86-64 fx-team leak test build': 3,
                    'Linux x86-64 mozilla-central debug asan build': 2,
                    'Linux x86-64 mozilla-central debug asan nightly': 1,
                    'Linux x86-64 mozilla-inbound asan build': 2,
                    'Linux x86-64 mozilla-inbound build': 3,
                    'Linux x86-64 mozilla-inbound debug asan build': 6,
                    'Linux x86-64 mozilla-inbound leak test build': 1,
                    'OS X 10.7 64-bit mozilla-inbound leak test build': 1,
                    'b2g_b2g-inbound_linux32_gecko-debug build': 1,
                    'b2g_mozilla-central_macosx64_gecko build': 1,
                    'b2g_mozilla-inbound_macosx64_gecko build': 1},
 'repo_counts': {'b2g-inbound': 4,
                 'fx-team': 11,
                 'mozilla-central': 4,
                 'mozilla-inbound': 16},
 'slave_counts': {'bld-linux64-spot-037': 1,
                  'bld-linux64-spot-041': 1,
                  'bld-linux64-spot-051': 1,
                  'bld-linux64-spot-052': 1,
                  'bld-linux64-spot-075': 1,
                  'bld-linux64-spot-097': 1,
                  'bld-linux64-spot-1010': 1,
                  'bld-linux64-spot-1013': 1,
                  'bld-linux64-spot-1043': 1,
                  'bld-linux64-spot-1068': 1,
                  'bld-linux64-spot-109': 1,
                  'bld-linux64-spot-1094': 1,
                  'bld-linux64-spot-1095': 1,
                  'bld-linux64-spot-132': 1,
                  'bld-linux64-spot-136': 1,
                  'bld-linux64-spot-176': 1,
                  'bld-linux64-spot-178': 1,
                  'bld-linux64-spot-192': 1,
                  'bld-linux64-spot-200': 1,
                  'bld-linux64-spot-205': 1,
                  'bld-linux64-spot-210': 1,
                  'bld-linux64-spot-215': 1,
                  'bld-linux64-spot-216': 1,
                  'bld-linux64-spot-252': 1,
                  'bld-linux64-spot-257': 1,
                  'bld-linux64-spot-306': 1,
                  'bld-linux64-spot-312': 1,
                  'bld-linux64-spot-345': 1,
                  'bld-linux64-spot-347': 1,
                  'bld-linux64-spot-362': 1,
                  'bld-linux64-spot-399': 1,
                  'bld-linux64-spot-476': 1,
                  'bld-lion-r5-062': 1,
                  'bld-lion-r5-076': 1,
                  'bld-lion-r5-080': 1},
 'time_counts': {'2014-10-09': 1,
                 '2014-12-25': 1,
                 '2014-12-29': 1,
                 '2014-12-30': 2,
                 '2015-01-05': 1,
                 '2015-01-06': 2,
                 '2015-01-08': 7,
                 '2015-01-12': 1,
                 '2015-01-16': 1,
                 '2015-01-17': 1,
                 '2015-01-19': 3,
                 '2015-01-20': 2,
                 '2015-01-22': 1,
                 '2015-01-23': 5,
                 '2015-01-26': 2,
                 '2015-01-27': 2,
                 '2015-01-28': 2}}
I think we can scale the 'timeout without output' limit down a bit here so we are not wasting so much time on, obviously, hung slaves.

looks like we upped the total time by an hour back in "Bug 1027983 - bm-2008-sm-00xx fail to link xul.dll when build is PGO enabled"

since we don't have those machines anymore, it's probably safe to bring that back down to 2hours
Depends on: 1141941
looking at the actual issue though:

it appears like most of these intermittents happen while doing some checksums:

02:40:33     INFO -  	'../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums' '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums'.asc
command timed out: 10800 seconds without output running ['/tools/buildbot/bin/python', '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config'


looking at a good build, we can do this step in less than 10min so something obviously seems hung:

19:07:18     INFO -  	'../../dist//firefox-39.0a1.en-US.linux-x86_64-asan.checksums'
19:14:37     INFO -  sys.argv: ['/usr/local/bin/post_upload.py', '--tinderbox-builds-dir', 'mozilla-central-linux64-asan', '-p', 'firefox', '-i',

mshal, any ideas ^ ?
Flags: needinfo?(mshal)
(In reply to Jordan Lund (:jlund) from comment #77)
> looking at the actual issue though:
> 
> it appears like most of these intermittents happen while doing some
> checksums:
> 
> 02:40:33     INFO -  
> '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums'
> '../../dist//firefox-38.0a2.en-US.linux-x86_64.checksums'.asc
> command timed out: 10800 seconds without output running
> ['/tools/buildbot/bin/python',
> '/tools/checkouts/mozharness/scripts/fx_desktop_build.py', '--config'
> 
> 
> looking at a good build, we can do this step in less than 10min so something
> obviously seems hung:
> 
> 19:07:18     INFO -  
> '../../dist//firefox-39.0a1.en-US.linux-x86_64-asan.checksums'
> 19:14:37     INFO -  sys.argv: ['/usr/local/bin/post_upload.py',
> '--tinderbox-builds-dir', 'mozilla-central-linux64-asan', '-p', 'firefox',
> '-i',
> 
> mshal, any ideas ^ ?

I think it is probably not the checksum that is hanging, but the upload step that comes right after (so maybe either upload.py or post_upload.py). The output can be misleading since make is doing some buffering here, and it might not always print the full buffer when it gets killed. AFAIK the checksums.py just reads some files on the local disk and generates sha/md5 sums for them, and writes the output to another local file. Maybe we need to make upload.py and/or post_upload.py more noisy to track this further, though again the output buffering may make that tricky.

How are we killing the make process when we hit the timeout? Testing locally on linux, it seems that a hanging target will flush its output buffer when killed with ctrl-C, but not with 'kill -9'.
Flags: needinfo?(mshal)
Inactive; closing (see bug 1180138).
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → WORKSFORME
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.