Closed Bug 803530 Opened 12 years ago Closed 12 years ago

dxr-mozilla-central builds failing after Oct 10-12th with "command timed out: 3600 seconds without output"

Categories

(Release Engineering :: General, defect, P2)

x86_64
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

Details

(Whiteboard: [dxr])

DXR builds are currently failing with:

{
...
...
webapprt.cpp
/builds/slave/dxr-mozilla-central/dxr-build-env/src/webapprt/components.manifest: WARNING: no useful preprocessor directives found
/builds/slave/dxr-mozilla-central/dxr-build-env/src/webapprt/ContentPolicy.js: WARNING: no preprocessor directives found
/builds/slave/dxr-mozilla-central/dxr-build-env/src/webapprt/ContentPermission.js: WARNING: no preprocessor directives found
/builds/slave/dxr-mozilla-central/dxr-build-env/src/webapprt/CommandLineHandler.js: WARNING: no preprocessor directives found
/builds/slave/dxr-mozilla-central/dxr-build-env/src/webapprt/DirectoryProvider.js: WARNING: no preprocessor directives found
AboutRedirector.cpp
DirectoryProvider.cpp
nsPrivateBrowsingServiceWrapper.cpp
nsFeedSniffer.cpp
/builds/slave/dxr-mozilla-central/dxr-build-env/src/config/makefiles/mochitest.mk:47: browser_forgetthissite_single.js temporarily disabled because of very frequent oranges, see bug 551540
/builds/slave/dxr-mozilla-central/dxr-build-env/src/config/makefiles/mochitest.mk:47: browser_sidebarpanels_click.js temporarily disabled cause it breaks the treeview, see bug 658744
/builds/slave/dxr-mozilla-central/dxr-build-env/src/config/rules.mk:1649: browser_forgetthissite_single.js temporarily disabled because of very frequent oranges, see bug 551540
/builds/slave/dxr-mozilla-central/dxr-build-env/src/config/rules.mk:1649: browser_sidebarpanels_click.js temporarily disabled cause it breaks the treeview, see bug 658744
/builds/slave/dxr-mozilla-central/dxr-build-env/src/config/rules.mk:1655: browser_forgetthissite_single.js temporarily disabled because of very frequent oranges, see bug 551540
/build
command timed out: 3600 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=9174.435886
========= Finished 'scripts/scripts/dxr/dxr.sh' failed (results: 2, elapsed: 2 hrs, 32 mins, 54 secs) (at 2012-10-18 05:40:05.617573) =========
}

eg:
https://tbpl.mozilla.org/php/getParsedLog.php?id=16229815&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=16194482&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=16155389&tree=Firefox

First known bad:
https://tbpl.mozilla.org/php/getParsedLog.php?id=16041442&tree=Firefox

Last known good:
https://tbpl.mozilla.org/php/getParsedLog.php?id=16009709&tree=Firefox
How long should this be taking without producing any output?
Severity: critical → major
Whiteboard: [dxr]
Hmm, that "last known good" shouldn't be good at all (FF compile failed), which means probably that the build script doesn't have an exit-on-error enabled somewhere. The logs appear to indicate that the build succeeds and the post-processing is what times out.

Jonas, what kind of intermediate output does your build.sh script dump out?
build.sh essentially sets some environment variables as calls dxr/update.sh with config file and tree. So it dumps all the useless compiler warnings and warnings from DXR clang plugin.

But I think catlee added a grep at the invocation of build.sh, to eat all the "Unprocessekind..." warnings from the dxr-clang plugin, so they wouldn't fill up the log and cause problems.

With regards to exit-on-error, this was a minimum effort to get DXR deployed. So I wouldn't be surprised if the build could fail silently, although we do check size of generated tarball afterwards.
The dxr compile for today seems to have passed, but not the one for the past few days. The time in the successful compile appears to be about 15 minutes shorter than the unsuccessful one, so it seems very likely that the time for the post-processing is running around the 1 hour mark that is the cutoff for killing processes with no output.

I worked with catlee a bit today for debugging this, but, most likely, the fact that DXR produces sparing output combined with the I/O being unflushed (!) means that DXR never appears to have any progress. sys.stdout.flush() after most of the print statements should break the time for running into small enough chunks that buildbot won't think it's stuck in an infinite loop...
Priority: -- → P2
The build for 2012-10-24 also failed, so it's definitely not fixed.
Blocks: 784681
I pushed a change which flushes, as mentioned in comment 4. Let's see if this gets things to work.
First run ran out of disk space, set retry, the retry caught the same slave which the previous run had broken. Better luck tomorrow.
Depends on: 807680
(In reply to Joshua Cranmer [:jcranmer] from comment #6)
> I pushed a change which flushes, as mentioned in comment 4. Let's see if
> this gets things to work.

This appears to not have worked. Is it possible to try bumping the timeout to 2 hours without output?
I bumped the timeout to 2 hours. This will take effect after the next reconfig.
In production
Looks like this worked, I see green builds.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.