Open Bug 948145 Opened 6 years ago Updated 2 years ago

Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-x86_6"

Categories

(Testing :: General, defect)

x86
Linux
defect
Not set

Tracking

(Not tracked)

REOPENED

People

(Reporter: njn, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Valgrind-on-TBPL runs are now often timing out, with this message:

> command timed out: 1200 seconds without output, attempting to kill

https://tbpl.mozilla.org/php/getParsedLog.php?id=31676175&tree=Mozilla-Central is a recent example in which Valgrind didn't produce any output.  Here's some of the output:

> + make pgo-profile-run 'EXTRA_TEST_ARGS=--debugger=valgrind --debugger-args='\''--error-exitcode=1 --smc-check=all-non-file --gen-suppressions=all --leak-check=full --num-callers=15 --show-possibly-lost=no --track-origins=yes --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/cross-architecture.sup --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/x86_64-redhat-linux-gnu.sup'\''' 
> /builds/slave/m-cen-l64-valgrind-00000000000/objdir/_virtualenv/bin/python ../src/build/pgo/profileserver.py --debugger=valgrind --debugger-args='--error-exitcode=1 --smc-check=all-non-file --gen-suppressions=all --leak-check=full --num-callers=15 --show-possibly-lost=no --track-origins=yes --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/cross-architecture.sup --suppressions=/builds/slave/m-cen-l64-valgrind-00000000000/objdir/_valgrind/x86_64-redhat-linux-gnu.sup'
> 
> command timed out: 1200 seconds without output, attempting to kill
> program finished with exit code 247

https://tbpl.mozilla.org/php/getParsedLog.php?id=31654180&tree=Mozilla-Central is the first example that I saw.  Valgrind produced almost all of its output, with only a bit truncated at the end:

> ==1979== LEAK SUMMARY:
> ==1979==    definitely lost: 9,032 bytes in 7 blocks
> ==1979==    indirectly lost: 1,552 bytes in 6 blocks
> ==1979==      possibly lost: 96,747 bytes in 1,749 blocks
> ==1979==    still reachable: 3,452,056 bytes in 6,091 blocks
> ==1979==         suppressed: 94,165 bytes in 1,775 blocks
> ==1979== Reachable blocks (those to which a pointer was found) ar
> command timed out: 1200 seconds without output, attempting to kill
> program finished with exit code 247
> elapsedTime=3493.368914
I turned on extra logging (-d -v) in an attempt to understand this better:
https://hg.mozilla.org/build/tools/rev/2b49e8cb95c1

In many of the runs, Valgrind isn't even printing its start-up messages.  Humph.
I just retriggered some old m-c Valgrind runs from Dec 6, which was before the timeout showed up.  They timed out.  This suggests that it wasn't something that in m-c that changed to start causing the problems.  Ideas:

- Changes to valgrind.sh in the tools/ repo?  Or some other part of the testing infrastructure that is stored outside of mozilla-central?

- Changes to the build machines' configurations?

And it's really weird the way the valgrind output is being truncated.  E.g. the final calls to VG_(debugLog)() are printing some but not all of their output, despite the fact that the actual writing of the output happens in a single call to the write() syscall, using stderr.  It doesn't make sense that that operation could be failing/hanging/whatever... it really feels more like the harness that Valgrind is running inside is doing something wrong.
Let's try running natively, instead of under Valgrind, in case that tells us anything:
https://hg.mozilla.org/build/tools/rev/7ad8a0bfc8cc
> it really feels more like the harness that Valgrind is running inside is doing
> something wrong.

> Let's try running natively, instead of under Valgrind, in case that tells us
> anything:

The timeout still occurs!  E.g.
https://tbpl.mozilla.org/php/getParsedLog.php?id=31726798&tree=Mozilla-Inbound

Well isn't that interesting.
> Well isn't that interesting.

Just to clarify:  this proves that it's not Valgrind that's hanging.  So what other component could be responsible?
> Let's try running natively, instead of under Valgrind, in case that tells us
> anything:
> https://hg.mozilla.org/build/tools/rev/7ad8a0bfc8cc

I reverted this change here:
https://hg.mozilla.org/build/tools/rev/8e17d3c1e37b
I temporarily disabled the |make pgo-profile-run| step, to see how that affects things:
https://hg.mozilla.org/build/tools/rev/3db2a940cdc6
(In reply to Nicholas Nethercote [:njn] from comment #124)
> I temporarily disabled the |make pgo-profile-run| step, to see how that
> affects things:
> https://hg.mozilla.org/build/tools/rev/3db2a940cdc6

They were green.  Not terribly surprising.  I reverted the change:
https://hg.mozilla.org/build/tools/rev/8a908482428e
(In reply to Nicholas Nethercote [:njn] from comment #26)
> I just retriggered some old m-c Valgrind runs from Dec 6, which was before
> the timeout showed up.  They timed out.  This suggests that it wasn't
> something that in m-c that changed to start causing the problems.  Ideas:
> 
> - Changes to valgrind.sh in the tools/ repo?  Or some other part of the
> testing infrastructure that is stored outside of mozilla-central?
> 
> - Changes to the build machines' configurations?

Ick. Maybe catlee can help us find out if anything changed on RelEng's side in that timeframe?
Flags: needinfo?(catlee)
I don't see any relevant changes happening in puppet in that time.

The primary change is that we switched from running nightly to running per-push. This has resulted in one change to the build environment: MOZ_UPDATE_CHANNEL is unset rather than being set to "nightly".
Flags: needinfo?(catlee)
This results in slightly different arguments to configure:

--enable-update-channel=nightly 

vs

--enable-update-channel=
https://tbpl.mozilla.org/php/getParsedLog.php?id=31789399&full=1&branch=fx-team has a smoking gun:
JavaScript error: http://localhost:8888/index.html, line 55: Quitter is not defined

This is:
http://mxr.mozilla.org/mozilla-central/source/build/pgo/index.html?force=1#55

If the Quitter extension isn't being installed and we hit that error then it makes sense that it would hang.
I'm wondering if there's something weird going on here where the quitter addon install fails silently and then we hang.
pretty sure http://hg.mozilla.org/build/tools/rev/c46f324a8b46 is the cause here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try some new jobs.
(In reply to Chris AtLee [:catlee] from comment #140)
> here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try

That strikes me as pretty unlikely to be the root cause, esp given that
it hangs when running natively (comment 47).

Some non-default setting of --vex-iropt-register-updates= now appears
to be pretty much mandatory; see eg bug 913876.  I suspect
--vex-iropt-register-updates=allregs-at-mem-access is the required
minimum now.
> (In reply to Chris AtLee [:catlee] from comment #140)
> > here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try

Just to be sure, I reran V's testsuite on amd64-linux with this flag set,
and didn't get any failures as a result.
I was able to reproduce the hang on the same build with that flag set, and when I removed it it exited successfully....Maybe there's something else at play here?

I see some green V's on inbound prior to me landing.

This failed:
https://tbpl.mozilla.org/?showall=1&jobname=valgrind&tree=Mozilla-Inbound&rev=316ab101c526

And this passed:
https://tbpl.mozilla.org/?showall=1&jobname=valgrind&tree=Mozilla-Inbound&rev=092c3bc22ae8

this code changed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/10800c0df8b5
Oh, I guess that failure is a build failure not a valgrind failure...
(In reply to Julian Seward from comment #142)
> (In reply to Chris AtLee [:catlee] from comment #140)
> > here. I'll remove --vex-iropt-register-updates=allregs-at-each-insn and try
> 
> That strikes me as pretty unlikely to be the root cause, esp given that
> it hangs when running natively (comment 47).

I agree.  I was seeing hangs prior to setting that flag.  Also, I re-triggered an old job from Dec 5 with an updated valgrind.sh which had that flag and it ran successfully.

> I see some green V's on inbound prior to me landing.

BTW, you can't trust green V's... well, hanging runs always show up as red, but other failures (such as crashes) show up as green due to bug 823787.

In fact, if you disable the --vex-iropt-register-updates=allregs-at-each-insn flag then Valgrind will sometimes crash before it hangs, and it'll show up as green, and you'll probably think everything went ok unless you look in the logs.  (The crash message involves "clone".)
I was able to reproduce on a build slave.  I had to ctrl-c to kill it.  Here's the relevant part of the output:


==10502== LEAK SUMMARY:
==10502==    definitely lost: 9,032 bytes in 7 blocks
==10502==    indirectly lost: 1,552 bytes in 6 blocks
==10502==      possibly lost: 96,747 bytes in 1,749 blocks
==10502==    still reachable: 3,449,372 bytes in 6,088 blocks
==10502==         suppressed: 94,171 bytes in 1,775 blocks
==10502== Reachable blocks (those to which a pointer was found) are not shown.
==10502== To see them, rerun with: --leak-check=full --show-reachable=yes
==10502==
==10502== For counts of detected and suppressed errors, rerun with: -v
==10502== ERROR SUMMARY: 7 errors from 7 contexts (suppressed: 15 from 15)
JavaScript error: http://localhost:8888/index.html, line 55: Quitter is not defined

^CTraceback (most recent call last):
  File "../src/build/pgo/profileserver.py", line 68, in <module>
    runner.wait()
  File "/builds/slave/m-cen-l64-valgrind-00000000000/src/testing/mozbase/mozrunner/mozrunner/runner.py", line 83, in wait
    self.process_handler.wait(timeout)
  File "/builds/slave/m-cen-l64-valgrind-00000000000/src/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 741, in wait
    self.outThread.join(timeout=1)
  File "/tools/python27/lib/python2.7/threading.py", line 674, in join
    self.__block.wait(delay)
  File "/tools/python27/lib/python2.7/threading.py", line 262, in wait
    _sleep(delay)
KeyboardInterrupt
WARNING: Command failed. See logs for output.


So Valgrind finished (as indicated by the "ERROR SUMMARY" line.  The |Quitter is not defined| thing showed up after a while.  And then |runner.wait()| ends up hanging.
So my current theory is that something change in mozharness.  Ted, what do you think, based on comment 148?
Flags: needinfo?(ted)
http://hg.mozilla.org/build/mozharness has all the changes to mozharness; which change do you suspect?
> If the Quitter extension isn't being installed and we hit that error then it
> makes sense that it would hang.

I was discounting that because the "Quitter is not defined" message wasn't showing up in logs of some of the hanging jobs.  Though maybe that's because buildbot's buffering is hiding it.  But if it was a factor, I wouldn't expect Valgrind to finish?  Hmm, not sure.
(In reply to Aki Sasaki [:aki] from comment #150)
> http://hg.mozilla.org/build/mozharness has all the changes to mozharness;
> which change do you suspect?

None specifically;  it's just a theory and could be totally wrong :)
Ok, Quitter is looking more likely.  I have the process hanging, and |pstree| says this:

    |-sshd-+-sshd---sshd---bash-+-mock_mozilla---valgrind.sh---make---python-+-firefox---30*[{firefox}]

I.e. Firefox is still running.  |top| tells me that it's not consuming any CPU.

So why isn't Quitter installed, has this changed lately, and can I remedy it?
So objdir/dist/xpi-stage/quitter is present, and has various symlinks back to src/tools/quitter/, so that looks reasonable...
> So objdir/dist/xpi-stage/quitter is present, and has various symlinks back
> to src/tools/quitter/, so that looks reasonable...

glandium says that's not enough.  My best guess is that Quitter isn't running for some reason.
Running pgo-profile-run locally (where it doesn't hang) I can see Quitter is installed and running.  But if I rename the $OBJDIR/dist/xpi-stage/quitter directory to something else, it doesn't get run, and the hang occurs.

I checked the build slave, but $OBJDIR/dist/xpi-stage/quitter is present and its contents are identical to those on my local machine.  Which means that Quitter is present and correct, but not being picked up for some reason.
I re-enabled --vex-iropt-register-updates=allregs-at-each-insn because it's clearly not the problem.  Also, it prevents crashes, which confusingly show up as green runs on TBPL due to bug 823787:

https://hg.mozilla.org/build/tools/rev/61bc99912845
Aha!

http://mxr.mozilla.org/mozilla-central/source/build/pgo/profileserver.py#51 looks like this:

49     profile = FirefoxProfile(profile=profilePath,
50                              preferences=prefs,
51                              addons=[os.path.join(build.distdir, 'xpi-stage', 'quitter')],
52                              locations=locations)

That's what tells Firefox to look in the non-standard place for the Quitter extension.

I'm still trying to work out what goes wrong w.r.t. this code on the build slave.
Right, that's what I was thinking back in comment 138 and comment 139.  I don't know offhand what would have broken it. The code that added quitter dates back 3 months: bug 641829.
Flags: needinfo?(ted)
> I'm still trying to work out what goes wrong w.r.t. this code on the build
> slave.

The profile directory (which is created in /tmp) disappears earlier in the build slave run, somehow.  If I remove the clean-up code on the very last line of profileserver.py, on my local machine the profile directory survives the run, but on the build slave the profile directory doesn't survive the run.
> The profile directory (which is created in /tmp) disappears earlier in the
> build slave run, somehow.  If I remove the clean-up code on the very last
> line of profileserver.py, on my local machine the profile directory survives
> the run, but on the build slave the profile directory doesn't survive the
> run.

Wait, on the build slave I think the try/ directory used in the chroot during the build slave run isn't the same as the standard /try, so that observation is probably bogus.
So when Firefox starts up, both machines appear to have a valid profile with the Quitter extension installed appropriately.  So I want to turn my attention to the loading of add-ons within Firefox, but I'm not sure where/how that happens...
Here's progress:  I've done various bits of logging relating to extension loading.  There's something very weird about the file QuitterObserver.js on the build slaves -- the add-on service complains about not being able to get its last-modified-time, and my ad hoc code to query it after installation from profileserver.py fails, even though the parent directory claims to contain it.

So just now I tried changing $OBJDIR/dist/xpi-stage/quitter/components/QuitterObserver.js from a symlink (which points to tools/quitter/QuitterObserver.js) to a real file (by copying tools/quitter/QuitterObserver.js) and *the damn thing runs without hanging*.  It works.

WTF?

One curious thing is that on my local machine the symlink has an absolute path:

> lrwxrwxrwx 1 njn njn 50 Dec  7 21:46 QuitterObserver.js -> /home/njn/moz/mc1/tools/quitter/QuitterObserver.js

And on the build slave it has a tortuous relative path:

> lrwxrwxrwx 1 cltbld mock_mozilla   71 Dec 11 12:40 QuitterObserver.js -> ../../../../tools/quitter/../../../src/tools/quitter/QuitterObserver.js

I don't know why this would matter but it seems worth mentioning.
> I don't know why this would matter but it seems worth mentioning.

I just tried a symlink with an absolute path:

> lrwxrwxrwx 1 cltbld cltbld         81 Dec 11 20:13 QuitterObserver.js -> /builds/slave/m-cen-l64-valgrind-00000000000/src/tools/quitter/QuitterObserver.js

and that worked too.
I bet this has something to do with the chroot these tests are run in -- that screws up the relative symlink somehow.
Anti-climax time:  it appears that while I was doing all this digging, glandium fixed the problem unintentionally in bug 948275 -- Valgrind runs are currently green on m-i and m-c!
(In reply to Nicholas Nethercote [:njn] from comment #166)
> Anti-climax time:  it appears that while I was doing all this digging,
> glandium fixed the problem unintentionally in bug 948275 -- Valgrind runs
> are currently green on m-i and m-c!

I confirmed this:  I updated the build slave's m-c tree and re-ran.  No hang, and the symlink is absolute:

> lrwxrwxrwx 1 cltbld mock_mozilla 81 Dec 11 21:20 QuitterObserver.js -> /builds/slave/m-cen-l64-valgrind-00000000000/src/tools/quitter/QuitterObserver.js
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
(In reply to TBPL Robot from comment #168)
> KWierso
> https://tbpl.mozilla.org/php/getParsedLog.php?id=35822195&tree=Mozilla-
> Inbound
> Linux x86-64 mozilla-inbound valgrind on 2014-03-07 13:47:33
> revision: 8394fed3332e
> slave: bld-linux64-spot-383
> 
> command timed out: 1200 seconds without output, attempting to kill

Huh, that's weird. |mach valgrind-test| produced no output. It's the the same as the previous issue, because the "Quitter is not defined" string doesn't appear in the logs.
> It's the the same as the previous issue

That should say "It's not the same..."
The harness needs to catch these hangs and output unique/identifiable cause for each (similar to how we do this for the other suites), at present, this doesn't meet:
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6.29_Outputs_failures_in_a_TBPL-starrable_format
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: Valgrind-on-TBPL: "command timed out: 1200 seconds without output, attempting to kill" → Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output, attempting to kill"
Flags: needinfo?(n.nethercote)
(In reply to Ed Morley [:edmorley UTC+0] from comment #176)
> The harness needs to catch these hangs and output unique/identifiable cause
> for each (similar to how we do this for the other suites), at present, this
> doesn't meet:
> https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6.
> 29_Outputs_failures_in_a_TBPL-starrable_format

I guess you're referring to this:

> Exceptions & timeouts must be handled with appropriate log output
> (eg: the failure line must state in which test the timeout occurred,
> not just that the entire run has timed out).

Looking at the hangs in the past month... those in comment 171 and comment 173 both occurred while Valgrind was actually running. The Valgrind test job is unusual in that there aren't distinct tests; it just runs a single thing.

The hangs in comment 172 and comment 174 were on this line of scripts/valgrind/valgrind.sh:

> python $SCRIPTS_DIR/buildfarm/utils/retry.py -s 1 -r 5 -t 3660 \
>      python $SCRIPTS_DIR/buildfarm/utils/hgtool.py --rev $REVISION \
>           --bundle $HG_BUNDLE $HG_REPO src || exit 2

In both cases an |hg| command timed out.

And the hang in comment 175 is occurring during this command:

> MOZCONFIG=$srcdir/browser/config/mozconfigs/linux${_arch}/valgrind make -f $srcdir/client.mk configure || exit 2

Specifically, during this part of the build:

> /builds/slave/fx-team-l64-valgrind-000000000/src/obj-firefox/_virtualenv/bin/python /builds/slave/fx-team-l64-valgrind-000000000/src/config/expandlibs_exec.py --depend .deps/libjs_static.a.pp --target libjs_static.a --extract -- ar crs libjs_static.a jitprofiling.o RegExp.o Parser.o jsarray.o jsatom.o jsmath.o jsutil.o ExecutableAllocatorPosix.o MacroAssemblerX86Common.o CTypes.o Library.o pm_linux.o Unified_cpp_js_src0.o Unified_cpp_js_src1.o Unified_cpp_js_src2.o Unified_cpp_js_src3.o Unified_cpp_js_src4.o Unified_cpp_js_src5.o Unified_cpp_js_src6.o Unified_cpp_js_src7.o Unified_cpp_js_src8.o Unified_cpp_js_src9.o Unified_cpp_js_src10.o Unified_cpp_js_src11.o ctypes/libffi/.libs/libffi.a  ../../intl/icu/target/lib/libicui18n.a ../../intl/icu/target/lib/libicuuc.a ../../intl/icu/target/lib/libicudata.a

which is the linking of libjs_static.a.

I'm happy to hear suggestions how these different kinds of failures (some of which appear to be infra problems) should be reported.
Flags: needinfo?(emorley)
Flags: needinfo?(n.nethercote)
(In reply to Nicholas Nethercote [:njn] from comment #177)
> I guess you're referring to this:
> 
> > Exceptions & timeouts must be handled with appropriate log output
> > (eg: the failure line must state in which test the timeout occurred,
> > not just that the entire run has timed out).
> 
> Looking at the hangs in the past month... those in comment 171 and comment
> 173 both occurred while Valgrind was actually running. The Valgrind test job
> is unusual in that there aren't distinct tests; it just runs a single thing.

It would be great if the harness had a max runtime and/or a max time without output (similar to buildbot, but set to something shorter than the buildbot times), so that the harness could kill the valgrind job and output something more identifiable (once shown in the TBPL annotated summary), eg:
TEST-UNEXPECTED-FAIL | valgrind | Timed out during foo
or
Automation Error: Valgrind job exceeded max runtime

or something of that form.

> The hangs in comment 172 and comment 174 were on this line of
> scripts/valgrind/valgrind.sh:
> 
> > python $SCRIPTS_DIR/buildfarm/utils/retry.py -s 1 -r 5 -t 3660 \
> >      python $SCRIPTS_DIR/buildfarm/utils/hgtool.py --rev $REVISION \
> >           --bundle $HG_BUNDLE $HG_REPO src || exit 2
> 
> In both cases an |hg| command timed out.

retry.py should timeout & retry N times, after which if there was still no luck, outputting something like:
Automation Error: Timeout cloning from foo

See bug 808536, bug 961048, bug 961042 for examples of fixing this elsewhere in automation.

> And the hang in comment 175 is occurring during this command:
> 
> > MOZCONFIG=$srcdir/browser/config/mozconfigs/linux${_arch}/valgrind make -f $srcdir/client.mk configure || exit 2
> 
> Specifically, during this part of the build:
> 
> > /builds/slave/fx-team-l64-valgrind-000000000/src/obj-firefox/_virtualenv/bin/python /builds/slave/fx-team-l64-valgrind-000000000/src/config/expandlibs_exec.py --depend .deps/libjs_static.a.pp --target libjs_static.a --extract -- ar crs libjs_static.a jitprofiling.o RegExp.o Parser.o jsarray.o jsatom.o jsmath.o jsutil.o ExecutableAllocatorPosix.o MacroAssemblerX86Common.o CTypes.o Library.o pm_linux.o Unified_cpp_js_src0.o Unified_cpp_js_src1.o Unified_cpp_js_src2.o Unified_cpp_js_src3.o Unified_cpp_js_src4.o Unified_cpp_js_src5.o Unified_cpp_js_src6.o Unified_cpp_js_src7.o Unified_cpp_js_src8.o Unified_cpp_js_src9.o Unified_cpp_js_src10.o Unified_cpp_js_src11.o ctypes/libffi/.libs/libffi.a  ../../intl/icu/target/lib/libicui18n.a ../../intl/icu/target/lib/libicuuc.a ../../intl/icu/target/lib/libicudata.a
> 
> which is the linking of libjs_static.a.

Not sure about this one. Let's skip over it for now.

> I'm happy to hear suggestions how these different kinds of failures (some of
> which appear to be infra problems) should be reported.

Thank you for looking at this - I really do appreciate it :-)
Blocks: 778688
Flags: needinfo?(emorley)
Depends on: 1029204
https://tbpl.mozilla.org/php/getParsedLog.php?id=45060194&tree=Fx-Team

Adjusting the summary to pick up the new failure line.
Summary: Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output, attempting to kill" → Valgrind jobs intermittently fail with "command timed out: 1200 seconds without output running ['mock_mozilla', '-r', 'mozilla-centos6-x86_6"
Component: New Frameworks → General
You need to log in before you can comment on or make changes to this bug.